Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of...

68
Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science Degree in Telecommunications and Computer Engineering Supervisor: Prof. Ricardo Jorge Feliciano Lopes Pereira Examination Committee Chairperson: Prof. Luís Manuel Antunes Veiga Supervisor: Prof. Ricardo Jorge Feliciano Lopes Pereira Member of the Committee: Prof. Rui Jorge Morais Tomaz Valadas November 2016

Transcript of Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of...

Page 1: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Mastering the Concurrency of Shared Path TCP Connections

Pedro de Almeida Braz

Thesis to obtain the Master of Science Degree in

Telecommunications and Computer Engineering

Supervisor: Prof. Ricardo Jorge Feliciano Lopes Pereira

Examination Committee

Chairperson: Prof. Luís Manuel Antunes VeigaSupervisor: Prof. Ricardo Jorge Feliciano Lopes Pereira

Member of the Committee: Prof. Rui Jorge Morais Tomaz Valadas

November 2016

Page 2: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

ii

Page 3: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Acknowledgments

The authors of this work were never alone. In the pursuit of knowledge hardships take place and the

process would have been harder if not for those that comfort you and for those that turn your doubts

around.

As a student, I want to thank professor Ricardo for thinking clearly when problems arose.

As a son, I want to thank the father, the mother and the sister for caring.

As a colleague and friend, I want to thank Pedro, Nuno, Joao, Rui, Artur and Karan for being there,

every step of the way.

iii

Page 4: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

iv

Page 5: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Abstract

Parallelism is a necessity for the Internet and most Transmission Control Protocol (TCP) based network-

ing applications benefit from its use, as is the case of the Hypertext Transfer Protocol (HTTP). But, this

introduces a worrying amount of concurrency which has adverse effects on networks: having too many

parallel TCP connections is proven to be too aggressive, causing unnecessary congestion and throttling

the throughput for all network users.

This work addresses the problem at the sender, by grouping parallel connections, which share the

same path, at the sender, reducing per connection redundancies. We survey existing implementations

and adapt them for our own protocol. Our solution enables TCP to: group connections from hosts in

close proximity, have finer network state estimates, react quickly to congestion, skip slow start and have

an increased average throughput.

Keywords: TCP, HTTP, Concurrency, Linux, Reno, internet.

v

Page 6: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

vi

Page 7: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Resumo

O paralelismo de ligacoes TCP tornou-se numa necessidade da Internet e muitos protocolos beneficiam

do seu uso, como o e caso das aplicacoes HTTP. Infelizmente, a concorrencia tem consequencias

adversas na rede: ligacoes paralelas sao agressivas e causam congestao desnecessariamente. Este

trabalho foca-se nesse problema, agrupando no emissor ligacoes paralelas que partilham o mesmo

caminho na rede, reduzindo redundancias. Analisamos implementacoes que permitam isto e adaptamo-

las ao nosso protocolo. A nossa solucao permite ao TCP: agrupar ligacoes por recetores proximos,

melhorar as estimativas do estado da rede, reagir mais depressa ao estado de congestao, saltar o slow

start e aumentar o debito maximo na rede.

Palavras-chave: TCP, HTTP, Concorrencia, Linux, Reno, Internet.

vii

Page 8: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

viii

Page 9: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Contents

Acknowledgments iii

Abstract v

Resumo vii

List of Figures xi

List of Tables xiii

Acronyms xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Proposted Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

2.1 Transmission Control Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 TCP operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Conservation in TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 The Hypertext Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Optimizations for Shared Path Parallel TCP Connections . . . . . . . . . . . . . . . . . . . 12

2.5 Multiplexing Parallel TCP flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Protocols Comparison and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Ensemble Sharing Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Architecture 21

3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Hydra: Connection Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Heracles: Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ix

Page 10: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

4 Implementation 27

4.1 Implementation Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.2 Linux Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.3 Kernel Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.4 Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.5 Iperf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.6 Netkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.7 Tc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Heracles Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Fast Retransmit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Evaluation 37

5.1 Tests Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Tests Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Long-Short . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.3 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.4 Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4.1 Long Short . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.4.3 Sequential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4.4 Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Conclusions 47

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 52

x

Page 11: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

List of Figures

2.1 Representation of the three way handshake for hosts A and B. (1) and (2) contains the

sequence numbers for hosts A and B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Representation of a TCP connection closing. . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 State transitions for TCP Reno. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Per-host TCP-Int structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Representation of the hydra structure, composed of an externally linked hash table and

binary trees, where each leaf represents a hydra group, which the Heracles structure

points to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Diagram of the Heracles cong avoid function. . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Diagram of the Heracles pkts acked function. . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1 Test Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Empirical Cumulative Distribution Function (CDF) plot for Long/short throughput. . . . . . 41

5.3 Empirical CDF graph for 2, 4 and 10 parallel connections respectively. . . . . . . . . . . . 43

5.4 Empirical CDF plot for sequential throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 Empirical CDF plot for packet test throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 2 connections partitioning into different groups with different throughput values. . . . . . . 45

xi

Page 12: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

xii

Page 13: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

List of Tables

2.1 Temporal Sharing TCB Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Temporal Sharing Cache Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Ensemble Sharing TCB Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Ensemble Sharing Cache Updates - rtt update indicates the operation of sampling the

newest round trip time (rtt) value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 part of the EBC structure definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 TCP/DCA-C Congestion Windows Update . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Information stored in each hydra group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1 Long/Short test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Results for parallel tests with 2 connections. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Results for parallel tests with 4 connections. . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Results for parallel tests with 10 connections. . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.5 Sequential test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 Packet test results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

xiii

Page 14: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

xiv

Page 15: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

List of Acronyms

ack acknowledgment

CDF Cumulative Distribution Function

csv comma-separated values

cwnd congestion window

HTTP Hypertext Transfer Protocol

IP Internet Protocol

IPv4 Internet Protocol version 4

MSS Max Segment Size

NAT Network Address Translation

P-HTTP Persistent HTTP

rto retransmission timeout

rtt round trip time

rttvar round trip time variance

srtt smooth round trip time

ssthresh slow start threshold

syscall system call

tbf token bucket filter

TCB TCP Control Block

TCP Transmission Control Protocol

URI Uniform Resource Identifier

WWW World Wide Web

xv

Page 16: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

xvi

Page 17: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Chapter 1

Introduction

1.1 Motivation

TCP’s congestion avoidance and control algorithms play an important role in supporting the Internet in-

frastructure. Van Jacobson describes a period when these were badly implemented and almost caused

an Internet collapse [1]. The congestion mechanisms allow Internet hosts to detect and deal effectively

with congestion, even so, they are not well suited for web traffic, predominantly derived from HTTP1.

Most traffic over the Internet has a high degree of parallelism. Previously HTTP connections were

characterized as having a short lifespan. In HTTP/1.0, as each TCP connection was used for a single

request/response exchange, multiple connections were required for fetching a webpage. Modifications

introduced in HTTP/1.1, allowed TCP connections to be reused, making them long lived, yet parallel

connections continued to be a requirement for reducing client latency. This is because, for long lived

connections, a request cannot be sent before the previous response is received, causing blocking for re-

quest/response exchanges in the same connection. To add to this, a change in the nature of web traffic,

with the rise of Ajax and video streaming, made content shift from static to dynamic, which resulted in

more traffic bursts, forcing browsers to raise their upper limit on parallel connections [3].

In the year 2000, parallel connections accounted for 44% of total connections to a web server [4].

From a 2010 data set, connections evolved to being mostly parallel with a median number of 6 to 7 [3].

For this work, we address the problem of parallel connections, which is a necessity for most web

applications, but creates congestion in the network. Each individual connection is aggressive by nature, it

is constantly probing the network to discover its maximum available throughput and this process requires

the detection of losses to signal congestion in the network. This is not only done by each individual host

in the network, but also for each of the parallel connections that every client uses. From the network’s

perspective there is no difference, every connection is treated equally, independently of the originating

user, whether it has one or more connections. Research has been published on this subject, attempting

to reduce the impact of parallel TCP connections, but failed to gain support since published [5, 6, 7].

The research targets TCP, to make connections cooperate better on scenarios with high concurrency,

1A 2009 study was able to gather traffic from 20,000 residential clients and found HTTP traffic to account for almost 60% oftotal Internet traffic [2].

1

Page 18: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

by using two different techniques: ensemble sharing and temporal sharing. They are based around

grouping same path connections and sharing network resources efficiently between them. They attempt

to reduce the normal aggressiveness by making connections share:

• losses, so they can all react immediately, instead of each having to experience one.

• state, for finer-grained control of the network’s condition, by using estimates from different connec-

tions.

From these solutions, we notice some problems, mainly, that each single address is assumed to be

a single host, which is not always the case with the use of Network Address Translation (NAT) interfaces

that hide multiple hosts. Even assuming that latency is negligible, an untrusted host or one who crashes

can deny the service for all hosts behind the NAT interface.

1.2 Problem Statement

Parallel connections play a major role in web traffic, but a fundamental TCP design choice makes it

inappropriate for dealing with these connections. As a transport layer protocol, TCP applies congestion

algorithms per connection, so every connection a server uses is independent from all others. But state

is path specific, connections sharing the same path will have similar value estimates for latency and

available throughput, and these calculations are not cheap due to TCP’s mechanisms for congestion

control:

1. Slow start, which is required for finding an initial threshold for traffic throughput. Connections start

with a decreased throughput.

2. Congestion avoidance algorithms are as aggressive as they need to be, reacting to losses, which

signal congestion in the network. A loss causes the connection’s throughput to lower and increases

latency because of packet retransmissions.

To minimize the negative effects of TCP’s concurrency on application protocols, some solutions have

been proposed, targeting either the transport or the application protocol. For dealing with the problem

directly, we will focus on transport level solutions, enabling us to group at the sender connections sharing

the same path. This allows TCP to reduce the aggressiveness of parallel connections, for fewer losses

and increased throughput.

1.3 Proposted Solution

Our initial goal is to design an ensemble sharing technique, adapted from existing research, where same

path connections can:

1. skip slow start on paths for which there is already a connection in congestion avoidance;

2. react to losses from other connections by decreasing their throughput, thus preventing a loss;

2

Page 19: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

3. Share network path information to provide finer TCP estimations.

By doing these optimizations to standard TCP, we can reduce unnecessary congestion, reduce losses,

and increase throughput when possible, effectively pacifying the effects of concurrency. To improve even

further on this, we will allow the protocol to group hosts by Internet Protocol (IP) addresses and deal

accordingly with any group inconsistencies that arise, making it possible to benefit a larger number of

hosts.

With this work, we contribute with: a new protocol design, adapted from existing research; a Linux

implementation; a set of experimental tests that evaluate the protocol.

1.4 Outline

• Chapter 2 analyses previous research done as a way of solving the problem of network concur-

rency, taking a look at solutions that proposed either sharing information between parallel connec-

tions or multiplexing them into a single entity.

• Chapter 3 describes the architecture of the systems that compose the implemented solution and

their purpose.

• Chapter 4 describes the Linux Kernel and how it was used to build the proposed solution and all

the tooling used to assert the correct protocol’s behavior.

• Chapter 5 describes the evaluation scenarios used to test the solution, their objectives and ana-

lyzes the results.

• Chapter 6 summarizes the goals achieved by the protocol and its shortcomings, as well as prob-

lems to be solved in the future.

3

Page 20: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

4

Page 21: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Chapter 2

Related Work

In this section we survey the literature on optimizations for TCP for shared path connections. We will

start by describing TCP, over which the optimizations were designed, along with the mechanisms nec-

essary to make it behave correctly on the Internet. We will then describe basic HTTP and its evolution

process. Finally, we present a view of the protocols designed specifically for optimizing TCP for parallel

connections and compare their different aspects.

2.1 Transmission Control Protocol

TCP[8] was developed to provide reliable communication for the internet. The protocol operates in a

symmetric manner, providing basic data transfer on a duplex connection with added mechanisms to

assure reliability to its end hosts.

As a transport layer protocol, TCP must be capable of dealing with the network layer’s flaws. For this,

TCP provides the following mechanisms:

• basic data transfer - packaging streams of bytes into segments to be transmitted over IP ;

• reliability - ability to recover from data that is lost, damaged, duplicated or out of order;

• flow control - restricting the max amount of data that the other host can accept.

2.1.1 TCP operation

For each sequence of bytes sent, an acknowledgment (ack ) is expected by the receiver for asserting

the correct delivery of the packet. For each client, packets are given sequence numbers that identify

and denote their ordering. These sequence numbers are assigned, incrementally, based on the size of

previous packets. A received packet with an higher sequence number than the expected one, cannot

be acknowledged, because it has to arrive in the correct sequence, this indicates that the correct packet

was delayed or lost in the network. At the sender side, transmitting a packet starts a timer, if no ac-

knowledgment is received during this interval, a timeout happens which forces the sender to retransmit

5

Page 22: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

the same packet. Retransmiting packets may cause the receiving of duplicate packets due to network

congestion.

Each TCP connection is responsible for 2 buffers, one for sending and the other for receiving bytes.

The client inserts data into the former buffer and waits for TCP to schedule the transmission. In the

receiving buffer, data is stored until TCP can pass it to the client. These buffers limit the amount of

outstanding data each client can have on the network. Every ack received informs the sender on how

many bytes it can still fit into the other host’s receiving buffer.

The opening of a connection can be performed actively or passively, depending on whether the client

knows the foreign host’s (socket) information or it wants to wait for incoming connection’s requests.

Independently of the type of connection, a TCP Control Block (TCB) is created, responsible for storing

state.

The procedure used for the initialization of the connection between hosts is called three way hand-

shake, in which a active host sends a message with the synchronize flag (SYN) to a passive host that

will acknowledge the SYN packet and in the same packet send its own SYN (called SYN-ACK), lastly,

the active host will acknowledge it. From this point onward, both connections are capable of sending

and receiving segments, to and from one another (figure 1).

Its purpose is for both hosts to decide on the sequence numbers in use for each side of the connec-

tion, these number will identify packets in the stream and point to data in the buffers. On the originating

connection, the sequence number points to the last bytes of data sent on the sending buffer, for the

receiving connection, it points to the last bytes of data received in the receiving buffer.

For a connection to close, each user must signal the other that there are no more segments to send.

When a user is ready to close the connection it sends a finish segment (FIN) to signal the remote host

that it has nothing more to send. From this point the connection is still open, because its assumed the

user can still receive segments, until the other also decides to end by also sending a FIN packet and

receiving the corresponding ack (Figure 2).

The unpredictability of the network makes it hard to decide with some certainty when a packet has

been lost. When waiting for an acknowledgment a timeout needs to be calculated based on smooth

round trip time (srtt) that tries to consider the unpredictability of the network, smoothing the samples

received. The initial expression used was later known to be inappropriate [1], this is discussed further in

the following section.

2.2 Conservation in TCP

Tahoe was the first version of TCP using congestion avoidance and control algorithms, introduced in

the late 90s on a release of the BSD (Berkeley Unix) operating system [9]. Here we describe the con-

servation property of TCP and the initial algorithms for the purpose: Slow Start, correct rtt , congestion

avoidance, fast retransmit and fast recovery.

TCP needs to obey the conservation of packets principle to work as intended [1]. This means that for

a connection to run stably it requires a conservative flow of packets, where no new packet is injected into

6

Page 23: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 2.1: Representation of the three wayhandshake for hosts A and B. (1) and (2) con-tains the sequence numbers for hosts A andB.

Figure 2.2: Representation of a TCP connec-tion closing.

the network until an older one has left (when it has a maximum amount of data in transit - full window).

This can fail under 3 possible conditions:

1. The connection can’t reach a stable state when starting, this is caused by having no information

on the initial state of the network and overestimating the amount of data it can actually send. This

causes unnecessary losses and retransmissions.

2. For a full window, the connection starts injecting packets, before those in the network leave. Having

more packets inside the network than the window allows may cause congestion.

3. There are not enough resources to allow stability, the network buffers along the path are not pre-

pared to deal with an increase in rate of packets, causing losses.

The same problems still apply to this day, but TCP has mechanisms in place to make a reliable detection

of the network’s state and adapt accordingly to its needs.

• Slow Start was detailed as a new method for initializing TCP connections in a controlled manner,

without injecting too many packets for the network to handle. It also initializes the ack clocking

mechanism, which allows a host to estimate the delay in the network path by receiving acks,

and to discover the stable state in the network, for which it can send data safely without causing

congestion in the link. It is a requirement for correct TCP behavior, and its relative slow action

between all connections sharing the same link allows them to start without causing congestion

and impacting the network performance for others.

A slow start threshold (ssthresh) for the connection is created that indicates the point from which it

can stop using slow start. Initially, when nothing is known about the path, it is set to a high value to

7

Page 24: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

cause a necessary packet loss that indicates the initial threshold for that path. The initial window

(IW) for the connection is set, according to the Max Segment Size (MSS), to around 4k bytes[10].

For each ack received, after the three way handshake, the congestion window (cwnd) is increased

by one MSS, causing it to double for each rtt , until cwnd > ssthresh [11]. A timeout causes slow

start to restart with ssthresh set to half of the congestion window and a cwnd of a single segment.

• Round-Trip Timing, on which TCP depends for accurate estimations of a packet’s travel time, is

important for loss detection. This work added dynamic rtt variation to the estimate calculations.

The original used a fixed value, that was found to cause retransmission of delayed packets after

load on the link reached 30%.

When a packet leaves the host, a timer starts for the waiting time before retransmiting the same

packet. The duration of the timer is the retransmission timeout (rto), and it is dependent on the srtt

and the round trip time variance (rttvar ). For precise calculation of the timer the following equations

are used, based on the newest rtt value:

rttvar ← (1− β)rttvar + β|srtt− rtt|

srtt← (1− α)srtt+ α× rtt

rto← srtt+ 4rttvar

The standard suggests the use of α = 1/8 and β = 1/4 for the rttvar and srtt calculations. Then

the rto value is updated according to both these equations [12].

• Congestion Avoidance requires 2 components to work, the sender must be able to detect the

congestion and endpoints must have policies in place for dealing with it.

In a network, there are buffer limits along the path the packets travel, and there is a chance they

will get discarded. We can almost certainly say that a loss happens due to congestion, and that

will be signaled through a timeout for the sender. On a congested system, queue lengths start

increasing exponentially during congestion. For the system to stabilize, the traffic sources must

reduce their outgoing traffic as quickly as the queues grow. For the sender, this is a multiplicative

decrease of the packets sent (currently the congestion window is cut in half).

In the case where a connection is using less than its fair share of the bandwidth, it should increase

its utilization. This suffers from the same problem that slow start solves for detecting the available

bandwidth for the connection. It uses the same method, increasing the amount of data it can send

on each acknowledgment received, but instead of increasing exponentially, it increases linearly

(adding 1 segment to the congestion window for each rtt).

• Fast Retransmit [13, 11] describes a algorithm for early detection of losses, where a receiver

generates a duplicate acknowledgment after receiving an out of order packet. This signals the

sender that a packet still hasn’t arrived and may be lost. At the sender, receiving 3 duplicate acks

is an indication that the packet was really lost, causing an early retransmission.

8

Page 25: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 2.3: State transitions for TCP Reno.

Reno was the name given to the next version of TCP, that improved the fast retransmit algorithm and

added a fast recovery phase, after retransmiting a dropped segment, for which TCP doesn’t need to

drop to slow start.

When the first duplicate ack arrives on a connection, it uses the limited transmit algorithm [14]. This

algorithm allows the sender to keep injecting new packets in the network, while receiving duplicate acks,

without having to go through slow start. When 3 duplicate acks arrive, and the client retransmits the

missing packet, the fast recovery algorithm governs the transmission until a non-duplicate ack arrives.

The congestion window is then lowered to ssthresh + 3mss, and for every duplicate ack the window

is inflated, so that it can keep sending new data. A non-duplicate ack will stop the algorithm and will

deflate the window to the ssthresh value [11]. This allows the connection to conserve the packets that

were already buffered by the receiver and preserve the ack clocking. The different states and their

respective transitions can be seen in Figure 2.3.

The algorithms presented above are what constitute a base TCP implementation, their use is not a

requirement, but it is imposed that any TCP implementation is not to be more aggressive than these [9].

2.3 The Hypertext Transfer Protocol

HTTP allows creating client/server services, targeted at the World Wide Web (WWW), hiding the imple-

mentation details of the services and presenting an uniform interface to the client for making requests,

independently of the resources associated with the service.

An HTTP server waits for connections, for servicing requests and sending responses [15]. It identifies

available resources and relationships between them with the Uniform Resource Identifier (URI) standard.

A request exchanges a URI with the server, identifying the target resource, for the server to return. The

protocol doesn’t define limitations for the nature of resources, only an interface that can be used to

interact with them [16], giving the server implementation flexibility.

9

Page 26: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

An HTTP message is divided into a header and a body, which is optional. In the header, a client

indicates a semantic method and the URI it applies to, while a server will indicate the result of the

request as a status code. The body is only used when a request or response requires a payload to be

exchanged. The semantic requests indicate the purpose of the identified resource, these methods are

inserted into the message header as uppercase letters, examples of these are:

• GET: fetches the current representation of the identified resource;

• POST: the identified resource processes the client’s request.

The full list of existing methods try to exhaust all possible use cases of the protocol.

At the start, in version 1.0, the protocol made each request/response pair a single TCP connection,

where after the server sent its response, it would close the underlying connection [17].

Persistent HTTP (P-HTTP) [18] was a solution for improving web traffic performance, that updated

HTTP to version 1.1.

Network latency is the biggest bottleneck for web retrieval, and congestion will cause a high incre-

ment to the rtt . To diminish this problem, unnecessary round trips must be avoided, which the initial

version didn’t, incurring a bigger delay than needed. To improve, on the inherent limitations of the proto-

col, 2 alterations where proposed:

• Long-Lived Connections that can keep a single TCP connection open for multiplexing all the

HTTP objects needed. Both client and server keep connections open for future requests, eliminat-

ing the need for TCP to go through slow start for each request, decreasing latency.

In the connection, a new request needs to wait for the previous response, before it can be sent.

For resource intensive requests, the connection will be blocked for an extended period of time, this

is known has the head-of-line blocking problem. In these cases, concurrency is needed.

• Request Pipelining expands on the previous long-lived connections, eliminating the need to wait

for before sending a request in the same connection, allowing the host to send multiple concurrent

requests, reducing latency between responses. The protocol requires responses to be sent in the

same order that requests are received [15], and so first in, first out (FIFO) ordering must be en-

sured, this is problematic when a more expensive request is issued, later requests will then suffer

from head-of-line blocking, causing the connection to block and increasing the latency for all later

requests.

HTTP/2[19] is a new proposed standard for the HTTP protocol, that is being currently pushed, as a way

of dealing with the increased requirements of the internet and for mitigating congestion. It describes

issues like head-of-line blocking, as we’ve seen already, and verbose headers, that inflate the HTTP

message’s size. The protocol’s main change is to allow HTTP to fully benefit from the use of a single

connection, decreasing the impact of having multiple connections in the network, which can lead to

congestion.

10

Page 27: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

The protocol facilitates pipelining, removing FIFO constraints for concurrent requests, by multiplexing

independent requests/response inside the same TCP connection to identify logically different sequences

of messages. This way FIFO ordering doesn’t have to be followed and streams can be independent from

each other, so the blocking in one doesn’t affect the others.

It also adds other features, as:

• Server Push: a server can send a unsolicited response, to a client, that it predicts to be necessary.

Used when a HTTP object has multiple dependencies, that will then force the sender to request

these independently. As when a browser requests an index page and will then have to parse it and

request all objects inside it. This reduces latency by removing the delay for the client’s requests.

• Stream Priority: for managing resource allocation between concurrent streams. Each stream can

be assigned a stream dependency or a weight. The stream dependency defines the parent of the

stream, from which it receives its relative share of resources, but only when the parent is not being

used. The weight defines the share it receives from the parent.

A unit of communication is called a frame, comprised of a header and a variable sequence of bytes.

The frames are exchanged inside a stream. A single HTTP/2 connection can have multiple streams

inside, each one identifiable, and can be opened or closed by either client or server. This completely

removes the need for parallel connections.

Frames can be of different types to serve different purposes:

• DATA: for client requests or response payloads;

• HEADERS: for opening new streams;

• PRIORITY: for defining the stream’s priority or dependency (for efficient multiplexing);

• RST STREAM: for terminating a stream or to indicate errors;

• SETTINGS: for declaring connections parameters;

• PUSH PROMISE: for reserving a stream;

• PING: for testing connection availability and respective round trip time;

• GOAWAY: to stop the connection gracefully;

• WINDOW UPDATE: for specific cases where limiting flow control is necessary due to peer con-

straints.

The operation of normal HTTP is mostly unchanged. A client, who wishes to send a request, uses a

new stream, which is then used by the server for sending the response. A normal HTTP message is now

divided into frames, with at least one header and optional DATA frames. A single stream behaves like

a TCP connection would in the initial version of HTTP, where each request/response exchange would

consume the entire stream.

11

Page 28: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Cache TCB New TCBold mss old mssold rtt old rttold rttvar old rttvarold cwnd old cwnd

Table 2.1: Temporal Sharing TCB Initialization

2.4 Optimizations for Shared Path Parallel TCP Connections

In this section we discuss several designs and protocols that were proposed for benefiting TCP con-

nections with an high degree of parallelism; to note that to simplify we assume parallel connections are

connections which share the same network path.

The idea of dependence on shared path TCP connections started with Touch’s TCP Control Block

Interdependence draft [5]. In it, concerns were raised about TCP’s per connection state, and its nega-

tive influence on performance for same host connections.

For each connection, TCP allocates a TCP control block (TCB) to store its state, such as srtt , rttvar ,

ssthresh, cwnd and MSS. These are the most important for congestion control and the focus of the

draft.

For the state, they classified it into: host-pair dependent and host-pair dependent for the aggregate,

the nuance here, is that, for aggregate dependency, state is required to be divided for each connection,

as is the case for congestion window. Host pair dependent state is equal for each parallel connection,

independently of the number of connections being used: MSS, srtt , rttvar .

These dependencies make the existence of a linking factor between connections clear, making most

of the the state calculations per connection redundant, which introduces an unneeded overhead. As a

way of minimizing this, Control Block Interdependence proposes two different tactics: temporal sharing

and ensemble sharing.

For building a model for each kind of sharing they based their design in Transactional TCP [20, 21],

which, on some available implementations, uses cache for storing state from older TCP connections.

Transactional TCP’s purpose is to lower the latency for TCP connections, by bypassing the three way

handshake, for already used connections. This protocol failed to gain widespread deployment and is

now classified as obsolete [22, 23].

The type of state dependency plays an important role in determining how information can be used

by the sharing tactics.

• Temporal Sharing tries to reuse closed connections’ state, when available, to initialize a TCB

faster (Table 2.1), where the values are simply copied. In Temporal sharing, caching is done

whenever a connection closes, or, in the case of MSS, when the the value is updated (Table 2.2).

• Ensemble Sharing is similar to Temporal Sharing, but allows interactions between concurrent par-

allel connections, where a cache is updated often during the connection’s lifetime, reflecting their

12

Page 29: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Current TCB Cached TCB when New Cached TCBcurrent mss old mss mss Update current msscurrent RTT old RTT conn close old = old + (current - old) >> 2current rttvar old rttvar conn close old = old + (current - old) >> 2current cwnd old cwnd conn close current cwnd

Table 2.2: Temporal Sharing Cache Updates

Cache TCB New TCBold mss old mssold rtt old rttold rttvar old rttvar

Table 2.3: Ensemble Sharing TCB Initialization

joined state. Newer connections are able to copy updated rtt information from other connections,

available in the cache. This will benefit newer connections, which will be able to start without a big

delay, and older connections, which can be notified of changes in the network through the cache.

Tables 2.3 and 2.4 present a trivial solution for Ensemble Sharing, the difference between these

and tables 2.1 and 2.2, is that the former have access to the most updated information, because

the cached information is from an open connection.

The draft enumerates some advantages of implementing TCB interdependence in TCP, first it pre-

vents the need of multiplexing different logically different streams into a single connection, as in P-HTTP

that does this to avoid the slow start penalty of starting a TCP connection for every request needed. TCB

interdependence still provides the same benefits as P-HTTP, but removes the coupling of connections

and, most importantly, transfers the concerns to the transport layer.

An initial solution to this problem was better described in TCP Behavior of a Busy Internet Server

[6]. The article studied the performance of parallel connections in a web server, focusing on losses: how

a group of independent connections experience losses and their combined congestion window after a

loss; and the increase in bandwidth: examining the ratio between the total throughput and the number of

parallel connections. They concluded that multiple parallel connections had an increase in throughput,

but the protocol was more aggressive due to it, aggression will lead to more congestion and losses.

They proposed changes in the form of TCP-Int, that would make parallel connections less aggressive,

where they would behave similarly to a single connection. TCP-Int provides better loss recovery and start

up performance for parallel connections and it only requires changes to the sender to make it compatible

Current TCB Cached TCB when New Cached TCBcurrent mss old mss mss Update current msscurrent rtt old rtt conn close rtt update(old, curr)

current rttvar old rttvar conn close rtt update(old, curr)current cwnd old cwnd conn close current cwnd

Table 2.4: Ensemble Sharing Cache Updates - rtt update indicates the operation of sampling the newestrtt value

13

Page 30: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

with other TCP versions.

For improving loss recovery they devised an integrated congestion control for parallel connections.

This mechanism has a single congestion window for all parallel connections and a loss will affect the

shared window, mimicking the same effects of a single TCP connection. The unified window also makes

it unnecessary for new connections to do slow-start. This can also speed up fast retransmit: a TCP

connection that suffers a loss can use packets received later, from any other parallel connection, as

duplicate acknowledgments.

They created two simplified structures for storing state per-host, instead of per-connection, in Figure

2.4 we see this, where the connection is linked to the host, and packets for different connections are

maintained per host and are sent according to a round robin scheduler.

struct chost {

Address addr;

int cwnd;

int ownd;

int ssthresh;

int count;

Time decr_ts;

Packet pkts[];

TCPConn conn[];

}

struct packet {

TCPConn *conn;

int seqno;

int size;

Time sent_ts;

int later_acks;

}

Figure 2.4: Per-host TCP-Int structures

The integrated fast recovery mechanism implemented stores, for each packet sent and unacknowl-

edged, the number of acknowledgments that were received after (later acks in Figure 2.4). These ac-

knowledgments are not all duplicated, as in standard fast retransmit, because they may come from any

connection that shares the same host. By doing this, the protocol decreases the number of false timeouts

during the fast retransmit phase; these false timeouts were due to insufficient duplicated acknowledg-

ments. From their server analysis, they predicted that, for all retransmissions from coarse timeouts, they

could have avoided 25%.

From their tests, they compared connections using TCP-Int and SACK [24], which is another protocol

used over TCP for congestion avoidance. They noticed that transfers using SACK had more timeouts

and worse bandwidth sharing. For TCP-Int, a round Robin scheduling allows connections to have a

better share of the network for the singular congestion window.

Even though, for the tests performed, TCP-Int performs better with a single congestion window, it’s

hard to reach conclusions about its performance over the Internet (deployed on a server), which is

their primary focus for the protocol changes. Their tests were purposely restrictive, done in a way to

experience constant losses and not to mimic a large scale network.

Addressing the Challenges of Web Data Transport [25], based on the previous research for TCB

Interdependence, developed 2 techniques for temporal and ensemble sharing: TCP fast start and TCP

sessions . They started by defining the needs of the WWW at the time. TCP is designed for long burst

14

Page 31: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

transfers to maximize throughput while HTTP transfers are too short for allowing this and latency is more

important than throughput. They came up with the following solutions:

The first solution is called fast start, which is based on temporal sharing, a cache is maintained for

avoiding the slow-start penalty after an idle period. This can be especially important in connections with

higher latency. The cache also stores the congestion window, but they raise the issue that using an older

congestion window could be too aggressive on the network. Their solution to this is implementing a new

drop algorithm in routers for making fast start traffic have lower priority and be dropped first if causing

congestion. But this solution would require changes in all routers, if there was an incompatible router, it

would not be able to discern that the marked packets must be discarded and would worsen congestion.

TCP Session is built over TCP-Int, aggregating the parallel connections and providing congestion

control and loss recovery mechanisms (most of these were already discussed), designed with a focus

on HTTP applications. These changes impact only the sending of data, not the receiving, so it isn’t

required for both endpoints to use the protocol, or detect if the other one has it.

The changes talked in TCP Session are mostly the same to those on TCP-Int, but it goes into finer

detail about packet scheduling for different connections, implementing a weighted round robin scheduler,

to have a better distinction between differently privileged connections. They claim that the connections’

weights can change dynamically, and this could solve this scheduler’s problems when having to deal

with different sized packets for each connection, but it is not explained.

In their simulation tests, they compared their approach to persistent HTTP and independent TCP

connections. They arrived to similar results between TCP session and and P-HTTP, but for individual

connections they noted an increase of 30% to 40% of packet loss, when there was more congestion.

For moderate loads, session performs 20% to 25% better than P-HTTP because of its changes to fast

retransmit.

One of the faults we find in their work is that their slow start solution is not explained thoroughly, see-

ing that TCP session will benefit short connections, slow start should play an important role in minimizing

the delay for these.

Effects of Ensemble-TCP [26] also pursued ways of adapting the initial design for TCB interde-

pendence. Their Ensemble-TCP architecture is capable of temporal sharing and ensemble sharing,

providing a shared structure for parallel TCP connections. The biggest divergence is that they don’t dif-

ferentiate as much between the two, being tightly coupled in the architecture. For explaining its design,

they go through the different components and respective thought process.

The authors start by defining the state that should be cached, because of their initial performance

cost. A misestimated rtt is costly, by default TCP connections use a conservative value. An initial higher

value is important for unknown higher latency connections, but an high delay can also be caused by

packet loss, so real losses will take time to be detected and dealt with. The same happens with the

congestion window, it has a low value initially and increases exponentially for each rtt . This conservative

start is bad for the throughput on connections, specially on smaller connections, but it can be improved

on by using caching.

For grouping connections into ensembles, they differentiate between hosts, but they idealize that this

15

Page 32: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Variable Descriptionr rtt round trip timer srtt smoothed round trip timet rttvar variance for round trip timesnd cwnd congestion windowsnd ssthresh slow start thresholdmembers associated connections’ TCBsosegs unacknowledged packets

Table 2.5: part of the EBC structure definitions

could be extended by grouping subnets instead (in that case, the delay between links in the same subnet

would have to be negligible). As in TCP Interdependence, state is shared differently and some needs to

be divided between the ensemble connections. In their case, for sharing congestion control information,

a priority scheduler was chosen with 4 different policies.

The use of temporal sharing depends on the stability of the values stored, because the network is

constantly changing, and using cached values can be to aggressive if the conditions in the network

deteriorate. This same problem was referred previously in TCP fast start, but their method for dealing

with this was flawed. Ensemble TCP proposes an aging mechanism to avoid it, even though the authors

fail to describe one in the paper. We can assume this method would make cached values converge to

default ones over time.

The structure used for caching state is called Ensemble Control Block (ECB) and it can be in 2

different states: active or cached, depending if there is any connection associated with it. In it, they store

the TCP state required by a singular connection and their own variables for allowing multiple connections.

A representation of this can be seen in Figure 2.5. A new connection will try using a cached ECB, or

create a new one if there isn’t one yet for that specific host. It then creates a modified TCB that is stored

inside the ECB, the new TCB references values directly from the ECB and has an added priority field.

For congestion control it applies the same algorithms we’ve seen in TCP-Int: shared congestion

window, where an ack increases the shared window and a loss decreases it, and shared fast recovery,

where other connections’ packets can be used as duplicate acks after an initial timeout.

An Integrated Congestion Management Architecture for Internet Hosts[27], described changes

in the internet traffic patterns, which could, in turn, threaten the long term stability of the internet. A novel

framework is introduced, capable of controlling network congestion from an end-to-end perspective, that

allows:

1. Multiplexing parallel flows to ensure proper congestion behavior;

2. Application and transport protocols adaptation with an API

The article has a wider focus, giving applications control over some transport layer concerns: the ability

to track and adapt to different bandwidths and to congestion. At the cost of removing abstraction between

protocols and increasing coupling. With this, they get a framework which is independent of transport and

applications protocols.

For managing different flows of data with different needs, a Congestion Manager (CM) acts as a

16

Page 33: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Sender cwnd = cwnd− α× cwndReceiver cwnd = cwnd− α× cwnd/(N − 1)

Table 2.6: TCP/DCA-C Congestion Windows Update

central point for maintaining network statistics and schedule outgoing transmissions according to formal

congestion control mechanisms, instead of having streams act independently of each other. Applications

use shared state learning to share network information along common paths.

Their implementation for the CM is divided in two modules, one for sending and another for receiving.

The sender side schedules data transmissions for all connections sharing the same links, the receiver

stores statistics relative to losses.

All network concerns are centered in the CM and state is stored in a centralized manner. This

includes TCP’s concerns for congestion avoidance and control which are used by default for all commu-

nications. The CM acts based on receiver feedback to estimate network capacity, mainly from packet

losses, independently of transport protocol.

Their design of a web server over their congestion manager has a bigger connection with our work.

A client will request objects from the server and the CM can control how to divide bandwidth between

each one. It can also provide an adaptive solution, where the same object can be requested with a

specified quality, to increase context-based performance.

In Collaborative Congestion Control in Parallel TCP Flows [7], they propose the TCP/DCA-C

as a different way of sharing state for parallel connections, using a delay-based congestion avoidance

scheme (DCA), where events are shared collaboratively (C) between flows. Again, the idea of sharing

in subnets is hinted at, but they maintain host specific groups in their protocol. For detecting congestion,

each flow calculates a threshold for the rtt , indicated by: T = rttmin+λ× (rttmax−rttmin); where λ is a

constant and the rtt values are taken from older rtt’s of the connections. A rtt higher than the threshold

(T ) indicates impending congestion. This allows the first connection in a group that experiences a

bigger delay to quickly share it with others, providing more accurate congestion windows for other group

members.

The protocol behaves like an ensemble, grouping the same host parallel connections into a group,

but without using any kind of caching, the group is used only for direct communication between flows.

When rtt > threshold in one flow, an event is signaled to all other flows, for all other flows in the

congestion avoidance phase the congestion window is reduced if the original flow had a lower window.

The congestion windows are decreased by a factor of 0.125, given by the parameter α. The reason this

value is so low, compared to other congestion avoidance mechanisms where the decrease would be of

0.5, is because the event only signaled an increase in latency and not a loss, which would be worse,

so the decrease doesn’t need to be more accentuated. For these flows, on the receiving end of an

event signal, the congestion window adjustment is lower than that of the event sender (table 2.6), it is

inversely proportional to the size of the group (N ) to compensate for cases where more flows experience

imminent congestion and also signal delay events.

17

Page 34: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

2.5 Multiplexing Parallel TCP flows

A recurring topic of discussion is on multiplexing parallel TCP flows. Multiplexing groups the different

streams into a single one, as a way of reducing the redundant aggressiveness associated with having

multiple independent flows that share the same path. This aggressiveness would come from each

connection having to do slow start and experience losses individually. This is detailed as an application

level solution, by not changing the underlying protocol layers, it is implemented by application.

Implementing a multiplexing solution is restrained by the TCP protocol layer, because it doesn’t

discern multiplexing. All flows, that the application differentiates, are treated equally by a single TCP

connection. If the application requires more from the transport protocol (using different ports on the

same host or increasing throughput), then it needs to use parallel connections.

From the literature, some drawbacks are enumerated related to the use of most multiplexing solutions

(as in the case of P-HTTP):

1. Protocol changes are done per application.

2. Each application requires an independent TCP connection, different applications can’t multiplex

into a single connection.

3. Multiplexing adds coupling between independent streams that share the same connection. A loss

or delay will affect all the objects in the connection.

4. The maximum throughput available to a multiplexed connection is the same of a single TCP con-

nection.

The biggest advantage of using parallel connections is, as we have already seen, that we can in-

crease the throughput for a client by increasing the amount of parallelism for connections, but with

increased aggressiveness [6]. Adding a ensemble technique makes these parallel connections behave

as aggressive as a single one, and conform to the TCP Reno standard, without compromising the higher

throughput of parallelism.

2.6 Protocols Comparison and Analysis

For a detailed analysis of the different protocols, we focus only on the ensemble sharing part of the

reviewed protocols. Starting at the beginning, we compare the distinct mechanisms in use, giving a

subjective appreciation when deemed necessary to reinforce flaws. The protocols are: TCB Interdepen-

dence, TCP-Int, TCP Session, E-TCP, CM and TCP/DCA-C.

The different protocols, share similar tactics for introducing efficient ensemble sharing to TCP. TCB

Interdependence introduced the idea, but has less architectural detail and no working implementation.

It details storing the rtt , MSS and rttvar for a connection as soon as possible in a central cache, so that

it can be accessed by other connections that share the same path.

18

Page 35: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

TCP-Int, TCP-Session and E-TCP are the closest to TCB Interdependence, and easier to differen-

tiate. But, unlike TCB Interdependence, they choose to cache directly in the TCB, which then acts as

a common structure between the same path connections that can be updated directly when any flow

detects a change in the network. This works better, because there doesn’t have to be any logic for con-

nections to know when to cache and update their state, and, this way, the state is always kept updated.

Because these designs use a common structure, the srtt , cwnd and ssthresh are also stored, in addition

to those already seen. From these, a distinction can be made, Ensemble-TCP uses a structure for each

TCP connection and then associates the state with the common structure, this allows these connections

to have independent parameters, as is the case of the connection priority which is used, this gives it

more flexibility, as to what can be stored, future-proofing it for subsequent updates. We question the

usefulness of this, compared to the others’ simpler design. From all the protocols, TCP/DCA-C is a

special case, the protocol only groups connections so that there can be signaling between them, but no

state is shared directly, only when the signals need to be sent.

As for the congestion manager, because it doesn’t fully explore its inner architecture, there is no

information on the details of its congestion mechanisms for parallel connections, making it impossible to

desiccate thoroughly. It does provide insight into performance tests used to compare the CM protocol to

a TCP implementation, to check if it achieves a similar performance and can compete fairly with other

TCP implementations.

The main goal for most of these algorithms is to have a shared congestion window for same path

connections, changes to this window are reflected in all the group connections. TCP-Int and TCP Ses-

sion act as a single connection for the same path, with a single congestion window and single queue

where different connections’ packets are enqueued. Scheduling wise, for the former we only know it

uses a round-robin variant, the latter uses weighted round-robin with dynamically set weights. E-TCP

does not stray from the formula, using a ticket based approach which reflects the relative share of the

congestion window, according to the priority given to each individual connection. The CM also uses a

round-robin scheduler, but reinforces the fact that schedulers are interchangeable. Then there is the

implementation of a later acks mechanism by TCP-Int, that was also used in E-TCP and TCP Session,

which speeds up fast retransmit.

Later acks and congestion window sharing prove that algorithms from standard TCP can be reimple-

mented to benefit from the use of ensemble sharing, at least those of which are path dependent. It is to

be seen if more algorithms can be adapted the same way.

2.7 Ensemble Sharing Considerations

Ensemble sharing solutions failed to gain adoption since they were initially proposed. We were unable

to find any reasons that could explain this, but there are a few disadvantages that we assume led to it:

1. At the time, the use of parallel connections was uncommon. A server in the year 2000 only found

44% of clients to make parallel connections [4].

19

Page 36: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

2. The implementations don’t conform to the TCP/IP protocol stack, where abstraction between the

network layer and transport layer should be well defined. The protocols need to group connection

according to IP specific concerns, and that breaks the abstraction. This can then make it harder

to push for the protocol implementation in operating systems, which use those abstractions in the

network code.

3. NAT interfaces present a risk as a single client behind a NAT may completely deny connections for

other users in the same private network, to the same server, by purposely delaying acknowledg-

ments or making its connection timeout prematurely.

20

Page 37: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Chapter 3

Architecture

In this chapter we provide an overview of the chosen architecture, detailing the parts that constitute

it and how it addresses the imposed objectives. In section 3.1, we describe our architecture require-

ments. Sections 3.2 and 3.3 explain the mechanisms that compose the architecture, which we divide

into connection grouping and congestion control. In section 3.4 a short summary is given of the

entire chapter.

3.1 Requirements

As discussed previously, parallelism is often the preferred option for increasing the throughput of TCP

connections for a single host, but, as a consequence, multiple connections between the same hosts

add per connection redundancies, resulting in increased aggressiveness inside the network. This will,

in turn, increase the amount of packet losses, diminishing the maximum available throughput in the

network. Even so, as parallelism is a must for most HTTP applications, enabling TCP to deal effectively

with it can be an important factor in reducing the overhead present in modern traffic.

Not only are the solutions presented above, in chapter 2, are relatively old, but web traffic has

changed completely in the last decade. With the rise of dynamic content, traffic bursts became more

characteristic, which forced browsers to increase their upper limit on parallel connections [3]. It remains

to be seen if the kind of solutions described, are now, more than ever, appropriate for the new needs of

the Internet.

The base idea is simple, group connections and share events inside the group. This is based on how

TCP/DCA-C used concurrency.

• To group connections we need to take into account the geographical location of the destination

hosts and their delay. To do so, we use the IPv4 address’ hierarchical nature to group connections

in close proximity and use the round trip time to deal with path and delay inconsistencies.

• Same group connections can share events, signaling the entry of a new connection, losses and

exits; and providing new estimates for the congestion window and slow start threshold.

21

Page 38: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

The system is comprised of 2 parts: Hydra, a data structure that receives network information from

different connections and assigns them to groups; and Heracles, the congestion control mechanism

that shares events between same group connections.

3.2 Hydra: Connection Grouping

Hydra can aggregate multiple remote hosts by their Internet Protocol version 4 (IPv4) addresses using

an implementation defined mask. In our approach we assume that 24 bit masks are used. There is no

certainty that any 2 addresses which share the same first 24 bits are indeed topographically close or

that hosts with completely different addresses aren’t close, but it’s the most efficient way for the sender

to check for potential group matches.

When a connection is inserted into a group, based only on its IPv4 address, a problem arises, we

have to admit that all connections in the group have the same delay, which, in the case of the Internet,

is impossible. As Jacobson described it, the Internet is a ”network of unknown topology and with an

unknown, unknowable and constantly changing population of competing conversations” [1]. Basing our

work in assumptions can reduce system complexity and improve the performance for most hosts. but

false positives (when a connection is wrongly inserted into a group with different network requirements),

even if rare, can be problematic. Supposing different rtt samples are taken from different connections:

a disparity of rtt estimations, causes wrong rttvar and srtt calculations, that will, in turn, provide wrong

timeouts, decreasing throughput. For groups of connections with the same IP address, we could have

the same problem: the address could be a NAT interface hiding multiple hosts, and we would need

to assume that the internal network latency is always negligible. It should be noted that, even if the

internal network latency was negligible, the grouping is still vulnerable to connections that timeout or

ill-intentioned users that delay the ack sending to inflate the timeout estimate.

This could be handled by having a beforehand knowledge of the network and only using the protocol

for specific trusted hosts, known to have similar path delays. Instead, we have a fail-safe mechanism

for detecting the problem above, to make the protocol functional independently of the network topology,

and minimize the effects of false positives.

The hydra structure is managed by a simple interface for adding, removing and updating groups

from the congestion control module. An interval comparison function is used to make decisions for each

group, based only on the latest rtt sample. Whenever a group is to be picked, the hydra structure is

accessed (figure 3.1). An hash table separates different 24 bits subnets according to the connections’

IPv4 address. For each position a binary tree is stored, groups are sorted based on the last rtt sample

taken from a connection. The group’s information can then be accessed during congestion control

and be updated with new information. Initially a connection has no group, until enough information

is gathered from the connection’s path, then it tries to find a group or create a new one. A group

is only determined according to the path’s round trip time, for which it tries to get a minimum of 3

acknowledgments, before trying to find a group or creating one. There is no specific reason for the

choice of the number 3 as the lower limit of acknowledgments, it was chosen as a way to make the

22

Page 39: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Hydra GroupGroup SizeSubnetGroup rttTotal ssthreshTotal cwndEvent Timestamps (JOIN, LOSS, LEAVE)

Table 3.1: Information stored in each hydra group.

Figure 3.1: Representation of the hydra structure, composed of an externally linked hash table andbinary trees, where each leaf represents a hydra group, which the Heracles structure points to.

connection increase its sending rate before joining a group, allowing the network to stabilize a bit, before

making the assumption that the connection shares a path with other connections. Group information

is described in table 3.1, which is posteriorly accessed to calculate network latency estimations. The

subnet is used so that the connections changing groups don’t need to traverse the initial hash table more

than once.

After each rtt sample received, the connection checks if it needs to change group, reseting group

specific information and finding a new group. When the connection stops transmitting, it performs any

necessary group cleanup. The operation required to change group is heavily simplified in our implemen-

tation. Our assumption is that if a connection is inside a group, then the rtt interval calculation doesn’t

need to be strict and should be quick to verify. We only check if the rtt sample received is within an

interval from 0.5 to 1.5 of the group’s last round trip time, the calculation is coarse, but enough to only

be affected by big changes in the network.

23

Page 40: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

3.3 Heracles: Congestion Control

Heracles is the congestion control algorithm, based on the Reno algorithm. Its base operation is closely

related to Reno, with added complexity for dealing with multiple connections for each group. So that,

when a connection is inside a group, it can send or receive events. There are different possible execution

paths according to whether the connection is in slow start or congestion avoidance:

• During Slow Start a connection can skip it, if it already found a group, else it behaves the same

as Reno and increases the congestion window exponentially.

• During Congestion Avoidance a connection always increases linearly. After, it tries to find a

group if it hasn’t already and updates the group’s cwnd and ssthresh total values.

In any of the previous states, connections check for the latests group events, dealing accordingly with

them. There are 3 possible events that force cwnd and ssthresh changes for the entire group: losses,

joins and leaves. For losses, all connections update the cwnd and ssthresh values to the estimated

ssthresh value of the connection that transmitted the event, which had the loss, decreasing both values.

During leaves, all excess cwnd is split across the remaining connections equally. On reception of a

join event, connections only update the ssthresh, there is no congestion window decrease in this case,

even if the joining connection starts with an increased cwnd . There is the possibility of performing

some decrease, which could be beneficial, making it easier for newer connections to increase. Our

evaluated implementation doesn’t do it, as a way of mimicking normal connections and to behave better

in concurrent environments

The Reno protocol cwnd increases stay unmodified in the Heracles protocol. For N connections

inside the group, the total window rises N times as fast as a single independent connection, as all

connections search for bandwidth concurrently, whether exponentially during slow start or linearly in

congestion avoidance.

When a loss happens in Reno, the single connection which perceived the loss will halve it’s ssthresh.

For Heracles the same happens, but the reduction in that single connection is split equally between the

connections in the group. Instead of a decrease of1

2, there is a decrease of

1

2N(for N connections in

the group). From the group’s viewpoint, only 1 connection suffered the loss, as with Reno, but it is shared

equally, to promote fairness. This goes against Van Jacobson’s stand on multiplicative decrease of con-

nections, halving the cwndduring congestion, [1] and doesn’t agree with the previous implementations,

which made losses halve all connections in a group [6, 26]. We defend that this approach is better for

the current state of the Internet, where most hosts using multiple parallel connections only decrease for

each connection, because there is no information being shared between the different connections. For

competing protocols, the usual aggressive approach would be disadvantageous for a Heracles conser-

vative protocol that reduced too quickly. As others would take the decrease in throughput as a chance of

stealing more bandwidth. This can be seen on the interactions between Vegas and Reno, where Reno

is more aggressive and ends up using an higher bandwidth percentage. On the network, Reno tries to

overflow buffer queues, leading to losses and Vegas sees this as a reason to reduce its own sending

rate [28].

24

Page 41: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

A connection can leave a group for 2 reasons: when it has nothing more to send or when it receives

a round trip time sample outside of the group’s interval. In both cases, it produces an event sharing

new window information with the group. When it changes group, its own window information remains

unaltered, so that the connection doesn’t have to restart itself on the new group.

During a join, a connection fetches a ssthresh estimate, so that it can use it to update its own cwnd

and instantly skip the slow start procedure. The connection can then start sending without having to

worry about finding the initial slow start value, suffering a loss and having to recover from it. Shortening

the slow start time decreases the amount of round trips required for it; where the connection has a

bottlenecked cwnd and purposefully limits its own throughput.

The mechanism used for sharing information is not perfect, because of the use of events. Events are

not consumed immediately after being emitted, only when execution control is passed to the recipient

connection. This is slightly problematic, as in the time it takes for a connection to read an event the

network state may have changed, and have made the event information useless. The delay until other

connections receive the event may be enough to provoke network congestion, especially with an high

degree of concurrency in bigger groups. This is a limitation of doing shared congestion control on top of

Linux’s network stack, which we discuss further in chapter 4. Events are prioritized (in order of the most

important to least one) by: loss, join and leave. Losses are the most important, because they will always

force a window decrease. Joins only requires the ssthresh estimate to be updated. A Leave event is the

least important, because it increases the connection’s window and it’s the only event where our protocol

forces a window rise. To simplify the event sharing platform and minimize the performance overhead, we

only have connections deal with the last highest priority event received and all other events are ignored.

This solution is similar to what was already seen in TCP/DCA-C. Using events allows the congestion

control to be built separated from the transport protocol, without having to modify the kernel directly.

3.4 Chapter Summary

In this chapter we presented a general overview of our protocol’s inner working. Detailing the two

mechanisms that compose it and explaining the mindset for the decisions taken in its design.

• Hydra, for grouping different connections into a single entity based on a common path to a subnet

and similar rtt . Shared information is calculated to be posteriorly read by other connections in the

same group, being constantly updated for providing the best network estimates.

• Heracles, from which connections interact with the kernel: receiving network specific information

and sharing it in their respective groups. Then deciding on a network state for all connections in

each group.

We explained the way different connections communicate with each other, storing events in the group,

to be then accessed by others. Events which can be from a new connection, from a connection that left

or from a connection that suffered a loss.We then presented how connections individually react to each

25

Page 42: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

of these events, changing their network values, to accurately represent the shared state. We also ex-

plained how these connections deal with the rtt samples they receive, adapting themselves to changes

in network congestion, by partitioning into different groups. This way, connections can defend them-

selves against abnormal network states, where connections in a group with supposed close network

proximity provide disparate values.

26

Page 43: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Chapter 4

Implementation

In this chapter, we go through important aspects of the protocol implementation. In section 4.1, we

provide information on the tools that were used to build the working protocol implementation. In section

4.2, we discuss details of the implemented protocol.

4.1 Implementation Options

This thesis’ work was targeted specifically for the Linux kernel. The possibility of using network sim-

ulators was discarded for the option of building the protocol directly in the kernel, as a real use case

implementation. With a simulator we would have more control over the state of the network and make

time agnostic tests, simulating the connections’ delay, this allows tests to iterate over the protocol’s con-

gestion control more times, facilitating the evaluation. Even so, coding a direct software implementation

for the Linux Kernel allows us to have results pertaining to real world use, in a real operating system,

without having to port code build specifically for a simulator, with different restrictions for which end

results may differ.

4.1.1 Linux Kernel

Linux is based on the Unix operating system. Created by Linus Torvalds and maintained by a team of

remote contributers across the world 1, the kernel is almost entirely coded using the C programming

language with the code base openly available.

Kernel code is different from normal user space code, that uses specific user libraries and syscalls

(system calls) to communicate indirectly with the kernel. User space code is not as prioritized as kernel

code, is limited to the provided higher level libraries and has worse performance from having to change

the operating system context (between user and kernel) when using syscalls, but is much safer to run.

The hygienization of user space input when changing context, prevents bad user space code or inputs

from stopping the operating system’s execution. The same control is not entirely possible in kernel code,

where it is easier to have it panic and crash.1git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/README?id=refs/tags/v3.16.37

27

Page 44: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

The kernel provides a IPv4 implementation for socket communication, and with it, a complex TCP

state machine, for managing the sender’s transmission, as defined by the TCP standard. TCP has loose

rules pertaining to the sender’s desired flow control. The rate of sending and retransmitting is not defined

in the describing rfc. These rules are mostly implementation dependent 2.

As for our protocol, the original intetion was for it to be coded within the kernel. However this quickly

became a problem due to the sheer complexity of the kernel network stack. A straight kernel implemen-

tation would generate the best performance, but it would require modifying the TCP stack and possibly

the IP stack. Doing direct changes to the kernel code would most likely break the normal implementation

with the added functionality, adding in the process side effects and severely increasing the development

time spent debugging.

Some decisions must be stated: we chose to target version 3.16 of the kernel. The kernel major

version 4 is most recent, but version 3 still sees widespread use, from which we chose the minor version

16. From the major version update, there were small changes to the module interface, but it remains

backwards compatible, other than that, different versions have small differences in the code, for which

results may differ. We make no assumptions of the protocol’s effectiveness for versions different of our

own.

All referenced documentation and code in our work is from that same version, taken from the Linux

code base, which is accessible from public repositories3.

4.1.2 Linux Modules

Linux allows for the creation of dynamically loaded code in the form of modules As of the kernel version

2.6.13, the TCP implementation allows pluggable congestion control modules.

Congestion Control Modules

The Linux kernel allows the congestion control mechanism to be linked from outside of the core kernel,

as a kernel module. Building the congestion control as a module reduces the interactions between the

network protocol itself and the congestion protocol, making it less error prone. All kernel congestion

control mechanisms are built as modules, except reno, which is hard coded into the kernel and used

as a fallback in the absence of any congestion control mechanism. By default, the 3.16 version of the

kernel uses the Cubic algorithm4[29], a congestion control protocol designed for fairness with competing

protocols, whether on short or long rtt paths.

An interface is defined for calling congestion control functions. At a minimum it only requires 2

functions, ssthresh and cong avoid. The first is called upon a loss, for which it returns a new value for

the connection’s ssthresh. Reno reads the congestion window and returns half its value. The latter is

called when any numbers of packets are acknowledged (the kernel doesn’t call it for each individual

ack ). Reno compares the cwnd with the ssthresh, determining the connection’s state, processing either

2git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/networking/tcp.txt?id=refs/tags/v3.16.373git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree?id=refs/tags/v3.16.374git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/networking/tcp.txt?id=refs/tags/v3.16.37

28

Page 45: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

the exponential or linear window increase. Implementing the protocol as a module is the easier option,

though it has some minor drawbacks:

• The scope of the interface provided only allows operations over a small number of settings that

influence the congestion control, as is the purpose of the module. As an example to this, we don’t

change the rtt calculations being performed in the kernel directly. At the risk of breaking the TCP

protocol completely. Fast retransmit is also out of bounds from the module.

• Path information can’t be shared directly between connections. Connections in groups can only

perform Heracles specific calculations from inside the congestion control modules, increasing the

time it takes between one connection modifying the group state and all others updating.

4.1.3 Kernel Debugging

Debugging at the kernel level is not easy, requiring some setup. Most options consist of running a

debugger with a virtualization of the kernel, though there are simpler options as printing debugging

information and having errors logged.

Logging

The function printk allows kernel code to print formatted strings into the kernel ring buffer, from which

these messages can be read from, with appropriate timestamps. It has the signature int printk(const

char * fmt, ...) and the first argument can receive a prepended keyword to characterize the message

logging level5. Printing always adds performance overhead, which is noticeable in network code that

needs to be efficient. It can be used to log values directly from the congestion protocol, from which we

can graph the algorithm’s variables, checking if it’s behaving as supposed.

We added a printk line, to log information on the state of each connection, whenever TCP called

the cong avoid function to increase the congestion window. On each log line we stored: connection

and group identifiers, TCP information and a timestamp. For each entry, we trimmed unnecessary

information, plotting the cwnd and the ssthresh of the different connections. For the evaluation portion

of the protocol the logging was removed to improve its performance.

Kernel Oops

A kernel oops is a kind of error that makes the process die, but is not severe enough to make the kernel

itself crash. This can be caused by a module trying to access an incorrect memory position or it can be

from a call to the BUG or BUG ON macros, which are used for making code assertions. Whenever a

kernel oops occurs, the kernel task being executed is killed and the error that led to to oops is logged

to the kernel ring buffer with a call stack, some disassembly code and a register dump. After this, the

module stops working and can’t be unloaded unless the system is rebooted. On the log, we find the

function where the oops happened and the machine code offset of the offending instruction. If the kernel

5git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/linux/printk.h?id=refs/tags/v3.16.37

29

Page 46: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

is compiled with the CONFIG DEBUG INFO flag, debugging symbols are available, allowing the kernel

oops to indicate the specific line in the source code that caused the error.

Kernel Panic

Kernel panics are errors from which the kernel can’t recover, leading to a crash. Debugging a kernel

crash is harder, because information isn’t logged as it is by a kernel oops, and memory is flushed from

the system and becomes unrecoverable. However, a tool as kdump6 helps recover crash dump data,

from which we can debug the kernel problem that triggered the crash.

4.1.4 Scripting

Scripting allows us to reduce the time it takes to perform a set of actions repeatedly. In our case, we

resorted to scripting to reduce the overall time spent compiling, running the protocols and evaluating

them. We used make and python.

Python

Python is an interpreted dynamically-typed programming language, with an emphasis on readability and

ease of programming. We used python 2.77 for:

• Starting external processes with the subprocess module;

• Automating the creation of different TCP clients processes. Using the threading library we could

start concurrent threads opening connections to a TCP server and controlling their expected be-

havior;

• Regex pattern matching, which we used to automate the creation of incremental logging files;

• Parsing of logging files and trimming;

Most high-level languages offer the same functionality offered above. Languages like C, take a lot of

boilerplate code and most default libraries it uses are low level, targeting mostly systems programming.

From the other side of the spectrum, interpreted languages like ruby8, python and lua9 are all dynamically

typed, doing no type checking during compile time and offer some high level programming constructs.

They are easier to develop in and can reduce a large chunk of development time, though at the cost of a

lower performance, when compared to compiled languages. We chose python over the others, because

it has a low learning curve and comes packaged with most Linux distributions.

6git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/kdump/kdump.txt?id=refs/tags/v3.16.377www.python.org/download/releases/2.7/8Ruby language: www.ruby-lang.org/en/9Lua language: www.lua.org/

30

Page 47: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Make

Make is used for program compilation and building. We used it for compiling Linux kernel modules. It’s

useful, because it takes care of dependencies between files and can check on its own if any requires

recompilation, without compiling everything again.

4.1.5 Iperf

Iperf10 is a simple tool for creating TCP data streams between hosts. A client connects to a server and

it can transmit a set amount of data or transmit data over a number of seconds, pushing the limits of the

TCP window.

Initially we tried using our own implementation of a TCP server and client, but the connection con-

stantly failed to make use of the capacity of the network. The sender couldn’t inject enough data into

the network, making the connection window stall during slow start, never to lose a packet. For Linux,

the window stalls if the congestion window is more than twice of the packets in flight, for which the ex-

ponential and linear increases aren’t processed. This prevents the connection from growing the window

indefinitely without using it completely and suddenly filling the window, injecting more packets than the

network can handle, which ends up causing network congestion. Iperf doesn’t suffer from the same

problems. It can stress the network to force losses and transition to congestion avoidance. It comes with

some other features too: it can transmit using a specific congestion control algorithm and output data

as comma-separated values (csv ) file from which we can graph the throughput. Iperf is available in 2

different versions, there is a choice between version 2 and 3.

Iperf3 has some added changes over the previous version, most importantly, it is able to create

parallel connections between the client and the server, reverse the data flow, which, by default, happens

from the client to the server, and log retransmission information. Unfortunately, the server can only have

1 connection at a time. This is a major disadvantage for our evaluation, that requires different clients to

connect to the server at any time and we need precise control of each client’s lifespan. For that reason

we chose to use version 2 of the tool.

4.1.6 Netkit

Netkit11 is a lightweight tool for creating multiple virtual machines to test network applications. It is not a

simulator, it only provides a virtual infrastructure for creating the closest alternative to a physical network.

One of this work’s objectives was to implement a real use case protocol for the Linux kernel. It is not

possible to test it on a simulator,as it would have to be adapted, and even then, the simulator would

need to use Linux’s network stack to accurately portray its behavior. Using Netkit, we can run the tests

on a virtual controlled environment that is easily deployable. The biggest drawback to its use is that it is

only available on some older specific Kernel versions and this makes it harder to access and download

the tools required to compile and run the module from the available repositories. To fix this, we created

10Iperf website: iperf.fr/11Netkit homepage: wiki.netkit.org/

31

Page 48: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

a network interface from the physical machine on which Netkit is running, to Netkit itself. The virtual

machines only need to run the servers (which receive data), while the physical machine runs the clients,

where the congestion protocols are tested, acting as a server.

4.1.7 Tc

Tc is a complex tool for managing traffic control in the Linux kernel. It allows us to limit network interfaces,

imposing queuing disciplines, which takes decisions on the packets scheduled for specific interfaces. For

our case we used a token bucket filter (tbf ) which is a discipline for limiting traffic rate. With it we can

bottleneck the bandwidth available and add delay to the link.

4.2 Heracles Module

The subsystems that compose the module were already detailed in the previous chapter. Here we

present an in-depth explanation on how the module interfaces with the kernel and the Reno algorithm.

In total, five functions are defined to interface with the kernel from the tcp congestion ops structure12. We’ve already discussed the ssthresh and cong avoid function pointers, which are the minimum

requirements for the TCP stack. We also used three optional functions:

• pkts acked - notification of a new round trip time calculation;

• init - startup behavior for new connections;

• release - cleanup behavior for connections.

The module isn’t implemented as an universal bridge between the congestion protocol and the

tcp congestion ops interface, instead it encapsulates Reno, polluting its interface with the kernel. The

downside of this is that our module can’t easily bridge other congestion protocols and the implementation

would need to be adapted for each one. We could have implemented a middleware algorithm, agnostic

to the congestion control module, however it would introduce significant performance overhead, which

should be avoided when dealing with a network algorithm.

The ssthresh function is modified from Reno to check if the connection is inside a group. A connection

inside a group doesn’t have to halve its congestion window. It only needs to decrease it by1

2N(for N

connections inside the group). We update the expected ssthresh value, store an event for the loss and

return the new ssthresh estimation.

The cong avoid is the main operating function of any congestion algorithm. In ours, as in Reno,

the cwnd is compared with the ssthresh, to determine congestion state. If the window is lower than

the threshold, it performs slow start, if equal or higher it performs congestion avoidance. During slow

start, the connection tries to find a group, if it has enough minimum acknowledgments. If there is no

group, protocol execution is transfered to Reno, to manage the window individually. When in congestion

avoidance, the algorithm always does reno window calculations, for the connection itself. The connection12git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/net/tcp.h?id=refs/tags/v3.16.37

32

Page 49: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 4.1: Diagram of the Heracles cong avoid function.

33

Page 50: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 4.2: Diagram of the Heracles pkts acked function.

doesn’t need to have a group during congestion avoidance, but it keeps trying to find one. If there is

a group, the connection starts receiving that group’s events and updates its window information in the

group. Figure 4.1 displays the diagram with the main operations performed by the function.

The pkts acked function receives a new rtt sample and does a fast interval check to determine if

a group change is needed (Figure 4.2). A more detailed approach would be to use a mix of different

metrics, because this problem has some similarity to the rto calculation problem: higher degrees of

congestion were not taken into account and the algorithm would fail to approximate correctly the timeout

[1]. For our protocol, if the interval is too small, group churn will be higher, decreasing performance, if

the interval is too big, connections with diverging paths will share bad information in the group, that may

increase injection of packets in the network, inducing congestion and, consequently, losses. We don’t

focus on this problem, but are aware of what it entails.

The init function initializes the heracles control structure. This structure primarily stores the current

group and event timestamp counters.

The release function handles the heracles structure exit routine, checking for an existing group and

removing it, emitting a leave event.

As to the interaction between Heracles and Hydra, Hydra exposes an interface that abstracts its

internals from Heracles and allows control over the following operations:

34

Page 51: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

• bool hydra_remains_in_group(struct heracles *heracles)

quickly calculates if the protocol is going to change group from the update;

• struct hydra_group *hydra_add_node(struct heracles *)

returns the group for the respective Heracles structure;

• struct hydra_group *hydra_update(struct heracles *)

changes the group of the Heracles structure;

• void hydra_remove_node(struct heracles*)

removes the connection from the group, performing group cleanup if required.

For each connection added to the hydra structure, we access the IP information to read the con-

nection’s IPv4 address, this is available from the sock structure, as the sk daddr field, which is directly

passed as an argument to the congestion control functions13. When a subnet is to be picked for the new

connection we perform a bitwise operation to get the key for the hash table, traversing the initial portion

of the hydra structure. The operation allocates memory for the hydra group structure. This uses the

kmalloc function which is similar to the malloc function, but receives a flag, with specific instructions for

the memory allocator. For network operations, the allocator cannot sleep, so the GFP NOWAIT flag is

chosen14.

When an update happens, we perform a check on whether the connection will change groups and if

the old group will become empty, removing the group in the process. The connection is then added to a

new group, directly from the tree it was in.

Removing a node, deletes information about the connection and the group if needed, freeing the

allocated memory in the process. Special care is taken to successfully remove the groups when no

connection is inside. Failing to remove the groups would leak memory and increase group search times

in the tree.

4.2.1 Fast Retransmit

Fast retransmit was one of the mechanisms referred in previous papers [6], that could be modified

to benefit from the connection grouping, where connections use acknowledgments from each other in

the same group to speed up fast retransmit sending, recovering sooner. But this mechanism is not

accessible directly from the kernel module and it would require more time than what was available, in

order for us to gain a deeper understanding of the inner workings of the TCP portion of the kernel. For

those reasons it was left out of the implementation.

4.2.2 Data Structures

Hydra uses different data structures to fulfill its duties efficiently. An hash table is used for group insertion,

as a key we use 8 bits, from index 16 to 23. This minimizes Hydra’s reserved memory space in the kernel,13git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/net/sock.h?id=refs/tags/v3.16.3714git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/linux/slab.h?id=refs/tags/v3.16.37

35

Page 52: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

needing only a table of 255 positions. The kernel provides a default hash table 15 implementation. This

table uses external chaining for dealing with collisions, which are appended to a list, making the search

time linear for IP’s with the same 8 repeated bits. This is not a big problem, connections only go through

the hash table when created, even so, for bigger servers, a bigger key should be used, dependent on

the average amount of different IP addresses seen by the server. An hash table seems like the better

option for high performance code, being only memory dependent. Trees require an higher search time

and constant rebalancing.

For the group search portion of the structure a self-balancing sorted binary tree is used. Being

able to access groups in sorted order is necessary for quickly picking groups whenever there is a new

connection or an update causes a change in group. The use of a tree also allows groups to use dynamic

amounts of memory; even though it has higher lookup times, when compared to an hash table. This is

important since a server will have no guarantees about the number of connections, with the same 24 bit

subnets, occupying the same tree. The kernel provides a Red Black Tree implementation16[30] , which

works the same way as AVL trees [31]. Both are sorted binary trees, only with different rebalancing

performance and with red black trees having slightly lower lookup time.

15git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/include/linux/hashtable.h?id=refs/tags/v3.16.3716git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/Documentation/rbtree.txt?id=refs/tags/v3.16.37

36

Page 53: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Chapter 5

Evaluation

In this chapter, we present the results achieved with the Heracles protocol, comparing it against Reno

and Cubic. Section 5.1, explains the goals of the evaluation scenarios, required to test the correct be-

havior of the protocol. Section 5.2, describes the different tests used to evaluate the protocol in different

environments. Section 5.3, explains how the tests are performed and how results are obtained, handled

and shown. Section 5.4, presents and discusses the results. Section 5.5, concludes the chapter, pro-

viding a deeper analysis into the entire set of results obtained, characterizing the protocol’s advantages

and disadvantages in different scenarios.

5.1 Tests Objectives

For the evaluation portion of our work, we compare 3 different protocols, Heracles, Reno and Cubic.

From these, the comparison between Heracles and Reno is the most important, because the former

uses the latter to achieve better performance. Cubic is also in the tests, because it can be used by the

other protocols as a comparison point to a more modern congestion control algorithm.

The evaluation tries to focus on different aspects of the Heracles protocol to test its throughput

against the alternatives. The objectives are the same described on chapter 1.

• skip slow start on paths for which the threshold is already known;

• react to losses on group by decreasing throughput fairly;

• Share common path information to provide better cwnd and ssthresh estimations.

For our evaluation, we present different test cases with specific characteristics, where the Heracles

protocol should have an advantage. We should be able to see better performance by noticing an in-

creased throughput. It should be noted that the tools used don’t provide information specific to TCP; as

is the case of retransmissions. An increased number of retransmissions could be an indicator of a more

aggressive protocol, but the negative effect of retransmissions will ultimately impact the throughput of

the protocol, even if harder to diagnose the cause. As such, we only use throughput as a performance

metric and the mean deviation as a fairness metric.

37

Page 54: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

The following test cases are not meant to be thorough or accurately describe normal behavior of

TCP based protocols, but to provide an insight into the use cases of the algorithm. The tests are:

• Long/Short - 1 long lived connection and short lived connections in constant intervals;

• Parallel - multiple connections over the same interval;

• Sequential - short lived connections are interpolated, before a connection ends another starts;

• Packet - 2 client streams constantly flood the network, each with data of different lengths.

The tests are described more in detail in each respective section.

5.2 Tests Scenarios

5.2.1 Long-Short

This test has a max throughput connection, which we call the long connection, that is the first to be

created and lasts until the end of the test. Then, short lived connections start transmitting in constant

intervals, at any point in the test there is only a short lived connection and a long lived connection. All

connections are controlled over how many bytes they can send in a specific amount of time, because

time is independent of the network’s available throughput, and, as such, is easier to manage. This should

allow us to analyze how many bytes these connections can transmit in short intervals of time, while

taking advantage of the preexisting connection. Specifically how fast they join a group and converge in

the network and how the whole group adapts.

For this test we used 20 short connections, with each lasting 5 seconds and a 1 second interval

between them. The test lasts in total 3 minutes.

5.2.2 Parallel

This test is comprised of parallel connections from the server to different clients, connections start and

end at the same time. This test should be able to discern: how fast can connections converge and

stay consistent throughout their lifetime. The test won’t factor group churn for Heracles, because of

the consistent network state throughout the test. Losses should be a major intervener in deciding the

throughput. The test was ran for 2, 4 and 10 parallel connections, with each single test lasting 60

seconds.

5.2.3 Sequential

For this test, short connections are created sequentially, a connection only stops transmitting after the

following connection starts. At most, only 2 connections transmit at the same time. This allows us to test

connection churn inside the Heracles’ groups, for a low number of connections. It should also test how

fast can the connections share information and converge on the network, compared to other connections

38

Page 55: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 5.1: Test Network

competing with each other. The server opens 50 connections, each lasting 5 seconds, 2 seconds before

the current connections stops transmitting, the next connection starts.

5.2.4 Packet

The Packet test is the only non time dependent one. There are 2 main client streams, one sending 1MB

and the other sending 100KB. The first opens 10 connections sequentially and the seconds opens 100

connections sequentially. Heracles should be able to reduce throughput of the bigger connections in

favor of speeding up the smaller connections.

5.3 Methodology

The Netkit software was used to run the tests locally in a controlled environment. Tests are not simulated,

but use the linux kernel to perform networking operations, this should give us accurate results, with a

low degree of unpredictability.

The Network consists only of a server and client, directly linked, with a 100 ms delay and a 100 Mb/s

total throughput (Figure 5.1).

To make the connections’ behavior more realistic we added a small number of TCP background con-

nections between the 2 machines, for the tests we used 10 connections, being transmitted using reno.

These connections are also trying to send as much as possible and are refreshed each 10 seconds.

Having no background traffic makes the connections being tested only steal throughput from each

other. Because we want to replicate the parallel behavior of connections on the Internet, with more

clients, having 2 connections should almost double throughput for the client, instead of it staying the

same if those are the only connections using the network.

Tests are almost all time dependent, with each taking from 1 minute up to 5 minutes. We repeated

each test 10 times, for each of the 3 used algorithms. Samples were gathered using IPerf’s lowest

logging interval of half a second. Output generated from the Iperf client processes is redirected to

temporary files. This data is then read by our python script, which splits different clients according to

the source port from the csv files. IPerf generates csv logs with the following fields: timestamp, source

IPv4 address, source port, destination IPv4 address, destination port, total duration, log start interval,

log end interval, bytes transfered and throughput in bits per second. The logs from Iperf revealed some

39

Page 56: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

problems:

• The timestamp is not accurate enough, with the smallest interval being in seconds

• The output appears with a delay in the standard output and the time field represents when the log

was flushed to the standard output and not when the respective event happened.

These are problematic, when trying to divide the logs and have them correspond correctly in time to

each other. To fix it, we added a counter number to the logs to represent the sequence in which the log

appeared (this works because logging intervals are constant).

From the different connections, we can plot graphs to observe the algorithm’s behavior and process

data, to calculate average throughput and the throughput deviation. To visualize the throughput proba-

bility for each protocol we show data as an empirical CDF graph. Each graph is built using the average

throughput samples taken directly from Iperf. Iperf showed limitations when trying to output the average

throughput for each connection. Most entries on the logs had the same throughput values and we were

unable to pinpoint what caused this inaccurate behavior. Only on packet tests were we able to get a

bigger variety of throughput samples. In each graph of this type, the horizontal axis represents the con-

nections’ throughput in Mb/s and the vertical axis represents the throughput’s ratio. These graphs are

created using mathematical python libraries: numpy1 that handles data and matplotlib2 that visualizes

data, by plotting it to a graph.

One of our examples shows the connections’ throughput graph over time, to differentiate the different

connections being monitored in our test cases. This graph is made from data taken by Iperf, which is

then plotted with gnuplot3. To improve the visual data, we create a 5 point smoothed moving average,

using two previous values, the value itself and the next two values. The horizontal axis represents our

sample rate (2 samples per second) and the vertical axis the connection’s throughput in Mb/s.

The result tables show average throughput and deviation. The average throughput is taken by cal-

culating the mean from all entries in each test and then recalculating the average for all test repetitions.

The deviation represents the average of the mean deviation for each test. In each test we calculate

the mean and for each sample we calculate the deviation from the mean. We then average the mean

deviation results from all repetitions.

5.4 Test Results

5.4.1 Long Short

For the long/short test (Table 5.1), Cubic has the best overall performance, with the long connection

being able to clearly steal a big share of the available bandwidth. On the downside it is the least fair

algorithm, with the lowest amount of bandwidth for the short connections from the 3 protocols. Because

1Numpy website: www.numpy.org/2Matplotlib website: matplotlib.org/3Gnuplot website: gnuplot.sourceforge.net/

40

Page 57: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Reno Cubic Heracleslong short long short long short

Average Throughput (Mb/s) 9.57 7.28 14.19 6.2 11.45 7.61Average Short/Long Deviation(Mb/s) 2.69 7.72 3.98

Total Throughput (Mb/s) 16.85 20.4 19.05

Table 5.1: Long/Short test results.

Figure 5.2: Empirical CDF plot for Long/short throughput.

the long connection has an average throughput of 14.62 Mb/s, this leaves a lower amount of throughput

to be used by the rest of connections in the network.

Heracles, on the other hand, has a better long throughput, when compared to Reno, with the best

performance for short connections. This is due to the way the protocol deals with group enters, making

both slow start and its initial loss unnecessary for groups. Exits then allow for the protocol to achieve

an higher performance than Reno on which it is based. When comparing the total throughput, Heracles

has 4.9% less than Cubic. As for the fairness metric, the average mean deviation between Long and

Short connections, Cubic has almost double the deviation.

From Figure 5.2, when comparing the performance probability from the total throughput, Heracles

had the lowest probability of worse performance, having the highest probability of better throughput, until

the 10Mb/s mark, where Cubic has the highest throughput ceiling.

5.4.2 Parallel

For 2 long lasting parallel connections (Table 5.2), Heracles had clearly the worst performance of the 3

protocols. With a average throughput of 18.56% less than Cubic and 2.8% less than Reno. For a 100

Mb/s link, supposing a perfect use, each connection should have an ideal throughput of about 8 Mb/s,

again Cubic is able to steal the highest percentage of throughput.

This test is only testing long lasting connections, so joins and leaves are rare, and group churn should

be non-existent, because of the constant throughput from all connections. This test only focuses on loss

41

Page 58: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

performance from the congestion avoidance window rising too high and lowering back down. Even when

comparing only Reno and Heracles, our protocol’s performance degrades. The way both protocols deal

with losses are similar so a big difference in throughput wasn’t expected. The added complexity of the

Heracles protocol is probably to blame for the slightly lower performance.

Reno Cubic HeraclesAverage Throughput (Mb/s) 16.07 19.18 15.62

Average Mean Deviation(Mb/s) 0.58 1.51 1.3

Table 5.2: Results for parallel tests with 2 connections.

With 4 connections (Table 5.3), the protocols start gaining a considerable share of the available

throughput. They all have similar performance, with a slight disadvantage to Heracles, though Cubic

loses its significant edge over the others.

Reno Cubic HeraclesAverage Throughput (Mb/s) 30.25 30.40 29.84

Average Mean Deviation(Mb/s) 0.5 1.19 1.04

Table 5.3: Results for parallel tests with 4 connections.

Finally, for 10 connections (Table 5.4), the test is controlling half of the connections using the network

link. In this test, Reno has on average more than half the throughput available in the network. Compared

to Reno, Cubic has 3.86% worse performance and Heracles has 12.18% worse performance. What

is interesting to note is that connection deviation is lower, this may be due to the higher number of

connection being analyzed that provide a better estimate. Looking at Figure 5.3, we see represented

the empirical CDF values for each of the parallel tests. It should be noted that the throughput scale

gets smaller with an increase of connections, because more parallel connections reduces available

throughput in the network, which in turn diminish the effect of the use of parallel connections.

Reno Cubic HeraclesAverage Throughput (Mb/s) 50.35 48.29 44.11

Average Mean Deviation(Mb/s) 0.3 0.6 0.15

Table 5.4: Results for parallel tests with 10 connections.

5.4.3 Sequential

For sequential connections (Table 5.5), Heracles shows the best performance, being able to transmit

more 11.7% than the other protocols. This can be attributed to the high connection churn, which benefit

Heracles, as events consist mainly in joins and leaves. Joins quickly increase the newer connections’

windows, while leaves will allow the older connections to get an higher share of the network than what

was previously available. Looking at Figure 5.4, values are similar for the 3 protocols, but, for this test,

the Heracles function has the highest throughput probability, being the rightmost function.

42

Page 59: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 5.3: Empirical CDF graph for 2, 4 and 10 parallel connections respectively.

Reno Cubic HeraclesAverage Throughput (Mb/s) 6.80 6.80 7.60

Average Mean Deviation(Mb/s) 1.14 1.00 1.14Bytes Transfered (GB) 0.433 0.432 0.483

Table 5.5: Sequential test results.

5.4.4 Packet

Connections during the Packet test achieve similar results (Table 5.6), with Reno having the worst per-

formance and Heracles having the best performance, but also the highest deviation value from the

protocols. Heracles’ throughput is only 2.78% superior to Cubic’s, but its deviation is 31.94% higher.

From Figure 5.5 we can see that the highest deviation value translates into a higher throughput ceil-

ing. For the lowest throughputs, Heracles has similar probabilities to Cubic. Reno has the leftmost

function with the worst throughput percentages, with a clear margin between itself and the other proto-

cols.

Reno Cubic HeraclesAverage Throughput (Mb/s) 9.96 11.89 12.23

Average Mean Deviation(Mb/s) 1.86 1.91 2.52

Table 5.6: Packet test results.

43

Page 60: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Figure 5.4: Empirical CDF plot for sequential throughput.

Figure 5.5: Empirical CDF plot for packet test throughput.

5.5 Protocol Analysis

From the results it is hard to give hard conclusions on the effectiveness of the Heracles protocol. On one

side, Heracles proved to be able to keep up with Cubic in the packet and long/short test, with only slight

throughput differences between either of them. In the sequential test, Heracles was the most successful

and it was able to guarantee the highest throughput by a noticeable margin. These are tests, with bursty

connections, lasting only a few seconds and Heracles was able to take advantage of most of its features

of information sharing. The scenario represented the expected behavior of downloading a webpage.

For parallel connections, the protocol was the worst performing, with a 12.18% lower throughput

when compared to the highest performing protocol. This is a problem, because it doesn’t guarantee that

44

Page 61: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

the protocol can achieve successful results in environments with an high degree of parallelism, which is

part of the specification of the protocol.

From the evaluation, we identified specific points that should be worked on, as a way of improving

the results. First, we were able to detect two implementation faults. Connections inside groups can only

change group after receiving a rtt sample and comparing it with the current group’s interval. This doesn’t

prevent against same path connections being in two different groups which share similar intervals. Hav-

ing the groups partitioned this way, increases group lookup times and path specific redundancies that we

try to avoid. These groups should be merged into a single one. Another implementation problem may

be from a connection changing its group. When a connection leaves a group and sends a leave event,

other connections will try to inflate their windows quickly, while the connection that left keeps the same

throughput. This presents an issue, if the path shared by these connections is similar or the change in

group was provoked by an anomalous rtt the number of packets in flight suddenly increase leading to

congestion. When together, the two previous problems will lead to unfair behavior of the protocol (as in

Figure 5.6).

Figure 5.6: 2 connections partitioning into different groups with different throughput values.

Finally, the last problem is that some values could be tweaked with the help of a more detailed

analysis. This would require slightly different versions of the Heracles algorithm to be evaluated between

themselves. The aspects of the protocol which we consider more important to be tuned are the following:

• Group interval - the rtt interval calculation is too coarse and doesn’t take into consideration the

connection’s rttvar . It’s important to factor in other variables, instead of using just the rtt . A score

should be calculated as a mix of different sender variables, as the srtt and rttvar that predict

network anomalies. This would help reduce false group changes for our evaluation.

• Min Acks - which sets the threshold for the minimum number of acknowledgments before a fresh

connection can join a group, for which we use 3. The need for a Min Acks requirement comes

from the lack of information about a new connection, before it joins a group. We see two major

45

Page 62: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

problems with our approach. First, the number used has no specific reasoning. It was used so

that connections have enough packets in flight to have a converged rtt, which could then be used

to join a group. Second, only using the current round trip time is not indicative that the connection

belongs inside a preexisting group. Each connection should be able to estimate the variability of

the network before inserting itself into a group, using the previously sampled rtts.

46

Page 63: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Chapter 6

Conclusions

6.1 Summary

In this document, we presented Heracles, a new congestion control protocol for TCP, designed as a way

of improving network performance for same path connections, by decreasing losses and increasing the

maximum per connection throughput.

The goals were to improve the performance for same host parallel connections, by sharing informa-

tion between connections that share similar paths to close receivers in the same subnet, mainly the

slow start threshold and the congestion window. For each individual connection, these are the means to

decide on a safe interval from which to control the rate of packet sending.

For hosts sharing the same network path, information from concurrency becomes redundant, where

every connection reaches the same results separately, wasting time in the process. We proposed an

alternative, by sharing path specific information, we enable connections to skip the slow start procedure

by having information on an estimation on the minimum amount of outstanding packets it can have.

Connections can share congestion events, dividing the window decreases evenly on losses, to improve

fairness. Finally, for exiting connections, the share of throughput they leave behind can be reused by

other connections in the group. This allows individual connections to increase their maximum throughput

ceiling, stealing a higher share of the network than connections using other protocols.

To enable group creation, we presented Hydra, the data structure responsible for managing different

groups for TCP connections, giving same path connections access to information about each other.

Hydra is designed with performance in mind, for network operations that require it to be fast. The

congestion control is Heracles, which controls access to the Hydra structure. It takes care of interfacing

with the kernel TCP stack through a module interface. From which it receives information to manage the

different groups and take decisions for the sender, based on finer network estimations.

Our proposal was evaluated using a prototype implemented on the Linux kernel, we observed that it

performs better for short connections, allowing them to finish sooner.

47

Page 64: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

6.2 Achievements

We implemented a protocol, inspired by previous work in the field referred to in this work, that encapsu-

lates common path connections into groups, allowing them to share path specific information between

each other. This should help in reducing slow start time for burst connections and help reduce total net-

work losses. We then evaluated the protocol performance on its throughput gains over different network

environments. The protocol should help reduce: slow start time for short connections and total network

losses.

As an improvement to all previous proposals, we allow the sender to group connections by subnet, to

increase the number of hosts affected by the protocol, important in server to client communication that

occurs on the Internet. We also made the protocol secure against a number of different environments:

where same subnet/address connections have different delay values; which the other protocols didn’t,

allowing network throughput to be compromised in a number of different scenarios.

The protocol is completely Linux 3.16 compatible and could be easily adapted to older versions of

the kernel. Tests were performed using Linux, so results should be close to the performance of a real

use case.

6.3 Future Work

From this work, some problems were left unsolved and new ones arose. The Heracles protocol evalu-

ation showed a decreased throughput when working with an increased number of parallel connections.

It should be tuned to fix this. In its current state, the protocol has a low adaptability to highly parallel

environments, for which it is more suited, as is in the case of server to clients connections, where some

clients are in close network proximity to each other.

The memory usage of the protocol wasn’t evaluated. For cases where memory can become a

bottleneck, its important to know how much memory the protocol requires on average per connection,

when compared to the normal network stack memory usage.

The protocol doesn’t deal with some problems that can arise after prolonged use, namely integer

overflows. We use machine dependent integers as timestamps, but perform no checks. During normal

behavior, an event timestamp is increased for the group, and if the connection’s timestamp is lower,

then it accepts the event and increases its own timestamp. After an integer overflow on the group’s

timestamp, connections will stop receiving events.

The evaluation should be extended to use TCP based applications, like HTTP which can make

connections behave in many different ways. Results should be analyzed from the point of view of a

HTTP server.

The Heracles protocol performance should be evaluated with tweaked variables, under parametrized

tests. There are different aspects of the protocol that can influence fairness and losses, which were

discussed:

• cwnd decrease for join events;

48

Page 65: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

• minimum amount of acks required to join a group;

• cwnd decrease for loss events;

• cwnd increase for leave events;

• group score calculation.

Tests should then be extended to the Internet. The protocol would have to be analyzed over a longer

period of time, compete against a bigger amount of congestion control protocols and have to deal with

different types of traffic, with higher delay variability.

Finally, some performance tuning should be done. The code should be appropriately profiled to be

tuned and cleaned. The complexity of the protocol is much higher than others, increasing time spent

processing congestion parameters, that influences the total delay of connections. These operations are

done at the network level and they should be as simple as possible.

49

Page 66: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

50

Page 67: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

Bibliography

[1] Jacobson, V.: Congestion avoidance and control. SIGCOMM Comput. Commun. Rev. 18(4) (Au-

gust 1988) 314–329

[2] Maier, G., Feldmann, A., Paxson, V., Allman, M.: On dominant characteristics of residential broad-

band internet traffic. In: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measure-

ment Conference. IMC ’09, New York, USA, ACM (2009) 90–102

[3] Ihm, S., Pai, V.S.: Towards understanding modern web traffic. In: Proceedings of the 2011 ACM

SIGCOMM Conference on Internet Measurement Conference. IMC ’11, New York, NY, USA, ACM

(2011) 295–312

[4] Allman, M.: A web server’s view of the transport layer. SIGCOMM Comput. Commun. Rev. 30(5)

(October 2000) 10–20

[5] Touch, J.: TCP Control Block Interdependence. RFC 2140 (June 1997)

[6] Balakrishnan, H., Padmanabhan, V., Seshan, S., Stemm, M., Katz, R.: Tcp behavior of a busy

internet server: analysis and improvements. In: INFOCOM ’98. Seventeenth Annual Joint Confer-

ence of the IEEE Computer and Communications Societies. Proceedings. IEEE. Volume 1. (Mar

1998) 252–262 vol.1

[7] Cho, S., Bettati, R.: Collaborative congestion control in parallel tcp flows. In: Communications,

2005. ICC 2005. 2005 IEEE International Conference on. Volume 2. (May 2005) 1026–1030 Vol. 2

[8] Postel, J.: Transmission Control Protocol. RFC 793 (September 1981)

[9] Fall, K., Stevens, W.: TCP/IP Illustrated Volume 1: The Protocols. 2 edn. Addison-Wesley Profes-

sional (2011)

[10] Allman M., F.S., C., P.: Increasing TCP’s Initial Window. RFC 3390 (January 2001)

[11] M. Allman, V.P., Blanton, E.: TCP Congestion Control. RFC 5681 (September 2009)

[12] Paxson V., Allman M., C.J., M., S.: Computing TCP’s Retransmission Timer. RFC 6298 (June

2011)

[13] Braden, R.: Requirements for Internet Hosts – Communication Layers. RFC 1122 (October 1989)

51

Page 68: Mastering the Concurrency of Shared Path TCP Connections€¦ · Mastering the Concurrency of Shared Path TCP Connections Pedro de Almeida Braz Thesis to obtain the Master of Science

[14] Allman M., B.H.F.S.: Enhancing TCP’s Loss Recovery Using Limited Transmit. RFC 3042 (January

2001)

[15] Berners-Lee, T., F.R., Frystyk, H.: Hypertext Tranfer Protocol (HTTP/1.1): Message Syntax and

Routing. RFC 7230 (June 2014)

[16] Fielding, R., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC

7231 (June 2014)

[17] Fielding, R., Reschke, J.: Hypertext Transfer Protocol – HTTP/1.0. RFC 1945 (May 1996)

[18] Padmanabhan, V.N., Mogul, J.C.: Improving http latency. Comput. Netw. ISDN Syst. 28(1-2)

(December 1995) 25–35

[19] Belshe, G., Peon, R., Thomson, M.: Hypertext Transfer Protocol version 2 (HTTP/2). RFC 7540

(May 2015)

[20] Braden, R.: Extending TCP for Transactions – Concepts. RFC 6247 (November 1992)

[21] Braden, R.: T/TCP – TCP Extensions for Transactions Functional Specification. RFC 1644 (July

1994)

[22] M. Duke, R. Braden, W.E., Blanton, E.: A Roadmap for Transmission Control Protocol (TCP)

Specification Documents. RFC 4614 (September 2006)

[23] Eggert, L.: Moving the Undeployed TCP Extensions RFC 1072, RFC 1106, RFC 1110, RFC 1145,

RFC 1146, RFC 1379, RFC 1644, and RFC 1693 to Historic Status. RFC 6247 (May 2011)

[24] Mathis M., Mahdavi J., F.S., A., R.: TCP Selective Acknowledgment Options. RFC 2018 (October

1996)

[25] Padmanabhan, V.N.: Addressing the Challenges of Web Data Transport. PhD thesis (1998)

[26] Eggert, L., Heidemann, J., Touch, J.: Effects of ensemble-tcp. SIGCOMM Comput. Commun. Rev.

30(1) (January 2000) 15–29

[27] Balakrishnan, H., Rahul, H.S., Seshan, S.: An integrated congestion management architecture for

internet hosts. SIGCOMM Comput. Commun. Rev. 29(4) (August 1999) 175–187

[28] Mo, J., La, R.J., Anantharam, V., Walrand, J.: Analysis and comparison of tcp reno and vegas. In:

In Proceedings of IEEE Infocom. (1999) 1556–1563

[29] Ha, S., Rhee, I., Xu, L.: Cubic: A new tcp-friendly high-speed tcp variant. SIGOPS Oper. Syst.

Rev. 42(5) (July 2008) 64–74

[30] Sedgewick, R., Guibas, L.J.: A dichromatic framework for balanced trees. (1978) 8–21

[31] Andel’son-Vel’skii, G.M., Landis, E.M.: An algorithm for the organization of information. Doklady

Akademii Nauk USSR 146(2) (1962) 263–266

52