PhD Thesis

199
Rolando da Silva Martins On the Integration of Real-Time and Fault-Tolerance in P 2 P Middleware Departamento de Ciˆ encia de Computadores Faculdade de Ciˆ encias da Universidade do Porto 2012

Transcript of PhD Thesis

Page 1: PhD Thesis

Rolando da Silva Martins

On the Integration of Real-Time

and Fault-Tolerance in P2P

Middleware

Departamento de Ciencia de Computadores

Faculdade de Ciencias da Universidade do Porto2012

Page 2: PhD Thesis
Page 3: PhD Thesis

Rolando da Silva Martins

On the Integration of Real-Time

and Fault-Tolerance in P2P

Middleware

Tese submetida a Faculdade de Ciencias da

Universidade do Porto para obtencao do grau de Doutor

em Ciencia de Computadores

Advisors: Prof. Fernando Silva and Prof. Luıs Lopes

Departamento de Ciencia de Computadores

Faculdade de Ciencias da Universidade do Porto

Maio de 2012

Page 4: PhD Thesis
Page 5: PhD Thesis

To my wife Liliana, for her endless love, support, and encouragement.

3

Page 6: PhD Thesis
Page 7: PhD Thesis

–Imagination is everything. It is the preview of

life’s coming attractions.

Albert Einstein

Acknowledgments

To my soul-mate Liliana, for her endless support on the best and worst of times. Her

unconditional love and support helped me to overcome the most daunting adversities

and challenges.

I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and Paulo

Paixao, for their vision and support that allowed me to pursuit this Ph.D.

I would like to thank the financial support from EFACEC, Sistemas de Engenharia,

S.A. and FCT - Fundacao para a Ciencia e Tecnologia, with Ph.D. grant SFRH/B-

DE/15644/2006.

I would especially like to thank my advisors, Professors Luıs Lopes and Fernando Silva,

for their endless effort and teaching over the past four years. Luıs, thank you for steering

me when my mind entered a code frenzy, and for teaching me how to put my thoughts

to words. Fernando, your keen eye is always able to understand the “big picture”, this

was vital to detected and prevent the pitfalls of building large and complex middleware

systems. To both, I thank you for opening the door of CRACS to me. I had an incredible

time working with you.

A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor.

She opened the door of CMU to me and helped to shape my work at crucial stages.

Priya, I had a fantastic time mind-storming with you, each time I managed to learn

something new and exciting. Thank you for sharing with me your insights on MEAD’s

architecture, and your knowledge on fault-tolerance and real-time.

Luıs, Fernando and Priya, I hope someday to be able to repay your generosity and

friendship. It is inspirational to see your passion for your work, and your continuous

effort on helping others.

I would like to thank Jiaqi Tan for taking the time to explain me the architecture and

functionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on my

work and for his insights on byzantine failures and consensus protocols.

I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor-

reia, Vıtor Costa, and Ines Dutra, for listening and sharing their thoughts on my work.

A big thank you to Hugo Ribeiro, for his crucial help with the experimental setup.

5

Page 8: PhD Thesis
Page 9: PhD Thesis

–All is worthwhile if the soul is not small.

Fernando Pessoa

Abstract

The development and management of large-scale information systems, such as high-

speed transportation networks, are pushing the limits of the current state-of-the-art

in middleware frameworks. These systems are not only subject to hardware failures,

but also impose stringent constraints on the software used for management and there-

fore on the underlying middleware framework. In particular, fulfilling the Quality-

of-Service (QoS) demands of services in such systems requires simultaneous run-time

support for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage that

remains a challenge for current middleware frameworks. Fault-tolerance support is

usually introduced in the form of expensive high-level services arranged in a client-server

architecture. This approach is inadequate if one wishes to support real-time tasks due

to the expensive cross-layer communication and resource consumption involved.

In this thesis we design and implement Stheno, a general purpose P2P middleware

architecture. Stheno innovates by integrating both FT and soft-RT in the architecture,

by: (a) implementing FT support at a much lower level in the middleware on top of a

suitable network abstraction; (b) using the peer-to-peer mesh services to support FT,

and; (c) supporting real-time services through a QoS daemon that manages the under-

lying kernel-level resource reservation infrastructure (CPU time), while simultaneously

(d) providing support for multi-core computing and traffic demultiplexing. Stheno is

able to minimize resource consumption and latencies from FT mechanisms and allows

RT services to perform withing QoS limits.

Stheno has a service oriented architecture that does not limit the type of service that can

be deployed in the middleware. Whereas current middleware systems do not provide

a flexible service framework, as their architecture is normally designed to support a

specific application domain, for example, the Remote Procedure Call (RPC) service.

Stheno is able to transparently deploy a new service within the infrastructure without

the user assistance. Using the P2P infrastructure, Stheno searches and selects a suitable

node to deploy the service with the specified level of QoS limits.

We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, such

as membership, discovery and service deployment, the impact of FT over RT, with

and without resource reservation, and compare with other closely related middleware

frameworks. Results showed that Stheno is able to sustain RT performance while

simultaneously providing FT support. The performance of the resource reservation

infrastructure enabled Stheno to maintain this behavior even under heavy load.

7

Page 10: PhD Thesis
Page 11: PhD Thesis

Acronyms

API Application Programming Interface

BFT Byzantine Fault-Tolerance

CCM CORBA Component Model

CID Cell Identifier

CORBA Common Object Request Broker Architecture

COTS Common Of The Shelf

DBMS Database Management Systems

DDS Data Distribution Service

DHT Distributed Hash Table

DOC Distributed Object Computing

DRE Distributed Real-Time and Embedded

DSMS Data Stream Management Systems

EDF Earliest Deadline First

EM/EC Execution Model/Execution Context

FT Fault-Tolerance

IDL Interface Description Language

IID Instance Identifier

IPC Inter-Process Communication

IaaS Infrastructure as a Service

J2SE Java 2 Standard Edition

JMS Java Messaging Service

JRTS Java Real-Time System

JVM Java Virtual Machine

9

Page 12: PhD Thesis

JeOS Just Enough Operating System

KVM Kernel Virtual-Machine

LFU Least Frequently Used

LRU Least Recently Used

LwCCM Lightweight CORBA Component Model

MOM Message-Oriented Middleware

NSIS Next Steps in Signaling

OID Object Identifier

OMA Object Management Architecture

OS Operating Systems

PID Peer Identifier

POSIX Portable Operating System Interface

PoL Place of Launch

QoS Quality-of-Service

RGID Replication Group Identifier

RMI Remote Method Invocation

RPC Remote Procedure Call

RSVP Resource Reservation Protocol

RTSJ Real-Time Specification for Java

RT Real-Time

SAP Service Access Point

SID Service Identifier

SLA Service Level of Agreement

SSD Solid State Disk

10

Page 13: PhD Thesis

TDMA Time Division Multiple Access

TSS Thread-Specific Storage

UUID Universal Unique Identifier

VM Virtual Machine

VoD Video on Demand

11

Page 14: PhD Thesis
Page 15: PhD Thesis

Contents

Acknowledgments 5

Abstract 7

Acronyms 9

List of Tables 17

List of Figures 19

List of Algorithms 23

List of Listings 25

1 Introduction 27

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.2 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 28

1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.4 Assumptions and Non-Goals . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2 Overview of Related Work 35

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2 RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2.1 Special Purpose RT+FT Systems . . . . . . . . . . . . . . . . . . 37

2.2.2 CORBA-based Real-Time Fault-Tolerant Systems . . . . . . . . . 39

2.3 P2P+RT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3.1 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

13

Page 16: PhD Thesis

2.3.2 QoS-Aware P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4 P2P+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 46

2.4.1 Publish-subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.4.2 Resource Computing . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.4.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.5 P2P+RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . 49

2.6 A Closer Look at TAO, MEAD and ICE . . . . . . . . . . . . . . . . . . 49

2.6.1 TAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.6.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.6.3 ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3 Architecture 59

3.1 Stheno’s System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61

3.1.1 Application and Services . . . . . . . . . . . . . . . . . . . . . . . 62

3.1.2 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.1.3 P2P Overlay and FT Configuration . . . . . . . . . . . . . . . . . 66

3.1.4 Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.1.5 Operating System Interface . . . . . . . . . . . . . . . . . . . . . 76

3.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.2.1 Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.2.2 Overlay Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.2.3 Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.3 Fundamental Runtime Operations . . . . . . . . . . . . . . . . . . . . . . 81

3.3.1 Runtime Creation and Bootstrapping . . . . . . . . . . . . . . . . 81

3.3.2 Service Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.3.3 Client Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 Implementation 91

4.1 Overlay Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.1 Overlay Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.1.2 Mesh Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.1.3 Discovery Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

14

Page 17: PhD Thesis

4.1.4 Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111

4.2 Implementation of Services . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4.2.1 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . 123

4.2.2 Actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.2.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.3 Support for Multi-Core Computing . . . . . . . . . . . . . . . . . . . . . 142

4.3.1 Object-Based Interactions . . . . . . . . . . . . . . . . . . . . . . 142

4.3.2 CPU Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

4.3.3 Threading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 145

4.3.4 An Execution Model for Multi-Core Computing . . . . . . . . . . 148

4.4 Runtime Bootstrap Parameters . . . . . . . . . . . . . . . . . . . . . . . 155

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5 Evaluation 157

5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.1.1 Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 157

5.1.2 Overlay Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2.1 Overlay Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2.2 Services Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161

5.2.3 Load Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5.3 Overlay Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5.3.1 Membership Performance . . . . . . . . . . . . . . . . . . . . . . 163

5.3.2 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.3.3 Service Deployment Performance . . . . . . . . . . . . . . . . . . 165

5.4 Services Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.4.1 Impact of Faul-Tolerance Mechanisms in Service Latency . . . . . 167

5.4.2 Real-Time and Resource Reservation Evaluation . . . . . . . . . . 169

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6 Conclusions and Future Work 177

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

6.3 Personal Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

15

Page 18: PhD Thesis

References 182

16

Page 19: PhD Thesis

List of Tables

4.1 Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155

17

Page 20: PhD Thesis
Page 21: PhD Thesis

List of Figures

1.1 Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1 Middleware system classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2 TAO’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.3 FLARe’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 53

2.4 MEAD’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 54

3.1 Stheno overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Application Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3 Stheno’s organization overview. . . . . . . . . . . . . . . . . . . . . . . . 63

3.4 Core Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5 QoS Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.6 Overlay Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.7 Examples of mesh topologies. . . . . . . . . . . . . . . . . . . . . . . . . 68

3.8 Querying in different topologies. . . . . . . . . . . . . . . . . . . . . . . . 69

3.9 Support framework layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.10 QoS daemon resource distribution layout. . . . . . . . . . . . . . . . . . . 73

3.11 End-to-end network reservation. . . . . . . . . . . . . . . . . . . . . . . . 75

3.12 Operating system interface. . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.13 Interactions between layers. . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.14 Multiple processes runtime usage. . . . . . . . . . . . . . . . . . . . . . . 79

3.15 Creating and bootstrapping of a runtime. . . . . . . . . . . . . . . . . . . 81

3.16 Local service creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.17 Finding a suitable deployment site. . . . . . . . . . . . . . . . . . . . . . 84

3.18 Remote service creation without fault-tolerance. . . . . . . . . . . . . . . 85

3.19 Remote service creation with fault-tolerance: primary-node side. . . . . . 86

3.20 Remote service creation with fault-tolerance: replica creation. . . . . . . 87

3.21 Client creation and bootstrap sequence. . . . . . . . . . . . . . . . . . . . 88

4.1 The peer-to-peer overlay architecture. . . . . . . . . . . . . . . . . . . . . 91

4.2 The overlay bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

19

Page 22: PhD Thesis

4.3 The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4 The initial binding process for a new peer. . . . . . . . . . . . . . . . . . 95

4.5 The final join process for a new peer. . . . . . . . . . . . . . . . . . . . . 96

4.6 Overview of the cell group communications. . . . . . . . . . . . . . . . . 99

4.7 Cell discovery and management entities. . . . . . . . . . . . . . . . . . . 103

4.8 Failure handling for non-coordinator (left) and coordinator (right) peers. 105

4.9 Cell failure (left) and subsequent mesh tree rebinding (right). . . . . . . . 106

4.10 Discovery service implementation. . . . . . . . . . . . . . . . . . . . . . . 109

4.11 Fault-Tolerance service overview. . . . . . . . . . . . . . . . . . . . . . . 112

4.12 Creation of a replication group. . . . . . . . . . . . . . . . . . . . . . . . 113

4.13 Replication group binding overview. . . . . . . . . . . . . . . . . . . . . . 114

4.14 The addition of a new replica to the replication group. . . . . . . . . . . 115

4.15 The control and data communication groups. . . . . . . . . . . . . . . . . 118

4.16 Semi-active replication protocol layout. . . . . . . . . . . . . . . . . . . . 120

4.17 Recovery process within a replication group. . . . . . . . . . . . . . . . . 122

4.18 RPC service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.19 RPC invocation types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.20 RPC service architecture without (left) and with (right) semi-active FT. 130

4.21 RPC service with passive replication. . . . . . . . . . . . . . . . . . . . . 132

4.22 Actuator service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.23 Actuator service overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.24 Actuator fault-tolerance support. . . . . . . . . . . . . . . . . . . . . . . 137

4.25 Streaming service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.26 Streaming service architecture. . . . . . . . . . . . . . . . . . . . . . . . . 139

4.27 Streaming service with fault-tolerance support. . . . . . . . . . . . . . . . 141

4.28 Object-to-Object interactions. . . . . . . . . . . . . . . . . . . . . . . . . 143

4.29 Examples of CPU Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 144

4.30 Object-to-Object interactions with different partitions. . . . . . . . . . . 145

4.31 Threading strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

4.32 End-to-End QoS propagation. . . . . . . . . . . . . . . . . . . . . . . . . 148

4.33 RPC service using CPU partitioning on a quad-core processor. . . . . . . 148

4.34 Invocation across two distinct partitions. . . . . . . . . . . . . . . . . . . 149

4.35 Execution Model Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 150

4.36 RPC implementation using the EM/EC pattern. . . . . . . . . . . . . . . 153

5.1 Overlay evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5.2 Physical evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.3 Overview of the overlay benchmarks. . . . . . . . . . . . . . . . . . . . . 160

20

Page 23: PhD Thesis

5.4 Network organization for the service benchmarks. . . . . . . . . . . . . . 161

5.5 Overlay bind (left) and rebind (right) performance. . . . . . . . . . . . . 164

5.6 Overlay query performance. . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.7 Overlay service deployment performance. . . . . . . . . . . . . . . . . . . 166

5.8 Service rebind time (left) and latency (right). . . . . . . . . . . . . . . . 168

5.9 Rebind time and latency results with resource reservation. . . . . . . . . 170

5.10 Missed deadlines without (left) and with (right) resource reservation. . . 172

5.11 Invocation latency without (left) and with (right) resource reservation. . 174

5.12 RPC invocation latency comparing with reference middlewares (without

fault-tolerance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

21

Page 24: PhD Thesis
Page 25: PhD Thesis

List of Algorithms

4.1 Overlay bootstrap algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2 Mesh startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.3 Cell initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4 Cell group communications: receiving-end . . . . . . . . . . . . . . . . . 100

4.5 Cell group communications: sending-end . . . . . . . . . . . . . . . . . . 102

4.6 Cell Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.7 Cell fault handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.8 Cell fault handling (continuation). . . . . . . . . . . . . . . . . . . . . . . 108

4.9 Discovery service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.10 Creation and joining within a replication group . . . . . . . . . . . . . . 116

4.11 Primary bootstrap within a replication group . . . . . . . . . . . . . . . 117

4.12 Fault-Tolerance resource discovery mechanism. . . . . . . . . . . . . . . . 118

4.13 Replica startup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.14 Replica request handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.15 Support for semi-active replication. . . . . . . . . . . . . . . . . . . . . . 121

4.16 Fault detection and recovery . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.17 A RPC object implementation. . . . . . . . . . . . . . . . . . . . . . . . 126

4.18 RPC service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.19 RPC service implementation. . . . . . . . . . . . . . . . . . . . . . . . . 128

4.20 RPC client implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.21 Semi-active replication implementation. . . . . . . . . . . . . . . . . . . . 130

4.22 Service’s replication callback. . . . . . . . . . . . . . . . . . . . . . . . . 131

4.23 Passive Fault-Tolerance implementation. . . . . . . . . . . . . . . . . . . 133

4.24 Actuator service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.25 Actuator service implementation. . . . . . . . . . . . . . . . . . . . . . . 136

4.26 Actuator client implementation. . . . . . . . . . . . . . . . . . . . . . . . 136

4.27 Stream service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . 139

4.28 Stream service implementation. . . . . . . . . . . . . . . . . . . . . . . . 140

4.29 Stream client implementation. . . . . . . . . . . . . . . . . . . . . . . . . 141

4.30 Joining an Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . 151

4.31 Execution Context stack management. . . . . . . . . . . . . . . . . . . . 152

4.32 Implementation of the EM/EC pattern in the RPC service. . . . . . . . . 154

23

Page 26: PhD Thesis
Page 27: PhD Thesis

List of Listings

3.1 Overlay plugin and runtime bootstrap. . . . . . . . . . . . . . . . . . . . 82

3.2 Transparent service creation. . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.3 Service creation with explicit and transparent deployments. . . . . . . . . 85

3.4 Service creation with Fault-Tolerance support. . . . . . . . . . . . . . . . 87

3.5 Service client creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 A RPC IDL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

25

Page 28: PhD Thesis
Page 29: PhD Thesis

–Most of the important things in the world have

been accomplished by people who have kept

trying when there seemed to be no hope at all.

Dale Carnegie 1Introduction

1.1 Motivation

The development and management of large-scale information systems is pushing the

limits of the current state-of-the-art in middleware frameworks. At EFACEC1, we have

to handle a multitude of application domains, including: information systems used

to manage public, high-speed transportation networks; automated power management

systems to handle smart grids, and; power supply systems to monitor power supply units

through embedded sensors. Such systems typically transfer large amounts of streaming

data; have erratic periods of extreme network activity; are subject to relatively common

hardware failures and for comparatively long periods, and; require low jitter and fast

response time for safety reasons, for example, vehicle coordination.

Target Systems

The main motivation for this PhD thesis was the need to address the requirements of the

public transportation solutions at EFACEC, more specifically, the light-train systems.

The deployment of one of such systems is installed in Oporto’s light-train network and

is composed of 5 lines, 70 stations and approximately 200 sensors (partially illustrated

in Figure 1.1). Each station is managed by a computational node, that we designate

as peer, that is responsible for managing all the local audio, video, display panels, and

low-level sensors such as track sensors for detecting inbound and outbound trains.

1EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systems

engineering namely in public transportation and energy systems, employs around 3000 people and has

a turnover of almost 1000 million euro; it is established in more than 50 countries and exports almost

half of its production (c.f. http://www.efacec.com).

27

Page 30: PhD Thesis

CHAPTER 1. INTRODUCTION

The system supports three types of traffic: normal - for regular operations over the

system, such as playing an audio message in a station through an audio codec; critical

- medium priority traffic comprised of urgent events, such as an equipment malfunction

notification; alarms - high priority traffic that notifies critical events, such as low-level

sensor events. Independently of the traffic type (e.g., event, RPC operation), the system

requires that any operation must be completed within 2 seconds.

From the point of view of distributed architectures, the current deployments would be

best matched with P2P infra-structures that are resilient and allow resources (e.g., a

sensor connected through a serial link to a peer) to be seamlessly mapped to the logical

topology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT)

services. Support for both RT and FT is fundamental to meet system requirements.

Moreover, the next generation light train solutions require deployments across cities

and regions that can be overwhelmingly large. This introduces the need for a scalable

hierarchical abstraction, the cell, that is composed of several peers that cooperate to

maintain a portion of the mesh.

Figure 1.1: Oporto’s light-train network.

1.2 Challenges and Opportunities

The requirements from our target systems pose a significant number of challenges. The

presence of FT mechanisms, specially using space redundancy [1], it introduces the need

for the presence of multiple copies of the same resource (replicas), and these, in turn,

ultimately lead to a greater resource consumption.

FT also introduce overheads in the form of latency and this is another constraint

that is important when dealing with RT systems. When an operation is performed,

irrespectively, of whether it is real-time or not, any state change that it causes by it

28

Page 31: PhD Thesis

1.2. CHALLENGES AND OPPORTUNITIES

must be propagated among the replicas through a replication algorithm that introduces

an additional source of latency. Furthermore, the recovery time, that consists in the

time that the system needs to recover from a fault, is an additional source of latency

to real-time operations. There are well known replication styles that offer different

trade-offs between state consistency and latency.

Our target systems have different traffic types with distinct deadlines requirements that

must be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernet

networking) and software (e.g., Linux). This requires that the RT mechanisms leverage

the available resources, through resource reservation, while providing different threading

strategies that allow different trade-offs between latency and throughput.

To overcome the overhead introduced by the FT mechanisms, it must be possible

to employ a replication algorithm that do not compromises the RT requirements.

Replication algorithms that offer a higher degree of consistency introduce a higher

level of latency [1, 2] that may be prohibitive for certain traffic types. On the other

hand, certain replication algorithms exhibit a lower resource consumption and latency

at the expense of a longer recovery time, that may also be prohibitive.

Considering current state-of-the-art research we see many opportunities to address

the previous challenges. One is the use of COTS operating system that allow for a

faster implementation time, thus smaller development cost, while offering the necessary

infrastructure to build a new middleware system.

P2P networks can be used to provide a resilient infra-structure that mirrors the physical

deployments of our target systems, furthermore, different P2P topologies offer different

trade-offs between self-healing, resource consumption and latency in end-to-end oper-

ations. Moreover, by directly implementing FT on the P2P infra-structure we hope

to lower resource usage and latency to allow the integration of RT. By using proven

replication algorithms [1, 2] that offer well-known trade-offs regarding consistency,

resource consumption and latency, we can focus on the actual problem of integrating

real-time, fault-tolerance within a P2P infrastructure.

On the other hand, RT support can be achieve through the implementation of different

threading strategies, resource reservation (through the Linux’s Control Groups) and by

avoiding traffic multiplexing through the use of different access points to handle different

traffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling would

provide greater RT guarantees, this goal will not be pursued due the lack of maturity

of the current EDF implementations in Linux (our reference COTS operating system).

Because we are limited to use priority based scheduling and resource reservation, we can

29

Page 32: PhD Thesis

CHAPTER 1. INTRODUCTION

only partially support our goal of providing end-to-end guarantees, more specifically,

we enhance our RT guarantees through the use of RT scheduling policies with over-

provisioning to ensure that deadlines are met.

1.3 Problem Definition

The work presented in this thesis focuses on the integration of Real-Time (RT) and

Fault-Tolerance (FT) in a scalable general purpose middleware system. This goal

can only be achieved if the following premises are valid: (a) FT infrastructure cannot

interfere in RT behavior, independently of the replication policy; (b) the network model

must be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and aware

of the underlying infrastructure, i.e. network model, operating system and physical

environment.

Our problem definition is a direct consequence of the requirements from our target

systems, and it can be summarize with the following question: ”Can we opportunistically

leverage and integrate these proven strategies to simultaneously support soft-RT and FT

to meet the needs of our target systems even under faulty conditions?”

In this thesis we argue that a lightweight implementation of fault-tolerance mechanisms

in a middleware is fundamental for its successful integration with soft real-time support.

Our approach is novel in that it explores peer-to-peer networking as a means to imple-

ment generic, transparent, lightweight fault-tolerance support. We do this by directly

embedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage of

their scalable, decentralized and resilient nature. For example, peer-to-peer networks

readily provide the functionality required to maintain and locate redundant copies of

resources. Given their dynamic and adaptive nature, they are promising infra-structures

for developing lightweight fault-tolerant and soft real-time middleware.

Despite these a priori advantages, mainstream generic peer-to-peer middleware systems

for QoS computing are, to our knowledge, unavailable. Motivated by this state of

affairs, by the limitations of the current infra-structure for the information system we

are managing at EFACEC (based on CORBA technology) and, last but not least, by the

comparative advantages of flexible peer-to-peer network architectures, we have designed

and implemented a prototype service-oriented peer-to-peer middleware framework.

The networking layer relies on a modular infra-structure that can handle multiple peer-

to-peer overlays. The support for fault-tolerance and soft real-time features is provided

30

Page 33: PhD Thesis

1.4. ASSUMPTIONS AND NON-GOALS

at this level through the implementation of efficient and resilient services for, e.g.

resource discovery, messaging and routing. The kernel of the middleware system (the

runtime) is implemented on top of these overlays and uses the above mentioned peer-to-

peer functionalities to provide developers with APIs for customization of QoS policies

for services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy,

number of replicas). This approach was inspired in that of TAO [3], that allows for

distinct strategies for the execution of tasks by threads to be defined.

1.4 Assumptions and Non-Goals

The distributed model used in this thesis is based on a partial asynchronous model

computing model, as defined in [2], extended with fault-detectors.

The services and P2P plugin implemented in this thesis only support crash failures. We

consider a crash failure [1] to be characterized as a complete shutdown of a computing

instance in the event of a failure, ceasing to interact any further with the remaining

entities of the distributed system.

The timing faults are handled differently by services and the P2P plugin. In our service

implementations a timing fault is logged (for analysis) with no other action being

performed, whereas, in the P2P layer we consider a timing fault as a crash failure, i.e.,

if the remote creation of a service exceeds its deadline, the peer is considered crashed.

This method is also called as process controlled crash, or crash control, as defined in

[4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect of

being crashed, it does not get killed or commits suicide, instead it gets shunned, that

is, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it must

rebind using the membership service in the P2P layer.

The fault model used was motivated by the author’s experience on several field deploy-

ments of ligth-train transportation systems, such as the Oporto, Dublin and Tenerife

Light Rail solutions [5]. Due to the use of highly redundant hardware solutions, such as

redundant power supplies and redundant 10-Gbit network ring links, network failures

tend to be short. The most common cause for downtime is related with software bugs,

that mostly results in a crashing computing node. While simultaneous failures can

happen, they are considered rare events.

We also assume that the resource-reservation mechanisms are always available.

In this thesis we do not address value faults and byzantine faults, as they are not a

31

Page 34: PhD Thesis

CHAPTER 1. INTRODUCTION

requirement for our target systems. Furthermore, we do not provide a formal specifica-

tion and verification of the system. While this would be beneficial to assess system

correctness, we had to limit the scope of this thesis. Nevertheless, we provide an

empirical evaluation of the system.

We also do not address hard real-time because the lack of a mature support for EDF

scheduling in the Linux kernel. Furthermore, we do not provide a fully optimized

implementation, but only a proof-of-concept to validate our approach. Testing the

system in a production environment is left for future work.

1.5 Contributions

Before undertaking the task of building an entire new middleware system from scratch,

we explored current solutions, presented in Chapter 2, to see if any of them could

support the requirements from our target system. As we did not find any suitable

solution, we then assessed if it was possible to extend an available solution to meet

those requirements. In our previous work, DAEM [6], we explored the use of JGroups [7]

within an hierarchical P2P mesh, and concluded that the simultaneous support for real-

time, fault-tolerance and P2P requires fine grain control of resources that is not possible

with the use of ”black-box” solutions, for example, it is impossible to have out-of-the-

box support for resource reservation in JGroups.

Given these assessments, we have designed and implemented Stheno, that to the best

of our knowledge is the first middleware system to seamlessly integrate fault-tolerance

and real-time in a peer-to-peer infrastructure. Our approach was motivated by the

lack of support of current solutions for the timing, reliability and physical deployment

characteristics of our target systems.

For that, a complete architectural design is proposed that addresses the levels of the

software stack, including kernel space, network, runtime and services, to achieve a

seamless integration. The list of contributions include: (a) a full specification of a user

Application Programming Interface (API); (b) pluggable P2P network infrastructure

aiming to better adjust to the target application; (c) support for configurable FT on

the P2P layer with the goal of providing lightweight FT mechanisms, that fully enable

RT behavior, and; (d) integration of resource reservation at all the levels of runtime,

enabling (partial) end-to-end Quality-of-Service (QoS) guarantees.

Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioning

for real-time systems. In this thesis we present, Euryale, a QoS network oriented

32

Page 35: PhD Thesis

1.6. THESIS OUTLINE

framework that features resource reservation with support for a broader range of sub-

systems, including CPU, memory, I/O and network bandwidth for a general purpose

operating system as Linux. At the heart of this infrastructure resides Medusa, a QoS

daemon that handles admission and management of QoS requests.

Current well-known threading strategies, such as Leader-Followers [11], Thread-per-

Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la-

tency and resource usage [3, 14]. However, they do not support resource reservation,

namely, CPU partitioning. In order to suppress this limitation, this thesis provides an

additional contribution with the introduction of a novel design pattern (Chapter 4) that

is able to integrate multi-core computing with resource reservation within a configurable

framework that supports these well-known threading strategies. For example, when a

client connects to a service it can specify, through the QoS real-time parameters, for a

particular threading strategy that best meets its requirements.

We present a full implementation that covers all the previously architectural features,

including a complete overlay implementation, inspired in the P3 [15] topology, that

seamlessly integrates RT and FT.

To evaluate our implementation and justify our claims, we present a complete evalua-

tion for both mechanisms. The impact of the resource reservation mechanism is also

evaluated, as well as a comparative evaluation of RT performance against state-of-the-

art middleware systems. The experimental results show that Stheno meets and exceeds

target system requirements for end-to-end latency and fail-over latency.

1.6 Thesis Outline

The focus of this thesis is on the design, implementation and evaluation of a scalable

general purpose middleware that provides the seamless integration of RT and FT. The

remaining of this thesis is organized as follows.

Chapter 2: Overview of Related Work.

This chapter presents an overview on related middleware systems that exhibit support

for RT, FT and P2P, the mandatory requirements from our target system. We started

by searching for an available off-the-shelf solution that could support all of these

requirements, or in its absence, identifying a current solution that could be extended in

order to avoid creating a new middleware solution from scratch.

Chapter 3: Architecture.

33

Page 36: PhD Thesis

CHAPTER 1. INTRODUCTION

Chapter 3 describes the runtime architecture on the proposed middleware. We start

by providing a detailed insight on the architecture, covering all layers present in the

runtime. Special attention is given to the presentation of the QoS and resource reser-

vation infrastructure. This is followed by an overview of the programming model that

describes the most important interfaces present in the runtime, as well the interactions

that occur between them. The chapter ends with the description of the fundamental

runtime operations, namely: the creation of services with and without FT support,

deployment strategy, and client creation.

Chapter 4: Implementation.

Chapter 4 describes the implementation of a prototype based on the aforementioned

architecture, and is divided in four parts. In the first part, we present a complete

implementation of P2P overlay that is inspired on the P3 [15] topology, while providing

some insight on the limitations of the current prototype. The second part of this chapter

focuses on the implementation of three types of user services, namely, Remote Procedure

Call (RPC), Actuator, and Streaming. These services are thoroughly evaluated in

Chapter 5. In the third part, we describe our support for multi-core computing, through

the presentation of a novel design pattern, the Execution Model/Context. This design

pattern is able to integrate resource reservation, especially CPU partitioning, with

different well-known (and configurable) threading strategies. The fourth and final part

of this chapter describes the most relevant parameters used in the bootstrap of the

runtime.

Chapter 5: Evaluation.

The experimental results are presented in this chapter. It starts by providing details

of physical setup used throughout the evaluation. Then it describes the parameters

used in the testbed suite, that is composed by the three services previously described in

Chapter 4. We then focus on presenting the results for the benchmarks, including the

assessment of the impact of FT on RT, and the impact of the resource reservation infra-

structure in the overall performance. The chapter ends with a comparative evaluation

against well-known middleware systems.

Chapter 6: Conclusion and Future Work.

This last chapter presents the concluding remarks. It highlights the contributions of

the proposed and implemented middleware, and provides

34

Page 37: PhD Thesis

–By failing to prepare, you are preparing to fail.

Benjamin Franklin 2Overview of Related Work

2.1 Overview

This chapter presents an overview of the state-of-the-art on related middleware systems.

As illustrated in Figure 2.1, we are mostly interested in systems that exhibit support

for real-time (RT), fault-tolerance (FT) and peer-to-peer (P2P), the mandatory require-

ments from our target system. We started by searching for an available off-the-shelf

solution that could support all of these requirements, or in its absence, identify a current

solution that could be extended, and thus avoid the creation of a new middleware

solution from the ground up. For that reason, we have focused on the intersecting

domains, namely, RT+FT, RT+P2P and FT+P2P, since the systems contained in these

domains are closer to meet the requirements of our target system.

From an historic perspective, the origins of modern middleware systems can be traced

back to the 1980s, with the introduction of the concept of ubiquitous computing, in

which computational resources are accessible and seen as ordinary commodities such

as electricity or tapwater [2]. Furthermore, the interaction between these resources

and the users was governed by the client-server model [16] and a supporting protocol

called RPC [17]. The client-server model is still the most prevalent paradigm in current

distributed systems.

An important architecture for client-server systems was introduced with the Common

Object Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did not

address real-time or fault-tolerance. Only recently both real-time and fault-tolerance

specifications were finalized but remained mutually exclusive. This means that a

system supporting the real-time specification will not be able to support the fault-

35

Page 38: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

FT

FT+P2P

P2P

RT+P2P

RT+FT RT+FT+P2P

RT

Video

Streaming

Distributed

storage

Pastry

CORBA RT FT

DDS

CORBA FT

Stheno

Figure 2.1: Middleware system classes.

tolerance specification, and vice-versa. Nevertheless, seminal work has already ad-

dressed these limitations and offered systems supporting both features, namely, TAO [3]

and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared as

a Java alternative capable of providing a more flexible and easy-to-use environment.

In recent years, CORBA entered in a steady decline [20] in favor of web-oriented

platforms, such as J2EE [21], .NET [22] and SOAP [23], and P2P systems. The

web-oriented platforms, such as the JBoss [24] application server, aim to integrate

availability with scalability, but they remain unable to support real-time. Moreover,

while partitioning offers a clean approach to improve scalability, it fails to support

large scale distributed systems [2]. Alternatively, P2P systems focused on providing

logical organizations, i.e., meshes, that abstract the underlying physical deployment

while providing a decentralized architecture for increased resiliency. These systems

focused initially on resilient distributed storage solutions, such as Dynamo [25], but

progressively evolved to support soft real-time systems, such as video streaming [26].

More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributed

message passing infrastructure based on an asynchronous interaction model, that is

able to suppress the scaling issues present in RPC. A considerable amount of im-

plementations exist, including Tibco [28], Websphere MQ [29] and Java Messaging

Service (JMS) [30]. MOM sometimes are integrated as subsystems in the application

server infrastructures, such as JMS in J2EE and Websphere MQ in the Websphere

Application Server.

A substantial body of research has focused on the integration of real-time within

36

Page 39: PhD Thesis

2.2. RT+FT MIDDLEWARE SYSTEMS

CORBA-based middleware, such as TAO [3] (that later addressed the integration of

fault-tolerance). More recently, QoS-enabled publish-subscribe middleware systems

based on the JAIN SLEE specification [31], such as Mobicents [32], and in the Data

Distribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34]

and OpenSplice [35], appeared as a way to overcome the current lack of support for

real-time applications in SOA-based middleware systems.

The introduction of fault-tolerance in middleware systems also remains an active topic

of research. CORBA-based middleware systems were a fertile ground to test fault-

tolerance techniques in a general purpose platform, resulting in the creation of the

CORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA-

based platforms, such as J2EE. One of the most popular deployments, JBoss, supports

scalability and availability through partitioning. Each partition is supported by a group

communication framework based on the virtual synchrony model, more specifically, the

JGroups [7] group communication framework.

2.2 RT+FT Middleware Systems

This section overviews systems that provide simultaneous support for real-time and

fault-tolerance. These systems are divided into special purposed solutions, designed for

specific application domains, and CORBA-based solutions, aimed for general purposed

computing.

2.2.1 Special Purpose RT+FT Systems

Special purpose real-time fault-tolerant systems introduced concepts and implementa-

tion strategies that are still relevant on current state-of-the-art middleware systems.

Armada

Armada [37] focused on providing middleware services and a communication infrastruc-

ture to support FT and RT semantics for distributed real-time systems. This was

pursued in two ways, which we now describe.

The first contribution was the introduction of a communication infrastructure that is

able to provide end-to-end QoS guarantees, in both unicast and multicast primitives.

This was supported by a control signaling and a QoS-sensitive data transfer (as in the

newer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)).

37

Page 40: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

The network infrastructure used a reservation mechanism based on EDF scheduling

policy that was built on top of the Mach OS priority based scheduling. The initial

implementation was done in the user-level but subsequently migrated to the kernel

level with the goal of reducing latency.

Much of the architectural decisions regarding RT support were based on the available

operating system at the time, mainly Mach OS. Despite the advantages of a micro-

kernel approach, its application remains restricted by the underlying cost associated

with message passing and context switching. Instead, a large body of research has been

made on monolithic kernels, specially in Linux OS, that are able to offer the advantages

of the micro-kernel approach, through the introduction of kernel modules, and the speed

of monolithic kernels.

The second contribution came in the form of a group communication infrastructure

based on a ring topology that ensured the delivery of messages in a reliable and total

order fashion within a bounded time. It also had support for membership management

that offered consistent views of the group through the detection of process and commu-

nication failures. These group communication mechanisms enabled the support for FT

through the use of a passive replication scheme, that allowed for some inconsistencies

between the primary and the replicas, where the states of the replicas could lag behind

the state of the primary, up to a bounded time window.

Mars

Mars [38] provided support for the analysis and deployment of synchronous hard real-

time systems through a static off-line scheduler for CPU and Time Division Multiple

Access (TDMA) bus. Mars is able to offer FT support through the use of active

redundancy on the TDMA bus, i.e. sending multiple copies of the same message, and

self-checking mechanisms. Deterministic communications are achieved though the use

of a time-triggered protocol.

The project focused on the RT process control, where all the intervening entities are

known in advance. So it does not offer any type of support for dynamical admission of

new components, neither it supports on-the-fly fault-recovery.

ROAFTS

ROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis-

tributed RT applications, consisting in a network of Time-triggered Message-triggered

Objects [41] (TMO’s), whose execution is managed by a TMO support manager. The

FT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic

38

Page 41: PhD Thesis

2.2. RT+FT MIDDLEWARE SYSTEMS

fault server ; (b) and a network surveillance [42] manager. Fault-detection is assured

by the network surveillance TMO, and used by the generic fault-server to change the

FT policy with the goal of preserving RT semantics. The system assumes that RT

can live with lesser reliability assurances from the middleware, under highly dynamic

environments.

Maruti

Maruti [43] aimed to provide a development framework and an infrastructure for the

deployment of hard real-time applications within a reactive environment, focusing on

real-time requirements on a single-processor system. The reactive model is able to

offer runtime decisions on the admission of new processing requests without producing

adverse effects on the scheduling of existing requests. Fault-tolerance is achieved by

redundant computation. A configuration language allows the deployment of replicating

modules and services.

Delta-4

Delta-4 [44] provided an in-depth characterization of fault assumptions, for both the

host and the network. It also demonstrated various techniques for handling them,

namely, passive and active replication for fail-silent hosts and byzantine agreement for

fail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi-

tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 framework

through the introduction of the Leader/Follower replication model (better known as

semi-active replication) for fail-silent hosts. This work also lead to the extension to the

communication system to support additional communication primitives (the original

work on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, and

AtLeastTo.

2.2.2 CORBA-based RT+FT Systems

The support for RT and FT in general purpose distributed platforms remains mostly

restricted to CORBA. While some support was carried out by Sun to introduce RT sup-

port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46,

47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen-

tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-Time

VM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attempted

to provide support for RT in a J2EE environment. Nevertheless, this support seems to

be confined to the introduction of a deterministic garbage collector, through the use of

39

Page 42: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

the RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbage

collection [51].

Previous work on integration of RT and FT in CORBA context systems can be catego-

rized into three distinct approaches: (a) integration, where the base ORB is modified;

(b) services, systems that rely on high-level services to provide FT (and indirectly, RT),

and; (c) interception, systems that perform interception on client request to provide

transparent FT and RT.

Integration Approach

Past work on the integration of fault-tolerance in CORBA-like systems was done in

Electra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors of

the CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage-

ment Architecture (OMA) to support transparent and non-transparent fault-tolerance

capabilities. Instead of using message queues or transaction monitors [56], it relied on

object-communication groups [57, 58]. Maestro [53] is a distribute layer built on top of

the Ensemble [59] group communication, that was used by Electra [52] in the Quality

of Service for CORBA Objects (QuO) project [60]. Its main focus was to provide an

efficient, extensible and non disruptive integration of the object layers with the low-

level QoS system properties. The AQuA [54] system uses both QuO and Maestro on

top of the Ensemble communication groups, to provide a flexible and modular approach

that is able to adapt to faults and changes in the application requirements. Within its

framework a QuO runtime accepts availability requests by the application and relays

them to a dependability manager, that is responsible to leverage the requests from

multiple QuO runtimes.

TAO+QuO

The work done in [61] focused on the integration of QoS mechanisms, for both CPU and

network resources while supporting both priority- and reservation-based QoS semantics,

with standard COTS Distributed Real-Time and Embedded (DRE) middleware, more

precisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. The

priority-based approach was built on top of the RT-CORBA specification, and it defined

a set of standard features in order to provide end-to-end predictability for operations

within a fixed priority context [62]. The CPU priority-based resource management is

left to the scheduling of the underlying Operating Systems (OS), whereas the network

priority-based management is achieved through the use of the DiffServ architecture [63],

by setting the DSCP codepoint on the IP header of the GIOP requests. Based on

various factors, the QuO runtime can dynamically change this priority to adjust to

40

Page 43: PhD Thesis

2.2. RT+FT MIDDLEWARE SYSTEMS

environment changes. Alternatively, the network reservation-based approach relies on

the RSVP [64] signaling protocol to guarantee the desired network bandwidth between

hosts. The QuO runtime monitors the RSVP connections and makes adjustments to

overcome abnormal conditions. For example, in a video service it can drop frames to

maintain stability. The cpu-reservation is made using reservation mechanisms present

in the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservations

policies. This was done to preserve the end-to-end QoS semantics that is only available

at a higher level of the middleware.

CIAO+QuO

CIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built on

top of TAO [3] that aims to alleviate the complexity of integrating real-time features on

DRE using Distributed Object Computing (DOC) middleware. These DOC systems,

of which TAO is an example, offer configurable policies and mechanisms for QoS,

namely real-time, but lack a programming model that is capable of separating systemic

aspects from applicational logic. Furthermore, QoS provisioning must be done in an

end-to-end fashion, thus having to be applied to several interacting components. It

is difficult, or nearly impossible, to properly configure a component without taking

into account the QoS semantics for interacting entities. Developers using standard

DOC middleware systems are susceptible to produce misconfigurations that cause an

overall system misbehavior. CIAO overcomes these limitations by applying a wide

range of aspect-oriented development techniques that support the composition of real-

time semantics without intertwining configurations concerns. The support for CIAO’s

CCM architecture was done in CORFU [66] and is described below.

Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67].

The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi-

sioning to a total provisioning middleware that is also able to accommodate dynamical

and adaptive QoS provisioning. For example, the setup of a RSVP [64] connection

would require the explicit configuration from the developer, defeating the purpose of

CIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], it

does not provide a solution for component cross-cutting.

DynamicTAO

DynamicTAO [69] focused on providing a reflective model middleware that extends

TAO to support on-the-fly dynamic reconfiguration of its component behavior and

resource management through meta-interfaces. It allows the application to inspect

the internal state/configuration and, if necessary, to reconfigure it in order to adapt

41

Page 44: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

to environment changes. Subsequently, it is possible to select networking protocols,

encoding and security policies to improve the overall system performance in the presence

of unexpected events.

Service-based Approach

An alternative, high-level service approach for CORBA fault-tolerance was taken by

Distributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service

(OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providing

replica management, fault-detection and fault-recovery as a CORBA high-level service.

It did group communication and it mainly focused on passive replication, but allowed

the developer to select the desired level of reliability (number of replicas), replication

policy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recovery

strategy. OGS improved over prior approaches by using a group communication protocol

that imposes consensus semantics. Instead of adopting an integrated approach, group

communication services are transparent to the ORB, by providing a request level

bridging. Newtop followed a similar approach to OGS but augmented the support

for network partition, allowing the newly formed sub-groups to continue to operate.

TAO

TAO [3] is a CORBA middleware with support for RT and FT middleware, that is

compliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. The

support for RT includes priority propagation, explicit binding, and RT thread pools.

The FT is supported through the of a high level service, the Replication Manager, that

sits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure,

acting as a rendezvous for all the remaining components, more precisely, monitors that

watch the status of the replicas, replica factories that allow the creation of new replicas,

and fault notifiers that inform the manager of failed replicas. TAO’s architecture is

further detailed in Section 2.6 of Chapter 3.

FLARe and CORFU

FLARe [74] focus on proactively adapting the replication group to underlying changes

on resource availability. To minimize resource usage, it only supports passive replica-

tion [75]. Its implementation is based on TAO [3]. It adds three new components to

the existing architecture: (a) Replication Manager high level service that decides on the

strategy to be employed to address the changes on resource availability and faults; (b)

a client interceptor that redirects invocations to the active primary; (c) a redirection

agent that receives updates from the Replication Manager and is used by the interceptor,

42

Page 45: PhD Thesis

2.2. RT+FT MIDDLEWARE SYSTEMS

and; (d) a resource monitor that watches the load on nodes and periodically notifies the

Replication Manager. In the presence of faulty conditions, such as overload of a node,

the Replication Manager adapts the replication group to the changing conditions, by

activating replicas on nodes that have a lower resource usage, and additionally, change

the location of the primary node to a better suitable placement.

CORFU [66] extends FLARe to support real-time and fault-tolerance for the Lightweight

CORBA Component Model (LwCCM) [76] standard for DRE systems. It provides

fail-stop behavior, that is, when one component on a failover unit fails, then all the

remaining components are stopped, allowing for a clean switch to a new unit. This is

achieved through a fault mapping facility that allows the correspondence of the object

failure into the respective plan(s), with the subsequent component shutdown.

DeCoRAM

The DeCoRAM system [77] aims to provide RT and FT properties through a resource-

aware configuration, executed using a deployment infrastructure. The class of supported

systems is confined to closed DRE, where the number of tasks and their respective

execution and resource requirements are known a priori and remain invariant thought

the system’s life-cycle. As the tasks and resources are static, it is possible to optimize the

allocation of the replicas on available nodes. The allocation algorithm is configurable

allowing for a user to choose the best approach to a particular application domain.

DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real-

Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti-

mization problem, while satisfying both RT and FT system constraints. Because of the

limited resources normally available on DRE systems, DeCoRAM only supports passive

replication [75], thus avoiding the high overhead associated with active replication [78].

The allocation algorithm calculates the components inter-dependencies and deploys the

execution plan using the underlying middleware infrastructure, which is provided by

FLARe [74].

Interception-based Approach

The work done in Eternal [79, 80] focused on providing transparent fault-tolerance for

CORBA ensuring strong replica consistency through the use of reliable totally-ordered

multicast protocol. This approach alleviated the developer from having to deal with low-

level mechanisms for supporting fault-tolerance. In order to maintain compatibility with

the CORBA-FT standard, Eternal exposes the replication manager, fault detector, and

fault notifier to developers. However, the main infrastructure components are located

below the ORB for both efficiency and transparency purposes. These components

43

Page 46: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

include logging-recovery mechanisms, replication mechanisms, and interceptors. The

replication mechanisms provide support for warm and cold passive replication and active

replication. The interceptor captures the CORBA IIOP requests and replies (based on

TCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recovery

mechanisms are responsible for managing the logging, checkpointing, and performing

the recovery protocols.

MEAD

MEAD focuses on providing fault-tolerance support in a non intrusive way by en-

hancing distributed RT systems with (a) a transparent, although tunable FT, that

is (b) proactively dependable through (c) resource awareness, that has (d) scalable and

fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO,

as proof-of-concept. The paper makes an important contribution by leveraging fault-

tolerance resource consumption for providing RT behavior. MEAD is detailed further

in Section 2.6 of Chapter 3.

2.3 P2P+RT Middleware Systems

While most of the focus on P2P systems has been on the support of FT, there is a

growing interested in using these systems for RT applications, namely, in streaming

and QoS support. This section provides an overview on P2P systems that support RT.

2.3.1 Streaming

Streaming and specially Video on Demand (VoD), were a natural evolution of the first

file sharing P2P systems [81, 82]. With the steady increase of network bandwidth on the

Internet, it is now possible to have high-quality multimedia streaming solutions to the

end-user. These focus on providing near soft real-time performance resorting to streams

split through the use of distributed P2P storage and redundant network channels.

PPTV

The work done in [26] provides the background for the analysis, design and behavior

of VoD systems, focusing on the PPTV system [83]. An overview of the different

replication strategies and their respective trade-offs is presented, namely, Least Recently

Used (LRU) and Least Frequently Used (LFU). The later uses a weighted estimation

based on the local cache completion and by the availability to demand ratio (ATD).

44

Page 47: PhD Thesis

2.3. P2P+RT MIDDLEWARE SYSTEMS

Each stream is divided into chunks. The size of these chunks have a direct influence on

the efficiency of the streaming, with smaller size pieces facilitating replication and thus

overall system load-balancing, whereas bigger pieces decrease the resource overhead

associated with piece management and bandwidth consumption due to less protocol

control. To allow for a more efficient piece selection three algorithms are proposed:

sequential, rarest first and anchor-based. To ensure real-time behavior the system is

able to offer different levels of aggressiveness, including: simultaneous requests of the

same type to neighboring peers; simultaneous sending different content requests to

multiple peers, and; requesting to a single peer (making a more conservative use of

resources).

Thicket

Efficient data dissemination over unstructured P2P was addressed by Thicket [84].

The work used multiple trees to ensure efficient usage of resources while providing

redundancy in the presence of node failure. In order to improve load-balancing across

the nodes, the protocol tries to minimize the existence of nodes that act as interior

nodes on several of trees, thus reducing the load produced from forwarding messages.

The protocol also defines a reconfiguration algorithm for leveraging load-balance across

neighbor nodes and a tree repair procedure to handle tree partitions. Results show

that the protocol is able to quickly recover from a large number of simultaneous node

failures and leverage the load across existing nodes.

2.3.2 QoS-Aware P2P

Until recently, P2P systems have been focused on providing resiliency and throughput,

and thus, not addressing the increasing need for QoS on latency-sensitive applications,

such as VoD.

QRON

QRON [85] aimed to provide a general unified framework in contrast to application-

specific overlays. The overlays brokers (OBs), present at each autonomous system in

the Internet, support QoS routing for overlay applications through resource negotiation

and allocation, and topology discovery. The main goal of QRON is to find a path that

satisfies the QoS requirements, while balancing the overlay traffic across the OBs and

overlay links. For this it proposes two distinct algorithms, a “modified shortest distance

path” (MSDP) and “proportional bandwidth shortest path (PBSP).

45

Page 48: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

GlueQoS

GlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoS

features from two communicating processes. It provides a declarative language that

allows the specification of the feature QoS set (and possible conflicts) and a runtime

negotiation mechanism that finds a set of valid QoS features that is valid in the both

ends of the interacting components. Contrary to aspect-oriented programming [65], that

only enforces QoS semantics at deployment time, GlueQoS offers a runtime solution that

remains valid throughout the duration of the session between a client and a server.

2.4 P2P+FT Middleware Systems

The research on P2P systems has been largely dominated by the pursuit for fault-

tolerance, such as in distributed storage, mainly due to the resilient and decentralized

nature of P2P infrastructures.

2.4.1 Publish-subscribe

P2P publish-subscribe systems are a set of P2P systems that implement a message

pattern where the publishers (senders) do not have a predefined set of subscribers

(receivers) to their messages. Instead, the subscribers must first register their interests

with the target publisher, before starting to receive published messages. This decou-

pling between publishers and subscribers allows for a better scalability, and ultimately,

performance.

Scribe

Scribe [87] aimed to provided a large scale event notification infrastructure, built on

top of Pastry [88], for topic-based publish-subscribe applications. Pastry is used to

support topics and subscriptions and build multicast trees. Fault-Tolerance is provided

by the self-organizing capabilities of Pastry, through the adaptation to network failures

and subsequent multicast tree repair. The event dissemination performed is best-

effort oriented and without any delivery order guarantees. Nevertheless, it is possible

to enhance Scribe to support consistent ordering thought the implementation of a

sequential time stamping at the root of the topic. To ensure strong consistency and

tolerate topic root node failures, an implementation of a consensus algorithm such as

Paxos [89] is needed across the set of replicas (of the topic root).

46

Page 49: PhD Thesis

2.4. P2P+FT MIDDLEWARE SYSTEMS

Hermes

Hermes [90] focused on providing a distributed event-based middleware with an underly-

ing P2P overlay for scalability and reliability. Inspired by work done in Distributed Hash

Table (DHT) overlay routing [88, 91], it also has some notions of rendezvous similar

to [81]. It bridges the gap between programming language type semantics and low-level

event primitives, by introducing the concepts of event-type and event-attributes that

have some common ground with Interface Description Language (IDL) within the RPC

context. In order to improve performance, it is possible in the subscription process to

attach a filter expression to the event attributes. Several algorithms are proposed for

improving availability, but they all provide weak consistency properties.

2.4.2 Resource Computing

There is a growing interest on harvesting and managing the spare computing power

from the increasing number of networked devices, both public and private, as reported

in [92, 93, 94, 95]. Some relevant examples are:

BOINC

BOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili-

tate the harvesting of public resource computing by the scientific research community.

BOINC implements a redundant computing mechanism to prevent malicious or erro-

neous computational results. Each project specifies the number of results that should be

created for each “workunit”, i.e. the basic unit of computation to be performed. When

some number of the results are available, an application specific function is called to

evaluate the results and possibly choosing a canonical result. If no consensus is achieved,

or if simply the results fail, a new set o results are computed. This process repeats until

a successful consensus is achieved or an application defined timeout occurs.

P2P-MapReduce

Developed at Google, MapReduce [97] is a programming model that is able parallelize

the processing of large data sets in a distributed environment. It follows a master-slave

model, where a master distributes the data set across a set of slaves, returning at end

the computational results (from the map or reduce tasks). MapReduce provides fault-

tolerance for slave nodes by reassigning the failed job to an alternative active slave,

but lacks support for master failures. P2P-MapReduce [98] provides fault-tolerance by

resorting to two distinct P2P overlays, one containing the current available masters in

47

Page 50: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

the system, and the other with the active slaves. When an user submits a MapReduce

job, it queries the master overlay for a list of the available masters (ordered by their

workload). It then selects a master node and the number of replicas. After this, the

master node notifies its replicas that they will participate on the current job. A master

node is responsible for periodically synchronizing the state of the job over its replica set.

In case of failure, a distributed procedure is executed to elect the new master across

the active replicas. Finally, the master selects the set of slaves using a performance

metric based on workload and CPU performance from the slave overlay and starts the

computation.

2.4.3 Storage

Storage systems were one of the most prevalent applications on first generation P2P sys-

tems. Evolving from early file-sharing systems, and with the help of DHT middlewares,

they have now become the choice for large-scale storage systems in both industry and

academia.

openDHT

Work done in [99] aimed to provide a lightweight framework for P2P storage using

DHTs (such in [88, 91]) in a public environment. The key challenge was to handle

mutually untrusting clients, while guarantying fairness in the access and allocation of

storage. The work was able to provide a fair access to the underlying storage capacity,

while taking the assumption that storage capacity is free. Because of its intrinsic fair

approach, the system is unable to provide any type of Service Level of Agreement (SLA)

to the clients, so reducing the domain of applications that can use it.

Dynamo

Recent research on data storage [25] and distribution at Amazon, focus on key-value

approaches using P2P overlays, more precisely DHT, to overcome the well explored

limitation of simultaneous providing high availability and strong consistency (through

synchronous replication) [100, 101]. The approach taken was to use an optimistic

replication scheme that relied on asynchronous replica synchronization (also known

as passive replication). The consistency conflicts between different replicas, that are

caused by network and server failures, are resolved in ’read time’, as opposed to the

more traditional ’write time’ strategy, with this being done to maximize the write

availability in the system. Such conflicts are resolved by the services, allowing for a

more efficient resolution (although the system offers a default ’last value holds’ strategy

48

Page 51: PhD Thesis

2.5. P2P+RT+FT MIDDLEWARE SYSTEMS

to the services). Dynamo offers efficient key-value storage, while maximizing write

operations availability. Nevertheless, the ring based overlay hampers the scalability of

the system, and depending on the partitioning strategy used, the membership process

does not seem efficient.

2.5 P2P+RT+FT Middleware Systems

These types of systems offer a natural evolution over previous FT-RT middleware

systems. They aim to provide scalability and resilience through a P2P network infra-

structure that is able to provide lightweight FT mechanisms, allowing them to support

soft RT semantics. We first proposed an architecture [102, 103] for a general purpose

middleware that aimed to integrate FT into the P2P network layer, while being able

to provide RT support. The first implementation, in Java, of the architecture was done

in DAEM [6, 104]. This work used an hierarchical tree P2P based on P3 [15]. The FT

support was performed in all levels of the tree, resulting in a high availability rate but

the use of JGroups [7] for maintaining strong consistency, both for mesh and service

data, resulted in high overhead. Due to its highly coupled tree architecture, faults had a

major impact on availability when they occurred near the root node, as they produced

a cascade failure. Initial support for RT was provided, but the high overhead of the

replication infrastructure limited its applicability.

2.6 A Closer Look at TAO, MEAD and ICE

This section provides a closer look at middleware systems that have provided us with

several strategies and insights that we used to design and implement Stheno, our

middleware solution that is able to support RT, FT and P2P.

All the referred systems share a service oriented architecture with a client-server network

model, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEAD

support the RT-CORBA standard, while ICE only supports best-effort invocations. As

for FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid,

that combines both low and high-level services.

49

Page 52: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

2.6.1 TAO

TAO is a classical RPC middleware and therefore only supports the client-server network

model. Name resolution is provided by a high-level service, representing a clear point-

of-failure and a bottleneck.

RT Support. TAO supports the RT CORBA specification 1.0., with the most impor-

tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT thread

pools.

The priority propagation ensures that a request maintains its priority across a chain of

invocations. A client issues a request to an Object A, that in turn, issues an invocation

to other Object B. The request priority at Object A is then used to make the invocation

at Object B. There are two types of propagation: a server declared priorities, and

client propagated priorities. In the first type, a server dictates the priority that will

be used when processing an incoming invocation. In the other type, the priority of

the invocation is encoded within the request, so the server processes the request at the

priority specified by the client.

A source of unbound priority inversion is caused by the use of multiplexed communica-

tion channels. To overcome this, the RT CORBA specification defines that the network

channels should be pre-established, avoiding the latency caused by their creation. This

model allows two possible policies: (a) private connection between the client and the

server, or; (b) priority banded connection that can be shared but limits the priority of

the requests that can be made on it.

In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], with

the support of a reactor (an object that handles network event de-multiplexing), and is

normally associated with an acceptor (an entity that handles the incoming connections),

a connection cache, and a memory pool. In classic CORBA a high priority thread can

be delayed by a low priority one, leading to priority inversion. So in an effort to avoid

this unwanted side-effect, the RT-CORBA specification defines the concept of thread

pool lanes.

All the threads belonging to a thread pool lane have the same priority, and so, only

process invocation that have the same priority (or a band that contains that priority).

Because each lane has it own acceptor, memory pool and reactor, the risk of priority

inversion is greatly minimized at the expense of greater resource usage overhead.

FT Support. In a effort to combine RT and FT semantics, the replication style

50

Page 53: PhD Thesis

2.6. A CLOSER LOOK AT TAO, MEAD AND ICE

proposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids the

latency associated with both warm and cold passive replication [105] and the high

overhead and non-determinism of active replication, but represents an extension to the

FT specification.

Figure 2.2: TAO’s architectural layout (adapted from [3]).

Figure 2.2 shows the architectural overview of TAO. The support for FT is achieved

through the use of a set of high-level services built on top of TAO. These services include

a Fault Notifier, a Fault Detector and a Replication Manager.

The Replication Manager is the central component of the FT infrastructure. It acts

as central rendezvous to the remaining FT components, and it has the responsibilities

of managing the replication groups life-cycle (creation/destruction) and perform group

maintenance, that is the election of a new primary, removal of faulty replicas, and

updating group information.

It is composed by three sub-components: (a) a Group Manager, that manages the group

membership operations (adds and removes elements), allows the change of the primary

of a given group (for passive replication only), and allows manipulation and retrieval

of group member localization; (b) a Property Manager, that allows the manipulation of

51

Page 54: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

replication properties, like replication style; and (c) a Generic Factory, the entry point

for creating and destroying objects.

The Fault Detector is the most basic component of the FT infrastructure. Its role is to

monitor components, processes and processing nodes and report eventual failures to the

Fault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwards

them to the Replication Manager.

The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next;

(b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that

(d) finds the Replication Manager and registers itself with it. As a response, e) the

Replication Manager connects as a consumer to the Fault Notifier. (f) For each node

that is going to participate, starts a Fault Detector Factory and a Replica Factory, that

in turn register themselves in the Replication Manager. (g) A group creation request is

made to the Replication Manager (by an foreign entity, that is referred as Object Group

Creator), followed by the request of a list to the available Fault Detector Factories and

a Replica Factories; (h) this is followed by a request to create an object group in the

Generic Factory. (i) The Object Group Creator then bootstraps the desired number

of replicas using the Replica Factory at each target node, and in turn, each Replica

Factory creates the actual replica, and at the same time, it starts a Fault Detector

at each site using the Fault Detector Factory. Each one of these detectors, finds the

Replication Manager and retrieves the reference to the Fault Notifier and connects to

it as a supplier. (j) Each replica is added to the object group by the Object Group

Creator by using the Group Manager at the Replication Manager. (k) At this point, a

client is started and retrieves the object reference from the naming service, and makes

an invocation to that group. This is then carried out by the primary of the replication

group.

Proactive FT Support. An alternative approach has been proposed by FLARe [74],

that focus on proactively adapting the replication group to the load present in the

system. The replication style is limited to semi-active replication using state-transfer,

that is commonly referred solely as passive replication .

Figure 2.3 shows the architectural overview of FLARe. This new architecture presents

three new components to TAO’s FT infrastructure: (a) a client interceptor, that redi-

rects the invocations to the proper server, as the initial reference could have been

changed by the proactive strategy, in response to a load change; (b) a redirection agent

that receives the updates with these changes from the Replication Manager; and (c)

a resource monitor that monitors the load on a processing node and sends periodical

52

Page 55: PhD Thesis

2.6. A CLOSER LOOK AT TAO, MEAD AND ICE

Figure 2.3: FLARe’s architectural layout (adapted from [74]).

updates to the Replication Manager.

In the presence of abnormal load fluctuations the Replication Manager changes the

replication group to adapt to these new conditions, by creating replicas on lower usage

nodes and, if required, by changing the primary to a better suitable replica.

TAO’s fault tolerance support relies on a centralized infrastructure, with its main

component, the Replication Manager, representing a major obstacle in the system’s

scalability and resiliency. No mechanisms are provided to replicate this entity.

2.6.2 MEAD

MEAD focused on providing fault-tolerance support in a non intrusive way for enhancing

distributed RT systems by providing a transparent, although tunable FT, that is

proactively dependable through resource awareness, that has scalable and fast fault-

detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-

concept.

Transparent Proactive FT Support. MEAD’s architecture contains three major

components, namely, the Proactive FT Manager, the Mead Recovery Manager and the

53

Page 56: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

Mead Interceptor. The underlying communication is provided by Spread, an group

communication framework that offers reliable total ordered multicast, for guaranteeing

consistency for both component and node membership.

The Mead Interceptor provides the usual interception of system calls between the

application and underlying operating system. This approach allows a transparent and

non-intrusive way to enhance the middleware with fault-tolerance.

Figure 2.4: MEAD’s architectural layout (adapted from [14]).

Figure 2.4 shows the architectural overview of MEAD. The main component of the

MEAD system is the Proactive FT Manager, and is embedded within the interceptors

in both server and client. It has the responsibility of monitoring the resource usage at

each server, initialization a proactive recovery schema based on a two-step threshold.

When the resource usage gets higher then the first threshold, the proactive manager

sends a request to the MEAD Recover Manager to launch a new replica. If the usage

gets higher than the second threshold then the proactive manager starts migrating the

replica’s clients to the next non-faulty replica server.

The Mead Recovery Manager has some similarities with the Replication Manager of

CORBA-FT, as it also must launch new replicas in the presence of failures (node or

server). In MEAD, the recovery manager does not follow a centralized architecture, as

in TAO or FLARe, where all the components of the FT infrastructure are connected to

the replication manager, instead, they are connected by a reliable total ordered group

communication framework that establishes an implicit agreement at each communica-

tion round. These frameworks also provide a notion of view, i.e. an instantaneous

54

Page 57: PhD Thesis

2.6. A CLOSER LOOK AT TAO, MEAD AND ICE

snapshot of the group membership, and notifications of any membership change. This

allows the MEAD Recover Manager to detect a failed server and respawn a new replica,

maintaining the desired number of replicas.

The decision by developers of the proper FT properties, e.g. replication style, without

an evaluation of object state size and resource usage, can severely affect the overall

performance and reliability. The only possible way to achieve balance between these

two orthogonal domains of reliability and (real-time) performance, must leverage the

object’s resource usage, system resource availability, and the target level of reliability

and recovery-time.

To overcome this issue, MEAD introduced a FT Advisor. This advisor profiles the

object for a certain period of time to assess its resource usage, e.g. cpu, network

bandwidth, etc., and invocation ratio. Using this, the advisor can provide advice on

the proper settings of FT properties. For example, if an object uses little computation

time and has a large state, then active replication is the most suitable replication style.

The replication style is not the only choice considered. For passive replication there

are two options that are of relevance: checkpoint and fault-detection. The periodicity

of checkpointing affects the delay window of the consistency between the primary and

the replicas. A high period results in a smaller window, i.e. the inconsistency state

has a lesser duration, but brings a larger resource overhead, as more cpu and network

bandwidth are needed. The fault-detection directly impacts the recovery-time, as a

larger period between fault-detection inspections results in a larger recovery time.

The fault advisor continuously and periodically provides feedback to the runtime with

more accurate suggestions, adjusting to changes in resource usage and availability.

Normally, active replication support is restricted to deterministic single threaded appli-

cations. MEAD’s last contribution comes in the form of support for non-deterministic

aplications under active replication. To achieve this, MEAD uses source-code analysis to

detect points in the source code that introduce non-determinism, e.g. system calls like

gettimeofday. These non-deterministic points are stored into a data structure and are

embedded within invocations and replies, so they must be stored locally in both clients

and servers. The reason behind this necessity resides in the way the active replication

works. A client makes an invocation, that is multicasted to the replicas. Each replica

processes the request, storing the non-deterministic data locally and piggybacking it to

the reply that is sent back to the client. The client picks the first reply, and stores the

non-deterministic data locally. The client piggybacks this information in next invocation

it makes. When the replicas receive this invocation, they retrieve this non-deterministic

55

Page 58: PhD Thesis

CHAPTER 2. OVERVIEW OF RELATED WORK

information and update their internal state, except the replica whose reply was chosen

by the client.

The recovery manager does not have replication, turning it into a single point of failure.

The use of a reliable total ordered group communication framework partial improves the

decentralization of the infrastructure, but the recovery manager still acts as a centralized

unit resulting in a negative impact on the overall system scalability. In systems that

are pruned to a large churn rate, group communication could result in partitions, as we

assessed in DAEM [6]. These partitions could result in a major outage, compromising

the reliability and real-time performance.

2.6.3 ICE

ICE [106] provides a lightweight RPC based middleware that aims to overcome the

inefficiency, such as redundancy, present in the CORBA specification. For that purpose,

ICE provides an efficient communication protocol and data encoding. It does not

support any kind of RT semantics.

The support for FT present in ICE is minimal, and is restricted to naming for replication

groups, that is when a client tries to resolve the replication group name, it receives a

list with all the server instances that belong to the group (i.e. the endpoints). On the

other hand, it does not support any type of replication style or even synchronization

primitives, leaving this to the applications.

ICE does not provide an infrastructure to support very large scale systems. Its registry,

that acts as the CORBA Naming Service, constitutes a bottleneck and possible single

point of failure. The reliability of the registry can be improved by the addition of

standby instances, in a master-slave relation.

2.7 Summary

The goal of this chapter was to search for a suitable solution that could address

all the requirements from our target system, that is, a middleware system capable

of simultaneously supporting RT+FT+P2P. As no solution was found, we focused

on systems that belong to the intersecting domains, namely, RT+FT, P2P+FT and

P2P+RT, to see if we could extend one of them and avoid designing and implementing

a new middleware from scratch.

56

Page 59: PhD Thesis

2.7. SUMMARY

In our previous work, DAEM [102, 103], we used some off-the-self components, e.g.,

JGroups [7] to manage replication groups, but realized that in order to integrate real-

time and fault-tolerance within a P2P infrastructure, we would have to completely

control the underlying infrastructure with fine grain management over all the resources

available in the system. Thus, the use of COTS software components creates a ”black-

box” effect that introduces sources of unpredictable behavior and non-determinism that

undermines any attempt to support real-time. For that reason, it was unavoidable to

create a solution from scratch.

Using the insights learned from several inspirational middleware systems, namely TAO,

MEAD, and ICE, we have designed, in Chapter 3, and implemented, in Chapter 4,

Stheno, that to the best of our knowledge is the first middleware system that simulta-

neously supports RT and FT within a P2P infrastructure.

57

Page 60: PhD Thesis
Page 61: PhD Thesis

–If you can’t explain it simply, you don’t understand it well

enough.

Albert Einstein 3Architecture

The implementation of increasingly complex systems at EFACEC is currently lim-

ited by the capabilities of the supporting middleware infrastructure. These systems

include public information systems for public transportation, automated power grid

management and automated substation management for railways. The use of service

oriented architectures is an effective approach to reduce the complexity of such systems.

However, the increasing demand for guarantees on the fulfillment of SLAs can only be

achieved with a middleware platform that is able to provide QoS computing while

enforcing a resilient behavior.

Some middleware systems [14, 3] already addressed this problem by offering soft real-

time computing and fault-tolerance support. Nevertheless, their support for real-time

computing is limited, as they do not provide any type of isolation. For example a service

can hog the CPU and effectively starve the remaining services. The support for fault-

tolerance is restricted to crash-failures and the implementation of the fault-tolerance

mechanisms is normally accomplished through the use of high-level services. However,

these high-level services cause a significant amount of overhead, due to cross-layering,

limiting the real-time capabilities of these middleware systems.

These systems also used a centralized networking model that is susceptible to single

point-of-failure and offers limited scalability. For example, the CORBA naming service

reflects these limitations, where a crash failure can effectively stop an entire system

because of the absence of the name resolution mechanism.

This chapter describes the architectural overview of a new general purpose P2P middle-

ware that addresses the aforementioned problems. The resilient nature of P2P overlays

enables us to overcome the limitations of current approaches by offering a decentral-

ized and reconfigurable fault resistant architecture that avoids bottlenecks, and thus

59

Page 62: PhD Thesis

CHAPTER 3. ARCHITECTURE

enhances overall performance.

Stheno, our middleware platform, is able to provide QoS computing with support for

resource reservation through the implementation of a QoS daemon. This daemon is

responsible for the admission and distribution of the available resources among the

components of the middleware. Furthermore, it also interacts with the low-level resource

reservation mechanisms of the operating system to perform the actual reservations.

With this support, we provide proper isolation that is able to accommodate soft real-

time tasks and thus provide guarantees on SLAs. While we currently only support CPU

reservation, the architecture was designed to be extensible and subsequently support

additional sub-systems, such as memory or networking resource reservations.

Notwithstanding, the real-time capabilities are limited by the amount of resources that

are need to provide fault-tolerance. To overcome the current limitations of provid-

ing fault-tolerance through the use of expensive high-level services, we propose the

integration of the fault-tolerance mechanisms directly in the the overlay layer. This

provides two advantages over the previous approaches: 1) it allows the implementations

of lightweight fault-tolerance mechanism by reducing cross-layering, and; 2) the replica

placement can be optimized using the knowledge of the overlay’s topology. Previous

systems relied on manual bootstrap of replicas, such as TAO [3], or required the presence

of additional high-level services to perform load balancing across the replica set, as in

FLARe [74].

While the work presented in this thesis only implements semi-active replication [44],

we designed a modular and flexible fault-tolerance infrastructure that is able to accom-

modate other types of replication policies, such as passive replication [75] and active

replication [78] .

Our architectural design also considered future support for virtualization. However,

instead of providing virtualization as a service, as is done in cloud computing plat-

form [107], our goal is to support lightweight virtualized services to offer out-of-the-

box fault-tolerance support for legacy services through the use of the live-migration

mechanisms present in current hypervisors, such as KVM [108] and Xen [109]. This

can be achieved through the use of Just Enough Operating System (JeOS) [110] which

enables the creation of small footprint virtual machines that are a critical requirement

to perform the migration of virtual machines.

Finally, in order to minimize the effort required to port the runtime to a new operating

system, we used the ACE framework [111] that abstracts the underlying operating

system infrastructure.

60

Page 63: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

3.1 Stheno’s System Architecture

In order to contextualize our approach, we will present our solution applied to one

of our target systems, the Oporto’s light-train public information system. As shown

in Figure 3.1, the networks uses an hierarchical tree-based topology, that is based on

the P3 overlay [15], where each cell represents a portion of the mesh space that is

maintained (replicated) by a group of peers. These peers provide the computational

resources needed to maintain the light-train stations and host services within the system.

Additionally, there are also sensors that connect to the system through peers. They

offer an abstraction to several low-level activities, such as traffic track sensors and

video camera streams. A detailed discussion about the implementation of the overlay

is provided in Chapter 4.

Figure 3.1: Stheno overview.

The middleware’s runtime provides the necessary infrastructure that allows users to

launch and manipulate services, while hiding the interaction with low level peer-to-peer

overlay and operating system mechanisms. It is based on a five layer model, as shown

in Figure 3.1.

The bottom layer, Operating System Interface, encapsulates the Linux operating system

61

Page 64: PhD Thesis

CHAPTER 3. ARCHITECTURE

and the ACE [111] network framework. The Support Framework is built on top of

the bottom layer, and offers a set of high-level abstractions for efficient, modular

component design. The P2P Layer and FT Configuration contains all the peer-to-

peer overlay infrastructure components and provides a communication abstraction and

FT configuration to the upper layers. The runtime can be loaded with a specific overlay

implementation at bootstrap. The middleware is parametric in the choice of overlay, and

these are provided as plugins and can be loaded dynamically. The Core layer represents

the kernel of the runtime, and is responsible for managing all the resources allocated

to the middleware and the peer-to-peer overlay. Finally, the Application and Services

layer is composed of the applications and services that run on top of the middleware.

Next, we describe the organization for each layer, as well as their inter-dependencies. In

an effort to improve the overall comprehension of the runtime, the layers are presented

using a top-down approach, starting at the application level, continuing throughout the

core and overlay layers, and ending at the operating system interface.

3.1.1 Application and Services

One of the most fundamental problem when developing a general purpose middleware

system is its ability to expose functionalities and configuration options to the user. This

layer achieves that goal through the introduction of high-levels APIs that allows the

users to query and configure the different layers of the runtime. For example, in our

target system, a system operator may create a video streaming service from a light-train

station and set the frame rate and replication style.

The service represents the main abstraction of the middleware, and is shown in Fig-

ure 3.4. A developer that wishes to deploy an application, has to use this abstraction.

The node hosting a service guarantees that its QoS requirements (CPU, network,

memory and I/O) are assured throughout the service’s entire life-cycle. The CPU

subsystem offers an exception to this definition. It allows the creation of best-effort

computing tasks that, as the name implies, do not have any QoS guarantees. These are

normally associated with helper mechanisms, such as logging.

A service can be statically or dynamically loaded into the middleware. Dynamic services

are encapsulated into a meta archive called Stheno Service Archive, that has the .ssa

file extension, and uses the ZIP archive format. Such an archive contains a service

implementation (plugin) that may be loaded by the runtime. This solution allows the

runtime to dynamically retrieve a missing service implementation and load it on-the-fly.

62

Page 65: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

Figure 3.2: Application Layer.

Figure 3.3: Stheno’s organization overview.

Each service is identified in the middleware system by a Service Identifier (SID) that

uniquely identifies the service implementation, and an Instance Identifier (IID) that

identifies a particular instance of a service in the system, as any given service imple-

mentation can have multiple instances running simultaneously (Figure 3.3).

An IID is unique across the peer-to-peer overlay, therefore at any given time, it is

running in only one peer, identified uniquely with Peer Identifier (PID), but during its

lifespan it can migrate to other peers. This occurs when a service instance migrates

from one peer to another. A PID can only be allocated in one Cell Identifier (CID).

However, this membership can dynamically change during the peer’s lifespan.

A cell can be seen as a set of peers that are organized to maintain a partition of

the overlay space. These cell can be loosely decoupled, for example, Gnutella peers

partitions the overlay space in an ad-hoc fashion, or follows a structured topology.

Other overlays [15] have an hierarchical tree of cells and in each cell the peers cooperate

with the purpose of maintaining a portion of an overlay tree. In turn, these cooperate

63

Page 66: PhD Thesis

CHAPTER 3. ARCHITECTURE

among themselves to maintain the global tree topology.

Some services can be deployed strictly as daemons. This class of services does not offer

any type of external interaction. Nevertheless, a service usually provides some sort of

interaction that is abstracted in the form of a client.

Using the RPC service as example, a client is a broker between the user and the server,

marshaling the request, and unmarshaling the reply. Another example, is a video

streaming client that connects to a streaming service with the purpose of receiving a

video stream, acting as a stream sink.

The interaction with a service, through a client, is only possible if the service provides

one or more Service Access Points (SAPs). These SAPs provide the entry-points that

support such interactions, with each one providing a specific QoS. For example, a

RPC service can provide two SAPs, one for low-priority invocations and the other for

high-priority invocations.

When an user (through a client) wants to contact a service instance, it first has to

known which SAPs are available in that particular instance. In order to accomplish

that goal, the user must use the discovery service and query about the active access

points for a particular instance of a service.

To summarize, the responsibilities of a service are the following: define the amount of

resources that it will need throughout its life-cycle; manage multiple SAPs, and; provide

a client implementation.

3.1.2 Core

One important issue is how to deal with the different real-time and fault-tolerance

requirements from different services, that in turn, are requested different users. In order

to address this issue, the core is responsible for the overall management of all assigned

resources, including overlays and services, is shown Figure 3.4. The resource reservation

mechanisms are not controlled directly by the runtime, but by a resource reservation

daemon, shown as QoS Daemon, that is responsible for managing the available low-level

resources.

This approach enables multiple runtimes to coexist within the same physical host and

further allows foreign applications to use the resource reservation infrastructure. The

runtime core merely acts as a broker for any resource reservation request initiated by

any of its applications or overlay services.

64

Page 67: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

Figure 3.4: Core Layer.

The most important roles performed by the core are the following: a) maintain the

information of all local active service instances; b) act as a regulator, deciding on the

acceptance of new local service instances, and; c) provide a resource reservation broker.

The management of the active service instances is done through the use of the Service

Manager.

Service Manager

The service manager is responsible for managing all the local services of a runtime.

A service can be loaded into an active runtime in one of two ways: it can be locally

bootstrapped at start-up, such as static services that are loaded when the runtime

bootstraps, or; it can be dynamically loaded in response to a local or remote request.

The request for the creation of a new service instance could be initiated locally by

the user or a local service, or when a remote peer requests it through the overlay

infrastructure.

This remote service creation is delegated to the overlay mesh service, that in turn uses

the overlay’s inner infrastructure to accomplish this task. This implementation of these

mechanisms is detailed in Chapter 4.

The service manager is composed of two entities, a service factory and a service book-

keeper. The service factory is a repository of known service implementations that can

be manipulated dynamically, allowing the insertion and removal of service implementa-

tions. The service bookkeeper manages the information, such as SAPs, about the active

service instances that are running locally.

65

Page 68: PhD Thesis

CHAPTER 3. ARCHITECTURE

QoS Controller

The QoS Controller, shown in Figure 3.5, acts as a proxy between the components of

the runtime and the QoS daemon. Each component has access to a resources that are

assigned at creation time. A component uses its resources through a QoS Client, that

was previously assigned to it by the QoS Controller. A resource reservation request

is created by a QoS Client and then gets re-routed by the QoS Controller to the QoS

daemon. In the current implementation, the allocation assigned to each component is

static. A dynamical reassignment of the resources allocated to a component is left for

future work.

Figure 3.5: QoS Infrastructure.

Section 3.1.4 provides the details on the QoS and resource reservation infrastructure,

in particular detailing the internals of the QoS daemon.

3.1.3 P2P Overlay and FT Configuration

Our target systems require that the middleware must be able to adapt its P2P net-

working layer to mimic the physical deployment, while at the same time, provide the

fault-tolerance configuration options to meet application needs.

The overlay layer is based on a plugin infrastructure that enables a flexible deployment of

the middleware for different application domains. For example, in our flagship solution,

the Oporto’s light-train network, we used a P3-based plugin implementation that mirrors

the regional hierarchy of the system. Additionally, the FT configurations options passed

by the user, for example, the requirement to maintain a service replicated among 3

replicas while using semi-active replication is delegated to the FT service within the

P2P overlay.

66

Page 69: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

Because of this flexibility, the runtime does not bootstrap with a specific overlay

implementation by default, it is left to the user to choose the most suitable P2P imple-

mentation to match its particular target case. Figure 3.6 shows the components that

form the overlay abstraction layer.

Figure 3.6: Overlay Layer.

Every overlay implementation must provide the following services: (a) Mesh, responsible

for membership and overlay management; (b) Discovery, used to discover resources and

data across the overlay, and; (c) FT (Fault-Tolerance), used to manage and negotiate

the fault-tolerance policies across the overlay.

Mesh Service

The mesh service is responsible for managing the overlay topology and providing support

for the remote creation and removal of services. The management of the overlay

topology is supported through the membership and recovery mechanisms. The mem-

bership mechanism must allow the entrance and departure of peers while maintaining

consistency of the mesh topology. At the same time, the recovery mechanism has to

perform the necessary rebind and reconfiguration to ensure that the mesh topology

remains valid even in the presence of severe faults.

An overlay plugin is free to implement the membership and recovery mechanisms that

most fits its needs. This was motivated by the goal of minimizing the restrictions

made on the overlay topology, increasing in this way the range of systems supported by

Stheno.

Figure 3.7 shows four possible implementation approaches. A portal can be used to act

as a gatekeeper [112] (shown in Figure 3.7a), resembling the approach taken by most

web services. This can be suitable for systems that do not have a high churn rate. On

the other hand, systems that need highly available and decentralized architectures may

67

Page 70: PhD Thesis

CHAPTER 3. ARCHITECTURE

(a) (b)

(c) (d)

Figure 3.7: Examples of mesh topologies.

use multicast mechanisms to detect other nodes present in the system [15] (shown in

Figure 3.7b). Nevertheless, some systems require bounded operations times, such as

queries. This can be accomplished with the introduction of cells (also known as federa-

tions), such in Gnutella [81] (shown in Figure 3.7c), or alternatively, by imposing some

kind of well-defined inter-peer relationship, such as Chord [113] (shown in Figure 3.7d).

Discovery Service

The discovery service offers an abstraction that allows the execution of queries on the

underlying overlay. As with the mesh service, each overlay plugin is free to implement

the discovery service as it best suits the needs of the target system. Figure 3.8 shows

the execution of a query under some possible topologies.

The main goals defined in the discovery service are the following: performing syn-

68

Page 71: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

(a) Hierarchical overlay topology.

(b) Ad-hoc overlay topology. (c) DHT overlay topology.

Figure 3.8: Querying in different topologies.

chronous and asynchronous querying with QoS awareness, and; handling query requests

from neighboring peers while respecting the QoS associated with each request

Fault-Tolerance Service

The FT infrastructure is based on replication groups. These groups can be defined

as a set of cooperating peers that have the common goal of providing reliability to a

high-level service. In current middleware systems, FT support is implemented through

a set of high-level services that use the underlying primitives, for example, TAO [3].

Our approach makes a fundamental shift to this principle, by embedding FT support

in the overlay layer.

The integration of FT in the overlay reduces the overhead of cross-layering that is

69

Page 72: PhD Thesis

CHAPTER 3. ARCHITECTURE

associated with the use of high-level services. Furthermore, this approach also enables

the runtime to make decisions on the placement of replicas that are aware of the overlay

topology. This awareness allows for a better leverage between the target reliability and

resource usage.

The FT service is responsible for the creation and removal of replication groups. How-

ever, the management of the replication group is self contained, that is, the FT service

delegates all the logistics to the replication group. This allows further extensibility of

the replication infrastructure, and also allows the co-existence of simultaneous types of

replication strategies inside the FT service. This allows each service to use the most

suitable replication policy to meet its requirements.

The assumptions made in the design of each service limit the type of fault-tolerance

policies that can be used. For example, if a service needs to maintain a high-level of

availability then it should use active replication [78] in order to minimize recovery time.

For these reasons, we designed an architecture that provides a flexible framework, where

different fault-tolerance policies can be implemented. In Chapter 4 we provide an

example of a FT implementation.

3.1.4 Support Framework

Our target system has different RT requirements for different tasks, for example, a

critical event is the highest priority traffic present in the system and is highly sensitive

to latency. To ensure that the 2 second deadline is meet, is necessary to reserve the

necessary CPU to process the events, and at the same time employ a suitable threading

strategy that minimizes latency (at the expense of throughput), such as Thread-per-

Connection [12].

The support framework provides the necessary infrastructure to address these issues

by offering a set of packages that provide high level abstractions for different threading

strategies, network communication and QoS management, in particular the mechanisms

for resource reservation. Figure 3.9 shows the components of the support framework.

It introduces three key aspects: (a) provides a novel and extensible infrastructure for

resource reservation and QoS; (b) introduces a novel design pattern for multi-core

computing; and (c) provides an extensible monitoring facility. Before delving into the

details of these components, we first present its components. In an effort to improve its

maintainability, the framework uses a package-like schema, with the following layout:

70

Page 73: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

Figure 3.9: Support framework layer.

• common - this package includes support for integer conversion, backtrace (for

debugging), state management, synchronization primitives, and exception han-

dling;

• network - this package has support for networking, namely stream and datagram

sockets, packet oriented sockets, low level network utilities, and request support;

• event - this package implements the event interface, a fundamental component

for network oriented programming.

• qos - this package implements the resource reservation infrastructure, namely

the QoS daemon and client, as well the QoS primitives which are used by the

threading package, such as scheduling information.

• serialization - this package includes a serialization interface support and provides

a default serialization implementation;

• threading - this package offers several scheduling strategies, including: Leader-

Followers [11], Thread-Pool [114], Thread-per-Connection [12] and Thread-per-

Request [13]. All of these strategies are implemented using the Execution Model

- Execution Context design pattern;

• tools - the tools package includes the loader and the monitoring sub-packages,

which contain a load injector and a resource monitoring daemon, respectively.

The most prominent package in the framework is the resource reservation and QoS

infrastructure. It provides the low-level support that is necessary for the integration

of RT and FT into the middleware’s runtime. Next, we present an overview of the

71

Page 74: PhD Thesis

CHAPTER 3. ARCHITECTURE

inner-works of each of the components and reason about their implications in several

aspects of a real-time fault-tolerant middleware.

The QoS and Resource Reservation Infrastructure

One of the key aspects of real-time systems is the ability to fulfill a SLA even in the

presence of an adverse environment. Adversities can be caused by system overload,

bugs, or malicious attacks, and can occur in the form of rogue services, device drivers,

or kernel modules.

The only viable solution to provide deterministic behavior is to provide isolation to

the various components present in the system. This type of containment can be

achieved by using a virtual machine, but obviously this would only work for user-

space applications/services, or; by using the low-level infrastructure provided by the

underlying operating system, such as Control Groups [115] or Zones [116]. Control

Groups is a modular and extensible resource management facility provided by the

Linux kernel, while Zones is a similar but less powerful implementation for the Solaris

operating system.

These types of mechanisms are normally associated with static provisioning and left to

system administrators to manage. This, clearly, is not a suitable approach to complex

and dynamic environments that are the focus of this work. To overcome this limitation

we designed and implemented a novel QoS daemon that manages the available resources

in the Linux operating system.

The goal of the QoS daemon is to provide an admission control and management

facility that governs the underlying Control Groups infrastructure. There are four main

QoS subsystems: CPU, I/O, memory and network. At this time, we have only fully

implemented the CPU subsystem. The remaining subsystems have just a preliminary

support.

All the subsystems supported by Control Groups follow a hierarchical tree approach

to the distribution of their resources (Figure 3.10). Each node of the tree represents

a group that contains a set of threads that share the available resources of the group,

for example if a CPU group has 50% of the CPU resources, then all the threads of the

group share those resources. As usual, the distribution of the CPU time among the

threads is performed by the underlying CPU scheduler.

CPU subsystem

We define three types of threads: (a) best-effort, that do not have real-time requirements,

and are expected to run as soon-as-possible but without any deadline constraints; (b)

72

Page 75: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

soft real-time, that have a defined deadline, but that, in case of a deadline miss,

does not produce system failures, and; (c) isolated soft real-time, these threads are

positioned in isolated core(s) in order to prevent entanglement with other activities

of the operating system (interrupt handling from other cores, network multi-queue

processing, etc), resulting in less latency and jitter and thus providing a better assurance

on the fulfillment of deadlines.

However, there is another type of threads that is not currently supported by the

middleware, hard real-time threads. A failure to fulfill the deadline of one of these

threads could result in a catastrophic failure, and is normally associated with critical

systems such as railway signalling or avionics. Ongoing work on EDF scheduling

seems to offer a solid way to provide hard real-time support in Linux [117, 118]. A

recent validation seems to confirm our beliefs [119]. We plan to extend our support to

accommodate threads that are governed by deadlines instead of priorities.

To simplify the explanation of the CPU subsystem, we describe it as one entity although,

in reality, it is composed by two separate groups that are closely related, the CPU

Partitioning (also known as cpusets) and the Real-Time Group Scheduling. The first

group is responsible to provide isolation, commonly known as shielding, of subsets of the

available cores, while the second group provides resource reservation guarantees to RT

threads, that is, it is responsible for controlling the amount of CPU for each reservation.

Figure 3.10: QoS daemon resource distribution layout.

73

Page 76: PhD Thesis

CHAPTER 3. ARCHITECTURE

Figure 3.10 illustrates a possible resource reservation schema. The nodes with RA and

RB represent the two runtimes present in the same physical host, while S1 and S2

represent services running under these two runtimes. The other node shown as OS,

represents the resources allocated to the operating system. The P2P node represents

the resources allocated to the overlay. For the sake of clarity, we do not present the

distribution of the overlay’s resources among its services. Each of the runtimes has

to request a provision of resources for later distribution among its services. Later, in

Chapter 5, we present the results that assess the potential of this approach.

I/O subsystem

Although not implemented, we have left support for the I/O subsystem that is respon-

sible for managing the I/O bandwidth of each device individually. The I/O reservation

can be accomplished either by specifying weights, or by specifying read and write

bandwidth limits and operations per second (IOPS).

When using weights to perform I/O reservation, groups with greater weights have more

I/O time quantum from the I/O scheduler. This approach is used for best effort

scenarios, which do not suit our purposes. In order to provide real-time behavior,

it is necessary to enforce I/O usage limits on both bandwidth and IOPS. Services that

manage large streams of information, such as video streaming, do not issue a high

number of I/O operations, but instead need a high amount of bandwidth. However,

low-latency data centric services like Database Management Systems (DBMS) [120] or

Data Stream Management Systems (DSMS) [121, 122] exhibit the opposite behavior.

Not needing a high amount of bandwidth, but instead, requiring a high number of

IOPS.

I/O contention can be caused by a high consumption service that starves other services

in the system, by either depleting I/O bandwidth, and/or by saturating the device with

an overwhelming number of I/O requests that exceeds its operational capabilities, such

as the length of the request queue.

The progressive introduction of Solid State Disk (SSD) technology into traditional

storage devices like hard-drives, is reshaping the approach taken to this type of re-

source [123]. These new devices are capable of unprecedented levels of performance,

specially in terms of latency, where they are able to offer a hundred fold reduction

in access times. The elimination of the mechanical components allows SSDs to offer

low-latency read/write operations and deterministic behavior. An evaluation of these

features is left for future work.

74

Page 77: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

Memory subsystem

A substantial number of system faults are caused by memory depletion, normally

associated with bugs, ill-defined applications, and system overuse. When an operating

system reaches a critical level of free memory, it tries to free all non-essential memory,

such as programs caches. If this is not sufficient, then a fail-over mechanism is started.

In the Linux operating system, this mechanism consists of randomly killing processes in

order to release allocated memory, in an effort to prevent the inevitable system crash.

The runtime ensures that has access to the memory it needs throughout its life-cycle by

requesting a statically provisioning to the memory subsystem. In the memory subsys-

tem, each group reserves a portion of the total system memory, following a hierarchical

distribution model, allowing the runtime to further distributed the provisioned memory

among the different components, such as the P2P overlay layer and the user services.

Network subsystem

Each group in the network subsystem tags the packets generated by its threads with an

ID, allowing the tc (Linux traffic controller) to identify packets of a particular group.

With this mapping it is possible to associate different priorities and scheduling policies

to different groups.

This approach deals with the local aspects of network reservation, that is the sending

and receiving on the local network interfaces, but this is not sufficient to guarantee end-

to-end network QoS. In order to provide this, all the hops, such as routers, between

the two peers, must accept and enforce the target QoS reservation. An example of an

end-to-end QoS reservation is depicted in figure 3.11.

Figure 3.11: End-to-end network reservation.

In the future, we intent to provide an end-to-end QoS signaling protocol capable of

providing QoS tunnels across a segment of a network, using a protocol such as RSVP [64]

and NSIS [124].

Monitoring Infrastructure

The monitoring infrastructure audits the resource usage of the underlying OS, such as

CPU, memory, storage, etc. The monitoring data is gathered using the information

expose by the /proc pseudo filesystem. For this to work, the Linux kernel must be

configured to exposed this information.

75

Page 78: PhD Thesis

CHAPTER 3. ARCHITECTURE

The main goal of the infrastructure is to provide a resource usage histogram (currently it

supports CPU and memory) that can be used for both off-line (log audit) and real-time

analysis. The log analysis is helpful in detecting abnormal behaviors that are normally

caused by bugs (such as memory leaks).

Currently, we use a reactive fault-detection model that only acts after a fault has

occurred. With a real-time monitoring infrastructure it is possible to evolve to a

more efficient proactive fault-detection model. Using a proactive approach, the runtime

could predict imminent faults and take actions to eliminate, or at least minimize, the

consequences of such events. For example, if a runtime detects that its storing unit,

such as hard drive, is exhibiting an increasing number of bad blocks, it could decide to

migrate its services to other nodes in the overlay.

3.1.5 Operating System Interface

Our target systems can be supported by a collection of heterogeneous machines with

different operating systems, so it was to crucial to develop a portable runtime im-

plementation. Additionally, fine-grain control over all the resources available in the

system is paramount to achieve real-time support. For example, in order to maintain a

highly critical surveillance feed, the middleware must be able to reserve (provision) the

necessary CPU time to process the video frames within a predefined deadline.

To met this goal, we choose to control and monitor the underlying resources from

userspace (shown in Figure 3.12), avoiding the use of specialized kernels modules.

To complement this approach, we use ACE [111], a portable network framework that

offers a common API that abstracts the low-level system-calls offered by the different

operating systems, namely, thread handling (including priorities), networking and I/O.

Furthermore, ACE also provides several high-level design patterns, such as the reac-

tor/connector design pattern, that enable the development of modular systems capable

of offering high-levels of performance.

The resource reservation mechanisms, including CPU partitioning, are not covered by

any of the Portable Operating System Interface (POSIX) standards, so there is no

common API to access them. The Linux operating system, in which our current imple-

mentation is based on, provides access to the low-level resource reservation mechanism,

via the Control Groups infrastructure, through the manipulation of a set of exposed

files in the /proc pseudo-filesystem.

Nevertheless, low-level RT support in Linux is not provided out-of-the-box. A careful

76

Page 79: PhD Thesis

3.1. STHENO’S SYSTEM ARCHITECTURE

Figure 3.12: Operating system interface.

selection of the kernel version and proper configuration must be used. An initial

evaluation was performed for kernel 2.6.33 with the rt-preempt patch (usually referred

as kernel 2.6.33-rt), but its support for Control Groups revealed several issues, resulting

in unstable systems.

A second kernel version was evaluated, the kernel 2.6.39-git12, which already supports

almost every feature present in the rt-preempt patch and provides flawless support for

Control Groups.

The Linux kernel supports a wide range of parameters that can be adjusted. However,

only a small subset had a significant impact in the overall system performance and

stability under RT, most notably:

• DynTicks - the dynamic ticks enhances the former static checking of timer events

(usually 100, 250, 500 or 1000 Hz), allowing for a significant power reduction, but

more importantly, the reduction of kernel latencies;

• Memory allocator - the two most relevant are the SLAB [125] and SLUB [126]

memory allocators. They both manage caches of objects, thus allowing for efficient

allocations. SLUB is an evolution of SLAB, offering a more efficient and scalable

implementation that reduces queuing and general overhead;

• RCU - the Read-Copy Update [127] is a synchronization mechanism that allows

reads to be performed concurrently with updates. Kernel 2.6.39-git12 offers a

novel RCU feature, the “RCU preemption priority boosting” [128]. This feature

enables a task that wants to synchronize the RCU to boost all the sleeping readers

priority to match the caller’s priority.

77

Page 80: PhD Thesis

CHAPTER 3. ARCHITECTURE

3.2 Programming Model

The access to the runtime capabilities is safeguarded by a set of interfaces. The main

purpose of these interfaces is to provide a disciplined access to resources while providing

interoperability between the runtime and services that are not collocated within the

same memory address space. Furthermore, it also allows a better modularization of the

components of the runtime. Figure 3.13 shows the interactions between the components

of the architecture through these interfaces.

Figure 3.13: Interactions between layers.

User applications and services access the runtime through the Runtime Interface. The

direct control of the overlay is restricted to the core of the runtime. The access to the

P2P overlay, for both services and users, is only allowed through the Overlay Interface

(described in Section 3.2.2) that is accessible from the Runtime Interface. An overlay

is also restricted on the its access to the core of the runtime. The access of an overlay

to the core of the runtime is also restricted through the Core Interface (described in

Section 3.2.3), avoiding malicious use of runtime resources by overlay plugins.

3.2.1 Runtime Interface

The Runtime Interface is the main interface that is available to the user and services and

it provides a proxy type support, allowing them to interact with runtimes that are not

in the same address space through an Inter-Process Communication (IPC) mechanism.

While multiple runtimes can exist in a single host, this results in a redundant resource

78

Page 81: PhD Thesis

3.2. PROGRAMMING MODEL

consumption. Our approach allows the reduction of coexisting runtimes, resulting in a

lesser resource consumption.

Figure 3.14 shows the access to the runtime from different processes. The runtime is

initially bootstrapped in process 1. A virtualized service that uses the Kernel Virtual-

Machine (KVM) hypervisor is contained in process 2. In process 3 is shown an additional

user and service using the runtime of process 1. The support for additional languages

was also considered in the design of the architecture. Processes 4 and 5 show services

running inside a Java Virtual Machine (JVM) and a .NET Virtual Machine, respectively.

While we only show one runtime in this example, the QoS daemon, allocated in process

6, is able to support multiple runtimes.

Figure 3.14: Multiple processes runtime usage.

The operations supported by the Runtime Interface are the following: (a) bootstrap

new runtimes; (b) access previously bootstrapped runtimes; (c) start and stop services,

both local and remote; (d) attach new overlay plugins on-the-fly; (e) allow access to the

overlay, through the Overlay Interface, and; (f) create clients to interact with service

instances.

3.2.2 Overlay Interface

The main goal of Overlay Interface is to provide a disciplined access to a subset of the

underlying overlay infrastructure, leveraging performance goals, for example avoiding

79

Page 82: PhD Thesis

CHAPTER 3. ARCHITECTURE

lengthy code paths that can lead to the creation of hot paths, while enforcing proper

isolation and thus preventing misuse of shared resources by rogue or misbehaved services

or applications.

The Overlay Interface provides an access to the overlay Mesh and Discovery services.

These services and the overall overlay architecture were described in Section 3.1.3. We

plan to extend the architecture to provide access to the underlying FT service to allow

the dynamical manipulation of the replication policy used in a replication group. But,

as different replication policies have different resource requirements, it is necessary to

provide additional support for dynamical changes in resource reservations assignments,

that is the increase or decrease of the amount of resources associated with a resource

reservation. To overcome this, we plan to enhance our QoS daemon in order to provide

the necessary support.

3.2.3 Core Interface

Every overlay implementation has to interact with the core. This interaction is mediated

through the Core Interface that is only accessible to the overlay plugin.

The operations supported by the Core Interface are the following: (a) start and stop

local services; (b) create replicas for the fault-tolerance service, and; (c) retrieve infor-

mation about service instances and resource availability.

The creation and destruction of local services are issued by the Mesh service upon the

reception of requests from remote peers. These requests are then redirected to the core

of the runtime by the Core Interface. In case of a creation of a new service, the core

requests the Service Manager to create a new service instance and makes the necessary

QoS resource reservations, by the QoS daemon, through the QoS client. On the other

hand, when destroying a service instance, the core just has to request the removal of

the instance by the Service Manager.

The creation and removal of replicas are issued by the FT service upon the reception

of requests from a replication group, and is normally requested by the coordinator of

the replication group, but it is implementation dependent. As with the previous case,

the request for a creation or removal of a replica is handled by the core, after being

redirected by the Core Interface. In the case of a removal of a replica, the core forwards

the request to the proper replication group through the fault-tolerance service. On

the other hand, in the case of the creation of a new replica, the core makes the QoS

resources reservations that are needed to maintain both the service instance ( that will

80

Page 83: PhD Thesis

3.3. FUNDAMENTAL RUNTIME OPERATIONS

act as a replica of the primary service instance) and the replication group, that is,

the infrastructure necessary to enforce the replication mechanisms. The retrieval of

information about service instances and resource availability is used by the Discovery

service in response to queries.

3.3 Fundamental Runtime Operations

The runtime manages resources, services and clients. Its main operations are: the initial

runtime creation and corresponding bootstrap; creation of local and remote services with

and without fault-tolerance, and; creation of clients for user services.

3.3.1 Runtime Creation and Bootstrapping

The creation and the initialization, normally designated as bootstrap, of the runtime

involves a three phase process as shown in Figure 3.15.

(a) (b) (c)

Figure 3.15: Creating and bootstrapping of a runtime.

The creation of the middleware, shown in 3.15a, is accomplished by the user through

the Runtime Interface. At this point, the runtime does not have an active overlay

infrastructure. The user is responsible for choosing a suitable overlay implementation

(plugin) and for attaching it to the runtime (shown in Figure 3.15b). In the final

81

Page 84: PhD Thesis

CHAPTER 3. ARCHITECTURE

phase, the user bootstraps the newly created runtime (depicted in Figure 3.15c). This

bootstrapping process is governed by the core. If the runtime is configured to use

QoS reservation then the core connects to the QoS daemon and reserves the necessary

resources. Otherwise, step 2 is omitted, and no interaction is made with the QoS

daemon.

Listing 3.1: Overlay plugin and runtime bootstrap.

1 RuntimeInterface∗ runtime = 0;2 try {3 runtime = RuntimeInterface::createRuntime();4 Overlay∗ overlay = createOverlay();5 runtime−>attachOverlay(overlay);6 runtime−>start(args);7 } catch (RuntimeException& ex) {8 Log(’Runtime creation failed’); // handle error9 }

The code snipplet necessary to create and bootstrap a runtime is shown in Listing 3.1.

Line 3, shows the creation of runtime, as previously illustrated in Figure 3.15a. At this

time, only the basic infrastructure is created and the runtime is still not bootstrapped.

This is followed by the creation of the chosen overlay implementation, that is going to

be attached to the runtime, in lines 4 and 5 (that corresponds to the illustration of

Figure 3.15b). For last, the whole process is completed, in line 6, with the bootstrap of

the runtime, that implicitly bootstraps the overlay (as shown in Figure 3.15b).

3.3.2 Service Infrastructure

The life cycle of a service starts with its creation and terminates with its destruction.

The service infrastructure provides the user with such mechanisms. This section starts

with an in-depth view of the local creation of services, as first introduced on Sec-

tion 3.1.1. Then follows a detailed view of the mechanisms that regulate the creation of

remote services with and without FT support. It concludes with the complete outline

of the service deployment mechanisms.

Local Service Creation

The steps involved in instantiating a new local service are depicted in Figure 3.16. The

user, through the Runtime Interface, requests the creation (and bootstrap) of a new

local service instance (step 1). The core of the runtime redirects the request to the

service manager for further handling (step 2). The first step to be taken by the service

82

Page 85: PhD Thesis

3.3. FUNDAMENTAL RUNTIME OPERATIONS

manager is to determine if the service implementation is known. If the service is not

known, then the core tries to find the respective implementation using the discovery

service in the overlay (step omitted). If the implementation is found then it is transferred

back requesting peer and the service creation can continue. Otherwise the creation of

the service is aborted.

Figure 3.16: Local service creation.

If the runtime was bootstrap with resource reservation enabled then, once the service

implementation is retrieved, it is possible to retrieve its QoS requirements. Knowing

these requirements, the runtime tries to allocate them through a QoS client (shown as a

dashed lines in steps 3 and 4). If the requested resources are available then the service

is instantiate, otherwise the service creation is aborted. If the resources are available

but the service does not successfully start, all the associated resource reservations are

released.

If, on the other hand, the runtime does not have the resource reservation infrastructure

enabled, then once the service implementation is known and retrieved, the core can

immediately instantiated a local service instance.

Listing 3.2 has the code snipplet necessary to bootstrap a new local service instance.

Line 1 shows the initialization of the service parameters that are wrapped by a smart

pointer variable, allowing for a safe manipulation by the runtime. The actual service

creation is done in line 4 and is performed by the startService() method that

takes the following parameters: the SID of the service to be created; the service

parameters, and; the peer where the service is to be launched, which in this case is the

Universal Unique Identifier (UUID) of the local runtime. Upon the successful creation

of the service instance, the parameter iid of the call startService() will contain its

instance identifier.

Listing 3.2: Transparent service creation.

83

Page 86: PhD Thesis

CHAPTER 3. ARCHITECTURE

1 ServiceParamsPtr paramsPtr(new ServiceParams(sid));2 try {3 UUIDPtr iid;4 runtime−>startService(sid, paramsPtr, runtime−>getUUID(), iid);5 } catch (ServiceException& ex) {6 Log(’Service creation failed’); // handle error7 }

Remote Service Creation

There are two distinct approaches to create remote services. An user can either explicitly

specify the peer to host the service, or alternatively, it can leave the decision of finding

a suitable place for hosting the service to the middleware. This last approach is the

default way to bootstrap services.

Figure 3.17: Finding a suitable deployment site.

Figure 3.17 shows the mechanism associated with the search for a suitable place to

deploy a new service instance, within a hierarchical mesh overlay, where each level of

tree is maintained by a cell. Cells are logical constructions that maintain portions of

the overlay space and provide mesh resilience.

The requesting peer uses the discovery service of the overlay to perform a Place of

Launch (PoL) query. This query retrieves the information about a suitable hosting

peer. However, as previously stated, the resolution of the query is totally dependent of

the overlay implementation. In the example provided by Figure 3.17, the query issued

by peer A is relayed until it reaches peer C. This peer is able to satisfy the query and

replies back to peer B that in turn replies back to peer A. After receiving the query reply,

indicating peer D as the deployment site, peer A requests a remote service creation at

peer D. We describe an implementation for this mechanism in Chapter 4.

84

Page 87: PhD Thesis

3.3. FUNDAMENTAL RUNTIME OPERATIONS

Figure 3.18: Remote service creation without fault-tolerance.

In order to create a remote service (Figure 3.18), the user using peer A makes the

request through the Runtime Interface (steps 1 and 2). The core of the runtime core

uses its mesh service to request the remote peer the creation of the wanted service

(steps 3 and 4). The mesh service of the remote peer after receiving the request for

the creation of a new service instance, uses the Core Interface (step 5) to redirect the

request to the core of its runtime (step 6). At this point, the remote peer uses the

previously described procedure for local service creation (Figure 3.16). The dashed

lines represent the optionality of using resource reservation.

The code snipplet shown in Listing 3.3 creates two remote service instances, one uses

explicit deployment and the other uses transparent deployment. Line 1 shows the initial-

ization of the service parameters that are used in the creation of both service instances.

Line 5, shows the creation of a remote service instance using explicit deployment. The

remote peer that will host the instance is given by the remotePeerUUID variable. Line

7 shows the creation of a remote service instance when using transparent deployment.

Upon the successful creation of the service instance, the last parameter used in the

call to startService() contains the instance identifier for the newly created service

instance.

Listing 3.3: Service creation with explicit and transparent deployments.

1 ServiceParamsPtr paramsPtr(new ServiceParams(sid));2 try {3 UUIDPtr explicitIID, transparentIID;4 // explicit deployment5 runtime−>startService(sid, paramsPtr, remotePeerUUID, explicitIID);6 // or, transparent deployment

85

Page 88: PhD Thesis

CHAPTER 3. ARCHITECTURE

7 runtime−>startService(sid, paramsPtr, transparentIID);8 } catch (ServiceException& ex) {9 Log(’Service creation failed’); // handle error10 }

Remote Service Creation With Fault-Tolerance

When creating a remote service with fault-tolerance (Figure 3.19), in response to a

request from another peer (steps 1 to 4), the remote peer acts as the main instance,

also known as the primary node, for that service (steps 5 to 8). Before being able to

instantiate the service, the primary node has first to find the placement for the number

of requested replicas (step omitted). This process is delegated and governed by the FT

service (step 9). As before, the dashed lines (step 8) represent optional paths, if using

resource reservation.

Figure 3.19: Remote service creation with fault-tolerance: primary-node side.

The fault-tolerance service using its underlying mechanisms, which are dependent on

the implementation of the overlay, tries to find the optimal placements on the mesh to

instantiate the needed replicas. In a typical implementation this is normally accom-

plished through the use of the discovery service. Depending on the overlay topology,

finding the optimal placement can be intractable, as in ad-hoc topologies, so systems

often implement more structure topologies or heuristics.

Given the modularity of the architecture, it is possible to configure for each service the

type of fault-tolerance strategy to be used, such as semi-active or passive replication,

allowing a better fit to the service’s needs to be obtained.

86

Page 89: PhD Thesis

3.3. FUNDAMENTAL RUNTIME OPERATIONS

The primary node using the FT service creates the replication group that will support

replication for the service. To create the replication group, the FT service uses the

placement information to create replicas for the group.

Figure 3.20: Remote service creation with fault-tolerance: replica creation.

The process of creating a new replica is shown in Figure 3.20. After receiving the

request to join the replication group through the FT service (steps 1 to 2), the replica

proceeds as previously described for the local service creation (steps 3 to 6). We describe

the algorithms that materialize the behavior for different types for different types of

replication policies in Chapter 4.

Listing 3.4: Service creation with Fault-Tolerance support.

1 FTServiceParams∗ ftParams = createFTParams(nbrOfReplicas, FT::SEMI ACTIVE REPLICATION));

2 ServiceParamsPtr paramsPtr(new ServiceParams(sid, ftParamsPtr));3 try {4 UUIDPtr iid;5 runtime−>startService(sid, paramsPtr, iid);6 } catch (ServiceException& ex) {7 Log(’Service creation failed’); // handle error8 }

Listing 3.4 shows the code snipplet that is necessary to bootstrap a remote service

with FT support. Line 1 shows the initialization of the FT parameters with a total of

nbrOfReplicas replicas and using semi-active replication. The actual service creation

is done in line 5. Upon the successful creation of the service instance, the parameter iid

of the call startService() will contain its instance identifier, and sid the system-

wide identifier for the service.

87

Page 90: PhD Thesis

CHAPTER 3. ARCHITECTURE

3.3.3 Client Mechanisms

The interactions between an user and a service instance are supported by a client.

A client is a proxy between the user and a service instance that is responsible for

handling all the underlying communication and resource reservation mechanisms. The

runtime provides a flexible infrastructure that does not impose any type of architectural

restrictions on either the design of a client or the type of interaction that can take place.

Figure 3.21 shows the creation and bootstrap sequence of a client.

(a) (b)

Figure 3.21: Client creation and bootstrap sequence.

The creation of a client, shown in Figure 3.21a, starts with the user requesting a new

client through the Runtime Interface. Upon receiving the client creation request, the

core of the runtime uses the service factory to check if the service implementation is

known. If it is known then the core of the runtime returns a new client to the user from

the service implementation, otherwise the creation of the client is aborted.

After retrieving the client, the user must find a suitable service instance to connect

to (shown in Figure 3.21b). After retrieving the Core Interface though the Runtime

Interface, the user uses the discovery service to search for a suitable instance (the calling

path is identified as 1). This is followed by the reply from the discovery service that is

returned to the user (calling path identified as 2). In this case, indicating peer B has

owner of a service instance.

If the user wishes to use resource reservation, then it must use the underlying resource

88

Page 91: PhD Thesis

3.4. SUMMARY

reservation infrastructure. This optional step is shown as a dashed line (calling path

identified as 3). To finish the bootstrap sequence, the user must use the information

about the service instance that was returned by the discovery service and connect to

the service (step 4).

Listing 3.5 shows the code snipplet necessary to create a client, using the RPC service

as an example.

Listing 3.5: Service client creation.

1 try {2 ClientParamsPtr paramsPtr(

new ClientParams(QoS::RT, CPUQoS::MAX RT PRIO));3 ServiceClient∗ client = runtime−>getClient (sid, iid, paramsPtr);4 RPCServiceClient∗ rpcClient = static cast<RPCServiceClient∗> (client);5 RPCTestObjectClient∗ rpcTestObjectClient =

new RPCTestObjectClient (rpcClient);6 rpcTestObjectClient−>ping();7 } catch (ServiceException& ex) {8 Log(’Client creation failed’); // handle error9 }

Prior to the actual creation of the client, the user must initialize the ClientParamsPrt

parameter with the desired QoS properties. In line 2 of Listing 3.5, this parameter is

initialized to use the maximum RT priority. The actual creation of the client is done

in line 3. The runtime returns a generic ServiceClient pointer that must be down-

casted to the proper client implementation. In line 4, the generic pointer is converted to

a generic RPC client, that manages the low-level infrastructure that handles invocations

and replies. Line 5 shows the creation of the RPC “stub”, that is responsible for

marshaling requests and unmarshaling replies. Within the creation of the stub, the

general RPC client is attached to it. Line 6 shows an actual one-way RPC invocation

of a ping operation.

3.4 Summary

This chapter started by presenting the architecture of the runtime of a P2P middleware,

providing an overview of all the layers that compose the runtime: applications layer,

contains all the services and users that run on top of the middleware; core layer,

is responsible for the overall management of the runtime; overlay abstraction layer,

provides the abstractions to the low-level P2P services; support framework, provides a

set of high level abstractions for network communications and QoS management, and;

89

Page 92: PhD Thesis

CHAPTER 3. ARCHITECTURE

the Linux/ACE layer, provides an abstraction to the underlying Linux operating system

through the ACE framework.

We then provided a detailed insight on programming model, exposing the interfaces that

must be used to access the runtime capabilities. Furthermore, we describe the advan-

tages of these programming interfaces, specifically their ability to provide modularity,

interoperability and controlled access to runtime resources.

The chapter ended with an overview of the fundamental operations present in the

middleware, namely: runtime creation and bootstrap; local service creation; remote

service creation with and without FT, and; client creation.

90

Page 93: PhD Thesis

–With great power comes great responsibility.

Voltaire 4Implementation

This chapter presents the implementation details of the runtime, focusing on the under-

lying mechanisms that are present in the P2P services of our overlay implementation.

Additionally, we present three service implementations that showcase the runtime capa-

bilities, more precisely, a RPC-like service, an actuator service and a streaming service.

4.1 Overlay Implementation

Figure 4.1: The peer-to-peer overlay architecture.

As a proof-of-concept for this prototype, we have chosen the P3 [15] topology, that

follows a hierarchical tree P2P mesh. A representation of such topology is shown in

Figure 4.1. There are three different types of peers present in our implementation,

91

Page 94: PhD Thesis

CHAPTER 4. IMPLEMENTATION

peers, coordinators peers and leafs peers. The peers are responsible for maintaining the

organization of the overlay and for providing access points to the overlay for leaf peers.

Each node in a P3 network corresponds to a cell, a set of peers that collaborate to

maintain a portion of the overlay. Cells are logical constructions that provide overlay

resilience and are central in our implementation of fault-tolerance mechanisms. Each

cell is coordinated by one peer, denominated as coordinator peer. Every other peer

in the cell is connected to the coordinator, allowing for efficient group communication.

If the coordinator fails, one of the peers in the cell takes its place and becomes the

new coordinator. The communication between distinct cells is accomplished through

point-to-point connections (TCP/IP sockets) between the coordinators of the cells.

The last type of peer present in the overlay is known as leaf peer. These peers do

not have any type of responsibilities in maintaining the mesh. Typically, they use the

overlay capabilities, for instance, to advertise the presence of a sensor or simply to act

as a client. This type of peer does not host any user service, but instead relies on the

overlay to host them.

The original P3 topology [15] follows a hierarchical organization that had a significant

problem. When a coordinator of a cell crashes it causes a cascade failure, with its

children coordinators propagating the failure to the remaining sub-trees. We explored

this problem in previous work [6], and concluded that it was directly linked to the rigid

naming scheme of the P3 architecture. In case of a cell failure, the cell and its sub-trees

would have to perform to a complete rebind to the mesh, and thus had to contact the

root node of the tree to find a new suitable position. This caused two obvious problems,

the overhead (and time) of rebinding all the cells and the bottleneck in the root node.

To avoid these limitations, we modified the original P3 topology. The problems as-

sociated with the rigid naming scheme of P3 were avoided through the design and

implementation of a new faulty architecture. This type of architecture focuses on

reducing the impact of faults, as it assumes that they happen frequently, taking spe-

cial care to eliminate, or at least minimize, the occurrence of cascade failures. To

achieve this, the middleware introduces a new flexible naming scheme, that removes all

inter-dependencies between cells, and therefore allows the migration of entire sub-trees

between different portions of the tree.

The developer, however is free to implement any type of topology and behavior within

an overlay implementation for the middleware. It only has to implement the Overlay

Interface. This interface is composed of three basic P2P services. The mesh service,

described in sub-section 4.1.2, handles all the management for the overlay. In a sense

92

Page 95: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

it is the most fundamental service since it provides the infra-structure for all the other

services. The discovery service, detailed in sub-section 4.1.3, supports the infrastructure

for generic querying. Last, the FT service, provides the infrastructure for the fault-

tolerance mechanisms present in the overlay and is described in sub-section 4.1.4.

4.1.1 Overlay Bootstrap

The bootstrap of an overlay is requested by the core of the runtime on behalf of the

user. Figure 4.2 illustrates this bootstrap process. The overlay bootstraps sequentially

the mesh (step 1), discovery (step 2) and fault-tolerance (step 3) services.

Figure 4.2: The overlay bootstrap.

The bootstrap process is implemented by the Overlay:start() procedure and it is

shown in Algorithm 4.1. This procedure starts the mesh, discovery and FT services.

The order by which the services are opened is conditioned by the dependencies between

the services, as both the discovery and fault-tolerance services need the information

about SAPs of homologous services in neighbor peers, and this information is provided

by the mesh service.

Algorithm 4.1: Overlay bootstrap algorithm

1 procedure Overlay:start()2 for service in [Mesh, Discovery, Fault-Tolerance] do3 service.start()4 end for5 end procedure

93

Page 96: PhD Thesis

CHAPTER 4. IMPLEMENTATION

4.1.2 Mesh Service

The mesh service is the central component in our overlay implementation. It acts as

an overlay manager and it is also responsible for the creation and removal of high-level

services from the overlay, as previously described in Chapter 3.

A mesh service must extend the Mesh Interface, but is free to implement any type

of organizational logic. Nevertheless, in a typical implementation, the mesh service

normally has a mesh discovery sub-service, responsible for providing a dynamic discov-

ery mechanism for peers in the overlay. Whereas the discovery service, described in

Section 4.1.3, provides a generic infrastructure capable of handling high-level queries.

It is not possible to use the generic discovery service to search for peers in the overlay

because the of the dependencies between the mesh and discovery services, as explained

previously.

A possible implementation for this type of mesh discovery mechanism could be accom-

plished though the use of a well-known portal. This has the advantage of being simple

to implement but inherently represents both a bottleneck and a single-point-of-failure.

To overcome these limitations, our overlay has a discovery mechanism, one in each

cell, that uses low-level multicast sockets to provide a distributed and efficient mesh

discovery implementation. Figure 4.3 provides an overview of the major components in

a cell. Each peer participating in a cell has a cell object that contains a cell discovery

object and a cell group object : the cell object provides a global view of the cell to the

local peer; the cell discovery object provides the support, through the use of multicast

sockets, for the cell discovery mechanism, and; the cell group object provides the group

communications within the cell.

Figure 4.3: The cell overview.

Building and membership

The membership mechanism allows a peer to join the peer-to-peer overlay (Figure 4.4).

The process starts with a request for a binding cell (step 1). This request has to be

94

Page 97: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

made to the root cell, that in turn replies with a tuple comprising a suitable cell, its

corresponding coordinator, and the parent cell and coordinator (if available). The next

step is the active binding (step 2), which is further sub-divided in two possibilities (steps

2-a and 2-b).

The multicast address for the root cell discovery address is a static and well-known

value. While this can be seen as a single point-of-failure, in the presence of a cell crash,

that is when all the peers in a cell have crashed, the root cell is replaced by one of its

children cells that belong to first level of the tree. The process behind failure handling

and recovery is described further below.

Figure 4.4: The initial binding process for a new peer.

Upon receiving the reply, and if the returned cell exists (step 2-a), the joining peer

connects to the coordinator (step 3-a). Otherwise, if the cell is new, it becomes the

coordinator for the cell (step 2-b). If the target cell is not the root cell, and if the peer

is the coordinator of the cell, then it connects to the coordinator peer of its parent cell

(step 3-b).

To finalize the binding process, the peer has to formalize its membership by sending

a join message that is illustrated in Figure 4.5. At this point, the peer sends a join

message to its parent (step 1), if it is the coordinator of the cell, or sends the message

95

Page 98: PhD Thesis

CHAPTER 4. IMPLEMENTATION

to the coordinator of the cell (step 1-a) that forwards it to its parent (step 1-b). This

message is propagated through the overlay until it reaches the root cell. It is the

responsibility of the root cell to validate the join request and to reply accordingly. The

reply is propagated through the overlay downwards to the joining peer (step 3). After

this, the peer is part of the overlay.

Figure 4.5: The final join process for a new peer.

The mesh construction algorithm is depicted in Algorithm 4.2. To enter the mesh, a new

peer calls the Mesh:start() procedure, which then creates a cell discovery object for

accessing the root cell discovery infrastructure (line 2), which has a well-known multicast

address. This is then used to request a cell to which the joining node will connect itself

by making a call to the cellRootDiscoveryObj.requestCell() procedure (shown

in Algorithm 4.6, lines 1-8). This procedure multicasts a discovery message that tries to

find the peer-to-peer overlay. If it fails then no peer is present in the root cell, then the

call to the Cell:requestCell() procedure returns the information associated with

the root cell, more specifically, the well-known multicast address used for the root cell

discovery. Otherwise, the appropriate bind information is returned. Using this binding

information, a new cell object is created and initialized (lines 4-5).

96

Page 99: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

Algorithm 4.2: Mesh startup

1 procedure Mesh:start()2 cellRootDiscoveryObj ← Cell:createRootCellDiscovery()3 bindInfo ← cellRootDiscoveryObj.requestCell()4 cellObj ← Cell:createCellObject()5 cellObj.start(bindInfo)6 end procedure

Cell Bootstrap

The binding information returned by the cell discovery mechanism has all the informa-

tion needed for the cell initialization (as shown in Figure 4.4). In Algorithm 4.3, we

show the algorithms that rule the behavior of a cell.

Algorithm 4.3: Cell initialization

var: this // The current cell object

1 procedure Cell:start(bindInfo)2 bindingCellInfo ← bindInfo.getBindingCellInfo()3 if not bindingCellInfo.isCoordinator() then4 cellGroupObj ← Cell:bindToCoordinatorPeer(bindingCellInfo.getCoordInfo())5 else6 parentPeerInfo ← ∅7 if not bindingCellInfo.isRoot() then8 parentPeerInfo ← bindInfo.getParentCellCoordInfo()9 end if10 cellGroupObj ← Cell:createCellGroup(parentPeerInfo)11 end if12 cellGroupObj.requestJoin()13 cellDiscoveryAddr ← bindingCellInfo.getCellDiscoveryAddress()14 cellDiscoveryObj ← Cell:createCellDiscovery(cellDiscoveryAddr)15 this.attach(cellDiscoveryObj)16 end procedure

The bootstrap of the cell object is performed using the Cell:start() procedure that

takes the bindInfo as its argument. This bootstrap process is dependent of on state

of the target cell. The call to the bindingCellInfo.isCoordinator() method

indicates if we are the coordinator of this cell. If the peer is not the coordinator peer

(Figure 4.4, step 2-a) for the cell then it has to join the cell group by binding to the

cell group’s coordinator peer (line 4). On the other hand, if the peer is the coordinator

(Figure 4.4, step 2-b), then it checks if the cell is the root. If the peer is on the root cell,

then the bootstrap is finished, otherwise it must connect to its parent cell coordinator,

and link the newly created cell to its parent cell (lines 5-10).

97

Page 100: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Regardless of whether the newly arrived peer is a non-coordinator on a cell group, or if

it is a coordinator on a non root cell, it must propagate its membership by using a join

message. In line 12, the call to cellGroupObj.requestJoin() initializes the process.

The join process is depicted in Figure 4.5, while the cell group communication is shown

in Figure 4.6.

Lines 13-15 show the creation of the cell discovery object that will be associated with

this cell, with the multicast address being provided by the bindingCellInfo. After

the creation, it is attached to the cell in line 15, enabling the cell to handle cell discovery

requests.

Cell State and Communications

When a peer is running inside a cell, it is either a coordinator or a non coordinator

peer providing redundancy to the coordinator. Any external peer that connects to the

cell, must connect through the coordinator peer. It is the coordinator’s responsibility

to validate any incoming request. If the request is valid and accepted, the coordinator

sends the request to its parent (if applicable). After receiving the reply from its parent,

the coordinator updates the state of the cell by synchronizing with all the active peers.

This synchronization is done using our group communication infrastructure that is

shown in Figure 4.6.

The synchronization process inside a cell can be divided in two cases, whether the

synchronization is initiated by the coordinator or by a non-coordinator peer. When the

synchronization is initialized by the coordinator peer, shown in Figure 4.6a, it starts

by sending the request its parent peer (step 1), which is recursively sent onwards the

root cell (step 2). After the root cell is synchronized, that is, after the request is sent

to all active peers and their replies have been received, an acknowledgment message is

sent downwards the originating cell (step 3). Upon receiving the acknowledgment from

its parent, each coordinator peer repeats the same process, that is, they synchronized

their cell (steps 4 and 5) and send an acknowledgment downwards (step 6). When the

acknowledgment reaches the originating cell, the request is synchronized (steps 7 and

8).

The synchronization process can be performed either in parallel or sequentially. Al-

though we do not provide benchmarks, we have done a preliminary empirical assessment

on the optimal transmission strategy. Early testing shows that for a small number of

peers, the best transmission strategy is to send the requests sequentially. However, for a

larger number of requests, the best transmission strategy is to send them in parallel, by

using a pool of threads for performing the transmission simultaneously. This behavior

98

Page 101: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

(a) Synchronization initiated by the coordinator.

(b) Synchronization initiated by a follower.

Figure 4.6: Overview of the cell group communications.

can be explained by the overhead associated with the enqueue of the sending request

in multiple threads. However, as the number of peers increases, the cost of sending the

requests sequentially surpasses the overhead of the parallel transmission.

99

Page 102: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Figure 4.6b shows the communications steps required when the synchronization is

initialized by a non coordinator peer. Here, the peer must send the request to the

coordinator peer (step 1). Upon receiving the request, the coordinator peer performs the

same process that as used in Figure 4.6a. It starts by propagating the request onwards

the root cell (steps 2 and 3), with the respective acknowledgment being sent after the

root cell synchronizes (step 4). All the coordinator peers that belong to the cells between

the root cell and the originating cell, synchronize the request within their cell after

receiving the acknowledgment from their parent. When the acknowledgment reaches

the originating cell, the coordinator peer spreads the request through the remaining

active peers and waits for their replies (steps 8 and 9). Last, the coordinator peer sends

an acknowledgment back to the originating peer (step 10).

The cell communication algorithms are shown in Algorithms 4.4 and 4.5, and they ex-

pose the previously described roles that are present in the architecture: the coordinator

and non coordinator roles.

Algorithm 4.4: Cell group communications: receiving-end

var: this // the current cell communication group objectvar: cellObj // the cell object associated with the communication groupvar: coordinatorPeer // the cell coordinator peer

1 procedure CellGroup:coordinatorHandleMsg(peer,msg)2 if not msg.isAckMessage() then3 ackMessage ← cellObj.processMsg(msg)4 if not isRoot() then5 request ← this.getParentPeer().sendMessage(msg)6 request.waitForCompletion()7 if request.failed() then8 this.handleParentFailure()9 end if10 end if11 this.sendMessage(msg)12 peer.sendMessage(ackMessage)13 else14 this.updatePendingRequests(msg)15 end if16 end procedure

17 procedure CellGroup:nonCoordinatorHandleMsg(msg)18 if not msg.isAckMessage() then19 ackMessage ← cellObj.processMsg(msg)20 coordinatorPeer.sendMessage(ackMessage)21 else22 this.updatePendingRequests(msg)23 end if24 end procedure

100

Page 103: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

If a peer is the coordinator of the cell, then all the incoming messages (from the cell

or from children cells) are processed by the CellGroup:coordinatorHandleMsg()

procedure, otherwise, the CellGroup:nonCoordinatorHandleMsg() procedure is

used to process the incoming messages.

In CellGroup:coordinatorHandleMsg() procedure, the coordinator receives a new

message and process it if is not an acknowledgment in line 3. After the message is pro-

cessed and validated by the coordinator (line 3), and if the coordinator does not belong

to the root cell, then it must forward the message to its parent cell coordinator and wait

for the acknowledgment (lines 5-6), with the process recursively updating the cells until

the root node is reached. If the synchronization with the parent fails, then the coor-

dinator enters in a recovery stage by executing the Cell::handleParentFailure()

procedure (lines 7-9) that is detailed below. After synchronizing with its parent, the

coordinator uses the CellGroup:sendMessage() procedure to send the message across

the peers, and thus synchronizing the state among all the active peers present in the cell

(line 11). The last step remaining is to send back the reply message to the requesting

peer (line 12). On the other hand, if the coordinator received an acknowledgment, then

it updates any pending request (lines 13-15).

If the peer is not the coordinator of the cell, then all the incoming messages are processed

by the CellGroup:nonCoordinatorHandleMsg() procedure. If the message is not an

acknowledgment then the cell object processes it and updates its internal state (line 19),

reflecting the changes performed globally the cell. After this update, an acknowledgment

is sent back to the coordinator peer (line 20). Otherwise, the message received was an

acknowledgment and is used to update any pending request (lines 21-23).

The CellGroup:sendMessage() procedure, in Algorithm 4.5, illustrates the process

of sending a message within a cell. If the a message is being sent by the coordinator

(lines 2-15), but if it was originated in another peer then the coordinator removes that

peer from the sending set (lines 3 and 4); illustrated in step 1 in Figure 4.6a and step

2 in Figure 4.6b). The message is sent to all the peers present in the set with each

pending request being stored in an auxiliary list (lines 5-9). The coordinator then waits

for the completion of all the pending requests (line 10). For each request that failed,

the coordinator removes the peer associated with that request from the list containing

all the active peers (lines 11-15).

On the other hand, if the message is being sent by a non-coordinator peer, then it is

forwarded to the coordinator of the cell (line 17). After sending the message, the peer

waits for the acknowledgment from the coordinator (line 18). The synchronization is

101

Page 104: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.5: Cell group communications: sending-end

var: this // the current cell group communications objectvar: peers // the active, non-coordinator, peer client listvar: coordinatorPeer // the coordinator peer client

1 procedure CellGroup:sendMessage(msg)2 if this.isLocalPeerGroupCoordinator() then3 sendList ← peers4 sendList.remove(msg.getSourcePeer())5 cellRequestList ← ∅6 for peer in sendList do7 cellRequest ← peer.sendMessage(msg)8 cellRequestList.add(cellRequest)9 end for10 cellRequestList.waitForCompletion()11 for cellRequest in cellRequestList do12 if cellRequest.failed() then13 peers.remove(cellRequest.getPeer())14 end if15 end for16 else17 cellRequest ← coordinatorPeer.sendMessage(msg)18 cellRequest.waitForCompletion()19 if cellRequest.failed() then20 this.handleCoordinatorFailure()21 end if22 end if23 end procedure

then handled by the coordinator through the CellGroup:coordinatorHandleMsg()

procedure (previously shown in Algorithm 4.4). If the request fails, it is assumed that

the coordinator has crashed. In order to recover the cell from this faulty state, the

CellGroup:handleCoordinatorFailure() procedure is triggered.

Cell Discovery Mechanism

The goal of the cell discovery mechanism is to allow the discovery of peers in a cell.

The cell discovery object implements this sub-service, and uses low-level multicast

sockets to achieve an efficient implementation. The cell membership management is

accomplished through the use of the join, leave and rebind operations. These operations

are implemented through the cell group object. Both these mechanisms are presented

in Figure 4.7.

The algorithms that implement the cell discovery mechanisms are presented in Algo-

rithm 4.6. When a peer wants to join the mesh, it first has to find a suitable cell to bind

102

Page 105: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

Figure 4.7: Cell discovery and management entities.

Algorithm 4.6: Cell Discovery

var: cellObj // the cell objectvar: discoveryMC // the discovery low-level multicast socket

1 procedure CellDiscovery:requestCell(peerType)2 request ← discoveryMC.sendRequestCell(peerType)3 if request.failed() then4 return Cell:createRootInfo()5 else6 return request.getCellInfo()7 end if8 end procedure

9 procedure CellDiscovery:RequestParent(peerType)10 request ← discoveryMC.requestParent(peerType)11 request.waitForCompletion()12 return request.getParent()13 end procedure

14 procedure CellDiscovery:handleDiscoveryMsg(peer,msg)15 switch(msg.getType())16 case(RequestCell)17 if not cellObj.isRoot() then18 return19 end if20 replyRequestCellMsg ← cellObj.getCell(msg.getPeerInfo())21 peer.sendMessage(replyRequestCellMsg)22 end case23 case(RequestParent)24 replyRequestParentMsg ← cellObj.getParent(msg.getPeerInfo())25 peer.sendMessage(replyRequestParentMsg)26 end case27 end switch28 end procedure

to. This is achieved through the call to the CellDiscovery:RequestCell procedure

(lines 1-8) on the root cell, which in turn sends a cell request message. The call will

103

Page 106: PhD Thesis

CHAPTER 4. IMPLEMENTATION

be serviced by any of the peers in the cell. If there are no peers in the root cell the

procedure returns the root cell identifier (line 4). Otherwise, it returns an appropriate

place in the mesh tree to position the requesting peer (line 6). The parameter peerType

denotes the type of node that is joining the cell, and it can be either a peer or a leaf

peer.

The optimal position for a new peer, depends on the strategy used and the type of peer.

For a new peer, and given a tree like topology, we first try to occupy the top of the tree

aiming to improve the resiliency of the overlay.

The procedure CellDiscovery:handleDiscoveryMsg() (lines 14-28) is the call-back

that is executed on the cell’s active peers to process the discovery requests. The cell

discovery mechanism supports two types of messages, the request for a cell (lines 16-22)

and the request for a new parent (lines 23-26).

The request for a cell is only valid in the root cell, otherwise the request is simply

discarded (lines 17-19). The restriction of this operation to the root cell allows us

to provide a better balance of the mesh tree, because the root cell is the only part

of the tree that has full knowledge of the overlay. A suitable cell is found, using the

cellObj.getCell() procedure. The reply message containing the binding information

is sent to the requesting peer (lines 20-21).

However, if the incoming request is for a new parent, then a suitable parent is found

through the call to the cellObj.getParent() procedure, with the result being sent to

the originating peer (lines 24 and 25). The request for a new parent is issued when the

parent peer of a cell fails. The coordinator of the cell must be able to find the parent peer

within the parent cell, if available, by using the CellDiscovery:RequestParent()

procedure.

Faults and Recovery

Faults arise for various reasons, ranging from hardware failures, that include peer

hardware failures and network outages, to software bugs. We considered three types of

faults: peer crash; coordinator peer crash, and; cell crash.

Figure 4.8 illustrates the fault handling processes in the presence of a fault in a cell.

When a non-coordinator peer crashes in a cell, shown in Figure 4.8a), the coordinator

peer issues a leavePeer request to the upper part of the tree (step 2), notifying the

departure of the crashed peer. After the acknowledgment from the parent peer has

been received (step 3), the coordinator peer notifies the active peers in the cell of the

crashed peer (steps 4 and 5).

104

Page 107: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

(a) (b)

Figure 4.8: Failure handling for non-coordinator (left) and coordinator (right) peers.

On the other hand, when a failure happens in the cell’s coordinator peer, shown in

Figure 4.8b), one of the other peers in the cell takes its place as the new coordinator.

After detecting the failure of the coordinator (step 1), the peer that is next-in-line,

according to the order that the peers entered the cell, succeeds it and becomes the new

coordinator. The coordinator of the parent cell also detects the crashed coordinator

peer, and sends a notification onwards the root cell (steps omitted). The newly elected

coordinator peer sends a rebind request to the parent coordinator and waits or the

acknowledgment (steps 2 and 3), informing that it is the new coordinator of the cell.

Furthermore, each active peer in the cell rebinds to the new coordinator, as will also any

coordinator belonging to a children cell. These rebind requests are also sent onwards

the root and fully acknowledged (steps 4 to 7).

As said, the coordinator peers from the children cells try to rebind to the parent’s cell.

If there are no more peers in parent’s cell then the cell has crashed and the coordinators

of the children cells have to contact the root node of the tree to request a new suitable

placement, that is a new cell. At this point, it is possible for the children cells, and

105

Page 108: PhD Thesis

CHAPTER 4. IMPLEMENTATION

their sub-trees, to migrate to their new location, effectively avoiding the costly rebinding

process that would arise from forcing every peer to individually rebind to the mesh.

(a) (b)

Figure 4.9: Cell failure (left) and subsequent mesh tree rebinding (right).

The Figure 4.9a) shows the instance when the coordinator peer crashes. Because it

was the only active peer in the cell, this resulted in a cell crash, as no more peers were

available in the cell. The reconfigured P2P network is shown in Figure 4.9b).

Algorithms 4.7 and 4.8 show the algorithms that govern the fault-handling mechanism.

When a TCP/IP connection closes without proper shutdown, the peer is assumed

to have crashed. Within a cell, the coordinator peer monitors all active peers, and

in turn, they monitor the coordinator peer. The Cell:onPeerFailureHandler()

procedure is called by the coordinator when any of the active peers has failed, or

it is called by all the active peers when the coordinator has failed. Furthermore,

when a parent coordinator detects that a child coordinator has failed, it also calls

the Cell:onPeerFailureHandler() procedure. On the other hand, every children

coordinator peer calls the Cell:onParentFailureHandler() procedure when they

detect that their parent coordinator has crashed.

When a peer crashes, there are two possible scenarios. The first one being the crash

of a non-coordinator peer of the cell, shown in Figure 4.9a), and the second scenario

is related to the crash of a coordinator peer, shown in Figure 4.9b). When a non-

coordinator peer crashes, the coordinator peer of that cell calls the Cell:leavePeer()

procedure at line 10 of the Cell:onPeerFailureHandler() procedure. It starts by

removing the information about the peer (line 39) and then sending the notification to

the parent coordinator peer and waiting for the acknowledgment (lines 40 to 42). After

the acknowledgment has been received, the coordinator synchronizes cell by issuing a

106

Page 109: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

Algorithm 4.7: Cell fault handling.

var: this // the current cell objectvar: cellGroupObj // the cell communication group object

1 procedure Cell:onPeerFailureHandler(peerInfo)2 if peerInfo.isCoordinator() then3 this.removePeerInfo(peerInfo)4 if this.isNewCoordinator() then5 this.rebindParentPeer(this.getParentInfo())6 else7 this.rebindCoordinatorPeer()8 end if9 else10 this.leavePeer(peerInfo)11 end if12 end procedure

13 procedure Cell:onParentFailureHandler(peerInfo)14 cellDiscoveryObj ← Cell:createCellDiscovery(peerInfo.getCellInfo())15 newParentInfo ← cellDiscoveryObj.requestParent()16 if newParentInfo 6= ∅ then17 this.rebindParentPeer(newParentInfo)18 else19 cellRootDiscoveryObj ← Cell:createRootCellDiscovery()20 newParentInfo ← cellRootDiscoveryObj.requestParent()21 this.rebindParentPeer(newParentInfo)22 end if23 end procedure

24 procedure Cell:onChildFailureHandler(peerInfo)25 leavePeer(peerInfo)26 end procedure

departure notification through the cell group communication infrastructure (line 43).

No additional recovery is necessary at this point.

When the crashed peer was coordinating the cell, then each active peer remaining in the

cell calls the Cell:onPeerFailureHandler() procedure (lines 2 to 9). They start by

removing the information about the crashed peer (line 2). The peer that is next-in-line

to succeed to the coordinator peer calls the Cell:rebindParentPeer() procedure

(line 5). In turn, all the remaining active peers in that cell call the Cell:rebind-

CoordinatorPeer() procedure (line 7) in order to connect to the new coordinator

peer. The Cell:rebindParentPeer() procedure starts by connecting to the parent

coordinator peer (line 28), and then issuing a rebind notification to it and waiting for

the acknowledgment (lines 29 to 31).

On the other hand, the Cell:rebindCoordinatorPeer() procedure starts by con-

107

Page 110: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.8: Cell fault handling (continuation).

27 procedure Cell:rebindParentPeer(parentInfo)28 this.connectToParentPeer(parentInfo)29 rebindMsg ← Cell:createRebindMsg(this.getOurPeerInfo())30 request ← this.getParentPeer().sendMessage(rebindMsg)31 request.waitForCompletion()32 end procedure

33 procedure Cell:rebindCoordinatorPeer()34 this.connectToCoordinator(this.getCoordinatorInfo())35 rebindMsg ← Cell:createRebindMsg(this.getOurPeerInfo())36 cellGroupObj.sendMessage(rebindMsg)37 end procedure

38 procedure Cell:leavePeer()(peerInfo)39 this.removePeerInfo(peerInfo)40 leaveMsg ← Cell:createLeaveMsg(peerInfo)41 request ← this.getParentPeer().sendMessage(leaveMsg)42 request.waitForCompletion()43 cellGroupObj.sendMessage(leaveMsg)44 end procedure

necting to the new coordinator peer (line 34), and then issuing a rebind notification to

the coordinator through the cell group communication infrastructure (lines 35 and 36).

At the same time, the parent coordinator peer and all the children coordinator peers also

detect that the coordinator peer has crashed. In the first case, the parent coordinator

peer, through the Cell:onChildFailureHandler() procedure, issues a notification

to the topmost portion of the tree informing of the departure of the crashed peer

(followed by the synchronization within its own cell). This is accomplished through

the Cell:leavePeer() procedure (line 25). The children coordinators upon detection

of the failure of their parent coordinator call the Cell:onParentFailureHandler()

procedure. The procedure starts by trying to discover a new parent in the same cell of

the crashed coordinator (lines 14 and 15). If there is an active coordinator in that

cell then the child coordinator rebinds by calling the Cell:rebindParentPeer()

procedure. If there is no such coordinator available, the child coordinator contacts

the root cell to ask for a new parent, and thus a new placement in the mesh, and

rebinds to it using also the Cell:rebindParentPeer() procedure (lines 19 to 21).

4.1.3 Discovery Service

The Discovery service provides a generic infrastructure for locating resources in the

overlay, such as the location of service instances, whereas the previously described cell

108

Page 111: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

discovery infrastructure only provides the mechanisms to locate peers within a cell.

Figure 4.10: Discovery service implementation.

The overlay Discovery service is shown in Figure 4.10. A user in peer A issues a query

though the Runtime Interface and Overlay Interface. The runtime of peer A tries first

to resolve it locally. If it is unable to locally resolve the query, then it must forward

the query to its parent coordinator, peer B. If peer B is unable to resolve the query,

then the request is forwarded to its parent coordinator, in this case peer C. If peer C

is unable to resolve the query, then a failure reply is sent downwards to the originating

peer.

Furthermore, the querying process can be generalized in the following manner. Upon the

reception of a discovery request, the runtime tries first to resolve it locally, in the peer,

and only when this is not possible, it propagates the request to the cell’s coordinator. If

the coordinator parent’s is also unable to reply to the request, the request is propagated

once more to its parent cell coordinator and the process is repeated recursively until

a coordinator peer is able to reply. If this process reaches a point where there is no

parent coordinator available (root node for the sub-tree), the process fails and a failure

reply is sent downwards to the originating peer.

Algorithm 4.9 illustrates the algorithms that implement the behavior of the discovery

service. The discovery service allows the execution of synchronous and asynchronous

queries. The procedure Discovery:executeQuery() performs synchronous queries.

The current implementation redirects the query to the root cell. This was done for the

sake of simplicity, but is going to be revised in the future.

109

Page 112: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.9: Discovery service.

var: this // the current discovery service objectvar: mesh // the mesh service

1 procedure Discovery:executeQuery(query,qos)2 queryResult ← this.executeLocalQuery()3 if queryResult 6= ∅ then4 return(queryResult)5 end if6 coordinatorUUID ← ∅7 if not mesh.getCell().isCoordinator() then8 coordinatorUUID ← mesh.getCell().getCoordinatorUUID()9 else10 coordinatorUUID ← mesh.getCell().getParentUUID()11 end if12 if coordinatorUUID = ∅ then13 return(∅)14 end if15 coordDiscoverySAP ← mesh.getDiscoveryInfo(coordinatorUUID)16 coordDiscoveryClient ← this.createCoordinatorClient(coordDiscoverySAP,qos)17 return(coordDiscoveryClient.executeQuery(query,qos))18 end procedure

19 procedure Discovery:executeAsyncQuery(query, qos)20 queryResult ← this.executeLocalQuery()21 if queryResult 6= ∅ then22 future ← this.createFutureWithResult(queryResult)23 return(future)24 end if25 coordinatorUUID ← ∅26 if not mesh.getCell().isCoordinator() then27 coordinatorUUID ← mesh.getCell().getCoordinatorUUID()28 else29 coordinatorUUID ← mesh.getCell().getParentUUID()30 end if31 if coordinatorUUID = ∅ then32 future ← this.createFutureWithResult(∅)33 return(future)34 end if35 coordDiscoverySAP ← mesh.getDiscoveryInfo(coordinatorUUID)36 coordDiscoveryClient ← this.createCoordinatorClient(coordDiscoverySAP,qos)37 return(coordDiscoveryClient.executeAsyncQuery(query,qos))38 end procedure

39 procedure Discovery:handleQuery(peer, query,qos)40 queryResult ← this.executeQuery(query,qos)41 queryReplyMessage ← Discovery:createQueryReplyMessage(queryResult)42 peer.sendMessage(queryReplyMessage)43 end procedure

110

Page 113: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

The procedure starts by trying to resolve the query locally, and if successful, returning

the result (lines 2 to 5). Otherwise, the query must be propagated throughout the

overlay. If the peer is not the coordinator of the cell, then the coordinator of the cell

will be used as gateway for the propagation of the query. On the other hand, if the

peer is the coordinator of the cell, then the coordinator of the parent cell is used (lines

6 to 11). If either coordinators are not available then the query fails (lines 12 to 14).

Otherwise, the SAP information of the coordinator, that can be either the coordinator

of the current cell or the coordinator of the parent cell, is retrieved using the mesh

service, in line 15, which is followed by the creation of a client to the Discovery service

of that coordinator (line 16). At line 17, we use the client to redirect the request to the

parent and return the result.

The procedure Discovery:executeAsyncQuery() provides the asynchronous version

of the querying primitive. It follows the same approach as with the synchronous version,

with some slight differences. Instead of returning the result of the query, it returns a

future, that acts as a placeholder for the query result, notifying the owner when that

data is available. If the query can be resolved locally, then a future is created with

query result and returned (lines 21 to 24). As with the synchronous querying, this

is followed with the retrieval of the UUID of either the coordinator of the cell, if the

peer is not the coordinator of the cell, or the coordinator of the parent cell. If no

coordinator is available, then the procedure fails and a token reflecting this failure is

created and returned (lines 26 to 34). Otherwise, a client is created to the coordinator

after the retrieval of the necessary information about the SAP of that coordinator.

Last, the procedure returns the future created by the asynchronous querying on the

coordinator’s client (lines 35 to 37).

The procedure Discovery:handleQuery() is the call-back that is executed to handle

the query requests of the followers peers of the cell, or from children peers, that belong

to children cells. The Discovery:executeQuery() procedure, that was previously

described in Listing 4.9, is used to process an incoming query. If the query fails, a

failure message is created. If not, the query result is attached to a reply message. The

reply message is finally sent to the requesting peer.

4.1.4 Fault-Tolerance Service

Our FT infrastructure is based on replication groups. These groups can be defined

as a set of cooperating peers that have the common goal of providing reliability to

a high-level service. Previous work [3, 14], implemented FT support through a set of

111

Page 114: PhD Thesis

CHAPTER 4. IMPLEMENTATION

high-level services that used the underlying primitives of the middleware. Our approach

(c.f. Chapter 3), makes a fundamental shift to this principle, by embedding lightweight

FT support at the overlay layer.

The management of the replication group is self contained, in the sense that the FT

service delegates all the logistics to the replication group. This allows further extensi-

bility of the replication infrastructure, and also allows the co-existence of simultaneous

types of replication strategies inside the FT service.

The integration of FT in the overlay reduces the overhead of cross-layering that is

associated with the use of high-level services. Furthermore, this approach also enables

the runtime to make decisions on the placement of replicas that are aware of the overlay

topology. This awareness can allow a better leverage between the target reliability and

resource usage. For example, placing replicas in different geographic locations leads to a

better reliability, but can be limited by the availability of bandwidth over WANs links.

Figure 4.11: Fault-Tolerance service overview.

Figure 4.11 shows an overview of the FT service, more specifically, of the bootstrap

process of a replicated service. It starts with a peer, in this case referred to as client,

requesting the creation of a replicated service to peer B. This request is delegated to

the mesh service. At this point, peer B receives the request and verifies if it is able

to host the service. If enough resources are available for hosting the service, that will

act as the primary service instance, then the core requests the FT service to create a

replication group that will support the replication infrastructure for the service.

The FT service creates a new replication group object, that will oversee the management

of the replication group acting as its primary. Using the fault-tolerance parameters,

that where passed by the core, the primary of the replication group finds the necessary

number of replicas across the overlay using the discovery service (this interaction is

112

Page 115: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

omitted). After finding the suitable deployment peers, the primary sends requests to

the remote FT services to join the replication group, as replicas. Each remote peer

verifies if it has the necessary resources to host the replica, and if so, the core creates a

replication group object that will act as a replica in the replication group. This process

ends with the replica binding to the primary of the replication group.

Replication Group Management

The management of a replication group includes the creation and removal of replicas.

Furthermore, a replication group is also responsible for providing the fail-over mecha-

nisms that allow the recovery from faults that occur in participating peers.

Figure 4.12: Creation of a replication group.

Figure 4.12 illustrates the creation of a replication group with one replica. The process

starts with an user requesting the creation of a service with FT support (step 1). The

core of the runtime processes the request and creates a service instance that will act

as Primary service instance (step 2). If configured, the core will make the necessary

reservations by interacting with the QoS client. The core proceeds to create a replication

group that will provide fault-tolerance support to the service (step 3).

After creating the replication group object, and finding a suitable deployment site

(omitted), the core requests the addition of a replica to the newly created replication

group, through the fault-tolerance service (step 4). The handleFTMsg procedure is the

call-back that is responsible for handling these types of requests.

After receiving and accepting the request for the creation of a replica, the peer denom-

inated as Replica, creates a service instance that will act as a replica to the primary

113

Page 116: PhD Thesis

CHAPTER 4. IMPLEMENTATION

service instance (step 5). This is followed by the creation of a replication group object

that will act as a replica in the existing group. In order to complete the join to the

replication group, the replica issues a join request to the primary of the replication

group, that is maintained by the primary peer (step 6-7).

Because the example given in Figure 4.12 only has one replica, there is no need to

advertise the arrival of a new replica. However, in the presence of a larger group, each

new added replica has to be advertised in the replication group.

Figure 4.13: Replication group binding overview.

Figure 4.13 depicts the existing bindings within a replication group with multiple

replicas. The primary of the replication group, the peer that is managing the group

and is responsible for hosting the primary service, has active binds to all the replicas,

that are the peers that host a replica service.

The replicas are shown from left to right, denoting their order of entrance in the

replication group. If the primary fails, the leftmost replica is elected as the new primary.

Furthermore, each replica pre-binds to all the replicas that are placed on its right.

These pre-binds allow the monitoring of the neighboring peers for failures and reduce

the latency of the binding process.

Figure 4.14 shows the details of the process involved in the creation of a new replica.

Following a request for the creation of a new replica, by the primary (show in Figure 4.12,

step 4), the new replica joins the replication group (step 1).

When the primery adds a new replica to the group, it first starts by binding to it

(step 2). If this initialization is successful, the primary sends a message notifying the

remaining replicas that a new replica was added (step 3). Upon the arrival of this

message, each replica pre-binds to the new replica, and if this is done successfully, each

replica replies back to the primary with an acceptance message (steps 4-5). Otherwise,

114

Page 117: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

a rejection message is sent back to the primary and the addition of the new replica is

aborted (omitted).

Figure 4.14: The addition of a new replica to the replication group.

Fault-Tolerance Algorithms

The fault-tolerance service handles three types of requests: the creation of a new

replication group, which is performed by the primary; the addition of a new replica to an

existing replication group, requested by the primary to a new replica, and; the removal

of an existing replication group. The procedures FT:createReplicationGroup(),

FT:joinReplicationGroup() and FT:removeReplicationGroup() handle these

requests, respectively, and are shown in Algorithm 4.10.

When a service creation request is made locally or remotely, through the mesh service,

the core verifies if the necessary resources are available, and if so, creates a service

instance to be used by the replication group. Following this, the core creates the

replication group through the procedure FT:createReplicationGroup() (shown in

Figure 4.12, step 3). Acting on behalf of the core, the FT service creates the replication

group primary that will construct and manage the replication group.

This procedure takes as input the following parameters: svc, the service instance that

will act as the primary; params, the service parameters used in the creation of the

primary and replicas; and qos, a QoS broker to be used by the replication group. After

the replication group has been created (line 2), the output variable rgid is initialized

with the Replication Group Identifier (RGID) and the group is added to the group

manager (line 3) and bootstrapped (line 4).

The fault-tolerance requests are handled by the FT:handleFTMsg() procedure. Upon

the reception of a request to host a new replica (lines 18-22), the FT service redirects the

request to the core of the runtime, by calling the joinReplicationGroup() procedure

of the Core Interface (line 20). The core of the runtime first verifies the availability of

115

Page 118: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.10: Creation and joining within a replication group

var: this // the current FT servicevar: ftGroupObj // the replication communication groupvar: groupManager // the FT replication group manager

1 procedure FT:createReplicationGroup(svc,params,rgid,qos)2 ftGroupObj ← this.createPrimaryFTGroupObj(svc,params,rgid,qos)3 groupManager.addGroup(ftGroupObj)4 ftGroupObj.start()5 end procedure

6 procedure FT:joinReplicationGroup(svc,params,rgid,primary,replicas,qos)7 ftGroupObj ← this.createReplicaFTGroupObj(svc,params,rgid,primary,replicas,qos)8 groupManager.addGroup(ftGroupObj)9 ftGroupObj.start()10 end procedure

11 procedure FT:removeReplicationGroup(rgid)12 ftGroupObj ← groupManager.getGroup(rgid)13 ftGroupObj.stop()14 groupManager.removeGroup(ftGroupObj)15 end procedure

16 procedure FT:handleFTMsg(peer,msg)17 switch(msg.getType())18 case(JoinFTGroup)19 (rgid,sid,params) ← msg.getReplicaInfo()20 getCoreInterface().joinReplicationGroup(primary,replicas,rgid,sid,params)21 peer.sendMessage(FT:createAckMessage(msg))22 end case23 case(RemoveFTGroup)24 ftGroupObj ← groupManager.getGroup(rgid)25 ftGroupObj.stop()26 groupManager.removeGroup(ftGroupObj)27 end case28 end switch29 end procedure

resources to run the replica, and if they are available, it requests the FT service to

join the replication group. This is implemented by the FT:joinReplicationGroup()

(shown in Figure 4.12, step 6) procedure and takes as input the following parameters:

svc, the service instance that will act as a replica; params, the service parameters

used in the creation of the primary and replicas; qos, a QoS broker to be used by

the replication group object; rgid, the RGID of the replication group; the primary

parameter, that holds the primary info; and the replicas parameter that holds the

current replicas info.

116

Page 119: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

Replication Group Algorithms

The replication group is the core of the replication infrastructure. It enforces the

behavior that was requested in the creation of the replicated service, such as the number

of replicas or replication policy.

Algorithm 4.11: Primary bootstrap within a replication group

var: this // the local instance of the replication groupvar: ft // the fault-tolerance servicevar: rgControlGroup // the replication control group

1 procedure FTGroup:startPrimary()2 this.openSAPs()3 (sid,params)← this.getServiceInfo()4 nbrOfReplicas ← params.getFTParams().getReplicaCount()5 deployPeers ← ft.findResources(sid,params,nbrOfReplicas)6 for peer in deployPeers do7 replica ← this.createReplicaObject(peer)8 rgControlGroup.addReplica(replica.getInfo())9 this.addToReplicaList(replica)10 end for11 this.getService().setReplicationGroup(this);12 end procedure

Algorithm 4.11 details the initialization procedure of a primary within a replication

group. The FTGroup:startPrimary() procedure shows the bootstrap sequence of a

primary. It starts by initializing two distinct access points, one for data and the other

for control (line 2). This separation was made to prevent multiplexing of control and

data requests, that could lead to priority inversion or increased latency in the processing

of requests. More specifically, the control SAP is used to manage the organization of

the replication group, such as addition and removal of replicas and election of a new

primary, while the data SAP is used to implement the “actual” FT protocol.

Figure 4.15 illustrates the control and data communication groups. The dashed lines

represent pre-binds that are made to minimize recovery time. When the primary of a

replication group fails, the necessary TCP/IP connections are already in place, so when

the replica that is next-in-line becomes the new primary, it can immediately recover the

replication group.

After this initial setup, the primary calls the FT:findResources() (shown in Al-

gorithm 4.12) to search for suitable deployment sites to create the replicas. The total

number of replicas is enclosed within the fault-tolerance parameters, that in turn belong

to the service parameters (lines 3-4).

117

Page 120: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Figure 4.15: The control and data communication groups.

After retrieving the list of suitable deployment sites at line 5, the primary creates

and binds each replica (line 7). Each newly added replica is synchronized with the

existing replicas in the replication group, using the control group infrastructure (line 8).

Subsequently, the new replica is added to the replica list (line 9). Last, the replication

group is attached to the service instance, allowing the service to access the underlying

FT infrastructure (line 11). If any of the previously mentioned operations fails, the

whole bootstrap process fails.

Algorithm 4.12: Fault-Tolerance resource discovery mechanism.

var: this // the current FT service objectvar: discovery // the discovery servicevar: mesh // the mesh service

1 procedure FT:findResources(sid,params,nbrOfReplicas)2 peerList ← ∅3 for i ← 1, i < nbrOfReplicas do4 filterList ← peerList5 query ← this.createPoLQuery(mesh.getUUID(),sid,filterList)6 queryReply ← discovery.executeQuery(query)7 peerList.add(queryReply.getPeerInfo())8 end for9 return(peerList)10 end procedure

In order to bootstrap a replica, a suitable place most be found. Algorithm 4.12

shows the details of mechanism that is responsible for finding suitable peers to host

new replicas. The process is exposed by the FT:findResources() procedure. This

procedure returns a list containing the peers, found across the overlay, that are able

to host a replica. To prevent duplication of replicas on the same runtime, a filter list

is added to each query. The initialization of this list is performed at line 5, and is

118

Page 121: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

updated every time a query is performed avoiding duplication of peers. The actual

query is created in line 6, through the use of the FT:createPoLQuery() procedure.

The short name PoL stands for Place of Deployment, and refers to the runtime where

a service, or in this case the replica, will be launched. At this point, the FT uses the

discovery service to perform the query (line 7), adding the reply to the peer list (line

8), in case of success. If this querying fails, the FT:findResources() procedure fails.

Algorithm 4.13: Replica startup.

var: this // the local instance of the replication group object

1 procedure FTGroup:startReplica()2 FTGroup:openSAPs()3 FTGroup:getService().setReplicationGroup(FTGroup:this);4 end procedure

The startup of a replica is detailed in the FTGroup:startReplica() procedure in

Algorithm 4.13. The replica starts by opening the control and data access points. This

enables the primary of the group to bind to the replica (shown in Algorithm 4.11). Last,

the replication group is attached to the replica service (line 3).

Algorithm 4.14: Replica request handling

var: this // the local instance of the replication group object

1 procedure FTGroup:replicaHandleControlMsg(primaryPeer,msg)2 switch(msg.getType())3 case(AddReplica)4 replicaInfo ← msg.getReplicaInfo()5 replica ← this.prebindControlAndDataToReplica(replicaInfo)6 this.addToReplicaList(replica)7 ackMessage ← FTGroup:createAckMessage(msg)8 primaryPeer.sendMessage(ackMessage)9 end case10 case(RemoveReplica)11 replicaInfo ← msg.getReplicaInfo()12 this.removeFromReplicaList(replicaInfo)13 ackMessage ← FTGroup:createAckMessage(msg)14 primaryPeer.sendMessage(ackMessage)15 end case16 end switch17 end procedure

Algorithm 4.14 shows the FTGroup:replicaHandleControlMsg() call-back that is

responsible for handling the control requests, in a peer that is acting as a replica within

119

Page 122: PhD Thesis

CHAPTER 4. IMPLEMENTATION

a replication group. The notification messages sent by the primary that inform of the

arrival of new replicas to the replication group are handled in lines 3-8. Upon receiving

the request, each replica pre-binds to the new replica (line 4) and adds it to the replica

list (line 5). This ends with a reply message being sent to the primary peer.

The removal of a replica from the replication group is handled in lines 8-12. When

removing the replica from the list (line 9), all associated pre-binds (control and data)

are closed. The process ends with an acknowledgment being sent to the primary peer.

Support for the Replication Protocol

Our current implementation only supports semi-active replication [44]. In this type of

replication, the primary instance of the service after receiving and processing a request

from a client, replicates the new state across all the active replicas. As soon as the

replication ends, an acknowledgment is sent back to the client.

Figure 4.16: Semi-active replication protocol layout.

Figure 4.16 illustrates the implementation of the semi-active replication policy. When

the primary service instance wants to replicate its state, it uses the replicate()

procedure within the replication group (step 1). The replication group then uses the

data group to synchronize the new state among the replicas (step 2). Each replica

handles the replication request through the replicaHandleDataMsg() procedure.

This takes the replication data and calls the onReplication() procedure (step 3).

The service, after synchronizing into the new state, issues an acknowledgment through

the replication group (step 4).

The actual replication protocol support is detailed in Algorithm 4.15. When a primary

service needs to synchronize some data, that can be individual actions, such as RPC

invocations, or state transfers (partial or complete), it uses the FTGroup:replicate()

120

Page 123: PhD Thesis

4.1. OVERLAY IMPLEMENTATION

procedure. The underlying replication group, depending on its policy, synchronizes the

replication data with all the replicas. For example, if the replication group is configured

to use semi-active replication, then when the FTGroup:replicate() procedure is

called (by the primary), the group immediately spreads the data. Alternatively, if

passive replication was in place, the replication group would buffer the data until the

next synchronization period expires. When the period expires, the replication group

synchronizes the data.

Each replica executes the FTGroup:handleReplicationPacket() call-back to handle

the arrival of replication data. Upon arrival, the replication data is send to replica

service instance to be processed.

Algorithm 4.15: Support for semi-active replication.

var: this // the local instance of the replication group objectvar: rgDataGroup // the replication data group

1 procedure FTGroup:replicate(buffer)2 rgDataGroup.replicate(buffer);3 end procedure

4 procedure FTGroup:replicaHandleDataMsg(primaryPeer,msg)5 switch(msg.getType())6 case(Replication)7 buffer ← msg.getBuffer()8 replicationAckMsg ← this.getService().onReplication(buffer)9 primaryPeer.sendMessage(replicationAckMsg)10 end case11 end switch12 end procedure

Fault Detection and Recovery in Replication Groups

The fault detection and recovery mechanisms within a replication group are imple-

mentation dependent. Figure 4.17 illustrates the recovery process within our current

implementation. After detecting the failure of the primary (step 1), the replica that

is next-in-line to become the new primary, assumes the leadership of the replication

group by sending a notification to all active replicas, informing that it assumed the

coordination (step 2). Next, the new primary notifies its service instance, that was

acting as a replica instance, that it became the primary service instance (step 3). At

this point, the primary node updates the information about the service, allowing any

existing client to retrieve this information and rebind to the new primary. This is

accomplished through the use of the changeIIDOfService() procedure of the Core

Interface. For the sake of simplicity, we omit the additional steps require to perform

121

Page 124: PhD Thesis

CHAPTER 4. IMPLEMENTATION

this update in the mesh.

Figure 4.17: Recovery process within a replication group.

Algorithm 4.16 details the detection and recovery call-backs that are used by the partic-

ipants of the replication group. The procedure FTGroup:onPeerFailureHandler()

is called when a bind or a pre-bind is closed, that is when a peer has crashed. If the

failing peer was the current primary of the group (line 2), then the next leftmost replica

(line 3), the older replica in the group, is elected leader. If the executing peer is the new

primary (line 4), then it must notify the service instance that it became the primary

(line 5). The new primary sends a notification to all the active replicas informing that

is ready to continue with the replication policy (line 6). This is followed by an update

containing the information about the new primary (lines 7 to 8). However, if the faulty

peer was not the primary then it is just a matter of removing the binding information

associated with the crashed peer (line 11).

4.2 Implementation of Services

EFACEC operates on several domains, including information systems used to manage

public high-speed transportation networks, robotics and smart (energy) grids. Despite

their differences, these systems have many common requirements and problems, such

as: the need to transfer large sets of data; intermittent network activity, that can lead to

data bursts; are exposure to common hardware failures, that can vary in time, ranging

from short (for example, network reconfiguration raised from a link failure) to extended

outages, such as fires, and; require low jitter and low latency for safety reasons, such

as vehicle coordination. The pursuit of these characteristics puts a tremendous stress

122

Page 125: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Algorithm 4.16: Fault detection and recovery

var: this // the local instance of the replication group objectvar: mesh // the mesh servicevar: rgControlGroup // the replication control groupvar: service // the replicated servicevar: replicas // the replica listvar: rgid // the replication group UUID

1 procedure FTGroup:onPeerFailureHandler(peerID)2 if this.isPeerPrimary(peerID) then3 primaryPeer ← replicas.pop()4 if primaryPeer.getUUID() = mesh.getUUID() then5 this.fireOnChangeToPrimary()6 rgControlGroup.sendNewPrimaryInfo()7 iid ← service.getIID()8 this.getCoreInterface().changeIIDOfService(sid,iid,rgid);9 end if10 else11 replicas.remove(peerID)12 end if13 end procedure

14 procedure FTGroup:fireOnChangeToPrimary15 serviceChangeStatus ← service.changeToPrimaryRole();16 return(serviceChangeStatus);17 end procedure

on both software and hardware infrastructures, and particularly, to the management

middleware platform.

Our middleware architecture is able to support different types of services. To showcase

some possible implementations, we present three distinct services: 1) RPC, the classical

remote procedure call service; 2) Actuator, that allows the execution of commands on

a set of sensors, and; 3) Streaming, that allows data streaming from a sensor to a

client. The RPC service is a standard in every middleware platform, whereas both the

Actuator and Streaming services were designed to resemble current systems for public

information management that were deployed in the Dublin and Tenerife metropolitan

infra-structures. These services will form the basis for the evaluation of the middleware

to be presented in Chapter 5.

4.2.1 Remote Procedure Call

The RPC service, depicted in figure 4.18, allows the execution of a procedure in a

foreign address space, alleviating the programmer from the burden of coding the remote

123

Page 126: PhD Thesis

CHAPTER 4. IMPLEMENTATION

interactions. The service uses fault-tolerance in the common way, with the primary

being the main service site, updating all the replicas that belong to the replication group

according to the group’s replication policy. The current implementation only supports

semi-active [44] replication, where the primary updates all replicas upon the reception

of a new invocation, and only replies to the client when all the replicas acknowledge the

update. On the other hand, if the RPC service is bootstrapped without fault-tolerance,

then the service executes a client invocation and replies immediately, as no replication

is involved. Figure 4.18 shows the RPC service deployed with two replicas across the

overlay.

Figure 4.18: RPC service layout.

The RPC service is divided in two layers. The topmost level contains the user defined

objects, referred as servers. The servers are the building block of the RPC, providing

an object-oriented semantics, that is similar to CORBA. For now, they are statically

linked, at compile time, to the RPC service. We have plans to expand this in the

future. On the other hand, the the bottommost level contains the server manager, also

known as service adapter, that is responsible for managing these user objects. The main

functions of the server adapter include the registration and removal of objects, and the

retrieval of the proper object to handle an incoming invocation.

In order to fully support object semantics, RPC has two distinct invocation types,

one-way and two-way invocations. One-way invocations do not return a value to the

client. Two-way invocations return a value back to the client that is dependent on the

particular operation.

Figures 4.19a and 4.19b show the interaction between a client while performing one-

way and two-way invocations, respectively. After receiving an invocation from a Service

124

Page 127: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

(a) RPC one-way invocation. (b) RPC two-way invocation.

Figure 4.19: RPC invocation types.

Access Point (SAP), through the handleRPCServiceMsg call-back, the server adapter

redirects the request to the target object (server) that performs the call to the requested

method. If it is a one-way invocation then the server only has to call the target method

using the input arguments (handled by the handleOneWayInvocation() method).

Otherwise, the server invokes the method, also using the input arguments, and sends

back the output values to the invoker (the handleTwoWayInvocation() procedure

handles this case).

Listing 4.1: A RPC IDL example.

1 interface Counter {2 void increment();3 int sum(int num);4 };

Listing 4.1 shows the IDL definition for a simple server that provides two basic op-

erations over a counter variable. The one-way Counter:increment() procedure in-

crements the counter by one, whereas the two-way Counter:sum() procedure adds a

given number to the counter variable and returns the new total.

The Algorithm 4.17 exposes an implementation of the Counter server, that is normally

denominated as RPC skeleton. The Counter:handleOneWayInvocation() proce-

dure handles one-way invocations. It starts by performing a look-up that checks if the

requested procedure exists in the object (error handling was omitted), that is followed

with the call to the target procedure (line 3 to 5). The only available one-way procedure

is the Counter:increment that performs the increment over sumTotal, the counter

variable (lines 18 to 20).

125

Page 128: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.17: A RPC object implementation.

var: this // the current RPC server objectconstant: PROC INCREMENT ID // one-way INCREMENT procedure id.constant: PROC SUM ID // two-way SUM procedure identificationconstant: COUNTER OID // the object identificationvar: sumTotal // the accumulator variable

1 procedure Counter:handleOneWayInvocation(pid,args)2 switch(pid)3 case(PROC INCREMENT PID)4 this.increment()5 end case6 end switch7 end procedure

8 procedure Counter:handleTwoWayInvocation(pid,args)9 switch(pid)10 case(PROC SUM PID)11 num ← RPCSerialization:unmarshall(INT TYPE,args)12 result ← this.sum(num)13 output ← RPCSerialization:marshall(INT TYPE,result)14 return output15 end case16 end switch17 end procedure

18 procedure Counter:increment()19 sumTotal ← sumTotal + 120 end procedure

21 procedure Counter:sum(num)22 sumTotal ← sumTotal + num23 return sumTotal24 end procedure

25 procedure Counter:getOID()26 return COUNTER OID27 end procedure

28 procedure Counter:getState()29 state ← RPCSerialization:marshall(INT TYPE,sumTotal)30 return state31 end procedure

32 procedure Counter:setState(state)33 sumTotal ← RPCSerialization:unmarshall(INT TYPE,state)34 end procedure

On the other hand, the Counter:handleTwoWayInvocation() procedure handles the

two-way invocations. It also checks if the requested procedure exists and then performs

the two-way invocation (lines 10 to 15). As the Counter:sum() procedure has one

input variable that has to be unmarshalled from the arguments (args) serialization

126

Page 129: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

buffer (line 11). This is followed by a call to the Counter:sum() procedure using the

unmarshalled argument num (line 12). The result from the call to the procedure is then

marshalled into the serialization buffer output (line 13) and returned (line 14).

The Counter:getOID() function returns the Object Identifier (OID) of the object,

in this example this procedure returns the COUNTER OID constant. The state of the

Counter object is returned by the Counter:getState() procedure. In this implemen-

tation it returns the total state, and for this it only has to marshall the sumTotal

into a serialization buffer and return it. The counterpart for this procedure, the

Counter:setState() procedure, performs the opposite action. It takes a serialization

buffer containing the state, unmarsalls the it and updates the local object.

Algorithm 4.18: RPC service bootstrap.

1 procedure RPCService:open()2 hrt ← createQoSEndpoint (HRT, MAX RT PRIO)3 srt ← createQoSEndpoint (SRT, MED RT PRIO)4 be ← createQoSEndpoint (BE, BE PRIO)5 sapQosList ← {hrt,srt,be}6 serviceSAPs ← createRPCSAPs(sapQoSList)7 serviceSAPs.open()8 end procedure

The RPC service is responsible for the performing the invocations and the management

of objects. We start by presenting its bootstrap sequence. Algorithm 4.18 shows the

opening sequence for the RPC service, exposed by the RPCService:open() procedure.

The lines 1 to 5 show the creation of the list containing the QoS endpoint properties.

This is followed by the creation of the SAPs and their respective bootstrap (lines 6 to

7). The information characterizing the SAPs is associated with the IID of the RPC

service, by the runtime, so when a client resolves a service identifier it also retrieves the

associated SAP information.

Algorithm 4.19 details the most relevant aspects of the RPC implementation. The

procedure RPCService:handleRPCServiceMsg() is the call-back that handles all in-

coming invocations (issued by the lower-level SAP infrastructure). The procedure takes

as input two arguments: channel, the TCP/IP channel used to support the invocation,

and; invocation, that contains all the relevant information to the invocation.

The invocation argument is decomposed into five separate variables (line 2): iid, is

the invocation identification that is used in the reply to the client; type, indicates the

type of invocation (one-way or two-way); oid is the object/server identification; pid,

identifies the procedure to be invoked, and; args, are the arguments to be used in the

127

Page 130: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.19: RPC service implementation.

1 procedure RPCService:handleRPCServiceMsg(channel,invocation)2 (iid,type,oid,pid,args)← invocation3 output ← handleInvocation(type,oid,pid,args)4 if RPCService:isFTEnabled() then5 RPCService:getReplicationGroup().replicate(getState())6 end if7 if type = TwoWay then8 channel.replyInvocation(iid,output)9 end if10 end procedure

11 procedure RPCService:handleInvocation(type,oid,pid,args)12 rpcObject ← getRPCObject(oid)13 switch(type)14 case(OneWay)15 rpcObject.handleOneWayInvocation(pid,arg)16 return ∅17 end case18 case(TwoWay)19 return rpcObject.handleTwoWayInvocation(pid,arg)20 end case21 end switch22 end procedure

invocation.

The actual invocation is delegated to the RPCService:handleInvocation() proce-

dure (lines 11-22). After retrieving the object associated with the invocation (line 12),

the procedure checks the type of the invocation and performs it corresponding action.

If it is an one-way invocation then it simply delegates it to the object to perform the

invocation (lines 14-17). If it is a two-way invocation then results of the operation are

returned back to the RPCService:handleRPCServiceMsg() procedure (lines 18-20).

After the invocation and if the RPC service was bootstrapped with fault-tolerance (lines

4-6) then state of the RPC is synchronized across the replica set by the replication group

infrastructure (line 5). If the invocation returns an output value (two-way invocations),

it is then sent back to the client (line 8).

The creation of an RPC client was already described in Chapter 3, more specifically

in Listing 3.5. The bootstrap and invocation procedures of the RPC client are shown

in Algorithm 4.20. The bootstrap sequence of the RPC client is implemented within

the RPCServiceClient:open() procedure, that takes as input parameters: the sid

of the RPC service; the iid of the instance that the client will bind to, and; the

client parameters. The initial step is to retrieve the information associated with the

128

Page 131: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Algorithm 4.20: RPC client implementation.

var: this // the current RPC client objectvar: channel // the low level connection object

1 procedure RPCServiceClient:open(sid, iid, clientParams)2 queryInstanceInfoQuery ← this.createFindInstanceQuery(sid,iid)3 discovery ← this.getRuntime().getOverlayInterface().getDiscovery()4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery)5 channel ← this.createRPCChannel(queryInstanceInfo.getSAPs(),clientParams)6 end procedure

7 procedure RPCServiceClient:twoWayInvocation(oid,pid,args)8 return (channel.twoWayInvocation(oid,pid,args))9 end procedure

10 procedure RPCServiceClient:oneWayInvocation(oid,pid,args)11 channel.oneWayInvocation(oid,pid,args)12 end procedure

service instance (lines 1-4). It first starts by creating the query message, through the

RPCServiceClient:createFindInstanceQuery() procedure, using the sid and

iid arguments (line 2). This is followed by the retrieval of a reference to the discovery

service (line 3), that is necessary to execute the query (line 4). This process ends with

the creation of the network channel using the query reply, with the information about

the available access points, and the selected level of QoS that is enclosed within the

client parameters (line 5).

The RPCServiceClient:twoWayInvocation() procedure (lines 7-9) is used to per-

form two-way invocations, while the RPCServiceClient:oneWayInvocation() pro-

cedure (lines 10-12) handles one-way invocations. They both use the RPC network

channel to perform the low-level remote invocation, that is, creating the packet channel

and sending it through the network channel. Contrary to its one-way counterpart, the

two-way operation must wait for the reply packet before returning to the caller.

Semi-Active Fault-Tolerance Support

The middleware offers an extensible fault-tolerance infrastructure that is able to ac-

commodate different types of replication policies.

Figure 4.20 depicts the current implemented fault-tolerance policy in the overlay. Fig-

ure 4.20a) shows the RPC service without FT support. In this case, upon the reception

of an invocation, the RPC service executes the invocation and replies immediately to

the client, as no replication is to be performed.

Figure 4.20b) shows the RPC with semi-active fault-tolerance enabled. The primary

129

Page 132: PhD Thesis

CHAPTER 4. IMPLEMENTATION

(a) (b)

Figure 4.20: RPC service architecture without (left) and with (right) semi-active FT.

node, upon reception of an invocation (step 1), uses the replication group to update all

the replicas (steps 2 and 3). After the replication is completed, that is, when all the

acknowledgments have been received by the primary node (steps 4 and 5), it sends the

result of the invocation back to the RPC client (step 6).

Algorithm 4.21: Semi-active replication implementation.

1 procedure SemiActiveReplicationGroup:replicate(replicationObject)2 if IsPrimary() then3 replicationRequestList ← ∅4 for replica in replicaGroup do5 replicationRequest ← replica.sendMessage(replicationObject)6 replicationRequestList.add(replicationRequest)7 end for8 replicationRequestList.waitForCompletion()9 end if10 end procedure

Algorithm 4.21 shows the algorithm used for implementation semi-active replication.

This procedure is only called in the primary peer of the replication group. After

receiving a replication object the replicate procedure sends a replication message

to all the replicas that are present in the replication group (the acknowledgments were

omitted for clarity).

Algorithm 4.22 shows the RPCService:onReplication call-back that is used by the

130

Page 133: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Algorithm 4.22: Service’s replication callback.

var: this // the current RPC service object

1 procedure RPCService:onReplication(replicationObject)2 switch(replicationObject.getType())3 case(State)4 this.setState(replicationObject)5 return ∅6 end case7 case(Invocation)8 (iid,type,oid,pid,args)← replicationObject9 return this.handleInvocation(iid,type,oid,pid,args)10 end case11 end switch12 end procedure

13 procedure RPCService:setState(replicationObject)14 (oid,state) ← replicationObject15 rpcObject ← this.getRPCObject(oid)16 rpcObject.setState(state)17 end procedure

replication group to perform the state update. In the current implementation, we

perform replication by synchronizing the state of the RPC service among the members

of the replication group (lines 3-6). The RPCService:setState() procedure retrieves

the object identification and state serialization buffer from the replicationObject

variable (line 14). This is followed with the look-up for the target object (line 15), that

is then used to update the state of the object (line 16). Our RPC implementation can

be further extended to support replication based on the execution of the invocations.

We present a possible implementation in lines 7 to 10.

However, this implementation is only valid for single threaded object implementations

without non-deterministic source code, such as using the gettimeofday system call. The

presence of multiple threads in a replica can alter the sequence of state updates, as

the thread scheduling is controlled by the underlying operating system, and can lead

to inconsistent states. The presence of non-deterministic source code in the servers

implementation can lead to inconsistent states if the replication is based on the re-

execution of the invocations by each replica. For example, if a server implementation

uses the gettimeofday system call then the execution of this system call will have a

different value on each replica, leading to an inconsistent state. Several techniques have

been proposed to address these problems [14, 129, 130].

131

Page 134: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Fault-Tolerance Infrastructure Extensibility

To illustrate the extensibility of our fault-tolerance infrastructure, we provided the

algorithms necessary to implement passive replication. An overview on the architecture

of both policies is shown in Figures 4.21 and ??, respectively.

Passive Replication

Figure 4.21: RPC service with passive replication.

Passive replication [75] is interesting from the point of view of RT integration because

it is associated with lower latency and lower resource requirements, such as CPU, as

shown in our previous work [6]. However, this is only feasible through the relaxation

of the state consistency among the replication group members. This is accomplished

by avoiding immediate replication, as performed in semi-active replication. Instead,

after receiving an invocation (step 1), the replication data is buffered and periodically

sent to the replicas (step 2). Because the primary node does not need to wait for the

acknowledgments, it can immediately reply the result of the invocation to the RPC

client (step 3). Each replica periodically receives the updates (step 4), processes and

acknowledges them back to the primary of the replication group (step 5).

Algorithm 4.23 shows the algorithms needed to provides passive replication. The

PassiveReplicationGroup:replicate() procedure instead of immediately repli-

cating the data, as done in semi-active replication, queues the data for later replica-

tion. The replication is periodically performed, using a user-defined period, by the

PassiveReplicationGroup:timer() procedure (lines 6-14). To achieve a better

throughput, it sends a batch message containing all the replication data that was

132

Page 135: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Algorithm 4.23: Passive Fault-Tolerance implementation.

var: this // the current passive replication group object

1 procedure PassiveReplicationGroup:replicate(replicationObject)2 if this.IsPrimary() then3 this.enqueue(replicationObject)4 end if5 end procedure

6 procedure PassiveReplicationGroup:timer(replicationObject)7 replicationBatch ← this.dequeAll()8 replicationRequestList ← ∅9 for replica in replicaGroup do10 replicationRequest ← replica.sendMessage(replicationBatch)11 replicationRequestList.add(replicationRequest)12 end for13 replicationRequestList.waitForCompletion()14 end procedure

15 procedure RPCService:onReplication(replicationObject)16 switch(replicationObject.getType())17 ... . (continuation of Algorithm 4.22)18 case(BatchMessage)19 replyList ← ∅20 for item in replicationObject do21 switch(item.getType())22 case(State)23 this.setState(replicationObject)24 end case25 case(Invocation)26 (iid,type,oid,pid,args) ← replicationObject27 replyList.add(handleInvocation(iid,type,oid,pid,args))28 end case29 end switch30 end for31 return replyList32 end case33 end switch34 end procedure

previously enqueued. In order to use passive replication, the support for a batch message

is introduced in RPCService:onReplication procedure (lines 18-32). For each item

that is contained in the batch message, it checks if it is a state transfer or an invocation.

In case of a state transfer, it updates the service using the setState() procedure (line

23). Otherwise, it is handling an invocation request and it has to perform the invocation

and store the result in the replyList variable (lines 25 to 28), which is used to return

the output values for all the batched invocations to the replication group infrastructure

(and is sent back to the primary).

133

Page 136: PhD Thesis

CHAPTER 4. IMPLEMENTATION

4.2.2 Actuator

One of the most important services in public information systems, for both railroads and

light trains, is the display of information at train stations about inbound and outbound

compositions, such as their track number and estimated time of arrival. The actuator

service allows a client to execute a command in a set of sensor nodes, such as displaying

a string in a set of information panels. These panels are implemented by leaf peers.

Figure 4.22: Actuator service layout.

Figure 4.22 shows the deployment of an actuator service instance while using 2 replicas.

The primary instance binds to each panel in the set, while the replicas make pre-bind

connections (shown as dashed lines).

Figure 4.23 shows an overview of the actuator service. The client starts by choosing

and binding to the appropriate SAP of the actuator service. To display a message on

the set of panels, the client sends a command (step 1) to the actuator service. After

receiving the command, the service sends it to the sensors (step 2), waits for their

acknowledgments (step 3), and then acknowledges the client itself (step 4).

Algorithm 4.24 shows the initial setup of the actuator service. As with the RPC service,

the initial steps focus on the construction and initialization of the service access points

(lines 2-7). If needed, a service must extend the generic class and augment it with the

service specific arguments. Unlike the RPC service, the actuator service makes uses

of this capability, by introducing an additional panel list parameter. Before processing

134

Page 137: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Figure 4.23: Actuator service overview.

Algorithm 4.24: Actuator service bootstrap.

var: this // the current actuator service objectvar: panelGroup // the panel communication group object

1 procedure ActuatorService:open(serviceArgs)2 hrt ← createQoSEndpoint (HRT, MAX RT PRIO)3 srt ← createQoSEndpoint (SRT, MED RT PRIO)4 be ← createQoSEndpoint (BE, BE PRIO)5 sapQosList ← {hrt,srt,be}6 serviceSAPs ← createActuatorSAPs(sapQoSList)7 serviceSAPs.open()8 actuatorServiceArgs ← downcast(serviceArgs)9 for panel in actuatorServiceArgs.getPanelList() do10 panelChannel ← createPanelChannel(sensor)11 panelGroup.add(panelChannel)12 end for13 end procedure

this information, the actuator service must downcast the serviceArgs to its concrete

implementation (line 8). Then, using the panel list, the actuator creates a network

channel for each of the panels and stores them in a list (lines 9-12).

Algorithm 4.25 shows the main algorithm present in the actuator service. The proce-

dure ActuatorService:handleAction() is the call-back that is executed upon the

reception of a new action by the actuator service. The actuator spreads the action across

all the panels (shown in Figure 4.23 as steps 2 and 3), using the channels previously

135

Page 138: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Algorithm 4.25: Actuator service implementation.

var: this // the current actuator service objectvar: panelGroup // the panel communication group object

1 procedure ActuatorService:handleAction(action,channel)2 actionRequestList ← ∅3 for panel in panelGroup do4 actionRequest ← panel.sendMessage(action)5 actionRequestList.add(actionRequest)6 end for7 actionRequestList.waitForCompletion()8 failedPanels ← ∅9 for actionRequest in actionRequestList do10 if actionRequest.failed() then11 panelGroup.remove(actionRequest.getPanel())12 failedPanels.add(actionRequest.getPanel())13 end if14 end for15 ackMessage ← ActuatorService:createAckMessage(failedPanels)16 channel.replyMessage(ackMessage)17 end procedure

created in the bootstrap of the service (lines 2-6). Each failed panel is removed from

the service panel list (line 11) and stored in an auxiliary list (line 12). The procedure

ends with the creation of an acknowledgment message containing the list of failed panels

that is sent back to the client (lines 15 and 16).

Algorithm 4.26: Actuator client implementation.

var: this // the current actuator client objectvar: channel // the low level connection object

1 procedure ActuatorServiceClient:open(sid, iid, clientParams)2 queryInstanceInfoQuery ← ActuatorSvcClient:createFindInstanceQuery(sid,iid)3 discovery ← this.getRuntime().getOverlayInterface().getDiscovery()4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery)5 channel ← this.createActuatorChannel(queryInstanceInfo.getSAPs(),clientParams)6 end procedure

7 procedure ActuatorServiceClient:action(action)8 actionRequest ← channel.sendMessage(action)9 actionRequest.waitForCompletion()10 end procedure

Algorithm 4.26 shows the initialization of the actuator client and the implementation of

the action operation. The ActuatorServiceClient:open() procedure exposes the

bootstrap of the client, following the same implementation as the RPC service. A query

136

Page 139: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

to find the information about the service instance is created and sent over the discovery

service. Using the localization information retrieved in the query reply, a channel is

created to that instance (lines 3-6). The low-level socket operations are handled by the

ActuatorServiceClient:action() procedure (shown in Figure 4.23 in step 1). It

sends action through the channel and waiting for corresponding acknowledgment.

Actuator Fault-Tolerance

The service does not use the fault-tolerance support for data synchronization (as in the

RPC service), but instead uses the replicas to pre-bind to the panels to minimize the

recovery time. Figure 4.24 shows the architectural details of the actuator service with

FT support.

Figure 4.24: Actuator fault-tolerance support.

In the event of a failure of the primary peer, the newly elected primary already has

pre-binds to all the panels in the set, thus minimizing recovery latency. After rebinding

to the new primary, the client reissues the failed action. While we could do the same

using multiple service instances, the actuator client would have to know about these

multiple instances, and switch among them in the presence of failures. Thus, using the

fault-tolerance infrastructure avoids this issue, and allows the client to transparently

switchover to new running primary.

137

Page 140: PhD Thesis

CHAPTER 4. IMPLEMENTATION

4.2.3 Streaming

The streaming of both video and audio in public information systems is an important

component in the management of train stations, specially in CCTV systems. The

streaming service allows the streaming of a data flow, such as video and audio, from

streamers to clients. While there is a considerable amount of work addressing streaming

over P2P networks [131, 132], we have chosen to implement it at a higher level to allow

us to provide an alternative example of an efficient streaming implementation, with

fault-tolerance support, on a general purposed middleware system.

Figure 4.25: Streaming service layout.

Figure 4.25 shows the deployment of a streaming service instance while using two

replicas. A leaf peer, denominated as streamer, connects to all the members of the

replication group.

Figure 4.26 shows the architecture details of the streaming service. At bootstrap, the

streaming service connects to the streamer (step 1) and starts receiving the stream (step

2). Afterwords, a client connects to the streaming service and requests a stream (step

3). The server allocates a stream session and the client starts receiving the stream from

the service (step 4).

Each client is handled by a stream session, that was designed to support transcoding.

The term transcoding refers to the capability of converting a stream from one encoding,

such as raw data, to a different encoding, such as the H.264 standard [133]. The use

of transcoding allows the streaming service to soften the compression ratio of streams

138

Page 141: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Figure 4.26: Streaming service architecture.

to support lower performance computing devices. At the same time, it also enables a

reduction of bandwidth usage, through a higher compression ratio, for high performance

computing devices. However, in our current implementation, we do not implement any

encoding in this example, the same is to say that we apply the identity filter.

Algorithm 4.27: Stream service bootstrap.

var: this // the current streaming service objectvar: streamServiceArgs // the streaming service argumentsvar: streamChannel // the streamer channel object

1 procedure StreamService:open(serviceParams)2 hrt ← createQoSEndpoint(HRT,MAX RT PRIO)3 srt ← createQoSEndpoint(SRT,MED RT PRIO)4 be ← createQoSEndpoint(BE,BE PRIO)5 sapQosList ← {hrt,srt,be}6 serviceSAPs ← this.createStreamSAPs(sapQoSList)7 serviceSAPs.open()8 streamServiceArgs ← downcast(serviceArgs)9 streamerInfo ← streamServiceArgs.getStreamerInfo()

10 streamerChannel ← this.createStreamerChannel(streamerInfo)11 end procedure

Algorithm 4.27 exposes the initialization process of the stream service. The bootstrap

process of the stream service is detailed in procedure StreamService:open(). The

initial setup creates and bootstraps the service access points (lines 2-7). The stream

service uses one additional parameter, the streamer endpoint. This parameter is used

139

Page 142: PhD Thesis

CHAPTER 4. IMPLEMENTATION

to create a stream channel to the streamer (lines 8-10).

Algorithm 4.28: Stream service implementation.

var: this // the current streaming service objectvar: streamSessions // the streaming session listvar: streamStore // the stream circular buffervar: streamChannel // the streamer channel object

1 procedure StreamService:handleNewStreamServiceClient(client,sessionQoS)2 streamSessions.add(createSession(client,sessionQoS))3 end procedure

4 procedure StreamService:handleStreamerFrame(streamFrame)5 for session in streamSessions do6 session.processFrame(streamFrame)7 end for8 end procedure

9 procedure StreamSession:processFrame(streamFrame)10 streamStore.add(streamFrame)11 streamChannel.sendFrame(streamFrame)12 end procedure

Algorithm 4.28 starts by exposing the procedure that handles a new incoming stream

client, in StreamService:handleNewStreamServiceClient() procedure. Upon

the arrival of a new client, the stream service creates a new session and stores it. The

StreamService:handleStreamerFrame() procedure handles incoming frames from

the streamer. When the service receives a new frame, it updates every active session

(lines 5 to 7) through the StreamSession:processFrame() procedure. Currently, a

session only stores the received frames in a circular buffer (whose size is pre-defined),

that will eventually substitute older frames with newer ones. The purpose of this buffer

is to suppress frame loss in the presence of a primary crash, allowing for the client to

request older frames to fix the damaged stream.

Algorithm 4.29 starts by describing the initialization process of the stream client. This

initialization follows the same sequence as with previously described clients. It retrieves

the information about the service instance, and then uses it to create a channel to

the service instance. Upon the reception of a new frame, by the stream client, the

StreamServiceClient:handleStreamFrame() procedure is executed.

Streaming Fault-Tolerance

Figure 4.27 shows the fault-tolerance support within the streaming service. The primary

server and the replicas all connect to the streamer, and receive the stream in parallel

140

Page 143: PhD Thesis

4.2. IMPLEMENTATION OF SERVICES

Algorithm 4.29: Stream client implementation.

var: this // the current streaming client objectvar: channel // the low level connection object

1 procedure StreamServiceClient:open(sid,iid,clientParams)2 queryInstanceInfoQuery ← StreamSvcClient:createFindInstanceQuery(sid,iid)3 discovery ← getRuntime().getOverlayInterface().getDiscovery()4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery)5 channel ← this.createStreamChannel(queryInstanceInfo.getSAPs(),clientParams)6 end procedure

7 procedure StreamServiceClient:handleStreamFrame(streamFrame)8 // application specific...9 end procedure

Figure 4.27: Streaming service with fault-tolerance support.

(step 1). Each of the replicas stores the stream flow up to a maximum configurable

time, for example 5 minutes (step 2). When a stream client connects to the stream

service, it binds to the primary instance and starts receiving the data stream (step 3).

When a fault occurs in the primary, the client rebinds to the newly elected primary of

the replication group. As the client rebinds, it must inform the new primary what was

the last frame received. The new primary, thought a new stream session, calculates the

missing data and sends it back to the client, thereafter resuming the normal stream

flow.

141

Page 144: PhD Thesis

CHAPTER 4. IMPLEMENTATION

4.3 Support for Multi-Core Computing

The evolution of microprocessors has focused on the support for multi-core architectures

as a way to scale through the current physical limits in manufacturing. This brings new

challenges to systems programmers as they must be able to deal with an ever increasing

potential for parallelism.

While coarse-grain parallelism can already be handled with current development frame-

works, such as MPI and OpenMP, they are aimed for best-effort tasks that do not have

a notion of deadline, and therefore are unable to support real-time. Furthermore, their

programming model is based on a set of low-level primitives that do not offer any type

of object-oriented programming support.

On the other hand, the use of object-oriented programming languages provides very

limited supported for specifying object-to-object interactions and almost no parallelism

support. For example, in C/C++ the parallelism is achieved through the use of threads

or processes that are implemented in low-level C primitives that do not have any type

of object awareness.

For these reasons, fine-grained parallelism is hard to implement in a flexible and modular

fashion. While a considerable amount of research work has been done in threading

strategies with object awareness, such as the leader-followers pattern [11], they do not

offer support for resource reservation or regulated access between objects.

4.3.1 Object-Based Interactions

The object-oriented paradigm is based on the principle of using objects, which are data

structures containing data fields and methods, to develop computer programs. The

methods of an object allow manipulation of its internal state, which is composed by

its data fields. However, object-to-object interaction is not addressed by the object-

oriented paradigm. Recent work on component middleware systems [65, 66] addressed

this issue through the use of component-oriented models. However, component-based

programming offers a high-level approach that in our view is not able to address

important low-level object-to-object interactions, such as CPU partitioning, and fine-

grained parallelism.

The implementation of fine-grained parallelism frameworks has to support object-to-

object interactions that include direct and deferred calls. With direct calls (shown

142

Page 145: PhD Thesis

4.3. SUPPORT FOR MULTI-CORE COMPUTING

in Figure 4.28a), the caller object enters the object-space of the callee, that might be

guarded through a mutex, and performs the target action. On the other hand, when the

target object enforces deferred calling shown in Figure 4.28b), the caller object is unable

to perform the operation directly and must queue it. These requests are then handled

by a thread of the target object. The caller does not enter the callee object-space. This

pattern is commonly known as Active Object [114, 13].

(a) Direct calling.

(b) Deferred calling.

Figure 4.28: Object-to-Object interactions.

4.3.2 CPU Partitioning

CPU partitioning is an approach based on the isolation of individual cores or processors

to perform specific tasks, and is normally used to isolate real-time threads from potential

interferences from other non real-time threads. Despite the large body of research on

real-time middleware systems that use general-purpose operating over Common-Of-

The-Shelf (COTS) hardware [3, 65, 66], to our knowledge, no real-time middleware

system, specially when combined with FT support, ever employed a CPU partitioning

scheme (shielding) to further enhance real-time performance.

Figure 4.29 exemplifies a possible examples of CPU partitioning for 4 (Figure 4.29a), 6

(Figure 4.29b) and 8 cores (Figure 4.29c) microprocessors. A more detailed explanation

of the resource reservation mechanisms is provided in Section 3.1.4. Now it suffices

to say that the partitions designated with OS contain the threads that belong to the

underlying operating system (in this case Linux). The partitions BE & RT contain the

threads for best-effort and soft real-time, and finally, the Isolated RT indicates that

143

Page 146: PhD Thesis

CHAPTER 4. IMPLEMENTATION

(a) Quad-core partition-

ing.

(b) Six-core partitioning.

(c) Eight-core partitioning.

Figure 4.29: Examples of CPU Partitioning.

the partitions have dedicated cores that only host soft real-time threads, reducing the

scheduling latency caused by the switching between best-effort and real-time threads.

Our runtime can be seen as a set of low-level services that offers a set of high level

abstractions to the implementation of high-level services. It was necessary to create a

mechanism that regulated access between the services in order to allow the preservation

of the QoS parameters for each individual service, that is the interactions between

objects running on different partitions.

Figure 4.30 revisits the object-to-object interactions with the introduction of CPU

partitioning. Figure 4.30a shows object A making a direct call to operation op b1()

in object B. This normally implies that operation op b1() has a mutex to guard

any critical data structures. Even with priority boosting schemes, such as priority

inheritance, the use of mutexes can cause unbound latencies. Subsequently, this would

break the isolation of partition Isolated RT and would defeat the purpose of using

CPU partitioning. In order to improve throughput, real-time threads can be co-located

with non real-time threads to maximize the use of the cores allocated to a particular

partition. The disadvantage of this approach is that the real-time threads are no longer

in an isolated environment, and so, the scheduling of non real-time threads can cause

interference in real-time threads. As in Figure 4.30b, a direct call involving objects

within the same partition is a valid option.

The use of deferred calling (shown in Figure 4.30c) avoids the problems of direct calling

when objects are allocated in different partitions. The call from object A is serialized

and queued, and a future is associated with the pending request. This call is later

144

Page 147: PhD Thesis

4.3. SUPPORT FOR MULTI-CORE COMPUTING

(a) Direct calling with different par-

titions.

(b) Direct calling within the same

partition.

(c) Deferred calling with different partitions..

Figure 4.30: Object-to-Object interactions with different partitions.

handled by a thread belonging to object B that dequeues it and executes the request,

updating the future with the respective result. The thread of object A, that was waiting

the future, is waken and returns to op a1(). This execution model is commonly referred

as worker-master [13].

4.3.3 Threading Strategies

A threading strategy defines how several threads interact in order to fulfill a goal, with

each strategy offering a trade-off between latency and throughput. Figure 4.31 presents

several well-known strategies that are implemented in our Support Framework, namely:

Leader-Followers [11]; b) Thread-Pool [114]; c) Thread-per-Connection [12], and; d)

Thread-per-Request [13].

Leader-Followers (LF)

The leader-followers pattern (c.f. Figure 4.31a) [11] was designed to reduce context

145

Page 148: PhD Thesis

CHAPTER 4. IMPLEMENTATION

(a) (b) (c) (d)

Figure 4.31: Threading strategies.

switching overhead when multiple threads access a shared resource, such as a set of file

descriptors. This is a special kind of thread-pool where threads take turn as leaders, in

order to access the shared resource. If the shared resource is a descriptor set, such as

sockets, then, when a new event happens on a descriptor, the leader thread is notified

by the select system call. At this point, the leader removes the descriptor from the

set, elects a new leader, and then resumes the processing of the request associated with

the event.

In this case, our default implementation allows that foreign threads join the leader-

followers execution model. After joining, the foreign thread is inserted in the followers

thread set, waiting its turn to become a leader and process pending work. As soon as

the event reaches a final state (in case of success, error or timeout), the foreign thread

is removed from the followers set.

Thread-Pool (TP)

The thread-pool pattern (c.f. Figure 4.31b) [114] consists of a set of pre-spawned threads,

that normally are synchronized by a barrier primitive, such as select and read. This

pattern avoids the overhead and latency of dynamically creating threads to handle client

requests, but results in a loss of flexibility. In general, however, it is possible to adjust

the size of the pool in order to cope with environment changes.

Thread-per-Connection (TPC)

The thread-per-connection pattern (c.f. Figure 4.31c) [12] aims to provide minimum

146

Page 149: PhD Thesis

4.3. SUPPORT FOR MULTI-CORE COMPUTING

latency time by avoiding request multiplexing, at the cost of having a dedicated thread

per connection.

Every SAP has a listening socket that is responsible for accepting new TCP/IP con-

nections that is usually managed by an Acceptor design pattern [13]. After accepting

a new connection, the Acceptor creates a new thread that will handle the connection

throughout its life-cycle. Given the one-to-one match between thread and connection,

it is not possible to allow foreign threads into the execution model without breaking

correctness of the connection object, as it is not configured to allow multiple accesses

to low-level primitives such as the read system call. Because of this, any foreign thread

that invokes a synchronous operation on the connection object, has its request queued.

This is later processed by the thread that owns the connection.

Thread-per-Request (TPR)

The thread-per-request pattern (c.f. Figure 4.31d) [13] focuses on minimizing thread

usage, while trying to maximize the overall throughput on a set of network sockets.

This design pattern results from a combination of a low-level thread-per-connection

strategy with a high-level thread-pool strategy. This strategy is also referred as Half

Async - Half Sync [13].

The role of the thread-per-connection strategy is to read and parse incoming packets,

and enqueuing them into an input queue to be processed by the workers of the thread-

pool. When a worker thread wants to send a packet, it also has to enqueue the packet

into an output queue.

Minimization of Network Induced Priority Inversion

Providing end-to-end QoS in a distributed environment needs a vertical approach,

starting at the network level (inside the OS layer). Previous research [134], focused

on the minimization of network induced priority inversion, through the enhancement of

Solaris’s network stack to support QoS. Additional work [3] extended this approach to

the runtime level by providing separate access points for requests of different priority.

Building on these principles, our runtime was built to preserve end-to-end QoS seman-

tics. For that end, each service publishes a set of access points, with associated QoS,

that will serve as entry points for client requests, thus avoiding request multiplexing.

This approach was based on TAO’s work on the minimization of priority inversion [3]

caused by the use of network multiplexing. The service access points are served by a

threading strategy that is statically configured during the bootstrap of the runtime.

147

Page 150: PhD Thesis

CHAPTER 4. IMPLEMENTATION

Figure 4.32: End-to-End QoS propagation.

However, as TAO was designed to accommodate only one type of service, that is the

RPC service, it did not address the following aspects: service inter-dependencies and

resource reservation, more precisely, CPU shielding. In our middleware, each SAP is

served by an execution model, offering a flexible behavior.

4.3.4 An Execution Model for Multi-Core Computing

The lack of a design pattern capable of providing a flexible behavior that leverages the

use of multi-core processors through CPU reservation and partitioning, while providing

support for a configurable threading strategy, motivated the creation of the Execution

Model/Context design pattern.

Figure 4.33: RPC service using CPU partitioning on a quad-core processor.

Figure 4.33 shows an overview of the RPC service while using CPU partitioning. The

Isolated RT partition, containing core 1, supports the handling of high priority RT

invocations. Whereas, the BE & RT partition, containing cores 2 and 3, supports the

148

Page 151: PhD Thesis

4.3. SUPPORT FOR MULTI-CORE COMPUTING

handling of medium priority RT invocations and best-effort invocations. Each SAP

features a thread-per-connection (TPC) threading strategy, but they can use any of the

previous described strategies.

Figure 4.34: Invocation across two distinct partitions.

Figure 4.34 shows the interaction of a medium priority RT invocation, which is handled

by a thread that belongs to a med RT SAP that resides in the BE & RT partition,

with a high priority server that resides in the Isolated RT partition. While any thread

belonging to the high RT SAP could directly interact with a high priority server, as

they reside in the same partition, this should not happen when the interaction was

originated by a thread belonging to a different partition. This last interaction could

cause a priority inversion on the threading strategy that is supporting high priority

server.

The first part of the execution model/context pattern, the execution model sub-pattern,

allows an entity to regulate the acceptance of foreign threads, that is the threads that

belong to other execution models, within its computing model. The rationale behind

this principle resides in the fact that an application might reside in a dedicated core and

the interaction with a foreign thread could cause cache line trashing, or simply break

the isolation for some real-time threads.

The second sub-pattern is the execution context. Its role is to efficiently manage the

call stack through the use of Thread-Specific Storage (TSS). This allows the execution

model to retrieve the necessary information about a thread, for example the partition

that is assigned to the thread, and use it to regulate the behavior of the thread that is

149

Page 152: PhD Thesis

CHAPTER 4. IMPLEMENTATION

interacting with it. For example, it prevents an isolated real-time thread that belongs to

an isolated execution model hosted on an isolated real-time partition, from participating

in a foreign execution model, that would break the isolation principle and result in

non-deterministic behavior (for example, by propagating interrupts from non-isolated

real-time threads into the isolated core).

The internals of the Execution Model/Execution Context (EM/EC) design pattern are

depicted in Figure 4.35 showing the interaction between three distinct execution models.

When a thread that belongs to EM0 calls an operation on EM1, it effectively enters a

new computational domain. An operation can either be synchronous or asynchronous.

If it is asynchronous, then the requesting EM0 will not participate in the computing

effort of EM1.

Figure 4.35: Execution Model Pattern.

On the other hand, if the operation is synchronous, then it must check whether the last

EM, the top of an execution context calling stack, allows that its thread to participate

in the threading strategy of EM1. If the thread is allowed to join the threading strategy,

then it participates in the computing effort until it reaches a final state (that is operation

successful, error, or timeout). When it reaches the final state, it backtracks to the

requesting EM, in this case EM0 by popping the context from the stack. The operation

being performed on EM1 could continue the call chain by executing an operation on

EM2, and if so, this process would repeat itself.

If the requesting EM0 does not allow its threads to join EM1, then the operation must

be enqueued for future processing by a thread within the threading strategy of EM1. If

EM1 embodies a passive entity, i.e. an object that does not have active threads running

150

Page 153: PhD Thesis

4.3. SUPPORT FOR MULTI-CORE COMPUTING

inside its scope, then the EM is considered a NO-OP EM. In this scenario, it is not

possible to enqueue the request because there are no threads to process it, so an error is

returned to EM0 (this should only happen in a configuration error). Otherwise, if EM1

is an active object, then if the queue buffer is not full, the request is enqueued and a

reply future is created. As the operation has a synchronous semantics, the thread (that

belongs to EM0) must wait for the token to reach its final state before returning to its

originating EM.

Algorithm 4.30: Joining an Execution Model.

var: this // the current Execution Model object

1 procedure ExecutionModel:join(event,timeout)2 ec ← TSS:getExecutionContext()3 topEM ← ec.peekExecutionModel()4 joinable ← topEM.allowsMigration(this)5 if not joinable then6 throw(ExecutionModelException)7 end if8 try9 ec.pushEnvironment(this,event,timeout)10 ts ← this.getThreadingStrategy()11 ts.join(event,timeout)12 catch(ThreadingStrategyException)13 ec.popEnvironment()14 throw(ExecutionModelException)15 catch(ExecutionContextException)16 throw(ExecutionModelException)17 end try18 end procedure

Algorithm 4.30 presents the ExecutionModel:join() procedure, that acts as the

entry point for every thread wanting to join the execution model. The procedure takes

two arguments, an event and a timeout. The event represents an uncompleted

operation belonging to the execution model, e.g. an unreceived packet reply from

a socket, that must be completed before the deadline given by timeout. It starts

by retrieving the Execution Context stored in Thread-Specific Storage (TSS) (line 2).

This allows the execution context to be private to the thread which owns it, avoiding

synchronized access to this data. At line 3, we retrieve the current, and also the last,

execution model in which the thread has entered. If this last execution model does not

allow its threads to migrate to the new execution, then an exception is raised and the

join process is aborted. Otherwise, the thread joins the new execution model, by first

pushing the call stack with the information regarding the join (a new tuple containing

the new execution model, event and timeout) (Line 10). This is followed by the thread

151

Page 154: PhD Thesis

CHAPTER 4. IMPLEMENTATION

joining the threading strategy (Lines 10-11). If the threading strategy does not allow

the thread to join it, then an exception is raised and the join is aborted. Independently

of the success or failure of the join, the call stack is popped, thus eliminating the

information regarding this completed join.

Algorithm 4.31: Execution Context stack management.

var: this // the current Execution Context objectvar: stack // the environment stack object

1 procedure ExecutionContext:pushEnvironment(em,event,timeout)2 topEnv ← stack.top()3 if timeout > topEnv.getTimeout() then4 throw(ExecutionModelException)5 end if6 if topEnv.getExecutionModel() = em & topEnv.getEvent() = event then7 topEnv.incrementNestingCounter()8 else9 nesting counter ← 110 context ← createContextItem(em,event,timeout,nesting counter)11 stack.push(context)12 end if13 end procedure

14 procedure ExecutionContext:popEnvironment()15 topEnv ← stack.top()16 topEnv.decrementNestingCounter()17 if topEnv.getNestingCounter() = ∅ then18 stack.pop()19 end if20 end procedure

21 procedure ExecutionContext:peekExecutionModel()22 return stack.top().getExecutionModel()23 end procedure

Algorithm 4.31 shows the most relevant procedures of the execution context. The

ExecutionContext:pushEnvironment() procedure is responsible for pushing a new

execution environment into the call stack. It starts by checking if the timeout,

belonging to the new environment, does not violate the previously established deadline

(that belongs to the last execution model), and if it is the case, an exception is raised

(Lines 3-5). If a thread is recursive, i.e. it enters multiple times in the same execution

model, then instead of creating a new execution environment and pushing it into the

queue, it simply increments a nesting counter, that represents the number of times a

thread has reentered this execution domain (Lines 7-8). Otherwise (Lines 9-11), a new

execution environment (with the nesting counter set to 1) is created and pushed into

the stack. The ExecutionContext:popEnvironment() procedure eliminates the top

152

Page 155: PhD Thesis

4.3. SUPPORT FOR MULTI-CORE COMPUTING

execution environment present in the call stack. It starts by decrementing the nesting

counter, and if it is equal to 0 then no recursive threads are present and the stack can

safely be popped. Otherwise, no further action is taken. The remaining procedure,

ExecutionContext:peekTopExecutionModel(), is an auxiliary procedure used to

peek at the top execution model associated with the current thread.

Applying the EM/EC Pattern to the RPC Service

Figure 4.36 show the RPC service using the EM/EC pattern. Each service access point

(SAP) is served by a thread-per-connection strategy that has a dedicated thread for

handling new connections, normally known as the Acceptor [13], that spawns a new

thread for each new client connection. Furthermore, the RPC service uses two CPU

partitions, an Isolated RT partition for supporting high priority RT invocations and a

BE & RT partition for supporting medium priority RT and best-effort invocations.

Figure 4.36: RPC implementation using the EM/EC pattern.

In Figure 4.36, each priority lane, the logical composition of the low-level socket handling

with the high-level server handling, is managed through a single execution model. Each

connection is handled by thread that after reading an invocation packet, uses the server

adapter to locate the target server and performs the invocation. As this approach does

not enqueue requests between the layers, it does not introduce additional sources of

latency. However, if the SAP that received the invocation request does not belong to

153

Page 156: PhD Thesis

CHAPTER 4. IMPLEMENTATION

the same partition as the target server, then the request is enqueued in execution model

containing the server. The invocation is later dequeued, in this case by thread that is

handling the SAP, and executed. The reply is then enqueued in the execution model

that originated the invocation.

Algorithm 4.32: Implementation of the EM/EC pattern in the RPC service.

var: thisSocket // the current RPC service objectvar: thisService // the current RPC socket objectvar: timeout // the timeout associated with the invocationvar: rpcService // RPC service instance

1 procedure RPCServiceSocket:handleInput()2 invocation ← getReadPacketFromSocket()3 rpcService.handleRPCServiceMsg(thisSocket,invocation)4 end procedure

5 procedure RPCServiceObject:handleTwoWayInvocation(pid,args)6 try7 event ← createInvocationEvent(pid,args)8 thisService.getExecutionModel().join(event,timeout)9 return event.getOutput()10 catch(ExecutionModelException ex)11 event.wait(timeout)12 return event.getOutput()13 end try14 end procedure

Algorithm 4.32 provides the main details of the EM/EC pattern implementation in

the RPC service. The RPCServiceSocket:handleInput() procedure is the callback

that is used by the thread managing the connection when a input event has occurred

in the socket. After the packet is read from the socket, its processing is delegated to

the upper level of the service, through the the RPCService:handleRPCServiceMsg()

procedure (shown previously in Algorithm 4.19). The server adapter is a bridge between

the layers, and is shown with a dashed outline. It starts by locating the server object

and delegating the invocation to it. The handling of a two-way invocation is im-

plemented in the RPCServiceObject:handleTwoWayInvocation() procedure (the

one-way invocation was omitted for clarity). If the invocation originated from a thread

belonging to server’s priority lane, more specifically from the socket that is handling

the connection, then is able to join the execution model of the server and help with the

computation (lines 7 to 9). On the other hand, if the invocation was originated from a

thread belonging to a execution model outside the server’s partition, then the request is

queued. After the threading strategy of the server executes the invocation, the request

154

Page 157: PhD Thesis

4.4. RUNTIME BOOTSTRAP PARAMETERS

is signaled as completed. At this point, the thread that originated the request is waken

in the wait() procedure (line 11) and the output is returned (line 12).

4.4 Runtime Bootstrap Parameters

The bootstrap of the core is implemented in method Core:open(args) and adjusts

the behavior of the runtime during its life-cycle. The arguments are passed to the core

by using command line options. Table 4.1 shows the most relevant arguments present

in the system.

Property Meaning Default

General use

resource reservation Enables resource reservation true

rr runtime Maximum global cpu runtime 10

rr period Maximum global cpu period 100

Overlay specific

default interface Default NIC eth0

cell multicast interface Default NIC for multicast eth0

cell root discovery ip IP address for root cell discovery 228.1.2.2

cell root discovery port Port address for root cell discovery 2001

tree span i Tree span at level i 2

cell peers i Maximum peers at tree level i 2

cell leafs i Maximum leafs at tree level i 80

Table 4.1: Runtime and overlay parameters.

One of the most important flags in the system is the resource reservation support flag

is controlled by the --resource reservation command line option. Upon initializa-

tion, and if the resource reservation support flag is activated (the default behavior),

the core creates a QoS client and connects to the resource reservation daemon. The

--rr runtime parameter controls how much CPU time can be spent running in each

computational period, that in turn is defined by the --rr period parameter. Both

parameters are expressed in micro-seconds and are used to configure the underlying

Linux’s control groups.

155

Page 158: PhD Thesis

CHAPTER 4. IMPLEMENTATION

The overlay is controlled by a set of specific command line options. the default network

interface card (NIC) to be used in the network communications is controlled by the

--default interface parameter. The --cell multicast interface defines the

network interface card to be used by the cell discovery mechanism. Furthermore, the

--cell root discovery ip and --cell root discovery port are used to specify

the IP address and port of the root multicast group. The --tree span i parameter

specifies the tree span for the ith level of the tree. The --cell peers i parameter

specifies the maximum number of peers in each cell at tree level i. Last, the maximum

number of leaf peers for every cell in tree level i is controlled by the cell leafs i

parameter.

It is possible to automatically bootstrap an overlay during the initialization of the run-

time, using the --overlay command line option. For example, using --overlay=p3,

the core will look for a “libp3.so” in the current directory, and bootstrap it. Alterna-

tively, it is possible to programmatically attach an overlay to the runtime, c.f. Listing 3.1

in Chapter 3.

4.5 Summary

This chapter provided an overall view of the implementation of the runtime. We

presented an overlay implementation inspired in the P3 topology, detailing the three

mandatory peer-to-peer services: mesh, discovery, and fault-tolerance.

The chapter also provides a presentation of three high-level services that provide a proof-

of-concept for our runtime architecture, namely: a RPC service that implements the

traditional remote procedure call; an Actuator service that exemplifies an aggregation

service that uses the FT service solely to minimize rebind latency, and; a Streaming

Service that offers buffering capabilities to ensure stream integrity even in the presence

of faults.

Furthermore, the chapter provides an overview of the challenges faced in supporting

multi-core computing, followed by the presentation of our novel design pattern, the

Execution Model/Context, that a provides an integrated solution for supporting multi-

core computing.

Last, the chapter ends with a short description of the options that may be used when

bootstrapping the runtime.

156

Page 159: PhD Thesis

–Success consists in being successful, not in hav-

ing potential for success. Any wide piece of

ground is the potential site of a palace, but there’s

no palace till it’s built.

Fernando Pessoa 5Evaluation

This chapter provides an evaluation of the real-time performance of the middleware

while in the presence of the fault-tolerance and resource reservation mechanisms. The

chapter highlights the performance of the two most important parts in the system,

the overlay and the high-level services. This evaluation uses a set of benchmarks

that characterize key aspects of the infrastructure. The assessment of the overlay

infrastructure focuses on (a) membership (and recovery time), (b) query behavior, and

(c) service deployment performance. Whereas, the evaluation of the high-level services

focused on (d) the impact of FT on service performance, (e) impact of multiple clients

(using the RPC as test case), and finally, (f) a comparison with other platforms.

5.1 Evaluation Setup

The evaluation setup is composed of the physical infrastructure and the overlay config-

uration used to produce the benchmarks results discussed throughout this chapter.

5.1.1 Physical Infrastructure

The physical infra-structure used to evaluate the middleware prototype consists of a

cluster of 20 quad-core nodes, equipped with AMD Phenom II X4 [email protected] CPUs

and 4Gb of memory, totaling 80 cores and 80Gb of memory. Each node was installed

with Ubuntu 10.10 and kernel 2.6.39-git12. Despite our earlier efforts to use the real-

time patch for Linux, known as the rt-preempt patch [135], this was not possible due

to bugs on the control group infrastructure. The purpose of this patch is to reduce

157

Page 160: PhD Thesis

CHAPTER 5. EVALUATION

the number and length of non-preemptive sections in the Linux kernel, resulting in

less scheduling latency and jitter. Nevertheless, the 2.6.39 version incorporates most of

the advancements brought by the rt-branch, namely, threaded-irqs [136]. The physical

network infrastructure was a 100 Mbit/s Ethernet with a star topology.

5.1.2 Overlay Setup

At bootstrap, the middleware starts by building a peer-to-peer overlay with a user

specified number of peers and leaf peers. The peers are grouped in cells that are created

according to the rules of the underlying P2P framework, described in Chapter 4. Overlay

properties control the tree span and the maximum number of peers per cell at any given

depth.

Figure 5.1: Overlay evaluation setup.

Figure 5.1 shows the configuration used for all the benchmarks performed on the overlay.

The overlay forms a binary tree with the first level, composed of the root cell, has four

peers, whereas each cell on the second level has 3 peers, while the third, and last, level

has two peers per cell.

Figure 5.2 shows the physical layout used for the evaluation. Each peer is launched in a

separate node of the cluster, for a total of 18 cluster nodes. On the other hand, all the

leaf peers are launched in a single node. Last, the clients are either launched in the same

node where the lead peers were launched, or in remaining free node of the cluster. The

allocation of the clients and leaf peers on the same node was done to provide accurate

measurements in services, such as the streaming service, where the stream of data only

goes one way. Otherwise, the physical clock of both client and leaf nodes would have

to be accurately synchronized through specialized hardware.

158

Page 161: PhD Thesis

5.2. BENCHMARKS

Figure 5.2: Physical evaluation setup.

5.2 Benchmarks

We divided the benchmark suite into two separate categories, one focusing on the

low-level overlay performance and the other on the high-level services. The main

objective is to isolate key mechanisms, specially at the overlay level, that may interfere

with the behavior of the services. A second objective is to create a solid benchmark

facility to assess the impact of future overlay implementations in the overall middleware

performance.

5.2.1 Overlay Benchmarks

The following benchmarks were designed to evaluate the performance of a P2P overlay

implementation. Figure 5.3 shows an overview of the different overlay benchmarks.

Membership Bind and Recovery

To evaluate the performance of the membership mechanism, we take two measurements,

(a) the bind time, which reflects the time a node takes to negotiate its entry into the

mesh, and, (b) the rebind time, comprehends the recovery and rebinding (renegotiation)

time that a node must undertake to deal with a faulty environment (Figure 5.3a). In

our P2P overlay, this failure happens when a coordinator node crashes, leading to a fault

on the containing cell, and subsequently to a fault in the tree mesh. The faulty cell

recovers by electing a new coordinator node, allowing the children subtrees to rebind

to the recovered cell. The time that it takes for a children subtree to rebind to new

coordinator is directly related with the size of its state (the serialized contents of the

159

Page 162: PhD Thesis

CHAPTER 5. EVALUATION

(a) Membership bind & recovery.

(b) Querying. (c) Service deployment.

Figure 5.3: Overview of the overlay benchmarks.

subtree), thus the larger the subtree, the longer it will take to transfer its state to the

new coordinator. So, in order to evaluate the worst case scenario, after building the

mesh, the coordinator of the root cell is crashed, forcing a rebind of the first level cells.

Querying

One of the most fundamental aspects of P2P is its ability to efficiently find resources in

the network. Given this, a measurement of the search mechanism is important to assess

the performance of a given P2P implementation. To assess the worst case scenario, we

focused on measuring the Place of Launch (PoL) query, as shown in Figure 5.3b. In

our current P2P implementation, a query is handled only at the root cell, since it has

a better account of the resource usage across the mesh tree.

Service Deployment

In a cloud like environment is important to quickly deploy services, and so the goal of

this benchmark is to profile the performance of such a mechanism in our overlay. This

benchmark measures the latency associated with a service bootstrap with and without

160

Page 163: PhD Thesis

5.2. BENCHMARKS

FT. Figure 5.3c represents a request to launch a service on a peer to be discovered.

After being found by the query PoL, the service is started. When a service is to be

bootstrapped without FT support, the source creating the service only has to request

one PoL query, as no replicas are going to be bootstrapped. Otherwise, the primary

of the replication group has to issue the same number of PoL queries as the number of

replicas that it is bootstrapping.

5.2.2 Services Benchmarks

We wanted to evaluate the following parameters: (a) the impact of fault-tolerance

mechanisms in priority-based real-time tasks; (b) the impact of fault-tolerance in iso-

lated real-time tasks; (c) preliminary (latency only) comparison with other main-stream

middleware systems, such as TAO, ICE and RMI. We implemented three simple services

to serve as benchmarks and one to inject load in the peers.

The maximum allowed priority for all benchmarks is 48. Priorities above 48, and up

to 99, are reserved for the various low-level Linux kernel threads, namely, the cgroup

manager and irq handlers.

(a) RPC. (b) Actuator. (c) Streaming.

Figure 5.4: Network organization for the service benchmarks.

RPC

The RPC service (Figure 5.4a) executes a procedure in a foreign address space. This

is a standard service in any middleware system. A primary server receives a call from

a client, executes it, and updates the state in all service replicas. When all replicas

acknowledge the update, the primary server then replies to the client. In the absence of

161

Page 164: PhD Thesis

CHAPTER 5. EVALUATION

fault-tolerance mechanisms, the primary server executes the procedure and immediately

replies to the client.

To evaluate the RPC service we used the maximum available priority of 48. The remote

procedure simply increments a counter and returns the value. We performed 1000 RPC

calls each run, with an invocation rate of 250 per second.

Actuator

The actuator service (Figure 5.4b) allows a client to execute a command in a set of panels

controlled by lead peers. This is used by EFACEC to display information of incoming

and departing trains in a train station. After receiving the command, the primary server

sends it to the panels, waits for their acknowledgments, and then acknowledges the client

itself. The service does not use the fault-tolerance support for data synchronization (as

in the RPC service), but instead pre-binds the replicas to the panels in the set.

We used 80 panels, and a string of 14 bytes. The 80 panels are representative of a

large real-world public information system in a light train network. The string length

represents the average size in current systems at EFACEC. We issued 1000 commands

each run, with an invocation rate of 250 per second.

Streaming

This service (Figure 5.4c) allows the streaming of a data flow (e.g. video, audio, events)

from leaf peers to a client. This type of service is used by EFACEC to send and receive

streams from train stations, namely, to implement the CCTV subsystem. The primary

server and the replicas all connect to the leaf peers, and receive the stream in parallel.

Each of the replicas stores the stream flow up to a maximum pre-defined time, for

example 5 minutes. When a fault occurs in the primary, the client rebinds to the newly

elected primary of the replication group. As the client rebinds, it must inform the new

primary what was the last frame received. The new primary then calculates the missing

data and sends it back to the client, thereafter resuming the normal stream flow.

We used a stream of 24 frames per second with length of 4 Kbytes, resulting in a bitrate

of 768Kbit per second. For example, this bitrate allows for a medium quality MPEG-4

stream with a 480 x 272 resolution, matching the video stream used by EFACEC’s

CCTV. The client and leaf peers are located in the same machine as this allows

the determination of the one-way latency and jitter for the traffic. The stream was

transmitted for 4 seconds in each run.

162

Page 165: PhD Thesis

5.3. OVERLAY EVALUATION

5.2.3 Load Generator

Complex distributed systems are prone to be affected by the presence of rogue services

that can become a source of latency and jitter. We evaluate the impact of the presence

of such entities by introducing in each peer a load generator service. The later spawns

as many threads as the logical core count of the CPU. Unless explicitly mentioned,

the threads are allocated to the SCHED FIFO schedule class, with priority 48. This

scheduling policy represents the worst case scenario of unwanted computation. Given

a desired load percentage p (in terms of the total available CPU time), each thread

continuously generates random time intervals (up to a configurable maximum of 5ms).

For each value it computes the percentage of time that it must compute so that the load

is p. For example, if the desired load is 75% and the value generated is 4ms, then the

load generator must compute for 3ms and sleep for the remainder of that time lapse.

The experiments on each benchmark were tested with increasing load values (5% step),

up to a maximum of 95%. For each of these configurations we ran the benchmark 16

times, and computed the average and the 95% confidence intervals (represented as error

bars). A vertical dashed line corresponds to a load of 90% is used as a reference for the

case where resource reservation is enabled.

5.3 Overlay Evaluation

This section presents the results for the runs on all the three benchmarks designed for

evaluating the P2P overlay performance, namely: membership, recovery, query, and

service deployment. The benchmarks that present latency and jitter use a logarithmic

scale that causes a distortion of the error bars. This may mislead in the evaluation of

the results.

5.3.1 Membership Performance

These experiments estimate the impact on membership bind and recovery time, while

the peers are exposed to an increasing load. The membership mechanisms run with

maximum priority (48) on each peer of the overlay.

163

Page 166: PhD Thesis

CHAPTER 5. EVALUATION

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

1000

Late

ncy

(m

s)

Legend:Res. No Res.

(a)

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

1000

Late

ncy

(m

s)

Legend:Res. No Res.

(b)

Figure 5.5: Overlay bind (left) and rebind (right) performance.

These two measurements are a key factor on the overall performance of the higher layers

on the middleware, because a node is only fully functional when it is connected to the

mesh. In the presence of a fault it is important to be able to quickly rebind and recover,

to minimize the down time of the P2P low level services. This in turn, can become a

source of latency for the high level services, for example, the RPC service.

The membership bind time, depicted in Figure 5.5a, shows a linear increase on bind

latency without the resource reservation enabled. This is expected, as the load increases,

it creates additional interference on the threads of the mesh service. When the resource

reservation mechanisms are enabled, the mesh service uses a portion of the resource

reservation allocated to the runtime. The use of the resource reservation mechanisms

allow for an almost constant latency time with some minor jitter on loads higher than

80%.

The rebind performance exhibits a similar behavior to the bind performance, although

it exhibits a lower latency on loads less than 80%. As with the bind benchmark,

the enabling of the resource reservations mechanisms allow for a near constant rebind

latency with a very small jitter.

5.3.2 Query Performance

The query performance is one of the most crucial aspects of every overlay implemen-

tation, because it is the basis of the resource discovery. Figure 5.6 shows the result of

164

Page 167: PhD Thesis

5.3. OVERLAY EVALUATION

performing the PoL query with and without resource reservation.

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

1000

Late

ncy

(m

s)

Legend:Res. No Res.

Figure 5.6: Overlay query performance.

The evaluation results show that up to loads of 70%, the use of resource reservation

introduces a small overhead as shown by the higher level of latency. This is explained by

the fact that the execution model uses a Thread-per-Connection (without a connection

pool) policy, where a peer creates a new connection (using the desired level of QoS) to

perform a query. When a neighbor peer receives a new connection (from the discovery

service), it has to spawn a new thread to deal with the request. This process is repeated

until a peer is able to handle the query, or the root cell is reached and a failure message

is replied back to the originator peer. When using resource reservation, the creation

of new threads must undergo an additional submission phase with the QoS daemon,

and subsequently, within the QoS infrastructure in Linux (control groups), causing the

increase of latency when using resource reservation. Nevertheless, from 70% to 95%, the

resource reservation mechanism is able to provide a stable behavior. Otherwise, in the

absence of the resource reservation mechanism, the query latency reaches a maximum

of 400ms when the peers were subjected to a load of 95%.

5.3.3 Service Deployment Performance

The quick allocation of services, and ultimately of resources, is a crucial aspect of

scalable middleware infrastructures. Figure 5.7 shows the evaluation results for service

deployment, with varying number of replicas.

165

Page 168: PhD Thesis

CHAPTER 5. EVALUATION

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

1000

10000

Late

ncy

(m

s)

Legend:Res. + NoFTRes. + 1FTRes. + 2FTRes. + 4FT

No Res. + NoFTNo Res. + 1FTNo Res. + 2FTNo Res. + 4FT

Figure 5.7: Overlay service deployment performance.

The results show that without resource reservation, the system exhibits a linear increase

of deployment time starting at loads of 30%, following (the linear) the increase of the

load injected in the system. Associated with this high latency, the results show a high

jitter throughout the service deployment. The maximum value registered was near 10s,

for the deployment of the service with 4 replicas without resource reservation and a load

of 95%. On the other hand, when the discovery service used the resource reservation

mechanism, it exhibited a near constant behavior, only showing a small increase of

the deployment time when the service is deployed with FT. The increasing number of

replicas brings additional latency to the deployment, as more queries are needed to be

performed to discover additional sites to deploy the replicas. Naturally, the deployment

of 4 replicas takes additional time, resulting in a maximum around 100ms, still, a

100 fold improvement over the 4 replica deployment without resource reservation. To

conclude, the results show negligible jitter in all the deployment configurations when

the resource reservation mechanism is activated.

5.4 Services Evaluation

Several aspects influence the behavior of the high-level services. Here, we present the

two most important aspects, the impact of FT mechanisms in service latency and the

impact of resource reservation while enforcing FT policies. Additionally, we present

166

Page 169: PhD Thesis

5.4. SERVICES EVALUATION

results that characterize the impact of the presence of multiple clients, using RPC as

test case. The evaluation of the system ends with a preliminary comparison with other

closely related middleware systems.

5.4.1 Impact of FT Mechanisms in Service Latency

These experiments estimate the impact of the FT mechanisms in service latency and

rebind latency, as the peers are subjected to increasing load. The services run with

maximum priority (48) without resource reservation. To assess the scalability of the

FT mechanisms we also vary the size of the replication group for the service through

2, 3 and 5 (1 primary server + 1, 2, 4 replicas). For the rebind latency, in the middle

of the run, we crash the primary server. This is accomplished by invoking an auxiliary

RPC object, initially loaded in every peer of the system. Finally, as a baseline reference

we present the results obtained with the same benchmarks but with all FT mechanisms

disabled. In this case, no fault is injected, as no fault-tolerance is active.

The results for the runs can be seen in Figure 5.8. In general, the rebind latency presents

a stepper increase when compared to invocation latency, although the differences with

varying number of replicas are masked by jitter. The rebind process involves several

steps: failure detection; election of a new primary server; discovery of new primary

server, and; transfer of lost data. In each step, the increasing load introduces a

new source of latency and jitter that accumulates to the overall rebind time. In this

implementation the client must use the discovery service of the mesh to find the new

primary server. This step could be optimized, for example, by keeping track of the

replicas in the client. Despite this, the rebind latency remains fairly constant up to

loads of 40% to 45%. The minimum and maximum rebind latencies for the RPC,

Actuator and Streaming services are, respectively: 5.9ms, 5.7ms, 7.2ms, and 2823ms,

2068ms, 1087ms.

The invocation latencies depicted in Figure 5.8 show that up to loads of 35% the FT

mechanisms introduce low overhead and low jitter. In the case of the RPC benchmark,

that uses a more complex replica synchronization protocol, the overhead remains a

constant factor in direct proportion to the number of replicas relative to the baseline case

(no FT). The Actuator and Streaming services, with their simple (or non-existing) data

synchronization protocols follow the baseline very closely. Despite this, the Streaming

service is far more CPU intensive than Actuator and therefore shows more impact from

increasing loads. The minimum and maximum invocation latencies measured for the

RPC, Actuator and Streaming services are, respectively: 0.1ms, 1.5ms, 1.1ms, and

167

Page 170: PhD Thesis

CHAPTER 5. EVALUATION

Rebind Latency Invocation Latency

RPC

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

1000

10000

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

0 10 20 30 40 50 60 70 80 90 100Load (%)

0.1

1

10

100

1000

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

Actuator

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

1000

10000

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

Stream

0 10 20 30 40 50 60 70 80 90 100Load (%)

0.1

1

10

100

1000

10000

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

Figure 5.8: Service rebind time (left) and latency (right).

168

Page 171: PhD Thesis

5.4. SERVICES EVALUATION

259ms, 19ms, 96ms.

5.4.2 Real-Time and Resource Reservation Evaluation

In these runs we use the middleware’s QoS daemon to isolate the services by reserving

at least 10% of the available CPU time for the runtime that executes the service. The

remainning 90% are used for operating system tasks and for the Load Generator service.

Everything else is kept from the scenario described for the previous set of runs.

Impact of FT in Service Latency with Reservation

The results for the runs can be seen in Figure 5.9. The fact that the services are

now isolated, at least in terms of CPU, from the remainder of the system contributes to

their almost constant latencies and stability (low jitter) with increasing peer loads. The

invocation latency also shows the natural increase with the number of replicas. The

minimum and maximum rebind latencies for the RPC, Actuator and Streaming are,

respectively: 9.2ms, 10.2ms, 10.9ms, and 15.8ms, 18.7ms, 21.9ms. The minimum and

maximum invocation latencies for the RPC, Actuator and Streaming are, respectively:

0.1ms, 4.8ms, 1.1ms, and 1.0ms, 5.9ms, 1.9ms.

Relative to the previous set of runs, the latencies for low values of peer loads with

resource reservation activated are somewhat higher. For example, the ratios between

the minimum rebind latencies with and without reservation for RPC, Actuator and

Streaming are, respectively: 1.6, 1.8, and 1.5. This is explained by the overhead

introduced by the reservation mechanisms (previously explained in Chapter 3). This

overhead has a higher impact on the rebind latency than it has on the invocation

latency, because the rebind process has a much shorter duration, therefore the overhead

represents a larger fraction of total time. In other words, the overhead of the resource

reservation setup on the invocation latency, is amortized across the duration of the

benchmark, such as the 1000 calls performed to the RPC service.

Impact of Multiple Clients in RPC Latency

To evaluate the performance of the middleware in the presence of multiple clients with

different priorities, we extended the RPC benchmark and introduced three service access

points with distinct priorities, more precisely, 48, 24, and 0. The first two access

points are served by a thread-per-connection model with scheduling class SCHED FIFO

(and priorities 48 and 24, respectively). The remaining SAP is served by threads with

scheduling class SCHED OTHER (with static priority 0). This benchmark allows to

169

Page 172: PhD Thesis

CHAPTER 5. EVALUATION

Rebind Latency Invocation Latency

RPC

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

0 10 20 30 40 50 60 70 80 90 100Load (%)

0.1

1

10

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

Actuator

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

Late

ncy

(m

s)

Legend:1 Replica2 Replicas

4 Replicas

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

Late

ncy

(m

s)

Legend:No FT1 Replica

2 Replicas4 Replicas

Stream

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

100

Late

ncy

(m

s)

Legend:Replica 1Replicas 2

Replicas 4

0 10 20 30 40 50 60 70 80 90 100Load (%)

1

10

Late

ncy

(m

s)

Legend:No FTReplica 1

Replicas 2Replicas 4

Figure 5.9: Rebind time and latency results with resource reservation.

170

Page 173: PhD Thesis

5.4. SERVICES EVALUATION

measure the impact of multiple clients on RT performance, specially the impact of low

priority clients on high priority clients.

As with the previously RPC benchmark, the remote procedure increments a counter, but

before returning the value, it continuously computes a batch of arithmetic operations

for 10ms. The objective is to evaluate the Linux’s scheduler and control group RT

performance. We used three clients with priorities 48, 24 and 0, and performed 1000

RPC calls each run, with an invocation rate of 25 per second (corresponding to a

deadline of 40ms). To evaluate the impact of different load conditions, we performed

the benchmark using three load generator configurations, using priorities 48, 24 and 0,

for the setups.

Figures 5.10a, 5.10c and 5.10e show the number of deadlines missed for each client with-

out resource reservation, under an increasing load of priorities 0, 24 and 48, respectively.

Whereas, figures 5.10b, 5.10d and 5.10f show the number of deadlines missed under the

same premises but with resource reservation enabled.

Without resource reservation, and if the load generator uses SCHED OTHER threads

with priority 0, the Linux’s scheduler is able to avoid any deadline miss. This is the

expected outcome for clients using priorities 24 and 48, as they are served by SCHED -

FIFO threads that are always scheduled ahead of any other scheduling class. The client

using priority 0 (and associated SCHED OTHER threads) is also able to avoid any

miss. This is explained by the good implementation of Linux’s fair scheduler, that is

able to leverage loads up to 95% of CPU time.

When the load generator uses SCHED FIFO threads the behavior starts to degrade

with loads higher that 35%. In both cases, the client with priority 0 has approximately

70% missed deadlines when the load is of 95% of CPU time. This is explained by the

CPU starvation caused by the load generator RT high priority threads. When the load

generator uses priority 24, the client that uses priority 48 should not have any deadline

misses. However, this is not the case. The client that uses priority 48 also experiences

missed deadlines, although in a much lesser scale. This is due to priority inversion

at the network interface card driver (whose IRQ is handled by a high priority kernel

thread).

When, in figure 5.10e, the load generator used priority 48, this priority inversion is

exacerbated. Adding to this, the race between the load generator threads interferes

with the remaining threads, due to their SCHED FIFO scheduling. This type of threads

are only preempted by higher priority threads, otherwise, they keep running until they

voluntary relinquish the CPU. But, as the load generator threads used the maximum

171

Page 174: PhD Thesis

CHAPTER 5. EVALUATION

Without Resource Reservation With Resource Reservation

Load Priority 0

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

100

200

300

400

500

600

700

Mis

sed D

eadlin

es

Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0

(a)

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

100

200

300

400

500

600

700

Mis

sed D

eadlin

es

Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0

(b)

Load Priority 24

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

100

200

300

400

500

600

700

Mis

sed D

eadlin

es

Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0

(c)

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

100

200

300

400

500

600

700

Mis

sed D

eadlin

es

Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0

(d)

Load Priority 48

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

100

200

300

400

500

600

700

Mis

sed D

eadlin

es

Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0

(e)

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

100

200

300

400

500

600

700

Mis

sed D

eadlin

es

Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0

(f)

Figure 5.10: Missed deadlines without (left) and with (right) resource reservation.

172

Page 175: PhD Thesis

5.4. SERVICES EVALUATION

permitted priority, this caused unbounded latency on the middleware threads (even

with the high priority threads).

With resource reservation, and when the load generator used priority 0, there were a

very few unexpected missed deadlines. We speculated that a possible explanation can

reside in the fact that we use a thread-per-connection strategy that creates a new thread

for each new connection, with each new thread being submitted to the QoS daemon.

This adds latency to service, and can cause some missed deadlines in the first invocations

from the client. When the load generator uses priority 24 and 48, it worsens the latency

associated with the acceptance of new threads by the QoS daemon. However, additional

analysis to the Linux kernel is still required to validate this hypothesis.

Figures 5.11a, 5.11c and 5.11e show the invocation latencies for each client without

resource reservation, under an increasing load of priorities 0, 24 and 48. Whereas,

figures 5.11b, 5.11d and 5.11f show the invocation latencies with resource reservation

enabled.

The load generator using priority 0, figures 5.11a and 5.11b (with and without resource

reservation, respectively), only interferes with invocation using priority 0. When the

load generator uses SCHED FIFO threads with priority 24 and 48, without the presence

of the resource reservation mechanisms (figures 5.11c and 5.11e), the performance starts

to degrade at 35% of load. The client using priority 48 in figure 5.11c should have a near

constant invocation latency, but due to priority inversion, it presents a linear increase

(although with a much lesser increase when comparing with the other two priorities).

Figure 5.11e shows the expected behavior with the load generator threads (using priority

48) causes a gradual latency increase in all the clients.

Figures 5.11d and 5.11f show the middleware performance with resource reservation

enabled under load priorities of 24 and 48, respectively. A scheduling artifact is notice-

able with invocations using priority 0, that instead of remaining constant, it presents a

lower latency with the increasing presence of load. The workload introduced by the RT

threads of the load generator on the control group infrastructure, that is continuously

forced to perform load balancing across the scheduling domains, causes a small jitter

on clients with priority 24 and 48.

RPC Performance Comparison with Other Platforms

Figure 5.12 shows the measured invocation latencies for the RPC service as implemented

in our middleware and other mainstream platforms, only using a client and a server and

making 1000 RPC invocations, with a 250 invocations per second rate.

173

Page 176: PhD Thesis

CHAPTER 5. EVALUATION

Without Resource Reservation With Resource Reservation

Load Priority 0

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

Late

ncy

(m

s)

Legend:Invocations P48Invocations P28Invocations P0

(a)

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

Late

ncy

(m

s)

Legend:Invocations P48Invocations P28Invocations P0

(b)

Load Priority 24

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

1000

Late

ncy

(m

s)

Legend:Invocations P48Invocations P28Invocations P0

(c)

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

Late

ncy

(m

s)

Legend:Invocations P48Invocations P28Invocations P0

(d)

Load Priority 48

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

1000

Late

ncy

(m

s)

Legend:Invocations P48Invocations P28Invocations P0

(e)

0 10 20 30 40 50 60 70 80 90 100Load (%)

10

100

Late

ncy

(m

s)

Legend:Invocations P48Invocations P28Invocations P0

(f)

Figure 5.11: Invocation latency without (left) and with (right) resource reservation.

174

Page 177: PhD Thesis

5.4. SERVICES EVALUATION

0 10 20 30 40 50 60 70 80 90 100Load (%)

0

0.1

1

10

100

Late

ncy

(m

s)

Legend:Stheno, No Res.Stheno, Res.ICE

TAORMI

Figure 5.12: RPC invocation latency comparing with reference middlewares (without

fault-tolerance).

As expected, RMI, implemented with Java SE, has the worst behavior with minimum

and maximum latencies of, respectively, 0.3ms and 8.9ms. TAO was optimized for

real-time tasks by using the CORBA-RT extension, exhibiting minimum and maximum

latencies of, respectively, 0.3ms and 6.5ms. TAO’s results were hampered by its strict

support to the (bloated) IIOP specification. ICE, while less stable than TAO, is overall

more efficient with minimum and maximum latencies of, respectively, 0.1ms and 7.8ms.

Despite the absence of RT support in ICE, its lightweight implementation (it does

not use IIOP) provides good performance for low values of load. Our middleware

implementation is able to offer minimum and maximum latencies of, respectively, 0.1ms

and 14.6ms, without resource reservation. With resource reservation we achieve a

maximum latency of just 0.1ms, by effectively isolating the service in terms of re-

quired resources. Our implementation without resource reservation exhibits a mixed

performance. Up to 40% of load, it compares very favorably to the other platforms, but

above this limit, it starts to degrade more quickly. We attribute this behavior to the

overhead associated with the time it takes to create a new thread to handle an incoming

connection (a consequence of using the Thread-per-Connection strategy). Nevertheless,

our performance is comparable with TAO’s. Above the 60% load threshold, all systems

without resource reservation have their performance severally hampered by the Load

Generator. Our system, with resource reservation enabled, is able to sustain high levels

of performance by shielding the service from resource starvation, offering at 95% of load,

175

Page 178: PhD Thesis

CHAPTER 5. EVALUATION

an 55 fold improvement to the second best system (TAO), and a 77 fold improvement

over the worst system (RMI).

5.5 Summary

This chapter provided an insight look on the performance behavior of several key

components of our middleware infrastructure, more precisely, the low-level overlay

performance and the high-level service layer. The benchmarks presented focused on

highlighting crucial characteristics on both levels. At the overlay level, we incised on

three aspects: membership behavior; query performance; and service deployment time.

Whereas, at the service layer, we focused exposing the effects of our lightweight FT

infrastructure on service performance, as well the impact of the resource reservation

mechanisms on both RT and FT performance. For contextualizing the performance of

our system, we presented two additional evaluations. The first, exhibits the effects

of the presence of multiple clients (with distinct priorities) in the RPC service, a

common practice for this type of service, such in [3]. The last evaluation presented

a RT performance comparison with other close related systems.

176

Page 179: PhD Thesis

–Success consists in being successful, not in hav-

ing potential for success. Any wide piece of

ground is the potential site of a palace, but there’s

no palace till it’s built.

Fernando Pessoa 6Conclusions and Future Work

6.1 Conclusions

In this thesis we have designed and implemented Stheno that to the best of our knowl-

edge is the first middleware system to seamlessly integrate fault-tolerance and real-time

in a peer-to-peer infrastructure. Our approach was motivated by the lack of support of

current solutions for the timing, reliability and physical deployment characteristics of

our target systems, as shown in the survey on related work.

Our hypothesis is that it is possible to effectively and efficiently integrate real-time

support with fault-tolerance mechanisms in a middleware system using an approach

fundamentally distinct from current solutions. Our solution involves: (a) implementing

FT support at low level in the middleware, albeit on top of a suitable network ab-

straction to maintain transparency; (b) using the peer-to-peer mesh services to support

FT, and; (c) supporting real-time services through kernel-level resource reservation

mechanisms.

The proposed architecture offers a flexible design that is able to support different fault-

tolerance policies, including semi-active and passive. The runtime’s programming model

details the most important interfaces and their interactions. It was designed to provided

the necessary infrastructure for allowing users and services to interact with runtimes

that are not in the same address space, and thus allowing for a reduction in the resource

footprint. Furthermore, it also provides support for additional languages.

We provide a complete implementation of a P2P overlay for efficient, transparent

and configurable fault-tolerance, and support real-time through the use of resource

reservation, network communication demultiplexing, and multi-core computing. The

177

Page 180: PhD Thesis

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

support for resource reservation was achieved through the implementation of a QoS

daemon that manages and interacts with the low-level QoS infrastructure present in

the Linux kernel. The multiplexing of requests can force high priority requests to miss

their deadlines, because of the FIFO nature of network communications. To avoid

this, our implementation allows services to define multiple access points, with each

one specifying a priority and a threading strategy. Last, to proper integrate resource

reservation and the different threading strategies, in a multi-core computing context, we

have designed a novel design pattern, the Execution Model/Context. Fault-tolerance is

efficiently implemented using the P2P overlay and the fault-tolerance strategy and the

number of replicas are configurable per service. The current prototype has a code base

of almost 1000 files and contains around 55000 lines of code.

The experiments show that Stheno meets and exceeds target system requirements for

end-to-end latency and fail-over latency, and thus validating our approach of implement-

ing fault-tolerance mechanisms directly over the peer-to-peer overlay infrastructure. In

particular, it is possible to isolate real-time tasks from system overhead, even in the

presence of high-loads and faults. Although the support for proactive fault-tolerance

is still absent from the current implementation, we were able to mitigate the impact of

faults in the system by providing proper isolation between the low-level P2P services

and the user’s high-level services. This was mainly accomplished with the introduction

of separate communications channels for both service types. We are able to maintain

performance in user services even in the presence of major mesh rebinds.

When taken as a whole these evaluation results are promising and support the idea that

the approach followed is valid. In summary, to the best of our knowledge, Stheno is the

first system that supports:

Configurable Architecture. The architecture of our middleware platform is open, in

a sense that it offers an adjustable and modular design that is able to accommodate a

wide range of applications domains. Instead of focusing on a specific application domain,

such as RPC, we designed a service-oriented platform that offers a computational

environment that seamlessly integrates both fault-tolerance and real-time. Furthermore,

Stheno supports configurability at multiple levels: P2P, real-time and fault-tolerance.

P2P. Our infrastructure, based on pluggable P2P overlays, offers a resilient behavior

that can be adjusted to meet the overall system requirements. The selection between

different overlay topologies, structured or unstructured, allows a software architect to

leverage between resource consumption, overall performance and resiliency.

Fault-Tolerance. We have implemented a lightweight fault-tolerance infrastructure

178

Page 181: PhD Thesis

6.2. FUTURE WORK

directly in the P2P overlay, currently supporting semi-active replication, that is able to

provide minimum overhead and thus enhancing real-time performance. Nevertheless,

a great effort was spent to allow the support of additional replication policies, such as

passive replication and active replication.

Real-Time Behavior. Our platform is able to offer resource reservation through the

implementation of a QoS daemon that leverages the available resources and interacts

with the low-level resource reservation infrastructure provided by the Linux kernel.

Furthermore, our architecture decouples control and data information flows through

the introduction of distinct service access points (SAPs). These SAPs are served by

a configurable threading strategy with an associated priority. Last, we introduced a

novel design pattern, the Execution Model/Context, that is able to integrate resource

reservation with distinct threading strategies, namely, Leader-Followers [11], Thread-

Pool [114], Thread-per-Connection [12] and Thread-per-Request [13], that focus on the

support for multi-core computing.

6.2 Future Work

The work accomplished in this thesis opens paths in several research domains.

Real-Time. An interesting challenge in the RT domain is to enhance the middleware

with support for EDF [117] and study the limitations of implementing the hard real-

time tasks in a general purpose operating system, such as Linux. A derivative work

from this, is to study the implications of isolating low-level hardware interrupts and

measure the impact of different runtime and periods in EDF tasks.

An in-depth study of the impact of CPU architecture, specially cache topology, in real-

time performance and resource reservation behavior would be good to contribute for

improving the deployment of distributed RT systems.

Fault-Tolerance. An interesting idea, originated from the collaboration with Prof.

Priya Narasimhan, consists in providing support for multiple overlays for further en-

hancing dependability. This opens several challenges, (a) correlate faults from different

overlays with the goal of identifying the root causes, (b) choosing the optimal deploy-

ment site for service bootstrap, (c) enhance current state-of-art of fault-tolerance with

support for inter-overlay replication groups, that is the placement of replicas across a

distinct set of overlays, and (d) identify nodes that are common to several overlays, as

179

Page 182: PhD Thesis

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

they diminish FT capabilities.

Currently, we use a reactive fault-detection model that only acts after a fault has

happened. Using a proactive approach, the runtime can predict imminent faults and

take actions to eliminate, or at least minimize, the consequences of such events. A

possible way to accomplish this can involve using a combination of real-time resource

monitoring analysis and gossip-based network monitoring.

The addition of new replicas into a replication group still poses a significant challenge

in distributed RT systems. The disturbance caused by the initialization process of the

new replica, can me mitigated by a two phase process. In the first phase, if there

is no checkpoint available, then the replication group would have to create one. The

existing replicas would then split the checkpoint state between themselves, and therefore

alleviating the primary of further overhead. In the second phase, all the replicas would

transfer their portion of the checkpoint state to the joining replica. This would end

with the primary providing the delta between the checkpoint state and the current

state. This would greatly minimize the interference in the primary node, specially in

very large states.

Byzantine Fault-Tolerance. The introduction of Byzantine Fault-Tolerance (BFT)

still poses a significant challenge. The integration of BFT with RT would represent

the next evolution in terms of FT. We would like to assess the impact of recent

BFT replication protocols, such as Zyzzyva [137] and Aardvark [138], in real-time

performance.

Virtualization. Current virtualization solutions focus on providing on-demand Virtual

Machine (VM)s to the end-user QoS, such as the Amazon EC2. A more low-level

approach can be taken by using lightweight VMs to provide a virtualized environment

for runtime (user) services, allowing the support for legacy services. This also allows

the migration of service without having to implement FT awareness into the service.

A second benefit of having support for virtualized services is the inherent support for

proving a strong isolation to services. This can be used as way to prevent malicious

servers to compromise the entire node.

A broad study on the possibility of having RT performance on the currently available

hypervisors is needed to assess the feasability of having RT virtualized services. To

the best of your knowledge, no RT support has ever been attempted in lightweight

virtualization hypervisors, such as Kernel Virtual-Machine (KVM) [108]. We spec-

180

Page 183: PhD Thesis

6.3. PERSONAL NOTES

ulate, that the use of CPU isolation could make this feasible, possibly allowing the

introduction of RT semantics to the Infrastructure as a Service (IaaS) paradigm. The

recent developments on virtualization at the operating system level [139], by the Linux-

CR project [140], could represent an interesting alternative to lightweight virtualization

hypervisors. Because no latency is added to the middleware runtime, the real-time be-

havior should be preserved. Furthermore, only the state of the application is serialized,

resulting in less overhead to the operating system and produces smaller state images

that should provide a more efficient way of migrating runtimes between nodes, with

subsequent improvement on the recovery time.

6.3 Personal Notes

The main motivation for undertaking this PhD was the desire to solve the problems

created by the requirements from our target systems, and it can be summarize with

the following question: ”Can we opportunistically leverage and integrate these proven

strategies to simultaneously support soft-RT and FT to meet the needs of our target

systems even under faulty conditions?”.

Doing research on middleware systems is a difficult, yet rewarding, task. We feel that

all the major goals of this PhD were met, and the author has gained an invaluable

insight on the vast and complex domain of distributed computing.

From a computer science standpoint, the full implementation of a new P2P middleware

platform that is able to offer seamless integration of both real-time and fault-tolerance

was only possible with a thorough analysis of all the mechanisms involved, as well their

inter-dependencies. Eventually, this work will lead to further research on operating

systems, parallel and distributed computing, and software engineering.

From the early stages of this PhD there has been an increasing focus on the support

for adaptive behavior. The ultimate goal is to leverage fault-tolerance assurances with

real-time performance, in order to meet the requirements of the target system. One of

the most prevailing applications for this type of research is Cloud Computing. We hope

that our work provides an open adaptive framework that allows the researchers and

developers to customize the behavior of the middleware to best suit their needs, while

benefiting from a resilient and distributed network layer built on top of P2P overlays.

The evolution of middleware systems, and in particular the ones that pursuit simulta-

neous support of both real-time and fault-tolerance, has been gradually focusing on the

181

Page 184: PhD Thesis

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

efficient implementations of byzantine fault-tolerance. The practical implementation of

such systems constitutes a promising and exciting research field. Another promising

research field is related to the introduction of hard real-time support in general purpose

middleware systems while supporting the dynamical insertion and removal of services.

I hope to have the opportunity to contribute in these exciting research challenges.

182

Page 185: PhD Thesis

References

[1] Paulo Verssimo and Luıs Rodrigues. Distributed Systems for System Architects.

Kluwer Academic Publishers, Norwell, MA, USA, 2001.

[2] Kenneth Birman. Guide to Reliable Distributed Systems. Texts in Computer

Science. Springer, 2012.

[3] Douglas Schmidt, David Levine, and Sumedh Mungee. The Design of the TAO

Real-Time Object Request Broker. Computer Communications, 21(4):294–324,

1998.

[4] Xavier Defago. Agreement-Related Problems: from Semi-Passive Replication

to Totally Ordered Broadcast. PhD thesis, Ecole Polytechnique Federale de

Lausanne, August 2000.

[5] EFACEC, S.A. EFACEC Markets. http://www.efacec.pt/

presentationlayer/efacec_mercado_00.aspx?idioma=2&area=8&local=

302&mercado=55. [Online; accessed 17-October-2011].

[6] Rolando Martins, Priya Narasimhan, Luıs Lopes, and Fernando Silva. Lightweight

Fault-Tolerance for Peer-to-Peer Middleware. In The First International Work-

shop on Issues in Computing over Emerging Mobile Networks (C-EMNs’10),

In Proceedings of the 29th IEEE Symposium on Reliable Distributed Systems

(SRDS’10), pages 313–317, November 2010.

[7] Bela Ban. Design and Implementation of a Reliable Group Communication

Toolkit for Java. Technical report, Cornell University, September 1998.

[8] Chen Lee, Ragunathan Rajkumar, and Cliff Mercer. Experiences with Processor

Reservation and Dynamic QOS in Real-Time Mach. Proceedings of Multimedia

Japan 96, April 1996.

[9] Hideyuki Tokuda, Tatsuo Nakajima, and Prithvi Rao. Real-Time Mach: Towards

a Predictable Real-Time System. In USENIX MACH Symposium, pages 73–82,

October 1990.

[10] Luigi Palopoli, Tommaso Cucinotta, Luca Marzario, and Giuseppe Lipari.

AQuoSA - Adaptive Quality of Service Architecture. Software: Practice and

Experience, 39(1):1–31, April 2009.

183

Page 186: PhD Thesis

REFERENCES

[11] Douglas Schmidt, Carlos O’Ryan, Irfan Pyarali, Michael Kircher, and Frank

Buschmann. Leader/Followers: A Design Pattern for Efficient Multi-threaded

Event Demultiplexing and Dispatching. In Proceedings of the 7th Conference on

Pattern Languages of Programs (PLoP’01), August 2001.

[12] Douglas Schmidt and Steve Vinoski. Comparing Alternative Programming

Techniques for Multithreaded CORBA Servers. C++ Report, 8(7):47–56, July

1996.

[13] Douglas Schmidt and Charles Cranor. Half-Sync/Half-Async: An Architectural

Pattern for Efficient and Well-Structured Concurrent I/O. In Proceedings of the

2nd Annual Conference on the Pattern Languages of Programs (PLoP’95), pages

1–10, 1995.

[14] Priya Narasimhan, Tudor Dumitras, Aaron Paulos, Soila Pertet, Carlos Reverte,

Joseph Slember, and Deepti Srivastava. MEAD: Support for Real-Time Fault-

Tolerant CORBA: Research Articles. Concurrency and Computation: Practice &

Experience, 17(12):1527–1545, October 2005.

[15] Licınio Oliveira, Luıs Lopes, and Fernando Silva. P3: Parallel Peer to Peer - An

Internet Parallel Programming Environment. In Workshop on Web Engineering &

Peer-to-Peer Computing, part of Networking 2002, volume 2376 of Lecture Notes

in Computer Science, pages 274–288. Springer-Verlag, May 2002.

[16] James E. White. A High-Level Framework for Network-Based Resource Sharing.

In Proceedings of the June 7-10, 1976, National Computer Conference and

Exposition (AFIPS’76), pages 561–570, New York, NY, USA, 1976. ACM.

[17] Andrew D. Birrell and Bruce Jay Nelson. Implementing Remote Procedure Calls.

ACM Transactions on Computer Systems, 2(1):39–59, February 1984.

[18] Object Management Group. CORBA Specification. OMG Technical Commit-

tee Document: http://www.omg.org/cgi-bin/doc?1991/91-08-01, Aug 1991.

[Online; accessed 17-October-2011].

[19] Ann Wollrath, Roger Riggs, and Jim Waldo. A Distributed Object Model for the

Java System. Computing Systems, 9(4):265–290, 1996.

[20] Michi Henning. The Rise and Fall of CORBA. Communications of the ACM,

51(8):52–57, August 2008.

184

Page 187: PhD Thesis

REFERENCES

[21] Enterprise Team, Vlada Matena, Eduardo Pelegri-Llopart Mark Hapner, James

Davidson, and Larry Cable. Java 2 Enterprise Edition Specifications. Addison-

Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.

[22] A. Wigley, M. Sutton, S. Wheelwright, R. Burbidge, and R. Mcloud. Microsoft

.Net Compact Framework: Core Reference. Microsoft Press, Redmond, WA, USA,

2002.

[23] Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendel-

sohn, Henrik Nielsen, Satish Thatte, and Dave Winer. Simple Object Access

Protocol (SOAP) 1.1. W3c note, World Wide Web Consortium, May 2000.

[Online; accessed 17-October-2011].

[24] Marc Fleury and Francisco Reverbel. The JBoss Extensible Server. In

Proceedings of the 4th ACM/IFIP/USENIX International Middleware Conference

(Middleware’03), pages 344–373, New York, NY, USA, 2003. Springer-Verlag New

York, Inc.

[25] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,

Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall,

and Werner Vogels. Dynamo: Amazon’s Highly Available Key-value Store.

In Proceedings of the 21st ACM Symposium on Operating Systems Principles

(SOSP’07), pages 205–220, October 2007.

[26] Yan Huang, Tom Fu, Dah-Ming Chiu, John Lui, and Cheng Huang. Challenges,

Design and Analysis of a Large-Scale P2P-VOD System. In Proceedings of the

ACM SIGCOMM Conference on Data Communication (SIGCOMM ’08), pages

375–388, New York, NY, USA, August 2008. ACM.

[27] Edward Curry. Message-Oriented Middleware, pages 1–28. John Wiley & Sons,

Ltd, 2005.

[28] Tibco. Tibco Rendezvous. http://www.tibco.com/products/soa/messaging/

rendezvous/. [Online; accessed 17-October-2011].

[29] IBM. WebSphere MQ. http://www-01.ibm.com/software/integration/wmq/.

[Online; accessed 17-October-2011].

[30] Richard Monson-Haefel and David Chappell. Java Message Service. O’Reilly &

Associates, Inc., Sebastopol, CA, USA, 2000.

185

Page 188: PhD Thesis

REFERENCES

[31] JCP. JAIN SLEE v1.1 Specification. JCP Document: http://download.

oracle.com/otndocs/jcp/jain_slee-1_1-final-oth-JSpec/, Jul 2008. [On-

line; accessed 17-October-2011].

[32] Mobicents. The Open Source SLEE and SIP Server. http://www.mobicents.

org/. [Online; accessed 17-October-2011].

[33] Object Management Group. OpenDDS. http://www.opendds.org/. [Online;

accessed 17-October-2011].

[34] RTI. Connext DDS. http://www.rti.com/products/dds/index.html. [Online;

accessed 17-October-2011].

[35] Douglas C. Schmidt and Hans van’t Hag. Addressing the challenges of mission-

critical information management in next-generation net-centric pub/sub systems

with opensplice dds. In IPDPS, pages 1–8, 2008.

[36] Object Management Group. Fault Tolerant CORBA Specification. OMG Techni-

cal Committee Document: http://www.omg.org/spec/FT/1.0/PDF/, May 2010.

[Online; accessed 17-October-2011].

[37] Tarek Abdelzaher, Scott Dawson, Wu Feng, Farnam Jahanian, S. Johnson, Ashish

Mehra, Todd Mitton, Anees Shaikh, Kang Shin, Zhiheng Wang, Hengming Zou,

M. Bjorkland, and Pedro Marron. ARMADA Middleware and Communication

Services. Real-Time Systems, 16:127–153, 1999.

[38] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and

R. Zainlinger. Distributed Fault-Tolerant Real-Time Systems: the Mars Ap-

proach. Micro, IEEE, 9(1):25–40, February 1989.

[39] Kane Kim. ROAFTS: A Middleware Architecture for Real-Time Object-Oriented

Adaptive Fault Tolerance Support. In Proceedings of the 3rd IEEE International

High-Assurance Systems Engineering Symposium (HASE’98), page 50. IEEE

Computer Society, November 1998.

[40] Eltefaat Shokri, Patrick Crane, Kane Kim, and Chittur Subbaraman. Archi-

tecture of ROAFTS/Solaris: A Solaris-Based Middleware for Real-Time Object-

Oriented Adaptive Fault Tolerance Support. In COMPSAC, pages 90–98. IEEE

Computer Society, 1998.

[41] Kane Kim and Chittur Subbaraman. Fault-Tolerant Real-Time Objects. Com-

munications of the ACM, 40(1):75–82, 1997.

186

Page 189: PhD Thesis

REFERENCES

[42] Kane Kim and Chittur Subbaraman. A Supervisor-Based Semi-Centralized

Network Surveillance Scheme and the Fault Detection Latency Bound. In

Proceedings of the 16th Symposium on Reliable Distributed Systems (SRDS’97),

pages 146–155, October 1997.

[43] Manas Saksena, James da Silva, and Ashok Agrawala. Design and implementation

of maruti-ii. In Sang Son, editor, Advances in Real-Time Systems, pages 73–102.

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1995.

[44] David Powell, Gottfried Bonn, D. Seaton, Paulo Verıssimo, and Francois

Waeselynck. The Delta-4 Approach to Dependability in Open Distributed

Computing Systems. In Proceedings of the 18th Annual International Symposium

on Fault-Tolerant Computing (FTCS’88), pages 246–251, Tokyo, Japan, 1988.

IEEE Computer Society Press.

[45] P. Bond P. Barrett, A. Hilborne, Luıs Rodrigues, D. Seaton, N. Speirs, , and

Paulo Verıssimo. The Delta-4 Extra Performance Architecture (XPA). 20th

International Symposium on Fault-Tolerant Computing, pages 481–488, 1990.

[46] James Gosling and Greg Bollella. The Real-Time Specification for Java. Addison-

Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.

[47] Greg Bollella, James Gosling, Ben Brosgol, P. Dibble, Steve Furr, David Hardin,

and Mark Turnbull. The Real-Time Specification for Java. The Java Series.

Addison-Wesley, 2000.

[48] Peter Dibble. Real-Time Java Platform Programming. BookSurge Publishing,

2nd edition, 2008.

[49] Joshua Auerbach, David Bacon, Daniel Iercan, Christoph Kirsch, V. Rajan,

Harald Roeck, and Rainer Trummer. Java Takes Flight:Time-Portable Real-

Time Programming with Exotasks. In Proceedings of the 2007 ACM SIG-

PLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded

Systems (LCTES’07), pages 51–62, New York, NY, USA, 2007. ACM.

[50] Joshua Auerbach, David Bacon, Bob Blainey, Perry Cheng, Michael Dawson,

Mike Fulton, David Grove, Darren Hart, and Mark Stoodley. Design and

Implementation of a Comprehensive Real-time Java Virtual Machine. In

Proceedings of the 7th ACM & IEEE International Conference on Embedded

Software (EMSOFT’07), pages 249–258, New York, NY, USA, 2007. ACM.

187

Page 190: PhD Thesis

REFERENCES

[51] Introduction to WebLogic Real-Time. http://docs.oracle.com/cd/E13221_

01/wlrt/docs10/pdf/intro_wlrt.pdf. [Online; accessed 17-October-2011].

[52] Silvano Maffeis. Adding Group Communication and Fault-Tolerance to CORBA.

In USENIX Conference on Object-Oriented Technologies, 1995.

[53] Alexey Vaysburd and Kenneth Birman. Building Reliable Adaptive Distributed

Objects with the Maestro Tools. In Proceedings of Workshop on Dependable

Distributed Object Systems (OOPSLA’97), 1997.

[54] Yansong Ren, David Bakken, Tod Courtney, Michel Cukier, David Karr, Paul

Rubel, Chetan Sabnis, William Sanders, Richard Schantz, and Mouna Seri.

AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects.

IEEE Trans. Comput., 52:31–50, January 2003.

[55] Balachandran Natarajan, Aniruddha Gokhale, Shalini Yajnik, and Douglas

Schmidt. DOORS: Towards High-Performance Fault Tolerant CORBA. In

Proceedings of International Symposium on Distributed Objects and Applications

(DOA’00), pages 39–48, 2000.

[56] Silvano Maffeis and Douglas Schmidt. Constructing Reliable Distributed Commu-

nications Systems with CORBA. IEEE Communications Magazine, 35(2):56–61,

February 1997.

[57] Robbert van Renesse, Kenneth Birman, and Silvano Maffeis. Horus: A Flexible

Group Communication System. Communications of the ACM, 39(4):76–83,

November 1996.

[58] Kenneth Birman and Robert van Renesse. Reliable Distributed Computing with

the Isis Toolkit. IEEE Computer Society Press, 1994.

[59] Robbert van Renesse, Kenneth Birman, Mark Hayden, Alexey Vaysburd, and

David Karr. Building adaptive systems using Ensemble. Software–Practice and

Experience, 28(8):963–979, August 1998.

[60] Thomas C. Bressoud. TFT: A Software System for Application-Transparent Fault

Tolerance. In Proceedings of the 28th Annual International Symposium on Fault-

Tolerant Computing (FTCS’98), pages 128–137, 1998.

[61] Richard Schantz, Joseph Loyall, Craig Rodrigues, Douglas Schmidt, Yamuna

Krishnamurthy, and Irfan Pyarali. Flexible and Adaptive QoS Control for

Distributed Real-Time and Embedded Middleware. In Markus Endler and

188

Page 191: PhD Thesis

REFERENCES

Douglas Schmidt, editors, Proceedings of the ACM/IFIP/USENIX International

Middleware Conference (Middleware’03), volume 2672 of Lecture Notes in Com-

puter Science, pages 374–393. Springer, June 2003.

[62] Douglas Schmidt and Fred Kuhns. An Overview of the Real-Time CORBA

Specification. IEEE Computer, 33(6):56–63, June 2000.

[63] IETF. An Architecture for Differentiated Services. http://www.ietf.org/rfc/

rfc2475.txt. [Online; accessed 17-October-2011].

[64] Lixia Zhang, Stephen Deering, Deborah Estrin, Scott Shenker, and Daniel

Zappala. RSVP: A New Resource ReSerVation Protocol. IEEE Network, 7(5):8–

18, 1993.

[65] Nanbor Wang, Christopher Gill, Douglas Schmidt, and Venkita Subramonian.

Configuring Real-Time Aspects in Component Middleware. In CoopIS/DOA/OD-

BASE (2), pages 1520–1537, 2004.

[66] Friedhelm Wolf, Jaiganesh Balasubramanian, Aniruddha Gokhale, , and Douglas

Schmidt. Component Replication Based on Failover Units. In Proceedings of

the 15th IEEE International Conference on Embedded and Real-Time Computing

Systems and Applications (RTCSA’09), pages 99–108, August 2009.

[67] Nanbor Wang, Douglas Schmidt, Aniruddha Gokhale, Christopher Gill, Balachan-

dran Natarajan, Craig Rodrigues, Joseph Loyall, and Richard Schantz. Total

Quality of Service Provisioning in Middleware and Applications. Microprocessors

and Microsystems, 26:9–10, 2003.

[68] Richard Schantz, Joseph Loyall, Craig Rodrigues, Douglas Schmidt, Yamuna

Krishnamurthy, and Irfan Pyarali. Flexible and adaptive QoS Control for

Distributed Real-Time and Embedded Middleware. In Proceedings of the ACM/I-

FIP/USENIX 2003 International Conference on Middleware (Middleware’03),

pages 374–393, New York, NY, USA, June 2003. Springer-Verlag New York, Inc.

[69] Fabio Kon, Fabio Costa, Gordon Blair, and Roy Campbell. The Case for Reflective

Middleware. Communications of the ACM, 45:33–38, June 2002.

[70] Jurgen Schonwalder, Sachin Garg, Yennun Huang, Aad van Moorsel, and Shalini

Yajnik. A Management Interface for Distributed Fault Tolerance CORBA

services. In Proceedings of the IEEE Third International Workshop on Systems

Management (SMW ’98), pages 98–107, Washington, DC, USA, April 1998.

189

Page 192: PhD Thesis

REFERENCES

[71] Pascal Felber, Benoit Garbinato, and Rachid Guerraoui. The Design of a CORBA

Group Communication Service. In Proceedings of the 15th Symposium on Reliable

Distributed Systems (SRDS’96), Washington, DC, USA, October 1996. IEEE

Computer Society.

[72] Graham Morgan, Santosh Shrivastava, Paul Ezhilchelvan, and Mark Little.

Design and Implementation of a CORBA Fault-Tolerant Object Group Service.

In Proceedings of the 2nd IFIP WG 6.1 International Working Conference on

Distributed Applications and Interoperable Systems (DAIS’99), pages 361–374,

Deventer, The Netherlands, The Netherlands, 1999. Kluwer, B.V.

[73] Object Management Group. Real-time CORBA Specification. OMG Technical

Committee Document: http://www.omg.org/spec/RT/1.2/PDF, January 2005.

[Online; accessed 17-October-2011].

[74] Jaiganesh Balasubramanian. FLARe: a Fault-tolerant Lightweight Adaptive

Real-time Middleware for Distributed Real-time and Embedded Systems. In

Proceedings of the 4th Middleware Doctoral Symposium (MDS’07), pages 17:1–

17:6, New York, NY, USA, November 2007. ACM.

[75] Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. The

Primary-Backup Approach. ACM Press/Addison-Wesley Publishing Co., New

York, NY, USA, 1993.

[76] Object Management Group. Light Weight CORBA Component Model Revised

Submission. OMG Technical Committee Document: http://www.omg.org/

spec/CCM/3.0/PDF/, June 2002. [Online; accessed 17-October-2011].

[77] Jaiganesh Balasubramanian, Aniruddha Gokhale, Abhishek Dubey, Friedhelm

Wolf, Chenyang Lu, Christopher Gill, and Douglas Schmidt. Middleware

for Resource-Aware Deployment and Configuration of Fault-Tolerant Real-time

Systems. In Marco Caccamo, editor, Proceedings of the 16th IEEE Real-Time

and Embedded Technology and Applications Symposium (RTAS’10), pages 69–78.

IEEE Computer Society, April 2010.

[78] Fred Schneider. Replication Management using the State-machine Approach.

ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993.

[79] Louise Moser, P. Michael Melliar-Smith, and Priya Narasimhan. A Fault Toler-

ance Framework for CORBA. In Proceedings of the 29th Annual International

190

Page 193: PhD Thesis

REFERENCES

Symposium on Fault-Tolerant Computing (FTCS’99), Washington, DC, USA,

1999. IEEE Computer Society.

[80] Priya Narasimhan, Louise Moser, and P. Michael Melliar-Smith. Strongly

Consistent Replication and Recovery of Fault-Tolerant CORBA Applications.

Computer System Science and Engineering Journal, 17, 2002.

[81] Justin Frankel and Tom Pepper. Gnutella Specification. http://www.

stanford.edu/class/cs244b/gnutella_protocol_0.4.pdf. [Online; accessed

17-October-2011].

[82] Yoram Kulbak and Danny Bickson. The eMule Protocol Specification, January

2005. [Online; accessed 17-October-2011].

[83] PPLive. PPTV. http://www.pplive.com/. [Online; accessed 17-October-2011].

[84] Mario Ferreira, Joao Leitao, and Luıs Rodrigues. Thicket: A Protocol for Building

and Maintaining Multiple Trees in a P2P Overlay. In Proceedings of the 29rd

International Symposium on Reliable Distributed Systems (SRDS’10), pages 293–

302. IEEE, November 2010.

[85] Zhi Li and Prasant Mohapatra. QRON: QoS-aware Routing in Overlay Networks.

IEEE Journal on Selected Areas in Communications, 22(1):29–40, January 2004.

[86] Eric Wohlstadter, Stefan Tai, Thomas Mikalsen, Isabelle Rouvellou, and Premku-

mar Devanbu. GlueQoS: Middleware to Sweeten Quality-of-Service Policy

Interactions. In Proceedings of the 26th International Conference on Software

Engineering (ICSE’04), pages 189–199, May 2004.

[87] Anthony Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel.

SCRIBE: The Design of a Large-Scale Event Notification Infrastructure. In

Proceedings of the 3rd International COST264 Workshop on Networked Group

Communication (NGC’01), pages 30–43, November 2001.

[88] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location,

and Routing for Large-Scale Peer-to-Peer Systems. In Proceedings of the

2nd ACM/IFIP/USENIX International Middleware Conference (Middleware’01),

pages 329–350, November 2001.

[89] Leslie Lamport. The part-time parliament. ACM Transactions on Computer

Systems, 16:133–169, May 1998.

191

Page 194: PhD Thesis

REFERENCES

[90] Peter Pietzuch and Jean Bacon. Hermes: A Distributed Event-Based Middleware

Architecture. In ICDCS Workshops, pages 611–618. IEEE Computer Society, July

2002.

[91] Ben Zhao, Ling Huang, Jeremy Stribling, Sean Rhea, Anthony Joseph, and John

Kubiatowicz. Tapestry: A Resilient Global-Scale Overlay for Service Deployment.

IEEE Journal on Selected Areas in Communications, June 2003.

[92] David Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer.

SETI@home: an Experiment in Public-Resource Computing. Communications of

the ACM, 45:56–61, November 2002.

[93] Bjorn Knutsson, Honghui Lu, Wei Xu, and Bryan Hopkins. Peer-to-peer Support

for Massively Multiplayer Games. In Proceedings of the 23rd Annual Joint Con-

ference of the IEEE Computer and Communications Societies (INFOCOM’04),

volume 1, March 2004.

[94] Gilles Fedak, Cecile Germain, Vincent Neri, and Franck Cappello. XtremWeb:

a Generic Global Computing System. In Proceedings of the 1st IEEE/ACM

International Symposium on Cluster Computing, pages 582–587, May 2001.

[95] Andrew Chien, Brad Calder, Stephen Elbert, and Karan Bhatia. Entropia:

Architecture and Performance of an Enterprise Desktop Grid System. Journal

Parallel Distributed Computing, 63:597–610, May 2003.

[96] David Anderson. BOINC: A System for Public-Resource Computing and Storage.

In Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing

(GRID’04), pages 4–10, Washington, DC, USA, November 2004. IEEE Computer

Society.

[97] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on

Large Clusters. Communications of the ACM, 51:107–113, January 2008.

[98] Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. Adapting MapReduce for

Dynamic Environments Using a Peer-to-Peer Model. In Proceedings of the 1st

Workshop on Cloud Computing and its Applications (CCA’08), Chicago, USA,

October 2008.

[99] Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy,

Scott Shenker, Ion Stoica, and Harlan Yu. OpenDHT: A Public DHT Service

and Its Uses. In Roch Guerin, Ramesh Govindan, and Greg Minshall, editors,

192

Page 195: PhD Thesis

REFERENCES

Proceedings of the ACM SIGCOMM Conference on Applications, Technologies,

Architectures, and Protocols for Computer Communications (SIGCOMM’05),

pages 73–84. ACM, August 2005.

[100] Philip Bernstein and Nathan Goodman. An Algorithm for Concurrency Control

and Recovery in Replicated Distributed Databases. ACM Transactional Database

Systems, 9(4):596–615, 1984.

[101] Bruce Lindsay, Patricia Selinger, Cesare Galtieri, Jim Gray, Raymond Lorie, T. G.

Price, Franco Putzolu, and Bradford Wade. Notes on Distributed Databases.

Technical report, International Business Machines (IBM), San Jose, Research

Laboratory (CA), July 1979.

[102] Rolando Martins, Luıs Lopes, and Fernando Silva. A Peer-to-Peer Mid-

dleware Platform for QoS and Soft Real-Time Computing. Technical

Report DCC-2008-02, Departamento de Ciencia de Computadores, Fac-

uldade de Ciencias, Universidade do Porto, April 2008. Available at

http://www.dcc.fc.up.pt/dcc/Pubs/TReports/.

[103] Rolando Martins, Luıs Lopes, and Fernando Silva. A Peer-To-Peer Middleware

Platform for Fault-Tolerant, QoS, Real-Time Computing. In Proceedings of the

2nd Workshop on Middleware-Application Interaction, part of DisCoTec 2008,

pages 1–6, New York, NY, USA, June 2008. ACM.

[104] Rolando Martins, Priya Narasimhan, Luıs Lopes, and Fernando Silva. On

the Impact of Fault-Tolerance Mechanisms in a Peer-to-Peer Middleware with

QoS Constraints. Technical Report DCC-2010-02, Departamento de Ciencia

de Computadores, Faculdade de Ciencias, Universidade do Porto, April 2010.

Available at http://www.dcc.fc.up.pt/dcc/Pubs/TReports/.

[105] Aniruddha Gokhale, Balachandran Natarajan, Douglas Schmidt, and Joseph

Cross. Towards Real-Time Fault-Tolerant CORBA Middleware. Cluster Com-

puting, 7(4):331–346, September 2004.

[106] Michi Henning. A New Approach to Object-Oriented Middleware. IEEE Internet

Computing, 8(1):66–75, January 2004.

[107] Daniel Nurmi, Richard Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil

Soman, Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus Open-Source

Cloud-Computing System. In Franck Cappello, Cho-Li Wang, and Rajkumar

Buyya, editors, Proceedings of the 9th IEEE/ACM International Symposium

193

Page 196: PhD Thesis

REFERENCES

on Cluster, Cloud, and Grid Computing (CCGrid’09), pages 124–131. IEEE

Computer Society, May 2009.

[108] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. KVM:

the Linux Virtual Machine Monitor. In Proceedings of the 9th Ottawa Linux

Symposium (OLS’07), June 2007.

[109] Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Ian Pratt,

Andrew Warfield, Paul Barham, and Rolf Neugebauer. Xen and the Art of

Virtualization. In Proceedings of the ACM Symposium on Operating Systems

Principles (SOSP’03), October 2003.

[110] Canonical Ltd. JeOS and ”vmbuilder”. https://help.ubuntu.com/11.10/

serverguide/C/jeos-and-vmbuilder.html. [Online; accessed 17-October-

2011].

[111] Douglas Schmidt. An Architectural Overview of the ACE Framework. ;login: the

USENIX Association newsletter, 24(1), January 1999.

[112] Francisco Curbera, Matthew Duftler, Rania Khalaf, William Nagy, Nirmal Mukhi,

and Sanjiva Weerawarana. Unraveling the Web Services Web: An Introduction

to SOAP, WSDL, and UDDI. IEEE Distributed Systems Online, 3(4), 2002.

[113] Ian Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan.

Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications.

In Proceedings of the ACM Special Interrest Group on Data Communication

Conference (SIGCOMM’01), volume 31, 4 of Computer Communication Review,

pages 149–160. ACM Press, August 2001.

[114] Greg Lavender and Douglas Schmidt. Active Object: an Object Behavioral

Pattern for Concurrent Programming. In Proceedings of the 2nd Conference on

Pattern Languages of Programs (PLoP’95), September 1995.

[115] Linux kernel 2.6.39. Real-Time Group Scheduling. http://www.kernel.org/

doc/Documentation/scheduler/sched-rt-group.txt, 2009. [Online; accessed

17-October-2011].

[116] Yuan Xu. A Study of Scalability and Performance of Solaris Zones, April 2007.

[117] Dario Faggioli, Michael Trimarchi, and Fabio Checconi. An Implementation

of the Earliest Deadline First Algorithm in Linux. In Sung Shin and Sascha

194

Page 197: PhD Thesis

REFERENCES

Ossowski, editors, Proceedings of the 24th ACM Symposium on Applied Computing

(SAC’09), pages 1984–1989. ACM, March 2009.

[118] Nicola Manica, Luca Abeni, and Luigi Palopoli. Reservation-Based Interrupt

Scheduling. In Marco Caccamo, editor, Proceedings of the 16th IEEE Real-Time

and Embedded Technology and Applications Symposium (RTAS’10), pages 46–55.

IEEE Computer Society, April 2010.

[119] Shinpei Kato, Yutaka Ishikawa, and Ragunathan Rajkumar. CPU Scheduling

and Memory Management for Interactive Real-Time Applications. Real-Time

Systems, pages 1–35, 2011.

[120] Michael Stonebraker and Greg Kemnitz. The POSTGRES Next Generation

Database Management System. Communications of the ACM, 34:78–92, October

1991.

[121] Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patino-Martınez, and Patrick

Valduriez. StreamCloud: A Large Scale Data Streaming System. In Proceedings

of the IEEE 30th International Conference on Distributed Computing Systems

(ICDCS’10), pages 126–137, Washington, DC, USA, June 2010. IEEE Computer

Society.

[122] Levent Gurgen, Claudia Roncancio, Cyril Labbe, Andre Bottaro, and Vincent

Olive. SStreaMWare: a Service Oriented Middleware for Heterogeneous Sensor

Data Management. In Proceedings of the 5th international Conference on

Pervasive Services (ICPS’08), pages 121–130, New York, NY, USA, July 2008.

ACM.

[123] Adrian Caulfield, Joel Coburn, Todor Mollov, Arup De, Ameen Akel, Jiahua He,

Arun Jagatheesan, Rajesh Gupta, Allan Snavely, and Steven Swanson. Under-

standing the Impact of Emerging Non-Volatile Memories on High-Performance,

IO-Intensive Computing. In Proceedings of the 23rd ACM/IEEE International

Conference for High Performance Computing, Networking, Storage and Analysis

(SC’10), pages 1–11, Washington, DC, USA, November 2010. IEEE Computer

Society.

[124] Maxweel Carmo, Bruno Carvalho, Jorge Sa Silva, Edmundo Monteiro, Paulo Sim

oes, Marılia Curado, and Fernando Boavida. NSIS-Based Quality of Service and

Resource Allocation in Ethernet Networks. In Torsten Braun, Georg Carle, Sonia

Fahmy, and Yevgeni Koucheryavy, editors, Proceedings of the 4th International

195

Page 198: PhD Thesis

REFERENCES

Conference on Wired/Wireless Internet Communications (WWIC’06), volume

3970 of Lecture Notes in Computer Science, pages 132–142. Springer, 2006.

[125] Jeff Bonwick. The Slab Allocator: An Object-Caching Kernel Memory Allocator.

In USENIX Summer, pages 87–98, 1994.

[126] Christoph Lameter. The SLUB Allocator. LWN.net: http://lwn.net/

Articles/229096/, March 2007. [Online; accessed 17-October-2011].

[127] Dinakar Guniguntala, Paul McKenney, Josh Triplett, and Jonathan Walpole.

The Read-Copy-Update Mechanism for Supporting Real-Time Applications on

Shared-Memory Multiprocessor Systems with Linux. IBM Systems Journal,

47:221–236, April 2008.

[128] Steven Rostedt. RCU Preemption Priority Boosting. LWN.net: http://lwn.

net/Articles/252837/, October 2007. [Online; accessed 17-October-2011].

[129] Claudio Basile, Keith Whisnant, Zbigniew Kalbarczyk, and Ravishankar Iyer.

Loose Synchronization of Multithreaded Replicas. In Proceedings of the 21st

International Symposium on Reliable Distributed Systems (SRDS’02), pages 250–

255, October 2002.

[130] Claudio Basile, Zbigniew Kalbarczyk, and Ravishankar Iyer. A Preemptive

Deterministic Scheduling Algorithm for Multithreaded Replicas. In Proceedings

of the 33rd International Conference on Dependable Systems and Networks

(DSN’03), pages 149–158, June 2003.

[131] Guang Tan and Stephen Jarvis and Daniel Spooner. Improving the Fault

Resilience of Overlay Multicast for Media Streaming. IEEE Transactions on

Parallel and Distributed Systems, 18(6):721–734, June 2007.

[132] Irena Trajkovska, Rodriguez Salvachua, and Alberto Velasco. A Novel P2P

and Cloud Computing Hybrid Architecture for Multimedia Streaming with QoS

Cost Functions. In Proceedings of the International Conference on Multimedia

(MM’10), pages 1227–1230, New York, NY, USA, October 2010. ACM.

[133] Thomas Wiegand, Gary Sullivan, Gisle Bjntegaard, and Ajay Luthra. Overview

of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and

Systems for Video Technology, 13(7):560–576, 2003.

[134] Fred Kuhns, Douglas Schmidt, and David Levine. The Design and Performance

of a Real-Time I/O Subsystem. In Proceedings of the 5th IEEE Real-Time

Technology and Applications Symposium (RTAS’99), pages 154–163, June 1999.

196

Page 199: PhD Thesis

REFERENCES

[135] Real-Time Preempt Linux Kernel Patch. kernel.org: http://www.kernel.org/

pub/linux/kernel/projects/rt/. [Online; accessed 17-October-2011].

[136] Moving Interrupts to Threads. LWN.net: http://lwn.net/Articles/302043/.

[Online; accessed 17-October-2011].

[137] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund

Wong. Zyzzyva: Speculative byzantine fault folerance. In Proceedings of 21st

ACM SIGOPS Symposium on Operating Systems Principles (SOSP ’07), pages

45–58, New York, NY, USA, 2007. ACM.

[138] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco

Marchetti. Making Byzantine Fault Tolerant Systems Tolerate Byzantine faults.

In Proceedings of the 6th USENIX Symposium on Networked Systems Design and

Implementation (NSDI’09), pages 153–168, Berkeley, CA, USA, 2009. USENIX

Association.

[139] Andrey Mirkin, Alexey Kuznetsov, and Kir Kolyshkin. Containers Checkpointing

and Live Migration. In Proceedings of the 10th Annual Linux Symposium

(OLS’08), July 2008.

[140] Oren Laadan and Serge Hallyn. Linux-CR: Transparent Application Checkpoint-

Restart in Linux. In Proceedings of the 12th Ottawa Linux Symposium (OLS’10),

July 2010.

197