PhD Thesis
-
Upload
carnegie-mellon-university-yinzcam-inc -
Category
Technology
-
view
1.973 -
download
0
Transcript of PhD Thesis
Rolando da Silva Martins
On the Integration of Real-Time
and Fault-Tolerance in P2P
Middleware
Departamento de Ciencia de Computadores
Faculdade de Ciencias da Universidade do Porto2012
Rolando da Silva Martins
On the Integration of Real-Time
and Fault-Tolerance in P2P
Middleware
Tese submetida a Faculdade de Ciencias da
Universidade do Porto para obtencao do grau de Doutor
em Ciencia de Computadores
Advisors: Prof. Fernando Silva and Prof. Luıs Lopes
Departamento de Ciencia de Computadores
Faculdade de Ciencias da Universidade do Porto
Maio de 2012
To my wife Liliana, for her endless love, support, and encouragement.
3
–Imagination is everything. It is the preview of
life’s coming attractions.
Albert Einstein
Acknowledgments
To my soul-mate Liliana, for her endless support on the best and worst of times. Her
unconditional love and support helped me to overcome the most daunting adversities
and challenges.
I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and Paulo
Paixao, for their vision and support that allowed me to pursuit this Ph.D.
I would like to thank the financial support from EFACEC, Sistemas de Engenharia,
S.A. and FCT - Fundacao para a Ciencia e Tecnologia, with Ph.D. grant SFRH/B-
DE/15644/2006.
I would especially like to thank my advisors, Professors Luıs Lopes and Fernando Silva,
for their endless effort and teaching over the past four years. Luıs, thank you for steering
me when my mind entered a code frenzy, and for teaching me how to put my thoughts
to words. Fernando, your keen eye is always able to understand the “big picture”, this
was vital to detected and prevent the pitfalls of building large and complex middleware
systems. To both, I thank you for opening the door of CRACS to me. I had an incredible
time working with you.
A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor.
She opened the door of CMU to me and helped to shape my work at crucial stages.
Priya, I had a fantastic time mind-storming with you, each time I managed to learn
something new and exciting. Thank you for sharing with me your insights on MEAD’s
architecture, and your knowledge on fault-tolerance and real-time.
Luıs, Fernando and Priya, I hope someday to be able to repay your generosity and
friendship. It is inspirational to see your passion for your work, and your continuous
effort on helping others.
I would like to thank Jiaqi Tan for taking the time to explain me the architecture and
functionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on my
work and for his insights on byzantine failures and consensus protocols.
I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor-
reia, Vıtor Costa, and Ines Dutra, for listening and sharing their thoughts on my work.
A big thank you to Hugo Ribeiro, for his crucial help with the experimental setup.
5
–All is worthwhile if the soul is not small.
Fernando Pessoa
Abstract
The development and management of large-scale information systems, such as high-
speed transportation networks, are pushing the limits of the current state-of-the-art
in middleware frameworks. These systems are not only subject to hardware failures,
but also impose stringent constraints on the software used for management and there-
fore on the underlying middleware framework. In particular, fulfilling the Quality-
of-Service (QoS) demands of services in such systems requires simultaneous run-time
support for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage that
remains a challenge for current middleware frameworks. Fault-tolerance support is
usually introduced in the form of expensive high-level services arranged in a client-server
architecture. This approach is inadequate if one wishes to support real-time tasks due
to the expensive cross-layer communication and resource consumption involved.
In this thesis we design and implement Stheno, a general purpose P2P middleware
architecture. Stheno innovates by integrating both FT and soft-RT in the architecture,
by: (a) implementing FT support at a much lower level in the middleware on top of a
suitable network abstraction; (b) using the peer-to-peer mesh services to support FT,
and; (c) supporting real-time services through a QoS daemon that manages the under-
lying kernel-level resource reservation infrastructure (CPU time), while simultaneously
(d) providing support for multi-core computing and traffic demultiplexing. Stheno is
able to minimize resource consumption and latencies from FT mechanisms and allows
RT services to perform withing QoS limits.
Stheno has a service oriented architecture that does not limit the type of service that can
be deployed in the middleware. Whereas current middleware systems do not provide
a flexible service framework, as their architecture is normally designed to support a
specific application domain, for example, the Remote Procedure Call (RPC) service.
Stheno is able to transparently deploy a new service within the infrastructure without
the user assistance. Using the P2P infrastructure, Stheno searches and selects a suitable
node to deploy the service with the specified level of QoS limits.
We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, such
as membership, discovery and service deployment, the impact of FT over RT, with
and without resource reservation, and compare with other closely related middleware
frameworks. Results showed that Stheno is able to sustain RT performance while
simultaneously providing FT support. The performance of the resource reservation
infrastructure enabled Stheno to maintain this behavior even under heavy load.
7
Acronyms
API Application Programming Interface
BFT Byzantine Fault-Tolerance
CCM CORBA Component Model
CID Cell Identifier
CORBA Common Object Request Broker Architecture
COTS Common Of The Shelf
DBMS Database Management Systems
DDS Data Distribution Service
DHT Distributed Hash Table
DOC Distributed Object Computing
DRE Distributed Real-Time and Embedded
DSMS Data Stream Management Systems
EDF Earliest Deadline First
EM/EC Execution Model/Execution Context
FT Fault-Tolerance
IDL Interface Description Language
IID Instance Identifier
IPC Inter-Process Communication
IaaS Infrastructure as a Service
J2SE Java 2 Standard Edition
JMS Java Messaging Service
JRTS Java Real-Time System
JVM Java Virtual Machine
9
JeOS Just Enough Operating System
KVM Kernel Virtual-Machine
LFU Least Frequently Used
LRU Least Recently Used
LwCCM Lightweight CORBA Component Model
MOM Message-Oriented Middleware
NSIS Next Steps in Signaling
OID Object Identifier
OMA Object Management Architecture
OS Operating Systems
PID Peer Identifier
POSIX Portable Operating System Interface
PoL Place of Launch
QoS Quality-of-Service
RGID Replication Group Identifier
RMI Remote Method Invocation
RPC Remote Procedure Call
RSVP Resource Reservation Protocol
RTSJ Real-Time Specification for Java
RT Real-Time
SAP Service Access Point
SID Service Identifier
SLA Service Level of Agreement
SSD Solid State Disk
10
TDMA Time Division Multiple Access
TSS Thread-Specific Storage
UUID Universal Unique Identifier
VM Virtual Machine
VoD Video on Demand
11
Contents
Acknowledgments 5
Abstract 7
Acronyms 9
List of Tables 17
List of Figures 19
List of Algorithms 23
List of Listings 25
1 Introduction 27
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.2 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4 Assumptions and Non-Goals . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Overview of Related Work 35
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2.1 Special Purpose RT+FT Systems . . . . . . . . . . . . . . . . . . 37
2.2.2 CORBA-based Real-Time Fault-Tolerant Systems . . . . . . . . . 39
2.3 P2P+RT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
13
2.3.2 QoS-Aware P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 P2P+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.1 Publish-subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.2 Resource Computing . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5 P2P+RT+FT Middleware Systems . . . . . . . . . . . . . . . . . . . . . 49
2.6 A Closer Look at TAO, MEAD and ICE . . . . . . . . . . . . . . . . . . 49
2.6.1 TAO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.6.2 MEAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6.3 ICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Architecture 59
3.1 Stheno’s System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61
3.1.1 Application and Services . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.2 Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.1.3 P2P Overlay and FT Configuration . . . . . . . . . . . . . . . . . 66
3.1.4 Support Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.5 Operating System Interface . . . . . . . . . . . . . . . . . . . . . 76
3.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.1 Runtime Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.2 Overlay Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.3 Core Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 Fundamental Runtime Operations . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Runtime Creation and Bootstrapping . . . . . . . . . . . . . . . . 81
3.3.2 Service Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.3 Client Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4 Implementation 91
4.1 Overlay Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Overlay Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.2 Mesh Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.1.3 Discovery Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
14
4.1.4 Fault-Tolerance Service . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Implementation of Services . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.2.1 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2.2 Actuator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.2.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.3 Support for Multi-Core Computing . . . . . . . . . . . . . . . . . . . . . 142
4.3.1 Object-Based Interactions . . . . . . . . . . . . . . . . . . . . . . 142
4.3.2 CPU Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.3.3 Threading Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.3.4 An Execution Model for Multi-Core Computing . . . . . . . . . . 148
4.4 Runtime Bootstrap Parameters . . . . . . . . . . . . . . . . . . . . . . . 155
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5 Evaluation 157
5.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.1.1 Physical Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . 157
5.1.2 Overlay Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2.1 Overlay Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2.2 Services Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.3 Load Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.3 Overlay Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.3.1 Membership Performance . . . . . . . . . . . . . . . . . . . . . . 163
5.3.2 Query Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3.3 Service Deployment Performance . . . . . . . . . . . . . . . . . . 165
5.4 Services Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.4.1 Impact of Faul-Tolerance Mechanisms in Service Latency . . . . . 167
5.4.2 Real-Time and Resource Reservation Evaluation . . . . . . . . . . 169
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6 Conclusions and Future Work 177
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.3 Personal Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
15
References 182
16
List of Tables
4.1 Runtime and overlay parameters. . . . . . . . . . . . . . . . . . . . . . . 155
17
List of Figures
1.1 Oporto’s light-train network. . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1 Middleware system classes. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 TAO’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3 FLARe’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 MEAD’s architectural layout. . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1 Stheno overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Application Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Stheno’s organization overview. . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Core Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 QoS Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Overlay Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Examples of mesh topologies. . . . . . . . . . . . . . . . . . . . . . . . . 68
3.8 Querying in different topologies. . . . . . . . . . . . . . . . . . . . . . . . 69
3.9 Support framework layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.10 QoS daemon resource distribution layout. . . . . . . . . . . . . . . . . . . 73
3.11 End-to-end network reservation. . . . . . . . . . . . . . . . . . . . . . . . 75
3.12 Operating system interface. . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.13 Interactions between layers. . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.14 Multiple processes runtime usage. . . . . . . . . . . . . . . . . . . . . . . 79
3.15 Creating and bootstrapping of a runtime. . . . . . . . . . . . . . . . . . . 81
3.16 Local service creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.17 Finding a suitable deployment site. . . . . . . . . . . . . . . . . . . . . . 84
3.18 Remote service creation without fault-tolerance. . . . . . . . . . . . . . . 85
3.19 Remote service creation with fault-tolerance: primary-node side. . . . . . 86
3.20 Remote service creation with fault-tolerance: replica creation. . . . . . . 87
3.21 Client creation and bootstrap sequence. . . . . . . . . . . . . . . . . . . . 88
4.1 The peer-to-peer overlay architecture. . . . . . . . . . . . . . . . . . . . . 91
4.2 The overlay bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
19
4.3 The cell overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 The initial binding process for a new peer. . . . . . . . . . . . . . . . . . 95
4.5 The final join process for a new peer. . . . . . . . . . . . . . . . . . . . . 96
4.6 Overview of the cell group communications. . . . . . . . . . . . . . . . . 99
4.7 Cell discovery and management entities. . . . . . . . . . . . . . . . . . . 103
4.8 Failure handling for non-coordinator (left) and coordinator (right) peers. 105
4.9 Cell failure (left) and subsequent mesh tree rebinding (right). . . . . . . . 106
4.10 Discovery service implementation. . . . . . . . . . . . . . . . . . . . . . . 109
4.11 Fault-Tolerance service overview. . . . . . . . . . . . . . . . . . . . . . . 112
4.12 Creation of a replication group. . . . . . . . . . . . . . . . . . . . . . . . 113
4.13 Replication group binding overview. . . . . . . . . . . . . . . . . . . . . . 114
4.14 The addition of a new replica to the replication group. . . . . . . . . . . 115
4.15 The control and data communication groups. . . . . . . . . . . . . . . . . 118
4.16 Semi-active replication protocol layout. . . . . . . . . . . . . . . . . . . . 120
4.17 Recovery process within a replication group. . . . . . . . . . . . . . . . . 122
4.18 RPC service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.19 RPC invocation types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.20 RPC service architecture without (left) and with (right) semi-active FT. 130
4.21 RPC service with passive replication. . . . . . . . . . . . . . . . . . . . . 132
4.22 Actuator service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.23 Actuator service overview. . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.24 Actuator fault-tolerance support. . . . . . . . . . . . . . . . . . . . . . . 137
4.25 Streaming service layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.26 Streaming service architecture. . . . . . . . . . . . . . . . . . . . . . . . . 139
4.27 Streaming service with fault-tolerance support. . . . . . . . . . . . . . . . 141
4.28 Object-to-Object interactions. . . . . . . . . . . . . . . . . . . . . . . . . 143
4.29 Examples of CPU Partitioning. . . . . . . . . . . . . . . . . . . . . . . . 144
4.30 Object-to-Object interactions with different partitions. . . . . . . . . . . 145
4.31 Threading strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.32 End-to-End QoS propagation. . . . . . . . . . . . . . . . . . . . . . . . . 148
4.33 RPC service using CPU partitioning on a quad-core processor. . . . . . . 148
4.34 Invocation across two distinct partitions. . . . . . . . . . . . . . . . . . . 149
4.35 Execution Model Pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.36 RPC implementation using the EM/EC pattern. . . . . . . . . . . . . . . 153
5.1 Overlay evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2 Physical evaluation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.3 Overview of the overlay benchmarks. . . . . . . . . . . . . . . . . . . . . 160
20
5.4 Network organization for the service benchmarks. . . . . . . . . . . . . . 161
5.5 Overlay bind (left) and rebind (right) performance. . . . . . . . . . . . . 164
5.6 Overlay query performance. . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7 Overlay service deployment performance. . . . . . . . . . . . . . . . . . . 166
5.8 Service rebind time (left) and latency (right). . . . . . . . . . . . . . . . 168
5.9 Rebind time and latency results with resource reservation. . . . . . . . . 170
5.10 Missed deadlines without (left) and with (right) resource reservation. . . 172
5.11 Invocation latency without (left) and with (right) resource reservation. . 174
5.12 RPC invocation latency comparing with reference middlewares (without
fault-tolerance). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
21
List of Algorithms
4.1 Overlay bootstrap algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Mesh startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Cell initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Cell group communications: receiving-end . . . . . . . . . . . . . . . . . 100
4.5 Cell group communications: sending-end . . . . . . . . . . . . . . . . . . 102
4.6 Cell Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.7 Cell fault handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.8 Cell fault handling (continuation). . . . . . . . . . . . . . . . . . . . . . . 108
4.9 Discovery service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.10 Creation and joining within a replication group . . . . . . . . . . . . . . 116
4.11 Primary bootstrap within a replication group . . . . . . . . . . . . . . . 117
4.12 Fault-Tolerance resource discovery mechanism. . . . . . . . . . . . . . . . 118
4.13 Replica startup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.14 Replica request handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.15 Support for semi-active replication. . . . . . . . . . . . . . . . . . . . . . 121
4.16 Fault detection and recovery . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.17 A RPC object implementation. . . . . . . . . . . . . . . . . . . . . . . . 126
4.18 RPC service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.19 RPC service implementation. . . . . . . . . . . . . . . . . . . . . . . . . 128
4.20 RPC client implementation. . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.21 Semi-active replication implementation. . . . . . . . . . . . . . . . . . . . 130
4.22 Service’s replication callback. . . . . . . . . . . . . . . . . . . . . . . . . 131
4.23 Passive Fault-Tolerance implementation. . . . . . . . . . . . . . . . . . . 133
4.24 Actuator service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.25 Actuator service implementation. . . . . . . . . . . . . . . . . . . . . . . 136
4.26 Actuator client implementation. . . . . . . . . . . . . . . . . . . . . . . . 136
4.27 Stream service bootstrap. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.28 Stream service implementation. . . . . . . . . . . . . . . . . . . . . . . . 140
4.29 Stream client implementation. . . . . . . . . . . . . . . . . . . . . . . . . 141
4.30 Joining an Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.31 Execution Context stack management. . . . . . . . . . . . . . . . . . . . 152
4.32 Implementation of the EM/EC pattern in the RPC service. . . . . . . . . 154
23
List of Listings
3.1 Overlay plugin and runtime bootstrap. . . . . . . . . . . . . . . . . . . . 82
3.2 Transparent service creation. . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3 Service creation with explicit and transparent deployments. . . . . . . . . 85
3.4 Service creation with Fault-Tolerance support. . . . . . . . . . . . . . . . 87
3.5 Service client creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1 A RPC IDL example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
25
–Most of the important things in the world have
been accomplished by people who have kept
trying when there seemed to be no hope at all.
Dale Carnegie 1Introduction
1.1 Motivation
The development and management of large-scale information systems is pushing the
limits of the current state-of-the-art in middleware frameworks. At EFACEC1, we have
to handle a multitude of application domains, including: information systems used
to manage public, high-speed transportation networks; automated power management
systems to handle smart grids, and; power supply systems to monitor power supply units
through embedded sensors. Such systems typically transfer large amounts of streaming
data; have erratic periods of extreme network activity; are subject to relatively common
hardware failures and for comparatively long periods, and; require low jitter and fast
response time for safety reasons, for example, vehicle coordination.
Target Systems
The main motivation for this PhD thesis was the need to address the requirements of the
public transportation solutions at EFACEC, more specifically, the light-train systems.
The deployment of one of such systems is installed in Oporto’s light-train network and
is composed of 5 lines, 70 stations and approximately 200 sensors (partially illustrated
in Figure 1.1). Each station is managed by a computational node, that we designate
as peer, that is responsible for managing all the local audio, video, display panels, and
low-level sensors such as track sensors for detecting inbound and outbound trains.
1EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systems
engineering namely in public transportation and energy systems, employs around 3000 people and has
a turnover of almost 1000 million euro; it is established in more than 50 countries and exports almost
half of its production (c.f. http://www.efacec.com).
27
CHAPTER 1. INTRODUCTION
The system supports three types of traffic: normal - for regular operations over the
system, such as playing an audio message in a station through an audio codec; critical
- medium priority traffic comprised of urgent events, such as an equipment malfunction
notification; alarms - high priority traffic that notifies critical events, such as low-level
sensor events. Independently of the traffic type (e.g., event, RPC operation), the system
requires that any operation must be completed within 2 seconds.
From the point of view of distributed architectures, the current deployments would be
best matched with P2P infra-structures that are resilient and allow resources (e.g., a
sensor connected through a serial link to a peer) to be seamlessly mapped to the logical
topology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT)
services. Support for both RT and FT is fundamental to meet system requirements.
Moreover, the next generation light train solutions require deployments across cities
and regions that can be overwhelmingly large. This introduces the need for a scalable
hierarchical abstraction, the cell, that is composed of several peers that cooperate to
maintain a portion of the mesh.
Figure 1.1: Oporto’s light-train network.
1.2 Challenges and Opportunities
The requirements from our target systems pose a significant number of challenges. The
presence of FT mechanisms, specially using space redundancy [1], it introduces the need
for the presence of multiple copies of the same resource (replicas), and these, in turn,
ultimately lead to a greater resource consumption.
FT also introduce overheads in the form of latency and this is another constraint
that is important when dealing with RT systems. When an operation is performed,
irrespectively, of whether it is real-time or not, any state change that it causes by it
28
1.2. CHALLENGES AND OPPORTUNITIES
must be propagated among the replicas through a replication algorithm that introduces
an additional source of latency. Furthermore, the recovery time, that consists in the
time that the system needs to recover from a fault, is an additional source of latency
to real-time operations. There are well known replication styles that offer different
trade-offs between state consistency and latency.
Our target systems have different traffic types with distinct deadlines requirements that
must be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernet
networking) and software (e.g., Linux). This requires that the RT mechanisms leverage
the available resources, through resource reservation, while providing different threading
strategies that allow different trade-offs between latency and throughput.
To overcome the overhead introduced by the FT mechanisms, it must be possible
to employ a replication algorithm that do not compromises the RT requirements.
Replication algorithms that offer a higher degree of consistency introduce a higher
level of latency [1, 2] that may be prohibitive for certain traffic types. On the other
hand, certain replication algorithms exhibit a lower resource consumption and latency
at the expense of a longer recovery time, that may also be prohibitive.
Considering current state-of-the-art research we see many opportunities to address
the previous challenges. One is the use of COTS operating system that allow for a
faster implementation time, thus smaller development cost, while offering the necessary
infrastructure to build a new middleware system.
P2P networks can be used to provide a resilient infra-structure that mirrors the physical
deployments of our target systems, furthermore, different P2P topologies offer different
trade-offs between self-healing, resource consumption and latency in end-to-end oper-
ations. Moreover, by directly implementing FT on the P2P infra-structure we hope
to lower resource usage and latency to allow the integration of RT. By using proven
replication algorithms [1, 2] that offer well-known trade-offs regarding consistency,
resource consumption and latency, we can focus on the actual problem of integrating
real-time, fault-tolerance within a P2P infrastructure.
On the other hand, RT support can be achieve through the implementation of different
threading strategies, resource reservation (through the Linux’s Control Groups) and by
avoiding traffic multiplexing through the use of different access points to handle different
traffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling would
provide greater RT guarantees, this goal will not be pursued due the lack of maturity
of the current EDF implementations in Linux (our reference COTS operating system).
Because we are limited to use priority based scheduling and resource reservation, we can
29
CHAPTER 1. INTRODUCTION
only partially support our goal of providing end-to-end guarantees, more specifically,
we enhance our RT guarantees through the use of RT scheduling policies with over-
provisioning to ensure that deadlines are met.
1.3 Problem Definition
The work presented in this thesis focuses on the integration of Real-Time (RT) and
Fault-Tolerance (FT) in a scalable general purpose middleware system. This goal
can only be achieved if the following premises are valid: (a) FT infrastructure cannot
interfere in RT behavior, independently of the replication policy; (b) the network model
must be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and aware
of the underlying infrastructure, i.e. network model, operating system and physical
environment.
Our problem definition is a direct consequence of the requirements from our target
systems, and it can be summarize with the following question: ”Can we opportunistically
leverage and integrate these proven strategies to simultaneously support soft-RT and FT
to meet the needs of our target systems even under faulty conditions?”
In this thesis we argue that a lightweight implementation of fault-tolerance mechanisms
in a middleware is fundamental for its successful integration with soft real-time support.
Our approach is novel in that it explores peer-to-peer networking as a means to imple-
ment generic, transparent, lightweight fault-tolerance support. We do this by directly
embedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage of
their scalable, decentralized and resilient nature. For example, peer-to-peer networks
readily provide the functionality required to maintain and locate redundant copies of
resources. Given their dynamic and adaptive nature, they are promising infra-structures
for developing lightweight fault-tolerant and soft real-time middleware.
Despite these a priori advantages, mainstream generic peer-to-peer middleware systems
for QoS computing are, to our knowledge, unavailable. Motivated by this state of
affairs, by the limitations of the current infra-structure for the information system we
are managing at EFACEC (based on CORBA technology) and, last but not least, by the
comparative advantages of flexible peer-to-peer network architectures, we have designed
and implemented a prototype service-oriented peer-to-peer middleware framework.
The networking layer relies on a modular infra-structure that can handle multiple peer-
to-peer overlays. The support for fault-tolerance and soft real-time features is provided
30
1.4. ASSUMPTIONS AND NON-GOALS
at this level through the implementation of efficient and resilient services for, e.g.
resource discovery, messaging and routing. The kernel of the middleware system (the
runtime) is implemented on top of these overlays and uses the above mentioned peer-to-
peer functionalities to provide developers with APIs for customization of QoS policies
for services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy,
number of replicas). This approach was inspired in that of TAO [3], that allows for
distinct strategies for the execution of tasks by threads to be defined.
1.4 Assumptions and Non-Goals
The distributed model used in this thesis is based on a partial asynchronous model
computing model, as defined in [2], extended with fault-detectors.
The services and P2P plugin implemented in this thesis only support crash failures. We
consider a crash failure [1] to be characterized as a complete shutdown of a computing
instance in the event of a failure, ceasing to interact any further with the remaining
entities of the distributed system.
The timing faults are handled differently by services and the P2P plugin. In our service
implementations a timing fault is logged (for analysis) with no other action being
performed, whereas, in the P2P layer we consider a timing fault as a crash failure, i.e.,
if the remote creation of a service exceeds its deadline, the peer is considered crashed.
This method is also called as process controlled crash, or crash control, as defined in
[4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect of
being crashed, it does not get killed or commits suicide, instead it gets shunned, that
is, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it must
rebind using the membership service in the P2P layer.
The fault model used was motivated by the author’s experience on several field deploy-
ments of ligth-train transportation systems, such as the Oporto, Dublin and Tenerife
Light Rail solutions [5]. Due to the use of highly redundant hardware solutions, such as
redundant power supplies and redundant 10-Gbit network ring links, network failures
tend to be short. The most common cause for downtime is related with software bugs,
that mostly results in a crashing computing node. While simultaneous failures can
happen, they are considered rare events.
We also assume that the resource-reservation mechanisms are always available.
In this thesis we do not address value faults and byzantine faults, as they are not a
31
CHAPTER 1. INTRODUCTION
requirement for our target systems. Furthermore, we do not provide a formal specifica-
tion and verification of the system. While this would be beneficial to assess system
correctness, we had to limit the scope of this thesis. Nevertheless, we provide an
empirical evaluation of the system.
We also do not address hard real-time because the lack of a mature support for EDF
scheduling in the Linux kernel. Furthermore, we do not provide a fully optimized
implementation, but only a proof-of-concept to validate our approach. Testing the
system in a production environment is left for future work.
1.5 Contributions
Before undertaking the task of building an entire new middleware system from scratch,
we explored current solutions, presented in Chapter 2, to see if any of them could
support the requirements from our target system. As we did not find any suitable
solution, we then assessed if it was possible to extend an available solution to meet
those requirements. In our previous work, DAEM [6], we explored the use of JGroups [7]
within an hierarchical P2P mesh, and concluded that the simultaneous support for real-
time, fault-tolerance and P2P requires fine grain control of resources that is not possible
with the use of ”black-box” solutions, for example, it is impossible to have out-of-the-
box support for resource reservation in JGroups.
Given these assessments, we have designed and implemented Stheno, that to the best
of our knowledge is the first middleware system to seamlessly integrate fault-tolerance
and real-time in a peer-to-peer infrastructure. Our approach was motivated by the
lack of support of current solutions for the timing, reliability and physical deployment
characteristics of our target systems.
For that, a complete architectural design is proposed that addresses the levels of the
software stack, including kernel space, network, runtime and services, to achieve a
seamless integration. The list of contributions include: (a) a full specification of a user
Application Programming Interface (API); (b) pluggable P2P network infrastructure
aiming to better adjust to the target application; (c) support for configurable FT on
the P2P layer with the goal of providing lightweight FT mechanisms, that fully enable
RT behavior, and; (d) integration of resource reservation at all the levels of runtime,
enabling (partial) end-to-end Quality-of-Service (QoS) guarantees.
Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioning
for real-time systems. In this thesis we present, Euryale, a QoS network oriented
32
1.6. THESIS OUTLINE
framework that features resource reservation with support for a broader range of sub-
systems, including CPU, memory, I/O and network bandwidth for a general purpose
operating system as Linux. At the heart of this infrastructure resides Medusa, a QoS
daemon that handles admission and management of QoS requests.
Current well-known threading strategies, such as Leader-Followers [11], Thread-per-
Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la-
tency and resource usage [3, 14]. However, they do not support resource reservation,
namely, CPU partitioning. In order to suppress this limitation, this thesis provides an
additional contribution with the introduction of a novel design pattern (Chapter 4) that
is able to integrate multi-core computing with resource reservation within a configurable
framework that supports these well-known threading strategies. For example, when a
client connects to a service it can specify, through the QoS real-time parameters, for a
particular threading strategy that best meets its requirements.
We present a full implementation that covers all the previously architectural features,
including a complete overlay implementation, inspired in the P3 [15] topology, that
seamlessly integrates RT and FT.
To evaluate our implementation and justify our claims, we present a complete evalua-
tion for both mechanisms. The impact of the resource reservation mechanism is also
evaluated, as well as a comparative evaluation of RT performance against state-of-the-
art middleware systems. The experimental results show that Stheno meets and exceeds
target system requirements for end-to-end latency and fail-over latency.
1.6 Thesis Outline
The focus of this thesis is on the design, implementation and evaluation of a scalable
general purpose middleware that provides the seamless integration of RT and FT. The
remaining of this thesis is organized as follows.
Chapter 2: Overview of Related Work.
This chapter presents an overview on related middleware systems that exhibit support
for RT, FT and P2P, the mandatory requirements from our target system. We started
by searching for an available off-the-shelf solution that could support all of these
requirements, or in its absence, identifying a current solution that could be extended in
order to avoid creating a new middleware solution from scratch.
Chapter 3: Architecture.
33
CHAPTER 1. INTRODUCTION
Chapter 3 describes the runtime architecture on the proposed middleware. We start
by providing a detailed insight on the architecture, covering all layers present in the
runtime. Special attention is given to the presentation of the QoS and resource reser-
vation infrastructure. This is followed by an overview of the programming model that
describes the most important interfaces present in the runtime, as well the interactions
that occur between them. The chapter ends with the description of the fundamental
runtime operations, namely: the creation of services with and without FT support,
deployment strategy, and client creation.
Chapter 4: Implementation.
Chapter 4 describes the implementation of a prototype based on the aforementioned
architecture, and is divided in four parts. In the first part, we present a complete
implementation of P2P overlay that is inspired on the P3 [15] topology, while providing
some insight on the limitations of the current prototype. The second part of this chapter
focuses on the implementation of three types of user services, namely, Remote Procedure
Call (RPC), Actuator, and Streaming. These services are thoroughly evaluated in
Chapter 5. In the third part, we describe our support for multi-core computing, through
the presentation of a novel design pattern, the Execution Model/Context. This design
pattern is able to integrate resource reservation, especially CPU partitioning, with
different well-known (and configurable) threading strategies. The fourth and final part
of this chapter describes the most relevant parameters used in the bootstrap of the
runtime.
Chapter 5: Evaluation.
The experimental results are presented in this chapter. It starts by providing details
of physical setup used throughout the evaluation. Then it describes the parameters
used in the testbed suite, that is composed by the three services previously described in
Chapter 4. We then focus on presenting the results for the benchmarks, including the
assessment of the impact of FT on RT, and the impact of the resource reservation infra-
structure in the overall performance. The chapter ends with a comparative evaluation
against well-known middleware systems.
Chapter 6: Conclusion and Future Work.
This last chapter presents the concluding remarks. It highlights the contributions of
the proposed and implemented middleware, and provides
34
–By failing to prepare, you are preparing to fail.
Benjamin Franklin 2Overview of Related Work
2.1 Overview
This chapter presents an overview of the state-of-the-art on related middleware systems.
As illustrated in Figure 2.1, we are mostly interested in systems that exhibit support
for real-time (RT), fault-tolerance (FT) and peer-to-peer (P2P), the mandatory require-
ments from our target system. We started by searching for an available off-the-shelf
solution that could support all of these requirements, or in its absence, identify a current
solution that could be extended, and thus avoid the creation of a new middleware
solution from the ground up. For that reason, we have focused on the intersecting
domains, namely, RT+FT, RT+P2P and FT+P2P, since the systems contained in these
domains are closer to meet the requirements of our target system.
From an historic perspective, the origins of modern middleware systems can be traced
back to the 1980s, with the introduction of the concept of ubiquitous computing, in
which computational resources are accessible and seen as ordinary commodities such
as electricity or tapwater [2]. Furthermore, the interaction between these resources
and the users was governed by the client-server model [16] and a supporting protocol
called RPC [17]. The client-server model is still the most prevalent paradigm in current
distributed systems.
An important architecture for client-server systems was introduced with the Common
Object Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did not
address real-time or fault-tolerance. Only recently both real-time and fault-tolerance
specifications were finalized but remained mutually exclusive. This means that a
system supporting the real-time specification will not be able to support the fault-
35
CHAPTER 2. OVERVIEW OF RELATED WORK
FT
FT+P2P
P2P
RT+P2P
RT+FT RT+FT+P2P
RT
Video
Streaming
Distributed
storage
Pastry
CORBA RT FT
DDS
CORBA FT
Stheno
Figure 2.1: Middleware system classes.
tolerance specification, and vice-versa. Nevertheless, seminal work has already ad-
dressed these limitations and offered systems supporting both features, namely, TAO [3]
and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared as
a Java alternative capable of providing a more flexible and easy-to-use environment.
In recent years, CORBA entered in a steady decline [20] in favor of web-oriented
platforms, such as J2EE [21], .NET [22] and SOAP [23], and P2P systems. The
web-oriented platforms, such as the JBoss [24] application server, aim to integrate
availability with scalability, but they remain unable to support real-time. Moreover,
while partitioning offers a clean approach to improve scalability, it fails to support
large scale distributed systems [2]. Alternatively, P2P systems focused on providing
logical organizations, i.e., meshes, that abstract the underlying physical deployment
while providing a decentralized architecture for increased resiliency. These systems
focused initially on resilient distributed storage solutions, such as Dynamo [25], but
progressively evolved to support soft real-time systems, such as video streaming [26].
More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributed
message passing infrastructure based on an asynchronous interaction model, that is
able to suppress the scaling issues present in RPC. A considerable amount of im-
plementations exist, including Tibco [28], Websphere MQ [29] and Java Messaging
Service (JMS) [30]. MOM sometimes are integrated as subsystems in the application
server infrastructures, such as JMS in J2EE and Websphere MQ in the Websphere
Application Server.
A substantial body of research has focused on the integration of real-time within
36
2.2. RT+FT MIDDLEWARE SYSTEMS
CORBA-based middleware, such as TAO [3] (that later addressed the integration of
fault-tolerance). More recently, QoS-enabled publish-subscribe middleware systems
based on the JAIN SLEE specification [31], such as Mobicents [32], and in the Data
Distribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34]
and OpenSplice [35], appeared as a way to overcome the current lack of support for
real-time applications in SOA-based middleware systems.
The introduction of fault-tolerance in middleware systems also remains an active topic
of research. CORBA-based middleware systems were a fertile ground to test fault-
tolerance techniques in a general purpose platform, resulting in the creation of the
CORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA-
based platforms, such as J2EE. One of the most popular deployments, JBoss, supports
scalability and availability through partitioning. Each partition is supported by a group
communication framework based on the virtual synchrony model, more specifically, the
JGroups [7] group communication framework.
2.2 RT+FT Middleware Systems
This section overviews systems that provide simultaneous support for real-time and
fault-tolerance. These systems are divided into special purposed solutions, designed for
specific application domains, and CORBA-based solutions, aimed for general purposed
computing.
2.2.1 Special Purpose RT+FT Systems
Special purpose real-time fault-tolerant systems introduced concepts and implementa-
tion strategies that are still relevant on current state-of-the-art middleware systems.
Armada
Armada [37] focused on providing middleware services and a communication infrastruc-
ture to support FT and RT semantics for distributed real-time systems. This was
pursued in two ways, which we now describe.
The first contribution was the introduction of a communication infrastructure that is
able to provide end-to-end QoS guarantees, in both unicast and multicast primitives.
This was supported by a control signaling and a QoS-sensitive data transfer (as in the
newer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)).
37
CHAPTER 2. OVERVIEW OF RELATED WORK
The network infrastructure used a reservation mechanism based on EDF scheduling
policy that was built on top of the Mach OS priority based scheduling. The initial
implementation was done in the user-level but subsequently migrated to the kernel
level with the goal of reducing latency.
Much of the architectural decisions regarding RT support were based on the available
operating system at the time, mainly Mach OS. Despite the advantages of a micro-
kernel approach, its application remains restricted by the underlying cost associated
with message passing and context switching. Instead, a large body of research has been
made on monolithic kernels, specially in Linux OS, that are able to offer the advantages
of the micro-kernel approach, through the introduction of kernel modules, and the speed
of monolithic kernels.
The second contribution came in the form of a group communication infrastructure
based on a ring topology that ensured the delivery of messages in a reliable and total
order fashion within a bounded time. It also had support for membership management
that offered consistent views of the group through the detection of process and commu-
nication failures. These group communication mechanisms enabled the support for FT
through the use of a passive replication scheme, that allowed for some inconsistencies
between the primary and the replicas, where the states of the replicas could lag behind
the state of the primary, up to a bounded time window.
Mars
Mars [38] provided support for the analysis and deployment of synchronous hard real-
time systems through a static off-line scheduler for CPU and Time Division Multiple
Access (TDMA) bus. Mars is able to offer FT support through the use of active
redundancy on the TDMA bus, i.e. sending multiple copies of the same message, and
self-checking mechanisms. Deterministic communications are achieved though the use
of a time-triggered protocol.
The project focused on the RT process control, where all the intervening entities are
known in advance. So it does not offer any type of support for dynamical admission of
new components, neither it supports on-the-fly fault-recovery.
ROAFTS
ROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis-
tributed RT applications, consisting in a network of Time-triggered Message-triggered
Objects [41] (TMO’s), whose execution is managed by a TMO support manager. The
FT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic
38
2.2. RT+FT MIDDLEWARE SYSTEMS
fault server ; (b) and a network surveillance [42] manager. Fault-detection is assured
by the network surveillance TMO, and used by the generic fault-server to change the
FT policy with the goal of preserving RT semantics. The system assumes that RT
can live with lesser reliability assurances from the middleware, under highly dynamic
environments.
Maruti
Maruti [43] aimed to provide a development framework and an infrastructure for the
deployment of hard real-time applications within a reactive environment, focusing on
real-time requirements on a single-processor system. The reactive model is able to
offer runtime decisions on the admission of new processing requests without producing
adverse effects on the scheduling of existing requests. Fault-tolerance is achieved by
redundant computation. A configuration language allows the deployment of replicating
modules and services.
Delta-4
Delta-4 [44] provided an in-depth characterization of fault assumptions, for both the
host and the network. It also demonstrated various techniques for handling them,
namely, passive and active replication for fail-silent hosts and byzantine agreement for
fail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi-
tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 framework
through the introduction of the Leader/Follower replication model (better known as
semi-active replication) for fail-silent hosts. This work also lead to the extension to the
communication system to support additional communication primitives (the original
work on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, and
AtLeastTo.
2.2.2 CORBA-based RT+FT Systems
The support for RT and FT in general purpose distributed platforms remains mostly
restricted to CORBA. While some support was carried out by Sun to introduce RT sup-
port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46,
47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen-
tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-Time
VM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attempted
to provide support for RT in a J2EE environment. Nevertheless, this support seems to
be confined to the introduction of a deterministic garbage collector, through the use of
39
CHAPTER 2. OVERVIEW OF RELATED WORK
the RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbage
collection [51].
Previous work on integration of RT and FT in CORBA context systems can be catego-
rized into three distinct approaches: (a) integration, where the base ORB is modified;
(b) services, systems that rely on high-level services to provide FT (and indirectly, RT),
and; (c) interception, systems that perform interception on client request to provide
transparent FT and RT.
Integration Approach
Past work on the integration of fault-tolerance in CORBA-like systems was done in
Electra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors of
the CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage-
ment Architecture (OMA) to support transparent and non-transparent fault-tolerance
capabilities. Instead of using message queues or transaction monitors [56], it relied on
object-communication groups [57, 58]. Maestro [53] is a distribute layer built on top of
the Ensemble [59] group communication, that was used by Electra [52] in the Quality
of Service for CORBA Objects (QuO) project [60]. Its main focus was to provide an
efficient, extensible and non disruptive integration of the object layers with the low-
level QoS system properties. The AQuA [54] system uses both QuO and Maestro on
top of the Ensemble communication groups, to provide a flexible and modular approach
that is able to adapt to faults and changes in the application requirements. Within its
framework a QuO runtime accepts availability requests by the application and relays
them to a dependability manager, that is responsible to leverage the requests from
multiple QuO runtimes.
TAO+QuO
The work done in [61] focused on the integration of QoS mechanisms, for both CPU and
network resources while supporting both priority- and reservation-based QoS semantics,
with standard COTS Distributed Real-Time and Embedded (DRE) middleware, more
precisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. The
priority-based approach was built on top of the RT-CORBA specification, and it defined
a set of standard features in order to provide end-to-end predictability for operations
within a fixed priority context [62]. The CPU priority-based resource management is
left to the scheduling of the underlying Operating Systems (OS), whereas the network
priority-based management is achieved through the use of the DiffServ architecture [63],
by setting the DSCP codepoint on the IP header of the GIOP requests. Based on
various factors, the QuO runtime can dynamically change this priority to adjust to
40
2.2. RT+FT MIDDLEWARE SYSTEMS
environment changes. Alternatively, the network reservation-based approach relies on
the RSVP [64] signaling protocol to guarantee the desired network bandwidth between
hosts. The QuO runtime monitors the RSVP connections and makes adjustments to
overcome abnormal conditions. For example, in a video service it can drop frames to
maintain stability. The cpu-reservation is made using reservation mechanisms present
in the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservations
policies. This was done to preserve the end-to-end QoS semantics that is only available
at a higher level of the middleware.
CIAO+QuO
CIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built on
top of TAO [3] that aims to alleviate the complexity of integrating real-time features on
DRE using Distributed Object Computing (DOC) middleware. These DOC systems,
of which TAO is an example, offer configurable policies and mechanisms for QoS,
namely real-time, but lack a programming model that is capable of separating systemic
aspects from applicational logic. Furthermore, QoS provisioning must be done in an
end-to-end fashion, thus having to be applied to several interacting components. It
is difficult, or nearly impossible, to properly configure a component without taking
into account the QoS semantics for interacting entities. Developers using standard
DOC middleware systems are susceptible to produce misconfigurations that cause an
overall system misbehavior. CIAO overcomes these limitations by applying a wide
range of aspect-oriented development techniques that support the composition of real-
time semantics without intertwining configurations concerns. The support for CIAO’s
CCM architecture was done in CORFU [66] and is described below.
Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67].
The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi-
sioning to a total provisioning middleware that is also able to accommodate dynamical
and adaptive QoS provisioning. For example, the setup of a RSVP [64] connection
would require the explicit configuration from the developer, defeating the purpose of
CIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], it
does not provide a solution for component cross-cutting.
DynamicTAO
DynamicTAO [69] focused on providing a reflective model middleware that extends
TAO to support on-the-fly dynamic reconfiguration of its component behavior and
resource management through meta-interfaces. It allows the application to inspect
the internal state/configuration and, if necessary, to reconfigure it in order to adapt
41
CHAPTER 2. OVERVIEW OF RELATED WORK
to environment changes. Subsequently, it is possible to select networking protocols,
encoding and security policies to improve the overall system performance in the presence
of unexpected events.
Service-based Approach
An alternative, high-level service approach for CORBA fault-tolerance was taken by
Distributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service
(OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providing
replica management, fault-detection and fault-recovery as a CORBA high-level service.
It did group communication and it mainly focused on passive replication, but allowed
the developer to select the desired level of reliability (number of replicas), replication
policy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recovery
strategy. OGS improved over prior approaches by using a group communication protocol
that imposes consensus semantics. Instead of adopting an integrated approach, group
communication services are transparent to the ORB, by providing a request level
bridging. Newtop followed a similar approach to OGS but augmented the support
for network partition, allowing the newly formed sub-groups to continue to operate.
TAO
TAO [3] is a CORBA middleware with support for RT and FT middleware, that is
compliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. The
support for RT includes priority propagation, explicit binding, and RT thread pools.
The FT is supported through the of a high level service, the Replication Manager, that
sits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure,
acting as a rendezvous for all the remaining components, more precisely, monitors that
watch the status of the replicas, replica factories that allow the creation of new replicas,
and fault notifiers that inform the manager of failed replicas. TAO’s architecture is
further detailed in Section 2.6 of Chapter 3.
FLARe and CORFU
FLARe [74] focus on proactively adapting the replication group to underlying changes
on resource availability. To minimize resource usage, it only supports passive replica-
tion [75]. Its implementation is based on TAO [3]. It adds three new components to
the existing architecture: (a) Replication Manager high level service that decides on the
strategy to be employed to address the changes on resource availability and faults; (b)
a client interceptor that redirects invocations to the active primary; (c) a redirection
agent that receives updates from the Replication Manager and is used by the interceptor,
42
2.2. RT+FT MIDDLEWARE SYSTEMS
and; (d) a resource monitor that watches the load on nodes and periodically notifies the
Replication Manager. In the presence of faulty conditions, such as overload of a node,
the Replication Manager adapts the replication group to the changing conditions, by
activating replicas on nodes that have a lower resource usage, and additionally, change
the location of the primary node to a better suitable placement.
CORFU [66] extends FLARe to support real-time and fault-tolerance for the Lightweight
CORBA Component Model (LwCCM) [76] standard for DRE systems. It provides
fail-stop behavior, that is, when one component on a failover unit fails, then all the
remaining components are stopped, allowing for a clean switch to a new unit. This is
achieved through a fault mapping facility that allows the correspondence of the object
failure into the respective plan(s), with the subsequent component shutdown.
DeCoRAM
The DeCoRAM system [77] aims to provide RT and FT properties through a resource-
aware configuration, executed using a deployment infrastructure. The class of supported
systems is confined to closed DRE, where the number of tasks and their respective
execution and resource requirements are known a priori and remain invariant thought
the system’s life-cycle. As the tasks and resources are static, it is possible to optimize the
allocation of the replicas on available nodes. The allocation algorithm is configurable
allowing for a user to choose the best approach to a particular application domain.
DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real-
Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti-
mization problem, while satisfying both RT and FT system constraints. Because of the
limited resources normally available on DRE systems, DeCoRAM only supports passive
replication [75], thus avoiding the high overhead associated with active replication [78].
The allocation algorithm calculates the components inter-dependencies and deploys the
execution plan using the underlying middleware infrastructure, which is provided by
FLARe [74].
Interception-based Approach
The work done in Eternal [79, 80] focused on providing transparent fault-tolerance for
CORBA ensuring strong replica consistency through the use of reliable totally-ordered
multicast protocol. This approach alleviated the developer from having to deal with low-
level mechanisms for supporting fault-tolerance. In order to maintain compatibility with
the CORBA-FT standard, Eternal exposes the replication manager, fault detector, and
fault notifier to developers. However, the main infrastructure components are located
below the ORB for both efficiency and transparency purposes. These components
43
CHAPTER 2. OVERVIEW OF RELATED WORK
include logging-recovery mechanisms, replication mechanisms, and interceptors. The
replication mechanisms provide support for warm and cold passive replication and active
replication. The interceptor captures the CORBA IIOP requests and replies (based on
TCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recovery
mechanisms are responsible for managing the logging, checkpointing, and performing
the recovery protocols.
MEAD
MEAD focuses on providing fault-tolerance support in a non intrusive way by en-
hancing distributed RT systems with (a) a transparent, although tunable FT, that
is (b) proactively dependable through (c) resource awareness, that has (d) scalable and
fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO,
as proof-of-concept. The paper makes an important contribution by leveraging fault-
tolerance resource consumption for providing RT behavior. MEAD is detailed further
in Section 2.6 of Chapter 3.
2.3 P2P+RT Middleware Systems
While most of the focus on P2P systems has been on the support of FT, there is a
growing interested in using these systems for RT applications, namely, in streaming
and QoS support. This section provides an overview on P2P systems that support RT.
2.3.1 Streaming
Streaming and specially Video on Demand (VoD), were a natural evolution of the first
file sharing P2P systems [81, 82]. With the steady increase of network bandwidth on the
Internet, it is now possible to have high-quality multimedia streaming solutions to the
end-user. These focus on providing near soft real-time performance resorting to streams
split through the use of distributed P2P storage and redundant network channels.
PPTV
The work done in [26] provides the background for the analysis, design and behavior
of VoD systems, focusing on the PPTV system [83]. An overview of the different
replication strategies and their respective trade-offs is presented, namely, Least Recently
Used (LRU) and Least Frequently Used (LFU). The later uses a weighted estimation
based on the local cache completion and by the availability to demand ratio (ATD).
44
2.3. P2P+RT MIDDLEWARE SYSTEMS
Each stream is divided into chunks. The size of these chunks have a direct influence on
the efficiency of the streaming, with smaller size pieces facilitating replication and thus
overall system load-balancing, whereas bigger pieces decrease the resource overhead
associated with piece management and bandwidth consumption due to less protocol
control. To allow for a more efficient piece selection three algorithms are proposed:
sequential, rarest first and anchor-based. To ensure real-time behavior the system is
able to offer different levels of aggressiveness, including: simultaneous requests of the
same type to neighboring peers; simultaneous sending different content requests to
multiple peers, and; requesting to a single peer (making a more conservative use of
resources).
Thicket
Efficient data dissemination over unstructured P2P was addressed by Thicket [84].
The work used multiple trees to ensure efficient usage of resources while providing
redundancy in the presence of node failure. In order to improve load-balancing across
the nodes, the protocol tries to minimize the existence of nodes that act as interior
nodes on several of trees, thus reducing the load produced from forwarding messages.
The protocol also defines a reconfiguration algorithm for leveraging load-balance across
neighbor nodes and a tree repair procedure to handle tree partitions. Results show
that the protocol is able to quickly recover from a large number of simultaneous node
failures and leverage the load across existing nodes.
2.3.2 QoS-Aware P2P
Until recently, P2P systems have been focused on providing resiliency and throughput,
and thus, not addressing the increasing need for QoS on latency-sensitive applications,
such as VoD.
QRON
QRON [85] aimed to provide a general unified framework in contrast to application-
specific overlays. The overlays brokers (OBs), present at each autonomous system in
the Internet, support QoS routing for overlay applications through resource negotiation
and allocation, and topology discovery. The main goal of QRON is to find a path that
satisfies the QoS requirements, while balancing the overlay traffic across the OBs and
overlay links. For this it proposes two distinct algorithms, a “modified shortest distance
path” (MSDP) and “proportional bandwidth shortest path (PBSP).
45
CHAPTER 2. OVERVIEW OF RELATED WORK
GlueQoS
GlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoS
features from two communicating processes. It provides a declarative language that
allows the specification of the feature QoS set (and possible conflicts) and a runtime
negotiation mechanism that finds a set of valid QoS features that is valid in the both
ends of the interacting components. Contrary to aspect-oriented programming [65], that
only enforces QoS semantics at deployment time, GlueQoS offers a runtime solution that
remains valid throughout the duration of the session between a client and a server.
2.4 P2P+FT Middleware Systems
The research on P2P systems has been largely dominated by the pursuit for fault-
tolerance, such as in distributed storage, mainly due to the resilient and decentralized
nature of P2P infrastructures.
2.4.1 Publish-subscribe
P2P publish-subscribe systems are a set of P2P systems that implement a message
pattern where the publishers (senders) do not have a predefined set of subscribers
(receivers) to their messages. Instead, the subscribers must first register their interests
with the target publisher, before starting to receive published messages. This decou-
pling between publishers and subscribers allows for a better scalability, and ultimately,
performance.
Scribe
Scribe [87] aimed to provided a large scale event notification infrastructure, built on
top of Pastry [88], for topic-based publish-subscribe applications. Pastry is used to
support topics and subscriptions and build multicast trees. Fault-Tolerance is provided
by the self-organizing capabilities of Pastry, through the adaptation to network failures
and subsequent multicast tree repair. The event dissemination performed is best-
effort oriented and without any delivery order guarantees. Nevertheless, it is possible
to enhance Scribe to support consistent ordering thought the implementation of a
sequential time stamping at the root of the topic. To ensure strong consistency and
tolerate topic root node failures, an implementation of a consensus algorithm such as
Paxos [89] is needed across the set of replicas (of the topic root).
46
2.4. P2P+FT MIDDLEWARE SYSTEMS
Hermes
Hermes [90] focused on providing a distributed event-based middleware with an underly-
ing P2P overlay for scalability and reliability. Inspired by work done in Distributed Hash
Table (DHT) overlay routing [88, 91], it also has some notions of rendezvous similar
to [81]. It bridges the gap between programming language type semantics and low-level
event primitives, by introducing the concepts of event-type and event-attributes that
have some common ground with Interface Description Language (IDL) within the RPC
context. In order to improve performance, it is possible in the subscription process to
attach a filter expression to the event attributes. Several algorithms are proposed for
improving availability, but they all provide weak consistency properties.
2.4.2 Resource Computing
There is a growing interest on harvesting and managing the spare computing power
from the increasing number of networked devices, both public and private, as reported
in [92, 93, 94, 95]. Some relevant examples are:
BOINC
BOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili-
tate the harvesting of public resource computing by the scientific research community.
BOINC implements a redundant computing mechanism to prevent malicious or erro-
neous computational results. Each project specifies the number of results that should be
created for each “workunit”, i.e. the basic unit of computation to be performed. When
some number of the results are available, an application specific function is called to
evaluate the results and possibly choosing a canonical result. If no consensus is achieved,
or if simply the results fail, a new set o results are computed. This process repeats until
a successful consensus is achieved or an application defined timeout occurs.
P2P-MapReduce
Developed at Google, MapReduce [97] is a programming model that is able parallelize
the processing of large data sets in a distributed environment. It follows a master-slave
model, where a master distributes the data set across a set of slaves, returning at end
the computational results (from the map or reduce tasks). MapReduce provides fault-
tolerance for slave nodes by reassigning the failed job to an alternative active slave,
but lacks support for master failures. P2P-MapReduce [98] provides fault-tolerance by
resorting to two distinct P2P overlays, one containing the current available masters in
47
CHAPTER 2. OVERVIEW OF RELATED WORK
the system, and the other with the active slaves. When an user submits a MapReduce
job, it queries the master overlay for a list of the available masters (ordered by their
workload). It then selects a master node and the number of replicas. After this, the
master node notifies its replicas that they will participate on the current job. A master
node is responsible for periodically synchronizing the state of the job over its replica set.
In case of failure, a distributed procedure is executed to elect the new master across
the active replicas. Finally, the master selects the set of slaves using a performance
metric based on workload and CPU performance from the slave overlay and starts the
computation.
2.4.3 Storage
Storage systems were one of the most prevalent applications on first generation P2P sys-
tems. Evolving from early file-sharing systems, and with the help of DHT middlewares,
they have now become the choice for large-scale storage systems in both industry and
academia.
openDHT
Work done in [99] aimed to provide a lightweight framework for P2P storage using
DHTs (such in [88, 91]) in a public environment. The key challenge was to handle
mutually untrusting clients, while guarantying fairness in the access and allocation of
storage. The work was able to provide a fair access to the underlying storage capacity,
while taking the assumption that storage capacity is free. Because of its intrinsic fair
approach, the system is unable to provide any type of Service Level of Agreement (SLA)
to the clients, so reducing the domain of applications that can use it.
Dynamo
Recent research on data storage [25] and distribution at Amazon, focus on key-value
approaches using P2P overlays, more precisely DHT, to overcome the well explored
limitation of simultaneous providing high availability and strong consistency (through
synchronous replication) [100, 101]. The approach taken was to use an optimistic
replication scheme that relied on asynchronous replica synchronization (also known
as passive replication). The consistency conflicts between different replicas, that are
caused by network and server failures, are resolved in ’read time’, as opposed to the
more traditional ’write time’ strategy, with this being done to maximize the write
availability in the system. Such conflicts are resolved by the services, allowing for a
more efficient resolution (although the system offers a default ’last value holds’ strategy
48
2.5. P2P+RT+FT MIDDLEWARE SYSTEMS
to the services). Dynamo offers efficient key-value storage, while maximizing write
operations availability. Nevertheless, the ring based overlay hampers the scalability of
the system, and depending on the partitioning strategy used, the membership process
does not seem efficient.
2.5 P2P+RT+FT Middleware Systems
These types of systems offer a natural evolution over previous FT-RT middleware
systems. They aim to provide scalability and resilience through a P2P network infra-
structure that is able to provide lightweight FT mechanisms, allowing them to support
soft RT semantics. We first proposed an architecture [102, 103] for a general purpose
middleware that aimed to integrate FT into the P2P network layer, while being able
to provide RT support. The first implementation, in Java, of the architecture was done
in DAEM [6, 104]. This work used an hierarchical tree P2P based on P3 [15]. The FT
support was performed in all levels of the tree, resulting in a high availability rate but
the use of JGroups [7] for maintaining strong consistency, both for mesh and service
data, resulted in high overhead. Due to its highly coupled tree architecture, faults had a
major impact on availability when they occurred near the root node, as they produced
a cascade failure. Initial support for RT was provided, but the high overhead of the
replication infrastructure limited its applicability.
2.6 A Closer Look at TAO, MEAD and ICE
This section provides a closer look at middleware systems that have provided us with
several strategies and insights that we used to design and implement Stheno, our
middleware solution that is able to support RT, FT and P2P.
All the referred systems share a service oriented architecture with a client-server network
model, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEAD
support the RT-CORBA standard, while ICE only supports best-effort invocations. As
for FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid,
that combines both low and high-level services.
49
CHAPTER 2. OVERVIEW OF RELATED WORK
2.6.1 TAO
TAO is a classical RPC middleware and therefore only supports the client-server network
model. Name resolution is provided by a high-level service, representing a clear point-
of-failure and a bottleneck.
RT Support. TAO supports the RT CORBA specification 1.0., with the most impor-
tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT thread
pools.
The priority propagation ensures that a request maintains its priority across a chain of
invocations. A client issues a request to an Object A, that in turn, issues an invocation
to other Object B. The request priority at Object A is then used to make the invocation
at Object B. There are two types of propagation: a server declared priorities, and
client propagated priorities. In the first type, a server dictates the priority that will
be used when processing an incoming invocation. In the other type, the priority of
the invocation is encoded within the request, so the server processes the request at the
priority specified by the client.
A source of unbound priority inversion is caused by the use of multiplexed communica-
tion channels. To overcome this, the RT CORBA specification defines that the network
channels should be pre-established, avoiding the latency caused by their creation. This
model allows two possible policies: (a) private connection between the client and the
server, or; (b) priority banded connection that can be shared but limits the priority of
the requests that can be made on it.
In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], with
the support of a reactor (an object that handles network event de-multiplexing), and is
normally associated with an acceptor (an entity that handles the incoming connections),
a connection cache, and a memory pool. In classic CORBA a high priority thread can
be delayed by a low priority one, leading to priority inversion. So in an effort to avoid
this unwanted side-effect, the RT-CORBA specification defines the concept of thread
pool lanes.
All the threads belonging to a thread pool lane have the same priority, and so, only
process invocation that have the same priority (or a band that contains that priority).
Because each lane has it own acceptor, memory pool and reactor, the risk of priority
inversion is greatly minimized at the expense of greater resource usage overhead.
FT Support. In a effort to combine RT and FT semantics, the replication style
50
2.6. A CLOSER LOOK AT TAO, MEAD AND ICE
proposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids the
latency associated with both warm and cold passive replication [105] and the high
overhead and non-determinism of active replication, but represents an extension to the
FT specification.
Figure 2.2: TAO’s architectural layout (adapted from [3]).
Figure 2.2 shows the architectural overview of TAO. The support for FT is achieved
through the use of a set of high-level services built on top of TAO. These services include
a Fault Notifier, a Fault Detector and a Replication Manager.
The Replication Manager is the central component of the FT infrastructure. It acts
as central rendezvous to the remaining FT components, and it has the responsibilities
of managing the replication groups life-cycle (creation/destruction) and perform group
maintenance, that is the election of a new primary, removal of faulty replicas, and
updating group information.
It is composed by three sub-components: (a) a Group Manager, that manages the group
membership operations (adds and removes elements), allows the change of the primary
of a given group (for passive replication only), and allows manipulation and retrieval
of group member localization; (b) a Property Manager, that allows the manipulation of
51
CHAPTER 2. OVERVIEW OF RELATED WORK
replication properties, like replication style; and (c) a Generic Factory, the entry point
for creating and destroying objects.
The Fault Detector is the most basic component of the FT infrastructure. Its role is to
monitor components, processes and processing nodes and report eventual failures to the
Fault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwards
them to the Replication Manager.
The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next;
(b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that
(d) finds the Replication Manager and registers itself with it. As a response, e) the
Replication Manager connects as a consumer to the Fault Notifier. (f) For each node
that is going to participate, starts a Fault Detector Factory and a Replica Factory, that
in turn register themselves in the Replication Manager. (g) A group creation request is
made to the Replication Manager (by an foreign entity, that is referred as Object Group
Creator), followed by the request of a list to the available Fault Detector Factories and
a Replica Factories; (h) this is followed by a request to create an object group in the
Generic Factory. (i) The Object Group Creator then bootstraps the desired number
of replicas using the Replica Factory at each target node, and in turn, each Replica
Factory creates the actual replica, and at the same time, it starts a Fault Detector
at each site using the Fault Detector Factory. Each one of these detectors, finds the
Replication Manager and retrieves the reference to the Fault Notifier and connects to
it as a supplier. (j) Each replica is added to the object group by the Object Group
Creator by using the Group Manager at the Replication Manager. (k) At this point, a
client is started and retrieves the object reference from the naming service, and makes
an invocation to that group. This is then carried out by the primary of the replication
group.
Proactive FT Support. An alternative approach has been proposed by FLARe [74],
that focus on proactively adapting the replication group to the load present in the
system. The replication style is limited to semi-active replication using state-transfer,
that is commonly referred solely as passive replication .
Figure 2.3 shows the architectural overview of FLARe. This new architecture presents
three new components to TAO’s FT infrastructure: (a) a client interceptor, that redi-
rects the invocations to the proper server, as the initial reference could have been
changed by the proactive strategy, in response to a load change; (b) a redirection agent
that receives the updates with these changes from the Replication Manager; and (c)
a resource monitor that monitors the load on a processing node and sends periodical
52
2.6. A CLOSER LOOK AT TAO, MEAD AND ICE
Figure 2.3: FLARe’s architectural layout (adapted from [74]).
updates to the Replication Manager.
In the presence of abnormal load fluctuations the Replication Manager changes the
replication group to adapt to these new conditions, by creating replicas on lower usage
nodes and, if required, by changing the primary to a better suitable replica.
TAO’s fault tolerance support relies on a centralized infrastructure, with its main
component, the Replication Manager, representing a major obstacle in the system’s
scalability and resiliency. No mechanisms are provided to replicate this entity.
2.6.2 MEAD
MEAD focused on providing fault-tolerance support in a non intrusive way for enhancing
distributed RT systems by providing a transparent, although tunable FT, that is
proactively dependable through resource awareness, that has scalable and fast fault-
detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-
concept.
Transparent Proactive FT Support. MEAD’s architecture contains three major
components, namely, the Proactive FT Manager, the Mead Recovery Manager and the
53
CHAPTER 2. OVERVIEW OF RELATED WORK
Mead Interceptor. The underlying communication is provided by Spread, an group
communication framework that offers reliable total ordered multicast, for guaranteeing
consistency for both component and node membership.
The Mead Interceptor provides the usual interception of system calls between the
application and underlying operating system. This approach allows a transparent and
non-intrusive way to enhance the middleware with fault-tolerance.
Figure 2.4: MEAD’s architectural layout (adapted from [14]).
Figure 2.4 shows the architectural overview of MEAD. The main component of the
MEAD system is the Proactive FT Manager, and is embedded within the interceptors
in both server and client. It has the responsibility of monitoring the resource usage at
each server, initialization a proactive recovery schema based on a two-step threshold.
When the resource usage gets higher then the first threshold, the proactive manager
sends a request to the MEAD Recover Manager to launch a new replica. If the usage
gets higher than the second threshold then the proactive manager starts migrating the
replica’s clients to the next non-faulty replica server.
The Mead Recovery Manager has some similarities with the Replication Manager of
CORBA-FT, as it also must launch new replicas in the presence of failures (node or
server). In MEAD, the recovery manager does not follow a centralized architecture, as
in TAO or FLARe, where all the components of the FT infrastructure are connected to
the replication manager, instead, they are connected by a reliable total ordered group
communication framework that establishes an implicit agreement at each communica-
tion round. These frameworks also provide a notion of view, i.e. an instantaneous
54
2.6. A CLOSER LOOK AT TAO, MEAD AND ICE
snapshot of the group membership, and notifications of any membership change. This
allows the MEAD Recover Manager to detect a failed server and respawn a new replica,
maintaining the desired number of replicas.
The decision by developers of the proper FT properties, e.g. replication style, without
an evaluation of object state size and resource usage, can severely affect the overall
performance and reliability. The only possible way to achieve balance between these
two orthogonal domains of reliability and (real-time) performance, must leverage the
object’s resource usage, system resource availability, and the target level of reliability
and recovery-time.
To overcome this issue, MEAD introduced a FT Advisor. This advisor profiles the
object for a certain period of time to assess its resource usage, e.g. cpu, network
bandwidth, etc., and invocation ratio. Using this, the advisor can provide advice on
the proper settings of FT properties. For example, if an object uses little computation
time and has a large state, then active replication is the most suitable replication style.
The replication style is not the only choice considered. For passive replication there
are two options that are of relevance: checkpoint and fault-detection. The periodicity
of checkpointing affects the delay window of the consistency between the primary and
the replicas. A high period results in a smaller window, i.e. the inconsistency state
has a lesser duration, but brings a larger resource overhead, as more cpu and network
bandwidth are needed. The fault-detection directly impacts the recovery-time, as a
larger period between fault-detection inspections results in a larger recovery time.
The fault advisor continuously and periodically provides feedback to the runtime with
more accurate suggestions, adjusting to changes in resource usage and availability.
Normally, active replication support is restricted to deterministic single threaded appli-
cations. MEAD’s last contribution comes in the form of support for non-deterministic
aplications under active replication. To achieve this, MEAD uses source-code analysis to
detect points in the source code that introduce non-determinism, e.g. system calls like
gettimeofday. These non-deterministic points are stored into a data structure and are
embedded within invocations and replies, so they must be stored locally in both clients
and servers. The reason behind this necessity resides in the way the active replication
works. A client makes an invocation, that is multicasted to the replicas. Each replica
processes the request, storing the non-deterministic data locally and piggybacking it to
the reply that is sent back to the client. The client picks the first reply, and stores the
non-deterministic data locally. The client piggybacks this information in next invocation
it makes. When the replicas receive this invocation, they retrieve this non-deterministic
55
CHAPTER 2. OVERVIEW OF RELATED WORK
information and update their internal state, except the replica whose reply was chosen
by the client.
The recovery manager does not have replication, turning it into a single point of failure.
The use of a reliable total ordered group communication framework partial improves the
decentralization of the infrastructure, but the recovery manager still acts as a centralized
unit resulting in a negative impact on the overall system scalability. In systems that
are pruned to a large churn rate, group communication could result in partitions, as we
assessed in DAEM [6]. These partitions could result in a major outage, compromising
the reliability and real-time performance.
2.6.3 ICE
ICE [106] provides a lightweight RPC based middleware that aims to overcome the
inefficiency, such as redundancy, present in the CORBA specification. For that purpose,
ICE provides an efficient communication protocol and data encoding. It does not
support any kind of RT semantics.
The support for FT present in ICE is minimal, and is restricted to naming for replication
groups, that is when a client tries to resolve the replication group name, it receives a
list with all the server instances that belong to the group (i.e. the endpoints). On the
other hand, it does not support any type of replication style or even synchronization
primitives, leaving this to the applications.
ICE does not provide an infrastructure to support very large scale systems. Its registry,
that acts as the CORBA Naming Service, constitutes a bottleneck and possible single
point of failure. The reliability of the registry can be improved by the addition of
standby instances, in a master-slave relation.
2.7 Summary
The goal of this chapter was to search for a suitable solution that could address
all the requirements from our target system, that is, a middleware system capable
of simultaneously supporting RT+FT+P2P. As no solution was found, we focused
on systems that belong to the intersecting domains, namely, RT+FT, P2P+FT and
P2P+RT, to see if we could extend one of them and avoid designing and implementing
a new middleware from scratch.
56
2.7. SUMMARY
In our previous work, DAEM [102, 103], we used some off-the-self components, e.g.,
JGroups [7] to manage replication groups, but realized that in order to integrate real-
time and fault-tolerance within a P2P infrastructure, we would have to completely
control the underlying infrastructure with fine grain management over all the resources
available in the system. Thus, the use of COTS software components creates a ”black-
box” effect that introduces sources of unpredictable behavior and non-determinism that
undermines any attempt to support real-time. For that reason, it was unavoidable to
create a solution from scratch.
Using the insights learned from several inspirational middleware systems, namely TAO,
MEAD, and ICE, we have designed, in Chapter 3, and implemented, in Chapter 4,
Stheno, that to the best of our knowledge is the first middleware system that simulta-
neously supports RT and FT within a P2P infrastructure.
57
–If you can’t explain it simply, you don’t understand it well
enough.
Albert Einstein 3Architecture
The implementation of increasingly complex systems at EFACEC is currently lim-
ited by the capabilities of the supporting middleware infrastructure. These systems
include public information systems for public transportation, automated power grid
management and automated substation management for railways. The use of service
oriented architectures is an effective approach to reduce the complexity of such systems.
However, the increasing demand for guarantees on the fulfillment of SLAs can only be
achieved with a middleware platform that is able to provide QoS computing while
enforcing a resilient behavior.
Some middleware systems [14, 3] already addressed this problem by offering soft real-
time computing and fault-tolerance support. Nevertheless, their support for real-time
computing is limited, as they do not provide any type of isolation. For example a service
can hog the CPU and effectively starve the remaining services. The support for fault-
tolerance is restricted to crash-failures and the implementation of the fault-tolerance
mechanisms is normally accomplished through the use of high-level services. However,
these high-level services cause a significant amount of overhead, due to cross-layering,
limiting the real-time capabilities of these middleware systems.
These systems also used a centralized networking model that is susceptible to single
point-of-failure and offers limited scalability. For example, the CORBA naming service
reflects these limitations, where a crash failure can effectively stop an entire system
because of the absence of the name resolution mechanism.
This chapter describes the architectural overview of a new general purpose P2P middle-
ware that addresses the aforementioned problems. The resilient nature of P2P overlays
enables us to overcome the limitations of current approaches by offering a decentral-
ized and reconfigurable fault resistant architecture that avoids bottlenecks, and thus
59
CHAPTER 3. ARCHITECTURE
enhances overall performance.
Stheno, our middleware platform, is able to provide QoS computing with support for
resource reservation through the implementation of a QoS daemon. This daemon is
responsible for the admission and distribution of the available resources among the
components of the middleware. Furthermore, it also interacts with the low-level resource
reservation mechanisms of the operating system to perform the actual reservations.
With this support, we provide proper isolation that is able to accommodate soft real-
time tasks and thus provide guarantees on SLAs. While we currently only support CPU
reservation, the architecture was designed to be extensible and subsequently support
additional sub-systems, such as memory or networking resource reservations.
Notwithstanding, the real-time capabilities are limited by the amount of resources that
are need to provide fault-tolerance. To overcome the current limitations of provid-
ing fault-tolerance through the use of expensive high-level services, we propose the
integration of the fault-tolerance mechanisms directly in the the overlay layer. This
provides two advantages over the previous approaches: 1) it allows the implementations
of lightweight fault-tolerance mechanism by reducing cross-layering, and; 2) the replica
placement can be optimized using the knowledge of the overlay’s topology. Previous
systems relied on manual bootstrap of replicas, such as TAO [3], or required the presence
of additional high-level services to perform load balancing across the replica set, as in
FLARe [74].
While the work presented in this thesis only implements semi-active replication [44],
we designed a modular and flexible fault-tolerance infrastructure that is able to accom-
modate other types of replication policies, such as passive replication [75] and active
replication [78] .
Our architectural design also considered future support for virtualization. However,
instead of providing virtualization as a service, as is done in cloud computing plat-
form [107], our goal is to support lightweight virtualized services to offer out-of-the-
box fault-tolerance support for legacy services through the use of the live-migration
mechanisms present in current hypervisors, such as KVM [108] and Xen [109]. This
can be achieved through the use of Just Enough Operating System (JeOS) [110] which
enables the creation of small footprint virtual machines that are a critical requirement
to perform the migration of virtual machines.
Finally, in order to minimize the effort required to port the runtime to a new operating
system, we used the ACE framework [111] that abstracts the underlying operating
system infrastructure.
60
3.1. STHENO’S SYSTEM ARCHITECTURE
3.1 Stheno’s System Architecture
In order to contextualize our approach, we will present our solution applied to one
of our target systems, the Oporto’s light-train public information system. As shown
in Figure 3.1, the networks uses an hierarchical tree-based topology, that is based on
the P3 overlay [15], where each cell represents a portion of the mesh space that is
maintained (replicated) by a group of peers. These peers provide the computational
resources needed to maintain the light-train stations and host services within the system.
Additionally, there are also sensors that connect to the system through peers. They
offer an abstraction to several low-level activities, such as traffic track sensors and
video camera streams. A detailed discussion about the implementation of the overlay
is provided in Chapter 4.
Figure 3.1: Stheno overview.
The middleware’s runtime provides the necessary infrastructure that allows users to
launch and manipulate services, while hiding the interaction with low level peer-to-peer
overlay and operating system mechanisms. It is based on a five layer model, as shown
in Figure 3.1.
The bottom layer, Operating System Interface, encapsulates the Linux operating system
61
CHAPTER 3. ARCHITECTURE
and the ACE [111] network framework. The Support Framework is built on top of
the bottom layer, and offers a set of high-level abstractions for efficient, modular
component design. The P2P Layer and FT Configuration contains all the peer-to-
peer overlay infrastructure components and provides a communication abstraction and
FT configuration to the upper layers. The runtime can be loaded with a specific overlay
implementation at bootstrap. The middleware is parametric in the choice of overlay, and
these are provided as plugins and can be loaded dynamically. The Core layer represents
the kernel of the runtime, and is responsible for managing all the resources allocated
to the middleware and the peer-to-peer overlay. Finally, the Application and Services
layer is composed of the applications and services that run on top of the middleware.
Next, we describe the organization for each layer, as well as their inter-dependencies. In
an effort to improve the overall comprehension of the runtime, the layers are presented
using a top-down approach, starting at the application level, continuing throughout the
core and overlay layers, and ending at the operating system interface.
3.1.1 Application and Services
One of the most fundamental problem when developing a general purpose middleware
system is its ability to expose functionalities and configuration options to the user. This
layer achieves that goal through the introduction of high-levels APIs that allows the
users to query and configure the different layers of the runtime. For example, in our
target system, a system operator may create a video streaming service from a light-train
station and set the frame rate and replication style.
The service represents the main abstraction of the middleware, and is shown in Fig-
ure 3.4. A developer that wishes to deploy an application, has to use this abstraction.
The node hosting a service guarantees that its QoS requirements (CPU, network,
memory and I/O) are assured throughout the service’s entire life-cycle. The CPU
subsystem offers an exception to this definition. It allows the creation of best-effort
computing tasks that, as the name implies, do not have any QoS guarantees. These are
normally associated with helper mechanisms, such as logging.
A service can be statically or dynamically loaded into the middleware. Dynamic services
are encapsulated into a meta archive called Stheno Service Archive, that has the .ssa
file extension, and uses the ZIP archive format. Such an archive contains a service
implementation (plugin) that may be loaded by the runtime. This solution allows the
runtime to dynamically retrieve a missing service implementation and load it on-the-fly.
62
3.1. STHENO’S SYSTEM ARCHITECTURE
Figure 3.2: Application Layer.
Figure 3.3: Stheno’s organization overview.
Each service is identified in the middleware system by a Service Identifier (SID) that
uniquely identifies the service implementation, and an Instance Identifier (IID) that
identifies a particular instance of a service in the system, as any given service imple-
mentation can have multiple instances running simultaneously (Figure 3.3).
An IID is unique across the peer-to-peer overlay, therefore at any given time, it is
running in only one peer, identified uniquely with Peer Identifier (PID), but during its
lifespan it can migrate to other peers. This occurs when a service instance migrates
from one peer to another. A PID can only be allocated in one Cell Identifier (CID).
However, this membership can dynamically change during the peer’s lifespan.
A cell can be seen as a set of peers that are organized to maintain a partition of
the overlay space. These cell can be loosely decoupled, for example, Gnutella peers
partitions the overlay space in an ad-hoc fashion, or follows a structured topology.
Other overlays [15] have an hierarchical tree of cells and in each cell the peers cooperate
with the purpose of maintaining a portion of an overlay tree. In turn, these cooperate
63
CHAPTER 3. ARCHITECTURE
among themselves to maintain the global tree topology.
Some services can be deployed strictly as daemons. This class of services does not offer
any type of external interaction. Nevertheless, a service usually provides some sort of
interaction that is abstracted in the form of a client.
Using the RPC service as example, a client is a broker between the user and the server,
marshaling the request, and unmarshaling the reply. Another example, is a video
streaming client that connects to a streaming service with the purpose of receiving a
video stream, acting as a stream sink.
The interaction with a service, through a client, is only possible if the service provides
one or more Service Access Points (SAPs). These SAPs provide the entry-points that
support such interactions, with each one providing a specific QoS. For example, a
RPC service can provide two SAPs, one for low-priority invocations and the other for
high-priority invocations.
When an user (through a client) wants to contact a service instance, it first has to
known which SAPs are available in that particular instance. In order to accomplish
that goal, the user must use the discovery service and query about the active access
points for a particular instance of a service.
To summarize, the responsibilities of a service are the following: define the amount of
resources that it will need throughout its life-cycle; manage multiple SAPs, and; provide
a client implementation.
3.1.2 Core
One important issue is how to deal with the different real-time and fault-tolerance
requirements from different services, that in turn, are requested different users. In order
to address this issue, the core is responsible for the overall management of all assigned
resources, including overlays and services, is shown Figure 3.4. The resource reservation
mechanisms are not controlled directly by the runtime, but by a resource reservation
daemon, shown as QoS Daemon, that is responsible for managing the available low-level
resources.
This approach enables multiple runtimes to coexist within the same physical host and
further allows foreign applications to use the resource reservation infrastructure. The
runtime core merely acts as a broker for any resource reservation request initiated by
any of its applications or overlay services.
64
3.1. STHENO’S SYSTEM ARCHITECTURE
Figure 3.4: Core Layer.
The most important roles performed by the core are the following: a) maintain the
information of all local active service instances; b) act as a regulator, deciding on the
acceptance of new local service instances, and; c) provide a resource reservation broker.
The management of the active service instances is done through the use of the Service
Manager.
Service Manager
The service manager is responsible for managing all the local services of a runtime.
A service can be loaded into an active runtime in one of two ways: it can be locally
bootstrapped at start-up, such as static services that are loaded when the runtime
bootstraps, or; it can be dynamically loaded in response to a local or remote request.
The request for the creation of a new service instance could be initiated locally by
the user or a local service, or when a remote peer requests it through the overlay
infrastructure.
This remote service creation is delegated to the overlay mesh service, that in turn uses
the overlay’s inner infrastructure to accomplish this task. This implementation of these
mechanisms is detailed in Chapter 4.
The service manager is composed of two entities, a service factory and a service book-
keeper. The service factory is a repository of known service implementations that can
be manipulated dynamically, allowing the insertion and removal of service implementa-
tions. The service bookkeeper manages the information, such as SAPs, about the active
service instances that are running locally.
65
CHAPTER 3. ARCHITECTURE
QoS Controller
The QoS Controller, shown in Figure 3.5, acts as a proxy between the components of
the runtime and the QoS daemon. Each component has access to a resources that are
assigned at creation time. A component uses its resources through a QoS Client, that
was previously assigned to it by the QoS Controller. A resource reservation request
is created by a QoS Client and then gets re-routed by the QoS Controller to the QoS
daemon. In the current implementation, the allocation assigned to each component is
static. A dynamical reassignment of the resources allocated to a component is left for
future work.
Figure 3.5: QoS Infrastructure.
Section 3.1.4 provides the details on the QoS and resource reservation infrastructure,
in particular detailing the internals of the QoS daemon.
3.1.3 P2P Overlay and FT Configuration
Our target systems require that the middleware must be able to adapt its P2P net-
working layer to mimic the physical deployment, while at the same time, provide the
fault-tolerance configuration options to meet application needs.
The overlay layer is based on a plugin infrastructure that enables a flexible deployment of
the middleware for different application domains. For example, in our flagship solution,
the Oporto’s light-train network, we used a P3-based plugin implementation that mirrors
the regional hierarchy of the system. Additionally, the FT configurations options passed
by the user, for example, the requirement to maintain a service replicated among 3
replicas while using semi-active replication is delegated to the FT service within the
P2P overlay.
66
3.1. STHENO’S SYSTEM ARCHITECTURE
Because of this flexibility, the runtime does not bootstrap with a specific overlay
implementation by default, it is left to the user to choose the most suitable P2P imple-
mentation to match its particular target case. Figure 3.6 shows the components that
form the overlay abstraction layer.
Figure 3.6: Overlay Layer.
Every overlay implementation must provide the following services: (a) Mesh, responsible
for membership and overlay management; (b) Discovery, used to discover resources and
data across the overlay, and; (c) FT (Fault-Tolerance), used to manage and negotiate
the fault-tolerance policies across the overlay.
Mesh Service
The mesh service is responsible for managing the overlay topology and providing support
for the remote creation and removal of services. The management of the overlay
topology is supported through the membership and recovery mechanisms. The mem-
bership mechanism must allow the entrance and departure of peers while maintaining
consistency of the mesh topology. At the same time, the recovery mechanism has to
perform the necessary rebind and reconfiguration to ensure that the mesh topology
remains valid even in the presence of severe faults.
An overlay plugin is free to implement the membership and recovery mechanisms that
most fits its needs. This was motivated by the goal of minimizing the restrictions
made on the overlay topology, increasing in this way the range of systems supported by
Stheno.
Figure 3.7 shows four possible implementation approaches. A portal can be used to act
as a gatekeeper [112] (shown in Figure 3.7a), resembling the approach taken by most
web services. This can be suitable for systems that do not have a high churn rate. On
the other hand, systems that need highly available and decentralized architectures may
67
CHAPTER 3. ARCHITECTURE
(a) (b)
(c) (d)
Figure 3.7: Examples of mesh topologies.
use multicast mechanisms to detect other nodes present in the system [15] (shown in
Figure 3.7b). Nevertheless, some systems require bounded operations times, such as
queries. This can be accomplished with the introduction of cells (also known as federa-
tions), such in Gnutella [81] (shown in Figure 3.7c), or alternatively, by imposing some
kind of well-defined inter-peer relationship, such as Chord [113] (shown in Figure 3.7d).
Discovery Service
The discovery service offers an abstraction that allows the execution of queries on the
underlying overlay. As with the mesh service, each overlay plugin is free to implement
the discovery service as it best suits the needs of the target system. Figure 3.8 shows
the execution of a query under some possible topologies.
The main goals defined in the discovery service are the following: performing syn-
68
3.1. STHENO’S SYSTEM ARCHITECTURE
(a) Hierarchical overlay topology.
(b) Ad-hoc overlay topology. (c) DHT overlay topology.
Figure 3.8: Querying in different topologies.
chronous and asynchronous querying with QoS awareness, and; handling query requests
from neighboring peers while respecting the QoS associated with each request
Fault-Tolerance Service
The FT infrastructure is based on replication groups. These groups can be defined
as a set of cooperating peers that have the common goal of providing reliability to a
high-level service. In current middleware systems, FT support is implemented through
a set of high-level services that use the underlying primitives, for example, TAO [3].
Our approach makes a fundamental shift to this principle, by embedding FT support
in the overlay layer.
The integration of FT in the overlay reduces the overhead of cross-layering that is
69
CHAPTER 3. ARCHITECTURE
associated with the use of high-level services. Furthermore, this approach also enables
the runtime to make decisions on the placement of replicas that are aware of the overlay
topology. This awareness allows for a better leverage between the target reliability and
resource usage.
The FT service is responsible for the creation and removal of replication groups. How-
ever, the management of the replication group is self contained, that is, the FT service
delegates all the logistics to the replication group. This allows further extensibility of
the replication infrastructure, and also allows the co-existence of simultaneous types of
replication strategies inside the FT service. This allows each service to use the most
suitable replication policy to meet its requirements.
The assumptions made in the design of each service limit the type of fault-tolerance
policies that can be used. For example, if a service needs to maintain a high-level of
availability then it should use active replication [78] in order to minimize recovery time.
For these reasons, we designed an architecture that provides a flexible framework, where
different fault-tolerance policies can be implemented. In Chapter 4 we provide an
example of a FT implementation.
3.1.4 Support Framework
Our target system has different RT requirements for different tasks, for example, a
critical event is the highest priority traffic present in the system and is highly sensitive
to latency. To ensure that the 2 second deadline is meet, is necessary to reserve the
necessary CPU to process the events, and at the same time employ a suitable threading
strategy that minimizes latency (at the expense of throughput), such as Thread-per-
Connection [12].
The support framework provides the necessary infrastructure to address these issues
by offering a set of packages that provide high level abstractions for different threading
strategies, network communication and QoS management, in particular the mechanisms
for resource reservation. Figure 3.9 shows the components of the support framework.
It introduces three key aspects: (a) provides a novel and extensible infrastructure for
resource reservation and QoS; (b) introduces a novel design pattern for multi-core
computing; and (c) provides an extensible monitoring facility. Before delving into the
details of these components, we first present its components. In an effort to improve its
maintainability, the framework uses a package-like schema, with the following layout:
70
3.1. STHENO’S SYSTEM ARCHITECTURE
Figure 3.9: Support framework layer.
• common - this package includes support for integer conversion, backtrace (for
debugging), state management, synchronization primitives, and exception han-
dling;
• network - this package has support for networking, namely stream and datagram
sockets, packet oriented sockets, low level network utilities, and request support;
• event - this package implements the event interface, a fundamental component
for network oriented programming.
• qos - this package implements the resource reservation infrastructure, namely
the QoS daemon and client, as well the QoS primitives which are used by the
threading package, such as scheduling information.
• serialization - this package includes a serialization interface support and provides
a default serialization implementation;
• threading - this package offers several scheduling strategies, including: Leader-
Followers [11], Thread-Pool [114], Thread-per-Connection [12] and Thread-per-
Request [13]. All of these strategies are implemented using the Execution Model
- Execution Context design pattern;
• tools - the tools package includes the loader and the monitoring sub-packages,
which contain a load injector and a resource monitoring daemon, respectively.
The most prominent package in the framework is the resource reservation and QoS
infrastructure. It provides the low-level support that is necessary for the integration
of RT and FT into the middleware’s runtime. Next, we present an overview of the
71
CHAPTER 3. ARCHITECTURE
inner-works of each of the components and reason about their implications in several
aspects of a real-time fault-tolerant middleware.
The QoS and Resource Reservation Infrastructure
One of the key aspects of real-time systems is the ability to fulfill a SLA even in the
presence of an adverse environment. Adversities can be caused by system overload,
bugs, or malicious attacks, and can occur in the form of rogue services, device drivers,
or kernel modules.
The only viable solution to provide deterministic behavior is to provide isolation to
the various components present in the system. This type of containment can be
achieved by using a virtual machine, but obviously this would only work for user-
space applications/services, or; by using the low-level infrastructure provided by the
underlying operating system, such as Control Groups [115] or Zones [116]. Control
Groups is a modular and extensible resource management facility provided by the
Linux kernel, while Zones is a similar but less powerful implementation for the Solaris
operating system.
These types of mechanisms are normally associated with static provisioning and left to
system administrators to manage. This, clearly, is not a suitable approach to complex
and dynamic environments that are the focus of this work. To overcome this limitation
we designed and implemented a novel QoS daemon that manages the available resources
in the Linux operating system.
The goal of the QoS daemon is to provide an admission control and management
facility that governs the underlying Control Groups infrastructure. There are four main
QoS subsystems: CPU, I/O, memory and network. At this time, we have only fully
implemented the CPU subsystem. The remaining subsystems have just a preliminary
support.
All the subsystems supported by Control Groups follow a hierarchical tree approach
to the distribution of their resources (Figure 3.10). Each node of the tree represents
a group that contains a set of threads that share the available resources of the group,
for example if a CPU group has 50% of the CPU resources, then all the threads of the
group share those resources. As usual, the distribution of the CPU time among the
threads is performed by the underlying CPU scheduler.
CPU subsystem
We define three types of threads: (a) best-effort, that do not have real-time requirements,
and are expected to run as soon-as-possible but without any deadline constraints; (b)
72
3.1. STHENO’S SYSTEM ARCHITECTURE
soft real-time, that have a defined deadline, but that, in case of a deadline miss,
does not produce system failures, and; (c) isolated soft real-time, these threads are
positioned in isolated core(s) in order to prevent entanglement with other activities
of the operating system (interrupt handling from other cores, network multi-queue
processing, etc), resulting in less latency and jitter and thus providing a better assurance
on the fulfillment of deadlines.
However, there is another type of threads that is not currently supported by the
middleware, hard real-time threads. A failure to fulfill the deadline of one of these
threads could result in a catastrophic failure, and is normally associated with critical
systems such as railway signalling or avionics. Ongoing work on EDF scheduling
seems to offer a solid way to provide hard real-time support in Linux [117, 118]. A
recent validation seems to confirm our beliefs [119]. We plan to extend our support to
accommodate threads that are governed by deadlines instead of priorities.
To simplify the explanation of the CPU subsystem, we describe it as one entity although,
in reality, it is composed by two separate groups that are closely related, the CPU
Partitioning (also known as cpusets) and the Real-Time Group Scheduling. The first
group is responsible to provide isolation, commonly known as shielding, of subsets of the
available cores, while the second group provides resource reservation guarantees to RT
threads, that is, it is responsible for controlling the amount of CPU for each reservation.
Figure 3.10: QoS daemon resource distribution layout.
73
CHAPTER 3. ARCHITECTURE
Figure 3.10 illustrates a possible resource reservation schema. The nodes with RA and
RB represent the two runtimes present in the same physical host, while S1 and S2
represent services running under these two runtimes. The other node shown as OS,
represents the resources allocated to the operating system. The P2P node represents
the resources allocated to the overlay. For the sake of clarity, we do not present the
distribution of the overlay’s resources among its services. Each of the runtimes has
to request a provision of resources for later distribution among its services. Later, in
Chapter 5, we present the results that assess the potential of this approach.
I/O subsystem
Although not implemented, we have left support for the I/O subsystem that is respon-
sible for managing the I/O bandwidth of each device individually. The I/O reservation
can be accomplished either by specifying weights, or by specifying read and write
bandwidth limits and operations per second (IOPS).
When using weights to perform I/O reservation, groups with greater weights have more
I/O time quantum from the I/O scheduler. This approach is used for best effort
scenarios, which do not suit our purposes. In order to provide real-time behavior,
it is necessary to enforce I/O usage limits on both bandwidth and IOPS. Services that
manage large streams of information, such as video streaming, do not issue a high
number of I/O operations, but instead need a high amount of bandwidth. However,
low-latency data centric services like Database Management Systems (DBMS) [120] or
Data Stream Management Systems (DSMS) [121, 122] exhibit the opposite behavior.
Not needing a high amount of bandwidth, but instead, requiring a high number of
IOPS.
I/O contention can be caused by a high consumption service that starves other services
in the system, by either depleting I/O bandwidth, and/or by saturating the device with
an overwhelming number of I/O requests that exceeds its operational capabilities, such
as the length of the request queue.
The progressive introduction of Solid State Disk (SSD) technology into traditional
storage devices like hard-drives, is reshaping the approach taken to this type of re-
source [123]. These new devices are capable of unprecedented levels of performance,
specially in terms of latency, where they are able to offer a hundred fold reduction
in access times. The elimination of the mechanical components allows SSDs to offer
low-latency read/write operations and deterministic behavior. An evaluation of these
features is left for future work.
74
3.1. STHENO’S SYSTEM ARCHITECTURE
Memory subsystem
A substantial number of system faults are caused by memory depletion, normally
associated with bugs, ill-defined applications, and system overuse. When an operating
system reaches a critical level of free memory, it tries to free all non-essential memory,
such as programs caches. If this is not sufficient, then a fail-over mechanism is started.
In the Linux operating system, this mechanism consists of randomly killing processes in
order to release allocated memory, in an effort to prevent the inevitable system crash.
The runtime ensures that has access to the memory it needs throughout its life-cycle by
requesting a statically provisioning to the memory subsystem. In the memory subsys-
tem, each group reserves a portion of the total system memory, following a hierarchical
distribution model, allowing the runtime to further distributed the provisioned memory
among the different components, such as the P2P overlay layer and the user services.
Network subsystem
Each group in the network subsystem tags the packets generated by its threads with an
ID, allowing the tc (Linux traffic controller) to identify packets of a particular group.
With this mapping it is possible to associate different priorities and scheduling policies
to different groups.
This approach deals with the local aspects of network reservation, that is the sending
and receiving on the local network interfaces, but this is not sufficient to guarantee end-
to-end network QoS. In order to provide this, all the hops, such as routers, between
the two peers, must accept and enforce the target QoS reservation. An example of an
end-to-end QoS reservation is depicted in figure 3.11.
Figure 3.11: End-to-end network reservation.
In the future, we intent to provide an end-to-end QoS signaling protocol capable of
providing QoS tunnels across a segment of a network, using a protocol such as RSVP [64]
and NSIS [124].
Monitoring Infrastructure
The monitoring infrastructure audits the resource usage of the underlying OS, such as
CPU, memory, storage, etc. The monitoring data is gathered using the information
expose by the /proc pseudo filesystem. For this to work, the Linux kernel must be
configured to exposed this information.
75
CHAPTER 3. ARCHITECTURE
The main goal of the infrastructure is to provide a resource usage histogram (currently it
supports CPU and memory) that can be used for both off-line (log audit) and real-time
analysis. The log analysis is helpful in detecting abnormal behaviors that are normally
caused by bugs (such as memory leaks).
Currently, we use a reactive fault-detection model that only acts after a fault has
occurred. With a real-time monitoring infrastructure it is possible to evolve to a
more efficient proactive fault-detection model. Using a proactive approach, the runtime
could predict imminent faults and take actions to eliminate, or at least minimize, the
consequences of such events. For example, if a runtime detects that its storing unit,
such as hard drive, is exhibiting an increasing number of bad blocks, it could decide to
migrate its services to other nodes in the overlay.
3.1.5 Operating System Interface
Our target systems can be supported by a collection of heterogeneous machines with
different operating systems, so it was to crucial to develop a portable runtime im-
plementation. Additionally, fine-grain control over all the resources available in the
system is paramount to achieve real-time support. For example, in order to maintain a
highly critical surveillance feed, the middleware must be able to reserve (provision) the
necessary CPU time to process the video frames within a predefined deadline.
To met this goal, we choose to control and monitor the underlying resources from
userspace (shown in Figure 3.12), avoiding the use of specialized kernels modules.
To complement this approach, we use ACE [111], a portable network framework that
offers a common API that abstracts the low-level system-calls offered by the different
operating systems, namely, thread handling (including priorities), networking and I/O.
Furthermore, ACE also provides several high-level design patterns, such as the reac-
tor/connector design pattern, that enable the development of modular systems capable
of offering high-levels of performance.
The resource reservation mechanisms, including CPU partitioning, are not covered by
any of the Portable Operating System Interface (POSIX) standards, so there is no
common API to access them. The Linux operating system, in which our current imple-
mentation is based on, provides access to the low-level resource reservation mechanism,
via the Control Groups infrastructure, through the manipulation of a set of exposed
files in the /proc pseudo-filesystem.
Nevertheless, low-level RT support in Linux is not provided out-of-the-box. A careful
76
3.1. STHENO’S SYSTEM ARCHITECTURE
Figure 3.12: Operating system interface.
selection of the kernel version and proper configuration must be used. An initial
evaluation was performed for kernel 2.6.33 with the rt-preempt patch (usually referred
as kernel 2.6.33-rt), but its support for Control Groups revealed several issues, resulting
in unstable systems.
A second kernel version was evaluated, the kernel 2.6.39-git12, which already supports
almost every feature present in the rt-preempt patch and provides flawless support for
Control Groups.
The Linux kernel supports a wide range of parameters that can be adjusted. However,
only a small subset had a significant impact in the overall system performance and
stability under RT, most notably:
• DynTicks - the dynamic ticks enhances the former static checking of timer events
(usually 100, 250, 500 or 1000 Hz), allowing for a significant power reduction, but
more importantly, the reduction of kernel latencies;
• Memory allocator - the two most relevant are the SLAB [125] and SLUB [126]
memory allocators. They both manage caches of objects, thus allowing for efficient
allocations. SLUB is an evolution of SLAB, offering a more efficient and scalable
implementation that reduces queuing and general overhead;
• RCU - the Read-Copy Update [127] is a synchronization mechanism that allows
reads to be performed concurrently with updates. Kernel 2.6.39-git12 offers a
novel RCU feature, the “RCU preemption priority boosting” [128]. This feature
enables a task that wants to synchronize the RCU to boost all the sleeping readers
priority to match the caller’s priority.
77
CHAPTER 3. ARCHITECTURE
3.2 Programming Model
The access to the runtime capabilities is safeguarded by a set of interfaces. The main
purpose of these interfaces is to provide a disciplined access to resources while providing
interoperability between the runtime and services that are not collocated within the
same memory address space. Furthermore, it also allows a better modularization of the
components of the runtime. Figure 3.13 shows the interactions between the components
of the architecture through these interfaces.
Figure 3.13: Interactions between layers.
User applications and services access the runtime through the Runtime Interface. The
direct control of the overlay is restricted to the core of the runtime. The access to the
P2P overlay, for both services and users, is only allowed through the Overlay Interface
(described in Section 3.2.2) that is accessible from the Runtime Interface. An overlay
is also restricted on the its access to the core of the runtime. The access of an overlay
to the core of the runtime is also restricted through the Core Interface (described in
Section 3.2.3), avoiding malicious use of runtime resources by overlay plugins.
3.2.1 Runtime Interface
The Runtime Interface is the main interface that is available to the user and services and
it provides a proxy type support, allowing them to interact with runtimes that are not
in the same address space through an Inter-Process Communication (IPC) mechanism.
While multiple runtimes can exist in a single host, this results in a redundant resource
78
3.2. PROGRAMMING MODEL
consumption. Our approach allows the reduction of coexisting runtimes, resulting in a
lesser resource consumption.
Figure 3.14 shows the access to the runtime from different processes. The runtime is
initially bootstrapped in process 1. A virtualized service that uses the Kernel Virtual-
Machine (KVM) hypervisor is contained in process 2. In process 3 is shown an additional
user and service using the runtime of process 1. The support for additional languages
was also considered in the design of the architecture. Processes 4 and 5 show services
running inside a Java Virtual Machine (JVM) and a .NET Virtual Machine, respectively.
While we only show one runtime in this example, the QoS daemon, allocated in process
6, is able to support multiple runtimes.
Figure 3.14: Multiple processes runtime usage.
The operations supported by the Runtime Interface are the following: (a) bootstrap
new runtimes; (b) access previously bootstrapped runtimes; (c) start and stop services,
both local and remote; (d) attach new overlay plugins on-the-fly; (e) allow access to the
overlay, through the Overlay Interface, and; (f) create clients to interact with service
instances.
3.2.2 Overlay Interface
The main goal of Overlay Interface is to provide a disciplined access to a subset of the
underlying overlay infrastructure, leveraging performance goals, for example avoiding
79
CHAPTER 3. ARCHITECTURE
lengthy code paths that can lead to the creation of hot paths, while enforcing proper
isolation and thus preventing misuse of shared resources by rogue or misbehaved services
or applications.
The Overlay Interface provides an access to the overlay Mesh and Discovery services.
These services and the overall overlay architecture were described in Section 3.1.3. We
plan to extend the architecture to provide access to the underlying FT service to allow
the dynamical manipulation of the replication policy used in a replication group. But,
as different replication policies have different resource requirements, it is necessary to
provide additional support for dynamical changes in resource reservations assignments,
that is the increase or decrease of the amount of resources associated with a resource
reservation. To overcome this, we plan to enhance our QoS daemon in order to provide
the necessary support.
3.2.3 Core Interface
Every overlay implementation has to interact with the core. This interaction is mediated
through the Core Interface that is only accessible to the overlay plugin.
The operations supported by the Core Interface are the following: (a) start and stop
local services; (b) create replicas for the fault-tolerance service, and; (c) retrieve infor-
mation about service instances and resource availability.
The creation and destruction of local services are issued by the Mesh service upon the
reception of requests from remote peers. These requests are then redirected to the core
of the runtime by the Core Interface. In case of a creation of a new service, the core
requests the Service Manager to create a new service instance and makes the necessary
QoS resource reservations, by the QoS daemon, through the QoS client. On the other
hand, when destroying a service instance, the core just has to request the removal of
the instance by the Service Manager.
The creation and removal of replicas are issued by the FT service upon the reception
of requests from a replication group, and is normally requested by the coordinator of
the replication group, but it is implementation dependent. As with the previous case,
the request for a creation or removal of a replica is handled by the core, after being
redirected by the Core Interface. In the case of a removal of a replica, the core forwards
the request to the proper replication group through the fault-tolerance service. On
the other hand, in the case of the creation of a new replica, the core makes the QoS
resources reservations that are needed to maintain both the service instance ( that will
80
3.3. FUNDAMENTAL RUNTIME OPERATIONS
act as a replica of the primary service instance) and the replication group, that is,
the infrastructure necessary to enforce the replication mechanisms. The retrieval of
information about service instances and resource availability is used by the Discovery
service in response to queries.
3.3 Fundamental Runtime Operations
The runtime manages resources, services and clients. Its main operations are: the initial
runtime creation and corresponding bootstrap; creation of local and remote services with
and without fault-tolerance, and; creation of clients for user services.
3.3.1 Runtime Creation and Bootstrapping
The creation and the initialization, normally designated as bootstrap, of the runtime
involves a three phase process as shown in Figure 3.15.
(a) (b) (c)
Figure 3.15: Creating and bootstrapping of a runtime.
The creation of the middleware, shown in 3.15a, is accomplished by the user through
the Runtime Interface. At this point, the runtime does not have an active overlay
infrastructure. The user is responsible for choosing a suitable overlay implementation
(plugin) and for attaching it to the runtime (shown in Figure 3.15b). In the final
81
CHAPTER 3. ARCHITECTURE
phase, the user bootstraps the newly created runtime (depicted in Figure 3.15c). This
bootstrapping process is governed by the core. If the runtime is configured to use
QoS reservation then the core connects to the QoS daemon and reserves the necessary
resources. Otherwise, step 2 is omitted, and no interaction is made with the QoS
daemon.
Listing 3.1: Overlay plugin and runtime bootstrap.
1 RuntimeInterface∗ runtime = 0;2 try {3 runtime = RuntimeInterface::createRuntime();4 Overlay∗ overlay = createOverlay();5 runtime−>attachOverlay(overlay);6 runtime−>start(args);7 } catch (RuntimeException& ex) {8 Log(’Runtime creation failed’); // handle error9 }
The code snipplet necessary to create and bootstrap a runtime is shown in Listing 3.1.
Line 3, shows the creation of runtime, as previously illustrated in Figure 3.15a. At this
time, only the basic infrastructure is created and the runtime is still not bootstrapped.
This is followed by the creation of the chosen overlay implementation, that is going to
be attached to the runtime, in lines 4 and 5 (that corresponds to the illustration of
Figure 3.15b). For last, the whole process is completed, in line 6, with the bootstrap of
the runtime, that implicitly bootstraps the overlay (as shown in Figure 3.15b).
3.3.2 Service Infrastructure
The life cycle of a service starts with its creation and terminates with its destruction.
The service infrastructure provides the user with such mechanisms. This section starts
with an in-depth view of the local creation of services, as first introduced on Sec-
tion 3.1.1. Then follows a detailed view of the mechanisms that regulate the creation of
remote services with and without FT support. It concludes with the complete outline
of the service deployment mechanisms.
Local Service Creation
The steps involved in instantiating a new local service are depicted in Figure 3.16. The
user, through the Runtime Interface, requests the creation (and bootstrap) of a new
local service instance (step 1). The core of the runtime redirects the request to the
service manager for further handling (step 2). The first step to be taken by the service
82
3.3. FUNDAMENTAL RUNTIME OPERATIONS
manager is to determine if the service implementation is known. If the service is not
known, then the core tries to find the respective implementation using the discovery
service in the overlay (step omitted). If the implementation is found then it is transferred
back requesting peer and the service creation can continue. Otherwise the creation of
the service is aborted.
Figure 3.16: Local service creation.
If the runtime was bootstrap with resource reservation enabled then, once the service
implementation is retrieved, it is possible to retrieve its QoS requirements. Knowing
these requirements, the runtime tries to allocate them through a QoS client (shown as a
dashed lines in steps 3 and 4). If the requested resources are available then the service
is instantiate, otherwise the service creation is aborted. If the resources are available
but the service does not successfully start, all the associated resource reservations are
released.
If, on the other hand, the runtime does not have the resource reservation infrastructure
enabled, then once the service implementation is known and retrieved, the core can
immediately instantiated a local service instance.
Listing 3.2 has the code snipplet necessary to bootstrap a new local service instance.
Line 1 shows the initialization of the service parameters that are wrapped by a smart
pointer variable, allowing for a safe manipulation by the runtime. The actual service
creation is done in line 4 and is performed by the startService() method that
takes the following parameters: the SID of the service to be created; the service
parameters, and; the peer where the service is to be launched, which in this case is the
Universal Unique Identifier (UUID) of the local runtime. Upon the successful creation
of the service instance, the parameter iid of the call startService() will contain its
instance identifier.
Listing 3.2: Transparent service creation.
83
CHAPTER 3. ARCHITECTURE
1 ServiceParamsPtr paramsPtr(new ServiceParams(sid));2 try {3 UUIDPtr iid;4 runtime−>startService(sid, paramsPtr, runtime−>getUUID(), iid);5 } catch (ServiceException& ex) {6 Log(’Service creation failed’); // handle error7 }
Remote Service Creation
There are two distinct approaches to create remote services. An user can either explicitly
specify the peer to host the service, or alternatively, it can leave the decision of finding
a suitable place for hosting the service to the middleware. This last approach is the
default way to bootstrap services.
Figure 3.17: Finding a suitable deployment site.
Figure 3.17 shows the mechanism associated with the search for a suitable place to
deploy a new service instance, within a hierarchical mesh overlay, where each level of
tree is maintained by a cell. Cells are logical constructions that maintain portions of
the overlay space and provide mesh resilience.
The requesting peer uses the discovery service of the overlay to perform a Place of
Launch (PoL) query. This query retrieves the information about a suitable hosting
peer. However, as previously stated, the resolution of the query is totally dependent of
the overlay implementation. In the example provided by Figure 3.17, the query issued
by peer A is relayed until it reaches peer C. This peer is able to satisfy the query and
replies back to peer B that in turn replies back to peer A. After receiving the query reply,
indicating peer D as the deployment site, peer A requests a remote service creation at
peer D. We describe an implementation for this mechanism in Chapter 4.
84
3.3. FUNDAMENTAL RUNTIME OPERATIONS
Figure 3.18: Remote service creation without fault-tolerance.
In order to create a remote service (Figure 3.18), the user using peer A makes the
request through the Runtime Interface (steps 1 and 2). The core of the runtime core
uses its mesh service to request the remote peer the creation of the wanted service
(steps 3 and 4). The mesh service of the remote peer after receiving the request for
the creation of a new service instance, uses the Core Interface (step 5) to redirect the
request to the core of its runtime (step 6). At this point, the remote peer uses the
previously described procedure for local service creation (Figure 3.16). The dashed
lines represent the optionality of using resource reservation.
The code snipplet shown in Listing 3.3 creates two remote service instances, one uses
explicit deployment and the other uses transparent deployment. Line 1 shows the initial-
ization of the service parameters that are used in the creation of both service instances.
Line 5, shows the creation of a remote service instance using explicit deployment. The
remote peer that will host the instance is given by the remotePeerUUID variable. Line
7 shows the creation of a remote service instance when using transparent deployment.
Upon the successful creation of the service instance, the last parameter used in the
call to startService() contains the instance identifier for the newly created service
instance.
Listing 3.3: Service creation with explicit and transparent deployments.
1 ServiceParamsPtr paramsPtr(new ServiceParams(sid));2 try {3 UUIDPtr explicitIID, transparentIID;4 // explicit deployment5 runtime−>startService(sid, paramsPtr, remotePeerUUID, explicitIID);6 // or, transparent deployment
85
CHAPTER 3. ARCHITECTURE
7 runtime−>startService(sid, paramsPtr, transparentIID);8 } catch (ServiceException& ex) {9 Log(’Service creation failed’); // handle error10 }
Remote Service Creation With Fault-Tolerance
When creating a remote service with fault-tolerance (Figure 3.19), in response to a
request from another peer (steps 1 to 4), the remote peer acts as the main instance,
also known as the primary node, for that service (steps 5 to 8). Before being able to
instantiate the service, the primary node has first to find the placement for the number
of requested replicas (step omitted). This process is delegated and governed by the FT
service (step 9). As before, the dashed lines (step 8) represent optional paths, if using
resource reservation.
Figure 3.19: Remote service creation with fault-tolerance: primary-node side.
The fault-tolerance service using its underlying mechanisms, which are dependent on
the implementation of the overlay, tries to find the optimal placements on the mesh to
instantiate the needed replicas. In a typical implementation this is normally accom-
plished through the use of the discovery service. Depending on the overlay topology,
finding the optimal placement can be intractable, as in ad-hoc topologies, so systems
often implement more structure topologies or heuristics.
Given the modularity of the architecture, it is possible to configure for each service the
type of fault-tolerance strategy to be used, such as semi-active or passive replication,
allowing a better fit to the service’s needs to be obtained.
86
3.3. FUNDAMENTAL RUNTIME OPERATIONS
The primary node using the FT service creates the replication group that will support
replication for the service. To create the replication group, the FT service uses the
placement information to create replicas for the group.
Figure 3.20: Remote service creation with fault-tolerance: replica creation.
The process of creating a new replica is shown in Figure 3.20. After receiving the
request to join the replication group through the FT service (steps 1 to 2), the replica
proceeds as previously described for the local service creation (steps 3 to 6). We describe
the algorithms that materialize the behavior for different types for different types of
replication policies in Chapter 4.
Listing 3.4: Service creation with Fault-Tolerance support.
1 FTServiceParams∗ ftParams = createFTParams(nbrOfReplicas, FT::SEMI ACTIVE REPLICATION));
2 ServiceParamsPtr paramsPtr(new ServiceParams(sid, ftParamsPtr));3 try {4 UUIDPtr iid;5 runtime−>startService(sid, paramsPtr, iid);6 } catch (ServiceException& ex) {7 Log(’Service creation failed’); // handle error8 }
Listing 3.4 shows the code snipplet that is necessary to bootstrap a remote service
with FT support. Line 1 shows the initialization of the FT parameters with a total of
nbrOfReplicas replicas and using semi-active replication. The actual service creation
is done in line 5. Upon the successful creation of the service instance, the parameter iid
of the call startService() will contain its instance identifier, and sid the system-
wide identifier for the service.
87
CHAPTER 3. ARCHITECTURE
3.3.3 Client Mechanisms
The interactions between an user and a service instance are supported by a client.
A client is a proxy between the user and a service instance that is responsible for
handling all the underlying communication and resource reservation mechanisms. The
runtime provides a flexible infrastructure that does not impose any type of architectural
restrictions on either the design of a client or the type of interaction that can take place.
Figure 3.21 shows the creation and bootstrap sequence of a client.
(a) (b)
Figure 3.21: Client creation and bootstrap sequence.
The creation of a client, shown in Figure 3.21a, starts with the user requesting a new
client through the Runtime Interface. Upon receiving the client creation request, the
core of the runtime uses the service factory to check if the service implementation is
known. If it is known then the core of the runtime returns a new client to the user from
the service implementation, otherwise the creation of the client is aborted.
After retrieving the client, the user must find a suitable service instance to connect
to (shown in Figure 3.21b). After retrieving the Core Interface though the Runtime
Interface, the user uses the discovery service to search for a suitable instance (the calling
path is identified as 1). This is followed by the reply from the discovery service that is
returned to the user (calling path identified as 2). In this case, indicating peer B has
owner of a service instance.
If the user wishes to use resource reservation, then it must use the underlying resource
88
3.4. SUMMARY
reservation infrastructure. This optional step is shown as a dashed line (calling path
identified as 3). To finish the bootstrap sequence, the user must use the information
about the service instance that was returned by the discovery service and connect to
the service (step 4).
Listing 3.5 shows the code snipplet necessary to create a client, using the RPC service
as an example.
Listing 3.5: Service client creation.
1 try {2 ClientParamsPtr paramsPtr(
new ClientParams(QoS::RT, CPUQoS::MAX RT PRIO));3 ServiceClient∗ client = runtime−>getClient (sid, iid, paramsPtr);4 RPCServiceClient∗ rpcClient = static cast<RPCServiceClient∗> (client);5 RPCTestObjectClient∗ rpcTestObjectClient =
new RPCTestObjectClient (rpcClient);6 rpcTestObjectClient−>ping();7 } catch (ServiceException& ex) {8 Log(’Client creation failed’); // handle error9 }
Prior to the actual creation of the client, the user must initialize the ClientParamsPrt
parameter with the desired QoS properties. In line 2 of Listing 3.5, this parameter is
initialized to use the maximum RT priority. The actual creation of the client is done
in line 3. The runtime returns a generic ServiceClient pointer that must be down-
casted to the proper client implementation. In line 4, the generic pointer is converted to
a generic RPC client, that manages the low-level infrastructure that handles invocations
and replies. Line 5 shows the creation of the RPC “stub”, that is responsible for
marshaling requests and unmarshaling replies. Within the creation of the stub, the
general RPC client is attached to it. Line 6 shows an actual one-way RPC invocation
of a ping operation.
3.4 Summary
This chapter started by presenting the architecture of the runtime of a P2P middleware,
providing an overview of all the layers that compose the runtime: applications layer,
contains all the services and users that run on top of the middleware; core layer,
is responsible for the overall management of the runtime; overlay abstraction layer,
provides the abstractions to the low-level P2P services; support framework, provides a
set of high level abstractions for network communications and QoS management, and;
89
CHAPTER 3. ARCHITECTURE
the Linux/ACE layer, provides an abstraction to the underlying Linux operating system
through the ACE framework.
We then provided a detailed insight on programming model, exposing the interfaces that
must be used to access the runtime capabilities. Furthermore, we describe the advan-
tages of these programming interfaces, specifically their ability to provide modularity,
interoperability and controlled access to runtime resources.
The chapter ended with an overview of the fundamental operations present in the
middleware, namely: runtime creation and bootstrap; local service creation; remote
service creation with and without FT, and; client creation.
90
–With great power comes great responsibility.
Voltaire 4Implementation
This chapter presents the implementation details of the runtime, focusing on the under-
lying mechanisms that are present in the P2P services of our overlay implementation.
Additionally, we present three service implementations that showcase the runtime capa-
bilities, more precisely, a RPC-like service, an actuator service and a streaming service.
4.1 Overlay Implementation
Figure 4.1: The peer-to-peer overlay architecture.
As a proof-of-concept for this prototype, we have chosen the P3 [15] topology, that
follows a hierarchical tree P2P mesh. A representation of such topology is shown in
Figure 4.1. There are three different types of peers present in our implementation,
91
CHAPTER 4. IMPLEMENTATION
peers, coordinators peers and leafs peers. The peers are responsible for maintaining the
organization of the overlay and for providing access points to the overlay for leaf peers.
Each node in a P3 network corresponds to a cell, a set of peers that collaborate to
maintain a portion of the overlay. Cells are logical constructions that provide overlay
resilience and are central in our implementation of fault-tolerance mechanisms. Each
cell is coordinated by one peer, denominated as coordinator peer. Every other peer
in the cell is connected to the coordinator, allowing for efficient group communication.
If the coordinator fails, one of the peers in the cell takes its place and becomes the
new coordinator. The communication between distinct cells is accomplished through
point-to-point connections (TCP/IP sockets) between the coordinators of the cells.
The last type of peer present in the overlay is known as leaf peer. These peers do
not have any type of responsibilities in maintaining the mesh. Typically, they use the
overlay capabilities, for instance, to advertise the presence of a sensor or simply to act
as a client. This type of peer does not host any user service, but instead relies on the
overlay to host them.
The original P3 topology [15] follows a hierarchical organization that had a significant
problem. When a coordinator of a cell crashes it causes a cascade failure, with its
children coordinators propagating the failure to the remaining sub-trees. We explored
this problem in previous work [6], and concluded that it was directly linked to the rigid
naming scheme of the P3 architecture. In case of a cell failure, the cell and its sub-trees
would have to perform to a complete rebind to the mesh, and thus had to contact the
root node of the tree to find a new suitable position. This caused two obvious problems,
the overhead (and time) of rebinding all the cells and the bottleneck in the root node.
To avoid these limitations, we modified the original P3 topology. The problems as-
sociated with the rigid naming scheme of P3 were avoided through the design and
implementation of a new faulty architecture. This type of architecture focuses on
reducing the impact of faults, as it assumes that they happen frequently, taking spe-
cial care to eliminate, or at least minimize, the occurrence of cascade failures. To
achieve this, the middleware introduces a new flexible naming scheme, that removes all
inter-dependencies between cells, and therefore allows the migration of entire sub-trees
between different portions of the tree.
The developer, however is free to implement any type of topology and behavior within
an overlay implementation for the middleware. It only has to implement the Overlay
Interface. This interface is composed of three basic P2P services. The mesh service,
described in sub-section 4.1.2, handles all the management for the overlay. In a sense
92
4.1. OVERLAY IMPLEMENTATION
it is the most fundamental service since it provides the infra-structure for all the other
services. The discovery service, detailed in sub-section 4.1.3, supports the infrastructure
for generic querying. Last, the FT service, provides the infrastructure for the fault-
tolerance mechanisms present in the overlay and is described in sub-section 4.1.4.
4.1.1 Overlay Bootstrap
The bootstrap of an overlay is requested by the core of the runtime on behalf of the
user. Figure 4.2 illustrates this bootstrap process. The overlay bootstraps sequentially
the mesh (step 1), discovery (step 2) and fault-tolerance (step 3) services.
Figure 4.2: The overlay bootstrap.
The bootstrap process is implemented by the Overlay:start() procedure and it is
shown in Algorithm 4.1. This procedure starts the mesh, discovery and FT services.
The order by which the services are opened is conditioned by the dependencies between
the services, as both the discovery and fault-tolerance services need the information
about SAPs of homologous services in neighbor peers, and this information is provided
by the mesh service.
Algorithm 4.1: Overlay bootstrap algorithm
1 procedure Overlay:start()2 for service in [Mesh, Discovery, Fault-Tolerance] do3 service.start()4 end for5 end procedure
93
CHAPTER 4. IMPLEMENTATION
4.1.2 Mesh Service
The mesh service is the central component in our overlay implementation. It acts as
an overlay manager and it is also responsible for the creation and removal of high-level
services from the overlay, as previously described in Chapter 3.
A mesh service must extend the Mesh Interface, but is free to implement any type
of organizational logic. Nevertheless, in a typical implementation, the mesh service
normally has a mesh discovery sub-service, responsible for providing a dynamic discov-
ery mechanism for peers in the overlay. Whereas the discovery service, described in
Section 4.1.3, provides a generic infrastructure capable of handling high-level queries.
It is not possible to use the generic discovery service to search for peers in the overlay
because the of the dependencies between the mesh and discovery services, as explained
previously.
A possible implementation for this type of mesh discovery mechanism could be accom-
plished though the use of a well-known portal. This has the advantage of being simple
to implement but inherently represents both a bottleneck and a single-point-of-failure.
To overcome these limitations, our overlay has a discovery mechanism, one in each
cell, that uses low-level multicast sockets to provide a distributed and efficient mesh
discovery implementation. Figure 4.3 provides an overview of the major components in
a cell. Each peer participating in a cell has a cell object that contains a cell discovery
object and a cell group object : the cell object provides a global view of the cell to the
local peer; the cell discovery object provides the support, through the use of multicast
sockets, for the cell discovery mechanism, and; the cell group object provides the group
communications within the cell.
Figure 4.3: The cell overview.
Building and membership
The membership mechanism allows a peer to join the peer-to-peer overlay (Figure 4.4).
The process starts with a request for a binding cell (step 1). This request has to be
94
4.1. OVERLAY IMPLEMENTATION
made to the root cell, that in turn replies with a tuple comprising a suitable cell, its
corresponding coordinator, and the parent cell and coordinator (if available). The next
step is the active binding (step 2), which is further sub-divided in two possibilities (steps
2-a and 2-b).
The multicast address for the root cell discovery address is a static and well-known
value. While this can be seen as a single point-of-failure, in the presence of a cell crash,
that is when all the peers in a cell have crashed, the root cell is replaced by one of its
children cells that belong to first level of the tree. The process behind failure handling
and recovery is described further below.
Figure 4.4: The initial binding process for a new peer.
Upon receiving the reply, and if the returned cell exists (step 2-a), the joining peer
connects to the coordinator (step 3-a). Otherwise, if the cell is new, it becomes the
coordinator for the cell (step 2-b). If the target cell is not the root cell, and if the peer
is the coordinator of the cell, then it connects to the coordinator peer of its parent cell
(step 3-b).
To finalize the binding process, the peer has to formalize its membership by sending
a join message that is illustrated in Figure 4.5. At this point, the peer sends a join
message to its parent (step 1), if it is the coordinator of the cell, or sends the message
95
CHAPTER 4. IMPLEMENTATION
to the coordinator of the cell (step 1-a) that forwards it to its parent (step 1-b). This
message is propagated through the overlay until it reaches the root cell. It is the
responsibility of the root cell to validate the join request and to reply accordingly. The
reply is propagated through the overlay downwards to the joining peer (step 3). After
this, the peer is part of the overlay.
Figure 4.5: The final join process for a new peer.
The mesh construction algorithm is depicted in Algorithm 4.2. To enter the mesh, a new
peer calls the Mesh:start() procedure, which then creates a cell discovery object for
accessing the root cell discovery infrastructure (line 2), which has a well-known multicast
address. This is then used to request a cell to which the joining node will connect itself
by making a call to the cellRootDiscoveryObj.requestCell() procedure (shown
in Algorithm 4.6, lines 1-8). This procedure multicasts a discovery message that tries to
find the peer-to-peer overlay. If it fails then no peer is present in the root cell, then the
call to the Cell:requestCell() procedure returns the information associated with
the root cell, more specifically, the well-known multicast address used for the root cell
discovery. Otherwise, the appropriate bind information is returned. Using this binding
information, a new cell object is created and initialized (lines 4-5).
96
4.1. OVERLAY IMPLEMENTATION
Algorithm 4.2: Mesh startup
1 procedure Mesh:start()2 cellRootDiscoveryObj ← Cell:createRootCellDiscovery()3 bindInfo ← cellRootDiscoveryObj.requestCell()4 cellObj ← Cell:createCellObject()5 cellObj.start(bindInfo)6 end procedure
Cell Bootstrap
The binding information returned by the cell discovery mechanism has all the informa-
tion needed for the cell initialization (as shown in Figure 4.4). In Algorithm 4.3, we
show the algorithms that rule the behavior of a cell.
Algorithm 4.3: Cell initialization
var: this // The current cell object
1 procedure Cell:start(bindInfo)2 bindingCellInfo ← bindInfo.getBindingCellInfo()3 if not bindingCellInfo.isCoordinator() then4 cellGroupObj ← Cell:bindToCoordinatorPeer(bindingCellInfo.getCoordInfo())5 else6 parentPeerInfo ← ∅7 if not bindingCellInfo.isRoot() then8 parentPeerInfo ← bindInfo.getParentCellCoordInfo()9 end if10 cellGroupObj ← Cell:createCellGroup(parentPeerInfo)11 end if12 cellGroupObj.requestJoin()13 cellDiscoveryAddr ← bindingCellInfo.getCellDiscoveryAddress()14 cellDiscoveryObj ← Cell:createCellDiscovery(cellDiscoveryAddr)15 this.attach(cellDiscoveryObj)16 end procedure
The bootstrap of the cell object is performed using the Cell:start() procedure that
takes the bindInfo as its argument. This bootstrap process is dependent of on state
of the target cell. The call to the bindingCellInfo.isCoordinator() method
indicates if we are the coordinator of this cell. If the peer is not the coordinator peer
(Figure 4.4, step 2-a) for the cell then it has to join the cell group by binding to the
cell group’s coordinator peer (line 4). On the other hand, if the peer is the coordinator
(Figure 4.4, step 2-b), then it checks if the cell is the root. If the peer is on the root cell,
then the bootstrap is finished, otherwise it must connect to its parent cell coordinator,
and link the newly created cell to its parent cell (lines 5-10).
97
CHAPTER 4. IMPLEMENTATION
Regardless of whether the newly arrived peer is a non-coordinator on a cell group, or if
it is a coordinator on a non root cell, it must propagate its membership by using a join
message. In line 12, the call to cellGroupObj.requestJoin() initializes the process.
The join process is depicted in Figure 4.5, while the cell group communication is shown
in Figure 4.6.
Lines 13-15 show the creation of the cell discovery object that will be associated with
this cell, with the multicast address being provided by the bindingCellInfo. After
the creation, it is attached to the cell in line 15, enabling the cell to handle cell discovery
requests.
Cell State and Communications
When a peer is running inside a cell, it is either a coordinator or a non coordinator
peer providing redundancy to the coordinator. Any external peer that connects to the
cell, must connect through the coordinator peer. It is the coordinator’s responsibility
to validate any incoming request. If the request is valid and accepted, the coordinator
sends the request to its parent (if applicable). After receiving the reply from its parent,
the coordinator updates the state of the cell by synchronizing with all the active peers.
This synchronization is done using our group communication infrastructure that is
shown in Figure 4.6.
The synchronization process inside a cell can be divided in two cases, whether the
synchronization is initiated by the coordinator or by a non-coordinator peer. When the
synchronization is initialized by the coordinator peer, shown in Figure 4.6a, it starts
by sending the request its parent peer (step 1), which is recursively sent onwards the
root cell (step 2). After the root cell is synchronized, that is, after the request is sent
to all active peers and their replies have been received, an acknowledgment message is
sent downwards the originating cell (step 3). Upon receiving the acknowledgment from
its parent, each coordinator peer repeats the same process, that is, they synchronized
their cell (steps 4 and 5) and send an acknowledgment downwards (step 6). When the
acknowledgment reaches the originating cell, the request is synchronized (steps 7 and
8).
The synchronization process can be performed either in parallel or sequentially. Al-
though we do not provide benchmarks, we have done a preliminary empirical assessment
on the optimal transmission strategy. Early testing shows that for a small number of
peers, the best transmission strategy is to send the requests sequentially. However, for a
larger number of requests, the best transmission strategy is to send them in parallel, by
using a pool of threads for performing the transmission simultaneously. This behavior
98
4.1. OVERLAY IMPLEMENTATION
(a) Synchronization initiated by the coordinator.
(b) Synchronization initiated by a follower.
Figure 4.6: Overview of the cell group communications.
can be explained by the overhead associated with the enqueue of the sending request
in multiple threads. However, as the number of peers increases, the cost of sending the
requests sequentially surpasses the overhead of the parallel transmission.
99
CHAPTER 4. IMPLEMENTATION
Figure 4.6b shows the communications steps required when the synchronization is
initialized by a non coordinator peer. Here, the peer must send the request to the
coordinator peer (step 1). Upon receiving the request, the coordinator peer performs the
same process that as used in Figure 4.6a. It starts by propagating the request onwards
the root cell (steps 2 and 3), with the respective acknowledgment being sent after the
root cell synchronizes (step 4). All the coordinator peers that belong to the cells between
the root cell and the originating cell, synchronize the request within their cell after
receiving the acknowledgment from their parent. When the acknowledgment reaches
the originating cell, the coordinator peer spreads the request through the remaining
active peers and waits for their replies (steps 8 and 9). Last, the coordinator peer sends
an acknowledgment back to the originating peer (step 10).
The cell communication algorithms are shown in Algorithms 4.4 and 4.5, and they ex-
pose the previously described roles that are present in the architecture: the coordinator
and non coordinator roles.
Algorithm 4.4: Cell group communications: receiving-end
var: this // the current cell communication group objectvar: cellObj // the cell object associated with the communication groupvar: coordinatorPeer // the cell coordinator peer
1 procedure CellGroup:coordinatorHandleMsg(peer,msg)2 if not msg.isAckMessage() then3 ackMessage ← cellObj.processMsg(msg)4 if not isRoot() then5 request ← this.getParentPeer().sendMessage(msg)6 request.waitForCompletion()7 if request.failed() then8 this.handleParentFailure()9 end if10 end if11 this.sendMessage(msg)12 peer.sendMessage(ackMessage)13 else14 this.updatePendingRequests(msg)15 end if16 end procedure
17 procedure CellGroup:nonCoordinatorHandleMsg(msg)18 if not msg.isAckMessage() then19 ackMessage ← cellObj.processMsg(msg)20 coordinatorPeer.sendMessage(ackMessage)21 else22 this.updatePendingRequests(msg)23 end if24 end procedure
100
4.1. OVERLAY IMPLEMENTATION
If a peer is the coordinator of the cell, then all the incoming messages (from the cell
or from children cells) are processed by the CellGroup:coordinatorHandleMsg()
procedure, otherwise, the CellGroup:nonCoordinatorHandleMsg() procedure is
used to process the incoming messages.
In CellGroup:coordinatorHandleMsg() procedure, the coordinator receives a new
message and process it if is not an acknowledgment in line 3. After the message is pro-
cessed and validated by the coordinator (line 3), and if the coordinator does not belong
to the root cell, then it must forward the message to its parent cell coordinator and wait
for the acknowledgment (lines 5-6), with the process recursively updating the cells until
the root node is reached. If the synchronization with the parent fails, then the coor-
dinator enters in a recovery stage by executing the Cell::handleParentFailure()
procedure (lines 7-9) that is detailed below. After synchronizing with its parent, the
coordinator uses the CellGroup:sendMessage() procedure to send the message across
the peers, and thus synchronizing the state among all the active peers present in the cell
(line 11). The last step remaining is to send back the reply message to the requesting
peer (line 12). On the other hand, if the coordinator received an acknowledgment, then
it updates any pending request (lines 13-15).
If the peer is not the coordinator of the cell, then all the incoming messages are processed
by the CellGroup:nonCoordinatorHandleMsg() procedure. If the message is not an
acknowledgment then the cell object processes it and updates its internal state (line 19),
reflecting the changes performed globally the cell. After this update, an acknowledgment
is sent back to the coordinator peer (line 20). Otherwise, the message received was an
acknowledgment and is used to update any pending request (lines 21-23).
The CellGroup:sendMessage() procedure, in Algorithm 4.5, illustrates the process
of sending a message within a cell. If the a message is being sent by the coordinator
(lines 2-15), but if it was originated in another peer then the coordinator removes that
peer from the sending set (lines 3 and 4); illustrated in step 1 in Figure 4.6a and step
2 in Figure 4.6b). The message is sent to all the peers present in the set with each
pending request being stored in an auxiliary list (lines 5-9). The coordinator then waits
for the completion of all the pending requests (line 10). For each request that failed,
the coordinator removes the peer associated with that request from the list containing
all the active peers (lines 11-15).
On the other hand, if the message is being sent by a non-coordinator peer, then it is
forwarded to the coordinator of the cell (line 17). After sending the message, the peer
waits for the acknowledgment from the coordinator (line 18). The synchronization is
101
CHAPTER 4. IMPLEMENTATION
Algorithm 4.5: Cell group communications: sending-end
var: this // the current cell group communications objectvar: peers // the active, non-coordinator, peer client listvar: coordinatorPeer // the coordinator peer client
1 procedure CellGroup:sendMessage(msg)2 if this.isLocalPeerGroupCoordinator() then3 sendList ← peers4 sendList.remove(msg.getSourcePeer())5 cellRequestList ← ∅6 for peer in sendList do7 cellRequest ← peer.sendMessage(msg)8 cellRequestList.add(cellRequest)9 end for10 cellRequestList.waitForCompletion()11 for cellRequest in cellRequestList do12 if cellRequest.failed() then13 peers.remove(cellRequest.getPeer())14 end if15 end for16 else17 cellRequest ← coordinatorPeer.sendMessage(msg)18 cellRequest.waitForCompletion()19 if cellRequest.failed() then20 this.handleCoordinatorFailure()21 end if22 end if23 end procedure
then handled by the coordinator through the CellGroup:coordinatorHandleMsg()
procedure (previously shown in Algorithm 4.4). If the request fails, it is assumed that
the coordinator has crashed. In order to recover the cell from this faulty state, the
CellGroup:handleCoordinatorFailure() procedure is triggered.
Cell Discovery Mechanism
The goal of the cell discovery mechanism is to allow the discovery of peers in a cell.
The cell discovery object implements this sub-service, and uses low-level multicast
sockets to achieve an efficient implementation. The cell membership management is
accomplished through the use of the join, leave and rebind operations. These operations
are implemented through the cell group object. Both these mechanisms are presented
in Figure 4.7.
The algorithms that implement the cell discovery mechanisms are presented in Algo-
rithm 4.6. When a peer wants to join the mesh, it first has to find a suitable cell to bind
102
4.1. OVERLAY IMPLEMENTATION
Figure 4.7: Cell discovery and management entities.
Algorithm 4.6: Cell Discovery
var: cellObj // the cell objectvar: discoveryMC // the discovery low-level multicast socket
1 procedure CellDiscovery:requestCell(peerType)2 request ← discoveryMC.sendRequestCell(peerType)3 if request.failed() then4 return Cell:createRootInfo()5 else6 return request.getCellInfo()7 end if8 end procedure
9 procedure CellDiscovery:RequestParent(peerType)10 request ← discoveryMC.requestParent(peerType)11 request.waitForCompletion()12 return request.getParent()13 end procedure
14 procedure CellDiscovery:handleDiscoveryMsg(peer,msg)15 switch(msg.getType())16 case(RequestCell)17 if not cellObj.isRoot() then18 return19 end if20 replyRequestCellMsg ← cellObj.getCell(msg.getPeerInfo())21 peer.sendMessage(replyRequestCellMsg)22 end case23 case(RequestParent)24 replyRequestParentMsg ← cellObj.getParent(msg.getPeerInfo())25 peer.sendMessage(replyRequestParentMsg)26 end case27 end switch28 end procedure
to. This is achieved through the call to the CellDiscovery:RequestCell procedure
(lines 1-8) on the root cell, which in turn sends a cell request message. The call will
103
CHAPTER 4. IMPLEMENTATION
be serviced by any of the peers in the cell. If there are no peers in the root cell the
procedure returns the root cell identifier (line 4). Otherwise, it returns an appropriate
place in the mesh tree to position the requesting peer (line 6). The parameter peerType
denotes the type of node that is joining the cell, and it can be either a peer or a leaf
peer.
The optimal position for a new peer, depends on the strategy used and the type of peer.
For a new peer, and given a tree like topology, we first try to occupy the top of the tree
aiming to improve the resiliency of the overlay.
The procedure CellDiscovery:handleDiscoveryMsg() (lines 14-28) is the call-back
that is executed on the cell’s active peers to process the discovery requests. The cell
discovery mechanism supports two types of messages, the request for a cell (lines 16-22)
and the request for a new parent (lines 23-26).
The request for a cell is only valid in the root cell, otherwise the request is simply
discarded (lines 17-19). The restriction of this operation to the root cell allows us
to provide a better balance of the mesh tree, because the root cell is the only part
of the tree that has full knowledge of the overlay. A suitable cell is found, using the
cellObj.getCell() procedure. The reply message containing the binding information
is sent to the requesting peer (lines 20-21).
However, if the incoming request is for a new parent, then a suitable parent is found
through the call to the cellObj.getParent() procedure, with the result being sent to
the originating peer (lines 24 and 25). The request for a new parent is issued when the
parent peer of a cell fails. The coordinator of the cell must be able to find the parent peer
within the parent cell, if available, by using the CellDiscovery:RequestParent()
procedure.
Faults and Recovery
Faults arise for various reasons, ranging from hardware failures, that include peer
hardware failures and network outages, to software bugs. We considered three types of
faults: peer crash; coordinator peer crash, and; cell crash.
Figure 4.8 illustrates the fault handling processes in the presence of a fault in a cell.
When a non-coordinator peer crashes in a cell, shown in Figure 4.8a), the coordinator
peer issues a leavePeer request to the upper part of the tree (step 2), notifying the
departure of the crashed peer. After the acknowledgment from the parent peer has
been received (step 3), the coordinator peer notifies the active peers in the cell of the
crashed peer (steps 4 and 5).
104
4.1. OVERLAY IMPLEMENTATION
(a) (b)
Figure 4.8: Failure handling for non-coordinator (left) and coordinator (right) peers.
On the other hand, when a failure happens in the cell’s coordinator peer, shown in
Figure 4.8b), one of the other peers in the cell takes its place as the new coordinator.
After detecting the failure of the coordinator (step 1), the peer that is next-in-line,
according to the order that the peers entered the cell, succeeds it and becomes the new
coordinator. The coordinator of the parent cell also detects the crashed coordinator
peer, and sends a notification onwards the root cell (steps omitted). The newly elected
coordinator peer sends a rebind request to the parent coordinator and waits or the
acknowledgment (steps 2 and 3), informing that it is the new coordinator of the cell.
Furthermore, each active peer in the cell rebinds to the new coordinator, as will also any
coordinator belonging to a children cell. These rebind requests are also sent onwards
the root and fully acknowledged (steps 4 to 7).
As said, the coordinator peers from the children cells try to rebind to the parent’s cell.
If there are no more peers in parent’s cell then the cell has crashed and the coordinators
of the children cells have to contact the root node of the tree to request a new suitable
placement, that is a new cell. At this point, it is possible for the children cells, and
105
CHAPTER 4. IMPLEMENTATION
their sub-trees, to migrate to their new location, effectively avoiding the costly rebinding
process that would arise from forcing every peer to individually rebind to the mesh.
(a) (b)
Figure 4.9: Cell failure (left) and subsequent mesh tree rebinding (right).
The Figure 4.9a) shows the instance when the coordinator peer crashes. Because it
was the only active peer in the cell, this resulted in a cell crash, as no more peers were
available in the cell. The reconfigured P2P network is shown in Figure 4.9b).
Algorithms 4.7 and 4.8 show the algorithms that govern the fault-handling mechanism.
When a TCP/IP connection closes without proper shutdown, the peer is assumed
to have crashed. Within a cell, the coordinator peer monitors all active peers, and
in turn, they monitor the coordinator peer. The Cell:onPeerFailureHandler()
procedure is called by the coordinator when any of the active peers has failed, or
it is called by all the active peers when the coordinator has failed. Furthermore,
when a parent coordinator detects that a child coordinator has failed, it also calls
the Cell:onPeerFailureHandler() procedure. On the other hand, every children
coordinator peer calls the Cell:onParentFailureHandler() procedure when they
detect that their parent coordinator has crashed.
When a peer crashes, there are two possible scenarios. The first one being the crash
of a non-coordinator peer of the cell, shown in Figure 4.9a), and the second scenario
is related to the crash of a coordinator peer, shown in Figure 4.9b). When a non-
coordinator peer crashes, the coordinator peer of that cell calls the Cell:leavePeer()
procedure at line 10 of the Cell:onPeerFailureHandler() procedure. It starts by
removing the information about the peer (line 39) and then sending the notification to
the parent coordinator peer and waiting for the acknowledgment (lines 40 to 42). After
the acknowledgment has been received, the coordinator synchronizes cell by issuing a
106
4.1. OVERLAY IMPLEMENTATION
Algorithm 4.7: Cell fault handling.
var: this // the current cell objectvar: cellGroupObj // the cell communication group object
1 procedure Cell:onPeerFailureHandler(peerInfo)2 if peerInfo.isCoordinator() then3 this.removePeerInfo(peerInfo)4 if this.isNewCoordinator() then5 this.rebindParentPeer(this.getParentInfo())6 else7 this.rebindCoordinatorPeer()8 end if9 else10 this.leavePeer(peerInfo)11 end if12 end procedure
13 procedure Cell:onParentFailureHandler(peerInfo)14 cellDiscoveryObj ← Cell:createCellDiscovery(peerInfo.getCellInfo())15 newParentInfo ← cellDiscoveryObj.requestParent()16 if newParentInfo 6= ∅ then17 this.rebindParentPeer(newParentInfo)18 else19 cellRootDiscoveryObj ← Cell:createRootCellDiscovery()20 newParentInfo ← cellRootDiscoveryObj.requestParent()21 this.rebindParentPeer(newParentInfo)22 end if23 end procedure
24 procedure Cell:onChildFailureHandler(peerInfo)25 leavePeer(peerInfo)26 end procedure
departure notification through the cell group communication infrastructure (line 43).
No additional recovery is necessary at this point.
When the crashed peer was coordinating the cell, then each active peer remaining in the
cell calls the Cell:onPeerFailureHandler() procedure (lines 2 to 9). They start by
removing the information about the crashed peer (line 2). The peer that is next-in-line
to succeed to the coordinator peer calls the Cell:rebindParentPeer() procedure
(line 5). In turn, all the remaining active peers in that cell call the Cell:rebind-
CoordinatorPeer() procedure (line 7) in order to connect to the new coordinator
peer. The Cell:rebindParentPeer() procedure starts by connecting to the parent
coordinator peer (line 28), and then issuing a rebind notification to it and waiting for
the acknowledgment (lines 29 to 31).
On the other hand, the Cell:rebindCoordinatorPeer() procedure starts by con-
107
CHAPTER 4. IMPLEMENTATION
Algorithm 4.8: Cell fault handling (continuation).
27 procedure Cell:rebindParentPeer(parentInfo)28 this.connectToParentPeer(parentInfo)29 rebindMsg ← Cell:createRebindMsg(this.getOurPeerInfo())30 request ← this.getParentPeer().sendMessage(rebindMsg)31 request.waitForCompletion()32 end procedure
33 procedure Cell:rebindCoordinatorPeer()34 this.connectToCoordinator(this.getCoordinatorInfo())35 rebindMsg ← Cell:createRebindMsg(this.getOurPeerInfo())36 cellGroupObj.sendMessage(rebindMsg)37 end procedure
38 procedure Cell:leavePeer()(peerInfo)39 this.removePeerInfo(peerInfo)40 leaveMsg ← Cell:createLeaveMsg(peerInfo)41 request ← this.getParentPeer().sendMessage(leaveMsg)42 request.waitForCompletion()43 cellGroupObj.sendMessage(leaveMsg)44 end procedure
necting to the new coordinator peer (line 34), and then issuing a rebind notification to
the coordinator through the cell group communication infrastructure (lines 35 and 36).
At the same time, the parent coordinator peer and all the children coordinator peers also
detect that the coordinator peer has crashed. In the first case, the parent coordinator
peer, through the Cell:onChildFailureHandler() procedure, issues a notification
to the topmost portion of the tree informing of the departure of the crashed peer
(followed by the synchronization within its own cell). This is accomplished through
the Cell:leavePeer() procedure (line 25). The children coordinators upon detection
of the failure of their parent coordinator call the Cell:onParentFailureHandler()
procedure. The procedure starts by trying to discover a new parent in the same cell of
the crashed coordinator (lines 14 and 15). If there is an active coordinator in that
cell then the child coordinator rebinds by calling the Cell:rebindParentPeer()
procedure. If there is no such coordinator available, the child coordinator contacts
the root cell to ask for a new parent, and thus a new placement in the mesh, and
rebinds to it using also the Cell:rebindParentPeer() procedure (lines 19 to 21).
4.1.3 Discovery Service
The Discovery service provides a generic infrastructure for locating resources in the
overlay, such as the location of service instances, whereas the previously described cell
108
4.1. OVERLAY IMPLEMENTATION
discovery infrastructure only provides the mechanisms to locate peers within a cell.
Figure 4.10: Discovery service implementation.
The overlay Discovery service is shown in Figure 4.10. A user in peer A issues a query
though the Runtime Interface and Overlay Interface. The runtime of peer A tries first
to resolve it locally. If it is unable to locally resolve the query, then it must forward
the query to its parent coordinator, peer B. If peer B is unable to resolve the query,
then the request is forwarded to its parent coordinator, in this case peer C. If peer C
is unable to resolve the query, then a failure reply is sent downwards to the originating
peer.
Furthermore, the querying process can be generalized in the following manner. Upon the
reception of a discovery request, the runtime tries first to resolve it locally, in the peer,
and only when this is not possible, it propagates the request to the cell’s coordinator. If
the coordinator parent’s is also unable to reply to the request, the request is propagated
once more to its parent cell coordinator and the process is repeated recursively until
a coordinator peer is able to reply. If this process reaches a point where there is no
parent coordinator available (root node for the sub-tree), the process fails and a failure
reply is sent downwards to the originating peer.
Algorithm 4.9 illustrates the algorithms that implement the behavior of the discovery
service. The discovery service allows the execution of synchronous and asynchronous
queries. The procedure Discovery:executeQuery() performs synchronous queries.
The current implementation redirects the query to the root cell. This was done for the
sake of simplicity, but is going to be revised in the future.
109
CHAPTER 4. IMPLEMENTATION
Algorithm 4.9: Discovery service.
var: this // the current discovery service objectvar: mesh // the mesh service
1 procedure Discovery:executeQuery(query,qos)2 queryResult ← this.executeLocalQuery()3 if queryResult 6= ∅ then4 return(queryResult)5 end if6 coordinatorUUID ← ∅7 if not mesh.getCell().isCoordinator() then8 coordinatorUUID ← mesh.getCell().getCoordinatorUUID()9 else10 coordinatorUUID ← mesh.getCell().getParentUUID()11 end if12 if coordinatorUUID = ∅ then13 return(∅)14 end if15 coordDiscoverySAP ← mesh.getDiscoveryInfo(coordinatorUUID)16 coordDiscoveryClient ← this.createCoordinatorClient(coordDiscoverySAP,qos)17 return(coordDiscoveryClient.executeQuery(query,qos))18 end procedure
19 procedure Discovery:executeAsyncQuery(query, qos)20 queryResult ← this.executeLocalQuery()21 if queryResult 6= ∅ then22 future ← this.createFutureWithResult(queryResult)23 return(future)24 end if25 coordinatorUUID ← ∅26 if not mesh.getCell().isCoordinator() then27 coordinatorUUID ← mesh.getCell().getCoordinatorUUID()28 else29 coordinatorUUID ← mesh.getCell().getParentUUID()30 end if31 if coordinatorUUID = ∅ then32 future ← this.createFutureWithResult(∅)33 return(future)34 end if35 coordDiscoverySAP ← mesh.getDiscoveryInfo(coordinatorUUID)36 coordDiscoveryClient ← this.createCoordinatorClient(coordDiscoverySAP,qos)37 return(coordDiscoveryClient.executeAsyncQuery(query,qos))38 end procedure
39 procedure Discovery:handleQuery(peer, query,qos)40 queryResult ← this.executeQuery(query,qos)41 queryReplyMessage ← Discovery:createQueryReplyMessage(queryResult)42 peer.sendMessage(queryReplyMessage)43 end procedure
110
4.1. OVERLAY IMPLEMENTATION
The procedure starts by trying to resolve the query locally, and if successful, returning
the result (lines 2 to 5). Otherwise, the query must be propagated throughout the
overlay. If the peer is not the coordinator of the cell, then the coordinator of the cell
will be used as gateway for the propagation of the query. On the other hand, if the
peer is the coordinator of the cell, then the coordinator of the parent cell is used (lines
6 to 11). If either coordinators are not available then the query fails (lines 12 to 14).
Otherwise, the SAP information of the coordinator, that can be either the coordinator
of the current cell or the coordinator of the parent cell, is retrieved using the mesh
service, in line 15, which is followed by the creation of a client to the Discovery service
of that coordinator (line 16). At line 17, we use the client to redirect the request to the
parent and return the result.
The procedure Discovery:executeAsyncQuery() provides the asynchronous version
of the querying primitive. It follows the same approach as with the synchronous version,
with some slight differences. Instead of returning the result of the query, it returns a
future, that acts as a placeholder for the query result, notifying the owner when that
data is available. If the query can be resolved locally, then a future is created with
query result and returned (lines 21 to 24). As with the synchronous querying, this
is followed with the retrieval of the UUID of either the coordinator of the cell, if the
peer is not the coordinator of the cell, or the coordinator of the parent cell. If no
coordinator is available, then the procedure fails and a token reflecting this failure is
created and returned (lines 26 to 34). Otherwise, a client is created to the coordinator
after the retrieval of the necessary information about the SAP of that coordinator.
Last, the procedure returns the future created by the asynchronous querying on the
coordinator’s client (lines 35 to 37).
The procedure Discovery:handleQuery() is the call-back that is executed to handle
the query requests of the followers peers of the cell, or from children peers, that belong
to children cells. The Discovery:executeQuery() procedure, that was previously
described in Listing 4.9, is used to process an incoming query. If the query fails, a
failure message is created. If not, the query result is attached to a reply message. The
reply message is finally sent to the requesting peer.
4.1.4 Fault-Tolerance Service
Our FT infrastructure is based on replication groups. These groups can be defined
as a set of cooperating peers that have the common goal of providing reliability to
a high-level service. Previous work [3, 14], implemented FT support through a set of
111
CHAPTER 4. IMPLEMENTATION
high-level services that used the underlying primitives of the middleware. Our approach
(c.f. Chapter 3), makes a fundamental shift to this principle, by embedding lightweight
FT support at the overlay layer.
The management of the replication group is self contained, in the sense that the FT
service delegates all the logistics to the replication group. This allows further extensi-
bility of the replication infrastructure, and also allows the co-existence of simultaneous
types of replication strategies inside the FT service.
The integration of FT in the overlay reduces the overhead of cross-layering that is
associated with the use of high-level services. Furthermore, this approach also enables
the runtime to make decisions on the placement of replicas that are aware of the overlay
topology. This awareness can allow a better leverage between the target reliability and
resource usage. For example, placing replicas in different geographic locations leads to a
better reliability, but can be limited by the availability of bandwidth over WANs links.
Figure 4.11: Fault-Tolerance service overview.
Figure 4.11 shows an overview of the FT service, more specifically, of the bootstrap
process of a replicated service. It starts with a peer, in this case referred to as client,
requesting the creation of a replicated service to peer B. This request is delegated to
the mesh service. At this point, peer B receives the request and verifies if it is able
to host the service. If enough resources are available for hosting the service, that will
act as the primary service instance, then the core requests the FT service to create a
replication group that will support the replication infrastructure for the service.
The FT service creates a new replication group object, that will oversee the management
of the replication group acting as its primary. Using the fault-tolerance parameters,
that where passed by the core, the primary of the replication group finds the necessary
number of replicas across the overlay using the discovery service (this interaction is
112
4.1. OVERLAY IMPLEMENTATION
omitted). After finding the suitable deployment peers, the primary sends requests to
the remote FT services to join the replication group, as replicas. Each remote peer
verifies if it has the necessary resources to host the replica, and if so, the core creates a
replication group object that will act as a replica in the replication group. This process
ends with the replica binding to the primary of the replication group.
Replication Group Management
The management of a replication group includes the creation and removal of replicas.
Furthermore, a replication group is also responsible for providing the fail-over mecha-
nisms that allow the recovery from faults that occur in participating peers.
Figure 4.12: Creation of a replication group.
Figure 4.12 illustrates the creation of a replication group with one replica. The process
starts with an user requesting the creation of a service with FT support (step 1). The
core of the runtime processes the request and creates a service instance that will act
as Primary service instance (step 2). If configured, the core will make the necessary
reservations by interacting with the QoS client. The core proceeds to create a replication
group that will provide fault-tolerance support to the service (step 3).
After creating the replication group object, and finding a suitable deployment site
(omitted), the core requests the addition of a replica to the newly created replication
group, through the fault-tolerance service (step 4). The handleFTMsg procedure is the
call-back that is responsible for handling these types of requests.
After receiving and accepting the request for the creation of a replica, the peer denom-
inated as Replica, creates a service instance that will act as a replica to the primary
113
CHAPTER 4. IMPLEMENTATION
service instance (step 5). This is followed by the creation of a replication group object
that will act as a replica in the existing group. In order to complete the join to the
replication group, the replica issues a join request to the primary of the replication
group, that is maintained by the primary peer (step 6-7).
Because the example given in Figure 4.12 only has one replica, there is no need to
advertise the arrival of a new replica. However, in the presence of a larger group, each
new added replica has to be advertised in the replication group.
Figure 4.13: Replication group binding overview.
Figure 4.13 depicts the existing bindings within a replication group with multiple
replicas. The primary of the replication group, the peer that is managing the group
and is responsible for hosting the primary service, has active binds to all the replicas,
that are the peers that host a replica service.
The replicas are shown from left to right, denoting their order of entrance in the
replication group. If the primary fails, the leftmost replica is elected as the new primary.
Furthermore, each replica pre-binds to all the replicas that are placed on its right.
These pre-binds allow the monitoring of the neighboring peers for failures and reduce
the latency of the binding process.
Figure 4.14 shows the details of the process involved in the creation of a new replica.
Following a request for the creation of a new replica, by the primary (show in Figure 4.12,
step 4), the new replica joins the replication group (step 1).
When the primery adds a new replica to the group, it first starts by binding to it
(step 2). If this initialization is successful, the primary sends a message notifying the
remaining replicas that a new replica was added (step 3). Upon the arrival of this
message, each replica pre-binds to the new replica, and if this is done successfully, each
replica replies back to the primary with an acceptance message (steps 4-5). Otherwise,
114
4.1. OVERLAY IMPLEMENTATION
a rejection message is sent back to the primary and the addition of the new replica is
aborted (omitted).
Figure 4.14: The addition of a new replica to the replication group.
Fault-Tolerance Algorithms
The fault-tolerance service handles three types of requests: the creation of a new
replication group, which is performed by the primary; the addition of a new replica to an
existing replication group, requested by the primary to a new replica, and; the removal
of an existing replication group. The procedures FT:createReplicationGroup(),
FT:joinReplicationGroup() and FT:removeReplicationGroup() handle these
requests, respectively, and are shown in Algorithm 4.10.
When a service creation request is made locally or remotely, through the mesh service,
the core verifies if the necessary resources are available, and if so, creates a service
instance to be used by the replication group. Following this, the core creates the
replication group through the procedure FT:createReplicationGroup() (shown in
Figure 4.12, step 3). Acting on behalf of the core, the FT service creates the replication
group primary that will construct and manage the replication group.
This procedure takes as input the following parameters: svc, the service instance that
will act as the primary; params, the service parameters used in the creation of the
primary and replicas; and qos, a QoS broker to be used by the replication group. After
the replication group has been created (line 2), the output variable rgid is initialized
with the Replication Group Identifier (RGID) and the group is added to the group
manager (line 3) and bootstrapped (line 4).
The fault-tolerance requests are handled by the FT:handleFTMsg() procedure. Upon
the reception of a request to host a new replica (lines 18-22), the FT service redirects the
request to the core of the runtime, by calling the joinReplicationGroup() procedure
of the Core Interface (line 20). The core of the runtime first verifies the availability of
115
CHAPTER 4. IMPLEMENTATION
Algorithm 4.10: Creation and joining within a replication group
var: this // the current FT servicevar: ftGroupObj // the replication communication groupvar: groupManager // the FT replication group manager
1 procedure FT:createReplicationGroup(svc,params,rgid,qos)2 ftGroupObj ← this.createPrimaryFTGroupObj(svc,params,rgid,qos)3 groupManager.addGroup(ftGroupObj)4 ftGroupObj.start()5 end procedure
6 procedure FT:joinReplicationGroup(svc,params,rgid,primary,replicas,qos)7 ftGroupObj ← this.createReplicaFTGroupObj(svc,params,rgid,primary,replicas,qos)8 groupManager.addGroup(ftGroupObj)9 ftGroupObj.start()10 end procedure
11 procedure FT:removeReplicationGroup(rgid)12 ftGroupObj ← groupManager.getGroup(rgid)13 ftGroupObj.stop()14 groupManager.removeGroup(ftGroupObj)15 end procedure
16 procedure FT:handleFTMsg(peer,msg)17 switch(msg.getType())18 case(JoinFTGroup)19 (rgid,sid,params) ← msg.getReplicaInfo()20 getCoreInterface().joinReplicationGroup(primary,replicas,rgid,sid,params)21 peer.sendMessage(FT:createAckMessage(msg))22 end case23 case(RemoveFTGroup)24 ftGroupObj ← groupManager.getGroup(rgid)25 ftGroupObj.stop()26 groupManager.removeGroup(ftGroupObj)27 end case28 end switch29 end procedure
resources to run the replica, and if they are available, it requests the FT service to
join the replication group. This is implemented by the FT:joinReplicationGroup()
(shown in Figure 4.12, step 6) procedure and takes as input the following parameters:
svc, the service instance that will act as a replica; params, the service parameters
used in the creation of the primary and replicas; qos, a QoS broker to be used by
the replication group object; rgid, the RGID of the replication group; the primary
parameter, that holds the primary info; and the replicas parameter that holds the
current replicas info.
116
4.1. OVERLAY IMPLEMENTATION
Replication Group Algorithms
The replication group is the core of the replication infrastructure. It enforces the
behavior that was requested in the creation of the replicated service, such as the number
of replicas or replication policy.
Algorithm 4.11: Primary bootstrap within a replication group
var: this // the local instance of the replication groupvar: ft // the fault-tolerance servicevar: rgControlGroup // the replication control group
1 procedure FTGroup:startPrimary()2 this.openSAPs()3 (sid,params)← this.getServiceInfo()4 nbrOfReplicas ← params.getFTParams().getReplicaCount()5 deployPeers ← ft.findResources(sid,params,nbrOfReplicas)6 for peer in deployPeers do7 replica ← this.createReplicaObject(peer)8 rgControlGroup.addReplica(replica.getInfo())9 this.addToReplicaList(replica)10 end for11 this.getService().setReplicationGroup(this);12 end procedure
Algorithm 4.11 details the initialization procedure of a primary within a replication
group. The FTGroup:startPrimary() procedure shows the bootstrap sequence of a
primary. It starts by initializing two distinct access points, one for data and the other
for control (line 2). This separation was made to prevent multiplexing of control and
data requests, that could lead to priority inversion or increased latency in the processing
of requests. More specifically, the control SAP is used to manage the organization of
the replication group, such as addition and removal of replicas and election of a new
primary, while the data SAP is used to implement the “actual” FT protocol.
Figure 4.15 illustrates the control and data communication groups. The dashed lines
represent pre-binds that are made to minimize recovery time. When the primary of a
replication group fails, the necessary TCP/IP connections are already in place, so when
the replica that is next-in-line becomes the new primary, it can immediately recover the
replication group.
After this initial setup, the primary calls the FT:findResources() (shown in Al-
gorithm 4.12) to search for suitable deployment sites to create the replicas. The total
number of replicas is enclosed within the fault-tolerance parameters, that in turn belong
to the service parameters (lines 3-4).
117
CHAPTER 4. IMPLEMENTATION
Figure 4.15: The control and data communication groups.
After retrieving the list of suitable deployment sites at line 5, the primary creates
and binds each replica (line 7). Each newly added replica is synchronized with the
existing replicas in the replication group, using the control group infrastructure (line 8).
Subsequently, the new replica is added to the replica list (line 9). Last, the replication
group is attached to the service instance, allowing the service to access the underlying
FT infrastructure (line 11). If any of the previously mentioned operations fails, the
whole bootstrap process fails.
Algorithm 4.12: Fault-Tolerance resource discovery mechanism.
var: this // the current FT service objectvar: discovery // the discovery servicevar: mesh // the mesh service
1 procedure FT:findResources(sid,params,nbrOfReplicas)2 peerList ← ∅3 for i ← 1, i < nbrOfReplicas do4 filterList ← peerList5 query ← this.createPoLQuery(mesh.getUUID(),sid,filterList)6 queryReply ← discovery.executeQuery(query)7 peerList.add(queryReply.getPeerInfo())8 end for9 return(peerList)10 end procedure
In order to bootstrap a replica, a suitable place most be found. Algorithm 4.12
shows the details of mechanism that is responsible for finding suitable peers to host
new replicas. The process is exposed by the FT:findResources() procedure. This
procedure returns a list containing the peers, found across the overlay, that are able
to host a replica. To prevent duplication of replicas on the same runtime, a filter list
is added to each query. The initialization of this list is performed at line 5, and is
118
4.1. OVERLAY IMPLEMENTATION
updated every time a query is performed avoiding duplication of peers. The actual
query is created in line 6, through the use of the FT:createPoLQuery() procedure.
The short name PoL stands for Place of Deployment, and refers to the runtime where
a service, or in this case the replica, will be launched. At this point, the FT uses the
discovery service to perform the query (line 7), adding the reply to the peer list (line
8), in case of success. If this querying fails, the FT:findResources() procedure fails.
Algorithm 4.13: Replica startup.
var: this // the local instance of the replication group object
1 procedure FTGroup:startReplica()2 FTGroup:openSAPs()3 FTGroup:getService().setReplicationGroup(FTGroup:this);4 end procedure
The startup of a replica is detailed in the FTGroup:startReplica() procedure in
Algorithm 4.13. The replica starts by opening the control and data access points. This
enables the primary of the group to bind to the replica (shown in Algorithm 4.11). Last,
the replication group is attached to the replica service (line 3).
Algorithm 4.14: Replica request handling
var: this // the local instance of the replication group object
1 procedure FTGroup:replicaHandleControlMsg(primaryPeer,msg)2 switch(msg.getType())3 case(AddReplica)4 replicaInfo ← msg.getReplicaInfo()5 replica ← this.prebindControlAndDataToReplica(replicaInfo)6 this.addToReplicaList(replica)7 ackMessage ← FTGroup:createAckMessage(msg)8 primaryPeer.sendMessage(ackMessage)9 end case10 case(RemoveReplica)11 replicaInfo ← msg.getReplicaInfo()12 this.removeFromReplicaList(replicaInfo)13 ackMessage ← FTGroup:createAckMessage(msg)14 primaryPeer.sendMessage(ackMessage)15 end case16 end switch17 end procedure
Algorithm 4.14 shows the FTGroup:replicaHandleControlMsg() call-back that is
responsible for handling the control requests, in a peer that is acting as a replica within
119
CHAPTER 4. IMPLEMENTATION
a replication group. The notification messages sent by the primary that inform of the
arrival of new replicas to the replication group are handled in lines 3-8. Upon receiving
the request, each replica pre-binds to the new replica (line 4) and adds it to the replica
list (line 5). This ends with a reply message being sent to the primary peer.
The removal of a replica from the replication group is handled in lines 8-12. When
removing the replica from the list (line 9), all associated pre-binds (control and data)
are closed. The process ends with an acknowledgment being sent to the primary peer.
Support for the Replication Protocol
Our current implementation only supports semi-active replication [44]. In this type of
replication, the primary instance of the service after receiving and processing a request
from a client, replicates the new state across all the active replicas. As soon as the
replication ends, an acknowledgment is sent back to the client.
Figure 4.16: Semi-active replication protocol layout.
Figure 4.16 illustrates the implementation of the semi-active replication policy. When
the primary service instance wants to replicate its state, it uses the replicate()
procedure within the replication group (step 1). The replication group then uses the
data group to synchronize the new state among the replicas (step 2). Each replica
handles the replication request through the replicaHandleDataMsg() procedure.
This takes the replication data and calls the onReplication() procedure (step 3).
The service, after synchronizing into the new state, issues an acknowledgment through
the replication group (step 4).
The actual replication protocol support is detailed in Algorithm 4.15. When a primary
service needs to synchronize some data, that can be individual actions, such as RPC
invocations, or state transfers (partial or complete), it uses the FTGroup:replicate()
120
4.1. OVERLAY IMPLEMENTATION
procedure. The underlying replication group, depending on its policy, synchronizes the
replication data with all the replicas. For example, if the replication group is configured
to use semi-active replication, then when the FTGroup:replicate() procedure is
called (by the primary), the group immediately spreads the data. Alternatively, if
passive replication was in place, the replication group would buffer the data until the
next synchronization period expires. When the period expires, the replication group
synchronizes the data.
Each replica executes the FTGroup:handleReplicationPacket() call-back to handle
the arrival of replication data. Upon arrival, the replication data is send to replica
service instance to be processed.
Algorithm 4.15: Support for semi-active replication.
var: this // the local instance of the replication group objectvar: rgDataGroup // the replication data group
1 procedure FTGroup:replicate(buffer)2 rgDataGroup.replicate(buffer);3 end procedure
4 procedure FTGroup:replicaHandleDataMsg(primaryPeer,msg)5 switch(msg.getType())6 case(Replication)7 buffer ← msg.getBuffer()8 replicationAckMsg ← this.getService().onReplication(buffer)9 primaryPeer.sendMessage(replicationAckMsg)10 end case11 end switch12 end procedure
Fault Detection and Recovery in Replication Groups
The fault detection and recovery mechanisms within a replication group are imple-
mentation dependent. Figure 4.17 illustrates the recovery process within our current
implementation. After detecting the failure of the primary (step 1), the replica that
is next-in-line to become the new primary, assumes the leadership of the replication
group by sending a notification to all active replicas, informing that it assumed the
coordination (step 2). Next, the new primary notifies its service instance, that was
acting as a replica instance, that it became the primary service instance (step 3). At
this point, the primary node updates the information about the service, allowing any
existing client to retrieve this information and rebind to the new primary. This is
accomplished through the use of the changeIIDOfService() procedure of the Core
Interface. For the sake of simplicity, we omit the additional steps require to perform
121
CHAPTER 4. IMPLEMENTATION
this update in the mesh.
Figure 4.17: Recovery process within a replication group.
Algorithm 4.16 details the detection and recovery call-backs that are used by the partic-
ipants of the replication group. The procedure FTGroup:onPeerFailureHandler()
is called when a bind or a pre-bind is closed, that is when a peer has crashed. If the
failing peer was the current primary of the group (line 2), then the next leftmost replica
(line 3), the older replica in the group, is elected leader. If the executing peer is the new
primary (line 4), then it must notify the service instance that it became the primary
(line 5). The new primary sends a notification to all the active replicas informing that
is ready to continue with the replication policy (line 6). This is followed by an update
containing the information about the new primary (lines 7 to 8). However, if the faulty
peer was not the primary then it is just a matter of removing the binding information
associated with the crashed peer (line 11).
4.2 Implementation of Services
EFACEC operates on several domains, including information systems used to manage
public high-speed transportation networks, robotics and smart (energy) grids. Despite
their differences, these systems have many common requirements and problems, such
as: the need to transfer large sets of data; intermittent network activity, that can lead to
data bursts; are exposure to common hardware failures, that can vary in time, ranging
from short (for example, network reconfiguration raised from a link failure) to extended
outages, such as fires, and; require low jitter and low latency for safety reasons, such
as vehicle coordination. The pursuit of these characteristics puts a tremendous stress
122
4.2. IMPLEMENTATION OF SERVICES
Algorithm 4.16: Fault detection and recovery
var: this // the local instance of the replication group objectvar: mesh // the mesh servicevar: rgControlGroup // the replication control groupvar: service // the replicated servicevar: replicas // the replica listvar: rgid // the replication group UUID
1 procedure FTGroup:onPeerFailureHandler(peerID)2 if this.isPeerPrimary(peerID) then3 primaryPeer ← replicas.pop()4 if primaryPeer.getUUID() = mesh.getUUID() then5 this.fireOnChangeToPrimary()6 rgControlGroup.sendNewPrimaryInfo()7 iid ← service.getIID()8 this.getCoreInterface().changeIIDOfService(sid,iid,rgid);9 end if10 else11 replicas.remove(peerID)12 end if13 end procedure
14 procedure FTGroup:fireOnChangeToPrimary15 serviceChangeStatus ← service.changeToPrimaryRole();16 return(serviceChangeStatus);17 end procedure
on both software and hardware infrastructures, and particularly, to the management
middleware platform.
Our middleware architecture is able to support different types of services. To showcase
some possible implementations, we present three distinct services: 1) RPC, the classical
remote procedure call service; 2) Actuator, that allows the execution of commands on
a set of sensors, and; 3) Streaming, that allows data streaming from a sensor to a
client. The RPC service is a standard in every middleware platform, whereas both the
Actuator and Streaming services were designed to resemble current systems for public
information management that were deployed in the Dublin and Tenerife metropolitan
infra-structures. These services will form the basis for the evaluation of the middleware
to be presented in Chapter 5.
4.2.1 Remote Procedure Call
The RPC service, depicted in figure 4.18, allows the execution of a procedure in a
foreign address space, alleviating the programmer from the burden of coding the remote
123
CHAPTER 4. IMPLEMENTATION
interactions. The service uses fault-tolerance in the common way, with the primary
being the main service site, updating all the replicas that belong to the replication group
according to the group’s replication policy. The current implementation only supports
semi-active [44] replication, where the primary updates all replicas upon the reception
of a new invocation, and only replies to the client when all the replicas acknowledge the
update. On the other hand, if the RPC service is bootstrapped without fault-tolerance,
then the service executes a client invocation and replies immediately, as no replication
is involved. Figure 4.18 shows the RPC service deployed with two replicas across the
overlay.
Figure 4.18: RPC service layout.
The RPC service is divided in two layers. The topmost level contains the user defined
objects, referred as servers. The servers are the building block of the RPC, providing
an object-oriented semantics, that is similar to CORBA. For now, they are statically
linked, at compile time, to the RPC service. We have plans to expand this in the
future. On the other hand, the the bottommost level contains the server manager, also
known as service adapter, that is responsible for managing these user objects. The main
functions of the server adapter include the registration and removal of objects, and the
retrieval of the proper object to handle an incoming invocation.
In order to fully support object semantics, RPC has two distinct invocation types,
one-way and two-way invocations. One-way invocations do not return a value to the
client. Two-way invocations return a value back to the client that is dependent on the
particular operation.
Figures 4.19a and 4.19b show the interaction between a client while performing one-
way and two-way invocations, respectively. After receiving an invocation from a Service
124
4.2. IMPLEMENTATION OF SERVICES
(a) RPC one-way invocation. (b) RPC two-way invocation.
Figure 4.19: RPC invocation types.
Access Point (SAP), through the handleRPCServiceMsg call-back, the server adapter
redirects the request to the target object (server) that performs the call to the requested
method. If it is a one-way invocation then the server only has to call the target method
using the input arguments (handled by the handleOneWayInvocation() method).
Otherwise, the server invokes the method, also using the input arguments, and sends
back the output values to the invoker (the handleTwoWayInvocation() procedure
handles this case).
Listing 4.1: A RPC IDL example.
1 interface Counter {2 void increment();3 int sum(int num);4 };
Listing 4.1 shows the IDL definition for a simple server that provides two basic op-
erations over a counter variable. The one-way Counter:increment() procedure in-
crements the counter by one, whereas the two-way Counter:sum() procedure adds a
given number to the counter variable and returns the new total.
The Algorithm 4.17 exposes an implementation of the Counter server, that is normally
denominated as RPC skeleton. The Counter:handleOneWayInvocation() proce-
dure handles one-way invocations. It starts by performing a look-up that checks if the
requested procedure exists in the object (error handling was omitted), that is followed
with the call to the target procedure (line 3 to 5). The only available one-way procedure
is the Counter:increment that performs the increment over sumTotal, the counter
variable (lines 18 to 20).
125
CHAPTER 4. IMPLEMENTATION
Algorithm 4.17: A RPC object implementation.
var: this // the current RPC server objectconstant: PROC INCREMENT ID // one-way INCREMENT procedure id.constant: PROC SUM ID // two-way SUM procedure identificationconstant: COUNTER OID // the object identificationvar: sumTotal // the accumulator variable
1 procedure Counter:handleOneWayInvocation(pid,args)2 switch(pid)3 case(PROC INCREMENT PID)4 this.increment()5 end case6 end switch7 end procedure
8 procedure Counter:handleTwoWayInvocation(pid,args)9 switch(pid)10 case(PROC SUM PID)11 num ← RPCSerialization:unmarshall(INT TYPE,args)12 result ← this.sum(num)13 output ← RPCSerialization:marshall(INT TYPE,result)14 return output15 end case16 end switch17 end procedure
18 procedure Counter:increment()19 sumTotal ← sumTotal + 120 end procedure
21 procedure Counter:sum(num)22 sumTotal ← sumTotal + num23 return sumTotal24 end procedure
25 procedure Counter:getOID()26 return COUNTER OID27 end procedure
28 procedure Counter:getState()29 state ← RPCSerialization:marshall(INT TYPE,sumTotal)30 return state31 end procedure
32 procedure Counter:setState(state)33 sumTotal ← RPCSerialization:unmarshall(INT TYPE,state)34 end procedure
On the other hand, the Counter:handleTwoWayInvocation() procedure handles the
two-way invocations. It also checks if the requested procedure exists and then performs
the two-way invocation (lines 10 to 15). As the Counter:sum() procedure has one
input variable that has to be unmarshalled from the arguments (args) serialization
126
4.2. IMPLEMENTATION OF SERVICES
buffer (line 11). This is followed by a call to the Counter:sum() procedure using the
unmarshalled argument num (line 12). The result from the call to the procedure is then
marshalled into the serialization buffer output (line 13) and returned (line 14).
The Counter:getOID() function returns the Object Identifier (OID) of the object,
in this example this procedure returns the COUNTER OID constant. The state of the
Counter object is returned by the Counter:getState() procedure. In this implemen-
tation it returns the total state, and for this it only has to marshall the sumTotal
into a serialization buffer and return it. The counterpart for this procedure, the
Counter:setState() procedure, performs the opposite action. It takes a serialization
buffer containing the state, unmarsalls the it and updates the local object.
Algorithm 4.18: RPC service bootstrap.
1 procedure RPCService:open()2 hrt ← createQoSEndpoint (HRT, MAX RT PRIO)3 srt ← createQoSEndpoint (SRT, MED RT PRIO)4 be ← createQoSEndpoint (BE, BE PRIO)5 sapQosList ← {hrt,srt,be}6 serviceSAPs ← createRPCSAPs(sapQoSList)7 serviceSAPs.open()8 end procedure
The RPC service is responsible for the performing the invocations and the management
of objects. We start by presenting its bootstrap sequence. Algorithm 4.18 shows the
opening sequence for the RPC service, exposed by the RPCService:open() procedure.
The lines 1 to 5 show the creation of the list containing the QoS endpoint properties.
This is followed by the creation of the SAPs and their respective bootstrap (lines 6 to
7). The information characterizing the SAPs is associated with the IID of the RPC
service, by the runtime, so when a client resolves a service identifier it also retrieves the
associated SAP information.
Algorithm 4.19 details the most relevant aspects of the RPC implementation. The
procedure RPCService:handleRPCServiceMsg() is the call-back that handles all in-
coming invocations (issued by the lower-level SAP infrastructure). The procedure takes
as input two arguments: channel, the TCP/IP channel used to support the invocation,
and; invocation, that contains all the relevant information to the invocation.
The invocation argument is decomposed into five separate variables (line 2): iid, is
the invocation identification that is used in the reply to the client; type, indicates the
type of invocation (one-way or two-way); oid is the object/server identification; pid,
identifies the procedure to be invoked, and; args, are the arguments to be used in the
127
CHAPTER 4. IMPLEMENTATION
Algorithm 4.19: RPC service implementation.
1 procedure RPCService:handleRPCServiceMsg(channel,invocation)2 (iid,type,oid,pid,args)← invocation3 output ← handleInvocation(type,oid,pid,args)4 if RPCService:isFTEnabled() then5 RPCService:getReplicationGroup().replicate(getState())6 end if7 if type = TwoWay then8 channel.replyInvocation(iid,output)9 end if10 end procedure
11 procedure RPCService:handleInvocation(type,oid,pid,args)12 rpcObject ← getRPCObject(oid)13 switch(type)14 case(OneWay)15 rpcObject.handleOneWayInvocation(pid,arg)16 return ∅17 end case18 case(TwoWay)19 return rpcObject.handleTwoWayInvocation(pid,arg)20 end case21 end switch22 end procedure
invocation.
The actual invocation is delegated to the RPCService:handleInvocation() proce-
dure (lines 11-22). After retrieving the object associated with the invocation (line 12),
the procedure checks the type of the invocation and performs it corresponding action.
If it is an one-way invocation then it simply delegates it to the object to perform the
invocation (lines 14-17). If it is a two-way invocation then results of the operation are
returned back to the RPCService:handleRPCServiceMsg() procedure (lines 18-20).
After the invocation and if the RPC service was bootstrapped with fault-tolerance (lines
4-6) then state of the RPC is synchronized across the replica set by the replication group
infrastructure (line 5). If the invocation returns an output value (two-way invocations),
it is then sent back to the client (line 8).
The creation of an RPC client was already described in Chapter 3, more specifically
in Listing 3.5. The bootstrap and invocation procedures of the RPC client are shown
in Algorithm 4.20. The bootstrap sequence of the RPC client is implemented within
the RPCServiceClient:open() procedure, that takes as input parameters: the sid
of the RPC service; the iid of the instance that the client will bind to, and; the
client parameters. The initial step is to retrieve the information associated with the
128
4.2. IMPLEMENTATION OF SERVICES
Algorithm 4.20: RPC client implementation.
var: this // the current RPC client objectvar: channel // the low level connection object
1 procedure RPCServiceClient:open(sid, iid, clientParams)2 queryInstanceInfoQuery ← this.createFindInstanceQuery(sid,iid)3 discovery ← this.getRuntime().getOverlayInterface().getDiscovery()4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery)5 channel ← this.createRPCChannel(queryInstanceInfo.getSAPs(),clientParams)6 end procedure
7 procedure RPCServiceClient:twoWayInvocation(oid,pid,args)8 return (channel.twoWayInvocation(oid,pid,args))9 end procedure
10 procedure RPCServiceClient:oneWayInvocation(oid,pid,args)11 channel.oneWayInvocation(oid,pid,args)12 end procedure
service instance (lines 1-4). It first starts by creating the query message, through the
RPCServiceClient:createFindInstanceQuery() procedure, using the sid and
iid arguments (line 2). This is followed by the retrieval of a reference to the discovery
service (line 3), that is necessary to execute the query (line 4). This process ends with
the creation of the network channel using the query reply, with the information about
the available access points, and the selected level of QoS that is enclosed within the
client parameters (line 5).
The RPCServiceClient:twoWayInvocation() procedure (lines 7-9) is used to per-
form two-way invocations, while the RPCServiceClient:oneWayInvocation() pro-
cedure (lines 10-12) handles one-way invocations. They both use the RPC network
channel to perform the low-level remote invocation, that is, creating the packet channel
and sending it through the network channel. Contrary to its one-way counterpart, the
two-way operation must wait for the reply packet before returning to the caller.
Semi-Active Fault-Tolerance Support
The middleware offers an extensible fault-tolerance infrastructure that is able to ac-
commodate different types of replication policies.
Figure 4.20 depicts the current implemented fault-tolerance policy in the overlay. Fig-
ure 4.20a) shows the RPC service without FT support. In this case, upon the reception
of an invocation, the RPC service executes the invocation and replies immediately to
the client, as no replication is to be performed.
Figure 4.20b) shows the RPC with semi-active fault-tolerance enabled. The primary
129
CHAPTER 4. IMPLEMENTATION
(a) (b)
Figure 4.20: RPC service architecture without (left) and with (right) semi-active FT.
node, upon reception of an invocation (step 1), uses the replication group to update all
the replicas (steps 2 and 3). After the replication is completed, that is, when all the
acknowledgments have been received by the primary node (steps 4 and 5), it sends the
result of the invocation back to the RPC client (step 6).
Algorithm 4.21: Semi-active replication implementation.
1 procedure SemiActiveReplicationGroup:replicate(replicationObject)2 if IsPrimary() then3 replicationRequestList ← ∅4 for replica in replicaGroup do5 replicationRequest ← replica.sendMessage(replicationObject)6 replicationRequestList.add(replicationRequest)7 end for8 replicationRequestList.waitForCompletion()9 end if10 end procedure
Algorithm 4.21 shows the algorithm used for implementation semi-active replication.
This procedure is only called in the primary peer of the replication group. After
receiving a replication object the replicate procedure sends a replication message
to all the replicas that are present in the replication group (the acknowledgments were
omitted for clarity).
Algorithm 4.22 shows the RPCService:onReplication call-back that is used by the
130
4.2. IMPLEMENTATION OF SERVICES
Algorithm 4.22: Service’s replication callback.
var: this // the current RPC service object
1 procedure RPCService:onReplication(replicationObject)2 switch(replicationObject.getType())3 case(State)4 this.setState(replicationObject)5 return ∅6 end case7 case(Invocation)8 (iid,type,oid,pid,args)← replicationObject9 return this.handleInvocation(iid,type,oid,pid,args)10 end case11 end switch12 end procedure
13 procedure RPCService:setState(replicationObject)14 (oid,state) ← replicationObject15 rpcObject ← this.getRPCObject(oid)16 rpcObject.setState(state)17 end procedure
replication group to perform the state update. In the current implementation, we
perform replication by synchronizing the state of the RPC service among the members
of the replication group (lines 3-6). The RPCService:setState() procedure retrieves
the object identification and state serialization buffer from the replicationObject
variable (line 14). This is followed with the look-up for the target object (line 15), that
is then used to update the state of the object (line 16). Our RPC implementation can
be further extended to support replication based on the execution of the invocations.
We present a possible implementation in lines 7 to 10.
However, this implementation is only valid for single threaded object implementations
without non-deterministic source code, such as using the gettimeofday system call. The
presence of multiple threads in a replica can alter the sequence of state updates, as
the thread scheduling is controlled by the underlying operating system, and can lead
to inconsistent states. The presence of non-deterministic source code in the servers
implementation can lead to inconsistent states if the replication is based on the re-
execution of the invocations by each replica. For example, if a server implementation
uses the gettimeofday system call then the execution of this system call will have a
different value on each replica, leading to an inconsistent state. Several techniques have
been proposed to address these problems [14, 129, 130].
131
CHAPTER 4. IMPLEMENTATION
Fault-Tolerance Infrastructure Extensibility
To illustrate the extensibility of our fault-tolerance infrastructure, we provided the
algorithms necessary to implement passive replication. An overview on the architecture
of both policies is shown in Figures 4.21 and ??, respectively.
Passive Replication
Figure 4.21: RPC service with passive replication.
Passive replication [75] is interesting from the point of view of RT integration because
it is associated with lower latency and lower resource requirements, such as CPU, as
shown in our previous work [6]. However, this is only feasible through the relaxation
of the state consistency among the replication group members. This is accomplished
by avoiding immediate replication, as performed in semi-active replication. Instead,
after receiving an invocation (step 1), the replication data is buffered and periodically
sent to the replicas (step 2). Because the primary node does not need to wait for the
acknowledgments, it can immediately reply the result of the invocation to the RPC
client (step 3). Each replica periodically receives the updates (step 4), processes and
acknowledges them back to the primary of the replication group (step 5).
Algorithm 4.23 shows the algorithms needed to provides passive replication. The
PassiveReplicationGroup:replicate() procedure instead of immediately repli-
cating the data, as done in semi-active replication, queues the data for later replica-
tion. The replication is periodically performed, using a user-defined period, by the
PassiveReplicationGroup:timer() procedure (lines 6-14). To achieve a better
throughput, it sends a batch message containing all the replication data that was
132
4.2. IMPLEMENTATION OF SERVICES
Algorithm 4.23: Passive Fault-Tolerance implementation.
var: this // the current passive replication group object
1 procedure PassiveReplicationGroup:replicate(replicationObject)2 if this.IsPrimary() then3 this.enqueue(replicationObject)4 end if5 end procedure
6 procedure PassiveReplicationGroup:timer(replicationObject)7 replicationBatch ← this.dequeAll()8 replicationRequestList ← ∅9 for replica in replicaGroup do10 replicationRequest ← replica.sendMessage(replicationBatch)11 replicationRequestList.add(replicationRequest)12 end for13 replicationRequestList.waitForCompletion()14 end procedure
15 procedure RPCService:onReplication(replicationObject)16 switch(replicationObject.getType())17 ... . (continuation of Algorithm 4.22)18 case(BatchMessage)19 replyList ← ∅20 for item in replicationObject do21 switch(item.getType())22 case(State)23 this.setState(replicationObject)24 end case25 case(Invocation)26 (iid,type,oid,pid,args) ← replicationObject27 replyList.add(handleInvocation(iid,type,oid,pid,args))28 end case29 end switch30 end for31 return replyList32 end case33 end switch34 end procedure
previously enqueued. In order to use passive replication, the support for a batch message
is introduced in RPCService:onReplication procedure (lines 18-32). For each item
that is contained in the batch message, it checks if it is a state transfer or an invocation.
In case of a state transfer, it updates the service using the setState() procedure (line
23). Otherwise, it is handling an invocation request and it has to perform the invocation
and store the result in the replyList variable (lines 25 to 28), which is used to return
the output values for all the batched invocations to the replication group infrastructure
(and is sent back to the primary).
133
CHAPTER 4. IMPLEMENTATION
4.2.2 Actuator
One of the most important services in public information systems, for both railroads and
light trains, is the display of information at train stations about inbound and outbound
compositions, such as their track number and estimated time of arrival. The actuator
service allows a client to execute a command in a set of sensor nodes, such as displaying
a string in a set of information panels. These panels are implemented by leaf peers.
Figure 4.22: Actuator service layout.
Figure 4.22 shows the deployment of an actuator service instance while using 2 replicas.
The primary instance binds to each panel in the set, while the replicas make pre-bind
connections (shown as dashed lines).
Figure 4.23 shows an overview of the actuator service. The client starts by choosing
and binding to the appropriate SAP of the actuator service. To display a message on
the set of panels, the client sends a command (step 1) to the actuator service. After
receiving the command, the service sends it to the sensors (step 2), waits for their
acknowledgments (step 3), and then acknowledges the client itself (step 4).
Algorithm 4.24 shows the initial setup of the actuator service. As with the RPC service,
the initial steps focus on the construction and initialization of the service access points
(lines 2-7). If needed, a service must extend the generic class and augment it with the
service specific arguments. Unlike the RPC service, the actuator service makes uses
of this capability, by introducing an additional panel list parameter. Before processing
134
4.2. IMPLEMENTATION OF SERVICES
Figure 4.23: Actuator service overview.
Algorithm 4.24: Actuator service bootstrap.
var: this // the current actuator service objectvar: panelGroup // the panel communication group object
1 procedure ActuatorService:open(serviceArgs)2 hrt ← createQoSEndpoint (HRT, MAX RT PRIO)3 srt ← createQoSEndpoint (SRT, MED RT PRIO)4 be ← createQoSEndpoint (BE, BE PRIO)5 sapQosList ← {hrt,srt,be}6 serviceSAPs ← createActuatorSAPs(sapQoSList)7 serviceSAPs.open()8 actuatorServiceArgs ← downcast(serviceArgs)9 for panel in actuatorServiceArgs.getPanelList() do10 panelChannel ← createPanelChannel(sensor)11 panelGroup.add(panelChannel)12 end for13 end procedure
this information, the actuator service must downcast the serviceArgs to its concrete
implementation (line 8). Then, using the panel list, the actuator creates a network
channel for each of the panels and stores them in a list (lines 9-12).
Algorithm 4.25 shows the main algorithm present in the actuator service. The proce-
dure ActuatorService:handleAction() is the call-back that is executed upon the
reception of a new action by the actuator service. The actuator spreads the action across
all the panels (shown in Figure 4.23 as steps 2 and 3), using the channels previously
135
CHAPTER 4. IMPLEMENTATION
Algorithm 4.25: Actuator service implementation.
var: this // the current actuator service objectvar: panelGroup // the panel communication group object
1 procedure ActuatorService:handleAction(action,channel)2 actionRequestList ← ∅3 for panel in panelGroup do4 actionRequest ← panel.sendMessage(action)5 actionRequestList.add(actionRequest)6 end for7 actionRequestList.waitForCompletion()8 failedPanels ← ∅9 for actionRequest in actionRequestList do10 if actionRequest.failed() then11 panelGroup.remove(actionRequest.getPanel())12 failedPanels.add(actionRequest.getPanel())13 end if14 end for15 ackMessage ← ActuatorService:createAckMessage(failedPanels)16 channel.replyMessage(ackMessage)17 end procedure
created in the bootstrap of the service (lines 2-6). Each failed panel is removed from
the service panel list (line 11) and stored in an auxiliary list (line 12). The procedure
ends with the creation of an acknowledgment message containing the list of failed panels
that is sent back to the client (lines 15 and 16).
Algorithm 4.26: Actuator client implementation.
var: this // the current actuator client objectvar: channel // the low level connection object
1 procedure ActuatorServiceClient:open(sid, iid, clientParams)2 queryInstanceInfoQuery ← ActuatorSvcClient:createFindInstanceQuery(sid,iid)3 discovery ← this.getRuntime().getOverlayInterface().getDiscovery()4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery)5 channel ← this.createActuatorChannel(queryInstanceInfo.getSAPs(),clientParams)6 end procedure
7 procedure ActuatorServiceClient:action(action)8 actionRequest ← channel.sendMessage(action)9 actionRequest.waitForCompletion()10 end procedure
Algorithm 4.26 shows the initialization of the actuator client and the implementation of
the action operation. The ActuatorServiceClient:open() procedure exposes the
bootstrap of the client, following the same implementation as the RPC service. A query
136
4.2. IMPLEMENTATION OF SERVICES
to find the information about the service instance is created and sent over the discovery
service. Using the localization information retrieved in the query reply, a channel is
created to that instance (lines 3-6). The low-level socket operations are handled by the
ActuatorServiceClient:action() procedure (shown in Figure 4.23 in step 1). It
sends action through the channel and waiting for corresponding acknowledgment.
Actuator Fault-Tolerance
The service does not use the fault-tolerance support for data synchronization (as in the
RPC service), but instead uses the replicas to pre-bind to the panels to minimize the
recovery time. Figure 4.24 shows the architectural details of the actuator service with
FT support.
Figure 4.24: Actuator fault-tolerance support.
In the event of a failure of the primary peer, the newly elected primary already has
pre-binds to all the panels in the set, thus minimizing recovery latency. After rebinding
to the new primary, the client reissues the failed action. While we could do the same
using multiple service instances, the actuator client would have to know about these
multiple instances, and switch among them in the presence of failures. Thus, using the
fault-tolerance infrastructure avoids this issue, and allows the client to transparently
switchover to new running primary.
137
CHAPTER 4. IMPLEMENTATION
4.2.3 Streaming
The streaming of both video and audio in public information systems is an important
component in the management of train stations, specially in CCTV systems. The
streaming service allows the streaming of a data flow, such as video and audio, from
streamers to clients. While there is a considerable amount of work addressing streaming
over P2P networks [131, 132], we have chosen to implement it at a higher level to allow
us to provide an alternative example of an efficient streaming implementation, with
fault-tolerance support, on a general purposed middleware system.
Figure 4.25: Streaming service layout.
Figure 4.25 shows the deployment of a streaming service instance while using two
replicas. A leaf peer, denominated as streamer, connects to all the members of the
replication group.
Figure 4.26 shows the architecture details of the streaming service. At bootstrap, the
streaming service connects to the streamer (step 1) and starts receiving the stream (step
2). Afterwords, a client connects to the streaming service and requests a stream (step
3). The server allocates a stream session and the client starts receiving the stream from
the service (step 4).
Each client is handled by a stream session, that was designed to support transcoding.
The term transcoding refers to the capability of converting a stream from one encoding,
such as raw data, to a different encoding, such as the H.264 standard [133]. The use
of transcoding allows the streaming service to soften the compression ratio of streams
138
4.2. IMPLEMENTATION OF SERVICES
Figure 4.26: Streaming service architecture.
to support lower performance computing devices. At the same time, it also enables a
reduction of bandwidth usage, through a higher compression ratio, for high performance
computing devices. However, in our current implementation, we do not implement any
encoding in this example, the same is to say that we apply the identity filter.
Algorithm 4.27: Stream service bootstrap.
var: this // the current streaming service objectvar: streamServiceArgs // the streaming service argumentsvar: streamChannel // the streamer channel object
1 procedure StreamService:open(serviceParams)2 hrt ← createQoSEndpoint(HRT,MAX RT PRIO)3 srt ← createQoSEndpoint(SRT,MED RT PRIO)4 be ← createQoSEndpoint(BE,BE PRIO)5 sapQosList ← {hrt,srt,be}6 serviceSAPs ← this.createStreamSAPs(sapQoSList)7 serviceSAPs.open()8 streamServiceArgs ← downcast(serviceArgs)9 streamerInfo ← streamServiceArgs.getStreamerInfo()
10 streamerChannel ← this.createStreamerChannel(streamerInfo)11 end procedure
Algorithm 4.27 exposes the initialization process of the stream service. The bootstrap
process of the stream service is detailed in procedure StreamService:open(). The
initial setup creates and bootstraps the service access points (lines 2-7). The stream
service uses one additional parameter, the streamer endpoint. This parameter is used
139
CHAPTER 4. IMPLEMENTATION
to create a stream channel to the streamer (lines 8-10).
Algorithm 4.28: Stream service implementation.
var: this // the current streaming service objectvar: streamSessions // the streaming session listvar: streamStore // the stream circular buffervar: streamChannel // the streamer channel object
1 procedure StreamService:handleNewStreamServiceClient(client,sessionQoS)2 streamSessions.add(createSession(client,sessionQoS))3 end procedure
4 procedure StreamService:handleStreamerFrame(streamFrame)5 for session in streamSessions do6 session.processFrame(streamFrame)7 end for8 end procedure
9 procedure StreamSession:processFrame(streamFrame)10 streamStore.add(streamFrame)11 streamChannel.sendFrame(streamFrame)12 end procedure
Algorithm 4.28 starts by exposing the procedure that handles a new incoming stream
client, in StreamService:handleNewStreamServiceClient() procedure. Upon
the arrival of a new client, the stream service creates a new session and stores it. The
StreamService:handleStreamerFrame() procedure handles incoming frames from
the streamer. When the service receives a new frame, it updates every active session
(lines 5 to 7) through the StreamSession:processFrame() procedure. Currently, a
session only stores the received frames in a circular buffer (whose size is pre-defined),
that will eventually substitute older frames with newer ones. The purpose of this buffer
is to suppress frame loss in the presence of a primary crash, allowing for the client to
request older frames to fix the damaged stream.
Algorithm 4.29 starts by describing the initialization process of the stream client. This
initialization follows the same sequence as with previously described clients. It retrieves
the information about the service instance, and then uses it to create a channel to
the service instance. Upon the reception of a new frame, by the stream client, the
StreamServiceClient:handleStreamFrame() procedure is executed.
Streaming Fault-Tolerance
Figure 4.27 shows the fault-tolerance support within the streaming service. The primary
server and the replicas all connect to the streamer, and receive the stream in parallel
140
4.2. IMPLEMENTATION OF SERVICES
Algorithm 4.29: Stream client implementation.
var: this // the current streaming client objectvar: channel // the low level connection object
1 procedure StreamServiceClient:open(sid,iid,clientParams)2 queryInstanceInfoQuery ← StreamSvcClient:createFindInstanceQuery(sid,iid)3 discovery ← getRuntime().getOverlayInterface().getDiscovery()4 queryInstanceInfo ← discovery.executeQuery(queryInstanceInfoQuery)5 channel ← this.createStreamChannel(queryInstanceInfo.getSAPs(),clientParams)6 end procedure
7 procedure StreamServiceClient:handleStreamFrame(streamFrame)8 // application specific...9 end procedure
Figure 4.27: Streaming service with fault-tolerance support.
(step 1). Each of the replicas stores the stream flow up to a maximum configurable
time, for example 5 minutes (step 2). When a stream client connects to the stream
service, it binds to the primary instance and starts receiving the data stream (step 3).
When a fault occurs in the primary, the client rebinds to the newly elected primary of
the replication group. As the client rebinds, it must inform the new primary what was
the last frame received. The new primary, thought a new stream session, calculates the
missing data and sends it back to the client, thereafter resuming the normal stream
flow.
141
CHAPTER 4. IMPLEMENTATION
4.3 Support for Multi-Core Computing
The evolution of microprocessors has focused on the support for multi-core architectures
as a way to scale through the current physical limits in manufacturing. This brings new
challenges to systems programmers as they must be able to deal with an ever increasing
potential for parallelism.
While coarse-grain parallelism can already be handled with current development frame-
works, such as MPI and OpenMP, they are aimed for best-effort tasks that do not have
a notion of deadline, and therefore are unable to support real-time. Furthermore, their
programming model is based on a set of low-level primitives that do not offer any type
of object-oriented programming support.
On the other hand, the use of object-oriented programming languages provides very
limited supported for specifying object-to-object interactions and almost no parallelism
support. For example, in C/C++ the parallelism is achieved through the use of threads
or processes that are implemented in low-level C primitives that do not have any type
of object awareness.
For these reasons, fine-grained parallelism is hard to implement in a flexible and modular
fashion. While a considerable amount of research work has been done in threading
strategies with object awareness, such as the leader-followers pattern [11], they do not
offer support for resource reservation or regulated access between objects.
4.3.1 Object-Based Interactions
The object-oriented paradigm is based on the principle of using objects, which are data
structures containing data fields and methods, to develop computer programs. The
methods of an object allow manipulation of its internal state, which is composed by
its data fields. However, object-to-object interaction is not addressed by the object-
oriented paradigm. Recent work on component middleware systems [65, 66] addressed
this issue through the use of component-oriented models. However, component-based
programming offers a high-level approach that in our view is not able to address
important low-level object-to-object interactions, such as CPU partitioning, and fine-
grained parallelism.
The implementation of fine-grained parallelism frameworks has to support object-to-
object interactions that include direct and deferred calls. With direct calls (shown
142
4.3. SUPPORT FOR MULTI-CORE COMPUTING
in Figure 4.28a), the caller object enters the object-space of the callee, that might be
guarded through a mutex, and performs the target action. On the other hand, when the
target object enforces deferred calling shown in Figure 4.28b), the caller object is unable
to perform the operation directly and must queue it. These requests are then handled
by a thread of the target object. The caller does not enter the callee object-space. This
pattern is commonly known as Active Object [114, 13].
(a) Direct calling.
(b) Deferred calling.
Figure 4.28: Object-to-Object interactions.
4.3.2 CPU Partitioning
CPU partitioning is an approach based on the isolation of individual cores or processors
to perform specific tasks, and is normally used to isolate real-time threads from potential
interferences from other non real-time threads. Despite the large body of research on
real-time middleware systems that use general-purpose operating over Common-Of-
The-Shelf (COTS) hardware [3, 65, 66], to our knowledge, no real-time middleware
system, specially when combined with FT support, ever employed a CPU partitioning
scheme (shielding) to further enhance real-time performance.
Figure 4.29 exemplifies a possible examples of CPU partitioning for 4 (Figure 4.29a), 6
(Figure 4.29b) and 8 cores (Figure 4.29c) microprocessors. A more detailed explanation
of the resource reservation mechanisms is provided in Section 3.1.4. Now it suffices
to say that the partitions designated with OS contain the threads that belong to the
underlying operating system (in this case Linux). The partitions BE & RT contain the
threads for best-effort and soft real-time, and finally, the Isolated RT indicates that
143
CHAPTER 4. IMPLEMENTATION
(a) Quad-core partition-
ing.
(b) Six-core partitioning.
(c) Eight-core partitioning.
Figure 4.29: Examples of CPU Partitioning.
the partitions have dedicated cores that only host soft real-time threads, reducing the
scheduling latency caused by the switching between best-effort and real-time threads.
Our runtime can be seen as a set of low-level services that offers a set of high level
abstractions to the implementation of high-level services. It was necessary to create a
mechanism that regulated access between the services in order to allow the preservation
of the QoS parameters for each individual service, that is the interactions between
objects running on different partitions.
Figure 4.30 revisits the object-to-object interactions with the introduction of CPU
partitioning. Figure 4.30a shows object A making a direct call to operation op b1()
in object B. This normally implies that operation op b1() has a mutex to guard
any critical data structures. Even with priority boosting schemes, such as priority
inheritance, the use of mutexes can cause unbound latencies. Subsequently, this would
break the isolation of partition Isolated RT and would defeat the purpose of using
CPU partitioning. In order to improve throughput, real-time threads can be co-located
with non real-time threads to maximize the use of the cores allocated to a particular
partition. The disadvantage of this approach is that the real-time threads are no longer
in an isolated environment, and so, the scheduling of non real-time threads can cause
interference in real-time threads. As in Figure 4.30b, a direct call involving objects
within the same partition is a valid option.
The use of deferred calling (shown in Figure 4.30c) avoids the problems of direct calling
when objects are allocated in different partitions. The call from object A is serialized
and queued, and a future is associated with the pending request. This call is later
144
4.3. SUPPORT FOR MULTI-CORE COMPUTING
(a) Direct calling with different par-
titions.
(b) Direct calling within the same
partition.
(c) Deferred calling with different partitions..
Figure 4.30: Object-to-Object interactions with different partitions.
handled by a thread belonging to object B that dequeues it and executes the request,
updating the future with the respective result. The thread of object A, that was waiting
the future, is waken and returns to op a1(). This execution model is commonly referred
as worker-master [13].
4.3.3 Threading Strategies
A threading strategy defines how several threads interact in order to fulfill a goal, with
each strategy offering a trade-off between latency and throughput. Figure 4.31 presents
several well-known strategies that are implemented in our Support Framework, namely:
Leader-Followers [11]; b) Thread-Pool [114]; c) Thread-per-Connection [12], and; d)
Thread-per-Request [13].
Leader-Followers (LF)
The leader-followers pattern (c.f. Figure 4.31a) [11] was designed to reduce context
145
CHAPTER 4. IMPLEMENTATION
(a) (b) (c) (d)
Figure 4.31: Threading strategies.
switching overhead when multiple threads access a shared resource, such as a set of file
descriptors. This is a special kind of thread-pool where threads take turn as leaders, in
order to access the shared resource. If the shared resource is a descriptor set, such as
sockets, then, when a new event happens on a descriptor, the leader thread is notified
by the select system call. At this point, the leader removes the descriptor from the
set, elects a new leader, and then resumes the processing of the request associated with
the event.
In this case, our default implementation allows that foreign threads join the leader-
followers execution model. After joining, the foreign thread is inserted in the followers
thread set, waiting its turn to become a leader and process pending work. As soon as
the event reaches a final state (in case of success, error or timeout), the foreign thread
is removed from the followers set.
Thread-Pool (TP)
The thread-pool pattern (c.f. Figure 4.31b) [114] consists of a set of pre-spawned threads,
that normally are synchronized by a barrier primitive, such as select and read. This
pattern avoids the overhead and latency of dynamically creating threads to handle client
requests, but results in a loss of flexibility. In general, however, it is possible to adjust
the size of the pool in order to cope with environment changes.
Thread-per-Connection (TPC)
The thread-per-connection pattern (c.f. Figure 4.31c) [12] aims to provide minimum
146
4.3. SUPPORT FOR MULTI-CORE COMPUTING
latency time by avoiding request multiplexing, at the cost of having a dedicated thread
per connection.
Every SAP has a listening socket that is responsible for accepting new TCP/IP con-
nections that is usually managed by an Acceptor design pattern [13]. After accepting
a new connection, the Acceptor creates a new thread that will handle the connection
throughout its life-cycle. Given the one-to-one match between thread and connection,
it is not possible to allow foreign threads into the execution model without breaking
correctness of the connection object, as it is not configured to allow multiple accesses
to low-level primitives such as the read system call. Because of this, any foreign thread
that invokes a synchronous operation on the connection object, has its request queued.
This is later processed by the thread that owns the connection.
Thread-per-Request (TPR)
The thread-per-request pattern (c.f. Figure 4.31d) [13] focuses on minimizing thread
usage, while trying to maximize the overall throughput on a set of network sockets.
This design pattern results from a combination of a low-level thread-per-connection
strategy with a high-level thread-pool strategy. This strategy is also referred as Half
Async - Half Sync [13].
The role of the thread-per-connection strategy is to read and parse incoming packets,
and enqueuing them into an input queue to be processed by the workers of the thread-
pool. When a worker thread wants to send a packet, it also has to enqueue the packet
into an output queue.
Minimization of Network Induced Priority Inversion
Providing end-to-end QoS in a distributed environment needs a vertical approach,
starting at the network level (inside the OS layer). Previous research [134], focused
on the minimization of network induced priority inversion, through the enhancement of
Solaris’s network stack to support QoS. Additional work [3] extended this approach to
the runtime level by providing separate access points for requests of different priority.
Building on these principles, our runtime was built to preserve end-to-end QoS seman-
tics. For that end, each service publishes a set of access points, with associated QoS,
that will serve as entry points for client requests, thus avoiding request multiplexing.
This approach was based on TAO’s work on the minimization of priority inversion [3]
caused by the use of network multiplexing. The service access points are served by a
threading strategy that is statically configured during the bootstrap of the runtime.
147
CHAPTER 4. IMPLEMENTATION
Figure 4.32: End-to-End QoS propagation.
However, as TAO was designed to accommodate only one type of service, that is the
RPC service, it did not address the following aspects: service inter-dependencies and
resource reservation, more precisely, CPU shielding. In our middleware, each SAP is
served by an execution model, offering a flexible behavior.
4.3.4 An Execution Model for Multi-Core Computing
The lack of a design pattern capable of providing a flexible behavior that leverages the
use of multi-core processors through CPU reservation and partitioning, while providing
support for a configurable threading strategy, motivated the creation of the Execution
Model/Context design pattern.
Figure 4.33: RPC service using CPU partitioning on a quad-core processor.
Figure 4.33 shows an overview of the RPC service while using CPU partitioning. The
Isolated RT partition, containing core 1, supports the handling of high priority RT
invocations. Whereas, the BE & RT partition, containing cores 2 and 3, supports the
148
4.3. SUPPORT FOR MULTI-CORE COMPUTING
handling of medium priority RT invocations and best-effort invocations. Each SAP
features a thread-per-connection (TPC) threading strategy, but they can use any of the
previous described strategies.
Figure 4.34: Invocation across two distinct partitions.
Figure 4.34 shows the interaction of a medium priority RT invocation, which is handled
by a thread that belongs to a med RT SAP that resides in the BE & RT partition,
with a high priority server that resides in the Isolated RT partition. While any thread
belonging to the high RT SAP could directly interact with a high priority server, as
they reside in the same partition, this should not happen when the interaction was
originated by a thread belonging to a different partition. This last interaction could
cause a priority inversion on the threading strategy that is supporting high priority
server.
The first part of the execution model/context pattern, the execution model sub-pattern,
allows an entity to regulate the acceptance of foreign threads, that is the threads that
belong to other execution models, within its computing model. The rationale behind
this principle resides in the fact that an application might reside in a dedicated core and
the interaction with a foreign thread could cause cache line trashing, or simply break
the isolation for some real-time threads.
The second sub-pattern is the execution context. Its role is to efficiently manage the
call stack through the use of Thread-Specific Storage (TSS). This allows the execution
model to retrieve the necessary information about a thread, for example the partition
that is assigned to the thread, and use it to regulate the behavior of the thread that is
149
CHAPTER 4. IMPLEMENTATION
interacting with it. For example, it prevents an isolated real-time thread that belongs to
an isolated execution model hosted on an isolated real-time partition, from participating
in a foreign execution model, that would break the isolation principle and result in
non-deterministic behavior (for example, by propagating interrupts from non-isolated
real-time threads into the isolated core).
The internals of the Execution Model/Execution Context (EM/EC) design pattern are
depicted in Figure 4.35 showing the interaction between three distinct execution models.
When a thread that belongs to EM0 calls an operation on EM1, it effectively enters a
new computational domain. An operation can either be synchronous or asynchronous.
If it is asynchronous, then the requesting EM0 will not participate in the computing
effort of EM1.
Figure 4.35: Execution Model Pattern.
On the other hand, if the operation is synchronous, then it must check whether the last
EM, the top of an execution context calling stack, allows that its thread to participate
in the threading strategy of EM1. If the thread is allowed to join the threading strategy,
then it participates in the computing effort until it reaches a final state (that is operation
successful, error, or timeout). When it reaches the final state, it backtracks to the
requesting EM, in this case EM0 by popping the context from the stack. The operation
being performed on EM1 could continue the call chain by executing an operation on
EM2, and if so, this process would repeat itself.
If the requesting EM0 does not allow its threads to join EM1, then the operation must
be enqueued for future processing by a thread within the threading strategy of EM1. If
EM1 embodies a passive entity, i.e. an object that does not have active threads running
150
4.3. SUPPORT FOR MULTI-CORE COMPUTING
inside its scope, then the EM is considered a NO-OP EM. In this scenario, it is not
possible to enqueue the request because there are no threads to process it, so an error is
returned to EM0 (this should only happen in a configuration error). Otherwise, if EM1
is an active object, then if the queue buffer is not full, the request is enqueued and a
reply future is created. As the operation has a synchronous semantics, the thread (that
belongs to EM0) must wait for the token to reach its final state before returning to its
originating EM.
Algorithm 4.30: Joining an Execution Model.
var: this // the current Execution Model object
1 procedure ExecutionModel:join(event,timeout)2 ec ← TSS:getExecutionContext()3 topEM ← ec.peekExecutionModel()4 joinable ← topEM.allowsMigration(this)5 if not joinable then6 throw(ExecutionModelException)7 end if8 try9 ec.pushEnvironment(this,event,timeout)10 ts ← this.getThreadingStrategy()11 ts.join(event,timeout)12 catch(ThreadingStrategyException)13 ec.popEnvironment()14 throw(ExecutionModelException)15 catch(ExecutionContextException)16 throw(ExecutionModelException)17 end try18 end procedure
Algorithm 4.30 presents the ExecutionModel:join() procedure, that acts as the
entry point for every thread wanting to join the execution model. The procedure takes
two arguments, an event and a timeout. The event represents an uncompleted
operation belonging to the execution model, e.g. an unreceived packet reply from
a socket, that must be completed before the deadline given by timeout. It starts
by retrieving the Execution Context stored in Thread-Specific Storage (TSS) (line 2).
This allows the execution context to be private to the thread which owns it, avoiding
synchronized access to this data. At line 3, we retrieve the current, and also the last,
execution model in which the thread has entered. If this last execution model does not
allow its threads to migrate to the new execution, then an exception is raised and the
join process is aborted. Otherwise, the thread joins the new execution model, by first
pushing the call stack with the information regarding the join (a new tuple containing
the new execution model, event and timeout) (Line 10). This is followed by the thread
151
CHAPTER 4. IMPLEMENTATION
joining the threading strategy (Lines 10-11). If the threading strategy does not allow
the thread to join it, then an exception is raised and the join is aborted. Independently
of the success or failure of the join, the call stack is popped, thus eliminating the
information regarding this completed join.
Algorithm 4.31: Execution Context stack management.
var: this // the current Execution Context objectvar: stack // the environment stack object
1 procedure ExecutionContext:pushEnvironment(em,event,timeout)2 topEnv ← stack.top()3 if timeout > topEnv.getTimeout() then4 throw(ExecutionModelException)5 end if6 if topEnv.getExecutionModel() = em & topEnv.getEvent() = event then7 topEnv.incrementNestingCounter()8 else9 nesting counter ← 110 context ← createContextItem(em,event,timeout,nesting counter)11 stack.push(context)12 end if13 end procedure
14 procedure ExecutionContext:popEnvironment()15 topEnv ← stack.top()16 topEnv.decrementNestingCounter()17 if topEnv.getNestingCounter() = ∅ then18 stack.pop()19 end if20 end procedure
21 procedure ExecutionContext:peekExecutionModel()22 return stack.top().getExecutionModel()23 end procedure
Algorithm 4.31 shows the most relevant procedures of the execution context. The
ExecutionContext:pushEnvironment() procedure is responsible for pushing a new
execution environment into the call stack. It starts by checking if the timeout,
belonging to the new environment, does not violate the previously established deadline
(that belongs to the last execution model), and if it is the case, an exception is raised
(Lines 3-5). If a thread is recursive, i.e. it enters multiple times in the same execution
model, then instead of creating a new execution environment and pushing it into the
queue, it simply increments a nesting counter, that represents the number of times a
thread has reentered this execution domain (Lines 7-8). Otherwise (Lines 9-11), a new
execution environment (with the nesting counter set to 1) is created and pushed into
the stack. The ExecutionContext:popEnvironment() procedure eliminates the top
152
4.3. SUPPORT FOR MULTI-CORE COMPUTING
execution environment present in the call stack. It starts by decrementing the nesting
counter, and if it is equal to 0 then no recursive threads are present and the stack can
safely be popped. Otherwise, no further action is taken. The remaining procedure,
ExecutionContext:peekTopExecutionModel(), is an auxiliary procedure used to
peek at the top execution model associated with the current thread.
Applying the EM/EC Pattern to the RPC Service
Figure 4.36 show the RPC service using the EM/EC pattern. Each service access point
(SAP) is served by a thread-per-connection strategy that has a dedicated thread for
handling new connections, normally known as the Acceptor [13], that spawns a new
thread for each new client connection. Furthermore, the RPC service uses two CPU
partitions, an Isolated RT partition for supporting high priority RT invocations and a
BE & RT partition for supporting medium priority RT and best-effort invocations.
Figure 4.36: RPC implementation using the EM/EC pattern.
In Figure 4.36, each priority lane, the logical composition of the low-level socket handling
with the high-level server handling, is managed through a single execution model. Each
connection is handled by thread that after reading an invocation packet, uses the server
adapter to locate the target server and performs the invocation. As this approach does
not enqueue requests between the layers, it does not introduce additional sources of
latency. However, if the SAP that received the invocation request does not belong to
153
CHAPTER 4. IMPLEMENTATION
the same partition as the target server, then the request is enqueued in execution model
containing the server. The invocation is later dequeued, in this case by thread that is
handling the SAP, and executed. The reply is then enqueued in the execution model
that originated the invocation.
Algorithm 4.32: Implementation of the EM/EC pattern in the RPC service.
var: thisSocket // the current RPC service objectvar: thisService // the current RPC socket objectvar: timeout // the timeout associated with the invocationvar: rpcService // RPC service instance
1 procedure RPCServiceSocket:handleInput()2 invocation ← getReadPacketFromSocket()3 rpcService.handleRPCServiceMsg(thisSocket,invocation)4 end procedure
5 procedure RPCServiceObject:handleTwoWayInvocation(pid,args)6 try7 event ← createInvocationEvent(pid,args)8 thisService.getExecutionModel().join(event,timeout)9 return event.getOutput()10 catch(ExecutionModelException ex)11 event.wait(timeout)12 return event.getOutput()13 end try14 end procedure
Algorithm 4.32 provides the main details of the EM/EC pattern implementation in
the RPC service. The RPCServiceSocket:handleInput() procedure is the callback
that is used by the thread managing the connection when a input event has occurred
in the socket. After the packet is read from the socket, its processing is delegated to
the upper level of the service, through the the RPCService:handleRPCServiceMsg()
procedure (shown previously in Algorithm 4.19). The server adapter is a bridge between
the layers, and is shown with a dashed outline. It starts by locating the server object
and delegating the invocation to it. The handling of a two-way invocation is im-
plemented in the RPCServiceObject:handleTwoWayInvocation() procedure (the
one-way invocation was omitted for clarity). If the invocation originated from a thread
belonging to server’s priority lane, more specifically from the socket that is handling
the connection, then is able to join the execution model of the server and help with the
computation (lines 7 to 9). On the other hand, if the invocation was originated from a
thread belonging to a execution model outside the server’s partition, then the request is
queued. After the threading strategy of the server executes the invocation, the request
154
4.4. RUNTIME BOOTSTRAP PARAMETERS
is signaled as completed. At this point, the thread that originated the request is waken
in the wait() procedure (line 11) and the output is returned (line 12).
4.4 Runtime Bootstrap Parameters
The bootstrap of the core is implemented in method Core:open(args) and adjusts
the behavior of the runtime during its life-cycle. The arguments are passed to the core
by using command line options. Table 4.1 shows the most relevant arguments present
in the system.
Property Meaning Default
General use
resource reservation Enables resource reservation true
rr runtime Maximum global cpu runtime 10
rr period Maximum global cpu period 100
Overlay specific
default interface Default NIC eth0
cell multicast interface Default NIC for multicast eth0
cell root discovery ip IP address for root cell discovery 228.1.2.2
cell root discovery port Port address for root cell discovery 2001
tree span i Tree span at level i 2
cell peers i Maximum peers at tree level i 2
cell leafs i Maximum leafs at tree level i 80
Table 4.1: Runtime and overlay parameters.
One of the most important flags in the system is the resource reservation support flag
is controlled by the --resource reservation command line option. Upon initializa-
tion, and if the resource reservation support flag is activated (the default behavior),
the core creates a QoS client and connects to the resource reservation daemon. The
--rr runtime parameter controls how much CPU time can be spent running in each
computational period, that in turn is defined by the --rr period parameter. Both
parameters are expressed in micro-seconds and are used to configure the underlying
Linux’s control groups.
155
CHAPTER 4. IMPLEMENTATION
The overlay is controlled by a set of specific command line options. the default network
interface card (NIC) to be used in the network communications is controlled by the
--default interface parameter. The --cell multicast interface defines the
network interface card to be used by the cell discovery mechanism. Furthermore, the
--cell root discovery ip and --cell root discovery port are used to specify
the IP address and port of the root multicast group. The --tree span i parameter
specifies the tree span for the ith level of the tree. The --cell peers i parameter
specifies the maximum number of peers in each cell at tree level i. Last, the maximum
number of leaf peers for every cell in tree level i is controlled by the cell leafs i
parameter.
It is possible to automatically bootstrap an overlay during the initialization of the run-
time, using the --overlay command line option. For example, using --overlay=p3,
the core will look for a “libp3.so” in the current directory, and bootstrap it. Alterna-
tively, it is possible to programmatically attach an overlay to the runtime, c.f. Listing 3.1
in Chapter 3.
4.5 Summary
This chapter provided an overall view of the implementation of the runtime. We
presented an overlay implementation inspired in the P3 topology, detailing the three
mandatory peer-to-peer services: mesh, discovery, and fault-tolerance.
The chapter also provides a presentation of three high-level services that provide a proof-
of-concept for our runtime architecture, namely: a RPC service that implements the
traditional remote procedure call; an Actuator service that exemplifies an aggregation
service that uses the FT service solely to minimize rebind latency, and; a Streaming
Service that offers buffering capabilities to ensure stream integrity even in the presence
of faults.
Furthermore, the chapter provides an overview of the challenges faced in supporting
multi-core computing, followed by the presentation of our novel design pattern, the
Execution Model/Context, that a provides an integrated solution for supporting multi-
core computing.
Last, the chapter ends with a short description of the options that may be used when
bootstrapping the runtime.
156
–Success consists in being successful, not in hav-
ing potential for success. Any wide piece of
ground is the potential site of a palace, but there’s
no palace till it’s built.
Fernando Pessoa 5Evaluation
This chapter provides an evaluation of the real-time performance of the middleware
while in the presence of the fault-tolerance and resource reservation mechanisms. The
chapter highlights the performance of the two most important parts in the system,
the overlay and the high-level services. This evaluation uses a set of benchmarks
that characterize key aspects of the infrastructure. The assessment of the overlay
infrastructure focuses on (a) membership (and recovery time), (b) query behavior, and
(c) service deployment performance. Whereas, the evaluation of the high-level services
focused on (d) the impact of FT on service performance, (e) impact of multiple clients
(using the RPC as test case), and finally, (f) a comparison with other platforms.
5.1 Evaluation Setup
The evaluation setup is composed of the physical infrastructure and the overlay config-
uration used to produce the benchmarks results discussed throughout this chapter.
5.1.1 Physical Infrastructure
The physical infra-structure used to evaluate the middleware prototype consists of a
cluster of 20 quad-core nodes, equipped with AMD Phenom II X4 [email protected] CPUs
and 4Gb of memory, totaling 80 cores and 80Gb of memory. Each node was installed
with Ubuntu 10.10 and kernel 2.6.39-git12. Despite our earlier efforts to use the real-
time patch for Linux, known as the rt-preempt patch [135], this was not possible due
to bugs on the control group infrastructure. The purpose of this patch is to reduce
157
CHAPTER 5. EVALUATION
the number and length of non-preemptive sections in the Linux kernel, resulting in
less scheduling latency and jitter. Nevertheless, the 2.6.39 version incorporates most of
the advancements brought by the rt-branch, namely, threaded-irqs [136]. The physical
network infrastructure was a 100 Mbit/s Ethernet with a star topology.
5.1.2 Overlay Setup
At bootstrap, the middleware starts by building a peer-to-peer overlay with a user
specified number of peers and leaf peers. The peers are grouped in cells that are created
according to the rules of the underlying P2P framework, described in Chapter 4. Overlay
properties control the tree span and the maximum number of peers per cell at any given
depth.
Figure 5.1: Overlay evaluation setup.
Figure 5.1 shows the configuration used for all the benchmarks performed on the overlay.
The overlay forms a binary tree with the first level, composed of the root cell, has four
peers, whereas each cell on the second level has 3 peers, while the third, and last, level
has two peers per cell.
Figure 5.2 shows the physical layout used for the evaluation. Each peer is launched in a
separate node of the cluster, for a total of 18 cluster nodes. On the other hand, all the
leaf peers are launched in a single node. Last, the clients are either launched in the same
node where the lead peers were launched, or in remaining free node of the cluster. The
allocation of the clients and leaf peers on the same node was done to provide accurate
measurements in services, such as the streaming service, where the stream of data only
goes one way. Otherwise, the physical clock of both client and leaf nodes would have
to be accurately synchronized through specialized hardware.
158
5.2. BENCHMARKS
Figure 5.2: Physical evaluation setup.
5.2 Benchmarks
We divided the benchmark suite into two separate categories, one focusing on the
low-level overlay performance and the other on the high-level services. The main
objective is to isolate key mechanisms, specially at the overlay level, that may interfere
with the behavior of the services. A second objective is to create a solid benchmark
facility to assess the impact of future overlay implementations in the overall middleware
performance.
5.2.1 Overlay Benchmarks
The following benchmarks were designed to evaluate the performance of a P2P overlay
implementation. Figure 5.3 shows an overview of the different overlay benchmarks.
Membership Bind and Recovery
To evaluate the performance of the membership mechanism, we take two measurements,
(a) the bind time, which reflects the time a node takes to negotiate its entry into the
mesh, and, (b) the rebind time, comprehends the recovery and rebinding (renegotiation)
time that a node must undertake to deal with a faulty environment (Figure 5.3a). In
our P2P overlay, this failure happens when a coordinator node crashes, leading to a fault
on the containing cell, and subsequently to a fault in the tree mesh. The faulty cell
recovers by electing a new coordinator node, allowing the children subtrees to rebind
to the recovered cell. The time that it takes for a children subtree to rebind to new
coordinator is directly related with the size of its state (the serialized contents of the
159
CHAPTER 5. EVALUATION
(a) Membership bind & recovery.
(b) Querying. (c) Service deployment.
Figure 5.3: Overview of the overlay benchmarks.
subtree), thus the larger the subtree, the longer it will take to transfer its state to the
new coordinator. So, in order to evaluate the worst case scenario, after building the
mesh, the coordinator of the root cell is crashed, forcing a rebind of the first level cells.
Querying
One of the most fundamental aspects of P2P is its ability to efficiently find resources in
the network. Given this, a measurement of the search mechanism is important to assess
the performance of a given P2P implementation. To assess the worst case scenario, we
focused on measuring the Place of Launch (PoL) query, as shown in Figure 5.3b. In
our current P2P implementation, a query is handled only at the root cell, since it has
a better account of the resource usage across the mesh tree.
Service Deployment
In a cloud like environment is important to quickly deploy services, and so the goal of
this benchmark is to profile the performance of such a mechanism in our overlay. This
benchmark measures the latency associated with a service bootstrap with and without
160
5.2. BENCHMARKS
FT. Figure 5.3c represents a request to launch a service on a peer to be discovered.
After being found by the query PoL, the service is started. When a service is to be
bootstrapped without FT support, the source creating the service only has to request
one PoL query, as no replicas are going to be bootstrapped. Otherwise, the primary
of the replication group has to issue the same number of PoL queries as the number of
replicas that it is bootstrapping.
5.2.2 Services Benchmarks
We wanted to evaluate the following parameters: (a) the impact of fault-tolerance
mechanisms in priority-based real-time tasks; (b) the impact of fault-tolerance in iso-
lated real-time tasks; (c) preliminary (latency only) comparison with other main-stream
middleware systems, such as TAO, ICE and RMI. We implemented three simple services
to serve as benchmarks and one to inject load in the peers.
The maximum allowed priority for all benchmarks is 48. Priorities above 48, and up
to 99, are reserved for the various low-level Linux kernel threads, namely, the cgroup
manager and irq handlers.
(a) RPC. (b) Actuator. (c) Streaming.
Figure 5.4: Network organization for the service benchmarks.
RPC
The RPC service (Figure 5.4a) executes a procedure in a foreign address space. This
is a standard service in any middleware system. A primary server receives a call from
a client, executes it, and updates the state in all service replicas. When all replicas
acknowledge the update, the primary server then replies to the client. In the absence of
161
CHAPTER 5. EVALUATION
fault-tolerance mechanisms, the primary server executes the procedure and immediately
replies to the client.
To evaluate the RPC service we used the maximum available priority of 48. The remote
procedure simply increments a counter and returns the value. We performed 1000 RPC
calls each run, with an invocation rate of 250 per second.
Actuator
The actuator service (Figure 5.4b) allows a client to execute a command in a set of panels
controlled by lead peers. This is used by EFACEC to display information of incoming
and departing trains in a train station. After receiving the command, the primary server
sends it to the panels, waits for their acknowledgments, and then acknowledges the client
itself. The service does not use the fault-tolerance support for data synchronization (as
in the RPC service), but instead pre-binds the replicas to the panels in the set.
We used 80 panels, and a string of 14 bytes. The 80 panels are representative of a
large real-world public information system in a light train network. The string length
represents the average size in current systems at EFACEC. We issued 1000 commands
each run, with an invocation rate of 250 per second.
Streaming
This service (Figure 5.4c) allows the streaming of a data flow (e.g. video, audio, events)
from leaf peers to a client. This type of service is used by EFACEC to send and receive
streams from train stations, namely, to implement the CCTV subsystem. The primary
server and the replicas all connect to the leaf peers, and receive the stream in parallel.
Each of the replicas stores the stream flow up to a maximum pre-defined time, for
example 5 minutes. When a fault occurs in the primary, the client rebinds to the newly
elected primary of the replication group. As the client rebinds, it must inform the new
primary what was the last frame received. The new primary then calculates the missing
data and sends it back to the client, thereafter resuming the normal stream flow.
We used a stream of 24 frames per second with length of 4 Kbytes, resulting in a bitrate
of 768Kbit per second. For example, this bitrate allows for a medium quality MPEG-4
stream with a 480 x 272 resolution, matching the video stream used by EFACEC’s
CCTV. The client and leaf peers are located in the same machine as this allows
the determination of the one-way latency and jitter for the traffic. The stream was
transmitted for 4 seconds in each run.
162
5.3. OVERLAY EVALUATION
5.2.3 Load Generator
Complex distributed systems are prone to be affected by the presence of rogue services
that can become a source of latency and jitter. We evaluate the impact of the presence
of such entities by introducing in each peer a load generator service. The later spawns
as many threads as the logical core count of the CPU. Unless explicitly mentioned,
the threads are allocated to the SCHED FIFO schedule class, with priority 48. This
scheduling policy represents the worst case scenario of unwanted computation. Given
a desired load percentage p (in terms of the total available CPU time), each thread
continuously generates random time intervals (up to a configurable maximum of 5ms).
For each value it computes the percentage of time that it must compute so that the load
is p. For example, if the desired load is 75% and the value generated is 4ms, then the
load generator must compute for 3ms and sleep for the remainder of that time lapse.
The experiments on each benchmark were tested with increasing load values (5% step),
up to a maximum of 95%. For each of these configurations we ran the benchmark 16
times, and computed the average and the 95% confidence intervals (represented as error
bars). A vertical dashed line corresponds to a load of 90% is used as a reference for the
case where resource reservation is enabled.
5.3 Overlay Evaluation
This section presents the results for the runs on all the three benchmarks designed for
evaluating the P2P overlay performance, namely: membership, recovery, query, and
service deployment. The benchmarks that present latency and jitter use a logarithmic
scale that causes a distortion of the error bars. This may mislead in the evaluation of
the results.
5.3.1 Membership Performance
These experiments estimate the impact on membership bind and recovery time, while
the peers are exposed to an increasing load. The membership mechanisms run with
maximum priority (48) on each peer of the overlay.
163
CHAPTER 5. EVALUATION
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
1000
Late
ncy
(m
s)
Legend:Res. No Res.
(a)
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
1000
Late
ncy
(m
s)
Legend:Res. No Res.
(b)
Figure 5.5: Overlay bind (left) and rebind (right) performance.
These two measurements are a key factor on the overall performance of the higher layers
on the middleware, because a node is only fully functional when it is connected to the
mesh. In the presence of a fault it is important to be able to quickly rebind and recover,
to minimize the down time of the P2P low level services. This in turn, can become a
source of latency for the high level services, for example, the RPC service.
The membership bind time, depicted in Figure 5.5a, shows a linear increase on bind
latency without the resource reservation enabled. This is expected, as the load increases,
it creates additional interference on the threads of the mesh service. When the resource
reservation mechanisms are enabled, the mesh service uses a portion of the resource
reservation allocated to the runtime. The use of the resource reservation mechanisms
allow for an almost constant latency time with some minor jitter on loads higher than
80%.
The rebind performance exhibits a similar behavior to the bind performance, although
it exhibits a lower latency on loads less than 80%. As with the bind benchmark,
the enabling of the resource reservations mechanisms allow for a near constant rebind
latency with a very small jitter.
5.3.2 Query Performance
The query performance is one of the most crucial aspects of every overlay implemen-
tation, because it is the basis of the resource discovery. Figure 5.6 shows the result of
164
5.3. OVERLAY EVALUATION
performing the PoL query with and without resource reservation.
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
1000
Late
ncy
(m
s)
Legend:Res. No Res.
Figure 5.6: Overlay query performance.
The evaluation results show that up to loads of 70%, the use of resource reservation
introduces a small overhead as shown by the higher level of latency. This is explained by
the fact that the execution model uses a Thread-per-Connection (without a connection
pool) policy, where a peer creates a new connection (using the desired level of QoS) to
perform a query. When a neighbor peer receives a new connection (from the discovery
service), it has to spawn a new thread to deal with the request. This process is repeated
until a peer is able to handle the query, or the root cell is reached and a failure message
is replied back to the originator peer. When using resource reservation, the creation
of new threads must undergo an additional submission phase with the QoS daemon,
and subsequently, within the QoS infrastructure in Linux (control groups), causing the
increase of latency when using resource reservation. Nevertheless, from 70% to 95%, the
resource reservation mechanism is able to provide a stable behavior. Otherwise, in the
absence of the resource reservation mechanism, the query latency reaches a maximum
of 400ms when the peers were subjected to a load of 95%.
5.3.3 Service Deployment Performance
The quick allocation of services, and ultimately of resources, is a crucial aspect of
scalable middleware infrastructures. Figure 5.7 shows the evaluation results for service
deployment, with varying number of replicas.
165
CHAPTER 5. EVALUATION
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
1000
10000
Late
ncy
(m
s)
Legend:Res. + NoFTRes. + 1FTRes. + 2FTRes. + 4FT
No Res. + NoFTNo Res. + 1FTNo Res. + 2FTNo Res. + 4FT
Figure 5.7: Overlay service deployment performance.
The results show that without resource reservation, the system exhibits a linear increase
of deployment time starting at loads of 30%, following (the linear) the increase of the
load injected in the system. Associated with this high latency, the results show a high
jitter throughout the service deployment. The maximum value registered was near 10s,
for the deployment of the service with 4 replicas without resource reservation and a load
of 95%. On the other hand, when the discovery service used the resource reservation
mechanism, it exhibited a near constant behavior, only showing a small increase of
the deployment time when the service is deployed with FT. The increasing number of
replicas brings additional latency to the deployment, as more queries are needed to be
performed to discover additional sites to deploy the replicas. Naturally, the deployment
of 4 replicas takes additional time, resulting in a maximum around 100ms, still, a
100 fold improvement over the 4 replica deployment without resource reservation. To
conclude, the results show negligible jitter in all the deployment configurations when
the resource reservation mechanism is activated.
5.4 Services Evaluation
Several aspects influence the behavior of the high-level services. Here, we present the
two most important aspects, the impact of FT mechanisms in service latency and the
impact of resource reservation while enforcing FT policies. Additionally, we present
166
5.4. SERVICES EVALUATION
results that characterize the impact of the presence of multiple clients, using RPC as
test case. The evaluation of the system ends with a preliminary comparison with other
closely related middleware systems.
5.4.1 Impact of FT Mechanisms in Service Latency
These experiments estimate the impact of the FT mechanisms in service latency and
rebind latency, as the peers are subjected to increasing load. The services run with
maximum priority (48) without resource reservation. To assess the scalability of the
FT mechanisms we also vary the size of the replication group for the service through
2, 3 and 5 (1 primary server + 1, 2, 4 replicas). For the rebind latency, in the middle
of the run, we crash the primary server. This is accomplished by invoking an auxiliary
RPC object, initially loaded in every peer of the system. Finally, as a baseline reference
we present the results obtained with the same benchmarks but with all FT mechanisms
disabled. In this case, no fault is injected, as no fault-tolerance is active.
The results for the runs can be seen in Figure 5.8. In general, the rebind latency presents
a stepper increase when compared to invocation latency, although the differences with
varying number of replicas are masked by jitter. The rebind process involves several
steps: failure detection; election of a new primary server; discovery of new primary
server, and; transfer of lost data. In each step, the increasing load introduces a
new source of latency and jitter that accumulates to the overall rebind time. In this
implementation the client must use the discovery service of the mesh to find the new
primary server. This step could be optimized, for example, by keeping track of the
replicas in the client. Despite this, the rebind latency remains fairly constant up to
loads of 40% to 45%. The minimum and maximum rebind latencies for the RPC,
Actuator and Streaming services are, respectively: 5.9ms, 5.7ms, 7.2ms, and 2823ms,
2068ms, 1087ms.
The invocation latencies depicted in Figure 5.8 show that up to loads of 35% the FT
mechanisms introduce low overhead and low jitter. In the case of the RPC benchmark,
that uses a more complex replica synchronization protocol, the overhead remains a
constant factor in direct proportion to the number of replicas relative to the baseline case
(no FT). The Actuator and Streaming services, with their simple (or non-existing) data
synchronization protocols follow the baseline very closely. Despite this, the Streaming
service is far more CPU intensive than Actuator and therefore shows more impact from
increasing loads. The minimum and maximum invocation latencies measured for the
RPC, Actuator and Streaming services are, respectively: 0.1ms, 1.5ms, 1.1ms, and
167
CHAPTER 5. EVALUATION
Rebind Latency Invocation Latency
RPC
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
1000
10000
Late
ncy
(m
s)
Legend:1 Replica2 Replicas
4 Replicas
0 10 20 30 40 50 60 70 80 90 100Load (%)
0.1
1
10
100
1000
Late
ncy
(m
s)
Legend:No FT1 Replica
2 Replicas4 Replicas
Actuator
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
1000
10000
Late
ncy
(m
s)
Legend:1 Replica2 Replicas
4 Replicas
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
Late
ncy
(m
s)
Legend:No FT1 Replica
2 Replicas4 Replicas
Stream
0 10 20 30 40 50 60 70 80 90 100Load (%)
0.1
1
10
100
1000
10000
Late
ncy
(m
s)
Legend:1 Replica2 Replicas
4 Replicas
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
Late
ncy
(m
s)
Legend:No FT1 Replica
2 Replicas4 Replicas
Figure 5.8: Service rebind time (left) and latency (right).
168
5.4. SERVICES EVALUATION
259ms, 19ms, 96ms.
5.4.2 Real-Time and Resource Reservation Evaluation
In these runs we use the middleware’s QoS daemon to isolate the services by reserving
at least 10% of the available CPU time for the runtime that executes the service. The
remainning 90% are used for operating system tasks and for the Load Generator service.
Everything else is kept from the scenario described for the previous set of runs.
Impact of FT in Service Latency with Reservation
The results for the runs can be seen in Figure 5.9. The fact that the services are
now isolated, at least in terms of CPU, from the remainder of the system contributes to
their almost constant latencies and stability (low jitter) with increasing peer loads. The
invocation latency also shows the natural increase with the number of replicas. The
minimum and maximum rebind latencies for the RPC, Actuator and Streaming are,
respectively: 9.2ms, 10.2ms, 10.9ms, and 15.8ms, 18.7ms, 21.9ms. The minimum and
maximum invocation latencies for the RPC, Actuator and Streaming are, respectively:
0.1ms, 4.8ms, 1.1ms, and 1.0ms, 5.9ms, 1.9ms.
Relative to the previous set of runs, the latencies for low values of peer loads with
resource reservation activated are somewhat higher. For example, the ratios between
the minimum rebind latencies with and without reservation for RPC, Actuator and
Streaming are, respectively: 1.6, 1.8, and 1.5. This is explained by the overhead
introduced by the reservation mechanisms (previously explained in Chapter 3). This
overhead has a higher impact on the rebind latency than it has on the invocation
latency, because the rebind process has a much shorter duration, therefore the overhead
represents a larger fraction of total time. In other words, the overhead of the resource
reservation setup on the invocation latency, is amortized across the duration of the
benchmark, such as the 1000 calls performed to the RPC service.
Impact of Multiple Clients in RPC Latency
To evaluate the performance of the middleware in the presence of multiple clients with
different priorities, we extended the RPC benchmark and introduced three service access
points with distinct priorities, more precisely, 48, 24, and 0. The first two access
points are served by a thread-per-connection model with scheduling class SCHED FIFO
(and priorities 48 and 24, respectively). The remaining SAP is served by threads with
scheduling class SCHED OTHER (with static priority 0). This benchmark allows to
169
CHAPTER 5. EVALUATION
Rebind Latency Invocation Latency
RPC
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
Late
ncy
(m
s)
Legend:1 Replica2 Replicas
4 Replicas
0 10 20 30 40 50 60 70 80 90 100Load (%)
0.1
1
10
Late
ncy
(m
s)
Legend:No FT1 Replica
2 Replicas4 Replicas
Actuator
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
Late
ncy
(m
s)
Legend:1 Replica2 Replicas
4 Replicas
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
Late
ncy
(m
s)
Legend:No FT1 Replica
2 Replicas4 Replicas
Stream
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
100
Late
ncy
(m
s)
Legend:Replica 1Replicas 2
Replicas 4
0 10 20 30 40 50 60 70 80 90 100Load (%)
1
10
Late
ncy
(m
s)
Legend:No FTReplica 1
Replicas 2Replicas 4
Figure 5.9: Rebind time and latency results with resource reservation.
170
5.4. SERVICES EVALUATION
measure the impact of multiple clients on RT performance, specially the impact of low
priority clients on high priority clients.
As with the previously RPC benchmark, the remote procedure increments a counter, but
before returning the value, it continuously computes a batch of arithmetic operations
for 10ms. The objective is to evaluate the Linux’s scheduler and control group RT
performance. We used three clients with priorities 48, 24 and 0, and performed 1000
RPC calls each run, with an invocation rate of 25 per second (corresponding to a
deadline of 40ms). To evaluate the impact of different load conditions, we performed
the benchmark using three load generator configurations, using priorities 48, 24 and 0,
for the setups.
Figures 5.10a, 5.10c and 5.10e show the number of deadlines missed for each client with-
out resource reservation, under an increasing load of priorities 0, 24 and 48, respectively.
Whereas, figures 5.10b, 5.10d and 5.10f show the number of deadlines missed under the
same premises but with resource reservation enabled.
Without resource reservation, and if the load generator uses SCHED OTHER threads
with priority 0, the Linux’s scheduler is able to avoid any deadline miss. This is the
expected outcome for clients using priorities 24 and 48, as they are served by SCHED -
FIFO threads that are always scheduled ahead of any other scheduling class. The client
using priority 0 (and associated SCHED OTHER threads) is also able to avoid any
miss. This is explained by the good implementation of Linux’s fair scheduler, that is
able to leverage loads up to 95% of CPU time.
When the load generator uses SCHED FIFO threads the behavior starts to degrade
with loads higher that 35%. In both cases, the client with priority 0 has approximately
70% missed deadlines when the load is of 95% of CPU time. This is explained by the
CPU starvation caused by the load generator RT high priority threads. When the load
generator uses priority 24, the client that uses priority 48 should not have any deadline
misses. However, this is not the case. The client that uses priority 48 also experiences
missed deadlines, although in a much lesser scale. This is due to priority inversion
at the network interface card driver (whose IRQ is handled by a high priority kernel
thread).
When, in figure 5.10e, the load generator used priority 48, this priority inversion is
exacerbated. Adding to this, the race between the load generator threads interferes
with the remaining threads, due to their SCHED FIFO scheduling. This type of threads
are only preempted by higher priority threads, otherwise, they keep running until they
voluntary relinquish the CPU. But, as the load generator threads used the maximum
171
CHAPTER 5. EVALUATION
Without Resource Reservation With Resource Reservation
Load Priority 0
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
100
200
300
400
500
600
700
Mis
sed D
eadlin
es
Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0
(a)
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
100
200
300
400
500
600
700
Mis
sed D
eadlin
es
Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0
(b)
Load Priority 24
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
100
200
300
400
500
600
700
Mis
sed D
eadlin
es
Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0
(c)
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
100
200
300
400
500
600
700
Mis
sed D
eadlin
es
Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0
(d)
Load Priority 48
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
100
200
300
400
500
600
700
Mis
sed D
eadlin
es
Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0
(e)
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
100
200
300
400
500
600
700
Mis
sed D
eadlin
es
Legend:Missed Deadlines P48Missed Deadlines P24Missed Deadlines P0
(f)
Figure 5.10: Missed deadlines without (left) and with (right) resource reservation.
172
5.4. SERVICES EVALUATION
permitted priority, this caused unbounded latency on the middleware threads (even
with the high priority threads).
With resource reservation, and when the load generator used priority 0, there were a
very few unexpected missed deadlines. We speculated that a possible explanation can
reside in the fact that we use a thread-per-connection strategy that creates a new thread
for each new connection, with each new thread being submitted to the QoS daemon.
This adds latency to service, and can cause some missed deadlines in the first invocations
from the client. When the load generator uses priority 24 and 48, it worsens the latency
associated with the acceptance of new threads by the QoS daemon. However, additional
analysis to the Linux kernel is still required to validate this hypothesis.
Figures 5.11a, 5.11c and 5.11e show the invocation latencies for each client without
resource reservation, under an increasing load of priorities 0, 24 and 48. Whereas,
figures 5.11b, 5.11d and 5.11f show the invocation latencies with resource reservation
enabled.
The load generator using priority 0, figures 5.11a and 5.11b (with and without resource
reservation, respectively), only interferes with invocation using priority 0. When the
load generator uses SCHED FIFO threads with priority 24 and 48, without the presence
of the resource reservation mechanisms (figures 5.11c and 5.11e), the performance starts
to degrade at 35% of load. The client using priority 48 in figure 5.11c should have a near
constant invocation latency, but due to priority inversion, it presents a linear increase
(although with a much lesser increase when comparing with the other two priorities).
Figure 5.11e shows the expected behavior with the load generator threads (using priority
48) causes a gradual latency increase in all the clients.
Figures 5.11d and 5.11f show the middleware performance with resource reservation
enabled under load priorities of 24 and 48, respectively. A scheduling artifact is notice-
able with invocations using priority 0, that instead of remaining constant, it presents a
lower latency with the increasing presence of load. The workload introduced by the RT
threads of the load generator on the control group infrastructure, that is continuously
forced to perform load balancing across the scheduling domains, causes a small jitter
on clients with priority 24 and 48.
RPC Performance Comparison with Other Platforms
Figure 5.12 shows the measured invocation latencies for the RPC service as implemented
in our middleware and other mainstream platforms, only using a client and a server and
making 1000 RPC invocations, with a 250 invocations per second rate.
173
CHAPTER 5. EVALUATION
Without Resource Reservation With Resource Reservation
Load Priority 0
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
Late
ncy
(m
s)
Legend:Invocations P48Invocations P28Invocations P0
(a)
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
Late
ncy
(m
s)
Legend:Invocations P48Invocations P28Invocations P0
(b)
Load Priority 24
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
1000
Late
ncy
(m
s)
Legend:Invocations P48Invocations P28Invocations P0
(c)
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
Late
ncy
(m
s)
Legend:Invocations P48Invocations P28Invocations P0
(d)
Load Priority 48
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
1000
Late
ncy
(m
s)
Legend:Invocations P48Invocations P28Invocations P0
(e)
0 10 20 30 40 50 60 70 80 90 100Load (%)
10
100
Late
ncy
(m
s)
Legend:Invocations P48Invocations P28Invocations P0
(f)
Figure 5.11: Invocation latency without (left) and with (right) resource reservation.
174
5.4. SERVICES EVALUATION
0 10 20 30 40 50 60 70 80 90 100Load (%)
0
0.1
1
10
100
Late
ncy
(m
s)
Legend:Stheno, No Res.Stheno, Res.ICE
TAORMI
Figure 5.12: RPC invocation latency comparing with reference middlewares (without
fault-tolerance).
As expected, RMI, implemented with Java SE, has the worst behavior with minimum
and maximum latencies of, respectively, 0.3ms and 8.9ms. TAO was optimized for
real-time tasks by using the CORBA-RT extension, exhibiting minimum and maximum
latencies of, respectively, 0.3ms and 6.5ms. TAO’s results were hampered by its strict
support to the (bloated) IIOP specification. ICE, while less stable than TAO, is overall
more efficient with minimum and maximum latencies of, respectively, 0.1ms and 7.8ms.
Despite the absence of RT support in ICE, its lightweight implementation (it does
not use IIOP) provides good performance for low values of load. Our middleware
implementation is able to offer minimum and maximum latencies of, respectively, 0.1ms
and 14.6ms, without resource reservation. With resource reservation we achieve a
maximum latency of just 0.1ms, by effectively isolating the service in terms of re-
quired resources. Our implementation without resource reservation exhibits a mixed
performance. Up to 40% of load, it compares very favorably to the other platforms, but
above this limit, it starts to degrade more quickly. We attribute this behavior to the
overhead associated with the time it takes to create a new thread to handle an incoming
connection (a consequence of using the Thread-per-Connection strategy). Nevertheless,
our performance is comparable with TAO’s. Above the 60% load threshold, all systems
without resource reservation have their performance severally hampered by the Load
Generator. Our system, with resource reservation enabled, is able to sustain high levels
of performance by shielding the service from resource starvation, offering at 95% of load,
175
CHAPTER 5. EVALUATION
an 55 fold improvement to the second best system (TAO), and a 77 fold improvement
over the worst system (RMI).
5.5 Summary
This chapter provided an insight look on the performance behavior of several key
components of our middleware infrastructure, more precisely, the low-level overlay
performance and the high-level service layer. The benchmarks presented focused on
highlighting crucial characteristics on both levels. At the overlay level, we incised on
three aspects: membership behavior; query performance; and service deployment time.
Whereas, at the service layer, we focused exposing the effects of our lightweight FT
infrastructure on service performance, as well the impact of the resource reservation
mechanisms on both RT and FT performance. For contextualizing the performance of
our system, we presented two additional evaluations. The first, exhibits the effects
of the presence of multiple clients (with distinct priorities) in the RPC service, a
common practice for this type of service, such in [3]. The last evaluation presented
a RT performance comparison with other close related systems.
176
–Success consists in being successful, not in hav-
ing potential for success. Any wide piece of
ground is the potential site of a palace, but there’s
no palace till it’s built.
Fernando Pessoa 6Conclusions and Future Work
6.1 Conclusions
In this thesis we have designed and implemented Stheno that to the best of our knowl-
edge is the first middleware system to seamlessly integrate fault-tolerance and real-time
in a peer-to-peer infrastructure. Our approach was motivated by the lack of support of
current solutions for the timing, reliability and physical deployment characteristics of
our target systems, as shown in the survey on related work.
Our hypothesis is that it is possible to effectively and efficiently integrate real-time
support with fault-tolerance mechanisms in a middleware system using an approach
fundamentally distinct from current solutions. Our solution involves: (a) implementing
FT support at low level in the middleware, albeit on top of a suitable network ab-
straction to maintain transparency; (b) using the peer-to-peer mesh services to support
FT, and; (c) supporting real-time services through kernel-level resource reservation
mechanisms.
The proposed architecture offers a flexible design that is able to support different fault-
tolerance policies, including semi-active and passive. The runtime’s programming model
details the most important interfaces and their interactions. It was designed to provided
the necessary infrastructure for allowing users and services to interact with runtimes
that are not in the same address space, and thus allowing for a reduction in the resource
footprint. Furthermore, it also provides support for additional languages.
We provide a complete implementation of a P2P overlay for efficient, transparent
and configurable fault-tolerance, and support real-time through the use of resource
reservation, network communication demultiplexing, and multi-core computing. The
177
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
support for resource reservation was achieved through the implementation of a QoS
daemon that manages and interacts with the low-level QoS infrastructure present in
the Linux kernel. The multiplexing of requests can force high priority requests to miss
their deadlines, because of the FIFO nature of network communications. To avoid
this, our implementation allows services to define multiple access points, with each
one specifying a priority and a threading strategy. Last, to proper integrate resource
reservation and the different threading strategies, in a multi-core computing context, we
have designed a novel design pattern, the Execution Model/Context. Fault-tolerance is
efficiently implemented using the P2P overlay and the fault-tolerance strategy and the
number of replicas are configurable per service. The current prototype has a code base
of almost 1000 files and contains around 55000 lines of code.
The experiments show that Stheno meets and exceeds target system requirements for
end-to-end latency and fail-over latency, and thus validating our approach of implement-
ing fault-tolerance mechanisms directly over the peer-to-peer overlay infrastructure. In
particular, it is possible to isolate real-time tasks from system overhead, even in the
presence of high-loads and faults. Although the support for proactive fault-tolerance
is still absent from the current implementation, we were able to mitigate the impact of
faults in the system by providing proper isolation between the low-level P2P services
and the user’s high-level services. This was mainly accomplished with the introduction
of separate communications channels for both service types. We are able to maintain
performance in user services even in the presence of major mesh rebinds.
When taken as a whole these evaluation results are promising and support the idea that
the approach followed is valid. In summary, to the best of our knowledge, Stheno is the
first system that supports:
Configurable Architecture. The architecture of our middleware platform is open, in
a sense that it offers an adjustable and modular design that is able to accommodate a
wide range of applications domains. Instead of focusing on a specific application domain,
such as RPC, we designed a service-oriented platform that offers a computational
environment that seamlessly integrates both fault-tolerance and real-time. Furthermore,
Stheno supports configurability at multiple levels: P2P, real-time and fault-tolerance.
P2P. Our infrastructure, based on pluggable P2P overlays, offers a resilient behavior
that can be adjusted to meet the overall system requirements. The selection between
different overlay topologies, structured or unstructured, allows a software architect to
leverage between resource consumption, overall performance and resiliency.
Fault-Tolerance. We have implemented a lightweight fault-tolerance infrastructure
178
6.2. FUTURE WORK
directly in the P2P overlay, currently supporting semi-active replication, that is able to
provide minimum overhead and thus enhancing real-time performance. Nevertheless,
a great effort was spent to allow the support of additional replication policies, such as
passive replication and active replication.
Real-Time Behavior. Our platform is able to offer resource reservation through the
implementation of a QoS daemon that leverages the available resources and interacts
with the low-level resource reservation infrastructure provided by the Linux kernel.
Furthermore, our architecture decouples control and data information flows through
the introduction of distinct service access points (SAPs). These SAPs are served by
a configurable threading strategy with an associated priority. Last, we introduced a
novel design pattern, the Execution Model/Context, that is able to integrate resource
reservation with distinct threading strategies, namely, Leader-Followers [11], Thread-
Pool [114], Thread-per-Connection [12] and Thread-per-Request [13], that focus on the
support for multi-core computing.
6.2 Future Work
The work accomplished in this thesis opens paths in several research domains.
Real-Time. An interesting challenge in the RT domain is to enhance the middleware
with support for EDF [117] and study the limitations of implementing the hard real-
time tasks in a general purpose operating system, such as Linux. A derivative work
from this, is to study the implications of isolating low-level hardware interrupts and
measure the impact of different runtime and periods in EDF tasks.
An in-depth study of the impact of CPU architecture, specially cache topology, in real-
time performance and resource reservation behavior would be good to contribute for
improving the deployment of distributed RT systems.
Fault-Tolerance. An interesting idea, originated from the collaboration with Prof.
Priya Narasimhan, consists in providing support for multiple overlays for further en-
hancing dependability. This opens several challenges, (a) correlate faults from different
overlays with the goal of identifying the root causes, (b) choosing the optimal deploy-
ment site for service bootstrap, (c) enhance current state-of-art of fault-tolerance with
support for inter-overlay replication groups, that is the placement of replicas across a
distinct set of overlays, and (d) identify nodes that are common to several overlays, as
179
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
they diminish FT capabilities.
Currently, we use a reactive fault-detection model that only acts after a fault has
happened. Using a proactive approach, the runtime can predict imminent faults and
take actions to eliminate, or at least minimize, the consequences of such events. A
possible way to accomplish this can involve using a combination of real-time resource
monitoring analysis and gossip-based network monitoring.
The addition of new replicas into a replication group still poses a significant challenge
in distributed RT systems. The disturbance caused by the initialization process of the
new replica, can me mitigated by a two phase process. In the first phase, if there
is no checkpoint available, then the replication group would have to create one. The
existing replicas would then split the checkpoint state between themselves, and therefore
alleviating the primary of further overhead. In the second phase, all the replicas would
transfer their portion of the checkpoint state to the joining replica. This would end
with the primary providing the delta between the checkpoint state and the current
state. This would greatly minimize the interference in the primary node, specially in
very large states.
Byzantine Fault-Tolerance. The introduction of Byzantine Fault-Tolerance (BFT)
still poses a significant challenge. The integration of BFT with RT would represent
the next evolution in terms of FT. We would like to assess the impact of recent
BFT replication protocols, such as Zyzzyva [137] and Aardvark [138], in real-time
performance.
Virtualization. Current virtualization solutions focus on providing on-demand Virtual
Machine (VM)s to the end-user QoS, such as the Amazon EC2. A more low-level
approach can be taken by using lightweight VMs to provide a virtualized environment
for runtime (user) services, allowing the support for legacy services. This also allows
the migration of service without having to implement FT awareness into the service.
A second benefit of having support for virtualized services is the inherent support for
proving a strong isolation to services. This can be used as way to prevent malicious
servers to compromise the entire node.
A broad study on the possibility of having RT performance on the currently available
hypervisors is needed to assess the feasability of having RT virtualized services. To
the best of your knowledge, no RT support has ever been attempted in lightweight
virtualization hypervisors, such as Kernel Virtual-Machine (KVM) [108]. We spec-
180
6.3. PERSONAL NOTES
ulate, that the use of CPU isolation could make this feasible, possibly allowing the
introduction of RT semantics to the Infrastructure as a Service (IaaS) paradigm. The
recent developments on virtualization at the operating system level [139], by the Linux-
CR project [140], could represent an interesting alternative to lightweight virtualization
hypervisors. Because no latency is added to the middleware runtime, the real-time be-
havior should be preserved. Furthermore, only the state of the application is serialized,
resulting in less overhead to the operating system and produces smaller state images
that should provide a more efficient way of migrating runtimes between nodes, with
subsequent improvement on the recovery time.
6.3 Personal Notes
The main motivation for undertaking this PhD was the desire to solve the problems
created by the requirements from our target systems, and it can be summarize with
the following question: ”Can we opportunistically leverage and integrate these proven
strategies to simultaneously support soft-RT and FT to meet the needs of our target
systems even under faulty conditions?”.
Doing research on middleware systems is a difficult, yet rewarding, task. We feel that
all the major goals of this PhD were met, and the author has gained an invaluable
insight on the vast and complex domain of distributed computing.
From a computer science standpoint, the full implementation of a new P2P middleware
platform that is able to offer seamless integration of both real-time and fault-tolerance
was only possible with a thorough analysis of all the mechanisms involved, as well their
inter-dependencies. Eventually, this work will lead to further research on operating
systems, parallel and distributed computing, and software engineering.
From the early stages of this PhD there has been an increasing focus on the support
for adaptive behavior. The ultimate goal is to leverage fault-tolerance assurances with
real-time performance, in order to meet the requirements of the target system. One of
the most prevailing applications for this type of research is Cloud Computing. We hope
that our work provides an open adaptive framework that allows the researchers and
developers to customize the behavior of the middleware to best suit their needs, while
benefiting from a resilient and distributed network layer built on top of P2P overlays.
The evolution of middleware systems, and in particular the ones that pursuit simulta-
neous support of both real-time and fault-tolerance, has been gradually focusing on the
181
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
efficient implementations of byzantine fault-tolerance. The practical implementation of
such systems constitutes a promising and exciting research field. Another promising
research field is related to the introduction of hard real-time support in general purpose
middleware systems while supporting the dynamical insertion and removal of services.
I hope to have the opportunity to contribute in these exciting research challenges.
182
References
[1] Paulo Verssimo and Luıs Rodrigues. Distributed Systems for System Architects.
Kluwer Academic Publishers, Norwell, MA, USA, 2001.
[2] Kenneth Birman. Guide to Reliable Distributed Systems. Texts in Computer
Science. Springer, 2012.
[3] Douglas Schmidt, David Levine, and Sumedh Mungee. The Design of the TAO
Real-Time Object Request Broker. Computer Communications, 21(4):294–324,
1998.
[4] Xavier Defago. Agreement-Related Problems: from Semi-Passive Replication
to Totally Ordered Broadcast. PhD thesis, Ecole Polytechnique Federale de
Lausanne, August 2000.
[5] EFACEC, S.A. EFACEC Markets. http://www.efacec.pt/
presentationlayer/efacec_mercado_00.aspx?idioma=2&area=8&local=
302&mercado=55. [Online; accessed 17-October-2011].
[6] Rolando Martins, Priya Narasimhan, Luıs Lopes, and Fernando Silva. Lightweight
Fault-Tolerance for Peer-to-Peer Middleware. In The First International Work-
shop on Issues in Computing over Emerging Mobile Networks (C-EMNs’10),
In Proceedings of the 29th IEEE Symposium on Reliable Distributed Systems
(SRDS’10), pages 313–317, November 2010.
[7] Bela Ban. Design and Implementation of a Reliable Group Communication
Toolkit for Java. Technical report, Cornell University, September 1998.
[8] Chen Lee, Ragunathan Rajkumar, and Cliff Mercer. Experiences with Processor
Reservation and Dynamic QOS in Real-Time Mach. Proceedings of Multimedia
Japan 96, April 1996.
[9] Hideyuki Tokuda, Tatsuo Nakajima, and Prithvi Rao. Real-Time Mach: Towards
a Predictable Real-Time System. In USENIX MACH Symposium, pages 73–82,
October 1990.
[10] Luigi Palopoli, Tommaso Cucinotta, Luca Marzario, and Giuseppe Lipari.
AQuoSA - Adaptive Quality of Service Architecture. Software: Practice and
Experience, 39(1):1–31, April 2009.
183
REFERENCES
[11] Douglas Schmidt, Carlos O’Ryan, Irfan Pyarali, Michael Kircher, and Frank
Buschmann. Leader/Followers: A Design Pattern for Efficient Multi-threaded
Event Demultiplexing and Dispatching. In Proceedings of the 7th Conference on
Pattern Languages of Programs (PLoP’01), August 2001.
[12] Douglas Schmidt and Steve Vinoski. Comparing Alternative Programming
Techniques for Multithreaded CORBA Servers. C++ Report, 8(7):47–56, July
1996.
[13] Douglas Schmidt and Charles Cranor. Half-Sync/Half-Async: An Architectural
Pattern for Efficient and Well-Structured Concurrent I/O. In Proceedings of the
2nd Annual Conference on the Pattern Languages of Programs (PLoP’95), pages
1–10, 1995.
[14] Priya Narasimhan, Tudor Dumitras, Aaron Paulos, Soila Pertet, Carlos Reverte,
Joseph Slember, and Deepti Srivastava. MEAD: Support for Real-Time Fault-
Tolerant CORBA: Research Articles. Concurrency and Computation: Practice &
Experience, 17(12):1527–1545, October 2005.
[15] Licınio Oliveira, Luıs Lopes, and Fernando Silva. P3: Parallel Peer to Peer - An
Internet Parallel Programming Environment. In Workshop on Web Engineering &
Peer-to-Peer Computing, part of Networking 2002, volume 2376 of Lecture Notes
in Computer Science, pages 274–288. Springer-Verlag, May 2002.
[16] James E. White. A High-Level Framework for Network-Based Resource Sharing.
In Proceedings of the June 7-10, 1976, National Computer Conference and
Exposition (AFIPS’76), pages 561–570, New York, NY, USA, 1976. ACM.
[17] Andrew D. Birrell and Bruce Jay Nelson. Implementing Remote Procedure Calls.
ACM Transactions on Computer Systems, 2(1):39–59, February 1984.
[18] Object Management Group. CORBA Specification. OMG Technical Commit-
tee Document: http://www.omg.org/cgi-bin/doc?1991/91-08-01, Aug 1991.
[Online; accessed 17-October-2011].
[19] Ann Wollrath, Roger Riggs, and Jim Waldo. A Distributed Object Model for the
Java System. Computing Systems, 9(4):265–290, 1996.
[20] Michi Henning. The Rise and Fall of CORBA. Communications of the ACM,
51(8):52–57, August 2008.
184
REFERENCES
[21] Enterprise Team, Vlada Matena, Eduardo Pelegri-Llopart Mark Hapner, James
Davidson, and Larry Cable. Java 2 Enterprise Edition Specifications. Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.
[22] A. Wigley, M. Sutton, S. Wheelwright, R. Burbidge, and R. Mcloud. Microsoft
.Net Compact Framework: Core Reference. Microsoft Press, Redmond, WA, USA,
2002.
[23] Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendel-
sohn, Henrik Nielsen, Satish Thatte, and Dave Winer. Simple Object Access
Protocol (SOAP) 1.1. W3c note, World Wide Web Consortium, May 2000.
[Online; accessed 17-October-2011].
[24] Marc Fleury and Francisco Reverbel. The JBoss Extensible Server. In
Proceedings of the 4th ACM/IFIP/USENIX International Middleware Conference
(Middleware’03), pages 344–373, New York, NY, USA, 2003. Springer-Verlag New
York, Inc.
[25] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall,
and Werner Vogels. Dynamo: Amazon’s Highly Available Key-value Store.
In Proceedings of the 21st ACM Symposium on Operating Systems Principles
(SOSP’07), pages 205–220, October 2007.
[26] Yan Huang, Tom Fu, Dah-Ming Chiu, John Lui, and Cheng Huang. Challenges,
Design and Analysis of a Large-Scale P2P-VOD System. In Proceedings of the
ACM SIGCOMM Conference on Data Communication (SIGCOMM ’08), pages
375–388, New York, NY, USA, August 2008. ACM.
[27] Edward Curry. Message-Oriented Middleware, pages 1–28. John Wiley & Sons,
Ltd, 2005.
[28] Tibco. Tibco Rendezvous. http://www.tibco.com/products/soa/messaging/
rendezvous/. [Online; accessed 17-October-2011].
[29] IBM. WebSphere MQ. http://www-01.ibm.com/software/integration/wmq/.
[Online; accessed 17-October-2011].
[30] Richard Monson-Haefel and David Chappell. Java Message Service. O’Reilly &
Associates, Inc., Sebastopol, CA, USA, 2000.
185
REFERENCES
[31] JCP. JAIN SLEE v1.1 Specification. JCP Document: http://download.
oracle.com/otndocs/jcp/jain_slee-1_1-final-oth-JSpec/, Jul 2008. [On-
line; accessed 17-October-2011].
[32] Mobicents. The Open Source SLEE and SIP Server. http://www.mobicents.
org/. [Online; accessed 17-October-2011].
[33] Object Management Group. OpenDDS. http://www.opendds.org/. [Online;
accessed 17-October-2011].
[34] RTI. Connext DDS. http://www.rti.com/products/dds/index.html. [Online;
accessed 17-October-2011].
[35] Douglas C. Schmidt and Hans van’t Hag. Addressing the challenges of mission-
critical information management in next-generation net-centric pub/sub systems
with opensplice dds. In IPDPS, pages 1–8, 2008.
[36] Object Management Group. Fault Tolerant CORBA Specification. OMG Techni-
cal Committee Document: http://www.omg.org/spec/FT/1.0/PDF/, May 2010.
[Online; accessed 17-October-2011].
[37] Tarek Abdelzaher, Scott Dawson, Wu Feng, Farnam Jahanian, S. Johnson, Ashish
Mehra, Todd Mitton, Anees Shaikh, Kang Shin, Zhiheng Wang, Hengming Zou,
M. Bjorkland, and Pedro Marron. ARMADA Middleware and Communication
Services. Real-Time Systems, 16:127–153, 1999.
[38] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and
R. Zainlinger. Distributed Fault-Tolerant Real-Time Systems: the Mars Ap-
proach. Micro, IEEE, 9(1):25–40, February 1989.
[39] Kane Kim. ROAFTS: A Middleware Architecture for Real-Time Object-Oriented
Adaptive Fault Tolerance Support. In Proceedings of the 3rd IEEE International
High-Assurance Systems Engineering Symposium (HASE’98), page 50. IEEE
Computer Society, November 1998.
[40] Eltefaat Shokri, Patrick Crane, Kane Kim, and Chittur Subbaraman. Archi-
tecture of ROAFTS/Solaris: A Solaris-Based Middleware for Real-Time Object-
Oriented Adaptive Fault Tolerance Support. In COMPSAC, pages 90–98. IEEE
Computer Society, 1998.
[41] Kane Kim and Chittur Subbaraman. Fault-Tolerant Real-Time Objects. Com-
munications of the ACM, 40(1):75–82, 1997.
186
REFERENCES
[42] Kane Kim and Chittur Subbaraman. A Supervisor-Based Semi-Centralized
Network Surveillance Scheme and the Fault Detection Latency Bound. In
Proceedings of the 16th Symposium on Reliable Distributed Systems (SRDS’97),
pages 146–155, October 1997.
[43] Manas Saksena, James da Silva, and Ashok Agrawala. Design and implementation
of maruti-ii. In Sang Son, editor, Advances in Real-Time Systems, pages 73–102.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1995.
[44] David Powell, Gottfried Bonn, D. Seaton, Paulo Verıssimo, and Francois
Waeselynck. The Delta-4 Approach to Dependability in Open Distributed
Computing Systems. In Proceedings of the 18th Annual International Symposium
on Fault-Tolerant Computing (FTCS’88), pages 246–251, Tokyo, Japan, 1988.
IEEE Computer Society Press.
[45] P. Bond P. Barrett, A. Hilborne, Luıs Rodrigues, D. Seaton, N. Speirs, , and
Paulo Verıssimo. The Delta-4 Extra Performance Architecture (XPA). 20th
International Symposium on Fault-Tolerant Computing, pages 481–488, 1990.
[46] James Gosling and Greg Bollella. The Real-Time Specification for Java. Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.
[47] Greg Bollella, James Gosling, Ben Brosgol, P. Dibble, Steve Furr, David Hardin,
and Mark Turnbull. The Real-Time Specification for Java. The Java Series.
Addison-Wesley, 2000.
[48] Peter Dibble. Real-Time Java Platform Programming. BookSurge Publishing,
2nd edition, 2008.
[49] Joshua Auerbach, David Bacon, Daniel Iercan, Christoph Kirsch, V. Rajan,
Harald Roeck, and Rainer Trummer. Java Takes Flight:Time-Portable Real-
Time Programming with Exotasks. In Proceedings of the 2007 ACM SIG-
PLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded
Systems (LCTES’07), pages 51–62, New York, NY, USA, 2007. ACM.
[50] Joshua Auerbach, David Bacon, Bob Blainey, Perry Cheng, Michael Dawson,
Mike Fulton, David Grove, Darren Hart, and Mark Stoodley. Design and
Implementation of a Comprehensive Real-time Java Virtual Machine. In
Proceedings of the 7th ACM & IEEE International Conference on Embedded
Software (EMSOFT’07), pages 249–258, New York, NY, USA, 2007. ACM.
187
REFERENCES
[51] Introduction to WebLogic Real-Time. http://docs.oracle.com/cd/E13221_
01/wlrt/docs10/pdf/intro_wlrt.pdf. [Online; accessed 17-October-2011].
[52] Silvano Maffeis. Adding Group Communication and Fault-Tolerance to CORBA.
In USENIX Conference on Object-Oriented Technologies, 1995.
[53] Alexey Vaysburd and Kenneth Birman. Building Reliable Adaptive Distributed
Objects with the Maestro Tools. In Proceedings of Workshop on Dependable
Distributed Object Systems (OOPSLA’97), 1997.
[54] Yansong Ren, David Bakken, Tod Courtney, Michel Cukier, David Karr, Paul
Rubel, Chetan Sabnis, William Sanders, Richard Schantz, and Mouna Seri.
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects.
IEEE Trans. Comput., 52:31–50, January 2003.
[55] Balachandran Natarajan, Aniruddha Gokhale, Shalini Yajnik, and Douglas
Schmidt. DOORS: Towards High-Performance Fault Tolerant CORBA. In
Proceedings of International Symposium on Distributed Objects and Applications
(DOA’00), pages 39–48, 2000.
[56] Silvano Maffeis and Douglas Schmidt. Constructing Reliable Distributed Commu-
nications Systems with CORBA. IEEE Communications Magazine, 35(2):56–61,
February 1997.
[57] Robbert van Renesse, Kenneth Birman, and Silvano Maffeis. Horus: A Flexible
Group Communication System. Communications of the ACM, 39(4):76–83,
November 1996.
[58] Kenneth Birman and Robert van Renesse. Reliable Distributed Computing with
the Isis Toolkit. IEEE Computer Society Press, 1994.
[59] Robbert van Renesse, Kenneth Birman, Mark Hayden, Alexey Vaysburd, and
David Karr. Building adaptive systems using Ensemble. Software–Practice and
Experience, 28(8):963–979, August 1998.
[60] Thomas C. Bressoud. TFT: A Software System for Application-Transparent Fault
Tolerance. In Proceedings of the 28th Annual International Symposium on Fault-
Tolerant Computing (FTCS’98), pages 128–137, 1998.
[61] Richard Schantz, Joseph Loyall, Craig Rodrigues, Douglas Schmidt, Yamuna
Krishnamurthy, and Irfan Pyarali. Flexible and Adaptive QoS Control for
Distributed Real-Time and Embedded Middleware. In Markus Endler and
188
REFERENCES
Douglas Schmidt, editors, Proceedings of the ACM/IFIP/USENIX International
Middleware Conference (Middleware’03), volume 2672 of Lecture Notes in Com-
puter Science, pages 374–393. Springer, June 2003.
[62] Douglas Schmidt and Fred Kuhns. An Overview of the Real-Time CORBA
Specification. IEEE Computer, 33(6):56–63, June 2000.
[63] IETF. An Architecture for Differentiated Services. http://www.ietf.org/rfc/
rfc2475.txt. [Online; accessed 17-October-2011].
[64] Lixia Zhang, Stephen Deering, Deborah Estrin, Scott Shenker, and Daniel
Zappala. RSVP: A New Resource ReSerVation Protocol. IEEE Network, 7(5):8–
18, 1993.
[65] Nanbor Wang, Christopher Gill, Douglas Schmidt, and Venkita Subramonian.
Configuring Real-Time Aspects in Component Middleware. In CoopIS/DOA/OD-
BASE (2), pages 1520–1537, 2004.
[66] Friedhelm Wolf, Jaiganesh Balasubramanian, Aniruddha Gokhale, , and Douglas
Schmidt. Component Replication Based on Failover Units. In Proceedings of
the 15th IEEE International Conference on Embedded and Real-Time Computing
Systems and Applications (RTCSA’09), pages 99–108, August 2009.
[67] Nanbor Wang, Douglas Schmidt, Aniruddha Gokhale, Christopher Gill, Balachan-
dran Natarajan, Craig Rodrigues, Joseph Loyall, and Richard Schantz. Total
Quality of Service Provisioning in Middleware and Applications. Microprocessors
and Microsystems, 26:9–10, 2003.
[68] Richard Schantz, Joseph Loyall, Craig Rodrigues, Douglas Schmidt, Yamuna
Krishnamurthy, and Irfan Pyarali. Flexible and adaptive QoS Control for
Distributed Real-Time and Embedded Middleware. In Proceedings of the ACM/I-
FIP/USENIX 2003 International Conference on Middleware (Middleware’03),
pages 374–393, New York, NY, USA, June 2003. Springer-Verlag New York, Inc.
[69] Fabio Kon, Fabio Costa, Gordon Blair, and Roy Campbell. The Case for Reflective
Middleware. Communications of the ACM, 45:33–38, June 2002.
[70] Jurgen Schonwalder, Sachin Garg, Yennun Huang, Aad van Moorsel, and Shalini
Yajnik. A Management Interface for Distributed Fault Tolerance CORBA
services. In Proceedings of the IEEE Third International Workshop on Systems
Management (SMW ’98), pages 98–107, Washington, DC, USA, April 1998.
189
REFERENCES
[71] Pascal Felber, Benoit Garbinato, and Rachid Guerraoui. The Design of a CORBA
Group Communication Service. In Proceedings of the 15th Symposium on Reliable
Distributed Systems (SRDS’96), Washington, DC, USA, October 1996. IEEE
Computer Society.
[72] Graham Morgan, Santosh Shrivastava, Paul Ezhilchelvan, and Mark Little.
Design and Implementation of a CORBA Fault-Tolerant Object Group Service.
In Proceedings of the 2nd IFIP WG 6.1 International Working Conference on
Distributed Applications and Interoperable Systems (DAIS’99), pages 361–374,
Deventer, The Netherlands, The Netherlands, 1999. Kluwer, B.V.
[73] Object Management Group. Real-time CORBA Specification. OMG Technical
Committee Document: http://www.omg.org/spec/RT/1.2/PDF, January 2005.
[Online; accessed 17-October-2011].
[74] Jaiganesh Balasubramanian. FLARe: a Fault-tolerant Lightweight Adaptive
Real-time Middleware for Distributed Real-time and Embedded Systems. In
Proceedings of the 4th Middleware Doctoral Symposium (MDS’07), pages 17:1–
17:6, New York, NY, USA, November 2007. ACM.
[75] Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. The
Primary-Backup Approach. ACM Press/Addison-Wesley Publishing Co., New
York, NY, USA, 1993.
[76] Object Management Group. Light Weight CORBA Component Model Revised
Submission. OMG Technical Committee Document: http://www.omg.org/
spec/CCM/3.0/PDF/, June 2002. [Online; accessed 17-October-2011].
[77] Jaiganesh Balasubramanian, Aniruddha Gokhale, Abhishek Dubey, Friedhelm
Wolf, Chenyang Lu, Christopher Gill, and Douglas Schmidt. Middleware
for Resource-Aware Deployment and Configuration of Fault-Tolerant Real-time
Systems. In Marco Caccamo, editor, Proceedings of the 16th IEEE Real-Time
and Embedded Technology and Applications Symposium (RTAS’10), pages 69–78.
IEEE Computer Society, April 2010.
[78] Fred Schneider. Replication Management using the State-machine Approach.
ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, 1993.
[79] Louise Moser, P. Michael Melliar-Smith, and Priya Narasimhan. A Fault Toler-
ance Framework for CORBA. In Proceedings of the 29th Annual International
190
REFERENCES
Symposium on Fault-Tolerant Computing (FTCS’99), Washington, DC, USA,
1999. IEEE Computer Society.
[80] Priya Narasimhan, Louise Moser, and P. Michael Melliar-Smith. Strongly
Consistent Replication and Recovery of Fault-Tolerant CORBA Applications.
Computer System Science and Engineering Journal, 17, 2002.
[81] Justin Frankel and Tom Pepper. Gnutella Specification. http://www.
stanford.edu/class/cs244b/gnutella_protocol_0.4.pdf. [Online; accessed
17-October-2011].
[82] Yoram Kulbak and Danny Bickson. The eMule Protocol Specification, January
2005. [Online; accessed 17-October-2011].
[83] PPLive. PPTV. http://www.pplive.com/. [Online; accessed 17-October-2011].
[84] Mario Ferreira, Joao Leitao, and Luıs Rodrigues. Thicket: A Protocol for Building
and Maintaining Multiple Trees in a P2P Overlay. In Proceedings of the 29rd
International Symposium on Reliable Distributed Systems (SRDS’10), pages 293–
302. IEEE, November 2010.
[85] Zhi Li and Prasant Mohapatra. QRON: QoS-aware Routing in Overlay Networks.
IEEE Journal on Selected Areas in Communications, 22(1):29–40, January 2004.
[86] Eric Wohlstadter, Stefan Tai, Thomas Mikalsen, Isabelle Rouvellou, and Premku-
mar Devanbu. GlueQoS: Middleware to Sweeten Quality-of-Service Policy
Interactions. In Proceedings of the 26th International Conference on Software
Engineering (ICSE’04), pages 189–199, May 2004.
[87] Anthony Rowstron, Anne-Marie Kermarrec, Miguel Castro, and Peter Druschel.
SCRIBE: The Design of a Large-Scale Event Notification Infrastructure. In
Proceedings of the 3rd International COST264 Workshop on Networked Group
Communication (NGC’01), pages 30–43, November 2001.
[88] A. Rowstron and P. Druschel. Pastry: Scalable, Decentralized Object Location,
and Routing for Large-Scale Peer-to-Peer Systems. In Proceedings of the
2nd ACM/IFIP/USENIX International Middleware Conference (Middleware’01),
pages 329–350, November 2001.
[89] Leslie Lamport. The part-time parliament. ACM Transactions on Computer
Systems, 16:133–169, May 1998.
191
REFERENCES
[90] Peter Pietzuch and Jean Bacon. Hermes: A Distributed Event-Based Middleware
Architecture. In ICDCS Workshops, pages 611–618. IEEE Computer Society, July
2002.
[91] Ben Zhao, Ling Huang, Jeremy Stribling, Sean Rhea, Anthony Joseph, and John
Kubiatowicz. Tapestry: A Resilient Global-Scale Overlay for Service Deployment.
IEEE Journal on Selected Areas in Communications, June 2003.
[92] David Anderson, Jeff Cobb, Eric Korpela, Matt Lebofsky, and Dan Werthimer.
SETI@home: an Experiment in Public-Resource Computing. Communications of
the ACM, 45:56–61, November 2002.
[93] Bjorn Knutsson, Honghui Lu, Wei Xu, and Bryan Hopkins. Peer-to-peer Support
for Massively Multiplayer Games. In Proceedings of the 23rd Annual Joint Con-
ference of the IEEE Computer and Communications Societies (INFOCOM’04),
volume 1, March 2004.
[94] Gilles Fedak, Cecile Germain, Vincent Neri, and Franck Cappello. XtremWeb:
a Generic Global Computing System. In Proceedings of the 1st IEEE/ACM
International Symposium on Cluster Computing, pages 582–587, May 2001.
[95] Andrew Chien, Brad Calder, Stephen Elbert, and Karan Bhatia. Entropia:
Architecture and Performance of an Enterprise Desktop Grid System. Journal
Parallel Distributed Computing, 63:597–610, May 2003.
[96] David Anderson. BOINC: A System for Public-Resource Computing and Storage.
In Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
(GRID’04), pages 4–10, Washington, DC, USA, November 2004. IEEE Computer
Society.
[97] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. Communications of the ACM, 51:107–113, January 2008.
[98] Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. Adapting MapReduce for
Dynamic Environments Using a Peer-to-Peer Model. In Proceedings of the 1st
Workshop on Cloud Computing and its Applications (CCA’08), Chicago, USA,
October 2008.
[99] Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy,
Scott Shenker, Ion Stoica, and Harlan Yu. OpenDHT: A Public DHT Service
and Its Uses. In Roch Guerin, Ramesh Govindan, and Greg Minshall, editors,
192
REFERENCES
Proceedings of the ACM SIGCOMM Conference on Applications, Technologies,
Architectures, and Protocols for Computer Communications (SIGCOMM’05),
pages 73–84. ACM, August 2005.
[100] Philip Bernstein and Nathan Goodman. An Algorithm for Concurrency Control
and Recovery in Replicated Distributed Databases. ACM Transactional Database
Systems, 9(4):596–615, 1984.
[101] Bruce Lindsay, Patricia Selinger, Cesare Galtieri, Jim Gray, Raymond Lorie, T. G.
Price, Franco Putzolu, and Bradford Wade. Notes on Distributed Databases.
Technical report, International Business Machines (IBM), San Jose, Research
Laboratory (CA), July 1979.
[102] Rolando Martins, Luıs Lopes, and Fernando Silva. A Peer-to-Peer Mid-
dleware Platform for QoS and Soft Real-Time Computing. Technical
Report DCC-2008-02, Departamento de Ciencia de Computadores, Fac-
uldade de Ciencias, Universidade do Porto, April 2008. Available at
http://www.dcc.fc.up.pt/dcc/Pubs/TReports/.
[103] Rolando Martins, Luıs Lopes, and Fernando Silva. A Peer-To-Peer Middleware
Platform for Fault-Tolerant, QoS, Real-Time Computing. In Proceedings of the
2nd Workshop on Middleware-Application Interaction, part of DisCoTec 2008,
pages 1–6, New York, NY, USA, June 2008. ACM.
[104] Rolando Martins, Priya Narasimhan, Luıs Lopes, and Fernando Silva. On
the Impact of Fault-Tolerance Mechanisms in a Peer-to-Peer Middleware with
QoS Constraints. Technical Report DCC-2010-02, Departamento de Ciencia
de Computadores, Faculdade de Ciencias, Universidade do Porto, April 2010.
Available at http://www.dcc.fc.up.pt/dcc/Pubs/TReports/.
[105] Aniruddha Gokhale, Balachandran Natarajan, Douglas Schmidt, and Joseph
Cross. Towards Real-Time Fault-Tolerant CORBA Middleware. Cluster Com-
puting, 7(4):331–346, September 2004.
[106] Michi Henning. A New Approach to Object-Oriented Middleware. IEEE Internet
Computing, 8(1):66–75, January 2004.
[107] Daniel Nurmi, Richard Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil
Soman, Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus Open-Source
Cloud-Computing System. In Franck Cappello, Cho-Li Wang, and Rajkumar
Buyya, editors, Proceedings of the 9th IEEE/ACM International Symposium
193
REFERENCES
on Cluster, Cloud, and Grid Computing (CCGrid’09), pages 124–131. IEEE
Computer Society, May 2009.
[108] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. KVM:
the Linux Virtual Machine Monitor. In Proceedings of the 9th Ottawa Linux
Symposium (OLS’07), June 2007.
[109] Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Ian Pratt,
Andrew Warfield, Paul Barham, and Rolf Neugebauer. Xen and the Art of
Virtualization. In Proceedings of the ACM Symposium on Operating Systems
Principles (SOSP’03), October 2003.
[110] Canonical Ltd. JeOS and ”vmbuilder”. https://help.ubuntu.com/11.10/
serverguide/C/jeos-and-vmbuilder.html. [Online; accessed 17-October-
2011].
[111] Douglas Schmidt. An Architectural Overview of the ACE Framework. ;login: the
USENIX Association newsletter, 24(1), January 1999.
[112] Francisco Curbera, Matthew Duftler, Rania Khalaf, William Nagy, Nirmal Mukhi,
and Sanjiva Weerawarana. Unraveling the Web Services Web: An Introduction
to SOAP, WSDL, and UDDI. IEEE Distributed Systems Online, 3(4), 2002.
[113] Ian Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan.
Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications.
In Proceedings of the ACM Special Interrest Group on Data Communication
Conference (SIGCOMM’01), volume 31, 4 of Computer Communication Review,
pages 149–160. ACM Press, August 2001.
[114] Greg Lavender and Douglas Schmidt. Active Object: an Object Behavioral
Pattern for Concurrent Programming. In Proceedings of the 2nd Conference on
Pattern Languages of Programs (PLoP’95), September 1995.
[115] Linux kernel 2.6.39. Real-Time Group Scheduling. http://www.kernel.org/
doc/Documentation/scheduler/sched-rt-group.txt, 2009. [Online; accessed
17-October-2011].
[116] Yuan Xu. A Study of Scalability and Performance of Solaris Zones, April 2007.
[117] Dario Faggioli, Michael Trimarchi, and Fabio Checconi. An Implementation
of the Earliest Deadline First Algorithm in Linux. In Sung Shin and Sascha
194
REFERENCES
Ossowski, editors, Proceedings of the 24th ACM Symposium on Applied Computing
(SAC’09), pages 1984–1989. ACM, March 2009.
[118] Nicola Manica, Luca Abeni, and Luigi Palopoli. Reservation-Based Interrupt
Scheduling. In Marco Caccamo, editor, Proceedings of the 16th IEEE Real-Time
and Embedded Technology and Applications Symposium (RTAS’10), pages 46–55.
IEEE Computer Society, April 2010.
[119] Shinpei Kato, Yutaka Ishikawa, and Ragunathan Rajkumar. CPU Scheduling
and Memory Management for Interactive Real-Time Applications. Real-Time
Systems, pages 1–35, 2011.
[120] Michael Stonebraker and Greg Kemnitz. The POSTGRES Next Generation
Database Management System. Communications of the ACM, 34:78–92, October
1991.
[121] Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patino-Martınez, and Patrick
Valduriez. StreamCloud: A Large Scale Data Streaming System. In Proceedings
of the IEEE 30th International Conference on Distributed Computing Systems
(ICDCS’10), pages 126–137, Washington, DC, USA, June 2010. IEEE Computer
Society.
[122] Levent Gurgen, Claudia Roncancio, Cyril Labbe, Andre Bottaro, and Vincent
Olive. SStreaMWare: a Service Oriented Middleware for Heterogeneous Sensor
Data Management. In Proceedings of the 5th international Conference on
Pervasive Services (ICPS’08), pages 121–130, New York, NY, USA, July 2008.
ACM.
[123] Adrian Caulfield, Joel Coburn, Todor Mollov, Arup De, Ameen Akel, Jiahua He,
Arun Jagatheesan, Rajesh Gupta, Allan Snavely, and Steven Swanson. Under-
standing the Impact of Emerging Non-Volatile Memories on High-Performance,
IO-Intensive Computing. In Proceedings of the 23rd ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC’10), pages 1–11, Washington, DC, USA, November 2010. IEEE Computer
Society.
[124] Maxweel Carmo, Bruno Carvalho, Jorge Sa Silva, Edmundo Monteiro, Paulo Sim
oes, Marılia Curado, and Fernando Boavida. NSIS-Based Quality of Service and
Resource Allocation in Ethernet Networks. In Torsten Braun, Georg Carle, Sonia
Fahmy, and Yevgeni Koucheryavy, editors, Proceedings of the 4th International
195
REFERENCES
Conference on Wired/Wireless Internet Communications (WWIC’06), volume
3970 of Lecture Notes in Computer Science, pages 132–142. Springer, 2006.
[125] Jeff Bonwick. The Slab Allocator: An Object-Caching Kernel Memory Allocator.
In USENIX Summer, pages 87–98, 1994.
[126] Christoph Lameter. The SLUB Allocator. LWN.net: http://lwn.net/
Articles/229096/, March 2007. [Online; accessed 17-October-2011].
[127] Dinakar Guniguntala, Paul McKenney, Josh Triplett, and Jonathan Walpole.
The Read-Copy-Update Mechanism for Supporting Real-Time Applications on
Shared-Memory Multiprocessor Systems with Linux. IBM Systems Journal,
47:221–236, April 2008.
[128] Steven Rostedt. RCU Preemption Priority Boosting. LWN.net: http://lwn.
net/Articles/252837/, October 2007. [Online; accessed 17-October-2011].
[129] Claudio Basile, Keith Whisnant, Zbigniew Kalbarczyk, and Ravishankar Iyer.
Loose Synchronization of Multithreaded Replicas. In Proceedings of the 21st
International Symposium on Reliable Distributed Systems (SRDS’02), pages 250–
255, October 2002.
[130] Claudio Basile, Zbigniew Kalbarczyk, and Ravishankar Iyer. A Preemptive
Deterministic Scheduling Algorithm for Multithreaded Replicas. In Proceedings
of the 33rd International Conference on Dependable Systems and Networks
(DSN’03), pages 149–158, June 2003.
[131] Guang Tan and Stephen Jarvis and Daniel Spooner. Improving the Fault
Resilience of Overlay Multicast for Media Streaming. IEEE Transactions on
Parallel and Distributed Systems, 18(6):721–734, June 2007.
[132] Irena Trajkovska, Rodriguez Salvachua, and Alberto Velasco. A Novel P2P
and Cloud Computing Hybrid Architecture for Multimedia Streaming with QoS
Cost Functions. In Proceedings of the International Conference on Multimedia
(MM’10), pages 1227–1230, New York, NY, USA, October 2010. ACM.
[133] Thomas Wiegand, Gary Sullivan, Gisle Bjntegaard, and Ajay Luthra. Overview
of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and
Systems for Video Technology, 13(7):560–576, 2003.
[134] Fred Kuhns, Douglas Schmidt, and David Levine. The Design and Performance
of a Real-Time I/O Subsystem. In Proceedings of the 5th IEEE Real-Time
Technology and Applications Symposium (RTAS’99), pages 154–163, June 1999.
196
REFERENCES
[135] Real-Time Preempt Linux Kernel Patch. kernel.org: http://www.kernel.org/
pub/linux/kernel/projects/rt/. [Online; accessed 17-October-2011].
[136] Moving Interrupts to Threads. LWN.net: http://lwn.net/Articles/302043/.
[Online; accessed 17-October-2011].
[137] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund
Wong. Zyzzyva: Speculative byzantine fault folerance. In Proceedings of 21st
ACM SIGOPS Symposium on Operating Systems Principles (SOSP ’07), pages
45–58, New York, NY, USA, 2007. ACM.
[138] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and Mirco
Marchetti. Making Byzantine Fault Tolerant Systems Tolerate Byzantine faults.
In Proceedings of the 6th USENIX Symposium on Networked Systems Design and
Implementation (NSDI’09), pages 153–168, Berkeley, CA, USA, 2009. USENIX
Association.
[139] Andrey Mirkin, Alexey Kuznetsov, and Kir Kolyshkin. Containers Checkpointing
and Live Migration. In Proceedings of the 10th Annual Linux Symposium
(OLS’08), July 2008.
[140] Oren Laadan and Serge Hallyn. Linux-CR: Transparent Application Checkpoint-
Restart in Linux. In Proceedings of the 12th Ottawa Linux Symposium (OLS’10),
July 2010.
197