PhD Thesis

Seediscussions,stats,andauthorprofilesforthispublicationat:http://www.researchgate.net/publication/36207005

Replicationindistributedmanagementsystems.ARTICLESource:OAI

CITATIONS14

DOWNLOADS13

VIEWS48

1AUTHOR:

EvangelosKotsakisEuropeanCommission21PUBLICATIONS124CITATIONS

SEEPROFILE

Availablefrom:EvangelosKotsakisRetrievedon:01August2015

REPLICATION IN DISTRIBUTED MANAGEMENT

SYSTEMS

EVANGELOS GRIGORIOS KOTSAKIS

Telford Research Institute

Department of Electrical and Electronic Engineering

The University Of Salford

Submitted in Partial Fulfilment of the Requirements for the

Degree of Doctor of Philosophy

1998

ii

This thesis is dedicated to

my daughter Dimitra

iii

TABLE OF CONTENTS

LIST OF FIGURES................................................................................................................................... V

LIST OF TABLES ................................................................................................................................. VII

ACKNOWLEDGMENTS ..................................................................................................................... VIII

ABBREVIATIONS .................................................................................................................................. IX

ABSTRACT ............................................................................................................................................... X

1. INTRODUCTION .................................................................................................................................. 1

1.1 DISTRIBUTED MANAGEMENT SYSTEMS ............................................................................................... 1

1.2 REPLICATION ON A DISTRIBUTED MIB ................................................................................................. 2

1.3 THE WORK ........................................................................................................................................... 4

1.4 ROAD MAP OF THE THESIS .................................................................................................................. 5

2. REPLICATION MANAGEMENT SYSTEM ARCHITECTURE .................................................... 8

2.1 MANAGEMENT FUNCTIONAL AREAS ................................................................................................... 8

2.2 MANAGEMENT ARCHITECTURAL MODEL .......................................................................................... 10

2.3 PROTOCOLS FOR CONTROLLING MANAGEMENT INFORMATION......................................................... 11

2.3.1 OSI Management Framework ................................................................................................... 11

2.3.2 Internet Network Management .................................................................................................. 12

2.4 OBJECT ORIENTED MIB MODELLING ................................................................................................. 13

2.5 DISTRIBUTED MANAGEMENT INFORMATION BASE (MIB) ................................................................. 15

2.6 DISTRIBUTED NETWORK MANAGEMENT ........................................................................................... 18

2.7 CORBA SYSTEM .............................................................................................................................. 22

2.8 IMPLEMENTING OSI MANAGEMENT SERVICES FOR TMN ................................................................. 23

2.9 REPLICATION IN A MANAGEMENT SYSTEM ....................................................................................... 26

2.10 NEED FOR REPLICATION TECHNIQUES IN A MANAGEMENT SYSTEM .................................................. 29

2.11 SYNCHRONOUS AND ASYNCHRONOUS REPLICA MODELS ................................................................. 33

2.12 REPLICATION TRANSPARENCY AND ARCHITECTURAL MODEL ........................................................ 33

2.13 SUMMARY ....................................................................................................................................... 37

3. FAILURES IN A MANAGEMENT SYSTEM .................................................................................. 38

3.1 DEPENDABILITY BETWEEN AGENTS .................................................................................................. 38

3.2 FAILURE CLASSIFICATION .................................................................................................................. 39

3.3 FAULTY AGENT BEHAVIOUR ............................................................................................................. 40

3.4 FAILURE SEMANTICS ......................................................................................................................... 41

3.5 FAILURE MASKING ............................................................................................................................ 42

3.6 ARCHITECTURAL ISSUES ................................................................................................................... 46

3.7 GROUP SYNCHRONISATION ............................................................................................................... 47

3.7.1 Close Synchronisation ............................................................................................................... 47

3.7.2 Loose synchronisation ............................................................................................................... 48

3.8 GROUP SIZE ....................................................................................................................................... 49

3.9 GROUP COMMUNICATION .................................................................................................................. 49

3.10 AVAILABILITY POLICY ..................................................................................................................... 50

3.11 GROUP MEMBER AGREEMENT ........................................................................................................ 51

3.12 SUMMARY ....................................................................................................................................... 53

4. REPLICA CONTROL PROTOCOLS ............................................................................................... 55

4.1 PARTITIONING IN A REPLICATION SYSTEM ......................................................................................... 55

4.2 CORRECTNESS IN REPLICATION ......................................................................................................... 56

4.3 TRANSACTION PROCESSING DURING PARTITIONING .......................................................................... 59

4.4 PARTITION PROCESSING STRATEGY ................................................................................................... 60

4.5 AN ABSTRACT MODEL FOR STUDYING REPLICATION ALGORITHMS .................................................. 62

4.6 PRIMARY SITE PROTOCOL ................................................................................................................. 66

iv

4.7 VOTING ALGORITHMS ........................................................................................................................ 69

4.7.1 Majority Consensus Algorithm .................................................................................................. 70

4.7.2 Voting With Witnesses ............................................................................................................... 73

4.7.3 Dynamic Voting ......................................................................................................................... 73

4.7.4 Dynamic Majority Consensus Algorithm (DMCA) - A novel approach .................................... 79

4.8 SUMMARY ......................................................................................................................................... 91

5. ANALYSIS AND DESIGN OF THE SOFTWARE SIMULATION ............................................... 93

5.1 INTRODUCTION TO SIMULATION MODELLING ..................................................................................... 93

5.2 USING AN OBJET-ORIENTED TECHNIQUE FOR MODELLING A SIMULATION SYSTEM .......................... 94

5.3 OBJECT ORIENTED DISCRETE EVENT SIMULATION ............................................................................ 95

5.4 THE SIMULATION MODELLING PROCESS ........................................................................................... 96

5.4.1 Problem formulation ................................................................................................................. 96

5.4.2 Model Implementation .............................................................................................................. 97

5.5 OBJECT ORIENTED ANALYSIS AND DESIGN ....................................................................................... 97

5.5.1 Analysis ..................................................................................................................................... 98

5.5.2 Design ....................................................................................................................................... 98

5.5.3 Implementation .......................................................................................................................... 99

5.6 ATS REQUIREMENTS ........................................................................................................................ 99

5.7 ATS ANALYSIS................................................................................................................................ 101

5.7.1 Object Model ........................................................................................................................... 103

5.8 DYNAMIC MODEL ........................................................................................................................... 106

5.9 EVALUATION OF THE SYSTEM ......................................................................................................... 108

5.10 SUMMARY ..................................................................................................................................... 108

6. SIMULATION AND ESTIMATION OF REPLICA CONTROL PROTOCOLS ....................... 110

6.1 PERFORMANCE EVALUATION .......................................................................................................... 110

6.2 THE SIMULATION MODEL ................................................................................................................ 111

6.3 FAULT INJECTION ............................................................................................................................ 113

6.4 SIMULATED ALGORITHMS ............................................................................................................... 114

6.5 THE PROTOCOLS ROUTINES ............................................................................................................ 115

6.6 IMPLEMENTING GROUP COMMUNICATION ....................................................................................... 116

6.7 FUNCTIONAL COMPONENTS OF THE SIMULATION. ........................................................................... 117

6.8 PARAMETER OF THE SIMULATION .................................................................................................... 119

6.9 AVAILABILITY AND THE CONTRIBUTION OF THE DMCA ALGORITHM .............................................. 119

6.10 RESULTS OF THE SIMULATION ....................................................................................................... 122

6.11 SUMMARY ..................................................................................................................................... 135

7. CONCLUSIONS................................................................................................................................. 138

7.1 CONTRIBUTIONS OF THIS WORK ....................................................................................................... 138

7.2 FUTURE RESEARCH DIRECTION ....................................................................................................... 139

7.3 CONCLUDING REMARKS .................................................................................................................. 141

APPENDIX-A PAPERS............................................................................................................. PAPERS 1

APPENDIX-B TABLES............................................................................................................. TABLES 1

APPENDIX-C SOURCE CODE ................................................................................................... CODE 1

LIST OF REFERENCES................................................................................................ REFERENCES 1

v

LIST OF FIGURES

FIGURE 2-1: BASIC MANAGEMENT MODEL .................................................................................................... 9

FIGURE 2-2: VIEWS OF SHARED MANAGEMENT KNOWLEDGE....................................................................... 17

FIGURE 2-3: SIMPLIFIED MANAGEMENT SYSTEM. ....................................................................................... 17

FIGURE 2-4 NETWORK MANAGEMENT APPROACHES (A) CENTRALISED (B) PLATFORM BASED (C)

HIERARCHICAL (D) DISTRIBUTED ......................................................................................................... 21

FIGURE 2-5: INTER-WORKING TMN ........................................................................................................... 25

FIGURE 2-6: (A) REPLICATION (B) NO REPLICATION ..................................................................................... 28

FIGURE 2-7: NETWORK MANAGEMENT REPLICATION EXAMPLE .................................................................. 31

FIGURE 2-8: SYNCHRONOUS REPLICATION .................................................................................................. 33

FIGURE 2-9: ARCHITECTURAL MODEL FOR REPLICATION. (A) NON TRANSPARENT SYSTEM (B) TRANSPARENT

REPLICATION SYSTEM (C ) LAZY REPLICATION (D) PRIMARY COPY MODEL. ......................................... 35

FIGURE 3-1: RELATIONSHIP BETWEEN USER AND RESOURCE. ...................................................................... 39

FIGURE 3-2: FAILURE MASKING ................................................................................................................... 42

FIGURE 3-3: GROUP MASKING ..................................................................................................................... 43

FIGURE 4-1 REPLICATION ANOMALY CAUSED BY CONFLICT WRITE OPERATIONS. A) BEFORE ISOLATION B)

AFTER ISOLATION ................................................................................................................................ 57

FIGURE 4-2: LOGICAL AND PHYSICAL OBJECTS OF THE SENSOR ENTITY. ..................................................... 64

FIGURE 4-3. REPLICATION USING PRIMARY SITE ALGORITHM. ..................................................................... 66

FIGURE 4-4. READ IN A PRIMARY SITE PROTOCOL ...................................................................................... 68

FIGURE 4-5. WRITE IN A PRIMARY SITE PROTOCOL ..................................................................................... 68

FIGURE 4-6. MAKE CURRENT IN A PRIMARY SITE PROTOCOL ..................................................................... 69

FIGURE 4-7. READ IN A MAJORITY CONSENSUS ALGORITHM ...................................................................... 71

FIGURE 4-8. WRITE IN A MAJORITY CONSENSUS ALGORITHM ....................................................................... 72

FIGURE 4-9. MAKE CURRENT IN A MAJORITY CONSENSUS ALGORITHM ..................................................... 72

FIGURE 4-10. ISMAJORITY IN THE DYNAMIC VOTING PROTOCOL ............................................................... 75

FIGURE 4-11. READ FUNCTION IN THE DYNAMIC VOTING PROTOCOL ......................................................... 75

FIGURE 4-12. WRITE (UPDATE) IN THE DYNAMIC VOTING PROTOCOL ......................................................... 76

FIGURE 4-13 UPDATE IN THE DYNAMIC VOTING PROTOCOL ...................................................................... 78

FIGURE 4-14 MAKE CURRENT IN THE DYNAMIC VOTING PROTOCOL ......................................................... 78

FIGURE 4-15 READPERMITTED IN THE DMCA ........................................................................................... 82

FIGURE 4-16 WRITEPERMITTED FUNCTION IN THE DMCA .......................................................................... 83

FIGURE 4-17 DOREAD FUNCTION IN THE DMCA........................................................................................ 84

FIGURE 4-18 DOWRITE FUNCTION IN THE DMCA ...................................................................................... 86

FIGURE 4-19 MAKE CURRENT FUNCTION IN DMCA ................................................................................... 88

FIGURE 4-20: SEQUENCE DIAGRAM FOR DOREAD OPERATION .................................................................... 89

FIGURE 4-21: SEQUENCE DIAGRAM FOR DOWRITE OPERATION ................................................................... 90

FIGURE 4-22: SEQUENCE DIAGRAM FOR MAKECURRENT OPERATION ......................................................... 90

FIGURE 5-1. ATS PROCESS DIAGRAM ........................................................................................................ 102

FIGURE 5-2. ATS OBJECT MODEL .............................................................................................................. 106

FIGURE 5-3. ATS DYNAMIC MODEL .......................................................................................................... 107

FIGURE 6-1: NETWORK MODEL ................................................................................................................. 112

FIGURE 6-2: FAULT INJECTION SYSTEM .................................................................................................... 113

FIGURE 6-3: COMPONENTS OF THE SIMULATION MODEL ............................................................................ 118

FIGURE 6-4: AVAILABILITY CURVE ............................................................................................................ 120

FIGURE 6-5: TOTAL AVAILABILITY =4. ................................................................................................... 121

FIGURE 6-6: BOUNDARIES OF TOTAL AVAILABILITY . ................................................................................. 122

FIGURE 6-7: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

DELAY=0.1 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY. .................. 126


DELAY=0.2 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 127







vi











vii

LIST OF TABLES

TABLE 2-1: CMISE SERVICES AND FUNCTIONS ........................................................................................... 12

TABLE 2-2: SNMP SERVICES AND FUNCTIONS ............................................................................................ 13

TABLE 4-1: DMCA MAPPING ...................................................................................................................... 91

TABLE 6-1: SIMULATION PARAMETERS ..................................................................................................... 123

viii

ACKNOWLEDGEMENTS

I would like to thank my supervisor Dr. B. H. Pardoe who has assisted me and

guided me in preparing this thesis. His kind assistance in my struggles with the English

language was very helpful. He has been willing to answer any technical question and

provide me with the information and knowledge to cope with the difficult task of

preparing a Ph.D. thesis. Without his kind help and encouragement, this research

would never have been done

I would especially like to express sincere appreciation for the financial support,

encouragement and love given to me by my parents Grigorios and Dimitra Kotsakis.

They have given me much more than moral and material support through my University

studies. They provide me a rock - solid support system which was proved helpful during

my studies. Without their support and love, this research would never have been done

Special thanks are due to my wife Chaido for encouraging me over the last three

years. I would like to thank her for her unconditional love, unfailing enthusiasm,

unending optimism and confidence in my abilities. Her patient and support are

boundless.

Finally, I would like to thank little Dimitra, whose birth two years ago gave me a

great joy, for being quiet while I was writing this thesis

ix

ABBREVIATIONS

ANSA Advanced Network System Architecture

ATM Asynchronous Transfer Mode

ATS Availability Testing System

CMIS Common Management Information Protocol

CMISE Common Management Information Service Element

DMCA Dynamic Majority Consensus Algorithm

FE Front End

IP Internet Protocol

ISO International Standards Organisation

LAN Local Area Network

MIB Management Information Base

OMT Object Modelling Technique

OSF/DCE Open Software Foundation / Distributed Computing Environment

OSI Open System Interconnection

ROSE Remote Operation Service Element

SNA Systems Network Architecture

SNMP Simple Network Management Protocol

TCP Transmission Control Protocol

UDP User Datagram Protocol

WAN Wide Area Network

x

ABSTRACT

Systems management is concerned with supervising and controlling the system so

that it fulfils the requirements. The management of a system may be performed by a

mixture of human and automated components. These components are abstract

representations of network resources and they are known as managed objects. A

distributed management system may be viewed as a collection of such objects located at

different sites in a network. Replication is a technique used in distributed systems to

improve the availability of vital data components and to provide higher system

performance since access to a particular object may be accomplished at multiple sites

concurrently. By applying replication in a distributed management system, we can locate

certain management objects at multiple sites by copying their internal data and the

operations used to access or update those data. This is considered as a great advantage

since it increases the reliability and availability, it provides higher fault tolerance and it

allows data sharing; improving in that way the system performance.

This thesis is concerned with methods that may be used to apply replication in

such a system, as well as certain replica control algorithms that may be used to control

operations over a replicated managed object. Certain replication architectures are

examined and the availability provided by each of them is discussed. A new replica

control algorithm is proposed as an alternative to providing higher availability. A tool

for evaluating the availability provided by a replica control algorithm is designed and

proposed as a benchmark test utility that examines the suitability of certain replica

control algorithms.

1

1. INTRODUCTION

The importance of replication techniques for providing high availability in

distributed systems has been known for over two decades. But the use of these

techniques in network management systems have been minimal. A reason for this has

been that the machinery needed to cope with the partitions and reunions is excessively

complex. This thesis addresses the methods that may be safely used for replicating

management objects and it shows how one can use replication techniques in a

management system that preserves usability and availability.

The rest of this chapter discusses replication techniques and the related problems.

It begins with a discussion on network management systems and the use of replication

in a distributed Management Information Base (MIB) for improving the availability of

the managed objects. It then discusses the goal of this thesis and it concludes with a

road-map for the rest of the thesis.

1.1 Distributed Management Systems

The management of a communication environment is a distributed information

processing application where individual components of the management activities are

associated with network resources. Management applications perform the management

activities in a distributed and consistent manner that guarantees transparency and system

operability. Management information is stored in a special database which is known as

Management Information Base (MIB). The MIB is the conceptual repository of the

management information and each object stored in the MIB is associated with an

2

individual network resource, an attribute used to represent a network activity. When the

MIB is distributed over sites, one site may fail, while other sites continue to operate.

Distributed MIBs may also increase performance since different managed objects

located at different hosts may be accessed concurrently.

A fundamental problem with a distributed MIB is data availability. Since managed

objects are stored on separate machines, a server crash or a network failure that

partitions a client from a server can prevent a manager from accessing managed objects.

Such situations are very frustrating to a manager because they impede computation even

though client resources are still available. The problem of object availability increases

over time for two reasons

1. The frequency of the network failures will increase. Networks get larger, they

cover larger geographical area, encompass multiple administrative boundaries and

consist of multiple sub-networks via routers and bridges. Furthermore, there is an

increased need for better network resource management that increases the

availability of managed objects and the management performance

2. The introduction of mobile network managers will increase the number of

occasions on which management agencies are inaccessible. Wireless technologies

such as packet-radio suffer from inherent limitations like short range and line of

sight. Due to these limitations, the network connections between management

agents and mobile managers will exhibit frequent partitions.

1.2 Replication on a distributed MIB

Replication is a technique used in distributed operating systems and distributed

databases to improve the availability of system resources. In the case of the MIB,

replication can be used to increase the performance of management activities and to

provide high availability of management objects. Replicating the same management

3

object at different sites can improve the availability remarkably because the system can

continue to operate as long as at least one site is up. It also improves performance of

global retrieval queries, because the result of such a query can be obtained locally from

any site; hence a retrieval query can be processed at the local site where it is submitted.

To deal with replicated objects in a management information base, a control

method is needed to keep all the replicas in a consistent state even during partitioning.

The proposed techniques used to assure consistency may be divided into two families;

those based on a distinguished copy and those based on voting. The former technique is

based on the idea of using a designated copy of each replicated copy of each replicated

object in such a way that requests are sent to the site that contains that copy

(ALSBERG 1976, BERNSTEIN 1987, GARCIA 1982).

Voting replica control algorithms are more promising. They do not use a

distinguished copy; rather a request is sent to all sites that include a copy of the

replicated object. An access to a particular copy is granted if a majority of votes is

collected (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b). Voting algorithms

are fully distributed concurrency control algorithms and they exhibit higher flexibility

over those based on a distinguished copy. Despite the fact that the voting algorithms

pass many messages among sites, it is anticipated that good performance will be gained,

if the round trip time becomes shorter. Todays technology can improve the round trip

time through the use of high speed networks like ATM. Thus, messages may be

transferred from one machine to another faster and more reliably.

In the voting scheme, replicated objects can be accessed in the partition group that

obtains a majority vote. In the distinguished (primary) copy scheme, availability is

significantly limited in the case of a network or link failure. Primary copy algorithms

exhibit good behaviour only for site failures . On the other hand, voting algorithms

4

provide higher availability tolerating both network and site failures. Voting algorithms

guarantee consistency at the expense of the availability. To provide higher availability,

one may use either a consistency relaxation technique that allows concurrent access of

replicated objects across different partitions (optimistic control algorithm) or to improve

the existing voting pessimistic algorithms by forming more sophisticated schemes.

Optimistic control algorithms must be supported by an extra mechanism to detect and

resolve diverging replicas once the partition groups are reconnected. This complicates

the control of replication task and allows, at least for a short interval of time,

inconsistency between replicas. Such an approach requires long time to retrieve the state

of the database after a site failure and it does not seem appropriate for databases such as

those used to store management information. Therefore, the invention of a more

sophisticated replica control algorithm based on voting seems to be a promising

approach that may provide higher availability preserving strong consistency between

replicated objects.

Finally, the following questions are addressed in this thesis:

Can we improve further the availability of managed objects by utilising voting

techniques?

Can replication be used effectively in a distributed MIB in order to ensure fault

tolerance in a management system?

1.3 The work

The goal of this work is to investigate the potential use of replication in a network

management system and examine the practical aspect of applying such a technique in

real time systems. Intuitively this work appears viable for the following reasons.

5

1. There is a great proliferation of different network technologies and a great need for

network management. Keeping management information available and in a

consistent state is of great importance, since the network operability depends on

management activities.

2. Availability of network information may be obtained only by applying redundancy.

Replication is one of the most widely used techniques that ensures high availability

keeping the replicated objects in a consistent state.

3. The development of replication schemes for insuring higher object availability and

for tolerating site and communication failures is very promising and should be

studied further.

4. Failures always happen. No system can work forever within its specifications.

Exogenous or endogenous factors can affect the operability of the system causing

temporary or permanent failures.

5. The need for developing fault tolerant techniques for network management systems is

of great importance.

1.4 Road Map of the Thesis

The rest of the thesis consists of six chapters.

Chapter 2 describes systems management and provides a discussion about the

architectural models of management systems. It examines object replication in terms of

system performance and availability and illustrates the replication architectural models

for managing failures.

6

Chapter 3 discusses the nature of failures in a management system. It decomposes a

management system into management agents and defines the concept of dependability

between agents. It also classifies certain failures according to their disruptive behaviour

and it then ends by specifying the architectural impact of a group of agents to the

availability of replicated objects.

Chapter 4 presents the correctness criteria that should be taken into account when

designing a replication system. It also introduces an abstract model in order to study

formally certain replica control algorithms. It then presents a variety of replica control

algorithms. It ends with a thorough discussion on the DMCA (Dynamic Majority

Consensus Algorithm) which is a novel approach that enriches the up to date knowledge

on replication techniques and improves the overall management of replicated objects

providing higher availability.

Chapter 5 mainly evaluates the DMCA algorithm and presents quantitative results

regarding the availability provided by DMCA. It starts by specifying the way in which

one can measure the performance of certain replica control protocols and introduces the

simulation model for building the benchmark test utility ATS (Availability Testing

System) that estimates the effectiveness of the algorithms. It also shows the fault

injection mechanism for generating faults and repairs. It ends with a thorough discussion

on the results of the simulation justifying the superiority of the DMCA.

Chapter 6 presents the object -oriented development process of the ATS tool. It first

discusses the advantages of using object oriented technology to develop such a complex

system and it then presents the static object model and dynamic model of the ATS.

7

The thesis concludes with chapter 7 which presents the contributions and it includes a

discussion of future work and a summary of key results.

8

2. REPLICATION MANAGEMENT SYSTEM

ARCHITECTURE

This chapter introduces the fundamental idea behind systems management and

illustrates the main features of a management system. It provides a brief discussion

about the architectural model of a management system and introduces the concept of

distributed MIB as a naturally distributed database. It highlights issues related to the

object oriented MIB modelling and definition of managed objects. It also justifies the

use of object replication in terms of system performance, data reliability and availability.

Finally it discusses the type of failures that may occur in a management system as well

as the replication architectural models that may be used to maintain multiple replicas.

2.1 Management Functional Areas

Management of a system is concerned with supervising and controlling the

system so that it fulfils the operational requirements. To facilitate the management task ,

Open System Interconnection (OSI ) divides the management design process into five

areas known as OSI management functional areas (ISO 1989). The fundamental

objectives of the OSI functional areas are to fulfil the following goals.

1. To maintain proper operation of a complex network (fault management).

2. To maintain internal accounting procedures (accounting management).

3. To maintain procedures regarding the configuration of a network or a

distributed processing system (configuration management).

9

4. To provide the capability of performance evaluation (performance

management).

5. To allow authorised access-control information to be maintained and

distributed across a management domain (security management).

In other words, a network that has a management system, must be able to manage

its own operations, performance, failures, modifications, security and hardware/software

configuration. To fulfil the above requirements, it is necessary to develop a management

model that is capable of incorporating a vast amount of services covered under the

specifications of the OSI functional areas. The actual architecture of the network

management model varies greatly, depending on the functionality of the platform and

the details of the network management capability. A management architectural model

that has been proposed by OSI defines the fundamental concepts of systems

management (ISO 1992). This model describes the information, functional,

communication and organisational aspect of systems management.

Figure 2-1: Basic Management model

10

2.2 Management Architectural Model

The management of a communication environment is an information processing

application. Because the system being managed is distributed, the individual

components of the management activities are themselves distributed. Management

applications perform the management activities in a distributed manner by establishing

associations between management entities. As shown in Figure 2-1 there are two

fundamental type of entities that exchange management information; one takes the

manager role and the other the agent role. An entity, that plays the manager role is

supposed to be the entity that generates queries to obtain management information. An

entity plays the agent role when it accepts those queries returning a response back to the

manager and generates notifications regarding the state of the objects located in the

domain of the agent. An agent performs management operations on managed objects as

a consequence of the communication with the manager. A manager may be seen as the

part of the distributed application that is responsible for generating messages related to

one or more management activities (collection of information, controlling the state of

remote objects, change the configuration of managed devices etc.). The agent, on the

other hand, could be viewed as a local corespondent of the manager in the managed

system controlling access to the managed object and looking after the distribution of

events occurring in the managed system.

As Figure 2-1 shows, three kinds of messages are transferred between manager and

agent. These are the following:

request messages transferred from the manager to the agent.

response messages are bi-directional

notification messages transferred from the agent to the manager

11

The database of systems management information called Management

Information Base (MIB) is associated with both the manager and the agent. The MIB is

the conceptual repository of the management information stored in an OSI-based

network management system. The definition of the MIB describes the conceptual

schema containing information about managed objects and relations between them. It

actually defines the set of all managed objects visible to a network management entity.

The MIB may be viewed as the interface definition - it defines a conceptual schema

which contains information about specific managed objects, which are instantiations of

managed object classes. The schema also embodies relationships between these

managed objects, specifies the operations which may be performed on them and

describes the notifications which they may emit (ISO 1993).

2.3 Protocols For Controlling Management Information

All types of management exchanges consist of requests and/or requests -

responses. There are currently two basic architectural frameworks related to the

standardisation of the exchange of messages passed between managers and agencies; the

OSI management framework (ISO 1989) and the Internet network management

framework (CERF 1988).

2.3.1 OSI Management Framework

The Common Management Information Protocol (CMIP) is a utility designed to

convey the requests or responses between managers and agencies (ISO 1991a). The

CMIP offers specific system management functions and services for the remote holding

of management data. The CMIP implements the services offered by the remote

operation service elements (ROSE) (ISO 1988), in order to perform the create, set,

delete, get, action, and event-report operations. The Common Management Information

12

Service Element (CMISE) is the standardised application service element that is used to

exchange management information in the form of requests and/or requests-responses

(ISO 1991). The CMISE is a basic vehicle that provides individual management

applications with the means of executing management operations on objects and issuing

notifications. The CMISE provides the means of supporting distributed management

operations using application associations. The CMISE services shown in Table 2-1

constitute the kernel functional unit of the CMISE. A system supporting the CMIP must

implement the kernel functional units of the CMISE.

2.3.2 Internet Network Management

The Simple Network Management Protocol (SNMP) (CASE 1990) is used to

convey management information in an Internet management system just like the CMIP

is used in an OSI management system. The SNMP includes a limited set of management

requests and responses. The managing system issues get, get_next, and set requests to

retrieve single or multiple objects or to set the value of a single object. The managed

system sends a response to complete the get, get_next or set requests. The managed

system also sends an event notification called trap to the managing system to identify

Table 2-1: CMISE services and functions

Service Type Function Notifications M_EVENT_REPORT C/NC Gives notifications of an event

occurring on a managed object

M_GET C Request for management data

Operations M_SET C/NC Modification of management data

M_ACTION C/NC Action execution on managed object

M_CREATE C Creation of a managed object

M_DELETE C Deletion of a managed object

C = Confirmed M stands for management

NC = Not Confirmed

13

the occurrence of an event. Table 2-2 lists the SNMP request and response messages

along with their types and functions.

2.4 Object oriented MIB modelling

The central point in a management system is the managed object. A managed

object may be seen as the management view of a resource and is described by the

following characteristics:

Attributes, that denote specific characteristics of the resource.

Operations, that are performed on a set of attributes

Behaviour, that specifies how the object reacts to operations performed on it.

Notifications, that may be emitted to the managing station through a protocol

as a reaction in an external event or as a repeated action.

The managed object class provides a way to specify a family of managed object. A

managed object class is a template for managed objects that share the same attributes,

operations, notifications and behaviour. A managed object is an instantiation of the

managed object class.

Table 2-2: SNMP services and functions

Service Type Function Notifications Trap C/NC An agent sends a trap to alert the

manager that an event has been occurred

GetRequest C Retrieves the state of a single object

Operations GetNextRequest C Retrieves the state of the next object in an a sequence of objects

GetResponse NC Response sent by the agent to the manager

SetRequest C Sets the state of a managed object

C = Confirmed

NC = Not Confirmed

14

The MIB is the conceptual repository containing all the related information about

managed objects. The MIB modelling encompasses an abstract model and an

implementation model (KOTSAKIS 1995). The abstract model defines

Principles of naming objects

The logical structure of management information

Concepts related with management object classes and the relationship between

them.

The implementation model (BABAT 1991, KOTSAKIS 1995) defines the following

The platform for hosting a MIB.

The architectural principles for partitioning MIB information

Database type (Object oriented or relational)

Translation of MIB object model into a schema

The Management Information Model (ISO 1993) defines two types of management

operations

operations intended to be applied to the object attributes

operations intended to be applied to the management object as a whole.

Attribute oriented operations are as follows

get attribute value

replace attribute value

replace with default value

add member

remove member

15

Any operation may affect the state of one or more attributes. The operations may also be

performed atomically (either all operations succeed or none is performed). Operations

that may be applied to the managed object as a whole are the following

create

delete

action

An action operation requests the managed object to perform the specified action and to

indicate the result of this action.

2.5 Distributed Management Information Base (MIB)

Roles are not permanently assigned to a management entity. Some management

entities may be restricted to only taking an agent role, some to only taking a manager

role while other are allowed to take an agent role in one interaction and to take a

manager role in a separate interaction. In order to perform system management and

share management knowledge, it is sometimes necessary to embody manager and agent

within a single open system (see Figure 2-2). Shared management knowledge is implied

by the nature of the management framework since the management applications are

distributed across a network. Therefore the management information base may be

naturally viewed as a distributed database containing the managed objects that belong

to the same management system but is physically spread over multiple sites (hosts) of a

computer network. The MIB is considered as a superset of managed objects. Each

subset of this superset may constitute a set of objects associated with a device physically

separated from any other managed device. (ARPEGE 1994) Therefore the managed

objects in each location may be viewed as a local management description of the

16

managed device. The distributed design of a MIB may be considered as a great

advantage for the following reasons:

Increased reliability and availability: Reliability is defined as the probability that a

system is up at a particular moment, whereas availability is the probability that a

system is continuously available during a time interval. When the MIB is

distributed over several sites, one site may fail while other sites continue to operate.

Only the objects associated with the failed site cannot be accessed. This improves

both reliability and availability. On the other hand a failure in a centralised MIB

may makes the whole system unavailable to all users.

Allowing data sharing while maintaining some measure of local control: A

distributed MIB allows the control of objects locally at each agent. Objects, that may

be available to a specific manager may be hidden to some other managers.

Improved Performance: A distributed MIB implies the existence of smaller

databases at each site. If a site combines both the role of manager and agent, the

manager may gain faster access to the local MIB than any other manager located in

a remote site. This increases the performance of the system since a set of managed

objects may be accessed locally without the need to open a communication

transaction over the network. In addition a distributed MIB decreases the load

(number of transactions) submitted to an agent compared with the load executed by

a centralised MIB. Since different agents may operate independently, different

agents may proceed in parallel, reducing response times.

17

A typical arrangement of a management system is shown in Figure 2-3. The nodes may

be located in physical proximity and connected via a Local Area Network (LAN), or

they may be geographically distributed over a interconnected network (Internet). It is

possible to connect a number of diskless workstations or personal computers as

Figure 2-2: Views of shared management knowledge

Figure 2-3: Simplified Management System.

18

managers to a set of agents that maintains the managed objects. As illustrated in Figure

2-3, some nodes may run as managers (such as the diskless node 1, or the node 2 with

disks), while other nodes are dedicated to run only agent software, such as the node 3.

Still other nodes may support both manager and agent roles, such as the node 4.

Interaction between manager and agent might proceed as follows:

1. The manager parses a user query and decomposes it into a number of

independent queries that are sent separately to independent management agent

nodes.

2. Each agent node processes the local query and sends a response to the manager

node.

3. The manager node combines the results of the subqueries to produce the result of

the original submitted query.

4. If something occurs in an agent that changes its operational state, a notification

may be generated from the agent and an associated message is sent urgently to

the manager for further processing.

The agent software is responsible for local access of managed objects while the manager

software is responsible for most of the distribution functions; it processes all the user

requests that require access to more than one management node and it keeps a truck

where each managed object is located. An important function of the manager is to hide

the details of data distribution from the user, that is, the user should write global queries

as though the MIB were not distributed. This property is called MIB transparency. A

management system that does not provide distribution transparency makes it the

responsibility of the user to specify the managed node related with a managed object.

2.6 Distributed Network Management

19

Systems are increasingly becoming complex and distributed. As a result, they are

exposed to problems such as failures, performance inefficiency and resource allocation.

So, an efficient integrated network management system is required to monitor, interpret

and control the behaviour of their hardware and software resources. This task is

currently being carried out by centralised network management systems in which a

single management system monitors the whole network. Most of existing management

systems are platform-centred. That is, the applications are separated from the data they

require and from the devices they need to control. Although some experts believe that

most network management problems can be solved with a centralised management

system, there are real network management problems that cannot be adequately

addressed by the centralised approach. (MEYER 1995).

Basically, there are four basic approaches for network management systems

centralised, platform based, hierarchical and distributed (LEINWARD 1993).

Currently, most network management systems are centralised. In a centralised

management system (Figure 2-4.a ) there is a single management machine (manager)

which collects the information and controls the entire network. This workstation is a

single point of failure and if it fails, the entire network could collapse. In case the

management host does not fail, but the fault partitions the network, the other part of the

network is left without any management functionality. Centralised network management

has shown inadequacy for efficient management of large heterogeneous network. Also, a

centralised system cannot be easily scaled up when the size of complexity of the

network increases.

In the platform based approach (Figure 2-4.b), a single manager is divided in two

parts; the management platform and the management application. The management

platform is mainly concerned with information gathering while management

20

applications use the services offered by the management platform to handle decision

support. The advantage of this approach is that, applications do not need to worry about

protocol complexity and heterogeneity.

The hierarchical architecture (Figure 2-4.c) uses the concept of Manager Of

Managers (MOM) and manager per domain paradigm (LEINWARD 1993). Each

domain manager is only responsible for the management of its domain and it is unaware

of other domains. The manager of managers sits at the higher level and request

information from domain managers.

21

The distributed approach (Figure 2-4.d) is a peer architecture. Multiple managers,

each one responsible for a domain, communicate with each other in a peer system.

Whenever information from another domain is required, the corresponding manager is

contacted and the information is retrieved. By distributing management over several

workstations, the network management reliability, robustness and performance increase

while the network management cost in communication and computation decreases. This

Figure 2-4 Network management approaches (a) centralised (b) platform based (c) hierarchical (d)

distributed

22

approach has also been adapted by ISO standards and the Telecommunication

Management Network (TMN) architecture (ITU 1995).

A distributed system should use interconnected and independent processing

elements to avoid having a single point of failures. Several reasons contribute in using a

distributed management architecture: higher performance/cost ratio, modularity, greater

expandability and scalability, higher availability and reliability. Distributed management

services should be transparent to users, so that they cannot distinguish between a local

and a remote service. This requires the system to be consistent, secure, fault tolerant and

have a bounded response time.

Remote Procedure Call (RPC) (NELSON 1981) is well understood control

mechanism used for calling a remote procedure in a client server environment. The

Object Management Group (OMG) Common Object Request Broker Architecture

(CORBA) (OMG 1997) is also an important standard for distributed object oriented

systems. It is aimed at the management of objects in distributed heterogeneous systems.

CORBA addresses two challenges in developing distributed systems (OMG 1997):

1. Making the design of the system not more difficult than a centralised one

2. Providing an infrastructure to integrate application components into a distributed

system.

2.7 CORBA System

The most promising approach to solve the distributed interface and integration

problem is the CORBA architecture(VINOSKI 1997). Although CORBA does not

23

support directly a network management architecture, it provides a distributed object

oriented framework where a management system may be developed.

The main component of CORBA is the Object Request Broker (ORB). An ORB is

the basic mechanism by which objects transparently make requests to each other on the

same machine or across a network. A client object need not be aware of the mechanisms

used to communicate with or activate an object, how the object is implemented nor

where the object is located. The ORB forms the foundation for building applications

constructed from distributed objects and for interoperability between applications in

both homogeneous and heterogeneous environments.

The OMG Interface Definition Language (IDL) provides a standardised way to

define the interfaces to CORBA objects. The IDL definition is the contract between the

implementor of the object and the client. IDL is a strongly typed declarative language

that is programming language independent. Language mapping enables objects to be

implemented in the developers programming language of choice.

CORBA services include naming, events, persistence, transactions, concurrency

control, relationships, queries, security etc. CORBA services are the basic building

blocks for distributed object applications. Compliant objects can be combined in many

different ways and put to many different uses in applications. They can be used to

construct higher level facilities and object frameworks that can inter-operate across

multiple platform environments.

2.8 Implementing OSI Management Services for TMN

Recently the telecommunication industry has gained knowledge and experience

establishing management functionality through the Telecommunication Management

24

Network (TMN) framework(ITU 1995). On the other hand in the Internet community,

Simple Network Management protocol (SNMP)has gained widespread acceptance due

to its simplicity of implementation. Thus, TMN and Internet management will co-exist

in the future.

The aim of the TMN is to enhance interoperability of management software and to

provide an architecture for management systems. A TMN is a logically distinct network

from the telecommunication network that it manages. It interfaces with the

telecommunication network at several different points and controls their operations. The

TMN information architecture is based on an object oriented approach and the

agent/manager concepts that underlie the Open Systems Interconnection (OSI) systems

management.

The Telecommunication Management Network (TMN) is a framework for the

management of telecommunication networks and the services provided on those

networks. The Open Systems Interconnection (OSI) management framework is an

essential component of the TMN architecture. Each TMN function block can play the

role of an OSI manager, an OSI agent or both.

A managed object instance can represent a resource and thus there is a

requirement for communication between managed objects instances in an OSI agent and

the resources they represent. Examples of resources include telecommunication

switches, bridges, gateways etc. If a new interface card is added to a switch, the switch

may send a create request to the agent for the creation of the corresponding manager

object instance.

25

Figure 2-5 illustrates how TMN systems can inter-work within the TMN logical layer

architecture (SIDOR 1998, FERIDUM 1996). In this architecture, system A manages

System B and B may, in turn, necessitate operations on the information model of the

system C.

The management information base ( MIB) is the managed object repository and

may be implemented by using C++ objects through a MIB composer tool (FERIDUM

1996). In (BAN 1995) a uniform generic object model (GOM) is proposed for

manipulating transparently managed objects of various specific object models (CORBA,

OSI X.700, COM etc.). Communication between managed object instances and

resources can be initiated from both directions. Resource access components access

managed object instances through the core agent. Some or all of the managed object

instances in a MIB may be persistent to allow fast recovery after an agent failure. There

are two major design considerations in implementing persistence:

1. Performance: persistent managed objects ensure fast restart after agent failures. For

example the instance representing a leased line between two communication nodes

may need to be persistent, whereas an instance representing a connection does not

(since after an agent failure, the connection will be terminated). Object oriented

Figure 2-5: Inter-working TMN

26

databases or traditional relational databases or even flat files can be used to

implement selective persistence.

2. Synchronisation: When the agent restarts, managed objects must be updated to

reflect the current state of the resources. Synchronisation requires exchange of are

you there and what are the current value type messages between managed objects

and resources.

2.9 Replication In a Management System

Making persistent managed objects may increase the performance and object

availability offered during an agents failure. Replication may be used to increase further

the performance and the availability of network managed objects. The major design

considerations in implementing replication are that it allows the control of objects

locally at each replication site and it lets managers gain faster access to a MIB managed

object by retrieving information locally without the need of performing remote

transactions. In that way the load is shared among many sites.

To facilitate better the use of a replication technique in a network management

system we may incorporate replication in a distributed framework (CORBA). CORBA

has been designed to provide an architecture for distributed object-oriented computing,

not network management. Engineers have focused their efforts on developing an

integrated management platform to create, manage and invoke distributed

telecommunication services. Some of these efforts are (LEPPINEN 1997, MAFFEIS

1997a, RAHKILA 1997)

The CORBA standard provides mechanisms for the definition of interfaces to

distributed objects and for communication of operations to those objects through

27

messages. Unfortunately, the current CORBA standard makes no provision for fault

tolerance.

To provide fault tolerance, objects should be replicated across multiple processors

within the distributed system (ADAMEC 1995. The motivations for applying object

replication in a distributed network management system could be of many types:

One can be performance enhancement. Management information that is shared

between a large manager community should not be held at a single server, since this

computer will act as a bottleneck that slow down responses.

Another motivation is improvement of fault tolerance. When the computer with one

replica crashes, system can proceed management computation with another replica.

Another motivation could be the case of using replicas to access remote objects.

When a remote object is to be accessed, a local replica reflecting remote objects

state is created and used instead of a remote object.

28

Figure 2-6 shows a typical scheme for implementing replication of management

information. Agent updates the MIBs located at the manager sites by exchanging

messages with the managers. Each manager get the management information locally

without the need of issuing remote request. This yields a performance increment since

two additional instances of the MIB are used to provide information about the same

resources.

(MAFFEIS 1997b) discusses a CORBA based fault tolerant system that monitors remote

objects and if some of them fail it automatically restarts the failed objects and replicates

state-full objects on the fly, migrating objects from one host to another. In

(NARASIMHAN 1997) a similar system is discussed, which provides fault tolerant

services under CORBA to applications with no modification to the existing ORB.

Figure 2-6: (a) replication (b) no replication

29

2.10 Need for replication techniques in a management system

The Management Information Base (MIB) is viewed as a distributed database that

stores information associated with the resources of a network or a remote system.

Replication is applied to managed objects, which may be seen as an abstract

representation of the resources. In traditional data-base applications, the need for a

replication scheme is straight forward since the nature of the data easily allows the

implementation of such a schema. For instance, a company may have locations at

different cities or a bank may have multiple branches. It is natural, for such applications

to enforce replication since such a scheme may increase drastically the fault tolerance of

the system. If, for example, the software in some branch fails, information regarding

customers of that branch may be available from some other branch. A management

information base is different from a traditional database in that, the information stored

in it corresponds to objects that represent software or hardware resources. For instance,

a variable associated with a remote sensor may be considered as a managed object and

its value may be viewed as an instantiation describing its state. The main questions that

arise in a management database application are the following :

1. Do we really need to replicate such kind of objects?

2. How useful is a replication scheme in a practical management system?

To answer all these questions we illustrate an example of a network management system

which manages network resources spread across an interconnected network.

Figure 2-7 shows an interconnected network consisting of three networks. Each

network constitutes a management domain. Each domain has a manager, an agent, a

management information base (MIB) and some network resources that are monitored

and controlled by the manager. Agents are responsible for collecting information from

the resources and updating the relevant managed objects in the database to reflect the

30

current state of the resources. The manager and the agent of each domain could be

accommodated by the same host computer. However, we use the most general case in

which manager and agent reside at different computers. This could be the case where

the manager runs on a diskless machine. There are two possible scenarios:

1. No replication: Each MIB stores information about managed resources of its own

domain.

2. Replication: Each MIB contains replicated information which is associated with

resources of other domains.

31

In a no replication scheme, when a manager wants information about some

network resources, it contacts the appropriate agent which is responsible for providing

this management information. The agent either collects the information dynamically

from the resources or it makes a relevant query to the MIB and sends a response back to

the manager. In a pure distributed management environment if manager B wants some

information about the network resources A, it sends a request to agent A. Upon

receiving the request, agent A undertakes to provide the requested information to

manager B. In a hierarchical management system, this could be accomplished through

Agent A

MIB A

Domain C

Manager A

Manager C

Manager B

Agent A

Agent C

Agent B

Bridge AC

Bridge BC

Bridge AB

Network A

Network C

Network B

MIB A

MIB C

MIB B

Network

Resources

ANetwork

Resources

A

Network

Resources

ADomain A

Domain B

Domain C

Figure 2-7: Network management replication example

32

the manager A. Manager B establish a manager to manager communication with

manager A and then it asks A to request its local (domain) agent A to complete the task.

In a replication scheme, replicas of managed objects exist on other MIBs. When

an agent updates the state of a managed object locally, it also transmits the object state

to other domain and updates all the replicated objects that resides at other MIBs. Under

this arrangement, when manager B wants some information about network resources A,

there is no need either to contact agent A or the manager A but its local agent B since

the MIB B contains replicated managed objects of the MIB A. This improves the

performance and speeds up the process of collecting network management information

from remote system. This becomes more obvious if network B is a remote network

which is linked with a low speed leased lines with network A. In case of an agent

failure, managers can still get information from some particular managed objects from

other MIBs increasing the availability of the management information and making the

management system more robust and fault tolerant.

We may increase further the performance and the availability if we utilise dynamic

creation of managed objects from faulty agents to other agents. This can be achieved by

utilising CORBA based replication techniques that may restart failed objects and

replicate their state on the fly on another agent (MAFFEIS 1997b).

It becomes clear that replicating managed objects to other agents MIBs may

increase the availability of certain managed objects and the performance of the

management system ensuring continuous network management without to interrupt the

monitoring and control of network resources.

33

2.11 Synchronous and Asynchronous replica models

In the context of research on fault tolerance, a system that has the property of

always responding to a message within a known finite time interval is said to be

synchronous. A synchronous replica system is said to be the system in which all update

requests are ordered. That is requests are processed at all replicas in the same order.

Consider for example the replication model in Figure 2-8. The node M sends an update

request r to all other nodes G1, G2, G3 and wait for responses. If it receives all the

acknowledgements A1, A2, A3 from those nodes, it assumes that the update is done

successfully and it proceeds to the next request. In a synchronous replication system the

next request is forwarded only if the current update request has been processed at all the

agencies holding replicas. A replication system not having this property is said to be

asynchronous. That is, in an asynchronous replication system a node proceeds to the

next request without the need to wait to get acknowledgements from all the recipients of

the previous request. That results in an unordered processing of requests. A request

received by the node G1 may be processed in a different order than that of node G2 or

G3.

2.12 Replication Transparency and Architectural Model

A key issue relating to replication is transparency. Transparency (invisibility)

determines the extend the users (managers) are aware of that some objects are

replicated. At one extreme the users are fully aware of replication process and can even

Figure 2-8: Synchronous Replication

34

control it. At the other, the system does everything without users noticing anything. The

ANSA reference manual (ANSA 1989) and the International Standard Organisation

Reference Model for Open Distributed Processing (ISO 1992a) provide definitions

related to replication transparency. Among others the standards state that a replication

system is transparent if it enables multiple instances of the information object (in our

case managed object) to be replicated without knowledge of the replicas by users or

application programs.

A basic architectural model for controlling replicated objects may involve distinct

agencies located across a network. Figure 2-9(a) shows how a manager may control the

entire process in a non transparent system. When a manager creates or updates an object,

it does so on one agency and then it takes responsibility to make copies or complete any

update on other agencies.

An agency is a process that contains replicas and performs operations upon them

directly. An agency may maintain a physical copy of every logical item, however, there

are cases, when an agency may not maintain a physical copy. For example, a managed

object needed mostly by a manager on one LAN may be never used by a manager of

another LAN. In this case the agency in the second LAN may not contain a physical

copy of that object and if ever this manager requests information about the object, the

local agency may obtain the information making a call to another agency that actually

holds a physical copy of the object. The general model for a transparent replication

system is shown in Figure 2-9(b). A managers request first handled by a Front End (FE)

component.

35

Figure 2-9: Architectural model for replication. (a) non transparent system (b)

transparent replication system (c ) Lazy replication (d) Primary copy model.

36

The FE component is used for passing the messages to at least one agency. This

hides details of how the message is forwarded to which agency. The user manager does

not need to determine a specific agency for service, but it just sends the message and the

FE component takes responsibility to determine which agency will receive the request.

The FE component may be implemented as part of the manager application or it may be

implemented as a separate process invoked by a manager application using a kind of

Interprocess Communication (PRESOTTO 1990). Figure 2-9(c ) shows a specialisation of

the architectural model in Figure 2-9(b). The model in (c ) is called a lazy replication

model and implements what is called gossip architecture (LADIN 1992). Here the

manager creates or updates only one copy in one agency. Later the agency itself makes

replicas on other agencies automatically without the managers knowledge. The

replication server is running in the background all the time scanning the managed object

hierarchically. Whenever it finds a managed object to have less replicas than it is

expected, the replication server arranges to make all the additional copies. The

replication server works best for immutable objects since such objects cannot change

during the replication process. This architecture is also called gossip architecture

because the replica agencies exchange gossip messages in order to convey the updates

they have each received. In gossip architecture the FE component communicates directly

either with an individual agency or alternatively with more than on agencies. Figure 2-

9(d) shows another replication architectural model known as the primary copy model

(LISKOV 1991). In that model all front ends communicate with the same primary

agency when updating a particular data item. The primary agency propagates the

updates to the other agencies called slaves. Front ends may also read objects from a

slave. If the primary agency fails, one of the slaves can be promoted to act as the

primary. Front ends may communicate with either a primary or a slave agency to

37

retrieve information. In that case, however, front ends may not perform updates; updates

are made to primary copy of an object.

2.13 Summary

This chapter has set the background for a replication management system. It has

shown the need for using a replication scheme in a real time application. It has

examined the distributed aspect of a network management system describing the

distributed nature of the MIB. It has briefly discussed two major protocols (CMIP and

SNMP) for exchanging management messages. It has also examined design aspects of

the MIB discussing the significance of the managed object as an autonomous entity for

performing operations related to incoming messages. The concepts of object availability

and performance have been defined and used as a measure of the quality of service of

the system. Synchronous and asynchronous replica models have been examined and

finally various architectural models for replication have been discussed as a way to

maintain transparently multiple replicas.

In the following chapters will discuss the internal mechanisms (algorithms) used

to obtain transparent updates to replicated objects. We will discuss a variety of solutions

that may be applied to ensure consistency among multiple replicas in occurrence of node

or communication link failures.

38

3. FAILURES IN A MANAGEMENT SYSTEM

This chapter discusses the nature of failures in a management system. It first

defines the concept of dependability between management agents and it then classifies

certain failures that may occurs in a management system analysing further each one by

its potentially disruptive behaviour. Failure semantics and masking are examined as a

way to understand how failures may be masked by using certain techniques. The chapter

ends by specifying some architectural issues including synchronisation, communication

and availability of certain components in a group of agents.

3.1 Dependability Between Agents

An agent provides certain management services that may be viewed as a collection

of operations whose execution can be triggered by inputs from other agents (proxy) or a

manager or the passage of time. An agent implements a management service without

exposing to the manager the internal representation of the managed objects. Such details

are hidden from the manager, who need know only the externally specified management

service behaviour. Agents may implement their services which are implemented by

other agents. An agent U depends on the agent R if the correctness of U depends on the

correctness of R's behaviour. The agent U is called the user and the R is called the

resource of U. Resources in turn might depend on other resources to provide their

service, and so on, down to the managed objects. The managed object is the atomic

resource which is not analysed further and which is actually used to represent hardware

or software components in a network. What is a resource at a certain level of abstraction

39

can be a user at another level of abstraction. The relationship between user and resource

is a "depends on" relationship as it is shown in 2nd.

A distributed management system consists of many agents. The management

services provided by those agents may depend on other secondary low level

management services associated with operating system components as well as

communication components. The union of all these management services is provided as

a distributed management system service. To ensure correctness and management

service availability, the classes of possible failures in the lower levels of abstraction

should be studied and redundancy in particular management services should be

introduced to prevent system crashes.

3.2 Failure Classification

An agent designed to provide a certain management service works correctly if in

response to requests, it behaves in a manner consistent with the service specification. By

an agents response we mean any output that it has to be delivered to the manager. An

agent fails when the agent does not behave in the manner specified . The most frequent

failures are the followings:

Figure 3-1: Relationship between user and resource.

40

1. Omission Failure: It happens when the agent receiving a request omits to respond

to that request. This failure occurs either because the queue of incoming messages

in the agent is full and therefore any additional request is lost or an internal failure

(i.e. memory allocation failure) is experienced due to a temporary lack of physical

resources for handling the incoming request. A communication service that

occasionally loses messages but does not delay messages is an example of a

service that suffers omission failures

2. Timing Failure: It happens when the agent response is functionally correct but

untimely. The response occurs outside the real-time interval specified. The most

frequent timing failure is the performance (late timing) failure in which the

response reaches the manager after the elapse of the time interval during which the

manager is expecting the response. This failure occurs because either the network

is too slow or the agent is overloaded and it gets late to give a response to the

manager. An excessive message transmission or message processing delay due to

an overload is an example of performance failures.

3. Response Failure: It happens when the agent responds incorrectly, either the value

of its output is incorrect (value failure) or the state transition that takes place is

incorrect (state failure). A search procedure that "finds" a key that is not an entry

of a routing table is an example of a response failure.

Crash Failures: It happens when after the first omission to produce a response to a

request, an agent omits to produce outputs for subsequent requests

3.3 Faulty Agent Behaviour

To detect a failure, an agent should reveal a certain behaviour that allows us to

identify the occurrence of a failure in order to perform the appropriate actions for

41

handling the failure. The behaviour of an agent under the occurrence of a failure may be

classified as follows:

Fail-stop behaviour

Byzantine behaviour

With fail-stop behaviour, a faulty agency just stops and does not respond to subsequent

requests or produce further output, except perhaps to announce that it is no longer

functioning. With Byzantine behaviour, a faulty agency continues to run, issuing wrong

responses to requests and possibly working together maliciously with other faulty

managers or agencies to give the impression that they are all working correctly when

they are not. In our study we assume only fail-stop behaviour.

3.4 Failure Semantics

The failure behaviour an agent can exhibit must be studied in order to suggest

possible fail tolerance mechanisms. Recovery actions invoked upon detection of an

agent failure depends on the likely failure behaviour of the agent. Therefore one has to

extend the standard specifications of an agent to include failure behaviour. If the

specification of an agent prescribes that the failure F may occur, it is said that the agent

has an F failure semantics(CRISTIAN 1991). If a communication failure is allowed to

lose messages but the probability that it delays or corrupts messages is negligible, we

PhD Thesis

Documents

Transcript of PhD Thesis