PhD Thesis

174
 See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/36207005 Replication in distributed management systems.  ARTICLE Source: OAI CITATIONS 14 DOWNLOADS 13 VIEWS 48 1 AUTHOR: Evangelos Kotsakis European Commission 21 PUBLICATIONS 124 CITATIONS SEE PROFILE Available from: Evangelos Kotsakis Retrieved on: 01 August 2015

description

Phd thesis

Transcript of PhD Thesis

  • Seediscussions,stats,andauthorprofilesforthispublicationat:http://www.researchgate.net/publication/36207005

    Replicationindistributedmanagementsystems.ARTICLESource:OAI

    CITATIONS14

    DOWNLOADS13

    VIEWS48

    1AUTHOR:

    EvangelosKotsakisEuropeanCommission21PUBLICATIONS124CITATIONS

    SEEPROFILE

    Availablefrom:EvangelosKotsakisRetrievedon:01August2015

  • REPLICATION IN DISTRIBUTED MANAGEMENT

    SYSTEMS

    EVANGELOS GRIGORIOS KOTSAKIS

    Telford Research Institute

    Department of Electrical and Electronic Engineering

    The University Of Salford

    Submitted in Partial Fulfilment of the Requirements for the

    Degree of Doctor of Philosophy

    1998

  • ii

    This thesis is dedicated to

    my daughter Dimitra

  • iii

    TABLE OF CONTENTS

    LIST OF FIGURES................................................................................................................................... V

    LIST OF TABLES ................................................................................................................................. VII

    ACKNOWLEDGMENTS ..................................................................................................................... VIII

    ABBREVIATIONS .................................................................................................................................. IX

    ABSTRACT ............................................................................................................................................... X

    1. INTRODUCTION .................................................................................................................................. 1

    1.1 DISTRIBUTED MANAGEMENT SYSTEMS ............................................................................................... 1

    1.2 REPLICATION ON A DISTRIBUTED MIB ................................................................................................. 2

    1.3 THE WORK ........................................................................................................................................... 4

    1.4 ROAD MAP OF THE THESIS .................................................................................................................. 5

    2. REPLICATION MANAGEMENT SYSTEM ARCHITECTURE .................................................... 8

    2.1 MANAGEMENT FUNCTIONAL AREAS ................................................................................................... 8

    2.2 MANAGEMENT ARCHITECTURAL MODEL .......................................................................................... 10

    2.3 PROTOCOLS FOR CONTROLLING MANAGEMENT INFORMATION......................................................... 11

    2.3.1 OSI Management Framework ................................................................................................... 11

    2.3.2 Internet Network Management .................................................................................................. 12

    2.4 OBJECT ORIENTED MIB MODELLING ................................................................................................. 13

    2.5 DISTRIBUTED MANAGEMENT INFORMATION BASE (MIB) ................................................................. 15

    2.6 DISTRIBUTED NETWORK MANAGEMENT ........................................................................................... 18

    2.7 CORBA SYSTEM .............................................................................................................................. 22

    2.8 IMPLEMENTING OSI MANAGEMENT SERVICES FOR TMN ................................................................. 23

    2.9 REPLICATION IN A MANAGEMENT SYSTEM ....................................................................................... 26

    2.10 NEED FOR REPLICATION TECHNIQUES IN A MANAGEMENT SYSTEM .................................................. 29

    2.11 SYNCHRONOUS AND ASYNCHRONOUS REPLICA MODELS ................................................................. 33

    2.12 REPLICATION TRANSPARENCY AND ARCHITECTURAL MODEL ........................................................ 33

    2.13 SUMMARY ....................................................................................................................................... 37

    3. FAILURES IN A MANAGEMENT SYSTEM .................................................................................. 38

    3.1 DEPENDABILITY BETWEEN AGENTS .................................................................................................. 38

    3.2 FAILURE CLASSIFICATION .................................................................................................................. 39

    3.3 FAULTY AGENT BEHAVIOUR ............................................................................................................. 40

    3.4 FAILURE SEMANTICS ......................................................................................................................... 41

    3.5 FAILURE MASKING ............................................................................................................................ 42

    3.6 ARCHITECTURAL ISSUES ................................................................................................................... 46

    3.7 GROUP SYNCHRONISATION ............................................................................................................... 47

    3.7.1 Close Synchronisation ............................................................................................................... 47

    3.7.2 Loose synchronisation ............................................................................................................... 48

    3.8 GROUP SIZE ....................................................................................................................................... 49

    3.9 GROUP COMMUNICATION .................................................................................................................. 49

    3.10 AVAILABILITY POLICY ..................................................................................................................... 50

    3.11 GROUP MEMBER AGREEMENT ........................................................................................................ 51

    3.12 SUMMARY ....................................................................................................................................... 53

    4. REPLICA CONTROL PROTOCOLS ............................................................................................... 55

    4.1 PARTITIONING IN A REPLICATION SYSTEM ......................................................................................... 55

    4.2 CORRECTNESS IN REPLICATION ......................................................................................................... 56

    4.3 TRANSACTION PROCESSING DURING PARTITIONING .......................................................................... 59

    4.4 PARTITION PROCESSING STRATEGY ................................................................................................... 60

    4.5 AN ABSTRACT MODEL FOR STUDYING REPLICATION ALGORITHMS .................................................. 62

    4.6 PRIMARY SITE PROTOCOL ................................................................................................................. 66

  • iv

    4.7 VOTING ALGORITHMS ........................................................................................................................ 69

    4.7.1 Majority Consensus Algorithm .................................................................................................. 70

    4.7.2 Voting With Witnesses ............................................................................................................... 73

    4.7.3 Dynamic Voting ......................................................................................................................... 73

    4.7.4 Dynamic Majority Consensus Algorithm (DMCA) - A novel approach .................................... 79

    4.8 SUMMARY ......................................................................................................................................... 91

    5. ANALYSIS AND DESIGN OF THE SOFTWARE SIMULATION ............................................... 93

    5.1 INTRODUCTION TO SIMULATION MODELLING ..................................................................................... 93

    5.2 USING AN OBJET-ORIENTED TECHNIQUE FOR MODELLING A SIMULATION SYSTEM .......................... 94

    5.3 OBJECT ORIENTED DISCRETE EVENT SIMULATION ............................................................................ 95

    5.4 THE SIMULATION MODELLING PROCESS ........................................................................................... 96

    5.4.1 Problem formulation ................................................................................................................. 96

    5.4.2 Model Implementation .............................................................................................................. 97

    5.5 OBJECT ORIENTED ANALYSIS AND DESIGN ....................................................................................... 97

    5.5.1 Analysis ..................................................................................................................................... 98

    5.5.2 Design ....................................................................................................................................... 98

    5.5.3 Implementation .......................................................................................................................... 99

    5.6 ATS REQUIREMENTS ........................................................................................................................ 99

    5.7 ATS ANALYSIS................................................................................................................................ 101

    5.7.1 Object Model ........................................................................................................................... 103

    5.8 DYNAMIC MODEL ........................................................................................................................... 106

    5.9 EVALUATION OF THE SYSTEM ......................................................................................................... 108

    5.10 SUMMARY ..................................................................................................................................... 108

    6. SIMULATION AND ESTIMATION OF REPLICA CONTROL PROTOCOLS ....................... 110

    6.1 PERFORMANCE EVALUATION .......................................................................................................... 110

    6.2 THE SIMULATION MODEL ................................................................................................................ 111

    6.3 FAULT INJECTION ............................................................................................................................ 113

    6.4 SIMULATED ALGORITHMS ............................................................................................................... 114

    6.5 THE PROTOCOLS ROUTINES ............................................................................................................ 115

    6.6 IMPLEMENTING GROUP COMMUNICATION ....................................................................................... 116

    6.7 FUNCTIONAL COMPONENTS OF THE SIMULATION. ........................................................................... 117

    6.8 PARAMETER OF THE SIMULATION .................................................................................................... 119

    6.9 AVAILABILITY AND THE CONTRIBUTION OF THE DMCA ALGORITHM .............................................. 119

    6.10 RESULTS OF THE SIMULATION ....................................................................................................... 122

    6.11 SUMMARY ..................................................................................................................................... 135

    7. CONCLUSIONS................................................................................................................................. 138

    7.1 CONTRIBUTIONS OF THIS WORK ....................................................................................................... 138

    7.2 FUTURE RESEARCH DIRECTION ....................................................................................................... 139

    7.3 CONCLUDING REMARKS .................................................................................................................. 141

    APPENDIX-A PAPERS............................................................................................................. PAPERS 1

    APPENDIX-B TABLES............................................................................................................. TABLES 1

    APPENDIX-C SOURCE CODE ................................................................................................... CODE 1

    LIST OF REFERENCES................................................................................................ REFERENCES 1

  • v

    LIST OF FIGURES

    FIGURE 2-1: BASIC MANAGEMENT MODEL .................................................................................................... 9

    FIGURE 2-2: VIEWS OF SHARED MANAGEMENT KNOWLEDGE....................................................................... 17

    FIGURE 2-3: SIMPLIFIED MANAGEMENT SYSTEM. ....................................................................................... 17

    FIGURE 2-4 NETWORK MANAGEMENT APPROACHES (A) CENTRALISED (B) PLATFORM BASED (C)

    HIERARCHICAL (D) DISTRIBUTED ......................................................................................................... 21

    FIGURE 2-5: INTER-WORKING TMN ........................................................................................................... 25

    FIGURE 2-6: (A) REPLICATION (B) NO REPLICATION ..................................................................................... 28

    FIGURE 2-7: NETWORK MANAGEMENT REPLICATION EXAMPLE .................................................................. 31

    FIGURE 2-8: SYNCHRONOUS REPLICATION .................................................................................................. 33

    FIGURE 2-9: ARCHITECTURAL MODEL FOR REPLICATION. (A) NON TRANSPARENT SYSTEM (B) TRANSPARENT

    REPLICATION SYSTEM (C ) LAZY REPLICATION (D) PRIMARY COPY MODEL. ......................................... 35

    FIGURE 3-1: RELATIONSHIP BETWEEN USER AND RESOURCE. ...................................................................... 39

    FIGURE 3-2: FAILURE MASKING ................................................................................................................... 42

    FIGURE 3-3: GROUP MASKING ..................................................................................................................... 43

    FIGURE 4-1 REPLICATION ANOMALY CAUSED BY CONFLICT WRITE OPERATIONS. A) BEFORE ISOLATION B)

    AFTER ISOLATION ................................................................................................................................ 57

    FIGURE 4-2: LOGICAL AND PHYSICAL OBJECTS OF THE SENSOR ENTITY. ..................................................... 64

    FIGURE 4-3. REPLICATION USING PRIMARY SITE ALGORITHM. ..................................................................... 66

    FIGURE 4-4. READ IN A PRIMARY SITE PROTOCOL ...................................................................................... 68

    FIGURE 4-5. WRITE IN A PRIMARY SITE PROTOCOL ..................................................................................... 68

    FIGURE 4-6. MAKE CURRENT IN A PRIMARY SITE PROTOCOL ..................................................................... 69

    FIGURE 4-7. READ IN A MAJORITY CONSENSUS ALGORITHM ...................................................................... 71

    FIGURE 4-8. WRITE IN A MAJORITY CONSENSUS ALGORITHM ....................................................................... 72

    FIGURE 4-9. MAKE CURRENT IN A MAJORITY CONSENSUS ALGORITHM ..................................................... 72

    FIGURE 4-10. ISMAJORITY IN THE DYNAMIC VOTING PROTOCOL ............................................................... 75

    FIGURE 4-11. READ FUNCTION IN THE DYNAMIC VOTING PROTOCOL ......................................................... 75

    FIGURE 4-12. WRITE (UPDATE) IN THE DYNAMIC VOTING PROTOCOL ......................................................... 76

    FIGURE 4-13 UPDATE IN THE DYNAMIC VOTING PROTOCOL ...................................................................... 78

    FIGURE 4-14 MAKE CURRENT IN THE DYNAMIC VOTING PROTOCOL ......................................................... 78

    FIGURE 4-15 READPERMITTED IN THE DMCA ........................................................................................... 82

    FIGURE 4-16 WRITEPERMITTED FUNCTION IN THE DMCA .......................................................................... 83

    FIGURE 4-17 DOREAD FUNCTION IN THE DMCA........................................................................................ 84

    FIGURE 4-18 DOWRITE FUNCTION IN THE DMCA ...................................................................................... 86

    FIGURE 4-19 MAKE CURRENT FUNCTION IN DMCA ................................................................................... 88

    FIGURE 4-20: SEQUENCE DIAGRAM FOR DOREAD OPERATION .................................................................... 89

    FIGURE 4-21: SEQUENCE DIAGRAM FOR DOWRITE OPERATION ................................................................... 90

    FIGURE 4-22: SEQUENCE DIAGRAM FOR MAKECURRENT OPERATION ......................................................... 90

    FIGURE 5-1. ATS PROCESS DIAGRAM ........................................................................................................ 102

    FIGURE 5-2. ATS OBJECT MODEL .............................................................................................................. 106

    FIGURE 5-3. ATS DYNAMIC MODEL .......................................................................................................... 107

    FIGURE 6-1: NETWORK MODEL ................................................................................................................. 112

    FIGURE 6-2: FAULT INJECTION SYSTEM .................................................................................................... 113

    FIGURE 6-3: COMPONENTS OF THE SIMULATION MODEL ............................................................................ 118

    FIGURE 6-4: AVAILABILITY CURVE ............................................................................................................ 120

    FIGURE 6-5: TOTAL AVAILABILITY =4. ................................................................................................... 121

    FIGURE 6-6: BOUNDARIES OF TOTAL AVAILABILITY . ................................................................................. 122

    FIGURE 6-7: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=0.1 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY. .................. 126

    FIGURE 6-8: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=0.2 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 127

    FIGURE 6-9: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=0.3 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 128

    FIGURE 6-10: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=0.4 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 129

    FIGURE 6-11: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=0.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 130

  • vi

    FIGURE 6-12: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=1.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 131

    FIGURE 6-13: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=1.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 132

    FIGURE 6-14: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=2.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 133

    FIGURE 6-15: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=2.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 134

    FIGURE 6-16: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR

    DELAY=3.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 135

  • vii

    LIST OF TABLES

    TABLE 2-1: CMISE SERVICES AND FUNCTIONS ........................................................................................... 12

    TABLE 2-2: SNMP SERVICES AND FUNCTIONS ............................................................................................ 13

    TABLE 4-1: DMCA MAPPING ...................................................................................................................... 91

    TABLE 6-1: SIMULATION PARAMETERS ..................................................................................................... 123

  • viii

    ACKNOWLEDGEMENTS

    I would like to thank my supervisor Dr. B. H. Pardoe who has assisted me and

    guided me in preparing this thesis. His kind assistance in my struggles with the English

    language was very helpful. He has been willing to answer any technical question and

    provide me with the information and knowledge to cope with the difficult task of

    preparing a Ph.D. thesis. Without his kind help and encouragement, this research

    would never have been done

    I would especially like to express sincere appreciation for the financial support,

    encouragement and love given to me by my parents Grigorios and Dimitra Kotsakis.

    They have given me much more than moral and material support through my University

    studies. They provide me a rock - solid support system which was proved helpful during

    my studies. Without their support and love, this research would never have been done

    Special thanks are due to my wife Chaido for encouraging me over the last three

    years. I would like to thank her for her unconditional love, unfailing enthusiasm,

    unending optimism and confidence in my abilities. Her patient and support are

    boundless.

    Finally, I would like to thank little Dimitra, whose birth two years ago gave me a

    great joy, for being quiet while I was writing this thesis

  • ix

    ABBREVIATIONS

    ANSA Advanced Network System Architecture

    ATM Asynchronous Transfer Mode

    ATS Availability Testing System

    CMIS Common Management Information Protocol

    CMISE Common Management Information Service Element

    DMCA Dynamic Majority Consensus Algorithm

    FE Front End

    IP Internet Protocol

    ISO International Standards Organisation

    LAN Local Area Network

    MIB Management Information Base

    OMT Object Modelling Technique

    OSF/DCE Open Software Foundation / Distributed Computing Environment

    OSI Open System Interconnection

    ROSE Remote Operation Service Element

    SNA Systems Network Architecture

    SNMP Simple Network Management Protocol

    TCP Transmission Control Protocol

    UDP User Datagram Protocol

    WAN Wide Area Network

  • x

    ABSTRACT

    Systems management is concerned with supervising and controlling the system so

    that it fulfils the requirements. The management of a system may be performed by a

    mixture of human and automated components. These components are abstract

    representations of network resources and they are known as managed objects. A

    distributed management system may be viewed as a collection of such objects located at

    different sites in a network. Replication is a technique used in distributed systems to

    improve the availability of vital data components and to provide higher system

    performance since access to a particular object may be accomplished at multiple sites

    concurrently. By applying replication in a distributed management system, we can locate

    certain management objects at multiple sites by copying their internal data and the

    operations used to access or update those data. This is considered as a great advantage

    since it increases the reliability and availability, it provides higher fault tolerance and it

    allows data sharing; improving in that way the system performance.

    This thesis is concerned with methods that may be used to apply replication in

    such a system, as well as certain replica control algorithms that may be used to control

    operations over a replicated managed object. Certain replication architectures are

    examined and the availability provided by each of them is discussed. A new replica

    control algorithm is proposed as an alternative to providing higher availability. A tool

    for evaluating the availability provided by a replica control algorithm is designed and

    proposed as a benchmark test utility that examines the suitability of certain replica

    control algorithms.

  • 1

    1. INTRODUCTION

    The importance of replication techniques for providing high availability in

    distributed systems has been known for over two decades. But the use of these

    techniques in network management systems have been minimal. A reason for this has

    been that the machinery needed to cope with the partitions and reunions is excessively

    complex. This thesis addresses the methods that may be safely used for replicating

    management objects and it shows how one can use replication techniques in a

    management system that preserves usability and availability.

    The rest of this chapter discusses replication techniques and the related problems.

    It begins with a discussion on network management systems and the use of replication

    in a distributed Management Information Base (MIB) for improving the availability of

    the managed objects. It then discusses the goal of this thesis and it concludes with a

    road-map for the rest of the thesis.

    1.1 Distributed Management Systems

    The management of a communication environment is a distributed information

    processing application where individual components of the management activities are

    associated with network resources. Management applications perform the management

    activities in a distributed and consistent manner that guarantees transparency and system

    operability. Management information is stored in a special database which is known as

    Management Information Base (MIB). The MIB is the conceptual repository of the

    management information and each object stored in the MIB is associated with an

  • 2

    individual network resource, an attribute used to represent a network activity. When the

    MIB is distributed over sites, one site may fail, while other sites continue to operate.

    Distributed MIBs may also increase performance since different managed objects

    located at different hosts may be accessed concurrently.

    A fundamental problem with a distributed MIB is data availability. Since managed

    objects are stored on separate machines, a server crash or a network failure that

    partitions a client from a server can prevent a manager from accessing managed objects.

    Such situations are very frustrating to a manager because they impede computation even

    though client resources are still available. The problem of object availability increases

    over time for two reasons

    1. The frequency of the network failures will increase. Networks get larger, they

    cover larger geographical area, encompass multiple administrative boundaries and

    consist of multiple sub-networks via routers and bridges. Furthermore, there is an

    increased need for better network resource management that increases the

    availability of managed objects and the management performance

    2. The introduction of mobile network managers will increase the number of

    occasions on which management agencies are inaccessible. Wireless technologies

    such as packet-radio suffer from inherent limitations like short range and line of

    sight. Due to these limitations, the network connections between management

    agents and mobile managers will exhibit frequent partitions.

    1.2 Replication on a distributed MIB

    Replication is a technique used in distributed operating systems and distributed

    databases to improve the availability of system resources. In the case of the MIB,

    replication can be used to increase the performance of management activities and to

    provide high availability of management objects. Replicating the same management

  • 3

    object at different sites can improve the availability remarkably because the system can

    continue to operate as long as at least one site is up. It also improves performance of

    global retrieval queries, because the result of such a query can be obtained locally from

    any site; hence a retrieval query can be processed at the local site where it is submitted.

    To deal with replicated objects in a management information base, a control

    method is needed to keep all the replicas in a consistent state even during partitioning.

    The proposed techniques used to assure consistency may be divided into two families;

    those based on a distinguished copy and those based on voting. The former technique is

    based on the idea of using a designated copy of each replicated copy of each replicated

    object in such a way that requests are sent to the site that contains that copy

    (ALSBERG 1976, BERNSTEIN 1987, GARCIA 1982).

    Voting replica control algorithms are more promising. They do not use a

    distinguished copy; rather a request is sent to all sites that include a copy of the

    replicated object. An access to a particular copy is granted if a majority of votes is

    collected (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b). Voting algorithms

    are fully distributed concurrency control algorithms and they exhibit higher flexibility

    over those based on a distinguished copy. Despite the fact that the voting algorithms

    pass many messages among sites, it is anticipated that good performance will be gained,

    if the round trip time becomes shorter. Todays technology can improve the round trip

    time through the use of high speed networks like ATM. Thus, messages may be

    transferred from one machine to another faster and more reliably.

    In the voting scheme, replicated objects can be accessed in the partition group that

    obtains a majority vote. In the distinguished (primary) copy scheme, availability is

    significantly limited in the case of a network or link failure. Primary copy algorithms

    exhibit good behaviour only for site failures . On the other hand, voting algorithms

  • 4

    provide higher availability tolerating both network and site failures. Voting algorithms

    guarantee consistency at the expense of the availability. To provide higher availability,

    one may use either a consistency relaxation technique that allows concurrent access of

    replicated objects across different partitions (optimistic control algorithm) or to improve

    the existing voting pessimistic algorithms by forming more sophisticated schemes.

    Optimistic control algorithms must be supported by an extra mechanism to detect and

    resolve diverging replicas once the partition groups are reconnected. This complicates

    the control of replication task and allows, at least for a short interval of time,

    inconsistency between replicas. Such an approach requires long time to retrieve the state

    of the database after a site failure and it does not seem appropriate for databases such as

    those used to store management information. Therefore, the invention of a more

    sophisticated replica control algorithm based on voting seems to be a promising

    approach that may provide higher availability preserving strong consistency between

    replicated objects.

    Finally, the following questions are addressed in this thesis:

    Can we improve further the availability of managed objects by utilising voting

    techniques?

    Can replication be used effectively in a distributed MIB in order to ensure fault

    tolerance in a management system?

    1.3 The work

    The goal of this work is to investigate the potential use of replication in a network

    management system and examine the practical aspect of applying such a technique in

    real time systems. Intuitively this work appears viable for the following reasons.

  • 5

    1. There is a great proliferation of different network technologies and a great need for

    network management. Keeping management information available and in a

    consistent state is of great importance, since the network operability depends on

    management activities.

    2. Availability of network information may be obtained only by applying redundancy.

    Replication is one of the most widely used techniques that ensures high availability

    keeping the replicated objects in a consistent state.

    3. The development of replication schemes for insuring higher object availability and

    for tolerating site and communication failures is very promising and should be

    studied further.

    4. Failures always happen. No system can work forever within its specifications.

    Exogenous or endogenous factors can affect the operability of the system causing

    temporary or permanent failures.

    5. The need for developing fault tolerant techniques for network management systems is

    of great importance.

    1.4 Road Map of the Thesis

    The rest of the thesis consists of six chapters.

    Chapter 2 describes systems management and provides a discussion about the

    architectural models of management systems. It examines object replication in terms of

    system performance and availability and illustrates the replication architectural models

    for managing failures.

  • 6

    Chapter 3 discusses the nature of failures in a management system. It decomposes a

    management system into management agents and defines the concept of dependability

    between agents. It also classifies certain failures according to their disruptive behaviour

    and it then ends by specifying the architectural impact of a group of agents to the

    availability of replicated objects.

    Chapter 4 presents the correctness criteria that should be taken into account when

    designing a replication system. It also introduces an abstract model in order to study

    formally certain replica control algorithms. It then presents a variety of replica control

    algorithms. It ends with a thorough discussion on the DMCA (Dynamic Majority

    Consensus Algorithm) which is a novel approach that enriches the up to date knowledge

    on replication techniques and improves the overall management of replicated objects

    providing higher availability.

    Chapter 5 mainly evaluates the DMCA algorithm and presents quantitative results

    regarding the availability provided by DMCA. It starts by specifying the way in which

    one can measure the performance of certain replica control protocols and introduces the

    simulation model for building the benchmark test utility ATS (Availability Testing

    System) that estimates the effectiveness of the algorithms. It also shows the fault

    injection mechanism for generating faults and repairs. It ends with a thorough discussion

    on the results of the simulation justifying the superiority of the DMCA.

    Chapter 6 presents the object -oriented development process of the ATS tool. It first

    discusses the advantages of using object oriented technology to develop such a complex

    system and it then presents the static object model and dynamic model of the ATS.

  • 7

    The thesis concludes with chapter 7 which presents the contributions and it includes a

    discussion of future work and a summary of key results.

  • 8

    2. REPLICATION MANAGEMENT SYSTEM

    ARCHITECTURE

    This chapter introduces the fundamental idea behind systems management and

    illustrates the main features of a management system. It provides a brief discussion

    about the architectural model of a management system and introduces the concept of

    distributed MIB as a naturally distributed database. It highlights issues related to the

    object oriented MIB modelling and definition of managed objects. It also justifies the

    use of object replication in terms of system performance, data reliability and availability.

    Finally it discusses the type of failures that may occur in a management system as well

    as the replication architectural models that may be used to maintain multiple replicas.

    2.1 Management Functional Areas

    Management of a system is concerned with supervising and controlling the

    system so that it fulfils the operational requirements. To facilitate the management task ,

    Open System Interconnection (OSI ) divides the management design process into five

    areas known as OSI management functional areas (ISO 1989). The fundamental

    objectives of the OSI functional areas are to fulfil the following goals.

    1. To maintain proper operation of a complex network (fault management).

    2. To maintain internal accounting procedures (accounting management).

    3. To maintain procedures regarding the configuration of a network or a

    distributed processing system (configuration management).

  • 9

    4. To provide the capability of performance evaluation (performance

    management).

    5. To allow authorised access-control information to be maintained and

    distributed across a management domain (security management).

    In other words, a network that has a management system, must be able to manage

    its own operations, performance, failures, modifications, security and hardware/software

    configuration. To fulfil the above requirements, it is necessary to develop a management

    model that is capable of incorporating a vast amount of services covered under the

    specifications of the OSI functional areas. The actual architecture of the network

    management model varies greatly, depending on the functionality of the platform and

    the details of the network management capability. A management architectural model

    that has been proposed by OSI defines the fundamental concepts of systems

    management (ISO 1992). This model describes the information, functional,

    communication and organisational aspect of systems management.

    Figure 2-1: Basic Management model

  • 10

    2.2 Management Architectural Model

    The management of a communication environment is an information processing

    application. Because the system being managed is distributed, the individual

    components of the management activities are themselves distributed. Management

    applications perform the management activities in a distributed manner by establishing

    associations between management entities. As shown in Figure 2-1 there are two

    fundamental type of entities that exchange management information; one takes the

    manager role and the other the agent role. An entity, that plays the manager role is

    supposed to be the entity that generates queries to obtain management information. An

    entity plays the agent role when it accepts those queries returning a response back to the

    manager and generates notifications regarding the state of the objects located in the

    domain of the agent. An agent performs management operations on managed objects as

    a consequence of the communication with the manager. A manager may be seen as the

    part of the distributed application that is responsible for generating messages related to

    one or more management activities (collection of information, controlling the state of

    remote objects, change the configuration of managed devices etc.). The agent, on the

    other hand, could be viewed as a local corespondent of the manager in the managed

    system controlling access to the managed object and looking after the distribution of

    events occurring in the managed system.

    As Figure 2-1 shows, three kinds of messages are transferred between manager and

    agent. These are the following:

    request messages transferred from the manager to the agent.

    response messages are bi-directional

    notification messages transferred from the agent to the manager

  • 11

    The database of systems management information called Management

    Information Base (MIB) is associated with both the manager and the agent. The MIB is

    the conceptual repository of the management information stored in an OSI-based

    network management system. The definition of the MIB describes the conceptual

    schema containing information about managed objects and relations between them. It

    actually defines the set of all managed objects visible to a network management entity.

    The MIB may be viewed as the interface definition - it defines a conceptual schema

    which contains information about specific managed objects, which are instantiations of

    managed object classes. The schema also embodies relationships between these

    managed objects, specifies the operations which may be performed on them and

    describes the notifications which they may emit (ISO 1993).

    2.3 Protocols For Controlling Management Information

    All types of management exchanges consist of requests and/or requests -

    responses. There are currently two basic architectural frameworks related to the

    standardisation of the exchange of messages passed between managers and agencies; the

    OSI management framework (ISO 1989) and the Internet network management

    framework (CERF 1988).

    2.3.1 OSI Management Framework

    The Common Management Information Protocol (CMIP) is a utility designed to

    convey the requests or responses between managers and agencies (ISO 1991a). The

    CMIP offers specific system management functions and services for the remote holding

    of management data. The CMIP implements the services offered by the remote

    operation service elements (ROSE) (ISO 1988), in order to perform the create, set,

    delete, get, action, and event-report operations. The Common Management Information

  • 12

    Service Element (CMISE) is the standardised application service element that is used to

    exchange management information in the form of requests and/or requests-responses

    (ISO 1991). The CMISE is a basic vehicle that provides individual management

    applications with the means of executing management operations on objects and issuing

    notifications. The CMISE provides the means of supporting distributed management

    operations using application associations. The CMISE services shown in Table 2-1

    constitute the kernel functional unit of the CMISE. A system supporting the CMIP must

    implement the kernel functional units of the CMISE.

    2.3.2 Internet Network Management

    The Simple Network Management Protocol (SNMP) (CASE 1990) is used to

    convey management information in an Internet management system just like the CMIP

    is used in an OSI management system. The SNMP includes a limited set of management

    requests and responses. The managing system issues get, get_next, and set requests to

    retrieve single or multiple objects or to set the value of a single object. The managed

    system sends a response to complete the get, get_next or set requests. The managed

    system also sends an event notification called trap to the managing system to identify

    Table 2-1: CMISE services and functions

    Service Type Function Notifications M_EVENT_REPORT C/NC Gives notifications of an event

    occurring on a managed object

    M_GET C Request for management data

    Operations M_SET C/NC Modification of management data

    M_ACTION C/NC Action execution on managed object

    M_CREATE C Creation of a managed object

    M_DELETE C Deletion of a managed object

    C = Confirmed M stands for management

    NC = Not Confirmed

  • 13

    the occurrence of an event. Table 2-2 lists the SNMP request and response messages

    along with their types and functions.

    2.4 Object oriented MIB modelling

    The central point in a management system is the managed object. A managed

    object may be seen as the management view of a resource and is described by the

    following characteristics:

    Attributes, that denote specific characteristics of the resource.

    Operations, that are performed on a set of attributes

    Behaviour, that specifies how the object reacts to operations performed on it.

    Notifications, that may be emitted to the managing station through a protocol

    as a reaction in an external event or as a repeated action.

    The managed object class provides a way to specify a family of managed object. A

    managed object class is a template for managed objects that share the same attributes,

    operations, notifications and behaviour. A managed object is an instantiation of the

    managed object class.

    Table 2-2: SNMP services and functions

    Service Type Function Notifications Trap C/NC An agent sends a trap to alert the

    manager that an event has been occurred

    GetRequest C Retrieves the state of a single object

    Operations GetNextRequest C Retrieves the state of the next object in an a sequence of objects

    GetResponse NC Response sent by the agent to the manager

    SetRequest C Sets the state of a managed object

    C = Confirmed

    NC = Not Confirmed

  • 14

    The MIB is the conceptual repository containing all the related information about

    managed objects. The MIB modelling encompasses an abstract model and an

    implementation model (KOTSAKIS 1995). The abstract model defines

    Principles of naming objects

    The logical structure of management information

    Concepts related with management object classes and the relationship between

    them.

    The implementation model (BABAT 1991, KOTSAKIS 1995) defines the following

    The platform for hosting a MIB.

    The architectural principles for partitioning MIB information

    Database type (Object oriented or relational)

    Translation of MIB object model into a schema

    The Management Information Model (ISO 1993) defines two types of management

    operations

    operations intended to be applied to the object attributes

    operations intended to be applied to the management object as a whole.

    Attribute oriented operations are as follows

    get attribute value

    replace attribute value

    replace with default value

    add member

    remove member

  • 15

    Any operation may affect the state of one or more attributes. The operations may also be

    performed atomically (either all operations succeed or none is performed). Operations

    that may be applied to the managed object as a whole are the following

    create

    delete

    action

    An action operation requests the managed object to perform the specified action and to

    indicate the result of this action.

    2.5 Distributed Management Information Base (MIB)

    Roles are not permanently assigned to a management entity. Some management

    entities may be restricted to only taking an agent role, some to only taking a manager

    role while other are allowed to take an agent role in one interaction and to take a

    manager role in a separate interaction. In order to perform system management and

    share management knowledge, it is sometimes necessary to embody manager and agent

    within a single open system (see Figure 2-2). Shared management knowledge is implied

    by the nature of the management framework since the management applications are

    distributed across a network. Therefore the management information base may be

    naturally viewed as a distributed database containing the managed objects that belong

    to the same management system but is physically spread over multiple sites (hosts) of a

    computer network. The MIB is considered as a superset of managed objects. Each

    subset of this superset may constitute a set of objects associated with a device physically

    separated from any other managed device. (ARPEGE 1994) Therefore the managed

    objects in each location may be viewed as a local management description of the

  • 16

    managed device. The distributed design of a MIB may be considered as a great

    advantage for the following reasons:

    Increased reliability and availability: Reliability is defined as the probability that a

    system is up at a particular moment, whereas availability is the probability that a

    system is continuously available during a time interval. When the MIB is

    distributed over several sites, one site may fail while other sites continue to operate.

    Only the objects associated with the failed site cannot be accessed. This improves

    both reliability and availability. On the other hand a failure in a centralised MIB

    may makes the whole system unavailable to all users.

    Allowing data sharing while maintaining some measure of local control: A

    distributed MIB allows the control of objects locally at each agent. Objects, that may

    be available to a specific manager may be hidden to some other managers.

    Improved Performance: A distributed MIB implies the existence of smaller

    databases at each site. If a site combines both the role of manager and agent, the

    manager may gain faster access to the local MIB than any other manager located in

    a remote site. This increases the performance of the system since a set of managed

    objects may be accessed locally without the need to open a communication

    transaction over the network. In addition a distributed MIB decreases the load

    (number of transactions) submitted to an agent compared with the load executed by

    a centralised MIB. Since different agents may operate independently, different

    agents may proceed in parallel, reducing response times.

  • 17

    A typical arrangement of a management system is shown in Figure 2-3. The nodes may

    be located in physical proximity and connected via a Local Area Network (LAN), or

    they may be geographically distributed over a interconnected network (Internet). It is

    possible to connect a number of diskless workstations or personal computers as

    Figure 2-2: Views of shared management knowledge

    Figure 2-3: Simplified Management System.

  • 18

    managers to a set of agents that maintains the managed objects. As illustrated in Figure

    2-3, some nodes may run as managers (such as the diskless node 1, or the node 2 with

    disks), while other nodes are dedicated to run only agent software, such as the node 3.

    Still other nodes may support both manager and agent roles, such as the node 4.

    Interaction between manager and agent might proceed as follows:

    1. The manager parses a user query and decomposes it into a number of

    independent queries that are sent separately to independent management agent

    nodes.

    2. Each agent node processes the local query and sends a response to the manager

    node.

    3. The manager node combines the results of the subqueries to produce the result of

    the original submitted query.

    4. If something occurs in an agent that changes its operational state, a notification

    may be generated from the agent and an associated message is sent urgently to

    the manager for further processing.

    The agent software is responsible for local access of managed objects while the manager

    software is responsible for most of the distribution functions; it processes all the user

    requests that require access to more than one management node and it keeps a truck

    where each managed object is located. An important function of the manager is to hide

    the details of data distribution from the user, that is, the user should write global queries

    as though the MIB were not distributed. This property is called MIB transparency. A

    management system that does not provide distribution transparency makes it the

    responsibility of the user to specify the managed node related with a managed object.

    2.6 Distributed Network Management

  • 19

    Systems are increasingly becoming complex and distributed. As a result, they are

    exposed to problems such as failures, performance inefficiency and resource allocation.

    So, an efficient integrated network management system is required to monitor, interpret

    and control the behaviour of their hardware and software resources. This task is

    currently being carried out by centralised network management systems in which a

    single management system monitors the whole network. Most of existing management

    systems are platform-centred. That is, the applications are separated from the data they

    require and from the devices they need to control. Although some experts believe that

    most network management problems can be solved with a centralised management

    system, there are real network management problems that cannot be adequately

    addressed by the centralised approach. (MEYER 1995).

    Basically, there are four basic approaches for network management systems

    centralised, platform based, hierarchical and distributed (LEINWARD 1993).

    Currently, most network management systems are centralised. In a centralised

    management system (Figure 2-4.a ) there is a single management machine (manager)

    which collects the information and controls the entire network. This workstation is a

    single point of failure and if it fails, the entire network could collapse. In case the

    management host does not fail, but the fault partitions the network, the other part of the

    network is left without any management functionality. Centralised network management

    has shown inadequacy for efficient management of large heterogeneous network. Also, a

    centralised system cannot be easily scaled up when the size of complexity of the

    network increases.

    In the platform based approach (Figure 2-4.b), a single manager is divided in two

    parts; the management platform and the management application. The management

    platform is mainly concerned with information gathering while management

  • 20

    applications use the services offered by the management platform to handle decision

    support. The advantage of this approach is that, applications do not need to worry about

    protocol complexity and heterogeneity.

    The hierarchical architecture (Figure 2-4.c) uses the concept of Manager Of

    Managers (MOM) and manager per domain paradigm (LEINWARD 1993). Each

    domain manager is only responsible for the management of its domain and it is unaware

    of other domains. The manager of managers sits at the higher level and request

    information from domain managers.

  • 21

    The distributed approach (Figure 2-4.d) is a peer architecture. Multiple managers,

    each one responsible for a domain, communicate with each other in a peer system.

    Whenever information from another domain is required, the corresponding manager is

    contacted and the information is retrieved. By distributing management over several

    workstations, the network management reliability, robustness and performance increase

    while the network management cost in communication and computation decreases. This

    Figure 2-4 Network management approaches (a) centralised (b) platform based (c) hierarchical (d)

    distributed

  • 22

    approach has also been adapted by ISO standards and the Telecommunication

    Management Network (TMN) architecture (ITU 1995).

    A distributed system should use interconnected and independent processing

    elements to avoid having a single point of failures. Several reasons contribute in using a

    distributed management architecture: higher performance/cost ratio, modularity, greater

    expandability and scalability, higher availability and reliability. Distributed management

    services should be transparent to users, so that they cannot distinguish between a local

    and a remote service. This requires the system to be consistent, secure, fault tolerant and

    have a bounded response time.

    Remote Procedure Call (RPC) (NELSON 1981) is well understood control

    mechanism used for calling a remote procedure in a client server environment. The

    Object Management Group (OMG) Common Object Request Broker Architecture

    (CORBA) (OMG 1997) is also an important standard for distributed object oriented

    systems. It is aimed at the management of objects in distributed heterogeneous systems.

    CORBA addresses two challenges in developing distributed systems (OMG 1997):

    1. Making the design of the system not more difficult than a centralised one

    2. Providing an infrastructure to integrate application components into a distributed

    system.

    2.7 CORBA System

    The most promising approach to solve the distributed interface and integration

    problem is the CORBA architecture(VINOSKI 1997). Although CORBA does not

  • 23

    support directly a network management architecture, it provides a distributed object

    oriented framework where a management system may be developed.

    The main component of CORBA is the Object Request Broker (ORB). An ORB is

    the basic mechanism by which objects transparently make requests to each other on the

    same machine or across a network. A client object need not be aware of the mechanisms

    used to communicate with or activate an object, how the object is implemented nor

    where the object is located. The ORB forms the foundation for building applications

    constructed from distributed objects and for interoperability between applications in

    both homogeneous and heterogeneous environments.

    The OMG Interface Definition Language (IDL) provides a standardised way to

    define the interfaces to CORBA objects. The IDL definition is the contract between the

    implementor of the object and the client. IDL is a strongly typed declarative language

    that is programming language independent. Language mapping enables objects to be

    implemented in the developers programming language of choice.

    CORBA services include naming, events, persistence, transactions, concurrency

    control, relationships, queries, security etc. CORBA services are the basic building

    blocks for distributed object applications. Compliant objects can be combined in many

    different ways and put to many different uses in applications. They can be used to

    construct higher level facilities and object frameworks that can inter-operate across

    multiple platform environments.

    2.8 Implementing OSI Management Services for TMN

    Recently the telecommunication industry has gained knowledge and experience

    establishing management functionality through the Telecommunication Management

  • 24

    Network (TMN) framework(ITU 1995). On the other hand in the Internet community,

    Simple Network Management protocol (SNMP)has gained widespread acceptance due

    to its simplicity of implementation. Thus, TMN and Internet management will co-exist

    in the future.

    The aim of the TMN is to enhance interoperability of management software and to

    provide an architecture for management systems. A TMN is a logically distinct network

    from the telecommunication network that it manages. It interfaces with the

    telecommunication network at several different points and controls their operations. The

    TMN information architecture is based on an object oriented approach and the

    agent/manager concepts that underlie the Open Systems Interconnection (OSI) systems

    management.

    The Telecommunication Management Network (TMN) is a framework for the

    management of telecommunication networks and the services provided on those

    networks. The Open Systems Interconnection (OSI) management framework is an

    essential component of the TMN architecture. Each TMN function block can play the

    role of an OSI manager, an OSI agent or both.

    A managed object instance can represent a resource and thus there is a

    requirement for communication between managed objects instances in an OSI agent and

    the resources they represent. Examples of resources include telecommunication

    switches, bridges, gateways etc. If a new interface card is added to a switch, the switch

    may send a create request to the agent for the creation of the corresponding manager

    object instance.

  • 25

    Figure 2-5 illustrates how TMN systems can inter-work within the TMN logical layer

    architecture (SIDOR 1998, FERIDUM 1996). In this architecture, system A manages

    System B and B may, in turn, necessitate operations on the information model of the

    system C.

    The management information base ( MIB) is the managed object repository and

    may be implemented by using C++ objects through a MIB composer tool (FERIDUM

    1996). In (BAN 1995) a uniform generic object model (GOM) is proposed for

    manipulating transparently managed objects of various specific object models (CORBA,

    OSI X.700, COM etc.). Communication between managed object instances and

    resources can be initiated from both directions. Resource access components access

    managed object instances through the core agent. Some or all of the managed object

    instances in a MIB may be persistent to allow fast recovery after an agent failure. There

    are two major design considerations in implementing persistence:

    1. Performance: persistent managed objects ensure fast restart after agent failures. For

    example the instance representing a leased line between two communication nodes

    may need to be persistent, whereas an instance representing a connection does not

    (since after an agent failure, the connection will be terminated). Object oriented

    Figure 2-5: Inter-working TMN

  • 26

    databases or traditional relational databases or even flat files can be used to

    implement selective persistence.

    2. Synchronisation: When the agent restarts, managed objects must be updated to

    reflect the current state of the resources. Synchronisation requires exchange of are

    you there and what are the current value type messages between managed objects

    and resources.

    2.9 Replication In a Management System

    Making persistent managed objects may increase the performance and object

    availability offered during an agents failure. Replication may be used to increase further

    the performance and the availability of network managed objects. The major design

    considerations in implementing replication are that it allows the control of objects

    locally at each replication site and it lets managers gain faster access to a MIB managed

    object by retrieving information locally without the need of performing remote

    transactions. In that way the load is shared among many sites.

    To facilitate better the use of a replication technique in a network management

    system we may incorporate replication in a distributed framework (CORBA). CORBA

    has been designed to provide an architecture for distributed object-oriented computing,

    not network management. Engineers have focused their efforts on developing an

    integrated management platform to create, manage and invoke distributed

    telecommunication services. Some of these efforts are (LEPPINEN 1997, MAFFEIS

    1997a, RAHKILA 1997)

    The CORBA standard provides mechanisms for the definition of interfaces to

    distributed objects and for communication of operations to those objects through

  • 27

    messages. Unfortunately, the current CORBA standard makes no provision for fault

    tolerance.

    To provide fault tolerance, objects should be replicated across multiple processors

    within the distributed system (ADAMEC 1995. The motivations for applying object

    replication in a distributed network management system could be of many types:

    One can be performance enhancement. Management information that is shared

    between a large manager community should not be held at a single server, since this

    computer will act as a bottleneck that slow down responses.

    Another motivation is improvement of fault tolerance. When the computer with one

    replica crashes, system can proceed management computation with another replica.

    Another motivation could be the case of using replicas to access remote objects.

    When a remote object is to be accessed, a local replica reflecting remote objects

    state is created and used instead of a remote object.

  • 28

    Figure 2-6 shows a typical scheme for implementing replication of management

    information. Agent updates the MIBs located at the manager sites by exchanging

    messages with the managers. Each manager get the management information locally

    without the need of issuing remote request. This yields a performance increment since

    two additional instances of the MIB are used to provide information about the same

    resources.

    (MAFFEIS 1997b) discusses a CORBA based fault tolerant system that monitors remote

    objects and if some of them fail it automatically restarts the failed objects and replicates

    state-full objects on the fly, migrating objects from one host to another. In

    (NARASIMHAN 1997) a similar system is discussed, which provides fault tolerant

    services under CORBA to applications with no modification to the existing ORB.

    Figure 2-6: (a) replication (b) no replication

  • 29

    2.10 Need for replication techniques in a management system

    The Management Information Base (MIB) is viewed as a distributed database that

    stores information associated with the resources of a network or a remote system.

    Replication is applied to managed objects, which may be seen as an abstract

    representation of the resources. In traditional data-base applications, the need for a

    replication scheme is straight forward since the nature of the data easily allows the

    implementation of such a schema. For instance, a company may have locations at

    different cities or a bank may have multiple branches. It is natural, for such applications

    to enforce replication since such a scheme may increase drastically the fault tolerance of

    the system. If, for example, the software in some branch fails, information regarding

    customers of that branch may be available from some other branch. A management

    information base is different from a traditional database in that, the information stored

    in it corresponds to objects that represent software or hardware resources. For instance,

    a variable associated with a remote sensor may be considered as a managed object and

    its value may be viewed as an instantiation describing its state. The main questions that

    arise in a management database application are the following :

    1. Do we really need to replicate such kind of objects?

    2. How useful is a replication scheme in a practical management system?

    To answer all these questions we illustrate an example of a network management system

    which manages network resources spread across an interconnected network.

    Figure 2-7 shows an interconnected network consisting of three networks. Each

    network constitutes a management domain. Each domain has a manager, an agent, a

    management information base (MIB) and some network resources that are monitored

    and controlled by the manager. Agents are responsible for collecting information from

    the resources and updating the relevant managed objects in the database to reflect the

  • 30

    current state of the resources. The manager and the agent of each domain could be

    accommodated by the same host computer. However, we use the most general case in

    which manager and agent reside at different computers. This could be the case where

    the manager runs on a diskless machine. There are two possible scenarios:

    1. No replication: Each MIB stores information about managed resources of its own

    domain.

    2. Replication: Each MIB contains replicated information which is associated with

    resources of other domains.

  • 31

    In a no replication scheme, when a manager wants information about some

    network resources, it contacts the appropriate agent which is responsible for providing

    this management information. The agent either collects the information dynamically

    from the resources or it makes a relevant query to the MIB and sends a response back to

    the manager. In a pure distributed management environment if manager B wants some

    information about the network resources A, it sends a request to agent A. Upon

    receiving the request, agent A undertakes to provide the requested information to

    manager B. In a hierarchical management system, this could be accomplished through

    Agent A

    MIB A

    Domain C

    Manager A

    Manager C

    Manager B

    Agent A

    Agent C

    Agent B

    Bridge AC

    Bridge BC

    Bridge AB

    Network A

    Network C

    Network B

    MIB A

    MIB C

    MIB B

    Network

    Resources

    ANetwork

    Resources

    A

    Network

    Resources

    ADomain A

    Domain B

    Domain C

    Figure 2-7: Network management replication example

  • 32

    the manager A. Manager B establish a manager to manager communication with

    manager A and then it asks A to request its local (domain) agent A to complete the task.

    In a replication scheme, replicas of managed objects exist on other MIBs. When

    an agent updates the state of a managed object locally, it also transmits the object state

    to other domain and updates all the replicated objects that resides at other MIBs. Under

    this arrangement, when manager B wants some information about network resources A,

    there is no need either to contact agent A or the manager A but its local agent B since

    the MIB B contains replicated managed objects of the MIB A. This improves the

    performance and speeds up the process of collecting network management information

    from remote system. This becomes more obvious if network B is a remote network

    which is linked with a low speed leased lines with network A. In case of an agent

    failure, managers can still get information from some particular managed objects from

    other MIBs increasing the availability of the management information and making the

    management system more robust and fault tolerant.

    We may increase further the performance and the availability if we utilise dynamic

    creation of managed objects from faulty agents to other agents. This can be achieved by

    utilising CORBA based replication techniques that may restart failed objects and

    replicate their state on the fly on another agent (MAFFEIS 1997b).

    It becomes clear that replicating managed objects to other agents MIBs may

    increase the availability of certain managed objects and the performance of the

    management system ensuring continuous network management without to interrupt the

    monitoring and control of network resources.

  • 33

    2.11 Synchronous and Asynchronous replica models

    In the context of research on fault tolerance, a system that has the property of

    always responding to a message within a known finite time interval is said to be

    synchronous. A synchronous replica system is said to be the system in which all update

    requests are ordered. That is requests are processed at all replicas in the same order.

    Consider for example the replication model in Figure 2-8. The node M sends an update

    request r to all other nodes G1, G2, G3 and wait for responses. If it receives all the

    acknowledgements A1, A2, A3 from those nodes, it assumes that the update is done

    successfully and it proceeds to the next request. In a synchronous replication system the

    next request is forwarded only if the current update request has been processed at all the

    agencies holding replicas. A replication system not having this property is said to be

    asynchronous. That is, in an asynchronous replication system a node proceeds to the

    next request without the need to wait to get acknowledgements from all the recipients of

    the previous request. That results in an unordered processing of requests. A request

    received by the node G1 may be processed in a different order than that of node G2 or

    G3.

    2.12 Replication Transparency and Architectural Model

    A key issue relating to replication is transparency. Transparency (invisibility)

    determines the extend the users (managers) are aware of that some objects are

    replicated. At one extreme the users are fully aware of replication process and can even

    Figure 2-8: Synchronous Replication

  • 34

    control it. At the other, the system does everything without users noticing anything. The

    ANSA reference manual (ANSA 1989) and the International Standard Organisation

    Reference Model for Open Distributed Processing (ISO 1992a) provide definitions

    related to replication transparency. Among others the standards state that a replication

    system is transparent if it enables multiple instances of the information object (in our

    case managed object) to be replicated without knowledge of the replicas by users or

    application programs.

    A basic architectural model for controlling replicated objects may involve distinct

    agencies located across a network. Figure 2-9(a) shows how a manager may control the

    entire process in a non transparent system. When a manager creates or updates an object,

    it does so on one agency and then it takes responsibility to make copies or complete any

    update on other agencies.

    An agency is a process that contains replicas and performs operations upon them

    directly. An agency may maintain a physical copy of every logical item, however, there

    are cases, when an agency may not maintain a physical copy. For example, a managed

    object needed mostly by a manager on one LAN may be never used by a manager of

    another LAN. In this case the agency in the second LAN may not contain a physical

    copy of that object and if ever this manager requests information about the object, the

    local agency may obtain the information making a call to another agency that actually

    holds a physical copy of the object. The general model for a transparent replication

    system is shown in Figure 2-9(b). A managers request first handled by a Front End (FE)

    component.

  • 35

    Figure 2-9: Architectural model for replication. (a) non transparent system (b)

    transparent replication system (c ) Lazy replication (d) Primary copy model.

  • 36

    The FE component is used for passing the messages to at least one agency. This

    hides details of how the message is forwarded to which agency. The user manager does

    not need to determine a specific agency for service, but it just sends the message and the

    FE component takes responsibility to determine which agency will receive the request.

    The FE component may be implemented as part of the manager application or it may be

    implemented as a separate process invoked by a manager application using a kind of

    Interprocess Communication (PRESOTTO 1990). Figure 2-9(c ) shows a specialisation of

    the architectural model in Figure 2-9(b). The model in (c ) is called a lazy replication

    model and implements what is called gossip architecture (LADIN 1992). Here the

    manager creates or updates only one copy in one agency. Later the agency itself makes

    replicas on other agencies automatically without the managers knowledge. The

    replication server is running in the background all the time scanning the managed object

    hierarchically. Whenever it finds a managed object to have less replicas than it is

    expected, the replication server arranges to make all the additional copies. The

    replication server works best for immutable objects since such objects cannot change

    during the replication process. This architecture is also called gossip architecture

    because the replica agencies exchange gossip messages in order to convey the updates

    they have each received. In gossip architecture the FE component communicates directly

    either with an individual agency or alternatively with more than on agencies. Figure 2-

    9(d) shows another replication architectural model known as the primary copy model

    (LISKOV 1991). In that model all front ends communicate with the same primary

    agency when updating a particular data item. The primary agency propagates the

    updates to the other agencies called slaves. Front ends may also read objects from a

    slave. If the primary agency fails, one of the slaves can be promoted to act as the

    primary. Front ends may communicate with either a primary or a slave agency to

  • 37

    retrieve information. In that case, however, front ends may not perform updates; updates

    are made to primary copy of an object.

    2.13 Summary

    This chapter has set the background for a replication management system. It has

    shown the need for using a replication scheme in a real time application. It has

    examined the distributed aspect of a network management system describing the

    distributed nature of the MIB. It has briefly discussed two major protocols (CMIP and

    SNMP) for exchanging management messages. It has also examined design aspects of

    the MIB discussing the significance of the managed object as an autonomous entity for

    performing operations related to incoming messages. The concepts of object availability

    and performance have been defined and used as a measure of the quality of service of

    the system. Synchronous and asynchronous replica models have been examined and

    finally various architectural models for replication have been discussed as a way to

    maintain transparently multiple replicas.

    In the following chapters will discuss the internal mechanisms (algorithms) used

    to obtain transparent updates to replicated objects. We will discuss a variety of solutions

    that may be applied to ensure consistency among multiple replicas in occurrence of node

    or communication link failures.

  • 38

    3. FAILURES IN A MANAGEMENT SYSTEM

    This chapter discusses the nature of failures in a management system. It first

    defines the concept of dependability between management agents and it then classifies

    certain failures that may occurs in a management system analysing further each one by

    its potentially disruptive behaviour. Failure semantics and masking are examined as a

    way to understand how failures may be masked by using certain techniques. The chapter

    ends by specifying some architectural issues including synchronisation, communication

    and availability of certain components in a group of agents.

    3.1 Dependability Between Agents

    An agent provides certain management services that may be viewed as a collection

    of operations whose execution can be triggered by inputs from other agents (proxy) or a

    manager or the passage of time. An agent implements a management service without

    exposing to the manager the internal representation of the managed objects. Such details

    are hidden from the manager, who need know only the externally specified management

    service behaviour. Agents may implement their services which are implemented by

    other agents. An agent U depends on the agent R if the correctness of U depends on the

    correctness of R's behaviour. The agent U is called the user and the R is called the

    resource of U. Resources in turn might depend on other resources to provide their

    service, and so on, down to the managed objects. The managed object is the atomic

    resource which is not analysed further and which is actually used to represent hardware

    or software components in a network. What is a resource at a certain level of abstraction

  • 39

    can be a user at another level of abstraction. The relationship between user and resource

    is a "depends on" relationship as it is shown in 2nd.

    A distributed management system consists of many agents. The management

    services provided by those agents may depend on other secondary low level

    management services associated with operating system components as well as

    communication components. The union of all these management services is provided as

    a distributed management system service. To ensure correctness and management

    service availability, the classes of possible failures in the lower levels of abstraction

    should be studied and redundancy in particular management services should be

    introduced to prevent system crashes.

    3.2 Failure Classification

    An agent designed to provide a certain management service works correctly if in

    response to requests, it behaves in a manner consistent with the service specification. By

    an agents response we mean any output that it has to be delivered to the manager. An

    agent fails when the agent does not behave in the manner specified . The most frequent

    failures are the followings:

    Figure 3-1: Relationship between user and resource.

  • 40

    1. Omission Failure: It happens when the agent receiving a request omits to respond

    to that request. This failure occurs either because the queue of incoming messages

    in the agent is full and therefore any additional request is lost or an internal failure

    (i.e. memory allocation failure) is experienced due to a temporary lack of physical

    resources for handling the incoming request. A communication service that

    occasionally loses messages but does not delay messages is an example of a

    service that suffers omission failures

    2. Timing Failure: It happens when the agent response is functionally correct but

    untimely. The response occurs outside the real-time interval specified. The most

    frequent timing failure is the performance (late timing) failure in which the

    response reaches the manager after the elapse of the time interval during which the

    manager is expecting the response. This failure occurs because either the network

    is too slow or the agent is overloaded and it gets late to give a response to the

    manager. An excessive message transmission or message processing delay due to

    an overload is an example of performance failures.

    3. Response Failure: It happens when the agent responds incorrectly, either the value

    of its output is incorrect (value failure) or the state transition that takes place is

    incorrect (state failure). A search procedure that "finds" a key that is not an entry

    of a routing table is an example of a response failure.

    Crash Failures: It happens when after the first omission to produce a response to a

    request, an agent omits to produce outputs for subsequent requests

    3.3 Faulty Agent Behaviour

    To detect a failure, an agent should reveal a certain behaviour that allows us to

    identify the occurrence of a failure in order to perform the appropriate actions for

  • 41

    handling the failure. The behaviour of an agent under the occurrence of a failure may be

    classified as follows:

    Fail-stop behaviour

    Byzantine behaviour

    With fail-stop behaviour, a faulty agency just stops and does not respond to subsequent

    requests or produce further output, except perhaps to announce that it is no longer

    functioning. With Byzantine behaviour, a faulty agency continues to run, issuing wrong

    responses to requests and possibly working together maliciously with other faulty

    managers or agencies to give the impression that they are all working correctly when

    they are not. In our study we assume only fail-stop behaviour.

    3.4 Failure Semantics

    The failure behaviour an agent can exhibit must be studied in order to suggest

    possible fail tolerance mechanisms. Recovery actions invoked upon detection of an

    agent failure depends on the likely failure behaviour of the agent. Therefore one has to

    extend the standard specifications of an agent to include failure behaviour. If the

    specification of an agent prescribes that the failure F may occur, it is said that the agent

    has an F failure semantics(CRISTIAN 1991). If a communication failure is allowed to

    lose messages but the probability that it delays or corrupts messages is negligible, we