A Model for Alarm Correlation in Telecommunication Networks (Thesis)

Dilmar Malheiros MeiraA Model For Alarm Correlationin Telecommunications NetworksA Thesis Submitted to the Departmentof Computer Science in Partial Ful�ll-ment of the Requirements for the Degreeof Doctor of Science.

Federal University of Minas Gerais- UFMG -Belo Horizonte, November 1997

ii

xx,149 p.; 29.5cmPh.D. Thesis. Computer Science. Institute of Exact Sciences (ICEx) of the

1. Telecommunications 2. Modeling 3. Network Management

CDU: 621.39

621.39 Meira, Dilmar MalheirosM515m A Model for Alarm Correlation in Telecommunications Networks./

Dilmar Malheiros Meira. Belo Horizonte, 1997.

UFMG.

I. Title

iiiApproval SheetThe members of the Examining Committee appointed by the Department of ComputerScience of the Federal University of Minas Gerais have examined and approved the thesisentitled A Model for Alarm Correlation in Telecommunications Networks, presented anddefended by Dilmar Malheiros Meira, a candidate for the degree of Doctor of Science inComputer Science. Belo Horizonte, November 13th 1997.Prof. Jos�e Marcos Silva Nogueira, Ph.D (DCC/UFMG) - AdvisorProf. Edward Pinnes, Ph.D. (Bellcore { USA)Prof. Liane Tarouco, Ph.D. (Inst. Inform�atica/UFRGS)Prof. Antonio Alfredo Ferreira Loureiro, Ph.D. (DCC/UFMG)Prof. Jo~ao Eduardo de Rezende Dantas, Ph.D. (DCC/UFMG)

v

To my children Alexandre, An�alia and Guilherme, forwhat this work represented in terms of hours, days andyears subtracted from our family life.

viiAcknowledgementsTo my wife, Zilda Maria.To my parents, Benjamin and Valdelourdes.To my sisters, Da��se and Elta Lyly.To my grandmother Francisca Teixeira Couto (M~ae Chica); in the memory of my grand-parents An�alia de Novaes Meira, Prudencio de Souza Meira and Alperino Malheiros.To my brothers and sisters-in-law Jo~ao Andr�e, Iwens, Ana L�ucia, Lulude, Zezinho, Elianaand Beto.To my nephews and niece Leonardo, Fernando, Thiago and Ana.To my uncles and aunts Valtelina (Tia T�e), Ti~ao, Rosalva, Juca, Yolanda and Aparecida.To my cousins Zander (in memor��am), Enock, Vania and Isa.To my advisor, Prof. Jos�e Marcos Silva Nogueira.To my advisor during my stay at the British Columbia University (Jan 94 - Feb 95), Prof.Son T. Vuong.To The Minas Gerais State Telecommunications Company - TELEMIG, represented bythe President Saulo L. Coelho, by the Vice-President S�ergio A. R. S. Braga, by the En-gineering Director Carlos A. R. Andrade, by the Services Director Heleni M. Fonseca, bythe Financial Director Geraldo Pereira Sobrinho, by the Administrative Director J�ulio B.Braga and, particularly, by the Human Resources Director M�ario Assad Jr.To UFMG, here represented by the President Prof. Tomaz Aroldo da Mota Santos and bythe Head of Department of Computer Sciences, Prof. Clarindo I. P. S. P�adua.To Mike Shwe, President of Knowledge Industries, Inc. (Palo Alto, CA, USA).To Edward Pinnes, Network Management Director of Bellcore (New Jersey, USA) and toTony Harris, President of Comdale Technologies, Inc. (Toronto, Canada).To my colleagues, former colleagues and friends at TELEMIG and, particularly, to AdrianaA. Moura, Adriana J. Leit~ao de Almeida, Alessandra M. J. Maia, Alexandre C. Barros,Alexandre F. Barbosa, Ana Cristina P. Ruas, Analzira P. Horta, Andr�e Luiz de Abreu,Antonio C. Pazzini da Silva, Antonio Luiz M. Osse, Carlos Eduardo S. Pereira, CarlosLuiz, Carlos R. de Siqueira, Carmen Miranda F. da Silva, Christiano da Matta Machado,Cl�eia C. A. do Ros�ario, D�ario A. Nunes, Daniel C. Pereira, Divar H. B. J�unior, Eduardo

viiiJ. M. Swerts, Eduardo Winter, Elaine A. Santos, Elenice G. Silva, Eliana M. O. Diniz,Elizabeth M. Belico, Erben M. Macedo, Evandro Canabrava, F�atimo G. Pires, FernandoA. Fiuza, Fl�avio A. Carvalho, Francisco D. Paula, Francisco R. F. Saraiva, FranciscoA. Santos, Geraldo G. Pena, Geraldo Ines Campos, Gilberto F. Pereira, Gisele M. G.Hannouche, Guilherme P. S. P�adua, Helv�ecio Alvim, Hildegardo A. da Silva, Iran A.Carvalho, Ivan Cleveland Andrade, Jandu�� M. V. Teixeira, Janilde G. Santos, Jo~ao F.Coelho, Jo~ao J. R. Bronzo, Jo~ao de Morais Flores, Jos�e Augusto Cruz, Jos�e A. T. L.Baptista, Jos�e Francisco V. de Seniuk, Jos�e H. Vesperman, Jos�e Luciano Pimenta, Jos�eM�arcio Ribeiro, Jos�e Pio S. de Souza, Jos�e Ricardo S. Fonseca, Jos�e Salom~ao Barquete,Luiz Carlos S. Bem�ca, Luiz Gabriel de Castro, Luiz G. Leal, M�arcia M. S. Scott, Marc��niaM. C. Nahur, Marcelo A. C. G. Le~ao, Marcos Antonio Soares, Maria Cristina B. Sollero,Maria Helena M. Torres, Maria M�ercia G. J. Vale, Maria da Concei�c~ao Arci, Maria deF�atima B. Duarte, Maria de F�atima M. Quint~ao Silva, Marilene O. Figueiredo, MariliaMarkus, Martinho M. Evangelista, Milton A. Canabrava, Milton S. Nogueira, Morel A.Ribeiro, Ot�avio Marques de Azevedo, Patr��cia V. C. Bicalho, Paulo M. Cherubino, RenatoA. N. Os�orio, Ricardo Alves de Oliveira, Ricardo M. Conde, Ricardo Rog�erio D. Silva,Roberto H. Arantes, Rosangela Alves, Rosangela Diniz, S�amea R. S. Ferreira, S�ergio T. A.Gi�oni, Shigueo Yoshizane, T�acito G. Sobrinho, Tania M. L. Vianna, Teresa M. T. Lima,Teresinha M. B. dos Reis, Val�eria Noce, Vitorino P. dos Santos and Wilson P. Almeida.To my friends, colleagues and professors at the Federal University of Minas Gerais and,in particular, to Angelo M. Guimar~aes, Angelo M. Menezes, Antonio Alfredo F. Loureiro,Antonio Em��lio A. Ara�ujo, Antonio Ot�avio Fernandes, Autran Macedo, Belkiz I. R. Costa,Berthier Ribeiro de A. Neto, Carlos C. Goulart, Cristina D. Murta, Eduardo F. Barbosa,Em��lia S. da Silva, Estefania M. Moreira, Frederico R. B. Cruz, Georgia C. Penido, GeraldoRobson Mateus, Henrique C. M. Andrade, Henrique Pacca L. Luna, Ilm�erio R. da Silva,Jo~ao Carlos F. Barbosa, Jo~ao E. R. Dantas, Jones O. de Albuquerque, Luc��lia C. deFigueiredo, M�arcio M. Andrade, M�ario F. M. Campos, N��vio Ziviani, Rafael G. R. daSilva, Renata V. Moraes, Renato A. V. Leite, Roberto S. Bigonha, Rog�erio A. de Barros,Sibele V. de Oliveira, Veronica P. P. Marques and Viviane Hon.To my friends from Gentio do Ouro, Bananal, Porto Alegre, Tokyo, Guanambi, Geneva,Candiba, B.H., Vancouver, Vila Velha, Burnaby, Campinas, Campina Grande, : : : , and, inparticular, to Antonio Carlos & Cl�audia Lisboa, Armindo & Alessandra Fontana, Avelar& Odete Pereira Viana, Bent Georgensen & Vera Botelho, Carlos & Rose Pinheiro, Eli& Eleonora Moreira Lima, Elias Proc�opio Duarte Jr., Ennio & Deusa Palmeira, EnochNascimento, Fernando & Erika de Castro, Firmino, Francisco & Keia Horta, Guilherme& Sandra Kerr, Jack Snoeyink, Jesus Afonso, Jo~ao Augusto & Aydil Rocha, John Meech,Jos�e Francisco M. Nunes (in memor��am), Jos�e Gon�calves Filho, Kendra Cooper, KevinL. Tong, Lucio Telles & Linda Harasin, Lucio & Vania Amorim, Luis & Susana Afonso,Marcello Walter, Marcello & Sonia Veiga, Marlindo Fernandes, Martinha, Nuno Tomaz &Ana Pires de Carvalho, Ricardo & Cristina de Sousa, Sandoval & Tidinha Carneiro andWashington Neves.

ix

\Curiosity, or love of the knowledge of causes, draws a manfrom consideration of the e�ect to seek the cause; and again,the cause of that cause; till of necessity he must come tothis thought at last, that there is some cause whereof thereis no former cause, but is eternal; which is it men call God.So that it is impossible to make any profound inquiry intonatural causes without being inclined to believe there is oneGod eternal." Thomas Hobbes (1588-1679)LeviathanPart I: Of Man

ContentsList of Figures xvList of Tables xviiAbstract xix1 Introduction 12 A Review on Alarm Correlation in Telecommunications Networks 52.1 Basic Concepts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52.1.1 Management Functional Areas : : : : : : : : : : : : : : : : : : : : : 72.1.2 Logical Layered Architecture : : : : : : : : : : : : : : : : : : : : : : 72.1.3 Alarm Correlation : : : : : : : : : : : : : : : : : : : : : : : : : : : 92.1.4 Fault Diagnosis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112.1.5 Correlation Types : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122.1.6 Architectural Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : 142.1.7 Topology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 152.2 Methods and Algorithms for Alarm Correlation : : : : : : : : : : : : : : : 152.2.1 Rule-Based Correlation : : : : : : : : : : : : : : : : : : : : : : : : : 162.2.2 Fuzzy Logic : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 172.2.3 Bayesian Networks : : : : : : : : : : : : : : : : : : : : : : : : : : : 192.2.4 Model-Based Reasoning : : : : : : : : : : : : : : : : : : : : : : : : 202.2.5 Blackboard : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 202.2.6 Filtering : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 212.2.7 Event Forwarding Discriminator | EFD : : : : : : : : : : : : : : : 222.2.8 Case-Based Reasoning : : : : : : : : : : : : : : : : : : : : : : : : : 222.2.9 Coding Approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : 232.2.10 Explicit Localization : : : : : : : : : : : : : : : : : : : : : : : : : : 252.2.11 Correlation by Voting : : : : : : : : : : : : : : : : : : : : : : : : : : 252.2.12 Proactive Correlation : : : : : : : : : : : : : : : : : : : : : : : : : : 262.2.13 Distributed Correlation : : : : : : : : : : : : : : : : : : : : : : : : : 262.2.14 Arti�cial Neural Networks : : : : : : : : : : : : : : : : : : : : : : : 292.2.15 Diagnosis by Comparison of Test Results : : : : : : : : : : : : : : : 30xi

xii 2.2.16 Other Approaches : : : : : : : : : : : : : : : : : : : : : : : : : : : : 302.3 Comparison among the Available Approaches : : : : : : : : : : : : : : : : 302.4 Products, Solutions and Platforms for Alarm Correlation : : : : : : : : : : 322.4.1 Event Correlation Services (HP) : : : : : : : : : : : : : : : : : : : : 332.4.2 An Implementation of Intelligent Filtering (Philips) : : : : : : : : : 342.4.3 NerveCenter Pro (Seagate) : : : : : : : : : : : : : : : : : : : : : : : 342.4.4 Sinergia : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 352.4.5 TASA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 362.4.6 InCharge (SMARTS) : : : : : : : : : : : : : : : : : : : : : : : : : : 362.4.7 NetFACT (IBM) : : : : : : : : : : : : : : : : : : : : : : : : : : : : 372.4.8 IMPACT (GTE) : : : : : : : : : : : : : : : : : : : : : : : : : : : : 382.4.9 GMS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 392.4.10 ECXpert (AT&T) : : : : : : : : : : : : : : : : : : : : : : : : : : : : 402.4.11 SCOUT (AT&T) : : : : : : : : : : : : : : : : : : : : : : : : : : : : 422.4.12 NOAA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 432.4.13 CRITTER : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 442.4.14 FIXIT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 442.4.15 OPA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 452.5 Final Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 453 A General Model of Telecommunications Networks for Fault Manage-ment Applications 473.1 Telecommunications Networks Modeling : : : : : : : : : : : : : : : : : : : 483.1.1 Modeling Techniques : : : : : : : : : : : : : : : : : : : : : : : : : : 483.1.2 Information Modeling for Subsystem Management : : : : : : : : : : 503.1.3 Subsystem Architectural Models : : : : : : : : : : : : : : : : : : : : 513.1.4 New Solutions in Telecommunications : : : : : : : : : : : : : : : : : 523.1.5 General Models : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 533.2 The Proposed Model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 593.2.1 A High-level Model : : : : : : : : : : : : : : : : : : : : : : : : : : : 603.2.2 Public Network and User Networks : : : : : : : : : : : : : : : : : : 613.2.3 Access Networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : 633.2.4 General Model in Multi-Network Layers : : : : : : : : : : : : : : : 643.2.5 An Example : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 703.2.6 Robustness : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 703.2.7 Limitations of the Model : : : : : : : : : : : : : : : : : : : : : : : : 703.3 Final Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 734 A Model for Alarm Correlation 754.1 Recursive Multifocal Correlation : : : : : : : : : : : : : : : : : : : : : : : : 754.1.1 Performance and Precision : : : : : : : : : : : : : : : : : : : : : : : 774.1.2 Scope Limitation : : : : : : : : : : : : : : : : : : : : : : : : : : : : 794.2 Introduction to Bayesian Networks : : : : : : : : : : : : : : : : : : : : : : 80

xiii4.2.1 Why Bayesian Networks? : : : : : : : : : : : : : : : : : : : : : : : 814.2.2 Basic Concepts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 824.2.3 Alarm Correlation Using Bayesian Networks : : : : : : : : : : : : : 894.2.4 Evaluation of Bayesian Networks : : : : : : : : : : : : : : : : : : : 904.2.5 DXpress 2.0TM , API-DXTM and WIN-DXTM: An Example of Toolsfor the Development and Evaluation of Bayesian Networks : : : : : 964.3 Description of the Model : : : : : : : : : : : : : : : : : : : : : : : : : : : : 974.3.1 Presuppositions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 984.3.2 Construction of the Structure of the Bayesian Network : : : : : : : 994.3.3 De�nition of the Variables and their States : : : : : : : : : : : : : : 1004.3.4 Speci�cation of the Local Probabilities Distributions : : : : : : : : 1014.3.5 Dynamics of the Correlation Process : : : : : : : : : : : : : : : : : 1014.3.6 Complexity and Performance of Alarm Correlation through BayesianNetworks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1024.4 Final Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1035 A Prototype 1055.1 Description of the Object Network : : : : : : : : : : : : : : : : : : : : : : 1055.2 Construction of the Bayesian Network : : : : : : : : : : : : : : : : : : : : 1055.2.1 Structure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1075.2.2 Variables and their States : : : : : : : : : : : : : : : : : : : : : : : 1075.2.3 Local Probabilities Distributions : : : : : : : : : : : : : : : : : : : : 1075.3 Experimentation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1205.3.1 Case 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1215.3.2 Case 2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1225.3.3 Case 3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1225.3.4 Case 4 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1225.3.5 Case 5 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1235.3.6 Case 6 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1235.3.7 Case 7 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1235.3.8 Case 8 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1245.3.9 Case 9 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1245.3.10 Case 10 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1245.3.11 Case 11 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1255.3.12 Case 12 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1255.3.13 Case 13 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1265.3.14 Case 14 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1265.3.15 Case 15 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1265.3.16 Case 16 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1275.4 Final Considerations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127

xiv6 Conclusions 1296.1 Summary of the Results Obtained : : : : : : : : : : : : : : : : : : : : : : : 1296.2 Limitations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1306.3 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1316.3.1 Telecommunications Network Modeling : : : : : : : : : : : : : : : : 1316.3.2 Alarm Correlation at Sub-Network Level : : : : : : : : : : : : : : : 1316.3.3 Determination of the Local Probabilities Distributions : : : : : : : 1316.3.4 Acquisition of the Real Time Correlation Engine : : : : : : : : : : : 1316.3.5 Extensions to the Model : : : : : : : : : : : : : : : : : : : : : : : : 131Bibliography 133

List of Figures2.1 Management Functional Areas and Logical Layers : : : : : : : : : : : : : : 93.1 Simpli�ed Model of a Traditional Telecommunications Network : : : : : : : 543.2 A High-level Model of the Telecommunications Network : : : : : : : : : : : 613.3 Public Network and User Networks : : : : : : : : : : : : : : : : : : : : : : 623.4 Backbone and Access Networks : : : : : : : : : : : : : : : : : : : : : : : : 633.5 Example of General Model for a Telecommunications Network : : : : : : : 653.6 Three-dimensional Representation of the Networks that constitute aTelecommunications Network : : : : : : : : : : : : : : : : : : : : : : : : : 673.7 Plane Representation of the Networks that constitute a TelecommunicationsNetwork : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 683.8 Detailing of the Connection between Two Networks : : : : : : : : : : : : : 693.9 A Telecommunications Network : : : : : : : : : : : : : : : : : : : : : : : : 713.10 Expanded Telecommunications Network : : : : : : : : : : : : : : : : : : : 724.1 Example of Correlation Focuses in a Telecommunications Network : : : : : 784.2 An Example of Bayesian Network : : : : : : : : : : : : : : : : : : : : : : : 854.3 Example of a Bayesian Network with 16 nodes : : : : : : : : : : : : : : : : 864.4 Fragment of a Bayesian network : : : : : : : : : : : : : : : : : : : : : : : : 874.5 Two Functionally Dependent Networks : : : : : : : : : : : : : : : : : : : : 994.6 A Bayesian Network with n(n-1)/2 Edges : : : : : : : : : : : : : : : : : : : 1025.1 Structure of the Bayesian Network for the Example TelecommunicationsNetwork : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108xv

List of Tables2.1 De�nition of a Fuzzy Set : : : : : : : : : : : : : : : : : : : : : : : : : : : : 194.1 An Example of How to Transform a Continuous Variable into a DiscreteVariable : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 834.2 De�nition of the Possible States of the Variables TO, TR, SS and PD : : : 884.3 Probabilities Distribution for Node PD : : : : : : : : : : : : : : : : : : : : 895.1 Sub-networks that constitute the Example Network : : : : : : : : : : : : : 1065.2 Probabilities Distribution for the Opt-Net Node : : : : : : : : : : : : : : : 1075.3 Distribution of Probabilities for the Radio Node : : : : : : : : : : : : : : : 1095.4 Distribution of Probabilities for the Satellites Node : : : : : : : : : : : : : 1095.5 Distribution of Probabilities for the Coax-Cables Node : : : : : : : : : : : 1095.6 Distribution of Probabilities for the SDH Node : : : : : : : : : : : : : : : : 1095.7 Distribution of Probabilities for the ATM Node : : : : : : : : : : : : : : : 1105.8 Distribution of Probabilities for the B-ISDN Node : : : : : : : : : : : : : : 1105.9 Distribution of Probabilities for the N-ISDN Node : : : : : : : : : : : : : : 1105.10 Distribution of Probabilities for the PDH Node : : : : : : : : : : : : : : : 1115.11 Distribution of Probabilities for the FDM Node : : : : : : : : : : : : : : : 1125.12 Distribution of Probabilities for the SS-7 Node : : : : : : : : : : : : : : : : 1135.13 Distribution of Probabilities for the Fr-Relay Node : : : : : : : : : : : : : 1145.14 Distribution of Probabilities for the IDN Node : : : : : : : : : : : : : : : : 1155.15 Distribution of Probabilities for the A-Switch Node : : : : : : : : : : : : : 1165.16 Distribution of Probabilities for the X.25 Node : : : : : : : : : : : : : : : : 1175.17 Distribution of Probabilities for the Cellular Node : : : : : : : : : : : : : : 1185.18 Distribution of Probabilities for the POTS Node : : : : : : : : : : : : : : : 119xvii

AbstractThis thesis proposes a general model for telecommunications networks and, from this model,it proposes a model for alarm correlation in the network as a whole. The model is based ona principle named recursive multifocal correlation, also developed in this thesis, accordingto which the telecommunications network is partitioned into several sub-networks, each oneof them constituting a correlation focus. The breakdown of the problem into smaller sub-problems facilitates its solution and allows the use, in each sub-network, of the correlationtechnique most suitable to its peculiarities. The multifocal correlation principle may berecursively utilized in each sub-network until the network element level is reached.The concepts developed were utilized in the implementation of a prototype, used foralarm correlation in a canonical telecommunications network. By utilizing a commercialproduct as a tool for the development and evaluation of Bayesian networks, the occurrenceof alarms was simulated and the functioning of the model was veri�ed, both concerningthe identi�cation of the possible causes for the received alarms (diagnostic inference), asin the prediction of possible e�ects (predictive inference).xix

Chapter 1IntroductionTo manage a system means to supervise and control its functioning so that it meets therequirements of both its users and its proprietors [Sloman, 1994]. In the case of telecom-munications systems, which by nature are large, complex, heterogeneous and highly dis-tributed, this is a very di�cult task, which has been attracting the attention of a growingnumber of researchers over the past years. Two reasons may be cited for this interest. The�rst is due to the relevance of the subject in economic terms: a large telecommunicationsnetwork represents an investment of several billions of dollars. An e�ective management ofthis network may imply a signi�cant increase in its productivity, through a better utiliza-tion of the available resources. The improvement of the network management may havethe following results [Meira and Lages, 1988]:� the improvement in the quality of services, from the users' point of view;� the increase of revenues, through the increase of the tra�c routed in the network;� the reduction in the maintenance and operation costs of the network.The second reason for the interest of the scienti�c community in network management isdue to the challenges involved. The subject requires research in several areas of ComputerScience, such as computer networks, database systems, human-machine interface, softwareengineering and arti�cial intelligence. In the case of telecommunications networks, themanagement concept is even richer for it also relates to the technical, administrative andtelecommunications business aspects.The management of telecommunications networks involves the transference and the pro-cessing of management information (that is, information referring to the structure and tothe functioning of these networks) with the goal of helping a telecommunications companyto e�ectively conduct its business [ITU-T, 1996]. The standards and recommendationsalready issued by the normative organisms which deal with the subject cover only theaspects connected to the structure and the protocols for the transference of management1

2information, disregarding the management activities themselves [Fink et al., 1993].The ITU-T, the normative sector of telecommunications of the International Telecom-munications Union (ITU), previously known as CCITT (The International Telegraph andTelephone Consultative Committee), issued the speci�cations for a TelecommunicationsManagement Network - TMN. These speci�cations took into account only the technolog-ical aspects, giving little attention to the management needs of the telecommunicationsservices and businesses [Milham and Pai, 1994].Regarding the technological aspects focused by TMN, the majority is related to theintegrated and harmonic functioning among systems, networks and operators. The imple-mentation of systemic solutions for the management of telecommunications networks andtelecommunications services is not dealt with in the scope of TMN[Furley, 1996].Even outside the normative environment, until recently most of the e�ort spent on thedevelopment of network management had been geared to the aspects related to the trans-ference of information. The great polemics around the virtues and limitations of CMIP(Common Management Information Protocol) [ITU-T, 1991b] and SNMP (Simple Net-work Management Protocol) [Case et al., 1990] [Gering, 1993] �ts in this context. Howeverimportant, necessary and complex it is, the collection of information is not network man-agement, but only a prerequisite for this management.One of the most important areas in telecommunications network management consistsof the management of faults occurred during the functioning of these networks. Accord-ing to ITU-T, fault management encompasses detection, isolation and correction of faults[ITU-T, 1992i]. Any of these three functions may only be carried out from value addition[Mans�eld et al., 1993] to the gross data collected from the plant.For a typical telecommunications plant the problem related to the lack of informationat the network management center has been gradually losing relevance. As a matter offact, with the growth of the managed plant, associated to the implementation of modernmanagement systems, there has been a great increase in the volume of information receivedin the management centers, rendering it practically unfeasible to \manually" process all ofthem [Nygate, 1995].So, it is the interest of the operators of public telecommunications services to developsolutions to process the management information rendered available to the network man-agement systems, so as to deliver to the human operators only the most relevant informa-tion for them to make decisions.De�nition: Alarm Correlation consists of the conceptual interpretation ofmultiple alarms, resulting in the attribution of a new meaning to the original

3alarms [Jakobson and Weissman, 1993].As part of the correlation process, gross data is interpreted and analyzed, taking intoaccount a set of criteria, which are pre-established or otherwise dynamically de�ned ac-cording to the management process. Thus, the correlation adds value to the original alarmsand is an important mechanism for the management of telecommunications networks.As it reduces the necessary time for the identi�cation of faults causing alarms, allowinga quick restoration of the a�ected services, the implementation of alarm correlation in atypical American operation center may yield annual bene�ts of up to US$1 million [Ny-gate, 1995]. The importance of this subject is acknowledged by ITU-T, which classi�esthe alarm correlation as one of the problems to be solved so that a telecommunicationsmanagement system produces the results expected from it. In spite of this importance,the ITU-T has not yet de�ned its position concerning the subject, which is open to studies[ITU-T, 1996].Several interesting proposals may be found in the literature about alarm correlation intelecommunications networks (cf. Chapter 2). Most of these proposals aim at the alarmcorrelation in some segment of these networks or even in isolated elements of a sub-network.None of the studied proposals could be directly applied in the alarm correlation gener-ated in the range of a telecommunications network as a whole.One of the main obstacles to the correlation of alarms in the range of a telecommu-nications network as a whole stems from the di�culty associated to the understandingof the networks themselves, for there is a lack of tools which facilitate the modeling of atelecommunications network, in the aspects referring to the propagation of the e�ect offaults in their sub-networks.As a contribution to make up for this lack, this thesis o�ers a telecommunicationsnetwork model through which the functional dependencies among sub-networks may bespeci�ed (cf. Chapter 3).The model is general and may be utilized to facilitate the understanding, as a whole,of any telecommunications network. Besides that, the model is also robust, being easilyadaptable to architectural or technological alterations in the modeled networks.From a general model of a telecommunications network, developed in Chapter 3, ageneral model for the correlation of alarms in these networks is proposed, which consti-tutes another contribution of this thesis (cf. Chapter 4). The proposed model adopts, asa strategy to approach the complexity of the problem of alarm correlation, a techniquenamed recursive multi-focal correlation, also developed in this thesis. The technique is suf-

4�ciently general to allow, in principle, any of the methods and algorithms available in theliterature to be adopted to make the alarm correlation of a telecommunications networkor one of its segments.Another contribution of this research is the comparative study of the several approachesfor alarm correlation found in the literature, having as a major comparison parameter thenature of the application to which the correlation is intended (cf. Chapter 2).It is also a contribution of this thesis the classi�cation of the telecommunications net-work models found in literature, according to the objectives and scope of these models (cf.Chapter 3).This thesis has as its theme the alarm correlation in telecommunications networks.This does not imply in a loss of generality in relation to the study of the same subjectin computer networks, for, according to [Tanenbaum, 1996], a computer network may beseen as a set of autonomous interconnected (that is, able to exchange information) com-puters. Thus, many of the modern telecommunications sub-systems, such as, for example,the signaling networks [ITU-T, 1993b], the intelligent networks [ITU-T, 1992h] and themanagement networks [ITU-T, 1996], may be seen as large computer networks.The remainder of this thesis is structured into �ve chapters. Chapter 2 supplies a basisfor the study of alarm correlation, through the discussion of the main concepts involved,and provides a panoramic view of the main approaches found in literature and comparesthese approaches according to criteria related to the nature of the applications to whichthe correlation is intended. Chapter 2 also carries out a revision of the main products,solutions and platforms for the alarm correlation found in literature. In Chapter 3 thereis a study on the modeling of telecommunications networks which includes a classi�cationof the main approaches found in literature on modeling of telecommunications networks,according to the objectives and scope of the models. Chapter 3 also establishes a generalmodel for telecommunications networks which may be used as a basis for the study ofalarm correlation and fault management in these networks. In Chapter 4, the conceptsof recursive multi-focal correlation and scope limitation are developed and, from them,a model for the correlation of alarms in the range of a telecommunications network isproposed. In Chapter 5, the modeling of a canonical telecommunications network is madeand a prototype for the alarm correlation in this network is developed. In Chapter 6, theresults obtained in this thesis, their limitations and the possible future works are related.

Chapter 2A Review on Alarm Correlation inTelecommunications NetworksThis chapter provides a prospective study on alarm correlation, including the establishingof a set of basic concepts on the subject (section 2.1). Section 2.2 o�ers a panoramic viewof the main existing approaches in the literature. In section 2.3, a comparison among theseveral approaches is made, having as a parameter the nature of the application to whichthe correlation is destined. Section 2.4 gathers some of the main products, solutions andcommercial platforms for the alarm correlation found in the literature. In section 2.5,some considerations in the manner of a synthesis of the results obtained from the studycarried out in this chapter are made.2.1 Basic ConceptsIn 1985, in order to allow the implementation of management networks from multiven-dor equipment and systems, the ITU-T started the speci�cation of its TelecommunicationsManagement Network, best known as TMN [Mines, 1987] [ITU-T, 1996]. TMN is a generalmodel of a network to give support to the needs for management of a telecommunicationscompany to plan, provide, install, maintain, operate and administer networks and telecom-munications services.De�nition: An Operations Support System - OS is a program which processesinformation related to telecommunications management, with the goal of mon-itoring, coordinating and/or controlling the telecommunications functions. AnOS is characterized by implementing management functions named OSFs (Op-erations Systems Functions) [ITU-T, 1996].De�nition: An equipment which communicates with the TMN, according tostandards de�ned by the ITU-T, with the purpose of being monitored and/orcontrolled, is named network element [ITU-T, 1996].5

6 De�nition: Amanaged object is de�ned as a vision of a feature of a telecommu-nications network, under the point of view of the management system [ITU-T,1992i].One of the basic objectives of the TMN is to enable the interconnection among di�erentoperations support systems and the telecommunications plant, essentially constituted ofheterogeneous network elements.In order to be implemented from functional blocks acquired from multiple vendors, thespeci�cation of TMN involved the establishing of reference points, which represent \bound-aries" among functional blocks, whose purpose is to identify the information exchangedamong these blocks. In general, each reference point represents an interface among twofunctional blocks, for implementation purposes.In order to allow the exchange of information among these interfaces, the ITU-T, inassociation with the ISO (International Organization for Standardization), has de�nedstandards both for the protocols and for the information models to be adopted in thecommunication.The standardization e�ort carried out by the ITU-T resulted in the issuing of a series ofrecommendations. The M.3000 Recommendation [ITU-T, 1994a] presents a general visionof the ITU-T standards referring to the TMN, serving as an \umbrella" for the devel-opment and the use of other Recommendations. The M.3010 Recommendation [ITU-T,1996] describes the TMN architecture, including the aspects referring to the exchange ofinformation among the network elements and the physical and functional aspects of thearchitecture. Several other Recommendations describe in detail the principles [ITU-T,1992i] [ITU-T, 1992j] [ITU-T, 1992m] the architectures [ITU-T, 1995b] [ITU-T, 1992k],the de�nitions [ITU-T, 1992d] [ITU-T, 1992e] [ITU-T, 1991a] [ITU-T, 1992l] and the spec-i�cations [ITU-T, 1992b] [ITU-T, 1992f] [ITU-T, 1992g] [ITU-T, 1993c] [ITU-T, 1993d][ITU-T, 1991b] [ITU-T, 1993e] [ITU-T, 1992n] [ITU-T, 1992o] [ITU-T, 1993f] [ITU-T,1995c] needed for the implementation of a TMN.One of the greatest di�culties found in the implementation of TMNs is due to thefact that a signi�cant part of the equipment and telecommunications systems existingtoday is not prepared to implement the functions of a TMN network element [Meira et al.,1995]. In the near future, nevertheless, with the increasing digitalization of the networkand the increase of \intelligence" of its elements, TMN should be consolidated as thegreat architectural framework for the management of telecommunications networks, for itsqualities of conceptual elegance, robustness and consistency.

72.1.1 Management Functional AreasThe needs for the management of a network may be grouped into �ve distinct functionalareas [ITU-T, 1992i]1:� Fault management. It includes the detection, isolation and correction of faults.� Con�guration management. It aims at preparing, initializing, starting, keeping oper-ating and terminating interconnection services, through the exchange of informationwith managed objects and by acting upon these objects. The con�guration manage-ment also includes the creation or deletion of managed objects.� Accounting management. It aims at identifying the costs of the network featuresutilized in order to meet the needs of a given communication objective, so that thesecosts may be communicated to the users and the corresponding tari�s be applied.� Performance management. It enables the permanent evaluation of the telecommuni-cations plant resources behavior, as well as their results and e�ciency.� Security management. It aims at the application of security policies de�ned for thesystem. It includes the creation, elimination and control of security services andmechanisms and also the distribution of information related to security.In order for a telecommunications network to be e�ectively utilized, it is indispensablethat all of its features be adequately managed and that there is an integration among theseveral management functional areas. The information generated in one area may be usefulin other areas [Fink et al., 1993]. As a matter of fact, the occurrence of a fault may causea reduction of the network performance and may also compromise its security, leading tothe need to act on its con�guration.2.1.2 Logical Layered ArchitectureAs a way to facilitate the understanding of the functionality of telecommunications man-agement systems, the ITU-T de�ned a \Logical Layered Architecture" - LLA [ITU-T,1996]. According to this model, some of the most important aspects of the managementprocess are utilized as criteria for the grouping of the functionality of the operations supportsystems (that is, of the OSFs) according to four logical management layers:1From the work developed by the ITU-T, the RACE European Program (Research and TechnologyDevelopment in Advanced Communications Technologies in Europe) rede�ned the functional areas oftelecommunications management so as to include, besides the management \in-service" aspects, also some\pre-service" aspects. As a result, the RACE program identi�ed nine functional areas: project; planning;installation; provisioning; maintenance; performance; security; accounting; consultation and control by theuser.

8 � Network Elements Management LayerIn this layer, the functions referring to the management of individual network ele-ments or to network element groups are situated. The OSFs of this layer provide, tothe OSFs of the upper layer, the access to the functionality of network elements andto the implementation of relationships among these elements. The access is madethrough a uniform vision of the network elements, independently of the equipmentmanufacturers.� Network Management LayerSupported by the functionality of the network elements management layer, an OSFof this layer aims at the management of a network as a whole, which is typicallydistributed over an extensive geographical area. It is also the objective of this layer toprovide the upper layer with a network vision which is independent of the technologiesutilized in its implementation.Since they have a global vision of the managed network, the OSFs of this layerare able to know, monitor and control the utilization of the network resources, thusguaranteeing its functioning according to adequate performance standards and servicequality.� Services Management LayerIn this layer, the OSFs aim at knowing, monitoring and controlling the contractualaspects of the services o�ered to the clients, including the receipt, processing andclosing of service orders and complaints.This layer provides the main point of contact of clients with the service provider andso it must have updated and precise information on the activation and deactivationof services, the quality of these services and the occurrence of faults in the renderingof these services (independently of the causes).� Business Management LayerOne of the goals of the OSFs of this layer is the interaction with other OSFs, in orderto obtain a better utilization of the telecommunications resources, under the businesspoint of view, which consists of searching the best return over the investment.Other attributions of the OSFs of this layer include the support to the decisionprocesses related to the realization of new investments and to the allocation of re-sources (human and material) for the operation, administration and maintenance oftelecommunications resources.The classi�cation of the OSFs may be made by utilizing a bi-dimensional templatesimilar to the one presented in [ITU-T, 1992e] (�gure 2.1).The functional management areas are represented in one of the dimensions of the chart,while the layers of the logical architecture are represented in the other dimension. A given

9Fault Con�guration Accounting Perfomance SecurityManagement Management Management Management ManagementBusinessManagementServiceManagementNetworkManagementElementManagementFigure 2.1: Management Functional Areas and Logical LayersOSF must �t into one of the cells of the chart. For example, the intersection of \FaultManagement" with \Element Management" must contain all the OSFs referring to faultmanagement at the network element level. An OS may contain several OSFs, and so itmay �t into one or more positions in the chart.2.1.3 Alarm CorrelationA managed object, which may be seen as a representation of a real resource, may emitnoti�cations as an answer to the occurrence of some event within itself. The noti�cations,as well as the information contained in them, are de�ned by the class of managed objects ofwhich the managed object is an instance [ITU-T, 1992k]. A noti�cation may be transmittedto the outside of the managed object or simply stored internally in the object [ITU-T,1992p].Event reports are utilized in order to report, through the use of communication proto-cols, the occurrence of events in a managed object [ITU-T, 1991a].De�nition: In the context of network management, a fault is de�ned as a causefor malfunctioning. Faults are responsible for making it di�cult or preventingthe normal functioning of a system and they manifest themselves through er-rors, that is, deviations in relation to the normal operation of the system.De�nition: An alarm consists of a noti�cation of the occurrence of a speci�cevent, which may or not represent an error. An alarm report is a kind of eventreport, used in the transportation of alarm information [ITU-T, 1992o].Some authors [Kehl and Hopfm�uller, 1993] de�ne correlation as a process in which aminimum set of fault hypotheses is created for a given set of alarms. This de�nition maybe adequate for the fault diagnosis context (cf. item 2.1.4), but, considering that an alarm

10is not always associated to a fault (for example, in the tra�c management applications,an alarm may inform about the increase of tra�c in a trunk, which does not characterizea fault), it is preferable to adopt the de�nition given by [Jakobson and Weissman, 1993],according to which alarm correlation consists in the conceptual interpretation of multiplealarms, leading to the attribution of a new meaning to the original alarms. The correlationgenerally has as an objective to reduce the number of alarm noti�cations transferred tothe operators of the network management system, thus increasing the semantic content ofthe resulting noti�cations.Alarm correlation may be applied to any of the �ve management functional areas de-�ned by the ITU-T, that is, faults, con�guration, accounting, performance and security[ITU-T, 1992i]. However, most of the applications found in literature is in the fault man-agement area, which is the most elementary and, for this very reason, maybe the mostimportant. Nevertheless, there are other areas where, due to the large volume of infor-mation involved, the alarm correlation may be useful. Tra�c management, for example,is an application which demands the collection and the processing of information in realtime, with the objective of detecting abnormalities in the network's tra�c standards andto take the necessary provisions to remedy them. Among the most important anomalies orexception conditions we may point out the random overloads, the \killer"2 trunks and thetra�c overloads generated by exceptional events, such as catastrophes, sales promotionsby telephone and special dates [Goodman et al., 1993].The main requirement to perform fault management in an integrated way is the exis-tence, in a management center, of information on the network's real time functioning. Theabnormalities that occur during the operation of the network cause the automatic emissionof alarm noti�cations, which are received at the network management center. From thenoti�cations of received alarms, the human operator must attempt to identify the faultoccurred and, if necessary, to emit a trouble ticket, which is used as a reference for dis-patching maintenance teams. Once the problem is solved, the trouble ticket is closed, andremains available only for consultation.With the growth of the managed plant, it is estimated that, in the mid-run, the man-agement center of a medium size regional operator will be receiving tens of thousandsof alarm noti�cations per day, which will render the \manual" processing of all of thempractically unfeasible [Nygate, 1995].Besides that, many of the received noti�cations do not contain original information. Infact, the occurrence of a single fault in the supervised network sometimes results in thereception of multiple noti�cations. Several factors contribute to this situation [Houck etal., 1995]:2A trunk is named \killer" if it systematically accepts the calls that are o�ered to it and releases them(kills them) soon after.

111. A device may generate several alarms due to a single fault;2. A fault may be intrinsically intermittent, which implies in the sending of a noti�cationat each new occurrence;3. The fault of a component may result in the sending of an alarm noti�cation eachtime the service supplied by this component is invoked;4. A single fault may be detected by multiple network components, each one of thememitting an alarm noti�cation;5. The fault of a given component may a�ect several other components, causing thefault's propagation.Although, at �rst, the correlation may be \manually" made by the network managementcenter operators, in the context of this thesis the expression alarm correlation, or theequivalent expression \event correlation", means the use of computing resources in thecorrelation process.2.1.4 Fault DiagnosisDe�nition: Fault Diagnosis is a stage in the fault management process whichconsists of �nding out what the original cause for the received symptoms (rep-resented by the alarms) is. Before getting to the original cause, it may benecessary to formulate a set of fault hypotheses, which will need to be vali-dated by means of tests.It is desirable that a system for fault diagnosis have a model of the managed con�gu-ration, process the ow of alarms in real time and be capable to work on incomplete data.Besides that, it is expected to be able to identify changes in the appearance and in theimportance of problems in function of time (for example, hour, day of the week, seasonof the year), to separate cause from e�ects and to solve the problems according to theirseriousness (i.e., the most serious problems must have the priority). In the selection oftests to be applied, the system must choose the most cost e�ective. As much as possible,the diagnostic tests must be automatized. Finally, it is desirable that the system be able,somehow, to interpret the results of tests [Sutter and Zeldin, 1988].The problem of fault localization is NP-complete [Katzela and Schwartz, 1995], whichmeans that a polinomial algorithm that may solve it is not yet known. Nevertheless, bymeans of heuristics [Pearl, 1984], in some cases it is possible to develop polinomial algo-rithms which give approximate solutions or, in other cases, which give a correct solution,with a given probability [Katzela et al., 1995].The needs for the reception and centralized storage of alarms, for the knowledge ofthe con�guration of the managed system when the fault occurs and for the knowledge of

12how a fault in a component a�ects adjacent components in the con�guration are some ofthe barriers which must be surpassed before a practical solution for the alarm correlationproblem may be implemented. In its most general case, the correlation may demandother information, such as the results of tests executed in the network, data obtained inexternal databases and from users. The di�culty to obtain those informations constitutesan obstacle to the commercial implementation of alarm correlation and justi�es de fact that,among the several approaches that have been proposed, few have practical applications inthe integrated management of telecommunications networks.Besides the inherent complexity to any NP-complete problem, the project and the devel-opment of the algorithms necessary to make the correlation, assuming that the mentionedbarriers are surpassed, must take into account the following additional aspects [Houck etal., 1995]:� Noises, constituted by meaningless information, redundant information, streamingalarms, occasional spikes, frequent oscilations and repeated occurrences;� Hidden Dependencies. Very often the strategy adopted in the correlation demandsthe construction of a model of the managed network. The simpli�cations adopted inthis model may render some elements of the managed network \invisible" to the cor-relation process. This allows that a fault occurred in an \invisible" network elementsimulates the occurrence of a fault in another network element;� Complex Dependencies. The model of dependencies adopted very often presupposesthat, when a support feature fails, all the elements that depend on this feature willalso fail, which sometimes does not happen;� Incomplete Data. In general, it is presupposed that all information necessary to thecorrelation are spontaneously sent by the network elements. At times, some of thisinformation is not available (for example, due to an interruption in a link without analternative path).2.1.5 Correlation TypesSeveral types of correlation may be identi�ed [Jakobson and Weissman, 1995], accordingto the operations executed on the alarms. The most important of these operations aredetailed as follows:CompressionCompression consists of detecting, from the observation of the alarms received in a giventimewindow, multiple occurrences of the same event, substituting the corresponding alarmsfor a single alarm, possibly indicating how many times the event occurred during theobservation period.

13Selective SuppressionSelective Suppression is a temporary inhibition of alarms referring to a given event, ac-cording to criteria | continuously evaluated by the correlation system | related to thedynamic context of the network management process. The suppression criteria are gener-ally linked to the presence of other alarms, to the temporal relationship among alarms orto the priorities established by the network managers.FilteringFiltering consists of suppressing a given alarm, depending on the values of a set of pa-rameters previously speci�ed. In a restricted sense, �ltering takes into account only theparameters of the alarm which is being �ltered. In a broader sense, �ltering may take intoaccount any of other criteria. In this case, which might be characterized as an intelligent�ltering, the concept of �ltering is expanded, and it may comprehend several other typesof operations, such as compression and suppression.CountingCounting consists of generating a new alarm each time the number of occurrences of agiven type of event surpasses a previously established threshold.ScalingScaling is an operation in which, depending on the operational context, an alarm is sup-pressed, and another is created in its place, with one of its parameters (for example,severity) assuming a higher value. The operational context includes, among other factors,the presence of other alarms, the temporal relationship among alarms, the number of oc-currences of an event in a given time window and the priorities established by the networkmanagers.GeneralizationGeneralization consists of replacing an alarm, depending on the operational context, by thealarm corresponding to its super-class [Bapat, 1994]. As an example, in the simultaneousoccurrence of the alarms corresponding to all the routes that utilize a certain cable as aphysical media, each one of the original alarms may be replaced by an alarm indicating adefect in the cable; next, through a compression operation, all the repeated alarms maybe replaced by a single alarm.This operation is based on an inductive type reasoning, which was originally studied byAristotle in the 4th century before Christ [Benton, 1952]. The inductive reasoning allowsthe expansion of the knowledge scope, at the expenses of an increase in the complexity of

14the problem and of the introduction of a certain degree of uncertainty in the correlationresult.Two main types of generalization may be identi�ed: generalization by simpli�cation ofconditions and instance based generalization [Holland et al., 1986]. In the �rst case, for thelower class alarm to be replaced by another of a higher class, one or more of the conditionsde�ned as necessary to its identi�cation are ignored or neglected. In the second case, anew alarm may be generated from the association of the information corresponding to twoor more alarms received.SpecializationSpecialization is an operation which is the reverse of generalization, and that consistsof substituting an alarm for another, corresponding to a sub-class [Bapat, 1994]. Thisoperation, based on a deductive type reasoning, does not add any new information besidesthe ones that were already implicitly present in the original alarms and in the con�gurationdata base, but it is useful in making evident the consequences that an event in a givenmanagement layer [ITU-T, 1996] may cause in the higher management layers.As an example of a possible specialization, the correlation system may generate, when-ever a determined path is interrupted, an alarm for each one of the services a�ected by theinterruption. Thus, through specialization, the consequences of a fault in the telecommu-nications network management layer upon the entities of the telecommunications servicesmanagement layer will be made evident.Temporal RelationshipTemporal Relationship is an operation in which the criteria for correlation depends on theorder or the time at which the alarms are generated or received. Several temporal relation-ships may be de�ned, utilizing concepts such as: after, follow, before, precede,during, start, finish, coincide, overlap [Jakobson and Weissman, 1995].ClusteringClustering consists of generating a new alarm based on the detection of complex correlationpatterns on the received alarms. The clustering operation may also take into account theresult of other correlations and the result of tests carried out in the network.2.1.6 Architectural AspectsIn the de�nition of the architecture of an alarm correlation system for a telecommunicationsnetwork, the following aspects, among others, must be taken into account:� The architecture of the managed telecommunications network;

15� The architecture of the existing management network;� The objectives of the correlation.Initially, besides identifying which functional areas will be served by the alarm correlation,it is important to characterize the objective of the correlation, which may include aspectsranging from the reduction of the volume of information routed to the network managersto something more elaborated, such as fault localization and diagnosis, or the predictionof the future behavior of the network, based on an analysis of tendencies.After de�ning the objectives of the correlation, the topology of the correlation systemmust be de�ned (cf. item 2.1.7), that is, it must be decided where the correlating devicesmust be localized and what type of relationship must exist among them. Other aspects tobe outlined by the architecture involve the de�nition of the type of correlation that must bemade (cf. item 2.1.5) and where. Equally important is the de�nition of the approaches tobe adopted, among those available or that may be implemented, for each type of correlation(cf. section 2.2).2.1.7 TopologyIn a telecommunications network, correlation may be made at several levels of the con�g-uration, including the level of the individual network elements up to the maximum level,which involves all the network. When the correlation takes place at a lower level, it isgenerally made up of simpler and, consequently, faster processes. On the other hand, sinceit does not take into account the broader context, this kind of correlation su�ers from a se-vere \myopia", which prevent it from detecting possible consequences that local problemsmay cause in the network as a whole. When the correlation is made at the higher level, allthe relevant information may, in thesis, be o�ered to the correlating mechanism, which, inthis way, has a broad vision of the managed system and may diagnose problems by workingon the consequences they provoke on the network as a whole. As a counterpart, the largeamount of information available causes an increase in the complexity of the problem, whichvery often becomes intractable.2.2 Methods and Algorithms for Alarm CorrelationThis section provides a panoramic view of alarm correlation in telecommunications net-works, through the gathering of the main approaches existing in literature, classi�ed ac-cording to the methods and algorithms utilized in the correlation process [Meira, 1997][Meira and Nogueira, 1997a].Only two types of approaches have been identi�ed by [Lazar et al., 1992] for the problemof detection and identi�cation of faults in telecommunications networks: the probabilistic

16approaches, on the one hand, and, on the other, the approaches in which the networkentities were modeled as �nite state machines.Today, the number of available approaches is much larger. Some of these approachesare probabilistic, others utilize traditional arti�cial intelligence paradigms and others applyprinciples de�ned in non-conventional logics [Smets et al., 1988]. There are also approacheswhich adopt ad hoc methods to deal with the alarm correlation problem.2.2.1 Rule-Based CorrelationOf the several approaches presented in the literature on alarm correlation in telecommuni-cations systems, a signi�cant part is constituted of variations around the classic rule-basedapproach. In this approach, the general knowledge of a certain area is contained in a set ofrules and the speci�c knowledge, relevant for a particular situation, is constituted of facts,expressed through assertions and stored in a database. A rule consists of two expressions| well-formed formulas of predicate calculus [Nilsson, 1980] | linked by an implicationconnective ()), and working on a global database. The left side of each rule contains aprerequisite which must be satis�ed by the database, so that the rule is applicable. Theright side describes the action to be executed if the rule is applied. The application of arule alters the database.Every rule based system (or production system) has a control strategy which determinesthe order in which the applicable rules will be applied and which stops the computingprocess when a �nishing condition is satis�ed by the database.There are two operation modes in a production system. The �rst one is the forwardmode, in which it departs from an initial state and constructs a sequence of steps thatleads to the solution of the problem (\goal"). When it comes to a fault diagnosis system,the rules would be applied to a database containing all the alarm received, until a �nishingcondition involving one fault is reached. In the second operation mode, denominated back-ward mode, it starts from the con�guration corresponding to the solution of the problemand it constructs a sequence of steps that leads to the con�guration corresponding to theinitial state. Taking again as an example a fault diagnosis system, the rules would beapplied to a database containing all the possible faults, until a �nishing condition werereached, in which all the received alarms would be present. The same set of rules may beused for the two operation modes [Rich, 1983].In comparison with the traditional programs, which contain in their code both thespecialized knowledge and the control information | which contributes to make themextremely complex and hard to maintain | a rule-based expert system is simpler, moremodularized and easier to maintain, for it is organized in three levels [Cronk et al., 1988]:a) An inference engine which contains the strategy to solve a given class of problems;

17b) A knowledge base, containing a set of rules with the knowledge about a speci�c task,that is, an instance of that class of problems;c) A working memory, containing the data about the problem being dealt with.Despite the advantages that they present in relation to the traditional programs, therule-based systems present some limitations as far as the acquisition of the necessary knowl-edge is concerned, because this acquisition is, at �rst, based upon interviews with humanspecialists. This procedure takes a long time, is expensive and subject to errors, which hasmotivated researches aiming at automating it and making it faster, by means of machinelearning techniques [Michalski et al., 1983] [Goodman and Latin, 1991].Another limitation of these systems is the fact that they do not take advantage of pastexperiences in the deductive process, that is, they lack \memory". Therefore, a purelyrule-based system that has triggered thousands of rules in order to, from a given set ofalarms, deduct the occurrence of a given fault, will trigger again all those rules wheneverit is submitted to the same set of alarms, getting once again to the same conclusion. Theprogram does not \remember" the occurrence of a similar situation in the past.Because they do not make use of past experiences, the rule-based systems are subject torepeating the same errors over and over again, which contributes to degrade the precisionand the performance of these systems.Having its knowledge limited to the rules of its database, the system can not deal withthe situations to which these rules do not apply. This a�ects its robustness [Lewis andDreo, 1993], since the system may lack the alternatives in many common situations in anintegrated network management environment.In areas which are quickly modi�ed, as is the case of telecommunications networks, rulebased systems tend to become quickly obsolete [Lewis, 1993].2.2.2 Fuzzy LogicDue to the complexity of the managed networks, it is not always possible to build precisemodels of these networks, in which all the situations in which the occurrence of a given setof alarms indicates a fault on a given equipment are made evident.The knowledge of the cause and e�ect relations among faults and alarms is generallyincomplete. Besides that, some of the alarms generated by a fault are frequently not madeavailable to the correlation system in due time, because of losses or delays in the route fromthe network element which originated them. Finally, due to the fact that the con�gurationfrequently changes, the more detailed a model is, the faster it will become outdated.

18 The imprecision of the information supplied by specialists very often causes great di�-culties. In an hypothetical example, a network management specialist could formulate thefollowing rules:1. If tra�c in route A is very high and tra�c in route B is normal, then divert 1=4 ofroute A tra�c to route B;2. The occurrence of alarm C sometimes indicates equipment D fault.The expressions \very high", \normal" and \sometimes" are inherently imprecise andmay not be directly incorporated to the knowledge basis of a conventional rule-basedsystem.Fuzzy logic is an alternative to deal with the uncertainty and the imprecision whichcharacterize some applications of telecommunications network management.According to [Zadeh, 1988], fuzzy logic contains, as special cases, the traditional logicsystem, the mutivalued logic systems, the probability theory and the probabilistic logic.On the other hand, in spite of being possible to empirically attest that a given fuzzy logicsystem operates according to what was expected, there still are not tools that allow toprove, a priori, that this system works [Meech and Kumar, 1994], which indicates that theconcepts introduced by [Zadeh, 1965] still do not count on su�cient mathematical support.Some researchers argue that all problems that may be solved by means of fuzzy logicsmay be equally solved by means of probabilistic models as, for example, Bayesian networks(cf. item 2.2.3), with the advantage of counting, in the latter case, on a solid mathematicalbasis, which fuzzy logic lacks [Luna, 1994].The basic concept underlying fuzzy logic are the fuzzy sets. In classic logic, one set Apresents the property of, given an element X, the expression \X is a member in A" alwaysassumes one between two possible values: true or false. When it comes to fuzzy sets, eachelement X has, in relation to the set, a certain grade of membership, which may assumeany value between 0 (when de�nitely the element does not belong to the set) and 1 (whenthe element certainly is a member of the set). The concept of fuzzy set brings in itself thenovelty that any given proposition does not have to be only true or false, but that it maybe partially true, in any degree in a scale 0 to 1.For the sake of an example, be 100 Erlang the maximum tra�c capacity on a givenroute. By the classic logic criteria, \very high tra�c" might arbitrarily be de�ned asthe one that surpassed, say, 85 Erlang. Therefore, an 84.9 Erlang tra�c would not beconsidered \very high", whereas an 85.1 Erlang tra�c would de�nitely be framed in thatcategory.

19Tra�c Grade of Membership(Erlang) in the Setup to 30 030 to 40 0.240 to 50 0.450 to 60 0.660 to 70 0.870 to 80 0.980 to 90 0.9590 to 100 1Table 2.1: De�nition of a Fuzzy SetIn the fuzzy logic approach, the \very high tra�c" fuzzy set could be represented by afunction such as the one de�ned in table 2.1.Through a special algebra, several operations on fuzzy sets are de�ned (for example,complementation, intersection and union).Fuzzy expert systems allow the rules to be directly formulated utilizing linguistic vari-ables such as \very high" or \normal", which rather simpli�es the development of thesystem.A number of applications of fuzzy systems have been implemented in several areas, suchas: Strategic Planning [Hall, 1987], Mining [Meech and Jordon, 1993], Geology [Lebaillyet al., 1987], Medicine [Henkind et al., 1987], Environmental Sciences [Veiga and Meech,1994], Electrical Engineering [Chen and Rao, 1993] and NetworkManagement [Lirov, 1993].A good introduction to fuzzy systems is presented in [Negoita, 1984]. The theoreticalaspects of expert systems based on fuzzy logic are explored in [Gupta et al., 1985], whichalso brings several applications of these systems.2.2.3 Bayesian NetworksBayesian networks [Wright, 1921] apud [Heckerman et al., 1995b] constitute an interest-ing approach to the treatment of uncertainty. Through them, it is possible to produceinferences even when the available information is incomplete and inaccurate.De�nition: A Bayesian network is a directed acyclic graph in which each noderepresents a random variable to which conditional probabilities are associated,given all the possible combinations of values of the variables represented by the

20 directly preceding nodes; an edge in this graph indicates the existence of a di-rect causal in uence between the variables corresponding to the interconnectednodes.A subjective probability expresses the degree of belief of an expert related to the occur-rence of a given event, based on the information this person has available up to the moment[Henrion et al., 1991]. The use of subjective probabilities is very often the only resource,in situations where analytical or experimental data is very hard, or even impossible, toobtain [Charniak, 1991] [Pearl, 1991] [Deng et al., 1993] [Kirsch and Kroschel, 1994].Sometimes it is possible to evaluate the conditional probabilities from empiric data,obtained from the study of the behavior shown in the past by the system being studied.Given a Bayesian network and a set of evidences (nodes whose corresponding variableshave been instantiated) it is possible to evaluate the network, that is, to calculate theconditional probability associated to each node, given the evidences observed up to themoment. Generally speaking, this is a NP-hard problem but, with the use of appropriateheuristics and depending on the problem dealt with, networks containing thousands ofnodes may be evaluated in an acceptable time [Cooper, 1987] [Charniak, 1991].A more detailed study on Bayesian networks is shown in item 4.2.2.2.4 Model-Based ReasoningModel-Based Reasoning (MBR) is a paradigm of the arti�cial intelligence area which hasseveral applications in alarm correlation. The principles of MBR have been originally pro-posed in [Davis et al., 1982]. MBR consists of representing a system through a structuralmodel and a functional model, in contrast with the traditional rule-based systems, wherethe rules are based on empirical associations. In the case of a telecommunications networkmanagement system, the structural representation includes a description of the networkelements and of the topology (i.e, connectivity and containment relationships). The rep-resentation of functional behavior describes the processes of event propagation and eventcorrelation [Jakobson and Weissman, 1995].2.2.5 BlackboardSeveral expert systems for network management and, particularly, for alarm correlationhave been implemented utilizing an architecture denominated \blackboard" [Goyal andWorrest, 1988] [Sasisekharan et al., 1993b] [Frontini et al., 1991]. The architecture consistsof a global database, called blackboard, several knowledge sources (KS) and a scheduler.The blackboard [Hayes-Roth, 1987] is responsible for storing solution elements pro-duced by the system during the problem resolution process. The solution elements are

21organized in the blackboard according to two axes, representing abstraction levels andsolution intervals, respectively.Knowledge sources are processes responsible for generating solution elements and forstoring them in the blackboard. Each KS is de�ned by a condition and an action. Thecondition normally is characterized by a given con�guration of the blackboard and speci�esin which situations the KS will be apt to contribute to the solution of the problem throughan action. Generally, one of the e�ects of an action consists of the modi�cation of theblackboard's content. The knowledge sources are independent among themselves, that is,a KS may not invoke other KSs, nor does it have a knowledge of the functionality or eventhe existence of these KSs.Each change in the content of the blackboard constitutes an event, able to activateone or more KSs, depending on the information stored in the blackboard. The scheduler isresponsible, among other things, for selecting, among the KSs that have had their conditionsatis�ed, which will be triggered. By using opportunistic heuristics, the scheduler maychoose, in each cycle, which potential action is most adequate for the present situation.The architecture in the blackboard has, among others, the following objectives [Hayes-Roth, 1987]: the reduction of the search space, through the reasoning in multiple abstrac-tion levels, through the use of independent KSs and through opportunistic scheduling; theintegration of di�erent types of knowledge; simultaneous operation of redundant KSs, as astrategy to compensate for the lack of reliability of the available knowledge; independenceamong the di�erent KSs development teams' works; an easy process of modi�cation andevolution.2.2.6 FilteringSome network management systems have �lters that select the alarm noti�cations to bedisplayed, at the operator's request, according to criteria such as the geographic areawhere the alarm was originated, the technical area (i.e, transmission, switching, etc) orthe alarm's severity degree. In these systems, the �lter concept is similar to ITU-T'sde�nition, according to which a �lter is a set of assertions on the presence or on the valuesof attributes in a managed object [ITU-T, 1991a]. The �ltering criteria do not depend onthe context and are based exclusively on the characteristics of the alarm itself [Meira, 1995].In spite of being able to signi�cantly reduce the amount of information displayed, the �lters'cutting criteria sometimes do not contribute to facilitate the identi�cation of the faults thatcaused the emission of alarm noti�cations, and may even prevent the presentation of theinformation necessary for the identi�cation of these faults.There is an alarm correlation modality, which could be called intelligent �ltering, inwhich the selection criteria is more elaborated, being dynamically calculated by the sys-

22tem, according to the information obtained externally to the alarm being �ltered [M�oller etal., 1995]. This technique is appropriate to deal with a situation known as \event storm"[Hewlett Packard, 1995b], in which hundreds or even thousands of events are generated ina short period of time, as a consequence of a single problem. This phenomenon frequentlyoccurs in systems that use high speed technologies such as, for example, ATM (Asyn-chronous Transfer Mode) [de Pricker, 1995] and SDH (Synchronous Digital Hierarchy)[ITU-T, 1993a], and must be minimized through alarm correlation.2.2.7 Event Forwarding Discriminator | EFDIn Recommendation X.734, ITU-T de�nes the services, the protocols and the functionalunits of a management system, as far as the event report function is concerned [ITU-T,1993f]. In the recommended model, before being transferred to the outside of a managedobject, in the form of event reports, the locally generated noti�cations are preprocessed,originating potential event reports.An Event Forwarding Discriminator (EFD), such as de�ned in Recommendation X.734,determines which potential event reports must be transferred, in the form of event reports,to a destiny and during a speci�ed time interval.The conditions to be satis�ed, so that a potential event report may be transferred,are speci�ed by means of an attribute denominated discriminator construct, which actsas a �ltering mechanism on the objects presented at the entry of the EFD. The followingattributes of a managed object may be speci�ed in a discriminator construct to be evaluatedby an EFD:� The managed object class;� The managed object instance;� The type of event;� The attributes of a given type of event, such as, for example, severity.An EFD is a managed object and, so, it may be created and destroyed, besides thepossibility of having its state and the values of its attributes modi�ed at any time.2.2.8 Case-Based ReasoningAs an alternative to rule-based approach, some authors propose a technique denominatedCase-Based Reasoning (CBR) [Slade, 1991] [Weiner et al., 1995]. Here, the basic unit ofknowledge is a case and not a rule. Cases consist of registers containing the most relevantaspects of past episodes and are stored, retrieved, adapted and utilized in the solution ofnew problems. The experience obtained with the solution of these new problems constitutes

23new cases, which are added to the database, for future use. Thus, the system is able to getknowledge through its own means, without needing to interview human experts. Anotherrelevant characteristic of the CBR systems is their ability to modify their future behavioraccording to the mistakes made. Besides that, a case-based system may build solutions tounheard-of problems, through the adaptation of past cases to the new situations.The development of CBR systems started in the 1980's and, since then, several chal-lenges have stimulated the creativity of researchers: how to represent the cases; how toindex them, so as to allow their retrieval when necessary; how to modify an old case, toadapt it to a new situation in order to generate an original solution; how to test a pro-posed solution, classifying it as a success or a failure; how to explain the fault of a suggestedsolution and repair it, originating a new proposal.The problem of case adaptation was studied by [Lewis and Dreo, 1993], who describea technique denominated parameterized adaptation, which is based on the existence, in antrouble ticket, of a certain relationship among the variables that describe a problem and thevariables that specify the corresponding solution. The CBR system takes into account theparameters of this relationship in the proposition of a solution for the case under analysis.For the representation of parameters, the use of linguistic variables is proposed (that is,the ones that assume linguistic values, instead of numeric values) and the provision offunctions that translate the parameters' numeric values into grades of membership in afuzzy set (cf. item 2.2.2).To store and retrieve the knowledge on the solution of past problems, [Dreo and Valta,1995] present the concept of master ticket which, instead of containing information on asingle fault (case), contains a generalization of the information on the fault. Thus, when amaster ticket is retrieved, it must be instantiated before the information it contains maybe applied to a particular case. The goal of this procedure is to facilitate the access tothe information on the solution of problems. The generalization consists of replacing byparameters all the speci�c information on the fault that originated the ticket, such as users'information and the involved nodes' addresses. To instantiate a master ticket consists ofsubstituting its parameters for the real values of the case under consideration.2.2.9 Coding ApproachIn the coding approach [Kliger et al., 1995], most of the necessary processing to alarmcorrelation is previously carried out, originating a database denominated codebook. Thecodebook may be seen as a matrix, where each row corresponds to a symptom (or event,or alarm) and each column corresponds to a problem (or fault, or defect). If n distinctsymptoms are represented in the codebook, each element of the vector pi = (s1; s2; : : : ; sn)contains a measure of the causality of problem pi to the corresponding symptom. Thus,if, in the vector pi, s1 = 0, the symptom s1 must never occur as a consequence of problem

24pi; on the other hand, if s1 = 1, the symptom s1 must always occur as a consequence ofproblem pi [Yemini et al., 1996] [System Management ARTS, 1996a].It is not demanded that the values of the causality measures belong to the set (0; 1);the model allows these values to belong to any semi-ring, which constitutes a special classof partially ordered sets. This leaves open the possibility to use several approaches todescribe the likelihood of causality, such as deterministic, probabilistic, fuzzy logic andtemporal models [Lirov, 1993] [Luna and Correa Filho, 1992].In real time, each abnormality situation may be described by means of an alarm vectora = (a1; a2; : : : ; an), where each element indicates the occurrence or not of the correspond-ing alarm. The correlation is made through the choice, in the codebook, of the problem pwhose code is closest to a, in terms of Hamming's distance [Hamming, 1950].As most of the computing is carried out in advance, only the simpler operations aredone in real time. This allows the performance of the coding approach, in terms of eventsprocessed per second, to be from two to four orders of magnitude higher than the perfor-mance of other alarm correlation approaches found in the literature [Yemini et al., 1996][Nygate, 1995] [Jakobson and Weissman, 1993].The object oriented paradigm is adopted in the coding approach to represent the objectclasses of the modeled network, as well as its attributes, relationships and event informa-tion. A class is a template for a set of object instances, in which the properties commonto these objects are described, as far as the structure and the behavior are concerned.The de�nition of events and the modeling of the propagation of their e�ects in themanaged network are done through a special language, denominated MODEL [Ohsie etal., 1997].One of the main motivations for the use of object orientation is to allow the interoper-ability between an alarm correlation system and other applications, running in heteroge-neous distributed environments. In order to have interoperability between two applicationsthey must obey the same interface standards, such as, for example, the ones de�ned in[OMG and X/Open, 1995].The strong points of the codebook approach are: performance, robustness, automaticcomputation of the correlation rules, and versatility in the adaptation of the system tochanges occurred in the network topology [Yemini et al., 1996].

252.2.10 Explicit LocalizationMost of the alarms received in a network management centre do not bring any explicitinformation on the localization of the fault that originated them. In the proposal presentedin [Bouloutas et al., 1994], information on the fault location are explicitly associated toeach alarm, by using a set that contains all the possible locations3. It is initially supposedthat all the alarms are reliable and that there is only one fault in the network. In this way,the fault will be located at the intersection of the localization sets provided by the severalalarms. Next, this scenario is extended in order to cover multiple faults, in a more realisticenvironment, where the alarms received may even not be reliable. The initial problemevolves to a discrete optimization problem, whose objective is to �nd out the set of faultsand the set of alarms that minimize a certain cost function. Since this is an NP-hardproblem [Bouloutas et al., 1994], its solution involves the use of heuristics.2.2.11 Correlation by VotingCorrelation by voting is a technique conceptually similar to the explicit localization tech-nique (cf. item 2.2.10). The main di�erence is that, instead of containing information onthe exact localization of the fault, given by a set containing all the possible locations, asis the case in the explicit localization, in the correlation by voting each alarm contains awhole number of votes, showing the direction (in relation to the element that reports thealarm) in which the problem which caused it may be [Houck et al., 1995].According to this technique, an alarm does not contain votes for each individual node,but for all the nodes of a given direction. Therefore, it is necessary that the correlationsystem knows the topology of the managed network in order to, when it knows the numberof votes of an alarm to a given direction, each of the nodes of that direction may receivethat number of votes. Next, it is possible to perform a totaling of the votes of each node,followed by the choice of the most voted nodes as possible fault locations.The correlation by voting technique may be associated to other techniques such as,for example, dependence tree search [Houck et al., 1995], which allow the identi�cation,among the components of the most voted nodes, which one is most probably responsiblefor the fault that caused the alarms. Through this search, it may also be determined if thefault is imputable to the identi�ed nodes or if these nodes failed due to a problem withcomponents of which they are dependent.3This proposal shows some similarity with the model recommended by the ITU-T in [ITU-T, 1992o], inwhich each alarm noti�cation may contain, among other information, a parameter denominated correlatednoti�cations. When present, this parameter contains a set of noti�cation identi�ers and, if necessary, therespective names of instances of managed objects. This set contains all the noti�cations to which thepresent noti�cation is considered to be related.

262.2.12 Proactive CorrelationThe fact that a managed network generates a large volume of alarms in a fault situationmust not always be seen as negative. It is known that the manual processing of this massof data tends to become unfeasible as the number of high speed systems in the networkincreases. The most common techniques of alarm correlation work, in real time, directlyon the ow of alarms originated from the plant, trying to eliminate most of them or,at least, to \hide" them from the operators and network managers, thus facilitating theidenti�cation of faults that have already occurred.Through the data mining and knowledge discovery techniques [Sasisekharan et al., 1996][H�at�onen et al., 1996], it is possible to �nd patterns that characterize the present behaviorand the future behavior tendencies of the network. The technique consists of sweepingthe available data, systematically and exhaustively, by applying correlation and learningtechniques [Sasisekharan et al., 1994].In this way, it is possible to identify potential problems before they materialize, whichallows the proactive maintenance of the network [Sasisekharan et al., 1993a].The approach consists of examining the behavior of the network elements throughouttime, by considering behavior patterns that are common to elements of the same class.Having a strong empirical component (represented by the collected data), the approachalso includes knowledge on the elements and on the network topology. The computing maybe divided into three steps:a) The classi�cation of the network elements and their behavior throughout time;b) The correlation of the information of the whole network and the formulation of hy-pothesis;c) The resolution and veri�cation, to con�rm the real cause of the problem and to solveit.The necessity of human involvement in the fault identi�cation process is emphasized by[Sasisekharan et al., 1996], due to the fact that the problem is not yet completely solved.More recent proposals for the solution of the proactive correlation problem involve theuse of Bayesian networks (cf. item 2.2.3) [Hood and Ji, 1997].2.2.13 Distributed CorrelationWith the growth of telecommunications networks, both in size and in complexity, it maybe recommendable to partition the management environment in a certain number of man-agement domains, so that the requirements for the desirable quality may be reached both

27in the operation and maintenance of the system. An example of this organizational archi-tecture may be found in [Nogueira and Meira, 1996].The adoption of a distributed architecture for the network management system fa-cilitates the implementation of schemes of distributed fault localization and justi�es thedevelopment of distributed algorithms for this localization.[Katzela et al., 1995] present a model of a managed telecommunications network inwhich the distributed management approach is assumed. The network is partitioned intoseveral static, disconnected and logically autonomous domains, each one of them managedby a single management center. Each management center has a limited vision of thestate of the other domains. Nevertheless, managers of di�erent domains communicateamong themselves and exchange information on the state of their domains. As the alarmsgenerated in consequence of a determined fault may not be restricted to a single domain,the management centers have to collaborate in order to infer the real state of the system.It is assumed that the processes of information transfer between managers and the otherparts of TMN are not a�ected by faults.In the presented model, the set of managed objects whose fault may cause a givenalarm is de�ned as the alarm domain. Before starting the fault localization process, themanagement center must �nd out the domain of each one of the alarms received which maybe made through explicit localization [Bouloutas et al., 1994] (cf. item 2.2.10). A cluster ofalarms is a set of alarms whose domains have an intersection di�erent from ; (empty set).Each alarm cluster may have one or more possible causes. The fault localization algo-rithm must �nd out the \best" (that is, the most probable) among these possible causes.This could be done from the attribution of a fault probability to each managed object.From then, the \best" cause for the alarm cluster might be de�ned as being the set ofmanaged objects whose combined fault probability was maximum. Instead of attributinga fault probability to each managed object, [Katzela et al., 1995] associate to each one ofthese objects an \information cost", de�ned as the negative logarithm of the object's faultprobability. Then, the \best" , that is, the most probable, cause will be given by the setof managed objects whose sum of information costs is minimum.As the fault localization is an NP-complete problem, in the general case a polinomialalgorithm which may give it an exact solution is not known. However, given a cluster ofreceived alarms (A) and the set of corresponding managed objects, it may be demonstratedthat there is a polinomial algorithm which �nds an exact solution if the maximum numberof simultaneous faults is less than a parameter k. Therefore, it may be assumed that thereis a centralized algorithm which �nds out the most probable faults in a set of managedobjects, given a set of alarms and the information costs associated to each managed object[Katzela et al., 1996]. Let Q be the probability of existing more than k simultaneous faults

28in the system. In this case, Q is also the probability that the algorithm does not providea solution to the problem.With relation to the distribution of the correlation process, three approaches for faultlocalization are identi�ed:1. Centralized localization. Relies on the existence of a central manager, that has aglobal view of the network and whose jurisdiction includes the domains of all theother managers. The central manager directly solves any problem which a�ects morethan one management domain.2. Decentralized localization. In this case, the problems that a�ect more than onedomain are solved through the collaboration between the central manager and themanagers of the a�ected domains.Each domain manager is responsible for calculating the partial solutions to the prob-lem, containing the possible causes for the alarms whose domains include objectsfrom more than one management domain, and send them to the central manager,who will have to �nd out, among these partial solutions, which are the compatibleones. Two partial solutions are compatible if all the alarms received by all the domainmanagers are explained by the �nal solution.Finally, the central manager selects the compatible global solution that has the min-imum information cost.3. Distributed localization. This approach consists in trying to �nd out the causes forthe alarms, without the intervention of a central manager.Let the telecommunications network be divided into two domains, 1 and 2. Fromthe point of view of the manager of domain 1, for each alarm whose domain crossesthe boundary it would be desirable to associate the probability that it be explainedby domain 2. For this to happen, all the managed objects that belong to domain 2and are related to the alarm, are associated to a \proxy node". The fault of a proxynode indicates that one or more objects in domain 2 has failed.It may be demonstrated that the calculus of the fault probability of a proxy nodeis an NP-complete problem [Katzela et al., 1996]. The best that may be done is anestimate of this probability, which introduces an error, which, by its turn, preventsthe guarantee for an optimum global solution.[Katzela et al., 1995] compare the fault identi�cation algorithms for these three ap-proaches, with respect to the precision and time complexity aspects. Complexity is afunction of the number of managed objects associated to the alarm clusters, of the numberof alarms that cross the boundaries between the domains and of the k parameter (themaximum number of simultaneous faults allowed). On the other hand, the precision ofeach approach depends on the error in the estimate of fault probabilities of the managed

29objects corresponding to the received alarm cluster. It is demonstrated that the decen-tralized approach generally has a smaller complexity than the centralized approach andthat it provides equal or better solutions as far as precision is concerned. The distributedapproach has a smaller complexity than either of the other two, but it does not alwaysguarantee an optimum solution. Nevertheless, it provides solutions that are almost asprecise as the solutions provided by the other two approaches.2.2.14 Arti�cial Neural NetworksAn Arti�cial Neural Network (ANN) is a system constituted by elements (\neurons") inter-connected according to a model that tries to reproduce the neural network existing in thehuman brain. Conceptually, each neuron may be considered as an autonomous processingunit, provided with local memory and with unidirectional channels for the communicationwith other neurons. The functioning of an input channel in an ANN is inspired in theoperation of a dendrite in the biological neurons. In an analog way, an output channel hasan axon as its model. A neuron has only one axon, but it may have an arbitrary numberof dendrites (in a biological neuron there are around ten thousand dendrites). The out-put \signal" of a neuron may be utilized as the input for an arbitrary number of neurons[Meech and Kumar, 1994] [Mead, 1989].In its simplest form, the processing carried out in a neuron consists of e�ecting theweighed sum of the signals present in their inputs and of generating an output signal if theresult of the sum surpasses a certain threshold. In the most general case, the processingmay include any type of mathematical operation on the input signals, also taking intoconsideration the values stored in the neuron's local memory [Meira J�unior, 1993].One of the main motivations for the development of the ANN is the utilization ofcomputers to deal with a class of problems that are easily solved by the human brain, butwhich do not manage to be e�ectively treated with the exclusive utilization of conventionalprogramming paradigms.Due to the frustration of the great expectation that built up from the �rst researches onthe modeling of the nervous system in the 40's, the interest in the ANNs was signi�cantlyreduced at the end of the 60's, when theoretical studies showed strong limitations of thisparadigm [Michalski et al., 1983]. The interest for the area was reborn from the beginningof the 80's, when the performance of computers started allowing practical implementations,which have a very high computing cost. The discovery of new applications for the ANNshas also contributed for this \renaissance".The distributed control and storage of data and parallelism are remarkable features ofthe ANNs. Besides that, an ANN does not require a previous knowledge of the mathemat-ical relationship between inputs and outputs, which may be automatically learned, during

30the system's normal operation. This makes them, at �rst, a good alternative for appli-cations (such as alarm correlation and fault diagnosis) where the relationships betweenfaults and alarms are not always well de�ned or understood and where the available datais sometimes ambiguous or inconsistent [Covo et al., 1989].An excellent introduction to the neural networks may be found in [Meira J�unior, 1993].2.2.15 Diagnosis by Comparison of Test Results[Nussbaumer and Chutani, 1995] propose a general technique for fault diagnosis in commu-nication networks, applicable both to centralized systems and to distributed systems. Thetechnique is based on a \test comparison paradigm", which consists of having the networknodes carry out a series of known tasks. The diagnosis of defective nodes or links is madefrom the discrepancies observed in the test results; the precision of the diagnosis may becontrolled through the number of times the task is repeated.2.2.16 Other ApproachesDue to the importance the subject has got over the past few years, several new approacheshave been recently proposed for the problem of alarm correlation. As an example of thesenew approaches, it may be mentioned the technique presented in [Ohta et al., 1997], basedon the \divide and conquer" paradigm, through which the alarms of interest are isolatedby dynamically constructed �lters, which allows the simpli�cation of the mechanism to beused in the correlation.[Sabin et al., 1997] utilize the paradigm known as \constraint satisfaction problem"(CSP) for fault diagnosis in computer networks. According to this method, a system isconsidered in a fault state if some restrictions can not be satis�ed; the knowledge of whichrestrictions have been violated allows the identi�cation of the fault's nature.2.3 Comparison among the Available ApproachesThere is not a unique solution that is the \best", in terms of precision and/or complexity,to solve a generic problem of alarm correlation. Recent researches indicate a tendency forthe adoption of combinations of di�erent approaches for the solution of the problem incomplex networks [Frey and Lewis, 1997] [K�atker and Paterok, 1997].The option to be adopted in a speci�c case must be chosen by the virtues and limitationsof the several approaches applicable to that case. A survey encompassing some of the mainsolutions, platforms and products for alarm correlation is found in section 2.4 in this thesis.The comparison between the methods and algorithms presented in section 2.2 is a

31di�cult process and must take into account the following factors, among which, in general,there is a commitment relationship: (1) the facility for constructing a theoretical model ofthe object network, that is, the network that will generate the alarms to be correlated; (2)the implementation complexity; (3) the facility to adapt to changes in the object network;(4) performance; (5) precision.Although it is possible, at �rst, to quantify these factors for a speci�c object network,it is hard to imagine how this could be done for a generic object network. Therefore, acomparison among approaches for alarm correlation must take into account the character-istics of the object network | or, at least, the class of object networks: SDH, ATM, digitalswitching, etc.In general, rule-based approaches are indicated for the correlation in network elements,or in networks whose con�guration is rarely altered; the high costs of implementation andadaptation to changes in the object network make it di�cult to apply these strategies inlarge telecommunications networks. Other approaches, such as the case-based ones, areless sensitive to changes in the object network, but they still lack a theoretical basis whichwould allow their utilization in large size commercial networks (cf. item 2.2.8).The nature of the application to which the correlation is intended (for example, areduction of the amount of information to be analyzed by the operator; the identi�cationof the faults that originated the alarms; a prediction as to the occurrence of faults in thefuture) determines the type of correlation to be e�ected (cf. item 2.1.5), which, by its turn,must be taken into consideration in the choice of the method or algorithm to be adoptedin a given situation.Alarm compression, selective suppression, simple �ltering, counting and specializationare examples of correlation types that may be easily implemented by utilizing rule-basedconventional approaches (cf. item 2.2.1).Applications involving intelligent �ltering, scaling or temporal relationship may also beimplemented through the approach described in 2.2.1, or one of its variations. Nevertheless,the increase in the complexity of the problem brings as a consequence the appearance ofexceptions, which are not always suitably treated by these approaches, which do not havean e�ective mechanism for the implementation of non-monotonic reasoning [Pearl, 1991].Thus, at the development stage, each identi�ed exception demands the reformulation ofthe rules already established and/or the creation of new rules, which implies in an increaseof the complexity of the solution and in a reduction of performance. Besides that, the pos-sibility of the occurrence of non-identi�ed exceptions a�ects the system's robustness, whichdoes not have alternatives to deal with these situations. Other approaches, such as theModel-based Reasoning or the Blackboard architecture (cf. items 2.2.4 and 2.2.5), presentan additional structuring in relation to the \pure" rule-based systems, which facilitates the

32development of applications but does not make them substantially more attractive for theimplementation of more complex correlation types, due to the tendency for the reductionof performance.The most interesting applications of alarm correlation, such as, for example, faultdiagnosis (cf. item 2.1.4), generally involve generalization or clustering (cf. item 2.1.5).In these cases, the complexity of the problem makes it extremely di�cult to obtain exactsolutions [Katzela et al., 1996], making uncertainty an ever present factor in the correlationprocess.Among the several approaches to deal with uncertainty in alarm correlation, fuzzy logic(cf. item 2.2.2), Bayesian networks (cf. item 2.2.3), case-based reasoning (cf. item 2.2.8),coding correlation (cf. item 2.2.9) and arti�cial neural networks (cf. item 2.2.14) are note-worthy. There is much controversy involving the advantages and the disadvantages of eachone of these alternatives. The defenders of fuzzy logic based approaches, for example, ar-gue that they rather simplify the development of applications and result in products thatwork and that have an excellent performance; on the other hand, the nonexistence of asolid mathematical basis that may support them is a factor that inhibits the adoption ofthis alternative in a larger number of applications.Coding correlation is a rather interesting alternative under the aspects of performanceand robustness, but it demands a great e�ort in the modeling of the object network, makingit little recommendable for complex networks.Bayesian networks based methods were �rst utilized in 1921, in the analysis of harvest-ing results [Wright, 1921] apud [Heckerman et al., 1995b], and count on a solid mathemat-ical basis. Each day, these methods win more acceptance, in the community of computingscientists, as a suitable option to the solution of problems involving uncertainty [Hecker-man et al., 1995b]. These factors contributed for the adoption of Bayesian networks in themodel proposed in chapter 4.2.4 Products, Solutions and Platforms for AlarmCorrelationThis section presents a review on the main products, solutions and platforms for alarmcorrelation found in the literature.A large number of arti�cial intelligence applications and, more speci�cally, of expertsystems has already been developed with the aim of helping telecommunications networkmanagement. Some of these applications were gathered in [Liebowitz, 1988], which containsclassic cases such as ACE (Automated Cable Expertise) [Wright et al., 1988] and SMART

33(Switching Maintenance Analysis and Repair Tool) [Slawsky and Sassa, 1988]. Severalother arti�cial intelligence applications in telecommunications network management maybe found in [Goyal and Worrest, 1988]. A more recent survey may be found in [Hedberg,1996].As far as the rule-based systems are speci�cally concerned, [Cronk et al., 1988] hadcompiled the main systems, already available or in development, intended for monitoring,control, diagnosis and repair of telecommunications networks.Out of the existing products, however, only a few are capable of correlating, in realtime, the ow of alarms engendered by modern telecommunications networks, which arecharacterized by a large amount of information that are able to generate and routing tothe network management centers during abnormality situations.The survey presented next is not meant to be exhaustive, but it is meant to be represen-tative of the application of the techniques, methods and algorithms discussed in section 2.2.2.4.1 Event Correlation Services (HP)HP developed, as part of its OpenView platform, a product denominated ECS (\EventCorrelation Services") [Hewlett Packard, 1995b], which proposes to process hundreds ofevents per second, aiming at dealing with the phenomenon denominated \events storm",described in item 2.2.6.ECS is a rule-based system whose rules are grouped in processing elements denominated\nodes", each one responsible for a certain functionality in the correlation process (input,output, �ltering, delay, counting, combination, modi�cation, etc). In the ECS system,combination consists of forming a single output ow from two event input ows, whereasmodi�cation consists of a process through which the value of any attribute of an inputevent is modi�ed. In the latter case, the new attribute values are copied or calculated froma \publicly" available datum.The nodes are interconnected in \circuits", through which the alarm \ ows" generatedby the managed plant pass. A composite node may be de�ned from a circuit segmentand added to the node library, to contemplate a certain functionality de�ned by the user[Hewlett Packard, 1996b]. Through this metaphor, part of the complexity inherent to analarm correlation system may be hidden from the user. The ECS system still counts onan user's graphic interface which allows the development of correlation rules through theselection, connection and con�guration of the library \nodes".The ECS system owns a distributed architecture, which allows the distribution of eventcorrelation, contributing to reduce the network management information tra�c [Hewlett

34Packard, 1996a].The ECS architecture provides two optional databases, denominated \Data Store"and \Fact Store", which are useful when, due to reusability and reliability aspects ofthe correlation circuits, it is desirable to maintain certain information outside the generalcorrelation rules. Whereas \Data Store" contains a set of pairs of name|name values,such as, for example, \TableSize = 20", \Fact Store" stores information in the entity1|relationship|entity2 form, such as, for example, \equipment15 is contained in cabinet5".These databases may be utilized to build a managed network model, allowing the relation-ships among events to be dynamically evaluated. Alternatively, the model could be builtutilizing only the correlation circuit [Hewlett Packard, 1996b].The ECS system supports primitive events of the CMIP [ITU-T, 1991b] and SNMP[Case et al., 1990] types. As the internal processing does not depend on the format ofthe primitive events, the system may receive SNMP input events and, as a result of thecorrelation, issue CMIP noti�cations, or vice versa.According to HP, the ECS system has been available since June 1996, at an initialprice, at that date, of US$8,000 [Hewlett Packard, 1996a].2.4.2 An Implementation of Intelligent Filtering (Philips)[M�oller et al., 1995] present an implementation of intelligent �ltering that acts on an SDHnetwork management system and utilizes the rule-based correlation paradigm.A �lter is divided into modules, each one of them responsible for a network segmentand for a certain functionality. A module is made up of three parts: information on thetopology, description of the dependencies among events and rules for the �ltering process.The implementation includes three types of modules, with the following functionalities:compression, selective suppression and clustering.For a network with 13 network elements (NE) and 13 lines, in a simulation carriedout by [M�oller et al., 1995], the �lter processed 1000 noti�cations per second during oneminute, with a maximum delay of 5s. This good performance may be partially explainedby the fact that the compiler used in the experiment instantiates all the rules that containvariables with all the possible combinations of the values of these variables. This resultsin a great memory consumption (1.5 MByte for the case mentioned).2.4.3 NerveCenter Pro (Seagate)The NerveCenter Pro system, by Seagate [Hewlett Packard, 1995a] [Seagate, 1996], uti-lizes \behavior models" to identify critical problems, to make alarm correlations and to

35execute actions on the network. The information received by the system may include bothSNMP messages (\traps" and \polls") and messages issued by the HP systems OpenViewOperationsCenter or Seagate LANAlert.The information contained in the messages received are continuously monitored todetermine the occurrence of abnormal situations, as de�ned beforehand. If one of thesesituations occurs, the pro�le of the corresponding behavior goes to a new state, executingthe actions and issuing the corresponding noti�cations. The behavior pro�le is updatedat each new message received, until the situation is solved. At this moment, the behaviorpro�le returns to the initial state, erasing all the alarms associated to it.2.4.4 SinergiaSinergia is a rule-based expert system, utilized in the fault diagnosis of the Italian telecom-munications network. Through the correlation, in real time, of alarms of the transmissionand switching networks, the system reduces by one order of magnitude the amount ofinformation presented to the operators [Brugnoni et al., 1993].The goal of Sinergia is to localize faults in the digital paths of the transmission network,constituted by lines, line terminations, multiplexers and demultiplexers, \cross connects"and other devices. The system is able to identify intermitent faults, as well as the occur-rence of multiple simultaneous faults, but it is not apt to identify internal faults in switchesor in \cross connects".Two strategies are utilized by Sinergia to deal with the complexity of the fault diagnosisproblem. The �rst consists of not taking into account the AIS (Alarm Indication Signal)alarm types [ITU-T, 1995a], considered to have little value for the diagnosis and responsiblefor the increase of the faults in uence area. The second strategy consists of emphasize thetemporal locality characteristic of the alarms in relation to the faults that originated them.The operation of the system is based on the generate and test paradigm, which es-tablishes two steps in the fault identi�cation process. The �rst step, based on a set ofrules, instantiates a set of fault hypothesis. The second step consists of a classic heuristicresearch [Nilsson, 1980], through which the best solution among the hypothesis presentedin the �rst step is chosen.A more elaborated version of Sinergia is presented in [Manione and Paschetta, 1994].In this new version, the system is able to deal with some degree of inconsistency in itstopology database, presenting, in this case, a graceful degradation in its performance.For the validation of Sinergia a plesiochronous simulator of transmission networks de-nominated Sprinter was utilized. It generates test standards for the alarm correlation

36system [Manione and Montanari, 1995].2.4.5 TASAThe TASA system (Telecommunication Alarm Sequence Analyzer) works on the ow ofalarms generated by a telecommunications system, seeking to �nd out regularities, that is,alarm sequences that are frequently repeated and, from then on, to propose rules that maybe incorporated to a correlation system[H�at�onen et al., 1996].The complete process of knowledge discovery may be divided into �ve distinct tasks:the choice of the pattern discovery methods, data collection, pattern discovery, presenta-tion and selection of the discovered knowledge and utilization of knowledge (for example,in an expert system). TASA supports only the pattern discovery and the selection andpresentation of discovered knowledge phases.The discovery of regularities is automatically made, from a few parameters supplied bythe user. But the phase of presentation and selection of the discovered knowledge mustnot dismiss the human specialist, who is considered vital at this processing stage [H�at�onenet al., 1996].The proposed rules are similar to this: \if alarms of the types link alarm and link faultoccur within a 5 second interval, then an alarm of the type high fault rate occurs within 60seconds, with a 0.7 probability."Through a hipertext interface, the TASA system presents huge sets of rule suggestions(corresponding to the regularities discovered by the system), of which di�erent views maybe o�ered to the operator. The rule selection is interactively made, utilizing �ltering,ordering and clustering.2.4.6 InCharge (SMARTS)InCharge is a completely automatic system for fault diagnosis in telecommunications ordata networks [System Management ARTS, 1996a]. The system was developed by theNorth-American company System Management ARTS Inc. - SMARTS and utilizes a cor-relation engine based on the coding approach, patented by SMARTS [Kliger et al., 1995][System Management ARTS, 1996b] [Riordan, 1996]. Besides a correlation engine, eachInCharge system maintains, in runtime, a repository of modeled components.The InCharge architecture is distributed, counting on two versions of systems in run-time: Mid Level Managers | MLMs and Enterprise Managers | EMs. The main di�er-ence between the two versions is that the EMs may communicatewith external applications,which is not possible in the MLM case. A typical application of InCharge consists of an

37EM at higher level and one or more MLM layers, making up a correlation system hierarchy.Besides the runtime systems, InCharge contains an environment for the development ofdiagnosis models. For this environment a formal speci�cation language was developed, de-nominated Managed Object De�nition Language (MODEL) [Ohsie et al., 1997], which is anextension of CORBA IDL (Common Object Request Broker Architecture Interface De�ni-tion Language) [OMG and X/Open, 1995], with the addition of new syntactic constructionsto specify semantic properties which may not be speci�ed in IDL, such as: relationships,events, problems and causal propagation. The development environment also contains aMODEL compiler to C++ code, besides debugging tools and a set of libraries in MODEL.The problem identi�cation process is divided into two parts: in the �rst stage, whichis made in the development environment, the codes are generated. In the second stage, inruntime, the problem decoding is executed.InCharge is utilized by Motorola in the Iridium project, which is a telecommunicationssystem based on a constellation of low orbit satellites. According to this project manage-ment [Du�y, 1996], it was initially attempted to utilize the rule based system NerveCenter,by Seagate, but, after three years with this system, Motorola chose SMARTS technology,due to bugs and other types of de�ciencies of NerveCenter.The InCharge system was priced, in 1996, at about US$7,000 per server [Du�y, 1996].2.4.7 NetFACT (IBM)Through alarm correlation, the IBM NetFACT system (Network Fault and Alarm Cor-relator and Tester) [Houck et al., 1995] executes the diagnosis and the fault propagationfollow-up in telecommunications networks. The system utilizes the voting correlation tech-nique (cf. item 2.2.11), combined with dependency tree search.NetFACT operates in IBM's NetView network management environment. Its function-ing is based on the normalization of all alarms received and on the existence of a networkmodel, containing information on the con�guration elements, as well as on the relationshipsamong them (connectivity, dependencies, data ow).Each normalized alarm provides an indication, through a whole number of votes, ofthe direction in which the problem that caused it may be, in relation to the node thatreports the alarm. A normalized alarm also contains a \time stamp", the identi�cation ofthe object to which the alarm refers and information on the impact of the alarm on theobject behavior.The dependency tree search takes into account both the direct evidences (that is,

38alarms) and indirect evidences (for example, how many users of an inferior level componentare having di�culties). The indirect evidences have to be used because the componentsthat fail do not always generate alarms. In the cases in which the fault of a given com-ponent may have been caused by faults in several distinct inferior level components, forwhich only indirect evidences exist, heuristics are used to choose the component with thehighest probability of having caused the fault. Diagnostic tests, if available, may also beused to help to solve such ambiguities.The implementation was made in a MVS/390 system and utilized, for con�gurationmodeling, the RODM system (\Resource Object Data Manager") of NetView, a high per-formance object oriented data manager. Since it is not available in the MVS environment,the C++ language was not utilized in the development of diagnosis application, which waswritten in ANSI C [Houck et al., 1995].For systems such as NetFACT to become practical and commercial, it is necessaryto develop suitable alarm report patterns, rendering the \translation" of each individualalarm into the normalized way dispensable. Because it requires knowledge related to eachindividual alarm semantics, this translation is a di�cult and expensive procedure, renderingunfeasible the implementation of commercial products, except for very limited contexts.2.4.8 IMPACT (GTE)IMPACT (Intelligent Management Platforms for Alarm Correlation Tasks) [Jakobson andWeissman, 1993] is a system developed by GTE, utilizing the model based approach (cf.item 2.2.4). The implementation has as support an expert system denominated ART-IM,counts on a rule based correlation engine and uses a forward chaining algorithm.The conditions for the triggering of correlation rules take into account the temporalrelationships among the events. [Jakobson and Weissman, 1995] explore the temporalaspects of the adopted model, of which concepts such as correlation window and lifespanmake part. A correlation window may vary from some seconds (for very fast processes,such as alarm streams in high speed systems) to some days, as would be the case of atendency analysis from an event log data. The same type of variation may occur with acorrelation's lifespan.Besides creating a model for alarm correlation, a �nal user oriented support softwaresystem was also created, that is, one that allows the alarm correlation speci�cation to bemade by the domain specialists themselves.As it is an MBR system, the proposed solution contains a structural component anda behavioral component. The �rst includes a model of the network con�guration anda network element class hierarchy. The behavioral component, by its turn, is composed

39of a message class hierarchy, a correlation class hierarchy and several correlation rules.The correlations are made in a �xed time interval, which may be absolute or relative. Inthe latter case, the time interval is implemented as a sliding time window, in which thecorrelation is continuously executed.IMPACT works together with the NetAlert real time network management system andwas used in GTE's business units for the development of two alarm correlation applications:CORAL, for the cellular network, and AMES, for the conventional telecommunicationsnetwork.The system may be divided into two main parts: an environment for application de-velopment and an environment for runtime application. The environment of applicationdevelopment gives support to the acquisition of knowledge, to the editing and to the pre-sentation; it provides the tools to be utilized by the operation personnel in the creationand maintenance of the knowledge basis. The runtime application environment monitorsthe network events in real time, it does the incoming messages' parsing, executes alarmcorrelation procedures and provides interfaces for the network operation personnel. Thisenvironment is composed of four main modules: the graphical user interface (GUI), themessage and command processor, the action processor and the alarm correlation engine.The IMPACT system counts on a network knowledge base, created by the applicationdevelopment environment, which contains the network structural con�guration and thealarm correlation dynamic models. The knowledge base contains the correlation classesand rules, the managed objects' classes and instances and the message classes. It alsostores the network con�guration models, graphical objects for visualization, correlationicons and procedural scripts to be executed by the action processor.The system runs on a SUN Sparc 10 workstation and correlates from 12 to 15 alarmsper second. The correlation model presented is deterministic, but future developments willbe able to enable IMPACT to make inexact (fuzzy) correlations, aiming to meet speci�cneeds [Jakobson and Weissman, 1993].2.4.9 GMS[Kehl and Hopfm�uller, 1993] discuss the application of model based reasoning in thetelecommunications network management area, taking as a basis the work carried outin the AIM (Advanced Information Processing (AIP) Application to IBCN (IntegratedBroadband Communication Network) Maintenance) and GEMA (Generic MaintenanceApplication) projects of the RACE European program. In the AIM project, a modelbased fault management system, denominated GMS (Generic Maintenace System), wasdeveloped for the corrective on-line maintenance of hardware faults. The GMS may beapplied to the maintenance of any telecommunications systems, since a speci�c knowledge

40data base (model) for each system is used.The GMS was utilized, under the prototype form, in two networks: in the BERKOMnetwork (\Berlinen Kommunikationssystem"), which is a test B-ISDN network; and in ametropolitan area network (MAN)model built from knowledge on a test MAN, in operationin Stuttgart.To explain the symptoms of a fault, GMS requires a functional model of the telecommu-nications network. This model is constituted by structural information (structural model)and by behavioral knowledge.The structural model is based on the unit-port concept, according to which a unit isan object constituted of internal attributes and ports. The internal attributes de�ne theunit state, while the ports are used to make the connection among units. Each unit is aninstance of a unit class. A unit used in the modeling of a function of the telecommunicationssystem is denominated functional entity (FE). The modeled network topology is re ected inthe way the FEs are interconnected, which determines the functional dependencies amongFEs. There must be a mapping between the FEs and the physical model entities.The structural model is organized in functional layers, according to the OSI modelprinciples. Thus, the connection among FEs situated in adjacent layers is made throughthe connections among service providing ports and service using ports. This characteristicfacilitates the modeling of the error propagation mechanism.The behavioral knowledge is a description of how a modeled network behaves, notablyin the aspects referring to faults. This knowledge encompasses three types of information:� Internal states. Seen as root causes for a given behavior. Statements on the internalstates are represented by fault hypothesis.� Port states. They specify the interaction of a component with its neighbors.� Behavior descriptions. By using if-then rules, they specify the states of output portsaccording to the states of input ports and of the internal states.The GMS inference engine, which may operate with multiple simultaneous faults, iscomposed of a suggestion interpreting module and of a model interpreting module. Thesuggestion interpreting module operates in backward mode (cf. item 2.2.1), seeking tosuggest causes for the symptoms presented by the network. From a hypothesis suggestedby the suggestion interpreting module, the model interpreting module uses the rules in theforward mode, making a simulation which aims at proving the suggested hypothesis.

412.4.10 ECXpert (AT&T)[Nygate, 1995] describes a product denominated ECXpert, whose function is to aid networkmanagement center operators in alarm analysis and in making decisions about correctiveactions to be adopted. ECXpert is part of a product family denominated TNM (TotalNetwork Management) [IF Computer, 1996], which constitutes a network managementplatform developed by AT&T and incorporated to the network management centers ofseveral companies, including NYNEX, PacBell, Bell South, SNET and Bell Atlantic.ECXpert adopts a data structure denominated correlation tree skeleton to represent aset of alarms which have a cause and e�ect relationship among themselves. In these trees,a child/parent link is equivalent to a cause-e�ect relationship among alarms. Equivalentmessages are represented in the same node. The skeletons are used as a basis to buildcorrelation tree instances, from the alarms received from the plant, in real time. Therefore,the main role of ECXpert is to receive alarms and to create, dynamically, correlationtrees based on correlation tree skeletons. To facilitate the system con�guration in the�eld,correlation groups have been de�ned, which keep a one-to-one correspondence withthe correlation tree skeletons. Each correlation group may be seen as a model of a particularnetwork problem.Like other network management systems [Nogueira and Meira, 1996], TNM also allowsusers to specify the view they intend to have of the network (for example, according toequipment type or to region). As usual, this �ltering must be done with prudence because,when the operator's view is very restrictive, it is generally di�cult to correlate alarmsgenerated by problems that a�ect a large part of the network. On the other hand, ifrestrictions are not made, the volume of alarms presented to the operator will be so largethat it won't even be possible to take knowledge of these alarms in due time.In the ECXpert package there is a description language used to specify correlationgroups, including information such as: when a new alarm belongs to a correlation group;when a new alarm is correlated with a previous alarm that belonged to this group; causeand e�ect relationships among alarms; actions to take; time window inside which alarmsmay be correlated.ECXpert is constituted by four main modules: a correlation group compiler, a userinterface, a test correlation process and a correlation process.The correlation group compiler, written in Prolog, converts the correlation rules de�nedby the user in Prolog clauses. Each typical correlation group contains around 100 lines,which originate, after compiling, around 250 lines of Prolog code.The user interface presents, for a given chosen alarm, a window where all the correlation

42trees to which the alarm belongs are shown, under the form of lists of active alarms, whereprecedence information is associated to each alarm, which allows the user to reconstructthe correlation tree.The test correlation process allows the system administrators to verify the semanticcorrectness of the correlation group. This is done by sending to the test process a seriesof alarms, one by one. The test process will inform why certain alarms belong to a givencorrelation group and how they correlated with other older alarms in the group. Thisprocedure will have to be carried out in order to validate each correlation group, before itse�ective installation.The correlation process is composed of a C++ object for the TNM system alarm col-lection, a Prolog object which executes the correlation algorithm and a C++ object whichmanipulates the database that contains the correlation trees. The rule-based correlationalgorithm (cf. item 2.2.1), may be seen as a forward-chaining engine which receives eachalarm, veri�es the rules which apply to it and activates these rules to update the correlationtrees. Each alarm received is added to all the relevant correlation trees.With ten active correlation groups, the version of ECXpert presented may correlatearound 1,000 alarms per hour, executing in a Tandem FT computer.Due to the fault identi�cation time reduction and to the quick restoration of the a�ectedservices, ECXpert has caused a revenue increase and a reduction in labor costs, whichresulted in hundreds of thousands of dollars in annual bene�ts.2.4.11 SCOUT (AT&T)SCOUT is a system developed by AT&T's Bell Laboratories to automate the diagno-sis of transmission problems in telecommunications networks [Sasisekharan et al., 1993a][Sasisekharan et al., 1994] [Sasisekharan et al., 1996]. One of the product objectives isto detect and to foresee the occurrence of chronic problems in the transmission systems,through the use of machine learning techniques, enabling the proactive maintenance of thissystem (cf. item 2.2.12).The system does not make the acquisition of the plant's data and, so, it must commu-nicate with other systems to obtain these information. The processing is made in threestages: in the �rst stage, the information collected in the plant is used to enable the clas-si�cation of the circuits in di�erent categories, according to the types of errors observed.Next, the alarms are correlated, reaching several fault hypothesis. Finally, from the faulthypothesis, the cause of the problem is determined and it is solved. This last stage iscarried out with the participation of the SCOUT's users.

43The architecture of SCOUT is based on the blackboard paradigm (cf. item 2.2.5)[Sasisekharan et al., 1993b].2.4.12 NOAANOAA (Network Operations Analyzer and Assistant) is a system utilized in tra�c man-agement in the telephone network of Paci�c Bell in the State of California (USA) [AGLSystems Inc., 1996] [Goodman et al., 1995]. The system runs in a Sun workstation andis interconnected, through an Ethernet local network, to the AT&T NTMOS (NetMin-der/NTM OS) network management system.NOAA is a rule-based system (cf. item 2.2.1), which also uses neural network techniques(cf. item 2.2.14). The version presented has 120 rules and counts on an algorithm, denom-inated ITRULE (Information Theoretic Rule Induction) for the automatic acquisition ofrules from network management databases [Goodman and Latin, 1991]. Resources havealso been developed to deal with the situations to which no rule applies. In these cases,the operator is invited to enter new rules or to assign that situation as a \special case".Four software modules compose NOAA, with the functions of collecting data, execut-ing tra�c calculations, proposing alternative actions and implementing a graphical userinterface, respectively [Goodman et al., 1993].The data o�ered to NOAA consist of counts of some types of events for each groupof trunks, during a �ve minute period. Examples of these events are: trunk seizing,over owing (unsuccessful trunk seizing attempt) and reporting on trunk usage at a giveninstant. The following parameters are calculated from the data collected:ACH Call attempts per circuit per hourCCH Connections per circuit per hourOFL Percentage of attempts that over owedUSG Percentage of used trunks, on averageHT Call holding timeThrough the use of neural networks, NOAA foresees the tra�c capacity that will beavailable in one route, at an instant of future time, from several readings, previously carriedout, of this route's occupancy [Goodman et al., 1993]. Points with potential problems arealso indicated and expansive controls automatically executed aiming at re-routing tra�cto alternative routes. Restrictive commands (for example, selective blocking of tra�c inits origin exchange) are executed by the operator, with the system's aid.The work of interfacing NOAAwith the remainder of Paci�c Bell's network managementsystem and the construction of the infrastructure for the expert system lasted three years.Additional developments still had to be made to enable the system to carry out eventcorrelation and fault diagnosis.

442.4.13 CRITTERCRITTER [Lewis, 1993] is a trouble ticket system which utilizes case-based reasoning(CBR, cf. item 2.2.8) for the diagnosis of network faults. Here, each closed trouble ticketconstitutes one case. One trouble ticket database is a case library. The CBR componentof the CRITTER system provides mechanisms to retrieve a useful ticket, to adapt it (ifthis is necessary to generate a solution recommendation for an outstanding problem) andto add the new ticket to the case library.To retrieve a useful ticket, a set of \determinators" is used, containing information onthe relationship between network problem classes and a given attribute set registered inthe ticket. A determinator points to a set of relevant attributes to be observed should acertain problem in the network occur.It may happen that an assessment appearing at a retrieved ticket is directly applicableto the problem in question. In the most general case, nevertheless, in order to solve theproblem the system must adapt the retrieved ticket.The architecture of the CRITTER system consists of �ve modules:� Input module, for the acquisition of information on the problems that are occurringin the network. This entry may be manual or automatic.� Ticket retrieval module. It uses determinators to retrieve, from the library, a groupof tickets that are similar to an outstanding ticket.� Adaptation module. It operates on the retrieved ticket that is most similar to the out-standing ticket. If the two tickets match perfectly as far as all the relevant �elds areconcerned, the solution on the retrieved ticket is directly applied to the outstandingticket. Otherwise, this module executes the due adaptation.� Proposition module. It presents the user with the potential solutions found by theprevious module and allows the user to inspect them and manually execute thenecessary adaptations.� Process module. It updates the ticket library.2.4.14 FIXITThe system denominated FIXIT (Fault Information Extraction and Investigation Tool)[Weiner et al., 1995] is a case-based architecture (cf. item 2.2.8) in which the experienceacquired in fault management is coded and made available to the operation personnel.FIXIT was experimentally implemented as a decision support system for the controllers ofa NASA satellite system. The implementation was written in C++, running on a UNIXworkstation and utilizing an Oracle database for case storage.

452.4.15 OPAGensym Corporation developed a product denominated \Operations Assistant" (OPA),which works together with another software produced by the company and denominatedG2 [Gensym Corp., 1995]. According to what its supplier says, OPA is a rule-based system(cf. item 2.2.1) which may analyze, �lter, correlate and prioritize alarms, aiming at faultdiagnosis and the recommendation of corrective actions to the operators. It is also possibleto recognize patterns, which allows proactive network maintenance.OPA may be interfaced with network management platforms such as HP OpenView andIBM NetView. AT&T has developed an integrated network management system utilizingOpenView, together with the G2 and OPA softwares.2.5 Final ConsiderationsDue to the importance of the subject, a lot still has to be developed in the alarm correlationarea so that the needs of a telecommunications network may be met.The study carried out in this chapter demonstrates the existence of several solutionsapplicable to alarm correlation in speci�c segments of the network (sub-networks or networkelements), but it also demonstrates, on the other hand, a great lack of proposals aiming atcorrelating alarms in the range of the whole of a telecommunications network. This couldbe explained by the di�culty in understanding the functional relationship among the sub-networks that comprise a modern telecommunications network. This understanding is anessential condition for a general scheme of alarm correlation to be proposed. The nextchapter aims at contributing to facilitating this understanding.

Chapter 3A General Model ofTelecommunications Networks forFault Management ApplicationsThe complexity and heterogeneity of telecommunications networks contribute to make itextremely hard to visualize their architecture or to understand their functioning as a whole,without utilizing some kind of a model. A model provides an abstract view of the system,through which it is possible to hide all the aspects of this system which are irrelevant to agiven objective. Thus, it is generally easier to propose a solution to a problem working ona system model than working directly on the real system.This chapter proposes a general model for telecommunications networks, through whichthey are seen as sub-network sets (or sets of sub-network classes) partitioned in layers, ac-cording to functional a�nity criteria. Beginning with the de�nition of some rules for theutilization of communication services among sub-networks, a telecommunications networkis then modeled as a directed acyclic graph, where each node represents a sub-network ora class of sub-networks and an edge indicates a functional dependence among the intercon-nected nodes. The proposed model may be taken as a basis for the construction of a modelfor the correlation of the alarms generated in a telecommunications network, as shown inchapter 4.This chapter is structured in three sections. In section 3.1, a survey is made on the mostsigni�cant works found in the literature on telecommunications network modeling, whichare classi�ed according to the objectives and scope of the proposed models. In section 3.2,a new model for telecommunications networks is proposed and its main characteristicsand limitations are discussed. Section 3.3 brings some �nal considerations on the modelpresented. 47

483.1 Telecommunications Networks ModelingIn the modeling of telecommunications sub-networks or networks, the approaches found inthe literature may be divided into four distinct classes, depending on the objective and thescope of the models:1. Subsystem Management Information Models. They have as object of study a speci�cpart of the telecommunications network and focus exclusively on the aspects of sub-network management. They are generally object oriented and follow the standardsde�ned in ITU-T's M.3xxx series Recommendations [ITU-T, 1994a]. The primary ob-jective of these models is to allow the exchange of management information throughTMN interfaces. They generally present a high degree of formalism, which is highlydesirable for the objectives they are meant to.2. Subsystem ArchitecturalModels. They describe the architecture (structure and func-tioning) of a subsystem while part of the telecommunications system. The aspectsof interest are related to the transference of user information between two points.They also have as object of study a speci�c part of the telecommunications network.3. New Solutions in Telecommunications. Models through which new solutions areproposed and which have as main objectives the characterization of the functionalityand of the architecture of the new systems.4. General Models. The models of this class seek to contribute to the understanding ofa telecommunications network as a whole. Due to its size, complexity and hetero-geneity, it is generally di�cult to obtain a formal and, at the same time, useful modelof a telecommunications network. In this context, a formal model is one that usesmathematical notation and which is built up as a consequence of logical reasoningupon elementary principles [Bapat, 1994].In the following sections, a survey of the literature is made, involving works on telecom-munications networks modeling techniques, as well as some models that represent each oneof the previously identi�ed classes.3.1.1 Modeling TechniquesOne of the most utilized techniques in the modeling of telecommunications networks isbased on the object oriented paradigm. [Williamson and Azmoodeh, 1991] make an in-troduction to the development of object based information models. In terms of dataorganization, the following objectives are de�ned in the object modeling:� The model must contain some views which are intuitive and easy to be understoodby non-specialists;� When appropriate, the model must facilitate formalization;

49� The model must not depend on implementation options;� The model must be extensible, alterable and portable;� The model must reinforce the de�nition and the understanding of a consistent ter-minology.A complete guide for the object based network modeling is presented in [Bapat, 1994].[Klerer, 1993], [Filipiak et al., 1993] and [Chan, 1994] present generic object basedmodels of managed networks, which may be utilized to represent heterogeneous networks.[Hall and Magedanz, 1993], [Sclavos et al., 1994], [Strang et al., 1993], [Petermueller,1996], [Kheradpir et al., 1993], [Fink et al., 1993] and [Gillespie and Rees, 1996] presentother examples of object oriented models.[Shomaly, 1993] introduces the concept of model based management, which is alsoexploited in [Stinson and Kheradpir, 1992] and [Kheradpir et al., 1993]. Model basedmanagement involves, besides the management information modeling, the representationof management functions through the description of their behavior. As it is a repositoryof logical information, a model is little sensitive to implementation details [Strang et al.,1993], which is useful in multi-vendor environments.[Benz and Leischner, 1993] present a technique for the modeling of management in-formation that contemplates the semantic aspects of this information. The model includesconsistency rules, aiming at the detection of anomalies in the network's con�guration, butit does not take into account the internal structure of the objects, that is, their attributes,noti�cations, actions or behavior. As a consequence, the rules based on the characteristicsof the internal structure of the objects are not de�ned, but only the higher level rules are,thus formalizing the relationships among the managed objects.The implementation of consistency rules is made through database techniques; thesevery techniques are also utilized to provide view mechanisms [Korth and Silberschatz,1989], which contribute to reduce the di�culties related to the size and complexity of themanaged networks.View, viewpoint or visibility level mechanisms are also presented in [Chan, 1994],[Cornily et al., 1993] and [Sclavos et al., 1994].Plane and layer decomposition techniques are useful in the modeling of complex systemsand have been utilized in [Appeldorn et al., 1993], [Campbell and Everitt, 1992], [Chan,1994], [Strang et al., 1993], [Yamaguchi et al., 1992], [Stinson and Kheradpir, 1992],[Kher-adpir et al., 1993] and [Uehara, 1996].

503.1.2 Information Modeling for Subsystem ManagementOut of the works available in the literature on telecommunications systems modeling, agreat part deals with the modeling of speci�c segments of some systems, such as, forexample, ATM (Asynchronous Transfer Mode) networks, SDH (Synchronous Digital Hier-archy) and cellular telephony. The formal modeling of these segments, which very oftenutilizes the object oriented paradigm, facilitates their understanding and the constructionof computing systems to manage them.Nevertheless, the existence of formal models for some segments of the telecommunica-tions system is not enough as a basis for the implementation of an integrated managementof this system. For example, in spite of the existence of formal models for transmission andswitching systems, the integration of the transmission and switching equipment manage-ment is still de�cient or does not exist in the present-day network management systems.In another example, the alarms related to the microwave links, in some modern cellulartelephony management systems, are not even shown on the same screen where the alarmsreferring to the radio base stations appear [Fr�ohlich et al., 1996].Through Recommendations X.720 [ITU-T, 1992k] and M.3100 [ITU-T, 1995b], ITU-Tprovides a generic information model for a telecommunications network, containing theobject classes necessary to the exchange of information through TMN interfaces. Themodel is conceptual, independent of implementation technologies, and works as a starting-point for the development of more detailed models, through the specialization technique.The ITU-T G.774 Recommendation [ITU-T, 1992a] provides an information modelof the synchronous digital hierarchy (SDH), through the specialization of generic objectclasses de�ned in [ITU-T, 1992c]. The object classes de�ned are necessary to the exchangeof management information through the standardized interfaces of a TMN.[Fatato, 1996] gives a general view of the existing standards, in the ITU-T and in theETSI (European Telecommunications Standards Institute), in the area of SDH and PDH(Plesiochronous Digital Hierarchy) system modeling.Based on the OSI concepts, [Schott et al., 1992] present a general information model,which is utilized in the modeling of ATM network resources.The integration of an ATM network with its SONET \server" network and its access\client" network is the object of the model proposed by [Aydemir and Tanzini, 1996].[Gillespie and Rees, 1996] present several object based models for access networks. Themodels are based on the functional architecture of the modeled network, which is de�nedthrough the layer paradigm. In these models, each layer is served by the layer situated

51below and serves the layer situated above. Each architecture component is represented byan object class. In the physical implementation, each managed object is represented by aninstance of one of these object classes. The relationship among object classes is representedthrough entity-relationship diagrams.[Petermueller, 1996] analyses the ITU-T and ETSI standards concerning the mod-eling of exchanges, according to three functional areas: subscriber management, tra�cmanagement and system resources management. The models analyzed are based on theQ3 standards of the ITU-T X.700 series of recommendations and on the ITU-T M.3100Recommendation [ITU-T, 1992c].[Owen, 1994] approaches the utilization of computerized tools to aid the modeling ofSDH networks.[Fr�ohlich et al., 1996] make an analysis of the main existing solutions for alarm corre-lation in a cellular telephony network, numbering pros and cons of each one of them. Theypropose an alternative model-based system, which has the characteristic of being scalable.[Sclavos et al., 1994] propose an information model for the management of a FDDInetwork.[Czarnecki et al., 1996] present some criteria for the comparison and analysis of man-agement information models (MIMs) based on the GDMO (Guidelines for the De�nitionof Managed Objects) [ITU-T, 1992m].3.1.3 Subsystem Architectural ModelsThrough Recommendation G.803 [ITU-T, 1993a], ITU-T describes a telecommunicationsnetwork from a functional and structural model of a transport network based on SDH tech-nology. The main focus of the document is the capacity of information transference in thenetwork. In ITU-T's view, the transport network may be broken down into several layers,associated according to the client-server paradigm. The model permits the partitioning ofeach layer, according to some criteria that re ect the internal structure of the layer.In spite of having been developed aiming at SDH technology, the RecommendationG.803 also applies to PDH networks, as far as some of its fundamental principles areconcerned.[Chen and Liu, 1994] describe an architectural model for the management and controlfunctions of ATM systems.[Okamoto et al., 1996] propose an architecture for optical networks based on wave-

52length division multiplexing (WDM) and discuss the con�guration, performance and faultmanagement functions for these networks.[Magendanz, 1993], [Pontailler, 1993], [Appeldorn et al., 1993] and [Magedanz, 1995]present descriptions of TMN and IN architectures, identifying several similarities and sev-eral areas of possible integration among them.[Appeldorn et al., 1993] analyze the conceptual model of Intelligent Networks (IN)described in the ITU-T Q.1201 Recommendation, concluding that, of the four architecturalplanes (physical, distributed functional, global functional and service), the service planeencompasses the most interesting aspects of intelligent network management.3.1.4 New Solutions in Telecommunications[Oshisanwo and Boyd, 1993] present a summary of the works of the TINA Research Con-sortium, created with the mission of specifying TINA (Telecommunications InformationNetworking Architecture), whose objective is to meet the main requirements demanded bytelecommunications infrastructure users, that are: (1) access to any service, at any placeand at any time; (2) quick service provisioning; (3) personalized, safe and reliable services.[de la Fuente et al., 1995] utilize TINA architecture concepts in the de�nition of themanagement services of the Free Phone service.[Lengdell et al., 1996] describe the TINA network resource information model.According to [Yang, 1996], TINA architecture, originally conceived at Bellcore, allowsthe implementation of several types of telecommunications applications and may be consid-ered one of the basis for the development of the \Information Super-Highway" applications.The technological advance that has been taking place, both in the users networks asin the service providers networks, has contributed to reduce even more the technologicaland functional di�erences among these networks, which won't be separated any longer bya great barrier from the moment the access networks are digitalized. According to [Ejiri,1995], the user networks and those of the service providers tend to merge, with the use ofthe same technology (hardware and software) in the two types of network.[Kitami, 1996] and [Kano, 1996] present a concept of telecommunication network ar-chitecture based on the existence of \Points of Interconnection" (POIs) between networks.Based on this architecture, denominated NTT's Open Network Architecture (NONA), thelong distance service provider will open its network to other operators. NTT intends too�er several POIs modalities, depending on the service desired, such as, for example: com-mon channel signaling, multimedia communications, PHS (Personal Handy-Phone System)

53operators interconnection and CATV (Community Antenna Television).3.1.5 General ModelsAnalytical models of complex systems generally require many assumptions, which limits thepossibility of their application in real world situations [Frost and Melamed, 1994]. On theother hand, natural language models tend to be incomplete, imprecise and/or ambiguous,which make them hard to implement in computing systems. Therefore, in spite of utilizinga certain degree of formalism, general models tend to be less formal than subsystemmodels.A Traditional ModelA traditional telecommunications network could be seen as a set of exchanges, severalaccess networks, a set of transmission systems and a set of infrastructure resources. Fig-ure 3.1, adapted from Fig. 1.11 of [Pines and Barradas, 1977], illustrates a traditionalnetwork. In this model, an access network connects each user to a local exchange and,from there, to the rest of the system. Transmission systems are used to interconnect ex-changes and may be classi�ed as local (when utilized to interconnect local exchanges) orlong-distance (when they are meant for long-distance communication). The infrastructure(which is not explicitly represented in �gure 3.1) consists of buildings, towers, recti�ers,air-conditionings, batteries and other essential items for the functioning of the whole of atelecommunications network.This model is able to adequately represent a traditional telephony system, includingsome elementary low speed data communication services, by utilizing modems and dial-uplines.With the advance of the technologies related to computing and telecommunications,and the rapid integration of these technologies, it was possible to gradually incorporate, tothe traditional telecommunications network, a large number of sub-networks and platformcomplexes, computing systems and equipment, allowing new and di�erent services to beo�ered to the users. In this context, we may point out, among others, the mobile cellulartelephony plant, the common channel signaling system, the intelligent networks, the ATMnetworks and the network management systems. These equipment items and systems di�eramong themselves in aspects such as functionality, architecture, manufacturer and imple-mentation technology. The traditional model is not able to represent this new scenario, forwhich several alternative proposals are found in the literature.Models in Layers, Planes and LevelsTo face the complexity of a network, the layer structuring implies, among others, thefollowing bene�ts [ITU-T, 1993a] [ITU-T, 1994b]:

54Local Tx

Local Tx Local Tx

Locality A

LocalSwitch

Local Switch

Local Switch

Local Tx Switch

Long Distance Tx

Long

Switch Distance

Locality B

Long Distance Tx

Locality CLong

Distance Switch

Distance

Local Tx

Local Tx Local

Local Switch

Switch

Local Tx

Long

LocalSwitch

Access Network Tx: Transmission System

Local Tx

Figure 3.1: Simpli�ed Model of a Traditional Telecommunications Network

55a) Changes in the architecture of one layer do not necessarily a�ect the architectureof the other layers; so, each layer may be de�ned without taking into account theparticularities of the other layers;b) A new layer may be de�ned at any moment, not implying in modi�cations in thealready existing layers;c) Elements with similar functions are put on the same layer.In the case of telecommunications networks, several models found in the literatureutilize this decomposition method.The model for network management proposed by [Campbell and Everitt, 1992] includesfour control levels: network element layer, network layer, network product layer and cus-tomer layer.The network element layer and the network layer are similar to the layers that havethe same denomination in the logical architecture de�ned in [ITU-T, 1996]. The networklayer is subdivided into three sub-layers, that are: physical con�guration sub-layer, pathcon�guration sub-layer and tra�c con�guration sub-layer. The network product layercontains the telecommunications services o�ered by the network and is meant to isolatethe network implementation details from the users. The customer layer allows an abstractview of the services and provides the clients with a portfolio of the products o�ered.[Sclavos et al., 1994] introduce the management visibility level concept, representingthe levels in which the management decisions have to be taken. Four of these levels arede�ned: physical, data link, network and application. The authors point out that thevisibility level concept and the layer concept of the OSI model are not identical.The model of [Sclavos et al., 1994] is recursive, which allows a same information modelto be applied to each visibility level. A network element at the visibility level N is char-acterized in the following way: (1) by its own architecture; (2) by the network elements ofthe level (N � 1), if this level exists; and (3) by the service that the element in questiono�ers to the (N + 1) visibility level network elements.To facilitate the interworking between network management and service management,and taking as reference the approach adopted in the ROSA (\RACE Open Services Archi-tecture") European project, [Hall and Magedanz, 1993] propose the utilization of a uni�edmodel for telecommunications and management services, based on the object orientedparadigm.[Yamaguchi et al., 1992] divide a telecommunications network into four layers, thatare:� Transmission facility network layer;

56 � Digital path network layer;� Switched circuit network layer;� Tra�c layer.The �rst three layers, together, constitute the transport network layer.Each one of the layers is modeled by the same structure, that contains three generalizedentity types: nodes, links and end-points. The model of each layer contains specializationsof the three generalized entity types.By utilizing an entity-relationship diagram, the authors propose a single model thatrepresents the four layers that constitute the network.[Widl and Woldegiorgis, 1992] divide a telecommunications network into four distinctnetworks, arranged in layers. Each one of these layers may be later subdivided into sub-layers.The �rst layer of the telecommunications network corresponds to the Physical Trans-mission Network and contains the physical means of information transference: switchingmatrices, cross-connect systems, modems, multiplexers, cable and line systems.The second layer contains the Logical Transport Network, being responsible for thereliable and transparent information transference between service users.In the third layer, the Intelligent Service Network is situated, whose function is to o�erusers the services demanded by them. In this layer, the signaling and switching functionsare situated.In the fourth and last layer the Telecommunications Management Network { TMN isfound, which connects the managed objects of the several layers to the operation supportsystems (OSSs) and to the network managers.[Kheradpir et al., 1992] break down the telecommunications network functionality intofunctional planes, each one of them sending alarms and reports to the network managementsystems and receiving commands and controls from them.The functional planes resulting from the decomposition correspond to user services,local access, common channel signaling, switching, transmission and infrastructure.[Stinson and Kheradpir, 1992] and [Kheradpir et al., 1993] analyze the evolution of thenetwork management systems, identifying three generations of these systems:� First Generation Systems. Introduced in the 70's, they are little intelligent, do notprovide integration among the transmission, switching, signaling, access and infras-tructure plain managements. Their functionality is restricted to acquisition, �ltering(utilized to reduce the management center operator overload), storage and raw datadisplay.

57� Second Generation Systems. Planned or installed from the beginning of the 90's,they provide the integration, at the physical level, of the management of the severalplanes, which allows the data to be visualized on the same screen and stored in thesame database.� Third Generation Systems. They provide functional integration among the severalplanes, allowing an integrated view of the network performance. This is obtainedthrough the addition of more intelligence to the network management system.[Stinson and Kheradpir, 1992] and [Kheradpir et al., 1993] propose a functional modelfor a telecommunications network, organized according to functional planes [Kheradpiret al., 1992]. The model is composed by all the managed objects, by their attributesand by the relationships among them. For each plane (user, access, signaling, switching,transmission and infrastructure), the model identi�es the network nodes and their inter-connections, distinguishing the physical objects from the logical objects. As a general rule,it is considered that events occurred at an lower level manifest themselves as symptomsat the higher levels. In order to achieve an end-to-end network model, the planes must beinterconnected, which must be done at \appropriate points", which are not detailed in themodel presented.Four basic types of relationships are de�ned in the model: physical containment, log-ical containment, implementation and functional support. The physical containment andlogical containment relationships form two distinct trees.The implementation relationship is applied between one or more objects of the physicalcontainment tree and one or more objects of the logical containment tree. Examples: card5 of frame A implements generation of synchronism signal; recti�er and battery implementDC power.A functional support relationship applies between one or more objects of the logicalcontainment tree and one or more objects of the physical containment tree. Example: DCpower functionally supports frame A, frame B and frame C.The functional dependence between logical objects is determined by the implementationand functional support relationships and originates a graph, from which it is possible tocorrelate a set of state changes (which may be considered as a \signature" of a problem)to the original cause of the problem.The functional model is utilized by a \network state estimator" to correlate the changesin the network state. The state changes are reported by the received alarms, to whichinformation exogenous to the network (such as those related to climatic situations) areadded.[Kheradpir et al., 1993] still approach the telecommunications network modeling underthe aspects of operations and services, giving a conceptual view of the stages necessary tothe implementation of an automated fault management system.

58 A model in which to each network or sub-network corresponds a layer is not alwayssuitable for the representation of a telecommunications network. On the other hand, themechanism of breaking down in layers may be useful in the formulation of a model in whichthe allocation of several network classes in a single layer is allowed, as will be shown initem 3.2.4.Other ApproachesDOMAINS (Distributed Open Management Architecture in Networked Systems) [Fink etal., 1993] contemplates an example of recursive model, in which a manager must ful�ll cer-tain goals, whose achievement he communicates through reports. Each goal is transformedby the manager in a set of controls, to be applied to the resources under his management.A resource, by its turn, may be constituted by another manager, to which other resourcesreport. The controls stemming from the �rst manager are considered as goals by the lowerlevel manager. The reports issued by the latter are utilized by the upper level manager tomonitor the managed resources.[Sclavos et al., 1994] point out the limitations of the existing information models, as faras the integrated view of the services, network and network elements aspects is concerned.The authors propose an integrated object based model, with the goal of overcoming theselimitations.[Frost and Melamed, 1994] present models for tra�c simulation in telecommunicationsnetworks. According to the authors, the simulation models present, over the analyticalmodels, the advantage of imposing few restrictive conditions to the classes of problemsstudied, which makes them more applicable to real word situations.[Uehara, 1996] presents some \conceptual views" of the telecommunications network,such as, for example:� Functional Model (user's point of view).Here, the focus is not on the internal (that is, technological) details of the network,but only on the services it can o�er its users;� Internal (network implementer's) View.Here, the internal details are important. The network is constituted by networkelements, which are grouped, according to their basic functions, in access systems,switching nodes and interconnection systems and means.The telecommunications network is segmented in a peripheral part (access network) andone central part, denominated \backbone". The access network is formed by several accesssub-networks, with di�erent functions, that \gravitate around" the backbone, accordingto the concept of overlaid networks.

59[Katzela and Schwartz, 1995] model a telecommunications network through a collec-tion of objects, each one of which represents a network, a node, a link, a software process,a hardware component or any other part of a telecommunications network. Each objectmay be successively subdivided until each one of the resulting objects is considered indi-visible. An indivisible object is denominated terminal object. In the model of [Katzelaand Schwartz, 1995], the telecommunications network is represented by a directed graphG = (E;D), where E is a non-empty �nite set of terminal objects ei, and D is a set ofdirected edges (ei; ej). The existence of an edge (ei; ej) indicates that a fault in ei causesside e�ects in ej.3.2 The Proposed ModelDe�nition: In the scope of this thesis, a telecommunications network is consti-tuted by the set of all the technological resources that work in the provision oftelecommunications services, being basically composed of several sub-networks,from now on simply referred to as networks, and of a set of infrastructure re-sources.Networks are constituted by network elements, as de�ned by ITU-T [ITU-T, 1996], andare generally characterized by being spatially distributed. One infrastructure resource, byits turn, may be generally represented at one single point of the physical space.Networks may be grouped according to classes, stipulated according to their functionin the telecommunications network and to the technology employed in its implementation.Examples of network classes are: N-ISDN access networks; cellular access networks; SDHtransport networks; ATM switching networks; transparent (that is, composed by onlyoptical �bers) physical networks.Depending on the knowledge available on a given telecommunications network, it ispossible to subdivide a network class into sub-classes, which may be convenient in somesituations. For example, the cellular access network class may be subdivided into AMPS(Advanced Mobile Phone Service) cellular access network, TDMA (Time Division MultipleAccess) cellular access network and CDMA (Code Division Multiple Access) cellular accessnetwork.The infrastructure resources may also be grouped into classes, according to their func-tion and the technology used, as for example: power supply, grounding system, buildings,towers, roads and underground ducts.Some infrastructure resources may be seen forming \networks", as is the case of power,grounding, duct and road networks. In the present context, however, the term \network"

60is reserved to the components that perform a direct function in the telecommunicationprocess.When convenient, and depending on the available knowledge, it is possible to subdividea class of infrastructure resources into sub-classes. The power supply class, for example,may be subdivided into AC utility power supply, recti�ers, batteries and distribution frames,and may also include power distribution networks, substations and diesel motor generators.In the model presented in this thesis, the detailing of an infrastructure resource will belimited to what is strictly necessary to understand its direct role in the telecommunicationprocess.Several models have been proposed to formally and precisely describe some segmentsof a telecommunications network (cf. section 3.1). When seeking an alternative for anoverall network model, an option would be to start from the models of these segments,successively consolidating them till a single model is reached, representing the whole ofthe telecommunications network. This is not always possible in practice, because themodels, having been elaborated by specialists in each of the modeled segments, picture thetelecommunications network from distinct perspectives, which makes it di�cult to achievethe generalizations which would be necessary to obtain a higher level model.In most of the works found in the literature, the e�orts are concentrated on the de�ni-tion of information models, utilizing methodological principles such as the object orientedparadigm, with the main goal of enabling the adequate transference of management infor-mation between one or more operation support systems (OSs) and a given subsystem.For the study of alarm correlation in the context of a telecommunications network as awhole, it is necessary to utilize a model through which fault propagation among the severalsub-networks may be adequately represented [Meira and Nogueira, 1997b]. In spite of theexistence of several works on modeling, a general model of telecommunications networkthat supports the study of alarm correlation in the range of a whole network has not beenfound in the researched literature. In the remainder of this chapter, a model is proposedthat aims at contributing to ful�lling this gap.3.2.1 A High-level ModelThe �rst step in the modeling process is the de�nition of the universe of discourse and ofa decomposition mechanism [Bapat, 1994]. In the present case, the universe of discourseencompasses all the telecommunications networks. As far as the decomposition mecha-nism is concerned, a top-down approach has been adopted, in which we depart from aquite synthetic view of the telecommunications network. At this level, it is important tocharacterize the existence of access points through which the telecommunications networkservices are made available to the users.

61Through the identi�cation of the resources necessary to the provision of these services,it is possible to evolve recursively to a more detailed view, in which sub-networks providethe services demanded by the users (or user networks), by utilizing the services o�ered byother sub-networks.Figure 3.2 presents a high-level model of the telecommunications network, whose ser-vices are supplied through its Service Access Points - SAP.Service Access Point (SAP)

Telecommunications

Network

Figure 3.2: A High-level Model of the Telecommunications Network3.2.2 Public Network and User NetworksAt a subsequent step in the construction of the model, the telecommunications networkis seen as a public telecommunications network, to which several user networks are added(�gure 3.3).A user network may have an arbitrary con�guration. It may, at �rst, be larger and morecomplex than the public telecommunications network itself. The communication amongusers of the same user network may be made without utilizing the services of the publicnetwork, which is transparent to the model. There is not the intention, in the presentthesis, to model the structure or the internal functioning of the user networks, but onlythe aspects referring to the utilization, by these networks, of the public telecommunicationsnetwork services.

62

Service Access Point (SAP)

Network User

UserNetwork

UserNetwork

UserNetwork

UserNetwork

UserNetwork

User

User

User

User

User

UserNetwork

Network

Network

Network

Network

NetworkNetwork

Public Telecommunications

Network

Figure 3.3: Public Network and User Networks

63A user of the public telecommunications network is a person or a machine which utilizesthis network's communication services, to which it has access either directly or through auser network.3.2.3 Access NetworksIn �gure 3.4, the public telecommunications network is seen with a little more detail. Here,several access networks are responsible for making available to the public network usersthe telecommunications services demanded by them, by utilizing the services o�ered bya \backbone", characterized by having high tra�c capacity and ample usage possibilites[Uehara, 1996].


NetworkUser

User

User

Network

Network

Network

Network

Network

Network

Network

Network

Network

Network

Network

User

User

User

User

User

User

UserUser

User

AccessNetwork

Access

Access

Access

Access

Access

Network

Network

Network

Network

Network

Backbone

Figure 3.4: Backbone and Access NetworksOnly access networks may be connected to the Service Access Points of the backbone.The access networks SAPs may be accessed both by users and by user networks. Onlyusers may have access to the Service Access Points of the user networks.

643.2.4 General Model in Multi-Network LayersThe next step in the re�nement of the proposed model is the detailing of the nucleus, whichoriginates a set of new layers. The last one of these layers corresponds to the transmissionphysical means.Finally, through a generalization of the model presented in [Katzela and Schwartz, 1995](cf. item 3.1.5), the telecommunications network is modeled as a directed graph G = (V;E),where V is a non-empty set of nodes and E is a set of edges, represented by ordered pairs(vi; vj). vi and vj are distinct elements from V , that is, vi 6= vj. At �rst, each node mayrepresent a terminal object, such as in [Katzela and Schwartz, 1995], a complete network,or even a network class, as de�ned in 3.2.In the proposed model, if a node vi is adjacent to a node vj, that is, if there is an edge(vi; vj), then the entity represented by vi utilizes the services of the entity represented byvj, that is, vi depends on vj.Cycles, that is, paths that begin and �nish at the same node, are not allowed. Particu-larly, if vi depends on vj then vj does not depend on vi:(vi; vj) 2 E =) (vj; vi) 62 E:The set of nodes V is partitioned into N + 1 distinct sub-sets c1, c2; : : : ; cN , cN+1,where N � 1. Each one of these sub-sets is de�ned as a layer. The existence of an edgedeparting from the node situated in layer cm and arriving at a node situated at layer cn isonly possible if m > n.Figure 3.5 exempli�es the proposed general model for a telecommunications networkin which the layers are numbered from 1 up to (N+1). In this �gure, each node representsthe set of all the networks of a given class present in the telecommunications network and isidenti�ed by a sequence number and by the number of the layer to which it belongs. Layer1 owns p1 network classes; layer 2 owns p2 classes, and so successively. As an example, thenodes of layer N are: vN:1; vN:2; vN:3; : : : ; vN:pN .The set of edges departing from each node represents all the communication servicesthat the networks of the corresponding class may utilize. The existence of an edge fromnode vi to node vj means that at least one network of the network class represented by viutilizes the services of a network of the network class represented by vj.Two network classes belong to the same layer when there is a functional a�nity betweenthem. The model does not stipulate a maximum number of layers.The following considerations may be made about this model:

65...

...

...

...

... ... ... ... ...

...

...

(*)

(**)

(**) Public Network

Access NetworksLayer N

User Networks

Physical MediaLayer 1


2

3 4 N+1p

Np

N-1p

N-2p

1p321

1

1 2

2 3

32

1

1

(*) Backbone

Layer (N+1)

Figure 3.5: Example of General Model for a Telecommunications Network

66A) The sub-networks that constitute a telecommunications public network (simply de-nominated networks) may always be structured in, at least, two layers:{ Access networks (N Layer);{ Transmission physical means (Layer 1).B) Any communication in the telecommunications public network is only possiblethrough the utilization of physical transmission media (that is, networks of thev1:1; v1:2; : : : ; v1:p1 classes).C) There are two possible ways for a user to access the public telecommunications net-work:{ Directly through an access network (N Layer);{ Through a user network, in the (N +1) layer, which, by its turn, has the accessto one or more access networks.D) To the set of networks of layers 1 up to (N � 1) the name \backbone" is given.E) In order to o�er its communication services, a network utilizes the services of oneor more networks of lower layers. The networks of layer 1, that is, the physicaltransmission media are the only exceptions, as they do not require the services ofother networks to e�ect the communication.F) A network never utilizes the services of networks of the same layer or of upper layers.G) A Service Access Point (SAP) is a point through which it is possible to access thecommunication services o�ered by a network.H) Only users may have access to SAPs of the user networks. Both users and usernetworks may have access to the SAPs of the access networks. Only networks haveaccess to the SAPs of the backbone.In the example of �gure 3.6 all the instances of each one of the network classes presentin the telecommunications network of �gure 3.5 are represented. Within the same class,each network instance in this �gure is represented in a di�erent plane, originating a three-dimensional representation. Figure 3.7 is a bi-dimensional representation of the sametelecommunications network.Each network in �gure 3.7 is identi�ed by the class to which it belongs and by itssequence number within the class. As an example, the N -layer networks are: vN:1:3, vN:1:2,vN:1:1, vN:2:3, vN:2:2, vN:2:1, vN:3:2, vN:3:1; : : : ; vN:pN :1.Figure 3.8 illustrates the meaning of the expression \uses the services of". In thisexample, a set of three exchanges composes a switching network, which uses the services

67...

...

...

...

... ... ... ... ...

...

...

(*)

(**)

Layer (N+1)

Layer N

Layer 1Physical Media

(**) Public NetworkService Access Point (SAP) (*) Backbone

User Networks

Access Networks

Figure 3.6: Three-dimensional Representation of the Networks that constitute a Telecom-munications Network

68

... ... ... ... ...

...

...

...

(**)

(*)

...

...

...

1 2 3 4

3

1

1 2

2 3 p

p

1 2 p

1 2 3 p

N

N-1

N-2

1

N+1p

2 2 1

2 1 2 1 2 1 2 1 3 2 1

112123123

3 1 3 1 3 2 1

12123 2 14

3 2 1 4 3 2 1 2 1 3 2 1

Layer (N+1)

Layer N

Layer 1Physical Media

(**) Public NetworkService Access Point (SAP)

User Networks

Access Networks

(*) BackboneFigure 3.7: Plane Representation of the Networks that constitute a TelecommunicationsNetwork

69of a transmission network of a lower layer. The interconnections among the switchingnetwork exchanges (r layer) are implemented by the transmission network of the s layer,which, by its turn, utilizes the physical transmission media of a lower layer, not shown inthe �gure. The arrows indicate the points where exchanges A, B and C make use of theresources of the s layer. There is no limit for the number of interconnections between thetwo networks. In other words, a single edge in �gure 3.7 may be representing an arbitrarynumber of interconnections among the corresponding networks.SwitchingLayer [r]

Switch A

Switch C

Switch B

1

2

3

4

5

678

9

TransmissionLayer [s]Figure 3.8: Detailing of the Connection between Two Networks

703.2.5 An ExampleFigure 3.9 illustrates the application of the concepts previously presented to a hypothet-ical telecommunications network, which was decomposed into �ve layers: physical media,transmission, switching, access networks and user networks. For each layer, all the exist-ing network classes have been identi�ed (for example: SDH, PDH and FDM transmissionnetworks). It is possible to re�ne the model of �gure 3.9 through the substitution of eachnode (representing a network class) by the set of networks of that class present in thetelecommunications network.3.2.6 RobustnessThe model presented is robust enough to adapt itself to architectural and technologicalalterations in the telecommunications networks. This could be done through the introduc-tion of new network instances in an existing class, the de�nition of new network classes inan existing layer, the creation of new layers, or the modi�cation of the relationship amongnetworks. In �gure 3.10, it is shown, for illustration sake, what the network of �gure 3.9would be like after the introduction of a common channel signaling network, serving theintegrated digital network (IDN) and being served by the SDH and PDH transmissionnetworks.3.2.7 Limitations of the ModelLet's consider a situation in which two end users, A and B, are having a telephone conver-sation. There are several paths through which the connection between A and B may beestablished. Any one of these paths must involve at least an access network and a physicalmedium. A long-distance call, for example, typically involves two access networks, one ormore switching networks, one or more transmission networks and several physical media.When a fault occurs in any one of the networks involved in the call, it is possible thatthe quality or even the continuity of the connection be a�ected. Supposing that both userterminals are connected to access networks belonging to the modeled telecommunicationsnetwork, the model presented in this thesis enables the determination of a set containingall the sub-networks that may be utilized in an eventual connection between A and B.Through scope limitation (cf. item 4.1.2), there is a theoretical possibility of separatingthe alarms from the networks that could be involved in the call from the other alarmsgenerated by the telecommunications network, and then to correlate them, aiming theidenti�cation of the original cause of the problem. In practice, due to the high volume oftra�c o�ered or routed in a typical network (of the magnitude of several million calls perHour), the execution of this type of correlation for each individual call would implicate inan unacceptable overhead for the a network management system structured as it is today.Besides that, the resources available for the service management are generally scarce.

71POTS

X.25 POTS

ATMFrame

Relay

SDH PDH FDM

Network

User

Network

User User

Network

User

Network User Networks

Access Networks

Switching (*)

Transmission (*)

Physical Media Networks

Systems

Analog

Network Switching

B-ISDN

B-ISDN

Network

Access

Cellular

N-ISDN

IDN

Transparent Optical Network

Coaxial

Systems Radio

X.25 N-ISDN

reorganization, as a consequence of technological changes.

Terrestrial Satellite

Note: Layers marked with (*) are liable to spin-offs, combination, or another kind of

CableFigure 3.9: A Telecommunications Network

72X.25 POTS

ATMFrame

Relay

SDH PDH FDM

SS7

Network Network Network Network

B-ISDN

Cellular

Access

Network

IDN

User

B-ISDN

User User

N-ISDN

User

POTS

N-ISDN

Analog

NetworkSwitching

User Networks

Access Networks

Switching (*)

Signaling (*)

Transmission (*)

Systems

Coaxial

NetworksPhysical Media

X.25

NetworkOptical

Transparent

SystemsRadio

another kind of reorganization, as a consequence of technological changes.

TerrestrialSatellite

Cable

Note: Layers marked with (*) are liable to spin-offs, combination, orFigure 3.10: Expanded Telecommunications Network

73There are not, in the general case, speci�c alarms for event noti�cation at the level of theservices provided to an individual user.Furthermore, the model is not able to determine, for a given call, which resources arebeing e�ectively utilized (cf. item 3.2.4). Let's consider, for example, that users A and Bare utilizing access networks connected respectively to the exchanges A and B of �gure 3.8.At the switching layer level, the call may have been completed directly between exchangesA and B, or utilizing the transit functions of exchange C. For each one of these alternatives,other options open in the transmission layer, as is the case of the linking between A andB, which may be made through nodes 1, 2 and 3 or, for example, through nodes 1, 7, 8and 3.Finally, the users are not always connected to a single telecommunications network and,so, some of the resources involved in the call may not be part of the model.The situation presented illustrates the following limitations of the proposed model:1. The model provides a global view of the managed network and may be considered asa tool for the development of OSFs at the Network Management Layer of the LogicalLayered Architecture (LLA) de�ned by ITU-T (cf. item 2.1.2). On the other hand,the model does not directly help the monitoring of the quality of the services o�eredto each speci�c client, and does not o�er, consequently, a signi�cant contributiontowards the development of OSFs situated at the Services Management layer. Inthis aspect, it is interesting to point out the inexistence, up to the moment, of astandardized model for the service management layer [K�atker and Paterok, 1997].No direct contribution is given by the model to the development of OSFs in the otherlayers of the LLA architecture, that are, Network ElementManagement and BusinessManagement.2. From the point of view of the management functional areas (cf. item 2.1.1), theproposed model directly serves only the fault management area, not presenting sig-ni�cant contributions to the other areas. Therefore, out of the 20 cells identi�ed inthe template presented in �gure 2.1, only one is covered by the model.3.3 Final ConsiderationsThe general model proposed in this chapter is based on the partitioning of the set of sub-networks into layers, based on the functional a�nities among these sub-networks, and onthe utilization of an directed acyclical graph to represent the telecommunications network.Also, some rules are de�ned to govern the utilization of the services o�ered by the sub-networks of a given layer by the sub-networks of the upper layers.

74 By being able to qualitatively represent the functional dependences among sub-networks, the model is an appropriate tool for the study of alarm correlation in telecom-munications networks, as will be shown in Chapter 4.As the sub-networks that compose a telecommunications network are identi�ed, knowl-edge areas will also be identi�ed, in which the development of technological competenceis required, as a requisite for the adequate operation and maintenance of the telecommu-nications network. Consequently, the model presented may also be useful in the humanresources area, as a subsidy in the elaboration of professional development plans in telecom-munication technologies.

Chapter 4A Model for Alarm CorrelationIn this chapter a model for alarm correlation in the range of a telecommunications networkis proposed. This proposal takes into account the network model presented in chapter 3 andis based on the recursive multifocal correlation technique. This technique, proposed in thisthesis (cf. item 4.1), is a tool to approach the complexity of the alarm correlation problemand is su�ciently general to allow, at �rst, any of the methods or algorithms studied insection 2.2 to be adopted to make the alarm correlation in a telecommunications network orin one of its segments. The method based on Bayesian networks was chosen as an examplebecause it shows as an important characteristic the facility to deal with uncertainty, whichis an essential factor in the solution to this class of problems, as will be seen in item 4.2.The remainder of this chapter is structured in four parts. In section 4.1, the conceptsof recursive multifocal correlation and scope limitation, fundamental to the development ofthe model, are introduced. Section 4.2 contains an introduction to the study of Bayesiannetworks and presents a commercial tool utilized in the development of these networks.Section 4.3 approaches the details of the proposed model and of the correlation process inreal time, and also analyzes the complexity and the performance of some of the Bayesianalgorithms applicable to the model. In section 4.4 some �nal considerations on the modelare made.4.1 Recursive Multifocal CorrelationAs it has already been emphasized in this thesis, alarm correlation is, in the general case, aNP-complete problem (cf. item 2.1.4 ). Besides that, as it was seen in item 2.3, we do notknow a single solution that is the \best" to solve the generic problem of alarm correlation,and it is necessary, many times, to use a combination of di�erent methods and algorithms[Meira and Nogueira, 1997a].Several requirements are indispensable for the implementation of alarm correlation ina telecommunications network. The main one of them is the availability of the alarms in75

76real time (cf. item 2.1.3). A proposal for solution to an alarm correlation problem mustalso meet the following general requirements (cf. item 2.3) [Meira and Nogueira, 1997a][K�atker and Paterok, 1997]:1. Facility to adapt the alarm correlation system to the existing network managementsystem;2. Facility to theoretically model the object network;3. Facility to adapt the alarm correlation system to changes in the object network;4. The use of parallelism and functional distribution as a way to achieve a satisfactoryperformance;5. The ability to deal with uncertainty, notably as far as incomplete and/or inconsistentdata are concerned, by maintaining acceptable precision levels.The technique proposed here, denominated Recursive Multifocal Correlation, allows thecombined use of di�erent methods and algorithms for alarm correlation and meets all thegeneral requirements presented. The approach applies to a telecommunications networkmodeled according to the criteria presented in item 3.2 and consists of partitioning thenetwork in several sub-networks, each one constituting a correlation focus. The modelingof the functional relationship among the sub-networks is also necessary, in the aspectspertinent to the propagation of the fault e�ects. According to the model presented inchapter 3, the existence of a functional relationship between two sub-networks implies inthe existence, in the corresponding graph, of a path between these sub-networks. Thus,the existence of an edge uniting two nodes models the possibility of propagation of faulte�ects between the corresponding sub-networks.Several correlation processes may be executed in parallel. The breakdown of the prob-lem into several smaller sub-problems implies in the reduction of the complexity of theresulting problems.The term \recursive correlation" is due to the fact that, for each resulting focus ofthe partitioning, the \multifocality" principle may be recursively utilized, until each focuscorresponds to a single network element.In each focus, it may be possible, at �rst, to utilize a di�erent correlation technique.Simpler approaches such as, for example, the rule-based ones, may be utilized at the lowerlevels, whereas other more elaborate, such as the ones based on Bayesian networks, may beutilized at the higher levels, where the computing complexity and uncertainty factors aremore relevant. The result of the correlation obtained in each focus is passed to the upperlevel, where a new correlation may be made. For each level, besides de�ning the correlationfocuses, the functional relationships (or dependences) among these focuses must also bede�ned (cf. item 3.2.4).

77Figure 4.1 represents an example telecommunications network, where m correlationm focuses at Level 1 are identi�ed. To each correlation focus corresponds, at this level,a sub-network (cf. item 3.2), identi�ed as Sub-network 1.x, where 1 � x � m. Forthe correlation to be made at the telecommunications network's level, it is necessary toknow: (1) the structure of this network, that is, the correlation focuses and the functionalrelationships among them and (2) the state of each one of these focuses, given by thealarms received from the corresponding sub-network and/or by the results o�ered by thecorrelation processes at the level of each sub-network(1:1; 1:2; : : : ; 1:m).At Level 2, �gure 4.1 details, as an example, sub-network 1.3, for which n correlationfocuses are identi�ed, corresponding to the sub-networks 1:3:1; 1:3:2; : : : ; 1:3:n. For thecorrelation in a sub-network at this level to be made it is necessary to know: (1) the struc-ture of this sub-network, that is, the correlation focuses and the functional relationshipsamong them and (2) the state of each one of these focuses, given by the alarms receivedfrom the corresponding sub-network and/or by the results o�ered by correlation processesat each sub-network's level (1:3:1; 1:3:2; : : : ; 1:3:n).For the correlation in each given sub-network at level 3 to be made, in the example of�gure 4.1, it is necessary to know: (1) this sub-network's structure, that is, the correlationfocuses and the functional relationships among them and (2) the state of each one ofthese focuses, given by the alarms received from the corresponding sub-network and/or bythe results o�ered by correlation processes at each sub-network level. In �gure 4.1, thecorrelation focuses identi�ed for the sub-network 1:3:2, taken as an example, correspondto the network elements 1:3:2:1; 1:3:2:2; :::; 1:3:2:p.The recursion ends at network element level, at which the alarm correlation is directlymade, without the need for the partitioning into new correlation focuses.4.1.1 Performance and PrecisionWith the aim of reaching a performance improvement, the process responsible for thecorrelation at a focus of level h does not have to wait for a call from a process of higherlevel (that is, h+ 1) in order to start its h-level correlation. By exploiting the parallelismpossibility given by the model, several correlation processes (one for each correlation focus)may be continually running. As it is requested by an (h + 1)-level process, an h-levelcorrelation process may simply o�er the requesting process the result of the last correlation,previously and automatically executed. Alternatively, the (h+1)-level process may requestto the h-level process, at any time, the immediate execution of a new correlation.In the �rst case, the time necessary for the execution of the correlation process at an(h+ 1)-level focus will depend exclusively on the algorithm adopted for the correlation atthis level, not being a�ected by correlation processes executed at focuses of lower levels. In

781.2 1.3 1.m

1.3.2.2 1.3.2.3 1.3.2.p1.3.2.1

Complete view(Level 1)

network Sub-

1.1

Sub-network

Sub-network

Sub-network

Modeled Network

(Level 2)

Sub-network 1.3.1

Sub-network 1.3.2

Sub-network 1.3.3

Sub-network 1.3.n

Sub-network 1.3

(Level 3)

NetworkElement

NetworkElement Element

Network NetworkElement

Sub-network 1.3.2

Correlation focus Functional relationship between correlationfocuses (sub-networks or network elements).

Sub-network 1.3 view

Sub-network 1.3.2 view

Figure 4.1: Example of Correlation Focuses in a Telecommunications Network

79the other case, each correlation process will only be executed when the processes it calledreturn the respective correlation results, or when a given previously stipulated time limitis exhausted for each one of these processes.The recursive multifocal correlation technique propitiates a great improvement in thedevelopment of the correlation process in a telecommunications network, at the expenses ofa loss in precision. As a matter of fact, as several correlation processes will be executed inparallel and assynchronously, the consistency among the corresponding correlation resultsis not guaranteed.The absence of synchronism among concurrent processes is perfectly acceptable and,some times, is the only possible alternative in some telecommunications network man-agement situations. A typical example of this situation occurs in the System for theIntegration of Supervision (SIS) [Meira and Lages, 1988], where several independent pro-cesses collect management information in distinct plant segments, storing this informationin a distributed database in which some types of inconsistency are allowed [de Andrade,1995]. As the alarm correlation system depends on the information collected by a net-work management system, of which SIS is a representative example, it is reasonable toaccept the possibility of inconsistency, which brings as a consequence the uncertainty inthe correlation.Therefore, the adoption of the recursive multifocal correlation technique implies in theexplicit incorporation of some degree of uncertainty to the correlation result, which doesnot necessarily represent a limitation in relation to other approaches, for uncertainty is anintrinsic component of any correlation process (cf. item 2.1.4).4.1.2 Scope LimitationScope limitation consists of a pre-processing through which the number of alarms to beconsidered in a correlation may be rather decreased, contributing to reduce the complexityof the problem and allowing that the faults occurred in speci�c time intervals or in spe-ci�c segments of the managed network can be identi�ed through a subsequent correlationprocess upon the remaining alarms.The scope limitation technique is closely related to the recursive multifocal correlationprocess, because it is responsible to identify, in a network management center, the alarmsthat must be considered in each one of the correlation processes to be executed in parallel,according to the recursive multifocal approach.There are several ways to limit the scope of a correlation as, for example:

80Temporal limitationIt consists of de�ning a timewindow (which may be �xed or sliding) inside which the alarmswill be considered for correlation. The times to be considered may have as reference thealarms' arrival order or the time at which they were generated (\time stamp"). In the lattercase, as the time window is de�ned, it must also be speci�ed the maximumadmissible delayfor the arrival of an alarm.As time is an important factor in any correlation, the temporal limitation will alwaysbe considered, very often associated to other types of scope limitation.Spatial limitationIt consists of de�ning a physical space inside which the alarms will be considered forcorrelation. This space may correspond to an operation region, to a route or to an exchange,for example.Functional limitationIn this case, the scope limitation allows the selection, from a set of original alarms, of thealarms issued by a given functional segment of the network (that is, a sub-network or anetwork element). Therefore, a subsequent correlation process may take into account onlythe alarms originated from the sub-network in focus.Limitation by Severity DegreeHere, the alarms to be considered must be framed into a speci�ed severity degree range[ITU-T, 1992o].4.2 Introduction to Bayesian NetworksStrictly speaking, for any type of correlation that is considered, the set of alarms to becorrelated will always be subject to errors and omissions (cf. item 2.1.4). Such errors andomissions may be generated both in the network element responsible for the original faultas in elements situated in other points of the managed network; they may also be causedby communication failures or by the network management system itself. Besides that,the simultaneous occurrence of two or more faults may generate an alarm pattern that ischaracteristic of a fault that has not occurred, thus inducing the alarm correlation systemto error. We may conclude, therefore, that uncertainty is inherent to any alarm correlationprocess.

814.2.1 Why Bayesian Networks?The multifocal correlation technique permits that, at �rst, any of the approaches identi�edin section 2.2 be used for correlating alarms in the range of a telecommunications network.The following reasons have been responsible for the decision to use Bayesian networks inthe example developed in this thesis:a) Facility for construction. The telecommunications network model proposed in chap-ter 3 provides a qualitative speci�cation of the functional dependences among thesub-networks, which contains all the necessary information for the construction ofthe Bayesian network structure [Henrion et al., 1991];b) The capacity to identify, in polynomial time, all the conditional independence rela-tionships, from the information propitiated by the Bayesian network structure. Thisallows a great part of the available information to be ignored and attention to be con-centrated on information relevant to the focus problem [Pearl, 1991], thus reducingthe complexity of the alarm correlation process;c) The capacity to carry out inferences on the present state of a telecommunicationsnetwork from the combination of: (1) statistical data empirically surveyed duringthe network functioning; (2) subjective probabilities supplied by specialists and (3)information (that is, \evidences" or \alarms") received from the telecommunicationsnetwork, in real time [Hood and Ji, 1997].d) The capacity for non-monotonic reasoning, through which previously obtained con-clusions may be withdrawn as a consequence of the knowledge of new information;e) Robustness. Through the evaluation of a Bayesian network (cf. item 4.2.4), it ispossible to obtain approximate answers, even when the existing information are in-complete or imprecise; as new information become available, the Bayesian networksallow a corresponding improvement in the precision of the correlation results;f) Mathematical support. The Bayesian networks count on a solid mathematical sup-port [Pearl, 1991], which allows the analysis of the model in view of the knowledgeof its performance and precision, before an implementation is carried out.With reference to the criteria of comparison among alternatives for alarm correlation,presented in section 2.3, the approach based on Bayesian networks presents a facility for themodeling of the object network (cf. chapter 3) and a relative facility for the implementation(as will be shown in this chapter).Changes in the object network imply the need for alterations in the network model,followed by corresponding alterations in the Bayesian network and by the subsequent com-pilation of the knowledge base.

82 As far as performance is concerned, the evaluation of a Bayesian network, that is,the calculation of the probabilities associated to the variables corresponding to its nodes,given a set of evidences, is an NP-complete problem [Charniak, 1991]. In spite of that, theevaluation may be made in reasonable times for networks constituted by hundreds or eventhousands of nodes. The reason for this is the existence, in practical applications, of a largenumber of conditional independences among the nodes, which, in the context of the presentthesis, will represent the telecommunications network sub-networks (cf. item 4.2.2).In terms of precision, the behavior of a Bayesian network re ects the quality and thedetailing level of its structure, which stems from the object network model. Another factorwhich a�ects the precision of the Bayesian alarm correlation process is the quality and thedetailing level of the alarms to be correlated. These two factors a�ect the precision in anyalarm correlation process, independently of the adopted approach. A third factor, that is,the precision of the values of the conditional probabilities (cf. item 4.2.2) also contributesfor the precision of the correlation process.ApplicationsBayesian networks have been utilized for the solution of problems in several areas in whichuncertainty is a key factor. [Deng et al., 1993] approach the fault diagnosis in opticalcommunications networks.[Kirsch and Kroschel, 1994] describe the application of Bayesiannetworks in the fault diagnosis in diesel engines.[Heckerman et al., 1995a] approach the faultlocation in complex devices, such as aircraft or trains. [Fung and Favero, 1995] describe anApplication of Bayesian networks in the retrieval of information, according to the users'areas of interest.[Burnell and Horvitz, 1995] present a system that utilizes a Bayesiannetwork for debugging very complex computer programs.[Buntine, 1996] approaches theutilization of Bayesian networks in coding, representing and discovering knowledge, throughsome processes that seek new knowledge on a given domain based on inferences on newdata and/or on the knowledge already available [Kl�osgen, 1996].[Hood and Ji, 1997] utilizeBayesian networks for the proactive detection of abnormal behavior in a computer network.4.2.2 Basic ConceptsAs it has already been seen (cf. item 2.2.3), a Bayesian network is a directed acyclicalgraph in which each node represents a random variable [Hoel et al., 1971] and each edgedenotes the existence of a direct causal in uence among the variables connected by it. Theintensity of this in uence is quanti�ed by conditional probabilities [Pearl, 1987]. Therefore,the causal connections in the Bayesian networks are not absolute, which is rather convenientwhen it is di�cult to attribute deterministic causal relations among the variables, due tothe complexity inherent to the system under study.A random variable may be discrete (when it has a �nite or countable number of states)or continuous, when the number of possible states is in�nite and non-countable. For

83Range of the Corresponding States of theContinuous Variable Discrete Variable0 � V � 90 AC Fault90 < V � 110 Low Voltage110 < V � 130 Normal Voltage130 < V OvervoltageTable 4.1: An Example of How to Transform a Continuous Variable into a Discrete Variableexample, a random variable denominated \General Switch", whose possible states are onor o�, is a discrete variable (and also binary, for the number of possible states is two). Onthe other hand, the random variable \Power Voltage" may be de�ned as being continuous, ifit can assume any value in the 0 to 120 Volt range, for example. If convenient, a continuousvariable may originate a discrete variable, by choosing a single value as a representative ofall the possible values of the variable within a certain range (cf. example in Table 4.1).Joint Probabilities DistributionLet X be a set of random variables fX1;X2; : : : ;Xng. Each combination of the valuesof these variables de�nes a con�guration or scenario. The number of possible con�gura-tions is given by the product n1 � n2 � � � � � nn, where n1; n2; : : : nn correspond to thenumber of possible states of each variable X1;X2; : : : ;Xn. In the particular case in whichX1;X2; : : : ;Xn are binary variables, the number of possible combinations is 2n. The jointprobabilities distribution of X is de�ned as P (X1;X2; : : : ;Xn), for all the possible con-�gurations [Charniak, 1991]. Therefore, if X = fX1;X2g, where X1 and X2 are binaryvariables, the joint distribution will contain the following probabilities, corresponding tothe four possible con�gurations: P (x1; x2), P (x1;:x2), P (:x1; x2), P (:x1;:x2).The probabilities of a joint distribution are exhaustive and mutually exclusive. There-fore, the sum of all these probabilities must be equal to 1. Besides that, the probability asso-ciated to the occurrence of at least one among two possible con�gurations A and B is equalto the sum of the probabilities associated to A and B, that is, P (A [ B) = P (A) + P (B)and P (A \B) = 0.Local Probabilities DistributionsA Bayesian network for the set of variables (or \domain") X = fX1;X2; : : : ;Xng representsthe joint probabilities distribution P (X1;X2; : : : ;Xn), with the advantage of drasticallyreducing the number of probabilities which must be speci�ed. This is possible because aBayesian network consists of a set of local probabilities associated to each variable, anda set of information on the independence among these variables (given by the network

84structure). From these data, it is possible to construct the joint distribution [Pearl, 1991].Figure 4.2 (adapted from �gure 1, page 7, by [Knowledge Industries, Inc., 1996a])shows an extremely simple Bayesian network, constructed to aid in the fault diagnosis ofa ashlight. This example is taken as a reference for the introduction of some importantconcepts.It is possible to notice, by the network in �gure 4.2, that there are no direct causalconnections among the variables Battery and Bulb; Bulb and Measured Voltage; Beam andMeasured Voltage. These three causal independences built into the Bayesian network areresponsible for the reduction of the number of probabilities whose speci�cation is necessary,from 54 (that is, 2 � 3 � 3 � 3), in the case of the joint distribution, to 32 (that is,2+3+18+9). This reduction may not seem signi�cant for the network of �gure 4.2. Toillustrate the importance of the built-in causal independences on the reduction of thenumber of probabilities to be speci�ed, let's consider, for example, the network of �gure 4.3,which contains 16 nodes, corresponding to discrete variables with two or three states each.In this case, not considering the causal independence information built into the Bayesiannetwork, the number of probabilities to be considered would be (2�3�3�3)4 = 8503056.Nevertheless, due to the causal independences made explicit by the Bayesian networkstructure, only 158 (that is, (2 + 3 + 18 + 9) + (6 + 9 + 18 + 9)� 3) probabilities must bespeci�ed.Therefore, the complexity associated to the construction of a Bayesian network |measured by the number of probabilities to be speci�ed | grows linearly with the numberof the network nodes. The inexistence of information on causal independences would makethe number of probabilities to be speci�ed grow exponentially in relation to the number ofnodes.Local Conditional Probabilities DistributionIn a Bayesian network, if there is an edge departing from node A and reaching node B, itis said that A is a direct predecessor or \parent" of B; on the other hand, B is said to be adirect descendant (or \child") of A.The probabilities associated to a node that does not have parents are said to be localprior probabilities, because their speci�cation does not need to take into account thevalue of any other network variable. The probabilities associated to the other nodes aredenominated local a posteriori probabilities, for each one of them is conditioned to theoccurrence of a certain pattern of values of the direct predecessors of the node.Let be, for example, the network of �gure 4.3. The prior probabilities to be speci�edare: P(A1=a11); P(A1=a12); P(B1=b11); P(B1=b12); P(B1=b13). All the others are aposteriori probabilities, as, for example, P (C1 = c11jA1 = a11;B1 = B11), which is the

85

P(B=Dead | LB=OK, BT=Good) = 0,5%

P(B=Dead | LB=Burnt, BT=Good) = 100%

P(B=Dead | LB=Burnt,

P(B=Weak | LB=OK, BT=Good) = 0,5%

P(B=Weak | LB=OK, BT=Weak) = 90%

P(B=Dead | LB=OK, BT=Weak) = 9%

P(B=Weak | LB=Burnt, BT=Good) = 0%

P(B=Weak | LB=Burnt, BT=Weak) = 0%

P(B=Dead | LB=Burnt, BT=Weak) = 100%

P(B=Weak | LB=Burnt,

P(B=Dead | LB=OK, BT=Discharged) = 99%

BT=Discharged) = 0%

BT=Discharged) = 0%

BT=Discharged) = 100%

P(V=Nominal | BT=Good) = 97%

P(V=Average | BT=Good) = 1,5%

P(V=Low | BT=Good) = 1,5%

P(V=Nominal | BT=Weak) = 2%

P(V=Average | BT=Weak) = 97%

P(V=Low | BT=Weak) = 1%

P(V=Nominal | BT=Discharged) = 1%

P(V=Average | BT=Discharged) = 2%

P(V=Low | BT=Discharged) = 97%

Beam (B)

LB in {OK,Burnt}

P(LB=OK) = 90%

P(LB=Burnt) = 10%

BT in {Good, Weak, Discharged}

P(BT=Good) = 80%

Bulb (LB) Battery (BT)

P(BT=Weak) = 10%

P(BT = Discharged) = 10%

Measured Voltage (V)V in {Nominal, Average, Low}

P(B=Bright | LB=OK, BT=Good) = 99%

P(B=Bright | LB=OK, BT=Weak) = 1%

P(B=Bright | LB=OK, BT=Discharged) = 0%

P(B=Bright | LB=Burnt, BT=Good) = 0%

P(B=Bright | LB=Burnt, BT=Weak) = 0%

P(B=Bright | LB=Burnt,

B in {Bright, Weak, Dead}

P(B=Weak | LB=OK, BT=Discharged) = 1%

Figure 4.2: An Example of Bayesian Network

86A1 in {a11, a12} B1 in {b11, b12, b13}

D1 in {d11, d12, d13}C1 in {c11, c12, c13}

A2 in {a21, a22} B2 in {b21, b22, b23}

D2 in {d21, d22, d23}C2 in {c21, c22, c23}

A3 in {a31, a32} B3 in {b31, b32, b33}

D3 in {d31, d32, d33}C3 in {c31, c32, c33}

A4 in {a41, a42} B4 in {b41, b42, b43}

D4 in {d41, d42, d43}C4 in {c41, c42, c43}Figure 4.3: Example of a Bayesian Network with 16 nodes

87PD

TO TR SS

Transparent Terrestrial Satellite

PDH Transmission

Network

Radio SystemsOptical Network Systems

Figure 4.4: Fragment of a Bayesian networkconditional probability of C1 assuming the value c11, given that A1 is worth a11 and B1is worth b11.A local distribution of a posteriori probabilities generally consists of a set of condi-tional probabilities. For example, the distribution of probabilities associated to the nodeB of �gure 4.2 is constituted by the set of 18 conditional probabilities shown inside thecorresponding rectangle. To the set of probabilities that constitute a local probabilitydistribution the name link matrix is also given [Pearl, 1991].Other Alternatives for the Distribution of Local ProbabilitiesIn the example of �gure 4.4 a fragment of a hypothetical Bayesian network is presented.Supposing that for each discrete variable (that is, TO, TR, SS, PD) there are four possiblevalues, the speci�cation of the local probabilities distribution for the PD node will contain44 = 256 probabilities.Therefore, the size of the link matrix, given by the number of local conditional probabil-ities to be speci�ed, grows exponentially with the number of parent variables. Furthermore,in the cases in which the parent variables have little in common besides the fact of sharinga child node, it is hard, even for an expert, to evaluate each one of these probabilities.These and other reasons motivated the appearance of alternative proposals for themodeling of multicausal interactions in the situations where the causal independence amongconditioning factors may be assumed.One of these models is denominated disjunctive interaction or \Noisy-OR gate". Adisjunctive interaction occurs when any one among a set of conditions may cause a certain

88 VariablesTO TR SS PDFault (F) Fault (F) Fault (F) Fault (F)Critical Critical Critical CriticalAlarm (AC) Alarm (AC) Alarm (AC) Alarm (AC)Alarm (A) Alarm (A) Alarm (A) Alarm (A)Normal (N) Normal (N) Normal (N) Normal (N)Table 4.2: De�nition of the Possible States of the Variables TO, TR, SS and PDevent, with a certain probability, and this probability does not decrease when several ofthese conditions occur simultaneously [Pearl, 1991].A causal independence node is composed by a combination node and by several \noise"nodes, being one for each parent node (or \cause"). In the Noisy-OR model, the combina-tion function corresponds to the logical OR applied on the individual e�ect of each causeand on the e�ect of a \background" node, which is related to the probabilities distributionof the combination node in the absence of any of the causes.The predecessor nodes of a Noisy-OR node may have any number of states, but one ofthem must correspond to the absence of the cause associated to the node. The distributionof probabilities in a Noisy-OR node has to be binary.Noisy-MAX is a model similar in every aspect to the Noisy-OR model, except for thefact that the probabilities distributions are not necessarily binary [Knowledge Industries,Inc., 1996a].To illustrate the recently introduced concepts, be the PD node of �gure 4.4. Table 4.2presents the possible states for each one of the variables involved.Adopting for the PD node the Noisy-MAX distribution, only 52 probabilities will needto be speci�ed, according to table 4.3. This represents a substantial reduction in relationto the 256 probabilities required by the conditional distribution.Steps for the Construction of a Bayesian NetworkThree steps are necessary to construct a Bayesian network [Heckerman, 1996]:a) To choose the variables and the states of each one of them;

89Parent Nodes ProbabilitiesP (PD = F jTO = F ) P (PD = F jTO = AC) P (PD = F jTO = A) P (PD = F jTO = N)Transparent P (PD = ACjTO = F ) P (PD = ACjTO = AC) P (PD = ACjTO = A) P (PD = ACjTO = N)Optical P (PD = AjTO = F ) P (PD = AjTO = AC) P (PD = AjTO = A) P (PD = AjTO = N)Network (TO) P (PD = N jTO = F ) P (PD = N jTO = AC) P (PD = N jTO = A) P (PD = N jTO = N)P (PD = F jTR = F ) P (PD = F jTR = AC) P (PD = F jTR = A) P (PD = F jTR = N)Terrestrial P (PD = ACjTR = F ) P (PD = ACjTR = AC) P (PD = ACjTR = A) P (PD = ACjTR = N)Radio P (PD = AjTR = F ) P (PD = AjTR = AC) P (PD = AjTR = A) P (PD = AjTR = N)Systems (TR) P (PD = N jTR = F ) P (PD = N jTR = AC) P (PD = N jTR = A) P (PD = N jTR = N)P (PD = F jSS = F ) P (PD = F jSS = AC) P (PD = F jSS = A) P (PD = F jSS = N)Satellite P (PD = ACjSS = F ) P (PD = ACjSS = AC) P (PD = ACjSS = A) P (PD = ACjSS = N)Systems P (PD = AjSS = F ) P (PD = AjSS = AC) P (PD = AjSS = A) P (PD = AjSS = N)(SS) P (PD = N jSS = F ) P (PD = N jSS = AC) P (PD = N jSS = A) P (PD = N jSS = N)P (PD = F )Background P (PD = AC)P (PD = A)P (PD = N)Table 4.3: Probabilities Distribution for Node PDb) To construct the structure of the Bayesian network, that is, the directed acyclicalgraph containing the information on the independence among variables;c) To assign probability values (that is, to specify the distribution of local probabilities)for each variable.In the case of a telecommunications network, a great part of the work needed forthe construction of the Bayesian network, namely, the choice of the variables and theconstruction of the structure, will already be done as soon as the network model is available(see chapter 3).4.2.3 Alarm Correlation Using Bayesian NetworksFor each variable of a Bayesian network some states corresponding to faults may be de�ned,whose probability will be evaluated during a diagnosis session. This would be the case,for example, of the states LB=Burnt, B=Weak and BT=Discharged, in the network of�gure 4.2 [Knowledge Industries, Inc., 1996a].Any variable of a Bayesian network may also be de�ned as an observation node, if itsstate is possible to be observed during a diagnosis session. According to the de�nitionspresented in the item 2.1.3, these variables would therefore be capable of emitting alarms.It must be pointed out that a node may be an observation one at the same time as thecorresponding variable may contain fault states.

90 De�nition: Alarm correlation in a Bayesian network consists of the evaluationof the probabilities associated to the occurrence of one or more faults, basedon the information received from the system under diagnosis. This informa-tion is basically constituted by the alarms generated during the operation ofthe managed network elements, or obtained as a result of previous correlationprocesses.Uncertainty Causing FactorsAs it has been previously seen (cf. item 4.2), uncertainty is inherent to the correlationprocess. In the present case, this uncertainty is due to the error possibility, which has fourmain sources:1. the in uence of factors not captured by the managed system model;2. the imprecision in the attribution of values for the probabilities distributions;3. the imprecision in the capture and transference of the alarms. This may be illustratedeven in simple systems such as the one of �gure 4.2, where errors may occur bothin the observation of the beam intensity (due to the di�erence of optical sensitivityamong observers) and in the reading of the battery voltage (due to errors in thevoltmeter operation or to a defect of the device itself);4. imprecision in the information obtained as the result of other correlation processes.4.2.4 Evaluation of Bayesian NetworksIt may be demonstrated [Charniak, 1991] that the joint probabilities distributionP (X1;X2; : : : ;Xn), for a Bayesian network, may be obtained through the product of thelocal probabilities distributions for each random variable. Be, for example, the Bayesiannetwork of �gure 4.2, in which the joint distribution P (LB;BT;B; V ) may be calculatedas P (LB;BT;B; V ) = P (LB)� P (BT )� P (BjLB;BT )� P (V jBT )In this way it is possible to calculate any probability of the distribution, such as, forexample:P (LB = OK;BT = Weak;B = Dead; V = Average) =P (LB = OK)� P (BT = Weak)� P (B = DeadjLB = OK;BT = Weak)�P (V = AveragejBT = Weak) =0:9 � 0:1� 0:09 � 0:97 = 0:007857 � 0:79%

91The probability that a set of variablesY�X, constituted by the variablesXm; : : : ;Xp 2fX1;X2; : : : ;Xng, assumes the con�guration y = fXm = xm; : : : ;Xp = xpg is given by thesum of all the probabilities of the X joint distribution in which Xm = xm; : : : ;Xp = xp[Henrion et al., 1991].Taking again as example the network of �gure 4.2, the probability that, in the modeledsystem, the battery is good (BT=Good) and the beam is dead (B=Dead) is calculatednext. P (BT = Good;B = Dead) =P (LB = OK;BT = Good;B = Dead; V = Nominal) +P (LB = OK;BT = Good;B = Dead; V = Average) +P (LB = OK;BT = Good;B = Dead; V = Low) +P (LB = Burnt;BT = Good;B = Dead; V = Nominal) +P (LB = Burnt;BT = Good;B = Dead; V = Average) +P (LB = Burnt;BT = Good;B = Dead; V = Low) =0:9000 � 0:8000 � 0:0050 � 0:9700 +0:9000 � 0:8000 � 0:0050 � 0:0150 +0:9000 � 0:8000 � 0:0050 � 0:0150 +0:1000 � 0:8000 � 1:0000 � 0:9700 +0:1000 � 0:8000 � 1:0000 � 0:0150 +0:1000 � 0:8000 � 1:0000 � 0:0150 =0:003492 + 0:000054 + 0:000054 + 0:077600 + 0:001200 + 0:001200 =0:083600 = 8:36%In the same Bayesian network, the probability that the beam is dead is:P (B = Dead) =P (LB = OK;BT = Good;B = Dead; V = Nominal) +P (LB = OK;BT = Good;B = Dead; V = Average) +P (LB = OK;BT = Good;B = Dead; V = Low) +P (LB = Burnt;BT = Good;B = Dead; V = Nominal) +P (LB = Burnt;BT = Good;B = Dead; V = Average) +P (LB = Burnt;BT = Good;B = Dead; V = Low) +P (LB = OK;BT = Weak;B = Dead; V = Nominal) +P (LB = OK;BT = Weak;B = Dead; V = Average) +

92 P (LB = OK;BT = Weak;B = Dead; V = Low) +P (LB = Burnt;BT =Weak;B = Dead; V = Nominal) +P (LB = Burnt;BT =Weak;B = Dead; V = Average) +P (LB = Burnt;BT =Weak;B = Dead; V = Low) +P (LB = OK;BT = Discharged;B = Dead; V = Nominal) +P (LB = OK;BT = Discharged;B = Dead; V = Average) +P (LB = OK;BT = Discharged;B = Dead; V = Low) +P (LB = Burnt;BT = Discharged;B = Dead; V = Nominal) +P (LB = Burnt;BT = Discharged;B = Dead; V = Average) +P (LB = Burnt;BT = Discharged;B = Dead; V = Low) =0:9000 � 0:8000 � 0:0050 � 0:9700 +0:9000 � 0:8000 � 0:0050 � 0:0150 +0:9000 � 0:8000 � 0:0050 � 0:0150 +0:1000 � 0:8000 � 1:0000 � 0:9700 +0:1000 � 0:8000 � 1:0000 � 0:0150 +0:1000 � 0:8000 � 1:0000 � 0:0150 +0:9000 � 0:1000 � 0:0900 � 0:0200 +0:9000 � 0:1000 � 0:0900 � 0:9700 +0:9000 � 0:1000 � 0:0900 � 0:0100 +0:1000 � 0:1000 � 1:0000 � 0:0200 +0:1000 � 0:1000 � 1:0000 � 0:9700 +0:1000 � 0:1000 � 1:0000 � 0:0100 +0:9000 � 0:1000 � 0:9900 � 0:0100 +0:9000 � 0:1000 � 0:9900 � 0:0200 +0:9000 � 0:1000 � 0:9900 � 0:9700 +0:1000 � 0:1000 � 1:0000 � 0:0100 +0:1000 � 0:1000 � 1:0000 � 0:0200 +0:1000 � 0:1000 � 1:0000 � 0:9700 =0:003492 + 0:000054 + 0:000054 + 0:077600 + 0:001200 + 0:001200 +0:000162 + 0:007857 + 0:000081 + 0:000200 + 0:009700 + 0:000100 +0:000891 + 0:001782 + 0:086427 + 0:000100 + 0:000200 + 0:009700 =0:200800 = 20:08%

93A conditional probability may be calculated by using the formulaP (AjE) = P (A;E)P (E)according to which the probability for the occurrence of A, given that E occurred, is givenby the quotient between the probability of simultaneous occurrence of A and E and theprobability of occurrence of E [Hoel et al., 1971].Therefore, if one knows a set of evidences e = fXm = xm; : : : ;Xp = xpg, constituted byall the known values of the random variables of a Bayesian network, where fXm; : : : ;Xpg �X = fX1;X2; : : : ;Xng,the calculation of the probability (or \belief") that a variable Xk 62fXm; : : : ;Xpg assumes the value xk is given byP (Xk = xkje) = P (Xk = xk; e)P (e)For illustration sake, be, once again, the Bayesian network of �gure 4.2. Supposingthat e = fB=Deadg is the set of all the known evidences, the belief that the battery isgood is given by:P (BT = GoodjB = Dead) = P (BT = Good;B = Dead)P (B = Dead)By using the previously calculated probability values, we have:P (BT = GoodjB = Dead) = 0:0836000:200800 = 0:416335 � 41:63%Supposing now that a new evidence is known, that is to say, the one that the light bulbis burnt, the new belief that the battery is good is calculated next:P (BT = GoodjB = Dead;LB = Burnt) = P (BT = Good;B = Dead;LB = Burnt)P (B = Dead;LB = Burnt)Where:P (BT = Good;B = Dead;LB = Burnt) =P (LB = Burnt;BT = Good;B = Dead; V = Nominal) +P (LB = Burnt;BT = Good;B = Dead; V = Average) +P (LB = Burnt;BT = Good;B = Dead; V = Low) =0:1000 � 0:8000 � 1:0000 � 0:9700 +0:1000 � 0:8000 � 1:0000 � 0:0150 +0:1000 � 0:8000 � 1:0000 � 0:0150 =0:077600 + 0:001200 + 0:001200 =0:080000 = 8:00%

94 P (B = Dead;LB = Burnt) =P (LB = Burnt;BT = Good;B = Dead; V = Nominal) +P (LB = Burnt;BT = Good;B = Dead; V = Average) +P (LB = Burnt;BT = Good;B = Dead; V = Low) +P (LB = Burnt;BT =Weak;B = Dead; V = Nominal) +P (LB = Burnt;BT =Weak;B = Dead; V = Average) +P (LB = Burnt;BT =Weak;B = Dead; V = Low) +P (LB = Burnt;BT = Discharged;B = Dead; V = Nominal) +P (LB = Burnt;BT = Discharged;B = Dead; V = Average) +P (LB = Burnt;BT = Discharged;B = Dead; V = Low) =0:1000 � 0:8000 � 1:0000 � 0:9700 +0:1000 � 0:8000 � 1:0000 � 0:0150 +0:1000 � 0:8000 � 1:0000 � 0:0150 +0:1000 � 0:1000 � 1:0000 � 0:0200 +0:1000 � 0:1000 � 1:0000 � 0:9700 +0:1000 � 0:1000 � 1:0000 � 0:0100 +0:1000 � 0:1000 � 1:0000 � 0:0100 +0:1000 � 0:1000 � 1:0000 � 0:0200 +0:1000 � 0:1000 � 1:0000 � 0:9700 =0:077600 + 0:001200 + 0:001200 +0:000200 + 0:009700 + 0:000100 +0:000100 + 0:000200 + 0:009700 =0:100000 = 10:00%Therefore,P (BT = GoodjB = Dead;LB = Burnt) = 0:0800000:100000 = 0:8000 = 80:00%The example presented demonstrates the capacity for non-monotonic reasoning of theBayesian networks: while the only known evidence was that the beam was dead, the beliefthat the battery was good was of 41.63%; as it is known that the light bulb was burnt, thebelief could be recalculated, having gone up to 80.00%. This belief will grow even more asthe information that the measured voltage is found within the range considered \Nominal"is made available:P (BT = GoodjB = Dead;LB = Burnt; V = Nominal) =

95P (BT = Good;B = Dead;LB = Burnt; V = Nominal)P (B = Dead;LB = Burnt; V = Nominal)Where:P (BT = Good;B = Dead;LB = Burnt; V = Nominal) =0:8000 � 1:0000 � 0:1000 � 0:9700 = 0:077600 = 7:76%P (B = Dead;LB = Burnt; V = Nominal) =P (LB = Burnt;BT = Good;B = Dead; V = Nominal) +P (LB = Burnt;BT = Weak;B = Dead; V = Nominal) +P (LB = Burnt;BT = Discharged;B = Dead; V = Nominal) +0:1000 � 0:8000 � 1:0000 � 0:9700 +0:1000 � 0:1000 � 1:0000 � 0:0200 +0:1000 � 0:1000 � 1:0000 � 0:0100 =0:077600 + 0:000200 + 0:000100 = 0:077900 = 7:79%Therefore,P (BT = GoodjB = Dead;LB = Burnt; V = Nominal) =0:0776000:077900 = 0:996149 � 99:62%A Bayesian network constitutes a complete probabilistic model on a given domainand contains all the necessary information to respond to any probabilistic query involvingthis domain's variables [Pearl, 1991]. In the worst case, the response to these queriesinvolves the attainment of probabilities through the sum of an exponential number ofparcels, corresponding to the individual con�gurations of the Bayesian network. It maybe demonstrated that this problem, denominated probabilistic inference, is an NP-hardproblem [Cooper, 1990] [Dagum and Luby, 1993].Several techniques have been developed to deal with the complexity of probabilisticinference in Bayesian networks. Some of them involve alterations in the network structure(cf. \Noisy-OR" and \Noisy-MAX" models in item 4.2.2); other techniques exploit knowncharacteristics of the Bayesian networks or neglect the demand for exact results, to simplifythe calculations involved. The description and analysis of these techniques may be foundin [Pearl, 1991] and escape the scope of this thesis. A good bibliography on the subject isfound in [Pearl, 1993] and in [Heckerman and Wellman, 1995].

964.2.5 DXpress 2.0TM , API-DXTM and WIN-DXTM : An Ex-ample of Tools for the Development and Evaluation ofBayesian NetworksTo correlate alarms in a telecommunications network it may be necessary to utilize aBayesian network with hundreds or even thousands of nodes. The development of thisnetwork, as well as its subsequent utilization in making probabilistic inferences, will demandthe use of tools adequate to the problem's complexity.It is hard to decide, before a careful study, if the adequate tool will have to be developedor if the best option is the utilization of an already available commercial product. Thepresentation of commercial products in this thesis has only illustrative aims, and does notimply an opinion as to the applicability of the products in the solution to any speci�cproblem.What is DXpress 2.0TMDXpress 2.0TM is a software developed by the North-American company Knowledge In-dustries, Inc., of Palo Alto, California, to aid in the development and testing of diagnosisapplications using Bayesian networks [Knowledge Industries, Inc., 1996a]. The main mod-ules of this product are:Graphic Editor. It allows the speci�cation of the nodes and of the edges of a Bayesiannetwork, representing respectively the variables of the modeled domain and the qual-itative expression of the causal relationships among these variables.Node Editor. It gives support to the speci�cation of the data associated to each networknode, including: (1) the values allowed for the variable corresponding to the node;(2) the local probabilities distributions; (3) the declaration of the node as beingan observation and/or a fault one. In order to reduce the complexity of the networkevaluation, only the nodes declared as observation ones will be able to have the valuesof the corresponding variables instantiated during a diagnosis session. Similarly, onlynodes designated as fault ones will have the probabilities of the states of correspondingvariables inferred. Not all the states of a fault variable will have their probabilitiesinferred during a diagnosis session, but only those states explicitly declared as faultstates.Group Editor It allows the speci�cation of hierarchical relations among faults or amongobservations.Compiler It transforms the knowledge base generated by the DXpress 2.0TM into a �lethat can be used by the programs API-DXTM and WIN-DXTM.

97API-DXTM and WIN-DXTMWIN-DXTM is a program utilized to test diagnosis applications developed with the help ofDXpress [Knowledge Industries, Inc., 1996b]. The program has three main windows:Available Tests. It shows the tests that are available during a diagnosis session.Possible Faults. It shows a list of possible faults, arranged according to a decreasingprobability order.Working Window. It shows all the observations carried out up to the moment. For testends, the value of these observations may be modi�ed.API-DXTM is a library of programs, written in C++, which allows to endow an appli-cation developed by the user with diagnostic capability.4.3 Description of the ModelA good model must hide the aspects of a system that are irrelevant for a given objective,preserving solely the aspects necessary to the comprehension of the phenomena understudy. In the present approach, the managed telecommunications network is partitionedinto several sub-networks, according to the model introduced in chapter 3. The objectsthat constitute each sub-network are not visible externally and, consequently, are notindividually considered.The state of a sub-network is modeled by a random variable whose value is attributedbased on information supplied by a network management system and/or by an alarmcorrelation system. In the range of each sub-network, the alarm correlation may be doneby using one or more among the techniques identi�ed in section 2.2, which may be chosenkeeping in sight the sub-network peculiarities (cf. section 2.3).The propagation of alarms among the sub-networks that compose the telecommunica-tions network is modeled through the causal propagation among the nodes of a Bayesiannetwork [Pearl, 1991]. To each node in the Bayesian network corresponds a random vari-able representing the state of a sub-network in the telecommunications network. Each edge(vi; vj) indicates that the probabilities information corresponding to node vi must be takeninto consideration in the calculation of the local probabilities associated to node vj (cf.section 4.2).Once the Bayesian network that models the telecommunications network is constructed,the appearance of alarms in a sub-network | be directly sent by the network elementsor obtained as a result of another correlation process | may cause the change of state ofthe corresponding variable. In this case, through diagnostic inference, the model will allow

98the probabilities associated to the possible causes to be evaluated. On the other hand,through predictive inference it is possible to quantify the propensity that the occurrenceof a given fault has to cause the emission of an alarm. The model also allows intercausalinference, or the quanti�cation of the increase of belief in a possible cause for an observeddefect, as a consequence of the observation of a new evidence, contrary to the occurrenceof another possible cause. Through the intercausal inference it is also possible to quantifythe reduction in this belief, in the case of the appearance of a new evidence, supportinganother possible cause [Henrion et al., 1991].The result of the correlation carried out in the range of the whole of the telecommuni-cations network may be utilized to automatically re�ne the result of the correlation in therange of each sub-network. For this to happen, whenever the value calculated for a faultprobability in a sub-network surpasses the value of a previously established parameter,actions should be taken, via network management system, to collect more information,which will allow the attainment of a more precise picture of the sub-network state. Thismay include intensive monitoring mechanisms, such as the one described in [Meira et al.,1991].4.3.1 PresuppositionsThe following presuppositions are assumed and constitute sine qua non conditions for theapplication of the model:1. The managed telecommunications network is modeled in conformity with section 3.2of this thesis;2. For each sub-network, corresponding to a node in the graph of the telecommunica-tions network general model, it is possible to de�ne a discrete random variable whosevalues are representative of the state of the sub-network as far as the presence ofalarms and/or faults is concerned;3. The prior probabilities related to the variables of each root node (corresponding tothe physical transmission means), as well as the local probabilities related to thevariables of the other nodes, may be attributed, alternatively:a) Through the relative frequencies of the corresponding events, calculated fromthe data collected by the network management system;b) By using relative likeliness, which consists of estimating the probabilities fromthe subjective judgement of an expert. This method will be useful wheneverthere is not enough data to permit the estimation of the relative frequencies,which may occur due to the low frequency of the phenomena observed, or evendue to the nonexistence of su�cient network management resources;

994. There is an integrated network management system that collects information fromwhich values are attributed, in real time, to the variables mentioned, possibly throughalarm correlation at each sub-network level.4.3.2 Construction of the Structure of the Bayesian NetworkThe �rst step towards the construction of the structure of a Bayesian network is themodeling of the object network (that is, the telecommunications network), in multi-networklayers (cf. item 3.2.4). This model must be re�ned up to the point at which each sub-network instance corresponds to an individual node in the graph (cf. example in �gure 3.7).In an object network model, the existence of a directed edge (vi; vj), departing fromnode vi and reaching node vj (�gure 4.5), indicates that vi depends on vj , which means thata fault in vj may, potentially, have its e�ects directly propagated up to vi. In a Bayesiannetwork, the possibility of direct propagation of the e�ects of a fault, from a node vj to anode vi , is modeled by a directed edge (vj, vi).vi

vjFigure 4.5: Two Functionally Dependent NetworksTherefore, to reach the structure of a Bayesian network it is necessary to invert thedirections of the arrows in the directed edges of the telecommunications network model,that is, to replace each edge (vi; vj) by an edge (vj; vi).By convention, the edges of a Bayesian network will always point from top down.Because of this, the network will be represented with the nodes corresponding to thephysical means occupying the upper part of the graph.4.3.3 De�nition of the Variables and their StatesA single discrete variable must correspond to each sub-network. The variable will have anarbitrary, though �nite, number of possible values, allowing the states of the sub-networkmodeled by these variables to be represented at the wanted detailing level.

100Sometimes it may be convenient to associate an arbitrary number of discrete variables toa single sub-network. Suppose, for example, that it is desirable to associate to a given sub-network SR the variables V1; V2; : : : ; Vn, to which are de�ned, respectively, SV1; SV2; : : : ; SVnpossible states. For this to happen, it is possible to de�ne a variable VSR, with SV1 �SV2�: : :�SVn possible states; to each one of these states it must be assigned one of the possiblecombinations of the states of the original variables.The values de�ned for a random variable must be collectively exhaustive (that is,all the values that may be assumed by the variable must be de�ned) and mutually ex-clusive (that is, a variable can only assume, at a given instant, one of the values de-�ned). As a consequence, if fa1; a2; : : : ; ang is the set of states de�ned for a variable andP (a1); P (a2); : : : ; P (an) are the probabilities associated to each one of these values, thenP (a1) + P (a2) + : : :+ P (an) = 1.The de�nition of the states of the variables is an important stage in the construction ofa Bayesian network and frequently represents, by itself, an interesting modeling challenge[Heckerman and Wellman, 1995].As an example, the following states could be de�ned for a random variable named\Exception Condition in the SR Sub-network":� FAULT-STATE. When there are faults in the SR sub-network, identi�ed by any kindof means, including through alarm correlation at the level of the SR sub-network;� CRITICAL ALARM. When there are no faults in the SR sub-network but thereare, among the active alarms in this sub-network, one or more alarms with criticalseverity level [ITU-T, 1992o];� ALARM. When there are no faults in the sub-network but there are active alarms,none of which with critical severity level;� NORMAL. When there are no active alarms and no fault was identi�ed in the sub-network.4.3.4 Speci�cation of the Local Probabilities DistributionsBe vi a network that utilizes the services of the vj network, as illustrated in �gure 4.5.A fault in the vj network may produce consequences on the vi network, but this willnot always happen. In fact, some faults in the vj network may happen in parts of thenetwork that do not contribute to the service o�ered to the vi network. Other times, thefault occurs in a part of the network which presents active redundancy, in such way thatthe fault becomes transparent to the vi network. Also it may happen that a fault in vj is

101perceived by vi through a reduction in the performance of the service o�ered by vj, butnot as a fault in this network.The analytical determination of the degree of dependence of the network vi in relationto the network vj is generally di�cult to be established without a deep knowledge of thearchitecture and of the functionality of the two networks and of the relationship amongthem. On the other hand, the occurrence of an exception situation in the vi network, as aconsequence of an event (for example, a fault) in the vj network, may be suitably modeledthrough a conditional probability, which may be estimated by experts or calculated fromexperimental data collected by the network management system.Thus, for each variable of the Bayesian network (equivalent to a sub-network in thetelecommunications network model) a local probabilities distribution will have to be spec-i�ed (cf. item 4.2.2).4.3.5 Dynamics of the Correlation ProcessTaking into account the existence of the presuppositions established in item 4.3.1, thealarm correlation in a telecommunications network consists of �ve independent processes,which are inde�nitely repeated:1. Alarm acquisition by the network management system;2. Classi�cation of the alarms received, according to timewindows (temporal limitation)and originating sub-network (functional limitation) (cf. item 4.1.2);3. Correlation at sub-network level, from alarms originally generated by the network el-ements or obtained through other correlation processes, depending on the correlationtopology adopted (cf. item 2.1.7). The existence of this process is not mandatory ata �rst moment, as it can be gradually implemented in each sub-network, accordingto necessity and taking into consideration the peculiarities of each one of them;4. Updating of the values of the random variables corresponding to each sub-network,according to the state of these sub-networks, given by the alarms received and by theresult of the correlations carried out on these alarms;5. Alarm correlation at the telecommunications network level, through the evaluationof the new probabilities associated to the fault states de�ned for each sub-network,in view of the evidences available up to the moment (cf. item 4.2.4).4.3.6 Complexity and Performance of Alarm Correlationthrough Bayesian NetworksIn the general case, the probabilistic inference through Bayesian networks is an NP-hardproblem [Cooper, 1990] [Baase, 1988]. In this way, we do not know a polynomial algorithm

102that gives exact solutions to all the kinds of the Bayesian networks. [Dagum and Luby,1993] proved that, in the general case, even the attainment of approximate solutions isNP-hard.Therefore, for satisfactory performances to be obtained, the inference algorithms mustbe developed taking into account the most frequently found cases and not the worse case,for which the proposed solutions will probably only work for small size problems.Three important factors may be identi�ed as causing impact on the performance of aprobabilistic inference algorithm in Bayesian networks:1. The total number of edges which, in the worse case, is O(n2), where n is the numberof nodes of the network (cf. �gure 4.6);v v v v v1 2 3 n-1 nFigure 4.6: A Bayesian Network with n(n-1)/2 Edges2. The number of parent nodes, per node. As seen in item 4.2.2, the size of the linkmatrix associated to a node grows exponentially with the number of parent variables;3. The number of states per random variable. The total number of possible con�gura-tions in a Bayesian network with n nodes (or variables) and k states per variable iskn.4.4 Final ConsiderationsThe solution proposed in this chapter �nds similarity in the approach of SIS (System forthe Integration of Supervision) [Meira and Lages, 1988] for the problem of integration ofsupervisory or network management systems. The motivation for developing SIS was notto have another data collection or remote command system. Its purpose was, in a fewwords, to consolidate the information generated by other network management systems,so as to o�er the operators an integrated view of the telecommunications plant. Similarly,the correlation model presented here does not have as a goal to become an alternative forthe alarm correlation approaches presented in section 2.2, but, rather, to o�er an unheard-of solution for alarm correlation in the range of a whole telecommunications network, by

103utilizing correlation results o�ered by other systems and integrating them to obtain a singleview of the managed plant.The following points are left for future work: (1) The extension of the correlationmodel to the lower levels of the telecommunications system architecture (for example, tothe sub-network level) and (2) The utilization of other alarm correlation methods besidesthe Bayesian network method (for example, fuzzy logic or coding, cf. items 2.2.2 and 2.2.9)both at the level that encompasses the whole of the telecommunications network and atthe lower levels.

Chapter 5A PrototypeThis chapter presents a prototype, constructed to demonstrate how the concepts introducedin chapters 3 and 4 may be utilized in the development of a system for alarm correlationin a telecommunications network.This chapter is structured in four parts. In section 5.1, a description of the objectnetwork is made. Section 5.2 contains a description of the Bayesian network constructedaccording to the methodology formulated in the previous chapters. Section 5.3 containsexperimentation results, based on 16 di�erent cases (or scenarios). Section 5.4 containssome �nal considerations on the prototype.5.1 Description of the Object NetworkThe object network for the prototype is the telecommunications network of �gure 3.10.This network was chosen because it considers several of the most important sub-networkclasses which compose a modern telecommunications network, being, under this aspect, a\canonical" network, despite not necessarily corresponding to any real case.Twenty-one sub-network classes are represented in �gure 3.10. The object networkcontains only one instance of each one of these classes. As we do not intend to model, inthis thesis, the functioning of the user networks (cf. item 3.2.2), these will not be consideredfrom here on (cf. table 5.1).5.2 Construction of the Bayesian NetworkThe Bayesian network was constructed according to the methodology formulated in sec-tion 4.3, by utilizing the DXpress 2.0 TM tool (cf. item 4.2.5).105

106 Sub-network DescriptionNetwork through which the user has accessB-ISDN to the B-ISDN (Broadband IntegratedServices Digital Network) servicesNetwork through which the user has accessX.25 to the packet switched datacommunication service, X.25 standardNetwork through which the user has accessCellular to the mobile cellular telephony serviceNetwork through which the user has accessN-ISDN to the N-ISDN (Narrowband IntegratedServices Digital Network) servicesNetwork through which the user has accessPOTS to the Plain Old Telephone Services (POTS)ATM Assynchronous Transfer Mode (ATM) networkFr-Relay Frame relay networkIDN Integrated digital networkA-Switch Analog switching networkITU-T No. 7SS-7 Common Channel Signaling NetworkSDH Synchronous Digital Hierarchy (SDH) networkPDH Plesiochronous Digital Hierarchy (PDH) networkFrequency Division MultiplexFDM (FDM) analog networkOpt-Net Optical cable networkPhysical network constituted byRadio terrestrial radio linksPhysical network whose links are constitutedSatellites by satellites and associated terrestrial stationsCoax-cables Coaxial cable networkTable 5.1: Sub-networks that constitute the Example Network

107N1: Opt-NetStates ProbabilitiesNormal 0.9950Alarm 0.0035Critical 0.0014Fault 0.0001Table 5.2: Probabilities Distribution for the Opt-Net Node5.2.1 StructureThe Bayesian network structure (�gure 5.1) was constructed based on the network model(�gure 3.10). Opt-Net, Radio, Satellites and Coax-Cables are source nodes (that is, they donot have predecessors). To facilitate the construction of the respective link matrices and toreduce the number of conditional probabilities to be speci�ed, the nodes PDH, FDM, SS-7,Fr-Relay, IDN, A-Switch, X.25, Cellular and POTS, which have two or more predecessors,were de�ned as Noisy-MAX nodes (cf. item 4.2.2). For each one of the remaining nodes,which possess just one predecessor, a conditional distribution of probabilities has beende�ned.5.2.2 Variables and their StatesAccording to item 4.3.3, the choice of the states of each variable is one of the most importantstages in the construction of a Bayesian network. Because of this, each sub-network mustbe studied in detail, aiming at the identi�cation of a set of states that is representative ofits functioning, synthetizing the most important parameters to be observed, as well as theexistence of alarms and/or faults. In the present case, as it comes to an example network,it was not possible to invest much on the study of the several sub-networks; instead, weadopted for each one of the variables that represent these sub-networks the same set ofstates. The chosen states were: fault-state, critical alarm, alarm and normal;the corresponding de�nitions are the ones presented in item 4.3.3.5.2.3 Local Probabilities DistributionsThe local probabilities distributions have been estimated taking into account the type ofeach node and are presented in tables 5.2 to 5.18. The criterion utilized for the estimatesconsisted of simulating the work of an expert in each one of the sub-networks, as it wouldhappen in the real case.Each probabilities distribution represents a synthesis of the knowledge available onthe behavior of a given sub-network, as far as the occurrence of exception situations is

108PDH FDM

ATM Fr-Relay

POTS

MaxMaxMax

Max Max Max

Max Max

Max

SDH

N1 N2 N3Satellites

N4

N5 N6 N7

N8

N9 N10 N11 N12

N13 N14 N15Cellular

N16 N17N-ISDNX.25B-ISDN

IDN

SS-7

Radio Coax-CablesOpt-Net

A-Switch

Figure 5.1: Structure of the Bayesian Network for the Example Telecommunications Net-work

109N2: RadioStates ProbabilitiesNormal 0.9900Alarm 0.0005Critical 0.0055Fault 0.0040Table 5.3: Distribution of Probabilities for the Radio NodeN3: SatellitesStates ProbabilitiesNormal 0.9950Alarm 0.0035Critical 0.0012Fault 0.0003Table 5.4: Distribution of Probabilities for the Satellites NodeN4: Coax-CablesStates ProbabilitiesNormal 0.9800Alarm 0.0000Critical 0.0000Fault 0.0200Table 5.5: Distribution of Probabilities for the Coax-Cables NodeProbabilities, given thatStates of Opt-Net state is:N5: SDH Normal Alarm Critical FaultNormal 0.9500 0.0850 0.0500 0.0300Alarm 0.0350 0.9000 0.9300 0.9000Critical 0.0148 0.0148 0.0198 0.0697Fault 0.0002 0.0002 0.0002 0.0003Table 5.6: Distribution of Probabilities for the SDH Node

110 Probabilities, given thatStates of SDH state is:N9: ATM Normal Alarm Critical FaultNormal 0.9400 0.0850 0.0500 0.0300Alarm 0.0450 0.9000 0.9300 0.9000Critical 0.0148 0.0148 0.0197 0.0694Fault 0.0002 0.0002 0.0003 0.0006Table 5.7: Distribution of Probabilities for the ATM NodeProbabilities, given thatStates of ATM state is:N13: B-ISDN Normal Alarm Critical FaultNormal 0.9200 0.6000 0.2000 0.1000Alarm 0.0600 0.3700 0.2000 0.1000Critical 0.0196 0.0295 0.5000 0.2000Fault 0.0004 0.0005 0.1000 0.6000Table 5.8: Distribution of Probabilities for the B-ISDN NodeProbabilities, given thatStates of IDN state is:N16: N-ISDN Normal Alarm Critical FaultNormal 0.9500 0.8000 0.6000 0.5000Alarm 0.0400 0.1800 0.2000 0.1000Critical 0.0098 0.0196 0.1994 0.3000Fault 0.0002 0.0004 0.0006 0.1000Table 5.9: Distribution of Probabilities for the N-ISDN Node

111Probabilities, given thatStates of N1: Opt-Net state is:N6: PDH Normal Alarm Critical FaultNormal 1.0000 0.9000 0.3000 0.1000Alarm 0.0000 0.0500 0.2500 0.3000Critical 0.0000 0.0499 0.3500 0.4000Fault 0.0000 0.0001 0.1000 0.2000Probabilities, given thatStates of N2: Radio state is:N6: PDH Normal Alarm Critical FaultNormal 1.0000 0.9000 0.8000 0.7000Alarm 0.0000 0.1000 0.1000 0.0500Critical 0.0000 0.0000 0.0500 0.1500Fault 0.0000 0.0000 0.0500 0.1000Probabilities, given thatStates of N3: Satellites state is:N6: PDH Normal Alar Critical FaultNormal 1.0000 0.8000 0.7000 0.6000Alarm 0.0000 0.2000 0.1000 0.1000Critical 0.0000 0.0000 0.1000 0.1500Fault 0.0000 0.0000 0.1000 0.1500States of \Leak" ProbabilitiesN6: PDH (\Background")Normal 0.9500Alarm 0.0400Critical 0.0080Fault 0.0020Table 5.10: Distribution of Probabilities for the PDH Node

112Probabilities, given thatStates of N2: Radio state is:N7: FDM Normal Alarm Critical FaultNormal 1.0000 0.9000 0.6000 0.1000Alarm 0.0000 0.0900 0.2000 0.2000Critical 0.0000 0.0100 0.1500 0.1000Fault 0.0000 0.0000 0.0500 0.6000Probabilities, given thatStates of N3: Satellites state is:N7: FDM Normal Alarm Critical FaultNormal 1.0000 0.9900 0.9500 0.9000Alarm 0.0000 0.0100 0.0400 0.0700Critical 0.0000 0.0000 0.0050 0.0200Fault 0.0000 0.0000 0.0050 0.0100Probabilities, given thatStates of N4: Coax-Cables state is:N7: FDM Normal Alarm Critical FaultNormal 1.0000 0.1000 0.0500 0.0002Alarm 0.0000 0.8000 0.5000 0.0018Critical 0.0000 0.0990 0.4000 0.3500Fault 0.0000 0.0010 0.0500 0.6480States of \Leak" ProbabilitiesN7: FDM (\Background")Normal 0.9000Alarm 0.0900Critical 0.0095Fault 0.0005Table 5.11: Distribution of Probabilities for the FDM Node

113Probabilities, given thatStates of N5: SDH state is:N8: SS-7 Normal Alarm Critical FaultNormal 1.0000 0.0001 0.0001 0.0001Alarm 0.0000 0.9900 0.8000 0.2000Critical 0.0000 0.0090 0.1500 0.4000Fault 0.0000 0.0009 0.0499 0.3999Probabilities, given thatStates of N6: PDH state is:N8: SS-7 Normal Alarm Critical FaultNormal 1.0000 0.9000 0.8000 0.1000Alarm 0.0000 0.1000 0.0500 0.2000Critical 0.0000 0.0000 0.1400 0.6500Fault 0.0000 0.0000 0.0100 0.0500States of \Leak" ProbabilitiesN8: SS-7 (\Background")Normal 0.0100Alarm 0.9850Critical 0.0049Fault 0.0001Table 5.12: Distribution of Probabilities for the SS-7 Node

114Probabilities, given thatStates of N5: SDH state is:N10: Fr-Relay Normal Alarm Critical FaultNormal 1.0000 0.9999 0.9500 0.3000Alarm 0.0000 0.0000 0.0000 0.0000Critical 0.0000 0.0000 0.0000 0.0000Fault 0.0000 0.0001 0.0500 0.7000Probabilities, given thatStates of N6: PDH state is:N10: Fr-Relay Normal Alarm Critical FaultNormal 1.0000 0.9999 0.9999 0.9700Alarm 0.0000 0.0000 0.0000 0.0000Critical 0.0000 0.0000 0.0000 0.0000Fault 0.0000 0.0001 0.0001 0.0300States of \Leak" ProbabilitiesN10: Fr-Relay (\Background")Normal 0.9700Alarm 0.0000Critical 0.0000Fault 0.0300Table 5.13: Distribution of Probabilities for the Fr-Relay Node

115Probabilities, given thatStates of N5: SDH state is:N11: IDN Normal Alarm Critical FaultNormal 1.0000 0.9900 0.9500 0.3000Alarm 0.0000 0.0100 0.0450 0.4000Critical 0.0000 0.0000 0.0040 0.2800Fault 0.0000 0.0000 0.0010 0.0200Probabilities, given thatStates of N6: PDH state is:N11: IDN Normal Alarm Critical FaultNormal 1.0000 0.9900 0.9000 0.5000Alarm 0.0000 0.0100 0.0500 0.2000Critical 0.0000 0.0000 0.0400 0.1000Fault 0.0000 0.0000 0.0100 0.2000Probabilities, given thatStates of N8: SS-7 state is:N11: IDN Normal Alarm Critical FaultNormal 1.0000 0.9000 0.8000 0.4000Alarm 0.0000 0.0900 0.1000 0.1000Critical 0.0000 0.0100 0.0500 0.3000Fault 0.0000 0.0000 0.0500 0.2000States of \Leak" ProbabilitiesN11: IDN (\Background")Normal 0.9500Alarm 0.0400Critical 0.0000Fault 0.0100Table 5.14: Distribution of Probabilities for the IDN Node

116Probabilities, given thatStates of N6: PDH state is:N12: A-Switch Normal Alarm Critical FaultNormal 1.0000 1.0000 0.9000 0.7000Alarm 0.0000 0.0000 0.0000 0.0500Critical 0.0000 0.0000 0.0000 0.0500Fault 0.0000 0.0000 0.1000 0.2000Probabilities, given thatStates of N7: FDM state is:N12: A-Switch Normal Alarm Critical FaultNormal 1.0000 0.7000 0.6000 0.1000Alarm 0.0000 0.2000 0.1000 0.1000Critical 0.0000 0.0700 0.2000 0.6000Fault 0.0000 0.0300 0.1000 0.2000States of \Leak" ProbabilitiesN12: A-Switch (\Background")Normal 0.2000Alarm 0.7000Critical 0.0990Fault 0.0010Table 5.15: Distribution of Probabilities for the A-Switch Node

117Probabilities, given thatStates of N9: ATM state is:N14: X.25 Normal Alarm Critical FaultNormal 1.0000 0.9990 0.9000 0.8000Alarm 0.0000 0.0010 0.0100 0.0100Critical 0.0000 0.0000 0.0800 0.0100Fault 0.0000 0.0000 0.0100 0.1800Probabilities, given thatStates of N10: Fr-Relay state is:N14: X.25 Normal Alarm Critical FaultNormal 1.0000 0.9000 0.8500 0.5000Alarm 0.0000 0.0900 0.0500 0.0500Critical 0.0000 0.0100 0.0800 0.0500Fault 0.0000 0.0000 0.0200 0.4000States of \Leak" ProbabilitiesN14: X.25 (\Background")Normal 0.9700Alarm 0.0100Critical 0.0100Fault 0.0100Table 5.16: Distribution of Probabilities for the X.25 Node

118Probabilities, given thatStates of N11: IDN state is:N15: Cellular Normal Alarm Critical FaultNormal 1.0000 0.9500 0.9000 0.8000Alarm 0.0000 0.0500 0.0500 0.0500Critical 0.0000 0.0000 0.0499 0.0500Fault 0.0000 0.0000 0.0001 0.1000Probabilities, given thatStates of N12: A-Switch state is:N15: Cellular Normal Alarm Critical FaultNormal 1.0000 0.9500 0.9000 0.7000Alarm 0.0000 0.0500 0.0200 0.0500Critical 0.0000 0.0000 0.0500 0.0500Fault 0.0000 0.0000 0.0300 0.2000States of \Leak" ProbabilitiesN15: Cellular (\Background")Normal 0.8000Alarm 0.0800Critical 0.0700Fault 0.0500Table 5.17: Distribution of Probabilities for the Cellular Node

119Probabilities, given thatStates of N11: IDN state is:N17: POTS Normal Alarm Critical FaultNormal 1.0000 0.9900 0.9800 0.4000Alarm 0.0000 0.0100 0.0100 0.1000Critical 0.0000 0.0000 0.0100 0.3000Fault 0.0000 0.0000 0.0000 0.2000Probabilities, given thatStates of N12: A-Switch state is:N17: POTS Normal Alarm Critical FaultNormal 1.0000 0.9800 0.9000 0.6000Alarm 0.0000 0.0100 0.0200 0.1000Critical 0.0000 0.0100 0.0300 0.1500Fault 0.0000 0.0000 0.0500 0.1500States of \Leak" ProbabilitiesN17: POTS (\Background")Normal 0.1000Alarm 0.5000Critical 0.2000Fault 0.2000Table 5.18: Distribution of Probabilities for the POTS Node

120concerned.The local probabilities distributions also re ect characteristics of the sub-networks asto the way they are managed, as well as some striking properties of their architecture, ascan be seen in the examples that follow:1. For the sake of the telecommunications network in question, it has been consid-ered that network N4:Cabos-Coax does not have available the network manage-ment features, being, therefore, unobservable. As a consequence, the probabilitiesP(N4=Alarm) and P(N4=Critical Alarm) have been considered as 0 (zero).The same is true for the N10: Fr-Relay network.2. Table 5.6 shows that the fault probability in the SDH network varies relatively verylittle in function of the presence or absence of faults in the optical network. Thiscould be explained by the existence of alternative routes in the optical network, so asthe faults in this sub-network could be covered by automatic recon�guration of thesub-network itself.3. A fault in the ATM network has a great impact on the B-ISDN network, indicatingthat the ATM network possibly has few alternative resources that could minimizethe e�ects of its own faults on the networks that are functionally dependent on it.A diversity of similar inferences could be made in relation to the architecture andrelationship among sub-networks, as well as in relation to the network management systemsthemselves and/or the alarm correlation in the range of the sub-networks.There are several ways of interpreting what may be a \Fault"state. For example,whereas, for a given sub-network, \Fault" may indicate that a signi�cant share of its re-sources is unavailable, for another sub-network the state with the same name may onlyindicate that some component has failed, which does not generally a�ect the services pro-vided by the sub-network. From one sub-network to another the meaning of the expressionsigni�cant share may also vary, due, for example, to the existence of an active redundancyof network resources.Thus, all the knowledge (or lack of knowledge) available on the architecture and onthe operation of the sub-networks, as well as on the functional relationship among sub-networks, will be synthesized in the probabilities distributions, which will be more and moreprecise as the knowledge on the managed network grows. This is a striking characteristicof the model.

1215.3 ExperimentationThe experimentation consisted of simulating the reception of alarms which, in real situa-tions, might have been collected by a network management system, and constitute a casefor each set of alarms \received".For each new observation (that is, for each new alarm received) the tool utilized (WIN-DXTM { cf. item 4.2.5) recalculates the whole of the Bayesian network, supplying the newprobabilities related to each possible fault (or Problem), in each one of the sub-networksthat compose the telecommunications network. Nevertheless, in order to facilitate thepresentation of the results, the probabilities associated to the possible problems, in eachone of the cases shown next, refer only to the cumulative occurrence of the whole set ofobservations.In each case it is considered that the set of observations is the only information availableon the managed system. The values of each of the other variables are considered unknown.The Possible Problems are presented in a decreasing order of probabilities. The toollists all the possible problems (even those with zero probability); for clarity sake, only theproblems with greater probability are listed in the cases that follow.5.3.1 Case 1Observations:(None)Possible Problems:POTS Fault: 21%Cellular Fault: 6%Fr-Relay Fault: 3%X.25 Fault: 2%Coax-Cables Fault: 2%FDM Critical Alarm: 2%FDM Fault: 2%IDN Fault: 1%A-Switch Fault: 1%Radio Fault: 0%Even in the absence of any alarm, the system points to the probability of 21% faultoccurrence in the POTS network, of 6% in the cellular network, besides the probability of upto 3% for other networks. This must not be surprising, since the respective \background"probabilities (that is, the prior fault probabilities) were speci�ed as being, for example,of 20% for the POTS network, of 5% for the cellular network and of 3% for the Fr-Relaynetwork.

122A \background" probability represents the prior probability of occurrence of a state,independently of the con�guration present in the remainder of the Bayesian network.5.3.2 Case 2Observations:ATM network: FaultPossible Problems:ATM Fault: 100%B-ISDN Fault: 60%POTS Fault: 21%X.25 Fault: 20%Cellular Fault: 6%Fr-Relay Fault: 3%Coax-Cables Fault: 2%5.3.3 Case 3Observations:ATM network: FaultSDH network: FaultPossible Problems:SDH Fault: 100%ATM Fault: 100%Fr-Relay Fault: 71%B-ISDN Fault: 60%X.25 Fault: 42%SS-7 Fault: 40%POTS Fault: 23%IDN Fault: 13%Cellular Fault: 7%Coax-Cables Fault: 2%5.3.4 Case 4Observations:SS-7 network: Critical AlarmPossible Problems:POTS Fault: 23%PDH Fault: 18%IDN Fault: 10%Cellular Fault: 7%A-Switch Fault: 6%

123Fr-Relay Fault: 5%FDM Fault: 4%Radio Fault: 4%X.25 Fault: 3%5.3.5 Case 5Observations:SS-7 network: FaultPossible Problems:SS-7 Fault: 100%POTS Fault: 25%IDN Fault: 23%PDH Fault: 13%Fr-Relay Fault: 11%Cellular Fault: 8%SDH Fault: 7%X.25 Fault: 5%A-Switch Fault: 5%5.3.6 Case 6Observations:Opt-Net: FaultPossible Problems:Opt-Net Fault: 100%POTS Fault: 23%PDH Fault: 20%A-Switch Fault: 9%Cellular Fault: 8%IDN Fault: 7%Fr-Relay Fault: 4%X.25 Fault: 3%Coax-Cables Fault: 2%SS-7 Fault: 2%FDM Critical Alarm: 2%FDM Fault: 2%N-ISDN Fault: 1%SDH Fault: 0%5.3.7 Case 7Observations:

124POTS network: FaultPossible Problems:POTS Fault: 100%Cellular Fault: 6%Fr-Relay Fault: 3%A-Switch Fault: 2%5.3.8 Case 8Observations:POTS network: FaultFDM network: FaultPossible Problems:POTS Fault: 100%FDM Fault: 100%Coax-Cables Fault: 80%A-Switch Fault: 26%Radio Fault: 15%Cellular Fault: 12%Fr-Relay Fault: 3%IDN Fault: 2%5.3.9 Case 9Observations:POTS network: FaultFDM network: FaultRadio: Critical AlarmPossible Problems:POTS Fault: 100%FDM Fault: 100%A-Switch Fault: 27%Coax-Cables Fault: 21%Cellular Fault: 12%PDH Fault: 6%IDN Fault: 4%Fr-Relay Fault: 3%X.25 Fault: 2%SS-7 Fault: 0%5.3.10 Case 10Observations:

125POTS network: FaultFDM network: FaultRadio: FaultPossible Problems:POTS Fault: 100%FDM Fault: 100%Radio Fault: 100%A-Switch Fault: 29%Cellular Fault: 13%PDH Fault: 12%IDN Fault: 6%Fr-Relay Fault: 3%Coax-Cables Fault: 3%5.3.11 Case 11Observations:PDH network: FaultFDM network: FaultPossible Problems:PDH Fault: 100%FDM Fault: 100%Radio Fault: 83%A-Switch Fault: 36%POTS Fault: 30%IDN Fault: 24%Cellular Fault: 15%Coax-Cables Fault: 14%Fr-Relay Fault: 6%SS-7 Fault: 5%5.3.12 Case 12Observations:PDH network: FaultFDM network: FaultSatellites network: FaultPossible Problems:PDH Fault: 100%FDM Fault: 100%Satellites Fault: 100%Coax-Cables Fault: 48%A-Switch Fault: 36%

126POTS Fault: 30%IDN Fault: 24%Cellular Fault: 15%Radio Fault: 14%Fr-Relay Fault: 6%SS-7 Fault: 5%5.3.13 Case 13Observations:PDH network: FaultFDM network: FaultSatellites network: NormalPossible Problems:PDH Fault: 100%FDM Fault: 100%Radio Fault: 84%A-Switch Fault: 36%POTS Fault: 30%IDN Fault: 24%Cellular Fault: 15%Coax-Cables Fault: 13%Fr-Relay Fault: 6%SS-7 Fault: 5%5.3.14 Case 14Observations:PDH network: FaultFDM network: FaultSatellites network: NormalRadio network: NormalPossible Problems:PDH Fault: 100%FDM Fault: 100%Coax-Cables Fault: 96%A-Switch Fault: 36%POTS Fault: 30%IDN Fault: 24%Cellular Fault: 15%Fr-Relay Fault: 6%SS-7 Fault: 5%

1275.3.15 Case 15Observations:FDM network: FaultSatellites network: NormalRadio network: NormalPDH network: NormalPossible Problems:FDM Fault: 100%Coax-Cables Fault: 96%POTS Fault: 25%A-Switch Fault: 20%Cellular Fault: 11%Fr-Relay Fault: 3%5.3.16 Case 16Observations:SS-7 network: FaultPOTS network: NormalCellular network: NormalN-ISDN network: NormalPossible Problems:SS-7 Fault: 100%Fr-Relay Fault: 11%PDH Fault: 9%SDH Fault: 6%IDN Fault: 6%X.25 Fault: 5%FDM Fault: 2%5.4 Final ConsiderationsGenerally speaking, the results of the experiments have been widely satisfactory.The tools utilized showed to be adequate for the manipulation of Bayesian networks thesize of the prototype (17 nodes). For larger networks, the choice of suitable tools may needdeeper studies, involving the performance and development facility aspects of Bayesiannetworks.

Chapter 6ConclusionsThis chapter synthesizes the main contributions of this thesis and its main limitations. Itis also pointed out what still must be achieved so that the results obtained may be appliedto alarm correlation in a real telecommunications network.6.1 Summary of the Results ObtainedAfter the development of some modern network management systems, such as the Systemfor the Integration of Supervision (SIS) [Meira and Lages, 1988], to collect managementinformation on the elements of a telecommunications network ceased to be a great challengeas a research area. Indeed, one of the greatest challenges today consists of exactly, throughthe correlation of the alarms received, to reduce the volume of information available in themanagement centers, without a�ecting the semantic content of this information, so as tofacilitate the work of the operators of these management centers.It is known that, in the general case, alarm correlation in a telecommunications networkis an NP-complete problem, which means that a polynomial algorithm to solve it is notknown (cf. item 2.1.4). Due to this fact, the survey carried out in chapter 2 was not meantto necessarily �nding an exact solution to the problem, but rather approximate solutionsthat would allow the application in real telecommunications networks. In the researchedliterature, nevertheless, concrete proposals for the solution to the problem have not beenfound.To make up for this lack of concrete proposals, an unheard-of model for alarm corre-lation, which may be applied to any telecommunications network, has been developed (cf.chapter 4).Another contribution of this thesis consisted of developing a general model of telecom-munications networks. The model allows the speci�cation of functional dependences amongthe sub-networks, or among the sub-network classes, which compose a given telecommu-129

130nications network, thus facilitating its understanding. The model is robust and may beeasily adapted to re ect expansions and architectural or technological alterations in themodeled network (cf. chapter 3).Several other contributions have been given in this thesis, among which we point out:a) The organization of the main concepts related to alarm correlation and fault diag-nosis in telecommunications networks. The depicting of a panoramic view on alarmcorrelation in telecommunications networks, through the gathering of the main ap-proaches existing in the literature. The classi�cation of these approaches, accordingto the methods and algorithms utilized. The comparison among these methods andalgorithms, having as a parameter the nature of the application to which the cor-relation is destined. The revision of the main products, solutions and platforms foralarm correlation (cf. chapter 2).b) Classi�cation of the main approaches found in the literature on telecommunicationsnetwork modeling, according to the objectives and scope of the models (cf. chapter 3).c) The modeling of a canonical telecommunications network and the development of aprototype for alarm correlation in this network (cf. chapter 5).6.2 LimitationsThe recursive multifocal correlation technique, which is part of the alarm correlation modeldeveloped in this thesis (cf. section 4.1), has as a characteristic the possibility of executingseveral parallel correlation processes, in an asynchronous way. If, on the one hand, thisimproves the performance, on the other hand it introduces the possibility of inconsistencyamong the results of the several correlations, which is a limitation of the model.The alarm correlation model proposed is based on the synthesis of the state of eachsub-network in a single variable, whose value is estimated based on the alarm informationcollected by the network management system. The value of the variable representing thestate of a sub-network is also in uenced by information given by alarm correlation systemsat the sub-network level itself or at lower levels, including at network element level. Errorsor imprecisions in any of these processes a�ect the precision of the correlation at thetelecommunications network level. This is also a limitation of the model presented.Another limitation consists of the di�culty in obtaining precise values for the probabil-ities associated to the random variables, independently of their being estimated by experts(due to the subjectivity factor) or when they are statistically calculated, from data ob-tained from the regular functioning of the telecommunications network (due to the smallsize of some samples).

1316.3 Future WorkIn order to enable the application of the concepts formulated in this work in the correlationof the alarms generated in a real telecommunications network, several additional worksmust still be developed, the main which are listed below.6.3.1 Telecommunications Network ModelingBy utilizing the methodology presented in chapter 3, all the network classes present inthe telecommunications network must be identi�ed, which must be distributed in as manylayers as necessary, based on the identi�cation of the functional a�nities and the functionaldependences among them. Next, for each network class identi�ed, it is necessary to �nd allthe network instances present in the telecommunications network. Finally, the graph thatrepresents the functional dependences among the sub-networks must be constructed andthe variables corresponding to each one of them and the respective possible states must bede�ned.6.3.2 Alarm Correlation at Sub-Network LevelThe model presented in this thesis does not make any restriction as to the criteria tobe adopted in alarm correlation at the sub-network level. This correlation may both bea simple process, which merely makes a summary of the present alarms (for example,classifying them by severity level), and may be a more elaborate process, involving, forexample, the identi�cation of faults from the alarms generated by the sub-network. Themore precise is the correlation at sub-network level, the more precise is the correlation atthe level of the whole of the telecommunications network.6.3.3 Determination of the Local Probabilities DistributionsIt consists of the speci�cation of a local probabilities distribution for each variable of theBayesian network, making use of estimates made by experts or based on experimental datacollected by a network management system.6.3.4 Acquisition of the Real Time Correlation EngineIt is necessary to identify or to develop a tool which will allow the evaluation of the Bayesiannetwork, in real time, with the desired precision and performance.6.3.5 Extensions to the ModelTwo interesting extensions to the correlation model proposed in this thesis are easily iden-ti�ed:

1321. The utilization of other methods and algorithms for correlation at the telecommu-nications network level, in addition to the method adopted in this thesis (that is,Bayesian networks);2. The detailing of the lower levels in the recursive model, to meet the need for corre-lation at the lower levels of the telecommunication system architecture.

Bibliography[AGL Systems Inc., 1996] AGL Systems Inc. NOAA Plus 7, 1996. http://www.agl.com/-page plus7.html.[Appeldorn et al., 1993] Menso Appeldorn, Roberto Kung, and Roberto Saracco. TMN +IN = TINA. IEEE Communications Magazine, pages 78{85, Mar 1993.[Aydemir and Tanzini, 1996] Buket Aydemir and Joe Tanzini. An ATM network-viewmodel for integrated layer-network management. In IEEE/IFIP 1996 Network Opera-tions and Management Symposium (NOMS'96) [1996], pages 690{699.[Baase, 1988] Sara Baase. Computer Algorithms: Introduction to Design and Analysis.Addison-Wesley, Reading, Massachusetts, USA, 2 edition, 1988.[Bapat, 1994] Subodh Bapat. Object-Oriented Networks: Models for Architecture, Opera-tions, and Management. PTR Prentice Hall, Englewood Cli�s, NJ, USA, 1994.[Benton, 1952] William Benton, editor. The Works of Aristotle, volume I. EncyclopaediaBritannica, Chicago, USA, 1952. Posterior Analytics, Book I, Chapter 3.[Benz and Leischner, 1993] Ch. Benz and M. Leischner. A high level speci�cation tech-nique for modeling networks and their environments including semantic aspects. In IFIPInternational Symposium on Integrated Network Management, III (ISINM'93) [1993],pages 29{43.[Bouloutas et al., 1994] A. T. Bouloutas, S. Calo, and A. Finkel. Alarm correlation andfault identi�cation in communication networks. IEEE Transactions on Communications,42(2/3/4):523{533, Feb/Mar/Apr 1994.[Brugnoni et al., 1993] Simona Brugnoni, Guido Bruno, Roberto Manione, Enrico Montar-iolo, Elio Paschetta, and Luisella Sisto. An expert system for real time fault diagnosis ofthe italian telecommunications network. In IFIP International Symposium on IntegratedNetwork Management, III (ISINM'93) [1993], pages 617{628.[Buntine, 1996] Wray Buntine. Graphical Models for Discovering Knowledge, pages 59{82.In Fayyad et al. [1996], 1996. 133

134[Burnell and Horvitz, 1995] Lisa Burnell and Eric Horvitz. Structure and chance: Meldinglogic and probability for software debugging. Communications of the ACM, 38(3):31{41,57, Mar 1995.[Campbell and Everitt, 1992] L.H. Campbell and H.J. Everitt. A layered approach tonetwork management control. In Network Operations and Management Symposium'92(NOMS'92) [1992], pages 46{56.[Case et al., 1990] J. Case, M. Fedor, M. Scho�stall, and J. Davin. A Simple NetworkManagement Protocol (SNMP). Request for Comments 1157, 1990. Internet EngineeringTask Force | IETF.[Chan, 1994] Edward Chan. An object-oriented network model for the development ofintegrated network management systems. In 26th Southeastern Symposium on SystemTheory, pages 359{364, Athens, OH, USA, Mar 1994.[Charniak, 1991] Eugene Charniak. Bayesian networks without tears. AI Magazine, (Win-ter 1991):50{63, 1991.[Chen and Liu, 1994] Thomas M. Chen and Steve S. Liu. Management and control func-tions in ATM switching systems. IEEE Network, 8(4):27{40, July/August 1994.[Chen and Rao, 1993] Jian-Liang Chen and Nutakki D. Rao. A fuzzy expert system forfault diagnosis in electric distribution systems. In Canadian Conference on Electricaland Computer Engineering { CCECE'93, pages 1283{6 vol. 2, Vancouver, BC, Canada,Sept 1993. IEEE.[Cooper, 1987] G. F. Cooper. Probabilistic inference using belief networks is NP-Hard.Technical Report KSL-87-27, Medical Computer Science Group, Stanford University,1987.[Cooper, 1990] G.F. Cooper. The computational complexity of probabilistic inference usingBayesian belief networks. Arti�cial Intelligence, 42:393{405, 1990.[Cornily et al., 1993] Jean-Michel Cornily, Dennis Doherty, John Ellson, Pramila Mullan,Christine Pageot-Millet, and Thierry Stiphant. Application of ODP modelling tech-niques to SDH network management. In IEEE Global Telecommunications Conference(GLOBECOM 93) [1993], pages 646{652.[Covo et al., 1989] A.A. Covo, T.M. Moruzzi, and E.D. Peterson. AI-assisted telecom-munications network management. In IEEE Global Telecommunications Conference(GLOBECOM 89), pages 487{491, Dallas, TX, USA, Nov 1989.[Cronk et al., 1988] R. Cronk, P. Callahan, and L. Bernstein. Rule-based expert systemsfor network management and operations: an introduction. IEEE Network, 2(5):7{21,1988.

135[Czarnecki et al., 1996] Przemyslaw Czarnecki, Andrzej Jajszczyk, and Marek Wilkosz.Comparison criteria for management information models (MIMs): a way of analyz-ing MIMs. In IEEE/IFIP 1996 Network Operations and Management Symposium(NOMS'96) [1996], pages 666{673.[Dagum and Luby, 1993] P. Dagum and M. Luby. Approximating probabilistic inferencein Bayesian belief networks is NP-hard. Arti�cial Intelligence, 60:141{153, 1993.[Davis et al., 1982] R. Davis, H. Shrobe, W. Hamscher, K. Wieckert, M. Shirley, andS. Polit. Diagnosis based on description of structure and functions. In National Confer-ence on Arti�cial Intelligence, pages 137{142, Pittsburg, PA, 1982.[de Andrade, 1995] M�arcio Migueletto de Andrade. Concep�c~ao, projeto e implementa�c~aodo mecanismo de distribui�c~ao de dados do Sistema Integrado de Supervis~ao | SIS.Master's thesis, Federal University of Minas Gerais, Belo Horizonte, Brazil, 1995.[de la Fuente et al., 1995] L.A. de la Fuente, M. Kawanishi, M. Wakano, T. Walles, andC. Aurrecoechea. Application of the TINA-C management architecture. In IFIP/IEEEInternational Symposium on Integrated Network Management, IV (ISINM'95) [1995],pages 424{435.[de Pricker, 1995] Martin de Pricker. Asynchronous Transfer Mode: Solution for Broad-band ISDN. Prentice Hall, London, Great Britain, 3 edition, 1995.[Deng et al., 1993] Robert H. Deng, Aurel A. Lazar, and Weiguo Wang. A probabilisticapproach to fault diagnosis in linear lightwave networks. In IFIP International Sympo-sium on Integrated Network Management, III (ISINM'93) [1993], pages 697{708.[Dreo and Valta, 1995] Gabi Dreo and Robert Valta. Using master tickets as a storagefor problem-solving expertise. In IFIP/IEEE International Symposium on IntegratedNetwork Management, IV (ISINM'95) [1995], pages 328{340.[Du�y, 1996] Jim Du�y. Net control start-up gets SMARTS. Network World, January1996. http://www.smarts.com/news/network world.html.[Ejiri, 1995] Masayoshi Ejiri. The paradigm shift in telecommunications services and net-works. In IFIP/IEEE International Symposium on Integrated Network Management, IV(ISINM'95) [1995], pages 688{699.[Fatato, 1996] Massimo Fatato. Modeling telecommunications networks' transmission sys-tems. IEEE Communications Magazine, 34(3):40{47, March 1996.[Fayyad et al., 1996] Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, andRamasamy Uthrusamy, editors. Advances in Knowledge Discovery and Data Mining.AAAI Press / The MIT Press, Menlo Park, California, USA, 1996.

136[Filipiak et al., 1993] J. Filipiak, A. Lombardo, and S. Palazzo. Design of network manage-ment architectures for heterogeneous networks using object oriented approach. In IFIPInternational Symposium on Integrated Network Management, III (ISINM'93) [1993],pages 59{70.[Fink et al., 1993] Barbara Fink, Heribert Baldus, Marita M�oller, and Rolf Kraemer. Anintegrating architecture for networked systems management. In IEEE InternationalConference on Communications'93 (ICC 93) [1993], pages 8{12.[Frey and Lewis, 1997] J. Frey and L. Lewis. Multi-level reasoning for managing dis-tributed enterprises and their networks. In Integrated Network Management V [1997],pages 5{16.[Fr�ohlich et al., 1996] Peter Fr�ohlich, Wolfgang Nejdl, Klaus Jobmann, and Hermann Wi-etgrefe. Model-based alarm correlation in cellular phone networks. Technical Report ,Institut f�ur Rechnergest�utze Wissensverarbeitung, University of Hannover, 1996.[Frontini et al., 1991] M. Frontini, J. Gri�n, and S. Towers. A knowledge-based system forfault localization in wide area networks. In IFIP International Symposium on IntegratedNetwork Management, II (ISINM'91) [1991], pages 519{530.[Frost and Melamed, 1994] Victor S. Frost and Benjamin Melamed. Tra�c modeling fortelecommunications networks. IEEE Communications Magazine, 32(3):70{81, March1994.[Fung and Favero, 1995] Robert Fung and Brendan Del Favero. Applying bayesian net-works to information retrieval. Communications of the ACM, 38(3):42{48,57, Mar 1995.[Furley, 1996] Nick Furley. OSS architecture framework. In IEEE/IFIP 1996 NetworkOperations and Management Symposium (NOMS'96) [1996], pages 656{665.[Gensym Corp., 1995] Gensym Corp. Gensym introduces operations assistant (OPA)toolkit for intelligent network management applications, October 1995. http://www.-gensim.com/recentnews/pr15.htm.[Gering, 1993] Michael Gering. CMIP versus SNMP. In IFIP International Symposiumon Integrated Network Management, III (ISINM'93) [1993], pages 347{359.[Gillespie and Rees, 1996] Alex Gillespie and Simon Rees. Access network managementmodeling. IEEE Communications Magazine, 34(3):62{72, March 1996.[GLO, 1993] IEEE Global Telecommunications Conference (GLOBECOM 93), Huston,TX, USA, Nov 1993.[Goodman and Latin, 1991] Rodney M. Goodman and Hayes Latin. Automated knowledgeacquisition from network management databases. In IFIP International Symposium onIntegrated Network Management, II (ISINM'91) [1991], pages 541{549.

137[Goodman et al., 1993] Rodney M. Goodman, Barry Ambrose, Hayes Latin, and SandeeFinnell. A hybrid expert system / neural network tra�c advice system. In IFIP Inter-national Symposium on Integrated Network Management, III (ISINM'93) [1993], pages607{616.[Goodman et al., 1995] R.M. Goodman, B.E. Ambrose, H.W. Latin, and C.T. Ulmer.NOAA - an expert systemmanaging the telephone network. In IFIP/IEEE InternationalSymposium on Integrated Network Management, IV (ISINM'95) [1995], pages 316{327.[Goyal and Worrest, 1988] Shri K. Goyal and Ralph W. Worrest. Expert System Applica-tions to Network Management, pages 3{44. In Liebowitz [1988], 1988.[Gupta et al., 1985] Madan M. Gupta, Abraham Kandel, Wyllis Bandler, and Jerzy B.Kiszka, editors. Approximate Reasoning in Expert Systems. North-Holland, Amsterdam,The Netherlands, 1985.[Hall and Magedanz, 1993] Jane Hall and Thomas Magedanz. Uniform modeling of man-agement and telecommunication services in future telecommunication environmentsbased on the ROSA approach. In IFIP International Symposium on Integrated Net-work Management, III (ISINM'93) [1993], pages 521{532.[Hall, 1987] N. Green Hall. A Fuzzy Decision Support System for Strategic Planning, pages77{90. In Sanchez and Zadeh [1987], 1987.[Hamming, 1950] R. W. Hamming. Error detecting and error correcting codes. Bell SystemTech. J., 29:147{160, Apr 1950.[H�at�onen et al., 1996] Kimmo H�at�onen, Mika Klemettinen, Heikki Mannila, PirjoRonkainen, and Hannu Toivonen. TASA: Telecommunication Alarm Sequence Ana-lyzer or How to enjoy faults in your network. In IEEE/IFIP 1996 Network Operationsand Management Symposium (NOMS'96) [1996], pages 520{529.[Hayes-Roth, 1987] B. Hayes-Roth. Blackboard Systems, pages 73{80. In Shapiro et al.[1987], 1987.[Heckerman and Wellman, 1995] David Heckerman and Michael P. Wellman. Bayesiannetworks. Communications of the ACM, 38(3):27{30, Mar 1995.[Heckerman et al., 1995a] David Heckerman, John S. Breese, and Koos Rommelse.Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49{57, Mar1995.[Heckerman et al., 1995b] David Heckerman, Abe Mamdani, and Michael P. Wellman.Real-world applications of bayesian networks. Communications of the ACM, 38(3):24{26, Mar 1995.

138[Heckerman, 1996] David Heckerman. Bayesian Networks for Knowledge Discovery, pages273{305. In Fayyad et al. [1996], 1996.[Hedberg, 1996] Sara Reese Hedberg. AI's impact in telecommunications { today andtomorrow. IEEE Expert, 11(1):6{9, February 1996. Also at http:www.computer.org/-pubs/expert/1996/insights/x10006/insig ht.html.[Henkind et al., 1987] S.J. Henkind, R.R. Yager, A.M. Benis, and M.C. Harrison. A Clin-ical Alarm System Using Techniques from Arti�cial Intelligence and Fuzzy Set Theory,pages 91{104. In Sanchez and Zadeh [1987], 1987.[Henrion et al., 1991] Max Henrion, John S. Breese, and Eric J. Horvitz. Decision analysisand expert systems. AI Magazine, (Winter 1991):64{91, 1991.[Hewlett Packard, 1995a] Hewlett Packard. HP announces new functionality in HP Open-View solution framework for managing multiple domains throughout an enterprise, May1995. http://www.hp.com:80/csopress/95may31b.html.[Hewlett Packard, 1995b] Hewlett Packard. HP OpenView event correlation for thetelecommunications environment: Technology brief, September 1995.[Hewlett Packard, 1996a] Hewlett Packard. HP introduce la nueva soluci�on \openviewevent-correlation" y el mejoramiento de plataformas para la industria de telecomuni-caciones, Maio 1996. http://www.dmo.hp.com/latinamerica/colombia/calendar/press/96may29j.html.[Hewlett Packard, 1996b] Hewlett Packard. HP OpenView event correlation services:Technical evaluation guide, April 1996. Version 1.0.[Hoel et al., 1971] Paul G. Hoel, Sidney C. Port, and Charles J. Stone. Introduction toProbability Theory. Houghton Mi�in, Boston, USA, 1971.[Holland et al., 1986] John H. Holland, Keith J. Holyoak, Richard E. Nisbett, and Paul R.Thagard. Induction: Processes of Inference, Learning, and Discovery. The Mas-sachusetts Institute of Technology, Cambridge, USA, 1986.[Hood and Ji, 1997] C.S. Hood and C. Ji. Automated proactive anomaly detection. InIntegrated Network Management V [1997], pages 688{699.[Houck et al., 1995] K. Houck, S. Calo, and A. Finkel. Towards a practical alarm corre-lation system. In IFIP/IEEE International Symposium on Integrated Network Manage-ment, IV (ISINM'95) [1995], pages 226{237.[ICC, 1993] IEEE International Conference on Communications'93 (ICC 93), 1993.[IF Computer, 1996] IF Computer. IF/Projects: Network management, January 1996.http://www.biz.isar.de/ifcomputer/FILES/ifreproj network.html.

139[IM9, 1997] IFIP International Symposium on Integrated Network Management, V(IM'97), San Diego, CA, USA, May 1997. Elsevier Science (North-Holland).[ISI, 1991] IFIP International Symposium on Integrated Network Management, II. ElsevierScience (North-Holland), 1991.[ISI, 1993] IFIP International Symposium on Integrated Network Management, III(ISINM'93), San Francisco, CA, USA, 1993. Elsevier Science (North-Holland).[ISI, 1995] IFIP/IEEE International Symposium on Integrated Network Management, IV(ISINM'95), Santa Barbara, CA, USA, 1995. Chapman & Hall.[ITU-T, 1991a] ITU-T. Recommendation X.710: Common Management Information Ser-vice de�nition for CCITT applications, 1991.[ITU-T, 1991b] ITU-T. Recommendation X.711: Common Management Information Pro-tocol speci�cation for CCITT applications, 1991.[ITU-T, 1992a] ITU-T. Recommendation G.774: Synchronous Digital Hierarchy SDH:Management information model for the network element view, September 1992.[ITU-T, 1992b] ITU-T. Recommendation M.3020: TMN interface speci�cation methodol-ogy, October 1992.[ITU-T, 1992c] ITU-T. Recommendation M.3100: Generic network information model,October 1992.[ITU-T, 1992d] ITU-T. Recommendation M.3180: Catalogue of TMN management infor-mation, October 1992.[ITU-T, 1992e] ITU-T. Recommendation M.3200: TMN management services: Overview,October 1992.[ITU-T, 1992f] ITU-T. Recommendation M.3300: TMN management capabilities pre-sented at the F interface, October 1992.[ITU-T, 1992g] ITU-T. Recommendation M.3400: TMN management functions, October1992.[ITU-T, 1992h] ITU-T. Recommendation Q.1201: Principles of intelligent network archi-tecture, October 1992.[ITU-T, 1992i] ITU-T. Recommendation X.700: Management framework for Open Sys-tems Interconnection (OSI) for CCITT applications, September 1992.[ITU-T, 1992j] ITU-T. Recommendation X.701: Information technology - Open SystemsInterconnection - systems management overview, 1992.

140[ITU-T, 1992k] ITU-T. Recommendation X.720: Information technology - Open Sys-tems Interconnection - structure of management information: Management informationmodel, January 1992.[ITU-T, 1992l] ITU-T. Recommendation X.721: Information technology - Open SystemsInterconnection - structure of management information: De�nition of management in-formation, 1992.[ITU-T, 1992m] ITU-T. Recommendation X.722: Information technology - Open SystemsInterconnection - structure of management information: Guidelines for the De�nition ofManaged Objects, 1992.[ITU-T, 1992n] ITU-T. Recommendation X.730: Information technology - Open SystemsInterconnection - systems management: Object management function, 1992.[ITU-T, 1992o] ITU-T. Recommendation X.733: Information technology - Open SystemsInterconnection - systems management: Alarm reporting function, 1992.[ITU-T, 1992p] ITU-T. Recommendation X.735: Information technology - Open SystemsInterconnection - systems management: Log control function, September 1992.[ITU-T, 1993a] ITU-T. Recommendation G.803: Digital networks: Architectures of trans-port networks based on the Synchronous Digital Hierarchy (SDH), March 1993.[ITU-T, 1993b] ITU-T. RecommendationQ.700: Introduction to CCITT signalling systemno. 7, March 1993.[ITU-T, 1993c] ITU-T. Recommendation Q.811: Lower layer protocol pro�les for the Q3interface, March 1993.[ITU-T, 1993d] ITU-T. Recommendation Q.812: Upper layer protocol pro�les for the Q3interface, March 1993.[ITU-T, 1993e] ITU-T. Recommendation X.723: Information technology - Open SystemsInterconnection - structure of management information: Generic management informa-tion, November 1993.[ITU-T, 1993f] ITU-T. Recommendation X.734: Information technology - Open SystemsInterconnection - systems management: Event report management function, 1993.[ITU-T, 1994a] ITU-T. Recommendation M.3000: Overview of TMN recommendations,October 1994.[ITU-T, 1994b] ITU-T. Recommendation X.200: Information technology - Open SystemInterconnection - Basic Reference Model: The Basic Model, July 1994.[ITU-T, 1995a] ITU-T. Recommendation G.704: Synchronous frame structures used at1544, 6312, 2048, 8488 and 44736 kbit/s hierarchical levels, July 1995.

141[ITU-T, 1995b] ITU-T. Recommendation M.3100: Generic network information model,July 1995.[ITU-T, 1995c] ITU-T. Recommendation X.790: Data networks and open system commu-nications | trouble management function for ITU applications, November 1995.[ITU-T, 1996] ITU-T. Recommendation M.3010: Principles for a TelecommunicationsManagement Network, May 1996.[Jakobson and Weissman, 1993] Gabriel Jakobson and Mark D. Weissman. Alarm corre-lation. IEEE Network, 7(6):52{59, November 1993.[Jakobson and Weissman, 1995] G. Jakobson and M. Weissman. Real-time telecom-munication network management: extending event correlation with temporal con-straints. In IFIP/IEEE International Symposium on Integrated Network Management,IV (ISINM'95) [1995], pages 290{301.[Kano, 1996] Sadahiko Kano. Open network and internetworking in multi-service providerenvironment. In IEEE/IFIP 1996 Network Operations and Management Symposium(NOMS'96) [1996], pages 24.4.1{24.4.13.[K�atker and Paterok, 1997] S. K�atker and M. Paterok. Fault isolation and event corre-lation for integrated fault management. In Integrated Network Management V [1997],pages 583{596.[Katzela and Schwartz, 1995] Irene Katzela and Misha Schwartz. Schemes for fault iden-ti�cation in communication networks. IEEE Transactions on Networking, 3(6):753{764,December 1995.[Katzela et al., 1995] I. Katzela, A.T. Bouloutas, and S.B. Calo. Centralized vs dis-tributed fault localization. In IFIP/IEEE International Symposium on Integrated Net-work Management, IV (ISINM'95) [1995], pages 250{261.[Katzela et al., 1996] Irene Katzela, A.T. Bouloutas, and S. Calo. Comparison of dis-tributed fault identi�cation schemes in communication networks. Technical report, IBMCorp., T.J. Watson Research Center, Yorktown Heights, NY, USA, January 1996.[Kehl and Hopfm�uller, 1993] Walter Kehl and Heinrich Hopfm�uller. Model-based reason-ing for the management of telecommunication networks. In IEEE International Confer-ence on Communications'93 (ICC 93) [1993], pages 13{17.[Kheradpir et al., 1992] Shaygan Kheradpir, Willis Stinson, Renu Chipalkatti, and GregBossert. Managing the network manager. IEEE Communications Magazine, pages 12{21, Jul 1992.

142[Kheradpir et al., 1993] Shaygan Kheradpir, Willis Stinson, Jelena Vucetic, and Alexan-der Gersht. Real-time management of telephone operating company networks: Issuesand approaches. IEEE Journal on Selected Areas in Communications, 11(9):1385{1403,December 1993.[Kirsch and Kroschel, 1994] Harald Kirsch and Khristian Kroschel. Applying bayesian net-works to fault diagnosis. In Third IEEE Conference on Control Applications, pages895{900. IEEE, 1994.[Kitami, 1996] Kenichi Kitami. Future of Japan's info-communication infrastructures. InIEEE/IFIP 1996 Network Operations and Management Symposium (NOMS'96) [1996],pages 1{8. Separate hand-out.[Klerer, 1993] S. Mark Klerer. System management information modeling. IEEE Commu-nications Magazine, pages 38{44, May 1993.[Kliger et al., 1995] S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo. A codingapproach to event correlation. In IFIP/IEEE International Symposium on IntegratedNetwork Management, IV (ISINM'95) [1995], pages 266{277.[Kl�osgen, 1996] Willi Kl�osgen. Knowledge Discovery in Database Terminology, pages 573{592. In Fayyad et al. [1996], 1996.[Knowledge Industries, Inc., 1996a] Knowledge Industries, Inc. DXpress 2.0 for Windowsand Windows NT, 1996. Product Manual.[Knowledge Industries, Inc., 1996b] Knowledge Industries, Inc. WIN-DX 2.0 for Windowsand Windows NT, 1996. Product Manual.[Korth and Silberschatz, 1989] Henry F. Korth and Abraham Silberschatz. Sistemas deBancos de Dados. McGraw-Hill, S~ao Paulo, SP, 1989.[Lazar et al., 1992] Aurel A. Lazar, Weiguo Wang, and Robert H. Deng. Models and algo-rithms for network fault detection and identi�cation: A review. In IEEE InternationalConference on Communications'92 (ICC 92), pages 999{1003, Singapore, November1992.[Lebailly et al., 1987] J. Lebailly, R. Martin-Clouaire, and H. Prade. Use of Fuzzy Logicin a Rule-Based System, in Petroleum Geology, pages 125{144. In Sanchez and Zadeh[1987], 1987.[Lengdell et al., 1996] Magnus Lengdell, Juan Pavon, Masaki Wakano, Martin Chapman,and Motoharu Kawanishi. The TINA network resource model. IEEE CommunicationsMagazine, 34(3):74{79, March 1996.[Lewis and Dreo, 1993] Lundy Lewis and Gabi Dreo. Extending trouble ticket systems tofault diagnostics. IEEE Network, 7(6):44{51, November 1993.

143[Lewis, 1993] Lundy Lewis. A case-based reasoning approach to the resolution of faultsin communications networks. In IFIP International Symposium on Integrated NetworkManagement, III (ISINM'93) [1993], pages 671{682.[Liebowitz, 1988] J. Liebowitz, editor. Expert System Applications to Telecommunications.John Wiley and Sons, New York, NY, USA, 1988.[Lirov, 1993] Yuval Lirov. Fuzzy logic for distributed systems troubleshooting. In FuzzySystems International Conference 1993, pages 986{991. IEEE, 1993.[Luna and Correa Filho, 1992] Henrique P. L. Luna and Milton Correa Filho. A proba-bilistic and informational basis to optimize expert systems. Investigaci�on Operativa,2(3):273{296, June 1992.[Luna, 1994] Henrique Pacca L. Luna. Sistemas de apoio �a decis~ao, 1994. Department ofComputer Science. Federal University of Minas Gerais, Brazil.[Magedanz, 1995] T. Magedanz. Modeling IN-based service control capabilities as part ofTMN-based service management. In IFIP/IEEE International Symposium on IntegratedNetwork Management, IV (ISINM'95) [1995], pages 386{397.[Magendanz, 1993] Thomas Magendanz. IN and TMN: the basis for future informationnetworking architecture. Computer Communications, 16(5):267{276, May 1993.[Manione and Montanari, 1995] Roberto Manione and Fabio Montanari. Validationand extension of fault management applications through environment simulation.In IFIP/IEEE International Symposium on Integrated Network Management, IV(ISINM'95) [1995], pages 238{249.[Manione and Paschetta, 1994] Roberto Manione and Elio Paschetta. An inconsistenciestolerant approach in the fault diagnosis of telecommunications networks. In NetworkOperations and Management Symposium'94 (NOMS'94) [1994], pages 459{469.[Mans�eld et al., 1993] G. Mans�eld, K. Jayanthi, K. Higuchi, Y. Nemoto, and S. Noguchi.The MIKBmodel for intelligent network management. In IEEE International Conferenceon Communications'93 (ICC 93) [1993], pages 1210{1214.[Mead, 1989] Carver Mead. Analog VLSI and Neural Systems. Addison-Wesley, Reading,USA, 1989.[Meech and Jordon, 1993] J.A. Meech and L.A. Jordon. Development of a self-tuning fuzzylogic controller. Minerals Engineering, 6(2):119{131, 1993.[Meech and Kumar, 1994] John A. Meech and Sunil Kumar. A Hypermanual on ExpertSystems. Canada Centre for Mineral and Energy Technology | CANMET, Ottawa,Canada, 3 edition, 1994. Hypertext Book.

144[Meira and Lages, 1988] Dilmar M. Meira and Newton A.C. Lages. SIS: Um Sistema In-tegrado de Supervis~ao para telecomunica�c~oes. Technical Report TELEMIG/UFMG/88,Department of Computer Science. Federal University of Minas Gerais, Brazil, Belo Hor-izonte, Brazil, September 1988.[Meira and Nogueira, 1997a] Dilmar Malheiros Meira and Jos�e Marcos S. Nogueira.M�etodos e algoritmos para correla�c~ao de alarmes em redes de telecomunica�c~oes. InSimp�osio Brasileiro de Redes de Computadores, 1997, pages 79{98, S~ao Carlos, SP,Brazil, May 1997.[Meira and Nogueira, 1997b] Dilmar Malheiros Meira and Jos�e Marcos S. Nogueira. Ummodelo geral de redes de telecomunica�c~oes para aplica�c~oes de gerenciamento de falhas. InSimp�osio Brasileiro de Telecomunica�c~oes, 1997, pages 45{50, Recife, PE, Brazil, Septem-ber 1997.[Meira et al., 1991] Dilmar M. Meira, Jo~ao E.R. Dantas, and Roberto S. Bigonha. Es-peci�ca�c~ao funcional do sistema integrado de supervis~ao. Technical Report SIS 1101,TELEMIG/DCC-UFMG, Belo Horizonte, Brazil, March 1991.[Meira et al., 1995] Dilmar M. Meira, Jos�e M.S. Nogueira, and Son T. Vuong. On telecom-munication network management. In IEEE Paci�c Rim Conference on Communications,Computers, and Signal Processing { PACRIM'95, pages 245{249, Victoria, BC, Canada,May 17-19 1995.[Meira J�unior, 1993] Wagner Meira J�unior. Implementa�c~ao de redes neuroniais em am-bientes paralelos. Master's thesis, Federal University of Minas Gerais, Belo Horizonte,Brazil, 1993.[Meira, 1995] Dilmar M. Meira. Managing a telecommunication network with SIS. Tech-nical Report DCC 011/95, Department of Computer Science of the Federal Universityof Minas Gerais, Belo Horizonte, Brazil, 1995.[Meira, 1997] Dilmar M. Meira. Um survey sobre correla�c~ao de alarmes. Technical ReportDCC 017/97, Department of Computer Science. Federal University of Minas Gerais,Belo Horizonte, Brazil, July 1997.[Michalski et al., 1983] R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, editors. Ma-chine Learning: An Arti�cial Intelligence Approach. Springer-Verlag, Berlin, Germany,1983.[Milham and Pai, 1994] David J. Milham and Dinesh Pai. OMNIPoint - the implementa-tion version of Telecommunication Management Networks? In Network Operations andManagement Symposium'94 (NOMS'94) [1994], pages 163{173.[Mines, 1987] James A. Mines. Overview of the Telecommunications Management Net-work. In IEEE Global Telecommunications Conference (GLOBECOM 87), pages 1245{1248, 1987.

145[M�oller et al., 1995] M. M�oller, S. Tretter, and B. Fink. Intelligent �ltering in networkmanagement systems. In IFIP/IEEE International Symposium on Integrated NetworkManagement, IV (ISINM'95) [1995], pages 304{315.[Negoita, 1984] Constantin Virgil Negoita. Expert Systems and Fuzzy Systems. Ben-jamin/Cummings, Menlo Park, USA, 1984.[Nilsson, 1980] Nils J. Nilsson. Principles of arti�cial intelligence. Tioga, Palo Alto, USA,1980.[Nogueira and Meira, 1996] Jos�e M. S. Nogueira and Dilmar M. Meira. The SIS project: Adistributed platform for the integration of telecommunication management systems. InIEEE/IFIP 1996 Network Operations and Management Symposium (NOMS'96) [1996],pages 175{185.[NOM, 1992] Network Operations and Management Symposium'92 (NOMS'92), 1992.[NOM, 1994] Network Operations and Management Symposium'94 (NOMS'94), Kissim-mee, Florida, USA, Feb 1994.[NOM, 1996] IEEE/IFIP 1996 Network Operations and Management Symposium(NOMS'96), Kyoto, Japan, April 1996.[Nussbaumer and Chutani, 1995] Henri Nussbaumer and Sailesh Chutani. On the dis-tributed fault diagnosis of computer networks. In IFIP/IEEE International Symposiumon Integrated Network Management, IV (ISINM'95) [1995], page 706.[Nygate, 1995] Y. A. Nygate. Event correlation using rule and object based tech-niques. In IFIP/IEEE International Symposium on Integrated Network Management,IV (ISINM'95) [1995], pages 278{289.[Ohsie et al., 1997] D. Ohsie, A. Mayer, S. Kliger, and S. Yemini. Event modeling withthe MODEL language. In Integrated Network Management V [1997], pages 625{637.[Ohta et al., 1997] Kohei Ohta, Takumi Mori, Nei Kato, Hideaki Sone, Glenn Mans�eld,and Yoshiaki Nemoto. Divide and conquer technique for network fault management. InIntegrated Network Management V [1997], pages 675{687.[Okamoto et al., 1996] Satoru Okamoto, Kimio Oguchi, and Ken-ichi Sato. Network archi-tecture and management concepts for optical transport networks. In IEEE/IFIP 1996Network Operations and Management Symposium (NOMS'96) [1996], pages 1{11.[OMG and X/Open, 1995] OMG and X/Open. The Common Object Request Broker:Architecture and speci�cation, July 1995. Revision 2.0.[Oshisanwo and Boyd, 1993] A. Oshisanwo and T. Boyd. Telecommunications informationnetworking architecture. In Telecommunications, 1993, pages 84{89, 1993. IEE Conf.Pub. 371.

146[Owen, 1994] Henry L. Owen. Synchronous digital hierarchy network modeling. In Sec-ond International Workshop on Modeling, Analysis, and Simulation of Computer andTelecommunication Systems (MASCOTS 94), pages 229{233, Durham, NC, USA, Jan1994.[Pearl, 1984] Judea Pearl. Heuristics: Intelligent Search Strategies for Computer ProblemSolving. Addison-Wesley, Reading, USA, 1984.[Pearl, 1987] J. Pearl. Bayesian Decision Methods, pages 48{56. Volume 1 of Shapiro et al.[1987], 1987.[Pearl, 1991] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann, San Francisco, USA, 1991. Revised SecondPrinting.[Pearl, 1993] Judea Pearl. Belief networks revisited. Arti�cial Intelligence, 60:141{153,1993.[Petermueller, 1996] Willi J. Petermueller. Q3 object models for the management of ex-changes. IEEE Communications Magazine, 34(3):48{60, March 1996.[Pines and Barradas, 1977] Jos�e Pines and Ov��dio C.M. Barradas. Telecomunica�c~oes: Sis-temas Multiplex. Livros T�ecnicos e Cient��cos, Rio de Janeiro, RJ, 1977.[Pontailler, 1993] Catherine Pontailler. TMN and new network architectures. IEEE Com-munications Magazine, pages 84{88, Apr 1993.[Rich, 1983] Elaine Rich. Arti�cial Intelligence. McGraw-Hill, New York, USA, 1983.[Riordan, 1996] Teresa Riordan. A quicker diagnosis for ailing networks. The New YorkTimes, July 22, 1996. http://www.smarts.com/news/nytimes patent article.html.[Sabin et al., 1997] Mihaela Sabin, Robert D. Russel, and Eugene C. Freuder. Generatingdiagnostic tools for network fault management. In Integrated Network Management V[1997], pages 700{711.[Sanchez and Zadeh, 1987] Elie Sanchez and Lofti Asker Zadeh, editors. Approximate Rea-soning in Intelligent Systems, Decision and Control. Pergamon, Oxford, England, 1987.[Sasisekharan et al., 1993a] R. Sasisekharan, V. Seshadri, and S.M. Weiss. Proactive net-work maintenance using machine learning. In IEEE Global Telecommunications Confer-ence (GLOBECOM 93) [1993], pages 217{222.[Sasisekharan et al., 1993b] Raguram Sasisekharan, Yung-Kao Hsu, and David Simen.SCOUT: An approach to automating diagnosis of faults in large scale networks. InIEEE Global Telecommunications Conference (GLOBECOM 93) [1993], pages 212{216.

147[Sasisekharan et al., 1994] Raguram Sasisekharan, V. Seshadri, and Sholom M. Weiss. Us-ing machine learning to monitor network performance. In IEEE Conference on Arti�cialIntelligence Applications, pages 92{98. IEEE, 1994.[Sasisekharan et al., 1996] Raguram Sasisekharan, V. Seshadri, and Sholom M. Weiss.Data mining and forecasting in large-scale telecommunication networks. IEEE Expert,11(1):37{43, February 1996.[Schott et al., 1992] B. Schott, A. Clemm, and U. Hollberg. An ISO/OSI based approachfor modeling heterogeneous networks. In Information Network and Data Communica-tion, IV, pages 377{388, 1992.[Sclavos et al., 1994] Jean Sclavos, No�emie Simoni, and Simon Znaty. Information model:From abstraction to application. In Network Operations and Management Symposium'94(NOMS'94) [1994], pages 183{195.[Seagate, 1996] Seagate. Nervecenter Pro: The complete solution for managing net-work and system behavior, September 1996. http://www.sems.com/Products/West/-nervecenter/Nervecenter.html.[Shapiro et al., 1987] Stuart C. Shapiro, David Eckroth, and George A. Vallasi, editors.Encyclopedia of Arti�cial Intelligence. John Wiley & Sons, New York, USA, 1987.[Shomaly, 1993] R. Shomaly. A model based approach to network, service and customermanagement systems. BT Technology Journal, 11(3):123{130, Jul 1993.[Slade, 1991] S. Slade. Case-based reasoning: A research paradigm. AI Magazine,12(1):42{55, Spring 1991.[Slawsky and Sassa, 1988] Gary M. Slawsky and D. J. Sassa. Expert Systems for Net-work Management and Control in Telecommunications at Bellcore, pages 191{199. InLiebowitz [1988], 1988.[Sloman, 1994] Morris Sloman, editor. Network and Distributed Systems Management.Addison-Wesley, Wokingham, England, 1994.[Smets et al., 1988] Philippe Smets, Abe Mamdani, Didier Dubois, and Henri Prade, edi-tors. Non-Standard Logics for Automated Reasoning. Academic Press, London, England,1988.[Stinson and Kheradpir, 1992] Willis Stinson and Shaygan Kheradpir. A state-based ap-proach to real-time telecommunications network management. In Network Operationsand Management Symposium'92 (NOMS'92) [1992], pages 520{532.[Strang et al., 1993] C. J. Strang, J. G. Callaghan, and A. Walles. An integrated approachto communications management. BT Technology Journal, 11(1):71{78, Jan 1993.

148[Sutter and Zeldin, 1988] Mark T. Sutter and Paul E. Zeldin. Designing expert systems forreal-time diagnosis of self-correcting networks. IEEE Network, pages 43{51, September1988.[System Management ARTS, 1996a] System Management ARTS. InCharge data sheet,September 1996. http://www.smarts.com/products/incharge datasheet.html.[System Management ARTS, 1996b] System Management ARTS. System ManagementARTS awarded patent for \codebook event correlation". http://www.smarts.com/-news/patent release.html, 1996. Press Release.[Tanenbaum, 1996] Andrew S. Tanenbaum. Computer Networks. Prentice Hall, UpperSaddle River, USA, 3 edition, 1996.[Uehara, 1996] Jos�e Masaaqui Uehara. Novos conceitos em telecomunica�c~oes. Bookletof TELEMIG's course on Telecommunications Strategic Management, Belo Horizonte,MG, 1996.[Veiga and Meech, 1994] M.M. Veiga and J.A. Meech. Application of fuzzy logic to envi-ronmental risk assessment. In Meeting of the Southern Hemisphere on Mineral Technol-ogy, IV, pages 355{370, Concepci�on, Chile, 1994.[Weiner et al., 1995] Andrew J. Weiner, David A. Thurman, and Christine M. Mitchell.Applying case-based reasoning to aid fault management in supervisory control. In Pro-ceedings of the 1995 IEEE International Conference on Systems, Man and Cybernetics,pages 4213{4218, Vancouver, BC, Canada, 1995. IEEE.[Widl and Woldegiorgis, 1992] Walter Widl and Kidane Woldegiorgis. In search of man-aged objects. Ericsson Review, (1-2):34{56, 1992.[Williamson and Azmoodeh, 1991] G. I. Williamson and M. Azmoodeh. The application ofinformation modelling in the TelecommunicationsManagement Network. BT TechnologyJournal, 9(3):18{26, Jul 1991.[Wright et al., 1988] Jon R. Wright, John E. Zielinski, and Elizabeth M. Horton. ExpertSystems Development: The ACE System, pages 45{72. In Liebowitz [1988], 1988.[Wright, 1921] Wright. Correlation and causation. J. Agric. Res., 20:557{585, 1921.[Yamaguchi et al., 1992] Haruo Yamaguchi, S. Isobe, T. Yamaki, and Y. Yamanaka. Net-work information modeling for network management. In Network Operations and Man-agement Symposium'92 (NOMS'92) [1992], pages 57{67.[Yang, 1996] Gary Yang. Introduction to information network architecture. In IEEE/IFIP1996 Network Operations and Management Symposium (NOMS'96) [1996], pages 3.2.1{3.2.12.

149[Yemini et al., 1996] Shaula Alexander Yemini, Shmuel Kliger, Eyal Mozes, YechiamYem-ini, and David Ohsie. High speed and robust event correlation. IEEE CommunicationsMagazine, pages 82{90, May 1996.[Zadeh, 1965] Lofti A. Zadeh. Fuzzy sets. Information and Control, 8:338{353, 1965.[Zadeh, 1988] Lot� A. Zadeh. Fuzzy logic. Computer, pages 83{93, April 1988.

A Model for Alarm Correlation in Telecommunication Networks (Thesis)

Documents

Transcript of A Model for Alarm Correlation in Telecommunication Networks (Thesis)