Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault...

9
IEEE TRANSACTIONS ON RELIABILITY, VOL. 46, NO. 3, 1997 SEPTEMBER Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE Illinois Institute of Technology, Chicago Rensselaer Polytechnic Institute, Troy Key Words - Network reliability, Network management, Pro- active fault detection, Bayes network, Feature extraction, SNMP agent, MIB variable. Summary & Conclusions - To improve network reliability & management in today’s high-speed communication networks, we propose an intelligent system using adaptive statistical ap- proaches. The system learns the normal behavior (norm) of the network. Deviations from the norm are detected and the informa- tion is combined in the probabilistic framework of a Bayes net- work. The proposed system thereby can detect unknown or un- seen faults. As demonstrated on real network data, this method can detect abnormal behavior before a fault actually occurs, giv- ing the network management system (human or automated) the ability to avoid a potentially serious problem. 1. INTRODUCTION Acronyms’ AR DAG IF IP MIB os1 SNMP UDP auto-regressive process directed acyclic graph interface internet protocol management information base open system interconnection simple network management protocol user datagram protocol. High-speed communication networks are increasingly im- portant in society. A key challenge in fulfilling this role, main- taining network availability & reliability, is the responsibility of network management. To meet this challenge, network- management tools & methods must evolve to meet the needs of current & future communication environments [2, 3, 271. Fault management is the part of network management that is responsible for detecting & identifying network faults. In- terest in fault management has increased over the past decade due to the growing number of networks that have become a critical component of the infrastructure of many organizations, thus making faults & downtime very costly. During this same time period, the fault-management problem has also become more difficult. This is due primarily to the dynamic nature and heterogeneity of current networks. Fundamental changes to the network occur much more frequently due to: 1) the growing demands on the network, and 2) the availability of new, ’The singular & plural of an acronym are always spelled the same. 333 improved components and applications. With network com- ponents & applications developed in an open environment, a network can be configured by mixing & matching several ven- dors’ hardware & software. While this allows the network to use the latest technologies and be customized to the users’ needs, it also increases the risk of faults or problems [27]. Current implementation of the fault-management part of network-management systems generally relies on a human network-manager to transfer ips2 network expertise into a set of rules, which eventually are translated into threshold levels on the measurement variables being collected. When one or more thresholds are exceeded, an alarm is sent to the network manager. The fault-management system must correlate the alarms to: 1) identify the problem, and 2) take corrective ac- tion. Alarm correlation is an open research problem, due largely to the lack of temporal information [27]. Additionally, as the rate of change in networks increases, it becomes more difficult for a human network-manager to maintain a sufficient level of expertise on the network behavior. Previous research in fault management has covered ap- proaches such as expert systems [ 121, finite state machines [20] , advanced database-techniques [26], and probabilistic approaches [5, 61. Ref [ 131 reviews communication-network fault detection & identification. The approaches mentioned in the previous paragraph require specification of the faults to be detected. This limits the performance of these approaches since it is not feasible to specify all possible faults. In addition, changes in network con- figuration, applications, and traffic can change the types & nature of faults that can occur, thus making modeling of faults more dif- ficult and, in many cases, impractical. Research using learning machines to detect anomalies [16] addresses the issue of fault modeling, but does not provide a method for correlating the in- formation collected in space or time. The problem we tackle is: Automated fault detection without specifications of faults. We propose an adaptive system for net- work monitoring. The system learns the normal behavior of each measurement variable. Deviations from the norm are detected and the information gathered is combined in the probabilistic frame- work of a Bayes3 network. Benefits from this approach include the ability to: 1) detect unknown faults, 2) correlate information 2The pronouns ip, ips, ip correspond to it, its, it; but the p implies person whereas the t implies thing. ’The reader is reminded of the philosophical differences between classical & Bayes probability theory. In the former, the probability is used to model the relative frequency of an event, whereas in the latter, probability is used to model the users’s degree-or-belief that the event will occur. Thus, in a classical interval estimate, the s- confidence level relates to the frequency with which the estimator pro- duces intervals that cover the true mean, while in the Bayes paradigm, the credibility level realtes to the user’s degree-of-belief that the true mean lies in the interval. 0018-9529/97$10.00 01997 IEEE

Transcript of Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault...

Page 1: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

IEEE TRANSACTIONS ON RELIABILITY, VOL. 46, NO. 3, 1997 SEPTEMBER

Proactive Network-Fault Detection

Cynthia S. Hood, Member IEEE

Chuanyi Ji, Member IEEE Illinois Institute of Technology, Chicago

Rensselaer Polytechnic Institute, Troy

Key Words - Network reliability, Network management, Pro- active fault detection, Bayes network, Feature extraction, SNMP agent, MIB variable.

Summary & Conclusions - To improve network reliability & management in today’s high-speed communication networks, we propose an intelligent system using adaptive statistical ap- proaches. The system learns the normal behavior (norm) of the network. Deviations from the norm are detected and the informa- tion is combined in the probabilistic framework of a Bayes net- work. The proposed system thereby can detect unknown or un- seen faults. As demonstrated on real network data, this method can detect abnormal behavior before a fault actually occurs, giv- ing the network management system (human or automated) the ability to avoid a potentially serious problem.

1. INTRODUCTION

Acronyms’

AR DAG IF IP MIB os1 SNMP UDP

auto-regressive process directed acyclic graph interface internet protocol management information base open system interconnection simple network management protocol user datagram protocol.

High-speed communication networks are increasingly im- portant in society. A key challenge in fulfilling this role, main- taining network availability & reliability, is the responsibility of network management. To meet this challenge, network- management tools & methods must evolve to meet the needs of current & future communication environments [2, 3, 271.

Fault management is the part of network management that is responsible for detecting & identifying network faults. In- terest in fault management has increased over the past decade due to the growing number of networks that have become a critical component of the infrastructure of many organizations, thus making faults & downtime very costly. During this same time period, the fault-management problem has also become more difficult. This is due primarily to the dynamic nature and heterogeneity of current networks. Fundamental changes to the network occur much more frequently due to: 1) the growing demands on the network, and 2) the availability of new,

’The singular & plural of an acronym are always spelled the same.

333

improved components and applications. With network com- ponents & applications developed in an open environment, a network can be configured by mixing & matching several ven- dors’ hardware & software. While this allows the network to use the latest technologies and be customized to the users’ needs, it also increases the risk of faults or problems [27].

Current implementation of the fault-management part of network-management systems generally relies on a human network-manager to transfer ips2 network expertise into a set of rules, which eventually are translated into threshold levels on the measurement variables being collected. When one or more thresholds are exceeded, an alarm is sent to the network manager. The fault-management system must correlate the alarms to: 1) identify the problem, and 2) take corrective ac- tion. Alarm correlation is an open research problem, due largely to the lack of temporal information [27]. Additionally, as the rate of change in networks increases, it becomes more difficult for a human network-manager to maintain a sufficient level of expertise on the network behavior.

Previous research in fault management has covered ap- proaches such as expert systems [ 121, finite state machines [20] , advanced database-techniques [26], and probabilistic approaches [5, 61. Ref [ 131 reviews communication-network fault detection & identification. The approaches mentioned in the previous paragraph require specification of the faults to be detected. This limits the performance of these approaches since it is not feasible to specify all possible faults. In addition, changes in network con- figuration, applications, and traffic can change the types & nature of faults that can occur, thus making modeling of faults more dif- ficult and, in many cases, impractical. Research using learning machines to detect anomalies [16] addresses the issue of fault modeling, but does not provide a method for correlating the in- formation collected in space or time.

The problem we tackle is: Automated fault detection without specifications of faults. We propose an adaptive system for net- work monitoring. The system learns the normal behavior of each measurement variable. Deviations from the norm are detected and the information gathered is combined in the probabilistic frame- work of a Bayes3 network. Benefits from this approach include the ability to: 1) detect unknown faults, 2) correlate information

2The pronouns ip , ips, ip correspond to i t , its, i t; but the p implies person whereas the t implies thing. ’The reader is reminded of the philosophical differences between classical & Bayes probability theory. In the former, the probability is used to model the relative frequency of an event, whereas in the latter, probability is used to model the users’s degree-or-belief that the event will occur. Thus, in a classical interval estimate, the s- confidence level relates to the frequency with which the estimator pro- duces intervals that cover the true mean, while in the Bayes paradigm, the credibility level realtes to the user’s degree-of-belief that the true mean lies in the interval.

0018-9529/97$10.00 01997 IEEE

Page 2: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

334 IEEE TRANSACTIONS ON RELIABILITY, VOL. 46, NO. 3, 1997 SEPTEMBER

in space & time, and 3) detect subtle fore the actual failure. This allows faults to be detected when they are developing, so that the network manager has time to take corrective action to prevent outages or downtime. Our approach is tested on a computer network. We monitor the variables col- lected within the SNMP framework. No specialized hardware is required for monitoring.

Section 2 provides background material on network management and Bayes networks. Section 3 describes our in- telligent monitoring approach. Section 4 gives detailed infor- mation about the data we collected. Section 5 con & comparisons.

Notation

mv‘j mv all MIB variables w k

w2 * abnormal state.

Other, standard notation is given in “Information for Readers s” at the rear of each issue.

D

Network management is a broad term that has been defin- ed by standards bodies to include 5 functional areas:

The network management system architecture consists of a cen- nager along with many agents. The agents

reside in various network nodes and collect data [27]. They com- municate with the central network manager through a network management protocol. There are two sets of standardized net-

n be generalized to OS1 ne SNMP provides a structure for org

plemented in the MIB each agent. The MIB contains a set of variables pertinent to that particular network node. There are a standard set of variables specified in [ 171, but the equip- ment manufacturer can add other vari tant. For this reason, the specific variables nodes can vary. Currently, SNMP agents are inclu pieces of equipment for the Internet, m way to monitor the network without hardware.

or UDP). More details on the specific variables collected are in section 4.

the agents and the net manager to query the

anager. This protocol

Therefore, SNMP alone cannot be considered a solution to the fault-management problem. The information provided by SNMP must be further processed & combined for fault detection. The mechanism we use in this work to combine the SNMP infor- mation is based on Bayes networks.

2.2 Bayes Networks

Notation

32 a network I a set of arcs n E 32 a node in 37, e E € a directed arc

conditional s-independence assumptions hold [ 191. the DAG represent r.v.

a1 s-Independence: Giv t w c 32 - ( d ( n )

that is not its descendent. Figure 1 illustrates assumption #1 for a Bayes network similar to the one used in our system.

Page 3: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

HOOD/JI: PROACTIVE NETWORK-FAULT DETECTION 335

Figure 1. Example of s-Independence Assumption for a Par- ticular Bayes Network

Assumption #1 allows us to estimate the conditional prob- abilities of any of the nodes (or r.v.) in the Bayes network, given the observed information or evidence. Algorithms for estimating these probabilities are in [ 191. The efficiency of the algorithm depends on the structure of the Bayes network. The strength of Bayes networks is that they provide a theoretical framework for combining statistical data with prior knowledge about the problem domain. Therefore, they can be useful in practical ap- plications where such prior knowledge can be exactly quantified in a Bayes-rational way.

Bayes networks have been widely used for medical diagnosis [9, 221, and troubleshooting [ 101. In communication networks, they have been proposed to diagnose faults in lightwave networks [6]. In [6] other methods have been used for detection; the Bayes networks are used for diagnosis only. Our work uses a Bayes network as a mechanism to combine information from various variables for the purpose of detec- ting anomalies.

3. MONITORING SYSTEM

I m

Figure 2. Monitoring System

We propose the local monitoring system in figure 2, allow- ing: 1) each network-node to compose a picture of the network's health, and 2) the node to take corrective action if necessary (local control), as suggested in [S, 151.

The goal of network monitoring is to alert the network manager in a timely manner when something problematic is oc- curring. To detect abnormalities before they cause serious net- work problems (when possible), the monitoring device must distinguish normal behavior of the measured information from problematic behavior. To be able to do this, information from both traffic data and the functionality of a network need to be combined & processed. To implement this idea, we designed the monitoring system to have two main components: 1) a feature extractor, and 2) a Bayes network as shown in figure 2. The feature extractor extracts salient features which can be used to detect abnormal information carried by each MIB variable. The Bayes network combines the information from all the MIB variables to provide a broader picture of the health of the node. By doing so, we can correlate information in time & space. This allows the central network manager to receive a more complete, less noisy picture of each node's view of net- work health. This can ease the alarm correlation problem as well, Since the Bayes network also incorporates prior knowledge on network functionality, the feature extractor and the Bayes network together can process & combine information from both measurements and network functionality needed for detecting network anomalies.

3.1 Feature Extraction

To explain our approach in feature extraction, we first ex- amine why the commonly-used methods for detecting network problems are insufficient.

Thresholds are the primary method currently used in both practice [15, 241 and research [6, SI for detecting abnormal behavior. The feature is not the value of the threshold itself, but the information on whether or not the threshold has been exceeded by a particular measurement variable. There can be both upper & lower thresholds, in which case the feature is the information on whether the variable is within the thresholds. One of the difficulties with thresholds is properly setting the threshold level, since thresholds strongly depend on the traffic level. Improperly set, the thresholds might never be exceeded, thereby letting problems go undetected, or might be exceeded too often, thereby flooding the network manager with false alarms.

While properly set thresholds do a good job of detecting large rises & falls in a measurement variable, more subtle behavior changes are missed. Figure 3 shows two examples of this.

In figure 3a the variance of the signal or measured variable changes; in figure 3b, the level of the signal decreases. These changes are also symptoms of something problematic occurr- ing in the network - this behavior is not normal for the variable. Detection of the more subtle signs of problems can allow cor- rective action to be taken to avoid a bigger problem. Identifica- tion of the problem also becomes easier with a more complete description of the symptoms rather than just the extreme cases.

To detect the more subtle changes in the nature of the measured variables, we use the parameters of a second order autoregressive process AR(2) as features:

Page 4: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

336

0 10 20 30' 40 time

IEEE TRANSACTIONS ON RELIABILITY, VOL. 46, NO 3, 1997 SEPTEMBER

16 .......................................... . 0 10 20 30 40

tlme

[the thresholds are shown by dotted lines]

Figure 3. Examples of Subtle Behavior Not Detected Using Thresholds

y ( t ) = q . y ( t - l ) + a2-y( t -2) + E(t). (1)

Notation

y ( t ) a l , a2 ( E ( t ) }

value of the signal at time t AR parameters that we use as features white noise process.

These features, like threshold features, are monitored for each measurement variable (in our case, each MIB variable). Each ME3 variable is sampled every 15 seconds. From the tiye-series data, we use a 20-sample window to estimate the AR(2) parameters. This window includes the current sample along with the 19 previous samples. This sliding window scheme allows us to correlate temporally the current sample with some past samples. The parameters are completely re-estimated each time using least-squares. Using the AR(2) parameters as features, we can detect the more subtle changes in figure 3. Higher order AR processes have also been tried, but no important gain has been observed. Details are in [ll].

One of the motivations for using AR parameters is simplici- ty. Ideally, if an underlying stochastic process which governs the data i s composed of piece-wise Gaussian processes, the ex- tracted AR parameters can characterize the process well, assum- ing of course, that the order of the AR process is properly chosen. If the process is non-Gaussian, our AR parameters cor- respond to the best linear approximation in terms'of mean- square-error. Because the problem of feature extraktion is close- ly related to traffic modeling, which is an active research area [14], we beliwe better features can be obtained as'we better understand the characteristics of the MIB variables.

3.2 Bayes Network

Once t& features are extracted from each MIB variable, a Bayes network'js used to combine information from all the (features of,) MIB variables. Two steps are needed to specify a Bayes networks compietely: 1 ) determine the structure of the Bayes network, and 29 estimate its parameters, viz, the condi- tional probqbilities. of the Bayes network.

To specify the structure, we define the r.v. or nodes in the Bayes network while incorporating prior knowledge on net- work functionality. There are two types of variables: 1) observ- ed, and 2) not-observed and thus need to be estimated. The not- observed variables are called internal varipbles. The observed variables directly correspond to the MIB variables (viz, the features from the MIB variables). The internal variables are defined as: network, IF, IF, and UDP. The IF, IP, and UDP' variables correspond to the MIB groups. Logically they repre- sent different types of network functionality. The MIB variables within a group are the measurement variables for that network function. The network variable is defined to correspond to all of the network functionality.

Assumptions

I~

2. The network node is comprised only of the Interface, IP, and UDP functions.

3. Given knowledge of the network health, the network- health functions are s-independent.

4. Given the network-health' function, the measurement variables for that function are s-independent and the meapire- ment variables for other network functions do not contribute any additional information. '

5 . Each internal variable has 2 discrete states: w1 & w2. 6 . Each observable variable is continuous. 4

The structure of the Bayes network can be determined based on these internal variables as shown in figure 4 wherein the ar- rows between the nodes go from cause to effect. The observed (MIB) variables are the variables at the' lowest level.

Figure 4. Bayes Network for Fault Detection

Analogous to medical-diagnosis examples, we consider the observed MIB variables to be symptoms of larger problems. In our model, the health of the network is the most general in- formation estimated and is considered to be an underlying

Page 5: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

HOOD/JI: PROACTIVE NETWORK-FAULT DETECTION 337

influence on the rest of the nodes in the Bayes network. The overall health of the network directly influences the health of the three functions of the network (IF, IP, UDP). This is in- dicated in the model by the arrows from the network r.v. to the IF, IP, UDP r.v. Likewise, for each network function, the health of that function directly influences the values of the in- dividual measurement or MIB variables for that function.

The model has been designed based on these intuitive rela- tionships from the structure of the MIB. The Bayes network model requires assumption #l. Assumptions #2 - #4 are reasonable approximations of real situations, since each of the network functions represent s-independent functional com- ponents of the network. These components can fail s-independently , although there is a relationship between the functions, and serious problems in one component can even- tually impact the other components. Since propagation of a fault through the functional components depends on the type & loca- tion of the fault (faults can propagate from a low-level function to high-level functions, or vice versa) [25], no prior relation- ship between these functional components is assumed, given that the overall-network health is known. Therefore we have not assumed a fault propagation structure in our model. However, assumption #3 simplifies the problem. The MIB variables in general can be s-dependent, and their s-dependence can be complicated and changing with time; this is left for future research.

Once the structure of a Bayes network is determined, we are ready to derive the probabilities needed to be estimated from data. We estimate the' following posterior probabilities:

Pr{network=wkImv}, (2)

Pr {nfi = wkl mv} . (3)

These two probabilities correspond to the health of the network and the internetwork layers, given the observations. Due to the tree or singly-connected structure of the Bayes network, these probabilities can be calculated efficiently either using the Pearl algorithm [ 191 or directly, based on assumption #3. The equa- tions for the direct calculation can easily be derived using assumption #3 and Bayes rule [ 1 11. Estimating these posterior probabilities can be reduced to estimating the conditional probabilities :

Pr {mvij I nfi =normal}, for all mvij E nfi, (4)

Pr {mviJ I nfi= abnormal}, for all mvij E nfi. (5)

These are the conditional probabilities of the observations given the health of the network, and can be directly estimated from data. The following probabilities also need to be determined to specify completely the probabilities in (2) (4):

Pr{network= wk}, (6)

Pr{nfi=q)network=wk}, for 1= 1,2. (7)

The prior probabilities in (6) and the conditional probabilities in (7) are guessed-at using peoples' prior knowledge (degree- of-belief) of the network behavior gained from observations and conversations with the network managers. These probabilities remain constant throughout the monitoring process. The con- ditional probabilities in (4) & (5) can be estimated using the observed MIB variables.

3.3 Estimating the Probab

Is it feasible to estimate directly both the probab given a network is normal and abnormal, from data? We first need to examine what type of data is usually available for detec- ting network anomalies.

Roughly speaking, there are two types of available data: 1) the so-called normal data which are collected when a net- work functions normally, and 2) the abnormal data which cor- respond to the abnormal behavior of a network. Here the in- formation on the network health can be obtained from the tools currently used to report problems, including reports about cer- tain types of serious network problems that have occurred. We can use these reports as labels for the abnormal data. Since there are many normal data available (networks function normally most of the time), the:

Pr{mvi,JInfi=normal}, for all mv,,] E nfi,

can be estimated accurately from the normal data. The labels are used to eliminate data from time periods when a problem was reported; the rest of data are used to estimate this prob- ability, ie, this probability is estimated through learning the nor- mal behavior of the network.

Labeled abnormal data could be used directly to estimate the probability distribution of each MIB variable, given that its related network function is abnormal. This sounds promising, but in fact is extremely difficult due to the sparse nature of ab- normal data. With so few examples of problems, we do not have the variety (many examples of different problems) or the depth (several examples of the same type of problem) to learn effec- tively the distribution [4]. To alleviate the problem of sparse abnormal data, we treat Pr {mvi,] I nfi =abnormal} as unknown [21], and use assumption #6.

Assumption

6. There is a uniform distribution over all allowable values for a ] , a2. 4

The allowable values must obey (8) if the estimated AR(2) pro- cess is to be stationary [l]:

al + a2 < 1,

u2 - al < 1,

-1 < U2 < 1.

Page 6: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

338 IEEE TRANSACTIONS ON RELIABILITY, VOL. 46, NO. 3, 1997 SEPTEMBER

If the estimated AR(2) parameters fall outside of the allowable range, then set Pr{mv,,] I nf,=abnormal} =O

for that particular MIB variable. Once Pr {mv,,l I nf, =normal} & Pr {mv,,] I nf, =abnormal}

are obtained, they can be used to determine the original posterior probabilities on network health using the common techniques in [7]. Since we are monitoring locally, all of the evidence or probabilities estimated from the observed MIB variables are available to the system. This enables the system to calculate the desired posterior probabilities using a complete & current set of observations

4. DATA COLLECTION

Data for this work were collected from the RPI Computer partment network. The network (figure 5) is com- subnetworks (subnets) and 2 routers. The individual orkstations, printers) on the subnets are not shown

the file-servers (fsl and fs2) and the data collection

To campus network

Figure 5. Configuration of t h e Monitored Network

Each of the file-servers has approximately half of the disk space for the users of the network; fsl is also the ftp server. Router 1 is the gateway between this network and the campus network, with all the traffic to & from the campus and the out- side world flowing through this router. Router 2 mainly routes the local traffic flowing between the subnetworks. A large por- tion of this traffic is access from workstations to the file-servers.

Data were collected from Router 2 , the internal router, and consisted of 43 MIB variables. The variables can be separated into two types: 1) those that were active (continuously chang- ing in value), 2) those that were constant for long periods of time (hours or days). This classification of variables does not include the variables that contain configuration-type informa- tion. Of the 43 MIB collected, there were 14 variables of type #1 and 29 of type #2. In this work we considered only the MIB variables of type #1. The type #2 MIB variables contained very little information since the values of the variables rarely changed.

The MIB variables used in this work are:

Interface (io group: ifInOctets, ifInUcastPkts, ifInNUcastPkts, ifOutOctets, ifOutUcastPkts, ifOutNUcastPkts.

ipInReceives, ipForwDatagrams, ipInDelivers, ipOutRequests, ipOutDiscards.

IP group:

UDP group udpInDatagrams, udpNoPorts, udpOutDatagrams. 4

The interface group variables are specific to the interface be- tween Subnet 2 and Router 2. The variables in other groups are general, and depend on all the information flowing in or out of the router.

The variables were collected by polling the MIB of Router 2 every 15 seconds. The machine NM (see figure 5) polled Router 2 by sending an SNMP query. Since the data collection took place on existing equipment, available storage space and the impact of SNMP queries on the router and the network had to be considered in determining the polling frequency. The 15-second polling frequency was adequate for determining behavior changes in this environment, without placing undue burdens on the network in terms of load or storage. The issues we describe with the SNMP queries and the storage space are products of the experimental setup and need not carry over if this approach were implemented on a network. We envision the system in section 3 working as the agent, processing the data as it arrives. This would place no additional requirements for storage or load on the network. Additional processing would be required, but given the methods in section 3, the amount of processing appears to be feasible.

The network-management technique currently used for this network mainly uses the Unix syslug function. Each of the com- puters in the network runs a syslug process, which sends messages to the network management machine, nm. The messages can contain reports of problems pertaining to: the net- work, a specific application, or specific to the machine generating the message. An example of a typical message is:

Dec 5 11:12:34 machinel .cs.rpi.edu unix: NFS server fsl .cs.rpi.edu not responding still trying

These messages can alert the network manager at the time the problem occurs as long as the communication path between the computer and NM is up. If the path is unavailable, the message is sent when a path becomes available. The time stamp in the message comes from the machine that originated the message (for the example message: machinel .cs.rpi.edu).

Syslug provides a tool for the human network-manager to troubleshoot reported problems, but not all problems can be trac- ed to syslug messages.

The log file on nm that contains all the syslug messages is filtered to eliminate all messages not pertaining to network problems (since we are monitoring only the network). The re- maining syslug messages are the labels described in section 3.2. These labels are not complete (not all problems are labeled), but they are accurate.

Page 7: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

HOOD/JI: PROACTIVE NETWORK-FAULT DETECTION 339

2 5 80-

fj 60- 0

5. EXPERIMENTAL RESULTS

m 8 %

u x x x m m m u

The system proposed for anomaly detection was tested on a problem reported in the log file. Fileserver 1 (fsl) went down on 1995 Dec 05 from 11: loam - 11: 17am. As noted in section 4, this fileserver is where approximately half of the users on the network have their disk-space, and it is also the ftp server. Therefore the problem affected many users of the network as well as users from the campus or other internet networks who were trying to ftp something from this site.

Although the exact cause of fsl failing is not known, an increase in load related to the fsl function as an ftp server was a contributing factor. A few days before fsl went down, a new version of software that was archived on fsl was released. The number of ftp requests per day increased by a factor of 10-15.

5.1 Our Results

The necessary conditional probabilities were estimated from a time period on the previous day when no problems oc- curred (no entries in the log file). The training data consisted of 500 samples of each MIB variable. Since the probabilities are learned from normal data, there are no examples of net- work faults in the training data.

The results for

Pr { network = abnormal I mv}

are shown in figure 6 as a function of time; mv implies all MIB in the figure. The asterisks denote the fileserver downtime period.

Intelligent Monitonng System

0.8

20.7- 5 10.6 - -

Z 0.5 -

0.4

$0.3 -

E

$

a Ii -

-0 50 100 150 200 250 300 350 400 time in 15 semna increments

Figure 6. Network Results Obtained Using Intelligent Monitor- ing System

Two things in figure 6 are important: the height of a peak

0.5, it indicates a potential network problem important enough to cause the attention of a network manager. On the other hand, the time of the occurrence of a peak indicates capability of our monitoring system for early detection of faults. Intuitively, if a large peak occurs before the down-time of the file server, it indicates the early detection of network problems by our system. Figure 6 shows that the first large peak occurred approximate- ly 28 minutes before the system went down. The probability of the network being abnormal decreases after this peak, but there is still some fluctuation. The second large peak (set of peaks) comes approximately 8 minutes before the system goes down. The probability changes, but remains at a high level for several minutes. This output could be used to alert a network manager of the incident, since the syslog messages were sent only when the fileserver was down. These peaks correspond to the symptoms that are being caused by the large ftp-traffic at fileserver 1. As a result, the network performance for users of the monitored network was deteriorating. Since information was not flowing as usual, there were more retries and the behavior of many of the variables changed. During the failure, something is also detected. The detection here is more subtle. The traffic level decreases, due to the absence of fileserver ac- cess traffic (except possibly for traffic due to workstations try- ing to determine if the fileserver is up).

The structure of the network in figure 5 is important. The router that we are monitoring continues to route all other traf- fic normally. The fileserver being down is not problematic to the router, but we can still detect the problem by monitoring the router. The router can detect an anomaly in the network through changes to its MIB variables. Therefore, the results that we have are from the router’s view of the network.

5.2 Comparisons

al

$ 0 50 100 150 200 250 300 350 400

n (b) Adaptive Threshold

2 x i f 80- P p 60-

time in 15 second increments

u

m * m u x *

g 100

9 4 a

20 z - G o s o 50 100 150 200 250 300 350 400

time in 15 second increment+

Figure 7. Results Obtained Using Thresholds

(value of the probability for the network to be abnormal, given the MIB variables), and the time of the occurrence of a large peak. Specifically, if the height of a peak in figure 6 exceeds

Since the thresholds are commonly used to detect faults, we compared out results to those obtained with a single upper

Page 8: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

340 IEEE TRANSACTIONS ON RELIABILITY, VOL 46, NO 3, 1997 SEPTEMBER

threshold. To combine the information from each MIB variable, we counted the total number of variables exceeding their thresholds at each tim nstance. The thresholds were calculated in two ways:

method of basing

ere the fileserver The thresholds are unable

be ascertained. In addition, neither threshold method detects anything very important during the time the fileserver is down. The performance of the thresholds can be traced to the prob-

ned in section 3.1, viz, difficulty in setting the and the inability of thresholds to detect subtle

5.3 Discussio

It is possible to use adaptive statistical methods to detect network faults without using models of specific faults. The Bayes network provided a theoretical framework within which we could use prior knowledge to determine a structure and learn the normal behavior of the measurement variables. The system was tested on real data involving a fileserver crash on a com- puter network. It successfully detected something abnormal ap- proximately 8 mnutes before the fileserver crashed. The detec- tion was done from a router on the network which was operating properly. The route ng was wrong in the network from chang variables. With ear- ly detection, the net anager can be warned of impending failure, take corrective action, and avoid the failure and costly downtime. Our approach accomplishes early detection by recognizing deviations from normal behavior in each of the

This work shows one successful example of early detec- tion of network faults. We are testing our methods on more fault cases. More in-depth study is extending the method to on-line learning and detecting network faults.

ACKNOWLEDGMENT

ollinger , Roddy Collins, for helping us with the data collection,

, and George Nagy for lated references, and to

We gratefully acknowledge the support of the US National thank an anonymous referee for valuable comments.

Science Foundation (ECS-9312594 and (CA IN-95025 18).

REFERENCES

[ l j G E P. Box, G M Jenkins, Tzme Series Analysis Forecasting and Con- trol, 1970, Holden-Day

[2] L N Cassell, C. Partridge, “Network management archtectures and pro- tocols Problems and approaches”, IEEE J Selected Areas m COmmUniCQ-

tions, vol 7, 1989 Sep, pp 1104-1114 [3] F R K Chung, “Reliable software and commutllcauon 1. A

IEEE J SelectedAreas in Communications, vol 12, 1994 Jan, pp 23-32 [4] C Cortes, L D Jackel, W-P Chiang, “Predicting failures of telecom-

munication paths Limits on learning machine accuracy imposed by data quality”, Proc. Int ‘1 Workshop on Applications of Neural Networks to Telecommunications 2 , 1995, Stockholm N Dawes, J Altoft, B Pagurek, “Network diagnosis by reasoning in uncertain nested evidence spaces”, IEEE Trans Communications, vol 43, num 2 4, 1995 Feb-Apr, pp 466-476

[6j R H Deng, A A Lazar, W Wang, “A probabilishc approach to fault diagnosis in linear lightwave networks”, IEEE J Selected Areas in Com- munications, vol 11, 1993 Dec, pp 1438-1448

[7] R Duda, P Hart, Patten Classification and Scene Analysis, 1973, John Wiley & Sons

[8] G Goldszmidt, Y . Yemini, “Distributed management by delegation”, Proc. 15th Int’l Conf Distributed Computing System, 1995 Jun

[9] D Heckerman, “A tractable algorithm for diagnosing mulhple diseases”, Proc Fifth Workshop on Uncertainty in Artificial Intelligence, 1989, pp 174-181, Windsoi

troubleshooting”, Comm.

hon networks”, PhD Disse

[5j

[lo] D Heckerman, J S Breese, K. Rommelse, “Decision-theor

[ l l j C. Hood, “Intelligent det for fault management of commun , 1997, Rensselaer Polytechmc Institute.

[12] G Jakobson, M D Weissman, “Alarm correlation”, IEEE Network,

[13] A A Lazar. W Wang, R Deng, “Models and algorithms for network fault detection and identification A review”, Proc Int’l Conf Com- munications, 1992 Nov, Singapore

[14] W Leland, M Taqqu, W Willinger, D. Wilson, “On the self similar nature of Ethernet traffic (extended version)”, IEEEIACM Trans Net-

uco, “Fault management tools for a etwork operations environment”, IEEE

[16] R Maxion, F Feather, “A case study of ethernet anomalies in a

vol 38, 1995 Mar, pp 49-57

V O ~ 7, 1993 NOV, pp 52-59.

J. Selected Areas in Communications, vol 12, 1994 Aug, pp 1121-1 130

distributed computing environment”, IEEE Trans Reliability, vol 39, 1990 Oct. DP 433-443

Cloghrie, M Rose, “Management information base for network CPiIP-based internets. MIF-II”, Request for Comments

Page 9: Proactive Network-Fault Detectionjic.ece.gatech.edu/anomaly-graph.pdf · Proactive Network-Fault Detection Cynthia S. Hood, Member IEEE Chuanyi Ji, Member IEEE ... tion is combined

HOOD/JI: PROACTIVE NETWORK-FAULT DETECTION

R.E. Neapolitan, Probabilistic Reasoning in Expert Systems: Theory and Algorithms, 1990; John Wiley Sons. J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988; Morgan Kaufman. I. Rouvellou, “Graph identification techniques applied to network manage- ment problems”, PhD Dissertation, 1993; Columbia University. P. Smyth, “Markov monitoring with unknown states”, IEEEJ. Selected Areas in Communications, vol 12, 1994 Dec, pp 1600-1612. D.J. Spiegelhalter, A.P. Dawid, S.L. Lauritzen, R.G. Cowell, “Baye- sian analysis in expert systems”, Statistical Science, vol8, num 3, 1993

W. Stallings, SNMP, SNMPR, and CMIP: The Practical Guide to Net- work Management Standards, 1993; Addison-Wesley . S. Waldbusser, “Remote network monitoring management information base”, Request for Comments 1271, 1991 Nov. Z. Wang, “Model of network faults”, Integrated Network Management I (B. Meandzija and J. Westcott, Eds), 1989; Elsevier Science. 0. Wolfson, S . Sengupta, Y. Yemini, “Managing communication net- works by monitoring databases”, IEEE Trans. Sofhyare Engineering, vol 17, num 9, 1991 Sep, pp 944-953. Y . Yemini, “A critical survey of network management protocol stan- dards”, Telecommunications Network Management into the 21st Cen- tury (S. Aidarous and T. Plevyak, Eds), 1994; IEEE Press.

Aug, pp 219-288.

AUTHORS

Dr. Cynthia S. Hood; Dep’t of Computer Science and Applied Math; Illinois Inst. of Technology; Chicago, Illinois 60616 USA. Internet (e-mail): [email protected]

Cynthia S. Hood received the BS (1987) from Rensselaer Polytechnic

341

Institute. She then joined Bellcore where she was involved in software reliability and SS7 requirements. During this time she completed an ME in Electrical Engineering at Stevens Institute of Technology. She received her PhD (1996) in Computer & Systems Engineering from Rensselaer Polytechnic Institute. She is an Assistant Professor of Computer Science & Engineering at Illinois Institute of Technology, Chicago. Her research interests include network reliabili- ty, enterprise network management, fault management, network security, and statistical methods.

Dr. Chuanyi Ji: Dep’t of Electrical, Computer, and System Eng’g; Rensselaer Polytechnic Inst; Troy, New York 12180 USA. Internet (e-mail): [email protected]

Chuanyi Ji received her BS (1983) from Tsinghua University, Beijing, her MS (1986) from the University of Pennsylvania, Philadelphia, and PhD (1992) from California Institute of Technology, all in Electrical Engineering. She is an Assistant Professor in the Department of Electrical, Computer, and System Engineering at Rensselaer Polytechnic Institute. Her research areas in- clude theoretical & experimental studies in both computer communication net- works and machine learning. In computer communication networks, her research interests include network management, and traffic modeling and performance analysis. In machine learning, her research interests are statistical learning theory & algorithms, and their applications in network management and pattern recogni- tion. Chuanyi Ji received the US NSF Early Career Development (CAREER) award in 1995, the Ming-Li scholarship (1989) at Caltech, and was an Honor graduate (1983) of Tsinghua University.

Manuscript TR96-108 received 1996 August 5; revised 1997 March 15

Responsible editor: C.J. Colbourn

Publisher Item Identifier S 0018-9529(97)06909-1 (TR,

CORRECTION 1997 MARCH Issue CORRECTION 1997 MARCH Issue CORRECTION 1997 MARCH Issue CORRECTION

Efficient Optimization of All-Terminal Reliable Networks, Using an Evolutionary Approach

In [l], an incorrect figure 1 was inadvertently submitted. The correct [l: figure 11 is shown here.

REFERENCE [ l ] B. Dengiz, F. A l t i p m a k , A.E. Smith, “Efficient optimization

of all-terminal reliable networks, using an evolutionary approach”, IEEE Trans. Reliability, vol 46, 1997 Mar, pp 18-26. [PI1 S 0018-9529(97)-02338-51

Correction TR97-117 received 1997 May 12 4TRb -

Figure 1. Typical Network