Fault Management – Detection and Diagnosis. Outline Fault management functionality Event...

50
Fault Management – Detection and Diagnosis

Transcript of Fault Management – Detection and Diagnosis. Outline Fault management functionality Event...

Page 1: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Management – Detection and Diagnosis

Page 2: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Outline

Fault management functionality Event correlations concept Techniques

Page 3: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Definitions

A fault may cause hundreds of alarms. We need to be able to do the following:

o Detect the existence of faultso Locate faults

An alarm o External manifestations of faults

—Generated by components—Observable, e.g. via messages

An alarm represents a symptom of a fault. An event

o An occurrence of interest, e.g. an alarm message

Page 4: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Management Functionalities

Fault detectiono Should be real-timeo Techniques can be based on active schemes

(e.g., polling) or event-based schemes (where a system component says that it has detected a failure).

Fault locationo Is it a link or system component or

application component? Determine corrective actions Carry out corrective actions and

determine effectiveness

Page 5: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Alarm (Event) Correlation Alarm explosion

o A single problem might trigger multiple symptoms (e.g., router is down)

There could be too many alarms for an administrator to handle; Techniques used to help:o Compression: reduction of multiple occurrences of an

alarm into a single alarmo Count: replacement of a number of occurrences of

alarms with a new alarmo Suppression: inhibiting a low-priority alarm in the

presence of a higher priority alarmo Boolean: substitution of a set of alarms satisfying a

condition with a new alarmo Root cause determination

Page 6: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Faults and Alarms

f0

f1 f2

A1 A2 A3 A4 A5C1 C2

C3

Page 7: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Faults and Alarms

The previous figure shows that correlation c1 detects the fault f1 and that correlation c2 detects the fault f2.

Correlating c1 and c2 into the correlation c0 allows the diagnosis of the fault f0.

Page 8: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Example

Let a1, a2, a3, a4, a5 be alarms generated by client processes indicating that a client process is not getting a response from a server.

Correlation techniques can be used to show that since a1, a2, a3 were generated by client processes by trying to contact the same server then the server may be the problem. Similar comments apply to a4 and a5.

Page 9: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Example

From the perspective of client processes, the servers (at the second level of the previous figure) are at fault.

However, it may be observed that alarms were generated by these two servers. Both alarms indicate that each of the two servers are not getting a response and that both were trying to contact the same server. This is another correlation.

Page 10: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Diagnosis

Major application of alarm correlation (often called event correlation) is fault diagnosis

Useful in fault location

Page 11: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Rule-Based Reasoning

Based on expert systems Intended to represent heuristic knowledge

as rules. Components

o Knowledge Base (KB): Contains the expert rules that describe the action to be taken when a specific condition occurs e.g., if-then-else

o Working Memory(WM): Stores information such as the system/network topology and data collected through the monitoring of application and network components.

Page 12: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Rule-Based Reasoning

Components (continued)o Inference engine: matches the current state

(as represented by the monitored data) of the system against the left-side of a rule in the knowledge base in order to trigger the action.

The rules are meant to encapsulate expert knowledge

Why rule-based reasoning?o Rules are interpreted which means that rules

can be changed without recompiling.o Since expert knowledge can be wrong and/or

complete, this feature is very useful.

Page 13: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Rule-Based Reasoning

Operationo The WM constantly scanned for facts (e.g., alarms)

that can satisfy any of the left hand sides of the rules.

o If a rule is found then the rule “fires” I.e., the right hand side is executed.

o The result of the execution may result in facts being inserted into WM.

Example:o Failed-connection (Y,X) and Failed-connection(X,Z)

faulty(Z). Used by commercial systems such as Tivoli

(from IBM) and HP Openview.

Page 14: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Approaches

Fault propagation Model traversing Case-based reasoning

Page 15: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation

Based on models that describe which symptoms will be observed if a specific fault occurs.

Monitors typically collect managed data at network elements and detect out of tolerance conditions, generating appropriate alarms.

An event model is used by a management application to analyze these alarms.

The event model represents knowledge of events and their causal relationships.

Page 16: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propogation (Coding Approach)

Correlation is concerned with analysis of causal relations among events.

The notation ef is to denote causality of the event f by the event e.

Causality is a partial order between events. The relation may be described by a causality

graph whose directed edges represent causality. Distinguish between faults problems) and

symptoms. Nodes of a causality graph may be marked as

problems (P) or symptoms (S). Some symptoms are not directly caused by

faults, but rather by other symptoms.

Page 17: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

76

1

8

9

11

3

5

4

10

2

Example Causality Graph

Page 18: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

The correlation problemo A correlation p s means that problem p

can cause a chain of events leading to the symptom s.

o This can be represented by a graph.

Page 19: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

1

9

11

10

2

6

A Correlation Graph

Page 20: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

For each fault (problem) p, the correlation graphs provides a vector that summarizes information available about correlation and symptoms and problems.

This is referred to as the code of the problem. Alarms may also be described using a vector

assigning measures of 1 and 0 to observed and unobserved symptoms.

The alarm correlation problem is that of finding problems whose codes optimally match an observed alarm vector.

Page 21: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

Example codes (look at correlation graph example)o 1 = (0,1,1) – This indicates that problem 1

causes symptoms 9 and 10o 2 = (1,0,1) – This indicates that problem 2

causes symptoms 6 and 10o 11 = (0,1,1) – This indicates that problem

11 causes symptoms 9 and 10.

Page 22: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

Example alarm vector o Assume that alarms indicating symptoms 9

and 10 have been observed.o a = (0,1,1)

We can infer that either 1 or 11 match the observation a.

These two problems have identical codes and hence are indistinguishable.

The fault management application may have to do additional tests.

Page 23: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

A Codebook is an array of the vectors just defined.

The number of symptoms associated with a single problem may be very large.o Sometimes a much smaller set of symptoms

is selected to accomplish a desired level of distinction among problems.

Page 24: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

Example Codebook p1 p2 p3 p4 p5 p6

1 1 0 0 1 0 12 1 1 1 1 0 04 1 0 1 0 1 0

Page 25: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

Example Codebook p1 p2 p3 p4 p5 p6

1 1 0 0 1 0 13 1 1 0 1 0 04 1 0 1 0 1 06 1 1 1 0 0 19 0 1 0 0 1 118 0 1 1 1 0 0

Page 26: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

Distinction among problems is measured by the Hamming Distance between their codes

The radius of a codebook is one half of the minimal Hamming distance among codes.

When the radius is 0.5, the code provides distinction between problems.

Page 27: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Fault Propagation (Coding Approach)

Is this easy to apply to application processes?o No

Whyo Applications are dynamico The coding approach assumes the system is

fairly static.

Page 28: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Reconstruct fault propagation at run time using relationships between objects

Begins with managed object that generated event

Work best when object relationship is graph-like and easy to obtain since it must be obtained at run-timeo Performanceo Potential parallelism

Weaknesseso Lack of flexibilityo Not well-structured like fault propagation

Page 29: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Characteristicso Event-Driven: Fault management

application is passive until an event arrives. This event is the reporting of a symptom.

o Correlation : Decides whether two events result from the same primary fault.

o Relationship Exploration: The fault management application correlates events by detecting special relationships between the source objects of those events.

Page 30: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Event reports should have the following information:o Symptom typeo Sourceo Targeto etc

If symptom si’s target is the same as sj’s source then this is an indication that si is a secondary symptom. This allows us to ignore certain alarms.

Page 31: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

For each event, construct a graph of objects (models) related to the source object of that event.

When two such graphs touch each other, i.e. contain at least one common object, the events which initiated their construction are regarded to be correlated. Possibly these two events are the result of the same fault.

If si is correlated with sj and sj is correlated with sk then through transitivity we can conclude that si is a secondary symptom.

Page 32: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

The process of eliminating symptom reports may result in reports that have the same target.

Example:o s1 and t

o s2 and t It might be necessary to construct

possible paths of objects between s1 and t as well as s2 and t

Nodes in common are good candidates for the faults.

Page 33: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

We will now discuss the building of graphs The algorithm for building graphs uses

relationships between network hardware and software components to search for the root cause of a problem.

Assumes that information about the relationships between the components are available (e.g., through a database).

Assumes that there are functions including these:o getNextHop(source, target,B): Get the node

representing the next entity (that comes after B) in the path between source and target. Note that this may return more than one entity.

Page 34: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Example Assume the following configuration of

processes and machines. All machines are connected through the Ethernet.o P1 is on chocolate; P2 is on pepperminto P3 is on vanilla; P4 is on strawberryo P5 is on doublefudge; P6 is on mintchip

Communication is through remote procedure calls. This basically requires that all communication go through a daemon process on the server host’s machine. We will call this rpcd

Page 35: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

P4

P3

P5

P1

P2

P6

Call structure is depicted in the following graph:

Page 36: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Example Assume that P4 terminates abnormally

causing a cascade of timeouts Correlation will result on focusing on these

event reports: o (P1,P4)o (P3,P4)

Not enough to diagnose the fault.o It’s all at the process level.o There are still many entities or objects to examine

since you do not want everything generating a message.

Page 37: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Example Starting with P1 the next component

(node) along the path of the connection between P1 and P4 is identified.

Between P1 and P4 are many entities. We will start out with a vertical search which basically results in the fact that P1 is running on a host machine called chocolate

Page 38: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing chocolate is connected to the hub through an

ethernet cable. The hub is connected to strawberry through an

ethernet connection cable where P2 is running. Thus we can say that the path is the following:o P1, chocolate,ethernet connection

cable,hub,strawberry,ethernet connection cable, rpcd.strawberry,P4

The path between P3 and P4 is the following:o P3, vanilla, ethernet connection cable, hub, ethernet

connection cable, strawberry, rpcd.strawberry, P4

Page 39: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Example This suggests that we can narrow down the

problem to hub, ethernet connection cable, strawberry.rpcd, strawberry, P4.

At this point, the fault management application may want to poll for additional information. The polling may check to see if something is up or not. An example is applying the ping operation to the host machine called strawberry.

What if every entity is up? This may indicate that strawberry is overloaded. An indication of an overload can be found by measuring the CPU load.

Page 40: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Building the graphs requires structural information and the use of rules.

Page 41: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Implementation What management services are needed?

o To detect and report symptoms, one could use application instrumentation.

o The instrumentation library should most likely talk with a management process (or agent).

o The agent sends an event report to the event server.

o The event server may have a set of rules for symptom correlation.

o After correlation, a task may be invoked that does relationship exploration and the final diagnosis.

Page 42: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Implementation Information Needed

o Information representing the relationships between hardware components and software components is needed.

o This needs to be stored in a database or a directory service (e.g., X500)

o An API needs to be defined to retrieve this information.

o Rules can be used to help construct the graph.

Page 43: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Implementation Information Needed

o How is the information collected?o Many different techniques. Examples

include: —Processes (using instrumentation) may have to

register and have their information put into the database.

—Network information may have to be entered manually.

Page 44: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Model Traversing

Summary Performs very quickly once model is built

o Model can be constructed incrementally during normal processing; do not have to wait until failure

Can operate in parallel Can accommodate multiple events;

different starting points can result in same problem element

Does require model reflective of run-timeo One that changes too fast is a problem

Page 45: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Case-based Reasoning (CBR)

Objectiveo Learn from experienceo Solutions to novel problemso Avoid extensive maintenance

Basic idea: recall, adapt and execute episodes of former problem-solving in an attempt to deal with a current problem

Page 46: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Case-based Reasoning

In p u t R e triev e A d ap t P ro cess

C aseL ib ra ry

Approach

Page 47: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Case-Based Reasoning

Strategy Useful for domains in which a body of

knowledge with a case structure exists or is easily obtainable

Case structure:o Set of fields or “slots”o Capture “essential” information

Yield discriminatorso Set of fields highly correlated with problems

or solutions Need to find “closest” match

Page 48: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Case-Based Reasoning

In p u t R e triev e A d ap t P ro cess

C aseL ib ra ry

D isc rim in a to rs

A d ap ta tio nTech n iq u es

U se r-b asedA d ap ta tio n

Adapt

Page 49: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Case-based Reasoning

Summary Needs well-defined cases Likely to work well when problems are

“close” to existing solutions Problem selecting solutions when “not

so close”o Dangerous in following actions?o How to adapt?

Page 50: Fault Management – Detection and Diagnosis. Outline  Fault management functionality  Event correlations concept  Techniques.

Summary

Variety of approaches Mostly applied in network management

scenarioso More controlled?o Better understanding of problems?

Limited experience in application management