1 In - Semantic Scholar · PDF file(SDP). 3. T o compute the DPR of DCS e cien tly ......

31

Transcript of 1 In - Semantic Scholar · PDF file(SDP). 3. T o compute the DPR of DCS e cien tly ......

Dependability Analysis of Distributed ComputerSystems with Imperfect Coverage�Xinyu Zang, Hairong Sun and Kishor S. Trivedifxzang, hairong, [email protected] for Advanced Computing and CommunicationsDepartment of Electrical and Computer EngineeringDuke UniversityDurham, NC 27708AbstractIn this paper, a new algorithm based on Binary Decision Diagram (BDD) for de-pendability analysis of distributed computer systems (DCS) with imperfect coverageis proposed. Minimum �le spanning trees (MFST) are generated and stored via BDDmaniplution. By using the multistate concepts, the algorithms for BDD generation andevaluation that can deal with imperfect coverage are given. Due to the nature of BDD,the sum of disjoint products (SDP) can be implicitly represented by BDD, which avoidshuge storage and high computation complexity for large system. Several examples aregiven to show the e�ciency of this algorithm.Index Terms: Distributed Computer System (DCS), Minimum File Spanning Tree(MFST), Imperfect Coverage, Multistate Component, Reliability Evaluation, Bi-nary Decision Diagram (BDD)�This research was supported in part by the National Science Foundation under Grant No. EEC9418765,and by the Department of Defense as an enhancement project to the Center for Advanced Computing andCommunications in Duke University. 1

1 IntroductionWith the rapid development of computer networking, the distributed computer systems(DCS) become a more and more attract and e�cient way to share system resources, achievefault-tolerance and obtain high extensibility and dependability. Normally, in a DCS, asuccessful excution of a distributed program usually requires one or more of resources thatreside on hosts of DCS, e.g. processing elements, memory modules, data �les and so on.These hosts are interconnected via communication links which enable a running distributedprogram to access the resources on remote hosts.In [10], Kumar et al. model a DCS to an undirected graph G(v, e) in which the verticesrepresent the hosts and the edges represent the communication links. Fig. 1 shows amodel of a DCS. Both vertices and edges are in either operational or failed states and theirbehaviors are independent. The system resources, except for processing elements, requiredby a program are simpli�ed to �les. A �le spanning tree (FST) was de�ned as a spanningtree that connects the root node (the host that runs the program under consideration) tosome other nodes such that its vertices hold all the required �les for executing that program.If all FSTs that are supersets of some other FSTs are removed from the FTS set, the remianpart of the set make up the set of minimal �le spanning tree (MFST), i.e. a FST is a MFSTif there exists no other FST that is subset of this FST. For instance, consider PRG 1 inFig. 1 can be working if it can run on node n1 or n4, and can access �les fF1, F2, F3g, theMFSTs of program P1 in Fig. 1 are:1. fn1, e1, n2g2. fn1, e4, n3, e3, n2g3. fn4, e5, n3g 2

1PRG: P , P3

FA: F , F1 3FA: F , F1 2

1PRG: P , P2

FA: F , F2 4

PRG: P , P2 3

FA: F , F43

PRG: P 4

e

e1

e

e

e

4

2

3

5

n1

n2

n3

n4

Figure 1: A four-node DCS4. fn4, e2, n2, e3, n3gBased on these, two reliability measures, distributed program reliability (DPR) anddistributed system reliability (DSR), were de�ned as:DPR = Pr[at least one MFST of a given program is operational]= Pr(nmfst[j=1 MFSTj)where nmfst is the number of MFST that run the given program.DPR = Pr[at least one MFST for all programs is operational]= Pr(nmfst[i=1 MFSTi)where nmfst is the number of MFST that run all programs. The algorithm to obtain theMFSTs is also given in [10]. After obtaining the MFSTs, disjoint method is applied toobtain the probability expression, i.e. sum of disjoint products (SDP).3

To compute the DPR of DCS e�ciently, Kumar et al. [9] presented an algorithm thatcan obtain the probability expression directly during generation of MFST, i.e., in one step.In [8], Kumar and Agrawal generalized this algorithm so that it can deal with the case inwhich one program can run on multiple nodes. Ke and Wang [7] gave an improved algorithmthat addressed the imperfect nodes. All these algorithms, however, can not be used directlyto obtain the DSR of DCS.The another issue that above algorithms did not consider is imperfect coverage. Gener-ally, the hosts in DCS can be located at di�erent geographic sites. There exists a possibilitythat some failures of hosts and communication links may not be detected promptly andthe redundant units (the other hosts and links) can not be utilized so that the distributedprogram can not be executed successfully. Dugan and Trivedi addressed this issue in [6].Doyle et al. introduced the coverage factor to combinatorial models in [5]. To addressthe imperfect coverage in DCS, Lopez-Benitez used Petri net in [11]. As the other Markovbased model, this method become impractical when the number of hosts and links in DCSis large. In this paper, we use the multistate method developed in [17] to deal with theimperfect coverage.Binary Decision Diagram (BDD) was, at �rst, used in VLSI design and veri�cation as ane�cient method to manipulate the Boolean expression [2, 1]. Bryant [2] and other researchesshowed that, in most cases, BDDs use less memory to represent large Boolean expressionsthan representing them explicitly. Because BDDs are based on Shannon's decomposition,reliability evaluation is very easily obtained from BDD format. Some researchers havealready used BDD to do reliability analysis for fault tree [14, 3, 15, 16, 4].In this paper, a new algorithm based on BDD is proposed for dependability analysisof DCS with imperfect coverage. The BDD for a program can be generated by searching4

MFSTs of this given program. The BDD for DSR can be obtain by logic operation betweenthe BDD for each program. Then using multistate concepts, we will convert these BDDsto multistate BDDs which can deal with imperfect coverage. The �nal BDD for a programand whole system can implicitly represent the SDPs, avoiding the huge storage for largenumber of SDPs, and easily be evaluated to obtain the unreliability. Ordering strategy inBDD will be discussed in this paper as well, because the size of BDD heavily depends onthis order. Several ordering strategy will be provided and compared. Sensitivity analysiswill be made for DCS with imperfect coverage.The paper is organized as follows. Section 2 presents some preliminary concepts of BDD.Section 3 reviews the coverage model and multistate model. Section 4 gives the descriptionof the algorithm. Some examples are provided in Section 5 to show the e�ciency of thealgorithm. The last section gives the conclusion and future work.2 Binary Decision Diagrams (BDD)In [2], Bryant gave the basic de�nitions for BDD (also known as function graph). A subsetof general BDD, reduced ordered binary decision diagrams (ROBDD) were introduced ase�cient means to manipulate the Boolean expressions. [14, 3, 15] applied BDD to reliabilityevaluation for fault trees. Here we will review some basic concepts of BDD.2.1 Shannon's decomposition and ite formatTheorem 1 Shannon's Decomposition: let f be a Boolean expression on X, and x be avariable of X, then, f = x � fx=1 + �x � fx=0 (1)where f evaluated in x = v is denoted by fx=v.5

Shannon's decomposition is the basis of using BDD. In order to express Shannon'sdecomposition concisely, the If-Then-Else (ite) format is de�ned as:f = ite(x; F1; F2) = x � F1 + �x � F2 (2)where F1 = fx=1 and F2 = fx=0.2.2 BDDA BDD is a directed acyclic graph (DAG) that is based on Shannon's decomposition. Thegraph has two sink nodes, labeled 0 and 1, representing the two corresponding constant 0and 1. Each non-sink node is labeled with a Boolean variable x and has two outgoing edgesthat represent the two corresponding expressions in the Shannon's decomposition. Thesetwo edges are called 0-edge (or else-edge) and 1-edge (or then-edge). The node linked by1-edge represents the Boolean expression when x = 1, i.e., fx=1 in Eqn. (1), while thenode linked by 0-edge represents the Boolean expression when x = 0, i.e., fx=0 in Eqn. (1).Thus, each non-sink node in BDD encodes an ite format. Obviously, one of the key featureof BDD is the disjoint, nature of the two subexpressions.An ordered binary decision diagram (OBDD) is a BDD with the constraint that thevariables are ordered and every source to sink path in the OBDD visits the variables inascending order. A reduced ordered binary decision diagram (ROBDD) is an OBDD whereeach node represents a distinct Boolean expression.In practice, ROBDDs are widely used. Actually, to generate a ROBDD, the ordering ofthe variables has to be selected �rst and this order of variables is not changed during thegeneration1. In this paper, we denote xi < xj as variable xj is behind variable xi in theorder of variables. Fig. 2 shows the ROBDD for several Boolean expressions.1We do not consider the dynamic reordering of BDD in this paper6

b

c

a

10

0

0 1

1

01

G1

G2

(a) g = a � c+ b � c

a

c

b

0 1

10

0 1

0 1H1

H2

(b) h = a � b+ cFigure 2: BDD representation of Boolean expressions2.3 Manipulation of BDDsBDDs represent Boolean expressions graphically. The manipulation of BDDs using logicaloperations is very easy. For instance, consider a logic operation AND on two Booleanexpressions g and h. We �rst generate two BDDs for g and h respectively using the sameordering of variables. We assume that the �rst variable is common to g and h. Let thiscommon variable be denoted by x,. Using Eqn. (2), the ite formats of the expressions areg = ite(x; gx=1; gx=0) = ite(x;G1; G2)h = ite(x; hx=1; hx=0) = ite(x;H1;H2)The g + h represented by ite format will be:ite(x;G1; G2) + ite(x;H1;H2) = g + h7

a

b

0 1

0 1

c

0 1

10

a

G1+H1 G2+H2

10

10

G2+H1

b

0+H1

0 1b

H1+G2 1+G2

0 1

c

10

0 1

c

10

REDUCE

1

Figure 3: OR operation of two BDDs8

= ite(x; (g + h)x=1; (g + h)x=0)= ite(x; (gx=1 + hx=1); (gx=0 + hx=0))= ite(x; (G1 +H1); (G2 +H2)) (3)The recursive method can be used for G1 + H1 and G2 + H2 till one of them becomes aconstant expression, 0 or 1, or they do not have the �rst variable in common.Next we assume h does not have variable x, but has variable y, and x < y in order ofvariables. The ite formats of the expressions areg = ite(x; gx=1; gx=0) = ite(x;G1; G2)h = ite(y; hy=1; hy=0) = ite(y;H1;H2)The g + h represented by ite format will be:ite(x;G1; G2) + ite(y;H1;H2) = g + h= ite(x; (g + h)x=1; (g + h)x=0)= ite(x; (gx=1 + h); (gx=0 + h))= ite(x; (G1 + h); (G2 + h)) (4)In the above example, we used OR operation, but any other logical operation can beused and the only di�erence is to use the di�erent truth tables when one of the operandsbecomes constant expression. Actually, in practice, the BDD is generated by using logicaloperation on variables rather than using Shannon's decomposition directly.Fig. 3 shows the OR operation of two Boolean expressions in Fig. 2. Note that tworeductions are made during operation: 9

� Because the results of 0 + H1 and G2 + H1 are same, the node b at left becomesirrelevant and can be reduced.� Because the 0-edge of the node b at right is linked to same subtree as the 0-edge ofthe node a does (the node b at left is reduced), these two subtrees can be reduced toone.3 Coverage model and multistate modelIn this section, we will review some basic concepts of coverage model and multistate model,and present how to use multistate model to deal with imperfect coverage.3.1 Coverage modelGenerally, the reliability of systems will be improved via adding redundant components.However, it must be sure that the failure of a component can be detected promptly, so thatthe redundant components can be utilized. Otherwise, the system may still be out of func-tion even there are many redundant components that can replace the failed component[6].A parameter called coverage was introduced to re ect the ability of the system to automat-ically recover from the occurrence of a fault during normal system operation. The coverageis de�ned as: coverage = Pr[system recovers j fault occurs]In a DCS, faults may occur at hosts and communication links. Some of them may notbe detected promptly since the hosts and communication links may be distributed in awide area. These undetectable faults will cause the distributed program to fail. Hence, theimperfect coverage need be considered in a practical system.10

(1-c) λ

λc

1

2

3 detectable failure

undetectable failure

operation

Figure 4: Markov chain for imperfect coverageBecause faults of components (hosts and communication links) should be divided intodetectable faults and undetectable faults, only one failure mode of components is not enough.In this paper, each component will have two failure modes corresponding to detectablefaults and undetectable faults. Fig. 4 shows a Markov chain representing the behaivor of acomponent.3.2 Multistate modelIn many combinatorial reliability models, components are assumed to be in one of twostates: function or failure. However, in some application, more than two states need to beconsidered. Actually, the components in DCS with imperfect coverage may exist in one ofthree states:1. operational state;2. undetectable failed state that will cause system down;3. detectable failed state that can not cause system down.In [17], we developed a BDD-based algorithm to solve the multistate problems. Eachstate of the multistate components is represented by a Boolean variable. Because of the11

0 1

A1

0

1 0

1

1

1

A2

A3

0

0

Figure 5: BDD format of the equivalence A1 = �A2 � �A3dependence among these Boolean variables, a Boolean algebra with restrictions on variablesis used to address this dependence. In this paper, we will use the same concepts to dealwith imperfect coverage in DCS, but will use the simpli�ed method since there are onlythree states in coverage model.Let A be a component (host or link), the three states, operational state, undetectablefailed state and detectable failed state can be represented by three Boolean variables A1,A2 and A3 respectivly. Ai = 1, i = 1; 2; 3, means that component A is in the correspondingstate. According to the restriction rule of Boolean algebra with restrictions on variables,A1 = A2 +A3 = �A2 � �A3 (5)This equivalence can be represented by BDD format in Fig. 54 BDD algorithm for DCS4.1 Generation of BDDWe will use two step to generate a BDD for DCS with imperfect coverage:1. Generate ordinary BDDs by seaching MFST for each program, then use BDD opera-tion AND to combine these BDDs to a BDD for whole system.12

bdd_gen(p_id) {mfst_bdd = 0for(node_i in the set of nodes at which p_id resides) {mfst_bdd = mfst_bdd OR bdd_gen_aux(p_id, node_i)disable all edges connected to node_i}enable back all edge_i that be disabled in above loop}bdd_gen_aux(p_id, node) {T_bdd = 0set node in this_mfstfor (edge_i in the set of edges starting from all nodes in this_mfst) {disable edge_inext_node = the other end of edge_iif (next_node is already in this_mfst)continue;if (next_node and this_mfst contain all the required files)submfst_bdd = edge_i_bddelsesubmfst_bdd = bdd_gen_aux(p_id, next_node) AND edge_i_bddT_bdd = T_bdd OR submfst_bdd}T_bdd = T_bdd AND node_bddclear node in this_mfstenable back all edge_i that be disabled in above loopreturn T_bdd}Figure 6: Algorithm for generating a BDD for a given progrm13

2. Covert this BDD for whole DCS to a multistate BDD that can re ect the imperfectcoverage. If only DPR is needed, this covertion may apply only to the BDD for agiven program.Fig. 6 shows the algorithm for generating a BDD for a given progrm. This algorithm isadopted from [8] and |citeKW97. The function of bdd gen aux(p id, node) is actually toseach the set of MFST for a given program from a given node via recursive method. We useBoolean variable Ei to represent edge ei, and Ni to node ni. For the program P1 runningon n1 in Fig. 1, the BDD obtained from bdd gen aux(p id, node) represents the Booleanexpression: N1(E1N2 +E4N3(E5N4 +E3 +N2)) (6)Fig. 7 shows the BDD generated for distributed program P1 in Fig. 1.The BDD generated above is a ordinary BDD that does not contain coverage. Toincorporate imperfect coverage, the following two steps will be applied to the ordinaryBDD:1. Replace all nodes except sink nodes in ordinary BDD by their corresponding equivelentnodes shown in Fig. 5, i.e. for node that represent component A, use �A2 � �A3 to replacethe A1.2. Use BDD operation AND to merge the variables that represent the undetectable faultof components. Note that only the components that are contained in MFSTs will beconsidered.After these two steps, the �nal BDD is able to deal with the imperfect coverage. Fig. 8shows the �nal BDD that is coverted from Fig. 7.Observing Fig. 8, we see that the nodes representing state 3 of the components haveonly one parent, i.e. the variable for state 2 whose 0-edge always connects to sink node 0.Based on this fact, we can remove the nodes whose 1-edge connects to the variable for state3 of components. Actually, in our implementation of the algorithm, step 1 of conversion isjust to replace the nodes of orignal BDD by nodes representing state 3 of the components.14

NN

NNN

1

NN

N1

2 2

E

4E

333

4 4 4N

5E E E5 5

3 3

2

E E

E

0 1

1

0

0

1 1

1

1

11

1

1

1 1

11

1

1

1

1

00

0

0

0

0

0

0

0

0

0

0

0

0

Figure 7: BDD for distributed program P1 in Fig. 115

E3,2

E5,3

E5,2 E5,2

N4,3

N4,2N4,2

N3,3

N3,2 N3,2

E4,3

E4,2E4,2

E1,3

1,2E

N2,3

N2,2

N1,3

N1,2

E2,2

E5,2E5,2

N2,2

E4,2E4,2

E3,2

E3,3

E5,3

E5,2

E5,3

N4,3 N4,3

N4,2 N4,2

N3,3 N3,3

N3,2 N3,2

E3,2

E3,3

E2,2

E2,3

1

N2,3

1,2E1,2E

0

0

0 0

0 0 0

0 0 0

0 00

0

0 0 0 0

0 0

0 0

0 1

1

1

1

1

1

1

0

00

0

1

10

1 1 00

00 0 01 11

1

10 0 0 0

0

0

1

10

0

0 101

1 1 1

11 1

0 0 0

0 0 0

0 0 00 01 1 1 1

1 11

111

0

0 0 0

0 0

00

0

1

1

1

1

0 0

1

1

11

0

Figure 8: Final BDD for distributed program P1 in Fig. 116

4.2 Unreliability evaluationBecause the Boolean variables that represent the two fault states of the same componentare no longer independent with each other, the algorithm for ordinary BDD can not be usedto evaluate the BDD with dependence. In [17], we have developed an evaluation algorithmto deal with this dependence. This algorithm can be simpli�ed here because there are onlytwo Boolean variables for each component.Observing the BDD in Fig. 8, we see that the 0-edge always links two variables thatbelong to di�erent components. But for the 1-edge, there are two cases which need to beevaluated di�erently:� the 1-edge linking the variables of di�erent components;� the 1-edge linking the variables of the same component.Lemma 1 Let BDD G beG = ite( �X;G1; G2) = �X �G1 +X �G2and G1 = ite( �Y ;H1;H2) = �Y �H1 + Y �H2thenP (G) = ( P (G1) + P (X)(1 � P (H1)) X;Y are di�erent states of the same componentP (G1) + P (X)(P (G2)� P (G1)) otherwise (7)Proof: First consider the case in which the 1-edge linking the variables of di�erent com-ponents, it is the same as the ordinary BDD. Normal evaluation method for BDD can beused as: P (G) = P (ite( �X;G1; G2))= P ( �X �G1 +X �G2)= P ( �X)P (G1) + P (X)P (G2)= P (G1) + P (X)(P (G2)� P (G1)) (8)17

Next consider the case in which the 1-edge linking the variables of the same component,we assume that X and Y represent the di�erent states of component A, i.e. X = A2,Y = A3, P (G) = P (ite( �X;G1; G2))= P ( �X �G1 +X �G2)= P ( �X � (barY �H1 + Y �H2)) + P (X)P (G2)= P ( �X �Y H1 + �XYH2) + P (X)P (G2) (9)According to the restriction relation among the variables associated with the same compo-nent, �XY = �A2A3 = A3 = Y�Y = �A3= A1 +A2= �A2 �A3 +A2= �X �Y +Xwe can obtain P (G1) = P ( �Y �H1 + Y �H2)= P (( �X � �Y +X) � Y �H1 + �X � Y �H2)= P ( �X �Y H1 + �XYH2) + P (XH1)= P ( �XYH1 + �X �Y H2) + P (X)P (H1) (10)Becaue H1 does not have variables that represent the states of the same component as Xdoes, P (XH1) = P (X)P (H1). Therefore,P ( �XYH1 + �X �YH2) = P (G1)� P (X)P (H1) (11)At this case, because undetectable faults always cause system failure, P (G2) = 1. Thisresult can also be observed from Fig. 8, all 0-edges of Boolean variables that represent the18

undetectable faults of components are connected to sink node 0. Substituting Exp. (11) toExp. (9), we have P (G) = P (G1)� P (X)P (H1) + P (X)= P (G1) + P (X)(1 � P (H1)) (12)2 Using recursive method, we can calculate P (G1), P (G2) and P (H1) until reaching thesink node, i.e. Gi = 1 or Gi = 0:� Gi = 0, implies the system or subsystem represented by Gi is always down, henceP (Gi) = 1.� Gi = 1, implies the system or subsystem represented by Gi is always up, henceP (Gi) = 0.In our implementation, all nodes whose 1-edge connects to variable representing state 3of components are removed. The Lemma 1 can be changed toLemma 2 Let BDD G beG = ite( �X;G1; G2) = �X �G1 +X �G2thenP (G) = ( (1� Cx)P (X) + (1� P (X))P (G1) + CxP (X)P (G2) X is variable for state 3P (G1) + (1� Cx)P (X)(P (G2)� P (G1)) otherwise (13)where Cx is the coverage of the component.Fig. 9 give the algorithm for evaluation of unreliability.4.3 Sensitivity analysisThe purpose of the sensitivity analysis(also known as importance measure) is to obtainthe information concerning a component's contribution to the system reliability, which19

Prob(G) {if (G == 1)return 0else if (G == 1)return 0else if (computed-table has entry {G, P_G})return P_Gelse { /* G = ite(x, G1, G2) */P_G1 = Prob(G1), P_G2 = Prob(G2)if (x is variables for state 3 of component)P_G = (1 - Cx) * P(x) + (1 - P(x)) * P_G1 + Cx * P(x) * P_G2elseP_G = P_G1 + (1 - Cx) * P(x) * (P_G2 - P_G1))}insert_computed_table({G, P_G})return P_G} Figure 9: Algorithm for unreliability evaluating of DCScan be immensely used in system design, failure diagnosis and system failure probabilityminimization. The de�nition of sensitivity analysis of event in fault tree was given in [13].In this paper, we use the same de�nition for sensitivity analysis and consider the imperfectcoverage as wel, and obtain three types of measures for DCS.4.3.1 Birnbaum's importance measureThe Birnbaum's measure of importance provides the probability that the system is in acritical state with respect to component k and that the failure of component k will thencause the system to fail. It is de�ned as the partial derivative of system unreliability withrespect to the failure probability of component k, i.e.IBk (t) d= @Fsys(t)@Fk(t) = P [�r(1k; �x) = 1]� P [�r(0k; �x) = 1] (14)20

where �r(�x) is the system structure function as a function of state vector (�x1; �x2; � � � ; �xn),�xi = 1 means the component xi has failed, and(1k; �x) d= (�x1; �x2; � � � ; �xk�1; 1; �xk+1; � � � ; �xn)(0k; �x) d= (�x1; �x2; � � � ; �xk�1; 0; �xk+1; � � � ; �xn)�r(�x) = 1 represents the system has failed. Fsys(t) and Fk(t) are the failure function of thesystem and component k, this failure function contains both detectable and undetectablefaults of component.Because there are two types of faults that will cause di�erent system behaviore, P [�r(1k; �x) =1] should include the probability of system failure caused by both faults. There are twoways to calculate the Birnbaum's importance measure. One is directly using the de�nitionformula Eqn. (14): evaluating the BDD by setting component failure and function, thenthe di�erence of these two evaluations is the Birnbaum's importance measure. This methodneed to traverse the BDD twice. The alternative method is using the algorithm shown inFig. 10, and only need to traverse the BDD once.4.3.2 Criticality importance measureThe criticality importance measure gives the probability of a component k being responsiblefor system failure before time t. It is de�ned asICRk (t) d= @Fsys(t)@Fk(t) � Fk(t)Fsys(t) = IBk (t) � Fk(t)Fsys(t) (15)In order to calculate the criticality importance measure, it only need to follow the Eqn.(15): use the algorithm shown in Fig. 10 to obtain the IBk (t) �rst, then multiply Fk(t)Fsys(t) .21

CrProb(F, xk, flag) { /* F = ite(x, F1, F2) */if (F == 1)return 0else if (F == 0) {if (flag == 0)return 0elsereturn 1}else if ((x is at behind of xk) and (flag == 0))return 0else if (cpomupted-table has entry {(flag, xk, F),R})return Relseif (x is the varibale for component xk) {P_F1 = CrProb(F1, xk, 1), P_F2 = CrProb(F2, xk, 1)if (x is the variables for state $3$ of component)P_F = (1 - C_xk) + C_xK * P_F2 - P_F1elseP_F = P_F2 - P_F1} else {P_F1 = CrProb(F1, xk, flag), P_F2 = CrProb(F2, xk, flag)if (x is the variables for state 3 of same component)P_F = (1 - Cx) * P(x) + (1 - P(x)) * P_F1 + Cx * P(x) * P_F2elseP_F = P_F1 + (1 - Cx) * P(x) * (P_F2 - P_F1)}insert_computed_table({(flag, xk, F), P_F})return P_F} Figure 10: Algorithm for Birnbaum's importance measure22

4.3.3 Structural importance measureThe structural importance measure allow us to consider the relative importance of variouscomponents when only the structure of the system is known. It is de�ned as:ISTk d= 12n�1X�x [�r(1k; �x)� �r(0k; �x)] (16)where �r(�x) is the system structure function.Eqn. (16) gives the de�nition of structural importance but it is hard to use this formulato compute the structural importance measure because there are 2n�1 state vectors haveto be evaluated. An alternative method is to assign a failure probability of 0.5 for allcomponents xi (i = 1; 2; : : : ; n and i 6= k), and then use the algorithm shown in Fig. 10.The proof of this method was given in [13].4.3.4 An exampleThe Table 1 shows the importance measures for the reliability graph in Fig. 14.4 Ordering strategiesThe order of variables is very important for BDD generation. The size of BDD (the numberof nodes in BDD) heavily depends on the order. But the problem of computing an orderingthat minimizes the size of BDD is itself a co NP-complete problem. The previous studyshowed that a set of heuristics may be used to select an adequate ordering [2, 12], howeverall these heuristics are proposed for digital circuits that consist of logic gates and there isno similarity between DCS and logic circuits. New ordering strategies are needed for DCS.Observing the Eqn. 6, we see that this expression can be represented by a fault tree, sothat some ordering heuristics developed for fault tree may be adopted for DCS. Insipired23

Comp. Failure Prob. Coverage Birnbaum's Impt. Criticality Impt. Structural Impt.0.90 1.22612129e-01 1.31928966e-01 1.75511641e-01n1 0.015 0.95 7.63365391e-02 1.50313836e-01 2.01533201e-010.99 3.90352529e-02 2.30713924e-01 2.24437575e-010.90 1.33968564e-01 1.44148334e-01 1.98356172e-01n2 0.015 0.95 8.90025201e-02 1.75254346e-01 2.28343796e-010.99 5.28013095e-02 3.12076813e-01 2.54760184e-010.90 1.30003197e-01 1.39881654e-01 1.98356172e-01n3 0.015 0.95 8.45710274e-02 1.66528319e-01 2.28343796e-010.99 4.79772126e-02 2.83564474e-01 2.54760184e-010.90 1.22401424e-01 1.31702251e-01 1.75511641e-01n4 0.015 0.95 7.60884221e-02 1.49825271e-01 2.01533201e-010.99 3.87541783e-02 2.29052661e-01 2.24437575e-010.90 1.10933942e-01 7.95755983e-02 1.46971797e-01e1 0.010 0.95 6.33451243e-02 8.31550112e-02 1.68024363e-010.99 2.49204652e-02 9.81932635e-02 1.86534505e-010.90 9.89085127e-02 1.06424201e-01 3.31218672e-02e2 0.015 0.95 4.98381101e-02 9.81359335e-02 2.75454344e-020.99 1.01819035e-02 6.01791131e-02 2.13494383e-020.90 9.91641656e-02 1.42265706e-01 4.96382734e-02e3 0.020 0.95 5.00930063e-02 1.31517131e-01 4.73052489e-020.99 1.04567944e-02 8.24051046e-02 4.40150165e-020.90 9.88082058e-02 7.08775147e-02 3.31218672e-02e4 0.010 0.95 4.97659544e-02 6.53292348e-02 2.75454344e-020.99 1.01138966e-02 3.98514435e-02 2.13494383e-020.90 1.10977738e-01 1.19410521e-01 1.46971797e-01e5 0.015 0.95 6.33433474e-02 1.24729018e-01 1.68024363e-010.99 2.48986107e-02 1.47160726e-01 1.86534505e-01Table 1: Sensitivity analysis of DPR of P1 in Fig. 124

by this relationship between fault tree and DCS, we design the ordering strategie for DCS.The Eqn. 6 actually gave trace of seaching the set of MFST. To avoid travsing the graphtwice, we order the components (hosts or links) during which we use the algorithm shownin Fig. 6 to obtain the MFST, i.e. we put the components in the ordering list if we meet anew variable, or need the variable representing this component in BDD manipulation andthis variables has not been in the list yet. There are two strategies for putting the varibalesin ordering list.� Strategy 1: always put at the head of the list so that the list serves as a stack. Wepush the varibale in the stack if we need this variable in BDD manipulation.� Strategy 2: always put at the end of the list so that the list serves as a queue. Weadd the variable at the end of the queue if this is a new variable.For the example in Fig. 1, two orders are generated as:� Strategy 1: e2 < n1 < e4 < n3 < e3 < n4 < e5 < n2 < e1� Strategy 2: n1 < n2 < e1 < e4 < n3 < n4 < e5 < e3 < e25 Examples5.1 E�ect of coverageExample: Fig. 11 shows the system topology of a DCS. The program P1 need �les fF1,F2, F3g and P2 need �les fF4, F5, F6g. The failure rate of hosts is 0:01, and the failurerate of communication links is 0:015. Fig. 12 shows the results of reliability of DCS withdi�erent coverage. 25

n1

n3

n2 n4

n5

n6

e1

e2

e3

e4 e5

e6

e7

e8

PRG: P 1

FA: F , F1 5

FA: F , F3 6

PRG: P 1

FA: F , F41

FA: F , F3 4

PRG: P 2

FA: F , F2 5

PRG: P 2FA: F , F1 6

Figure 11: System topology of a DCS5.2 Experimental resultsThe execution time of our algorithm is depend on system topology, distribution of programsand �les and ordering of variables. We use the same benchmarks given in [7]. Please referto appdix or [7] for detail of these benchmarks. Table 2 shows the experimental results forthese benchmarks.Observing these results, we can obtain some feature of BDD algorithm:� In most cases, the size of BDD is not very large if an appropriate order is used.� The size of BDD heavily depends on the ordering of the variables.6 ConclusionWe presented a BDD-based algorithm for dependability analysis of distributed computersystem with imperfect coverage. Several related issues, such as generation of BDD for26

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time

Rel

iabi

lity

c = 1.00c = 0.95c = 0.90c = 0.85

Figure 12: System topology of a DCSBDD (O1) BDD (O2)Network # of Nodes Time(s) # of Nodes Time(s)G8:4 56 0.06 65 0.06G10:4 72 0.06 73 0.06G8:6 124 0.07 345 0.10G8:7 269 0.10 1433 0.25G10:7 309 0.12 1529 0.38G8:8 725 0.23 3952 0.74G10:8 891 0.36 6571 1.83G10:9 2503 2.10 28473 16.62Table 2: Experimental results27

systems or a given program, evaluation of unreliability, sensitivity analysis and orderingstrategies, were discussed. Due to the nature of BDD, this algorithm is e�cient in bothcomputation and storage, which makes it possible for us to study some practical and largedistributed systems.AppdixGi:j is the benchmark network used in the expreiments, j � i, where i is the number ofnodes in the network, and j means G with n1 to nj being completely connected. Fig. 13shows the benchmark network G8:6. The program is located at n1, and �les F1, F3, F5 arerequired to excuted it. Table 3 gives the �le distribution.Node Name Filesn1 F1, F2, F3n2 F2, F3, F4n3 F3, F4, F5n4 F4, F5, F6n5 F5, F6, F7n6 F6, F7, F8n7 F1, F7, F8n8 F1, F2, F8n9 F3, F7, F8n10 F1, F4, F7Table 3: File distribution28

n1 n2

n3

n4

n5n6

n8

n7

Figure 13: Benchmark network G8:6References[1] K. Brace, R. Rudell, and R. Bryant. E�cient implementation of a bdd package. Proc.27th ACM/IEEE Design Automation Conference, pages 40{45, 1990.[2] R. Bryant. Graph based algorithms for boolean function manipulation. IEEE Trans-actions on Computer, 35(8):677{691, 1987.[3] O. Coudert and J.C. Madre. Metaprime: An interactive fault-tree analyzer. IEEETransactions on Reliability, 43(1):121{127, 1994.[4] S.A. Doyle and J.B. Dugan. Dependability assessment using binary decision diagrams.Proc. 25th International Symposium on Fault-Tolerant Computing, pages 249{258,1995.[5] S.A. Doyle, J.B. Dugan, and M. Boyd. Combinatorial-models and coverage: a binarydecision diagram (bdd) approach. Proc. 1995 Annual Reliability and MaintainabilitySymposium, pages 82{89, 1995. 29

[6] J.B. Dugan and K.S. Trivedi. Coverage modeling for dependability analysis of fault-tolerant systems. IEEE Transactions on Computers, 38(6):775{787, 1989.[7] W. Ke and S. Wang. Reliability evaluation for distributed computing networks withimperfect nodes. IEEE Transactions on Reliability, 46(3):342{349, 1997.[8] A. Kumar and D.P. Agrawal. A generalized algorithm for evaluating distributed-program reliability. IEEE Transactions on Reliability, 42(3):416{426, 1993.[9] A. Kumar, A. Rai, and D.P. Agrawal. On computer communication network reliabilityunder program execution constraints. IEEE Journal on Selected Areas in Communi-cations, 6(8):1393{1399, 1988.[10] V.K.P. Kumar, S. Hariri, and C.S. Raghavendra. Distributed program reliability anal-ysis. IEEE Transactions on Software Engineering, SE-12(1):42{50, 1986.[11] N. Lopez-Benitez. Dependability modeling and analysis of distributed programs. IEEETransactions on Software Engineering, 20(5):345{352, 1994.[12] M.Bouissou. An ordering heuristic for building binary decision diagrams from fault-trees. Proc. 1996 Annual Reliability and Maintainability Symposium, pages 208{214,1996.[13] K.B. Misra. Reliability Analysis and Prediction: A Methodology Oriented Treatment.Elsevier, Amsterdam, The Netherlands, 1992.[14] A. Rauzy. New algorithms for fault tree analysis. Reliability Engineering and SystemSafety, 40:203{211, 1993.30

[15] R.M. Sinnamon and J.D. Andrews. Improved accuracy in quantitative fault tree anal-ysis. Quality and Reliability Engineering International, 13:285{292, 1997.[16] R.M. Sinnamon and J.D. Andrews. Improved e�ciency in qualitative fault tree analysis.Quality and Reliability Engineering International, 13:293{298, 1997.[17] X. Zang, H. Sun, and K.S. Trivedi. A bdd-based algorithm for analysis of multistatesystems with multistate components. Technical Report, 1998.

31