Criteria to evaluate approximate belief network representations in expert systems

E L S E V I E R Decision Support Systems 15 (1995) 323 350

r mon

Criteria to evaluate approximate belief network representations in expert systems

Sumit Sarkar, Ishwar Murthy Department of Quantitative Business Analysis, College of Business Administration, Louisiana State University, Baton Rouge,

LA 70803, USA

Abstract

The representation of uncertainty, and reasoning in the presence of uncertainty, has become an important area of research in expert systems. Belief networks have been found to provide an effective framework for the representation of uncertainty using probability calculus. Unfortunately, belief propagation techniques for general network structures are computationally intense. In this paper, we present belief network representations that approximate the underlying dependency structure in a problem domain in order to allow efficient propagation of beliefs. An important issue then is one of obtaining the 'best ' approximate representation. A criterion is required to measure the closeness of the approximate to the actual. We examine desirable features of measures that compare approximate representations to the actual one. We identify two well-known measures, called the logarithm rule and the quadratic rule, as having special properties for evaluating approximations. We present a new result that shows the equivalence of using the logarithm rule to that of finding the maximum likelihood estimator. Next, we discuss the modeling implications of using the logarithm rule and the quadratic rule in terms of the nature of solutions that are obtained, and the computational effort required to obtain such solutions. Finally, we use a decision theoretic approach to compare such solutions using a common frame of reference. A simple decision problem is modelled as a belief network, and the comparison is performed over a wide range of probability distributions and cost functions. Our results suggest that the logarithm rule is very appropriate for evaluating approximate representations.

Keywords: Belief networks: Expert systems: Probabilistic reasoning; Scoring rules; Approximate representations; Performance analysis

I. Introduct ion

In recen t years , the r e p r e s e n t a t i o n of uncertainty, and reason ing in the p re sence of uncertainty, has become an i m p o r t a n t a r ea of research

The research was supported in part by a grant from the College of Business, Louisiana State University.

in exper t systems. In par t i cu la r , ne twork structures, ca l led be l i e f ne tworks , have been found to provide an effect ive f r amework for the r ep re sen - ta t ion of uncer ta in ty using p robab i l i ty calculus [7] [10] [19] [23]. Pea r l [24], and Laur i t zen and Sp iege lha l t e r [19] have m a d e impor t an t con t r ibu- t ions towards unce r t a in reason ing using be l ie f networks . They have p r e s e n t e d t echn iques tha t can p r o p a g a t e bel iefs in such ne tworks in a man-

0167-9236/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved SSD1 0167-9236(94)00045-X

324 S. Sarkar, L Murthy / Decision Support Systems 15 (1995) 323-350

ner consistent with probability theory. The feasi- bility of using such networks has been demonstrated for several application areas [11 [2] [14] [421.

In order to use belief-network based expert systems, knowledge engineers must address the problem of constructing such networks for the application domains of interest. This is a critical task, since the performance of an expert system will be highly dependent on the accuracy of the knowledge that is represented in the belief network. In storing domain specific knowledge, a belief network representation must include information about all relevant objects in the problem domain and the dependencies between them. Further, an important requirement for expert systems to perform in real-world applications is that the system should make inferences in a reason- ably short time. For this to occur, the representation must also allow for efficient manipulation of information stored in it. In principle, belief network representations can be used to capture all dependencies that may exist across objects of interest. Unfortunately, belief propagation techniques for such general network structures are computationally intense. In fact, Cooper [8] has shown that a theoretically accurate probabilistic inference scheme for multiply connected networks is NP-Hard. Hence, the current practice of achieving computational efficiency is by approximating the inference process, for instance by using simulation based techniques [4] [11] [151 [351. In this research we propose an alternate approach wherein we consider belief network representations that are approximate. Such representations allow belief propagation techniques to con- form to probability calculus and operate within specified levels of computational complexity.

In any application, there typically exists many feasible approximate representations of the probability distribution describing the underlying dependencies. An important issue then is one of obtaining the 'best ' approximate representation. Loosely speaking, the best approximate representation is one that is 'closest' in some sense to the dependency structure underlying the application domain. A criterion is then required to measure the closeness of the approximate to the actual.

Obviously, the best representation depends on the criterion that is used. What are the desirable features of any criterion used? Given a choice of possible measures which one is most appropriate? These are the questions that this paper seeks to answer.

In this paper, we first discuss the nature of approximate belief network representations, and identify the factors that determine the computational complexity of making inferences in such networks. We then examine desirable features of measures that compare approximate probabilistic representations to the actual one. Functional forms of measures that satisfy these requirements have been identified in the literature on probability assessments. We show that the logarithm rule and the quadratic rule have some special properties for evaluating assessments. We discuss a known result wherein using the logarithm rule is shown to be equivalent to using the I-Divergence measure. In addition, we present a new result, wherein using the logarithm rule is shown to be equivalent to finding the maximum likelihood estimator. We then compare the use of the logarithm rule and the quadratic rule for obtaining approximate representations in two ways. First, we discuss the modeling implications of using each of these measures. This focuses on the nature of solutions that are obtained when using these measures, and the computational effort required to obtain such solutions. The best solution obtained using the logarithm rule is usually different from the one obtained using the quadratic rule. We use a decision theoretic approach to compare such solutions using a common frame of reference. A simple decision problem is modelled as a belief network, and the comparison is performed over a wide range of probability distributions and cost functions.

This paper is organized as follows. In section 2, we present an overview of belief networks, and characterize structures that are amenable to efficient belief propagation techniques. This also serves the purpose of providing the necessary background and motivation for the issues examined in subsequent sections. In section 3, we discuss the desirable properties of measures used to assess probability distributions. We also

S. Sarkar, L Murthy / Decision Support Systems 15 (1995) 323-350 325

demonstrate the equivalence of using the logarithm rule, the I-Divergence measure and the maximum likelihood estimator in the context of this problem. The modeling implications of using the logarithm and quadratic rules are presented in section 4. We compare the performance of representations obtained using the logarithm rule and the quadratic rule respectively in section 5. In section 6, we use a simple example to illustrate how the logarithm rule may be used to evaluate two different approximate structures. A summary of our findings are provided in section 7.

2. Representation and propagation of beliefs using network structures

In this section, we discuss properties of belief networks that make them effective for representing uncertainty in expert systems. We identify topological features of belief networks that characterize the computational effort involved in propagating beliefs in such networks. This provides a basis for classifying the complexity of belief network structures, and helps in identifying desirable approximate representations. Finally, we discuss the role of a measure in obtaining efficient belief network structures.

2.1. Belief networks in expert systems

Belief networks are directed acyclic graphs in which nodes represent propositions, and arcs sig- nifty dependencies between the linked propositions (the terms variables and events are used

interchangeably with propositions). The belief ac- corded to different propositions are stated as probabilities (prior or posterior, as the case may be), and the strengths of the dependencies are quantified by conditional probabilities. A collec- tion of propositions with associated dependencies can be conveniently represented using a belief network as shown in Figure l(a). The nodes denote propositions of interest in the problem domain. For illustration purposes, consider a hypothetical example in which the nodes refer to attributes of mutual funds that may be used to classify different instances of funds (this example is loosely adapted from an example presented in [37]). Each attribute is considered to be categori- cal. For example, Fund Types may be classified as: Growth, Growth and Income, and Aggressive Growth; Yield classified as: Under 3%, Over 3%; Price-Earnings Ratio classified as: Above Market, Below Market; long term Projected Earnings Growth classified as: Less than 20%, Greater than 20%, and, Volatility as: Above Market, Be- low Market. Each arc between two nodes represents a dependency across these attributes, and the direction of the arc indicates an ordering of the attributes. For instance, in Figure l(a), nodes Fund Type and Yield are predecessors of Price- Earnings Ratio. This indicates that the dependencies between the attributes Fund Type, Yield and Price-Earnings Ratio are represented by storing the conditional probability associated with each value of Price-Earnings Ratio for all possible values of the variables Fund Type and Yield. The absence of a link between two nodes indicates that the attributes are not directly related. In-

a. Belief Network for Mutual Funds b. Equivalent Network with Symbolic Names Fig. 1. Belief networks.

326 S. Sarkar, L Murthy / Decision

stead, their dependence is mediated by attributes that lie on the paths connecting them. In probabilistic terms, this means that the two nodes are conditionally independent of each other, given the intermediate nodes on the path between them. In Figure l(a), the nodes Fund Type and Volatil- ity are shown to be conditionally independent of each other given realizations for the attributes Yield and Price-Earnings Ratio. We should note that if a variable has more than one conditioning event, then, strictly speaking, the belief network represents a hypergraph [25]. In Figure l(a), node Projected Earnings Growth is dependent on two nodes Fund Type and Price-Earnings Ratio; in general, this dependency cannot be captured by the individual dependencies of node Projected Earnings Growth on nodes Fund Type and Price- Earnings Ratio; respectively. Figure l(b) represents an equivalent belief network, with the variables A through E used to represent the attributes in Figure l(a). For notational convenience, we subsequently use such symbolic variable names. We return to the example for evaluating mutual funds in section 6, where we discuss how different approximate representations are compared using the logarithm measure.

A belief network represents a joint distribution P ( X 1 , . . . , X .) over the variables of interest XI . . . . . X, . The chain-rule allows joint distributions to be represented as a product of conditional distributions in the following manner:

P(X, . . . . . Xo)

= P ( X , ) x I I P(Xi Ix, . . . . . x , _ , ) . i=2.n

Support Systems 15 (1995) 323-350

Such a representation is called a product-form representation, and is often written as follows:

P(X, . . . . . X.) =P(X,) x 17 e(XilF(X~)). i=2,n

Here, F(X i) refers to the set of variables on which e v e n t X i is conditioned, and is called the parent set for variable X i. For instance, the belief network shown in Figure l(b) can be completely specified by specifying the following marginal and conditional distributions for all realizations of the variables: P(A), P(BIA), P(C IA, B), P(DIA, C), P(E I B, C).

A belief network is therefore characterized by a structure (or topology), and, a set of probability parameters. The structure provides information regarding conditional independence across events represented in the network. The probability parameters (usually expressed as conditional probabilities) quantify the dependence of an event on its conditioning (parent) events.

A belief network can be used to compute the probability of any realization of a set of variables as a result of observing some other variables. Many different schemes have been proposed to propagate beliefs in general network structures. Each of these schemes belong to one of two classes of propagation techniques - exact propagation of probabilities, or, stochastic simulation. Schemes that belong to the former category are presented in [5] [19] [24], while simulation based schemes are presented in [4] [11] [15] [35], among others.

a. Completely Connected

® ® ©

® @ b. Completely Disconnected C. Incompletely Connected

Fig. 2. Belief networks with different connectivity levels.

S. Sarkar, I. Murthy / Decision Support Systems 15 (1995) 323-350 327

2.2. Computational complexity of belief propagation in networks

The computational complexity of belief propagation schemes depend entirely on the structure of the network, and not on the probability parameters themselves. Two extreme instances are the completely connected and the completely disconnected structures. For a completely connected network, each variable is dependent on all its preceeding variables (Figure 2a). This corre- sponds to a structure that is computationally the most intense for making inferences. In a completely disconnected network, none of the variables are conditioned on any other variables (Fig- ure 2b). This implies that the variables are mutually independent, and observing one of the variables to be true will not affect our belief in other variables in any way; hence no computations would be required for inference. Using the chain rule, the product-form representation for the completely connected network may be written as:

P(A, B, C, D, E)

= P ( A ) P ( B [A)P(C I AB)P(D IABC)P(E IABCD)

By virtue of mutual independence of variables, the completely disconnected network shown in Figure 2(b) is represented as:

P(A, B, C, D, E) = P ( A ) P ( B ) P ( C ) P ( D ) P ( E )

Figure 2(c) is an instance of a belief network that is incompletely connected, i.e. it does not have the full complement of arcs that are feasible. The complexity of propagating beliefs using this structure lies somewhere between the above two extreme cases. This network can be represented as:

P(A, B, C, D, E)

= P ( A ) P ( B IA)P(C IA)P(D I AC)P(E IBC)

Lauritzen and Spiegelhalter [19] have devised a technique to propagate beliefs that is applicable for any general network structure. Their method is efficient for sparse network structures, al- though, like all exact techniques, it is of exponential complexity for complete or near-complete networks. It is regarded to be one of the efficient techniques that have been developed for arbitrary general network structures [22]. The computational complexity of their scheme is shown to be

of the order O(nrm), where n is the number of variables in the network, r is the maximum number of realizations that a variable may have, and m is the size of the largest clique [22]. The size of a clique in turn depends on the number of conditioning variables (size of the parent set) that a variable has in the product-form expression. Therefore, if the number of parents for a variable is large, then the computational complexity of performing belief propagation in the network is high.

In general, it is expected that each variable will be conditionally independent of other variables in the network given a set of conditioning variables. The efficiency of update mechanisms will depend on the maximum number of variables that constitute the conditioning set for any term in the product-form, which determines the order of the joint distribution for the network. The product-form expression for the network shown in Figure 2(a) is of order 5, since the parent set for variable E includes all the other variables in the network. Similarly, the product-form expression for the network shown in Figure 2(c) is of order 3, because the largest term in the product- form expression consists of three variables. Therefore, the time taken to perform inferences will be much more for the structure in Figure 2(a) as compared to Figure 2(c), since the complexity of inference mechanisms increase exponentially with the order of the product-form distribution.

On the other hand, the ability of a belief network to capture the underlying dependencies across variables improves with increasing connectivity of the structure. By allowing a larger number of parents for each variable, one can better represent the dependencies that are inherent among the variables in the network. Clearly, a completely connected network should be able to represent every dependency that exists in the problem domain in an exact fashion. Similarly, a completely disconnected network will not be able to represent any dependencies across variables. Thus, there is a clear trade-off in the richness of representation that is possible using a belief network with given connectivity, and the computational complexity of making inferences in that network. Performing inferences in completely

328 S. Sarkar, 1. Murthy / Decision Support Systems 15 (1995) 323-350

connected belief networks are too time-consum- ing to be considered acceptable for most real applications. For instance, when evaluating mutual funds, a fund manager may have to deal with a large number of factors that could affect the performance of a fund. Even if the number of variables was as few as twenty, using the complete joint distribution would not be computationally feasible. Therefore, practically feasible representations are those that are constrained in the number of conditioning variables allowed for each variable in the component terms. We call such representations approximate belief network representations, or in short, approximate representations.

2.3. The role of a measure m obtaining efficient network structures

Traditionally, construction of belief networks has required eliciting from domain experts a belief network topology along with its associated probability parameters. In recent years, re- searchers have developed techniques to obtain belief network structures from historic databases. In both of these approaches, the choice of an appropriate measure plays an important role in obtaining efficient network structures.

When a belief network is directly obtained from domain experts, many different topologies are usually examined. There does not exist any standard criteria for either evaluating alternate network structures, or, identifying the appropriate probability parameters for any given topology. When different structures are considered feasible, then selecting any one among these structures is usually done in an ad-hoc manner. Once a structure is selected, the probability parameters are chosen such that they correspond to the expert 's true beliefs. Often, the structure imposes restrictions on what values the parameters can take. In such instances, the parameters are ad- justed so that they are as close to the experts beliefs as possible. This can lead to additional problems in obtaining the final structure. For instance, consider the conditional probabilities associated with the variable D in Figure 2(c). If the selected structure does not completely capture the dependencies in the problem domain,

then it may not be possible to choose parameters that will lead to inferences that completely agree with the experts beliefs. In such instances, parameters associated with other events that are related to D are often modified as well. This adjustment process could propagate to many other nodes in the network, making it even harder to evaluate the goodness of the final representation. The designers of the expert system PROSPECTOR [10] document many instances where the expert specified parameters were modified in order to take advantage of efficient tree structures for propagating beliefs. In order to compare alternate representations, an appropriate criterion is required to evaluate different representations. Ideally, we require a measure that will help identify the best structure among different feasible structures, as well as determine the probability parameters that should be used with the chosen structure. The measure should enable us to determine the representation that is closest to the true dependency structure that exists in the problem domain when exact representations are computationally prohibitive.

The construction of belief networks from databases appears to be a promising approach for application areas where large amounts of historic data are easily available. Usually, the objective is to obtain networks with structures that are conve- nient for making inferences. One commonly used structure is the tree structure, as it is very efficient for propagating beliefs. Techniques to obtain efficient structures require some criterion to determine when one structure better captures the dependencies that are displayed by the observed data as compared to some other structure(s). In essence, this is equivalent to determining the best approximate representation that conforms to the specified structural requirements. For instance, Chow and Liu [6] have addressed the problem of representing a joint distribution over n variables by a distribution that supports a tree structure. They use the I-Divergence measure [18] to compare different tree structures with the actual distribution. Rebane and Pearl [27] have extended the methodology proposed by Chow and Liu to recover the structure of a singly connected network using the same I-Divergence measure from


a joint distribution that is known to support such a structure. Herskovitz and Cooper [16] have developed an algorithm, called Kutat6, that be- gins with the assumption of marginal independence among variables, and obtains a network incrementally by adding the arc that results in a belief network with minimum entropy. In Cooper and Herskovitz [9], the authors address the problem of finding the most probable belief network given a database using an algorithm called K2. Smyth and Goodman [37] develop a scheme called ITRULE, that takes sample data in the form of discrete attribute vectors, and generates a set of K best rules (where K is a user-defined parameter). They too use an entropy based measure, called the J-measure, to compare rules. Spirtes, Glymour and Scheines [38] discuss two algorithms that recover belief networks from data by check- ing for conditional independencies across sets of variables using estimated probabilities. The first one, called the SGS algorithm, is computationally intense. They modify this algorithm, which is then called the PC algorithm, such that it can efficiently discover sparse networks underlying a problem domain. When such networks do not exist, or noise in the data prevent accurate esti- mation of probabilities, they recommend heuris- tics similar to those used in K2 for practical implementations. In the related problem of obtaining decision trees from data, Quinlan [26] has developed an algorithm called ID3 that uses an entropy based measure to obtain the best decision tree. Uthurusamy et al. [43] use a quadratic measure to obtain decision trees in their algorithm called INFERULE, and demonstrate the resulting structures to be superior to those obtained by ID3. In all of these examples, the measure plays a critical role in comparing different structures, and obtaining relatively sparse structures that effectively capture the dependency across different events in the domain.

3. Measures to evaluate approximate representations

As discussed earlier, the choice of an appropriate criterion is very important since the best

approximate representation will depend on the criterion chosen, and may be different when different measures are used. First, we formally state the problem. Next, we provide an overview of the existing literature on measures used to evaluate approximate probability distributions, and discuss some fundamental desirable properties for an appropriate measure. Finally, we identify two measures, the quadratic and the logarithm measures, as potentially useful measures.

3.1. Generalized problem formulation

The general problem of constructing approximate belief networks can be viewed as one of determining a probability distribution that best approximates the joint distribution underlying the problem domain. Let P(X 1 . . . . . X n) be the underlying distribution and Pa(X1 . . . . . X n) be the approximate distribution that is desired. The distribution P(X 1 . . . . ,Xo) is either obtained from an expert, or estimated from data. If there are no constraints on the form of the approximate distribution Pa(XI . . . . . Xn), then it should coincide with the underlying distribution. However, as discussed in Section 2.2, such representations are often inefficient for belief propagation. Feasible approximate distributions are those that are constrained in the following manner. The approximate representation must belong to the family of product-form distributions that are constrained in the nuraber of parents that any variable is allowed to have. If m is the maximum number of conditioning variables allowed, then it must be possible to represent the distribution P~(X 1 . . . . . X~) by the product-form //i=l,n Pa(XilF(Xi )), where Pa(Xi IF(X1))= Pa(Xj), for some ordering of the variables, such that maxi{IF(Xi)l}<m. Such distributions will be said to have a connectivity of order m, and the corresponding product-form distributions will be of order m + 1.

In order to determine the best approximate representation for a given m, we need to measure how close the approximate distribution is to the actual one. The best approximate representation Pa~ is one that is closest to P( ) in terms of some measure of closeness M(P, P~). If Pa(" ) is identical to P ( ), then M(P, Pa) should be minimized, and

330 S. Sarkar, I. Murthy / Decision Support Systems 15 (1995) 323-350

Pa(') is an exact representation. The resulting optimization problem is:

Min M(P, Pa)

where P~ =Hi=l, n Po( X i I F ( ) ( , ) )

s.t. I F ( X i ) l < m { Connecticity Constraints}

Finding the 'best ' belief network representation satisfying the connecticity constraints requires determining the topology that supports the best approximation, and, the optimal probability parameters associated with that topology.

3.2. Scoring rules

Approximate probability distributions have been analyzed for the purpose of judging subjective probability assessments made by experts (e.g. weather forecasts by meteorologists). A reason- able measure is a function of the approximate probability distribution and subsequent observations of the actual realizations. Terminologies used for such measures include scoring rules, reward functions [40], incentice functions [20], and scoring systems [36]. We adopt the term scoring rule in this paper.

Scoring rules are designed to (i) evaluate different probability assessments, and (ii) encourage assessors to provide their true ( 'honest ') esti- mates. Let Y be an uncertain quantity represented by a probability distribution F on an outcome space S, and let E l, E 2 , . . . , E n constitute an n-fold partition of S (i.e., they are a set of n mutually exclusive and exhaustive events). The probability mass in Ej is denoted by pj, where pj = P(Y ~ Ej). The vector p = (Pl . . . . . pn), represents the true probability values, and r = ( r l , . . . , r n) represents the assessors stated beliefs. The assessment receives a score Sk(r) if the k th event occurs. The expected score for the assessed distribution r is S(p, r), where S(p, r ) =

Y~k PkSk(r). A desirable property of such rules is that the

score should be maximized when the assessment coincides with the actual. Scoring rules that satisfy this requirement are those for which S(p, 0) /> S(p, r) for any O and r (i.e. assessments other than p cannot get a higher score than O itself).

Such rules are called proper scoring rules. Scoring rules could also be defined in a way such that a low score is preferred to a high score. In that case, the vale would be proper if the score is minimized by setting r = p.

An assessment is evaluated based on the assessed distribution r, and the event that is realized. Scoring rules, therefore, should be non-decreasing functions of r k, where k is the event realized (i.e., Sk(r) is non-decreasing in rk). This ensures that an assessment r~, will get a higher score than r k if k is the event realized and r~, > r k. There are potentially an infinite number of functions that may serve as proper scoring rules. McCarthy [21], Marschak [20], and Shuford et al. [36], among others, have provided different characterizations for the functional form of proper scoring rules for different situations. Among the various possible proper scoring rules, three have received particular attention in the literature. They are: • Quadratic scoring rule [3], defined as:

S(p, r) = --•(pk -- rk )2" • Logarithmic scoring rule [13], defined as:

S(p, r ) = ]~Pk log r k. • Spherical scoring rule [28], defined as:

S(p, r) = 2(0, r) = EPkrk/(y~r2) °'5. It is easily seen that any linear transformation of a proper scoring rule is also a proper rule.

3.3. Properties of the quadratic rule

The choice is more limited when additional features are required of a scoring rule. In particular, when it is required that the rule be a function of the discrepancy ( P i - ri) then the quadratic rule is the only one that satisfies this requirement [31]. Savage [31] also shows that it is the only proper rule that is symmetric over p and r (i.e. S(p, r ) = S(r, p)). It is easy to see that the quadratic rule is equal to the negative of the second-norm function (which is the euclidean distance in the n-dimensional vector space). There- fore, maximizing the quadratic rule is equivalent to minimizing the second-norm vector function. Subsequently, this rule exhibits all the desirable properties of such a norm function, including the

S. Sarkar, L Murthy / Decision Support Systems 15 (1995)323-350 331

triangle inequality property. This measure has been used in [43] to obtain sparse decision trees from data.

3.4. Properties of the logarithm rule

A different requirement often imposed on a scoring rule is that the score depend only on the probability assigned to the event that is actually realized (called the principle of rele~,ance [40]). For instance, in a three event state space, the two assessments (0.6, 0.3, 0.1) and (0.6, 0.2, 0.2) should receive the same score if the first event was realized (if one of the other two events were to occur, the two assessments would get different scores). It has been shown that the logarithm rule is the only proper scoring rule that satisfies this requirement for any arbitrary n [40] [31]. Proof of uniqueness of the logarithmic rule has been demonstrated in [40].

An important limitation of the logarithm rule is that it is not a norm function [17]. However, in the context of evaluating belief networks, the logarithm rule is equivalent to two important criteria that have been used in practice to learn dependency structures from historical data. These are: (i) the information theoretic measure called the I-Divergence measure [18]; and (ii) the statistical maximum likelihood criterion. We first discuss the well-known equivalence between the I- Divergence measure and the logarithm rule in evaluating belief network structures. Next, we prove that finding the maximum likelihood estimate from among all the feasible solutions is also equivalent to using the logarithm rule.

3.4.1. Logarithm rule and the I-Dit,ergence mea- sblre

In communication theory, the I-Divergence measure has been widely used to determine the best estimate of an unknown probability distribution. Subsequently, the I-Divergence measure has been used by Chow and Liu [6] to obtain tree structured representations. Many other re- searchers have also used this measure in obtaining network structures [12] [27] [44] [45]. It is defined as the difference in the information contained in the actual distribution p and the infor-

mation contained in the approximate distribution r about the actual distribution p. The I-Diver- gence measure D(p, r) is expressed as:

D(p, r) = ~ p i log pl = ~ p i log p i - Y'~pi log r i ri

This measure is always positive when the distributions p and r are different, and zero when they are identical [18]. Since the expression Epi log pi does not depend on the approximate representation, it is easy to see that the logarithm scoring rule is a linear transformation of the I-Diver- gence measure, and minimizing the I-Divergence measure is equivalent to maximizing the logarithm scoring rule. Hence, the best solution obtained using the logarithm rule is also the one that minimizes the difference in information between the approximate and actual representations, respectively.

Smyth and Goodman [37] use an entropy based measure, which they call the J-measure, in their scheme called ITRULE, that generates a set of K best probabilistic rules from sample data. As pointed out by the authors, the J-measure is a special instance of the I-Divergence measure, and is used to quantify the goodness of a rule of the form "I f X = x then Y = y with probability p". In a belief network, when such a rule is included, then corresponding rules for all other realizations of X must also be included. In that case, using the J-measure becomes equivalent to the I-Diver- gence measure.

3.4.2. Logarithm rule and the maximum likelihood estimate

The maximum likelihood estimator is widely used for estimating distributions in statistical analysis. Cooper and Herskovitz [9] have formu- lated the problem of constructing the best approximate belief network from data as one of finding the most probable network given a database. Assuming that all feasible network structures are equally likely when no information is available, this is equivalent to finding the belief network (structure with associated parameters) whose joint distribution maximizes the likelihood among all feasible solutions given the data. We


show that the belief network that maximizes the logarithm score is also one that is the maximtm likelihood estimator. Consider the problem of determining the belief network that is the maximum likelihood estimator. Let the true distribution that characterizes the problem domain be P(X) = P(x 1 . . . . . Xn), and let Pa(X)= P,(x 1 . . . . . x n )be some unknown distribution that satisfies the connectivity constraint. When the belief network is induced from historical data, then P(X) is the best estimate of the true underlying distribution. Let the sample data con- sist of s instances of the set of variables, denoted by X s = {x 1 . . . . . xS}. For instance, we may have the following five data instances for three binary variables: X s = {(0, 0, 1), (1, 0, 1), (1, 1, 1), (0, 0, 1), (1, O, 0)}. For distribution P.(X), the likelihood function is:

L = 1--[p.(x9 = l--I I-I P~(xilF(xi)) j j i= I.n

Let L' be defined as the log likelihood, i.e. L' = log L. Then:

L ' = i ° g l - ' I 1 - I P a ( X i l F ( x i ) ) j i = l , n

= E Y'~ l°gPa(XilF(xi)) j i - l,n

= E Y'-l°gPa(XilF(xi)) i = l , n j

Each instance (xi, F(xi)) is a subset of the jth instance in the database. Each such subset corre- sponds to one of a finite number of realizations for the variables included in the subset. For instance, the pair of variables (x~, x 2) in the example mentioned earlier can take on four values: {(0, 0), (0, 1), (1, 0), (1, 1)}. Let the total number of such possibilities associated with the set of variables (xi, F(xi)) be r~, and the true and approximate conditional probabilities associated with each realization be denoted as Pk(x i [F(xi)) and pak(XilF(xi )) for k = 1, r i. Further, let the total number of occurrences for each realization of (xi, F(xi)) be fk(x i, F(xi)), for k = 1, r i. Then:

]~log Pa(xi IF(xi)) J

= Efk(x~, F(x , ) ) log Pk(x, IF(x,)) k

The estimate for the true underlying joint distribution Pk(x i, F(xi)) is obtained by finding, for each feasible realization of (x i, F(xi)), the pro- portion of instances in the database with that realization as compared to the total number of instances, i.e.

pk(xi, F ( x i ) ) =

Therefore:

fk(xi, F(xi) )

~ f k ( x , , F(X,)) log P~(X~ IF(x,)) k

fk(xi, F(xi) ) = s~'~ log Pf(x i IF(xi) )

k S

= s E pk(x,, F(xi) ) log pak(Xi [F(xi) ) k

Subsequently, we have:

L ' = Y'~ s ~ p k ( x i , F(xi) ) log pak(Xi I F ( x i ) ) i= 1,n k

= s E ~,Pk(x~, F(xi)) log pk(x i ]F(xi) ) i = l , n k

A well-known property of the logarithm transformation is that L' is maximized when L is maximized. Since s is fixed for a given database, we have:

Max U

=Max E Epk(x,, V ( x J ) log pk(x i [F(xi) ) i = l , n k

Next, consider the problem formulation (as stated in section 3.1) when the logarithm rule is used. To obtain the best approximation, we must solve the following optimization problem:

Max ~--~.P(X) log Pa(_X) x

where Pa = I-'[ Pa(xi [F(xi)) i = 1,n

s.t. l f ( x i ) l<m

The objective function can be manipulated as shown:

Max EP(X)log 1-I Pa(xi ]F(xi)) X i= l.n

= Max EP(X) E log Pa(x, IF(x,)) X i= 1,n


= Max E ~] P(X) log P~(x i [F(xi) ) X i = l , n

= Max Y'. ~ P ( X ) log P~(x~ IF(x0) i = l , n X

= M a x E Y'. P(x,, F(xi) ) i - 1 ,n xi,F(xi)

log Pa(Xi IF(xi) )

: M a x E E p k ( x ~ , F ( x , ) ) i = l , n k

l o g Pak(xi IF(x,)) where k is as defined earlier.

The above expression is the same as the one obtained for the maximum likelihood estimator; hence using the two criteria are equivalent.

4. Modeling features of the quadratic and logarithm scoring rules

In choosing a measure to evaluate approximate representations, a proper scoring rule is clearly very desirable. As discussed in section 3, the quadratic and the logarithm scores have been shown to have some additional desirable properties that make them appropriate for evaluating approximations. Therefore, in this section, we discuss some modeling implications of using these two rules, respectively.

4.1. Modeling using the logarithm scoring rule

state a result that enables us to obtain the probability parameters relatively easily.

Proposition: When using the logarithm rule, the best set of probability parameters for a given topology are those that preserve the joint probabilities of the component terms for the corresponding product-form representation.This result follows from the fact that the objective function for the associated optimization problem contains the logarithm of the approximate distribution which is a product-form, and subsequently may be expressed as the sum of the logarithm of the individual components of the approximate distribution. Since the logarithm rule is a proper scoring rule, the objective function is optimized when the probability parameters associated with each component of the approximate distribution are equal to the corresponding parameters for the actual distribution. A formal proof of this result is shown in [30]. Consider the topology shown in Figure 2(c) as an approximation to the completely connected representation shown in Figure 2(a). By virtue of the above property, the best set of probability parameters obtained using the logarithm rule will satisfy the following conditions:

• P a ( A ) = P ( A )

• Pa(B IA) = P(B IA)

• Pa(C IA) = P(C IA)

• P~(D [A, C) = P(D ]A, C)

• P~,(E tB, C) = P(E IB, C)

In many applications, an expert (or experts) may be able to specify either one or a small number of alternate topologies for a problem domain. If a unique topology is specified, the problem reduces to one of determining the best set of probability parameters given the topology. When alternate topologies are to be considered, then the best representation for each topology (in terms of the probability statistic) is compared using the chosen measure. We examine how the quadratic and logarithm scores can be used in these circumstances.

The logarithm rule leads to a problem formulation with some very attractive properties. We

This property holds for any feasible topology that is considered as an approximation. An outcome of this property is that if the topology is specified, then the probability parameters for each component can be easily obtained. The joint distribution for the complete representation is obtained by using the appropriate product-form for the given topology.

The logarithm rule can also be efficiently ap- plied for those instances where the best representation is to be chosen from one of a small number of alternate structures that are provided by the expert. The best set of probability parameters for each alternative is easily obtained. The best


topology is then selected by evaluating the score for the optimal set of parameters associated with each topology. The number of terms in the evaluation function increases exponentially with the number of variables in the problem domain. For example, if there are n binary variables being considered, then the number of terms to be computed is 2 n. However, useful approximate representations will usually have low order product- forms. This allows the scoring function to be decomposed into smaller components. For the topology shown in Figure 2(a) the score is:

= Y'~ P(A, B, C, D, E) A,B,C,D,E

X log P,(A, B, C, D, E)

= Y'~ P(A, B, C, D, E) A.B,C,D,E

log(Pa(A ) X Pa(a JA) X P~(C dA)

Pa(D IA, C) X P~(E 1B, C))

= ~ P(A) log P~(A) + ]~ P(A, B) A A,B

log Pa(B ]A) + Y~ P(A, C) log P.(C IA) A.C

+ ~. P(A, C, D) log P~,(DIA, C) A,C,D

+ ~ P(B, C, E) log P.(EIB, C) B,C,E

= • P(A) log P(A) + Y'~ P(A, B) A A.B

log P(B IA) + ~ P(A, C) log P(C JA) A,C

+ ~ P ( A , C , D ) l o g P ( D I A , C) A,C,D

+ ~ P(B, C, E) log P(EtB, C) B,C,E

{follows from the proposition stated earlier}. The overall score for the representation can be

obtained by evaluating each of the component expressions independently. Since the number of variables that are allowed in any component is restricted to m + 1, each expression can be evaluated by computing a relatively smaller number of terms. Assuming that all variables are binary, the maximum number of terms that need to be evalu-

ated for any one expression will be 2m+k Since there will be a total of n such expressions, the total number of terms to be evaluated is no more than n. 2 m+l. For n much greater than m, it is easy to see that there will be enormous computational savings in using the decomposed version of the evaluation function. When variables are not restricted to be binary, the computational savings will be even greater.

When the topology for the approximate representation is not specified (e.g. when the network structure is being induced from data), then finding the best representation is a hard problem. All feasible topologies must be considered, and the best logarithm solution for each of these topologies must be compared in order to determine the optimal representation. For each feasible topoi- ogy, the above property of the logarithm scoring rule is still applicable; however, the number of topologies that need to be considered increases exponentially with the number of variables in the network. The problem appears to be hard, al- though some special cases have been shown to be tractable [6]. Heuristic techniques will be required to obtain good solutions for such problem instances in general [38].

4.2. Modeling using the quadratic scoring rule

Determining the best solution is more difficult when the quadratic scoring rule is used to evaluate different representations. Unlike when using the logarithm rule, using the quadratic rule does not help in decomposing the objective function into its different components. The product-form nature of the approximate representation has to be enforced by incorporating constraints into the optimization formulation. For a given topology, determining the best solution requires solving the resulting optimization problem. For the topology shown in Figure 2(c), the optimization problem will be:

Min S(P, Pa)

= Min E (P(A, B, C, D, E) A,B,C,D,E

-Pa(A, B, C, D, E)) 2


s.t. ~ ( A , B , C , D , E )

= ~ ( A ) x ~ ( B I A ) x ~ ( C I A )

x P ~ ( D I A , C) x P , ( E I B , C)

for all realizations of the variables. The problem is one of non-linear optimization,

with a quadratic objective function and non-linear constraints. The exact nature of the constraints depends on the product-form of the approximate distribution (the optimization problem formulation is discussed in further detail in [Sarkar, 1993]). Exact analytic solutions do not exist for such problems in general, and numerical approximation techniques must be used to solve them. Such techniques are not guaranteed to find the global optimal solutions. A serious drawback is that the number of terms that are to be computed will grow exponentially with the size of the problem, making it intractable for large problem sizes.

When alternate topologies are considered, then the best quadratic solution for each topology needs to be obtained, and the scores compared. When no topologies are specified, then finding the best quadratic solution is a very difficult problem. The best quadratic approximation for all feasible topologies have to be compared using the quadratic score. Obtaining the best solution for a given topology requires computations that increase exponentially with the size of the problem. The number of feasible topologies that must be considered also increase in an exponential

fashion. Clearly, exact solutions will not be feasible for large problem instances.

5. Experimental comparison of the quadratic and logarithm scoring rules

The logarithm rule is much easier to use than the quadratic rule when the best representation is desired for a given topology, or one among a few topologies is to be selected. The best representation obtained when using the logarithm rule will usually not be the same as the one obtained using the quadratic rule. In this section we compare the quality of the solutions obtained from these rules using a decision theoretic framework. We de- scribe a decision problem that is modeled as a belief network. The decision problem consists of five binary variables which are interrelated. The nature of the dependencies among the variables is shown in Figure 3(a). We assume that the topology for the approximate representation is fixed, and as shown in Figure 3(b). A brief discus- sion on the choice of the actual and approximate representations is required here. The approximate topology chosen is one that may be easily converted to a tree-structure by incorporating auxiliary variables [29]. Joint distributions that correspond to tree structures are those in which all component terms have at most one conditioning variable. This makes such representations very efficient for propagating beliefs; hence, the structure shown in Figure 3(b) is an efficient one. The only difference between the actual and approxi-

a. Actual Belief Network b. Approximate Topology

Fig. 3. Belief Network for the Decision Problem.


mate topology is that the arc BE appears in the actual topology and not in the approximate. This enables us to subsequently examine different distributions by varying the strength of the dependency associated with this arc.

We obtain the best representation for a given approximate topology when using the quadratic and logarithm rule respectively. The solutions obtained when using each of these rules are compared as follows. The node A is considered to be the hypothesis variable. The decision problem is to predict whether the hypothesis variable is true or not when some of the other variables have been observed. When the actual representation is used, then the revised belief that the hypothesis is true is obtained by using the distribution associated with the actual representation. Similarly, when an approximate representation is used, then the revised belief is obtained by using the distribution associated with the approximate representation. When an approximate representation is used instead of the actual one, the decisions made may or may not coincide with ones made with the actual representation. When the decisions made are the same, then no losses are deemed to incur. When decisions made are not the same, then the use of the approximate distribution results in losses that are characterized as Type I and Type II losses. We evaluate the expected losses associated with the quadratic and logarithm approximations when different sets of variables are observed, as well as for a wide range of Type I and Type II errors. Subsequently, we perform this analysis for different distributions associated with the actual topology by varying the strength of dependency associated with the arc BE. The expected losses associated with each approximation are summarized and compared.

5.1. The decision problem

The decision problem is to predict whether the hypothesis variable A is true or false. In practice, decisions are made without exact knowledge about A. It may be possible to observe some of the variables B, C, D and E before a decision has to be made about variable A. Knowledge of the value of other variables will affect our belief in event A. For instance, the variables C and E may be observed to be true prior to making a decision. The belief in A is revised to account for this information using the actual and approximate representations, respectively. The posterior probabilities obtained for event A using the different distributions are then used for prediction.

There are two types of errors associated with making a prediction in the absence of perfect information. If A is predicted to be "Not True" when it is actually true, then we have a Type I error, and when A is predicted to be "True" when it is actually not true, we have a Type II error. The costs associated with these wrong decisions are denoted by C t and C 2 respectively. If the prediction is correct then there is no cost associated with the decision. The approximate representations are analyzed by evaluating the decisions made when using the approximate distributions respectively and comparing them with the decision made when using the actual distribution. The expected costs when using the approximate and actual distributions are analyzed for different values of C 1 and C 2.

5.2. Actual and approximate distributions

The joint distribution for the actual network in Figure 3(a) can be obtained by specifying the

Table 1

Joint dis tr ibut ions for (A, B, C) and (C, D, E)

P(A) = 0.5 P(B) = 0.6 P(C) = 0,4

P(AB) = 0.4 P(AC) = 0.3 P(BC) = 0.31 P(ABC) = 0.25

P(C) = 0.4 P(D) = 0.5 P(E) = 0.6

P(CD) = 0.3 P(CE) = 0.31 P (DE) = 0.4 P ( C D E ) = 0.25

S. Sarkar. L Murthy / Decision Support Systems 15 (1995) 323-350 337

Table 2 Distribution P(B, C, E)

P(B) = 0.6 P(C) = 0.4 P(E) = 0.6 P(BC) = 0.31 P(BE) ~ 0.38 P(CE) = 0.31 P(BCE) = 0.26

joint distribution P(A, B, C), and the conditional distributions P(EIB, C) and P(DIE, C) for all realizations of the variables. In order to simplify the specification of the complete distribution, the joint distribution for the variables (A, B, C) is chosen identical to that for variables (D, E, C). The distributions used for the variables (A, B, C) and (C, D, E) are shown in Table 1. Table 2 displays the joint distribution for (B, C, E), which is equivalent to specifying the conditional distribution P(EIB, C). Table 3 displays the resulting complete joint distribution over the five variables. In these tables, an entry of the form P ( A ) = 0.5 indicates that the probability of event A being true is 0.5, whereas earlier, the notation P(A) has been used to denote the complete distribution for variable A. For notational convenience, we use P(A) to refer to a specific outcome for event A (i.e. event A is true) for the example considered in this section.

The best quadratic and logarithm solutions are obtained for the approximate topology. The logarithm approximation is easily obtained, since it preserves the marginal distribution P(A) and the conditional distributions P(B I A), P(C I A, B), P(E I C) and P(D I C, E) from the actual network. The quadratic approximation requires solving the optimization problem presented in Section 3.1 in which the objective function is the quadratic score (details of this formulation are discussed in [30]). A non-linear optimization code, called NCONF, from the IMSL library of mathematical routines

has been used to obtain the quadratic solution. The routine uses a successive quadratic programming algorithm [32] [33] [41]. Due to the existence of multiple local optima, the procedure could terminate at some locally optimal solution. To reduce the likelihood of using local optima that are not global, the procedure is run with different starting solutions, and the best solution is chosen. In addition, to ensure that the solution obtained is indeed a good one (if not the best), the quadratic score for the IMSL solution is compared with the quadratic score for the logarithm solution. If the quadratic score for the logarithm solution is better than for the IMSL solution, then the IMSL solution is discarded, and more solutions are generated using some other starting points. This is repeated until the IMSL solution obtained has a better quadratic score than the best logarithm solution. This ensures that neither of the approximations being compared dominates the other one for both the quadratic and the logarithm score. The probability masses for the approximate solutions (when the actual distribution is as shown in Table 3) are displayed in Appendix I.

5.3. Costs using the actual and approximate distribution

In order to evaluate the performance of the two approximate solutions, the costs of using the actual and the two approximate distributions have

Table 3 Complete joint distribution

P(A) = 0.5 P(B) = 0.6 P(C) = 0.4 P(D) = 0.5 P(E) = 0.6 P(AB) = 0.4 P(AC) = 0.3 P(AD) = 0.29035 P(AE) = 0.4 P(BC) = 0.31 P(BD) = 0.32694 P(BE) = 0.38 P(CD) = 0.3 P(CE) = 0.31 P(DE) = 0.4 P(ABC) = 0.25 P(ABD) = 0.23778 P(ABE) = 0.27175 P(ACD) = 0.22624 P(ACE) = 0.23746 P(ADE) = 0.23778 P(BCD) = 0.23746 P(BCE) = 0.26 P(BDE) = 0.27175 P(CDE) = 0.25 P(ABCD) = 0.1915 P(ABCE) = 0.20968 P(ABDE) = 0.2012 P(ACDE) = 0.1915 P(BCDE) = 0.20968 P(ABCDE) = 0.16909


to be determined for the decision problem. We illustrate this with the help of an example. In this example, we assume that the exact distribution is as shown in Table 3, and the approximate distributions are as shown in Appendix I. We further consider an instance where the variables C and E have been observed before a prediction is to be made about the hypothesis variable A. Consider the case when both variables C and E are observed to be true. The posterior belief in variable A being true (i.e. P(A ICE)) is first evaluated using the actual distribution. Using the probabilities for the actual distribution in Table 3, we o b t a i n P ( A I C E ) = P ( A C E ) / P ( C E ) = 0.23746/0.31 = 0.766. The cost of making a decision can then be evaluated using the decision tree shown in Figure 4. At the decision node, the decision maker can do one of two things: predict that A is "True" , or that A is "Not True". Subsequent to this decision, A may actually-turn out to be true or not.

The prediction that minimizes the cost will be chosen. The cost of predicting A to be "T rue" is (1-P(A ICE)) x C2, while the cost of predicting A to be "Not True" is P(AICE) X C 1. Strictly speaking, the costs of prediction are expected costs, based on the probability that A is true or false. However, we reserve the term expected cost for later use, when we find the expectation over the range of all possible values of C 2. Therefore, when (1-P(A ICE)) x C2 < P(A ICE) x C 1, we are better off in predicting A to be "True" , otherwise we would predict A to be "Not True". A similar analysis is performed for other possible realiza-

C>] 2>'° A T r u e ~ C2

~ Not True

"- -" ~ True C 1

"A N ~

True" N o k " J ~ - ~ t Tree 0

Fig. 4. Decision Tree with Actual Probabilities.

Prob = 0.766

Prob = 0.234

Prob = 0.766

Prob = 0.234

Cost

0.766

Cost Curve

, /

3.273

Fig. 5. Cost Curve for the Actual Distribution.

Cz

tions of the observed variables C and E. From the above analysis we see that the decision depends on the ratio of the costs C 1 and C2, and not on the absolute costs. Therefore, with no loss of generality, we can set C x to 1 and analyze the cost for different values of C 2. We predict A to be true when (1-P(A ICE)) x C2 < P(A ICE) X C 1, or equivalently when C 2 < P(A ICE) × C 1 / ( 1 -

P(A ICE)) = P(A I CE) / (1-P(A ICE)) = 0.766/ 0.234 = 3.273 = Q (where Q is defined as the posterior 'odds' for the hypothesis being true). The cost involved in making such a prediction is ( 1 - P ( A I C E ) ) x C 2 = 0 . 2 3 4 x C 2. When C 2 > P(A ICE)/(1-P(A ICE)), we predict A to be "Not True" with a cost equal to P(A ICE) = 0.766. The cost curve as a function of C 2 is shown in Figure 5.

When an approximate distribution is used instead of the actual one, the posterior belief in A will usually be different from the actual one. For instance, when the logarithm approximation is used, the posterior probability is Pt(A ICE)= Pt (ACE)/P t (CE) = 0.2325/0.31 = 0.75 (these numbers are easily computed from the table shown in Appendix I). The variable A is predicted "T ru e" when C 2 < Qt = P t (A ICE) / ( 1- Pt(A qCE)) = 0.75/0.25 = 3, and, "Not True" otherwise. When the decision made using the approximate distribution is the same as the decision made using the actual distribution, the cost for prediction is also the same. When the decision is different, the cost is higher for the approximate. In this example, the decision using the approximate distribution is the same as that when using the actual distribution for those values of C 2

S. Sarkar, 1. Murthy / Decision Support Systems 15 (1995) 323-350 339

where C 2 < 3 or C 2>/3.273. When 3 < C 2< 3.273, then we would predict A to be "Not True" when using the approximate distribution, which would lead to a cost equal to P(A I C E ) = 0.766. This is higher than the cost when using the actual distribution ( = 0.234- C2).

The cost curve using the logarithm approximation is shown in Figure 6. The cost curve using the actual distribution is OBG, while that using the approximate distribution is OEAG. The triangle EAB characterizes the loss region for the approximate distribution. AB identifies the range of values for C 2 over which a loss occurs, and AE is the maximum loss that may occur when using the approximate distribution. The range AB is the difference between the correct posterior 'odds ' Q, and the approximate one which is Qt. The ratio m = 100 x A E / E D is the maximum %loss when using the approximate distribution (expressed as a % of the expected cost when using the actual distribution). For this example, we have A B = Q - Q t = 3 . 2 7 3 - 3 = 0 . 2 7 3 , and m = 100 × A E / E D = 100 x ( ( B C / E D ) - 1) = 100 X ( ( Q / Q t ) - 1) = 9.1%. A similar analysis is per formed using the quadratic solution. For that solution we have Pq(A I CE) = Pq(ACE)/Pq(CE) = 0.24238/0.3198 = 0.758. Variable A will be predicted "T rue" when C 2 < Pq(A l E E ) / ( 1 - Pq(m I C E ) ) = 0.758/0.242 = 3.132, and the decision will result in a loss when 3.132 < C 2 < 3.273. In this case the range of C 2 where a loss occurs is AB = 0.141, and, m = 4.5%.

5.4. Consolidation of loss parameters

The above analysis is performed for decisions made after different sets of observed variables, and the results summarized. The range AB, and the maximum percentage loss m, are two parameters to compare for different approximate distributions. A third paramete r we compare is t h e expected loss expressed as a percentage of the expected cost when using the actual distribution.

We define M, the cumulative maximum percentage loss for a given distribution, as: M = max{m}, over all possible sets of observed variables. Thus, M is the maximum of the re's, which is itself the maximum percentage loss for a given set of observations. Since we are going to compare the summary statistics subsequently, we use M as one measure of the goodness of an approximation. For the sake of brevity, we drop the qualifier cumulative in the rest of this section.

It is harder to consolidate the range AB, which is the interval of C 2 over which some loss is incurred when using the approximate distribution. In order to do so for different problem instances, we interpret this interval as the probability of incurring a loss when using an approximate distribution, and call it the 'Loss Probabil- ity' associated with such a distribution.

The loss interval for C 2 is translated into the "Loss Probability" in the following manner. C2 is the ratio of the costs associated with the two types of errors (since C~ has been fixed equal to 1). Each value of C 2 refers to that particular ratio

Cost

0.766

Cost Using Actual Distribution

............ Cost Using Logarithm Approximation

A B . . . . . . . . . ' , l m Q ° , m l , , ° , H ~ . . . . . . . . . . . . . ~ | ~ U . . . . . . . . . . 0 " . . . . . . .

U c

3 3.273

G

- C e

Fig. 6. Cost Curve Using the Logarithm Approximation.


of costs of making Type I and Type II errors respectively. Its domain is the real line [0, infinity). For instance, C 2 = 1 refers to all instances where C l = C 2. Similarly, C 2 = 0.5 ~ C 2 = 0.5 - C 1, and C 2 = 2 ~ C 2 = 2- C1.

Consider the following two loss intervals:

Interval 1: [0.5, 1] ~, [C 2 = 0.5Ct, C 2 = C~]

In terval2: [ 1 , 2 ] ~ [ C 2 = C , , C 2=2C,]

The range of the intervals are 0.5 and 1 for these two cases. This seems to imply that the second interval is more significant than the first. How- ever, since the interval values reflect the ratio of two costs, it seems more appropriate that these two intervals be considered equivalent for evaluating the approximate distributions. A transformation scheme that achieves this objective by using a variable X as a surrogate for C 2 is shown below:

X = C 2 when C 2 < 1

1 - 2 when C 2 > 1

C2

The transformation maps the values of C z in the interval [0, infinity) onto the interval [0, 2] for values of X, and centers the interval for X around 1, with the distance from 1 reflecting the propor- tionate rather than the absolute difference between the two type of costs. When using this transformation, the effective intervals for the variable X in the above two cases become [0.5, 1] and [1, 1.5] respectively, which translate to the same ranges. Another intuitively appealing fea- ture of this transformation is that the range of X evaluated for the actual and approximate posterior probabil i t ies p - - P ( A ICE) and p~, = P~(A ICE) respectively, is the same as that for the posterior of the negation of event A, i.e for P( ~ A rCE) and Pa(~ A ICE).

The values of X that correspond to the two endpoints of the loss interval C 2 = Q and C 2 = Q~ are denoted by Z and Z~ respectively (thus, X lies in the interval between Z and Z a if and only if C 2 is in the interval between Q and Qa). Using an approximate distribution results in a loss only when X is in [Z, Z~]. Therefore, the probability of incurring a loss when using an approximate

distribution is given by the probability that X lies in [Z, Za]. For computational convenience, we assume X to be uniformly distributed over [0, 2]. As a result, the probability of incurring a loss when an approximate distribution is used is given by R = [ Z - Z a [ / 2 , which we call the 'Loss Probability'. It should be noted that the above transformation is used only to help consolidate the loss interval for different sets of observed variables. The actual loss interval is adequate for evaluating approximate representations in individual cases.

We define a third parameter L that captures both the range of C2 over which a loss occurs, as well as, the rate of loss, when using approximate solutions. L is defined as the expected loss over the entire range of X (and hence C2), expressed as a percentage of the expected cost using the actual distribution, i.e.:

Expected Cost Using Approximation - Expected Cost Using Actual

Expected Cost Using Actual

The expected costs using the actual and approximate distributions are given by the areas under the cost curves for the two distributions, respectively. L is a function of Q and Qa, where Qa is the odds that the hypothesis is true when an approximate distribution is used. The expression for L is different for different ranges of values of Q ant'. Qa. The different expressions are shown below, with the derivations presented in Ap- pendix II:

Case I: Both Q and Qa

(Oa - Q)2 L

Q(4 - Q)

Case H: Both Q and Qa to 1

-• Q log + o - - - 1

a ' ~

L = 1.5 + log Q

are less than or equal to 1

are greater than or equal

S. Sarkar. I. Murthy / Decision Support Systems 15 (1995) 323-350

Table 4 Loss parameters when C and E are observed to be true

341

Expected Loss (L) Loss Probability (R) Maximum Loss (M)

Logarithm approximation 0.146% 0.0278 9.1% Quadratic approximation 0.03% 0.0137 4.5%

Case Ilia: Q is less than 1 and Qa is greater than 1

Q 1 + 2 log Qa + 2 - -

L = - 1 Q(4 - Q)

Case llIb: Q is greater than 1 and Q~ is less than 1

Qa 2 + Q ( 4 - 2Q~) L = - 1

3 + 2 log Q

The loss parameters, when both C and E are observed to be true, are as shown in Table 4. For this case, both Q and Qa are greater than 1, and therefore Case II is used to calculate L.

A similar analysis is performed for all feasible sets of observable variables. Table 5 summarizes the loss parameters for different numbers of variables observed before prediction. For the belief network used in the study, there are four observable variables B, C, D and E. Hence, the number of variables that may be observed before prediction can be 0, 1, 2, 3 or 4. Each row in Table 5 displays the loss parameters averaged over all possible realizations for each combination of a given number of variables observed. For instance, when the number of observed variables is two, there are six different combinations of variables

that may be observed. For each such combination, there are four distinct realizations. There- fore, the parameters displayed in that particular row in Table 5 is an average over a total of 24 distinct realizations. Similarly, other rows in the table show the summary values of the loss parameters for all possible realizations.

5.5. Loss analysis for different actual distributions

The best logarithm and the quadratic solutions arc compared over a large number of actual distributions. The different distributions are obtained by varying the strength of the dependency between nodes B and E, since the arc (B, E) is missing from the approximate representation (refer to Figure 3). In the actual distribution the variable E is conditioned on the two variables B and C. Therefore, different dependencies between variables B and E can be obtained by varying the parameters P(BE) and P(BCE). The values that parameters P(BE) and P(BCE) may take are constrained by the probabilities specified for the rest of the distribution. We conduct our comparison by varying the parameter P(BE) over the range [0.3, 0.5] in steps of 0.04. For each such value of P(BE), P(BCE) is varied over [0.22, 0.3] in steps of 0.02 (the parameter P(BE) has a total

Table 5 Consolidated loss parameters for given distribution

# of variables Expected Loss (L%) Observed Log Quad

Loss Probability (R) Max Loss (M%)

Log Quad Log Quad

0 0.000 0.006 1 0.003 (i.025 2 0.107 0.109 3 0.134 0.126 4 0.000 0.(X)3 Cure Avg 0.074 0.079

0.0000 0.0064 0.00 1.30 0.0028 0.0088 2.29 7.16 0.0136 0.0132 31.69 37.64 0.0127 0.0121 42.04 66.31 O.(~)(X) 0.0016 0.00 1.84 0.009 0.011 -

distributions

S. Sarkar, L Murthy / Decision Support Systems 15 (1995) 323-350

- 3

P(BE) Range of P(BCE)

Expected Loss (L%) Loss Probability (R) Max Loss (M%)

Log Quad Log Quad Log Quad

0.3 [0.22,0.3] 0.34 [0.22,0.3] 0.38 [0.22,0.3] 0.42 [0.22,0.31 0.46 [0.22,0.3] 0.5 [0.22,0.3]

1.0254 1.0376 0.0385 0.0403 134.26 186.12 0.4738 0.5044 0.0242 0.0269 88.39 143.13 0.2290 0.2466 0.0127 0.0157 67.75 126.22 0.2336 0.2340 0.0155 0.0166 68.44 112.24 0.4866 0.4398 0.0256 0.0265 76.50 108.98 1.0408 0.9010 0.0376 0.0407 115.21 140.16

2

'2 t a , Logarithm Approximation

1.0 _l

0 .8 -

0.6

0.4

0.2 0.30 0.35

Quadrati

i i

0.40 0.45 0.50

P(BE)

Fig. 7. Expected Loss L (%) for Logarithm and Quadratic Approximation.

0.05

0.04

4~ Logarithm Approximation

0.03

0.02

342

Table 6 Loss analysis for different

0.01 i I 1

0.30 0.35 0.40 0.45

P(BE)

Fig. 8. Loss Probability R for Logarithm and Quadratic Approximation.

0.50

s. Sarkar, L Murthy / Decision Support Systems 15 (1995) 323-350 343

feasible range of [0.22,0.6]; however for values of P(BE) below 0.3, and above 0.5, the corresponding feasible range of P(BCE) is less than [0.22, 0.3]). We note that variables B and E are conditionally independent of each other with respect to C when P(BE) = 0.380417 and P(BCE) = 0.24025 (i.e. the actual distribution does not have arc BE as part of its belief network, and an exact representation is possible).

For each set of values for parameters P(BE) and P(BCE), the complete distribution is generated for the actual problem, and the best logarithm and quadratic solutions are obtained. The loss parameters are evaluated for each of the approximate solutions. The results are aggregated for different values of P(BE), and presented in Table 6. Figures 7 and 8 show the parameters L and R respectively for the approximations obtained when using the two scoring rules.

The expected loss L, when using the logarithm solution, decreases when P(BE) increases from 0.3 to 0.38, and then increases for increasing P(BE). This is to be expected since the dependence across B and E is relatively weakest when P(BE) = 0.38, and is stronger for higher and lower values of P(BE). When using the quadratic approximation, the expected loss L is lowest when P(BE) = 0.42; however it is only marginally more for P(BE) = 0.38. Overall, the expected loss functions are very similar for the two solutions. The probability of a loss, R, is lower for the logarithm solution for the entire range of P(BE) considered. However, the differences between the logarithm and quadratic solutions are relatively small. The

parameter M is also lower for the logarithm approximation for the entire range of P(BE).

These results seem to indicate that, on average, the logarithm solutions perform at least as well as the quadratic solutions, if not better. However, the magnitude of the differences are quite small, and may not be significant. The quadratic solutions used are not necessarily optimal, because of the error inherent in the numerical approximation code that is used. This problem will be faced by practitioners working on real applications as well, since exact solution techniques are currently not available for problems of this nature.

A natural question that may arise is why not to use the loss function itself as a rule to determine the best approximation, i.e. an approximation that minimizes the expected loss. When exact loss functions are available for a problem domain, then it is clearly desirable to use such functions. However, such functions are often hard to obtain. Even when such functions are available, they are often too complex for meaningful analysis. For instance, obtaining the best representation while using the loss function discussed in the example in section 5 would be computationally extremely hard for large problem instances.

6. An example using the logarithm rule

In section 4, we have shown that the logarithm rule can be used relatively easily to obtain probability parameters for approximate representa-

a. Complete Network c. Approximation II b. Approximation I

Fig. 9. Actual and Approximate Networks for the Mutual Funds Example.


tions. Further, in section 5, we show that the probability parameters obtained when using the logarithm rule perform as well as, if not better than, the parameters obtained using the quadratic rule. In this section, we use a simple example to illustrate how the logarithm rule may be used to evaluate two different approximate structures. We consider the hypothetical problem of evaluating mutual funds that was introduced in section 2. A complete network representation over the five variables could be as shown in Figure 9a. In this example, we consider approximate representations that are of order 3.

The expert may provide two possible approximate network representations as shown in Fig- ures 9b and 9c, respectively. In order to compare two approximate representations, we first need to obtain the joint distribution for the complete network, i.e. the actual distribution, either from an expert, or, estimated from historical data if available. This is because any valid measure eval- uates an approximate distribution by measuring its distance from the actual. If the logarithm rule is used, obtaining the joint distribution for a given topology is easy. This is because, when using the logarithm rule, the conditional probability distribution associated with each variable in the approximation is preserved. For example, given the structure in Approximation I, the best approximate distribution for the variable Volatility, conditioned on the two variables Yield and PE Ra- tio, is equal to the corresponding actual (which is obtained either from the expert, or estimated from data). Similarly, the best approximate distribution for the variable Proj Growth, conditioned on the two variables Fund Type and PE Ratio, is equal to the actual distribution for the variable Proj Growth, conditioned on the two variables Fund Type and PE Ratio, etc. Hence, once the conditional distribution associated with each variable is obtained for a structure, the complete joint distribution can be computed by multiplying the component distributions in accordance with the product-form of that structure. The logarithm measure is then evaluated for the two approximate distributions with respect to the actual distribution (as discussed in section 4.1), and the network with a higher score is selected.

We note that approximate representations often miss dependencies that exist in the problem domain. For instance, the complete network shown in Figure 9a indicates that Volatility is directly dependent on Fund Type, which is not captured in either of the approximate representations considered. When Approximation I is used in practice, the variable Volatility will affect Fund Type; however, it will do so indirectly through the variables Yield and PE Ratio. On the other hand, by ignoring this direct dependency (as well as some others shown in Figure 9a), belief updates can be made using the structure in Approxima- tion I far more efficiently than when using the complete network. In summary, accuracy in belief representation is traded off for computational convenience.

7. Conclusions

In this paper we have examined different criteria that may be used to evaluate belief networks. Desirable properties of measures that may be used are discussed, and proper scoring rules are shown to be appropriate. There are many scoring rules that are proper; however the logarithm and the quadratic rule have been shown to have some additional features that make them very attractive. These two rules were closely examined in the context of evaluating different belief networks, and optaining optimal representations. The logarithm rule was shown to have very good modeling features, in that it can be implemented relatively easily, as compared to the quadratic rule. We performed extensive experimentation that compared the solutions obtained using the logarithm rule and the quadratic rule, respectively, using a decision theoretic approach. The solutions obtained when using the logarithm rule were found to be at least as good as the solutions obtained using the quadratic rule. This research clearly suggests that the logarithm rule is very appropriate for evaluating belief networks.

We have also discussed some commonly used measures that are equivalent to using the logarithm rule, viz. the I-Divergence measure and the maximum likelihood criteria. Another criterion

S. Sarkar, L Murthy / Decision Support Systems 15 119951 323-350 3 4 5

that has been considered for evaluating alternate networks is the entropy function [16]. Assuming that the conditional probabilities associated with each component of the product-form distribution is as estimated from data, they develop an algorithm that obtains a network with minimum entropy. While there are some obvious similarities in the functional form of the entropy function and the logarithm rule, the best solution obtained by these two approaches are not necessarily the same [30]. Using the logarithm rule (or any of the

other equivalent criteria) is more appropriate than the minimum entropy approach as suggested in [16], since no additional assumptions are required regarding the probability parameters associated with different components of the approximate distribution.

Appendix

Appendix 1: Logarithm and quadratic solutions for the distribution in Table 3

Table 7

Logarithm and quadratic solutions for distribution in Table 3

A C T U A L L O G A P P R O X Q U A D A P P R O X

A B C D E 0 . 1 6 9 0 9 0 . 1 5 6 2 5 0 . 1 6 8 9 0

A B C D ~ E 0 . 0 2 2 4 0 0.113125 0 . 0 2 5 8 8

A B C ~ D E 0 . 0 4 0 5 8 0.03751) 0 . 0 4 1 2 8

A B C ~ D ~ E 0 . 0 1 7 9 2 0.02501) 0 . 0 2 0 9 0

A B ~ C D E 0 . 0 3 2 1 0 0 . 0 3 7 5 0 0 . 0 4 0 3 9

A B ~ C D ~ E 0 . 0 1 4 1 8 0.111250 0 . 0 1 2 8 9

A B ~ C ~ D E 0 . 0 2 9 9 6 0 . 0 3 5 0 0 ( I .03772

A B ~ C ~ D ~ E 0 . 0 7 3 7 5 0.0650(1 0 . 0 6 5 8 5

A ~ B C D E 0 . 0 2 2 4 0 0 . 0 3 1 2 5 0 . 0 2 5 8 8

A ~ B C D ~ E 0 . 0 1 2 3 5 0.110625 0 . 0 0 3 9 6

A ~ B C ~ D E 0 . 0 0 5 3 8 0 . 0 0 7 5 0 0 . 0 0 6 3 2

A ~ B C ~ D - E 0 . 0 0 9 8 8 0 . 0 0 5 0 0 0 . 0 0 3 2 0

A ~ B ~ C D E 0 . 0 1 4 1 8 0 . 0 1 2 5 0 0 . 0 1 2 8 9

A ~ B ~ C D ~ E 0 . 0 0 3 6 4 0.(/I)417 0 . 0 0 4 1 1

A ~ B ~ C ~ D E 0 . 0 1 3 2 4 [ ) .01167 0 . 0 1 2 0 4

A ~ B ~ C ~ D ~ E 0 . 0 1 8 9 4 I).112167 0 . 0 2 1 0 2

A B C D E 0 . 0 4 0 5 8 I).t1375(I 0 . 0 4 1 2 8

~ A B C D ~ E 0 . 0 0 5 3 8 0.(10750 0 . 0 0 6 3 2

~ A B C ~ D E 0 . 0 0 9 7 4 ( / . 00900 0 . 0 1 0 0 9

~ A B C ~ D ~ E 0 . 0 0 4 3 0 0.(10600 0 . 0 0 5 1 1

~ A B ~ C D E 0 . 0 2 9 9 6 0.(13500 0 . 0 3 7 7 2

~ A B ~ C D ~ E 0 . 0 1 3 2 4 I1.111167 ( I .01204

~ A B ~ C ~ D E 0 . 0 2 7 9 7 0 . 0 3 2 6 7 0 , 0 3 5 2 2

~ A B ~ C ~ D ~ E I ) .06883 ( I .06067 0 . 0 6 1 4 9

~ A ~ B C D E 0 . 0 1 7 9 2 (I .I)2500 0 . 0 2 0 9 0

~ A ~ B C D ~ E 0 . 0 0 9 8 8 0.1)050(I 0 . 0 0 3 2 0

~ A ~ B C ~ D E 0 . 0 0 4 3 0 0 . 0 0 6 0 0 0 . 0 0 5 1 1

~ A ~ B C ~ D ~ E 0 . 0 0 7 9 0 0 . 0 0 4 0 0 0 . 0 0 2 5 9

~ A ~ B ~ C D E 0 . 0 7 3 7 5 0 . 0 6 5 0 0 0 . 0 6 5 8 5

~ A ~ B ~ C D ~ E 0 . 0 1 8 9 4 0 . 0 2 1 6 7 0 . 0 2 1 0 2

~ A ~ B ~ C ~ D E 0 . 0 6 8 8 3 I).(/61167 0 . 0 6 1 4 9

~ A ~ B ~ C ~ D ~ E 0 . 0 9 8 4 8 0 . 1 1 2 6 7 0 . 1 0 7 3 5


Appendix II: Expected loss when using approximate distributions

The loss parameter L for an approximate distribution is defined as:

Expected Cost Using Approximation - Expected Cost Using Actual L

Expected Cost Using Actual

The cost curves for the actual and approximate distributions are plotted as a function of X (which is a transformation of C2). The expected cost for the actual and approximate distributions are obtained by finding the area under these cost curves. For the actual representation, the cost of prediction, K(p), is:

K(p) = Min{(1 - p) × C 2, p},

where p is the posterior probability that

event A is true when using the actual

representation

i.e.

l - p ) x C , w h e n C e < (1 p-----~ - Q K(p) = " ~ -

when C 2 > Q

For approximate representations, Pa is the evaluated posterior probability that A is true. Thus:

(l -p) xc2

P when C~ < Pa K(Pa) = ~ - (1 ~ P a ) Qa

[ when C2 > Q.

X is a transformation of C 2, defined as follows:

{~2 when C 2 _~ 1

X = 1 C2 when C z > 1

The cost K is expressed as a function of X, and, the expected costs and corresponding losses evaluated for the different cases considered below.

Case la: Q <_ Qa < 1 The cost curves are expressed as a function of

C 2 and X respectively, and shown in Figure 10. We have:

(1 - p ) × C 2 = (1 - p ) × X

P K(p) = / w h e n X < ~ - - p ) O

/ P I when X > Q

The expected cost EK(p) is given by:

EK(p)

= f[0,ol(l - p ) .X. f (X) dX + f[Q,2]p.f(X) dX

X is assumed to be uniformly distributed over [0,2]. Therefore:

EK(p) (1 p) X dX + dX O,Q] 2 O,2]

- p _ (1 P) O 2 + ( Z - Q )

Cost

t Cost

, . , ,

w

Q Qa C 2


............. Cost Using Approximate Distribn

,<

. 4 , " " " ~

Fig. 10. Cost curves for Actual and Approximate Distributions when Q < Qa < 1.

v 2 X


Cost Cost

Z Qa Q

_ _ Cost Using Actual Distribution

............. Cost Using Approximate Distribn

7 Z Qa Q

Fig. 11. Cost curves for Actual and Approximate Distributions when Qa -< Q < 1.

v

Z X

Similarly, the expected cost when using an approximate representation is:

EK(Pa)

= f[o,Q~l(1- p) .X.f(X) dX + fto~,2]p.f(X) dX

( l - p ) : p -- T O a + ~'(2 -- O a )

Therefore, we have:

EK(p~) - EK(p) L =

EK(p)

(1 - p ) 4 ( O 2 - Q2) + 2 ( Q - Qa)

( 1 - P ) o 2+5(p 2 - O )

(1 - p ) ( O 2 _ O 2 ) + 2 p ( O - Q~,)

(1 -- p ) Q 2 + 2p(2 - Q )

(1-p)(O~ - O 2 ) + 2 ( t _ p ) O 2 - 2 ( 1 - p ) O O ~

(1 - p)Q2 + 2p(2 - Q)

( 1 - p) (Q, - Q)2

(1 - p)Q2 + 2 p ( 2 - Q)

(Q. _ O)Z (O. _ Q)2

O 2 + 2Q(2 - Q) 0 ( 4 - Q)

Case Ib: Qa <- Q <- 1 The expressions for expected costs using the

approximate and actual representations are identical to Case Ia, and therefore so is L. The cost curves are shown in Figure ll.Case IIa: Q >t Qa >~1

Since both Q and Q, are >/1, the cost expressed as a function of the transformation X is different from the cost expressed as a function of C 2. The values of X that correspond to C 2 = Q and C 2 = Q~ are denoted by Z and Z a respec-


............ Cost Using Approximate Distrihn

Cost

1 Za Z

Fig. 12. Cost curves for Actual and Approximate Distributions when O >/ Oa >/ 1.

z "~


tively (e.g. Z = 2 - l / Q ; etc). The cost curves are shown in Figure 12. Here,

K ( p ) = ( 1 - p ) x C z = ( 1 - p ) x x

when X _.< 1

l = ( 1 - p ) x C 2 = ( 1 - p ) x - -

2 - X

when 1 < X _< Z = p

when X > Z

The expected cost EK(p) is given by:

EK(p)

/ -

= J[o,,l(1 - p) .X.f(X) dX

1 - p + ftl ,zl - - . f ( X ) 2 - X dX + ftz,z]P'f(X) dX

1 - p 1 - p 1 p - + - - l o g 2 - + q - Z ) 4 2 Z z. (2

1 - p 1 - p p - + - - l o g Q + ....

4 2 2Q

Similarly,

EK(Pa) = rio,l] (1 - p) .X.f(X) dX

1 - p F + / - - . f ( X ) dX

q 1,za] 2 - X

+ f[Za,2]p.f(X) dX

1 - p 1 - p p - + - - l o g Q~ + - -

4 2 2Q~

E K ( p ~ ) - EK(p) L =

EK(p)

1 - p I - p p 1 p 1 - p + ~ l o g O, , +

4

D log Q - - -

20 ,~ 4 2 2 Q

l - p 1 p p - - + l og O + - -

4 2 2 Q

1 P l o g + - 1

1 - p 1 - p p - - + - - l o g Q + - -

4 2 2Q

) 1 P l o g + - 1 2 5

1 - p 1 - p 1 - p - - + - - l o g Q + - -

4 2 2

__~ Q log + Q---~-

1 .5 + l o g Q

Case lib: O~ >1 Q >1 1 The analysis for this case is identical to that

for Case IIa, and so is the expression for L

Case Il ia: Q < 1 and Qa > 1 The expected cost when using the actual distri-

bution is identical to Case Ia, while the expected cost when using the approximate distribution is identical to that for Case IIa. Therefore,

1 - p 1 - p p - - + - - l o g Qa + - -

4 2 2Q a L = - 1

( 1 - P ~ ) Q 2 + 2 ( Z - Q ) 4

Q 1 + 2 l o g Q a + 2 - -

Qa = - - 1

0 ( 4 - Q)

Case Il ia: Q > 1 and Qa < 1 Here, the expected cost when using the actual

distribution is identical to Case IIa, while the expected cost when using the approximate distribution is identical to that for Case Ia.

- - - ~ Q ~ + ( 2 - Qa) L = - 1

1 - p 1 - p p - - + l o g Q + - -

4 2 2Q

Qa 2 + Q ( 4 - 2Qa) = - - 1

3 + 2 log Q

R e f e r e n c e s

[1] A.M. Agogino and A. Rege, "IDES: Influence Diagram Based Expert System," Mathematical Modelling, Volume 8, pp. 227-233, 1987.

S. Sarkar, I. Murthy / Decision Support Systems 15 (1995) 323-350 349

[2] I.A. Beinlich, H.J. Suermondt, R.M. Chavez and G.F. Cooper, "The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks," Proceedings of the Conference on Artificial Intelligence in Medical Care, pp. 247-256, 1989.

[3] G.W. Brier, "Verification of Forecasts Expressed in Terms of Probability," Monthly Weather Review, Vol- ume 78, no. 1, pp. 1-3, January 1958.

[4] R.M. Chavez and G.F. Cooper, "An Empirical Evalua- tion of a Randomized Algorithm for Probabilistic Infer- ence," in Uncertainty in Artificial Intelligence 5, M. Henrion, R.D. Shachter, L. Kanal and J.F. Lemmer (eds.), North Holland, Amsterdam, pp. 191-208, 1990.

[5] P. Cheeseman, "A Method of Computing Generalized Bayesian Probability Values for Expert Systems," Pro- ceedings of the 8th International Joint Conference on Artificial Intelligence, vol. 1, Karlsruhe, West Germany, 1983.

[6] C.K. Chow and C.N. Liu, "Approximating Discrete Prob- ability Distributions with Dependence Trees," IEEE Transactions on Information Theory, vol. IT-14, no. 3, pp. 462-467, May 1968.

[7] G.F. Cooper, "NESTOR: A Computer-Based Medical Diagnostic Aid that Integrates Causal and Probabilistic Knowledge," Ph.D. Dissertation, Stanford University, Stanford, CA, 1984.

[8] G.F. Cooper, "The Computational Complexity of Proba- bilistic Inference Using Bayesian Belief Networks," Arti- ficial Intelligence, vol. 42, pp. 393-405, 1990.

[9] G.F. Cooper and E. Herskovitz, "A Bayesian Method for Constructing Bayesian Belief Networks from Databases," Proceedings from the 7th Annual Conference on Uncer- tainty in Artificial Intelligence, pp. 86-94, 1991.

[10] R.O. Duda, P.E. Hart, K. Konolige and R. Reboh(1979), "A Computer-Based Consultant for Mineral Explora- tion," Final Report, SRI Projects 6415, SRI Interna- tional, Menlo Park, California, September 1979.

[11] R. Fung and IC Chang, "Weighing and Integrating Evi- dence for Stochastic Simulation in Bayesian Networks," in Uncertainly in Artificial Intelligence 5, M. Henrion, R.D. Shachter, L. Kanal and J.F. Lemmer (eds.), North Holland, Amsterdam, pp. 209-220, 1990.

[12] D. Geiger, "An Entropy-Based Learning Algorithm of Bayesian Conditional Trees," Uncertainty in Artificial Intelligence: Proceedings of the Eighth Conference, D. Dubois, M.P. Wellman, B. D'A_mbrosio and P. Smets (eds.), pp. 92-97, 1992.

[13] I.J. Good, "Rational Decisions," Journal of the Royal Statistical Society, Ser. B, Vol. 14, pp. 107-114, 1952.

[14] D.E. Heckerman, E.J. Horvitz and B.N. Nathwani, "Up- date on the Pathfinder Project," Proceedings of the Sym- posium on Computer Applications in Medical Care, pp. 203-207, 1989.

[15] M. Henrion, "Propagating Uncertainty in Bayesian Net- works by Probabilistic Logic Sampling," in Uncertainty in Artificial Intelligence 2, J.F. Lemmer and L. Kanal (eds.), North Holland, Amsterdam, pp. 149-164, 1988.

[16] E. Herskovitz and G.F. Cooper, "Kutato: An Entropy- Driven System for Construction of Probabilistic Expert Systems from Databases," Uncertainty in Artificial Intel- ligence 6, P.P. Bonnisone, M. Henrion, L.N. Kanal and J.F. Lemmer (eds.), North Holland, Amsterdam, pp. 117-125, 1991.

[17] D. Kazakos and T. Cotsidas, "A Decision Theory Ap- proach to the Approximation of Discrete Probability Densities," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-2, no. 1, pp. 61-67, January 198.

[18] S. Kullback, Information Theory and Statistics, Wiley: New York, 1959.

[19] S.L. Lauritzen and D.J. Spiegelhalter, "Local Computa- tion with Probabilities in Graphical Structures and Their Applications to Expert Systems," Journal of the Royal Statistical Society B, vol. 50, no. 2, pp. 157-224, 1988.

[20] J. Marschak, "Remarks on the Economics of Informa- tion," In The Contributions to Scientific Research in Management, Los Angeles: University of California, 1959.

[21] J. McCarthy, "Measures of the Value of Information," Proceedings of the National Academy of Sciences, pp. 654-655, 1956.

[22] R.E. Neapolitan, Probabilistic Reasoning in Expert Sys- tems: Theory and Algorithms, John Wiley and Sons, Inc., NY, 1990.

[23] J. Pearl, "Reverend Bayes on Inference Engines: A Dis- tributed Hierarchical Approach," Proceedings of the Na- tional Conference in AI, Pittsburg, pp. 133-36, 1982.

[24] J. Pearl, "Fusion, Propagation, and Structuring in Belief Networks," Artificial Intelligence, vol. 29, pp. 241-288, 1986.

[25] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufman, San Mateo, California, 1988.

[26] J.R. Quinlan, "Induction of Decision Trees," Machine Learning, Vol. 1 (1), pp. 81-106, 1986.

[27] G. Rebane, and J. Pearl, "The Recovery of Causal Poly- trees from Statistical Data," Proceedings of the 3rd Workshop on Uncertainty in AI, Seattle, pp. 222-228, 1987.

[28] T.B. Roby, "Belief States and the Uses of Evidence," Behavioral Science, vol. 10, pp. 255-270, 1965.

[2t)] S. Sarkar, "Using Tree Structures to Approximate Belief Networks in Expert Systems," Proceedings of the Second Annual Workshop on Information Technologies and Sys- tems, V.C. Storey and A.B. Whinston (eds.), pp. 235-244, 1992.

[30] S. Sarkar and I. Murthy, "Some Theoretical Results in Obtaining Approximate Representations for Belief Net- works," Working Paper, Louisiana State University, Ba- ton Rouge, 1993.

[31] L.J. Savage, "Elicitation of Personal Probabilities and Expectations," Journal of the American Statistical Asso- ciation, vol. 66, no. 336, December 1971.

[32] K. Schittkowski, "Nonlinear Programming Codes," Lec- ture Notes in Economics and Mathematical Systems, 183, Springer-Verlag, Berlin, Germany, 1980.


[33] K. Schittkowski, "'On the Convergence of a Sequential Quadratic Programming Method with an Augmented La- grangian Line Search Function," Mathematik Opera- tionsforschung und Statistik, Serie Optimization, 14, pp. 197-216, 1983.

[34] M.J. Shaw, "Applying Inductive Learning to Enhance Knowledge-Based Expert Systems." Decision Support Systems, vol. 3, pp. 319-332, 1987.

[35] R.D. Shachter and M. Peot, "'Simulation Approaches to General Probabilistic Inference on Belief Networks," in Uncertainty in Artificial Intelligence 5, M. Henrion, R.D. Shachter, L. Kanal and J.F. Lemmer (eds.), North Hol- land, Amsterdam, pp. 221-231, 1990.

[36] E.H. Shuford, A. Albert, and H.E. Massengill, "Admissi- ble Probability Measurement Procedures," Psychome- trika, 31, pp. 125-145, 1966.

[37] P. Smyth and R.M. Goodman, "An Information Theo- retic Approach to Rule Induction from Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 4, no. 4, pp. 301-316, 1992.

[38] P. Spirtes, C. Glymour and R. Scheines, Causation. Pre- diction, and Search, Springer-Verlag Lecture Notes in Statistics, New York, 1993.

[39] S. Srinivas, S. Russell and A. Agogino, "Automated Construction of Sparse Bayesian Networks from Unstruc- tured Probabilistic Models and Domain Information,'" Uncertainty in Artificial Intelligence 5, North-Holland, pp. 295-308, 1990.

[40] C.-A.S. Stael von Holstein, "Assessment and Evaluation of Subjective Probability Distributions," The Economic Research Institute at the Stockholm School of Eco- nomics, Stockholm, 1970.

[41] J. Stoer, "Principles of Sequential Quadratic Program- ming Methods for Solving Nonlinear Programs," Compu- tational Mathematical Programming. Edited by K. Schit- tkowski, NATO ASI Series, 15, Springer-Verlag, Berlin, Germany, 1985.

[42] H.J. Suermondt and M.D. Amylon, "Probabilistic Predic- tion of the Outcome of Bone-Marrow Transplantation,"

Proceedings of the Symposium on Computer Applica- tions in Medical Care, pp. 208-212, 1989.

[43] R. Uthurusamy, U.M. Fayyad and S. Spangler, "Learning Useful Rules from Inconclusive Data," Knowledge Dis- covery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (eds.), pp. 141-157, 1991.

[44] A.K.C. Wong and C.C. Wang., "Classification of Discrete Biomedical Data with Error Probability Minimax," Pro- ceedings of the Seventh International Conference of the Cybernetics Society, Washington, DC, pp. 19-21, 1977.

[45] S.K.M. Wong and F.C.S. Poon, "Comments on Approxi- mating Discrete Probability Distributions with Depen- dence Trees," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 3, pp. 333-335, March 1989.

Sumit Sarkar is currently Assistant Professor of Management Informa- tion Systems at the College of Busi- ness at Louisiana State University. He received his Ph.D. in Computers and Information Systems from the Uni- versity of Rochester. His current research interests are in the areas of expert systems, databases and the economics of information systems. He is a member of ACM and TIMS.

Ishwar Murthy is Associate Professor in the Department of Quantitative Business Analysis at Louisiana State University, Baton Rouge. He received his Ph.D Degree in Management Sci- ence from Texas A & M University. His current research interests are in Network Optimization, Multiohjective Optimization and Mathematical Pro- gramming Applications in Telecom- munications.

Criteria to evaluate approximate belief network representations in expert systems

Documents

Transcript of Criteria to evaluate approximate belief network representations in expert systems