An Algorithmic Framework For Density Estimation Based...

An Algorithmic Framework For Density Estimation Based

Evolutionary Algorithms

Peter A.N. Bosman Dirk [email protected] [email protected]

Department of Computer Science, Utrecht University

P.O. Box 80.089, 3508 TB Utrecht, The Netherlands

December 1999

Abstract

The direct application of statistics to stochastic optimization in evolutionary computationhas become more important and present over the last few years. With the introduction ofthe notion of the Estimation of Distribution Algorithm (EDA), a new line of research hasbeen named. The application area so far has mostly been the same as for the classic geneticalgorithms, being the binary vector encoded problems. The most important aspect in thenew algorithms is the part where probability densities are estimated. In probability theory,a distinction is made between discrete and continuous distributions and methods. Using therationale for density estimation based evolutionary algorithms, we present an algorithmicframework for them, named IDEA. This allows us to define such algorithms for vectors ofboth continuous and discrete random variables, combining techniques from existing EDAs aswell as density estimation theory. The emphasis is on techniques for vectors of continuousrandom variables, for which we present new algorithms in the field of density estimation basedevolutionary algorithms, using two different density estimation models.

1 Introduction

Optimization problems are formulated using variables that can assume values from some domain.The allowed combinations of assignments of values from that domain to those variables, constitutesthe search space for the optimization problem. Each such instantiation of every variable is calleda feasible solution. Each such a solution can be graded as to how well of a solution it is for theproblem at hand. The objective with respect to the optimization problem, can then be definedso as to find the optimal feasible solution. The definition of optimality is subject to the problemdefinition (eg. minimization or maximization).

The class of optimization problems can be seen as a superclass. Every optimization problemresides in some class that is a subclass of the general optimization problem class. An example ofsuch a class of problems are the combinatorial optimization problems (COPs). These problemsare in general defined so as to find the minimum subset S ⊆ N , given a set of n weighted elements|N | = n such that S ∈ F ⊆ P(N), or in other words such that S lies within the set of feasiblesolutions F , which is a subset of the powerset of N that contains all possible subsets of N . As thesize of set F is usually an exponentially growing amount as a function of n, the search space tendsto be very large. A lot of well known combinatorial optimization problems are not surprisingly inthe class of NP–complete problems. This means that for such a problem we cannot expect thatthere will ever be a polynomial time algorithm that finds the optimal feasible solution.

Different optimization approaches focus on a subset of optimization problem classes. An algo-rithm that is often applied to solving NP–complete problems, is the branch and bound method(see for instance [27]). Branch and bound is in general an exponential time algorithm in the caseof exponentially large search spaces, that traverses the search space systematically to find theoptimal solution.

1

The way that feasible solutions are graded along with which solutions are to be seen as feasible,codes the structure of the search space. All of this structural information is formulated using theproblem variables. Any reasonable optimization problem has structure. Regard for instance theproblems in the COP class. The common known problems in this class, such as the TSP, clearlyhave some structure. If we exploit this structure, we can perform optimization faster to give betterresults.

In black box optimization, we have no prior knowledge of the problem structure. The onlything available is a mechanism to grade a feasible solution. Assuming that the underlying problemwithin the black box has structure, attempting to find out and subsequently use this structure isa more preferable way to traverse the search space than a random search.

Searching for and using such structure in optimization has been an active line of research withinthe field of genetic and evolutionary computation. One of the latest developments in this field isto build and use probabilistic models over stochastic random variables, where one random variableis introduced for every coding variable within the optimization problem. Most of the theory andpractice within this line of research has been done in the class of problems that are encoded usingonly binary variables. The reason for this is that the new methods have been developed in thefield of genetic and evolutionary computation, wherein the genetic algorithm (GA) plays one ofthe most important and historical roles.

Within the standard genetic algorithm [12, 18], strings of binary variables are manipulatedusing operators inspired by biological evolution. Parts of information are exchanged betweensolutions as a result of recombination. A selection of solutions that are maintained in a populationat any point in time, is appointed to undergo this recombination operation. If the structure ofthe encoded problem is such that it contains sets of interacting variables (building blocks), it isknown that the population size required to solve the problem for a standard genetic algorithm withuniform crossover, grows exponentially with the problem size [33]. The uniform crossover operatorcombines two solutions by swapping the values for every variable with a certain probability.

In the compact GA (cGA) by Harik, Lobo and Goldberg [16], the dynamics of the standardGA are mimicked using only a single probability vector that codes whether a variable should beset to 1. This probability vector as a result resembles a univariate probability distribution overthe binary variables. Using such a univariate probability distribution in different ways, has inaddition to the cGA resulted in the PBIL approach by Baluja and Caruana [2] and the UMDAby Muhlenbein and Paaß [23]. These approaches share the problem of not being able to efficientlyoptimize problems that contain sets of interacting variables.

To take into account interacting variables, Holland [18] already recognized that it would bebeneficial to the GA if it would exploit this so called linkage information. The inversion operatorwas proposed to accomplish this, but it unfortunately has turned out to be too slow with respectto the speed of convergence of the GA. Through other approaches, such as the mGA [14], thefmGA [13], the GEMGA [19, 4], the LLGA [21] and the BBF–GA [34], the simple GA wasextended to be able to process building blocks. Other approaches have come to focus more on theprobability density view of processing the problem variables. An overview of the work in this fieldwas given by Pelikan, Goldberg and Lobo [25]. Our work in this paper contributes along thoselines of research.

Staying within the field of problems that are encoded using binary variables, approaches thatallowed for pairwise interactions between variables in the probability distribution were presented.The reason for this extension to higher order statistics is the equivalent of the quest for theexploitation of linkage in GAs. Because the variables are being processed independently whenusing a univariate distribution, problems that are defined using higher order building blocks cannot be solved efficiently. In these approaches, a variable may be conditionally dependent on exactlyone other variable. Allowing different structures for pairwise interactions, the MIMIC algorithmby De Bonet, Isbell and Viola [6], an optimal dependency tree approach by Baluja and Davies [3]and the BMDA algorithm by Pelikan and Muhlenbein [26] were introduced. For all of the methodsin this introduction, we leave the details to section 3.

More recently, approaches allowing multivariate interactions were presented. In these ap-proaches, a variable may be conditionally dependent on sets of variables. These methods include

2

the BOA by Pelikan, Goldberg and Cantu–Paz [24], the FDA by Muhlenbein, Mahnig and Ro-driguez [22] and the ECGA by Harik [15]. Even though finding the correct structure to capturethe sets of interacting variables is very difficult, covering multivariate interactions is the key tosolving higher order building block problems and exploiting problem structure [7].

Muhlenbein, Mahnig and Rodriguez [22] first presented a general framework for this type ofalgorithm, named EDA (Estimation of Distribution Algorithm). This framework is a generaloptimization algorithm that estimates a distribution every iteration, based upon a collection ofsolutions and subsequently samples new solutions from the estimated distribution. In this paper,we make certain steps from the EDA more explicit within a new framework. By doing so, we allowalgorithms within the new framework to follow a certain rationale for density estimation basedevolutionary algorithms.

Next to presenting this new algorithmic framework, we also present three instance classes forit. One of these instances is the class of empiric probabilities for discrete variables, which followsthe approaches mentioned above. The estimation of distributions is however not limited to binaryor discrete spaces and can be extended to real spaces. Therefore, we also present two new instanceclasses for problems defined on real continuous variables.

The simple GA and the approaches discussed so far only regard binary or discrete encodedproblems. In the case of real valued problems, evolution strategies (ES) [1] have been shown tobe very effective. In the ES algorithm, real variables are treated as real values. Real values canbe coded in binary strings, but when such a coding is processed by the GA, a few problems arise.The most important of these is that if two real variables are dependent on each other, this meansthat all of the binary variables that code these real variables, have to be dependent on each otherin some way. This gives additional overhead because the bits that code a single real variable alsohave to be seen as having some sort of dependence amongst each other. In the ES, a new type ofmutation is applied that adapts the value for a real variable along a certain direction according toa normal distribution, using correlations with respect to other variables if desired.

There is no direct use of a probability distribution or density estimation in the ES algorithm.However, other approaches have made a first step in this direction. Given the EDA framework,we could say that the work by Servet, Trave–Massuyes and Stern in a first approach [30], thealgorithm PBILC by Sebag and Ducoulombier [29], which is a real valued version of PBIL usingnormal distributions, and the more flexible approach using a mixture model of normal distributionsby Gallagher, Fream and Downs [11], are first attempts to expand the use of building and usingprobabilistic models for optimization to real continuous spaces. In all of those approaches however,the real variables are processed independently of each other. In other words, the probabilitydistribution structure that is used in these algorithms, is a univariate one. In this paper, wepresent algorithms that use real density models as well as allow for the modelling of multivariateinteractions between variables. All of this is done within the new framework that we present, butcan be seen to be rather independent of it.

Our goal in this paper is to apply the search for linkage in terms of probabilistic models tocontinuous spaces. We therefore do not introduce any new way of data or linkage informationprocessing, but show how we can expand the existing techniques to be used in the continuouscase. This means that we focus mainly on density estimation. We require density models to usefor modelling distributions of continuous random variables. The problem of estimating densitiesis well studied. An overview on fundamental issues in density estimation has for instance beengiven by Scott [28] and in a more application like manner by Bishop [5]. Based on such work,we can employ commonly used density models that have convenient properties in expanding fromdiscrete to continuous density estimation based evolutionary algorithms.

The remainder of this paper is organized as follows. In section 2 we present the generalframework for density estimation based evolutionary algorithms. Next, we go over previous workin section 3 and observe what is required to establish those algorithms. We need to know thedemands on the density estimation model to be able to use probability theory for defining densityestimation based evolutionary algorithms in general. In section 4, we redefine the algorithmsfor the case of discrete random variables to fit within the new framework. Subsequently, we usethe requirements information in section 5 to define new algorithms using two different density

3

estimation models for continuous random variables. The two density estimation models we use inthis paper are the normal distribution and the histogram distribution. Notes on the running timesof the algorithms and topics for further research are given in section 6. We finish by presentingour conclusions in section 7. We present all of the algorithms in somewhat detailed pseudo–code.Conventions that are used in this paper for writing pseudo–code are given in appendix F. Ingeneral, involved mathematical derivations and definitions can be found in the appendices (Athrough F). Before moving to present the IDEA algorithmic framework, we first introduce somenotation we employ in writing probability theory.

1.1 Probability theory notation

Discrete random variables are variables that can take on only a finite or at most a countablyinfinite number of values. We shall denote such variables by Xi. Using this notation, X or Xj

denotes a vector of random variables Xi. The domain for each Xi is mostly taken to be the sameas well as finite, so we shall denote the domain by DX = {dX

0 , dX1 , . . . , dX

nd−1}. The size of thedomain is thus denoted by nd = |DX |. In a special case, we have nd = 2, d0 = 0 and d1 = 1so that DX = B, the binary domain that was used in most EDAs so far. Now we can define themultivariate joint probability mass function for n discrete random variables Xj0 ,Xj1 , . . . Xjn−1 :

pj0,j1,...,jn−1(dXk0

, dXk1

, . . . , dXkn−1

) = P (Xj0 = dXk0

,Xj1 = dXk1

, . . . , Xjn−1 = dXkn−1

) (1)

such thatnd−1∑k0=0

nd−1∑k1=0

. . .

nd−1∑kn−1=0

pj0,j1,...,jn−1(dXk0

, dXk1

, . . . , dXkn−1

) = 1

where pj0,j1,...,jn−1(·) ≥ 0

For n = 1 this becomes the well known univariate probability mass function pi(dXk ) = P (Xi =

dXk ) so that the sum over all domain variables of the probability function equals 1. As the set DX

is taken to be finite, we can straightforwardly make use of the mapping i ↔ dXi . By storing this

mapping1, the numbers DX = {0, 1, . . . , nd − 1} ⊆ N can be used. This clarifies and simplifiescertain notations and implementations. Therefore, when referring to discrete random variablesin the remainder of the paper, we shall use DX instead of DX . In the one–dimensional case forinstance, this means that we write pj(k) = P (Xj = k) instead of pj(dX

k ) = P (Xj = dXk ). As

a general notation, we usually write P (Xi) for pi, making P (Xi) a function. This notation isstraightforwardly expanded to the multivariate case.

Continuous random variables are variables that can take on a continuum of values. We shalldenote such variables by Yi and let Y or Y j once again be a vector of variables Yi. The domainis mostly a subset of R, so we shall denote the domain by DY = [dY

0 , dY1 ] with dY

0 < dY1 and

(dY0 , dY

1 ) ∈ R2. We can now define the multivariate joint density function f(y0, y1, . . . , yn−1) for

n continuous random variables Yj0 , Yj1 , . . . Yjn−1 :

∫ b0

a0

∫ b1

a1

. . .

∫ bn−1

an−1

fj0,j1,...,jn−1(y0, y1, . . . , yn−1)dy0dy1 . . . dyn−1 = P ((Yj0 , Yj1 , . . . , Yjn−1) ∈ A) (2)

such that∫ dY

1

dY0

∫ dY1

dY0

. . .

∫ dY1

dY0

fj0,j1,...,jn−1(y0, y1, . . . , yn−1)dy0dy1 . . . dyn−1 = 1

where f(·) ≥ 0 and A = [a0, b0] × [a1, b1] × . . . × [an−1, bn−1] ⊆ (DY )n

For n = 1 this becomes the well known univariate density function∫ b

afi(y)dy = P (a < Yi < b),

so that the integral over the domain of the positive density function equals 1. As with the discretevariables, we would like to employ a notation P (Yi)(x) =

∫ x

xfi(y)dy so that given a point a, we get

1When programming for instance, this mapping could be stored in an array.

4

the probability that Yi is set to a, but we cannot do so because P (Yi = a) =∫ a

afi(y)dy = 0. Since

however we do computations on f(y) in the continuous case, we introduce the notation P (Yi) forfi, making P (Yi) a (density) function. This notation is again straightforwardly expanded to themultivariate case.

We may now note that when we discuss random variables in general, regardless of their domaintype, we specify them as Zi and write Z for a vector of variables Zi. We can now give the definitionfor the conditional probability of one random variable Zi being conditionally dependent on n othervariables Zj0 , Zj1 , . . . , Zjn−1 . This is the most fundamental probability expression in the use ofdensity estimation based evolutionary algorithms:

P (Zi|Zj0 , Zj1 , . . . , Zjn−1) =P (Zi, Zj0 , Zj1 , . . . , Zjn−1)

P (Zj0 , Zj1 , . . . , Zjn−1)(3)

2 Optimization using IDEAs

In this section we present the algorithmic framework for evolutionary optimization algorithmsusing density estimation techniques. We call this framework IDEA. In section 2.1 we specify theIDEA and show its connection with the EDA. In section 2.2, we introduce additional notation andplace some remarks on the fundamental part of IDEA, namely the probability distribution.

2.1 The IDEA framework

The rationale for using distribution estmation in optimization was clearly stated by De Bonet,Isbell and Viola [6]. Assume we have a problem for which the function to be optimized is definedas C(Z) with Z = (Z0, Z1, . . . , Zl−1), which without loss of generality we seek to minimize. If wehave no information on C(Z) in advance, we might as well assume a uniform distribution over theinput P (Z). Now denote a probability function that has a uniform distribution over all vectors Zwith C(Z) ≤ θ and a probability of 0 over all other vectors by P θ(Z):

P θ(Z) ={ 1

|{Z′|C(Z′)≤θ}| if C(Z) ≤ θ

0 otherwise(4)

Sampling from distribution P θ(Z) would thus give only samples with a function value of atmost θ. This means that if we would somehow gain information that allows us to find P θ∗

(Z)where θ∗ = minZ{C(Z)}, a sample drawn from the probability distribution P θ∗

(Z) would give usan optimal solution vector Z∗.

Sampling and estimating distributions is well studied within probability theory. In terms of aniterated algorithm, given a collection of n vectors Zi (i ∈ {0, 1, . . . , n − 1}) at iteration t, denotethe largest function value of the best �τn� vectors with τ ∈ [ 1

n , 1] by θt. We then pose a densityestimation problem, where given the �τn� vectors, we want to find, or best approximate, the densityP θt(Z). To this end, we use density estimation techniques from probability theory. In section 3we will investigate what techniques have been employed so far in previously introduced algorithmsfor discrete (binary) spaces. Once we have this approximate distribution, we sample from it tofind more samples that will hopefully all have a function value lower than θt and select again fromthe available samples to approximate in the next step P θt+1(Z) with θt+1 ≤ θt. Making thesesteps explicit, we define the general Iterated Density Estimation Evolutionary Algorithm (IDEA)as follows:

5

IDEA(n, τ , m, sel(), rep(), ter())1 Generate a collection of n random vectors.

{Zi | i ∈ {0, . . . , n − 1}}2 Evaluate function values of the vectors in the collection.

C(Zi) (i ∈ {0, 1, . . . , n − 1})3 Initialize the iteration counter.

t ← 04 Select �τn� vectors.

{Z(S)i | i ∈ {0, . . . , �τn� − 1}} ← sel() (τ ∈ [ 1n , 1])

5 Set θt to the worst function value value among the selected vectors.6 Determine the probability distribution P θt(Z).7 Generate m new vectors O by sampling from P θt(Z).8 Incorporate the new vectors O in the collection.

rep()9 Evaluate the new vectors in the collection (⊆ O).

10 Update the iteration counter.t ← t + 1

11 If the termination condition has not been satisfied, go to step 4.if ¬ter() then goto 4

12 Denote the amount of required iterations by tend.tend ← t

If we refer to the iterations in the IDEA as generations, it becomes clear that the IDEA isa true evolutionary algorithm in the sense that a population of individuals is used. From thispopulation, individuals are selected to generate new offspring with. Using these offspring alongwith the parent individuals and the current population, a new population is constructed.

In the IDEA algorithm, P θt(Z) stands for an approximation to the true distribution P θt(Z).An approximation is required because the determined distribution is based upon samples. Thismeans that even though depending on the density model and search algorithm used, it is possiblewe might achieve P θt(Z) = P θt(Z), in general this is not the case.

If we set m to (n − �τn�), sel() to selection by taking the best �τn� vectors and rep() toreplacing the worst (n − �τn�) vectors in the collection by the new sampled vectors, we canstate that θk+1 = θk − ε with ε ≥ 0. This assures that the search for θ∗ is conveyed through amonotonically decreasing series θ0 ≥ θ1 ≥ . . . ≥ θtend

. Hopefully we end up having θtend= θ∗.

We call an IDEA with m, sel() and rep() so chosen, a monotonic IDEA.If we set m in the IDEA to n and take the m new vectors as the new collection, we obtain

the EDA algorithm that was originally presented by Muhlenbein, Mahnig and Rodriguez [22] asa general estimation of distribution algorithm. In the EDA however, the probability plateau θt

cannot be enforced according to the rationale we just presented. Therefore we introduce IDEAand note how EDA is an instance of this new framework.

In figure 1 we give a graphical example of the general idea according to which IDEAs go towork. We have a continuous problem with DY = [−10, 10] and l = 2. The simple function C(Y )we seek to minimize has a uniquely defined minimum at (Y ∗

1 , Y ∗2 ) = (0, 0). The C(Y ) function

plots demonstrate the idea of having the threshold θt for the probability distribution P θt(Y ). Overthe iterations (with tend = 30) we observe that θt goes to the minimum value of C(Y ), leading toconvergence to the optimal point. The contour plots show the convergence of the distribution ofthe vectors in the collection. The vectors were determined at the end of each iteration.

New IDEAs have been introduced under different names. The most important discriminantamong these algorithms has been the way in which step 6 in IDEA is performed. The reason forthis becomes apparent when looking at the case of discrete random variables. In the case whenwe attempt to best estimate the complete joint probability, we require to count from the selected�τn� vectors the occurrences of every combination of assignments of every element from DX toevery Xi. This leads to (nd)l combinations to check out for each selected vector, which is anexponential amount. The amount of vectors we would require to justify this estimation would

6

Iteration 0

0.254595

-10 -8 -6 -4 -2 0 2 4 6 8 10y0 -10

-8-6

-4-2

02

46

810

y1

0

0.2

0.4

0.6

0.8

1

C(Y)

Iteration 1

0.207767

-10 -8 -6 -4 -2 0 2 4 6 8 10y0 -10

-8-6

-4-2

02

46

810

y1

0

0.2

0.4

0.6

0.8

1

C(Y)

Iteration 2

0.129158

-10 -8 -6 -4 -2 0 2 4 6 8 10y0 -10

-8-6

-4-2

02

46

810

y1

0

0.2

0.4

0.6

0.8

1

C(Y)

Iteration 0

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 0

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 1

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

00.20.40.60.81

Iteration 2

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 2

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

00.20.40.60.81

Iteration 5

0.047173

-10 -8 -6 -4 -2 0 2 4 6 8 10y0 -10

-8-6

-4-2

02

46

810

y1

0

0.2

0.4

0.6

0.8

1

C(Y)

Iteration 10

0.004665

-10 -8 -6 -4 -2 0 2 4 6 8 10y0 -10

-8-6

-4-2

02

46

810

y1

0

0.2

0.4

0.6

0.8

1

C(Y)

Iteration 30

-0.000023

-10 -8 -6 -4 -2 0 2 4 6 8 10y0 -10

-8-6

-4-2

02

46

810

y1

0

0.2

0.4

0.6

0.8

1

C(Y)

Iteration 5

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 5

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

00.20.40.60.81

Iteration 10

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 10

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

00.20.40.60.81

Iteration 30

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

Iteration 30

-10 -8 -6 -4 -2 0 2 4 6 8 10

y0

-10

-8

-6

-4

-2

0

2

4

6

8

10

y1

00.20.40.60.81

Figure 1: An example run of a monotonic IDEA on a simple test function with a fixed probabilitydistribution structure such that P (Y ) = P (Y1|Y2)P (Y2), τ = 0.5, n = 30 and m = 15.

7

0 1

Figure 2: An unfeasible PDS graph on two variables: P (Z0|Z1)P (Z1|Z0)

also be enormous. We would be better off trying every possible vector X. The other extremewould be to regard every variable Xi independently. Such has been the approach by algorithmssuch as PBIL [2] and UMDA [26]. However, just as the notion of linkage was first acknowledgedby Holland to be of importance [18], it has been shown by Bosman and Thierens [7] that IDEAalgorithms require to find higher order density structures as an approximation to P θt(X) as wellin order to find solutions to higher order building block problems. This is why quick and efficientdensity estimation is of the greatest importance to IDEA algorithms and why multiple differentalgorithms have been introduced with different types of distribution structures. In section 3 weshall go over these algorithms and note what is required to rationally find such a distribution byobserving what has been done so far. First however, we introduce some notation regarding theseprobability distributions and introduce the general notion of how these distributions are used inIDEAs.

2.2 Probability distribution issues

After updating the probability density structure, we can write this structure in the followinggeneral form:

Pπ(Z) =l−1∏i=0

P (Zi|Zπ(i)0 , Zπ(i)1 , . . . , Zπ(i)|π(i)|−1) (5)

Where Zi represents the i–th random variable in vector Z and π(i) is a function that returns avector π(i) = (π(i)0, π(i)1, . . . , π(i)|π(i)|−1) of indices denoting the variables that Zi is conditionallydependent on.

We call the graph that results when drawing the variables Zi as nodes and drawing an arcfrom node i to j when Zj is conditionally dependent on Zi, the Probability Density Structure(PDS) graph. The PDS graph needs to be acyclic. This can be seen from the simple example inthe discrete case when we might want to model P (X0|X1)P (X1|X0) as shown in figure 2, withDX = B = {0, 1}. This is not a valid probability distribution over X. From an IDEA standpoint,even though we will be able to determine the two required probabilities by counting, we willnot be able to generate new samples according to this distribution because we need to set X1

to 0 or 1 before we can sample X0, but we also need to set X0 to 0 or 1 before we can sampleX1. To see that an acyclic graph is sufficient, we note that because P (Zj0 |Zj1 , Zj2 , . . . , Zjn−1) ·P (Zj1 , Zj2 , . . . , Zjn−1) = P (Zj0 , Zj1 , . . . , Zjn−1), the complete joint probability can be written asfollows (see figure 3 as an example):

P (Z0, Z1, . . . , Zl−1) =l−1∏i=0

P (Zi|Zi+1, Zi+2, . . . , Zl−1) (6)

To formalize the notion of having an acyclic graph, we state that we require to have a vectorω = (ω0, ω1, . . . , ωl−1), so that we can rewrite the general form of the approximated probabilitydensity structure:

8

2 3

4

0 1

Figure 3: A complete PDS graph on five variables: P (Z0|Z1, Z2, Z3, Z4)P (Z1|Z2, Z3, Z4)·P (Z2|Z3, Z4)P (Z3|Z4)P (Z4) = P (Z0, Z1, Z2, Z3, Z4)

Pπ,ω(Z) =l−1∏i=0

P (Zωi|Zπ(ωi)0 , Zπ(ωi)1 , . . . , Zπ(ωi)|π(ωi)|−1

) (7)

such that ∀i∈{0,1,...,l−1}〈ωi ∈ {0, 1, . . . , l − 1} ∧ ∀k∈{0,1,...,l−1}−{i}〈ωi �= ωk〉〉∀i∈{0,1,...,l−1}〈∀k∈π(ωi)〈k ∈ {ωi+1, ωi+2, . . . , ωl−1}〉〉

Even though this definition is tricky and defines what permutations in combination with parentfunctions are allowed, the construction method for the PDS in the form of (π,ω) will in generalonly create feasible structures.

We should note that next to finding a distribution, we must also be able to sample fromPπ,ω(Z). It is clear to see that because the PDS graph is acyclic, there is an ordering of thevariables such that we can always sample a value for a variable that is conditionally dependentonly on variables we have already sampled for. This ordering is given by the ωi from equation 7.Note that given the conditional variables, the probability density for a variable takes on a onedimensional form, since it is given in the form of equation 7. From such a one dimensional densityit is sometimes straightforward to sample. Still, the method of sampling is dependent on the typeof density estimation that is used, so we parameterize this method in the IDEA.

Once more we note that step 6 in the IDEA is an important discriminant. A general dis-crimination can be made between algorithms that use a factorization a priori and the ones thatsearch for a probability density structure. Note that the factorization we refer to is not the fullspecification of Pπ,ω(Z), but only of the structure of Pπ,ω(Z). This structure is fully specified bya pair (π,ω) that is subject to the constraints from equation 7. The actual probability values arefound through estimations on the selected points. This means that even though we might havethe correct structure over all points, say the complete joint probability, we yet have to find theprobability values over the domains that assign probabilities of 1.0 to the assignment of the rightdomain values to the variables. This estimating however needs always to be done. If we specify asearch procedure for finding (π,ω) by sea(), we note that algorithms that use a factorized prob-ability distribution simply use a static sea() procedure that always returns the same structure.Therefore we need not to introduce two new algorithms, but we can suffice by elaborating on thefirst framework to end up with the final specification of IDEA:

9

IDEA(n, τ , m, sel(), rep(), ter(), sea(), est(), sam())1 . . . 5 Identical to IDEA(n, τ , m, sel(), rep(), ter())

6 Determine the probability distribution P θtπ,ω(Z) in two steps:

6.1 Find a PDS subject to the constraints from equation 7.(π,ω) ← sea()

6.2 Estimate the density functions.{P (Zωi

|Zπ(ωi)0 , Zπ(ωi)1 , . . . , Zπ(ωi)|π(ωi)|−1) | i ∈ {0, . . . l − 1}} ← est()

7 Generate m new vectors O by sampling from P θtπ,ω(Z) in three steps:

7.1 O ← ∅7.2 repeat m times7.3 O ← O ∪ sam()

8 . . . 12 Identical to IDEA(n, τ , m, sel(), rep(), ter())

Concluding, we now have a framework that, given implementations for the functions that areparameters and the values for the other parameters, completely specifies density estimation basedevolutionary algorithms. This makes it clear what variant of the IDEA is exactly used, as anyparameter that is not specified leaves it unclear as to what the algorithm does. In addition however,add–ons may be introduced on top of the framework. For instance a local search heuristic can beadded, which makes the algorithm hybrid.

3 Finding requirements from existing EDAs: Previous Work

The use of distribution estimation in optimization algorithms has been introduced through variousapproaches. These approaches use various types of density structures and have mostly been appliedto discrete and in particular binary random variables. We seek to investigate what approachesto the search for a PDS have been used. By making this algorithmic part explicit, we isolatewhat we require from certain distribution models in order to use them in the case of continuousrandom variables. In all cases, we will find that the metric to guide the search for a PDS is theentropy measure over random variables. We shall write hi and hij respectively for the univariateand bivariate joint empirical entropy (either discrete or differential) over variables Zi and Zj . Inthe pseudo–code that we present in the remainder of this paper, we assume that hi (∀i∈{0,1,...,l−1})and hij (∀(i,j)∈{0,1,...,l−1}2) are global variables, for which we do not allocate additional memoryin the algorithms.

3.1 PBIL

The basic idea in PBIL by Baluja and Caruana [2], is to update a single probability vectorP [i], i ∈ {0, 1, . . . , l − 1}. This vector has an entry for every binary random variable. Eachentry denotes the probability that the corresponding variable should be set to 1. The updateprocedure is performed by moving the probability vector toward a few (M) of the best vectors inthe population. This is achieved by updating each entry independently of the others at the rateof a certain learning parameter η:

for j ← 0 to M − 1 dofor i ← 0 to l − 1 do

P [i] ← P [i](1 − η) + X(S)ji η

The requirements for the density model are nothing special because the density structure issimply the univariate distribution. In other words, we need no special aspects of the density modelto compute (π,ω). We do note however, that PBIL uses a different approach to updating thedistribution. In PBIL, a memory is used as the information from previous iterations is preservedby multiplication of the probability vector entries with (1 − η). This is different from our IDEAalgorithmic model as we use only the population to update the distribution from.

10

In extensions of the PBIL framework to continuous spaces [30, 29, 11], all variables continue tobe regarded independently of one another. Even though the density model for each variable maybe different, the requirements on these models are nothing special from the viewpoint of searchingfor the structure (π,ω). Concluding, the requirements for implementing the search algorithmemployed by PBIL are none.

3.2 MIMIC

In MIMIC, a framework by De Bonet, Isbell and Viola [6], all binary variables except for one areconditionally dependent on exactly one other variable. These dependencies are ordered in such away that the resulting PDS is a chain:

P (Z) =

(l−2∏i=0

P (Zji|Zji+1)

)P (Zjl−1) (8)

such that ∀(i,k)∈{0,1,...,l−1}2〈i �= k → ji �= jk〉To determine the ji that constitute the chain, the agreement between the complete joint

distribution and P (X) is maximized. This comes down to minimizing the following expressionwhen using the Kullback–Leibler divergence, written using discrete variables Xi:

J(X) =

(l−2∑i=0

H(Xji|Xji+1)

)+ H(Xjl−1) (9)

In this expression, H(Xi) stands for the entropy of variable Xi. The notion of entropy wasfirst introduced by Shannon when presenting his information theory [31]. It is a measure of theaverage amount of information conveyed per message. The multivariate definition in n dimensionsfor discrete random variables Xj0 ,Xj1 , . . . , Xjn−1 , is the following:

H(Xj0 ,Xj1 , . . . , Xjn−1) = (10)

−nd−1∑k0=0

nd−1∑k1=0

. . .

nd−1∑kn−1=0

pj0,j1,...,jn−1(k0, k1, . . . , kn−1)ln(pj0,j1,...,jn−1(k0, k1, . . . , kn−1))

Furthermore, we require the notion of conditional entropy. For one discrete random variableXi conditioned on n other discrete random variables Xj0 ,Xj1 , . . . , Xjn−1 , this definition equals:

H(Xi|Xj0 ,Xj1 , . . . , Xjn−1) = (11)

H(Xi,Xj0 ,Xj1 , . . . , Xjn−1) − H(Xj0 ,Xj1 , . . . , Xjn−1)

The definition in equation 10 cannot be used for continuous random variables. To this end,we require the definition of differential entropy, which is the continuous variant of the entropymeasure. This entropy measure is usually denoted with a small h instead of a capital letter H:

h(Yj0 , Yj1 , . . . , Yjn−1) = (12)

−∫ ∞

−∞

∫ ∞

−∞. . .

∫ ∞

−∞fj0,j1,...,jn−1(y0, y1, . . . , yn−1)ln(fj0,j1,...,jn−1(y0, y1, . . . , yn−1))dy0dy1 . . . dyn−1

The minimization of the divergence in equation 9 can be done in a greedy way, which is theapproach used in MIMIC. First, the variable with minimal unconditional entropy is selected. Then,new variables are subsequently selected by choosing the one with minimal conditional entropy,given the previously selected variable. This gives an O(l2) algorithm (without computing theentropies):

11

ωl−1 ← arg minj{h(Zj) | j ∈ {0, 1, . . . , l − 1}}π(ωl−1) ← ∅for i ← l − 2 downto 0 do

ωi ← arg minj{h(Zj , Zωi+1) − h(Zωi+1) |j ∈ {0, 1, . . . , l − 1} − {ωi+1, ωi+2, . . . , ωl−1}}

π(ωi) ← ωi+1

In the above, we have written h(Zi) instead of h(Zi) to remark that the entropy measure usedin an actual algorithm based on vectors Z(S)j is the empirical entropy. The empirical entropydiffers from the true entropy in that the empirical entropy is based upon samples instead of uponthe true distribution.

Concluding, the one requirement we have on the distribution model to be used in order toapply the MIMIC chain search procedure, is that we are able to compute one dimensional and twodimensional (joint) empirical entropies.

3.3 Optimal Dependency Trees

In the approach using optimal dependency trees by Baluja and Davies [3], all variables except forone may still be conditionally dependent on exactly one other variable, but multiple variables maybe dependent on one and the same variable, yielding a tree structure:

P (Z) =

(l−2∏i=0

P (Zji|Zei

)

)P (Zjl−1) (13)

such that ∀(i,k)∈{0,1,...,l−1}2〈(i �= k → ji �= jk) ∧ ei ∈ {ji+1 . . . jn−1}〉To determine the ji that constitute the tree, the Kullback–Leibler divergence is again mini-

mized. In the algorithm, the measure named mutual information is used. The definition of mutualinformation for two random variables Zi and Zj can be expressed in terms of the entropy measurewe saw earlier in section 3.2:

I(Zi, Zj) = h(Zi) + h(Zj) − h(Zi, Zj) (14)

The greater the mutual information, the greater the interaction between variables Zi and Zj .As a result, the algorithm builds the tree by maximizing the mutual information measure. Thealgorithm runs in O(l2) time (without computing the entropies):

bt ← new array of integer with size l − 1ωl−1 ← RandomNumber(l)π(ωl−1) ← ∅bt[i] ← ωl−1 (∀i∈{0,1,...,l−1}−{ωl−1})for i ← l − 2 downto 0 do

ωi ← arg maxj{h(Zj) + h(Zbt[j]) − h(Zj , Zbt[j]) |j ∈ {0, 1, . . . , l − 1} − {ωi+1, ωi+2, . . . , ωl−1}}

π(ωi) ← bt[ωi]if h(Zj) + h(Zbt[j]) − h(Zj , Zbt[j]) < h(Zj) + h(Zωi

) − h(Zj , Zωi)

then bt[j] ← ωi (∀j∈{0,1,...,l−1}−{ωi,ωi+1,...,ωl−1})

In the above, we have used algorithm RandomNumber(x), which is assumed to return a(pseudo) random integer from the set {0, 1, . . . , x − 1}. We may conclude that because we onlyrequire (empirical) entropy measures, we require nothing new for implementing the optimal depen-dency tree search procedure in addition to the requirements for implementing the MIMIC chainsearch procedure. The only requirement is thus to be able to determine h(Zj) and h(Zi, Zj).

12

3.4 BOA

One of the most general approaches, allowing the most flexible network structure, was proposedby Pelikan, Goldberg and Cantu–Paz [24]. The algorithm named BOA uses a PDS in which eachnode may have up to κ successors, allowing variables now to be conditionally dependent on setsof variables. Because of the very general structure, we may write this using the same variables asbefore in equation 7 and with the introduction of one additional constraint:

P (Z) =l−1∏i=0

P (Zωi|Zπ(ωi)0 , Zπ(ωi)1 , . . . , Zπ(ωi)|π(ωi)|−1

) (15)


∀i∈{0,1,...,l−1}〈|π(i)| ≤ κ〉As was the case in MIMIC and the use of the optimal dependency trees, a metric is required that

should be maximized or minimized so as to find a network structure. In the original implementationof BOA, the so called K2 metric is used, which is a metric that should be maximized. In contrastto that approach and more in the light of previously dicussed approaches, we use a metric basedon conditional entropies. In appendix A, the difference between the K2 metric and the entropymetric is elaborated upon. In the general graph search algorithm that will result, we will attemptto minimize the conditional entropy of the PDS imposed by equation 15:

ψ((π,ω)) = min(π,ω)

{l−1∑i=0

h(Xωi|Xπ(ωi)0 ,Xπ(ωi)1 , . . . , Xπ(ωi)|π(ωi)|−1

)

}(16)

Note the obvious similarity with the metric used by MIMIC which we discussed in section 3.2.It follows from this metric that we require again only to be able to compute empirical entropies.The only difference is that this time we are required to be able to compute multivariate entropiesfor the applied density model in κ + 1 dimensions.

The only thing that is now missing is the algorithm used in BOA to find a PDS. To thisend, three different cases are distinguished for κ. When κ = 0, the PDS regards every variableunivariately. In other words, we end up with a fixed network in which ∀i〈π(i) = ∅〉. In the case ofκ = 1, there is a polynomial algorithm by Edmonds [10] to construct the optimal network. A directcombinatorial proof of the correctness of the algorithm was given by Karp [20]. The definitionsthat are required to understand the algorithm and fill in the details, are based on the work byKarp and are clarified in appendix B. If κ > 1, the problem is NP–complete for the proposedmetric [17]. Therefore, a greedy algorithm can be used as has been done in the MIMIC approach.The original implementation of BOA starts with a PDS with ∀i〈π(i) = ∅〉. Next, from the set ofarcs that are not yet in the PDS graph, the arc that moves the metric the most in the direction ofthe optimum is selected and added to the PDS graph. This process is repeated until the metric canno longer be improved. These three cases constitute a search algorithm that runs in O(κl3) time(without computing the entropies). This time bound is due to the fact that for κ = 0, the runningtime is O(l), for κ = 1, the running time is O(l3) and for κ > 1, the running time is O(κl3). Alongwith these running times, we note that Tarjan [32] has given a more efficient implementation offinding optimal branchings for the case when κ = 1, which runs in O(l(log(l))2 + l2) time. Seen tothe equivalent running time of the general graph search algorithm with respect to the case whenκ > 1 is taken as a constant, this is of no influence on the running time. We are now ready topresent the algorithm itself:

13

if κ = 0then i ← 0 to l − 1 do

ωi ← iπ(ωi) ← ∅

else if κ = 1then P ← ConstructProblem()

G = (V,A) ← ({0, 1, . . . , l − 1}, {0, 1, . . . , l − 1}2 −⋃l−1i=0{(i, i)})

set s(i, j) ← i and t(i, j) ← jcompute c(i, j) for every arc (i, j) so that min(i,j){c(i, j) | (i, j) ∈ A} = 1P ← (G, s, t, c)H = (V,AH) ← ComputeCriticalGraph(P )if acyclic(H)

then B ← AH

else P ← ConstructDerivedProblem(P , H)B ← repeat computation, but start with PB ← optimum branching for P corresponding to B for P

perform topological sort on B to find (π,ω)else

then G = (V,A) ← ({0, 1, . . . , l − 1}, ∅)compute change(i, j) for every arc (i, j)ψ ← ∑l−1

i=0 h(Zi)while ¬finished do

(v1, v2) ← arg min(i,j){change(i, j) | allowed(i, j) ∧ (i, j) ∈ V 2}if change(v1, v2) > 0

then breakwhilemark (v1, v2) and arcs (i, j) that make cycles as not allowedA ← A ∪ (v1, v2)if v2 has κ parents

then mark each arc (i, v2), i ∈ {0, 1, . . . , l − 1} as not allowedψ ← ∑l−1

i=0 h(Zi|{Zj |(j, i) ∈ A})compute change(i, j) for every arc (i, j) with allowed(i, j) = true

perform topological sort on G to find (π,ω)

3.5 UMDA, BMDA and FDA

The EDA framework has resulted from insights gained through the introduction of algorithmsalong the line of UMDA, BMDA and FDA. The UMDA by Muhlenbein and Paaß [23] uses thesame type of distribution as does the PBIL algorithm discussed in section 3.1. The main differenceis that a memory is used in PBIL, whereas in UMDA the collection of vectors is used mainly. Thislatter approach is therefore more in concordance with our IDEA approach. It follows from theuse of only the univariate distribution that, for the same reasons as PBIL, we have no additionalrequirements for the UMDA.

The BMDA by Pelikan and Muhlenbein [26] covers second order interactions just as is thecase for MIMIC and the optimal dependency trees discussed in sections 3.2 and 3.3 respectively.However, the structure used in BMDA is the most general one for second order relations, meaningthat the PDS graph has the form of a collection of trees as used in the optimal dependency treesapproach. If we look even closer, we observe that because the BOA from section 3.4 also uses themost general approach for modeling relations with maximum order κ, the model for BMDA equalsthe model for the BOA with κ = 1. As noted elsewhere [24], the BOA covers both the UMDAand the BMDA. Hence, we may decline from going into further detail on this algorithm.

Finally, the FDA by Muhlenbein, Mahnig and Rodriguez [22] is of importance. In this algorithma complete factorization of the distribution, specified by ns sets si with i ∈ {0, 1, . . . , ns − 1}, istaken as input and used to build the PDS:

14

P (Z) =

ns−1∏

i=1

P

∏

Zj∈bi

Zj

∣∣∣∣∣∣∏

Zj∈ci

Zj

P

∏

Zj∈b0

Zj

(17)

where

ci = si ∩ di−1

bi = si − di−1

di ={⋃i

j=0 sj if i ≥ 0∅ otherwise⋃ns−1

i=0 si = {Z0, Z1, . . . , Zn}Using an actual factorization for the problem at hand, FDA can very efficiently perform opti-

mization. As this factorization is given a priori, it can be written in terms of the general definitionof equation 7. This means that we have no additional requirements for implementing FDA inour IDEA framework, since the PDS that is used, is fixed at the outset. In other words, a sea()function that returns the PDS given as input by the user, suffices.

3.6 CGA and ECGA

The compact genetic algorithm (cGA) was introduced by Harik, Lobo and Goldberg [16] in 1997.The main idea behind this algorithm, is that it is able to mimic the order–1 behaviour of thesimple genetic algorithm (sGA) with uniform crossover. Again, just as in the UMDA and PBIL,for each binary variable, an entry in a probability vector is created that is initially set to 0.5. Thealgorithm then creates tournaments and moves the probability vector toward the winner of thetournament. The way in which the probability vector is adapted, is dependent on the assumedpopulation size in the sGA that the cGA mimics. Again, because of the fact that a univariatedistribution is used, we do not have any additional requirements.

An extension to the cGA, named the extended compact genetic algorithm (ECGA), was givenby Harik [15] in 1999. In the ECGA, the PDS consists of joint structures which are all consideredunivariately. This means that variables can be grouped together in a joint distribution, but theyare processed independently of any other variables. Such a PDS follows the idea of having non–overlapping building blocks which must be processed as a whole in order to solve a problem. Suchproblems have been shown to be solvable by the ECGA in a very efficient manner. The searchalgorithm for the PDS is a greedy algorithm that seeks to combine two sets of variables into asingle new set as long as a certain metric can still be increased. Initially, all of the variables areplaced in a singleton set. The metric used is based upon the minimium discription length frominformation theory (see for instance [8]). The advantage of this metric is that is penalizes modelsthat are more complex than required. This is important because the density estimation step is anapproximation to the true density because of the model used as well as the fact that we are usinga limited set of samples.

The metric used in the ECGA (as well as the one used in BMDA) differs from the entropymeasure we proposed to use in all of the other cases. The search algorithm is a special case offinding a multivariate PDS. The most general structure is used within the FDA and the BOA.Therefore, even though it deserves full attention, especially with respect to completely decompos-able problems, we shall see this method as being covered by the level of allowed interactions in ageneral PDS search and focus on that. The ECGA search algorithm, along with the metric used,can however easily be incorporated in the IDEA framework. Even though the used metric hasimportant and interesting properties, we shall refrain from exploring it further in this paper.

4 IDEAs for discrete random variables

Having noted the type of information that is required in previously created EDAs for binaryrandom variables, we are ready to revisit the work in terms of the IDEA framework. In thissection, we show how in the case of discrete random variables, the algorithms discussed in section 3

15

can be implemented. We directly cast these algorithms in the form of general discrete randomvariables as introduced in section 1 as opposed to the special case of the binary random variablesthat was implicitly used in most of the discussed algorithms. The running times of the algorithmsare stated in section 6.

Even though the incorporation of the pseudo–code for the search procedures that return thePDS (π,ω) in the case of discrete random variables is something of a repetition of earlier work, itserves three purposes:

1. It gives a clear overview of the main ingredients of the algorithms presented in the literatureso far by using pseudo–code to clearly specify the operational procedures.

2. It shows how those algorithms can be incorporated within the IDEA framework.

3. It provides a reference to check back with when defining the algorithms for continuousrandom variables, allowing to note the similarities and see how density estimation lies at theheart of these methods.

In the discrete case, we compute probabilities by counting frequencies as we would in makinga histogram. We call these probabilities empiric probabilities and denote them by p(·). The basisof computing probabilities in these algorithms in the discrete case lies within the use of equation 3and the following:

pj0,j1,...,jn−1(k0, k1, . . . , kn−1) =m(j,k)�τn� (18)

where

j = (j0, j1, . . . , jn−1)k = (k0, k1, . . . , kn−1)

m(ν,λ) =∑�τn�

q=0

{1 if ∀i∈{0,1,...,|λ|−1}〈X(S)q

νi = λi〉0 otherwise

We need to compute the empiric probabilities in order to sample from the conditional proba-bilities imposed by the PDS. We do this by using bins (bi and bp

i , i ∈ {0, 1, . . . , l − 1}) in whichwe perform a frequency count exactly as imposed by function m(·, ·) from equation 18. This leadsus to define algorithm DiEst to estimate the density functions in the discrete case once the PDSis known:

DiEst()1 for i ← 0 to l − 1 do

1.1 bωi← new array of real in |π(ωi)| + 1 dimensions

with size nd × nd × . . . × nd

1.2 bpωi

← new array of real in |π(ωi)| dimensionswith size nd × nd × . . . × nd

1.3 for (a0, a1, . . . , a|π(ωi)|) ← (0, 0, . . . , 0)to (nd − 1, nd − 1, . . . , nd − 1) in crossproduct do

1.3.1 bωi[a0, a1, . . . , a|π(ωi)|] ← 0

1.4 for (a0, a1, . . . , a|π(ωi)|−1) ← (0, 0, . . . , 0)to (nd − 1, nd − 1, . . . , nd − 1) in crossproduct do

1.4.1 bpωi

[a0, a1, . . . , a|π(ωi)|−1] ← 01.5 for k ← 0 to �τn� − 1 do

1.5.1 a0 ← X(S)kωi

1.5.2 for q ← 0 to |π(ωi)| − 1 do

1.5.2.1 aq+1 ← X(S)kπ(ωi)q

1.5.3 bωi[a0, a1, . . . , a|π(ωi)|] ← bωi

[a0, a1, . . . , a|π(ωi)|] + 1�τn�

1.5.4 bpωi

[a1, a2, . . . , a|π(ωi)|] ← bpωi

[a1, a2, . . . , a|π(ωi)|] + 1�τn�

16

1.6 ∀(k0,k1,...,k|π(ωi)|)∈DX

|π(ωi)|+1〈P (Xωi

= k0|Xπ(ωi)0 = k1,Xπ(ωi)1 = k2, . . . ,Xπ(ωi)|π(ωi)|−1

= k|π(ωi)|) ← ϕ(ωi, k0, k1, . . . , k|π(ωi)|)〉

where ϕ(ωi, k0, k1, . . . , k|π(ωi)|) =

{bωi

[k0,k1,...,k|π(ωi)|]bp

ωi[k1,k2,...,k|π(ωi)|]

if |π(ωi)| > 0bωi

[k0, k1, . . . , k|π(ωi)|] otherwise2 return(P (·))

Note that in step 1.6 of algorithm DiEst the probabilities are assigned to the approximationof the probability mass function. This step does not actually have to be computed, because therequired information is stored in the arrays bi and bp

i . The only reason it is stated here is forthe clarity of the mapping to be established for sampling from the probability distribution. Notethat the above algorithm takes an exponential amount of time, which can become a problem ifthe amount of parents gets larger. This sampling is straightforward as the probability informationcan be directly accessed through the arrays. This is coded in algorithm DiHiSam:

DiSam()1 for i ← l − 1 downto 0 do

1.1 ς ← Random01()1.2 ϑ ← 01.3 η ← 11.4 if |π(ωi)| > 0 then

1.4.1 η ← bpωi

[Xπ(ωi)0 ,Xπ(ωi)1 , . . . , Xπ(ωi)|π(ωi)|−1]

1.5 for k ← 0 to nd − 1 do1.5.1 ϑ ← ϑ + 1

η bωi[k,Xπ(ωi)0 ,Xπ(ωi)1 , . . . , Xπ(ωi)|π(ωi)|−1

]1.5.2 if ς ≤ ϑ then Xωi

← k; breakfor

Having specified how to estimate the distribution and how to sample from these estimations inthe case of discrete random variables, we still have to specify how to find (π,ω). Two special casesarise when we fix the distribution in a certain way. In general, a search procedure that returnssome fixed PDS that may be dependent on user–input on beforehand, gives an approach similarto the FDA [22]. The two special cases of such a fixed distribution that we wish to outline onbeforehand, are the univariate distribution and the complete joint distribution. In the univariatedistribution, each variable is regarded independently of the others. This leads to an approach inthe line of to PBIL [2] and UMDA [23]. For problems without linked parameters, these methodscan be very efficient. However, when parameters are linked, we require higher order buildingblock processing in order to find efficient and effective optimization algorithms [7]. In the jointdistribution, which is the other extreme, all the variables are regarded in a joint fashion, conveying12 l(l − 1) dependencies. This PDS will convey too many dependencies for most problems. We canformalize the two algorithms as follows:

UnivariateDistribution()1 for i ← 0 to l − 1 do

1.1 ωi ← i1.2 π(ωi) ← ∅

2 return((π,ω))

JointDistribution()1 for i ← 0 to l − 1 do

1.1 ωi ← i1.2 π(ωi) ← (i + 1, i + 2, . . . , l − 1)

2 return((π,ω))

Note that these two fixed PDS algorithms only need to compute the information once and mayreturn this information every time anew. The running time of these algorithms thus needs to be

17

counted only once, contrary to the algorithms that search for a PDS. The search algorithms willrequire their running time every iteration.

For each of the three different search approaches discussed in section 3, we now present themin the setting of discrete random variables. The first of these was MIMIC. In order to write downthis algorithm, we noted in section 3.2 that we need to be able to compute the entropy. In the caseof discrete random variables, combining equations 10 and 18 provides us with enough informationto write out algorithm DiChainSea:

DiChainSea()1 a ← new array of integer with size l2 for i ← 0 to l − 1 do

2.1 a[i] ← i2.2 pi ← new array of real with size nd

3 e ← 04 ωl−1 ← a[e]5 for i ← 0 to l − 1 do

5.1 for j ← 0 to nd − 1 do5.1.1 pa[i][j] ← 0

5.2 for j ← 0 to �τn� − 1 do

5.2.1 pa[i][X(S)ja[i] ] ← pa[i][X

(S)ja[i] ] + 1

�τn�5.3 ha[i] ← −∑nd−1

k=0 pa[i][k]ln(pa[i][k])5.4 if ha[i] < hωl−1 then

5.4.1 ωl−1 ← a[i]5.4.2 e ← i

6 π(ωl−1) ← ∅7 a[e] ← a[l − 1]8 for i ← l − 2 downto 0 do

8.1 e ← 08.2 ωi ← a[e]8.3 for q ← 0 to i do

8.3.1 pωi+1a[q] ← new array of real in 2 dimensions with size nd×nd

8.3.2 pa[q]ωi+1 ← new array of real in 2 dimensions with size nd×nd

8.3.3 for j0 ← 0 to nd − 1 do8.3.3.1 for j1 ← 0 to nd − 1 do

8.3.3.1.1 pωi+1a[q][j0, j1] ← 08.3.3.1.2 pa[q]ωi+1 [j1, j0] ← 0

8.3.4 for j ← 0 to �τn� − 1 do

8.3.4.1 pωi+1a[q][X(S)jωi+1 ,X

(S)ja[q] ] ← pωi+1a[q][X

(S)jωi+1 ,X

(S)ja[q] ] + 1

�τn�8.3.4.2 pa[q]ωi+1 [X

(S)ja[q] ,X

(S)jωi+1 ] ← pa[q]ωi+1 [X

(S)ja[q] ,X

(S)jωi+1 ] + 1

�τn�8.3.5 hω[i+1]a[q] ← −∑nd−1

k0=0

∑nd−1k1=0 pωi+1a[q][k0, k1]ln(pωi+1a[q][k0, k1])

8.3.6 ha[q]ωi+1 ← hωi+1a[q]

8.3.7 if ha[q]ωi+1 − hωi+1 < hωiωi+1 − hωi+1 then8.3.7.1 ωi ← a[q]8.3.7.2 e ← q

8.4 π(ωi) ← ωi+1

8.5 a[e] ← a[i]9 for i ← 0 to l − 1 do

9.1 if |π(ωi)| > 0 then9.1.1 bωi

← pωiπ(ωi)0

9.1.2 bpωi

← pπ(ωi)0

9.2 else9.2.1 bωi

← pωi

10 return((π,ω))

18

Note that from algorithm DiChainSea it follows that we do not need to recompute the arraysbi and bp

i in algorithm DiEst. In other words, because of the fact that line 1.6 of algorithm DiEst

was only to clarify the construction of P , algorithm DiEst can be skipped when using algorithmDiChainSea. This follows from the definition of IDEA because sea() is executed before est().This reduces computation time every iteration.

The second search procedure is taken from the optimal dependency trees approach. To thisend, we require again the entropy measure. Since we have already shown what this measure is,we can immediately write down algorithm DiTreeSea:

DiTreeSea()1 a ← new array of integer with size l2 for i ← 0 to l − 1 do

2.1 a[i] ← i2.2 pi ← new array of real with size nd

3 bt ← new array of integer with size l − 14 e ←RandomNumber(l)5 ωl−1 ← e6 π(ωl−1) ← ∅7 for j ← 0 to nd − 1 do

7.1 pωl−1 [j] ← 08 for j ← 0 to �τn� − 1 do

8.1 pωl−1 [X(S)jωl−1 ] ← pωl−1 [X

(S)jωl−1 ] + 1

�τn�9 hωl−1 ← −∑nd−1

k=0 pωl−1 [k]ln(pωl−1 [k])10 a[e] ← a[l − 1]11 for i ← 0 to l − 2 do

11.1 for j ← 0 to nd − 1 do11.1.1 pa[i][j] ← 0

11.2 for j ← 0 to �τn� − 1 do

11.2.1 pa[i][X(S)ja[i] ] ← pa[i][X

(S)ja[i] ] + 1

�τn�11.3 ha[i] ← −∑nd−1

k=0 pa[i][k]ln(pa[i][k])11.4 bt[a[i]] ← ωl−1

11.5 pbt[a[i]]a[i] ← new array of real in 2 dimensions with size nd×nd

11.6 pa[i]bt[a[i]] ← new array of real in 2 dimensions with size nd×nd

11.7 for j0 ← 0 to nd − 1 do11.7.1 for j1 ← 0 to nd − 1 do

11.7.1.1 pbt[a[i]]a[i][j0, j1] ← 011.7.1.2 pa[i]bt[a[i]][j1, j0] ← 0

11.8 for j ← 0 to �τn� − 1 do

11.8.1 pbt[a[i]]a[i][X(S)jbt[a[i]],X

(S)ja[i] ] ← pbt[a[i]]a[i][X

(S)jbt[a[i]],X

(S)ja[i] ] + 1

�τn�11.8.2 pa[i]bt[a[i]][X

(S)ja[i] ,X

(S)jbt[a[i]]] ← pa[i]bt[a[i]][X

(S)ja[i] ,X

(S)jbt[a[i]]] + 1

�τn�11.9 hbt[a[i]]a[i] ← −∑nd−1

k0=0

∑nd−1k1=0 pbt[a[i]]a[i][k0, k1]ln(pbt[a[i]]a[i][k0, k1])

11.10 ha[i]bt[a[i]] ← hbt[a[i]]a[i]

19

12 for i ← l − 2 downto 0 do12.1 e ← arg maxj{ha[j] + hbt[a[j]] − ha[j]bt[a[j]] | j ∈ {0, 1, . . . , i}}12.2 ωi ← a[e]12.3 π(ωi) ← bt[ωi]12.4 a[e] ← a[i]12.5 for j ← 0 to i − 1 do

12.5.1 pωia[j] ← new array of real in 2 dimensions with size nd × nd

12.5.2 pa[j]ωi← new array of real in 2 dimensions with size nd × nd


12.5.3.1.1 pωia[j][j0, j1] ← 012.5.3.1.2 pa[j]ωi

[j1, j0] ← 012.5.4 for k ← 0 to �τn� − 1 do

12.5.4.1 pωia[j][X(S)kωi ,X

(S)ka[j] ] ← pωia[j][X

(S)kωi ,X

(S)ka[j] ] + 1

�τn�12.5.4.2 pa[j]ωi

[X(S)ka[j] ,X

(S)kωi ] ← pa[j]ωi

[X(S)ka[j] ,X

(S)kωi ] + 1

�τn�12.5.5 hωia[j] ← −∑nd−1

k0=0

∑nd−1k1=0 pωia[j][k0, k1]ln(pωia[j][k0, k1])

12.5.6 ha[j]ωi← hωia[j]

12.5.7 Ibest ← ha[j] + hbt[a[j]] − ha[j]bt[a[j]]

12.5.8 Iadd ← ha[j] + hωi− ha[j]ωi

12.5.9 if Ibest < Iadd then12.5.9.1 bt[a[j]] ← ωi

13 for i ← 0 to l − 1 do13.1 if |π(ωi)| > 0 then

13.1.1 bωi← pωiπ(ωi)0

13.1.2 bpωi

← pπ(ωi)0

13.2 else13.2.1 bωi

← pωi

14 return((π,ω))

Note that from algorithm DiTreeSea we have again that we do not need to recompute arraysbi and bp

i in algorithm DiEst.The third and final search procedure is the BOA approach. As noted in section 3.4, the general

algorithm consists out of more than one case, depending on the value of κ. This means that first ofall, we may present the general algorithm DiGraphSea for the case of discrete random variables:

DiGraphSea(κ)1 if κ = 0 then

1.1 (π,ω) ← UnivariateDistribution()2 else if κ = 1 then

2.1 (π,ω) ← DiGraphSeaExact()3 else then

3.1 (π,ω) ← DiGraphSeaGreedy(κ)4 return((π,ω))

When κ = 0, the result is simply the univariate distribution, for which we had already definedalgorithm UnivariateDistribution. When we have κ = 1, we noted in section 3.4 that we canuse Edmond’s algorithm. Edmond’s algorithm requires for each arc (i, j) in the graph to haveassigned to it a positive real value c((i, j)) as defined in appendix B. The optimum branching isthen found with respect to these values. This optimum branching is in our sense a PDS graph whereeach node has at most 1 parent, meaning that the π(·) vector for all nodes has length at most1 (∀i∈{0,1,...,l−1}〈|π(i)| ≤ 1〉). Note that Emond’s algorithm computes a maximum branching,whereas we proposed in section 3.4 to minimize the entropy measure. In order to be able tomaximize and to have all positive values, we maximize the negative entropy, which is equivalentto minimizing the entropy measure. As the PDS that results holds no information on the measure

20

or metric used, this is completely irrelevant to the result. For Edmond’s algorithm, we require tohave only positive real values. To this end, we go over the metric for each of the arcs and findthe minimum value. By subtracting this minimum value from the value for each arc, the valuec((i, j)) becomes a real value larger or equal to 0. To get only positive values as required, we add1 to every value in addition. We can now specify algorithm DiGraphSeaExact that employsEdmond’s optimum branching algorithm to find the PDS in the case of discrete random variables:

DiGraphSeaExact()1 define type edarc as (stack of integer, stack of integer, stack of real)2 vp ← new array of vector of integer with size l3 V ← new vector of integer4 A ← new vector of edarc5 for i ← 0 to l − 1 do

5.1 pi ← new array of real with size nd

5.2 for j ← 0 to nd − 1 do5.2.1 pi[j] ← 0

5.3 for j ← 0 to �τn� − 1 do

5.3.1 pi[X(S)ji ] ← pi[X

(S)ji ] + 1

�τn�5.4 hi ← −∑nd−1

k=0 pi[k]ln(pi[k])6 for i ← 0 to l − 1 do

6.1 for j ← i + 1 to l − 1 do6.1.1 pij ← new array of real in 2 dimensions with size nd × nd

6.1.2 pji ← new array of real in 2 dimensions with size nd × nd


6.1.3.1.1 pij [j0, j1] ← 06.1.3.1.2 pji[j1, j0] ← 0

6.1.4 for k ← 0 to �τn� − 1 do

6.1.4.1 pij [X(S)ki ,X

(S)kj ] ← pij [X

(S)ki ,X

(S)kj ] + 1

�τn�6.1.4.2 pji[X

(S)kj ,X

(S)ki ] ← pji[X

(S)kj ,X

(S)ki ] + 1

�τn�6.2 for j ← 0 to l − 1 do

6.2.1 if i �= j then

6.2.1.1 hji ← −∑nd−1k0=0

∑nd−1k1=0 pji[k0, k1]ln(pji[k0, k1])

6.2.1.2 c ← hji − hi

6.2.1.3 Ss ← new stack of integer6.2.1.4 St ← new stack of integer6.2.1.5 Sc ← new stack of real6.2.1.6 push(Ss, i)6.2.1.7 push(St, j)6.2.1.8 push(Sc,−c)6.2.1.9 A|A| ← (Ss, St, Sc)

7 γ ← min(Ss,St,Sc)∈A{top(Sc)}8 for i ← 0 to |A| − 1 do

8.1 (Ss, St, Sc) ← Ai

8.2 c ← pop(Sc)8.3 c ← c + 1 − γ8.4 push(Sc, c)

9 B ← GraphSeaOptimumBranching(V ,A,l)10 for i ← 0 to |B| − 1 do

10.1 (Ss, St, Sc) ← Bi

10.2 s ← pop(Ss)10.3 t ← pop(St)10.4 vp[t]|vp[t]| ← s

21

11 (π,ω) ← GraphSeaTopologicalSort(vp)12 for i ← 0 to l − 1 do

12.1 if |π(ωi)| > 0 then12.1.1 bωi

← pωiπ(ωi)0

12.1.2 bpωi

← pπ(ωi)0

12.2 else12.2.1 bωi

← pωi

13 return((π,ω))

Again, we find from algorithm DiGraphSeaExact that we do not need to recompute arraysbi and bp

i in algorithm DiEst. Algorithm DiGraphSeaExact calls other algorithms to computethe optimum branching. These algorithms are independent of the type of density model that isused and concern only the finding of the optimum branching, given c(i, j). Therefore, we need tostate them only once. The additional algorithms can be found in appendix E. The running timefor algorithm DiGraphSeaExact is O(l3 + l2n2

d + l2τn), which is subsumed by the running timeof the greedy graph search algorithm for the case when κ > 1.

We conclude the summary of the search algorithms for discrete random variables by presentingthe greedy graph search algorithm for the case when κ > 1. The outline for the algorithm wasgiven in section 3.4. Combining that outline with the discrete entropy we have already used inthe above search algorithms, we can now state algorithm DiGraphSeaGreedy:

DiGraphSeaGreedy(κ)1 a ← new array of boolean in 2 dimensions with size l × l2 vp, vs ← 2 new arrays of vector of integer with size l3 hn, c ← 2 new arrays of real in 2 dimensions with size l × l4 ho ← new array of real with size l5 for i ← 0 to l − 1 do

5.1 pp0i ← new array of real with size nd

5.2 for j ← 0 to nd − 1 do5.2.1 pp

0i[j] ← 05.3 for j ← 0 to �τn� − 1 do

5.3.1 pp0i[X

(S)jωl−1 ] ← pp

0i[X(S)jωl−1 ] + 1

�τn�5.4 for q ← 1 to l − 1 do

5.4.1 ppqi ← new array of real with size nd

5.4.2 for j ← 0 to nd − 1 do5.4.2.1 pp

qi[j] ← pp0i[j]

5.5 ho[i] ← −∑nd−1k=0 pp

0i[k]ln(pp0i[k])

5.6 bi ← pp0i

6 for i ← 0 to l − 1 do6.1 for j ← i + 1 to l − 1 do

6.1.1 pij ← new array of real in 2 dimensions with size nd × nd

6.1.2 pji ← new array of real in 2 dimensions with size nd × nd


6.1.3.1.1 pij [j0, j1] ← 06.1.3.1.2 pji[j1, j0] ← 0

6.1.4 for k ← 0 to �τn� − 1 do

6.1.4.1 pij [X(S)ki ,X

(S)kj ] ← pij [X

(S)ki ,X

(S)kj ] + 1

�τn�6.1.4.2 pji[X

(S)kj ,X

(S)ki ] ← pji[X

(S)kj ,X

(S)ki ] + 1

�τn�

22

6.2 for j ← 0 to l − 1 do6.2.1 if i �= j then

6.2.1.1 a[i, j] ← true

6.2.1.2 hn[i, j] ← −∑nd−1k0=0

∑nd−1k1=0 pji[k0, k1]ln(pji[k0, k1])

6.2.1.3 hn[i, j] ← hn[i, j] − ho[i]6.2.1.4 c[i, j] ← hn[i, j] − ho[j]

6.3 a[i, i] ← false7 γ ← l2 − l8 while γ > 0 do

8.1 (v0, v1) ← arg min(i,j){c[i, j] | a[i, j] ∧ (i, j) ∈ {0, 1, . . . , l − 1}2}8.2 if c[v0, v1] > 0 then

8.2.1 breakwhile8.3 γ ← γ − GraphSeaArcAdd(κ, v0, v1, a, vp, vs)8.4 ho[v1] ← hn[v0, v1]8.5 bp

v1← pp

v1v0

8.6 bv1 ← pv1v0

8.7 for i ← 0 to l − 1 do8.7.1 if a[i, v1] then

8.7.1.1 pv1i ← new array of real in |vp[v1]| + 2 dimensionswith size nd × nd × . . . × nd

8.7.1.2 ppv1i ← new array of real in |vp[v1]| + 1 dimensions

with size nd × nd × . . . × nd

8.7.1.3 for (j0, j1, . . . , j|vp[v1]|+1) ← (0, 0, . . . , 0)to (nd − 1, nd − 1, . . . , nd − 1) in crossproduct do

8.7.1.3.1 pv1i[j0, j1, . . . , j|vp[v1]|+1] ← 08.7.1.4 for (j0, j1, . . . , j|vp[v1]|) ← (0, 0, . . . , 0)

to (nd − 1, nd − 1, . . . , nd − 1) in crossproduct do8.7.1.4.1 pp

v1i[j0, j1, . . . , j|vp[v1]|] ← 08.7.1.5 for j ← 0 to �τn� − 1 do

8.7.1.5.1 k0 ← X(S)jv1

8.7.1.5.2 kq ← X(S)jvp[v1]q−1

(∀q∈{1,2,...,|vp[v1]|})

8.7.1.5.3 k|vp[v1]|+1 ← X(S)ji

8.7.1.5.4 pv1i[k0, k1, . . . , k|vp[v1]|+1] ←pv1i[k0, k1, . . . , k|vp[v1]|+1] + 1

�τn�8.7.1.5.5 pp

v1i[k1, k2, . . . , k|vp[v1]|+1] ←pv1i[k1, k2, . . . , k|vp[v1]|+1] + 1

�τn�8.7.1.6 hn[i, j] ← −∑nd−1

k0=0

∑nd−1k1=0 . . .

∑nd−1k|vp[v1]|+1

pv1i[k0, k1, . . . , k|vp[v1]|+1]ln(pv1i[k0, k1, . . . , k|vp[v1]|+1])8.7.1.7 hn[i, j] ← hn[i, j] − (−∑nd−1

k0=0

∑nd−1k1=0 . . .

∑nd−1k|vp[v1]|

ppv1i[k0, k1, . . . , k|vp[v1]|]ln(pp

v1i[k0, k1, . . . , k|vp[v1]|]))8.7.1.8 c[i, j] ← hn[i, v1] − ho[v1]

9 return(GraphSeaTopologicalSort(vp))

Also in the general graph search algorithm DiGraphSeaGreedy, which runs in O(l3κ +l2nd + l2τn + lκnκ+1

d + lκ2τn) time, we find that we do not need to recompute arrays bi and bpi

in algorithm DiEst. The reason for this in all cases is that the information that is used to guidethe search is largely the same information that is used to build the estimates. In the algorithm,we have called algorithms GraphSeaArcAdd and GraphSeaTopologicalSort. Just as wasthe case for the exact algorithm for the case when κ = 1, these subalgorithms are independent ofthe type of density model that is used. More detailed information on the additional algorithmscan be found in appendix E.

23

-10

-5

0

5

10

-10 -5 0 5 10

Y1

Y0

-10

-5

0

5

10

-10 -5 0 5 10

Y1

Y0

Figure 4: A uniform and a clustered data set over Y0 and Y1 with DY = [−10, 10].

To complete this section on the algorithms for discrete random variables, we refer to the articlesthat presented the original algorithms for obtained results. Given the general framework of IDEAand the implementation of the original algorithms, it will probably immediately become clear asto how a similar algorithm within the IDEA framework would perform.

5 IDEAs for continuous random variables

Having noted the type of information that is required in previously created algorithms in section 3and having presented the algorithms for discrete random variables in section 4, we are ready topresent IDEA strategies in which density estimation for continuous random variables is performed.Density estimation can be seen as model fitting where given the structure from equation 7, wehave to estimate the individual probabilities in the product. In general, we find from probabilitytheory that we can distinguish three important classes of density estimation models: parametric,non–parametric and mixture. None of these have ever been explicitly mentioned in previous workon algorithms for discrete random variables. A reason for this might be that the frequency countapproach is the most straightforward and most precise one in the case of discrete random variables.The distinction between the three classes is classicly found in continuous models.

In this section, we concentrate on the continuous model. To provide additional insight in thedistribution estimations, we give examples of the resulting density function for two data sets. Inboth cases we have only two dimensions with DY = [−10, 10]. The data sets are plotted in figure 4,which shows a uniform distribution on the left and a clustered distribution on the right. We shallmake figures for the joint distribution that was estimated by using a certain density model. Inaddition, we shall also make figures for the univariate distribution over both variables Y0 and Y1.

In this paper, we present two approaches with different density models. One approach is fromthe class of parametric distributions and uses the normal distribution. This approach is presentedin section 5.1. The other approach is from the class of non–parametric distributions and uses thehistogram distribution, which is closely related to the empiric probabilities for discrete randomvariables from section 4. The histogram approach is presented in section 5.2. In section 6, somepreliminary work and ideas are presented for another model in the class of non–parametric distri-butions, namely the normal kernel distribution. Also, the relation of that kernel distribution tothe normal mixture model distribution, which belongs to the class of mixture models, is presented.Finally, we note again as we did for the case of using discrete random variables, that the runningtimes of the algorithms are stated in section 6.

24

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

-10 -5 0 5 10

f(Y

0)

Y0

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

-10 -5 0 5 10

f(Y

1)

Y1

Figure 5: Fitting the univariate data from the uniform data set with a normal distribution para-metric model.

5.1 Parametric model: normal distribution

The first type of distribution estimation model we discuss, is the parametric model. In this model,a density function as stated in equation 2 is given on beforehand. This means that the generalform of the density function is fixed. The goal is to set the parameters for the density function insuch a way that the result represents the distribution of the data points as good as possible.

The most widely used parametric density function for real spaces is the density function thatunderlies the normal distribution, which is a normalized Gaussian. It follows from section 2.2 thatwe require to have the conditional density function conditioned on one variable. In appendix C it isshown that this density function for one variable Y0 conditioned on n other variables Y1, Y2, . . . , Yn

equals:

f(y0|y1, y2, . . . , yn) = g(y, µ0, σ0) (19)

where

g(y, µ, σ) = 1σ√

2πe

−(y−µ)2

2σ2

σ0 = 1√σ′00

µ0 = µ0σ′00−

∑ ci=1(yi−µi)σ

′i0

σ′00

In equation 19, we use the notation σ′ij = (Σ−1)(i, j) with Σ the nonsingular covariance matrix

over variables Y0, Y1, . . . , Yn. The density function over the variables taken univariately for the twodatasets is plotted in figures 5 and 6. Note that for the uniform data set, the centers are placedjust about over 0 as expected, whereas this is not completely the case for the clustered data set.Still, it is clear that there is no great distinction between the two estimated distributions, whichshows the obvious deficit of this parametric model in modelling clusters. This can be observedagain from the joint distributions as depicted in figure 7. The distributions are quite identical.The difference lies in that the joint distribution over the clustered data set is somewhat skewedand lies more around the diagonal from (−10,−10) to (10, 10) instead of being complete uniformin all directions. By observing again the clustered data set in figure 4, this can be seen to indeedslightly be the trend for this data set. So even though this model is able to represent the generalfeatures of the data, it fails in representing the specific features.

In order to sample from equation 19, we need to compute µ0 and σ0. It becomes clear thatto this end, we require to compute the inverse of the covariance matrix, as we need values fromΣ−1. Computing the inverse requires O(n3) computing time, where n×n is the size of the matrix

25

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

-10 -5 0 5 10

f(Y

0)

Y0

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

-10 -5 0 5 10

f(Y

1)

Y1

Figure 6: Fitting the univariate data from the clustered data set with a normal distributionparametric model.

-10-5

05

10Y0 -10

-5

0

5

10

Y1

02e-054e-056e-058e-05

0.00010.000120.000140.00016

f(Y0,Y1)

-10-5

05

10Y0 -10

-5

0

5

10

Y1

02e-054e-056e-058e-05

0.00010.000120.000140.00016

f(Y0,Y1)

Figure 7: Fitting the joint data from the uniform (left) and clustered (right) data sets with anormal distribution parametric model.

26

to be inverted. However, we observe from equation 19 that we only require entries from the firstcolumn of the inverse matrix, meaning that we have to solve the linear system of equations:

n−1∑i=0

σ0iσ′i0 = 1 ∧ ∀j∈{1,2,...,n−1}

⟨n−1∑i=0

σjiσ′i0 = 0

⟩(20)

We shall use matrix notation for the remainder of this paper to write systems such as the onefrom equation 20. The use of matrix notation gives better insight into the linear system to besolved and eases both notation and implementation. In the case of equation 20, we may write:

σ00 σ01 . . . σ0(n−1)

σ10 σ11 . . . σ1(n−1)

σ20 σ21 . . . σ2(n−1)

......

......

σ(n−1)0 σ(n−1)1 . . . σ(n−1)(n−1)

σ′00

σ′10

σ′20...

σ′(n−1)0

=

100...0

≡ Σ(Σ−1(∗, 0)) = I(∗, 0) (21)

Solving this system of equations unfortunately also takes O(n3) time, but can be done fasterthan computing the complete inverse, especially when the diameter of the matrix becomes ratherlarge. The above linear system of equations is of the form Σ(Σ−1(∗, 0)) = I(∗, 0). The matrix Σis the non–singular symmetric covariance matrix, where σij = Σ(i, j) and σ′

ij = Σ−1(i, j). In thealgorithms that follow, we shall first have to construct Σ. At this point, we must note that thismatrix concerns certain variables and that so far, we have not pointed out exactly which thesevariables are. We therefore introduce the following notation:

Σ〈j0, j1, j2, . . . , jn−1〉 =

σj0j0 σj0j1 . . . σj0jn−1



......

......

σjn−1j0 σjn−1j1 . . . σjn−1jn−1

σij (= σji) =∑ �τn�−1

k=0 (Y(S)k

i −µi)(Y(S)k

j −µj)

�τn�

(22)

It should be clear that by using equation 22, we have that σik = Σ〈j0, j1, . . . , jn−1〉(i, k) = σjijk.

In this way, we have filled the positions of the matrix Σ that we use in equation 21 with the actualcovariances σij . Just as we did for the entropy in the univariate and bivariate case as stated insection 3, we shall assume in the pseudo–code that µi (∀i∈{0,1,...,l−1}) and σij (∀(i,j)∈{0,1,...,l−1}2)are global variables, for which we do not allocate additional memory in the algorithms. Usingthese definitions, we can now specify algorithm RePaNoEst to estimate the density functionsonce the structure is known:

RePaNoEst()1 m ← new array of boolean in 2 dimensions with size l × l2 for i ← 0 to l − 1 do

2.1 for j ← 0 to l − 1 do2.1.1 m[i, j] ← false

2.2 m[i, i] ← true3 for i ← 0 to l − 1 do

3.1 µi ←∑ �τn�−1

k=0 Y(S)k

i

�τn�

27

4 for i ← 0 to l − 1 do

4.1 σii ←∑ �τn�−1

k=0 (Y(S)k

i −µi)2

�τn�4.2 for j ← 0 to |π(i)| − 1 do

4.2.1 if ¬m[i, π(i)j ] then

4.2.1.1 σiπ(i)j←

∑ �τn�−1k=0 (Y

(S)ki −µi)(Y

(S)k

π(i)j−µπ(i)j

)

�τn�4.2.1.2 σπ(i)ji ← σiπ(i)j

4.2.1.3 m[i, π(i)j ] ← true4.2.1.4 m[π(i)j , i] ← true

4.2.2 for q ← 0 to j − 1 do4.2.2.1 if ¬m[π(i)q, π(i)j ] then

4.2.2.1.1 σπ(i)qπ(i)j←

∑ �τn�−1k=0 (Y

(S)k

π(i)q−µπ(i)q )(Y

(S)k

π(i)j−µπ(i)j

)

�τn�4.2.2.1.2 σπ(i)jπ(i)q

← σπ(i)qπ(i)j

4.2.2.1.3 m[π(i)q, π(i)j ] ← true4.2.2.1.4 m[π(i)j , π(i)q] ← true

5 for i ← 0 to l − 1 do5.1 Find σ′

∗0 in O((|π(ωi)| + 1)3) fromΣ〈ωi, π(ωi)0, π(ωi)1, . . . , π(ωi)|π(ωi)|−1〉(Σ−1(∗, 0)) = I(∗, 0)

5.2 σωi← 1√

σ′00

5.3 µωi← µωi

σ′00−

∑ |π(ωi)|−1k=0 (yπ(ωi)k

−µπ(ωi)k)σ′

(k+1)0

σ′00

5.4 P (Yωi|Yπ(ωi)0 , Yπ(ωi)1 , . . . , Yπ(ωi)|π(ωi)|−1

) ← 1σωi

√2π

e−(yωi

−µωi)2

2(σωi)2

6 return(P (·))

Step 5.1 from algorithm RePaNoEst requires the polynomially bounded time of O((|π(ωi)|+1)3) for solving the linear system of equations. This means that if we specify a complete acyclicPDS (which models the complete joint probability), we still have a polynomially bounded algo-rithm. That algorithm is bounded by

∑l−1i=0 O(i3) = O(l4). Note that we mentioned earlier in

section 2.1 that for the complete joint probability distribution in the case of discrete random vari-ables, the computational requirements would be an exponentially growing function. The reasonwhy this is not the case here, is because we are using an estimation instead of counting frequenciesfor all possible combinations. Indeed, in the case of using histograms, we do end up with anexponentially growing function as we shall see in section 5.2.

Steps 5.2 and 5.3 in algorithm RePaNoEst compute the σωiand µωi

parameters for theone–dimensional normal distribution to sample from. In the expression for µωi

however, we havevariables yπ(ωi)k

. These are function variables in accordance with equation 2. This means thatµωi

cannot actually be computed until values for those yπ(ωi)kare given. This is exactly according

to the definition of conditionality, where the probability density function is specified, given valuesfor the conditional variables. Values for these conditional variables are given when we samplefor a new vector. This means that when sampling from the distribution generated by algorithmRePaNoEst, we must first actually compute µωi

.Sampling from the distribution thus made, requires to be able to sample from a normal distri-

bution. Sampling from a “standard” normal distribution Yi ∼ N (0, 1)(µ = 0, σ =√

1) is mostlyavailable, so we shall assume to have an algorithm for this. From the form of the normal distribu-tion it follows that if we have Y ′

i = µ + σYi, variable Y ′i is normally distributed with parameters

µ and σ (Yi ∼ N (µ, σ2)). We call this one line algorithm sampleNormalDistribution(µ, σ).Using this primitive, we can now sample a vector Y = (Y0, Y1, . . . , Yl−1) from the distributionmade by algorithm RePaNoEst() as follows2:

2Note that σ′ values here refer to the inverse of the covariance matrix for variables with indices {ωi}∪π(ωi), sowe need to have stored these values globally in an actual implementation of RePaNoEst().

28

RePaNoSam()1 for i ← l − 1 downto 0 do

1.1 µωi← µωi

σ′00−

∑ |π(ωi)|−1k=0 (Yπ(ωi)k

−µπ(ωi)k)σ′

(k+1)0

σ′00

1.2 Yωi← sampleNormalDistribution(µωi

, σωi)

A special case for using the normal distribution parametric model, is when using univariateprobabilities. This is modelled in algorithm UnivariateDistribution, which we introducedearlier in section 4. Using the univariate distribution gives an approach that is close to beingsimilar to the approach named PBILC [29]. All variables are regarded independently of another.

As we did in the discrete case, we shall now derive the application of using the normal distri-bution as an instance of the parametric distribution class in the search procedures. In order toimplement the framework of MIMIC in our parametric setting, we require to know the differentialentropy for the normal distribution. It has been shown [8] that the joint entropy over n normallydistributed variables y0, y1, . . . , yn−1 can be written as follows:

h(y0, y1, . . . , yn−1) =12(n + ln((2π)n(det Σ)) (23)

It is clear that the entropy of a normally distributed random variable is completely determinedby the value of σ. Using the above differential entropy information, we can specify algorithmRePaNoChainSea that uses the greedy algorithm employed in MIMIC to pick the chain ofconditional dependencies and algorithm RePaNoTreeSea that builds the tree of conditionaldependencies:

RePaNoChainSea()1 a ← new array of integer with size l2 for i ← 0 to l − 1 do

2.1 a[i] ← i3 e ← 04 ωl−1 ← a[e]5 for i ← 0 to l − 1 do

5.1 µa[i] ←∑ �τn�−1

k=0 Y(S)k

a[i]

�τn�

5.2 σa[i]a[i] ←∑ �τn�−1

k=0 (Y(S)k

a[i] −µa[i])2

�τn�5.3 ha[i] ← 1

2 (1 + ln(2πσa[i]a[i]))5.4 if ha[i] < hωl−1 then

5.4.1 ωl−1 ← a[i]5.4.2 e ← i


8.1 e ← 08.2 ωi ← a[e]8.3 for j ← 0 to i do

8.3.1 σωi+1a[j] ←∑ �τn�−1

k=0 (Y (S)kωi+1

−µωi+1 )(Y(S)k

a[j] −µa[j])

�τn�8.3.2 σa[j]ωi+1 ← σωi+1a[j]

8.3.3 hωi+1a[j] ← 12 (2 + ln(4π2(σωi+1ωi+1 σa[j]a[j] − σωi+1a[j]σa[j]ωi+1)))

8.3.4 ha[j]ωi+1 ← hωi+1a[j]

8.3.5 if ha[j]ωi+1 − hωi+1 < hωiωi+1 − hωi+1 then8.3.5.1 ωi ← a[j]8.3.5.2 e ← j

8.4 π(ωi) ← ωi+1

8.5 a[e] ← a[i]9 return((π,ω))

29

RePaNoTreeSea()1 a ← new array of integer with size l2 for i ← 0 to l − 1 do

2.1 a[i] ← i3 bt ← new array of integer with size l − 14 e ←RandomNumber(l)5 ωl−1 ← e6 π(ωl−1) ← ∅7 µωl−1 ←

∑ �τn�−1k=0 Y (S)k

ωl−1�τn�

8 σωl−1ωl−1 ←∑ �τn�−1

k=0 (Y (S)kωl−1

−µωl−1 )2

�τn�9 hωl−1 ← 1

2 (1 + ln(2πσωl−1ωl−1))10 a[e] ← a[l − 1]11 for i ← 0 to l − 2 do

11.1 µa[i] ←∑ �τn�−1

k=0 Y(S)k

a[i]

�τn�

11.2 σa[i]a[i] ←∑ �τn�−1

k=0 (Y(S)k

a[i] −µa[i])2

�τn�11.3 ha[i] ← 1

2 (1 + ln(2πσa[i]a[i]))11.4 bt[a[i]] ← ωl−1

11.5 σbt[a[i]]a[i] ←∑ �τn�−1

k=0 (Y(S)k

bt[a[i]]−µbt[a[i]])(Y

(S)k

a[i] −µa[i])

�τn�11.6 σa[i]bt[a[i]] ← σbt[a[i]]a[i]

11.7 hbt[a[i]]a[i] ← 12 (2 + ln(4π2(σbt[a[i]]bt[a[i]]σa[i]a[i] − σbt[a[i]]a[i]σa[i]bt[a[i]])))


12 for i ← l − 2 downto 0 do12.1 e ← arg maxj{ha[j] + hbt[a[j]] − ha[j]bt[a[j]] | j ∈ {0, 1, . . . , i}}12.2 ωi ← a[e]12.3 π(ωi) ← bt[ωi]12.4 a[e] ← a[i]12.5 for j ← 0 to i − 1 do

12.5.1 σωia[j] ←∑ �τn�−1

k=0 (Y (S)kωi

−µωi)(Y

(S)k

a[j] −µa[j])

�τn�12.5.2 σa[j]ωi

← σωia[j]

12.5.3 hωia[j] ← 12 (2 + ln(4π2(σωiωi

σa[j]a[j] − σωia[j]σa[j]ωi)))





13 return((π,ω))

It follows from both algorithm RePaNoChainSea and algorithm RePaNoTreeSea that wedo not need to recompute the values for µi and σij in algorithm RePaNoEst if one of thesesearch procedures is employed. This follows from the definition of algorithm IDEA for the samereasons as in the case of discrete random variables in section 4.

It remains to specify the most involved approach, which is the general graph search approach.First of all, we present the general algorithm RePaNoGraphSea for the real parametric caseusing a normal distribution that distinguishes between the different cases of the amount of allowedparents κ:

30

RePaNoGraphSea(κ)1 if κ = 0 then


2.1 (π,ω) ← RePaNoGraphSeaExact()3 else then

3.1 (π,ω) ← RePaNoGraphSeaGreedy(κ)4 return((π,ω))

In section 4, we already elaborated on the implementation of Edmond’s algorithm for thecase when κ = 1. Because the entropy measures are the same as in the case of algorithmsRePaNoChainSea and RePaNoTreeSea, since in all cases only a single parent is allowed,the computation of the cost of each arc is now straightforward. The remainder of the searchprocedure consists of calling the optimum branching algorithm and deriving the PDS from theso obtained results. Referring to appendix E for the subalgorithms used, we can therefore nowpresent algorithm RePaNoGraphSeaExact:

RePaNoGraphSeaExact()1 define type edarc as(stack of integer, stack of integer, stack of real)2 vp ← new array of vector of integer with size l3 V ← new vector of integer4 A ← new vector of edarc5 for i ← 0 to l − 1 do

5.1 µi ←∑ �τn�−1

k=0 Y(S)k

i

�τn�5.2 σii ←

∑ �τn�−1k=0 (Y

(S)ki −µi)

2

�τn�5.3 V|V | ← i


6.1.1 σij ←∑ �τn�−1

k=0 (Y(S)k

i −µi)(Y(S)k

j −µj)

�τn�6.1.2 σji ← σij


6.2.1.1 c ← 12 (2 + ln(4π2σjj σii − σjiσij)) − 1

2 (1 + ln(2πσii))6.2.1.2 Ss ← new stack of integer6.2.1.3 St ← new stack of integer6.2.1.4 Sc ← new stack of real6.2.1.5 push(Ss, i)6.2.1.6 push(St, j)6.2.1.7 push(Sc,−c)6.2.1.8 A|A| ← (Ss, St, Sc)


8.1 (Ss, St, Sc) ← Ai


9 B ← GraphSeaOptimumBranching(V ,A,l)10 for i ← 0 to |B| − 1 do

10.1 (Ss, St, Sc) ← Bi



31

When using algorithm RePaNoGraphSeaExact, the conditional entropy is computed forevery arc (i, j) with i �= j. This requires to compute all variables µi and σij . This implies thatagain we do not need to recompute those variables when using algorithm RePaNoEst. Therunning time for algorithm RePaNoGraphSeaExact is O(l3 + l2τn), which is subsumed by therunning time of the greedy graph search algorithm for the case when κ > 1..

To implement the greedy algorithm for the case when κ > 1, we note from the algorithm insection 3.4 that we require to update the entropy measures once an arc is added to the graph. Asthe entropy measure is a decomposable sum of values, it is of the form

∑l−1i=0 αi. Moreover, the

αi factors are independent of each other, meaning that when we add arc (v0, v1) to the graph, weonly have to recompute αv1 .

If we now recall equation 23, the joint entropy is a function of the determinant of the covariancematrix of the involved variables. This covariance matrix for vertex v1 is enlarged by adding oneadditional row and column when we add an arc (v0, v1) to the graph. We can avoid recomputingthe complete determinant by using an update rule. To this end, we note that we have the followingfrom linear algebra:

det(Σ〈j0, j1, . . . , jn〉)det(Σ〈j0, j1, . . . , jn−1〉) =

1Σ〈j0, j1, . . . , jn〉−1(n, n)

(24)

Therefore, by computing σ′∗,n from the linear system of equations Σ〈j0, j1, . . . , jn〉Σ−1(∗, n) =

I(∗n), we get σ′nn. If we have stored the determinant for the previous state of the vertex to which

we are to add another parent in do, by using equation 24 we can determine the new determinantas dn = do/σ′

nn. Formalizing the remainder of the concepts from the algorithm in section 3.4, wecan now present the greedy algorithm RePaNoGraphSeaGreedy:

RePaNoGraphSeaGreedy(κ)1 a ← new array of boolean in 2 dimensions with size l × l2 vp, vs ← 2 new arrays of vector of integer with size l3 hn, dn

f , dnp , c ← 4 new arrays of real in 2 dimensions with size l × l

4 ho, dof , do

p ← 3 new arrays of real with size l

5 for i ← 0 to l − 1 do

5.1 µi ←∑ �τn�−1

k=0 Y(S)k

i

�τn�5.2 σii ←

∑ �τn�−1k=0 (Y

(S)ki −µi)

2

�τn�5.3 do

f [i] ← σii

5.4 dop[i] ← 0

5.5 ho[i] ← 12 (1 + ln(2πdo

f [i]))6 for i ← 0 to l − 1 do

6.1 for j ← i + 1 to l − 1 do

6.1.1 σij ←∑ �τn�−1

k=0 (Y(S)k

i −µi)(Y(S)k

j −µj)

�τn�6.1.2 σji ← σij


6.2.1.1 a[i, j] ← true6.2.1.2 dn

f [i, j] ← σjj σii − σjiσij

6.2.1.3 dnp [i, j] ← σii

6.2.1.4 hn[i, j] ← 12 (2 + ln(4π2dn

f [i, j])) − 12 (1 + ln(2πdn

p [i, j]))6.2.1.5 c[i, j] ← hn[i, j] − ho[j]

6.3 a[i, i] ← false7 γ ← l2 − l

32

8 while γ > 0 do8.1 (v0, v1) ← arg min(i,j){c[i, j] | a[i, j] ∧ (i, j) ∈ {0, 1, . . . , l − 1}2}8.2 if c[v0, v1] > 0 then

8.2.1 breakwhile8.3 γ ← γ − GraphSeaArcAdd(κ, v0, v1, a, vp, vs)8.4 do

f [v1] ← dnf [v0, v1]

8.5 dop[v1] ← dn

p [v0, v1]8.6 ho[v1] ← hn[v0, v1]8.7 for i ← 0 to l − 1 do

8.7.1 if a[i, v1] then8.7.1.1 Find σ′

∗(|vp[v1]|+2) in O((|vp[v1]| + 2)3) from

Σ〈v1, vp[v1]0, vp[v1]1, . . . , vp[v1]|vp[v1]|−1, i〉 ·

(Σ−1(∗, |vp[v1]| + 2)) = I(∗, |vp[v1]| + 2)8.7.1.2 dn

f [i, v1] ← dof [v1]/σ′

(|vp[v1]|+1)(|vp[v1]|+1)

8.7.1.3 hn[i, v1] ← 12 (|vp[v1]| + 2 + ln((2π)|v

p[v1]|+2dnf [i, v1]))

8.7.1.4 Find σ′∗(|vp[v1]|+1) in O((|vp[v1]| + 1)3) from

Σ〈vp[v1]0, vp[v1]1, . . . , vp[v1]|vp[v1]|−1, i〉 ·(Σ−1(∗, |vp[v1]| + 1)) = I(∗, |vp[v1]| + 1)

8.7.1.5 dnp [i, v1] ← do

p[v1]/σ′|vp[v1]||vp[v1]|

8.7.1.6 hn[i, v1] ← hn[i, v1]−12 (|vp[v1]| + 1 + ln((2π)|v

p[v1]|+1dnp [i, v1]))

8.7.1.7 c[i, j] ← hn[i, v1] − ho[v1]9 return(GraphSeaTopologicalSort(vp))

As was the case for the discrete random variables, we have that we do not need to recomputevariables µi and σij in all cases of applying a search algorithm, because they will already havebeen computed.

5.2 Non–parametric model: histogram distribution

In a non–parametric model, the form of the distribution is not determined at the outset as isthe case for the parametric model. There are no parameters that need to be chosen to best fit afunction over data points. Instead, a transformation of the data points constitutes the resultingdensity function. Note that we may still have parameters, but they do not solely determine theform of the fit. Examples of non–parametric models are histograms and kernel–based methods.

Out of all models, the histogram is perhaps the most straightforward way to estimate a proba-bility density. In the continuous case, the histogram can arbitrarily well represent the distributionof given datapoints by choosing an arbitrarily small bin width. In the case of discrete randomvariables, the histogram is actually what has been used implicitly so far in all discrete EDAapproaches.

If we have to estimate the distribution of the data points Y(S)ki in a certain range [minYi ,maxYi〉,

we specify r bins bi[j], j ∈ {0, 1, . . . , r−1} that divide the range in r subsequent and equally sizedsubranges. Define rangeYi by maxYi −minYi . The basis of the empiric probability that y falls intobin bi[j], is a frequency count:

bi[j] =

∣∣∣∣∣{

Y(S)ki

∣∣∣∣∣ k ∈ {0, 1, . . . , �τn� − 1} ∧ j ≤ Y(S)ki − minYi

rangeYir < j + 1

}∣∣∣∣∣ (25)

An exponentially growing amount of bins in the form of rn is required in the multidimensionalcase for variables Yi0 , Yi1 , . . . , Yin−1 and according bins bi0i1...in−1 [j0, j1, . . . , jn−1]. It is clear thatrn bins are required for a PDS with a parent arity of n − 1 as equation 3 dictates we require toestimate the density for a probability over n variables in that case. We should realize this curseof dimensionality as it limits the amount of bins we can use when the parent arity goes up.

33

Figure 8: Fitting the univariate data from the uniform data set with a histogram density model,using 20 bins.

In figures 8 and 9, the density function using histograms over the variables taken univariatelyis plotted. We should be aware of the scale differences for the two sets of graphs. This differenceamplifies the non–uniformity of the fit over the clustered data set. As opposed to the resultingfits with the normal distribution in section 5.1, from these figures, as well as from the jointdensity function in figure 10, it is clear which data set is sampled from a uniform distribution andwhich is sampled from a clustered distribution. Using histograms is in general a more localizedmethod. Depending on the bin width, more or less details are conveyed in the resulting densityestimation. As the figures show however, we might very well require quite a large amount of binsto representably estimate the distribution. If the amount of parents κ in the PDS graph grows, thisresults in a drastic increase of computation time because of the exponential growing function rκ+1.In the case of binary random variables, the situation was comparable to using histograms withr = 2. If we however use 10 bins in each dimension, two parents already amounts to 103 = 1000bins, whereas in the binary case we could go as far as κ = 9 parents to get 210 = 1024 bins. Clearly,we are trying to solve different types of problems in the case of discrete and continuous randomvariables, but we should nevertheless be aware of the underlying problem in using histograms.

In the case of discrete random variables, we don’t have real valued data. We can however setthe minimum and maximum of the range we want to use bins for, to 0 and nd − 1 respectivelyand specify to use nd bins. By doing so, we have defined bins for the discrete case. This stressesagain the strong correspondence between the algorithms for discrete random variables and the useof the histogram density model for continuous random variables.

We now move to specify the algorithms in the continuous case using the non–parametrichistogram model. The estimation of the density functions are exactly the empiric probabilitiesthat result from the frequency count. This leads to the algorithm ReHiEst to estimate thedistribution once the structure (π,ω) is known. The algorithm requires a parameter r for theamount of bins to use in each dimension:

34

Figure 9: Fitting the univariate data from the clustered data set with a histogram density model,using 20 bins.

Figure 10: Fitting the joint data from the uniform (left) and clustered (right) data sets with ahistogram density model, using 20 bins in each dimension.

35

ReHiEst(r)1 for i ← 0 to l − 1 do

1.1 minYi ← mink∈{0,1,...,�τn�−1}{Y (S)ki }

1.2 maxYi ← maxk∈{0,1,...,�τn�−1}{Y (S)ki }

1.3 rangeYi ← maxYi − minYi

1.4 βYi ← rangeYi

r2 for i ← 0 to l − 1 do

2.1 bωi← new array of real in |π(ωi)| + 1 dimensions

with size r × r × . . . × r2.2 bp

ωi← new array of real in |π(ωi)| dimensions

with size r × r × . . . × r2.3 for (a0, a1, . . . , a|π(ωi)|) ← (0, 0, . . . , 0)

to (r − 1, r − 1, . . . , r − 1) in crossproduct do2.3.1 bωi

[a0, a1, . . . , a|π(ωi)|] ← 02.4 for (a0, a1, . . . , a|π(ωi)|−1) ← (0, 0, . . . , 0)

to (r − 1, r − 1, . . . , r − 1) in crossproduct do2.4.1 bp

ωi[a0, a1, . . . , a|π(ωi)|−1] ← 0

2.5 for k ← 0 to �τn� − 1 do

2.5.1 a0 ← min{r − 1, �(Y (S)kωi − minYωi )/βYωi �}

2.5.2 for q ← 0 to |π(ωi)| − 1 do

2.5.2.1 aq+1 ← min{r − 1, �(Y (S)kπ(ωi)q

− minYπ(ωi)q )/βYπ(ωi)q �}2.5.3 bωi

[a0, a1, . . . , a|π(ωi)|] ← bωi[a0, a1, . . . , a|π(ωi)|] + 1

�τn�2.5.4 bp

ωi[a1, a2, . . . , a|π(ωi)|] ← bp

ωi[a1, a2, . . . , a|π(ωi)|] + 1

�τn�2.6 P (Yωi

|Yπ(ωi)0 , Yπ(ωi)1 , . . . , Yπ(ωi)|π(ωi)|) ←

ϕ(ωi, a0, a1, . . . , a|π(ωi)|) ifyωi

∈ [minYωi ,maxYωi ] ∧∀j〈Yπ(ωi)j

∈ [minYπ(ωi)j ,maxYπ(ωi)j ]〉0 otherwise

where

aj = min{r − 1, aj}

aj =

⌊(yωi

−minYωi )

βYωi

⌋if j = 0⌊

(yπ(ωi)j−1−minYπ(ωi)j−1 )

βYπ(ωi)j−1

⌋otherwise

ϕ(ωi, k0, k1, . . . , k|π(ωi)|) =

{bωi

[k0,k1,...,k|π(ωi)|]bp

ωi[k1,k2,...,k|π(ωi)|]

if |π(ωi)| > 0bωi

[k0, k1, . . . , k|π(ωi)|] otherwise3 return(P (·))

The exponential factor in the running time for algorithm ReHiEst when we take a fullyconnected graph, is O(rκ+1) because of the fact that in step 2.4, at most κ + 1 counters areenumerated in crossproduct. Each enumeration consists of exactly O(r) steps. This can becompared to the algorithm for discrete random variables from section 4 that has an exponentialfactor of O(nκ+1

d ). Just as was the case for algorithms RePaNoEst and DiEst, line 1.6 does notactually have to computed, because the required information is stored in arrays bωi

and bpωi

. Line1.6 does however give us a way to sample from the distribution. Based on the proportionalityof the frequencies stored in the bins and a randomly drawn real number between 0 and 1, weselect a bin. In the case of discrete random variables, we would select the so found bin as theresult for the discrete random variable. Now however, the bin stands for a range of values for acontinuous random variable. To justify this, after we have selected a bin, the resulting value forthe continuous random variable is a random number drawn from the uniform distribution on therange that the selected bin resembles. This is formalized in algorithm ReHiSam:

36

ReHiSam(r)1 for i ← l − 1 downto 0 do

1.1 β ← rangeYωi

r1.2 α ← Random01()1.3 ς ← Random01()1.4 ϑ ← 01.5 for j ← 0 to |πωi

| − 1 do

1.5.1 aj+1 ← min{r − 1, �r(Yπ(ωi)j− minYπ(ωi)j )/rangeYπ(ωi)j �}

1.6 η ← 11.7 if |π(ωi)| > 0 then

1.7.1 η ← bpωi

[a1, a2, . . . , a|πωi|]

1.6 for a0 ← 0 to r − 1 do1.6.1 ϑ ← ϑ + 1

η bωi[a0, a1, . . . , a|πωi

|]1.6.2 if ς ≤ ϑ then Yωi

← minYωi + (a0 + α)β; breakfor

As in the parametric case, we are now able to create a full algorithm by using Univariate-

Distribution, JointDistribution or any other a priori fixed distribution. It may be clear thatbecause of quite similar reasons as we came across for the algorithms for discrete random variables,the recomputation of arrays bi and bp

i in method ReHiEst is not required after having appliedany of the search algorithms we will present next.

In order to define the search algorithms, we should remind ourselves again that we need tobe able to compute the entropy. In the case of the histogram distribution, this can be done ina similar way as in the case of discrete random variables. In appendix D it is shown that themultivariate entropy measure for continuous random variables that have a histogram distribution,equals:

h(Yj0 , Yj1 , . . . , Yjn−1) = (26)

−∏n−1

i=0 rangeYji

rn

r−1∑k0=0

r−1∑k1=0

. . .

r−1∑kn−1=0

(bj0,j1,...,jn−1 [k0, k1, . . . , kn−1]ln(bj0,j1,...,jn−1 [k0, k1, . . . , kn−1])

)

where bj0,j1,...,jn−1 [k0, k1, . . . , kn−1] =∣∣∣∣∣{

Y(S)qj0

∣∣∣∣∣ q ∈ {0, 1, . . . , �τn� − 1} ∧ ∀i∈{0,1,...,n−1}

⟨ki ≤

Y(S)qji

− minYji

rangeYji

r < ki + 1

⟩}∣∣∣∣∣We now have enough tools to write out the algorithms for the chain search, the tree search

and the graph search. We denote these algorithms as ReHiChainSea, ReHiTreeSea and Re-

HiGraphSea respectively. When going over the algorithms, note the obvious similarity with thealgorithms for discrete random variables from section 4. The running time for algorithm ReHi-

GraphSeaExact is O(l3 + l2r2 + l2τn), which is subsumed by the running time of the greedygraph search algorithm for the case when κ > 1.

ReHiChainSea(r)1 a ← new array of integer with size l2 for i ← 0 to l − 1 do

2.1 a[i] ← i2.2 pi ← new array of real with size r

3 e ← 04 ωl−1 ← a[e]

37

5 for i ← 0 to l − 1 do

5.1 minYa[i] ← mink∈{0,1,...,�τn�−1}{Y (S)ka[i] }

5.2 maxYa[i] ← maxk∈{0,1,...,�τn�−1}{Y (S)ka[i] }

5.3 rangeYa[i] ← maxYa[i] − minYa[i]

5.4 βYa[i] ← rangeYa[i]

r5.5 for j ← 0 to r − 1 do

5.5.1 pa[i][j] ← 05.6 for j ← 0 to �τn� − 1 do

5.6.1 k ← min{r − 1, �(Y (S)ja[i] − minYa[i])/βYa[i]�}

5.6.2 pa[i][k] ← pa[i][k] + 1�τn�

5.7 ha[i] ← − rangeYa[i]

r

∑r−1k=0 pa[i][k]ln(pa[i][k])

5.8 if ha[i] < hωl−1 then5.8.1 ωl−1 ← a[i]5.8.2 e ← i


8.1 e ← 08.2 ωi ← a[e]8.3 for q ← 0 to i do

8.3.1 pωi+1a[q] ← new array of real in 2 dimensions with size r × r8.3.2 pa[q]ωi+1 ← new array of real in 2 dimensions with size r × r8.3.3 for j0 ← 0 to r − 1 do

8.3.3.1 for j1 ← 0 to r − 1 do8.3.3.1.1 pωi+1a[q][j0, j1] ← 08.3.3.1.2 pa[q]ωi+1 [j1, j0] ← 0

8.3.4 for j ← 0 to �τn� − 1 do

8.3.4.1 k0 ← min{r − 1, �(Y (S)jωi+1 − minYωi+1 )/βYωi+1 �}

8.3.4.2 k1 ← min{r − 1, �(Y (S)ja[q] − minYa[q])/βYa[q]�}

8.3.4.3 pωi+1a[q][k0, k1] ← pωi+1a[q][k0, k1] + 1�τn�

8.3.4.4 pa[q]ωi+1 [k1, k0] ← pa[q]ωi+1 [k1, k0] + 1�τn�

8.3.5 hωi+1a[q] ← − rangeYωi+1 range

Ya[q]

r2 ·∑r−1k0=0

∑r−1k1=0 pωi+1a[q][k0, k1]ln(pωi+1a[q][k0, k1])

8.3.6 ha[q]ωi+1 ← hωi+1a[q]

8.3.7 if ha[q]ωi+1 − hωi+1 < hωiωi+1 − hωi+1 then8.3.7.1 ωi ← a[q]8.3.7.2 e ← q

8.4 π(ωi) ← ωi+1

8.5 a[e] ← a[i]9 for i ← 0 to l − 1 do

9.1 if |π(ωi)| > 0 then9.1.1 bωi

← pωiπ(ωi)0

9.1.2 bpωi

← pπ(ωi)0

9.2 else9.2.1 bωi

← pωi

10 return((π,ω))

38

ReHiTreeSea(r)1 a ← new array of integer with size l2 for i ← 0 to l − 1 do

2.1 a[i] ← i2.2 pi ← new array of real with size r

3 bt ← new array of integer with size l − 14 e ←RandomNumber(l)5 ωl−1 ← e6 π(ωl−1) ← ∅7 minYωl−1 ← mink∈{0,1,...,�τn�−1}{Y (S)k

ωl−1 }8 maxYωl−1 ← maxk∈{0,1,...,�τn�−1}{Y (S)k

ωl−1 }9 rangeYωl−1 ← maxYωl−1 − minYωl−1

10 βYωl−1 ← rangeYωl−1

r11 for j ← 0 to r − 1 do

11.1 pωl−1 [j] ← 012 for j ← 0 to �τn� − 1 do

12.1 k ← min{r − 1, �(Y (S)jωl−1 − minYωl−1 )/βYωl−1 �}

12.2 pωl−1 [k] ← pωl−1 [k] + 1�τn�

13 hωl−1 ← − rangeYωl−1

r

∑r−1k=0 pωl−1 [k]ln(pωl−1 [k])

14 a[e] ← a[l − 1]15 for i ← 0 to l − 2 do

15.1 minYa[i] ← mink∈{0,1,...,�τn�−1}{Y (S)ka[i] }

15.2 maxYa[i] ← maxk∈{0,1,...,�τn�−1}{Y (S)ka[i] }

15.3 rangeYa[i] ← maxYa[i] − minYa[i]

15.4 βYa[i] ← rangeYa[i]

r15.5 for j ← 0 to r − 1 do

15.5.1 pa[i][j] ← 015.6 for j ← 0 to �τn� − 1 do

15.6.1 k ← min{r − 1, �(Y (S)ja[i] − minYa[i])/βYa[i]�}

15.6.2 pa[i][k] ← pa[i][k] + 1�τn�

15.7 ha[i] ← − rangeYa[i]

r

∑r−1k=0 pa[i][k]ln(pa[i][k])

15.8 bt[a[i]] ← ωl−1

15.9 pbt[a[i]]a[i] ← new array of real in 2 dimensions with size r × r15.10 pa[i]bt[a[i]] ← new array of real in 2 dimensions with size r × r15.11 for j0 ← 0 to r − 1 do

15.11.1 for j1 ← 0 to r − 1 do15.11.1.1 pbt[a[i]]a[i][j0, j1] ← 015.11.1.2 pa[i]bt[a[i]][j1, j0] ← 0

15.12 for j ← 0 to �τn� − 1 do

15.12.1 k0 ← min{r − 1, �(Y (S)jbt[a[i]] − minYbt[a[i]])/βYbt[a[i]]�}

15.12.2 k1 ← min{r − 1, �(Y (S)ja[i] − minYa[i])/βYa[i]�}

15.12.3 pbt[a[i]]a[i][k0, k1] ← pbt[a[i]]a[i][k0, k1] + 1�τn�

15.12.4 pa[i]bt[a[i]][k1, k0] ← pa[i]bt[a[i]][k1, k0] + 1�τn�

15.13 hbt[a[i]]a[i] ← − rangeY

bt[a[i]] rangeYa[i]

r2 ·∑r−1k0=0

∑r−1k1=0 pbt[a[i]]a[i][k0, k1]ln(pbt[a[i]]a[i][k0, k1])


39

16 for i ← l − 2 downto 0 do16.1 e ← arg maxj{ha[j] + hbt[a[j]] − ha[j]bt[a[j]]} (j ∈ {0, 1, . . . , i})16.2 ωi ← a[e]16.3 π(ωi) ← bt[ωi]16.4 a[e] ← a[i]16.5 for j ← 0 to i − 1 do

16.5.1 pωia[j] ← new array of real in 2 dimensions with size r × r16.5.2 pa[j]ωi

← new array of real in 2 dimensions with size r × r16.5.3 for j0 ← 0 to r − 1 do

16.5.3.1 for j1 ← 0 to r − 1 do16.5.3.1.1 pωia[j][j0, j1] ← 016.5.3.1.2 pa[j]ωi

[j1, j0] ← 016.5.4 for k ← 0 to �τn� − 1 do

16.5.4.1 k0 ← min{r − 1, �(Y (S)kωi − minYωi )/βYωi �}

16.5.4.2 k1 ← min{r − 1, �(Y (S)ka[j] − minYa[j])/βYa[j]�}

16.5.4.3 pωia[j][k0, k1] ← pωia[j][k0, k1] + 1�τn�

16.5.4.4 pa[j]ωi[k1, k0] ← pa[j]ωi

[k1, k0] + 1�τn�

16.5.5 hωia[j] ← − rangeYωi rangeYa[j]

r2 ·∑r−1k0=0

∑r−1k1=0 pωia[j][k0, k1]ln(pωia[j][k0, k1])





17 for i ← 0 to l − 1 do17.1 if |π(ωi)| > 0 then

17.1.1 bωi← pωiπ(ωi)0

17.1.2 bpωi

← pπ(ωi)0

17.2 else17.2.1 bωi

← pωi

18 return((π,ω))

ReHiGraphSea(κ, r)1 if κ = 0 then


2.1 (π,ω) ← ReHiGraphSeaExact(r)3 else then

3.1 (π,ω) ← ReHiGraphSeaGreedy(κ, r)4 return((π,ω))

40

ReHiGraphSeaExact(r)1 define type edarc as

(stack of integer, stack of integer, stack of real)2 vp ← new array of vector of integer with size l3 V ← new vector of integer4 A ← new vector of edarc5 for i ← 0 to l − 1 do





r5.5 pi ← new array of real with size r5.6 for j ← 0 to r − 1 do

5.6.1 pi[j] ← 05.7 for j ← 0 to �τn� − 1 do

5.7.1 k ← min{r − 1, �(Y (S)jωl−1 − minYωl−1 )/βYωl−1 �}

5.7.2 pi[k] ← pi[k] + 1�τn�

5.8 hi ← − rangeYi

r

∑r−1k=0 pi[k]ln(pi[k])


6.1.1 pij ← new array of real in 2 dimensions with size r × r6.1.2 pji ← new array of real in 2 dimensions with size r × r6.1.3 for j0 ← 0 to r − 1 do

6.1.3.1 for j1 ← 0 to r − 1 do6.1.3.1.1 pij [j0, j1] ← 06.1.3.1.2 pji[j1, j0] ← 0

6.1.4 for k ← 0 to �τn� − 1 do

6.1.4.1 k0 ← min{r − 1, �(Y (S)ki − minYi)/βYi�}

6.1.4.2 k1 ← min{r − 1, �(Y (S)kj − minYj )/βYj�}

6.1.4.3 pij [k0, k1] ← pij [k0, k1] + 1�τn�

6.1.4.4 pji[k1, k0] ← pji[k1, k0] + 1�τn�


6.2.1.1 hji ← − rangeYj rangeYi

r2 ·∑r−1k0=0

∑r−1k1=0 pji[k0, k1]ln(pji[k0, k1])

6.2.1.2 c ← hji − hi

6.2.1.3 Ss ← new stack of integer6.2.1.4 St ← new stack of integer6.2.1.5 Sc ← new stack of real6.2.1.6 push(Ss, i)6.2.1.7 push(St, j)6.2.1.8 push(Sc,−c)6.2.1.9 A|A| ← (Ss, St, Sc)


8.1 (Ss, St, Sc) ← Ai


9 B ← GraphSeaOptimumBranching(V ,A,l)

41

10 for i ← 0 to |B| − 1 do10.1 (Ss, St, Sc) ← Bi


11 (π,ω) ← GraphSeaTopologicalSort(vp)12 for i ← 0 to l − 1 do

12.1 if |π(ωi)| > 0 then12.1.1 bωi

← pωiπ(ωi)0

12.1.2 bpωi

← pπ(ωi)0

12.2 else12.2.1 bωi

← pωi

13 return((π,ω))

ReHiGraphSeaGreedy(κ, r)1 a ← new array of boolean in 2 dimensions with size l × l2 vp, vs ← 2 new arrays of vector of integer with size l3 hn, c ← 2 new arrays of real in 2 dimensions with size l × l4 ho ← new array of real with size l5 for i ← 0 to l − 1 do





r5.5 pp

0i ← new array of real with size r5.6 for j ← 0 to r − 1 do

5.6.1 pp0i[j] ← 0

5.7 for j ← 0 to �τn� − 1 do

5.7.1 k ← min{r − 1, �(Y (S)jωl−1 − minYωl−1 )/βYωl−1 �}

5.7.2 pp0i[k] ← pp

0i[k] + 1�τn�

5.8 for q ← 1 to l − 1 do5.8.1 pp

qi ← new array of real with size r

5.8.2 for j ← 0 to r − 1 do5.8.2.1 pp

qi[j] ← pp0i[j]

5.9 ho[i] ← − rangeYi

r

∑r−1k=0 pp

0i[k]ln(pp0i[k])

5.10 bi ← pp0i


6.1.1 pij ← new array of real in 2 dimensions with size r × r6.1.2 pji ← new array of real in 2 dimensions with size r × r6.1.3 for j0 ← 0 to r − 1 do

6.1.3.1 for j1 ← 0 to r − 1 do6.1.3.1.1 pij [j0, j1] ← 06.1.3.1.2 pji[j1, j0] ← 0

6.1.4 for j ← 0 to �τn� − 1 do

6.1.4.1 k0 ← min{r − 1, �(Y (S)ki − minYi)/βYi�}

6.1.4.2 k1 ← min{r − 1, �(Y (S)kj − minYj )/βYj�}

6.1.4.3 pij [k0, k1] ← pij [k0, k1] + 1�τn�

6.1.4.4 pji[k1, k0] ← pji[k1, k0] + 1�τn�

42


6.2.1.1 a[i, j] ← true

6.2.1.2 hn[i, j] ← (− rangeYi rangeYj

r2 ·∑r−1k0=0

∑r−1k1=0 pji[k0, k1]ln(pji[k0, k1])) − ho[i]

6.2.1.3 c[i, j] ← hn[i, j] − ho[j]6.3 a[i, i] ← false

7 γ ← l2 − l8 while γ > 0 do

8.1 (v0, v1) ← arg min(i,j){c[i, j] | a[i, j] ∧ (i, j) ∈ {0, 1, . . . , l − 1}2}8.2 if c[v0, v1] > 0 then

8.2.1 breakwhile8.3 γ ← γ − GraphSeaArcAdd(κ, v0, v1, a, vp, vs)8.4 ho[v1] ← hn[v0, v1]8.5 bp

v1← pp

v1v0

8.6 bv1 ← pv1v0

8.7 for i ← 0 to l − 1 do8.7.1 if a[i, v1] then

8.7.1.1 pv1i ← new array of real in |vp[v1]| + 2 dimensionswith size r × r × . . . × r

8.7.1.2 ppv1i ← new array of real in |vp[v1]| + 1 dimensions

with size r × r × . . . × r8.7.1.3 for (j0, j1, . . . , j|vp[v1]|+1) ← (0, 0, . . . , 0)

to (r − 1, r − 1, . . . , r − 1) in crossproduct do8.7.1.3.1 pv1i[j0, j1, . . . , j|vp[v1]|+1] ← 0

8.7.1.4 for (j0, j1, . . . , j|vp[v1]|) ← (0, 0, . . . , 0)to (r − 1, r − 1, . . . , r − 1) in crossproduct do

8.7.1.4.1 ppv1i[j0, j1, . . . , j|vp[v1]|] ← 0

8.7.1.5 for j ← 0 to �τn� − 1 do

8.7.1.5.1 k0 ← min{r − 1, �(Y (S)jv1 − minYv1 )/βYv1 �}

8.7.1.5.2 kq ← min{r − 1,

�(Y (S)jvp[v1]q−1

− minYvp[v1]q−1 )/βYvp[v1]q−1 �}(∀q∈{1,2,...,|vp[v1]|})

8.7.1.5.3 k|vp[v1]|+1 ← min{r − 1, �(Y (S)ji − minYi)/βYi�}

8.7.1.5.4 pv1i[k0, k1, . . . , k|vp[v1]|+1] ←pv1i[k0, k1, . . . , k|vp[v1]|+1] + 1

�τn�8.7.1.5.5 pp

v1i[k1, k2, . . . , k|vp[v1]|] ←pp

v1i[k1, k2, . . . , k|vp[v1]|] + 1�τn�

8.7.1.6 α ← rangeYi∏|vp[v1]|−1

q=0 rangeYvp[v1]q

8.7.1.7 hn[i, j] ← − rangeYv1 αrangeYj

r|vp[v1]|+2

∑r−1k0=0

∑r−1k1=0 . . .

∑r−1k|vp[v1]|+1

pv1i[k0, k1, . . . , k|vp[v1]|+1]ln(pv1i[k0, k1, . . . , k|vp[v1]|+1])8.7.1.8 hn[i, j] ← hn[i, j] − (− αrangeYj

r|vp[v1]|+1

∑r−1k0=0

∑r−1k1=0 . . .

∑r−1k|vp[v1]|

ppv1i[k0, k1, . . . , k|vp[v1]|]ln(pp

v1i[k0, k1, . . . , k|vp[v1]|]))8.7.1.9 c[i, j] ← hn[i, v1] − ho[v1]


6 Discussion

The IDEA framework allows to implement density estimation based evolutionary algorithms in away that decomposes certain aspects within these algorithms. This allows us to clearly show howdifferent density models map onto different procedures that are parameters to the algorithmic

43

framework. Next to showing how the framework allows for applying certain density models tooptimization, the derived results should be brought into practice. Currently, we are in the processof implementing all of the algorithms presented in this paper. The first aspect to be published onshort notice, is the results of running the proposed algorithms on benchmark tests and comparingthem with other methods.

Next to bringing the algorithms into practice in the sense of a runnable program, there aresome important additional issues in general in the use of the IDEA framework. For instance, manytheoretical questions can be addressed. Such questions vary from population size requirements toselection issues and convergence. Here, we make special note of a few other issues that shouldbe taken into account. In section 6.1 we present the running times of the search algorithms wepresented in this paper and place some important notes. In section 6.2, we give some directionsfor future work in applying different types of density estimation models.

6.1 A note on the running times

In sections 4 and 5, we have presented algorithms to be used within the IDEA framework. Thesealgorithms take a certain amount of time every iteration. If the search method becomes moreinvolved, the running time that is required every iteration goes up. This is of course hopefullyaccompanied by better results, but we should be aware of how much time is spent on thesealgorithms. Recalling that l stands for the length of the coding vector of variables, n stands forthe amount of samples in the collection, τ stands for the truncation percentile, nd stands for thesize of the discrete domain set, r stands for the amount of bins in the histogram density model andκ stands for the maximum amount of allowed parents in the PDS for every variable, the runningtimes without regarding any of these as constants, are the following:

Algorithm ComplexityUnivariateDistribution() O(l)JointDistribution() O(l2)DiChainSea() O(l2n2

d + l2τn)DiTreeSea() O(l2n2

d + l2τn)DiGraphSea() O(l3κ + l2nd + l2τn + lκnκ+1

d + lκ2τn)RePaNoChainSea() O(l2τn)RePaNoTreeSea() O(l2τn)RePaNoGraphSea() O(l3κ + l2τn + lκ4)ReHiChainSea() O(l2r2 + l2τn)ReHiTreeSea() O(l2r2 + l2τn)ReHiGraphSea() O(l3κ + l2r + l2τn + lκrκ+1 + lκ2τn)DiEst() O(lnκ+1

d + lκτn)RePaNoEst() O(lκ3 + l2 + lτn + κ2τn)ReHiEst() O(lrκ+1 + lκτn)DiSam() O(lnd + lκ)RePaNoSam() O(lκ)ReHiSam() O(lr + lκ)

If κ, r, τ and nd are seen as constants, the running times of the algorithms in that case areeasily derived using the table above.

It is common practice to measure the competence of an algorithm by the amount of evaluationsrequired to reach a certain value. However, in the case of using more sophisticated methods forprocessing the available solution vectors, such as the methods discussed in this paper, such ameasure is not entirely fair unless the running time for evaluating a solution dominates all of therunning times above. Therefore, it is better to account for the time taken by the algorithm togenerate new samples. One way of doing this, is by recording the time in seconds or millisecondsthat the algorithm takes instead of the amount of function evaluations. However, in such a case,results obtained on a certain computer are hard to compare with results from the literature because

44

the computer(s) used in the literature are very likely to differ in many ways from that of your own.A way to compensate for this, is to use a factor based on time in which no additional dimensionsare introduced in the resulting expression. If we assume that the running of the algorithm is nothindered by the operating system for swapping memory or something like that, this will be a faircomparison3. If we have ne evaluations, a total running time T in (milli)seconds for the algorithmof and a total running time Te that the algorithm spent on performing evaluations measured inthe same precision, a measure that could be used to compare the results on instead of the amountof evaluations, is the following normalized expression:

ne = neT − Te

Te(27)

Another way would be to have a piece of benchmark code that for instance simulates someoperations that most evolutionary approaches would do. This would mean some iterations of alot of integer and real computations. Next, the running time for this benchmark program on thecomputer to run the tests on should be recorded as Tb. An alternative to using the amount ofevaluations (again unless the evaluations clearly dominate the running time) is the normalizedrunning time of the algorithm, which is an index to be named according to the benchmark test:

T =T

Tb(28)

The results that we are still to obtain for the algorithms presented in this paper, will besubjected to such a measure. In general, it should be noted that because of the time spentin finding and using models for covering the interactions between variables in an optimizationalgorithm might become substantially different from other algorithms, the total time taken bythe algorithm should be taken into account in drawing conclusions about the efficiency of thealgorithm in terms of running time.

6.2 Density estimation models

In this paper, we have presented algorithms that use density estimation techniques in optimization.Moreover, we have presented algorithms for continuous random variables within two differentclasses, namely the parametric normal distribution and the non–parametric histogram distribution.There are however more density models that are commonly used in practice. Two of these arethe normal kernel distribution and the normal mixture model. In sections 6.2.1 and 6.2.2 we givesome insights into using these methods within the IDEA framework.

Another thing that we would like to note in this discussion section, is that we have seen that aproblem of using histograms, is their exponential requirements in computation time. However, ifr = 1, the computation time requirements are no longer exponential. Taking r = 1 in the histogrammethod, we get the uniform distribution, which is somewhat like the Gaussian approach in thatwe have a very global technique instead of local. Because of the fact that we introduced rangesfor the histogram distribution, we expect to get convergence. It would be interesting to see howthis very simple approach would work on real problems.

6.2.1 Normal kernel distribution

In section 5.2, we used a non–parametric model for density estimation. Next to that densitymodel, there are other density models with interesting properties. If we for instance again takethe normal distribution density function, but now fix the value of σ, center one such densityfunction on every sample point and divide the sum of all of these density functions by the amountof sample points, we get the normal kernel distribution. The use of the density functions centeredaround the sample points is called a kernel method, where the density functions themselves arethe kernels. The method of using the Gaussian kernels underlying the normal distribution, has

3Not taking into account the quality of the code

45

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

-10 -5 0 5 10

f(Y

0)

Y0

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

-10 -5 0 5 10

f(Y

1)

Y1

Figure 11: Fitting the univariate data from the uniform dataset with a Gaussian kernel modelwith σ = 0.5.

some interesting properties. For instance, by changing the value of σ, we can either get more orless detail in the resulting density estimation. For a very large value of σ for example, there willhardly be any local features in the resulting density estimation. Like with the use of the histogramdistribution, we can scale an initially chosen value for σ with the range of the sample points andthereby keep the level of detail during the optimization process constant.

The density function underlying the univariate normal kernel distribution with fixed standarddeviation σ defined on N sample points y(S)i, i ∈ {0, 1, . . . , N − 1}, is defined by:

f(y) =1N

N−1∑i=0

1σ√

2πe

−(y−y(S)i)2

2σ2 (29)

The multivariate version of the density function in n dimensions with fixed standard deviationsσi, is defined by:

f(y0, y1, . . . , yn−1) =1N

N−1∑i=0

(2π)−n2∏n−1

j=0 σj

e−∑n−1

j=0

(yj−y(S)ij

)2

2σ2j (30)

In figures 11 and 12, the density function taken over the variables univariately is plotted.Note that the localized aspect of the kernel method becomes clear immediately. Even though thesamples were drawn from an uniform distribution, there are still peaks in the distribution. It isclear that this method is well suited for representing local features of the distribution. Because ofthe local aspect, the kernels show a much better representation of the clusters as can be seen inthe joint probability distribution using Gaussian kernels, which is depicted in figure 13. The jointdistributions are not identical at all.

Equation 3 dictates that we require to know the conditional density function for n variablesconditionally dependent on one single variable. It is not clear however whether that densityfunction will be convenient to sample from or whether more sophisticated techniques are requiredto this end. However, it would be very interesting to investigate the use of this density functionbecause of its properties.

6.2.2 Normal mixture distribution

The kernel method for density estimation has some useful properties, amongst which the possibilityto select the amount of detail in the resulting density estimation, based on the value for σ. However,

46

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

-10 -5 0 5 10

f(Y

0)

Y0

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

-10 -5 0 5 10

f(Y

1)

Y1

Figure 12: Fitting the univariate data from the clustered dataset with a normal kernel distributionmodel with σ = 0.5.

-10-5

05

10Y0 -10

-5

0

5

10

Y1

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

f(Y0,Y1)

-10-5

05

10Y0 -10

-5

0

5

10

Y1

00.0010.0020.0030.0040.0050.0060.0070.0080.009

f(Y0,Y1)

Figure 13: Fitting the joint data from the uniform (left) and clustered (right) datasets with aGaussian kernel model with σ0 = σ1 = 1.0.

47

if we have n data points, we require n kernels. This might become a problem when the conditionaldensity function is hard to sample from, as it will take a lot of time to evaluate it. As a trade–off between the very global model of the normal distribution as a parametric model on the oneside and the normal kernel distribution as a non–parametric model on the other side, we havethe normal mixture distribution as a mixture model in between. The intuitive idea is to takemore than one kernel, but not to take as many kernels as we have data points or to place themnecessarily over the data points. Furthermore, each density function within such a mixture neednot be weighted just as heavily as any other. This gives the following density function underlyingthe univariate normal mixture distribution defined on M models:

f(y) =M−1∑i=0

αi1

vi

√2π

e−(y−ci)

2

2v2i (31)

where

{∀i∈{0,1,...,M−1}〈0 ≤ αi ≤ 1〉∑M−1i=0 αi = 1

In equation 31, ci and vi stand for the center and the variance of the i–th normal distributiondensity function in the mixture respectively. We use this notation to prevent confusion withthe single normal distribution or the normal kernel distribution. To find the centers ci and thevariances vi, different approaches can be used. One well known method to this end is the EM–algorithm (see for instance [5]).

The multivariate version of the density function in n dimensions with y = (y0, y1, . . . , yn−1),is defined by:

f(y0, y1, . . . , yn−1) =M−1∑i=0

αi(2π)−

n2

(det V i)12e−

12 (y−ci)

T V −1i (y−ci) (32)

where

{∀i∈{0,1,...,M−1}〈0 ≤ αi ≤ 1〉∑M−1i=0 αi = 1

In equation 32, we have that ci = (ci0, ci1, . . . , cin−1) is the vector of means and V i = E[(y −ci)T (y − ci)] is the nonsingular symmetric covariance matrix. Both of these are defined withrespect to index i, which stands for the i–th multivariate normal distribution density function.The EM algorithm can for instance again be applied to find values for these variables.

The normal mixture model has convenient properties in that it is a trade off between theglobal distribution that is the parametric normal distribution and the local distributions that arethe histogram distribution and the normal kernel distribution. The amount of computing timecan be regularized because of the limit M on the amount of density functions within the mixturemodel. Further exploring the possibilities of using the mixture model seems to be very worthwhile.

Still, the same problem arises as with the normal kernel distritubion when computing theconditional density function as it is not clear yet whether that distribution will be straightforwardto sample from. This problem within the use of conditional distributions is not present whenusing univariate distributions as has been the case so far in density estimation based evolutionaryalgorithms for real spaces [30, 29, 11]. The mixture model was used by Gallagher, Fream andDowns [11] in a univariate distribution, but the full covariance matrix V i for density function i inthe model was reduced to a matrix with only entries on the diagonal.

7 Conclusions

We have presented the algorithmic framework IDEA for modelling density estimation optimizationalgorithms. These algorithms make use of density estimation techniques to build a probabilitydistribution over the variables that code a problem in order to perform optimization. To this end,a probability density structure must be found and subsequently used in density estimation. For a

48

set of existing search algorithms, we have presented their application within the IDEA framework,using three different density estimation models. One out of these is used in the case of discreterandom variables, whereas the other two are used in the case of continuous random variables.This, in combination with the rationale of modelling solutions below a threshold in a probabilitydistribution, shows that the framework is both general and applicable.

Even though testing of the algorithms on short notice is required, together with differentdensity models that may be implemented in the future, the new approaches seem promising foroptimization of problems with continuous variables.

References

[1] T. Back and H-P. Schwefel. Evolution strategies i: Variants and their computational implementation.In G. Winter, J. Priaux, M. Galn, and P. Cuesta, editors, Genetic Algorithms in Engineering andComputer Science, Proceedings of the First Short Course EUROGEN’95, pages 111–126. Wiley, 1995.

[2] S. Baluja and R. Caruana. Removing the genetics from the standard genetic algorithm. Technicalreport, Carnegie Mellon University, 1995.

[3] S. Baluja and S. Davies. Using optimal dependency–trees for combinatorial optimization: Learningthe structure of the search space. In D.H. Fisher, editor, Proceedings of the 1997 InternationalConference on Machine Learning. Morgan Kauffman publishers, 1997. Also available as TechnicalReport CMU-CS-97-107.

[4] S. Bandyopadhyay, H. Kargupta, and G. Wang. Revisiting the gemga: Scalable evolutionary op-timization through linkage learning. In Proceedings of the 1998 IEEE International Conference onEvolutionary Computation, pages 603–608. IEEE Press, 1998. Also available as Technical ReportEECS-97-004, Washington State University, Pullman.

[5] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995.

[6] J.S. De Bonet, C. Isbell, and P. Viola. Mimic: Finding optima by estimating probability densities.Advances in Neural Information Processing, 9, 1996.

[7] P.A.N. Bosman and D. Thierens. Linkage information processing in distribution estimation algo-rithms. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E.Smith, editors, Proceedings of the GECCO–1999 Genetic and Evolutionary Computation Conference,pages 60–67. Morgan Kaufmann Publishers, 1999.

[8] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons Inc., 1991.

[9] K. Deb and D.E. Goldberg. Sufficient conditions for deceptive and easy binary functions. Annals ofMathematics and Artificial Intelligence, 10:385–408, 1994.

[10] J. Edmonds. Optimum branchings. J. Res. Nat. Bur. Standards, 71B:233–240, 1967. Reprinted inMath. of the Decision Sciences, Amer. Math. Soc. Lectures in Appl. Math., 11:335–345, 1968.

[11] M. Gallagher, M. Fream, and T. Downs. Real–valued evolutionary optimization using a flexibleprobability density estimator. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar,M. Jakiela, and R.E. Smith, editors, Proceedings of the GECCO–1999 Genetic and EvolutionaryComputation Conference, pages 840–846. Morgan Kaufmann Publishers, 1999.

[12] D.E. Goldberg. Genetic Algorithms In Search, Optimization, And Machine Learning. Addison–Wesley, Reading, 1989.

[13] D.E. Goldberg, K. Deb, H. Kargupta, and G. Harik. Rapid, accurate optimization of difficult problemsusing fast messy genetic algorithms. In S. Forrest, editor, Proceedings of the Fifth InternationalConference on Genetic Algorithms, pages 56–64. Morgan Kauffman publishers, 1993. Also availableas IlliGAL Report 93004.

[14] D.E. Goldberg, B. Korb, and K. Deb. Messy genetic algorithms: Motivation, analysis and first results.Complex Systems, 10:385–408, 1989.

[15] G. Harik. Linkage learning via probabilistic modeling in the ecga. IlliGAL Technical Report 99010.ftp://ftp-illigal.ge.uiuc.edu/pub/papers/IlliGALs/99010.ps.Z, 1999.

[16] G. Harik, F. Lobo, and D.E. Goldberg. The compact genetic algorithm. In Proceedings of the 1998IEEE International Conference on Evolutionary Computation, pages 523–528. IEEE Press, 1998.

49

[17] D. Heckerman, D. Geiger, and D.M. Chickering. Learning bayesian networks: The combinationof knowledge and statistical data. In R. Lopez de Mantaras and D. Poole, editors, Proceedingsof the 10th Conference on Uncertainty in Artificial Intelligence, pages 293–301. Morgan Kauffmanpublishers, 1994. Also available as Technical Report MSR-TR-94-09.

[18] J.H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor: University of MichiganPress, 1975.

[19] H. Kargupta. The gene expression messy genetic algorithm. In Proceedings of the 1996 IEEE Inter-national Conference on Evolutionary Computation, pages 631–636. IEEE Press, 1996.

[20] R.M. Karp. A simple derivation of edmonds’ algorithm for optimum branchings. Networks, 1:265–272,1971.

[21] F.G. Lobo, K. Deb, D.E. Goldberg, G.R. Harik, and L. Wang. Compressed introns in a linkagelearning genetic algorithm. In W. Banzhaf, K. Chellapilla, K. Deb, M. Dorigo, D.B. Fogel, M.H.Garzon, D.E. Goldberg, H. Iba, and R. Riolo, editors, Genetic Programming 1998: Proceedings ofthe Third Annual Conference, pages 551–558. Morgan Kauffman publishers, 1998. Also available asIlliGAL Report 97010.

[22] H. Muhlenbein, T. Mahnig, and O. Rodriguez. Schemata, distributions and graphical models inevolutionary optimization. Journal of Heuristics, 5:215–247, 1999.

[23] H. Muhlenbein and G. Paaß. From recombination of genes to the estimation of distributions i. binaryparameters. In A.E. Eiben, T. Back, M. Schoenauer, and H.-P. Schwefel, editors, Parallel ProblemSolving from Nature – PPSN V, pages 178–187. Springer, 1998.

[24] M. Pelikan, D.E. Goldberg, and E. Cantu-Paz. Boa: The bayesian optimization algorithm. InW. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R.E. Smith, editors,Proceedings of the GECCO–1999 Genetic and Evolutionary Computation Conference, pages 525–532.Morgan Kaufmann Publishers, 1999.

[25] M. Pelikan, D.E. Goldberg, and F. Lobo. A survey of optimization by building and using proba-bilistic models. IlliGAL Technical Report 99018. ftp://ftp-illigal.ge.uiuc.edu/pub/papers/IlliGALs/99018.ps.Z, 1999.

[26] M. Pelikan and H. Muhlenbein. The bivariate marginal distribution algorithm. In R. Roy, T. Fu-ruhashi, K. Chawdry, and K. Pravir, editors, Advances in Soft Computing – Engineering Design andManufacturing. Springer–Verlag, 1999.

[27] A. Ravidran, D.T. Philips, and J.J. Solberg. Operations Research Principles and Practice. JohnWiley & Sons Inc., 1987.

[28] D.W. Scott. Multivariate Density Estimation. John Wiley & Sons Inc., 1992.

[29] M. Sebag and A. Ducoulombier. Extending population–based incremental learning to continuoussearch spaces. In A.E. Eiben, T. Back, M. Schoenauer, and H.-P. Schwefel, editors, Parallel ProblemSolving from Nature – PPSN V, pages 418–427. Springer, 1998.

[30] I. Servet, L. Trave-Massuyes, and D. Stern. Telephone network traffic overloading diagnosis and evo-lutionary computation technique. In J.K. Hao, E. Lutton, E. Ronald, M. Schoenauer, and D. Snyers,editors, Proceedings of Artificial Evolution ’97, pages 137–144. Springer Verlag, LNCS 1363, 1997.

[31] C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423,623–656, 1948.

[32] R. Tarjan. Finding optimal branchings. Networks, 7:25–35, 1977.

[33] D. Thierens and D.E. Goldberg. Mixing in genetic algorithms. In S. Forrest, editor, Proceedings ofthe fifth conference on Genetic Algorithms, pages 38–45. Morgan Kaufmann, 1993.

[34] C.H.M. van Kemenade. Building block filtering and mixing. In Proceedings of the 1998 IEEE Inter-national Conference on Evolutionary Computation. IEEE Press, 1998.

50

A PDS graph search metrics

Given is a PDS:(π,ω) = (π, (ω0, ω1, . . . , ωl−1)) (33)


The Bayesian Dirichlet metric [17] in the discrete case, which we denote ψ((π,ω)), is defined by:

ψ((π,ω)) = p((π,ω)|ξ)l−1∏i=0

nd−1∏k1=0

nd−1∏k2=0

. . .

nd−1∏k|π(ωi)|=0

(34)

(m′(π(ωi),�i)!

(m′(π(ωi),�i) + m(π(ωi),�i))!

nd−1∏k0=0

(m′(ωi � π(ωi), k0 � �i) + m(ωi � π(ωi), k0 � �i))!m′(ωi � π(ωi), k0 � �i)!

)

where

�i = (k1, k2, . . . , k|π(ωi)|) = (�i0, �

i1, . . . , �

i|�i|)

m(ν,λ) =∑�τn�

q=0

{1 if ∀r∈{0,1,...,|λ|−1}〈X(S)q

νr = λr〉0 otherwise

In the above, we have used � for the notation of appending an element at the head of a vector:x0 � (x1, x2, . . . , xn) = (x0, x1, . . . , xn). The expression p((π,ω)|ξ) stands for the prior probabilityof network (π,ω). The function m′(ν,λ) stands for prior information on m(ν,λ). In the BOAimplementation [24], the prior information is disregarded:{

p((π,ω)|ξ) = 1m′(ν,λ) = 1

Disregarding the prior information results in the so called K2 metric. For numerical purposes, theactual metric that we work with is a logarithm over the K2 metric, wich we denote in the case ofdiscrete random variables ψ((π,ω)), given the same where clause as in equation 34:

ψ((π,ω)) = (35)

ln

l−1∏

i=0

nd−1∏k1=0

nd−1∏k2=0

. . .

nd−1∏k|π(ωi)|=0

(1

(1 + m(π(ωi),�i))!

nd−1∏k0=0

(1 + m(ωi � π(ωi), k0 � �i))!

) =

l−1∑i=0

nd−1∑k1=0

nd−1∑k2=0

. . .

nd−1∑k|π(ωi)|=0

−

1+m(π(ωi),�i)∑

j=1

ln(j) +nd−1∑k0=0

1+m(ωiπ(ωi),k0�i)∑j=1

ln(j)

The MIMIC algorithm [6] uses a metric based upon the Kullback–Leibler divergence distancemetric to the chain probability distribution structure. The same is done in the optimal dependencytrees approach [3], but with respect to the tree probability distribution structure. We find thatif we write p(ν,λ) = pν0,ν1,...,ν|λ|−1(λ0, λ1, . . . , λ|λ|−1), we have in accordance with equation 18that m(ν,λ) = �τn�p(ν,λ). If we now apply the Kullback–Leibler divergence to the generalmodel from equation 7, we can derive another metric. Staying within the case of discrete randomvariables, we may write the negative of this metric, which we denote ψ((π,ω)), as follows4 (stillkeeping the same where clause as in equation 34):

4The Kullback–Leibler divergence is thus equal to −ψ((π, ω)).

51

ψ((π,ω)) = −l−1∑i=0


) = (36)

−l−1∑i=0

−

nd−1∑k0=0

nd−1∑k1=0

. . .


p(ωi � π(ωi), k0 � �i)ln(p(ωi � π(ωi), k0 � �i))

−−

nd−1∑k1=0

nd−1∑k2=0

. . .


p(π(ωi),�i)ln(p(π(ωi),�i))

=

l−1∑i=0

nd−1∑k1=0

nd−1∑k2=0

. . .

nd−1∑k|π(ωi)|=0(

−p(π(ωi),�i)ln(p(π(ωi),�i)) +nd−1∑k0=0

p(ωi � π(ωi), k0 � �i)ln(p(ωi � π(ωi), k0 � �i))

)

The goal using the K2 metric is to maximize ψ((π,ω)). Using the Kullback–Leibler divergence,according to the above, we can also propose to maximize ψ((π,ω)) or in other words to minimize−ψ((π,ω)). Introducing the notation ψ((π,ω)), we can state this minimization problem as follows:

min(π,ω)

{ψ((π,ω)) = −ψ((π,ω)) =

l−1∑i=0


)

}(37)

As the entropy measure has certain interesting and useful properties, we choose to work with themetric ψ((π,ω)). However, there seems to be a certain correspondence between metrics ψ((π,ω))and ψ((π,ω)). In a direct sense, because we have written out the definition of conditional entropyfor metric ψ((π,ω)), it is clear that a single substitution transforms ψ((π,ω)) into ψ((π,ω)). Thissubstition is the following:

1+m(ν,λ)∑j=1

ln(j) � m(ν,λ)�τn� ln

(m(ν,λ)�τn�

)= p(ν,λ)ln(p(ν,λ))

We have introduced this substitution by writing � because it is non–monotonic. However, inthe composition with the positive sum over all instances of the considered node, the two metricsmight distinguish between two networks in the same manner. If we observe closely, both of thelastly bracketed expressions from ψ((π,ω)) and ψ((π,ω)) are a way of expressing conditionalityof a single variable on a set of parent variables. Whereas in ψ((π,ω)) it is a factorial of instancenumbers, in ψ((π,ω)) it is close to the logarithm of the empiric probabilities. We will not attemptto give an analysis of the correspondence between the two metrics. However, to give some insightinto the behaviour of the metrics, we have portrayed some experiments. Consider the followingtrap functions:

f(X) ={ − 1−d

l−1 o(X) + 1 − d if o(X) < l

1 if o(X) = l

Here o(X) stands for the amount of ones in vector X of length l. The maximum value of 1 isachieved for a vector with l ones and the suboptimum of 1 − d is found for a vector with l zeros.When d goes to 0, the problem becomes fully deceptive for some value of d. This means that allschemata of an order smaller than l lead to the suboptimum (deceptive attractor) [9]. We haveset l = 5 and have taken only one such subvector as the problem. We then fixed the ordering ofthe variables to ω = (0, 1, 2, 3, 4). Because of the fact that the problem is not dependent on theordering of the variables in the vector, this will not restrict the results to a special case. Usingthis ordering, we enumerated all of the 1024 possible combinations of parents and computed both

52

-600

-550

-500

-450

-400

-350

K2-185*h(∑)

-3420

-3400

-3380

-3360

-3340

-3320

-3300

-3280

-3260

K2-975*h(∑)

Figure 14: Metrics for τ = 0.2 (left) and τ = 1.0 (right), d = 0.2.

-600

-550

-500

-450

-400

-350

K2-185*h(∑)

-3420

-3400

-3380

-3360

-3340

-3320

-3300

-3280

-3260

K2-975*h(∑)

Figure 15: Metrics for τ = 0.2 (left) and τ = 1.0 (right), d = 0.8.

the K2 metric as well as the negative entropy metric ψ((π,ω)). The computation of these metricsis done using a set of samples. To simulate the IDEA approach, we have taken 1000 samples andevaluated them according to function f(·). We then selected the top 1000τ samples with respectto this measure and computed the metrics based upon this selection. To show the results for a fewvariants of the problem, we have chosen τ ∈ {0.2, 1.0} and d ∈ {0.2, 0.8}. The results are shownin figures 14 and 15.

B Edmond’s algorithm for optimum branchings

Given is the following:

D =

V = {0, 1, . . . , n − 1}A ⊆ V × VG = (V,A)s : A → V such that s((i, j)) = it : A → V such that t((i, j)) = jc : A → R

+

(38)

53

The functions s(·) and t(·) respectively return the source and target of an arc, whereas functionc(·) returns the weight of an arc in the graph. We seek to find a maximum weighted branchingB ⊆ A. The weight C(·) of a branching B is the sum of the weights of the arcs in the branching.A set B ⊆ A is called a branching if B does not contain a cycle and no two arcs in B have thesame target. We may now formulate problem P(·) as a function of the data D as follows:

P(D) : maximize C(B) =n−1∑i=0

n−1∑j=0

c(i, j)xij (39)

such that

B ⊆ A

∀(i,j)∈B

⟨xij =

{1 if (i, j) ∈ B0 otherwise

⟩¬∃a=(a0,a1,...,a|a|−1)∈B|a|〈∀i∈{(0,1,...,|a|−2)}〈t(ai) = s(ai+1)〉 ∧ t(a|a|−1) = s(a0))〉∀(a,a′)∈B×B〈a �= a′ → t(a) �= t(a′)〉

The first step of the algorithm requires the notion of a critical graph. To this end, we introducepredicates CA(·), CG(·) and C′

G(·), which stand for the definition of a critical arc, a critical graph anda help predicate for the definition of a critical graph respectively. The following three definitionshold given the data D from equation 38 and given that we write a critical graph H for graphG = (V,A) as H = (V,AH):

CA(a) ⇔ c(a) > 0 ∧ ∀a′∈A〈t(a′) = t(a) → c(a′) ≤ c(a)〉 (40)

C′G(H) ⇔ AH ⊆ A ∧ ∀aH∈AH 〈CA(aH)〉 ∧ ∀(aH ,a′

H)∈AH×AH 〈aH �= a′H → t(aH) �= t(a′

H)〉 (41)

CG(H) ⇔ C′G(H) ∧ ∀H′=(V,AH ′)〈C′

G(H ′) → |AH ′| ≤ |AH |〉 (42)

In the algorithm, we have to be able to construct new data D for problem P from data D ifthe critical graph has cycles. This is done by contracting the nodes in each cycle to a singlenew vertex. Assuming that the critical graph H contains k cycles C0, C1, . . . , Ck−1 such that∀i∈{0,1,...,k−1}〈Ci ⊆ A〉, these cycles concern disjoint sets of vertices: ∀i∈{0,1,...,k−1}〈Vi = {t(a)|a ∈Ci}〉. We may now define the new data D(·) as a function of a critical graph H:

D(H) =

V = V ∪ {z0, z1, . . . , zk−1} such that zi ∈ N and unusedA = {a | a ∈ A ∧ (s(a) ∈ V ∨ t(a) ∈ V )}G = (V ,A)

s(a) ={

s(a) if s(a) ∈ Vzi if s(a) ∈ Vi

t(a) ={

t(a) if t(a) ∈ Vzi if t(a) ∈ Vi

c(a) ={

c(a) if t(a) ∈ V

c(a) − c(a) + c(amini ) if t(a) ∈ Vi

(43)

where

V = V −⋃k−1i=0 Vi

∀i∈{0,1,...,k−1}〈amini = arg mina∈Ci

{c(a)}〉∀aH∈AH 〈f(t(aH)) = aH〉∀a∈A〈∀i∈{0,1,...,k−1}〈t(a) ∈ Vi → a = f(t(a))〉〉

Karp [20] has shown that the following one–to–one correspondence holds between an optimumbranching B in P(D) and an optimum branching B in problem P(D(·)):

B = B ∪(

k−1⋃i=0

(Ci − abreaki )

)(44)

54

where abreaki =

{a if ∃a∈B〈t(a) = zi〉amin

i otherwise

Edmond’s algorithm [10] computes a critical graph H, which is a graph H such that CG(H) holds.It has been shown by Karp [20] that if H is acyclic, it is an optimum branching. Otherwise, thealgorithm recursively computes an optimum branching B using data D(H) in problem P(D(H)).By using the correspondence from equation 44, the optimum branching B for the original problemcan be determined.

C Multivariate conditional normal distribution

The univariate density function for the normal distribution is defined by:

f(y) =1

σ√

2πe

−(y−µ)2

2σ2 = g(y, µ, σ) (45)

The multivariate density function for the normal distribution in d dimensions with y = (y0, y1, . . . ,yd−1) is defined by:

f(y) =(2π)−

d2

(det Σ)12e−

12 (y−µ)TΣ−1

(y−µ) (46)

In equation 46, we have that µ = (µ0, µ1, . . . , µd−1) is the vector of means and Σ = E[(y −µ)T (y − µ)] is the nonsingular covariance matrix, which is symmetric. It is common practice todefine σij to be the entry in row i and column j in matrix Σ, meaning σij = Σ(i, j). Note that itfollows from the definition of Σ that σii = σ2

i . We similarly define σ′ij = (Σ−1)(i, j). We may now

rewrite matrix equation 46 as an equation involving only element variables (with the exception ofthe determinant):

f(y0, y1, . . . , yd−1) =(2π)−

d2

(det Σ)12e−

12 (

∑d−1j=0 (

∑d−1i=0 (yi−µi)σ

′ij)(yj−µj)) (47)

We derive the density function for the conditional normal distribution for a single variable y0 thatis dependent on c other variables y1, y2, . . . , yc. In the case of sampling, all µi and σ′

ij parametersare assumed to be known as well as are instance values for y1, y2, . . . , yc. This leaves us only with asingle parameter y0 and the remainder as constants, making the resulting pdf one–dimensional. Ina closed form, this means we are searching to find µ0 and σ0 so that the resulting density functioncan be expressed as a one–dimensional normalized Gaussian:

f(y0|y1, y2, . . . , yc) =1

σ0

√2π

e−(y−µ0)2

2σ20 = g(y, µ0, σ0) (48)

First, we note that we acquire from probability theory that

f(y0|y1, y2, . . . , yc) =f(y0, y1, y2, . . . , yc)

f(y1, y2, . . . , yc)(49)

Combining equation 49 with equation 46, we note that we get two covariance matrices of sizes(c + 1) × (c + 1) and c × c for the the multivariate expressions in the fraction of equation 49.We shall write the elements from the matrix containing (c + 1) × (c + 1) elements as σ

(1)ij and

the elements from the matrix containing c × c elements as σ(2)ij . Actually, we require to denote

the elements from the inverse matrices, so we shall write these as σ′ij and σ′′

ij respectively. Note

that from the definition of Σ it follows that σ(1)ij = σ

(1)ji and even that σ

(1)ij = σ

(2)(i−1)(j−1) when

i > 1 ∧ j > 1. However, in general we cannot state that σ′ij = σ′′

(i−1)(j−1). What we may state ingeneral, is that because Σ is by definition symmetric, so is Σ−1. This means that σ′

ij = σ′ji and

that σ′′ij = σ′′

ji It is very important in the following to note that σ′′ij concerns variables yi+1 and

55

yj+1 in terms of covariance. The mismatch in variable index is because of matrix indexing. Wecan now derive the required expressions µ0 and σ0:

f(y0|y1, y2, . . . , yc) =

(2π)−c+12

(det Σ(1))12e−

12 (

∑ cj=0(


′ij)(yj−µj))

(2π)−c2

(det Σ(2))12e− 1

2 (∑ c

j=1(∑ c

i=1(yi−µi)σ′′(i−1)(j−1))(yj−µj))

= aeb

From the form aeb it must become apparent what the the required expressions µ0 and σ0 are. Itis clear to see from the equation in definition 45 that from computing a we will be able to find σ0

and that from computing b we will be able to find both µ0 and σ0. Even though only deriving bwill be enough, we will derive both and see whether we arrive at matching results for σ0. First,we determine a to get σ0:

a =(2π)−

c+12

(det Σ(1))12

(det Σ(2))12

(2π)−c2

=1√

det Σ(1)

det Σ(2)

√2π

⇒ σ0 =

√det Σ(1)

det Σ(2)

Second, we determine b to get both µ0 and σ0:

b =12 (−(

∑cj=0(

∑ci=0(yi − µi)σ′

ij)(yj − µj))+(

∑cj=1(

∑ci=1(yi − µi)σ′′

(i−1)(j−1))(yj − µj)))

=

12 (−(

∑cj=1(


ij)(yi − µi))−(y0 − µ0)


i0

−(y0 − µ0)2σ′00

−(y0 − µ0)∑c

j=1(yj − µj)σ′0j

+(∑c

j=1(∑c

i=1(yi − µi)σ′′(i−1)(j−1))(yj − µj)))

=

12 (−(

∑cj=1(


ij)(yi − µi))−2y0


i0

+2µ0


i0

−y20σ′

00

+2y0µ0σ′00

−µ20σ

′00

+(∑c

j=1(∑c

i=1(yi − µi)σ′′(i−1)(j−1))(yj − µj)))

=def

a0 = −(∑c

j=1(∑c

i=1(yi − µi)σ′ij)(yi − µi))

a1 = −2∑c

i=1(yi − µi)σ′i0

a2 = 2µ0


i0

a3 = −σ′00

a4 = 2µ0σ′00

a5 = −µ20σ

′00

a6 = (∑c

j=1(∑c

i=1(yi − µi)σ′′(i−1)(j−1))(yj − µj))

α = −a3

β = −a1 − a4

γ = −a0 − a2 − a5 − a6

12(a0 + a1y0 + a2 + a3y

20 + a4y0 + a5 + a6) = −1

2(αy2

0 + βy0 + γ) =−(y2

0 + βαy0 + γ

α )2 1

α

So that when we realize that (y0 − µ0)2 = y0 − 2µ0y0 + µ20, we find that if we have ( β

−2α )2 = γα

(which can be checked to be the case), we may write:{σ0 =

√1α

µ0 =√

γα = β

−2α

56

Note that we now have a different expression for σ0, namely√

α−1 =√

1/σ′00. It can be checked

from linear algebra that indeed (det Σ(1))/(det Σ(2)) = 1/σ′00. Finally, we thus arrive at the

definition of the required multivariate density function for the conditional normal distribution:

f(y0|y1, y2, . . . , yc) = g(y, µ0, σ0) (50)

where

σ0 = 1√σ′00

µ0 = µ0σ′00−


′i0

σ′00

D Differential entropy for the histogram distribution

As defined in algorithm ReHiEst in section 5.2, the modeled distribution for variable Yi is definedto be 0 anywhere outside of [minYi ,maxYi ]. Inside that range, it is a stepfunction where the amountof steps equals the amount of bins. First we need a generalized version of equation 25:

bj0,j1,...,jn−1 [k0, k1, . . . , kn−1] = (51)∣∣∣∣∣{

Y(S)qj0

∣∣∣∣∣ q ∈ {0, 1, . . . , �τn� − 1} ∧ ∀i∈{0,1,...,n−1}

⟨ki ≤

Y(S)qji

− minYji

rangeYji

r < ki + 1

⟩}∣∣∣∣∣Now, starting from equation 12, for n variables Yj0 , Yj1 , . . . , Yjn−1 that have a histogram distri-

bution as defined by algorithms ReHiEst and ReHiSam, we can derive the differential entropyas follows:

h(Yj0 , Yj1 , . . . , Yjn−1) =

−∫ ∞

−∞

∫ ∞

−∞. . .

∫ ∞

−∞

(fj0,j1,...,jn−1(y0, y1, . . . , yn−1)

ln(fj0,j1,...,jn−1(y0, y1, . . . , yn−1))

)dy0dy1 . . . dyn−1 =

−∫ maxYj0

minYj0

∫ maxYj1

minYj1

. . .

∫ maxYjn−1

minYjn−1

(fj0,j1,...,jn−1(y0, y1, . . . , yn−1)

ln(fj0,j1,...,jn−1(y0, y1, . . . , yn−1))

)dy0dy1 . . . dyn−1 =def{

ρ(ι, λ) = minYι +

λ

rrange

Yι

}

−r−1∑k0=0

r−1∑k1=0

. . .

r−1∑kn−1=0

(∫ ρ(j0,k0+1)

ρ(j0,k0)

∫ ρ(j1,k1+1)

ρ(j1,k1)

. . .

∫ ρ(jn−1,kn−1)

ρ(jn−1,kn−1)(fj0,j1,...,jn−1(y0, y1, . . . , yn−1)

ln(fj0,j1,...,jn−1(y0, y1, . . . , yn−1))

)dy0dy1 . . . dyn−1

)=

−r−1∑k0=0

r−1∑k1=0

. . .

r−1∑kn−1=0

((rangeYj0

r

rangeYj1

r. . .

rangeYjn−1

r

)·

bj0,j1,...,jn−1 [k0, k1, . . . , kn−1]ln(bj0,j1,...,jn−1 [k0, k1, . . . , kn−1]))

The only fundamental difference with equation 10 for the discrete multivariate entropy lies inthe factor

∏n−1i=0 rangeYji /rn. In the case of discrete random variables, r = nd and the range is also

equal to nd, which makes this factor evaluate to 1 and disappear in the equation. We have arrivedat the definition of the multivariate differential entropy for n continuous random variables with ahistogram distribution according to algorithms ReHiEst and ReHiSam from section 5:

h(Yj0 , Yj1 , . . . , Yjn−1) = (52)

−∏n−1

i=0 rangeYji

rn

r−1∑k0=0

r−1∑k1=0

. . .r−1∑

kn−1=0

(bj0,j1,...,jn−1 [k0, k1, . . . , kn−1]ln(bj0,j1,...,jn−1 [k0, k1, . . . , kn−1])

)

57

E Additional algorithms

Algorithm GraphSeaOptimumBranching computes an optimum branching using Edmond’salgorithm as specified in appendix B. In order to do so, it first computes a critical graph and thenchecks if the graph contains cycles. If it does not, the resulting selection of arcs is the optimumbranching. If it does contain cycles, the derived problem is created following the definitions inappendix B. The algorithm is subsequently recursively applied to this smaller problem. Fromthe result obtained by the recursive call, the final result is constructed using the correspondence,which is also stated in appendix B. The data that is used to achieve the computation of theoptimum branching, mainly consists of the nodes V and the arcs A in the graph. As the source,target and weight of an arc are subject to changes for the derived problem while the actual arcsremain the same, the history of the source, target and weight for an arc during recursive calls isplaced on a stack.

GraphSeaOptimumBranching(V ,A,z)1 (V,AH) ← GraphSeaCriticalGraph(A,z)2 (C, f) ← GraphSeaCycles(V ,AH ,z)3 if |C| = 0 then

3.1 B ← AH

4 else

4.1 (V ,A, z, amin) ← GraphSeaDerivedProblem(V ,A,C,f ,z)4.2 B ← GraphSeaOptimumBranching(V ,A,z)4.3 B ← GraphSeaCorrespondence(B,A,z,f ,C,amin)

5 return(B)

Algorithm GraphSeaCriticalGraph computes a critical graph, given collections V and Aof nodes and arcs respectively. To this end, all the arcs are considered. At each node v, the arcwith target node v and with minimum weight is stored. All the arcs so stored, constitute thecritical graph according to the definition as stated in appendix B.

GraphSeaCriticalGraph(A,z)1 C ← new array of edarc with size z2 AH ← new vector of edarc3 for i ← 0 to z − 1 do

3.1 C[i] ← (⊥,⊥,⊥)4 for i ← 0 to |A| − 1 do

4.1 (Ss, St, Sc) ← Ai

4.2 t ← top(St)4.3 c ← top(Sc)4.4 if C[t] = ⊥ then

4.4.1 if c > 0 then4.4.1.1 C[t] ← (Ss, St, Sc)

4.5 else

4.5.1 (Ss′, St′, Sc′) ← C[t]4.5.2 c′ ← top(Sc′)4.5.3 if 0 < c′ ≤ c then

4.5.3.1 C[t] ← (Ss, St, Sc)5 for i ← 0 to z − 1 do

5.1 if C[i] �= ⊥ then5.1.1 AH

|AH | ← C[i]6 return((V,AH))

Algorithm GraphSeaCycles checks whether a critical graph has cycles and returns all ofthese cycles. To achieve this, a depth first search is performed at each non–visited node and as

58

soon as a node is visited that has been encountered earlier in a single depth–first run, the nodesare traversed backwards and the cycle is constructed.

GraphSeaCycles(V ,AH ,z)1 f ← new array of edarc with size z2 m,m′ ← 2 new arrays of boolean with size z3 C ← new vector of (vector of integer,vector of edarc)4 for i ← 0 to z − 1 do

4.1 m[i] ← true4.2 m′[i] ← true4.3 f [i] ← (⊥,⊥,⊥)

5 for i ← 0 to |V | − 1 do5.1 m[Vi] ← false5.2 m′[Vi] ← false

6 for i ← 0 to |AH | − 1 do6.1 (Ss, St, Sc) ← AH

i

6.2 t ← top(St)6.3 f [t] ← (Ss, St, Sc)

7 for i ← 0 to |V | − 1 do7.1 if ¬m[Vi] then

7.1.1 (b, V C , AC , vc, bc) ← GraphSeaCyclesVisit(Vi,m,m′,f)7.1.2 if b then

7.1.2.1 C|C| ← (V C , AC)8 return(C, f)

GraphSeaCyclesVisit(v,m[·],m′[·],f [·])1 T ← (false,⊥,⊥,−1,false)2 if m′[v] then

2.1 V C ← new vector of integer2.2 AC ← new vector of edarc2.3 T ← (true, V C , AC , v, true)

3 if ¬m[v] then3.1 m[v] ← true3.2 m′[v] ← true3.3 (Ss, St, Sc) ← f [v]3.4 if Ss �= ⊥ then

3.4.1 s ← top(Ss)3.4.2 (b, V C , AC , vc, bc) ← GraphSeaCyclesVisit(s,m,m′,f)3.4.3 m′[v] ← false3.4.4 if b then

3.4.4.1 if bc then3.4.4.1.1 V C

|V C | ← v

3.4.4.1.2 AC|AC | ← (Ss, St, Sc)

3.4.4.1.3 if vc = v then3.4.3.1.3.1 bc ← false

3.4.4.2 T ← (true, V C , AC , vc, bc)4 return(T )

Algorithm GraphSeaDerivedProblem constructs the derived problem by first determiningfor each node in V whether it lies in V or in the collection of nodes of some cycle Vi. It thenstraightforwardly constructs V and A. For each arc in A, the new source s, destination t and costc are pushed onto the stacks of the arc, so that the top of the stacks hold the relevant problemdata for the next recursive call to algorithm GraphSeaOptimumBranching.

59

GraphSeaDerivedProblem(V ,A,C,f [·],z)1 ev ← new array of integer with size z

2 amin new array of edarc with size |C|3 V ← new vector of integer4 A ← new vector of edarc5 u ← new vector of (integer, integer, real)6 for i ← 0 to z − 1 do

6.1 ev[i] ← −17 for i ← 0 to |V | − 1 do

7.1 ev[Vi] ← |C|8 for i ← 0 to |C| − 1 do

8.1 (V C , AC) ← Ci

8.2 for j ← 0 to |V C | − 1 do8.2.1 ev[V C

j ] ← i

8.3 amin[i] ← AC0

8.4 for j ← 1 to |AC | − 1 do8.4.1 (Ss, St, Sc) ← AC

j

8.4.2 c ← top(Sc)8.4.3 (Ss′, St′, Sc′) ← amin[i]8.4.4 c′ ← top(Sc′)8.4.5 if c < c′ then

8.4.5.1 amin[i] ← (Ss, St, Sc)9 for i ← 0 to z − 1 do

9.1 if ev[i] = |C| then9.1.1 V |V | ← i

10 z ← z + |C|11 for i ← 0 to |C| − 1 do

11.1 V |V | ← z + i

12 for i ← 0 to |A| − 1 do12.1 (Ss, St, Sc) ← Ai

12.2 s ← top(Ss)12.3 t ← top(St)12.4 c ← top(Sc)12.5 (s, t, c) ← (s, t, c)12.6 if ev[s] = |C| ∨ ev[t] = |C| then

12.6.1 if 0 ≤ ev[s] < |C| then12.6.1.1 s ← z + ev[s]

12.6.2 if 0 ≤ ev[t] < |C| then12.6.2.1 t ← z + ev[t]12.6.2.2 (Ss, St, Sc) ← f [t]12.6.2.3 c ← top(Sc)12.6.2.4 (Ssmin, Stmin

, Scmin) ← amin[ev[t]]12.6.2.5 cmin ← top(Scmin)12.6.2.6 c ← c − c + cmin

12.6.3 u|u| ← (s, t, c)12.6.4 A|A| ← (Ss, St, Sc)

13 for i ← 0 to |A| − 1 do13.1 (Ss, St, Sc) ← Ai

13.2 (s, t, c) ← ui

13.3 push(Ss, s)13.4 push(St, t)13.5 push(Sc, c)

14 return((V ,A, z, amin))

60

Algorithm GraphSeaCorrespondence computes the resulting optimum branching by firsttaking all of the arcs from B and then adding arcs from the cycles while leaving some arc out foreach cycle to prevent cycles in the resulting branching B. To this end, the source and target of anarc in the original problem are required as well as the source and target of an arc in the derivedproblem. As these were stored on a stack, this data is readily available. Finally, as directly opposedto the end of algorithm GraphSeaDerivedProblem, the relevant problem data is removed fromthe top of the stack for each arc in the derived problem.

GraphSeaCorrespondence(B,A,z,f [·],C,amin[·])1 f ← new array of edarc with size |C|2 for i ← 0 to |C| − 1 do

2.1 f [i] ← (⊥,⊥,⊥)3 for i ← 0 to |B| − 1 do

3.1 (Ss, St, Sc) ← Bi

3.2 t ← top(St)3.3 if 0 ≤ t − z < |C| then

3.3.1 f [t − z] ← (Ss, St, Sc)4 B ← {(Ss, St, Sc) | (Ss, St, Sc) ∈ B}5 for i ← 0 to |C| − 1 do

5.1 (V C , AC) ← Ci

5.2 (Ssbreak, Stbreak, Scbreak) ← amin[i]

5.3 sbreak ← top(Ssbreak)5.4 tbreak ← top(Stbreak)5.5 (Ss, St, Sc) ← f [i]5.6 if St �= ⊥ then

5.6.1 t ← pop(St)5.6.2 (Ssbreak, Stbreak

, Scbreak) ← f [top(St)]5.6.3 sbreak ← top(Ssbreak)5.6.4 tbreak ← top(Stbreak)5.6.5 push(St, t)

5.7 for j ← 0 to |AC | − 1 do5.7.1 (Ss, St, Sc) ← AC

j

5.7.2 s ← top(Ss)5.7.3 t ← top(St)5.7.4 if (sbreak, tbreak) �= (s, t) then

5.7.4.1 B|B| ← (Ss, St, Sc)6 for i ← 0 to |A| − 1 do

6.1 (Ss, St, Sc) ← Ai

6.2 pop(Ss)6.3 pop(St)6.4 pop(Sc)

7 return(B)

Algorithm GraphSeaTopologicalSort performs a topological sort for an array vp of vectorsthat contains for each node its parents. The algorithm places the ordering of the variables in theω vector as required. Note that the topological sort is performed using the arcs backwards as thebase of an arc points to a parent and the parents have to be on the side with the larger indices inthe ω vector.

61

GraphSeaTopologicalSort(vp[·])1 m ← new array of boolean with size l2 for i ← 0 to l − 1 do

2.1 m[i] ← false2.2 π(i) ← vp[i]

3 z ← l − 14 for i ← 0 to l − 1 do

4.1 if ¬m[i] then4.1.1 z ← GraphSeaTopologicalSortVisit(i,m,vp,z)

5 return((π,ω))

GraphSeaTopologicalSortVisit(i,m[·],vp[·],z)1 m[i] ← true2 for j ← 0 to |vp[i]| − 1 do

2.1 if ¬m[vp[i]j ] then2.1.1 z ← GraphSeaTopologicalSortVisit(vp[i]j ,m,vp,z)

3 ωz ← i4 return(z − 1)

Algorithm GraphSeaArcAdd updates which arcs are still allowed to be added to the graph.All arcs that will cause the graph to become cyclic when they are added, are marked as not allowed.This is achieved by a depth–first search through the successor nodes vs of v1 including v1 and foreach encountered node to start another depth–first search through the precedessor nodes vp of v0

including v0 so as to mark each arc (vs, vp) as not allowed. If the maximum amount of parents isachieved for node v1, all arcs (i, v1), i ∈ {0, 1, . . . l − 1} are marked as not allowed. The algorithmreturns the amount of arcs that have become newly marked.

GraphSeaArcAdd(κ, v0, v1, a[·, ·], vp[·], vs[·])1 m ← new array of boolean with size l2 ns, np ← 2 new vectors of integer3 S ← new stack of integer4 δ ← 05 vs[v0]|vs[v0]| ← v1

6 vp[v1]|vp[v1]| ← v0

7 if a[v0, v1] then7.1 a[v0, v1] ← false7.2 δ ← δ + 1

8 for i ← 0 to l − 1 do8.1 m[i] ← false

9 push(S, v1)10 while ¬empty(S) do

10.1 v ← pop(S)10.2 if ¬m[v] then

10.2.1 m[v] ← true10.2.2 ns

|ns| ← v

10.2.3 for k ← 0 to |vs[v]| do10.2.3.1 push(S, v)

11 push(S, v0)12 while ¬empty(S) do

12.1 v ← pop(S)12.2 if ¬m[v] then

12.2.1 m[v] ← true12.2.2 np

|np| ← v

12.2.3 for k ← 0 to |vp[v]| do12.2.3.1 push(S, v)

62

13 for i ← 0 to |ns| do13.1 for j ← 0 to |np| do

13.1.1 if a[i, j] then13.1.1.1 a[i, j] ← false13.1.1.2 δ ← δ + 1

14 if |vp[v1]| = κ then14.1 for i ← 0 to l − 1 do

14.1.1 if a[i, v1] then14.1.1.1 a[i, v1] ← false14.1.1.2 δ ← δ + 1

15 return(δ)

F Pseudo–code conventions in notation and data usage

All algorithms are written as functions by using capital letters, such as Function(). The assign-ment operation is written using A ← X, assigning X to A. Pseudo–code reserved words appear inboldface, such as while. The scope of structures is related to the indentation and numbering oflines. The scope of a grouping statement in line l is l.∗. The remainder of the conventions concernthe assumed to be available data structures and elementary data types:

Elementary data typesType Descriptionboolean Truth value (element of {false, true})integer Element of Z

real Element of R

Composite data structuresType Descriptionarray Typed array, memory block with fixed size.

• Starting index: 0• Notation for indexing: A[i].• Notation for determining the length of an array: |A|.

vector Typed dynamic array, memory block with non–fixed size.• Starting index: 0• Notation for indexing: Vi.• Notation for determining the amount of elements in a vector: |V |.• Special assignment notation: V|V | ← X, appends X at the end

vector V, increasing its size.

stack Typed collection of elements with non–fixed size, allowing the followingoperations, all of which run in O(1) time:

• Empty(S), returns true if stack S contains no elements, falseotherwise.

• Push(S,X), places X onto the top of stack S, increasing its size.• Pop(S), returns the element that resides on the top of stack S and

removes it from the stack, decreasing its size.• Top(S), returns the element that resides on the top of stack S

without removing it from the stack.

63

An Algorithmic Framework For Density Estimation Based...

Documents

Transcript of An Algorithmic Framework For Density Estimation Based...