UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research...
Transcript of UNIVERSITY OF ALMER´IA · I am also very grateful to Helge Langseth for the research...
UNIVERSITY OF ALMERIA
Department of Statistics and Applied Mathematics
PhD thesis
Hybrid Bayesian networks in high
dimensionality frameworks
PhD student
Antonio Fernandez Alvarez
Advisors
Antonio Salmeron Cerdan
Rafael Rumı Rodrıguez
Almerıa, March 2011
To all who fight for their dreams.
Acknowledgements
In all honesty, I am pleasantly surprised by all the support I have received
throughout these years both academically and personally. I am privileged that
most of this time I have felt a barrage of positive experiences and feel indebted
to all those who have supported me.
First and foremost, I wish to thank my mentor Antonio Salmeron for believing
in me from the very beginning. His encouragement has helped me through some
my difficult moments and he has taught me to view difficulties as challenges.
Without his help and trust in me, this work would have never been possible.
I also thank the great support received from my co-mentor Rafael Rumı. His
advice and his friendly attitude have made my work much easier.
I would particularly highlight Jens D. Nielsen for his help and good moments
shared over two years working together. I am indebted to him for everything
I learnt. I would also like to thank Ildiko Flesch for her friendliness and the
work done during her short but fruitful stay in Almerıa. Other officemates whom
I thank for their support and understanding are Sandra Rodrıguez and Carlos
Rodrıguez.
I shall cherish great memories of the members of the Machine Intelligence
Group of Computer Science Department, Aalborg University, for their warm wel-
come during my research stay there. Especially to Thomas D. Nielsen for su-
pervising my work and for later collaborations we have had. Thanks to Aderson
C. Pifer and Nicolaj Søndberg for their wonderful hospitality and for making my
stay more pleasant.
I am also very grateful to Helge Langseth for the research collaborations we
have had and his friendly attitude during his stay in Almerıa.
Other people who deserve my deepest thanks are my colleagues of the Data
Analysis Group and Department of Statistics and Applied Mathematics for mak-
ing daily work fun. In particular Fernando Reche, Inma Lopez, Marıa Morales,
Jose Caceres, Carmelo Rodrıguez and Irene Martınez.
To Pedro Aguilera and Rosa Fernandez for the research collaborations that
we have started in the field of environment.
iv
On a personal level, I owe everything to Charo for her longstanding support
and unselfish generosity. I hope someday I might be able to reward her back.
My friends have been a very important daily source of support. Thanks to
my football team mates for the good matches: Antonio Mendoza, Manuel Yuste,
Fernando Perez, Ignacio Fernandez, Antonio Garcıa, and everyone else in the
team. Also, to my paddle mate Juan Antonio Chaichio. To my friends and former
flat mates Angel, Carlos and Vıctor for making life an enriching experience. In
short, I offer my gratitude to all my friends for always being available.
I would like to specially thank my friend Luis Garcıa, whom I have known
since childhood, and who has taught me many of the principles and values I
cherish today.
I give my deepest thanks to my parents Paco and Emilia for their support
and education, and to my brother Paco and his wife Carmen. To them, I owe
everything.
Finally, I dedicate this thesis to the memory of my grandmother Pura for her
humility and love.
Antonio Fernandez Alvarez
Almerıa, February 18, 2011
This dissertation has been supported by the Spanish Ministry of Science and
Innovation through projects TIN2007-67418-C03-02 and TIN2010-20900-C04-02,
and with the FPI scholarship BES-2008-004014.
Contents
Contents v
List of Algorithms ix
I Introduction 1
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Organisation of the dissertation . . . . . . . . . . . . . . . . . . . 4
2 Preliminaries 7
2.1 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Hybrid Bayesian networks . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Discretisation . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Conditional Gaussian (CG) distributions . . . . . . . . . . 14
2.2.3 Mixtures of Truncated Exponentials . . . . . . . . . . . . . 16
2.2.4 Mixtures of Polynomials . . . . . . . . . . . . . . . . . . . 19
2.3 State-of-the-art in hybrid Bayesian networks . . . . . . . . . . . . 20
II Theoretical contributions 27
3 Learning models for regression from complete data 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Bayesian networks for classification . . . . . . . . . . . . . . . . . 31
3.3 Bayesian networks for regression . . . . . . . . . . . . . . . . . . . 34
vi CONTENTS
3.4 Regression based on the MTE model . . . . . . . . . . . . . . . . 35
3.5 Filtering the independent variables . . . . . . . . . . . . . . . . . 37
3.6 The naıve Bayes model for regression . . . . . . . . . . . . . . . . 40
3.7 The tree augmented naıve Bayes regression model . . . . . . . . . 41
3.8 The forest augmented naıve Bayes regression model . . . . . . . . 45
3.9 Regression model based on kDB structure . . . . . . . . . . . . . 46
3.10 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 51
3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Learning models for regression from incomplete data 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Regression model from incomplete data . . . . . . . . . . . . . . . 56
4.3 The algorithm for learning a regression model from incomplete data 59
4.4 Improving the final estimations by reducing the bias . . . . . . . . 63
4.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Results discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Parametric learning in MTE networks using incomplete data 73
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Translating standard distributions into MTE distributions . . . . 75
5.2.1 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Conditional linear Gaussian . . . . . . . . . . . . . . . . . 76
5.2.3 Logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4 The M-step. Updating rules for the parameter estimates . . . . . 85
5.4.1 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.2 Conditional linear Gaussian . . . . . . . . . . . . . . . . . 86
5.4.3 Logistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 The E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
CONTENTS vii
6 Approximate inference in MTE networks using importance sam-
pling 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Approximate propagation using importance sampling . . . . . . . 103
6.3.1 Obtaining a sampling distribution . . . . . . . . . . . . . . 107
6.3.2 Computing multiple probabilities simultaneously . . . . . . 110
6.4 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . 113
6.5.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5.3 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
III Applications 123
7 Species distribution modelling 125
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2.1 Variables and data set description . . . . . . . . . . . . . . 127
7.2.2 Selection of variables . . . . . . . . . . . . . . . . . . . . . 127
7.2.3 Bayesian classifiers and calibration of models . . . . . . . . 129
7.2.4 Inference in Bayesian classifiers . . . . . . . . . . . . . . . 129
7.2.5 Validation of the models . . . . . . . . . . . . . . . . . . . 130
7.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 131
7.3.1 NB model . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.3.2 TAN model . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3.4 Spatial application of the models . . . . . . . . . . . . . . 142
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
viii CONTENTS
8 Relevance analysis of performance indicators in higher educa-
tion 147
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Relevance analysis using Bayesian networks . . . . . . . . . . . . 149
8.3 Application to the analysis of performance indicators at the Uni-
versity of Almerıa . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.3.1 Relevance analysis for compulsory courses . . . . . . . . . 152
8.3.2 Relevance analysis for optional courses . . . . . . . . . . . 155
8.4 Software for relevance analysis . . . . . . . . . . . . . . . . . . . . 157
8.5 Using the software to construct composite indicators . . . . . . . 160
8.5.1 Generating the rank of descriptions . . . . . . . . . . . . . 161
8.5.2 Generating the composite index from the database . . . . 162
8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
IV Concluding remarks 165
9 Conclusions and future works 167
Bibliography 171
A Notation and mathematical derivations 191
B Publications 201
List of Algorithms
1 Median of a density function . . . . . . . . . . . . . . . . . . . . . 38
2 MTE-NB regression model . . . . . . . . . . . . . . . . . . . . . . 40
3 Selective MTE-NB regression model . . . . . . . . . . . . . . . . . 42
4 Maximum Spanning Tree (based on Kruskal’s algorithm) . . . . . . 45
5 MTE-TAN regression model . . . . . . . . . . . . . . . . . . . . . . 46
6 Selective MTE-TAN regression model . . . . . . . . . . . . . . . . 47
7 Maximum Spanning Forest (based on Kruskal’s algorithm) . . . . . 48
8 MTE-FAN regression model . . . . . . . . . . . . . . . . . . . . . . 48
9 Selective MTE-FAN regression model . . . . . . . . . . . . . . . . . 49
10 MTE-kDB regression model . . . . . . . . . . . . . . . . . . . . . . 50
11 Bayesian network regression model from missing data . . . . . . . . 60
12 Selective Bayesian network regression model from missing data . . 62
13 Computing a vector of bias to refine the predictions . . . . . . . . . 64
14 An EM algorithm for learning MTE networks from incomplete data. 85
15 PruneMTEPotential (T, α) . . . . . . . . . . . . . . . . . . . . . . 109
16 SamplingDistributions (B, e) . . . . . . . . . . . . . . . . . . . . . 111
17 ApproximateProbabilityPropagation (B, e, P ) . . . . . . . . . . . 114
18 Naıve Bayes classifier with continuous features . . . . . . . . . . . 129
19 TAN classifier with continuous features . . . . . . . . . . . . . . . . 130
Part I
Introduction
Chapter 1
Introduction
1.1 Motivation
In the last decades the complexity and uncertainty of information systems and the
huge amount of data available has turned the traditional analysis into obsolete
in many cases, and more sophisticated techniques has become necessary. Within
Artificial Intelligence field, Bayesian networks have shown to be a powerful tool
for handling such problems [89].
In recent years, much research about Bayesian networks has been focused on
the study of efficient methods for learning and inference in high dimensionality
frameworks from several points of view.
First, in situations in which it is already difficult to model the problem due to
its complexity, it is desirable to avoid as far as possible any other approximation
in the calculations. In this sense, much effort in the field of Bayesian networks has
been aimed at trying to avoid discretisation processes of continuous variables and
directly deal with them, representing their probability distributions as accurately
as possible. Mixtures of Truncated Exponentials (MTEs) have been considered
an appropriate tool without any restriction for this topic, and they will be the
main focus of this dissertation.
It is also very common facing problems with a high number of variables, in
which a selection of them is needed both to reduce the complexity of the model,
and sometimes increase the accuracy of the results. In this way, learning and
4 1.2. Organisation of the dissertation
inference processes are considerably simplified.
In the same line, if the MTE network is too complex, exact probability prop-
agation is computationally hard and therefore new approximate methods for in-
ference must be investigated.
Advances in hybrid Bayesian networks create opportunities to solve problems
already solved using other techniques. For example, a regression problem consid-
ers by nature the use of continuous and discrete variables. Some advantages from
Bayesian networks with respect to classical techniques can be obtained, such as
scalability or no need of a full observation of the independent variables to give
a prediction. In this dissertation, several MTE network structures are developed
and applied to solve regression problems.
Also, there are many situations in which missing data are frequent, for exam-
ple, in large databases it is more likely to find incomplete data. Other situations
in which missing data appear frequently, are in scenarios where the data acqui-
sition is difficult or even impossible. Therefore, new methods for learning MTE
models from missing data will be developed.
Finally, there are also many other disciplines where much of the theory about
hybrid Bayesian networks has not been applied yet. Hence, we present two works
in the environmental field and in higher education management.
1.2 Organisation of the dissertation
The document is divided into four parts. The first and last are the Introduction
and Concluding remarks, containing Chapters 1 and 2, and Chapter 9, respec-
tively. Part II includes Chapters 3, 4, 5, 6 and has been called Theoretical Contri-
butions. It describes the new theoretical and methodological advances developed
in this dissertation. On the other hand, Chapters 7 and 8 are embedded within
Part III called Applications, where two applications of MTEs are presented.
The nine chapters are distributed as follows:
Chapter 1 explains the motivation of the dissertation and how the contents
throughout the document are organised.
Chapter 2 introduces the topic of the dissertation beginning with the most
general concepts and gradually focusing on those most related to the remaining
Chapter 1. Introduction 5
chapters. First the concept of Bayesian network is explained, and then a specific
type of them in which discrete and continuous variables coexist simultaneously,
the so-called hybrid Bayesian networks. Different approaches for their treatment
are explained: Discretisation, conditional Gaussian (CG) distributions and Mix-
tures of Truncated Exponentials (MTEs), with special emphasis on the latter, the
main theme of the dissertation. Finally, we review the state of the art in hybrid
Bayesian networks.
In Chapter 3 we explore the extension of various kinds of MTE-based Bayesian
network classifiers to regression problems where some of the independent variables
are continuous and some others are discrete.
Chapter 4 is devoted to face the same problem as in Chapter 3 but for the
case of incomplete data.
In Chapter 5 we describe an EM-based algorithm for learning the maximum
likelihood parameters of an MTE network when confronted with incomplete data.
In Chapter 6, a new approximate propagation algorithm for MTE networks
based on importance sampling is presented.
The aim of Chapter 7 is to characterise the habitat of an endangered specie
(we focus on the spur-thighed tortoise), using several continuous environmental
variables. Two MTE models for this purpose are presented.
Chapter 8 presents a methodology for relevance analysis of performance in-
dicators in higher education based on the use of Bayesian networks. The MTE
model is applied for constructing composite indicators by using a Bayesian re-
gression model implemented in a web application.
Finally, Chapter 9 is devoted to conclude the dissertation summarising the
main contributions of the work and future research lines.
Chapter 2
Preliminaries
2.1 Bayesian networks
Bayesian networks [117, 77] are considered one of the most powerful tools for
representing complex systems in which the relationships among the variables
are subject to uncertainty. Their main purpose is to provide a framework for
efficiently reasoning about the system they represent, in the sense of updating
the information about the unobserved variables given that some new information
is incorporated to the system [76, 143].
We will use uppercase letters to denote random variables, and boldfaced up-
percase letters to denote random vectors, e.g. X = X1, . . . , Xn, and its domain
will be written as ΩX. By lowercase letters x (or x) we denote some element of
ΩX (or ΩX).
A Bayesian network is a statistical multivariate model for a set of variables
X, which is defined in terms of two components:
• A qualitative component defined by means of a directed acyclic graph
(DAG) where each vertex represents one of the variables in the model, and
so that the presence of an edge linking two variables indicates the existence
of statistical dependence between them.
• A quantitative component specified through a conditional distribution
p(xi | pa(xi)) for each variable Xi, i = 1, . . . , n given its parents in the
graph, denoted as pa(Xi).
8 2.1. Bayesian networks
X1
X2 X3
X4 X5
Figure 2.1: An example of Bayesian network with five variables.
For example, the graph depicted in Figure 2.1 could be the qualitative com-
ponent of a Bayesian network for variables X1, . . . , X5. According to the graph
structure, it would be necessary to specify a conditional distribution for each
variable given its parents. In this case, the distributions are p(x1), p(x2 | x1),p(x3 | x1), p(x4 | x2, x3) and p(x5 | x3).
In what follows, we will describe how the qualitative component encodes the
dependencies among the variables in the model, and how the strength of these
dependencies is determined by the quantitative component, i.e., the conditional
distributions.
Qualitative component of a Bayesian network
One of the most important advantages of Bayesian networks is that the structure
of the associated DAG determines the dependence and independence relationships
among the variables, so that it is possible to find out, with no need of carrying
out any numerical calculations, which variables are relevant or irrelevant for some
other variable of interest.
Figure 2.2 shows, through an example, the three types of connections among
variables. They can be interpreted as follows:
• Serial connections: Information may be transmitted from X1 to X3, un-
less the state of the variable X2 is known.
• Diverging connections: Information may be transmitted from X1 to X3,
unless the state of the variable X2 is known.
Chapter 2. Preliminaries 9
X1 X2 X3
(a) Serial connection.
X1
X2
X3
(b) Diverging connection.
X1
X2
X3
(c) Converging connection.
Figure 2.2: Kinds of connections in a DAG.
• Converging connections: Information may only be transmitted through
a converging connection if either information about the state of the variable
X2 or one of its descendants is available.
More formally, the rules for interpreting information flow given the structure
of a Bayesian network are based on the d-separation concept [77]:
Definition 1 (d-separation). Two variables X and Y in a Bayesian network are
d-separated, if for all the paths between them there is an intermediate variable Z
such as:
• There is a serial or diverging connection and Z is instantiated.
• There is a converging connection and either Z or any descendant of Z have
received evidence.
We will use a toy example, taken from [75], to explain the transmission of
information in a Bayesian network.
10 2.1. Bayesian networks
Example 1 (Burglary or earthquake). Mr. Holmes is working in his office when
he receives a phone call from his neighbor Dr. Watson, who tells him that Holmes’
burglar alarm has gone off. Convinced that a burglar has broken into his house,
Holmes rushes to his car and heads for home. On his way, he listens to the radio,
and in the news it is reported that there has been a small earthquake in the area.
Knowing that earthquakes have a tendency to turn burglar alarms on, he returns
to his work.
WatsonCalls
Alarm
Burglary Earthquake
RadioNews
Figure 2.3: The Bayesian network for the burglary or earthquake example.
The scenario described in Example 1 can be represented by the Bayesian
network in Figure 2.3. The semantic interpretation is given over this example:
• Serial connections:
– “Burglary” has a causal influence on “Alarm”, which in turn has a
causal influence on “Watson calls”. Therefore, information flows from
“Burglary” to “Watson calls” and vice versa, since knowledge about
one of the variable provides information about the other.
– However, if we observe “Alarm”, any information about the state of
“Burglary” is irrelevant to our belief about “Watson calls” and vice
versa, since once we have certainty about the fact that the alarm has
gone off, the information provided by Watson does not change our
state of belief.
Chapter 2. Preliminaries 11
• Diverging connections:
– “Earthquake” has a causal influence on both “Alarm” and “Radio
news”. Therefore, information flows from “Alarm” to “Radio news”
and vice versa, since knowledge about one of the variable provides
information about the other. For instance, if our only knowledge is
that the radio news reported a small earthquake, our belief about the
alarm going off would increase.
– On the other hand, if we observe “Earthquake”, i.e. we have certainty
about that, any information about the state of “Alarm” is irrelevant
for our belief about an earthquake report in the “Radio news” and vice
versa.
• Converging connections:
– “Alarm” is causally influenced by both “Burglary” and “Earthquake”.
However, in this case the last two variables are irrelevant to each other:
If we do not have any information about the alarm, there is no rela-
tionship between the other two variables.
– However, if we observe “Alarm” and “Burglary”, then this will effect
our belief about “Earthquake”: Burglary explains the alarm, reducing
our belief that earthquake is the triggering factor, and vice versa.
Quantitative component of a Bayesian network
Once the structure is defined, it is necessary to know how strong are the relations
among the variables. This is achieved by using the quantitative component of
the Bayesian network through the joint probability distribution.
Using chain rule, the joint probability distribution over a set of variables
X1, . . . , Xn can be expressed as follows:
p(x1, . . . , xn) =n∏
i=1
p(xi | x1, . . . , xi−1), (2.1)
where X1, . . . , Xn is a consistent order of the variables holding that pa(Xi) ⊆X1, . . . , Xi−1.
12 2.1. Bayesian networks
Taking into account the independences encoded by the network structure,
each conditioned probability in Equation (2.1) may be simplified. Thus, the
joint distribution over all the variables is equal to the product of the conditional
distributions attached to each node, so that
p(x1, . . . , xn) =
n∏
i=1
p(xi | pa(xi)). (2.2)
Note that the induced factorisation allows to represent complex distributions
by a set of simpler ones, and therefore the number of parameters needed to specify
a model is lower in general. For instance, the network in Figure 2.1 is factorised
as
p(x1, x2, x3, x4, x5) = p(x1)p(x2 | x1)p(x3 | x1)p(x5 | x3)p(x4 | x2, x3). (2.3)
Thus, if binary variables are assumed in Equation (2.3), 32 parameters are
needed to specify the joint distribution, whilst if the induced factorisation is
applied, the number of parameters is reduced to 22. The more complex the
network (number of arcs, variables and states), the greater the reduction when
performing the factorisation. This fact is really important to reduce the space in
memory.
Another advantage of factorising is related to inference. Assume that Xi is
a variable in which we are interested, and XE is a set of variables whose values
are known. Then, the prediction for the value of Xi given XE can be obtained
by computing the distribution p(xi | xE). This distribution could be obtained
from the joint distribution in Equation (2.1). The key point is that there is no
need to compute the joint distribution, since there are efficient algorithms that
allow us to compute p(xi | xE) taking advantage of the factorisation of the joint
distribution imposed by the network structure [99, 143].
Chapter 2. Preliminaries 13
2.2 Hybrid Bayesian networks
Bayesian networks were originally proposed for handling discrete variables and
nowadays a broad and consolidated theory about it can be found in the literature.
However, in real problems, it is very common the presence of continuous and
discrete domains simultaneously.
Definition 2. A Bayesian network is called hybrid when continuous and discrete
random variables coexist simultaneously in the model.
In a hybrid framework, a solution is to discretise the continuous data and
treat them as they were discrete. Thus, the application of the existing methods
for discrete variables can be carried out. However, the discretisation is just an
approximation and other alternatives were successfully studied later.
In this section, several approaches to deal with hybrid Bayesian networks are
explored. First, the discretisation is presented as the most extreme solution.
Afterwards, we study two frameworks where continuous and discrete variables
can be handled simultaneously without using discretisation. These models are
the Conditional Gaussian (CG) model, the Mixtures of Truncated Exponentials
model (MTE), and the Mixtures of Polynomials model (MOP).
2.2.1 Discretisation
Most existing algorithms in the literature for learning and inference in Bayesian
networks are only valid for discrete variables. One popular approach is just to
discretise the domain of the continuous variables [59, 81] which is a simple (but
sometimes inaccurate) solution.
Definition 3. Let X be a continuous random variable with support ΩX and with
density function f(x). Let A = A1, . . . , An be a partition of ΩX in which
Ai, i = 1, . . . , n are intervals. A discretisation of X is the process of building a
discrete random variable X ′ with support ΩX′ = 1, . . . , n such that:
P (X ′ = i) =
∫
Ai
f(x)dx i = 1, . . . , n.
14 2.2. Hybrid Bayesian networks
After the discretisation, a hybrid Bayesian network is considered as a dis-
crete one when performing the inference and learning. The higher the number
of intervals, the more accurate the approximation of the discretisation. This fact
can be seen in Figure 2.4, where the standard Gaussian distribution has been
approximated by a discrete probability function using 3, 6, 12 and 24 intervals,
respectively.
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
(a) 3 intervals
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
(b) 6 intervals
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
(c) 12 intervals
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
(d) 24 intervals
Figure 2.4: Discretising a normal density with different number of intervals.
However, using many intervals is computationally hard, since when the infer-
ence is carried out, the size of the potentials increase exponentially and there is
a limitation in memory for storage.
2.2.2 Conditional Gaussian (CG) distributions
Although discretisation is a technique that can always be applied, there is some
types of variables whose distributions allow to operate over hybrid Bayesian net-
works in an exact way. These distributions are part of the conditional Gaussian
model [90, 37, 91, 31] explained next.
Chapter 2. Preliminaries 15
Definition 4. Let X be a continuous variable in a hybrid Bayesian network,
Z = (Z1, . . . , Zd)T be its discrete parents, and Y = (Y1 . . . , Yc)
T be its continu-
ous parents. Conditional linear Gaussian (CLG) potentials in hybrid Bayesian
networks have the form
φ(X | z,y) ∼ N(
µ = lTzy + bz, σ2z
)
, (2.4)
where z and y are an assignment of discrete and continuous states of the parents
of X. For a concrete assignment z, lTz is the transpose of the coefficients of a
linear regression model with c values (one for each continuous parent), bz the
mean for variable X and σ2z > 0, the variance for variable X.
The conditional mean of CLG potentials depends linearly on the continuous
parent variables while the variance does not. For each configuration of the discrete
parents of X , a linear function of the continuous parents is specified as the mean
of the conditional distribution of X given its parents, and a positive real number
is specified for the variance of the distribution of X given its parents.
The scheme originally developed by Lauritzen [90] allows exact computation
of means and variances in CLG networks; however, this algorithm did not always
compute the exact marginal densities of continuous variables. A new computa-
tional scheme for CLG models was later developed by Lauritzen and Jensen [91].
This scheme allows calculation of full local marginals and also permits condi-
tionally deterministic linear variables, i.e. distributions where σ2y = 0 in Equa-
tion (2.4).
The CLG model has the property that for any assignment of values for the
discrete variables, the distribution for the continuous variables is multivariate
Gaussian. This is because, given an assignment of the discrete variables, the
conditional probability distributions for the continuous variables are linear Gaus-
sian models. When these linear Gaussian models are combined, they produce a
multivariate Gaussian. The joint distribution of all continuous variables in the
network is a mixture of Gaussians. CLG models can not accommodate continu-
ous random variables whose conditional distribution is not Gaussian. Moreover,
CLG models are not valid in frameworks where a discrete variable has continuous
parents, since their domain must be discretised in some way. This is solved by
16 2.2. Hybrid Bayesian networks
the MTE model presented next.
2.2.3 Mixtures of Truncated Exponentials
The CG model is useful in situations in which it is known that the joint distribu-
tion of the continuous variables for each configuration of the discrete ones follows
a multivariate Gaussian. However, in practical applications it is possible to find
scenarios where this hypothesis is violated, in which case another model, like dis-
cretisation, should be used. Since discretisation is equivalent to approximating
a target density by a mixture of uniforms, the accuracy of the final model could
be increased if, instead of uniforms, other functions were used. A good choice
are exponential functions since they have good properties, i.e., high fitting power
and they are closed under restriction, marginalisation and combination.
This is the idea behind the so-called Mixtures of Truncated Exponentials
(MTE) model [106].
During the probability inference process, where the posterior distributions of
the variables are obtained given some evidence, the intermediate functions are not
necessarily density functions, therefore a general function called MTE potential
needs to be defined as follows:
Definition 5. (MTE potential) Let X be a mixed n-dimensional random vector.
Let Z = (Z1, . . . , Zd)T and Y = (Y1, . . . , Yc)
T be the discrete and continuous parts
of X, respectively, with c + d = n. We say that a function f : ΩX 7→ R+0 is a
Mixture of Truncated Exponentials potential (MTE potential) if one of the next
conditions holds:
i. Z = ∅ and f can be written as
f(x) = f(y) = a0 +m∑
i=1
ai exp
bT
i y
(2.5)
for all y ∈ ΩY, where ai ∈ R and bi ∈ Rc, i = 1, . . . , m.
ii. Z = ∅ and there is a partition D1, . . . , Dk of ΩY into hypercubes such that
f is defined as
f(x) = f(y) = fi(y) if y ∈ Di,
Chapter 2. Preliminaries 17
where each fi, i = 1, . . . , k can be written in the form of Equation (2.5).
iii. Z 6= ∅ and for each fixed value z ∈ ΩZ, fz(y) = f(z,y) can be defined as in
ii.
Example 2. The function f defined as
f(y1, y2) =
2 + exp3y1+y2 +expy1+y2 if 0 < y1 ≤ 1, 0 < y2 < 2
1 + expy1+y2 if 0 < y1 ≤ 1, 2 ≤ y2 < 3
14+ exp2y1+y2 if 1 < y1 < 2, 0 < y2 < 2
12+ 5 expy1+2y2 if 1 < y1 < 2, 2 ≤ y2 < 3
is an MTE potential since all of its parts are MTE potentials.
Definition 6. (MTE density) An MTE potential f is an MTE density if
∑
z∈ΩZ
∫
ΩY
f(z,y)dy = 1.
A conditional MTE density can be specified by dividing the domain of the
conditioning variables and specifying an MTE density for the conditioned variable
for each configuration of splits of the conditioning variables.
Example 3. Consider two continuous variables X and Y . A possible conditional
MTE density for Y given X is the following:
f(y | x) =
1.26− 1.15 exp0.006y if 0.4 ≤ x < 5, 0 ≤ y < 13,
1.18− 1.16 exp0.0002y if 0.4 ≤ x < 5, 13 ≤ y < 43,
0.07− 0.03 exp−0.4y +0.0001 exp0.0004y if 5 ≤ x < 19, 0 ≤ y < 5,
−0.99 + 1.03 exp0.001y if 5 ≤ x < 19, 5 ≤ y < 43.
(2.6)
Since MTEs are defined into hypercubes, they admit a tree-structured rep-
resentation in a natural way. Moral et al. [106] proposed a data structure to
18 2.2. Hybrid Bayesian networks
represent MTE potentials, which is specially appropriate for this kind of condi-
tional densities: The so-called mixed probability trees or mixed trees for short.
The formal definition is as follows:
Definition 7. (Mixed tree) We say that a tree T is a mixed tree if it meets the
following conditions:
i. Every internal node represents a random variable (discrete or continuous).
ii. Every arc outgoing from a continuous variable Y is labeled with an inter-
val of values of Y , so that the domain of Y is the union of the intervals
corresponding to the arcs emanating from Y .
iii. Every discrete variable has a number of outgoing arcs equal to its number
of states.
iv. Each leaf node contains an MTE potential defined on variables in the path
from the root to that leaf.
Mixed trees can represent MTE potentials defined by parts. Each entire
branch in the tree determines one hypercube where the potential is defined, and
the function stored in the leaf of a branch is the definition of the potential on it.
Example 4. Consider the following MTE potential, defined for a discrete variable
(Z1) and two continuous variables (Y1 and Y2).
φ(z1, y1, y2) =
2 + exp3y1+y2 if z1 = 0, 0 < y1 ≤ 1, 0 < y2 < 2,
1 + expy1+y2 if z1 = 0, 0 < y1 ≤ 1, 2 ≤ y2 < 3,
14+ exp2y1+y2 if z1 = 0, 1 < y1 < 2, 0 < y2 < 2,
12+ 5 expy1+2y2 if z1 = 0, 1 < y1 < 2, 2 ≤ y2 < 3,
1 + 2 exp2y1+y2 if z1 = 1, 0 < y1 ≤ 1, 0 < y2 < 2,
1 + 2 expy1+y2 if z1 = 1, 0 < y1 ≤ 1, 2 ≤ y2 < 3,
13+ expy1+y2 if z1 = 1, 1 < y1 < 2, 0 < y2 < 2,
12+ expy1−y2 if z1 = 1, 1 < y1 < 2, 2 ≤ y2 < 3.
Chapter 2. Preliminaries 19
A representation of this potential by means of a mixed probability tree is dis-
played in Figure 2.5.
Z1
Y1
0
Y2
(0, 1]
2 + exp3y1+y2
(0, 2)
1 + expy1+y2
[2, 3)
Y2
(1, 2)
14
+ exp2y1+y2
(0, 2)
12
+ 5 expy1+2y2
[2, 3)
Y1
1
Y2
(0, 1]
1 + 2 exp2y1+y2
(0, 2)
1 + 2 expy1+y2
[2, 3)
Y2
(1, 2)
13
+ expy1+y2
(0, 2)
12
+ expy1−y2
[2, 3)
Figure 2.5: An example of mixed probability tree.
In the same way as in discretisation, the more intervals used to divide the
domain of the continuous variables, the better the MTE model accuracy, but also
more complex. Furthermore, in the case of MTEs, using more exponential terms
within each interval substantially improves the fit to the real model as we will
see in Chapter 5.
2.2.4 Mixtures of Polynomials
A recent research line connected with hybrid Bayesian networks is the Mixtures of
Polynomials (MOPs) proposed in [145]. The idea is to replace the basis function
of the MTE (exponential) by a polynomial.
A one-dimensional function f : R → R is said to be a mixture of polynomials
(MOP) function if it is a piecewise function of the form:
f(x) =
a0i + a1ix+ a2ix2 + . . .+ anix
n for x ∈ Ai, i = 1, . . . , k,
0 otherwise.(2.7)
where A1, . . . , Ak are disjoint intervals in R that do not depend on x, and a0i, . . . , ani
20 2.3. State-of-the-art in hybrid Bayesian networks
are constants for all i. We will say that f is a k-piece, and n-degree (assuming
ani 6= 0 for some i) MOP function.
The main motivation for defining MOP functions is that such functions are
easy to integrate in closed form, and that they are closed under multiplication,
integration, and addition, the main operations in making inferences in hybrid
Bayesian networks. The requirement that each piece is defined on an interval Ai
is also designed to ease the burden of integrating MOP functions.
2.3 State-of-the-art in hybrid Bayesian networks
The purpose of this section is to review the main advances in the literature
regarding hybrid Bayesian networks, mainly focusing on MTEs. So far, there are
three different alternatives to discretisation in the state-of-the-art for working
with hybrid Bayesian networks. The order in which they were proposed is:
• Conditional Linear Gaussian (CLG) model,
• Mixtures of Truncated Exponentials (MTEs), and
• Mixture of Polynomials (MOPs).
One of the earliest algorithm for dealing with hybrid Bayesian networks was
proposed in [90], and later revised in [91]. This algorithm is applied to Bayesian
networks where continuous variables are modeled by conditional linear Gaussian
(CLG) distributions. Its main weakness is that the network does not allow dis-
crete variables to have continuous parents, a dependency that arises in many
domains.
An approximate way to avoid this problem was suggested in [112, 94] with
the so-called augmented CLG networks, which are hybrid Bayesian networks with
conditional linear Gaussian distributions for continuous variables, and which al-
low discrete variables with continuous parents. The idea is to approximate the
product of a Gaussian and a logistic function (discrete variables with continuous
parents) by a variational approximation [112] or by a mixture of Gaussians using
numerical integration [94].
Chapter 2. Preliminaries 21
Another solution to the limitations of the CLG model was proposed in [141]
where a method for exact inference in hybrid networks is presented. It consists
of approximating general hybrid Bayesian networks by a mixture of Gaussian
Bayesian networks [153] and then apply the exact algorithm proposed in [91].
The approximation is based on an arcs reversal technique [116, 140] to avoid
discrete nodes with continuous parents and also on approximating non-Gaussian
distributions by Gaussian distributions.
On the other hand, the MTE framework was first stated in 2001 as an al-
ternative to represent the distributions of hybrid Bayesian networks. The MTE
model does not impose any restriction about the interactions among variables
(discrete nodes with continuous parents are allowed), and also exact probability
propagation can be carried out by means of local computation algorithms. This
model was formally proposed in [106] where its basic operations were defined and
also a Markov Chain Monte Carlo propagation algorithm was described to deal
with complex networks.
Later on, in [107, 135] an iterative algorithm to estimate MTE distributions
from data is proposed based on least squares approximation. In 2003, a method
to estimate conditional MTE densities using mixed trees is proposed [108] with a
criterion for selecting variables during the construction of the tree. Afterwards,
once some issues about learning MTEs from data were proposed, these two pa-
pers [127, 128] proposed structural learning from data with MTEs by means of a
hill-climbing algorithm.
Later in [30], a comparison among different approaches that can somehow deal
with hybrid networks was presented. In particular the behaviour of discretisation,
CLG models, mixed trees, and linear deterministic models were compared.
Although exact probability propagation in MTE networks can be accom-
plished by using standard algorithms, in complex networks the size of the poten-
tials involved grow so much that the propagation becomes infeasible. To overcome
this problem, the Penniless propagation algorithm (already proposed for the dis-
crete case) is adapted to MTE networks in [133]. In [134] the study goes beyond
also considering how to use Markov Chain Monte Carlo method in approximate
propagation. A comparison between both methods reported that MCMC method
is not competitive with respect to Penniless.
22 2.3. State-of-the-art in hybrid Bayesian networks
MTEs have also been applied to hybrid Bayesian networks with linear [29, 32]
and nonlinear deterministic variables [30]. In these works operations required for
performing inference are developed. Later in [144], an architecture for solving
large general hybrid Bayesian networks with deterministic variables is developed.
The problem of arc reversals is treated in [24] over hybrid Bayesian networks with
deterministic variables.
The work in [61] represents the first approach for learning MTE networks
from missing data. A naıve Bayes model for unsupervised data clustering with
hidden class variable is proposed. The proposal is compared with the conditional
Gaussian model implemented in the WEKA data mining suite [158].
In [31] MTE potentials that approximate an arbitrary normal density function
with any mean and positive variance are presented. In addition, the work in [33]
proposes a general solution to the approximation problem above, and shows that
the most common density functions can be approximated by an MTE potential,
which can always be marginalised in closed form. This advance is very useful,
since MTE potentials can be used for inference in hybrid Bayesian networks
because they do not fit the restrictive assumptions of the CLG model (discrete
nodes with continuous parents and Gaussian distribution).
In [47] an incremental method for building MTE classifiers in domains with
very large amounts of data or for data streams is proposed. This incremental
approach is specially interesting for the case of the MTE distributions, where it is
not possible to keep the sufficient statistics necessary to estimate the parameters,
and therefore, the only possibility to update an MTE potential so far was to
re-learn from scratch.
In [27] two frameworks for handling hybrid Bayesian networks based on the
CG and MTE distributions are reviewed. In both cases it is studied how infer-
ence and learning from data can be carried out, concluding that the CG model
relies on a very solid theoretical development and it allows efficient inference, but
with the restriction that discrete nodes cannot have continuous parents. On the
other hand, the MTE model fits more naturally to local computation schemes for
inference, since it is closed for the basic operations used in inference regardless of
the structure of the network.
Chapter 2. Preliminaries 23
Hybrid Bayesian networks have been applied to regression or prediction prob-
lems under the assumption, in a first stage, that the joint distribution of the
feature variables and the class is multivariate Gaussian [62]. If the normality
assumption is not fulfilled, the problem of regression was addressed using kernel
densities to model the conditional distributions [57], with poor results. A com-
mon restriction of Gaussian models and kernel-based models is that they are only
applied to scenarios in which all the variables are continuous.
Later in [110, 111], a naıve Bayes regression model based on MTEs and a vari-
able selection scheme was proposed, reporting competitive results with respect to
the state-of-the art methods so far. In 2007, a tree augmented naıve Bayes for
regression was proposed [51], solving the problem of estimating the conditional
mutual information which could not be analytically obtained for MTEs. The per-
formance of this model was tested in a real life context related to higher education
management where mixed variables are common. Afterwards in 2008, previous
ideas about regression with MTEs were collected and an study of the extension
of several Bayesian classifier structures to regression problems was addressed [55].
Also, a variable selection scheme was considered for the proposed models.
Having successfully implemented the MTE networks in regression problems,
the next step was aimed at studying the problem of learning regression models
from missing data. A first approach was adopted in [52], where an iterative algo-
rithm for learning a naıve Bayes model from incomplete data was proposed. Later
on, this idea was extended to TAN models, also considering variable selection,
and a deeper comparison with the state-of-the-art techniques was developed [53].
MTE and Gaussian networks have been also used as approximate models to
make feasible the inference in other models where it is not feasible. For example,
in [23, 25] methods for approximating PERT Bayesian networks by Gaussian and
MTE networks, respectively, were proposed.
So far, most prevalent MTE learning methods estimated the parameters based
on least squares estimation [135, 128]. The drawback of this approach is that by
not directly attempting to find the parameter estimates that maximise the likeli-
hood, there is no principled way of performing subsequent model selection using
those parameter estimates. In [85, 88] an estimation method that directly aims
at learning the parameters of an MTE potential following a maximum likelihood
24 2.3. State-of-the-art in hybrid Bayesian networks
approach is presented. Empirical results on this topic demonstrate that the pro-
posed method yields significantly better likelihood results than existing methods.
The approach above focus on the univariate case, but does not address the
conditional MTE specification. A preliminary work in this line [87] has investi-
gated two alternatives for the definition of conditional MTE densities, showing
that only the most restrictive one is compatible with standard efficient algorithms
for inference in Bayesian networks.
One of the current research line under study about MTE-based networks
is focused on learning parameters from incomplete data. In [49, 48] an EM-
based algorithm for learning maximum likelihood parameters of a general hybrid
network is described. The proposed learning procedure is not limited to any
distributional family, which is an important advance on this topic.
Moreover, recent work on inference with MTEs are intended to the approxi-
mate probability propagation. In [54] a propagation algorithm based on impor-
tance sampling was proposed with a remarkable accuracy with respect to the
approximate methods existing in the literature.
Current research directions for hybrid Bayesian networks are also focused on
Mixtures of Polynomials (MOPs) [145, 146, 142]. A MOP potential is defined as
for MTEs, just replacing the basis function of the MTE (exponential) by a polyno-
mial. MOP functions can be easily integrated, and are closed under combination
and marginalisation. This allows to propagate MOP potentials in the extended
Shenoy-Shafer architecture. MOP approximations have several advantages over
MTE approximations, since they are easier to find, even for multi-dimensional
conditional PDFs.
Other minor alternatives (apart from CLG model, MTEs and MOPs) that
have been applied to hybrid Bayesian networks are related to dynamic discreti-
sation of the continuous variables during inference [81], or use sampling methods
to compute approximate marginals [64, 80, 20, 65].
While theoretical developments on hybrid Bayesian networks have emerged
in the literature in recent years, a significant number of works in the field of
application are appearing, mostly due to the need for simultaneous treatment
of continuous and discrete variables in real problems. In what follows we show
some applications of hybrid BNs by focusing on the MTE model, since it is the
Chapter 2. Preliminaries 25
core of this thesis. Anyway many applications of CLG model can be found in the
literature.
In [28] two applications of MTE networks to finance problems are presented.
First naıve Bayes and TAN models are used to provide a distribution of stock
return and second, a Bayesian network is used to determine a return distribution
for a portfolio of stocks. In [86] some of the last decade’s research on inference
in hybrid Bayesian networks is summarised and the discussions are linked to an
example in which a model is developed for explaining and predicting humans’
ability to perform specific tasks in a given environment. Hybrid Bayesian net-
works have been also applied to higher education management in [50] where a
methodology for relevance analysis of performance indicators in the management
of the University of Almerıa is developed. The MTEs are applied for constructing
composite indicators by using a Bayesian network regression model.
The application of MTEs has been also developed in environmental sciences
in the study of species predictive distribution modelling. In [2], the habitat of
the spur-thighed tortoise is successfully characterised using MTE models.
A recent applied work is [26] where the authors introduce a graphical method
for valuing options on real asset investments that allow the investor to switch
between different operating modes at a single point-in-time. The technique uses
MTE functions to approximate both the probability density function for project
value and the expressions for options value of each alternative.
Part II
Theoretical contributions
Chapter 3
Learning models for regression
from complete data
In this chapter we explore the extension of various kinds of Bayesian network classi-
fiers to regression problems where some of the independent variables are continuous
and some others are discrete. The goal is to compute the posterior distribution of
the dependent variable given the independent ones, and then use that distribution
to predict a value for the dependent variable given the observations. The involved
distributions are represented as Mixtures of Truncated Exponentials (MTEs). The
construction of some of these classifiers requires the use of the conditional mutual
information, which cannot be analytically obtained for MTEs. In order to solve this
problem, we introduce an unbiased estimator of the conditional mutual information,
based on Monte Carlo estimation. We test the performance of the proposed models on
different datasets commonly used as benchmarks, showing a competitive performance
with respect to the state-of-the-art methods.
Abstract
3.1 Introduction
In real life applications, it is common to find problems in which the goal is to
predict the value of a variable of interest depending on the values of some other
observable variables. If the variable of interest is discrete, we are faced with a
classification problem, whilst if it is continuous, it is usually called a regression
30 3.1. Introduction
problem. In classification problems, the variable of interest is called class and
the observable variables are called features, while in regression frameworks, the
variable of interest is called dependent variable and the observable ones are called
independent variables.
Bayesian networks [75, 117] have been used both for classification and regres-
sion purposes. Its main advantage with respect to other regression models is that
it is not necessary to have a full observation of the independent variables to give
a prediction for the dependent variable. Also, the model is usually richer from a
semantic point of view.
Naıve Bayes models have been applied to regression problems under the as-
sumption that the joint distribution of the independent variables and the depen-
dent variable is multivariate Gaussian [62]. If the normality assumption is not
fulfilled, the problem of regression with naıve Bayes models has been approached
using kernel densities to model the conditional distribution in the Bayesian net-
work [57], but the obtained results are poor. Furthermore, the use of kernels
introduce a high complexity in the model, which can be problematic, especially
because standard algorithms for carrying out the computations in Bayesian net-
works are not valid for kernels. A restriction of Gaussian models is that they only
apply to scenarios in which all variables are continuous.
In a more general solution, we are interested in regression problems where
the independent variables can be either continuous or discrete. Therefore, the
joint distribution is not multivariate Gaussian in any case, due to the presence of
discrete variables. To solve this problem, a naıve Bayes regression model based
on the approximation of the joint distribution by an MTE was proposed [111].
In the same line, the aim of this chapter is to investigate the behaviour of
different Bayesian network classifiers when applied to regression problems. The
fact that models as the naıve Bayes are appropriate for classification as well as
regression is not surprising, as the nature of both problems is similar: Predict the
value of a dependent variable given an observation over the independent variables.
In all the cases we will consider problems where some of the independent variables
are continuous while some others are discrete, and therefore we will concentrate on
the use of MTEs. More precisely, the starting point is the naıve Bayes model for
regression proposed in [111], and the other models will be obtained by increasing
Chapter 3. Learning models for regression from complete data 31
the structural complexity of the underlying Bayesian network. In order not to use
a misleading terminology, from now on we will refer to the observable variables
as features, even if we are in a regression context.
The rest of the chapter is organised as follows. Section 3.2 is devoted to
present several Bayesian networks classifier structures as the basis of the proposed
regression models. The use of Bayesian networks for regression is explained in
Section 3.3. The solution of the regression problem using MTEs is described
in Section 3.4. In Section 3.5 we explain how the selection of features in the
proposed selective models can be carried out. The particular regression models
based on Bayesian networks are introduced in Sections 3.6 to 3.9. Section 3.10 is
devoted to the experimental evaluation. The chapter ends with some concluding
remarks in Section 3.11.
3.2 Bayesian networks for classification
A Bayesian network can be used for classification purposes if it contains a class
variable Y , and a set of feature variables X1, . . . , Xn, where an object with ob-
served features x1, . . . , xn will be classified as belonging to class y∗ obtained as
y∗ = argmaxy∈ΩYf(y | x1, . . . , xn), (3.1)
where ΩY denotes the set of possible values of Y .
Note that f(y | x1, . . . , xn) is proportional to f(y) × f(x1, . . . , xn | y), andtherefore, solving the classification problem would require a distribution to be
specified over the n feature variables for each value of the class. The associated
computational cost can be very high. However, using the factorisation determined
by the network, the cost is reduced. Although the ideal would be to build a
network without restrictions on the structure, usually this is not possible due to
the limited data available. Therefore, networks with fixed and simple structures
and specifically designed for classification are used.
The extreme case is the so-called naıve Bayes (NB) structure [44, 58]. It
consists of a Bayesian network with a single root node and a set of attributes
having only one parent (the root node). The NB model structure is shown in
32 3.2. Bayesian networks for classification
X1 X2
Y
· · · Xn
Figure 3.1: Structure of a naıve Bayes model.
Figure 3.1.
Its name comes from the naive assumption that the feature variablesX1, . . . , Xn
are considered independent given Y . This strong independence assumption is
somehow compensated by the reduction of the number of parameters to be esti-
mated from data, since in this case, it holds that
f(y | x1, . . . , xn) ∝ f(y)n∏
i=1
f(xi | y), (3.2)
which means that, instead of one n-dimensional conditional distribution, n one-
dimensional conditional distributions are estimated. Despite this extreme inde-
pendence assumption the results are amazing in many cases, therefore it has
become the most widely used Bayesian classifier in the literature.
However, if some variables are highly correlated, the accuracy of classification
would improve if any dependence between them could be included in the network
(i.e. links among the features). The impact of relaxing the independence as-
sumption has been studied for classification oriented Bayesian networks. In what
follows, several structures are presented expanding the naıve Bayes structure by
permitting each feature to have one more parent beside Y .
A structure in which some dependencies are allowed among the features is
the so-called tree augmented naıve Bayes (TAN), which also has been used for
classification [58]. The TAN structure is obtained according to this restriction:
The features must conform a directed tree structure. Figure 3.2 shows an example
of a TAN structure with 4 features. Note that all of them (except the root of the
directed tree) have two parents: The independent variable and one feature. The
model is richer, since it allows arcs among features, but an increase of complexity
Chapter 3. Learning models for regression from complete data 33
is assumed instead, both in the learning of the graph structure and the associated
probabilities.
X1 X2
Y
X3 X4
Figure 3.2: A TAN structure where X2 is the root of the directed tree among thefeatures.
A problem of the TAN model is that some of the introduced links between fea-
tures may be not necessary, as every feature is forced to be connected to another
one. This fact was pointed out by Lucas [96] within the context of classification
problems. He proposed to discard unnecessary links, obtaining a structure as the
one displayed in Figure 3.3, where, instead of a tree, the features conform a forest
of directed trees. The resulting classifier is called a forest augmented naıve Bayes
(FAN).
X1 X2
Y
X3 X4 X5
Figure 3.3: A forest augmented naıve Bayes structure with 2 trees.
Some more complex is the kDB classifier [136] (see Figure 3.4). This last
model establishes an upper limit of k parents per feature plus the class.
The detailed construction of previous classifiers when they are applied to
regression is described in Sections 3.6 to 3.9.
34 3.3. Bayesian networks for regression
Y
X1 X2 X3 X4 X5
Figure 3.4: A sample 2-DB regression model.
3.3 Bayesian networks for regression
Assume we have a set of variables Y,X1, . . . , Xn, where Y is continuous and the
rest are either discrete or continuous. Regression analysis consists of finding a
model g that explains the response variable Y in terms of the features X1, . . . , Xn,
so that given an assignment of the features, x1, . . . , xn, a prediction about Y can
be obtained as y = g(x1, . . . , xn).
A Bayesian network can be used as a regression model for prediction purposes
following the same ideas as for classification, since both problems are solved in
a similar way. Therefore, the classifier structures presented in Section 3.2 will
be applied for regression. Thus, in order to predict the value for Y for observed
features x1, . . . , xn, the conditional density
f(y | x1, . . . , xn), (3.3)
is computed and a numerical prediction for Y is given using the corresponding
mean (or the median) as follows:
y = g(x1, . . . , xk) = E[Y | x1, . . . , xk] =∫
ΩY
yf(y | x1, . . . , xk)dy, (3.4)
where ΩY represents the domain of Y .
Chapter 3. Learning models for regression from complete data 35
In any case, regardless of the structure employed, it is necessary that the joint
distribution for Y,X1, . . . , Xn follows a model for which the computation of the
density in Equation (3.3) can be carried out efficiently. As we are interested in
models able to simultaneously handle discrete and continuous variables, we think
that the approach that best meets these requirements is the MTE model.
3.4 Regression based on the MTE model
Once we know how Bayesian networks can be applied to solve regression problems
and also that MTEs are an appropriate tool, from now on we are concentrated
into two tasks:
1. Determining the structure of the network (except for NB).
2. Estimating the MTE densities corresponding to the obtained structure.
Example 5. Consider the next regression model with naıve Bayes structure,
where the dependent variable is X, and the independent ones are Y and Z, where
Y is continuous and Z is discrete:
X
Y Z
One example of conditional densities for this regression model is given by the
following expressions:
f(x) =
1.16− 1.12e−0.02x if 0.4 ≤ x < 4,
0.9e−0.35x if 4 ≤ x < 19.(3.5)
36 3.4. Regression based on the MTE model
f(y | x) =
1.26− 1.15e0.006y if 0.4 ≤ x < 5, 0 ≤ y < 13,
1.18− 1.16e0.0002y if 0.4 ≤ x < 5, 13 ≤ y < 43,
0.07− 0.03e−0.4y + 0.0001e0.0004y if 5 ≤ x < 19, 0 ≤ y < 5,
−0.99 + 1.03e0.001y if 5 ≤ x < 19, 5 ≤ y < 43.
(3.6)
f(z | x) =
0.3 if z = 0, 0.4 ≤ x < 5,
0.7 if z = 1, 0.4 ≤ x < 5,
0.6 if z = 0, 5 ≤ x < 19,
0.4 if z = 1, 5 ≤ x < 19.
(3.7)
In this chapter we follow the approach in [111], where a 5-parameter MTE is
fitted for each split of the support of the variable, which means that in each split
there will be 5 parameters to be estimated from data:
f(x) = a0 + a1ea2x + a3e
a4x, α < x < β. (3.8)
The reason to use the 5-parameter MTE is that it has shown its ability to fit
the most common distributions accurately, while the model complexity and the
number of parameters to estimate is low [33]. The estimation procedure is based
on least squares and is described in [128, 135].
The general procedure for obtaining a regression model is, therefore, to fix
one of the structures mentioned in Section 3.2 and to estimate the correspond-
ing conditional distributions using 5-parameter MTEs. Once the model is con-
structed, it can be used to predict the value of the dependent variable given that
the features are observed. The forecasting is carried out by computing the pos-
terior distribution of the dependent variable given the observed values for the
features. A numerical prediction for the class value will be obtained from the
posterior distribution, through its mean or its median. The choice between them
is a problem-dependent. A situation in which the median can be more robust
is when the training data contains outliers, and therefore the mean can be very
Chapter 3. Learning models for regression from complete data 37
biased towards the outliers. The posterior distribution will be computed using
the Variable Elimination algorithm for MTEs.
The expected value of a random variable X with a density defined as in
Equation (3.8) is computed as
E[X ] =
∫ ∞
−∞xf(x)dx =
∫ β
α
x(a0 + a1ea2x + a3e
a4x)dx
= a0β2 − α2
2+a1a22
((a2β − 1)ea2β − (a2α− 1)ea2α) +
a3a24
((a4β − 1)ea4β − (a4α− 1)ea4α).
If the density if defined by parts, the expected value would be the sum of the
expression above in each one of the parts.
The expression of the median, however, cannot be obtained in closed form,
since the corresponding distribution function cannot be inverted. Instead, it is
simulated using the search procedure described in [111] and showed in Algo-
rithm 1, which approximates the median with an error lower than 10−3 in terms
of probability. The input parameters for the algorithm is the n-part density
function, i. e.,
f(x) = fi(x) αi < x < βi, i = 1, . . . , n,
where each fi is defined as in Equation (3.8) and (αi, βi) is a partition of the
domain of X such that αi+1 = βi.
3.5 Filtering the independent variables
It is a well known fact in classification and regression that, in general, it is not true
that including more variables increases the accuracy of the model. It can happen
that some variables are not informative for the prediction task and therefore
including them in the model provides noise to the predictor. Also, unnecessary
variables cause an increase in the number of parameters need to be determined
from data.
38 3.5. Filtering the independent variables
Algorithm 1: Median of a density function
Input : A density f over interval (α, β).Output: An estimation of the median of a random variable with density
f , with error lower than 10−3 in terms of probability.found := false.1
accum := 0.0.2
i := 0.3
while (found == false and i ≤ n) do4
m :=
∫ βi
αi
fi(x)dx.5
if (accum+m) ≥ 0.5 then6
found := true.7
else8
i := i+ 1.9
accum := accum+m.10
end11
end12
max := βi.13
min := αi.14
found := false.15
while (found == false) do16
mid := (max+min)/2.17
p := accum+
∫ mid
min
fi(x)dx.18
if ⌊0.5 ∗ 1000⌋ == ⌊p ∗ 1000⌋ then19
found := true.20
else21
if p > 0.5 then22
max := mid.23
else24
min := mid.25
end26
end27
end28
return mid.29
Chapter 3. Learning models for regression from complete data 39
There are different approaches to the problem of selecting variables in regres-
sion and classification problems:
• The filter approach, which in its simplest formulation consists of estab-
lishing a ranking of the variables according to some measure of relevance
respect to the class variable, usually called filter measure. Then, a threshold
for the ranking is selected and variables below that threshold are discarded.
• The wrapper approach, proceeds by constructing several models with dif-
ferent sets of feature variables, and finally the model with higher accuracy
is selected.
• The filter-wrapper approach [130] is a mixture of the former ones. First
of all, the variables are ordered using a filter measure and then they are
incrementally included or excluded from the model according to that order,
so that a variable is included whenever it increases the accuracy of the
model.
The selection of the independent variables to be included in the model was
addressed in [111] following a filter-wrapper approach. First, the independent
variables are ordered according to the mutual information with respect to the
class and then they are included in the model one by one according to the initial
ranking, whenever the inclusion of a new variable increases the accuracy of the
preceding model. The accuracy of the model is measured by the root mean
squared error between the actual values of the dependent variable and those ones
predicted by the model for the records in a test database. If we call y1, . . . , yn
the corresponding estimates provided by the model, the root mean squared error
is obtain as [158]
rmse =
√
√
√
√
1
n
n∑
i=1
(yi − yi)2. (3.9)
The mutual information between two random variables X and Y is defined as
I(X, Y ) =
∫ ∞
−∞
∫ ∞
−∞fXY (x, y) log2
fXY (x, y)
fX(x)fY (y)dxdy, (3.10)
40 3.6. The naıve Bayes model for regression
where fXY is the joint density for X and Y , fX is the marginal density for X and
fY is the marginal for Y . The mutual information has been successfully applied
as filter measure in classification problems with continuous features [119].
In the case of using MTEs, the computation of the integral in Equation (3.10)
cannot be obtained in closed form. We will therefore use the estimation procedure
proposed in [111], which is based on the estimator
I(X, Y ) =1
m
m∑
i=1
(
log2 fX|Y (Xi | Yi)− log2 fX(Xi))
, (3.11)
for a sample of size m, (X1, Y1), . . . , (Xm, Ym), drawn from fXY .
3.6 The naıve Bayes model for regression
This regression model was proposed in [111]. We describe it here in detail, as
it is the basis of our proposals. The task of estimating this model from data is
simplified in the sense that we do not have to care about the structure of the
underlying network, since it is fixed beforehand, as in Figure 3.1. The detailed
steps for its construction can be found in Algorithm 2.
Algorithm 2: MTE-NB regression model
Input : A database D with variables X1, . . . , Xn, Y .Output: A NB model with root variable Y and features X1, . . . , Xn, with
joint distribution of class MTE.Construct a new network G with nodes Y,X1, . . . , Xn.1
Insert the links Y → Xi, i = 1, . . . , n in G.2
Estimate an MTE density for Y , and a conditional MTE density for each3
Xi, i = 1, . . . , n given Y .Let P be the set of estimated densities.4
Let NB be a Bayesian network with structure G and distributions P.5
return NB.6
This procedure includes all the available independent variables in the model.
The version in which the independent variables are filtered and selected is called
selective. The variable selection procedure described in Section 3.5 is illustrated
in Figure 3.5 for the NB model. The steps for its construction are presented in
Chapter 3. Learning models for regression from complete data 41
Algorithm 3. The main idea is to start with a model containing the class variable
and one feature variable, which is the node with a highest mutual information with
the dependent variable (i.e. Y and X(1)). Afterwards, the rest of the variables are
included in the model in sequence, according to their mutual information with
Y . In each step, if the included variable increases the accuracy of the model, it
is kept. Otherwise, it is discarded.
Ordered features by I(Xi, Y ) : X2,X3,X1,X4
rmse = 0.15
Y
X2
Add X3
rmse = 0.14
X2
Y
X3
Add X1
rmse = 0.145
X2
Y
X3 X1
Remove X1
Add X4
rmse = 0.143
X2 X3
Y
X4
Remove X4
FINAL MODEL
rmse = 0.14
X2
Y
X3
Figure 3.5: Example of selecting the independent variables in a naıve Bayesregression model.
3.7 The tree augmented naıve Bayes regression
model
For the construction of the TAN model we must take into account the dependence
structure among the features. The goal is to find a tree structure containing
them [58], so that the links of the tree connect the variables with higher degree
of dependence. This task can be solved using a variation of the method proposed
in [22]. The idea is to start with a fully connected graph, labelling the links with
the conditional mutual information between the connected features given the in-
42 3.7. The tree augmented naıve Bayes regression model
Algorithm 3: Selective MTE-NB regression model
Input : Variables X1, . . . , Xn, Y and a database D for variablesX1, . . . , Xn, Y .
Output: Selective NB predictor for the variable Y .for i := 1 to n do1
Compute I(Xi, Y ).2
end3
Let X(1), . . . , X(n) be a decreasing order of the feature variables according4
to I(X(i), Y ).Divide the database D into two sets, one for learning the model (Dl) and5
the other for testing its accuracy (Dt).Using Algorithm 2, construct a NB model M with variables Y and X(1)6
from database Dl.Let rmse(M) be the estimated accuracy of model M using Dt.7
for i := 2 to n do8
Let M1 be the NB predictor obtained from Algorithm 2 for variables of9
M and X(i).Let rmse(M1) be the estimated accuracy of model M1 using Dt.10
if rmse(M1) ≤ rmse(M) then11
M :=M1.12
end13
end14
return M .15
dependent variable (Y ). Afterwards, the tree structure is obtained by computing
the maximum spanning tree of the initial labelled graph (see Algorithm 4). The
weights used in Step 1 of this algorithm correspond to the conditional mutual
information between the linked variables given the dependent variable.
The conditional mutual information between two continuous features Xi and
Xj given Y is
I(Xi, Xj | Y ) =
∫∫∫
f(xi, xj , y) logf(xi, xj | y)
f(xi | y)f(xj | y)dxidxjdy. (3.12)
The computation of the conditional mutual information defined in Equa-
tion (3.12) has been addressed for the Conditional Gaussian model [119], but
only in classification contexts, i.e., with variable Y being discrete. For MTEs,
Chapter 3. Learning models for regression from complete data 43
the integral in Equation (3.12) cannot be computed in closed form. Therefore,
we propose to estimate it in a similar way as in [111] for the marginal mutual
information. Our proposal is based on the estimator given in the next proposition.
Proposition 1. Let Xi, Xj and Y be continuous random variables with joint
MTE density f(xi, xj, y). Let (X(1)i , X
(1)j , Y (1)), . . . , (X
(m)i , X
(m)j , Y (m)) be a sam-
ple of size m drawn from f(xi, xj, y). Then,
I(Xi, Xj | Y ) =1
m
m∑
k=1
(
log f(X(k)i | X(k)
j , Y (k))− log f(X(k)i | Y (k))
)
(3.13)
is an unbiased estimator of I(Xi, Xj | Y ).
Proof.
E[I(Xi, Xj | Y )] = E
[
1
m
m∑
k=1
(
log f(X(k)i | X(k)
j , Y (k))− log f(X(k)i | Y (k))
)
]
= E[log f(Xi | Xj , Y )]− E[log f(Xi | Y )]
= E[log f(Xi | Xj , Y )− log f(Xi | Y )] = E
[
logf(Xi|Xj, Y )
f(Xi|Y )
]
=
∫∫∫
f(xi, xj , y) logf(Xi | Xj, Y )
f(Xi | Y )dxidxjdy
=
∫∫∫
f(xi, xj , y) logf(Xi | Xj, Y )f(Xj | Y )
f(Xi | Y )f(Xj | Y )dxidxjdy
=
∫∫∫
f(xi, xj , y) logf(Xi, Xj | Y )
f(Xi | Y )f(Xj | Y )dxidxjdy
= I(Xi, Xj | Y ).
Proposition 1 can be extended for the case in which Xi or Xj are discrete by
replacing the corresponding integral by summation.
Therefore, the procedure for estimating the conditional mutual information
consists of getting a sample from f(xi, xj, y) and evaluating Equation (3.13).
Sampling from an MTE density is described in [134].
As we are not using the exact value of the mutual information, but an esti-
mation, it is interesting to get some clue about the sample size m that should be
44 3.7. The tree augmented naıve Bayes regression model
used to obtain a given accuracy. Assume, for instance, that we want to estimate
I with an error lower than ǫ > 0 with probability not lower than δ, that is,
P|I − I| < ǫ ≥ δ.
Using this expression, we can find a bound for m using Tchebyshev’s inequality:
P|I − I| < ǫ ≥ 1− Var(I)
ǫ2≥ δ. (3.14)
We only need to relate m with Var(I). It holds that
Var(I) =1
mVar (log f(Xi | Xj, Y )− log f(Xi | Y )) . (3.15)
Therefore, Var(I) depends on the discrepancy between f(Xi | Xj, Y ) and f(Xi |Y ). As these distributions are unknown beforehand, we will compute a bound for
m using two fixed distributions with very different shape, in order to simulate a
case of extreme discrepancy. If we choose f(xi | xj , y) = ae−xi and f(xi | y) = bex,
with α < xi < β and a, b normalisation constants, we find that
Var(I) =1
mVar
(
log ae−x − log bex)
=1
mVar ((log a)− x− (log b)− x)
=1
mVar(−2Xi) =
4
mVar(Xi).
Plugging this into Equation (3.14), we obtain
1− Var(I)
ǫ2≥ δ ⇒ 1− 4Var(Xi)
mǫ2≥ δ ⇒ m ≥ 4Var(Xi)
(1− δ)ǫ2(3.16)
If we assume, for instance, that Xi follows a normal distribution with mean 0
and standard deviation 1, and fix δ = 0.9 and ǫ = 0.1, we obtain
m ≥ 4
0.1× 0.01= 4000. (3.17)
Chapter 3. Learning models for regression from complete data 45
Thus, in all the experiments described in this chapter we have used a sample of
size m = 4000.
The steps to construct a TAN model [51] are described in Algorithm 5. All
the independent variables are included in this model. Here we propose to improve
it by introducing a variable selection scheme analogous to the one used for the
selective naıve Bayes in Section 3.6. The selective TAN model is computed by
Algorithm 6.
Algorithm 4: Maximum Spanning Tree (based on Kruskal’s algorithm)
Input: A graph G= (V,E), in which V is the set of vertices and E is theset of links
Output: Maximum Spanning Tree T
Order the links of E decreasingly using its weight.1
Let A be a set of links empty initially.2
T:= (V,A).3
for i := 0 to n− 2 do4
Add i-th link (u, v) ∈ E to the set A, iff it does not cause a cycle in T.5
end6
return T.7
3.8 The forest augmented naıve Bayes regres-
sion model
We will consider in this section how to construct a regression model following the
FAN methodology [96] (see Algorithm 8). The first step is to create a maximum
spanning forest following Algorithm 7, with an input parameter k that represents
the number of links included among the features. The construction of each tree
inside the forest is carried out in a similar way to the TAN construction, i.e.,
selecting a random root variable and directing the links visiting all the nodes. An
example can be found in Figure 3.3.
An important detail in the construction of a FAN model, is the optimal se-
lection of k. Usually, a low value of k means a good efficiency in the parameter
learning and worse accuracy in the model, while a high value of k the other way
46 3.9. Regression model based on kDB structure
Algorithm 5: MTE-TAN regression model
Input: A database D with variables X1, . . . , Xn, Y .Output: A TAN model with root variable Y and features X1, . . . , Xn,
with joint distribution of class MTE.Construct a complete graph C with nodes X1, . . . , Xn.1
Label each link (Xi, Xj) with the conditional mutual information between2
Xi and Xj given Y , i.e., I(Xi, Xj | Y ).Let T be the maximum spanning tree obtained from C using the3
Algorithm 4.Direct the links in T in such way that no node has more than one parent.4
Construct a new network G with nodes Y,X1, . . . , Xn and the same links as5
T.Insert the links Y → Xi, i = 1, . . . , n in G.6
Estimate an MTE density for Y , and a conditional MTE density for each7
Xi, i = 1, . . . , n given its parents in G.Let P be the set of estimated densities.8
Let TAN be a Bayesian network with structure G and distributions P.9
return TAN.10
round. In the experiments the value of k has been set to k = ⌊n/2⌋, where n is
the number of features.
As in the construction of the TAN, the links are labeled with the conditional
mutual information when computing the maximum spanning forest.
The selective version of the FAN regression model corresponds to Algorithm 9.
The procedure is totally analogous to the selective TAN.
3.9 Regression model based on kDB structure
Our last proposal consists of extending the kDB structure, already known in clas-
sification contexts [136], to regression problems. The kDB structure is obtained
by forcing the features to conform a directed acyclic graph where each variable has
at most k parents beside the class. An example can be found in Figure 3.4. The
method proposed by Sahami [136] to obtain such structure ranks the features
according to their mutual information with respect to the dependent variable.
Then, the variables are inserted in the directed acyclic graph following that rank.
Chapter 3. Learning models for regression from complete data 47
Algorithm 6: Selective MTE-TAN regression model
Input: Variables X1, . . . , Xn, Y and a database D for variablesX1, . . . , Xn, Y .
Output: Selective TAN regression model for variable Y .for i := 1 to n do1
Compute I(Xi, Y ).2
end3
Let X(1), . . . , X(n) a decreasing order of the independent variables4
according to I(Xi, Y ).Divide D into two sets, one for learning the model (Dl) and the other for5
testing its accuracy (Dt).Using Algorithm 5, construct a TAN model M with variables Y and X(1)6
from database Dl.Let rmse(M) the estimated accuracy of model M using Dp.7
for i := 2 to n do8
Let M1 be the TAN predictor obtained from Algorithm 5 for the9
variables in M and X(i).Let rmse(M1) be the estimated accuracy of model M1 using Dt.10
if rmse(M1) ≤ rmse(M) then11
M :=M1.12
end13
end14
return M .15
The parents of a new variable are selected among those variables already included
in the graph, so that the k variables with higher conditional mutual information
with the new variable given the dependent one are chosen.
The regression model we propose here is constructed in an analogous way,
but estimating the mutual information and the conditional mutual information
as described in Section 3.7. The details can be found in Algorithm 10. Note that
the complexity of constructing a kDB model is much higher than the complexity
for NB, TAN and FAN. The complexity is in the selection of the parent of a
variable and also in the estimation of the parameters, since their number is much
higher than in the other cases. For this reason, we do not propose a selective
version of this regression model, since the selection scheme used for the other
models is too costly from a computational point of view.
48 3.9. Regression model based on kDB structure
Algorithm 7: Maximum Spanning Forest (based on Kruskal’s algorithm)
Input: A graph G = (V,E), in which V is the set of vertices and E is theset of links. An integer value k that represents the number of linksthat will contain the maximum spanning forest.
Output: Maximum Spanning Forest F.Order the links of E decreasingly using its weight.1
Let A be a set of links empty initially.2
F := (V,A).3
for i := 0 to k − 1 do4
Add i-th link (u, v) ∈ E to the set A, if it does not cause a cycle in F.5
end6
return F.7
Algorithm 8: MTE-FAN regression model
Input : A database D with variables X1, . . . , Xn, Y and an integer valuek ∈ [1, n− 2], that represents the number of links that willcontain the maximum spanning forest.
Output: A FAN model with root variable Y and features X1, . . . , Xn,with joint distribution of class MTE.
Construct a complete graph C with nodes X1, . . . , Xn.1
Label each link (Xi, Xj) with the estimated conditional mutual2
information between Xi and Xj given Y , i.e., I(Xi, Xj | Y ).Let F be the maximum spanning forest obtained from C using Algorithm 73
and with exactly k links.For each connected components Fi in the forest, select a random root and4
direct its links constructing a tree.Construct a new network G with nodes Y,X1, . . . , Xn and the links5
computed in each connected component Fi.Insert the links Y → Xi, i = 1, . . . , n in G.6
Estimate a MTE density for Y , and a conditional MTE density for each7
Xi, i = 1, . . . , n given its parents in G.Let P be the set of estimated densities.8
Let FAN be a Bayesian network with structure G and distributions P.9
return FAN.10
Chapter 3. Learning models for regression from complete data 49
Algorithm 9: Selective MTE-FAN regression model
Input: Variables X1, . . . , Xn, Y and a database D for variablesX1, . . . , Xn, Y .
Output: Selective FAN predictor for the variable Y .for i := 1 to n do1
Compute I(Xi, Y ).2
end3
Let X(1), . . . , X(n) a decreasing order of the independent variables4
according to I(X(i), Y ).Divide D into two sets, one for learning the model (Dl) and the other for5
testing its accuracy (Dt).Using Algorithm 8, construct a FAN model M with variables Y and X(1)6
from database Dl.Let rmse(M) be the estimated accuracy of model M using Dp.7
for i := 2 to n do8
Let M1 be the FAN predictor obtained from Algorithm 8 for the9
variables in M and X(i).Let rmse(M1) the estimated accuracy of model M1 using Dp.10
if rmse(M1) ≤ rmse(M) then11
M :=M1.12
end13
end14
return M .15
50 3.9. Regression model based on kDB structure
Algorithm 10: MTE-kDB regression model
Input: Variables X1, . . . , Xn, Y and a database D for them. An integervalue k, which is the maximum number of parents allowed.
Output: kDB regression model for variable Y , with a joint distribution ofclass MTE.
Let G a graph with nodes Y,X1, . . . , Xn and a empty set of arcs.1
Let X(1), . . . , X(n) be a decreasing order of the independent variables2
according to I(Xi, Y ).S := X(1).3
for i := 2 to n do4
S := S ∪ X(i).5
for j := 1 to k do6
Select the variable Z ∈ S \ pa(X(i)) with higher I(X(i), Z | Y ).7
Add the link Z → X(i) to G.8
end9
end10
Insert the links Y → Xi, i = 1, . . . , n in G.11
Estimate an MTE density for Y , and a conditional MTE density for each12
Xi, i = 1, . . . , n given its parents in G.Let P be the set of estimated densities.13
Let kDB be a Bayesian network with structure G and distributions P.14
return kDB.15
Chapter 3. Learning models for regression from complete data 51
3.10 Experimental evaluation
We have implemented all the models proposed in this chapter in the Elvira plat-
form [34]1. For testing the models we have chosen a set of benchmark databases
borrowed from the UCI [11] and StatLib [149] repositories. A description of the
used databases can be found in Table 3.1. In all the cases, we have considered the
options of predicting with the mean and the median of the posterior distribution
of the dependent variable. Regarding the kDB model, we have restricted the
experiments to k = 2.
It was shown in [111] that the selective naıve Bayes regression model was
competitive with the so far considered state-of-the-art in regression problems
using graphical models, namely the so-called M5’ Algorithm [155]. The M5’ algo-
rithm [155] is an improved version of the model tree introduced by Quinlan [125].
The model tree is basically a decision tree where the leaves contain a regression
model rather than a single value, and the splitting criterion uses the variance of
the values in the database corresponding to each node rather than the information
gain. We chose the M5’ algorithm because it was the state-of-the-art in graphical
models for regression [57], before the introduction of MTEs for regression [111].
Therefore, we have compared the new models with the NB and the M5’ methods.
For the M5’ we have used the implementation in software Weka [158].
The results of the comparison are shown in Tables 3.2 and 3.3, where the values
displayed correspond to the root mean squared error for each one of the tested
models in the different databases, computed through 10-fold cross validation [150].
The boldfaced numbers represent the best value obtained between all the models,
while the underlined ones are the worst.
We have used Friedman’s test [42] to compare the experimental results, ob-
taining that there are not significant differences among the analysed algorithms
(p-value of 0.9961). However, a more detailed analysis of the results in Ta-
bles 3.2 and 3.3, show how in general the more simple models (NB, TAN, FAN)
have a better performance than the more complex (2DB). We believe that this
is due to increase on the number of parameters to estimate from data. Also,
the selective versions are usually more accurate than the models including all
1Available at http://leo.ugr.es/elvira
52 3.11. Conclusions
Database # records # continuous vars. # discrete vars.abalone 4176 8 1bodyfat 251 15 0
boston housing 452 11 2cloud 107 6 2
disclosure 661 4 0halloffame 1340 12 2
mte50 50 3 1pollution 59 16 0strikes 624 6 1
Table 3.1: A description of the databases used in the experiments.
Model abalone bodyfat boston housing cloud
NB(mean) 2.8188 6.5564 6.2449 0.5559NB(median) 2.6184 6.5880 6.1728 0.5776SNB(mean) 2.5307 5.1977 4.6903 0.5144SNB(median) 2.4396 5.2420 4.7857 0.5503TAN(mean) 2.5165 5.7095 6.9826 0.5838TAN(median) 2.4382 5.8259 6.8512 0.6199STAN(mean) 2.4197 4.5885 4.5601 0.5382STAN(median) 2.3666 4.5820 4.3853 0.5656FAN(mean) 2.6069 6.0681 6.5530 0.5939FAN(median) 2.4908 6.1915 6.5455 0.5957SFAN(mean) 2.4836 4.9123 4.2955 0.5253SFAN(median) 2.4037 4.9646 4.3476 0.56232DB(mean) 3.1348 8.2358 8.4179 1.04482DB(median) 3.0993 8.2459 8.3241 1.0315M5’ 2.1296 23.3525 4.1475 0.3764
Table 3.2: Results I of the experiments with the proposed regression models interms of rmse.
the variables. The behavior of the SNB model is remarkable, as it gets the bets
results in two experiments and is never the worst. M5’ also shows a very good
performance, obtaining the best results in 5 databases, though in one is the worst
model.
3.11 Conclusions
In this chapter we have analysed the performance of well known Bayesian net-
works classifiers when applied to regression problems. The experimental analysis
Chapter 3. Learning models for regression from complete data 53
Model disclosure halloffame mte50 pollution strikes
NB(mean) 196121.4232 186.3826 1.8695 43.0257 503.5635NB(median) 792717.1428 187.6512 2.0224 44.0839 561.4105SNB(mean) 93068.5218 160.2282 1.5798 31.1527 438.0894
SNB(median) 797448.1824 170.3880 1.6564 31.2055 582.9391TAN(mean) 250788.2272 165.3717 2.6251 48.4293 571.9346TAN(median) 796471.2998 166.5050 2.6718 48.8619 584.9271STAN(mean) 108822.7386 147.2882 1.5635 38.1018 447.2259STAN(median) 794381.9068 153.1241 1.6458 36.8456 596.6292FAN(mean) 228881.7789 179.8712 2.0990 43.9893 525.9861FAN(median) 798458.2283 181.8235 2.2037 44.2489 560.7986SFAN(mean) 97936.0322 150.0201 1.5742 36.3944 449.8657SFAN(median) 794825.1882 157.6618 1.6707 35.6820 597.70382DB(mean) 23981.0983 293.8374 2.7460 58.5873 516.24002DB(median) 793706.7221 301.4728 2.7598 58.9876 715.0236M5’ 23728.8983 35.3697 2.4718 46.8086 509.1756
Table 3.3: Results II of the experiments with the proposed regression models interms of rmse.
show that all the models considered are comparable in terms of accuracy, even
with the very robust M5’ method. However, in general we would prefer to use a
Bayesian network for regression rather than the M5’, at least from a modelling
point of view. In this way, the regression model could be included in a more gen-
eral model as long as it is a Bayesian network, and the global model can be used
for other purposes different to regression. However, the M5’ provides a regression
tree, which cannot be used for reasoning about the model it represents.
We think that the methods studied in this chapter can be improved by con-
sidering more elaborate variable selection schemes.
Chapter 4
Learning models for regression
from incomplete data
In Chapter 3 we addressed the problem of inducing Bayesian networks models for
regression from full databases. In this chapter we face the same problem but for the
case of incomplete data. We also use MTEs to represent the joint distribution in the
induced networks. Only two particular Bayesian network structures are considered,
the so-called naıve Bayes (NB) and tree augmented naıve Bayes (TAN), which were
successfully applied in Chapter 3 as regression models when learning from complete
data. We propose an iterative procedure for inducing the models, based on a variation
of the data augmentation method in which the missing values of the explanatory
variables are filled by simulating from their posterior distributions, while the missing
values of the response variable are generated using the conditional expectation of
the response given the explanatory variables. We also consider the refinement of
both regression models by using variable selection and bias reduction. We illustrate
through a set of experiments with various databases the performance of the proposed
algorithms.
Abstract
4.1 Introduction
In Chapter 3 MTEs have been successfully applied to regression problems consid-
ering different underlying network structures [51, 55, 111] obtained from complete
databases. Motivated by the common presence of missing values in databases,
56 4.2. Regression model from incomplete data
we face the problem of building Bayesian networks for regression from incomplete
data. We propose an iterative algorithm based on a variation of the data augmen-
tation method [151] in which the missing values of the explanatory variables are
filled by simulating from their posterior distributions, while the missing values
of the response variable are generated from its conditional expectation given the
explanatory variables. In this chapter we will focus only on two Bayesian net-
work structures, the so-called naıve Bayes (NB) and tree augmented naıve Bayes
(TAN) [58]. Also, the algorithm is extended to incorporate variable selection
in a similar way as in Section 3.6 and 3.7. Finally, we introduce a method for
reducing the bias in the predictions that can be used in all the models, regardless
they have been induced from complete or incomplete databases.
The rest of the chapter is organised as follows. Section 4.2 presents the theory
underlying the learning procedure. The new algorithm that operates over missing
values is formally proposed in Section 4.3. Section 4.4 introduces a method for
reducing the bias in the predictions. The behaviour of the algorithm is tested
through two experiments in Section 4.5 and the results are discussed in Sec-
tion 4.6. The chapter ends with the concluding remarks in Section 4.7.
4.2 Regression model from incomplete data
The aim of a regression problem is to find a model g that explains the response
variable Y in terms of the features X1, . . . , Xn, so that given an assignment of the
features, x1, . . . , xn, a prediction about Y can be obtained as y = g(x1, . . . , xn).
In order to calculate y, the conditional expectation of the response variable given
the observed explanatory variables is used. Therefore, our regression model will
be
y = g(x1, . . . , xn) = E[Y | x1, . . . , xn] =∫
ΩY
yf(y | x1, . . . , xn)dy ,
where f(y | x1, . . . , xn) is the conditional density of Y given x1, . . . , xn, which we
assume to be of class MTE.
A conditional distribution of class MTE can be represented as in Equa-
Chapter 4. Learning models for regression from incomplete data 57
tion (3.6), where actually a marginal density is given for each element of the
partition of the support of the variables involved. It means that, in each of the
four regions depicted in Equation (3.6), the distribution of the response variable
Y is independent of the explanatory variables within each region.
Therefore, from the point of view of regression, the distribution for the re-
sponse variable Y given an element in a partition of the domain of the explanatory
variablesX1, . . . , Xn, can be regarded as an approximation of the true distribution
of the actual values of Y for each possible assignment of the explanatory variables
in that region of the partition. This fact justifies the selection of E[Y | x1, . . . , xn]as the predicted value for the regression problem, because that value is the one
that best represents all the possible values of Y for that region, in the sense
that it minimises the mean squared error between the actual value of Y and its
predictions y, namely
mse =
∫
ΩY
(y − y)2f(y | x1, . . . , xn)dy, (4.1)
which is known to be minimised for y = E[Y | x1, . . . , xn]. Thus, the key point
to find a regression model of this kind is to obtain a good estimation of the
distribution of Y for each region of values of the explanatory variables. The NB
and TAN models [51, 111] proposed in Chapter 3 estimate that distribution by
fitting a kernel density [132] to the sample and then obtaining an MTE density
from the kernel using least squares [108, 135]. Obtaining such an estimation is
more difficult in the presence of missing values. The first approach to estimating
MTE distributions from incomplete data was developed in the more restricted
setting of unsupervised data clustering [61]. In that case, the only missing values
are on the class variable, which is hidden, while the data about the features are
complete.
Here we are interested in problems where the missing values can appear in the
response variable as well as in the explanatory variables. A first approach to solve
this problem could be to apply the EM algorithm [41], which is a commonly used
tool in semi-supervised learning [115]. However, the application of the method-
ology is problematic because the likelihood function for the MTE model cannot
be optimised in an exact way [85, 135].
58 4.2. Regression model from incomplete data
Another way of approaching problems with missing values is the so-called
data augmentation (DA) algorithm [151]. The advantage with respect to the EM
algorithm is that DA does not require a direct optimisation of the likelihood
function. Instead, it is based on imputing the missing values by simulating from
the posterior distribution of the missing variables, which is iteratively improved
from an initial estimation based on a random imputation. The DA algorithm
leads to an approximation of the maximum likelihood estimates of the parameters
of the model, as long as the parameters are estimated by maximum likelihood
from the complete database in each iteration. As maximum likelihood estimates
cannot be found in an exact way, we have chosen to use least squares estimation,
as in the NB and TAN regression models proposed in Chapter 3.
Furthermore, as our main goal is to obtain an accurate model for predicting
the response variable Y , we propose to modify the DA algorithm in connection
to the imputation of missing values of Y . The next proposition is the key on how
to proceed in this direction.
Proposition 2. Let Y and YS be two continuous independent and identically
distributed random variables. Then,
E[(Y − YS)2] ≥ E[(Y − E[Y ])2]. (4.2)
Proof.
E[(Y − YS)2] = E[Y 2 + Y 2
S − 2Y YS]
= E[Y 2] + E[Y 2S ]− 2E[Y YS]
= E[Y 2] + E[Y 2S ]− 2E[Y ]E[YS]
= 2E[Y 2]− 2E[Y ]2
= 2(E[Y 2]− E[Y ]2)
= 2Var(Y )
≥ Var(Y ) = E[(Y − E[Y ])2].
In the proof we have relied on the fact that both variables are independent
Chapter 4. Learning models for regression from incomplete data 59
and identically distributed, and therefore the expectation of the product is the
product of the expectations, and the expected value of both variables is the same.
Proposition 2 motivates our proposal for modifying the data augmentation
algorithm, since it proves that using the conditional expectation of Y to impute
the missing values instead of simulating values for Y (denoted as YS in the propo-
sition), reduces the mse of the estimated regression model. Notice that it is true
even if we are able to simulate from the exact distribution of Y conditional on
any configuration on a region of the values of the explanatory variables.
4.3 The algorithm for learning a regression mo-
del from incomplete data
Our proposal consists of an algorithm which iteratively learns a regression model
(which can be a NB or a TAN) by imputing the missing values in each iteration
according to the following criterion:
• If the missing value corresponds to the response variable, it is imputed
with the conditional expectation of Y given the values of the explanatory
variables in the same record of the database, computed from the current
regression model.
• Otherwise, the missing cell is imputed by simulating the corresponding vari-
able from its conditional distribution given the values of the other variables
in the same record, computed from the current regression model.
As the imputation requires the existence of a model, for the construction of
the initial model we propose to impute the missing values by simulating from
the marginal distribution of each variable computed from the observed values.
In this way we have reached better results than using pure random initialisation,
which is the standard way of proceeding in data augmentation [151]. Another
way of proceeding could be to simulate from the conditional distribution of each
explanatory variable given the response, but we rejected this option because the
estimation of the conditional distributions requires more data than the estimation
of the marginals, which can be problematic if the number of missing values is high.
60 4.3. The algorithm for learning a regression model from incomplete data
Algorithm 11: Bayesian network regression model from missing data
Input: An incomplete database D for variables Y,X1, . . . , Xn. A testdatabase Dt.
Output: A Bayesian network regression model for response variable Yand explanatory variables X1, . . . , Xn.
for each variable X ∈ Y,X1, . . . , Xn do1
Learn a univariate distribution fX(x) from its observed values in D.2
Create a new database D′ from D by imputing the missing values for each3
variable X ∈ Y,X1, . . . , Xn by simulating from fX(x).Learn a Bayesian network regression model M ′ from D′.4
Let srmse′ be the sample root mean squared error of M ′ computed using5
Dt according to Equation (4.3).srmse := ∞.6
while srmse′ < srmse do7
M :=M ′.8
srmse := srmse′.9
Create a new database D′ from D by imputing the missing values as10
follows:for each variable X ∈ X1, . . . , Xn do11
for each record z in D with missing value for X do12
Obtain fX(x | z) by probability propagation in model M .13
Impute the missing value for X by simulating from fX(x | z).14
for each record z in D with missing value for Y do15
Obtain fY (x | z) by probability propagation in model M .16
Impute the missing value for Y with EfY [Y | z].17
Re-estimate model M ′ from D′.18
Let srmse′ be the sample root mean squared error of M ′ computed19
using Dt.
return M .20
Chapter
4.Learningmodels
forregressio
nfro
mincomplete
data
61
D t
2XX 1
X 1 2X X 1 2X
X 1 2X
D tUsing
i (M)srmseX 1 2X
Y if (x|case ) fYE [Y|z]
if (x|case )X
2XX 1
Xf (x)
, , Y
D
?
?
?
3
4
4
1
3
1
?2 2
Y
2
?
4
?
4
?
2
3
1
Y
D’M
5
3
4
4
1
3
1
2 22
1
3
Y
(M)srmsei−1<
No
Yes
class variable has no valueDiscarding records where
Y: Obtain
Learn a regressionmodel M from D’
X =
Fill old missing cells
propagating in M and impute missing value as
: Obtain propagating in M and impute missing value simulating from it
Y
Y
Explanatory variables
Response variable
RETURN M
ITERATIVE ALGORITHM
For each variable learn an univariate
distribution and fill missingcells in D sampling from it
Figu
re4.1:
Algorith
mfor
learningaregression
model
frommissin
gdata.
62 4.3. The algorithm for learning a regression model from incomplete data
Algorithm 12: Selective Bayesian network regression model from missingdataInput: An incomplete database D for variables Y,X1, . . . , Xn. A test
database Dt.Output: A Bayesian network regression model made up of the response
variable Y and a subset of explanatory variablesS ⊆ X1, . . . , Xn.
for i := 1 to n do1
Compute I(Xi, Y ).2
Let X(1), . . . , X(n) be a decreasing order of the feature variables according3
to I(X(i), Y ).Using the Algorithm 11, construct a regression model M with variables Y4
and X(1) from database D.Let rmse(M) be the estimated accuracy of model M using Dt.5
for i := 2 to n do6
Let M1 be the model obtained by the Algorithm 11 with the variables7
of M plus X(i).Let rmse(M1) be the estimated accuracy of model M1 using Dt.8
if rmse(M1) ≤ rmse(M) then9
M :=M1.10
return M .11
The algorithm (see Algorithm 11) proceeds by imputing the initial database,
learning an initial model and re-imputing the missing cells. Then, a new model
is constructed and, if the mean squared error is reduced, the current model is re-
placed and the process repeated until convergence. As the mse in Equation (4.1)
requires the knowledge of the exact distribution of Y conditional on each con-
figuration of the explanatory variables, we use as error measure the sample root
mean squared error, computed as
srmse =
√
√
√
√
1
m
m∑
i=1
(yi − yi)2, (4.3)
where m is the sample size, yi is the observed value of Y for record i and yi is its
corresponding prediction through the regression model.
Chapter 4. Learning models for regression from incomplete data 63
The details are given in Algorithm 11 and graphically represented trough
a single example in Figure 4.1. Notice that, in Steps 4 and 18 the regression
model is learnt from a complete database, and therefore the existing estimation
methods for MTEs can be used [135, 111]. Also, notice that the algorithm is
valid for any Bayesian network structure, and therefore it is valid for our purpose,
which is to learn a NB or a TAN, just by calling to the appropriate procedure in
Steps 4 and 18. For learning the NB regression model [111], we use Algorithm 2
and for learning the TAN [51], Algorithm 5.
In the same way as in Chapter 3, we have also incorporated variable selection
in the construction of the regression models [55, 111] as described in Algorithm 12.
4.4 Improving the final estimations by reducing
the bias
In existing approaches to using MTEs for regression, the prediction that is used
is a corrected version computed by subtracting an estimated expected bias from
the prediction provided by the model [111]. That is, if Y is the response variable
and Y ∗ is the response variable actually identified by the model, i.e., the one that
corresponds to the estimations provided by the model, then the expected bias is
E[b(Y, Y ∗)] = E[Y − Y ∗], which is estimated as [111]
b =1
m
m∑
i=1
(yi − y∗i ), (4.4)
where yi and y∗i are the exact values of the response variable and their estimates
in a test database of m records. Finally, the estimates were corrected by giving
y∗i − b as the final estimation for item number i.
We have improved the estimation of the expected bias by detecting homoge-
neous regions in the set of possible values of Y and then estimating a different
expected bias in each region. The domain of the response variable is split using
the k-means clustering algorithm, determining k by exploring the dendrogram.
In this work we have considered a maximum value of k = 4, as we did not reach
any improvement by increasing its value in the experiments carried out.
64 4.5. Experimental evaluation
Therefore, instead of a single estimation of the expected bias, b now we com-
pute a vector of estimations of the expected bias, bj , j = 1, . . . , k, and the final
estimation given is y∗i − bj(i), where j(i) denoted the cluster where y∗i lies in. The
procedure for estimating the bias is detailed in Algorithm 13.
Algorithm 13: Computing a vector of bias to refine the predictions
Input: A full database D for variables Y,X1, . . . , Xn.A regression model M .Output: vBias, a vector of biases.Run a hierarchical clustering to obtain a dendrogram for the values of Y .1
Determine the number of clusters, numBias, using the dendrogram.2
Partition D into numBias partitions D1, . . . , DnumBias by clustering Y3
using the k-means algorithm.for i := 1 to numBias do4
Compute vBias[i] by means of Equation (4.4) using Di and M .5
return vBias, a vector of estimated expected biases.6
This new bias estimation heuristic is not really costly, and provides important
increases in accuracy. Therefore, we have used it in the experiments reported in
Section 4.5.
4.5 Experimental evaluation
In order to test the performance of the proposed regression models, we have
carried out a series of experiments over 16 databases, four of which are artificial
(mte50, extended mte50, tan and extended tan).
The mte50 dataset [111] consists of a random sample of 50 records drawn
from a Bayesian network with NB structure and MTE distributions. The aim of
this network is to represent a situation which is handled in a natural way by the
MTE model. In order to obtain this network, we first simulated a database with
500 records for variables X , Y , Z and W , where X follows a χ2 distribution with
5 degrees of freedom, Y follows a negative exponential distribution with mean
1/X , Z = ⌊X/2⌋, where ⌊·⌋ stands for the integer part function, and W is a
random variable with Beta distribution with parameters p = 1/X and q = 1/X .
Out of that database, a naıve Bayes regression model was constructed using X
Chapter 4. Learning models for regression from incomplete data 65
Database Size # Continuous variables # Discrete variablesabalone 4176 8 1auto-mpg 392 8 0bodyfat 251 15 0cloud 107 6 2
concrete 1030 9 0forestfires 517 11 2
housing 506 14 0machine 209 8 1pollution 59 16 0servo 166 1 4strikes 624 6 1veteran 137 4 4mte50 50 3 1
extended mte50 50 4 2tan 500 3 2
extended tan 500 4 3
Table 4.1: A description of the databases used in the experiments, indicatingtheir size, number of continuous variables and number of discrete variables.
as response variable, and a sample of size 50 drawn from it using the Elvira
software [34]. Database extended mte50 was obtained from mte50 by adding
two columns independently of the others. One of the columns was drawn by
sampling uniformly from the set 0, 1, 2, 3 and the other by sampling from a
N(4, 3) distribution.
Database tan was constructed in a similar way. We generated a sample of size
1000 for variables X0, . . . , X4, where X0 is a N(3, 2), X1 is a negative exponential
with mean 2 × |X0|, X2 is uniform in the interval (X0, X0 +X1), X3 is sampled
from the set 0, 1, 2, 3 with probability proportional to X0 and X4 has a Poisson
distribution with mean λ = log(|X0 − X1 − X3| + 1). Out of that database, a
TAN regression model [51] was generated, and a sample of size 500 drawn from
it using the Elvira software [34]. Finally, the dataset extended tan was obtained
from tan by adding two independent columns, one of them drawn by sampling
uniformly from the set 0, 1, 2, 3 and the other by sampling from a N(10, 5)
distribution.
66 4.5. Experimental evaluation
The aim of using the two extended databases for mte50 and tan is to test the
performance of the variable selection scheme in two databases where we know for
sure that some of the explanatory variables do not influence the response variable.
The other databases are available in the UCI [11] and StatLib [149] reposito-
ries. A description of the used databases can be found in Table 4.1.
In each database, we randomly inserted missing cells, ranging from a percent-
age of 10% to 50%. The missing cells have been created in an incremental way,
i.e., a database D with 20% of missing cells is constructed from the same database
with a 10% of missing values and so on. That is, these two data sets have the
same missing cells in a 10% of their positions. Over the resulting databases, we
have run 5 algorithms: NB, TAN, SNB and STAN, where the last two correspond
to the selective versions of NB and TAN. We have also included the M5’ algo-
rithm [155] in the comparison. Regarding the implementation of our regression
models, we have included it in the Elvira software [34].
We have used 10-fold cross validation [150] to estimate the srmse. The missing
cells in the databases were selected before running the cross validation, therefore,
in this case both the training and test databases contain missing cells in each
iteration of the cross validation. We discarded from the test set the records for
which the value of Y was missing. If the missing cells in the test set correspond to
explanatory variables, algorithmM5’ imputes them as column average for numeric
variables and column mode for qualitative variables [158]. The regression models
do not require the imputation of the missing explanatory variables in the test set,
as the posterior distribution for Y is computed by probability propagation and
therefore, the variables which are not observed are marginalised out. The results
of the experimental comparison are displayed in Figures 4.2, 4.3 and 4.4. The
values represented correspond to the average srmse computed by 10-fold cross
validation.
We used Friedman’s test [42] to compare the algorithms, reporting statisti-
cally significant difference among them, with a p-value of 2.2× 10−16. Therefore,
we continued the analysis by carrying out a pairwise comparison, following the
procedure discussed by Garcıa and Herrera [63], based on Nemenyi’s, Holm’s,
Shaffer’s and Bergmann’s tests. The ranking of the algorithms analysed, accord-
ing to Friedman’s statistic, is shown in Table 4.2. Notice that a higher rank
Chapter 4. Learning models for regression from incomplete data 67
indicates that the algorithm is more accurate, as we are using the rmse as target.
The result of the pairwise comparison is shown in Table 4.3. It can be seen that
SNB and STAN outperform their versions without variable selection. Also, M5’
is outperformed by SNB and STAN. Finally there are no statistically significant
difference between the two most accurate methods: SNB and STAN. The con-
clusions are rather similar regardless of the test used. The only difference is that
Holm’s and Bergmann’s tests also report significant differences between NB and
TAN and between TAN and M5’.
Algorithm RankingNB 2.4687500000000004TAN 1.7916666666666676SNB 4.302083333333335STAN 3.9895833333333313M5’ 2.447916666666668
Table 4.2: Average rankings of the algorithms tested in the experiments usingFriedman’s test.
Hypothesis Nemenyi Holm Shaffer BergmannTAN vs. SNB 3.8173E-27 3.8173E-27 3.8173E-27 3.8173E-27TAN vs. STAN 5.9273E-21 5.3346E-21 3.5564E-21 3.5564E-21SNB vs. M5’ 4.4902E-15 3.5922E-15 2.6942E-15 2.6942E-15NB vs. SNB 9.4913E-15 6.6439E-15 5.6948E-15 3.7965E-15STAN vs. M5’ 1.4259E-10 8.5557E-11 8.5557E-11 4.2778E-11NB vs. STAN 2.6655E-10 1.3328E-10 1.0662E-10 5.3310E-11NB vs. TAN 0.0301 0.0120 0.0120 0.0120TAN vs. M5’ 0.0403 0.0121 0.0121 0.0121SNB vs. STAN 1 0.3418 0.3418 0.3418NB vs. M5’ 1 0.9273 0.9273 0.9273
Table 4.3: Adjusted p-values for the pairwise comparisons using Nemenyi’s,Holm’s, Shaffer’s and Bergmann’s statistical tests.
68 4.5. Experimental evaluation
NBTANSNBSTANM5
0 10 20 30 40 50
2.2
2.4
2.6
2.8
abalone
% of missing values
rmse
0 10 20 30 40 50
3.0
3.5
4.0
4.5
auto−mpg
% of missing values
rmse
0 10 20 30 40 50
510
1520
25
bodyfat
% of missing values
rmse
0 10 20 30 40 50
0.4
0.5
0.6
0.7
0.8
0.9
cloud
% of missing values
rmse
0 10 20 30 40 50
810
1214
concrete
% of missing values
rmse
Figure 4.2: Comparison of the different models for the data sets.
Chapter 4. Learning models for regression from incomplete data 69
0 10 20 30 40 50
4045
5055
6065
7075
forestfires
% of missing values
rmse
0 10 20 30 40 50
4.0
5.0
6.0
7.0
housing
% of missing valuesrm
se
0 10 20 30 40 50
3040
5060
7080
90
machine
% of missing values
rmse
0 10 20 30 40 50
3040
5060
7080
pollution
% of missing values
rmse
0 10 20 30 40 50
0.6
0.8
1.0
1.2
servo
% of missing values
rmse
0 10 20 30 40 50
400
450
500
550
strikes
% of missing values
rmse
Figure 4.3: Comparison of the different models for the data sets. The legends arethe same as in Figure 4.2.
70 4.5. Experimental evaluation
0 10 20 30 40 50
120
140
160
180
veteran
% of missing values
rmse
0 10 20 30 40 50
1.6
1.8
2.0
2.2
2.4
2.6
2.8
mte50
% of missing values
rmse
0 10 20 30 40 50
2.0
2.5
3.0
3.5
4.0
extended_mte50
% of missing values
rmse
0 10 20 30 40 50
1.5
1.6
1.7
1.8
1.9
tan
% of missing values
rmse
0 10 20 30 40 50
1.6
1.7
1.8
1.9
2.0
2.1
extended_tan
% of missing values
rmse
Figure 4.4: Comparison of the different models for the data sets. The legends arethe same as in Figure 4.2.
Chapter 4. Learning models for regression from incomplete data 71
4.6 Results discussion
The experimental evaluation shows a satisfactory behaviour of the proposed re-
gression methods. The selective versions outperform the sophisticated M5’ algo-
rithm. Notice that the M5’ algorithm also incorporates variable selection, through
tree-pruning. The difference between the models based on Bayesian networks and
model trees becomes sharper as the rate of missing values grows. Also, the use
of variable selection always increases the accuracy. The fact that there are no
significant differences between SNB and STAN make the first one preferable, as
it is simpler (contains fewer parameters).
Finally, consider the line corresponding to M5’ in the graph for database
bodyfat in Figure 4.2. In that case, the error decreases abruptly for 40% and
50% of missing values, which is counterintuitive. We have found out that this is
due to the presence of outliers in the database, which are removed when the rate
of missing values is high. It suggests that M5’ is more sensitive to outliers than
the models based on Bayesian networks.
4.7 Conclusions
In this chapter we have studied the induction of Bayesian network models for
regression from incomplete data sets, based on the use of MTE distributions. We
have considered two well known network structures in classification and regres-
sion: The NB and TAN.
The proposal for handling missing values relies on the data augmentation
algorithm, which iteratively re-estimates a model and imputes the missing values
using it. We have shown that this algorithm can be adapted for the regression
problem by distinguishing the imputation of the response variable, in such a way
that the prediction error is minimised.
We have also studied the problem of variable selection, following the same
ideas as in Chapter 3. The final contribution of this chapter is the method for
improving the accuracy by reducing the bias, which can be incorporated regardless
of whether the model is obtained from complete or incomplete data.
The experiments conducted have shown that the selective versions of the pro-
72 4.7. Conclusions
posed algorithms outperform the robust M5’ scheme, which is not surprising, as
M5’ is mainly designed for continuous explanatory variables, while MTEs are
naturally developed for hybrid domains.
Chapter 5
Parametric learning in MTE
networks using incomplete data
Estimating an MTE from data has turned out to be a difficult task. Current methods
suffer from a considerable computational burden as well as the inability to handle
missing values in the training data. In this chapter we describe an EM-based algorithm
for learning the maximum likelihood parameters of an MTE network when confronted
with incomplete data. In order to overcome the computational difficulties we make
certain distributional assumptions about the domain being modeled, thus focusing
on a subclass of the general class of MTE networks. Preliminary empirical results
indicate that the proposed method offers results that are inline with intuition.
Abstract
5.1 Introduction
One of the major challenges when using probabilistic graphical models for mod-
eling hybrid domains is to find a representation of the joint distribution that
support 1) efficient algorithms for exact inference based on local computations
and 2) algorithms for learning the representation from data.
In this chapter we will focus on the learning problem considering MTEs [106]
as a candidate framework. Algorithms for learning marginal and conditional
MTE distributions from complete data have previously been proposed [135, 128,
88, 87]. When facing with incomplete data, in Chapter 4 we considered a data
74 5.1. Introduction
augmentation technique for learning (tree augmented) naıve MTE networks for
regression [53], but so far no attempt has been made at learning the parameters
of a general MTE network.
The task of learning MTEs from data was initially approached using least
squares estimation [135, 128]. However, this technique does not combine well
with more general model selection problems, as many standard score functions for
model selection, including the Bayesian information criterion (BIC) [138], assume
Maximum likelihood (ML) parameter estimates to be available. ML learning of
univariate distributions was introduced in [88], and a fist attempt on learning
conditional distributions was made in [87].
In this chapter we propose an EM-based algorithm [41] for learning parameters
in MTE networks from incomplete data. The general problem of learning MTE
networks (also with complete data) is computationally hard [87]: Firstly, the suf-
ficient statistics of a dataset is the dataset itself, and secondly, there are no known
closed-form equations for finding the maximum likelihood (ML) parameters. In
order to circumvent these problems, we focus on domains, where the probability
distributions mirror standard parametric families for which ML parameter esti-
mators are known to exist. This implies that instead of trying to directly learn
ML estimates for the MTE distributions, we may consider the ML estimators
for the corresponding parametric families. Hence, we define a generalised EM
algorithm that incorporates the following two observations (corresponding to the
M-step and the E-step, respectively):
i) Using the results of [33, 88] the domain-assumed parametric distributions
can be transformed into MTE distributions.
ii) Using the MTE representation of the domain we can evaluate the expected
sufficient statistics needed for the ML estimators.
For ease of presentation we shall only consider domains with multinomial,
Gaussian, and logistic functions, but, in principle, the proposed learning proce-
dure is not limited to these distributional families. Note that for these types of
domains exact inference is not possible using the assumed distributional families.
Chapter 5. Parametric learning in MTE networks using incomplete data 75
The remainder of the chapter is organised as follows. In Section 5.2 we give
rules for transforming selected parametric distributions into MTEs. From Sec-
tion 5.3 to 5.5 we describe the proposed algorithm. In Section 5.6 we present
some preliminary experimental results. Finally, the conclusion and some ideas
for future research are given in Section 5.7.
For ease of exposition, some irrelevant calculations in Sections 5.4 and 5.5
have been moved to the Appendix. The Appendix also includes help about the
vector notation used throughout the chapter.
5.2 Translating standard distributions into MTE
distributions
In this section we will consider transformations from selected parametric distri-
butions to MTE distributions.
5.2.1 Multinomial
The conversion from a multinomial into an MTE potential is straightforward,
since a multinomial distribution can be seen as a specific case of an MTE po-
tential [106]. For example, consider two discrete variables X and Z with states
xi, i = 1, . . . , n and zj , j = 1, . . . , d. The multinomial potential P (x | z) definedas a probability table,
Z = z1 . . . Z = zd
X = x1 p11 . . . p1d...
......
...
X = xn pn1 . . . pnd
or as a probability tree,
76 5.2. Translating standard distributions into MTE distributions
Z
X
1
p11
1
. . . pn1
n
. . . X
d
p1d
1
. . . pnd
n
can be translated to a MTE potential using a mixed tree structure (see Subsec-
tion 2.2.3), in which only discrete variables will take part. The MTE densities
located in the leaves will have only the independent term with the corresponding
value pij .
5.2.2 Conditional linear Gaussian
In [33, 88] methods for obtaining an MTE approximation of a (marginal) Gaussian
distribution are described. Common for both approaches is that the split points
used in the approximations depend on the mean value of the distribution being
modeled. Consider now a variable X with continuous parents Y = (Y1, . . . , Yc)T
and assume that X given Y follows a conditional linear Gaussian distribution:1
X | Y = y ∼ N(µ = b+ lTy, σ2).
In the conditional linear Gaussian distribution, the mean value is a weighted linear
combination of the continuous parents. This implies that we cannot directly
obtain an MTE representation of the distribution by following the procedures
of [33, 88]; each part of an MTE potential has to be defined on a hypercube
(see Definition 5 in Subsection 2.2.3), and the split points can therefore not
depend on any of the variables in the potential. Instead we define an MTE
approximation by splitting the domain of the variables Y, ΩY, into hypercubes
D1, . . . , Dk, and specifying an MTE density for X for each of the hypercubes.
For hypercube Dp, p = 1, . . . , k, the mean of the distribution is assumed to be
1For ease of exposition we will disregard any discrete parent variables in the remainder ofthis section, since they will only serve to index the parameters of the function.
Chapter 5. Parametric learning in MTE networks using incomplete data 77
Y1
Y2
[ymin1 , ya1 ]
f(x;µ1, σx)
µ1 = b + l1mid11 + l2mid1
2
[ymin2 , ya2 ]
f(x;µ2, σx)
µ2 = b + l1mid21 + l2mid2
2
[ya2 , ymax2 ]
Y2
[ya1 , ymax1 ]
f(x;µ3, σx)
µ3 = b + l1mid31 + l2mid3
2
[ymin2 , yb2]
f(x;µ4, σx)
µ4 = b + l1mid41 + l2mid4
2
[yb2, ymax2 ]
Y1
Y2
[1, 8]
f(x;µ1, σx)
µ1 = 0.5 + 9 · 4.5 + 5 · 3.5
[2, 5]
f(x;µ2, σx)
µ2 = 0.5 + 9 · 4.5 + 5 · 6
(5, 7]
Y2
(8, 12]
f(x;µ3, σx)
µ3 = 0.5 + 9 · 4.5 + 5 · 3.5
[2, 3]
f(x;µ4, σx)
µ4 = 0.5 + 9 · 4.5 + 5 · 3.5
(3, 7]
Figure 5.1: Mixed tree for a Gaussian variable X with two continuous (Gaussian)parents Y1, Y2. The leaves are represented by the MTE potential in Equation (5.1)with values µ = µp and σ = σX . The mean values for variables X, Y1 and Y2 inthe example are: b = 0.5, l1 = 9 and l2 = 5, respectively.
constant, i.e., µp = b + l1midp1 + · · · + lcmidpc , where midpi denotes the midpoint
of Yi in Dp, i ∈ 1, . . . , c (by defining fixed upper and lower bounds on the
ranges of the continuous variables, the midpoints are always well-defined). Thus,
finding an MTE representation of the conditional linear Gaussian distribution
has been reduced to defining a partitioning D1, . . . , Dk of ΩY and specifying
an MTE representation for a (marginal) Gaussian distribution (with mean µp
and variance σ2) for each of the hypercubes Dp in the partitioning1. Figure 5.1
shows an example of an MTE representation of a conditional linear Gaussian
distribution for a variable.
In the current implementation we define the partitioning of ΩY based on
1Note that σ2 does not depend on the continuous parents
78 5.2. Translating standard distributions into MTE distributions
equal-frequency binning, and we use BIC-score [138] to choose the number of
bins. To obtain an MTE representation of the (marginal) Gaussian distribution
for each partition in ΩY we follow the procedure of [88]; six MTE candidates for
the domain [−2.5, 2.5] are shown in Figure 5.2 (no split points are being used,
except to define the boundary). The 6-term MTE density has been selected to
take part of the approximation because it shows a good compromise between fit
and complexity.
Notice that this approximation is only positive within the interval [µ−2.5σ, µ+
2.5σ] (confer 5.2), and it actually integrates up to 0.9876 in that region, which
means that there is a probability of 0.0124 of finding points outside this inter-
val. In order to avoid problems with 0 probabilities, we add tails covering the
remaining probability mass of 0.0124. More precisely, we define the normalisation
constant
c =0.0124
2(
1−∫ 2.5σ
0exp−xdx
) ,
and include the tail
t(x) = c · exp −(x− µ) ,
for the interval above x = µ+ 2.5σ in the MTE specification. Similarly, a tail is
also included for the interval below x = µ − 2.5σ. The transformation rule from
Gaussian to MTE therefore becomes
f(x) =
c · exp x− µ if x < µ− 2.5σ,
σ−1[
a0 +∑6
j=1 aj exp
bjx−µσ
]
if µ− 2.5σ ≤ x ≤ µ+ 2.5σ,
c · exp −(x− µ) if x > µ− 2.5σ,
(5.1)
where,
Chapter 5. Parametric learning in MTE networks using incomplete data 79
−3 −2 −1 0 1 2 3−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
(a) 2 terms
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
(b) 4 terms
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
(c) 6 terms
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
(d) 8 terms
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
(e) 10 terms
−3 −2 −1 0 1 2 30
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
(f) 12 terms
Figure 5.2: MTE approximations with 2, 4, 6, 8, 10, 12 exponential terms, respec-tively, for the truncated standard Gaussian distribution with support [−2.5, 2.5].It is difficult to visually distinguish the MTE and the Gaussian for the four lattermodels.
80 5.2. Translating standard distributions into MTE distributions
a0 = 49.8248 a1 = −34.958 b1 = −0.33333
a2 = −34.958 b2 = 0.33333
a3 = 11.7704 b3 = −0.66667
a4 = 11.7704 b4 = 0.66667
a5 = −1.5269 b5 = −1
a6 = −1.5269 b6 = 1
Figure 5.3 shows the MTE approximation to a standard Gaussian density
using the 3 pieces specified in Equation (5.1).
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
f(x)
Figure 5.3: 3-piece MTE approximation of a standard Gaussian density. Thedashed red line represents the Gaussian density and the blue one the MTE ap-proximation.
5.2.3 Logistic
The sigmoid function for a discrete variable X with a single continuous parent Y
is given by
Chapter 5. Parametric learning in MTE networks using incomplete data 81
P (X = 1 | Y ) = 1
1 + expb+ wy .
In [31] a 4-piece 1-term MTE representation for this function is proposed:
P (X = 1 | Y = y)
=
0 if y < 5−bw,
−0.021704 + 0.521704c exp−0.635w(y − b(w + 1)) if 5−bw
≤ y ≤ −bw,
−1.021704− 0.521704c−1 exp0.635w(y − b(w + 1)) if −bw< y ≤ −5−b
w,
1 if y > −5−bw,
(5.2)
where c = 0.529936b(w2+w+1). Note that the MTE representation is 0 or 1 if
y < (5− b)/w or y > (−5− b)/w, respectively. The representation can therefore
be inconsistent with the data (i.e., we may have data cases with probability 0),
and we therefore replace the 0 and 1 with ǫ and 1− ǫ, where ǫ is a small positive
number. (ǫ = 0.0001 was used in the experiments reported in Section 5.6.).
Figure 5.5 shows both the logistic and the MTE approximation for the poten-
tial P (X = 1 | Y = y) in Equation (5.2).
In the general case, where X has continuous parents Y = (Y1, . . . , Yc)T and
discrete parents Z = (Z1, . . . , Zd)T, then for each configuration z of Z, the condi-
tional distribution of X given Y is given by
P (X = 1 | Y = y,Z = z) =1
1 + expbz +∑c
i=1wi,zyi. (5.3)
With more than one continuous variable as argument, the logistic function cannot
easily be represented by an MTE having the same structure as in Equation (5.2).
The problem is that the split points would then be (linear) functions of at least
one of the continuous variables, which is not consistent with the MTE framework
(see Definition 5 in Subsection 2.2.3). Instead we follow the same procedure as for
the conditional linear Gaussian distribution: For each of the continuous variables
in Y′ = Y2, . . . Yc, split the variable Yi into a finite set of intervals and use the
82 5.2. Translating standard distributions into MTE distributions
−6 −4 −2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
x
f(x)
Figure 5.4: 4-piece MTE approximation of a logistic function with b = 0 andw = −1. The dashed red line represents the logistic function and the blue onethe MTE approximation.
midpoint of the pth interval to represent Yi in that interval. The intervals for the
variables in Y′ define a partitioning D1, . . . , Dk of ΩY′ into hypercubes, and for
each of these partitions we apply Equation (5.2). That is, for partition Dp we get
P (X = 1 | y, z) = 1
1 + expb′ + w1y1, (5.4)
where b′ = b +∑c
k=2midpkwpk. In the current implementation Y1 is chosen arbi-
trarily from Y, and the partitioning of the state space of the parent variables is
performed as for the conditional linear Gaussian distribution.
Figure 5.5 shows an example of the MTE representation of a logistic vari-
able X with 3 continuous (Gaussian) parents. The information about the logistic
parameters of Y1 and Y2 is contained in parameter b′. Thus, each leave repre-
sents an MTE potential P (X | y3) which is an approximation of the potential
Chapter 5. Parametric learning in MTE networks using incomplete data 83
P (X | y1, y2, y3) in the same way as in Equation (5.2), where only the logistic
variable and one conditioning Gaussian variable were allowed. Note that for this
approximation we consider y1 = midp1 and y2 = midp2.
Y1
Y2
[
ymin1 , ya1
]
f(y3; b′, w3)
b′ = b3 + w1mid11 + w2mid1
2
[
ymin2 , ya2
]
f(y3; b′, w3)
b′ = b3 + w1mid21 + w2mid2
2
[
ya2 , ymin2
]
Y2
[
ya1 , ymin1
]
f(y3; b′, w3)
b′ = b3 + w1mid31 + w2mid3
2
[
ymin2 , yb2
]
f(y3; b′, w3)
b′ = b3 + w1mid41 + w2mid4
2
[
yb2, ymin2
]
Y1
Y2
[1, 8]
f(y3; b′, w3)
b′ = 0.5 + 0.1 · 4.5 + 0.8 · 3.5
[2, 5]
f(y3; b′, w3)
b′ = 0.5 + 0.1 · 4.5 + 0.8 · 6
(5, 7]
Y2
(8, 12]
f(y3; b′, w3)
b′ = 0.5 + 0.1 · 4.5 + 0.8 · 3.5
[2, 3]
f(y3; b′, w3)
b′ = 0.5 + 0.1 · 4.5 + 0.8 · 3.5
(3, 7]
Figure 5.5: Mixed tree for a logistic variable X with three continuous (Gaussian)parents Y1, Y2, Y3. The leaves are represented by the MTE potential in Equa-tion (5.2) with y = y3, b = b′ and w = w3. The values for the example areb3 = 0.5, w1 = 0.1 and w2 = 0.8.
5.3 The EM Algorithm
As previously mentioned, deriving an EM algorithm for general MTE networks is
computationally hard because the sufficient statistics of the dataset is the dataset
itself and there is no closed-form solution for estimating the maximum likelihood
parameters. To overcome these computational difficulties we will instead focus on
a subclass of MTE networks, where the conditional probability distributions in
84 5.3. The EM Algorithm
the network mirror selected distributional families. By considering this subclass
of MTE networks we can derive a generalised EM algorithm, where the updating
rules can be specified in closed form.
To be more specific, assume that we have an MTE network for a certain
domain, where the conditional probability distributions in the domain mirror
traditional parametric families with known ML-based updating rules. Based on
the MTE network we can calculate the expected sufficient statistics required by
these rules (the E-step) and by using the transformations described in Section 5.2
we can in turn update the distributions in the MTE network.
The overall learning algorithm is detailed in Algorithm 14, where the domain
in question is represented by the model B. Note that in order to exemplify the
procedure we only consider the multinomial distribution, the Gaussian distribu-
tion, and the logistic distribution. The algorithm is, however, easily extended to
other distribution classes. The algorithm finishes when the following convergence
criterion is satisfied:
| log L(B′t | D)− log L(B′
t−1 | D) | < ǫ,
where L(B′t | D) is the likelihood of the MTE network B′ given the database D
in step t.
The steps of the algorithm are graphically shown through an example in Fig-
ure 5.6.
The transformation rules for the conditional linear Gaussian distribution, the
multinomial distribution, and the logistic distribution are alredy given in Sec-
tion 5.2. In order to complete the specification of the algorithm, we therefore
only need to define the E-step and the M-step for the three types of distributions
being considered.
Chapter 5. Parametric learning in MTE networks using incomplete data 85
Algorithm 14: An EM algorithm for learning MTE networks from incom-plete data.
Input: A parameterised model B over X1, . . . , Xn, and an incompletedatabase D of cases over X1, . . . , Xn.
Output: An MTE network B′.Initialise the parameter estimates θB randomly.1
repeat2
Using the current parameter estimates θB, represent B as an MTE3
network B′ (Section 5.2).(E-step) Calculate the expected sufficient statistics required by the4
M-step using B′ (Section 5.4).(M-step) Use the result of the E-step to calculate new ML parameter5
estimates θB for B (Section 5.5).θB := θB.6
until convergence ;7
return B′.8
5.4 The M-step. Updating rules for the param-
eter estimates
This section is devoted to calculate the updating rules for the parameter estimates
of the considered distributions. Given a database of cases D = d1 . . .dN for
variables X1, . . . , Xn, where di = (x(i)1 , . . . , x
(i)n ), the updating rules are derived
based on the expected data-complete log-likelihood function Q:
Q =N∑
i=1
E[log f(X1, . . . , Xn) | di] =N∑
i=1
n∑
j=1
E[log f(Xj | pa(Xj)) | di]. (5.5)
5.4.1 Multinomial
Let Xj be a discrete variable with only discrete parents Z as follows,
Z
Xj
86 5.4. The M-step. Updating rules for the parameter estimates
Random parameter initialisation
Translation
rules
General framework
Mult. Mult. CLG CLG
Mult. Log. CLG
MTE framework
MTE MTE MTE MTE
MTE MTE MTE
E-stepM-step
Figure 5.6: EM algorithm for learning MTE networks from incomplete data.Continuous nodes are represented with double lines and the discrete ones withsingle line.
where the conditioning distribution for variable Xj, f(xj | z), is a discrete poten-
tia P (xj | z). Thus, the updating rule for the multinomial parameters is:
θj,k,z :=
N∑
i=1
P (Xj = k,Z = z | di)
|sp(Xj)|∑
k=1
N∑
i=1
P (Xj = k,Z = z | di). (5.6)
For the particular case in which the variable Xj has no parents the formula
above is simplified as:
θj,k :=
N∑
i=1
P (Xj = k | di)
|sp(Xj)|∑
k=1
N∑
i=1
P (Xj = k | di). (5.7)
5.4.2 Conditional linear Gaussian
Let Xj be a continuous (Gaussian) variable with discrete parents Z and contin-
uous (Gaussian) parents Y as follows,
Chapter 5. Parametric learning in MTE networks using incomplete data 87
Z Y
Xj
with density function f(xj | z,y) where
Xj ∼ N(
µ = lTz,jy + bz,j , σ2z,j
)
.
To ease notation, we shall use lz,j = [lTz,j, bz,j ]T and y = [yT, 1]T so,
lTz,jy + bz,j = lTz,jy. (5.8)
Therefore, the density function for Xj can be written as:
f(xj | z,y) =1
σz,j√2π
exp
−1
2
(
xj − lTz,jy
σz,j
)2
∝ exp
−1
2
(
xj − lTz,jy
σz,j
)2
.
(5.9)
So, we need to calculate the updating rule for the unknown parameters lz,j and
σz,j . The factor 1σz,j
√2π
can be considered a constant in the calculation of the
updating rule of lz,j, but not for for the updating rule of σz,j .
Since the parameters of the distribution need to be maximised, the updating
rules for each one will be calculated bellow by taking the derivative of the function
Q with respect to the parameter and then calculating the roots of the equation
to get the maximum value (intermediate calculations are reported in Appendix
at the end of the chapter).
For the simplest case in which the variable Xj has no parents, the density
function is:
f(xj) ∝ exp
−1
2
(
xj − µjσj
)2
. (5.10)
In this case the parameters to be estimated are µj and σj .
88 5.4. The M-step. Updating rules for the parameter estimates
Updating rule for µj
µj :=1
N
[
N∑
i=1
E [Xj | di]]
(5.11)
Updating rule for σj
σj :=
[
1
N√2π
(
N∑
i=1
E(X2j | di) +Nµ2
j − 2µj
N∑
i=1
E(Xj | di))]1/2
(5.12)
Updating rule for lz,j
ˆlz,j :=
[
N∑
i=1
f(z | di)E(YYT | di, z)]−1 [ N
∑
i=1
f(z | di)E(XjY | di, z)]
(5.13)
Updating rule for σz,j
σz,j :=
[
1∑N
i=1 f(z | di)
N∑
i=1
f(z | di)E[
(Xj − lTz,jY)2 | di, z]
]1/2
(5.14)
5.4.3 Logistic
Let Xj be a binary variable with discrete parents Z and continuous (Gaussian)
parents Y as follows,
Z Y
Xj
where
P (xj | z,y) = σz,j(y)xj(1− σz,j(y))
(1−xj) , xj ∈ 0, 1 (5.15)
and
σz,j(y) =1
1 + expwT
z,jy + bz,j, (5.16)
Chapter 5. Parametric learning in MTE networks using incomplete data 89
where wT
z,j is a set of coefficients, one for each continuous parent of Xj . To ease
notation, we shall use wz,j = [wT
z,j , bz,j]T and y = [yT, 1]T so
wT
z,jy + bz,j = wT
z,jy. (5.17)
The calculation for the logistic parameters are explained bellow following the
same ideas as in Subsection 5.4.2.
Updating rule for wz,j
∂Q
∂wz,j=
N∑
i=1
f(z | di)∂
∂wz,jE [logP (Xj | z,Y) | di, z]
=N∑
i=1
P (z | di)∂
∂wz,j
E[Xj log σz,j(Y) + (1−Xj) log(1− σz,j(Y)) | di, z]
=
N∑
i=1
P (z | di)[
∂
∂wz,jE[Xj log σz,j(Y) | di, z] +
∂
∂wz,jE[(1 −Xj) log(1− σz,j(Y)) | di, z]
]
.
(5.18)
Now, for the first part of the Equation (5.18) we get
∂
∂wz,jE[Xj log σz,j(Y) | di, z] =
∂
∂wz,j
(∫
y
P (xj = 1,y | di, z)1 logσz,j(y)dy +
∫
y
P (xj = 0,y | di, z)0 log σz,j(y)dy)
=∂
∂wz,j
(∫
y
P (xj = 1,y | di, z) log σz,j(y)dy)
=
∫
y
P (xj = 1,y | di, z)∂
∂wz,j
log σz,j(y)dy.
(5.19)
The derivative can be further expanded by noting that
90 5.4. The M-step. Updating rules for the parameter estimates
∂
∂wz,jlog σz,j(y)dy =
1
σz,j(y)
∂σz,j(y)
∂wz,j
=1
σz,j(y)σz,j(y)(1− σz,j(y))y = (1− σz,j(y))y,
(5.20)
and we therefore get
∂
∂wz,j
E(Xj log σz,j(Y) | di, z) =∫
y
P (xj = 1,y | di, z)(1− σz,j(y))ydy. (5.21)
In a similar way, for the second part of the Equation (5.18) we get
∂
∂wz,jE[(1−Xj) log(1− σz,j(Y)) | di, z]
=∂
∂wz,j
(∫
y
P (xj = 1,y | di, z)0 log(1− σz,j(y))dy +
∫
y
P (xj = 0,y | di, z)1 log(1− σz,j(y))dy
)
=∂
∂wz,j
(∫
y
P (xj = 0,y | di, z) log(1− σz,j(y))dy
)
=
∫
y
P (xj = 0,y | di, z)∂
∂wz,jlog(1− σz,j(y))dy. (5.22)
The derivative can be further expanded by noting that
∂
∂wz,jlog(1− σz,j(y))dy =
1
1− σz,j(y)
∂(1 − σz,j(y))
∂wz,j
= − 1
1− σz,j(y)σz,j(y)(1− σz,j(y))y = −σz,j(y)y.
(5.23)
and we therefore get
Chapter 5. Parametric learning in MTE networks using incomplete data 91
∂
∂wz,jE((1−Xj) log(1− σz,j(Y)) | di, z) = −
∫
y
P (xj = 0,y | di, z)σz,j(y)ydy.(5.24)
By inserting the expressions back into Equation (5.18) we ended up with
∂Q
∂wz,j
=N∑
i=1
P (z | di)[∫
y
P (xj = 1,y | di, z)σz,xj=1(y)ydy−∫
y
P (xj = 0,y | di, z)σz,xj=0(y)ydy
]
. (5.25)
In order to find this partial derivative we need to evaluate two integrals.
However, the combination of the MTE potential P (xj,y | di, z) and the logistic
function σz,xj(y) makes these integrals difficult to evaluate. In order to avoid
this problem we use the MTE representation of the logistic function specified in
Subsection 5.2.3, which allows the integrals to be calculated in closed form.
Furthermore, even considering the previous approximation for the integrals,
the roots of the resulting equation can not be found to get the updating rule for
the weights vector. Instead one typically resorts to numerical optimisation such
as gradient ascent for maximising Q. Thus, the gradient ascent updating rule can
be expressed as
ˆwz,j := wz,j + γ∂Q
∂wz,j,
where γ > 0 is a small number.
The calculation of ∂Q∂wz,j
is evaluated for each configuration of the discrete
parents z and returns a vector of values (v1, . . . , vc), where c is the number of
continuous parents. Let see in more detail the calculation of Equation (5.25).
The following part of the expression,
∫
y
P (xj = 1,y | di, z)σz,xj=1(y)ydy
92 5.5. The E-step
for one specific parent yi would be calculated as follows:
∫
yi
yi
(∫
y\yiP (xj = 1,y | di, z) σz,xj=1(y) dy \ yi
)
dyi,
that is, for each parent yi it is necessary to make c − 1 integrals and finally
compute the expectation like this:
∫
yi
yi g(yi) dyi ,
where g(yi) is an MTE density.
5.5 The E-step
In this section the expected sufficient statistics are calculated to perform the
updating rules in the M-step. More specifically, we need to compute the following
four expectations in the MTE framework:
1) E(Xj | di, z)
2) E(XjY | di, z)
3) E(YYT | di, z)
4) E[
(Xj − lTz,jY)2 | di, z]
For the calculation of all the expectations above we will simplify it saying that
for each configuration of the discrete parents z we calculate:
1) E(Xj | di)
2) E(XjY | di)
3) E(YYT | di)
4) E[
(Xj − lTz,jY)2 | di]
Chapter 5. Parametric learning in MTE networks using incomplete data 93
All the expectations shown bellow can be obtained analytically (the interme-
diate calculations and notation are reported in the Appendix). For the first one,
we have that:
E(Xj | di) =a02
(
xb2 − xa
2)
+
m∑
j=1
aj
bj2 (expbjxb(bjxb − 1)− expbjxa(bjxa − 1))
(5.26)
For the second one, we need to calculate a vector of expectations, where the
j-element is E(XjYj | di). For simplicity we will denote Xj as X and Yj as Y .
The ranges of the variables will be [xa, xb] and [ya, yb] respectively.
E(XY | di) =a04(y2b − y2a)(x
2b − x2a) +
m∑
j=1
aj
cj2bj2
(−expbjya+ bjyaexpbjya+ expbjyb − bjybexpbjyb)(−expcjxa+ cjxaexpcjxa+ expcjxb − cjxbexpcjxb)
(5.27)
A new version of E(XY | di) is shown in Equation (5.28) for the case in which
the exponent has only one variable, that is, bj = 0:
E(XY | di) =a04(y2b − y2a)(x
2b − x2a) +
m∑
j=1
aj(y2b − y2a)
2cj2
(expcjxa − cjxaexpcjxa − expcjxb+ cjxbexpcjxb)(5.28)
For the third one, we need to calculate a matrix of expectations, where the
jk-element will be E(YjYk | di). If j 6= k the calculation can be carried out in the
same way as in Equation (5.27). When j = k, the expectation is E(Y 2j | di) and
94 5.5. The E-step
will be calculated later in Equation (5.30).
For the fourth one, we have that:
E[
(Xj − lTz,jY)2 | di]
= E(X2j | di)− 2lTz,jE(XjY | di) + E((lTz,jY)2 | di),
(5.29)
where
E(X2j | di) =
a03
(
xb3 − xa
3)
+m∑
j=1
aj
bj3 ( expbjxb(bjxb(bjxb − 2) + 2)−
− [expbjxa(bjxa(bjxa − 2) + 2)] ) . (5.30)
The expectation E(XjY | di) of the second part in Equation (5.29) have been
calculated previously in Equation (5.27). For the last term in Equation (5.29),
we have that:
E[
(lTz,jY)2 | di]
= E[
(lTz,jY)(lTz,jY)T | di]
= E[
lTz,jYYTlz,j | di)]
= lTz,jE[
YYT | di)]
lz,j, (5.31)
and the calculation of E[
YYT | di)]
has been previously considered.
For calculating the expectations in all this section we will consider the most
general case in which all the variables involved are unobserved. When some of
them are observed, we will follow the next basic rules:
• if X and Y are both observed, then E(XY | di) is substituted by xy.
• if only X is unobserved, then E(XY | di) is substituted by y ∗ E(X), and
the other way around.
For the logistic we do not have any expectation to compute in the E-step,
since its updating rule in the M-step only has a product of MTE functions and
no expectations. Anyway, a simple integral need to be solved in Equation (5.25),
since we have an MTE times y.
Chapter 5. Parametric learning in MTE networks using incomplete data 95
The calculation of the expectation for the multinomial is straightforward and
implicitly is made inside the M-step in Section 5.4.1.
5.6 Experimental results
In order to evaluate the proposed learning method we have generated data from
the Crops network [112]. We sampled six complete datasets containing 50, 100,
500, 1000, 5000, and 10000 cases, respectively, and for each of the datasets we
generated three other datasets with 5%, 10%, and 15% missing data (the data is
missing completely at random [95]), giving a total of 24 training datasets. The
actual data generation was performed using WinBUGS [97].
Subsidize Crop
Price
Buy
Figure 5.7: The Crops network.
For comparison, we have also learned baseline models using WinBUGS. How-
ever, since WinBUGS does not support learning of multinomial distributions from
incomplete data we have removed all cases where Subsidize is missing from the
datasets.
The learning results are shown in Table 5.1, which lists the average (per
observation) log-likelihood of the model with respect to a test-dataset consisting
of 15000 cases (and defined separately from the training datasets). From the
table we see the expected behaviour: As the size of the training data increases,
the models tend to get better; as the fraction of the data that is missing increases,
the learned models tend to get worse.
96 5.6. Experimental results
The results also show how WinBUGS in general outperforms the algorithm
we propose in this chapter. We believe that one of the reasons is the way we
approximate the tails of the Gaussian distribution in Equation (5.1). As the tails
are thicker than the actual Gaussian tails, the likelihood is lower in the central
parts of the distribution, where most of the samples potentially concentrate.
Another possible reason is the way in which we approximate the CLG distribution.
Recall that when splitting the domain of the parent variable, we take the average
data point in each split to represent the parent, instead of using the actual value.
This approximation tends to give an increase in the estimate of the conditional
variance, as the approximated distribution needs to cover all the training samples.
Obviously, this will later harm the average predictive log likelihood. Two possible
solution to this problem are i) to increase the number of splits, or ii) to use
dynamic discretisation to determine the optimal way to split the parent’s domain.
However, both solutions come with a cost in terms of increased computational
complexity, and we consider the tradeoff between accuracy and computational
cost as an interesting topic for future research.
ELVIRA WINBUGSNo. Cases Percentage of missing data Percentage of missing data
0% 5 % 10% 15% 0% 5 % 10% 15%50 -3.8112 -3.7723 -3.8982 -3.8553 -3.7800 -3.7982 -3.7431 -3.6861
100 -3.7569 -3.7228 -3.9502 -3.9180 -3.7048 -3.7091 -3.7485 -3.7529500 -3.6452 -3.6987 -3.7972 -3.8719 -3.6272 -3.6258 -3.6380 -3.6295
1 000 -3.6325 -3.7271 -3.8146 -3.8491 -3.6174 -3.6181 -3.6169 -3.61795 000 -3.6240 -3.6414 -3.8056 -3.9254 -3.6136 -3.6141 -3.6132 -3.6144
10 000 -3.6316 -3.6541 -3.7910 -3.8841 -3.6130 -3.6131 -3.6131 -3.6135
Table 5.1: The average log-likelihood for the learned models, calculated per ob-servation on a separate test set.
The algorithm has been implemented in Elvira [34] and the software, the
datasets used in the experiments, and the WinBUGS specifications are all avail-
able from http://elvira.ual.es/MTE-EM.html.
Chapter 5. Parametric learning in MTE networks using incomplete data 97
5.7 Conclusions
In this chapter we have proposed an EM-based algorithm for learning MTE net-
works from incomplete data. In order to overcome the computational difficulties
of learning MTE distributions, we focus on a subclass of the MTE networks,
where the distributions are assumed to mirror known parametric families. This
subclass supports a computationally efficient EM algorithm. Preliminary empir-
ical results indicate that the method learns as expected, although not as well as
WinBUGS. In particular, our method seems to struggle when the portion of the
data that is missing increases. We have proposed some remedial actions to this
problem that we will investigate further.
Chapter 6
Approximate inference in MTE
networks using importance
sampling
In this chapter we propose an algorithm for approximate inference in hybrid Bayesian
networks where the underlying probability distribution is of class MTE. The algorithm
is based on importance sampling simulation. We show how it is able to compute
multiple posterior probabilities simultaneously. The behaviour of the new algorithm is
experimentally tested and compared with previous methods existing in the literature.
Abstract
6.1 Introduction
Even though Bayesian networks allow efficient inference algorithms to operate
over them, it is known that exact probabilistic inference is an NP-hard prob-
lem [35]. Furthermore, approximate probabilistic inference is also an NP-hard
problem if a given precision is required [39]. For that reason, approximate algo-
rithms that tradeoff complexity for accuracy have been developed both for discrete
Bayesian networks [21, 17, 137, 18, 109] and for hybrid Bayesian networks with
MTEs [134].
In this chapter we propose an approximate algorithm for inference in hybrid
100 6.2. Problem formulation
Bayesian networks with MTEs. The algorithm is based on importance sampling,
and therefore it is an anytime algorithm [126] in the sense that the accuracy
of its results is proportional to the time it is allowed to use for computing the
propagation. We show how our proposal outperforms the previous state-of-the-art
method for approximate inference with MTEs, introduced in [134].
The rest of the chapter is organised as follows. The problem for which we
propose a solution is formally posted in Section 6.2. The core of the methodolog-
ical contributions is in Section 6.3, and the details of the algorithm can be found
in Section 6.4. The experimental analysis carried out to test the performance of
the algorithm is reported in Section 6.5. The concluding remarks are given in
Section 6.6.
6.2 Problem formulation
We are interested in hybrid Bayesian networks, which are defined for a set of
variables X that contains discrete and continuous variables. Throughout this
chapter we will assume that X = Y ∪ Z, being Y and Z sets containing only
discrete and only continuous variables respectively.
Inference consists of computing a probability value for a target variableW ∈ X
given that the values of some variables E ⊂ X are known. Thus, if we write
X = (W,YT,ZT,ET)T, where Y = (Y1, . . . , Yd)T represents the non-observed
discrete variables and Z = (Z1, . . . , Zc)T represents the non-observed continuous
variables and E = (E1, . . . , Ek)T, then we are interested in calculating
P (a < W < b | E = e) =P (a < W < b,E = e)
φ(e)(6.1)
ifW is a continuous variable. The function φ in the denominator of Equation (6.1)
is the marginal over variables E of the joint distribution in the network. Let φX
denote the conditional distribution of any variable X in the network. Then, the
joint distribution is defined as
Chapter 6. Approximate inference in MTE networks using importance sampling 101
φ(w,y, z, e) =
φW (w | pa(w))d∏
i=1
φYi(yi | pa(yi))c∏
j=1
φZj(zj | pa(zj))
k∏
l=1
φEl(el | pa(el)). (6.2)
Since our goal is to compute a probability given a fixed value e of variables
E, we will rather be interested in the restriction of the joint distribution to the
knowledge that E = e. We will replace any symbol φ in Equation (6.2) by ψ,
where the new symbols means the former function restricted to e. With this
notation, the joint distribution restricted to e can be written as
ψ(w,y, z) =
ψW (w | pa(w))d∏
i=1
ψYi(yi | pa(yi))c∏
j=1
ψZj(zj | pa(zj))
k∏
l=1
ψEl(el | pa(el)). (6.3)
So, the numerator in Equation (6.1) can be obtained as
P (a < W < b,E = e) =
∫ b
a
(
∑
y∈Y
∫
ΩZ
ψ(w,y, z)dz
)
dw
=
∫ b
a
h(w)dw. (6.4)
where
h(w) =∑
y∈Y
∫
ΩZ
ψ(w,y, z)dz. (6.5)
To finally compute the probability expressed in Equation (6.1), we still have
to compute φ(e). This is obtained as
102 6.2. Problem formulation
φ(e) =
∫
ΩW
(
∑
y∈Y
∫
ΩZ
ψ(w,y, z)dz
)
dw
=
∫
ΩW
h(w)dw. (6.6)
On the other hand, if W is discrete, the probability is formulated as
P (W = w | E = e) =P (W = w,E = e)
φ(e), (6.7)
where w is any possible value of W .
The numerator of Equation (6.7) can be expressed as
P (W = w,E = e) =∑
y∈Y
∫
ΩZ
ψ(w,y, z)dz = h(w). (6.8)
A similar procedure is carried out to compute the denominator of Equa-
tion (6.7), which is obtained as
φ(e) =∑
w∈ΩW
∑
y∈Y
∫
ΩZ
ψ(w,y, z)dz =∑
w∈ΩW
h(w). (6.9)
Hence, calculating the probabilities formulated in Equations (6.1) and (6.7),
requires the computation of the expressions in Equations (6.4), (6.6), (6.8) and
(6.9). The problem is that in all the cases, the calculations are carried out over
the joint distribution, which size is exponential in the number of variables in
the network. Therefore, if the number of variables is high, it can be difficult or
even impossible to represent such a joint distribution in a computer, specially
if memory resources are limited. In the next section we propose a solution for
approximating the probabilities required, keeping the complexity bounded. The
solution is based on the use of the importance sampling technique [129].
Chapter 6. Approximate inference in MTE networks using importance sampling 103
6.3 Approximate propagation using importance
sampling
We will start off by considering the case in which the target variable, W , is
continuous. We can write Equation (6.1) as follows.
P (a < W < b,E = e) =
∫ b
a
h(w)dw =
∫ b
a
h(w)
f ∗(w)f ∗(w)dw
= Ef∗
[
h(W ∗)
f ∗(W ∗)
]
, (6.10)
where f ∗ is a probability density function on (a, b) called sampling distribution,
andW ∗ is a random variable with density f ∗. LetW ∗1 , . . . ,W
∗m be a sample drawn
from f ∗. Then it is easy to prove that
θ1 =1
m
m∑
i=1
h(W ∗i )
f ∗(W ∗i )
(6.11)
is an unbiased estimator of P (a < W < b,E = e). This estimation procedure is
called importance sampling.
As θ1 is unbiased, the error of the estimation is determined by its variance,
which is
Var(θ1) = Var
(
1
m
m∑
i=1
h(W ∗i )
f ∗(W ∗i )
)
=1
mVar
(
h(W ∗)
f ∗(W ∗)
)
. (6.12)
In order to minimise the variance in the expression above, f ∗ must be selected
in such a way that the ratio between h and f ∗ be as constant as possible within
interval (a, b). Actually, the minimum variance is reached when f ∗ is proportional
to h in that interval, but that is of no practical value, as we are assuming that
h, which is equivalent to the joint distribution, is difficult to handle. Later on
we will show in detail a way to obtain an approximation to h, but keeping the
complexity bounded. Let h∗ be such an approximation. Then it holds that
104 6.3. Approximate propagation using importance sampling
f ∗(w) =
h∗(w)∫ b
ah∗(w)dw
if a < w < b,
0 otherwise,
(6.13)
is a probability density function within interval (a, b). Therefore, in order to
apply importance sampling to answer our target query, we have to find an ap-
proximation, h∗, of h and then obtain a sampling distribution from it, according
to Equation (6.14). Finally, we can obtain an estimation of P (a < W < b,E = e)
using Equation (6.11).
On the other hand, φ(e) can be estimated using importance sampling as well.
In principle, a new sample should be generated, since the integral range in this
case is the entire domain of W , and not only interval (a, b). To avoid generating
two different samples, we can consider the following density:
f ∗2 (w) =
h∗(w)∫
ΩWh∗(w)dw
, (6.14)
which is a density for ΩW . From this, we can generate a sample W ∗1 , . . . ,W
∗m.
Then, it holds that
δ =1
m
m∑
i=1
h(W ∗i )
f ∗2 (W
∗i )
(6.15)
is an unbiased estimator of φ(e).
Now, if we writeW ∗(1), . . . ,W
∗(k) for the elements from sampleW ∗
1 , . . . ,W∗m that
fall inside interval (a, b), then it can be shown that
θ2 =1
k
k∑
i=1
h(W ∗(i))
f ∗2 (W
∗(i))
(6.16)
is an unbiased estimator of P (a < W < b,E = e).
Next proposition establishes the impact of using the same sample on the
accuracy of the estimation.
Chapter 6. Approximate inference in MTE networks using importance sampling 105
Proposition 3. Let m, k, θ2 and δ be as in Equations (6.15) and (6.16). Then,
Var(θ2) ≤m
kVar(δ) +
1
2k. (6.17)
Proof. Let functions h and f ∗2 be as in Equations (6.15) and (6.16). We define
ξ, ξ1 and ξ2 as
ξ(w) =h(w)
f ∗2 (w)
, w ∈ R,
ξ1(w) =h(w)I(a,b)(w)
f ∗2 (w)
, w ∈ R,
ξ2(w) =h(w)IR\(a,b)(w)
f ∗2 (w)
, w ∈ R,
where a, b ∈ R and
I(a,b)(w) =
1 if w ∈ (a, b),
0 otherwise.
and
IR\(a,b)(w) =
0 if w ∈ (a, b),
1 otherwise.
It is clear that ξ = ξ1 + ξ2 and ξ1 × ξ2 = 0. Then,
Var(ξ) = Var(ξ1 + ξ2) = Var(ξ1) + Var(ξ2) + 2Cov(ξ1, ξ2)
= Var(ξ1) + Var(ξ2) + 2(E[ξ1ξ2]− E[ξ1]E[ξ2])
= Var(ξ1) + Var(ξ2)− 2P (a < W < b,E = e)P (W /∈ (a, b),E = e)
= Var(ξ1) + Var(ξ2)
−2P (a < W < b,E = e)(1− P (a < W < b,E = e))
= Var(ξ1) + Var(ξ2)
−2(P (a < W < b,E = e)− P 2(a < W < b,E = e))
106 6.3. Approximate propagation using importance sampling
Hence,
Var(ξ1) = Var(ξ)−Var(ξ2) + 2(P (a < W < b,E = e)
−P 2(a < W < b,E = e)) ≤ Var(ξ) +1
2,
since Var(ξ2) ≥ 0 and P (a < W < b,E = e)−P 2(a < W < b,E = e) ≤ 1
4. Thus,
1
mVar(ξ1) ≤ 1
mVar(ξ) +
1
2m⇒
k
m
1
kVar(ξ1) ≤ 1
mVar(ξ) +
1
2m⇒
k
mVar(θ2) ≤ Var(δ) +
1
2m⇒
Var(θ2) ≤ m
kVar(δ) +
1
2k.
Proposition 3 establishes that the variance of θ2 is related to the variance
of δ by the inverse of the proportion of elements in the sample that fall within
interval (a, b). It means that using a single sample does not increase the error of
the estimation dramatically. Actually, if all the elements in the sample are inside
the target interval, then the variance of both estimators is almost the same, as
the term 1/2k tends to 0 as k increases.
If the target variable is discrete, the procedure is analogous. More precisely,
if W is discrete then, from Equation (6.8) it follows that
P (W = w,E = e) =∑
w′∈ΩW
h(w′)Iw(w′)
=∑
w′∈ΩW
h(w′)Iw(w′)
p∗(w′)p∗(w′) = Ep∗
[
h(W ∗)Iw(W∗)
p∗(W ∗)
]
,
where p∗ is any probability mass function defined on ΩW ,W ∗ is a discrete random
Chapter 6. Approximate inference in MTE networks using importance sampling 107
variable with distribution p∗, and
Iw(x) =
1 if w = x,
0 otherwise.
The rest of the procedure is identical to the continuous case.
6.3.1 Obtaining a sampling distribution
The error in the estimation procedure above described, depends on the variance
of the ratio h/f ∗. Therefore the best behaviour would be obtained if the sampling
distribution is close to h, as we mentioned before. Salmeron et. al [137] developed
a method for computing an accurate sampling distribution for discrete Bayesian
networks. It is based on computing the sampling distribution for a given vari-
able through a process of eliminating the other variables from the set of all the
conditional distributions in the network, H = p(xi | xpa(i)), i = 1, . . . , n. The
procedure can be adapted to the case of a hybrid Bayesian network as follows.
Let X1, . . . , Xl be the set of all the variables in the network, except the target
W and the observations E. An elimination order σ is considered and variables
are deleted according to such order: Xσ(1), . . . , Xσ(l).
The deletion of a variable Xσ(i) consists of marginalising it out from the com-
bination of all the functions in H which are defined for that variable. More
precisely, the steps are as follows:
• Let dom(f) denote the set of variables for which function f is defined.
• Let Hσ(i) = f ∈ H | Xσ(i) ∈ dom(f).
• Calculate
fσ(i) =∏
f∈Hσ(i)
f (6.18)
and f ′σ(i) defined on dom(fσ(i)) \ Xσ(i), by
f ′σ(i)(y) =
∫
xσ(i)∈ΩXσ(i)
fσ(i)(y, xσ(i))dxσ(i) ∀y ∈ Ωdom(fσ(i))\Xσ(i). (6.19)
108 6.3. Approximate propagation using importance sampling
• Transform H into H \Hσ(i) ∪ f ′σ(i).
Note that the integral in Equation (6.19) would be a summatory if W were
discrete. After deleting all the variables Xσ(1), . . . , Xσ(l) from the set of distribu-
tions H = p(xi | xpa(i)), i = 1, . . . , n, the remaining functions will depend only
on W . If all the computations are exact, it was proved in [72] that the remaining
function is actually the optimal sampling distribution.
However, the result of the products (see Equation (6.18)) in the process of
obtaining the sampling distribution may require a large amount of space to be
stored, and therefore the algorithm in [137] approximates the result of the com-
binations by pruning the probability trees (in our case, mixed trees) used to
represent the potentials. The price to pay is that the sampling distribution is not
the optimal one and the accuracy of the estimations will depend on the quality
of the approximations. Here we propose a strategy for approximating the MTE
potentials resulting from the products in Equation (6.18). We will explain the
idea by considering an MTE potential defined for a set of continuous variables
Z = (Z1, . . . , Zt)T as
φ(z) = a0 +
t∑
i=1
aiebT
i z.
The goal is to detect those exponential terms in φ(z) that are almost constant
and remove them. The rationale behind this strategy is that, from the point of
view of simulation, a flat or constant term does not provide any useful information
to the entire density, as there is already a constant term, namely a0.
Thus, we consider a threshold α ∈ (0, 1) and then, for each term gj(z) =
ajebT
j z, j = 1, . . . , t, in the mixture, if the following condition is satisfied
min(gj(z))
max(gj(z))> α,
then gj(z) is replaced by a constant
kj =min(gj(z)) + max(gj(z))
2. (6.20)
The closer to 1 α is, the more accurate the approximation.
Chapter 6. Approximate inference in MTE networks using importance sampling 109
Note that the previous statements can be made taking into account that the
exponential function by nature is strictly increasing or decreasing on its whole
domain, and therefore its maximum and minimum are always located at the
borders of the domain. In this way, the shape of the function can be controlled.
Summing up, if the j-th term of the mixture is replaced by constant kj , the
resulting potential would be
φ(z) = k + kj +∑
i∈Ni 6=j
aiebT
i z,
where N = 1, . . . , t. But in fact, MTE potentials are defined into hypercubes.
Therefore, rather than approximating a single potential, after each product the
whole mixed tree representing the resulting potential should be approximated
following this strategy. The detailed procedure can be found in Algorithm 15.
Algorithm 15: PruneMTEPotential (T, α)
Input: An mixed tree T and a threshold α for pruning terms.Output: Tree T with terms pruned according to α.Let Z be the set of continuous variables of tree T.1
foreach leaf in T do2
Let φ(z) = k +t∑
i=1
aiebT
i z be the MTE stored in the current leaf.3
for j := 1 to t do4
Let ajebT
j z be the j-term of φ(z) .5
ifmin(aje
bT
j z)
max(ajebT
jz)> α then
6
kj :=min(aje
bT
j z)+max(aje
bT
j z)
2.7
Remove ajebT
j z from φ(z).8
Update the independent term k of φ(z) to k + kj .9
return T.10
110 6.3. Approximate propagation using importance sampling
6.3.2 Computing multiple probabilities simultaneously
The procedure described so far is designed to calculate probabilities concerning a
single variable at a time. We will show in this section that it can be extended to
allow the possibility of calculating multiples probabilities about different variables
at the same time. The idea is based on the elimination procedure described in
Section 6.3.1.
It is possible to carry out a simulation in an order contrary to the one in which
variables are deleted. To obtain a value for Xσ(i), the function fσ(i) obtained in
the deletion of this variable is used. This function is defined for the values of
variable Xσ(i) and other variables already sampled. Function fσ(i) is restricted to
the already obtained values of variables in dom(fσ(i)) \ Xσ(i), giving rise to a
density function which depends only on Xσ(i). Finally, a value for this variable is
drawn from this density. If all the computations are exact, it was proved in [72]
that the simulation is actually carried out using the optimal density, and we
obtain a sample from the joint distribution of Xσ(1), . . . , Xσ(l).
The details of this procedure are given in Algorithm 16, which computes a
sampling distribution for each unobserved variable in a hybrid Bayesian network.
Later on we will study how to determine the order of the variables in Step 4.
Now let us denote by W1, . . . ,Wn the unobserved variables in the network,
and by E1, . . . , Ek the observed ones. Note that after applying Algorithm 16, if
we set α = 1 in Step 7, then it holds that the true joint probability function is
f(w1, . . . , wn, e1, . . . , ek) =
l∏
i=1
f ∗Xi. (6.21)
That is, if we simulate each variable Xi using f∗Xi, we would actually be obtaining
a sample of random vectors w1, . . . ,wn, e1, . . . , ek from the true distribution.
Our goal in this section is to calculate a set of probabilities about the unob-
served variables expressed as P (Wi = wi,E = e) or P (ai < Wi < bi,E = e),
i = 1, . . . , n, if Wi is discrete or continuous, respectively. It can be shown that we
can use the joint sample to estimate the different probabilities separately, since
each individual sample is itself a sufficient statistic for the probability of a precise
variable.
Chapter 6. Approximate inference in MTE networks using importance sampling 111
Algorithm 16: SamplingDistributions (B, e)
Input: A hybrid BN, B, and an observation e.Output: A sampling distribution for each variable in the network.Let H := ψX1 , . . . , ψXl
be all the potentials in B restricted to the1
evidence e, represented as mixed trees.S := ∅.2
for i := 1 to l do3
Select the next variable to remove, Xi.4
HXi:= ψ ∈ H | Xi ∈ dom(ψ).5
fXi:=∏
ψ∈HXiψ.6
f ∗Xi
:= PruneMTEPotential(fXi, α).7
S := S ∪ f ∗Xi.8
H := H \HXi.9
if Xi is continuous then10
H := H ∪ ∫
Xif ∗Xidxi.11
else12
H := H ∪ ∑Xif ∗Xi.13
return S .14
Let W(j)1 , . . . ,W
(j)n , j = 1, . . . , m be a sample of size m drawn from the sam-
pling distribution in the set S returned by Algorithm 16. Then
δ2 =1
m
m∑
j=1
ψ(W(j)1 , . . . ,W
(j)n )
∏ni=1 f
∗Wi(W
(j)i )
(6.22)
is an unbiased estimator of φ(e).
Let W(j)∗1 , . . . ,W
(j)∗n , j = 1, . . . , r be the elements from the sample above that
fall into interval (ai, bi) (or for which W(j)i = wi in the discrete case), i = 1, . . . , n.
Then
θXi=
1
r
r∑
j=1
ψ(W(j)∗1 , . . . ,W
(j)∗n )
∏ni=1 f
∗Wi(W
(j)∗i )
(6.23)
is an unbiased estimator of P (ai < Wi < bi,E = e), i = 1, . . . , n. A similar
result can be derived immediately in the case that Wi is discrete, and therefore
the quantity to estimate is P (Wi = wi,E = e).
112 6.4. The algorithm
In Equations (6.22) and (6.23), function ψ in the numerator is defined in a
similar way as in Equation (6.3), i.e. the product of conditionals, restricted to
the observations.
6.4 The algorithm
In this section we give the details of the algorithm that implements our proposal
for computing multiples probabilities in hybrid Bayesian networks with MTEs
using importance sampling. First of all it should be emphasised that Algorithm 16
makes a decision about which variable to remove in each iteration (see Step 4).
The decision there influences the complexity of the product in Step 6, since it
determines the set of potentials that will be multiplied. We propose to use a
one-step look-ahead heuristic based on selecting the variable that results in a
potential of lowest size after the product in Step 4. The concept of size is given
in the next definition.
Definition 8 (Size of an MTE potential). The size of an MTE potential is defined
as its number of exponential terms, including the independent term.
Example 6. The potential represented in Figure 2.5 has size equal to 16, since
it has 8 leaves, and in each one there is an independent term and one exponential
term, that is, 8× (1 + 1) = 16.
Though it is not possible to know beforehand the exact size of a potential
resulting from a product, an upper bound is given in the next proposition. This
is the bound actually used for deciding the elimination order in Algorithm 16.
Proposition 4 (proposed in [134]). Let T1, . . . ,Th be mixed probability trees,
Yi,Zi the discrete and continuous variables of each one of them, and ni the
highest number of intervals into which the domain of the continuous variables of
Ti is split. Let ΩYi, be the set of possible values of the discrete variable Yi. The
size of the tree T = T1 × T1 × . . .× Th is lower than
∏
Yi∈∪hi=1Yi
|ΩYi |
×(
h∏
j=1
nkjj
)
×(
h∏
j=1
tj
)
,
Chapter 6. Approximate inference in MTE networks using importance sampling 113
where tj is the maximum number of exponential terms in each leaf of Tj, and kj
is the number of continuous variables in Tj.
In this point, we have all the tools necessary for establishing our proposal for
computing multiple probabilities, which is described in Algorithm 17.
6.5 Experimental evaluation
Three experiments have been carried out to analyse the performance of the pro-
posed methodology. We have used five hybrid Bayesian networks. The first three,
denoted as artificial1, artificial2, artificial3 are artificial networks with
41, 77 and 97 variables, respectively, whose structure and parameters were gen-
erated at random, in the same way as the networks used in [134].
The two remaining networks have been created taking the structure from
the alarm [8] and barley [82] networks, which are originally fully discrete, and
making some assumptions about the kind of the variables. Out of the 37 and
48 variables in the networks, respectively, 10 of them were considered as discrete
with two states, and the remaining were considered continuous with support in the
interval [0, 1]. The domain of each continuous variable was split into two pieces.
The MTE densities associated with each split was defined using 2 exponential
terms, with parameters generated at random as in [134].
For each network, 20% of the variables were observed at random, considering
as goal variables the remaining 80%. For each network, we considered 10 differ-
ent observations. The probabilities were also selected at random, with uniform
probability for each value of the discrete variables, and considering an interval of
width of a 10% of its support for each continuous target variable.
6.5.1 Experiment 1
In this experiment we compared the performance of the Importance Sampling
(IS) algorithm versus the other two approximate propagation methods existing
in the literature for MTE networks: Markov Chain Monte Carlo (MCMC) and
Penniless Propagation (PP) [134].
114 6.5. Experimental evaluation
Algorithm 17: ApproximateProbabilityPropagation (B, e, P )
Input: A hybrid Bayesian network B with variables X. An observation eabout a set of variables E. A list of probabilities P to becalculated, of the form P (ai < Wi < bi | e) if Wi is continuous andP (Wi = wi | e) otherwise.
Output: Estimations P (ai < Wi < bi | e) or P (Wi = wi | e).Let W1, . . . ,Wn be the variables in X \ E.1
S := SamplingDistributions(B,e).2
Initialise ri := 0 and Pi := 0, i = 1, . . . , n, and φ(e) := 0.3
for j := 1 to m do4
Generate a sample w∗1, . . . , w
∗n for variables W
(j)1 , . . . ,W
(j)n by5
simulating in reverse order to the one used in Algorithm 16, using thesampling distributions in S (a procedure for sampling from an MTEdensity is given in [134]).for i := 1 to n do6
if Wi is continuous then7
if w∗i ∈ (ai, bi) then8
Pi := Pi +ψ(w∗
1 ,...,w∗
n)∏n
k=1 f∗
Wk(w∗
k).
9
ri := ri + 1.10
else11
if w∗i = wi then12
Pi := Pi +ψ(w∗
1 ,...,w∗
n)∏n
k=1 f∗
Wk(w∗
k).
13
ri := ri + 1.14
φ(e) := φ(e) +ψ(w∗
1 ,...,w∗
n)∏n
k=1 f∗
Wk(w∗
k).
15
φ(e) := φ(e)m
.16
Pi :=Pi
ri×φ(e), i = 1, . . . , n.17
return P1, . . . , Pn.18
Chapter 6. Approximate inference in MTE networks using importance sampling 115
For each set of observations, the execution time and the error in the estima-
tions were computed. The error was calculated using the χ2 divergence, which is
defined as
χ2 =1
n
n∑
i=1
(pi − pi)2
pi,
where pi, i = 1, . . . , n are the true probabilities for each query, and pi, i = 1, . . . , n
are their estimations.
Figures 6.1 and 6.2 show the results of the experiment for the five networks.
The three box plots correspond to the χ2 error, execution time and the rate
error × time obtained for a set of 10 observations. Notice that the outliers have
not been represented in these charts. Each execution of the simulation algorithms
(IS and MCMC) was repeated 10 times, using in both cases a sample of size 500.
The results shown correspond to the average over the ten executions.
In order to simplify the potentials during the propagation, we have set a
threshold α = 0.95 for the mixed trees in the IS algorithm (see Section 6.3.1) and
for algorithm PP we chose the following parameters, taken from [134]: ǫJoin =
0.05, ǫDisc = 0.05. We refer the readers to the original reference for a detailed
explanation of the meaning of those parameters. Finally, we limited the maximum
number of exponential terms in the PP algorithm to 2.
The experimental results show how the IS algorithm clearly outperforms
the other two in terms of accuracy, speed and rate error × time for network
artificial3. For networks alarm and barley, the error is again lower for IS,
but in exchange the running time is the worse. This is due to the higher com-
plexity of the potentials involved in these networks, which makes the algorithm
invest much time on obtaining the sampling distributions. However, the time
invested is worth it, as can be seen looking at the plots corresponding to the
rate error× time, which is better for IS. Therefore, we conclude that this exper-
iment suggests that IS offers the best way for dealing with the tradeoff between
complexity and accuracy when computing multiple probabilities.
116 6.5. Experimental evaluation
IS MCMC PP
0.2
0.4
0.6
0.8
alarm
erro
r
IS MCMC PP
1.0
1.5
2.0
2.5
3.0
3.5
alarm
time
(sec
)
IS MCMC PP
0.2
0.4
0.6
0.8
alarm
erro
r ·
time
IS MCMC PP
01
23
45
6
barley
erro
r
IS MCMC PP
24
68
1012
14
barley
time
(sec
)
IS MCMC PP
02
46
810
1214
barley
erro
r ·
time
Figure 6.1: Box plots of the χ2 error, execution time and the rate error × timefor the probabilities in networks alarm and barley.
Chapter 6. Approximate inference in MTE networks using importance sampling 117
IS MCMC PP
0.1
0.2
0.3
0.4
0.5
0.6
artificial1
erro
r
IS MCMC PP
0.2
0.3
0.4
0.5
0.6
0.7
artificial1
time
(sec
)
IS MCMC PP0.0
0.2
0.4
0.6
0.8
artificial1
erro
r ·
time
IS MCMC PP
01
23
45
artificial2
erro
r
IS MCMC PP
24
68
artificial2
time
(sec
)
IS MCMC PP
05
1015
artificial2
erro
r ·
time
IS MCMC PP
01
23
45
artificial3
erro
r
IS MCMC PP
1.5
2.0
2.5
3.0
artificial3
time
(sec
)
IS MCMC PP
02
46
8
artificial3
erro
r ·
time
Figure 6.2: Box plots of the χ2 error, execution time and the rate error × timefor the probabilties in networks artificial1, artificial2 and artificial3.
118 6.5. Experimental evaluation
6.5.2 Experiment 2
The second experiment is devoted to analyse the impact of the sample size as well
as the execution time in the behaviour of the simulation algorithms, that is IS
and MCMC. Figures 6.3 and 6.4 show the χ2 divergence as a function of sample
size and time, for the five networks considered. It can be seen that IS converges
more quickly than MCMC, and also converges to a more accurate solution.
0 500 1000 1500 2000
0.0
0.5
1.0
1.5
2.0
2.5
3.0
alarm
Sample size
erro
r
ISMCMC
0 2 4 6 8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
alarm
time (sec)
erro
r
ISMCMC
0 500 1000 1500 2000
01
23
45
barley
Sample size
erro
r
ISMCMC
0 10 20 30 40
01
23
45
barley
time (sec)
erro
r
ISMCMC
Figure 6.3: χ2 error for methods IS and MCMC as a function of the sample sizeand execution time. Results for the networks alarm and barley.
Chapter 6. Approximate inference in MTE networks using importance sampling 119
0 500 1000 1500 2000
01
23
artificial1
Sample size
erro
r
ISMCMC
0.0 0.5 1.0 1.5 2.0 2.5
01
23
artificial1
time (sec)er
ror
ISMCMC
0 500 1000 1500 2000
02
46
810
artificial2
Sample size
erro
r
ISMCMC
0 1 2 3 4 5
02
46
810
artificial2
time (sec)
erro
r
ISMCMC
0 500 1000 1500 2000
02
46
artificial3
Sample size
erro
r
ISMCMC
0 2 4 6
02
46
artificial3
time (sec)
erro
r
ISMCMC
Figure 6.4: χ2 error for methods IS and MCMC as a function of the samplesize and execution time. Results for networks artificial1, artificial2 andartificial3.
120 6.6. Conclusions
6.5.3 Experiment 3
Finally, we conducted an experiment aimed at testing the impact of using the
pruning method proposed in Section 6.3.1. More precisely, we performed two
tests. In the first one, we run the algorithm with different α thresholds and
measured the χ2 error of the predictions. As in previous experiments, for each
of the 10 observations, the algorithm was run 10 times. The results displayed
in Figure 6.5 show the average of the errors obtained. As expected, the error
decreases as we increase the threshold, which means that we are being more
strict with the pruning criterion.
In the second test, we have plotted an MTE density for different α thresholds.
The density had originally 10 terms (all of them positive). The idea is to see the
impact of reducing terms in the shape of the density function. The results are
displayed in Figure 6.6 and show how the shape of the density becomes smoother
as exponential terms are removed.
6.6 Conclusions
We have introduced a method for computing multiples posterior probabilities in
hybrid Bayesian networks with MTEs. The method is based on importance sam-
pling, which makes it an anytime algorithm. The algorithm is able to compute all
the require probabilities using a unique sample. We have shown that the variance
remains bounded if the same sample is also used to compute the numerator and
denominator in each conditional probability.
The experiments conducted illustrate the behaviour of the proposed algo-
rithm, and they support the idea that the IS algorithm outperforms the two algo-
rithms previously used for carrying out probabilistic reasoning in hybrid Bayesian
networks with MTEs. Therefore, the methodology introduced in this chapter ex-
pands the class of problems that can be handled using hybrid Bayesian networks,
and more precisely, it provides versatility to the MTE model, by increasing the
efficiency in solving probabilistic inference tasks.
We expect to continue this research line by developing methods for carrying
out more complex reasoning tasks. For instance, finding the most probable ex-
Chapter 6. Approximate inference in MTE networks using importance sampling 121
0.0 0.2 0.4 0.6 0.8 1.00.01
0.02
0.03
0.04
0.05
0.06
0.07
alarm
threshold
erro
r
0.0 0.2 0.4 0.6 0.8 1.0
0.02
0.04
0.06
0.08
0.10
0.12
barley
thresholder
ror
0.0 0.2 0.4 0.6 0.8 1.0
0.10
0.15
0.20
artificial1
threshold
erro
r
0.0 0.2 0.4 0.6 0.8 1.0
0.1
0.2
0.3
0.4
0.5
artificial2
threshold
erro
r
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
artificial3
threshold
erro
r
Figure 6.5: χ2 error for different levels of pruning. The higher the α threshold,the less pruning is actually carried out. Results for networks alarm, barley,artificial1, artificial2 and artificial3.
122 6.6. Conclusions
0.00 0.05 0.10 0.15 0.20 0.25 0.30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
x
f(x)
Threshold
1.0 (10 terms)0.8 (7 terms)0.7 (3 terms)0.6 (2 terms)0.5 (1 term)0.3 (no terms)
Figure 6.6: Several approximations to the same MTE density using different levelsfor pruning exponential terms.
planation to an observed fact in terms of a set of target variables, which is called
abductive inference [60].
Part III
Applications
Chapter 7
Species distribution modelling
Bayesian networks have been widely used to solve problems in environmental sci-
ences [1] by discretising the continuous domains to apply the techniques developed
for learning and inference so far. However, there are few studies facing directly with
continuous data, even in other areas. In this chapter the naıve Bayes (NB) and tree
augmented naıve Bayes (TAN) classification models based on MTEs are applied. The
aim is to characterise the habitat of the spur-thighed tortoise (Testudo graeca graeca),
using several continuous environmental variables, and one discrete (binary) variable
representing the presence or absence of the tortoise. These models are compared with
the full discrete models and the results show a better classification rate for the con-
tinuous ones. Therefore, the application of continuous models instead of discrete ones
avoids loss of statistical information due to the discretisation. Moreover, the results
of the TAN continuous model show a more spatially accurate distribution of the tor-
toise. The species is located in the Donana Natural Park, and in semiarid habitats.
The proposed continuous models based on MTEs are valid for the study of species
predictive distribution modelling.
Abstract
7.1 Introduction
Over the last decade, advances in species predictive distribution modelling have
been paralleled by the evolution and the development of geographical informa-
tion systems (GIS), remote sensing, statistical modelling and database manage-
ment [71, 7, 92, 139]. Statistical models relate observations of species, communi-
126 7.1. Introduction
ties or diversity [101, 66, 15, 70, 157] to environmental predictors, and project the
fitted relationships into geographical space to produce distribution maps [100].
The modelling of species distribution is a useful tool [70] that is widely used
in spatial ecology, biogeography and conservation biology. The models have
contributed significantly to test biogeographical, ecological and evolutionary hy-
potheses [4, 67], for assessing species invasion and proliferation [120], for rare
species distribution [68], for supporting conservation planning and reserve selec-
tion [56, 5], and for the study of the impacts of global change [121, 102, 152, 6].
Many statistical techniques have been applied to modelling [71, 157, 16, 118, 46],
including classical statistical models such as generalised linear regressions [69],
generalised additive models [98], generalised regression analysis and spatial pre-
diction (GRASP) [93] or logistic regressions [101]. Recently, machine learning
methods such as classification and regression trees [103, 45] and neural net-
works [105, 40] have also been applied.
Bayesian networks [77] have been applied in solving environmental prob-
lems [1] such as eutrophication in an estuary [13], credal classification in agri-
culture [160], management of endangered species [12, 123], water resources plan-
ning [14] and conservation of dunnarts [147]. Several advantages are gained from
this methodology [154]: suitability for incomplete data sets, possibility of struc-
tural learning, combination of different sources of knowledge, explicit treatment
of uncertainty and support for decision analysis, and fast response. However,
most environmental variables are continuous whilst Bayesian networks usually
build the model over discrete domains, so that continuous variables need to be
first discretised [154]. Discretisation implies capturing only rough characteristics
of the original distribution [59] and loss of statistical information. Thus, there is
a need to develop Bayesian networks that can work with continuous values.
The problem of using continuous and discrete variables simultaneously in-
volves more complex mathematical models. As we discussed in Section 2.2, there
are several techniques in the literature to cope with hybrid variables. In this work
we will concentrate on the use of MTEs.
As described in Chapter 3, Bayesian networks can be used to solve classifi-
cation problems. The most frequently-used structures for this purpose are the
naıve Bayes (NB) and the tree augmented naıve Bayes (TAN) [58]. These models
Chapter 7. Species distribution modelling 127
have usually been applied only to discrete variables. Bayesian classifiers bring
significant benefits over traditional statistical techniques. Mainly, accurate in-
formation about a target variable can be obtained without requiring complete
observation of all the remaining variables.
The aim of this chapter is to develop NB and TAN classifier structures based
on the MTE model that allows the simultaneous use of continuous and discrete
variables in the same network, without any pre-processing in either the variables
or the data. Continuous environmental variables and presence/absence records of
the spur-thighed tortoise (Testudo graeca graeca) were used to develop the models.
The results are compared with discrete NB and TAN structures proposed in [3]
and used to characterise the habitat of the tortoise.
7.2 Methodology
7.2.1 Variables and data set description
The study area selected (Figure 7.1) is located in the region of Andalusia (south-
ern Spain). A set of thematic maps of vegetation and land use, lithology and soils,
were selected and incorporated into an automatic spatial representation system,
ArcGis 9.2. A 10x10 kilometre grid was superimposed over the thematic maps
to calculate the percentage cover of each variable in each cell. Mean, maximum
and minimum, height, slope, temperature and rainfall were also considered. In
this way, a matrix of 988 observations and 176 environmental variables was ob-
tained. The data relating to the presence/absence of the spur-thighed tortoise
(Testudo graeca graeca) for each cell, were derived from the Atlas of Amphibians
and Reptiles of Spain [122]. This tortoise is an endangered species [74].
7.2.2 Selection of variables
Since the number of variables described in Section 7.2.1 is excessive, a selection of
the most representative ones is needed [3]. This selection can be done in different
ways within the framework of Bayesian networks [9, 10, 73, 104]. In this case, filter
measures, based on information functions applied to discrete variables (qualitative
128 7.2. Methodology
Figure 7.1: Location of the study area. A 10x10 kilometre grid was superimposedto calculate the values of the variables.
or quantitative), were used. The selected measure was Kullback-Leibler [84, 83].
Since this method is only defined for discrete variables, the continuous variables
were discretised using the k-means clustering algorithm. Three groups represent-
ing low, medium and high values for each variable, were considered. This process
was developed using Elvira GUI software [34].
The discretisation of the continuous variables was taken into account only to
select the final set of variables in our study. Once obtained, they were treated as
continuous variables in order to implement the NB and TAN models.
Once the discretisation and Kullback-Leibler filter measure were applied, 10
variables were selected after consulting with an expert. In decreasing order they
are: areas with low vegetation (vegetation cover < 20%), sparse shrubland with
pasture, aridisols soil type, mean rainfall, mean temperature, irrigated woody
crops, marsh with vegetation, sand and dunes, dry woody crops and dry herba-
ceous crops.
Chapter 7. Species distribution modelling 129
7.2.3 Bayesian classifiers and calibration of models
Although the ideal would be to build a network without restrictions on the struc-
ture, this is not possible due to the limited data available. Therefore, networks
with fixed and simple structures and specifically designed for classification have
been used. The extreme case is the NB model [44]. In Algorithm 18 the steps
for constructing a NB classifier with continuous features are shown. In essence,
they consist of building a Bayesian network with a NB structure and estimating
the marginal distribution for the class variable and conditional MTE densities for
the features [107].
Algorithm 18: Naıve Bayes classifier with continuous features
Input : A database D with variables X1, . . . , Xn, Y .Output: A NB model with root variable Y and features X1, . . . , Xn, with
joint distribution of class MTE.Construct a new network G with nodes Y,X1, . . . , Xn.1
Insert the links Y → Xi, i = 1, . . . , n en G.2
Estimate a discrete distribution for Y , and a conditional MTE density for3
each Xi, i = 1, . . . , n given its parents in G [135, 131, 128].Let P be the set of estimated distributions.4
Let NB be a Bayesian network with structure G and distributions P.5
return NB.6
The steps to build the TAN classifier with continuous features are shown in
Algorithm 19.
Elvira API software [34] was used both for the learning of the models and
their validation. It is remarkable that this is the only software in the literature
dealing with hybrid Bayesian networks using MTEs.
7.2.4 Inference in Bayesian classifiers
The goal in this section is to determine all the probabilistic information, both a
priori and a posteriori of the model. Thus, the model can be used to give a true
reflection of the initial data set (model a priori) or to predict the impact (in terms
of probability) of introducing evidence for certain variables (model a posteriori).
For example, if the evidence (the observation that the tortoise is present),
130 7.2. Methodology
Algorithm 19: TAN classifier with continuous features
Input: A database D with variables X1, . . . , Xn, Y .Output: A TAN model with root variable Y and features X1, . . . , Xn,
with joint distribution of class MTE.Construct a complete graph C with nodes X1, . . . , Xn.1
Label each link (Xi, Xj) with the conditional mutual information between2
Xi and Xj given Y [51], i.e.,
I(Xi, Xj | Y ) =
∫
ΩXi
∫
ΩXj
∑
ΩY
f(xi, xj, y) logf(xi, xj | y)
f(xi | y)f(xj | y)dxidxj .
Let T be the maximum spanning tree obtained from C using the3
Algorithm 4 in Chapter 3.Directs the links in T in such a way that no node has more than one parent.4
Construct a new network G with nodes Y,X1, . . . , Xn and the same links as5
T.Insert the links Y → Xi, i = 1, . . . , n in G.6
Estimate a discrete distribution for Y , and a conditional MTE density for7
each Xi, i = 1, . . . , n given its parents in G [135, 131, 128].Let P be the set of estimated distributions.8
Let TAN be a Bayesian network with structure G and distributions P.9
return TAN.10
P (tortoise = presence) = 1, is set in the class variable, the density functions of
the remaining environmental variables will be modified. In this way, an approxi-
mation to the most probable configuration for the presence of the tortoise can be
obtained.
7.2.5 Validation of the models
The models were tested using k-fold cross validation [150]. This technique is
applied to the initial data set and used to evaluate the quality of a classification
model.
A lazy choice would be to use holdout validation (k = 1). It is not consid-
ered cross validation as such, since the data never cross. The initial data set is
randomly divided into two subsets: The first one (Dl) is devoted to the training
phase of the model and the second one (Dt), to validating it. Usually, less than
Chapter 7. Species distribution modelling 131
a third of the initial data set is used for Dt.
For a k-value greater than 1, the data set is split into k subsets. In each step,
one subset is assigned to Dt and the remaining k − 1 to Dl. Cross validation is
repeated k times, each time taking a different subset for Dt. This is the approach
followed to test the classifiers presented in this chapter.
A particular and extreme case of k-fold cross validation, is when k coincides
the number of cases n of the data set. Each model is trained with n−1 cases and
tested with the remaining unused case. This validation method is called Leave-
One-Out [79]. It returns more accurate results, but its disadvantage is the need
to train as many models as there are cases in the initial data set and therefore it
is inefficient from a computational point of view.
The output model is constructed by including the entire database in Dl.
7.3 Results and discussion
7.3.1 NB model
The resulting NB model is shown in Figure 7.2. The introduction of the evidence
”presence of tortoise”, changes the probability distribution of the features due to
d-separation criterion (see Definition 1 in Chapter 2).
Tortoise
MV DHC IWC SSP SD DC AWV AS MR MT
Figure 7.2: Modelling using a NB structure. MV: marsh with vegetation; DHC:dry herbaceous crops; IWC: irrigated woody crops; SSP: sparse shrubland withpasture; SD: sand and dunes; DC: dry woody crops; AWV: areas with low vege-tation; MR: mean rainfall; AS: aridisols; MT: mean temperature.
132 7.3. Results and discussion
a priori a posteriori % changeMarsh with vegetation 3.70 15.90 329Dry herbaceous crops 22.54 10.06 -56Irrigated woody crops 6.83 3.39 -50
Sparse shrubland with pasture 17.56 40.00 127Sand and dunes 2.47 15.11 510Dry woody crops 38.02 17.18 -55
Areas with low vegetation 16.33 42.03 157Aridisols 6.85 23.11 237
Mean rainfall 608.39 586.71 -3.43Mean temperature 16.10 16.92 5
Table 7.1: Expected values of the probability distributions a priori and a poste-riori for the NB model. The values represent percentage cover except for meanrainfall (mm) and mean temperature (C).
Figures 7.3 and 7.4 show the density function for each variable, both without
evidence of the tortoise being present (a priori) and with evidence (a posteriori).
Table 7.1 shows the expected values of the marginal density function for each
environmental variable both a priori and a posteriori.
In order to compare the mean results of the variables a priori and a posteriori,
a threshold value was calculated above or below which (depending on whether the
mean decreases or increases) the differences between the a priori density function
and the a posteriori density function are maximised.
The mean value of variable marsh with vegetation is increased by 329%, from
a cover of 3.70% to 15.90%. The function shows that the prior probability of
its value exceeding the 2% threshold is 0.20, whereas the posterior probability
becomes 0.84.
The variable dry herbaceous crops decreases its mean value by 56%. The
function shows that the prior probability of its value exceeding the 5% threshold
is 0.74, whereas a posteriori it becomes 0.23, i.e., it is less likely that the variable
takes a high value (greater than 5).
For the variable irrigated woody crops, the mean value decreases by 50%,
from 6.83% to 3.39%. Both functions show that the mean value of the posterior
distribution should increase, however the right tail of the posterior distribution
Chapter 7. Species distribution modelling 133
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
Marsh with vegetation
Den
sity
A PrioriA Posteriori
0 20 40 60 80 1000.
00.
10.
20.
30.
4Dry herbaceous crops
Den
sity
A PrioriA Posteriori
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Irrigated woody crops
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.00
0.02
0.04
0.06
Sparse shrubland with pasture
Den
sity
A PrioriA Posteriori
0 10 20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
Sand and dunes
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.00
0.02
0.04
0.06
Dry woody crops
Den
sity
A PrioriA Posteriori
Figure 7.3: Prior and posterior marginal probability distributions in the NBmodel for the variables marsh with vegetation, dry herbaceous crops, irrigatedwoody crops, sparse shrubland with pasture, sand and dunes and dry woodycrops.
134 7.3. Results and discussion
0 20 40 60 80 100
0.00
0.04
0.08
0.12
Areas with low vegetation
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.00
0.10
0.20
0.30
Aridisols
Den
sity
A PrioriA Posteriori
200 400 600 800 1000 1200 14000.00
000.
0010
0.00
200.
0030
Mean rainfall
Den
sity
A PrioriA Posteriori
8 10 12 14 16 18
0.0
0.1
0.2
0.3
0.4
0.5
Mean temperature
Den
sity
A PrioriA Posteriori
Figure 7.4: Prior and posterior marginal probability distributions in the NBmodel for the variables areas with low vegetation, mean rainfall, aridisols andmean temperature.
Chapter 7. Species distribution modelling 135
is very long, and with lower probability, which explains this feature.
The mean value for the variable sparse shrubland with pasture increases by
127%. The function suggests that the prior probability of its value exceeding the
19% threshold is 0.30, whereas a posteriori it becomes 0.74. In other words, it is
more likely that the variable takes a high value.
The change in the sand and dunes variable is remarkable. The mean value
increases by 510%. The function shows that the prior probability of its value
exceeding the 2% threshold is 0.12, whereas a posteriori it becomes 0.86.
For the variable dry woody crops, the mean value decreases by 55%. The
prior probability that its value is lower than the threshold 35% is 0.54, whereas a
posteriori increases to 0.86, i.e., it is more likely to find low values for this variable
in the case that the presence of the tortoise is recorded.
The mean value of variable areas with low vegetation increases by 157%. The
function shows that the prior probability of its value exceeding the 16% threshold
is 0.29, whereas a posteriori it becomes 0.79.
The variable aridisols increases its mean value by 237%. The prior probability
that its value exceeds the 5% threshold is 0.22, whereas a posteriori it becomes
0.80.
The decrease in variable mean rainfall is lower (3.43%). The function shows
that the prior probability of its value being lower than the threshold 437 mm is
0.16, whereas if the presence of tortoise is known, this probability is 0.40, i.e.,
lower precipitations favours the presence of the tortoise.
For mean temperature, the mean value is increased slightly, by 5%. The
probability that the temperature exceeds the threshold value of 17C is 0.36
a priori, whereas a posteriori becomes 0.63. Therefore, slightly higher mean
temperatures favour the presence of tortoise.
Summing up, the marginal probability distributions of the NB model show
that it is likely to find the tortoise in areas with sand and dunes, marsh with
vegetation, aridisols, areas with low vegetation and in sparse shrubland with
pasture. The remaining variables vary by less than 100%, between the case of
evidence to no evidence. The model also suggests that is likely to find the tortoise
where mean rainfall is lower and where mean temperature is slightly higher.
136 7.3. Results and discussion
7.3.2 TAN model
Figure 7.5 shows the constructed TAN model. The main difference with respect to
the NB is the relationship between the features. This increases the number of arcs
in the structure and its complexity, but improves the accuracy and expressivity.
SD
DHC SSP
IWC MT AS AWV
MV DC MR
(a) Tree showing the relations amongthe features.
TORTOISE
SD
DHC SSP
IWC MT AS AWV
MV DC MR
TORTOISE
SD
DHC SSP
IWC MT AS AWV
MV DC MR
(b) TAN model constructed adding tothe previous tree the class variable andlinks from it to each feature.
Figure 7.5: Sequence followed to obtain the TAN model. MV: marsh with veg-etation; DHC: dry herbaceous crops; IWC: irrigated woody crops; SSP: sparseshrubland with pasture; SD: sand and dunes; DC: dry woody crops; AWV: areaswith low vegetation; MR: mean rainfall; AS: aridisols; MT: mean temperature.
Figure 7.5 shows the structure of the corresponding TAN model built from a
tree. The procedure consists of adding the tortoise variable and drawing an arc
from it to each environmental variable.
Figure 7.6 and 7.7 show the density functions for each variable in case of
there being no evidence of the tortoise (a priori) and with evidence (a posteriori).
Chapter 7. Species distribution modelling 137
Table 7.2 shows the expected values of the prior and posterior density function
for each environmental variable.
a priori a posteriori % changeMarsh with vegetation 3.70 15.90 329Dry herbaceous crops 22.54 10.06 -56Irrigated woody crops 6.16 4.08 -34
Sparse shrubland with pastures 17.56 40.00 127Sand and dunes 2.47 15.11 510Dry woody crops 38.02 17.18 -55
Areas with low vegetation 20.70 47.12 228Aridisols 15.50 26.62 72
Mean rainfall 600.81 392.56 -35Mean temperature 15.85 16.92 7
Table 7.2: Expected values of the probability distributions a priori and a poste-riori for the TAN model. The values represent percentage cover except for meanrainfall (mm) and mean temperature (C).
In the same way as for the NB, a threshold value has been calculated to see
where the greatest differences between the prior and posterior density functions
lie.
Mean percentage cover for irrigated woody crops decreases by 34%. The
function shows that the prior probability of its value exceeding the 1% threshold
is 0.47, whereas a posteriori it becomes 0.78. The probability that the mean is
lower than the threshold 10% (the same as the NB) goes from 0.72 to 0.91.
For areas with low vegetation, the mean value increases its cover by 228%.
The function suggests that the probability of its value exceeding the threshold of
20% is 0.33 a priori, whereas a posteriori this probability becomes 0.84.
The variable aridisols increases its mean value by 72%. The prior probability
that its value is greater than the threshold 3% is 0.35, whereas a posteriori, the
probability becomes 0.89.
Mean rainfall decreases by 35%. The function shows that the prior probability
of its value being less than the threshold 407 mm is 0.14, whereas a posteriori it
is 0.71. So, it is more likely to find low precipitation associated with the presence
of tortoise.
138 7.3. Results and discussion
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
Marsh with vegetation
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
Dry herbaceous crops
Den
sity
A PrioriA Posteriori
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Irrigated woody crops
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.00
0.02
0.04
0.06
Sparse shrubland with pasture
Den
sity
A PrioriA Posteriori
0 10 20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
Sand and dunes
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.00
0.02
0.04
0.06
Dry woody crops
Den
sity
A PrioriA Posteriori
Figure 7.6: Prior and posterior marginal probability distributions in the TANmodel for marsh with vegetation, dry herbaceous crops, irrigated woody crops,sparse shrubland with pasture, sand and dunes and dry woody crops.
Chapter 7. Species distribution modelling 139
0 20 40 60 80 100
0.00
0.04
0.08
0.12
Areas with low vegetation
Den
sity
A PrioriA Posteriori
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Aridisols
Den
sity
A PrioriA Posteriori
200 400 600 800 1000 1200 14000.00
00.
002
0.00
4
Mean rainfall
Den
sity
A PrioriA Posteriori
8 10 12 14 16 18
0.0
0.1
0.2
0.3
0.4
0.5
Mean temperature
Den
sity
A PrioriA Posteriori
Figure 7.7: Prior and posterior marginal probability distributions in the TANmodel for areas with low vegetation, mean rainfall, aridisols and mean tempera-ture.
140 7.3. Results and discussion
Mean temperature increases slightly by 7%. The function shows that the prior
probability that the temperature exceeds the threshold of 17C is 0.37, whereas
the posterior probability becomes 0.63.
Marsh with vegetation, dry herbaceous crops, sparse shrubland with pasture,
sand and dunes, and dry woody crops show prior and posterior marginal proba-
bility functions similar to the NB model.
Thus, NB and TAN models show similar distributions both a priori and a
posteriori, but the quantification varies (Tables 7.1 and 7.2). They vary in the
definition of relationships between the features in the TAN model, so that each
variable is influenced, not only by the class variable tortoise, but also by the
variables directly connected with it in the network. Five probability distributions
differ between TAN and NB, so the habitat description is slightly different: Mean
precipitation decreases by 33% (from 586.71 mm in NB to 392.56 mm in TAN),
cover of irrigated woody crops increases by 20% (from 3.39% in NB to 4.08% in
TAN), areas with low vegetation increases by 12% (from 42.03% in NB to 47.12%
in TAN) and aridisols increases by 15.2% (from 23.11% in NB to 26.62% in TAN).
Table 7.2 identify the representative variables related to the presence of the
tortoise. In descending order, they are: sand and dunes, marsh with vegetation,
areas with low vegetation and sparse shrubland with pasture. For these variables,
evidence of the tortoise being present implies an increase of more than 100% in
their mean cover.
Aridisols increases by only 72%. Mean rainfall and mean temperature are
important climatic variables in the habitat characterisation, and indicate that
the tortoise’s habitat has a lower mean rainfall and a higher mean temperature.
7.3.3 Validation
Table 7.3 shows the classification rate for the discrete [3] and the continuous
models, using 10-fold cross validation. A classification rate grouping by NB,
TAN, continuous and discrete models, as well as the standard deviation for each
value are also shown.
Figure 7.8 shows two box plots. The first one represents the values of the
classification rates for the continuous and discrete models. The second one shows
Chapter 7. Species distribution modelling 141
Classification rate Standard deviationDiscrete NB 0.9362 0.0165
Continuous NB 0.9707 0.0074Discrete TAN 0.9493 0.0519
Continuous TAN 0.9707 0.0074NB models 0.9535 0.0165TAN models 0.9600 0.0377
Continuous models 0.9707 0.0072Discrete models 0.9428 0.0381
Table 7.3: 10-fold cross validation for the discrete and continuous version of NBand TAN.
the same values for the TAN and NB models.
CONTINUOUS DISCRETE
0.88
0.90
0.92
0.94
0.96
0.98
1.00
NB TAN
0.88
0.90
0.92
0.94
0.96
0.98
1.00
Figure 7.8: Box plots comparing the classification rate for continuous againstdiscrete models and NB against TAN models.
After aplying Lilliefors’ test to check the normality of the data, the t-test
to compare the experimental results was applied (see Table 7.4). There are sig-
nificant differences between continuous and discrete models (p-value of 0.0021,
p < 0.05). This difference is due to the loss of statistical information in the
discretisation process. On the other hand, there are no significant differences
between TAN and NB models (p-value of 0.2531, p > 0.05). In general, it is
demostrated that TAN models are better for classification than NB models, but
142 7.3. Results and discussion
with scarce data (our case) the MTE learning process may cause a worse clas-
sification rate. This fact can modify slightly the results of TAN. In any case it
seems from Figure 7.8 that TAN outperforms NB but not statistically.
p-valueContinuous - Discrete 0.0021
TAN - NB 0.2531
Table 7.4: Statistical differences between the models.
7.3.4 Spatial application of the models
Figures 7.9(a) and 7.9(b) show the probability of the tortoise being present in
Andalusia according to the discrete models developed in [3]. The same is shown in
Figures 7.10(a) and 7.10(b) for the continuous models developed in this chapter.
Figures 7.9(a), 7.9(b), 7.10(a), 7.10(b) clearly indicate the existence of two
populations of tortoise in Andalusia: one located in the southwest and another
in the southeast. The discrete models NB and TAN recognise this pattern, but
show a more dispersed distribution in the region, locating the presence of tor-
toises in less likely inland habitats. The continuous NB model shows a better
characterisation of the habitat, however it includes an area close to the Strait of
Gibraltar, determined by higher precipitation (mean value of 586.71 mm) with
respect to the continuous TAN.
The continuous TAN model corresponds exactly to the presence of the tor-
toise in Andalusia. The probability distributions determined by this model char-
acterises both habitats. In the southwest, the tortoises occur in areas of sandy
substrate alternating with vegetation near to marshes. These environmental vari-
ables correspond spatially to the Donana National Park. In the southeast the
habitat is semiarid, with sparse shrubland with pasture, areas with low vegeta-
tion and an abundance of aridisol soil types. The model shows that the most
important factors in the distribution of tortoises in the southeast are climate and
vegetation type.
The results obtained in the characterisation of tortoise habitat indicate that
NB and TAN continuous models based on Mixtures of Truncated Exponentials
Chapter 7. Species distribution modelling 143
(a) Discrete NB.
(b) Discrete TAN.
Figure 7.9: Probability of presence of the tortoise in the region of Andalusiaaccording to the discrete models [3].
144 7.3. Results and discussion
(a) Continuous NB.
(b) Continuous TAN.
Figure 7.10: Probability of presence of the tortoise in the region of Andalusiaaccording to the continuous models.
Chapter 7. Species distribution modelling 145
(MTEs) can be applied to species distribution modeling, by allowing the simul-
taneous use of both discrete and continuous variables in the development of the
models.
7.4 Conclusions
We have applied two classification models based on MTEs (NB and TAN) to char-
acterise the habitat of the spur-thighed tortoise (Testudo graeca graeca), using
several continuous environmental variables. The application of the hybrid models
instead of the full discrete ones has reported a better classification rate and also
a more spatially accurate distribution of the tortoise. The study also show that
the results of the TAN continuous model corresponds exactly to the presence of
the tortoise in the region of Andalusia considering an expert’s opinion.
Chapter 8
Relevance analysis of
performance indicators in higher
education
In this chapter we describe a methodology for relevance analysis of performance in-
dicators in higher education based on the use of Bayesian networks. We analyse the
behaviour of the described methodology in a practical case, showing that it is a useful
tool to help decision making when elaborating policies based on performance indica-
tors. The methodology has been implemented in a software that interacts with the
Elvira package for graphical models, and that is available to the administration board
at the University of Almerıa (Spain) through a web interface. The software also im-
plements a new method for constructing composite indicators by using a Bayesian
network regression model based on MTEs.
Abstract
8.1 Introduction
During the last decades, the way in which the financial support provided by the
administration to public Universities is determined, has been gradually moved
to a system where an increasing part of the funds are obtained depending on
the goals achieved by each institution. The usual way to determine to which
extent an institution has achieved the compromised goals is through the so-called
148 8.1. Introduction
performance indicators [38]. Sometimes, the term performance is understood in a
wide sense, assuming that a performance indicator is any institutional goal that
can be objectively measured [43].
In order to design efficient policies oriented to increase the amount of pub-
lic funds, the administration boards of the Universities should determine which
variables, under their control, actually have an impact on the value of the perfor-
mance indicators that are lately used to compute the funds. This task requires
to take into account a high number of variables of different nature (qualitative
and quantitative), and which may have a complex dependence structure. In the
last years, there has been an increasing interest, within the fields of Statistics
and also in Artificial Intelligence, in handling scenarios in which a high number
of variables take part. One of the most satisfactory solutions is based on the use
of probabilistic graphical models and, more precisely, Bayesian networks [19, 77].
Examples of applications of Bayesian networks in enterprise information systems
can be found in the literature [156].
The main advantage of Bayesian networks is that they have a rich semantics,
and they can be easily interpreted by the user with no need of a high background
on Statistics. From an operational point of view, Bayesian networks provide a
natural framework for relevance analysis and also they can be used for prediction
tasks [111].
In this chapter we propose a methodology for relevance analysis of perfor-
mance indicators in higher education based on the use of Bayesian network mo-
dels. We illustrate the appropriateness of the proposed methodology for the par-
ticular case of the University of Almerıa (Spain). We also describe the decision
support system designed to implement this methodology. The system interacts
with the Elvira platform [34], and provides a web interface that guides the user
through the process of determining the variables that are relevant to a given
performance indicator. Furthermore, the software implements a novel procedure
for constructing composite indicators, based on rankings provided by experts.
Composite indicators [113] are indicators that sum up the information provided
by various indicators of different nature, with the aim of describing, with a sin-
gle number, the performance of an institution. Our proposal is a supervised
algorithm, consisting of creating a database using the rankings provided by the
Chapter 8. Relevance analysis of performance indicators in higher education 149
experts and the corresponding values of the individual indicators. The composite
indicator is constructed induced from the above mentioned database through the
Bayesian network regression model [111] explained in Section 3.6.
The rest of the chapter is organised as follows. In Section 8.2 we explain
the fundamentals of the methodology for relevance analysis using Bayesian net-
works. Section 8.3 is devoted to show the behaviour of the proposed technique
in a real-world problem. The software developed to implement the methodology
is described in Section 8.4. We describe the procedure to construct composite
indicators in Section 8.5. The chapter ends with conclusions in Section 8.6.
8.2 Relevance analysis using Bayesian networks
One of the most important advantages of Bayesian networks is that the structure
of the associated DAG determines the dependence and independence relationships
among the variables (see d-separation criterion in Section 2.1), so that it is pos-
sible to find out, with no need of carrying out any numerical calculations, which
variables are relevant or irrelevant for some other variable of interest (for instance,
a performance indicator). More precisely, we will illustrate how the relevance
analysis is performed in Bayesian networks through the concept of transmission
of information, so that two variables are irrelevant to each other if no informa-
tion can be transmitted between them. Thus, for instance, we could determine
which are the variables over which the administration board of a university has
to operate in order to change the value of a performance indicator.
Though relevance analysis can be carried out simply taking into account the
structure of the network, once the relevant variables for a given performance
indicator have been located, it is necessary to know to which extent the changes
in those variables determine the value of the performance indicator. This is
achieved by using the distributions of the Bayesian network.
Assume Xi is the performance indicator in which we are interested, and E is
a set of variables that can be controlled by the administration board. Then, the
prediction for the value of Xi given E = e can be obtained by computing the
distribution p(xi | E = e), that would provide us the likelihood of each possible
value of Xi given each possible configuration of E. This distribution can be
150 8.3. Application to the analysis of performance indicators at the University of Almerıa
obtained from the joint distribution in Equation (2.2).
Once we know how to use a Bayesian network model for relevance analysis,
we must consider how to obtain it. Nowadays, university administration is fully
assisted by computers, so that a large amount of statistical data is available.
More precisely, it is in general possible to obtain databases composed of records
describing items of information that contain the value of some performance indi-
cators together with other variables that can be controlled. For instance, we can
have a record with information about a course (number of students, number of
lecturers, etc.) together with some performance indicator regarding that course
(the success rate, for instance).
There are several algorithms that allow the construction of Bayesian networks
from databases. We will mention two of them that are commonly used: The so-
called K2 [36] and PC [148] algorithms. The K2 algorithm searches within the
space of all Bayesian networks that contain the variables in the database, and
tries to find an optimal network in terms of the likelihood of the database for each
candidate network. On the other hand, the PC algorithm tries to determine the
structure of the network by means of statistical tests of independence. None of
the methods is absolutely superior to the other, so that in practical applications
it is common to construct two networks, one with each algorithm, and then use
the network for which the likelihood of the database is higher. A common feature
of both algorithms is that they operate with qualitative variables, therefore, the
continuous variables must be discretised beforehand. A review on discretisation
methods can be found in [78].
There are free software packages that allow the construction of Bayesian net-
works from databases. In this work we have used the Elvira system [34].
8.3 Application to the analysis of performance
indicators at the University of Almerıa
In this section we describe a practical application of the methodology introduced
in Section 8.2, consisting of the analysis of some performance indicators that
are used to compute the amount of public funds received by the University of
Chapter 8. Relevance analysis of performance indicators in higher education 151
Almerıa.
The starting point is a database with 1345 records and 17 variables regarding
all the courses taught at the University of Almerıa in the different degree programs
during the academic year 2003-2004. A description of the considered variables can
be found in Table 8.1. The first five variables correspond to academic performance
indicators.
Variable Description
Performance Rate Ratio between the number of students that succeedin a course and the number of students that go to the exam.
Relative Mark Average of the marks obtained by each studentin the course in relation to the other students’ marks.
RepStudents Percentage of students that repeat a course.Used Exam Diets Number of times a student goes to the course
exam before passing.Rate of Diets Used Number of times a student goes to the course exam
divided by the maximum number of trials allowed.#StudTheLect Number of students per classroom in theoretical lectures.#StudPrtLect Number of students per classroom in practical lectures.Type Of Course Whether the course is compulsory or optionalSemester The semester, within the degree schedule,
in which the course is taught.#Lecturers Number of lecturers in the same course.Lecturer Evaluation Mark obtained by the lecturer in the students opinion polls.Type of Lecturer Position of the lecturer.PerctPhD Percentage of lecturers in the course with PhD degree.Degree Program The degree program in which the course is taught.FullTimeStud Percentage of full time students.AvgAccessMark Average marks of the students in the degree obtained
in the high school.P80AccessMark 80th percentile of the marks of the students
in the degree obtained in the high school.
Table 8.1: Description of the variables considered in the case study.
The database has been pre-processed by discretising the continuous variables
using the k-means clustering algorithm, which is one of the most popular clus-
tering algorithms in data mining [159], establishing a number of 5 categories for
each discretised variable. We have used the PC and the K2 algorithms, obtaining
the best model, in terms of likelihood of the data, with the K2. The resulting
network can be seen in Figure 8.1.
Attending the structure of the network in that figure, it can be seen that there
are two important variables, Type of Course and Degree Program, which play an
important role in the network, since information can flow from them to all the
performance indicators. We can evaluate the importance of these variables using
152 8.3. Application to the analysis of performance indicators at the University of Almerıa
UsedExamDietsRelativeMark
PerctPhDTypeOfLecturer
LecturerEvaluation #Lecturers
#StudPrtLect
RateOfDietsUsed
RepStudents
Semester
TypeOfCourse#StudTheLect
PerformanceRateDegreeProgramAvgAccessMark
FullTimeStudP80AccessMark
Figure 8.1: Bayesian network obtained for the case study.
the quantitative part of the Bayesian network. For instance, if we concentrate
on variable Type of Course, its influence on two important performance indica-
tors, Performance Rate and Relative Mark is clear attending to the conditional
probabilities displayed in Tables 8.2 and 8.3 respectively. The differences for the
distribution of the values of both performance indicators are significant depend-
ing on whether the course is compulsory or optional. This fact suggests that a
separate study of compulsory and optional courses is appropriate.
8.3.1 Relevance analysis for compulsory courses
The Bayesian network obtained using the registers in the database concerning
compulsory courses is displayed in Figure 8.2. We can draw the following conclu-
sions:
Chapter 8. Relevance analysis of performance indicators in higher education 153
Performance Prior Type of Courserate probability Compulsory Optional[0, 0.545) 0.03 0.04 0.03[0.545, 0.735) 0.10 0.15 0.04[0.735, 0.855) 0.17 0.25 0.09[0.855, 0.955) 0.20 0.26 0.13[0.955, 1] 0.49 0.31 0.71
Table 8.2: Conditional probabilities of Performance Rate given the Type ofCourse.
Relative Prior Type of Coursemark probability Compulsory Optional[0, 0.195) 0.23 0.28 0.15[0.195, 0.315) 0.27 0.29 0.24[0.315, 0.465) 0.24 0.21 0.28[0.465, 0.775) 0.18 0.14 0.23[0.775, 1] 0.08 0.07 0.09
Table 8.3: Conditional probabilities of Relative Mark given the Type of Course.
• The structure of the Lecturer board is irrelevant to the rest of the network,
since variables Number of Lecturers, Type of Lecturer and Percentage of
PhDs are disconnected from the rest.
• The evaluation obtained by a lecturer in the opinion polls is fully determined
by the number of students in classrooms in theoretical lectures. It is an
important conclusion, since it is common to find poor evaluation results in
large classrooms, which suggests that rather than the lecturer’s profile, it
is the size of the classroom what determines the result of the evaluation.
• Any possible information flow towards the performance indicators goes
through variable Degree Program. It is true that the administration board
Performance Prior # of students per classroom in theoretical lecturesRate probability < 25.5 [25.5, 49.5) [49.5, 79.25) [79.25, 114.75) ≥ 114.75[0, 0.355) 0.20 0.18 0.19 0.20 0.20 0.24[0.355, 0.535) 0.28 0.26 0.27 0.28 0.28 0.29[0.535, 0.695) 0.23 0.23 0.23 0.23 0.23 0.22[0.695, 0.845) 0.18 0.20 0.19 0.18 0.18 0.15[0.845, 1] 0.11 0.13 0.12 0.11 0.11 0.09
Table 8.4: Performance rate vs. # students in theoretical lectures for compulsorysubjects.
154 8.3. Application to the analysis of performance indicators at the University of Almerıa
#Lecturers
PerctPhDTypeOfLecturer
RateOfDietsUsedPerformanceRateLecturerEvaluation
UsedExamDietsRepStudents#StudTheLect#StudPrtLect
RelativeMarkSemesterDegreeProgramAvgAccessMark
FullTimeStudP80AccessMark
Figure 8.2: Bayesian network for compulsory courses.
cannot control the value of the degree program where a subject is included,
but they actually can control some characteristics of the degree program as
the access mark and number of students per classroom in theoretical and
practical lectures. The effect of these variables on the performance rate is
illustrated in Tables 8.4, 8.5 and 8.6.
• Attending to the results in Table 8.4, we can conclude that a good policy in
order to increase the performance rate is to establish a maximum number
Performance Prior # of students per classroom in practical lecturesRate probability < 23.9 [23.9, 41.65) [41.65, 68.5) [68.5, 114.16) ≥ 114.16[0, 0.355) 0.20 0.19 0.20 0.21 0.20 0.24[0.355, 0.535) 0.28 0.27 0.28 0.28 0.28 0.29[0.535, 0.695) 0.23 0.23 0.23 0.23 0.23 0.24[0.695, 0.845) 0.18 0.19 0.18 0.18 0.18 0.15[0.845, 1] 0.11 0.12 0.11 0.11 0.11 0.10
Table 8.5: Performance rate vs. # students in practical lectures for compulsorysubjects.
Chapter 8. Relevance analysis of performance indicators in higher education 155
LecturerEvaluation #Lecturers
PerctPhDTypeOfLecturer
UsedExamDiets
RelativeMark
#StudTheLect#StudPrtLect
RateOfDietsUsedRepStudentsFullTimeStudSemester
PerformanceRateDegreeProgramAvgAccessMark
P80AccessMark
Figure 8.3: Bayesian network for optional courses.
of students per classroom not greater than 49.
• The influence of the number of students in practical lectures is not so im-
portant, as can be seen in Table 8.5 (the columns are rather similar).
• Finally, the access marks have little impact on the performance rate. Only
increasing the 80th percentile of the access mark above 7.3 points out of 10,
a slight improvement in the performance rate can be noticed.
8.3.2 Relevance analysis for optional courses
The Bayesian network obtained using the registers in the database concerning op-
tional courses is displayed in Figure 8.3. Analysing the structure of this network,
we can deduce that the lecturers’ profile, including the result of the evaluation,
is irrelevant to the course performance indicators.
The access marks are connected to the performance rate through the degree
program. Their impact on this indicator is quantified in Table 8.7. The proba-
156 8.3. Application to the analysis of performance indicators at the University of Almerıa
Performance Prior Average Access MarkRate a Probability [5.32, 5.92) [5.92, 6.17) [6.17, 6.46) [6.46, 6.83) [6.83, 7.71][0, 0.355) 0.20 0.19 0.21 0.21 0.19 0.19[0.355, 0.535) 0.28 0.28 0.28 0.28 0.27 0.27[0.535, 0.695) 0.23 0.24 0.23 0.23 0.23 0.23[0.695, 0.845) 0.18 0.18 0.17 0.17 0.18 0.19[0.845, 1] 0.11 0.11 0.11 0.11 0.12 0.13
Performance Prior 80th Percentile of the Access MarkRate probability [5.46, 6.30) [6.30, 6.65) [6.65, 7.30) [7.30, 7.61) [7.61, 8.5][0, 0.355) 0.20 0.19 0.21 0.21 0.20 0.19[0.355, 0.535) 0.28 0.28 0.28 0.28 0.27 0.27[0.535, 0.695) 0.23 0.23 0.23 0.23 0.23 0.23[0.695, 0.845) 0.18 0.18 0.18 0.18 0.18 0.19[0.845, 1] 0.11 0.11 0.11 0.11 0.12 0.12
Table 8.6: Performance rate vs. student profile for compulsory subjects.
Performance Prior Average Access MarkRate probability [5.32, 6.00) [6.00, 6.20) [6.20, 6.44) [6.44, 6.69) [6.69, 8.03][0, 0.59) 0.20 0.18 0.19 0.22 0.22 0.17[0.59, 0.71) 0.19 0.17 0.20 0.29 0.19 0.18[0.71, 0.82) 0.20 0.21 0.21 0.21 0.21 0.18[0.82, 0.92) 0.20 0.23 0.21 0.19 0.19 0.19[0.92, 1] 0.21 0.20 0.19 0.19 0.20 0.28
Performance Prior 80th Percentile of the Access MarkRate probability [5.37, 6.28) [6.28, 6.61) [6.61, 6.92) [6.92, 7.22) [7.22, 8.57][0, 0.59) 0.20 0.19 0.19 0.21 0.21 0.18[0.59, 0.71) 0.19 0.18 0.19 0.19 0.19 0.18[0.71, 0.82) 0.20 0.21 0.21 0.21 0.20 0.18[0.82, 0.92) 0.20 0.22 0.21 0.20 0.19 0.19[0.92, 1] 0.21 0.20 0.20 0.20 0.21 0.26
Table 8.7: Performance rate given access marks for optional courses.
bilities in that table indicate that the best performances are attained for average
access marks above 6.69 points and 80th percentiles above 7.22 points.
The number of students in theoretical lectures is more relevant here that
in the case of compulsory subjects, since it is connected to the relative mark,
the number of diets used and the percentage of repeating students. Also, it is
indirectly connected to the performance rate.
Table 8.8 summarises the probabilities of indicator Performance Rate given
the number of students in theoretical lectures. It can be observed that the per-
formance rate is strongly influenced by this variable, so that low performances
appear when the number of students increases. Therefore, any policy oriented to
decrease the number of students per classroom conveys a significant improvement
Chapter 8. Relevance analysis of performance indicators in higher education 157
Performance Prior # students per classroom in theoretical lecturesRate probability < 10 [10, 18) [18, 28) [28, 51) [51, 137][0, 0.59) 0.20 0.18 0.20 0.19 0.21 0.20[0.59, 0.71) 0.19 0.15 0.19 0.19 0.20 0.19[0.71, 0.82) 0.20 0.17 0.20 0.21 0.21 0.22[0.82, 0.92) 0.20 0.19 0.19 0.19 0.21 0.23[0.92, 1] 0.21 0.31 0.22 0.21 0.17 0.16
Table 8.8: Results for optional subjects given the size of the classrooms in theo-retical lectures.
Performance Prior # students per classroom in practical lecturesRate probability < 9 [9, 17) [17, 25) [25, 38) ≥ 38[0, 0.59) 0.20 0.18 0.20 0.20 0.20 0.20[0.59, 0.71) 0.19 0.16 0.18 0.19 0.19 0.19[0.71, 0.82) 0.20 0.17 0.20 0.21 0.21 0.21[0.82, 0.92) 0.20 0.19 0.19 0.20 0.21 0.22[0.92, 1] 0.21 0.30 0.23 0.20 0.18 0.17
Table 8.9: Results for optional subjects given the size of the classrooms in prac-tical lectures.
in the course performance.
Finally, it can be concluded from the probabilities in Table 8.9, that the
influence of the number of students in practical lectures is not as important as in
the case of theoretical lectures.
8.4 Software for relevance analysis
We have implemented the above described methodology in a software package,
called academic advisor, which provides an intuitive web-based interface appro-
priate for academic staff not familiarised with Bayesian network models.
The functionality of the academic advisor is based on an client/server Web
architecture. In the client side the users interact with the system using a Web
browser to access an interface with data forms. The server side, contains most
of the functionality of the application. The system uses the Web server Apache
Tomcat 5.5 Servlet/JSP Container, that allows to run servlets and to generate
JSPs (Java Server Pages) automatically. In addition, Java classes of the Elvira
program [34] and the Bayesian networks described in Section 8.3 are stored.
Servlets and JSPs are two methods to create dynamic Web pages under the
158 8.4. Software for relevance analysis
context of a server and using the Java language. More precisely, the JSPs are
HTML pages with special labels and Java code embedded using scripts, which
makes possible to generate dynamic content. On the other hand, a servlet is a
Java program that receives requests, processes them and generates a Web page
from them. The structure of the academic advisor is shown in Figure 8.4.
JSP
CLASSES
ELVIRAJAVA
Web browser 1 Web browser n. . .
Javaservletclasses
CLIENTS
TOMCAT
WEB
SERVER
Internet
Request Response
Bayesiannetwork
Bayesiannetwork
Bayesiannetwork
courses courses programmesDegreeOptionalCompulsory
Figure 8.4: Structure of the decision support system for relevance analysis.
We can justify the use of these technologies from different points of view.
At first, this problem requires a constant interactivity between the application
and the user. In addition, the client/server approach seems appropriate, since it
makes possible the remote use of the application by different users at the same
time. On the other hand, the use of Java language allows the direct interaction
with the Elvira system, which is implemented using that language.
The interaction process is made as follows: The users introduce the input data
using a Web form designed in HTML, or JSP in case of needing some kind of
processing in Java. This request is sent to the server to activate the corresponding
Chapter 8. Relevance analysis of performance indicators in higher education 159
servlet that manages the search interacting with the underlying Elvira Java classes
and Bayesian network files. Once the information has been processed, another
servlet is in charge of generating a HTML/JSP page that will show the results to
the user (response).
From the point of view of the users, the system can be used for four tasks:
Relevance analysis, probability propagation, profile extraction and construction
of composite indicators.
In the relevance analysis module, which is depicted in Figure 8.5, the user
can choose a target variable and the system returns the list of variables that are
directly related to it, according to the Markov blanket [117]. The process can be
repeated in order to detect the relevant variables at a second level in the Bayesian
network, and so on. Finally, the posterior distribution of the target variable given
the relevant selected ones can be obtained.
Figure 8.5: The relevance analysis screen of the academic advisor.
In the probability propagation module, the user can obtain the posterior prob-
ability distribution of any target variable given an assignment of values to some
other variables (see Figure 8.6).
The profile extraction module can be seen in Figure 8.7. This module allows
to compute a set of explanations to a given fact. For instance, we can compute
160 8.5. Using the software to construct composite indicators
Figure 8.6: The probability propagation screen of the academic advisor.
the best explanation for a given value of the success rate in terms of number of
students in classroom and in terms of the rate student/teacher. This tool is very
useful for descriptive purposes, as allows to determine typical situations under
given restrictions. The problem of finding a set of explanations, also known as
MAP problem, is solved using the implementation in Elvira, which corresponds
to the method proposed in [114].
The module for constructing composite indicators is shown in Figure 8.8, and
described in Section 8.5.
8.5 Using the software to construct composite
indicators
Composite indicators [113] are indicators that sum up the information provided
by various indicators of different nature, with the aim of describing, with a single
number, the performance of an institution. The module for constructing com-
Chapter 8. Relevance analysis of performance indicators in higher education 161
Figure 8.7: The profile extraction screen of the academic advisor.
posite indicators implements a novel methodology of supervised nature. It is
supervised in the sense that there must be an expert that ranks different descrip-
tions of the institution in terms of performance. Out of that description, the
software creates a composite indicator that is computed from the values of the
individual indicators for each description. We give the details of this two main
tasks in the next sub-sections.
8.5.1 Generating the rank of descriptions
The first step to construct a composite indicator is to choose a set of individual
indicators X1, . . . , Xk. Then, an expert gives a set of descriptions of the insti-
tution in terms of some observable variables. For each description, the software
computes a list of profiles that explain the given description, and show them on
the screen. Afterwards, the expert assigns a number between 0 and 1 to each
profile, according to the performance of the institution corresponding to each de-
scription, where a value close to 1 indicates high performance, and a value close
162 8.5. Using the software to construct composite indicators
Figure 8.8: Screen for constructing composite indicators.
to 0 means low performance. A screenshot of this procedure can be seen in Fig-
ure 8.8. After this process, the software has stored a database D with variables
X1, . . . , Xk, Y , where Y is the ranking assigned to each description by the expert.
8.5.2 Generating the composite index from the database
Using the database D, described above, we construct the composite indicator
by using a naıve Bayes regression model [111] based on the MTE model (see
Section 3.3 for more details). Variable Y will be the composite indicator (response
variable) and X1, . . . , Xk are the individual indicators (explanatory variables).
Note that Y is constructed from the rankings given by the expert, and therefore
we use the same variable name. As is shown in Equation (3.4), the posterior
distribution of Y given X1, . . . , Xk (more precisely, its expectation) will be used
to obtain a prediction y for Y .
After constructing the composite indicator, it is included in the system and
can be computed from every possible combination of values of the individual
indicators.
Chapter 8. Relevance analysis of performance indicators in higher education 163
8.6 Conclusions
In this chapter we have introduced a methodology for relevance analysis of per-
formance indicators in higher education. We have shown through a case study
that this methodology can help the process of decision making when designing
policies oriented to increase the amount of public funds when they are assigned
according to some performance indicators.
The graphical nature of the used model allows drawing conclusions with no
need to interpret any numerical data, since relevance analysis can be carried out
just taking into account the structure of the Bayesian network. If the user is also
interested in quantifying the strength of the dependencies among the variables,
it can be achieved using the conditional probability distributions provided by the
Bayesian network as well.
We have also introduced the academic advisor system, which implements the
proposed methodology making use of the Elvira system, but providing a user
friendly interface appropriate for academic staff not familiarised with Bayesian
network models. An important novel feature of this software is that it allows to
construct composite indicators in an easy way, using an expert’s opinion.
The fact that the software interacts with the Elvira system, makes it easy to
update its knowledge base with, for instance, indicator databases corresponding to
forthcoming academic years, or any other database. It is specially interesting from
the point of view of data privacy, as there is no need of external intervention for
updating the system. The importance of preserving data privacy during external
intervention in data mining tasks is analysed in [124].
In the next future, we plan to apply Bayesian technology to construct a recom-
mendation system for students, so that they can choose the appropriate courses
in order to maximise their success chances.
We also plan to improve the module for constructing composite indicators,
by following a semi-supervised approach, in which there is no need to specify the
value for the composite indicator in all the records of the training database.
Part IV
Concluding remarks
Chapter 9
Conclusions and future works
This dissertation is a contribution to the state of the art of simultaneous pro-
cessing of discrete and continuous variables in hybrid Bayesian networks based
on the Mixtures of Truncated Exponential model proposed in [106]. Most of the
work has been focused on the learning problem, but also a contribution about
inference has been proposed in Chapter 6.
Regarding learning, Chapter 3 has addressed the problem of regression using
hybrid Bayesian networks. The main advantage of using Bayesian networks to
solve a regression problem with respect to classical techniques, is that it is not
necessary to have evidence for the entire set of independent variables to give a
prediction, since inference can be carried out in this case. Another advantage of
applying Bayesian networks in regression is scalability, i.e. a Bayesian network
can be included within another system acting as input or output for other models
whose aim is not regression.
Several existing BN classifier structures have been applied to solve the regres-
sion problem. Instead of selecting the value of the class variable that maximises
the posterior probability of the class given the observations (such as in classifica-
tion), we have used the posterior distribution of the dependent variable (which
is continuous) given the independent ones to give a prediction. This prediction
has been computed through the mean or the median of the distribution. The
construction of some of these predictors requires the use of the conditional mu-
tual information, which can not be obtained analytically for MTEs. In order to
solve this problem, we have introduced an unbiased estimator of the conditional
168
mutual information, based on Monte Carlo estimation. The models have been
refined using a variable selection scheme and the results of the experiments have
shown a good behaviour in terms of accuracy in comparison to the very robust
M5’ method.
Having successfully implemented Bayesian networks for regression in Chap-
ter 3, the problem is now addressed for the case of incomplete data. An iterative
procedure for inducing the models is proposed, based on a variation of the data
augmentation method in which the missing values of the explanatory variables
are filled by simulating from their posterior distributions, while the missing val-
ues of the response variable are generated using the conditional expectation of
the response given the explanatory variables. With this way of calculating the
prediction, it has been proved that the error is minimised. Another contribution
of this chapter is the method for improving the accuracy by reducing the bias
in the prediction, which can be incorporated regardless of whether the model is
obtained from complete or incomplete data. The experiments conducted have
shown that the selective versions of the proposed algorithms outperform the ro-
bust M5’ scheme, which is not surprising, as M5’ is mainly designed for continuous
explanatory variables, while MTEs are naturally developed for hybrid domains.
In the same line of parameter learning from missing data, in Chapter 5 we have
proposed an EM-based algorithm for learning the maximum likelihood parameters
of an MTE network when confronted with incomplete data. In this work any
network structure and any underlying probability distribution is permitted, since
transformation rules for approximating distributions to MTEs are proposed to
make feasible the inference during the E-step of the algorithm. On the other
hand the updating rules for maximising the likelihood in M-step are performed
over the original distributions of the variables. The results of the experiments
show the expected behaviour for different levels of missing data, although the
method is still not competitive in terms of likelihood.
Regarding inference, in Chapter 6 we have proposed an approximate proba-
bility propagation algorithm in hybrid networks based on the MTE model. The
algorithm is based on importance sampling technique that has already been suc-
cessfully applied for discrete networks. The obtained results represent a consid-
erable advance in terms of error in the calculation of posterior probabilities with
Chapter 9. Conclusions and future works 169
respect to the approximate methods in the hybrid Bayesian networks literature.
Finally, the dissertation ends with two applications of MTE networks to real
problems. On the one hand, we have used Bayesian networks to characterised
the habitat of the spur-thigh tortoise, using several environmental continuous
variables, and one discrete representing the presence or absence of the tortoise.
This work represents an advance in the field of application, since there are few
studies facing directly with continuous data, even in other areas. Previous works
about applications of Bayesian networks have been widely used to solve problems
in environmental sciences, but discretising the continuous domains to apply the
techniques developed for learning and inference so far. The results of the models
show a spatially accurate distribution of the tortoise and the conclusion is that
the proposed continuous models based on MTEs are valid for the study of species
predictive distribution modelling.
On the other hand, we have described Bayesian networks in relevance analysis
of performance indicators in higher education showing that it is a useful tool to
help decision making when elaborating policies based on performance indicators.
The methodology has been implemented in a software that interacts with the
Elvira package for graphical models, and that is available to the administration
board at the University of Almerıa (Spain) through a web interface. The main
contribution of the software is that implements a new method for constructing
composite indicators by using a Bayesian network regression model like the ones
in Chapter 3.
In what follows, we give some clues about possible future research lines derived
from this dissertation. In Chapters 3 and 4 where Bayesian networks were applied
for regression, we could consider more elaborate variable selection schemes to
improve the accuracy in the estimations. Also, other classifier structures can be
considered in the study, for instance, a semi-naıve Bayes model.
Regarding parametric learning in Chapter 5, we plan to investigate further
the behaviour of the proposed algorithm refining the implementation for being
able to deal with more complex networks.
In Chapter 6, we expect to continue this research line by developing methods
for carrying out more complex reasoning tasks. For instance, finding the most
probable explanation to an observed fact in terms of a set of target variables,
170
which is called abductive inference.
With respect to applications, we are still collaborating with people in en-
vironmental sciences applying Bayesian networks, and in particular MTEs, to
environmental modelling: Climate change, water resources, species distribution,
etc. In the same line as for the academic advisor in Chapter 8, we also plan to
apply Bayesian technology to construct a recommendation system for students,
so that they can choose the appropriate courses in order to maximise their suc-
cess chances. We also plan to improve the module for constructing composite
indicators, by following a semisupervised approach, in which there is no need to
specify the value for the composite indicator in all the records of the training
database.
Bibliography
[1] P. A. Aguilera, A. Fernandez, R. Fernandez, R. Rumı, and
A. Salmeron. Bayesian networks in environmental modelling. Environ-
mental Modelling & Software (submitted), 2011. 125, 126
[2] P. A. Aguilera, A. Fernandez, F. Reche, and R. Rumı. Hybrid
Bayesian network classifiers: Application to species distribution models.
Environmental Modelling & Software, 25[12]:1630–1639, 2010. 25
[3] P. A. Aguilera, F. Reche, E. Lopez, B. A. Willaarts, A. Castro,
and M. F. Schmitz. Aplicacion de las redes bayesianas a la caracteri-
zacion del habitat de la tortuga mora (testudo graeca graeca) en Andalucıa.
In Proceedings of the I Congreso Nacional de Biodiversidad, 2007. 127, 140,
142, 143
[4] R. P. Anderson, A. T. Peterson, and M. Gomez-Laverde. Using
niche-based GIS modelling to test geographic predictions of competitive
exclusion and competitive release in South American pocket mice. Oikos,
98:3–16, 2002. 126
[5] M. B. Araujo, M. Cabeza, W. Thuiller, L. Hannah, and P. H.
Williams. Would climate change drive species out of reserves?. An assess-
ment of existing reserve-selection models. Global Change Biology, 10:1618–
1626, 2004. 126
[6] M. B. Araujo, W. Thuiller, and R. G. Pearson. Climate warming
and the decline of amphibians and reptiles in Europe. Journal of Biogeog-
raphy, 33:1712–1728, 2006. 126
172 BIBLIOGRAPHY
[7] M. P. Austin. Spatial prediction of species distribution: an interface
between ecological theory and statistical modelling. Ecological Modelling,
157:101–118, 2002. 125
[8] I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F.
Cooper. The ALARM monitoring system: A case study with two proba-
bilistic inference techniques for Belief networks. In Second European Con-
ference on Artificial Intelligence in Medicine, 38, pages 247–256, 1989.
113
[9] D. A. Bell and H. Wang. A formalism for relevance and its application
in feature subset selection. Machine Learning, 41:175–195, 2000. 127
[10] M. Ben-Bassat. Use of distance measures, information measures and
error bounds in features evaluation. HandBook of Statistics, 2:773–791,
1982. 127
[11] C. L. Blake and C. J. Merz. UCI repository of machine learn-
ing databases. http://www.ics.uci.edu/~mlearn/MLRepository.html,
1998. University of California, Irvine, Dept. of Information and Computer
Sciences. 51, 66
[12] M. E. Borsuk, P. Reichert, A. Peter, E. Schager, and
P. Burkhardt-Holm. Assessing the decline of brown trout (Salmo
trutta) in swiss rivers using Bayesian probability network. Ecological Mod-
elling, 192:224–244, 2006. 126
[13] M. E. Borsuk, C. A. Stow, and K. H. Reckhow. A Bayesian network
of eutrophication models for synthesis, prediction, and uncertainty analysis.
Ecological Modelling, 173:219–239, 2004. 126
[14] J. Bromley, N. A. Jackson, O. J. Clymer, A. M. Giacomello,
and F. V. Jensen. The use of HuginR© to develop Bayesian networks
as aid to integrated water resource planning. Environmental Modelling &
Software, 20:231–242, 2005. 126
BIBLIOGRAPHY 173
[15] L. Brotons, W. Thuiller, M. B. Araujo, and A. H. Hirzel.
Presence-absence versus presence only modelling methods for predicting
bird habitat suitability. Ecography, 27:437–448, 2004. 126
[16] M. Burgmann, D. B. Lindenmayer, and J. Elith. Managing land-
scapes for conservation under uncertainty. Ecology, 86:2007–2017, 2005.
126
[17] A. Cano, S. Moral, and A. Salmeron. Penniless propagation in join
trees. International Journal of Intelligent Systems, 15:1027–1059, 2000. 99
[18] A. Cano, S. Moral, and A. Salmeron. Lazy evaluation in Penniless
propagation over join trees. Networks, 39:175–185, 2002. 99
[19] E. Castillo, J. M. Gutierrez, and A. S. Hadi. Expert systems and
probabilistic network models. Springer-Verlag, New York, 1997. 148
[20] K. C. Chang and Z. Tian. Efficient inference for mixed Bayesian net-
works. In Proceedings of the 5th ISIF/IEEE International Conference on
Information Fusion, pages 512–519, 2002. 24
[21] J. Cheng and M. J. Druzdzel. AIS-BN: An adaptive importance sam-
pling algorithm for evidential reasoning in large Bayesian networks. Journal
of Artificial Intelligence Research, 13:155–188, 2000. 99
[22] C. K. Chow and C. N. Liu. Approximating discrete probability distri-
butions with dependence trees. IEEE Transactions on Information Theory,
14:462–467, 1968. 41
[23] E. N. Cinicioglu and P. P. Shenoy. Solving stochastic PERT networks
exactly using hybrid Bayesian networks. In Proceedings of the Seventh
Workshop on Uncertainty Processing (WUPES-06), pages 183–197, 2006.
23
[24] E. N. Cinicioglu and P. P. Shenoy. Arc reversals in hybrid Bayesian
networks with deterministic variables. International Journal of Approxi-
mate Reasoning, 50:763–777, 2009. 22
174 BIBLIOGRAPHY
[25] E. N. Cinicioglu and P. P. Shenoy. Using mixtures of truncated
exponentials for solving stochastic PERT networks. In Proceedings of the
Eighth Workshop on Uncertainty Processing (WUPES-09), pages 269–283,
2009. 23
[26] B. R. Cobb and J. M. Charnes. A graphical method for valuing switch-
ing options. Journal of the Operational Research Society, 61:1596–1606,
2010. 25
[27] B. R. Cobb, R. Rumı, and A. Salmeron. Advances in probabilistic
graphical models, chapter Bayesian networks models with discrete and con-
tinuous variables, pages 81–102. Studies in Fuzziness and Soft Computing.
Springer, 2007. 22
[28] B. R. Cobb, R. Rumı, and A. Salmeron. Predicting stock and portfolio
returns using mixtures of truncated exponentials. ECSQARU’09. Lecture
Notes in Computer Science, 5590:781–792, 2009. 25
[29] B. R. Cobb and P. P. Shenoy. Hybrid Bayesian networks with linear
deterministic variables. In Proceedings of the Proceedings of the Twenty-
First Conference Annual Conference on Uncertainty in Artificial Intelli-
gence (UAI-05), pages 136–144. AUAI Press, 2005. 22
[30] B. R. Cobb and P. P. Shenoy. Nonlinear deterministic relationships in
Bayesian networks. ECSQARU’05. Lecture Notes in Artificial Intelligence,
3571:27–38, 2005. 21, 22
[31] B. R. Cobb and P. P. Shenoy. Inference in hybrid Bayesian networks
with mixtures of truncated exponentials. International Journal of Approx-
imate Reasoning, 41:257–286, 2006. 14, 22, 81
[32] B. R. Cobb and P. P. Shenoy. Operations for inference in continuous
Bayesian networks with linear deterministic variables. International Journal
of Approximate Reasoning, 42:21–36, 2006. 22
BIBLIOGRAPHY 175
[33] B. R. Cobb, P. P. Shenoy, and R. Rumı. Approximating probability
density functions with mixtures of truncated exponentials. Statistics and
Computing, 16:293–308, 2006. 22, 36, 74, 76
[34] Elvira Consortium. Elvira: An Environment for Creating and Using
Probabilistic Graphical Models. In Proceedings of the First European Work-
shop on Probabilistic Graphical Models, pages 222–230, 2002. 51, 65, 66,
96, 128, 129, 148, 150, 157
[35] G. F. Cooper. The computational complexity of probabilistic inference
using Bayesian belief networks. Artificial Intelligence, 42:393–405, 1990. 99
[36] G. F. Cooper and E. Herskovits. A Bayesian method for the induction
of probabilistic networks from data. Machine Learning, 9:309–347, 1992.
150
[37] R. G. Cowel, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegel-
halter. Probabilistic Networks and Expert System. Springer, 1999. 14
[38] S. Cuenin. The use of performance indicators in universities: An interna-
tional survey. International Journal of Institutional Management in Higher
Education, 2:117–139, 1987. 148
[39] P. Dagum and M. Luby. Approximating probabilistic inference in
Bayesian belief networks is NP-hard. Artificial Intelligence, 60:141–153,
1993. 99
[40] A. P. Dedecker, P. L. M. Goethals, W. Gabriels, and N. De
Pauw. Optimization of Artificial Neural Network (ANN) model design
for prediction of macroinvertebrates communities in the Zwalm river basin
(Flanders, Belgium). Ecological Modelling, 174:161–173, 2004. 126
[41] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the Royal Statistical
Society B, 39:1–38, 1977. 57, 74
[42] J. Demsar. Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research, 7:1–30, 2006. 51, 66
176 BIBLIOGRAPHY
[43] F. Dochy, M. Segers, and W. Wijnen. Selecting performance in-
dicators. A proposal as a result of research. In F. Goedegebuure,
F. Maasen, and D. Westerheijden, editors, Peer review and perfor-
mance indicators, pages 135–153. Lemma B.V., 1990. 148
[44] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification.
Wiley Interscience, 2001. 31, 129
[45] S. Dzeroski and D. Drumm. Using regression trees to identify the habi-
tat preference of the sea cucumber (Holothuria leucospilota) on Rarotonga,
Cook Islands. Ecological Modelling, 170:219–226, 2003. 126
[46] J. Elith, C. H. Graham, R. P. Anderson, M. Dudik, S. Fer-
rier, A. Guisan, R. J. Hijmans, F. Huettmann, J. R. Leathwick,
J. Li, L. G. Lohmann, B. A. Loiselle, G. Manion, C. Moritz,
M. Nakamura, Y. Nakazawa, J. McM. Overton, A. T. Peterson,
S. J. Phillips, K. S. Richardson, S. Scachetti-Pereria, R. E.
Schapire, J. Soberon, S. Williams, M. S. Wisz, and N. E. Zim-
mermann. Novel methods to improve prediction of species’ distribution
from occurrence data. Ecography, 29:129–151, 2006. 126
[47] A. Fernandez, I. Flesch, and A. Salmeron. Incremental super-
vised classification for the MTE distribution: a preliminary study. In Actas
del Congreso Nacional de Informatica (CEDI’07), Simposio de Inteligencia
Computacional (SICO’07), pages 217–224, 2007. 22
[48] A. Fernandez, H. Langseth, T. D. Nielsen, and A. Salmeron.
MTE-based parameter learning using incomplete data. Technical report,
Department of Statistics and Applied Mathematics, University of Almerıa,
Spain, 2010. 24
[49] A. Fernandez, H. Langseth, T. D. Nielsen, and A. Salmeron. Pa-
rameter learning in MTE networks using incomplete data. In Proceedings of
the Fifth European Workshop on Probabilistic Graphical Models (PGM’10),
pages 137–145, 2010. 24
BIBLIOGRAPHY 177
[50] A. Fernandez, M. Morales, C. Rodrıguez, and A. Salmeron. A
system for relevance analysis of performance indicators in higher education
using Bayesian networks. Knowledge and Information Systems, In press,
2011. 25
[51] A. Fernandez, M. Morales, and A. Salmeron. Tree augmented
naıve Bayes for regression using mixtures of truncated exponentials: Appli-
cations to higher education management. IDA’07. Lecture Notes in Com-
puter Science, 4723:59–69, 2007. 23, 45, 55, 57, 63, 65, 130
[52] A. Fernandez, J. D. Nielsen, and A. Salmeron. Learning naıve
Bayes regression models with missing data using mixtures of truncated
exponentials. In Proceedings of the Fourth European Workshop on Proba-
bilistic Graphical Models, pages 105–112, 2008. 23
[53] A. Fernandez, J. D. Nielsen, and A. Salmeron. Learning Bayesian
networks for regression from incomplete databases. International Journal
of Uncertainty, Fuzziness and Knowledge Based Systems, 18:69–86, 2010.
23, 74
[54] A. Fernandez, R. Rumı, and A. Salmeron. Answering queries in
hybrid bayesian networks using importance sampling. Decision Support
Systems (submitted), 2011. 24
[55] A. Fernandez and A. Salmeron. Extension of Bayesian network clas-
sifiers to regression problems. IBERAMIA’08. Lecture Notes in Artificial
Intelligence, 5290:83–92, 2008. 23, 55, 63
[56] S. Ferrier. Mapping spatial pattern in biodiversity for regional conserva-
tion planning: where to from here? Systematic Biology, 51:331–363, 2002.
126
[57] E. Frank, L. Trigg, G. Holmes, and I. H. Witten. Technical note:
Naive Bayes for regression. Machine Learning, 41:5–25, 2000. 23, 30, 51
[58] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network
classifiers. Machine Learning, 29:131–163, 1997. 31, 32, 41, 56, 126
178 BIBLIOGRAPHY
[59] N. Friedman and M. Goldszmidt. Discretizing continuous attributes
while learning Bayesian networks. In Proceedings of the 13th International
Conference on Machine Learning (ICML), pages 157–165. Morgan Kauf-
mann Publishers, 1996. 13, 126
[60] J. A. Gamez. Abductive inference in Bayesian networks: A review. In
J. A. Gamez, S. Moral, and A. Salmeron, editors, Advances in
Bayesian Networks, pages 101–120. Springer Verlag, 2004. 122
[61] J. A. Gamez, R. Rumı, and A. Salmeron. Unsupervised naıve Bayes
for data clustering with mixtures of truncated exponentials. In Proceedings
of the 3rd European Workshop on Probabilistic Graphical Models (PGM’06),
pages 123–132, 2006. 22, 57
[62] J. A. Gamez and A. Salmeron. Prediccion del valor genetico en ovejas
de raza manchega usando tecnicas de aprendizaje automatico. In Actas de
las VI Jornadas de Transferencia de Tecnologıa en Inteligencia Artificial,
pages 71–80. Paraninfo, 2005. 23, 30
[63] S. Garcıa and F. Herrera. An extension on “Statistical comparisons
of classifiers over multiple data sets” for all pairwise comparisons. Journal
of Machine Learning Research, 9:2677–2694, 2008. 66
[64] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov
chain Monte Carlo in practice. Chapman and Hall, London, UK, 1996. 24
[65] V. Gogate and R. Dechter. Approximate inference algorithms for
hybrid Bayesian networks with discrete constraints. In Proceedings of the
21st Conference on Uncertainty in Artificial Intelligence (UAI-05), pages
209–216, 2005. 24
[66] C. H. Graham, S. Ferrier, F. Huettman, C. Moritz, and A. T.
Peterson. New developments in museum-based informatics and applica-
tions in biodiversity analysis. Trends in Ecology and Evolution, 19:497–503,
2004. 126
BIBLIOGRAPHY 179
[67] C. H. Graham, S. R. Ron, J. C. Santos, C. J. Schneider, and
C. Moritz. Integrating phylogenetics and environmental niche models to
explore speciation mechanisms in dendrobatid frogs. Evolution, 58:1781–
1793, 2004. 126
[68] A. Guisan, O. Broennimann, R. Engler, M. Vust, N. G. Toccoz,
A. Lehmann, and N. E. Zimmermann. Using niche-based models to
improve the sampling of rare species. Conservation Biology, 20:501–511,
2006. 126
[69] A. Guisan, S. B. Gueiss, and A. D. Gueiss. GLM versus CCA spatial
modeling of plant species distribution. Plant Ecology, 143:107–122, 1999.
126
[70] A. Guisan and W. Thuiller. Predicting species distribution: offering
more than simple habitats models. Ecology Letters, 8:993–1009, 2005. 126
[71] A. Guisan and N. E. Zimmermann. Predictive habitat distribution
models in ecology. Ecological Modelling, 135:147–186, 2000. 125, 126
[72] L. D. Hernandez, S. Moral, and A. Salmeron. A Monte Carlo
algorithm for probabilistic propagation in belief networks based on impor-
tance sampling and stratified simulation techniques. International Journal
of Approximate Reasoning, 18:53–91, 1998. 108, 110
[73] I. Inza, P. Larranaga, R. Etxeberria, and B. Sierra. Feature sub-
selection by Bayesian networks based optimization. Artificial Intelligence,
123:157–184, 2000. 127
[74] IUCN. Red list of threatened species (version 2009.1). 2009. 127
[75] F. V. Jensen. Bayesian networks and decision graphs. Springer, 2001. 9,
30
[76] F. V. Jensen, S. L. Lauritzen, and K. G. Olesen. Bayesian updat-
ing in causal probabilistic networks by local computation. Computational
Statistics Quarterly, 4:269–282, 1990. 7
180 BIBLIOGRAPHY
[77] F. V. Jensen and T. D. Nielsen. Bayesian Networks and Decision
Graphs. Springer, 2007. 7, 9, 126, 148
[78] R. Jin, Y. Breitbart, and C. Muoh. Data discretization unification.
Knowledge and Information Systems, 19:1–29, 2009. 150
[79] R. Kohavi. A study of cross-validation and bootstrap for accuracy estima-
tion and model selection. In Proceedings of Fourteenth International Joint
Conference on Artificial Intelligence, pages 1137–1143. Morgan Kaufmann,
1995. 131
[80] D. Koller, U. Lerner, and D. Anguelov. A general algorithm for
approximate inference and its application to hybrid Bayes nets. In Proceed-
ings of the 15th Conference on Uncertainty in Artificial Intelligence, pages
324–333, 1999. 24
[81] D. Kozlov and D. Koller. Nonuniform dynamic discretization in hy-
brid networks. In Proceedings of the 13th Conference on Uncertainty in
Artificial Intelligence, pages 302–313, 1997. 13, 24
[82] K. Kristensen and I. A. Rasmussen. The use of a Bayesian network in
the design of a decision support system for growing malting barley with- out
use of pesticides. Computers and Electronics in Agriculture, 33:197–217,
2002. 113
[83] S. Kullback. Information theory and statistics. John Wiley & Son, 1959.
128
[84] S. Kullback and R. A. Leibler. On information and sufficiency. Annals
of Mathematical Statistics, 22:79–86, 1951. 128
[85] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Parameter
estimation in mixtures of truncated exponentials. In Proceedings of the
Fourth European Workshop on Probabilistic Graphical Models (PGM’08),
pages 169–176, 2008. 23, 57
BIBLIOGRAPHY 181
[86] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Inference
in hybrid Bayesian networks. Reliability Engineering and Systems Safety,
94:1499–1509, 2009. 25
[87] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Maxi-
mum likelihood learning of conditional MTE distributions. ECSQARU’09.
Lecture Notes in Artificial Intelligence, 5590:240–251, 2009. 24, 73, 74
[88] H. Langseth, T. D. Nielsen, R. Rumı, and A. Salmeron. Param-
eter estimation and model selection in mixtures of truncated exponentials.
International Journal of Approximate Reasoning, 51:485–498, 2010. 23, 73,
74, 76, 78
[89] P. Larranaga and S. Moral. Probabilistic graphical models in artificial
intelligence. Applied Soft Computing, 11:1511–1528, 2011. 3
[90] S. L. Lauritzen. Propagation of probabilities, means and variances in
mixed graphical association models. Journal of the American Statistical
Association, 87:1098–1108, 1992. 14, 15, 20
[91] S. L. Lauritzen and F. Jensen. Stable local computation with condi-
tional gaussian distributions. Statistics and Computing, 11:191–203, 2001.
14, 15, 20, 21
[92] A. Lehmann, J. McC. Overton, and M. P. Austin. Regression
models for spatial prediction: their role for biodiversity and conservation.
Biodiversity and Conservation, 11:2085–2092, 2002. 125
[93] A. Lehmann, J. McM. Overton, and J. R. Leathwick. Grasp:
generalized regression analysis and spatial prediction. Ecological Modelling,
160:165–183, 2002. 126
[94] U. Lerner, E. Segal, and D. Koller. Exact inference in networks
with discrete children of continuous parents. In Proceedings of the 17th
Conference on Uncertainty in Artificial Intelligence (UAI-01), pages 319–
32, 2001. 20
182 BIBLIOGRAPHY
[95] R. J. A. Little and D. B. Rubin. Statistical analysis with missing data.
John Wiley & Sons, New York, 1987. 95
[96] P. Lucas. Restricted Bayesian network structure learning. In Proceedings
of the 1st European Workshop on Probabilistic Graphical Models (PGM’02),
pages 217–232, 2002. 33, 45
[97] D. Lunn, A. Thomas, N. Best, and D. J. Spiegelhalter. WinBUGS
- A Bayesian modelling framework: Concepts, structure, and extensibility.
Statistics and Computing, 10:325–337, 2000. 95
[98] M. Luoto, J. Poyri, R. K. Heikkinen, and K. Saarinen. Uncer-
tainty of bioclimate envelope models based on the geographical distribution
of species. Global Ecology and Biogeography, 14:575–584, 2005. 126
[99] A. L. Madsen and F. V. Jensen. Lazy propagation: a junction tree in-
ference algorithm based on lazy evaluation. Artificial Intelligence, 113:203–
245, 1999. 12
[100] R. Maggini, A. Lehmann, E. Zimmermann, and A. Guisan. Im-
proving generalized regression analysis for the spatial prediction of forest
communities. Journal of Biogeography, 33:1729–1749, 2009. 126
[101] S. Manel, H. Ceri Williams, and S. J. Ormerod. Evaluating
presence-absence models in ecology: the need to account for prevalence.
Journal of Applied Ecology, 38:921–931, 2001. 126
[102] G. F. Midgley, L. Hannah, D. Millar, W. Thuiller, and
A. Booth. Developing regional and species-level assessments of climate
change impacts on biodiversity in the Cape floristic region. Biological Con-
servation, 112:87–97, 2003. 126
[103] J. Miller and J. Franklin. Modelling distribution of four vegetation
alliances using generalized linear models and classification trees with spatial
dependence. Ecological Modelling, 157:227–247, 2002. 126
BIBLIOGRAPHY 183
[104] D. Mladenic. Feature selection for dimensionality reduction. In Subspace,
Latent Structure and Feature Selection, 3940 of Lecture Notes in Computer
Science, pages 84–102. Springer, 2006. 127
[105] G. G. Moisen and T. S. Frescino. Comparing five modeling tech-
niques for predicting forest characteristics. Ecological Modelling, 157:209–
225, 2002. 126
[106] S. Moral, R. Rumı, and A. Salmeron. Mixtures of truncated ex-
ponentials in hybrid Bayesian networks. ECSQARU’01. Lecture Notes in
Artificial Intelligence, 2143:135–143, 2001. 16, 17, 21, 73, 75, 167
[107] S. Moral, R. Rumı, and A. Salmeron. Estimating mixtures of trun-
cated exponentials from data. In Proceedings of the First European Work-
shop on Probabilistic Graphical Models, pages 156–167, 2002. 21, 129
[108] S. Moral, R. Rumı, and A. Salmeron. Approximating conditional
MTE distributions by means of mixed trees. ECSQARU’03. Lecture Notes
in Artificial Intelligence, 2711:173–183, 2003. 21, 57
[109] S. Moral and A. Salmeron. Dynamic importance sampling in Bayesian
networks based on probability trees. International Journal of Approximate
Reasoning, 38:245–261, 2005. 99
[110] M. Morales, C. Rodrıguez, and A. Salmeron. Selective naıve
Bayes predictor using mixtures of truncated exponentials. In Proceedings
of the International Conference on Mathematical and Statistical Modelling
(ICMSM’06), 2006. 23
[111] M. Morales, C. Rodrıguez, and A. Salmeron. Selective naıve Bayes
for regression using mixtures of truncated exponentials. International Jour-
nal of Uncertainty, Fuzziness and Knowledge Based Systems, 15:697–716,
2007. 23, 30, 36, 37, 39, 40, 43, 51, 55, 57, 63, 64, 148, 149, 162
[112] K. P. Murphy. A variational approximation for Bayesian networks with
discrete and continuous latent variables. In Proceedings of the First Con-
184 BIBLIOGRAPHY
ference on Uncertainty in Artificial Intelligence, pages 467–475, 1999. 20,
95
[113] M. Nardo, M. Saisana, A. Saltelli, and S. Tarantola. Handbook
on constructing composite indicators: Methodology and user guide. OECD,
European Commission, Joint Research Centre, 2008. 148, 160
[114] D. Nilsson. An efficient algorithm for finding the M most probable config-
urations in Bayesian networks. Statistics and Computing, 9:159–173, 1998.
160
[115] B. Sholkopf O. Chapelle and A. Zien. Semi-supervised learning.
MIT Press, 2006. 57
[116] S. M. Olmsted. On representing and solving decision problems. PhD
thesis, Stanford University, 1983. 21
[117] J. Pearl. Probabilistic reasoning in intelligent systems. Morgan-
Kaufmann (San Mateo), 1988. 7, 30, 159
[118] R. G. Pearson, W. Thuiller, M. B. Araujo, E. Martınez-Meyer,
L. Brotons, C. McClean, L. Miles P. Segurado T. P. Dawson,
and D. C. Lees. Model–based uncertainty in species range prediction.
Journal of Biogeography, 33:1704–1711, 2006. 126
[119] A. Perez, P. Larranaga, and I. Inza. Supervised classification with
conditional Gaussian networks: Increasing the structure complexity from
naıve Bayes. International Journal of Approximate Reasoning, 43:1–25,
2006. 40, 42
[120] A. T. Peterson. Predicting the geography of species’ invasions via ecolog-
ical niche modelling. The Quarterly Review of Biology, 78:419–433, 2003.
126
[121] A. T. Peterson, M. A. Ortega-Huerta, J. J. Bartley,
V. Sanchez-Cordero, J. Soberon R. H. Buddmeier, and D. R. B.
Stockwell. Future projections for Mexican fauna under global climate
change scenarios. Nature, 416:626–629, 2002. 126
BIBLIOGRAPHY 185
[122] J. M. Pleguezuelos, R. Marquez, and M. Lizana, editors. Atlas
y libro rojo de los anfibios y reptiles de Espana. Direccion General de la
Conservacion de la Naturaleza–Asociacion Herpetologica Espanola, Madrid,
second edition, 2002. In Spanish. 127
[123] C. A. Pollino, A. K. White, and B. T. Hart. Examination of
conflicts and improved strategies for the management of an endangered
eucalyp species using Bayesian networks. Ecological Modelling, 201:37–59,
2007. 126
[124] L. Qiu, Y. Li, and X. Wu. Protecting business intelligence and customer
privacy while outsourcing data mining tasks. Knowledge and Information
Systems, 17:99–120, 2008. 163
[125] J. R. Quinlan. Learning with continuous classes. In Proceedings of the
5th Australian Joint Conference on Artificial Intelligence, pages 343–348,
Singapore, 1992. 51
[126] F. T. Ramos and F. G. Cozman. Anytime anyspace probabilistic in-
ference. International Journal of Approximate Reasoning, 38:53–80, 2005.
100
[127] V. Romero, R. Rumı, and A. Salmeron. Structural learning of
Bayesian networks with mixtures of truncated exponentials. In Proceed-
ings of the 2nd European Workshop on Probabilistic Graphical Models
(PGM’04), pages 177–184, Leiden, The Netherlands, 2004. 21
[128] V. Romero, R. Rumı, and A. Salmeron. Learning hybrid Bayesian
networks using mixtures of truncated exponentials. International Journal
of Approximate Reasoning, 42:54–68, 2006. 21, 23, 36, 73, 74, 129, 130
[129] R. Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley (New
York), 1981. 102
[130] R. Ruiz, J. Riquelme, and J. S. lar Ruiz. Incremental wrapper-
based gene selection from microarray data for cancer classification. Pattern
Recognition, 39:2383–2392, 2006. 39
186 BIBLIOGRAPHY
[131] R. Rumı. Modelos de redes bayesianas con variables discretas y continuas.
PhD thesis, Universidad de Almerıa, 2003. 129, 130
[132] R. Rumı. Kernel methods in Bayesian networks. In Proceedings of the
1st International Mediterranean Congress of Mathematics, pages 135–149,
2005. 57
[133] R. Rumı and A. Salmeron. Penniless propagation with mixtures of
truncated exponentials. ECSQARU’05. Lecture Notes in Computer Science,
3571:39–50, 2005. 21
[134] R. Rumı and A. Salmeron. Approximate probability propagation with
mixtures of truncated exponentials. International Journal of Approximate
Reasoning, 45:191–210, 2007. 21, 43, 99, 100, 112, 113, 114, 115
[135] R. Rumı, A. Salmeron, and S. Moral. Estimating mixtures of trun-
cated exponentials in hybrid Bayesian networks. Test, 15:397–421, 2006.
21, 23, 36, 57, 63, 73, 74, 129, 130
[136] M. Sahami. Learning limited dependence Bayesian classifiers. In Second
International Conference on Knowledge Discovery in Databases, pages 335–
338, 1996. 33, 46
[137] A. Salmeron, A. Cano, and S. Moral. Importance sampling in
Bayesian networks using probability trees. Computational Statistics and
Data Analysis, 34:387–413, 2000. 99, 107, 108
[138] G. Schwarz. Estimating the dimension of a model. The Annals of Statis-
tics, 6:461–464, 1978. 74, 78
[139] P. Segurado and M. B. Araujo. An evaluation of methods for mod-
elling species distribution. Journal of Biogeography, 31:1555–1568, 2004.
125
[140] R. D. Shachter. Evaluating influence diagrams. Operations Research,
34:871–882, 1986. 21
BIBLIOGRAPHY 187
[141] P. P. Shenoy. Inference in hybrid Bayesian networks using mixtures
of Gaussians. In Proceedings of the 22nd Conference on Uncertainty in
Artificial Intelligence (UAI-06), pages 428–436, 2006. 21
[142] P. P. Shenoy. Some issues in using mixtures of polynomials for inference
in hybrid Bayesian networks. Working Paper, No. 323, School of Business,
University of Kansas, October 2010. 24
[143] P. P. Shenoy and G. Shafer. Axioms for probability and belief function
propagation. In Uncertainty in Artificial Intelligence 4, pages 169–198,
1990. 7, 12
[144] P. P. Shenoy and J. West. Inference in hybrid Bayesian networks with
deterministic variables. ECSQARU’09. Lecture Notes in Computer Science,
5590:46–58, 2009. 22
[145] P. P. Shenoy and J. C. West. Mixtures of polynomials in hybrid
Bayesian networks with deterministic variables. In Proceedings of the 8th
Workshop on Uncertainty Processing (WUPES’09), pages 202–212, 2009.
19, 24
[146] P. P. Shenoy and J. C. West. Inference in hybrid Bayesian networks
using mixtures of polynomials. International Journal of Approximate Rea-
soning, In Press, 2010. 24
[147] C. S. Smith, A. L. Howes, B. Price, and C. A. McAlpine. Using
Bayesian belief network to predict suitable habitat of an endangered mam-
mal – the Julia Creek dunnart (Sminthopsis douglasi). Biological Conser-
vation, 139:333–347, 2007. 126
[148] P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction and
search, 81 of Lecture Notes in Statistics. Springer Verlag, 1993. 150
[149] StatLib. http://www.statlib.org, 1999. Department of Statistics.
Carnegie Mellon University. 51, 66
188 BIBLIOGRAPHY
[150] M. Stone. Cross-validatory choice and assessment of statistical predic-
tions. Journal of the Royal Statistical Society. Series B (Methodological),
36:111–147, 1974. 51, 66, 130
[151] M. A. Tanner and W. H Wong. The calculation of posterior distri-
butions by data augmentation (with discussion). Journal of the American
Statistical Association, 82:528–550, 1987. 56, 58, 59
[152] W. Thuiller. Patterns and uncertainties of species’ range shifts under
climate change. Global Change Biology, 10:2020–2027, 2004. 126
[153] D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical
Analysis of Finite Mixture Distributions. John Wiley, New York, 1985. 21
[154] L. Uusitalo. Advantages and challenges of Bayesian networks in environ-
mental modelling. Ecological Modelling, 203:312–318, 2007. 126
[155] Y. Wang and I. H. Witten. Induction of model trees for predicting
continuous cases. In Proceedings of the Poster Papers of the European
Conference on Machine Learning, pages 128–137, 1997. 51, 66
[156] Z. Wang, Q. Wang, and D. Wang. Bayesian network based business
information retrieval model. Knowledge and Information Systems, 20:63–
79, 2009. 148
[157] B. A. Wintle, J. Elith, and J. M. Potts. Fauna habitat modelling
and mapping: a review and case study in the Lower Hunter Central Coast
region of NSW. Austral Ecology, 30:719–738, 2005. 126
[158] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning
Tools and Techniques (Second Edition). Morgan Kaufmann, 2005. 22, 39,
51, 66
[159] X. Wu, V. Kumar, J. R. Quinlan, J. Gosh, Q. Yang, H. Motoda,
G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z. Zhou, M. Steinbach,
D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining.
Knowledge and Information Systems, 14:1–37, 2008. 151
BIBLIOGRAPHY 189
[160] M. Zaffalon. Credible classification for environmental problems. Envi-
ronmental Modelling & Software, 20:1003–1012, 2005. 126
Appendix A
Notation and mathematical
derivations
Vector notation for the Gaussian
Z =
Z1
...
Zd
⇒ ZT = [Z1, . . . , Zd]
z =
z1...
zd
⇒ zT = [z1, . . . , zd]
Y =
Y1...
Yc
⇒ YT = [Y1, . . . , Yc]
y =
y1...
yc
⇒ yT = [y1, . . . , yc]
192
lz,j =
l(1)z,j...
l(c)z,j
⇒ lTz,j =[
l(1)z,j , . . . , l
(c)z,j
]
f(xj | z,y) ∼ N(
µ = lTz,jy + ηz,j , σ2z,j
)
= N
µ = [l(1)z,j , . . . , l
(c)z,j]
y1...
yc
+ ηz,j , σ2z,j
To ease notation,
lz,j =[
lTz,j , ηz,j]T
=[
l(1)z,j , . . . , l
(c)z,j, ηz,j
]T
=
l(1)z,j...
l(c)z,j
ηz,j
⇒ lTz,j =[
l(1)z,j , . . . , l
(c)z,j, ηz,j
]
y = [yT, 1]T = [y1, . . . , yc, 1]T =
y1...
yc
1
⇒ yT = [y1, . . . , yc, 1]
lTz,jy + ηz,j = lTz,jy ⇒[
l(1)z,j , . . . , l
(c)z,j
]
y1...
yc
+ ηz,j =[
l(1)z,j , . . . , l
(c)z,j, ηz,j
]
y1...
yc
1
So,
f(xj | z,y) ∼ N(
µ = lTz,jy, σ2z,j
)
= N
µ = [l(1)z,j , . . . , l
(c)z,j, ηz,j]
y1...
yc
1
, σ2z,j
Chapter A. Notation and mathematical derivations 193
Vector notation for the logistic
wz,j =
w(1)z,j...
w(c)z,j
⇒ wT
z,j =[
w(1)z,j , . . . , w
(c)z,j
]
σz,j(y) =1
1 + expwT
z,jy + bz,j=
1
1 + exp
[
w(1)z,j , . . . , w
(c)z,j
]
y1...
yc
+ bz,j
wz,j = [wT
z,j, bz,j ]T =
[
w(1)z,j , . . . , w
(c)z,j, bz,j
]T
=
w(1)z,j...
w(c)z,j
bz,j
⇒ wT
z,j =[
w(1)z,j , . . . , w
(c)z,j, bz,j
]
wT
z,jy + bz,j = wT
z,jy ⇒[
w(1)z,j , . . . , w
(c)z,j
]
y1...
yc
+ bz,j =[
w(1)z,j , . . . , w
(n)z,j , bz,j
]
y1...
yc
1
σz,j(y) =1
1 + expwT
z,jy=
1
1 + exp
[
w(1)z,j , . . . , w
(c)z,j, bz,j
]
y1...
yc
1
194
Vector notation for the expectations
E(XjY | di) = E
Xj
Y1...
Yc
1
| di
=
E(XjY1 | di)...
E(XjYc | di)E(Xj | di)
E(YYT | di) = E
Y1...
Yc
1
[Y1, . . . , Yc, 1] | di
=
E(Y 21 | di) . . . E(Y1Yc | di) E(Y1 | di)...
. . ....
...
E(YcY1 | di) . . . E(Y 2c | di) E(Yc | di)
E(Y1 | di) . . . E(Yc | di) 1
2lTz,jE(XjY | di) = 2[
l(1)z,j , . . . , l
(c)z,j, ηz,j
]
E(XjY1 | di)...
E(XjYc | di)E(Xj | di)
lTz,jE[
YYT | di)]
lz,j =
=[
l(1)z,j , . . . , l
(c)z,j, ηz,j
]
E(Y 21 | di) . . . E(Y1Yc | di) E(Y1 | di)...
. . ....
...
E(YcY1 | di) . . . E(Y 2c | di) E(Yc | di)
E(Y1 | di) . . . E(Yc | di) 1
l(1)z,j...
l(c)z,j
ηz,j
Chapter A. Notation and mathematical derivations 195
Mathematical derivations
All the irrelevant mathematical derivations made in some calculations of the
updating rules in the M-step, as well as for getting the sufficient statistics in the
E-step are shown next:
∂Q
∂µj=
N∑
i=1
∂
∂µjE [log f(Xj | di)]
=N∑
i=1
E
[
∂
∂µjlog exp
−1
2
(
Xj − µjσj
)2
| di]
=
N∑
i=1
E
[
1
2σ2z,j
2 (Xj − µj) | di]
=1
σ2j
[
N∑
i=1
E [(Xj − µj) | di]]
=1
σ2j
[
N∑
i=1
(E [Xj | di]− µj)
]
=1
σ2j
[
N∑
i=1
E [Xj | di]]
−Nµj (A.1)
196
∂Q
∂lz,j=
N∑
i=1
∂
∂lz,jE [log f(Xj | Z,Y) | di]
=N∑
i=1
∂
∂lz,j
∑
z∈Z
∫
y
∫
xj
P (z,y, xj | di) log(f(xj | z,y))dydxj
=N∑
i=1
∂
∂lz,j
∫
y
∫
xj
f(z,y, xj | di) log(f(xj | z,y))dydxj
=
N∑
i=1
∂
∂lz,j
∫
y
∫
xj
f(y, xj | di, z)f(z | di) log(f(xj | z,y))dydxj
=
N∑
i=1
f(z | di)∂
∂lz,jE [log f(Xj | z,Y) | di, z]
=
N∑
i=1
f(z | di)E
∂
∂lz,jlog exp
−1
2
(
Xj − lTz,jY
σz,j
)2
| di, z
=
N∑
i=1
f(z | di)E
−1
2
∂
∂lz,j
(
Xj − lTz,jY
σz,j
)2
| di, z
=N∑
i=1
f(z | di)E[
1
2σ2z,j
2(
Xj − lTz,jY)
YT | di, z]
=1
σ2z,j
[
N∑
i=1
f(z | di)E[(
XjYT − lTz,jYYT
)
| di, z]
]
=1
σ2z,j
[
N∑
i=1
f(z | di)E(XjYT | di, z)− lTz,j
N∑
i=1
f(z | di)E(YYT | di, z)]
(A.2)
Chapter A. Notation and mathematical derivations 197
∂Q
∂σj=
N∑
i=1
∂
∂σjE [log f(Xj | di)]
=N∑
i=1
E
[
∂
∂σjlog
(
1
σj√2π
exp
−1
2
(
Xj − µjσj
)2)
| di]
=N∑
i=1
E
[(
∂
∂σjlog
1
σj√2π
− 1
2
∂
∂σj
(
Xj − µjσj
)2)
| di]
=
N∑
i=1
E
[(
−√2π
∂
∂σjlog σj −
1
2(Xj − µj)
2 ∂
∂σj
1
σ2j
)
| di]
=N∑
i=1
E
[(
−√2π
1
σj− 1
2(Xj − µj)
2 −2
σ3j
)
| di]
=
N∑
i=1
E
[(
(Xj − µj)2
σ3j
−√2π
σj
)
| di]
=
N∑
i=1
(
E
[
(Xj − µj)2
σ3j
| di]
−√2π
σj
)
=1
σ3j
N∑
i=1
E[
(Xj − µj)2 | di
]
− N√2π
σj
=1
σ3j
(
N∑
i=1
E[X2j | di] +Nµ2
j − 2µj
N∑
i=1
E[Xj | di])
− N√2π
σj
(A.3)
198
∂Q
∂σz,j=
N∑
i=1
f(z | di)∂
∂σz,jE [log f(Xj | z,Y) | di, z]
=N∑
i=1
f(z | di)∂
∂σz,jE log
1
σz,j√2π
exp
−1
2
(
Xj − lTz,jY
σz,j
)2
| di, z
=N∑
i=1
f(z | di)E
−1
2
∂
∂σz,j
(
Xj − lTz,jY
σz,j
)2
− ∂
∂σz,jlog(
σz,j√2π)
| di, z
=
N∑
i=1
f(z | di)E[
(Xj − lTz,jY)2
σ3z,j
− 1
σz,j| di, z
]
=N∑
i=1
f(z | di)E[
(Xj − lTz,jY)2 − σ2z,j
σ3z,j
| di, z]
=1
σ3z,j
N∑
i=1
f(z | di)E[
(Xj − lTz,jY)2 − σ2z,j | di, z
]
=1
σ3z,j
[
N∑
i=1
f(z | di)E[
(Xj − lTz,jY)2 | di, z]
− σ2z,j
N∑
i=1
f(z | di)]
(A.4)
E(Xj | di) =
∫ xb
xa
xjf(xj | di)dxj =∫ xb
xa
xj
(
a0 +
m∑
j=1
aj exp bjxj)
dxj
= a0
∫ xb
xa
xjdxj +m∑
j=1
aj
∫ xb
xa
xj exp bjxj dxj
= a0
∫ xb
xa
xjdxj +
m∑
j=1
aj
∫ xb
xa
xj exp bjxj dxj
=a02
(
xb2 − xa
2)
+
m∑
j=1
aj
bj2 (expbjxb(bjxb − 1)− expbjxa(bjxa − 1))
(A.5)
Chapter A. Notation and mathematical derivations 199
E(X2j | di) =
∫ xb
xa
x2jf(xj | di)dxj =∫ xb
xa
x2j
(
a0 +m∑
j=1
aj exp bjxj)
dxj
= a0
∫ xb
xa
x2jdxj +
m∑
j=1
aj
∫ xb
xa
x2j exp bjxj dxj
=a03
(
xb3 − xa
3)
+m∑
j=1
aj
bj3 ( expbjxb(bjxb(bjxb − 2) + 2)−
− [expbjxa(bjxa(bjxa − 2) + 2)] ) (A.6)
E(XY | di) =
∫ xb
xa
∫ yb
ya
xyf(x, y | di)dxdy
=
∫ xb
xa
∫ yb
ya
xy
(
a0 +
m∑
j=1
aj exp bjy + cjx)
dxdy
= a0
∫ xb
xa
∫ yb
ya
xydxdy +m∑
j=1
∫ xb
xa
∫ yb
ya
xyaj exp bjy + cjx dxdy
= a0
∫ xb
xa
∫ yb
ya
xydxdy +
m∑
j=1
∫ xb
xa
∫ yb
ya
xyaj exp bjy exp cjx dxdy
= a0
∫ xb
xa
∫ yb
ya
xydxdy +m∑
j=1
aj
∫ xb
xa
x exp cjx∫ yb
ya
y exp bjy dxdy
=a04(y2b − y2a)(x
2b − x2a) +
m∑
j=1
aj
cj2bj2
(−expbjya+ bjyaexpbjya+ expbjyb − bjybexpbjyb)(−expcjxa+ cjxaexpcjxa+ expcjxb − cjxbexpcjxb)
(A.7)
A new version of E(XY | di) is shown next for the case in which the exponent
has only one variable, i.e., bj = 0:
200
E(XY | di) =
∫ xb
xa
∫ yb
ya
xyf(x, y | di)dxdy
=
∫ xb
xa
∫ yb
ya
xy
(
a0 +m∑
j=1
aj exp cjx)
dxdy
= a0
∫ xb
xa
∫ yb
ya
xydxdy +
m∑
j=1
∫ xb
xa
∫ yb
ya
xyaj exp cjx dxdy
= a0
∫ xb
xa
∫ yb
ya
xydxdy +m∑
j=1
∫ xb
xa
∫ yb
ya
xyaj exp cjx dxdy
= a0
∫ xb
xa
∫ yb
ya
xydxdy +
m∑
j=1
aj
∫ xb
xa
x exp cjx∫ yb
ya
ydxdy
=a04(y2b − y2a)(x
2b − x2a) +
m∑
j=1
aj(y2b − y2a)
2cj2
(expcjxa − cjxaexpcjxa − expcjxb+ cjxbexpcjxb)(A.8)
Appendix B
Publications
The contents presented in this dissertation are the results of the following publi-
cations:
1. A. Fernandez, R. Rumı and A. Salmeron (2011). Answering queries
in hybrid Bayesian networks using importance sampling. Decision Support
Systems (submitted).
2. A. Fernandez, M. Morales, C. Rodrıguez, and A. Salmeron
(2011) A system for relevance analysis of performance indicators in higher
education using Bayesian networks. Knowledge and Information Systems
(In press) .
3. A. Fernandez, H. Langseth, T. D. Nielsen, and A. Salmeron
(2010) Parameter learning in MTE networks using incomplete data. Pro-
ceedings of the Fifth European Workshop on Probabilistic Graphical Models
(PGM’10), pages 137–145.
4. P. A. Aguilera, A. Fernandez, F. Reche, and R. Rumı (2010)
Hybrid Bayesian Network Classifiers: Application to species distribution
models. Environmental Modelling & Software, 25:1630–1639.
5. A. Fernandez, J. D. Nielsen, and A. Salmeron (2010) Learning
Bayesian networks for regression from incomplete databases. International
Journal of Uncertainty, Fuzziness and Knowledge Based Systems 18:69–86.
202
6. A. Fernandez and A. Salmeron (2008) Extension of Bayesian net-
work classifiers to regression problems. IBERAMIA’08. Lecture Notes in
Artificial Intelligence 5290:83–92.
7. A. Fernandez, J. D. Nielsen, and A. Salmeron (2008) Learning
naıve Bayes regression models with missing data using mixtures of truncated
exponentials. Proceedings of the Fourth European Workshop on Probabilis-
tic Graphical Models (PGM’08), pages 105–112.
8. A. Fernandez, M. Morales, and A. Salmeron (2007) Tree Aug-
mented Naıve Bayes for Regression Using Mixtures of Truncated Expo-
nentials: Application to Higher Education Management. IDA’07. Lecture
Notes in Computer Science 4723:59–69.
Other publications whose contents are not included in this dissertation are:
9. P. A. Aguilera, A. Fernandez, R. Fernandez, R. Rumı, and A.
Salmeron (2010). Bayesian networks in Environmental Modelling. Envi-
ronmental Modelling & Software (submitted).
10. A. Fernandez and A. Salmeron (2008) BayesChess: A computer chess
program based on Bayesian networks. Pattern Recognition Letters 29:1154–
1159.
11. A. Fernandez, I. Flesch, and A. Salmeron (2007) Incremental su-
pervised classification for the MTE distribution: a preliminary study. Pro-
ceedings of the CEDI’07-SICO’07, pages 217–224.
12. A. Fernandez and A. Salmeron (2006) BayesChess: programa de aje-
drez adaptativo basado en redes bayesianas. Proceedings of the CMPI’06,
pages 613–624.