UNIVERSITY OF CALIFORNIA, IRVINE THESIS MASTER OF...
Transcript of UNIVERSITY OF CALIFORNIA, IRVINE THESIS MASTER OF...
UNIVERSITY OF CALIFORNIA,IRVINE
Fixing and Extending theMultiplicative Approximation Scheme
THESIS
submitted in partial satisfaction of the requirementsfor the degree of
MASTER OF SCIENCE
in Information and Computer Science
by
Sidharth Shekhar
Thesis Committee:Professor Alexander Ihler, Chair
Professor Xiaohui XieProfessor Deva Ramanan
2009
c© 2009 Sidharth Shekhar
The thesis of Sidharth Shekhar is approved:
Committee Chair
University of California, Irvine2009
ii
TABLE OF CONTENTS
Page
LIST OF FIGURES v
ACKNOWLEDGMENTS vii
ABSTRACT OF THE THESIS viii
1 Introduction 11.1 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 42.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Inference in Graphical Models . . . . . . . . . . . . . . . . . . . . . . 82.4 Introduction to Bucket Elimination . . . . . . . . . . . . . . . . . . . 11
3 The Multiplicative Approximation Scheme 163.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 MAS Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 MAS Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Shift and Renormalize . . . . . . . . . . . . . . . . . . . . . . 193.3.2 No Shifting of Functions . . . . . . . . . . . . . . . . . . . . . 25
3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Tracking ε in an Inference Procedure . . . . . . . . . . . . . . . . . . 313.6 Need for new bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 MAS Error Bounds using the L∞ norm 364.1 The L∞ norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Tracking δ in an Inference Procedure . . . . . . . . . . . . . . . . . . 384.3 New Bounds for MAS . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Optimizing a Decomposition 475.1 Optimizing a decomposition with disjoint subsets . . . . . . . . . . . 495.2 Optimizing a decomposition with overlapping subsets . . . . . . . . . 50
iii
6 Bucket Elimination with MAS 536.1 The DynaDecomp Algorithm . . . . . . . . . . . . . . . . . . . . . . . 536.2 Improvements to DynaDecomp . . . . . . . . . . . . . . . . . . . . . 56
6.2.1 Using L∞ Bounds with DynaDecomp . . . . . . . . . . . . . . 566.2.2 The DynaDecompPlus Algorithm . . . . . . . . . . . . . . . . 56
7 Experiments 607.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2 Results on Ising Models . . . . . . . . . . . . . . . . . . . . . . . . . 617.3 Results on UAI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8 Conclusions 718.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Bibliography 74
iv
LIST OF FIGURES
Page
2.1 A Simple Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . 82.2 (a) A belief network, (b) its induced graph along o = (A,E,D,C,B),
and (c) its induced graph along o = (A,B,C,D,E) . . . . . . . . . . 142.3 A trace of algorithm elim-bel . . . . . . . . . . . . . . . . . . . . . . . 142.4 The MIN-FILL algorithm . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Schematic description of error tracking in an inference procedure usingε-decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 (a) A function f(X) and an example approximation f̂(X); (b) theirlog-ratio log f(X)/f̂(X), and the error measure δ. . . . . . . . . . . . 38
4.2 Schematic description of error tracking in an inference procedure usingδ-decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Schematic trace of generating overlapping decompositions . . . . . . . 52
6.1 The DynaDecomp Algorithm . . . . . . . . . . . . . . . . . . . . . . . 546.2 A trace of algorithm DynaDecomp . . . . . . . . . . . . . . . . . . . 556.3 The DynaDecompPlus Algorithm . . . . . . . . . . . . . . . . . . . . 576.4 A trace of algorithm DynaDecompPlus . . . . . . . . . . . . . . . . . 58
7.1 Bounds for logP (E) using DynaDecomp algorithm with no overlappingdecompositions allowed . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Bounds for logP (E) using DynaDecomp algorithm with overlappingdecompositions allowed . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3 Bounds for logP (E) using DynaDecompPlus algorithm . . . . . . . . 647.4 L∞ Bounds for logP (E) using DynaDecompPlus algorithm . . . . . . 647.5 Plot of accuracy versus maximum function size for the DynaDecomp
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.6 Plot of accuracy versus maximum function size for the DynaDecomp-
Plus algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.7 L∞ Bounds for logP (E) vs Max Function Size for the DynaDecomp
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.8 L∞ Bounds for logP (E) vs Max Function Size for the DynaDecomp-
Plus algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
v
7.9 Bounds for logP (E) using DynaDecompPlus algorithm with a maxi-mum function size of 9 . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.10 Bounds for logP (E) using DynaDecompPlus algorithm with a maxi-mum function size of 12 . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.11 Bounds for logP (E) using DynaDecompPlus algorithm with a maxi-mum function size of 15 . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
ACKNOWLEDGMENTS
I am indebted to all those who have helped me in finishing this thesis. In particular
I would like to thank my advisor, Prof. Alexander Ihler for his advice, his guidance,
his constructive criticism and his comments on the earlier drafts of my thesis. I
would also like to thank my committee members, Prof. Xiaohui Xie and Prof. Deva
Ramanan, for taking the time to review and comment on my work.
I would also like to thank my fellow colleague, Drew Frank for some insightful dis-
cussions that greatly assisted my understanding of the problem.
I would also like to thank all my friends in our “IKG” group, in particular Shweta
Akhila and Vivek Kumar Singh, for ensuring that at least once a week I could take
my mind off my work and have some fun.
Last but not least I would like to thank Natasha Flerova for her comments, her
support and for always being there to cheer me up whenever I felt any despair.
vii
ABSTRACT OF THE THESIS
Fixing and Extending the
Multiplicative Approximation Scheme
By
Sidharth Shekhar
Master of Science in Information and Computer Science
University of California, Irvine, 2009
Professor Alexander Ihler, Chair
We analyze the Multiplicative Approximation Scheme (MAS) for inference problems,
developed by Wexler and Meek in [24]. MAS decomposes the intermediate factors of
an inference procedure into factors over smaller sets of variables with a known error
and translates the errors into bounds on the results. We analyze the bounds proposed
in [24] and show that under certain conditions the bounds are incorrect. We then
derive corrected bounds which we show to be theoretically sound. We also derive
new bounds for the results of an inference procedure using an alternate error measure
based on the L∞ norm.
We analyze the methods for optimizing the decomposition of a factor. For decom-
positions containing disjoint sets, Wexler and Meek provide a closed form solution
that uses the L2 norm. We present a method that uses this closed form solution to
generate decompositions containing overlapping sets.
Further, we analyze Wexler and Meek’s DynaDecomp algorithm, which applies MAS
viii
to bucket elimination [6]. We show that DynaDecomp provides little control over the
size of the largest factor generated. We present our own DynaDecompPlus algorithm
which uses a different strategy for decomposition. The new strategy allows us to
guarantee that we will never generate a factor larger than the specified threshold.
Finally we provide experimental results which show that the L∞ bounds are tighter
than the corrected bounds. We demonstrate the superiority of the DynaDecomp-
Plus algorithm by showing how it can perfectly limit the size of the factors whereas
DynaDecomp cannot do so.
ix
Chapter 1
Introduction
Graphical models are a widely used representation framework in a variety of fields es-
pecially probability theory, statistics and machine learning. Models such as Bayesian
networks, Markov networks, constraints networks, etc. are commonly used for reason-
ing with probabilistic and deterministic information. By using graphs, these models
provide an extremely intuitive way to represent the conditional independence struc-
ture among random variables.
This thesis focuses on the problem of performing bounded approximate inference
in graphical models. In particular, we analyze an approximation scheme called the
Multiplicative Approximation Scheme (MAS) developed by Wexler and Meek in [24].
We provide a brief description of the structure of the thesis along with our contribu-
tions in the following section.
1
1.1 Thesis outline
The thesis begins with providing relevant background information in chapter 2. Here
we introduce the notation that we will be following throughout the thesis. We then
provide a brief description of graphical models and of performing inference in graphical
models. We also provide a brief introduction to an exact inference algorithm called
bucket elimination [6].
Chapter 3 provides the details of MAS. We present the scheme as originally described
by Wexler and Meek. We show that the bounds proposed by them turn out to be
incorrect when dealing with marginalization tasks in normalized probability distri-
butions such as those represented by Bayesian networks or Markov Random Fields
by providing simple counter examples. We also fix the error bounds by deriving the
corrected bounds which we show to be theoretically sound under all conditions.
Chapter 4 presents an alternate metric that can be used to measure the error of a
decomposition. The metric we used is based on the L∞ norm. We show how this
metric can be used with MAS and derive new bounds on the accuracy of the results
of an inference procedure.
In the preceding chapters, we assumed that the factored decomposition was already
given to us. Using this assumption, we measured errors and derived bounds on the
results. In chapter 5 we look at the methods for generating the decomposition of a
factor. We present the L2 optimization method developed by Wexler and Meek. It
has a closed form solution if the sets in the decomposition are disjoint. We present
a method that uses this closed form solution to generate decompositions that consist
of overlapping sets.
Chapter 6 looks at the application of MAS to the bucket elimination [6] inference
2
algorithm. We analyze the DynaDecomp algorithm developed by Wexler and Meek.
We present its major limitation, namely that it provides us with little control over the
size of the largest factor generated. We modify the algorithm and present our own
DynaDecompPlus algorithm. We show how we can easily limit the size of the largest
factor that the algorithm generates by using a smarter strategy for decomposition.
Chapter 7 provides the results of the experimental evaluations of the theories pre-
sented in the previous chapters. We present scenarios wherein the original MAS
bounds fail and wherein the corrected MAS bounds that we derived turn out to be
quite loose and impractical. We then compare the corrected MAS bounds and the
new L∞ bounds that we introduced in chapter 4 and show that new bounds are
tighter than the corrected bounds. We also compare the performance of the original
DynaDecomp algorithm with our DynaDecompPlus algorithm. We demonstrate the
superiority of the DynaDecompPlus algorithm by showing how it can perfectly limit
the size of the factors whereas DynaDecomp cannot do so. We also look at instances
from the UAI’06 inference evaluation competition [1]. These instances clearly high-
light the problems with the corrected bounds and the DynaDecomp algorithm. The
DynaDecomp algorithm has trouble solving many instances as it ran into memory
issues. Therefore, we only present the results of running the DynaDecompPlus algo-
rithm. Comparing the two bounds on these UAI instances, we see that only in some
instances are the new bounds tight and useful. The corrected bounds, however, are
extremely loose on all the instances we ran.
Chapter 8 concludes the thesis and also looks at some possible directions for future
research.
3
Chapter 2
Background
This chapter provides all of the relevant background information that is required for
this thesis. We begin in section 2.1 by introducing the notation we will be following
throughout the thesis. We follow this with a brief description of graphical models and
inference problems in graphical models in sections 2.2 and 2.3 respectively. Finally
we provide a brief introduction to the bucket elimination algorithm in section 2.4.
2.1 Notation
Unless specified otherwise, we follow the conventions specified below
1. We will make no distinction between a variable or a set of variables in this
thesis. Uppercase alphabets will be used to denote both variables and sets
of variables. For example we can have a variable X1, or a set of variables
X = {X1, X2, . . . , Xn}.
2. Given a function or probability distribution over a set of variables X, F (X) or
4
P (X), we denote an approximation to this function as F̃ (X) or P̃ (X).
2.2 Graphical Models
“Graphical models are a marriage between probability theory and graph theory. They
provide a natural tool for dealing with two problems that occur throughout applied
mathematics and engineering – uncertainty and complexity – and in particular they
are playing an increasingly important role in the design and analysis of machine
learning algorithms. Fundamental to the idea of a graphical model is the notion of
modularity – a complex system is built by combining simpler parts. Probability theory
provides the glue whereby the parts are combined, ensuring that the system as a whole
is consistent, and providing ways to interface models to data. The graph theoretic side
of graphical models provides both an intuitively appealing interface by which humans
can model highly-interacting sets of variables as well as a data structure that lends
itself naturally to the design of efficient general-purpose algorithms.
Many of the classical multivariate probabilistic systems studied in fields such as statis-
tics, systems engineering, information theory, pattern recognition and statistical me-
chanics are special cases of the general graphical model formalism – examples include
mixture models, factor analysis, hidden Markov models, Kalman filters and Ising mod-
els. The graphical model framework provides a way to view all of these systems as
instances of a common underlying formalism. This view has many advantages – in
particular, specialized techniques that have been developed in one field can be trans-
ferred between research communities and exploited more widely. Moreover, the graph-
ical model formalism provides a natural framework for the design of new systems.”
— Michael Jordan [13].
5
Probabilistic graphical models are graphs in which the nodes represent random vari-
ables, and the connectivity of the graph represents conditional independence assump-
tions. Two common types of graphical models are Markov networks, which use undi-
rected graphs, and Bayesian Networks, which use directed acyclic graphs. We describe
these two models in the following subsections.
2.2.1 Markov Network
A Markov network is a graphical model that encodes a probability distribution using
an undirected graph. Formally, we can define a Markov network as follows:
Definition 2.1. Given an undirected graph G = (V,E) a set of random variables
X = {Xv; v ∈ V } form a Markov network with respect to G if the joint density of X
can be factored into a product of functions defined as
P (X) =∏
C∈cl(G)
ψc(Xc)
where cl(G) is the set of cliques, or fully connected subgraphs, of G.
Markov networks represent conditional independence in the following manner: two
sets of nodes A and B are conditionally independent given a third set C, if all paths
between the nodes in A and B are separated by a node in C. For more information
on conditional independence in Markov networks see the book, Pearl [20].
One example of a Markov network is the log linear model. The log linear model is
given by
P (X) =1
Zexp
(∑k
wTk · ψk(Xk)
)
6
where Z is called the partition function and is defined as
Z =∑X
exp
(∑k
wTk · ψk(Xk)
)
The partition function Z normalizes the distribution.
2.2.2 Bayesian Network
A Bayesian network is a graphical model that encodes a probability distribution
using a directed acyclic graph (DAG). Formally, we can define a Bayesian network as
follows:
Definition 2.2. Given a directed acyclic graph G = (V,E) and a set of random
variables X = {Xv; v ∈ V }, G is a Bayesian network if the joint probability density
function can be written as
P (X) =∏v∈V
P (Xv|Xpa(v))
where pa(v) is the set of parents of v.
Bayesian Networks have a more complicated characterization of conditional inde-
pendence than Markov networks, since Bayesian Networks take into account the di-
rectionality of the arcs. For a detailed description of conditional independence in
Bayesian Networks refer to the book, Pearl [20].
An example of a Bayesian network is shown in Figure 2.1. At each node the con-
ditional probability distribution of the child node given the parent node has been
specified using a table.
Although Bayesian Networks have a more complicated characterization of conditional
7
SPRINKLERSPRINKLERSPRINKLER RAIN
GRASS WET
T F
SPRINKLER
0.4 0.6
T F
RAIN
0.2 0.8
SPRINKLER F
GRASS WET
0.0 1.0
TRAIN
FF
0.8 0.2TF
0.9 0.1FT
0.99 0.01TT
RAIN
F
0.01 0.99T
Figure 2.1: A Simple Bayesian Network
independence than Markov networks, they do have some advantages. The biggest
advantage is that they can represent causality, i.e., one can regard an arc from A to
B as indicating that A “causes” B. For a detailed study of the relationship between
directed and undirected graphical models and how to transform one to the other, we
direct the reader to the book, Pearl [20].
2.3 Inference in Graphical Models
Given a graphical model on a set of variables X, we can partition this set into dis-
joint partitions H and E, where H represents hidden variables and E represents our
evidence. The evidence consists of the set of variables that have been observed to
take on specific values, whereas the hidden variables are the variables for which no
observations are available. Hence, if we have no observations then the evidence set
E = ∅, and the set of hidden variables H = X. Within this setup, there are four
8
kinds of queries that are commonly made. They are as follows:
1. Likelihood of Evidence: Given a graphical model and some observations, a
question we might have is that how likely are the observations we have under
the model we are given. In other words we would like to compute the probability
of the evidence we have under the model we have chosen. This is computed as
follows:
P (E) =∑H
P (H,E)
2. Posterior Marginal probability: Given a variable A ∈ H, we would like to
evaluate the effect the evidence had on variable A averaging out the effects of
all other variables. This is computed as follows:
P (A) =∑H\A
P (H,E)
3. Most Probable Explanation (MPE): The MPE task is to find an assignment to
the set of hidden variables, denoted as H∗, such that the probability P (H∗) is
given as follows:
P (H∗) = maxH
P (H,E)
where the assignment H∗ is defined as:
H∗ = arg maxH
P (H,E)
4. Maximum aposteriori (MAP) probabilities: Given a set A ⊂ H, the MAP task
is to find an assignment to the set of variables in A, denoted as A∗, such that
9
the probability P (A∗) is given as follows:
P (A∗) = maxA
∑H\A
P (H,E)
where the assignment A∗ is defined as:
A∗ = arg maxA
∑H\A
P (H,E)
Performing exact inference in a graphical model has been proven to be NP-Hard [4].
A few exact inference methods are available, such as variable elimination [6, 22],
clique tree propagation [12, 16, 23], and recursive conditioning [5]. However, all
these methods have complexity that is exponential in a graph parameter known as
treewidth [21]. In practice the complexities are represented in terms of a related
graph parameter called the induced width [8]. We provide more information regarding
induced width in section 2.4.
The exponential complexity of performing exact inference has lead to the development
of a large number of approximate inference algorithms [15, 25, 14, 9, 10]. With
approximation schemes, it is useful to have error bounds on the result, so that we
have some guarantees on the exact value of the probabilities we are interested in.
Many of the approximation techniques developed do provide such guarantees.
In this thesis we focus on a general scheme called the Multiplicative Approximation
Scheme (MAS) [24]. MAS decomposes the intermediate factors of an inference proce-
dure into factors over smaller sets of variables with a known error. It then translates
the errors into bounds on the accuracy of the results. To test the accuracy and tight-
ness of the MAS bounds, we apply MAS to the bucket elimination algorithm. In the
next section we provide a brief introduction to the bucket elimination algorithm.
10
2.4 Introduction to Bucket Elimination
Bucket elimination is a framework that provides a unifying view of variable elimina-
tion algorithms for a variety of reasoning tasks.
A bucket elimination algorithm accepts as an input an ordered set of variables and
a set of dependencies, such as the probability functions given by a graphical model.
Each variable is then associated with a bucket constructed as follows: all the functions
defined on variable Xi but not on higher index variables are placed into the bucket of
Xi. Once the buckets are created the algorithm processes them from last to first. It
computes new functions, applying an elimination operator to all the functions in the
bucket. The new functions summarizing the effect of Xi on the rest of the problem,
are placed in the appropriate lower buckets.
The elimination operator depends on the task. For example, if we consider the task of
computing likelihood of evidence or computing the posterior marginal then the desired
elimination operator is summation. To eliminate a variable Xi we only need to look at
the functions that depend on Xi. Therefore, we look at the bucket of Xi and compute∑XiF , where F is the product of all functions in the bucket of Xi. Similarly for an
MPE task, the elimination operator will be maximization. The bucket-elimination
algorithm terminates when all buckets are processed, or when some stopping criterion
is satisfied.
To demonstrate the bucket elimination approach, we consider an example for com-
puting the posterior marginal. We apply the bucket elimination algorithm to the
network in Figure 2.2a, with the query variable A, the ordering o = (A,E,D,C,B),
and evidence E = 0. The posterior marginal P (A|E = 0) is given by
P (A|E = 0) =P (A,E = 0)
P (E = 0)
11
We can compute the joint probability P (A,E = 0) as
P (A,E = 0) =∑
E=0,D,C,B
P (A,B,D,C,E)
=∑
E=0,D,C,B
P (A)P (C|A)P (E|B,C)P (D|A,B)P (B|A)
=∑E=0
∑D
∑C
P (C|A)∑B
P (E|B,C)P (D|A,B)P (B|A)
The bucket elimination algorithm computes this sum from right to left using the
buckets, as shown below:
1. bucket B: hB(A,D,C,E) =∑
B P (E|B,C)P (D|A,B)P (B|A)
2. bucket C: hC(A,D,E) =∑
C P (C|A)hB(A,D,C,E)
3. bucket D: hD(A,E) =∑
D hC(A,D,E)
4. bucket E: hE(A) = hD(A,E = 0)
5. bucket A: P (A,E = 0) = P (A)hE(A),
To compute the posterior marginal P (A|E = 0), we can just normalize the result in
bucket A since summing out A is equivalent to calculating P (E = 0). Therefore we
can calculate P (A|E = 0) as
P (A|E = 0) = αP (A)hE(A)
where α is a normalizing constant. A schematic trace of the algorithm is shown in
Figure 2.3.
The performance of bucket elimination algorithms can be predicted using a graph
parameter called induced width [8]. It is defined as follows
12
Definition 2.3. An ordered graph is a pair (G, o) where G is an undirected graph,
and o = (X1, . . . , Xn) is an ordering of nodes. The width of a node is the number
of the node’s neighbors that precede it in the ordering. The width of an ordering o is
the maximum width over all nodes. The induced width of an ordered graph, denoted
by w∗(o), is the width of the induced ordered graph obtained as follows: nodes are
processed from last to first; when node Xi is processed all its preceding neighbors are
connected. The induced width of a graph, denoted by w∗, is the minimal induced width
over all its orderings.
For example, to compute the induced width along any ordering of the graph in Fig-
ure 2.2a, we need to to first convert it into an undirected graph. This can be easily
done as follows: First we traverse the graph and for every two unconnected nodes that
have a common child, we connect them using an undirected edge. In Figure 2.2a this
is shown using the undirected dashed edge. Next we convert all the directed edges
into undirected edges. This undirected graph is known as the moral graph. Figures
2.2b and 2.2c depict the induced graphs of the moral graph in Figure 2.2a along the
orderings o = (A,E,D,C,B) and o′ = (A,B,C,D,E), respectively. The dashed line
here represents the induced edges, which are the edges that were introduced in the
induced ordered graph but were absent in the moral graph. We can see that w∗(o) = 4
and w∗(o′) = 2.
The complexity of bucket elimination algorithms is time and space exponential in
w∗(o) [6]. Therefore, ideally we want to find the variable ordering that has smallest
induced width. Although this problem has been shown to be NP-Hard [2], there are
a few greedy heuristic algorithms that provide good orderings [7].
The greedy heuristic that we use in our experiments is known as the MIN-FILL
algorithm. The reason we choose MIN-FILL is because in most cases it has been
13
A
B
B
D
C
E
C
D
A
E
E
D
C
A
B
(a) (b) (c)
P (A)
P (B|A)
P (C|A)
P (E|B, C)
P (D|A, B)
Figure 2.2: (a) A belief network, (b) its induced graph along o = (A,E,D,C,B), and(c) its induced graph along o = (A,B,C,D,E)
bucket B
bucket A
bucket C
bucket D
bucket E
!B
"P (E|B, C) P (D|A, B) P (B|A)
P (C|A)
P (A)
E = 0
P (A|E = 0)
hB(A, D, C, E)
hC(A, D, E)
hD(A, E)
hE(A)
Figure 2.3: A trace of algorithm elim-bel
14
Input: A graph G = (V,E), V = V1, . . . , VnOutput: An ordering o of the nodes in Vfor j = n to 1 by −1 do
R← a node in V with smallest fill edges for his parentsPut R in position j: o[j]← RConnect R’s neighbors: E ← E ∪ (Vi, Vj), if (Vi, R) ∈ E and (Vj , R) ∈ ERemove R from resulting graph: V ← V − {R}
endReturn the ordering o
Figure 2.4: The MIN-FILL algorithm
empirically demonstrated to be better than other greedy algorithms [7]. The MIN-
FILL algorithm orders nodes in the order of their min-fill value, which is the number
of edges that need to be added to the graph so that its parent set is fully connected.
The MIN-FILL algorithm is provided in Figure 2.4.
15
Chapter 3
The Multiplicative Approximation
Scheme
Wexler and Meek recently introduced a new approximation scheme called the Mul-
tiplicative Approximation Scheme (MAS) [24]. The primary goal of this scheme was
to provide good error bounds for computation of likelihood of evidence, marginal
probabilities and the maximum a posteriori probabilities in discrete directed and
undirected graphical models. In section 3.1 we describe the probability distribution
we will be using in our analysis. In section 3.2 we provide the basics of MAS and the
methodology by which the MAS error bounds are calculated.
3.1 Preliminaries
From this point in the thesis, unless specified otherwise, the graphical model we will
consider will be defined on a set of n variables X = {X1, . . . , Xn}, and will encode
16
the normalized probability distribution
P (X) =1
Z
∏j
ψj(Dj)
where Dj ⊆ X and 0 < ψj(Dj) < 1.
For convenience we will be using the notation
φ(Dj) = logψj(Dj)
Finally, we will be denoting the set of evidence variables as E and the set of hidden
variables as H. Within this setup, the log joint probability can be represented by a
sum of factors as follows:
logP (X) = logP (H,E) =∑j
φj(Dj)
3.2 MAS Basics
The fundamental operations of MAS are local approximations that are known as ε-
decompositions. Each decomposition has an error associated with it. The errors of
all such local decompositions are later combined into an error bound on the result.
Let us first define an ε-decomposition.
Definition 3.1. Given a set of variables X, and a function φ(X) that assigns real
values to every instantiation of X, a set of m functions φ̃l(Xl), l = 1 . . .m, where
Xl ⊆ X, is an ε-decomposition if⋃lXl = X, and for all instantiations of X the
17
following holds:
1
1 + ε≤∑
l φ̃l(Xl)
φ(X)≤ 1 + ε (3.1)
for some ε ≥ 0.
With this definition in mind, Wexler and Meek make the following claims
Claim 3.1. The log of the joint probability P (H,E) can be approximated within a
multiplicative factor of 1+εmax using a set of εj-decompositions, where εmax = maxj εj.
1
1 + εmaxlogP (H,E) ≤ log P̃ (H,E) ≤ (1 + εmax) logP (H,E) (3.2)
where log P̃ (H,E) is defined as
log P̃ (H,E) =∑j,l
φ̃jl(Djl)
Claim 3.2. For a set A ⊆ H, the expression log∑
A P (H,E) can be approximated
within a multiplicative factor of 1 + εmax using a set of εj-decompositions.
1
1 + εmaxlog∑A
P (H,E) ≤ log∑A
P̃ (H,E) ≤ (1 + εmax) log∑A
P (H,E) (3.3)
To prove their claims, Wexler and Meek make the assumption that all functions satisfy
the property ψj(Dj) > 1, ∀Dj. However, as this is not always the case, they state
that in such scenarios we can first scale or shift in the log domain, every function that
does not satisfy the property, i.e., we can generate a function ψ′j(Dj) = zj · ψj(Dj)
such that the property ψ′j(Dj) > 1, ∀Dj is true. Then at the end of the procedure we
renormalize by dividing the result by∏
j zj. Wexler and Meek make the claim that
18
doing the shift and renormalize procedure does not affect the results of claim 3.1 and
claim 3.2. In the next section, we derive the MAS error bounds and verify whether
performing the shift and renormalize procedure has no effect on the error bounds.
3.3 MAS Error Bounds
We divide this section into two sub-sections. In the first part, we perform the shift
and renormalize procedure and derive the MAS error bounds. In the second part, we
make the assumption that all the factors satisfy the property 0 < ψj(Dj) < 1. With
this assumption we derive MAS error bounds without performing any shifting.
3.3.1 Shift and Renormalize
Given our normalized probability distribution P (X), we create new functions ψ′j by
multiplying each function ψj with a constant zj such that
zj > minDj
ψj(Dj)
Thus, we define
ψ′j(Dj) = zj · ψj(Dj)
which has the property that ψ′j(Dj) > 1, ∀Dj.
We now define the function f(X) to be the product of the ψ′js
f(X) =∏j
ψ′j(Dj) =∏j
zj · ψj(Dj)
19
and also define the shifted functions φ′j(Dj) as
φ′j(Dj) = logψ′j(Dj) = (log zj + φj(Dj))
Assume that for each factor ψ′j we have an εj-decomposition. With this assumption
we proceed to derive bounds for the log of the joint probability logP (H,E). We first
consider the approximation to log f(H,E) given by
log f̃(H,E) ≡ log∏j,l
eφ̃jl′(Djl) =
∑j,l
φ̃jl′(Djl)
Using the definition of an ε-decomposition (3.1) and the fact that {ψ′jl} is an ε-
decomposition for ψ′j, we have
log f̃(H,E) =∑j,l
φ̃jl′(Djl) ≤
∑j
(1 + εj)φ′j(Dj)
≤ (1 + εmax)∑j
φ′j(Dj)
To perform the renormalization we divide f̃(H,E) by∏
j zj, which is equivalent to
subtracting∑
j log zj from log f̃(H,E). Hence, we get
log f̃(H,E)−∑j
log zj ≤ (1 + εmax)∑j
φ′j(Dj)−∑j
log zj (3.4)
We have,
P (X) =∏j
ψj(Dj) =f(X)∏j zj
So, we define P̃ (X) as
P̃ (X) =f̃(X)∏j zj
20
Using the definitions of P̃ (X) and φ′j(Dj) in equation (3.4), we get
log P̃ (H,E) ≤ (1 + εmax)∑j
(φj(Dj) + log zj)−∑j
log zj
= (1 + εmax)∑j
φj(Dj) + (1 + εmax)∑j
log zj −∑j
log zj
= (1 + εmax) logP (H,E) + εmax∑j
log zj (3.5)
Equation (3.5) provides us with an upper bound on the log of the joint probability.
Next we will derive a similar lower bound on the log of the joint probability.
Again, using the definition of an ε-decomposition (3.1), we have
log f̃(H,E) =∑j,l
φ̃jl′(Djl) ≥
∑j
1
1 + εjφ′j(Dj)
≥ 1
1 + εmax
∑j
φ′j(Dj)
Subtracting∑
j log zj from both sides and using the definition of φ′j(Dj), we have
log P̃ (H,E) ≥ 1
1 + εmax
∑j
(φj(Dj) + log zj)−∑j
log zj
=1
1 + εmax
∑j
φj(Dj) +1
1 + εmax
∑j
log zj −∑j
log zj
=1
1 + εmaxlogP (H,E)− εmax
1 + εmax
∑j
log zj (3.6)
Equation (3.6) provides us with a lower bound on the log of the joint probability.
Combined with the upper bound on the log of the joint probability (3.5), we can
state the following theorem:
Theorem 3.1. If the log of the joint probability P (H,E) is approximated using a set
21
of εj-decompositions on the factors ψj of P , then we have the following bounds:
log P̃ (H,E) ≤ (1 + εmax) logP (H,E) + εmax∑j
log zj
log P̃ (H,E) ≥ 1
1 + εmaxlogP (H,E)− εmax
1 + εmax
∑j
log zj
where εmax = maxj εj.
Comparing theorem 3.1 with claim 3.1, we see that the bounds differ in that the
bounds in theorem 3.1 have extra terms that are multiples of∑
j log zj.
Having derived bounds for the joint probability, we proceed towards deriving bounds
for marginal probabilities, i.e., we would like bounds for the expression log∑
A P (H,E)
where A ⊆ H.
From theorem 3.1, we have
log P̃ (H,E) ≤ log
[P (H,E)
](1+εmax)
+ log
(∏j
zj
)εmax
We would like to get a relation between∑
A P (H,E) and∑
A P̃ (H,E). Therefore,
we rearrange the terms on the right hand side of the above equation as follows
log P̃ (H,E) ≤ log
[(P (H,E))(1+εmax) ·
(∏j
zj
)εmax]
= log
[P (H,E) ·
(∏j
zj
) εmax1+εmax
]1+εmax
Marginalizing out the set A from P̃ (H,E), we get
log∑A
P̃ (H,E) ≤ log∑A
[P (H,E) ·
(∏j
zj
) εmax1+εmax
]1+εmax
22
To get the relation between∑
A P (H,E) and∑
A P̃ (H,E) we need to move the
summation on the right hand side inside the exponentiated part. For this we use the
following lemma
Lemma 3.1. If Cj ≥ 0 and r ≥ 1, then
∑j
(Cj)r ≤
(∑j
Cj
)r
If Cj ≥ 0 and 0 < r ≤ 1, then
∑j
(Cj)r ≥
(∑j
Cj
)r
Therefore, we have
log∑A
P̃ (H,E) ≤ log
[∑A
P (H,E) ·(∏
j
zj
) εmax1+εmax
]1+εmax
= log
[(∏j
zj
) εmax1+εmax ∑
A
P (H,E)
]1+εmax
= (1 + εmax) log∑A
P (H,E) + εmax∑j
log zj (3.7)
Equation (3.7) provides us with an upper bound on the log of the marginal probability.
We now perform a very similar analysis to obtain a lower bound.
Again, from theorem 3.1, we have
log P̃ (H,E) ≥ log
[P (H,E)
] 11+εmax − log
(∏j
zj
) εmax1+εmax
23
Rearranging the right hand side of the above equation, we have
log P̃ (H,E) ≥ log
[(P (H,E))
11+εmax
(∏
j zj)εmax
1+εmax
]
= log
[P (H,E)
(∏
j zj)εmax
] 11+εmax
Summing out a set A ⊆ H, we get
log∑A
P̃ (H,E) ≥ log∑A
[P (H,E)
(∏
j zj)εmax
] 11+εmax
Again we would like to move the summation inside the exponentiated part. Using
lemma 3.1 we have
log∑A
P̃ (H,E) ≥ log
[∑A
P (H,E)
(∏
j zj)εmax
] 11+εmax
= log
[1
(∏
j zj)εmax
∑A
P (H,E)
] 11+εmax
=1
1 + εmaxlog∑A
P (H,E)− εmax1 + εmax
∑j
log zj (3.8)
Equation (3.8) provides us with a lower bound on the log of the marginal probability.
Combined with the upper bound on the log of the marginal probability (3.7), we can
state the following theorem
Theorem 3.2. For a set A ⊆ H, if the expression log∑
A P (H,E) is approximated
using a set of εj-decompositions on the factors ψj of P , then we have the following
bounds:
log∑A
P̃ (H,E) ≤ (1 + εmax) log∑A
P (H,E) + εmax∑j
log zj
log∑A
P̃ (H,E) ≥ 1
1 + εmaxlog∑A
P (H,E)− εmax1 + εmax
∑j
log zj
24
where εmax = maxj εj.
Theorem 3.2 differs from claim 3.2 in the fact that it has extra terms that are multiples
of∑
j log zj.
Theorems 3.1 and 3.2 do not disprove claims 3.1 and 3.2. We disprove these claims
in section 3.4 where we provide simple counter examples that violate claims 3.1 and
3.2 but do not violate theorems 3.1 and 3.2.
3.3.2 No Shifting of Functions
Now let us consider the scenario when all our functions ψj have values that lie between
zero and one. An example of such a distribution is a Bayesian Network with no zero
probabilities. In this situation, we have the following theorem
Theorem 3.3. If 0 < ψj(Dj) < 1 ∀Dj ∀j and the log of the joint probability
P (H,E) is approximated using a set of εj-decompositions on the factors ψj of P ,
then we have the following bounds:
1
1 + εmaxlogP (H,E) ≥ log P̃ (H,E) ≥ (1 + εmax) logP (H,E) (3.9)
where εmax = maxj εj.
Proof:
log P̃ (H,E) ≡ log∏j,l
eφ̃jl(Djl) =∑j,l
φ̃jl(Djl)
Now we know that
0 < ψj(Dj) < 1 =⇒ φj(Dj) < 0
25
Using the above fact and the definition of an ε-decomposition (3.1), we have
log P̃ (H,E) =∑j,l
φ̃jl(Djl) ≥∑j
(1 + εj)φj(Dj)
≥ (1 + εmax)∑j
φj(Dj)
= (1 + εmax) logP (H,E)
and
log P̃ (H,E) =∑j,l
φ̃jl(Djl) ≤∑j
1
1 + εjφj(Dj)
≤ 1
1 + εmax
∑j
φj(Dj)
=1
1 + εmaxlogP (H,E)
Theorem 3.3 allows us to bound the probability of an assignment. Now we look to
bound the marginal probability. From theorem 3.3 we have
log P̃ (H,E) ≥ log
[P (H,E)
](1+εmax)
Summing out a set A ⊆ H, we have
log∑A
P̃ (H,E) ≥ log∑A
[P (H,E)
]1+εmax
To get a relation between∑
A P (H,E) and∑
A P̃ (H,E) we would like to move the
summation sign inside the exponentiated part. Unfortunately since (1 + εmax) > 1,
lemma 3.1 can only provide an upper bound on the sum in the right hand side.
26
Similarly, looking at the upper bound from theorem 3.3, we have
log P̃ (H,E) ≤ log
[P (H,E)
] 11+εmax
Summing out a set A ⊆ H, we have
log∑A
P̃ (H,E) ≤ log∑A
[P (H,E)]1
1+εmax
Again we try to move the summation inside the exponentiation. This is again not
possible because lemma 3.1 can only provide a lower bound on the sum in the right
hand side since 0 < 11+εmax
≤ 1.
Hence we are unable to get any bounds on the log of the marginal probability∑A P (H,E).
Until this point we have been unable to prove the claims made by Wexler and Meek.
In the next section we will prove that the claims are actually false by providing simple
counterexamples.
3.4 Example
In our example, we consider a simple model wherein we have two binary valued
random variables A and B, and their joint probability is given by
P (A,B) =
B = 0 B = 1
A = 0 0.3 0.25
A = 1 0.25 0.2
27
Let us assume that we find an approximation P̃ (A,B) to P (A,B) as follows
P̃ (A,B) =
B = 0 B = 1
A = 0 0.285 0.2525
A = 1 0.25 0.2125
Using these two equations and the definition of an ε-decomposition (3.1), we have
εmax = 0.0426.
With this value of εmax, we look at the bounds for the joint probability P (A,B = 1).
We can see that the equation specified by claim 3.1
1
1 + εmaxlogP (A,B = 1) ≤ log P̃ (A,B = 1) ≤ (1 + εmax) logP (A,B = 1)
does not hold because we have
A = 0 −1.329
A = 1 −1.544≥ −1.376
−1.549≥ −1.445
−1.678
whereas the equation specified by theorem 3.3
1
1 + εmaxlogP (A,B = 1) ≥ log P̃ (A,B = 1) ≥ (1 + εmax) logP (A,B = 1)
does hold.
Similarly consider the procedure outlined in sub-section 3.3.1, which involves shifting
and renormalizing the function. To shift P (A,B) so that all its values are greater
than one, we multiply it with the constant z = 5.1.
28
Therefore, we have
F (A,B) = z · P (A,B) =
B = 0 B = 1
A = 0 1.53 1.275
A = 1 1.275 1.02
And we have F̃ (A,B) = z · P̃ (A,B) as
F̃ (A,B) =
B = 0 B = 1
A = 0 1.4535 1.28775
A = 1 1.275 1.08375
Using these two equations and the definition of an ε-decomposition (3.1), we have
εmax = 3.0614. Again, we look at the joint probability P (A,B = 1). Once again we
see that the bounds given by claim 3.1
1
1 + εmaxlogP (A,B = 1) ≤ log P̃ (A,B = 1) ≤ (1 + εmax) logP (A,B = 1)
do not hold because
A = 0 −0.341
A = 1 −0.396�−1.376
−1.549�−5.630
−6.537
whereas the bounds given by theorem 3.1
log P̃ (A,B = 1) ≥ 1
1 + εmaxlogP (A,B = 1)− εmax
1 + εmax
log P̃ (A,B = 1) ≤ (1 + εmax) logP (A,B = 1) + εmax log z
29
do hold, since
A = 0 −1.569
A = 1 −1.624≤ −1.376
−1.549≤ −0.642
−1.548
Hence, we can state that claim 3.1 does not hold in general.
Having focussed on joint probabilities, we now look to perform marginalization to
obtain bounds on the marginal probability. Referring to our original definition of
P (A,B), we can calculate P (B) by marginalizing out A. Therefore, we have
P (B) =0.55 B = 0
0.45 B = 1
From P̃ (A,B), we have
P̃ (B) =0.535 B = 0
0.465 B = 1
First we consider the case when εmax = 0.0426 was calculated directly using P (A,B)
and P̃ (A,B). We see that the equation specified by claim 3.2
1
1 + εmaxlogP (B) ≤ log P̃ (B) ≤ (1 + εmax) logP (B)
does not hold because
B = 0 −0.5734 > −0.6255 < −0.6233
B = 1 −0.7659 < −0.7657 > −0.8325
Next we consider the case when εmax = 3.0614 was calculated using the shift and
renormalize method. We see that the equation given by claim 3.2 still does not hold
30
because
B = 0 −0.1472
B = 1 −0.1966�−0.6255
−0.7657�−2.4281
−3.2431
whereas the bounds given by theorem 3.2
1
1 + εmlogP (B)− εm
1 + εmlog z ≤ log P̃ (B) ≤ (1 + εm) logP (B) + εm log z
where εm = εmax, do hold because
B = 0 −1.3753
B = 1 −1.4247≤ −0.6255
−0.7657≤ 2.5597
1.7447
Hence, we can state that claim 3.2 does not hold in general.
3.5 Tracking ε in an Inference Procedure
In the course of an inference procedure, we come across three major operations
1. Sum out a variable: We have a function f(X) and we want to marginalize a
subset A ⊆ X. In section 3.3, we considered this scenario and presented the
effects it has on the error ε.
2. Multiply two functions: Assume that we have two functions f and g. Let f̃ be
an approximation to f with error ε1 and let g̃ be an approximation to g with
error ε2. We would like the error of f̃ · g̃ with respect to f · g.
3. Compounded Approximation: Let us say that we have an approximation f̃ to
a function f with error ε1. Now assume that we get an approximation f̂ to f̃
31
with error ε2. We would like the error of f̂ with respect to the original function
f .
Let us first consider the case of multiplying two functions. From the definition of an
ε-decomposition (3.1), we have
1
1 + ε1≤ log f̃
log f≤ 1 + ε1
and
1
1 + ε2≤ log g̃
log g≤ 1 + ε2
From these two equations we can say
1
1 + ε1log f +
1
1 + ε2log g ≤ log f̃ + log g̃ ≤ (1 + ε1) log f + (1 + ε2) log g
Hence, we have
1
1 + max {ε1, ε2}(log f + log g) ≤ log f̃ + log g̃ ≤ (1 + max {ε1, ε2})(log f + log g)
We can rewrite the above equation as
1
1 + max {ε1, ε2} ≤log f̃ g̃
log fg≤ (1 + max {ε1, ε2})
Therefore the error of f̃ · g̃ with respect to f · g is given by ε = max{ε1, ε2}.
Now we consider the case of performing a compounded approximation. Since f̃ is an
approximation to f with error ε1, we have
1
1 + ε1≤ log f̃
log f≤ 1 + ε1
32
Also, since f̂ is an approximation to f̃ with error ε2, we have
1
1 + ε2≤ log f̂
log f̃≤ 1 + ε2
Multiplying these two equations, we have
1
1 + ε1· 1
1 + ε2≤ log f̃
log f· log f̂
log f̃≤ (1 + ε1) · (1 + ε2)
Simplifying the above equation we get
1
1 + (ε1 + ε2 + ε1ε2)≤ log f̂
log f≤ 1 + (ε1 + ε2 + ε1ε2)
Therefore the error of f̂ with respect to f is given by ε = (ε1 + ε2 + ε1ε2).
To understand how the error can be tracked during inference, we provide a schematic
description of the error tracking procedure in Figure 3.1. We start with four functions
(fa, fb, fc, fd) each associated with zero error, ri = 1. Every multiplication operation
is denoted by edges directed from the nodes S, representing the multiplied functions,
to a node t representing the resulting function with error rt = maxs∈S rs. An ε-
decomposition on the other hand has a single source node s with an associated error
rs, representing the decomposed function, and several target nodes T , with an error
rt = (1 + ε)rs for every t ∈ T . Therefore, the error associated with the result of the
entire procedure is the error associated with the final node fo in the graph. In our
example we make the assumption that ε1 > ε2 and that 1 + ε1 < (1 + ε1)(1 + ε2).
33
fa
fkfj
fgfffe
fdfcfb
fifh
fl fmfn
fo
rf=1+ε1re=1+ε1
rg=1
rj=1+ε2
rm=(1+ε2)(1+ε3)
rh=1+ε2
rk=1+ε1
ri=1+ε2
ra=1 rb=1 rc=1
rd=1
rl=(1+ε2)(1+ε3)
rn=(1+ε2)(1+ε3)
ro = max{1+ε1,(1+ε2)(1+ε3)} = (1+ε2)(1+ε3)
Figure 3.1: Schematic description of error tracking in an inference procedure usingε-decompositions
3.6 Need for new bounds
The corrected bounds of theorems 3.1 and 3.2 all contain the term∑
j log zj in some
form. The value of this term depends on the functions in our model that have
ψj(Dj) < 1. As a result it can become arbitrarily large when dealing with small
probabilities. For example if we have a probability value of 0.01 we will have to mul-
tiply that factor with a value that is greater than 100. This value will then dominate
the error in our bounds and hence will lead to very loose bounds.
We saw an example of this in section 3.4. The corrected upper bound on the log
marginal probability was positive, which states that the marginal probability has an
upper bound of one. Although not seen in our examples, in our experiments we will
show instances wherein the corrected lower bound for a probability is practically zero.
Since the corrected bounds resulting from the multiplicative error measure 3.1 are
34
often loose, we are motivated to investigate alternate error measures. In chapter 4
we propose one such measure, based on the L∞ norm. We show how this measure
can be used to compute the error of a local decomposition and then aggregated into
a bound on the result.
35
Chapter 4
MAS Error Bounds using the L∞
norm
In chapter 3, section 3.6, we stated the need to investigate alternate error measures
since the corrected bounds resulting from the multiplicative error measure are often
quite loose. Here we use a measure that is based on the L∞ norm. Using the L∞
error measure, we derive new bounds on the accuracy of the results of an inference
procedure.
In section 4.1 we introduce the L∞ norm and show how it can be used to compute
the error of a local decomposition. Then, in section 4.2 we show how to track the
new error measure in an inference procedure. Finally, in section 4.3, we derive new
bounds for the log joint probability and log marginal probability under MAS.
36
4.1 The L∞ norm
The L∞ norm can be used to measure the difference between two functions. Suppose
we have a function f(X) and an approximation f̃(X) to f(X). The L∞ norm of the
difference of logs is defined as
∥∥∥log f(X)− log f̃(X)∥∥∥∞
= supX
∣∣∣log f(X)− log f̃(X)∣∣∣
One of the problems with using the L∞ norm to measure error is that optimizing the
L∞ norm to get the best f̃(X) is hard. So, the strategy we use is that assuming we
have some approximation f̂(X) to f(X) we will perform a scalar shift by α on log f̂(X)
so that we minimize the L∞ norm. This means that we replace our approximation
f̂(X) with a new approximation f̃(X) given by
f̃(X) =f̂(X)
α
We define our error measure δ as
δ = supX
∣∣∣log f(X)− log f̃(X)∣∣∣
which is equivalent to
δ = infα
supX
∣∣∣∣∣log f(X)− logf̂(X)
α
∣∣∣∣∣The error measure δ we defined is equal to the log of the dynamic range which is a
metric introduced by Ihler et al in [11]. An illustration of how δ can be calculated
is shown in Figure 4.1. It shows a function f(X) with its approximation f̂(X), and
depicts the log-error log f/f̂ , error measure δ and the minimizing α.
37
f(X)
f̂(X) }}
0
δ
αmin
log f/f̂
(a) (b)
Figure 4.1: (a) A function f(X) and an example approximation f̂(X); (b) theirlog-ratio log f(X)/f̂(X), and the error measure δ.
With this error measure, we now define a δ-decomposition as follows
Definition 4.1. Given a set of variables X, and a function φ(X) that assigns real
values to every instantiation of X, a set of m functions φ̂l(Xl), l = 1 . . .m, where
Xl ⊆ X, is an δ-decomposition if⋃lXl = X, and for all instantiations of X the
following holds:
δ = infα
supX
∣∣∣∣∣logα + φ(X)−∑l
φ̂l(Xl)
∣∣∣∣∣ (4.1)
Having defined the new error measure δ, in the next section we analyze how the
different operations of an inference procedure affect δ.
4.2 Tracking δ in an Inference Procedure
As in chapter 3, we consider how the three major operations of an inference pro-
cedure, summing out a set of variables, multiplying two functions and compounded
approximations, affect the new error measure δ.
First we consider the operation of summing out a set of variables. The problem is
stated as follows: Given a function f(X) and approximation f̃(X) with error δ we
want the error of∑
A f̃(X) with respect to∑
A f(X) where A ⊆ X.
38
We have
∣∣∣log f(X)− log f̃(X)∣∣∣ ≤ δ ∀X
From the above equation we have
1
eδ≤ f(X)
f̃(X)≤ eδ ∀X
Therefore, we have
f̃(X)
eδ≤ f(X) f(X) ≤ eδf̃(X)
Summing out the set A, we get
∑A
f̃(X)
eδ≤∑A
f(X)∑A
f(X) ≤∑A
eδf̃(X)
Since δ is a constant we have
1
eδ
∑A
f̃(X) ≤∑A
f(X)∑A
f(X) ≤ eδ∑A
f̃(X)
Therefore we have
1
eδ≤∑
A f(X)∑A f̃(X)
≤ eδ
which is equivalent to
∣∣∣∣∣log∑A
f(X)− log∑A
f̃(X)
∣∣∣∣∣ ≤ δ
Therefore, summing out a set of variables A ⊆ X has an error of at most δ for the
approximation.
39
Next, consider multiplying two functions. We have a function f with approximation
f̃ and error δ1. We have another function g with approximation g̃ and error δ2. We
want the error of f̃ · g̃ with respect to f · g. We have
∣∣∣log f − log f̃∣∣∣ ≤ δ1∣∣∣log g − log g̃∣∣∣ ≤ δ2
We are interested in
∣∣∣log fg − log f̃ g̃∣∣∣ =
∣∣∣log f − log f̃ + log g − log g̃∣∣∣
By the triangle inequality we have
∣∣∣log fg − log f̃ g̃∣∣∣ ≤ ∣∣∣log f − log f̃
∣∣∣+∣∣∣log g − log g̃
∣∣∣≤ δ1 + δ2
Therefore the error of f̃ · g̃ with respect to f · g is at most δ = δ1 + δ2.
Finally consider the case of a compounded approximation. Assume we have an ap-
proximation f̃ to f with error δ1. Now assume that we get an approximation f̂ to f̃
with error δ2. We want the error of f̂ with respect to f . We have
∣∣∣log f − log f̃∣∣∣ ≤ δ1∣∣∣log f̃ − log f̂∣∣∣ ≤ δ2
We are interested in
∣∣∣log f − log f̂∣∣∣ =
∣∣∣log f − log f̃ + log f̃ − log f̂∣∣∣
40
By the triangle inequality we have
∣∣∣log f − log f̂∣∣∣ ≤ ∣∣∣log f − log f̃
∣∣∣+∣∣∣log f̃ − log f̂
∣∣∣≤ δ1 + δ2
Therefore the error of f̂ with respect to f is at most δ = δ1 + δ2.
To illustrate the error tracking for δ-decompositions in an inference procedure, we
provide a schematic trace in Figure 4.2. Since the error measure δ has been shown to
be at most additive, we only need to keep track of the error in each decomposition
and add all the errors at the end. In Figure 4.2 there are three decompositions that
take place: fa is decomposed with an error δ1, fg is decomposed with an error δ2 and
fj is decomposed with an error δ3. Therefore, the error on the result of the inference
procedure is at most δ1 + δ2 + δ3.
In the next section section we use the definition of a δ-decomposition to derive new
bounds for the log joint probability and the log marginal probability.
4.3 New Bounds for MAS
There are two scenarios for us to consider
1. P (X) is a normalized probability distribution, i.e.,
P (X) =1
Z
∏j
ψj(Dj), Dj ⊆ X
and we know the value of the normalization constant Z.
2. P (X) is not normalized and we need to compute the normalizing constant Z.
41
fa
fkfj
fgfffe
fdfcfb
fifh
fl fmfn
fo
δ1
Error = δ1 + δ2 + δ3
δ2
δ3
Figure 4.2: Schematic description of error tracking in an inference procedure usingδ-decompositions
First we consider the case when P (X) is a normalized probability distribution. We
start with deriving bounds for the log of the joint probability P (H,E). Assume that
for each factor ψj we have a δ-decomposition. Then, we have
∣∣∣∣∣φj(Dj)−∑l
φ̃jl(Djl)
∣∣∣∣∣ ≤ δj ∀j
We have shown in section 4.2 that multiplication of functions causes the errors to at
most add. Since multiplying the original factors is equivalent to summing the log of
the factors, we have
∣∣∣∣∣∑j
φj(Dj)−∑j,l
φ̃jl(Djl)
∣∣∣∣∣ ≤∑j
δj
42
Since P (X) is normalized we have
∣∣∣∣∣(∑
j
φj(Dj)− logZ
)−(∑
j,l
φ̃jl(Djl)− logZ
)∣∣∣∣∣ ≤∑j
δj
We define P̃ (H,E) as
P̃ (H,E) =1
Z
∏j,l
eφ̃jl(Djl)
Therefore, we have
∣∣∣logP (H,E)− log P̃ (H,E)∣∣∣ ≤∑
j
δj
From the above equation we can state the following theorem
Theorem 4.1. If the log of the joint probability P (H,E) is approximated using a set
of δj-decompositions on the factors ψj of P and P is a normalized distribution, then
logP (H,E)−∑j
δj ≤ log P̃ (H,E) ≤ logP (H,E) +∑j
δj (4.2)
Having derived bounds for the joint probability, we proceed towards deriving bounds
for marginal probabilities. From theorem 4.1 we have
∣∣∣logP (H,E)− log P̃ (H,E)∣∣∣ ≤∑
j
δj
In section 4.2 we showed that for a set A ⊆ X we have
∣∣∣∣∣log∑A
f(X)− log∑A
f̃(X)
∣∣∣∣∣ ≤ δ
43
Therefore, summing out a set A ⊆ H from P (H,E), we get
∣∣∣∣∣log∑A
P (H,E)− log∑A
P̃ (H,E)
∣∣∣∣∣ ≤∑j
δj
From the above equation we have the following theorem
Theorem 4.2. For a set A ⊆ H, if the expression log∑
A P (H,E) is approximated
using a set of δj-decompositions on the factors ψj of P and P is a normalized distri-
bution, then
log∑A
P (H,E)−∑j
δj ≤ log∑A
P̃ (H,E) ≤ log∑A
P (H,E) +∑j
δj (4.3)
We now consider the case when P (X) is not a normalized distribution. Assume we
have a function f(X) and an approximation f̃(X) to f(X) with error δ. We have
∣∣∣log f(X)− log f̃(X)∣∣∣ ≤ δ
Assume that we cannot calculate the real normalization constant Z =∑
X f(X).
We can calculate an approximation to the normalization constant as Z̃ =∑
X f̃(X).
We want a relation between our normalized approximation f̃(X)/Z̃ and the real
normalized function f(X)/Z, i.e., we want to find bounds for
∣∣∣∣∣logf(X)
Z− log
f̃(X)
Z̃
∣∣∣∣∣ (4.4)
In section 4.2 we showed that for a set A ⊆ X we have
∣∣∣∣∣log∑A
f(X)− log∑A
f̃(X)
∣∣∣∣∣ ≤ δ
44
Since X ⊆ X, we have
∣∣∣∣∣log∑X
f(X)− log∑X
f̃(X)
∣∣∣∣∣ ≤ δ
Substituting Z =∑
X f(X) and Z̃ =∑
X f̃(X) we have
∣∣∣logZ − log Z̃∣∣∣ ≤ δ
We can rewrite the normalized expression of equation 4.4 as
∣∣∣∣∣logf(X)
Z− log
f̃(X)
Z̃
∣∣∣∣∣ =∣∣∣log f(X)− log f̃(X) + log Z̃ − logZ
∣∣∣By triangle inequality we have
∣∣∣∣∣logf(X)
Z− log
f̃(X)
Z̃
∣∣∣∣∣ ≤ ∣∣∣log f(X)− log f̃(X)∣∣∣+∣∣∣log Z̃ − logZ
∣∣∣≤ 2δ
Hence, doing the normalization increases the error by a factor of at most two.
Now assuming that for each factor ψj of P we have a δ-decomposition, we have
∣∣∣∣∣∑j
φj(Dj)−∑j,l
φ̃jl(Djl)
∣∣∣∣∣ ≤∑j
δj
Since we do not know the real normalization constant Z we normalize the result to
get
∣∣∣∣∣(∑
j
φj(Dj)− logZ
)−(∑
j,l
φ̃jl(Djl)− log Z̃
)∣∣∣∣∣ ≤ 2∑j
δj
45
In this scenario we define P̃ (H,E) as
P̃ (H,E) =1
Z̃
∏j,l
eφ̃jl(Djl)
Therefore, we have
∣∣∣logP (H,E)− log P̃ (H,E)∣∣∣ ≤ 2
∑j
δj
From the above equation we have the following theorem:
Theorem 4.3. If the log of the joint probability P (H,E) is approximated using a set
of δj-decompositions on the factors ψj of P and P is not a normalized distribution,
then
logP (H,E)− 2∑j
δj ≤ log P̃ (H,E) ≤ logP (H,E) + 2∑j
δj (4.5)
Since we have shown that summing out a set of variables from a function does not
affect the error of the approximation, we can state the theorem for the log of the
marginal probabilities as follows:
Theorem 4.4. For a set A ⊆ H, if the expression log∑
A P (H,E) is approximated
using a set of δj-decompositions on the factors ψj of P and P is not a normalized
distribution, then
log∑A
P (H,E)− 2∑j
δj ≤ log∑A
P̃ (H,E) ≤ log∑A
P (H,E) + 2∑j
δj (4.6)
46
Chapter 5
Optimizing a Decomposition
In chapters 3 and 4, we provided methods to calculate error bounds given a particular
approximation. In this chapter we present methods for finding a decomposition of a
function.
Ideally we would like to find the decomposition of a function that has the smallest
value of the error measure ε or δ. There are two problems we have to consider namely,
how can we select good subsets {X1, . . . , Xm} for the decomposition, and given the
decomposition {X1, . . . , Xm}, how can we select the functions {φ̃1(X1), . . . , φ̃m(Xm)}.
Choosing good subsets is an interesting but difficult problem. In this thesis, we take
the option of choosing subsets {X1, . . . , Xm} at random and then find good values
for the functions {φ̃1(X1), . . . , φ̃m(Xm)}.
Given a decomposition {X1, . . . , Xm}, the ideal way to choose {φ̃1(X1), . . . , φ̃m(Xm)}would be to choose the functions that minimize our error measure. Considering the
47
error measure ε defined in chapter 3 the problem we would like to solve is
min(φ̃1,...,φ̃m)
maxX
{∑i φ̃i(Xi)
φ(X),
φ(X)∑i φ̃i(Xi)
}(5.1)
If we use the error measure δ defined in chapter 4, then the problem would be
min(φ̃1,...,φ̃m)
supX
∣∣∣∣∣φ(X)−∑i
φ̃i(Xi)
∣∣∣∣∣ (5.2)
Since optimizing either of these two objective functions is not easy, we look to optimize
some similar measure in the hope that we will get small value of our error measure.
Wexler and Meek suggest the L2 norm because for a vector X, ‖X‖2 ≥ ‖X‖∞. The
L2 norm for a decomposition is defined as follows:
min(φ̃1,...,φ̃m)
√√√√∑X
[(∑i
φ̃i(Xi)
)− φ(X)
]2
(5.3)
We can choose disjoint or overlapping subsets for the decomposition. In section 5.1
we consider the case wherein we are restricted to choosing disjoint subsets, i.e., each
φ̃i is defined on a non-overlapping subset of the variables. More formally we choose
sets {X1, . . . , Xm} such that
∀i, j = 1, . . . ,m Xi ∩Xj = ∅ ⇔ i 6= j
Then in section 5.2 we remove the restriction of disjoint subsets and consider the
more general case of allowing sets that overlap with each other.
48
5.1 Optimizing a decomposition with disjoint sub-
sets
The L2 norm is computationally convenient for disjoint subsets. The reason for this
is that there exists a closed form solution to equation (5.3) when the sets are disjoint.
Wexler and Meek obtained the closed form solution as follows. First they remove the
square root since the square root function is monotonic for positive values. Hence the
objective function becomes
min(φ̃1,...,φ̃m)
∑X
[(∑i
φ̃i(Xi)
)− φ(X)
]2
Next Wexler and Meek differentiate the equation with respect to each φ̃k(Xk) and
set it to zero. The resulting set of equations is under constrained so they use the
arbitrary constraint
∑X
φ̃i(Xi) =
∑X φ(X)
m
to make the system well posed. Solving the resulting set of equations they get the
closed form solution as
φ̃k(Xk) =
∑X\Xk φ(X)∏i 6=k |Xi| −
(m− 1)∑
X φ(X)
m |X| (5.4)
49
5.2 Optimizing a decomposition with overlapping
subsets
When we remove the restriction of disjoint subsets, we cannot find a closed form
solution for minimizing the L2 norm. So, we present a method that can generate
overlapping decompositions from disjoint decompositions.
Given a function φ(X) we would like to generate functions defined over overlapping
sets of variables. We create overlapping decompositions by first generating functions
over two or more different sets of disjoint decompositions and then combining them
together. We explain the procedure by considering two sets of disjoint decompositions
{X11, . . . , X1p} and {X21, . . . , X2q}.
First we generate a disjoint decomposition over {X11, . . . , X1p} for φ(X). To calculate
the values for {φ̃11(X11), . . . , φ̃1p(X1p)} we can use the closed form solution from
equation (5.4). We now compute the residual R(X) as
R(X) = φ(X)−p∑i=1
φ̃1i(X1i)
Next we generate a disjoint decomposition over {X21, . . . , X2q} for R(X) where the
sets {X21, . . . , X2q} are different from the sets {X11, . . . , X1p}. To calculate the values
of {R̃21(X21), . . . , R̃2q(X2q)} we again use the closed form solution from equation (5.4).
We now union the sets {φ̃11(X11), . . . , φ̃1p(X1p)} and {R̃21(X21), . . . , R̃2q(X2q)} to gen-
erate the combined set of functions {φ̃11(X11), . . . , φ̃1p(X1p), R̃21(X21), . . . , R̃2q(X2q)}.In this combined set if there exists a function over a set Xj and another function
over a set Xk such that Xj ⊆ Xk, then we multiply the two functions together and
generate a single new function. After performing this multiplication operation on the
50
combined set we get an overlapping decomposition {φ̂1(X1), . . . , φ̂m(Xm)}.
To make the procedure of generating overlapping decompositions clearer, we provide
the schematic trace of an example in Figure 5.1. We have a function φ that is
defined over three variables X1, X2, X3 for which we would like to generate overlapping
decompositions. We look at all possible decompositions that can be made except for
the trivial decomposition of three functions over single variables. At the first stage,
we decompose φ which can be done in three different ways. At the second stage
we decompose the residual. For each choice we made in the first stage we have
two possible choices at this stage. Finally we combine all the functions together to
generate the required overlapping decomposition. For each path down the tree from
root to leaf, we compute the values of ε and δ for the resulting decomposition.
Having provided a theoretical analysis of MAS including new ways to calculate error
bounds and new ways to optimize ε-decompositions, we now shift our focus to the
application of MAS to inference algorithms. In the next chapter, we look at the
application of MAS to bucket elimination [6]. We analyze the algorithm proposed
by Wexler and Meek called DynaDecomp and discuss some of its limitations. We
then present our own DynaDecompPlus algorithm which modifies the DynaDecomp
algorithm so as to overcome its limitations.
51
R̃(X
2)R̃
(X1,X
3)
R̃(X
1)R̃
(X2,X
3)
R̃(X
3)R̃
(X1,X
2)
!̃(X
3)!̃
(X1,X
2)
!̃(X
1)!̃
(X2,X
3)
!̃(X
2)!̃
(X1,X
3)
R̃(X
3)R̃
(X1,X
2)
R̃(X
1)R̃
(X2,X
3)
R̃(X
2)R̃
(X1,X
3)
!̂(X1, X2)!̂(X2, X3)
!̂(X1, X2)!̂(X1, X3)
!̂(X1, X3)!̂(X2, X3)
!̂(X1, X2)!̂(X1, X3)
!̂(X1, X2)!̂(X2, X3)
!̂(X1, X3)!̂(X2, X3)
ε=0.4265δ=0.6360
ε=0.4208δ=0.4938
ε=0.4265δ=0.6360
ε=0.3816δ=0.4289
ε=0.4208δ=0.4938
ε=0.3816δ=0.4289
!(X
1,X
2,X
3)
Figure 5.1: Schematic trace of generating overlapping decompositions
52
Chapter 6
Bucket Elimination with MAS
Many existing inference algorithms for graphical models compute and utilize multi-
plicative factors during the course of the inference procedure. As a result the multi-
plicative approximation scheme can be applied to these algorithms. In this chapter
we look at one such application, using MAS with bucket elimination [6].
In chapter 2 (section 2.4) we provided an introduction to the bucket elimination
framework. In section 6.1 we introduce and analyze the algorithm developed by
Wexler and Meek called DynaDecomp and present its major limitation. Then, in
section 6.2, we introduce our own algorithm called DynaDecomPlus which improves
on the DynaDecomp algorithm by providing additional control over the computational
complexity.
6.1 The DynaDecomp Algorithm
DynaDecomp is an algorithm developed by Wexler and Meek and presented in [24].
It performs variable elimination using MAS. The strategy that DynaDecomp uses for
53
Input: A model for n variables X = {X1, . . . , Xn} and functions ψj(Dj ⊆ X), thatencodes P (X) =
∏j ψj(Dj); A set E = X\H of observed variables; An
elimination order R over the variables in H; threshold M .Output: The log-likelihood logP (E); an error ε.Initialize: ε = 0; F ← {ψj(Dj)}for i = 1 to n do
k ← R[i]T ← All functions f ∈ F that contain variable Xk
F ← F\TEliminate Xk from functions in bucket T : f ′ ←∑
Xk
∏fj∈T fj
if |f ′| > M thenDecompose f ′ into a set F̃ and an error εf ′
F ← F ∪ F̃ε = max {ε, εf ′}
elseF ← F ∪ f ′
endendmultiply all constant functions in F and put in preturn log p, ε
Figure 6.1: The DynaDecomp Algorithm
decomposition is a simple one: If the size of a function exceeds a predefined threshold,
decompose the function. The pseudocode for DynaDecomp is presented in Figure 6.1.
DynaDecomp begins exactly like bucket elimination. Given an elimination order we
first assign the functions to their respective buckets and then start processing the
buckets one after the other. The difference is that while we are at bucket i, if the
function generated after eliminating variable Xi, hXi , has a size greater than a pre-
defined threshold M , we decompose this function hXi into a set of smaller functions.
A schematic trace of DynaDecomp is presented in Figure 6.2.
In our example we set the threshold M = 3, i.e., we will decompose any function
that is defined over more than three variables. In figure 6.2 we see that the function
hB(A,D,C,E) generated after eliminating variable B is defined over four variables.
Hence, it is decomposed into two smaller functions h̃B1 (D,C) and h̃B2 (A,E). The
54
bucket B
bucket A
bucket C
bucket D
bucket E
!B
"P (E|B, C) P (D|A, B) P (B|A)
P (C|A)
P (A)
E = 0
hB(A, D, C, E)
h̃B2 (A, E)h̃B
1 (D, C)
h̃C(A, D)
h̃D(A) h̃E(A)
P̃ (A|E = 0)
Figure 6.2: A trace of algorithm DynaDecomp
algorithm then proceeds as in standard bucket elimination and in the end we get an
approximation P̃ (A|E = 0) to P (A|E = 0).
One thing to notice is that even though we wanted to restrict ourselves to functions
over three variables, the algorithm generated a function over 5 variables when we
multiplied all the functions in bucket B. Hence, even though the DynaDecomp al-
gorithm does provide savings later on in the inference procedure, it still generates
functions greater than the threshold specified. As a result, it becomes quite hard to
control the time and space complexity of DynaDecomp.
In order to provide greater control over the time and space complexity we modify
DynaDecomp to create our own algorithm called DynaDecompPlus. In the next
section we provide the details of our algorithm
55
6.2 Improvements to DynaDecomp
The major issue with the DynaDecomp algorithm is that it is very difficult to control
the largest function size generated by the algorithm. Therefore we developed a mod-
ified version of DynaDecomp called DynaDecompPlus which ensures that functions
larger in size than the given threshold are never generated.
Before presenting the details of DynaDecompPlus, we describe the small changes that
have to be made to DynaDecomp so that we can use δ to track the error instead of ε.
6.2.1 Using L∞ Bounds with DynaDecomp
Modifying DynaDecomp to use δ for error tracking is simple. The algorithm pro-
ceeds exactly as in Figure 6.1. The only difference is that to decompose the func-
tion f ′ into the set of functions F̃ , we use a δ-decomposition (4.1) instead of an
ε-decomposition (3.1) to calculate the error. Hence the error changes from ε =
max {ε, εf ′} to δ = δ + δf ′. With these changes, we can use the L∞ bounds de-
rived in chapter 4 to bound the returned result, log p.
6.2.2 The DynaDecompPlus Algorithm
The DynaDecompPlus algorithm is presented in Figure 6.3. The basic principle be-
hind the algorithm is that instead of decomposing the function resulting from pro-
cessing a particular bucket, decompositions are performed while processing a bucket.
The idea is that before we multiply all the functions in a bucket, we check to see if the
resulting function will exceed our threshold. If it does then instead of multiplying all
the functions, we multiply as many functions as we can without exceeding the thresh-
56
Input: A model for n variables X = {X1, . . . , Xn} and functions ψj(Dj ⊆ X), thatencodes P (X) =
∏j ψj(Dj); A set E = X\H of observed variables; An
elimination order R over the variables in H; threshold M .Output: The log-likelihood logP (E); an error δ.Initialize: δ = 0; F ← {ψj(Dj)}for i = 1 to n do
k ← R[i]T ← All functions f ∈ F that contain variable Xk
F ← F\Ta← length(T ); f ′ ← 1; S ← ∅for j = 1 to a do
if Multiplying f ′ and T [j] will result in a function of size greater than Mthen
S ← S ∪ T [j]else
f ′ ← Multiplication of f ′ and T [j]end
endif S is not empty then
(δf ′ , F̃ )← decompose(S, f ′)(f ′, F̃ )← subsume functions(f ′, F̃ )δ ← δ + δf ′
endNow we sum out variable Xk from f ′: f ′ ←∑
Xkf ′
F ← F ∪ f ′F ← F ∪ F̃
endmultiply all constant functions in F and put in preturn log p, δ
Figure 6.3: The DynaDecompPlus Algorithm
57
bucket B
bucket A
bucket C
bucket D
bucket E
P (E|B, C) P (D|A, B) P (B|A)
h̃1(D, A) h̃2(B) h̃3(B) h̃4(A)
h̃B(E, C)
h̃C(E, A)
h̃E(A) h̃4(A)
h̃1(D, A)
h̃D(A)
P (C|A)
E = 0
P (A)
P̃ (A|E = 0)
Figure 6.4: A trace of algorithm DynaDecompPlus
old. The rest of the functions are then decomposed and the variable is eliminated at
the end.
To make this clearer, we look at the trace of the algorithm presented in Figure 6.4.
The problem we are trying to solve is P (A|E = 0). But we do not want to generate
any functions greater than size three. So given this threshold and the elimination
order o = (A,E,C,D,B), we start the DynaDecompPlus algorithm. At the very first
bucket B, we see that if we multiply all the functions in this bucket, we will generate
a function hB(A,B,C,D,E) which is greater than our threshold. So instead, we
choose the decompose the functions P (D|A,B) and P (B|A). By performing the
decomposition early and then proceeding with bucket elimination we can see that we
never generated any function greater than size three. If we compare this to the trace
of DynaDecomp in Figure 6.2, it also had set the maximum size to three, but since it
first multiplied all the functions in bucket B, it actually generated a function of size
five.
58
Note on subsume functions
The DynaDecompPlus algorithm calls a method known as subsume functions. This
method is designed to take a set of functions and if there exists any function that
is defined over a set of variables which are a subset of another function’s vari-
ables, then it multiplies the two functions. For example if we had a set of func-
tions that contained a function f1(A,B,C,D) and another function f2(B,D), the
method subsume functions will multiply these two functions and produce a new re-
sult f ′1(A,B,C,D).
In the next chapter we provide the results of a variety of experiments we performed
that compare the performance of DynaDecomp and DynaDecompPlus. We also com-
pare the tightness of the corrected MAS bounds derived in chapter 3 with the L∞
bounds derived in chapter 4.
59
Chapter 7
Experiments
In this chapter we present results of a variety of experiments that we performed in
order to evaluate the theoretical results derived in the previous chapter. In section 7.1
we describe the experimental setup we used to perform our experiments. Next, in
section 7.2 we show results that were obtained using simple Ising models. Finally,
in section 7.3 we show the results obtained using some instances from the UAI 2006
inference competition.
7.1 Experimental Setup
All the experiments focus on the inference problem of computing the probability of
evidence. The probability distribution we consider is always normalized or we know
the true normalization constant.
The experiments were run on a desktop system containing two Intel R© CoreTM2 Duo
processors running at 3 Ghz, 4 GB of main memory and running the 64-bit Ubuntu
operating system. The programming environment was Matlab. In order to code the
60
algorithms, some functions from the Bayes Net Toolbox [18] were used.
To determine the elimination order we used the MIN-FILL heuristic. The subsets for
a decomposition were chosen at random. To optimize a decomposition, we used the
L2 optimization method discussed in chapter 5.
7.2 Results on Ising Models
We considered a 15X15 Ising model with binary nodes and random pairwise poten-
tials. For evidence, we randomly assigned values to about 20% of the nodes. Since we
use random splits for decomposition, we ran each algorithm ten times for each value
of the threshold and chose the result that had the smallest error value.
First we compare the performance of the corrected MAS bounds derived in chapter 3
versus the L∞ MAS bounds derived in chapter 4. Figure 7.1 shows the result of
running the DynaDecomp algorithm as described in [24] for different values of the
threshold parameter. The clear square represents the exact value of logP (E) while
the green circle shows the value of the approximation log P̃ (E). The black line on
the left shows the corrected bounds whereas the red line on the right shows the L∞
bounds. We see that the L∞ bounds are tighter than the corrected bounds for all
values of the threshold.
Next, we check if using overlapping decompositions with DynaDecomp results in
tighter bounds. Figure 7.2 shows the result of running the DynaDecomp algorithm
with overlapping decompositions. The bounds in Figure 7.2 are marginally tighter
than the bounds in Figure 7.1 which suggests that using overlapping decompositions
gives slightly better results than using disjoint decompositions. We also see that the
L∞ bounds are again tighter than the corrected bounds for all values of the threshold.
61
2 3 4 5 6 7 8 9 10 11−150
−100
−50
0
50
Max Size Specified
log
P(E
)
Figure 7.1: Bounds for logP (E) using DynaDecomp algorithm with no overlappingdecompositions allowed
2 3 4 5 6 7 8 9 10 11−150
−100
−50
0
50
Max Size Specified
log
P(E
)
Figure 7.2: Bounds for logP (E) using DynaDecomp algorithm with overlapping de-compositions allowed
62
Now we look at the performance of the corrected bounds and the L∞ bounds on
the DynaDecompPlus algorithm. Figure 7.3 shows the results of running DynaDe-
compPlus for different values of the threshold parameter. Since the L∞ bounds in
Figure 7.3 are not very clear, we plotted just the L∞ bounds in Figure 7.4. Again
we see the trend that the L∞ bounds are tighter than the corrected bounds. In fact,
in all the experiments we ran the L∞ bounds were much tighter than the corrected
bounds.
Having compared the performance of the bounds, we now focus on the comparison
between DynaDecomp and DynaDecompPlus. We first look at the control the algo-
rithms provide over the time and space complexities.
Figure 7.5 is a plot of the accuracy of the result versus the maximum function size
created by DynaDecomp. The plot shows multiple runs of DynaDecomp at each
value for the threshold parameter. We can see that even when we set the threshold
at M = 3, DynaDecomp still generated functions over 7 to 9 variables. This makes
it hard to control the time and space complexity of DynaDecomp. For example
assuming that we have binary variables and want to limit ourselves to functions not
larger than 210 values, we have no easy way of setting the threshold parameter M
because as seen in Figure 7.5 there is no simple relation between the threshold value
and the maximum function size created by DynaDecomp.
Figure 7.6 is a plot of the accuracy of the result versus the maximum function size
created by DynaDecompPlus. It shows that DynaDecompPlus never generates a
function that is larger than the threshold specified. Thus, looking at our earlier
example, if we wanted to limit ourselves to functions not larger than 210 values,
we just have to set the threshold parameter M = 10. Comparing Figure 7.6 with
Figure 7.5 we can see that DynaDecompPlus provides much greater control over the
time and space complexities than DynaDecomp.
63
4 6 8 10 12 14 16−100
−80
−60
−40
−20
0
20
Max Size Specified
log
P(E
)
Figure 7.3: Bounds for logP (E) using DynaDecompPlus algorithm
4 6 8 10 12 14 16−58
−56
−54
−52
−50
−48
−46
−44
−42
−40
Max Size Specified
log
P(E
)
Figure 7.4: L∞ Bounds for logP (E) using DynaDecompPlus algorithm
64
6 7 8 9 10 11 12 13 14 15 16 17
−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
Max function size
log(
appr
ox)
− lo
g(ex
act)
maxsize = 3
maxsize = 4
maxsize = 5
maxsize = 6
maxsize = 7
maxsize = 8
maxsize = 9
maxsize = 10
Figure 7.5: Plot of accuracy versus maximum function size for the DynaDecompalgorithm
4 6 8 10 12 14 16
−1.5
−1
−0.5
0
0.5
1
1.5
Max function size
log(
appr
ox)
− lo
g(ex
act)
maxsize = 5
maxsize = 6
maxsize = 7
maxsize = 8
maxsize = 9
maxsize = 10
maxsize = 11
maxsize = 12
maxsize = 13
maxsize = 14
maxsize = 15
Figure 7.6: Plot of accuracy versus maximum function size for the DynaDecompPlusalgorithm
65
4 6 8 10 12 14 16−90
−85
−80
−75
−70
−65
−60
−55
−50
−45
−40
Max Function Size
log
P(E
)
Figure 7.7: L∞ Bounds for logP (E) vs Max Function Size for the DynaDecompalgorithm
4 6 8 10 12 14 16−90
−85
−80
−75
−70
−65
−60
−55
−50
−45
−40
Max Function Size
log
P(E
)
Figure 7.8: L∞ Bounds for logP (E) vs Max Function Size for the DynaDecompPlusalgorithm
66
Having shown that DynaDecompPlus provides greater control over the complexity
than DyanDecomp, we now look at the performance of DynaDecompPlus and Dy-
naDecomp for the same complexity. Figure 7.7 shows a plot of the L∞ bounds against
the maximum function size that DynaDecomp generated. Figure 7.8 shows a plot of
the L∞ bounds against the maximum function size that DynaDecompPlus generated
for the same problem. We can see that for the same complexity DynaDecompPlus is
more accurate and generates tighter bounds than DynaDecomp.
7.3 Results on UAI Data
In the previous section we considered only simple Ising models. To compare the
performance of the bounds and the two algorithms on more complex models, we ran
experiments on seventeen instances from the UAI 2006 inference competition. The
instances were taken from the UAI06 Problem Repository available at [1]. We chose
instances that had no zero probabilities and to compute the exact solution, we used
the aolibPE package for exact inference in Bayesian Networks available at [17].
We first tried to run the DynaDecomp algorithm. However, for the heuristic ordering
we chose, and a range of threshold values from M = 3 to M = 15, it failed to solve
any of the instances as it kept running out of memory.
Next, we ran the DynaDecompPlus algorithm for different values of the threshold
parameter. We provide the results for thresholds of 9, 12 and 15 in Figure 7.9, Fig-
ure 7.10 and Figure 7.11 respectively. We can see that as we increase the size of the
threshold the bounds get tighter. We again see that the L∞ bounds are much tighter
than the corrected bounds. In fact the corrected bounds are extremely loose. We do
not see the ends of the corrected bounds because we limited the Y -axis to lie between
67
−200 and 200 for plotting purposes.
Although the L∞ bounds are tighter than the corrected bounds, we see that in most
cases even they are not tight enough for the log probability of evidence. For example,
we see that in many cases the upper bound on logP (E) is greater than zero and
the lower bound is about −100. This states that the probability of evidence lies
somewhere between e−100 and 1 which is not very useful.
From these experiments we can conclude that the corrected MAS bounds from chap-
ter 3 are not useful even for simple Ising models and are quite unusable for any
complex models. The L∞ bounds on the other hand work reasonably well for simple
models but as the models get more complicated they too result in not very useful
bounds.
With regards to the algorithms, we can see that DynaDecomp does a poor job of
controlling the time and space complexities. For complex models it has trouble per-
forming inference even for small values of the threshold parameter. DynaDecompPlus
on the other hand can efficiently control the time and space complexities and can per-
form inference with good accuracy.
68
0 2 4 6 8 10 12 14 16 18−200
−150
−100
−50
0
50
100
150
200
Problem Number
log
P(E
)
Figure 7.9: Bounds for logP (E) using DynaDecompPlus algorithm with a maximumfunction size of 9
0 2 4 6 8 10 12 14 16 18−200
−150
−100
−50
0
50
100
150
200
Problem Number
log
P(E
)
Figure 7.10: Bounds for logP (E) using DynaDecompPlus algorithm with a maximumfunction size of 12
69
0 2 4 6 8 10 12 14 16 18−200
−150
−100
−50
0
50
100
150
200
Problem Number
log
P(E
)
Figure 7.11: Bounds for logP (E) using DynaDecompPlus algorithm with a maximumfunction size of 15
70
Chapter 8
Conclusions
The research presented in this thesis focussed on a specific approximation scheme,
called the Multiplicative Approximation Scheme (MAS), that was introduced by
Wexler and Meek. It is a general approximation scheme that can be applied to
inference algorithms so as to perform bounded approximate inference in graphical
models.
8.1 Contributions
We analyzed the multiplicative approximation scheme and proved that in reality the
bounds that were claimed by Wexler and Meek turn out to be incorrect when dealing
with normalized probability distributions. We then proceeded to derive the correct
bounds for these situations.
We also introduced a new way to calculate the error of a local decomposition. Our
method used a measure that was based on the L∞ norm. Using the L∞ error measure,
we derived new bounds on the results of an inference procedure.
71
We also analyzed the ε-decomposition optimization strategy. Wexler and Meek had
provided a closed form solution using L2 optimization on disjoint subsets. We pro-
vided a method that allowed us to generate overlapping subsets in our decompositions
using the closed form solution for disjoint subsets.
We then proceeded to analyze the algorithm developed by Wexler and Meek called Dy-
naDecomp, which is the application of MAS to bucket elimination. We demonstrated
the limitations of DynaDecomp particularly with respect to the lack of control over
how much space the algorithm would require. We then modified DynaDecomp and
developed our own DynaDecompPlus algorithm that gave us the necessary control
over the space the algorithm needed.
Finally we provided experimental evidence that demonstrated how in practice the
corrected bounds for MAS turn out to be very loose whereas our L∞ bounds were
much tighter. We also showed that how using overlapped decompositions resulted in
tighter bounds than using disjoint decompositions. We also compared the workings
of DynaDecomp with DynaDecompPlus and showed how DynaDecompPlus provided
much better control over space utilization. We also showed how it could solve prob-
lems that DynaDecomp had trouble solving.
8.2 Future Work
With the extensions we have provided, MAS still has room for additional improve-
ments that can be pursued in the future.
The biggest room for improvement is with the local decomposition. For example, say
we had a function defined over five variables, f(X1, X2, X3, X4, X5), and we would like
to decompose this functions into functions over not more than three variables, there
72
are many different options for us to try. We could create five single variable functions
f̃i(Xi) or we might choose a disjoint partition f̃1(X1, X3, X4) & f̃2(X2, X5) or we might
create an overlapping partition of the form f̃1(X1, X2, X3) & f̃2(X3, X4, X5) and so
on. Currently we make this decision by choosing the subsets at random. Therefore, it
would be worthwhile to explore some more sophisticated techniques to choose these
subsets.
Ideally we would like to choose the subsets such that the value of the error measure ε
or δ for the decomposition is the smallest. We could consider a brute force approach
of trying all possible decompositions, but with there being an exponential number of
decompositions to try, this would be infeasible for all but the most trivial cases.
One possible approach is to use the concept of mutual information, which measures
the mutual dependence of the two variables or conditional mutual information, which
is the mutual information of two variables conditioned on a third variable. There
has been recent work done by Narasimhan and Bilmes [19] and by Chechetka and
Guestrin [3] that uses conditional mutual information to learn graphical models with
bounded tree-width. Some of the ideas presented in these papers could be used to
provide a solution to our problem of finding good decompositions.
73
Bibliography
[1] UAI 2006 problem repository in ergo file format. http://graphmod.ics.uci.
edu/repos/pe/uaicomp06/, 2006.
[2] S. Arnborg. Efficient algorithms for combinatorial problems with bounded de-composability - a survey. BIT, 25(1):2–23, 1985.
[3] A. Chechetka and C. Guestrin. Efficient principled learning of thin junction trees.In In Advances in Neural Information Processing Systems (NIPS), Vancouver,Canada, December 2007.
[4] G. F. Cooper. The computational complexity of probabilistic inference usingBayesian belief networks (research note). Artificial Intelligence, 42(2-3):393–405,1990.
[5] A. Darwiche. Recursive conditioning. Artificial Intelligence, 126(1-2):5–41, 2001.
[6] R. Dechter. Bucket elimination: A unifying framework for reasoning. ArtificialIntelligence, 113(1-2):41–85, 1999.
[7] R. Dechter. Constraint Processing. Morgan Kaufmann Publishers, 340 PineStreet, Sixth Floor, San Francisco, CA 94104-3205, 2003.
[8] R. Dechter and J. Pearl. Network-based heuristics for constraint-satisfactionproblems. Artificial Intelligence, 34(1):1–38, 1987.
[9] R. Dechter and I. Rish. Mini-buckets: A general scheme for bounded inference.J. ACM, 50(2):107–153, 2003.
[10] M. Henrion. Propagating uncertainty in Bayesian networks by probabilistic logicsampling. In Proceedings of the 2nd Annual Conference on Uncertainty in Arti-ficial Intelligence (UAI-86), New York, NY, 1986. Elsevier Science.
[11] A. T. Ihler, J. W. Fisher III, and A. S. Willsky. Loopy belief propagation: Con-vergence and effects of message errors. Journal of Machine Learning Research,6:905–936, May 2005.
[12] F. V. Jensen, S. L. Lauritzen, and K. G. Olesen. Bayesian updating in causalprobabilistic networks by local computations. Computational Statistics Quaterly,4:269–282, 1990.
74
[13] M. Jordan, editor. Learning in Graphical Models. The MIT Press, 1998.
[14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introductionto variational methods for graphical models. In Learning in graphical models,pages 105–161. MIT Press, Cambridge, MA, USA, 1999.
[15] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-productalgorithm. IEEE Transactions on Information Theory, 47(2):498–519, Feb 2001.
[16] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilitieson graphical structures and their application to expert systems. Journal of theRoyal Statistical Society, Series B, 50(2):157–224, 1988.
[17] R. Mateescu. aolibPE package for exact inference in Bayesian networks. http:
//graphmod.ics.uci.edu/group/aolibPE, 2008.
[18] K. Murphy. Bayes net toolbox for Matlab. http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html, 2007.
[19] M. Narasimhan and J. Bilmes. Pac-learning bounded tree-width graphical mod-els. In Proceedings of the 20th conference on Uncertainty in artificial intelligence,pages 410–417, Arlington, Virginia, United States, 2004. AUAI Press.
[20] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, September 1988.
[21] N. Robertson and P. D. Seymour. Graph minors. iii. planar tree-width. Journalof Combinatorial Theory, Series B, 36(1):49 – 64, 1984.
[22] R. D. Shachter, B. D’Ambrosio, and B. A. Del Favero. Symbolic probabilisticinference in belief networks. In Proceedings of the Eighth National Conferenceon Artificial Intelligence, pages 126–131, Boston, Massachusetts, United States,1990.
[23] G. R. Shafer and P. P. Shenoy. Probability propagation. Annals of Mathematicsand Artificial Intelligence, 2(1-4):327–351, 1990.
[24] Y. Wexler and C. Meek. MAS: a multiplicative approximation scheme for prob-abilistic inference. In Advances in Neural Information Processing Systems 21,pages 1761–1768. 2009.
[25] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagationand its generalizations. In Exploring artificial intelligence in the new millennium,pages 239–269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2003.
75