UNIVERSITY OF CALIFORNIA, IRVINE THESIS MASTER OF...

UNIVERSITY OF CALIFORNIA,IRVINE

Fixing and Extending theMultiplicative Approximation Scheme

THESIS

submitted in partial satisfaction of the requirementsfor the degree of

MASTER OF SCIENCE

in Information and Computer Science

by

Sidharth Shekhar

Thesis Committee:Professor Alexander Ihler, Chair

Professor Xiaohui XieProfessor Deva Ramanan

2009

The thesis of Sidharth Shekhar is approved:

Committee Chair

University of California, Irvine2009

ii

TABLE OF CONTENTS

Page

LIST OF FIGURES v

ACKNOWLEDGMENTS vii

ABSTRACT OF THE THESIS viii

1 Introduction 11.1 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 42.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Markov Network . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Inference in Graphical Models . . . . . . . . . . . . . . . . . . . . . . 82.4 Introduction to Bucket Elimination . . . . . . . . . . . . . . . . . . . 11

3 The Multiplicative Approximation Scheme 163.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 MAS Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 MAS Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.1 Shift and Renormalize . . . . . . . . . . . . . . . . . . . . . . 193.3.2 No Shifting of Functions . . . . . . . . . . . . . . . . . . . . . 25

3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Tracking ε in an Inference Procedure . . . . . . . . . . . . . . . . . . 313.6 Need for new bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 MAS Error Bounds using the L∞ norm 364.1 The L∞ norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Tracking δ in an Inference Procedure . . . . . . . . . . . . . . . . . . 384.3 New Bounds for MAS . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Optimizing a Decomposition 475.1 Optimizing a decomposition with disjoint subsets . . . . . . . . . . . 495.2 Optimizing a decomposition with overlapping subsets . . . . . . . . . 50

iii

6 Bucket Elimination with MAS 536.1 The DynaDecomp Algorithm . . . . . . . . . . . . . . . . . . . . . . . 536.2 Improvements to DynaDecomp . . . . . . . . . . . . . . . . . . . . . 56

6.2.1 Using L∞ Bounds with DynaDecomp . . . . . . . . . . . . . . 566.2.2 The DynaDecompPlus Algorithm . . . . . . . . . . . . . . . . 56

7 Experiments 607.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2 Results on Ising Models . . . . . . . . . . . . . . . . . . . . . . . . . 617.3 Results on UAI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8 Conclusions 718.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Bibliography 74

iv

LIST OF FIGURES

Page

2.1 A Simple Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . 82.2 (a) A belief network, (b) its induced graph along o = (A,E,D,C,B),

and (c) its induced graph along o = (A,B,C,D,E) . . . . . . . . . . 142.3 A trace of algorithm elim-bel . . . . . . . . . . . . . . . . . . . . . . . 142.4 The MIN-FILL algorithm . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Schematic description of error tracking in an inference procedure usingε-decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 (a) A function f(X) and an example approximation f̂(X); (b) theirlog-ratio log f(X)/f̂(X), and the error measure δ. . . . . . . . . . . . 38

4.2 Schematic description of error tracking in an inference procedure usingδ-decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 Schematic trace of generating overlapping decompositions . . . . . . . 52

6.1 The DynaDecomp Algorithm . . . . . . . . . . . . . . . . . . . . . . . 546.2 A trace of algorithm DynaDecomp . . . . . . . . . . . . . . . . . . . 556.3 The DynaDecompPlus Algorithm . . . . . . . . . . . . . . . . . . . . 576.4 A trace of algorithm DynaDecompPlus . . . . . . . . . . . . . . . . . 58

7.1 Bounds for logP (E) using DynaDecomp algorithm with no overlappingdecompositions allowed . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.2 Bounds for logP (E) using DynaDecomp algorithm with overlappingdecompositions allowed . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.3 Bounds for logP (E) using DynaDecompPlus algorithm . . . . . . . . 647.4 L∞ Bounds for logP (E) using DynaDecompPlus algorithm . . . . . . 647.5 Plot of accuracy versus maximum function size for the DynaDecomp

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.6 Plot of accuracy versus maximum function size for the DynaDecomp-

Plus algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.7 L∞ Bounds for logP (E) vs Max Function Size for the DynaDecomp

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.8 L∞ Bounds for logP (E) vs Max Function Size for the DynaDecomp-

Plus algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

v

7.9 Bounds for logP (E) using DynaDecompPlus algorithm with a maxi-mum function size of 9 . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.10 Bounds for logP (E) using DynaDecompPlus algorithm with a maxi-mum function size of 12 . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.11 Bounds for logP (E) using DynaDecompPlus algorithm with a maxi-mum function size of 15 . . . . . . . . . . . . . . . . . . . . . . . . . 70

vi

ACKNOWLEDGMENTS

I am indebted to all those who have helped me in finishing this thesis. In particular

I would like to thank my advisor, Prof. Alexander Ihler for his advice, his guidance,

his constructive criticism and his comments on the earlier drafts of my thesis. I

would also like to thank my committee members, Prof. Xiaohui Xie and Prof. Deva

Ramanan, for taking the time to review and comment on my work.

I would also like to thank my fellow colleague, Drew Frank for some insightful dis-

cussions that greatly assisted my understanding of the problem.

I would also like to thank all my friends in our “IKG” group, in particular Shweta

Akhila and Vivek Kumar Singh, for ensuring that at least once a week I could take

my mind off my work and have some fun.

Last but not least I would like to thank Natasha Flerova for her comments, her

support and for always being there to cheer me up whenever I felt any despair.

vii

ABSTRACT OF THE THESIS

Fixing and Extending the

Multiplicative Approximation Scheme

By

Sidharth Shekhar

Master of Science in Information and Computer Science

University of California, Irvine, 2009

Professor Alexander Ihler, Chair

We analyze the Multiplicative Approximation Scheme (MAS) for inference problems,

developed by Wexler and Meek in [24]. MAS decomposes the intermediate factors of

an inference procedure into factors over smaller sets of variables with a known error

and translates the errors into bounds on the results. We analyze the bounds proposed

in [24] and show that under certain conditions the bounds are incorrect. We then

derive corrected bounds which we show to be theoretically sound. We also derive

new bounds for the results of an inference procedure using an alternate error measure

based on the L∞ norm.

We analyze the methods for optimizing the decomposition of a factor. For decom-

positions containing disjoint sets, Wexler and Meek provide a closed form solution

that uses the L2 norm. We present a method that uses this closed form solution to

generate decompositions containing overlapping sets.

Further, we analyze Wexler and Meek’s DynaDecomp algorithm, which applies MAS

viii

to bucket elimination [6]. We show that DynaDecomp provides little control over the

size of the largest factor generated. We present our own DynaDecompPlus algorithm

which uses a different strategy for decomposition. The new strategy allows us to

guarantee that we will never generate a factor larger than the specified threshold.

Finally we provide experimental results which show that the L∞ bounds are tighter

than the corrected bounds. We demonstrate the superiority of the DynaDecomp-

Plus algorithm by showing how it can perfectly limit the size of the factors whereas

DynaDecomp cannot do so.

ix

Chapter 1

Introduction

Graphical models are a widely used representation framework in a variety of fields es-

pecially probability theory, statistics and machine learning. Models such as Bayesian

networks, Markov networks, constraints networks, etc. are commonly used for reason-

ing with probabilistic and deterministic information. By using graphs, these models

provide an extremely intuitive way to represent the conditional independence struc-

ture among random variables.

This thesis focuses on the problem of performing bounded approximate inference

in graphical models. In particular, we analyze an approximation scheme called the

Multiplicative Approximation Scheme (MAS) developed by Wexler and Meek in [24].

We provide a brief description of the structure of the thesis along with our contribu-

tions in the following section.

1

1.1 Thesis outline

The thesis begins with providing relevant background information in chapter 2. Here

we introduce the notation that we will be following throughout the thesis. We then

provide a brief description of graphical models and of performing inference in graphical

models. We also provide a brief introduction to an exact inference algorithm called

bucket elimination [6].

Chapter 3 provides the details of MAS. We present the scheme as originally described

by Wexler and Meek. We show that the bounds proposed by them turn out to be

incorrect when dealing with marginalization tasks in normalized probability distri-

butions such as those represented by Bayesian networks or Markov Random Fields

by providing simple counter examples. We also fix the error bounds by deriving the

corrected bounds which we show to be theoretically sound under all conditions.

Chapter 4 presents an alternate metric that can be used to measure the error of a

decomposition. The metric we used is based on the L∞ norm. We show how this

metric can be used with MAS and derive new bounds on the accuracy of the results

of an inference procedure.

In the preceding chapters, we assumed that the factored decomposition was already

given to us. Using this assumption, we measured errors and derived bounds on the

results. In chapter 5 we look at the methods for generating the decomposition of a

factor. We present the L2 optimization method developed by Wexler and Meek. It

has a closed form solution if the sets in the decomposition are disjoint. We present

a method that uses this closed form solution to generate decompositions that consist

of overlapping sets.

Chapter 6 looks at the application of MAS to the bucket elimination [6] inference

2

algorithm. We analyze the DynaDecomp algorithm developed by Wexler and Meek.

We present its major limitation, namely that it provides us with little control over the

size of the largest factor generated. We modify the algorithm and present our own

DynaDecompPlus algorithm. We show how we can easily limit the size of the largest

factor that the algorithm generates by using a smarter strategy for decomposition.

Chapter 7 provides the results of the experimental evaluations of the theories pre-

sented in the previous chapters. We present scenarios wherein the original MAS

bounds fail and wherein the corrected MAS bounds that we derived turn out to be

quite loose and impractical. We then compare the corrected MAS bounds and the

new L∞ bounds that we introduced in chapter 4 and show that new bounds are

tighter than the corrected bounds. We also compare the performance of the original

DynaDecomp algorithm with our DynaDecompPlus algorithm. We demonstrate the

superiority of the DynaDecompPlus algorithm by showing how it can perfectly limit

the size of the factors whereas DynaDecomp cannot do so. We also look at instances

from the UAI’06 inference evaluation competition [1]. These instances clearly high-

light the problems with the corrected bounds and the DynaDecomp algorithm. The

DynaDecomp algorithm has trouble solving many instances as it ran into memory

issues. Therefore, we only present the results of running the DynaDecompPlus algo-

rithm. Comparing the two bounds on these UAI instances, we see that only in some

instances are the new bounds tight and useful. The corrected bounds, however, are

extremely loose on all the instances we ran.

Chapter 8 concludes the thesis and also looks at some possible directions for future

research.

3

Chapter 2

Background

This chapter provides all of the relevant background information that is required for

this thesis. We begin in section 2.1 by introducing the notation we will be following

throughout the thesis. We follow this with a brief description of graphical models and

inference problems in graphical models in sections 2.2 and 2.3 respectively. Finally

we provide a brief introduction to the bucket elimination algorithm in section 2.4.

2.1 Notation

Unless specified otherwise, we follow the conventions specified below

1. We will make no distinction between a variable or a set of variables in this

thesis. Uppercase alphabets will be used to denote both variables and sets

of variables. For example we can have a variable X1, or a set of variables

X = {X1, X2, . . . , Xn}.

2. Given a function or probability distribution over a set of variables X, F (X) or

4

P (X), we denote an approximation to this function as F̃ (X) or P̃ (X).

2.2 Graphical Models

“Graphical models are a marriage between probability theory and graph theory. They

provide a natural tool for dealing with two problems that occur throughout applied

mathematics and engineering – uncertainty and complexity – and in particular they

are playing an increasingly important role in the design and analysis of machine

learning algorithms. Fundamental to the idea of a graphical model is the notion of

modularity – a complex system is built by combining simpler parts. Probability theory

provides the glue whereby the parts are combined, ensuring that the system as a whole

is consistent, and providing ways to interface models to data. The graph theoretic side

of graphical models provides both an intuitively appealing interface by which humans

can model highly-interacting sets of variables as well as a data structure that lends

itself naturally to the design of efficient general-purpose algorithms.

Many of the classical multivariate probabilistic systems studied in fields such as statis-

tics, systems engineering, information theory, pattern recognition and statistical me-

chanics are special cases of the general graphical model formalism – examples include

mixture models, factor analysis, hidden Markov models, Kalman filters and Ising mod-

els. The graphical model framework provides a way to view all of these systems as

instances of a common underlying formalism. This view has many advantages – in

particular, specialized techniques that have been developed in one field can be trans-

ferred between research communities and exploited more widely. Moreover, the graph-

ical model formalism provides a natural framework for the design of new systems.”

— Michael Jordan [13].

5

Probabilistic graphical models are graphs in which the nodes represent random vari-

ables, and the connectivity of the graph represents conditional independence assump-

tions. Two common types of graphical models are Markov networks, which use undi-

rected graphs, and Bayesian Networks, which use directed acyclic graphs. We describe

these two models in the following subsections.

2.2.1 Markov Network

A Markov network is a graphical model that encodes a probability distribution using

an undirected graph. Formally, we can define a Markov network as follows:

Definition 2.1. Given an undirected graph G = (V,E) a set of random variables

X = {Xv; v ∈ V } form a Markov network with respect to G if the joint density of X

can be factored into a product of functions defined as

P (X) =∏

C∈cl(G)

ψc(Xc)

where cl(G) is the set of cliques, or fully connected subgraphs, of G.

Markov networks represent conditional independence in the following manner: two

sets of nodes A and B are conditionally independent given a third set C, if all paths

between the nodes in A and B are separated by a node in C. For more information

on conditional independence in Markov networks see the book, Pearl [20].

One example of a Markov network is the log linear model. The log linear model is

given by

P (X) =1

Zexp

(∑k

wTk · ψk(Xk)

)

6

where Z is called the partition function and is defined as

Z =∑X

exp

(∑k

wTk · ψk(Xk)

)

The partition function Z normalizes the distribution.

2.2.2 Bayesian Network

A Bayesian network is a graphical model that encodes a probability distribution

using a directed acyclic graph (DAG). Formally, we can define a Bayesian network as

follows:

Definition 2.2. Given a directed acyclic graph G = (V,E) and a set of random

variables X = {Xv; v ∈ V }, G is a Bayesian network if the joint probability density

function can be written as

P (X) =∏v∈V

P (Xv|Xpa(v))

where pa(v) is the set of parents of v.

Bayesian Networks have a more complicated characterization of conditional inde-

pendence than Markov networks, since Bayesian Networks take into account the di-

rectionality of the arcs. For a detailed description of conditional independence in

Bayesian Networks refer to the book, Pearl [20].

An example of a Bayesian network is shown in Figure 2.1. At each node the con-

ditional probability distribution of the child node given the parent node has been

specified using a table.

Although Bayesian Networks have a more complicated characterization of conditional

7

SPRINKLERSPRINKLERSPRINKLER RAIN

GRASS WET

T F

SPRINKLER

0.4 0.6

T F

RAIN

0.2 0.8

SPRINKLER F

GRASS WET

0.0 1.0

TRAIN

FF

0.8 0.2TF

0.9 0.1FT

0.99 0.01TT

RAIN

F

0.01 0.99T

Figure 2.1: A Simple Bayesian Network

independence than Markov networks, they do have some advantages. The biggest

advantage is that they can represent causality, i.e., one can regard an arc from A to

B as indicating that A “causes” B. For a detailed study of the relationship between

directed and undirected graphical models and how to transform one to the other, we

direct the reader to the book, Pearl [20].

2.3 Inference in Graphical Models

Given a graphical model on a set of variables X, we can partition this set into dis-

joint partitions H and E, where H represents hidden variables and E represents our

evidence. The evidence consists of the set of variables that have been observed to

take on specific values, whereas the hidden variables are the variables for which no

observations are available. Hence, if we have no observations then the evidence set

E = ∅, and the set of hidden variables H = X. Within this setup, there are four

8

kinds of queries that are commonly made. They are as follows:

1. Likelihood of Evidence: Given a graphical model and some observations, a

question we might have is that how likely are the observations we have under

the model we are given. In other words we would like to compute the probability

of the evidence we have under the model we have chosen. This is computed as

follows:

P (E) =∑H

P (H,E)

2. Posterior Marginal probability: Given a variable A ∈ H, we would like to

evaluate the effect the evidence had on variable A averaging out the effects of

all other variables. This is computed as follows:

P (A) =∑H\A

P (H,E)

3. Most Probable Explanation (MPE): The MPE task is to find an assignment to

the set of hidden variables, denoted as H∗, such that the probability P (H∗) is

given as follows:

P (H∗) = maxH

P (H,E)

where the assignment H∗ is defined as:

H∗ = arg maxH

P (H,E)

4. Maximum aposteriori (MAP) probabilities: Given a set A ⊂ H, the MAP task

is to find an assignment to the set of variables in A, denoted as A∗, such that

9

the probability P (A∗) is given as follows:

P (A∗) = maxA

∑H\A

P (H,E)

where the assignment A∗ is defined as:

A∗ = arg maxA

∑H\A

P (H,E)

Performing exact inference in a graphical model has been proven to be NP-Hard [4].

A few exact inference methods are available, such as variable elimination [6, 22],

clique tree propagation [12, 16, 23], and recursive conditioning [5]. However, all

these methods have complexity that is exponential in a graph parameter known as

treewidth [21]. In practice the complexities are represented in terms of a related

graph parameter called the induced width [8]. We provide more information regarding

induced width in section 2.4.

The exponential complexity of performing exact inference has lead to the development

of a large number of approximate inference algorithms [15, 25, 14, 9, 10]. With

approximation schemes, it is useful to have error bounds on the result, so that we

have some guarantees on the exact value of the probabilities we are interested in.

Many of the approximation techniques developed do provide such guarantees.

In this thesis we focus on a general scheme called the Multiplicative Approximation

Scheme (MAS) [24]. MAS decomposes the intermediate factors of an inference proce-

dure into factors over smaller sets of variables with a known error. It then translates

the errors into bounds on the accuracy of the results. To test the accuracy and tight-

ness of the MAS bounds, we apply MAS to the bucket elimination algorithm. In the

next section we provide a brief introduction to the bucket elimination algorithm.

10

2.4 Introduction to Bucket Elimination

Bucket elimination is a framework that provides a unifying view of variable elimina-

tion algorithms for a variety of reasoning tasks.

A bucket elimination algorithm accepts as an input an ordered set of variables and

a set of dependencies, such as the probability functions given by a graphical model.

Each variable is then associated with a bucket constructed as follows: all the functions

defined on variable Xi but not on higher index variables are placed into the bucket of

Xi. Once the buckets are created the algorithm processes them from last to first. It

computes new functions, applying an elimination operator to all the functions in the

bucket. The new functions summarizing the effect of Xi on the rest of the problem,

are placed in the appropriate lower buckets.

The elimination operator depends on the task. For example, if we consider the task of

computing likelihood of evidence or computing the posterior marginal then the desired

elimination operator is summation. To eliminate a variable Xi we only need to look at

the functions that depend on Xi. Therefore, we look at the bucket of Xi and compute∑XiF , where F is the product of all functions in the bucket of Xi. Similarly for an

MPE task, the elimination operator will be maximization. The bucket-elimination

algorithm terminates when all buckets are processed, or when some stopping criterion

is satisfied.

To demonstrate the bucket elimination approach, we consider an example for com-

puting the posterior marginal. We apply the bucket elimination algorithm to the

network in Figure 2.2a, with the query variable A, the ordering o = (A,E,D,C,B),

and evidence E = 0. The posterior marginal P (A|E = 0) is given by

P (A|E = 0) =P (A,E = 0)

P (E = 0)

11

We can compute the joint probability P (A,E = 0) as

P (A,E = 0) =∑

E=0,D,C,B

P (A,B,D,C,E)

=∑

E=0,D,C,B

P (A)P (C|A)P (E|B,C)P (D|A,B)P (B|A)

=∑E=0

∑D

∑C

P (C|A)∑B

P (E|B,C)P (D|A,B)P (B|A)

The bucket elimination algorithm computes this sum from right to left using the

buckets, as shown below:

1. bucket B: hB(A,D,C,E) =∑

B P (E|B,C)P (D|A,B)P (B|A)

2. bucket C: hC(A,D,E) =∑

C P (C|A)hB(A,D,C,E)

3. bucket D: hD(A,E) =∑

D hC(A,D,E)

4. bucket E: hE(A) = hD(A,E = 0)

5. bucket A: P (A,E = 0) = P (A)hE(A),

To compute the posterior marginal P (A|E = 0), we can just normalize the result in

bucket A since summing out A is equivalent to calculating P (E = 0). Therefore we

can calculate P (A|E = 0) as

P (A|E = 0) = αP (A)hE(A)

where α is a normalizing constant. A schematic trace of the algorithm is shown in

Figure 2.3.

The performance of bucket elimination algorithms can be predicted using a graph

parameter called induced width [8]. It is defined as follows

12

Definition 2.3. An ordered graph is a pair (G, o) where G is an undirected graph,

and o = (X1, . . . , Xn) is an ordering of nodes. The width of a node is the number

of the node’s neighbors that precede it in the ordering. The width of an ordering o is

the maximum width over all nodes. The induced width of an ordered graph, denoted

by w∗(o), is the width of the induced ordered graph obtained as follows: nodes are

processed from last to first; when node Xi is processed all its preceding neighbors are

connected. The induced width of a graph, denoted by w∗, is the minimal induced width

over all its orderings.

For example, to compute the induced width along any ordering of the graph in Fig-

ure 2.2a, we need to to first convert it into an undirected graph. This can be easily

done as follows: First we traverse the graph and for every two unconnected nodes that

have a common child, we connect them using an undirected edge. In Figure 2.2a this

is shown using the undirected dashed edge. Next we convert all the directed edges

into undirected edges. This undirected graph is known as the moral graph. Figures

2.2b and 2.2c depict the induced graphs of the moral graph in Figure 2.2a along the

orderings o = (A,E,D,C,B) and o′ = (A,B,C,D,E), respectively. The dashed line

here represents the induced edges, which are the edges that were introduced in the

induced ordered graph but were absent in the moral graph. We can see that w∗(o) = 4

and w∗(o′) = 2.

The complexity of bucket elimination algorithms is time and space exponential in

w∗(o) [6]. Therefore, ideally we want to find the variable ordering that has smallest

induced width. Although this problem has been shown to be NP-Hard [2], there are

a few greedy heuristic algorithms that provide good orderings [7].

The greedy heuristic that we use in our experiments is known as the MIN-FILL

algorithm. The reason we choose MIN-FILL is because in most cases it has been

13

A

B

B

D

C

E

C

D

A

E

E

D

C

A

B

(a) (b) (c)

P (A)

P (B|A)

P (C|A)

P (E|B, C)

P (D|A, B)

Figure 2.2: (a) A belief network, (b) its induced graph along o = (A,E,D,C,B), and(c) its induced graph along o = (A,B,C,D,E)

bucket B

bucket A

bucket C

bucket D

bucket E

!B

"P (E|B, C) P (D|A, B) P (B|A)

P (C|A)

P (A)

E = 0

P (A|E = 0)

hB(A, D, C, E)

hC(A, D, E)

hD(A, E)

hE(A)

Figure 2.3: A trace of algorithm elim-bel

14

Input: A graph G = (V,E), V = V1, . . . , VnOutput: An ordering o of the nodes in Vfor j = n to 1 by −1 do

R← a node in V with smallest fill edges for his parentsPut R in position j: o[j]← RConnect R’s neighbors: E ← E ∪ (Vi, Vj), if (Vi, R) ∈ E and (Vj , R) ∈ ERemove R from resulting graph: V ← V − {R}

endReturn the ordering o

Figure 2.4: The MIN-FILL algorithm

empirically demonstrated to be better than other greedy algorithms [7]. The MIN-

FILL algorithm orders nodes in the order of their min-fill value, which is the number

of edges that need to be added to the graph so that its parent set is fully connected.

The MIN-FILL algorithm is provided in Figure 2.4.

15

Chapter 3

The Multiplicative Approximation

Scheme

Wexler and Meek recently introduced a new approximation scheme called the Mul-

tiplicative Approximation Scheme (MAS) [24]. The primary goal of this scheme was

to provide good error bounds for computation of likelihood of evidence, marginal

probabilities and the maximum a posteriori probabilities in discrete directed and

undirected graphical models. In section 3.1 we describe the probability distribution

we will be using in our analysis. In section 3.2 we provide the basics of MAS and the

methodology by which the MAS error bounds are calculated.

3.1 Preliminaries

From this point in the thesis, unless specified otherwise, the graphical model we will

consider will be defined on a set of n variables X = {X1, . . . , Xn}, and will encode

16

the normalized probability distribution

P (X) =1

Z

∏j

ψj(Dj)

where Dj ⊆ X and 0 < ψj(Dj) < 1.

For convenience we will be using the notation

φ(Dj) = logψj(Dj)

Finally, we will be denoting the set of evidence variables as E and the set of hidden

variables as H. Within this setup, the log joint probability can be represented by a

sum of factors as follows:

logP (X) = logP (H,E) =∑j

φj(Dj)

3.2 MAS Basics

The fundamental operations of MAS are local approximations that are known as ε-

decompositions. Each decomposition has an error associated with it. The errors of

all such local decompositions are later combined into an error bound on the result.

Let us first define an ε-decomposition.

Definition 3.1. Given a set of variables X, and a function φ(X) that assigns real

values to every instantiation of X, a set of m functions φ̃l(Xl), l = 1 . . .m, where

Xl ⊆ X, is an ε-decomposition if⋃lXl = X, and for all instantiations of X the

17

following holds:

1

1 + ε≤∑

l φ̃l(Xl)

φ(X)≤ 1 + ε (3.1)

for some ε ≥ 0.

With this definition in mind, Wexler and Meek make the following claims

Claim 3.1. The log of the joint probability P (H,E) can be approximated within a

multiplicative factor of 1+εmax using a set of εj-decompositions, where εmax = maxj εj.

1

1 + εmaxlogP (H,E) ≤ log P̃ (H,E) ≤ (1 + εmax) logP (H,E) (3.2)

where log P̃ (H,E) is defined as

log P̃ (H,E) =∑j,l

φ̃jl(Djl)

Claim 3.2. For a set A ⊆ H, the expression log∑

A P (H,E) can be approximated

within a multiplicative factor of 1 + εmax using a set of εj-decompositions.

1

1 + εmaxlog∑A

P (H,E) ≤ log∑A

P̃ (H,E) ≤ (1 + εmax) log∑A

P (H,E) (3.3)

To prove their claims, Wexler and Meek make the assumption that all functions satisfy

the property ψj(Dj) > 1, ∀Dj. However, as this is not always the case, they state

that in such scenarios we can first scale or shift in the log domain, every function that

does not satisfy the property, i.e., we can generate a function ψ′j(Dj) = zj · ψj(Dj)

such that the property ψ′j(Dj) > 1, ∀Dj is true. Then at the end of the procedure we

renormalize by dividing the result by∏

j zj. Wexler and Meek make the claim that

18

doing the shift and renormalize procedure does not affect the results of claim 3.1 and

claim 3.2. In the next section, we derive the MAS error bounds and verify whether

performing the shift and renormalize procedure has no effect on the error bounds.

3.3 MAS Error Bounds

We divide this section into two sub-sections. In the first part, we perform the shift

and renormalize procedure and derive the MAS error bounds. In the second part, we

make the assumption that all the factors satisfy the property 0 < ψj(Dj) < 1. With

this assumption we derive MAS error bounds without performing any shifting.

3.3.1 Shift and Renormalize

Given our normalized probability distribution P (X), we create new functions ψ′j by

multiplying each function ψj with a constant zj such that

zj > minDj

ψj(Dj)

Thus, we define

ψ′j(Dj) = zj · ψj(Dj)

which has the property that ψ′j(Dj) > 1, ∀Dj.

We now define the function f(X) to be the product of the ψ′js

f(X) =∏j

ψ′j(Dj) =∏j

zj · ψj(Dj)

19

and also define the shifted functions φ′j(Dj) as

φ′j(Dj) = logψ′j(Dj) = (log zj + φj(Dj))

Assume that for each factor ψ′j we have an εj-decomposition. With this assumption

we proceed to derive bounds for the log of the joint probability logP (H,E). We first

consider the approximation to log f(H,E) given by

log f̃(H,E) ≡ log∏j,l

eφ̃jl′(Djl) =

∑j,l

φ̃jl′(Djl)

Using the definition of an ε-decomposition (3.1) and the fact that {ψ′jl} is an ε-

decomposition for ψ′j, we have

log f̃(H,E) =∑j,l

φ̃jl′(Djl) ≤

∑j

(1 + εj)φ′j(Dj)

≤ (1 + εmax)∑j

φ′j(Dj)

To perform the renormalization we divide f̃(H,E) by∏

j zj, which is equivalent to

subtracting∑

j log zj from log f̃(H,E). Hence, we get

log f̃(H,E)−∑j

log zj ≤ (1 + εmax)∑j

φ′j(Dj)−∑j

log zj (3.4)

We have,

P (X) =∏j

ψj(Dj) =f(X)∏j zj

So, we define P̃ (X) as

P̃ (X) =f̃(X)∏j zj

20

Using the definitions of P̃ (X) and φ′j(Dj) in equation (3.4), we get

log P̃ (H,E) ≤ (1 + εmax)∑j

(φj(Dj) + log zj)−∑j

log zj

= (1 + εmax)∑j

φj(Dj) + (1 + εmax)∑j

log zj −∑j

log zj

= (1 + εmax) logP (H,E) + εmax∑j

log zj (3.5)

Equation (3.5) provides us with an upper bound on the log of the joint probability.

Next we will derive a similar lower bound on the log of the joint probability.

Again, using the definition of an ε-decomposition (3.1), we have

log f̃(H,E) =∑j,l

φ̃jl′(Djl) ≥

∑j

1

1 + εjφ′j(Dj)

≥ 1

1 + εmax

∑j

φ′j(Dj)

Subtracting∑

j log zj from both sides and using the definition of φ′j(Dj), we have

log P̃ (H,E) ≥ 1

1 + εmax

∑j

(φj(Dj) + log zj)−∑j

log zj

=1

1 + εmax

∑j

φj(Dj) +1

1 + εmax

∑j

log zj −∑j

log zj

=1

1 + εmaxlogP (H,E)− εmax

1 + εmax

∑j

log zj (3.6)

Equation (3.6) provides us with a lower bound on the log of the joint probability.

Combined with the upper bound on the log of the joint probability (3.5), we can

state the following theorem:

Theorem 3.1. If the log of the joint probability P (H,E) is approximated using a set

21

of εj-decompositions on the factors ψj of P , then we have the following bounds:

log P̃ (H,E) ≤ (1 + εmax) logP (H,E) + εmax∑j

log zj

log P̃ (H,E) ≥ 1

1 + εmaxlogP (H,E)− εmax

1 + εmax

∑j

log zj

where εmax = maxj εj.

Comparing theorem 3.1 with claim 3.1, we see that the bounds differ in that the

bounds in theorem 3.1 have extra terms that are multiples of∑

j log zj.

Having derived bounds for the joint probability, we proceed towards deriving bounds

for marginal probabilities, i.e., we would like bounds for the expression log∑

A P (H,E)

where A ⊆ H.

From theorem 3.1, we have

log P̃ (H,E) ≤ log

[P (H,E)

](1+εmax)

+ log

(∏j

zj

)εmax

We would like to get a relation between∑

A P (H,E) and∑

A P̃ (H,E). Therefore,

we rearrange the terms on the right hand side of the above equation as follows


[(P (H,E))(1+εmax) ·

(∏j

zj

)εmax]

= log

[P (H,E) ·

(∏j

zj

) εmax1+εmax

]1+εmax

Marginalizing out the set A from P̃ (H,E), we get

log∑A

P̃ (H,E) ≤ log∑A

[P (H,E) ·

(∏j

zj

) εmax1+εmax

]1+εmax

22

To get the relation between∑

A P (H,E) and∑

A P̃ (H,E) we need to move the

summation on the right hand side inside the exponentiated part. For this we use the

following lemma

Lemma 3.1. If Cj ≥ 0 and r ≥ 1, then

∑j

(Cj)r ≤

(∑j

Cj

)r

If Cj ≥ 0 and 0 < r ≤ 1, then

∑j

(Cj)r ≥

(∑j

Cj

)r

Therefore, we have

log∑A

P̃ (H,E) ≤ log

[∑A

P (H,E) ·(∏

j

zj

) εmax1+εmax

]1+εmax

= log

[(∏j

zj

) εmax1+εmax ∑

A

P (H,E)

]1+εmax

= (1 + εmax) log∑A

P (H,E) + εmax∑j

log zj (3.7)

Equation (3.7) provides us with an upper bound on the log of the marginal probability.

We now perform a very similar analysis to obtain a lower bound.

Again, from theorem 3.1, we have

log P̃ (H,E) ≥ log

[P (H,E)

] 11+εmax − log

(∏j

zj

) εmax1+εmax

23

Rearranging the right hand side of the above equation, we have


[(P (H,E))

11+εmax

(∏

j zj)εmax

1+εmax

]

= log

[P (H,E)

(∏

j zj)εmax

] 11+εmax

Summing out a set A ⊆ H, we get

log∑A

P̃ (H,E) ≥ log∑A

[P (H,E)

(∏

j zj)εmax

] 11+εmax

Again we would like to move the summation inside the exponentiated part. Using

lemma 3.1 we have

log∑A

P̃ (H,E) ≥ log

[∑A

P (H,E)

(∏

j zj)εmax

] 11+εmax

= log

[1

(∏

j zj)εmax

∑A

P (H,E)

] 11+εmax

=1

1 + εmaxlog∑A

P (H,E)− εmax1 + εmax

∑j

log zj (3.8)

Equation (3.8) provides us with a lower bound on the log of the marginal probability.

Combined with the upper bound on the log of the marginal probability (3.7), we can

state the following theorem

Theorem 3.2. For a set A ⊆ H, if the expression log∑

A P (H,E) is approximated

using a set of εj-decompositions on the factors ψj of P , then we have the following

bounds:

log∑A

P̃ (H,E) ≤ (1 + εmax) log∑A

P (H,E) + εmax∑j

log zj

log∑A

P̃ (H,E) ≥ 1

1 + εmaxlog∑A

P (H,E)− εmax1 + εmax

∑j

log zj

24


Theorem 3.2 differs from claim 3.2 in the fact that it has extra terms that are multiples

of∑

j log zj.

Theorems 3.1 and 3.2 do not disprove claims 3.1 and 3.2. We disprove these claims

in section 3.4 where we provide simple counter examples that violate claims 3.1 and

3.2 but do not violate theorems 3.1 and 3.2.

3.3.2 No Shifting of Functions

Now let us consider the scenario when all our functions ψj have values that lie between

zero and one. An example of such a distribution is a Bayesian Network with no zero

probabilities. In this situation, we have the following theorem

Theorem 3.3. If 0 < ψj(Dj) < 1 ∀Dj ∀j and the log of the joint probability

P (H,E) is approximated using a set of εj-decompositions on the factors ψj of P ,

then we have the following bounds:

1

1 + εmaxlogP (H,E) ≥ log P̃ (H,E) ≥ (1 + εmax) logP (H,E) (3.9)


Proof:

log P̃ (H,E) ≡ log∏j,l

eφ̃jl(Djl) =∑j,l

φ̃jl(Djl)

Now we know that

0 < ψj(Dj) < 1 =⇒ φj(Dj) < 0

25

Using the above fact and the definition of an ε-decomposition (3.1), we have


φ̃jl(Djl) ≥∑j

(1 + εj)φj(Dj)

≥ (1 + εmax)∑j

φj(Dj)

= (1 + εmax) logP (H,E)

and


φ̃jl(Djl) ≤∑j

1

1 + εjφj(Dj)

≤ 1

1 + εmax

∑j

φj(Dj)

=1

1 + εmaxlogP (H,E)

Theorem 3.3 allows us to bound the probability of an assignment. Now we look to

bound the marginal probability. From theorem 3.3 we have


[P (H,E)

](1+εmax)

Summing out a set A ⊆ H, we have

log∑A

P̃ (H,E) ≥ log∑A

[P (H,E)

]1+εmax

To get a relation between∑

A P (H,E) and∑

A P̃ (H,E) we would like to move the

summation sign inside the exponentiated part. Unfortunately since (1 + εmax) > 1,

lemma 3.1 can only provide an upper bound on the sum in the right hand side.

26

Similarly, looking at the upper bound from theorem 3.3, we have


[P (H,E)

] 11+εmax

Summing out a set A ⊆ H, we have

log∑A


[P (H,E)]1

1+εmax

Again we try to move the summation inside the exponentiation. This is again not

possible because lemma 3.1 can only provide a lower bound on the sum in the right

hand side since 0 < 11+εmax

≤ 1.

Hence we are unable to get any bounds on the log of the marginal probability∑A P (H,E).

Until this point we have been unable to prove the claims made by Wexler and Meek.

In the next section we will prove that the claims are actually false by providing simple

counterexamples.

3.4 Example

In our example, we consider a simple model wherein we have two binary valued

random variables A and B, and their joint probability is given by

P (A,B) =

B = 0 B = 1

A = 0 0.3 0.25

A = 1 0.25 0.2

27

Let us assume that we find an approximation P̃ (A,B) to P (A,B) as follows

P̃ (A,B) =

B = 0 B = 1

A = 0 0.285 0.2525

A = 1 0.25 0.2125

Using these two equations and the definition of an ε-decomposition (3.1), we have

εmax = 0.0426.

With this value of εmax, we look at the bounds for the joint probability P (A,B = 1).

We can see that the equation specified by claim 3.1

1

1 + εmaxlogP (A,B = 1) ≤ log P̃ (A,B = 1) ≤ (1 + εmax) logP (A,B = 1)

does not hold because we have

A = 0 −1.329

A = 1 −1.544≥ −1.376

−1.549≥ −1.445

−1.678

whereas the equation specified by theorem 3.3

1

1 + εmaxlogP (A,B = 1) ≥ log P̃ (A,B = 1) ≥ (1 + εmax) logP (A,B = 1)

does hold.

Similarly consider the procedure outlined in sub-section 3.3.1, which involves shifting

and renormalizing the function. To shift P (A,B) so that all its values are greater

than one, we multiply it with the constant z = 5.1.

28

Therefore, we have

F (A,B) = z · P (A,B) =

B = 0 B = 1

A = 0 1.53 1.275

A = 1 1.275 1.02

And we have F̃ (A,B) = z · P̃ (A,B) as

F̃ (A,B) =

B = 0 B = 1

A = 0 1.4535 1.28775

A = 1 1.275 1.08375

Using these two equations and the definition of an ε-decomposition (3.1), we have

εmax = 3.0614. Again, we look at the joint probability P (A,B = 1). Once again we

see that the bounds given by claim 3.1

1

1 + εmaxlogP (A,B = 1) ≤ log P̃ (A,B = 1) ≤ (1 + εmax) logP (A,B = 1)

do not hold because

A = 0 −0.341

A = 1 −0.396�−1.376

−1.549�−5.630

−6.537

whereas the bounds given by theorem 3.1

log P̃ (A,B = 1) ≥ 1

1 + εmaxlogP (A,B = 1)− εmax

1 + εmax

log P̃ (A,B = 1) ≤ (1 + εmax) logP (A,B = 1) + εmax log z

29

do hold, since

A = 0 −1.569

A = 1 −1.624≤ −1.376

−1.549≤ −0.642

−1.548

Hence, we can state that claim 3.1 does not hold in general.

Having focussed on joint probabilities, we now look to perform marginalization to

obtain bounds on the marginal probability. Referring to our original definition of

P (A,B), we can calculate P (B) by marginalizing out A. Therefore, we have

P (B) =0.55 B = 0

0.45 B = 1

From P̃ (A,B), we have

P̃ (B) =0.535 B = 0

0.465 B = 1

First we consider the case when εmax = 0.0426 was calculated directly using P (A,B)

and P̃ (A,B). We see that the equation specified by claim 3.2

1

1 + εmaxlogP (B) ≤ log P̃ (B) ≤ (1 + εmax) logP (B)

does not hold because

B = 0 −0.5734 > −0.6255 < −0.6233

B = 1 −0.7659 < −0.7657 > −0.8325

Next we consider the case when εmax = 3.0614 was calculated using the shift and

renormalize method. We see that the equation given by claim 3.2 still does not hold

30

because

B = 0 −0.1472

B = 1 −0.1966�−0.6255

−0.7657�−2.4281

−3.2431

whereas the bounds given by theorem 3.2

1

1 + εmlogP (B)− εm

1 + εmlog z ≤ log P̃ (B) ≤ (1 + εm) logP (B) + εm log z

where εm = εmax, do hold because

B = 0 −1.3753

B = 1 −1.4247≤ −0.6255

−0.7657≤ 2.5597

1.7447

Hence, we can state that claim 3.2 does not hold in general.

3.5 Tracking ε in an Inference Procedure

In the course of an inference procedure, we come across three major operations

1. Sum out a variable: We have a function f(X) and we want to marginalize a

subset A ⊆ X. In section 3.3, we considered this scenario and presented the

effects it has on the error ε.

2. Multiply two functions: Assume that we have two functions f and g. Let f̃ be

an approximation to f with error ε1 and let g̃ be an approximation to g with

error ε2. We would like the error of f̃ · g̃ with respect to f · g.

3. Compounded Approximation: Let us say that we have an approximation f̃ to

a function f with error ε1. Now assume that we get an approximation f̂ to f̃

31

with error ε2. We would like the error of f̂ with respect to the original function

f .

Let us first consider the case of multiplying two functions. From the definition of an

ε-decomposition (3.1), we have

1

1 + ε1≤ log f̃

log f≤ 1 + ε1

and

1

1 + ε2≤ log g̃

log g≤ 1 + ε2

From these two equations we can say

1

1 + ε1log f +

1

1 + ε2log g ≤ log f̃ + log g̃ ≤ (1 + ε1) log f + (1 + ε2) log g

Hence, we have

1

1 + max {ε1, ε2}(log f + log g) ≤ log f̃ + log g̃ ≤ (1 + max {ε1, ε2})(log f + log g)

We can rewrite the above equation as

1

1 + max {ε1, ε2} ≤log f̃ g̃

log fg≤ (1 + max {ε1, ε2})

Therefore the error of f̃ · g̃ with respect to f · g is given by ε = max{ε1, ε2}.

Now we consider the case of performing a compounded approximation. Since f̃ is an

approximation to f with error ε1, we have

1

1 + ε1≤ log f̃

log f≤ 1 + ε1

32

Also, since f̂ is an approximation to f̃ with error ε2, we have

1

1 + ε2≤ log f̂

log f̃≤ 1 + ε2

Multiplying these two equations, we have

1

1 + ε1· 1

1 + ε2≤ log f̃

log f· log f̂

log f̃≤ (1 + ε1) · (1 + ε2)

Simplifying the above equation we get

1

1 + (ε1 + ε2 + ε1ε2)≤ log f̂

log f≤ 1 + (ε1 + ε2 + ε1ε2)

Therefore the error of f̂ with respect to f is given by ε = (ε1 + ε2 + ε1ε2).

To understand how the error can be tracked during inference, we provide a schematic

description of the error tracking procedure in Figure 3.1. We start with four functions

(fa, fb, fc, fd) each associated with zero error, ri = 1. Every multiplication operation

is denoted by edges directed from the nodes S, representing the multiplied functions,

to a node t representing the resulting function with error rt = maxs∈S rs. An ε-

decomposition on the other hand has a single source node s with an associated error

rs, representing the decomposed function, and several target nodes T , with an error

rt = (1 + ε)rs for every t ∈ T . Therefore, the error associated with the result of the

entire procedure is the error associated with the final node fo in the graph. In our

example we make the assumption that ε1 > ε2 and that 1 + ε1 < (1 + ε1)(1 + ε2).

33

fa

fkfj

fgfffe

fdfcfb

fifh

fl fmfn

fo

rf=1+ε1re=1+ε1

rg=1

rj=1+ε2

rm=(1+ε2)(1+ε3)

rh=1+ε2

rk=1+ε1

ri=1+ε2

ra=1 rb=1 rc=1

rd=1

rl=(1+ε2)(1+ε3)

rn=(1+ε2)(1+ε3)

ro = max{1+ε1,(1+ε2)(1+ε3)} = (1+ε2)(1+ε3)

Figure 3.1: Schematic description of error tracking in an inference procedure usingε-decompositions

3.6 Need for new bounds

The corrected bounds of theorems 3.1 and 3.2 all contain the term∑

j log zj in some

form. The value of this term depends on the functions in our model that have

ψj(Dj) < 1. As a result it can become arbitrarily large when dealing with small

probabilities. For example if we have a probability value of 0.01 we will have to mul-

tiply that factor with a value that is greater than 100. This value will then dominate

the error in our bounds and hence will lead to very loose bounds.

We saw an example of this in section 3.4. The corrected upper bound on the log

marginal probability was positive, which states that the marginal probability has an

upper bound of one. Although not seen in our examples, in our experiments we will

show instances wherein the corrected lower bound for a probability is practically zero.

Since the corrected bounds resulting from the multiplicative error measure 3.1 are

34

often loose, we are motivated to investigate alternate error measures. In chapter 4

we propose one such measure, based on the L∞ norm. We show how this measure

can be used to compute the error of a local decomposition and then aggregated into

a bound on the result.

35

Chapter 4

MAS Error Bounds using the L∞

norm

In chapter 3, section 3.6, we stated the need to investigate alternate error measures

since the corrected bounds resulting from the multiplicative error measure are often

quite loose. Here we use a measure that is based on the L∞ norm. Using the L∞

error measure, we derive new bounds on the accuracy of the results of an inference

procedure.

In section 4.1 we introduce the L∞ norm and show how it can be used to compute

the error of a local decomposition. Then, in section 4.2 we show how to track the

new error measure in an inference procedure. Finally, in section 4.3, we derive new

bounds for the log joint probability and log marginal probability under MAS.

36

4.1 The L∞ norm

The L∞ norm can be used to measure the difference between two functions. Suppose

we have a function f(X) and an approximation f̃(X) to f(X). The L∞ norm of the

difference of logs is defined as

∥∥∥log f(X)− log f̃(X)∥∥∥∞

= supX

∣∣∣log f(X)− log f̃(X)∣∣∣

One of the problems with using the L∞ norm to measure error is that optimizing the

L∞ norm to get the best f̃(X) is hard. So, the strategy we use is that assuming we

have some approximation f̂(X) to f(X) we will perform a scalar shift by α on log f̂(X)

so that we minimize the L∞ norm. This means that we replace our approximation

f̂(X) with a new approximation f̃(X) given by

f̃(X) =f̂(X)

α

We define our error measure δ as

δ = supX

∣∣∣log f(X)− log f̃(X)∣∣∣

which is equivalent to

δ = infα

supX

∣∣∣∣∣log f(X)− logf̂(X)

α

∣∣∣∣∣The error measure δ we defined is equal to the log of the dynamic range which is a

metric introduced by Ihler et al in [11]. An illustration of how δ can be calculated

is shown in Figure 4.1. It shows a function f(X) with its approximation f̂(X), and

depicts the log-error log f/f̂ , error measure δ and the minimizing α.

37

f(X)

f̂(X) }}

0

δ

αmin

log f/f̂

(a) (b)

Figure 4.1: (a) A function f(X) and an example approximation f̂(X); (b) theirlog-ratio log f(X)/f̂(X), and the error measure δ.

With this error measure, we now define a δ-decomposition as follows

Definition 4.1. Given a set of variables X, and a function φ(X) that assigns real

values to every instantiation of X, a set of m functions φ̂l(Xl), l = 1 . . .m, where

Xl ⊆ X, is an δ-decomposition if⋃lXl = X, and for all instantiations of X the

following holds:

δ = infα

supX

∣∣∣∣∣logα + φ(X)−∑l

φ̂l(Xl)

∣∣∣∣∣ (4.1)

Having defined the new error measure δ, in the next section we analyze how the

different operations of an inference procedure affect δ.

4.2 Tracking δ in an Inference Procedure

As in chapter 3, we consider how the three major operations of an inference pro-

cedure, summing out a set of variables, multiplying two functions and compounded

approximations, affect the new error measure δ.

First we consider the operation of summing out a set of variables. The problem is

stated as follows: Given a function f(X) and approximation f̃(X) with error δ we

want the error of∑

A f̃(X) with respect to∑

A f(X) where A ⊆ X.

38

We have

∣∣∣log f(X)− log f̃(X)∣∣∣ ≤ δ ∀X

From the above equation we have

1

eδ≤ f(X)

f̃(X)≤ eδ ∀X

Therefore, we have

f̃(X)

eδ≤ f(X) f(X) ≤ eδf̃(X)

Summing out the set A, we get

∑A

f̃(X)

eδ≤∑A

f(X)∑A

f(X) ≤∑A

eδf̃(X)

Since δ is a constant we have

1

eδ

∑A

f̃(X) ≤∑A

f(X)∑A

f(X) ≤ eδ∑A

f̃(X)

Therefore we have

1

eδ≤∑

A f(X)∑A f̃(X)

≤ eδ

which is equivalent to

∣∣∣∣∣log∑A

f(X)− log∑A

f̃(X)

∣∣∣∣∣ ≤ δ

Therefore, summing out a set of variables A ⊆ X has an error of at most δ for the

approximation.

39

Next, consider multiplying two functions. We have a function f with approximation

f̃ and error δ1. We have another function g with approximation g̃ and error δ2. We

want the error of f̃ · g̃ with respect to f · g. We have

∣∣∣log f − log f̃∣∣∣ ≤ δ1∣∣∣log g − log g̃∣∣∣ ≤ δ2

We are interested in

∣∣∣log fg − log f̃ g̃∣∣∣ =

∣∣∣log f − log f̃ + log g − log g̃∣∣∣

By the triangle inequality we have

∣∣∣log fg − log f̃ g̃∣∣∣ ≤ ∣∣∣log f − log f̃

∣∣∣+∣∣∣log g − log g̃

∣∣∣≤ δ1 + δ2

Therefore the error of f̃ · g̃ with respect to f · g is at most δ = δ1 + δ2.

Finally consider the case of a compounded approximation. Assume we have an ap-

proximation f̃ to f with error δ1. Now assume that we get an approximation f̂ to f̃

with error δ2. We want the error of f̂ with respect to f . We have

∣∣∣log f − log f̃∣∣∣ ≤ δ1∣∣∣log f̃ − log f̂∣∣∣ ≤ δ2

We are interested in

∣∣∣log f − log f̂∣∣∣ =

∣∣∣log f − log f̃ + log f̃ − log f̂∣∣∣

40

By the triangle inequality we have

∣∣∣log f − log f̂∣∣∣ ≤ ∣∣∣log f − log f̃

∣∣∣+∣∣∣log f̃ − log f̂

∣∣∣≤ δ1 + δ2

Therefore the error of f̂ with respect to f is at most δ = δ1 + δ2.

To illustrate the error tracking for δ-decompositions in an inference procedure, we

provide a schematic trace in Figure 4.2. Since the error measure δ has been shown to

be at most additive, we only need to keep track of the error in each decomposition

and add all the errors at the end. In Figure 4.2 there are three decompositions that

take place: fa is decomposed with an error δ1, fg is decomposed with an error δ2 and

fj is decomposed with an error δ3. Therefore, the error on the result of the inference

procedure is at most δ1 + δ2 + δ3.

In the next section section we use the definition of a δ-decomposition to derive new

bounds for the log joint probability and the log marginal probability.

4.3 New Bounds for MAS

There are two scenarios for us to consider

1. P (X) is a normalized probability distribution, i.e.,

P (X) =1

Z

∏j

ψj(Dj), Dj ⊆ X

and we know the value of the normalization constant Z.

2. P (X) is not normalized and we need to compute the normalizing constant Z.

41

fa

fkfj

fgfffe

fdfcfb

fifh

fl fmfn

fo

δ1

Error = δ1 + δ2 + δ3

δ2

δ3

Figure 4.2: Schematic description of error tracking in an inference procedure usingδ-decompositions

First we consider the case when P (X) is a normalized probability distribution. We

start with deriving bounds for the log of the joint probability P (H,E). Assume that

for each factor ψj we have a δ-decomposition. Then, we have

∣∣∣∣∣φj(Dj)−∑l

φ̃jl(Djl)

∣∣∣∣∣ ≤ δj ∀j

We have shown in section 4.2 that multiplication of functions causes the errors to at

most add. Since multiplying the original factors is equivalent to summing the log of

the factors, we have

∣∣∣∣∣∑j

φj(Dj)−∑j,l

φ̃jl(Djl)

∣∣∣∣∣ ≤∑j

δj

42

Since P (X) is normalized we have

∣∣∣∣∣(∑

j

φj(Dj)− logZ

)−(∑

j,l

φ̃jl(Djl)− logZ

)∣∣∣∣∣ ≤∑j

δj

We define P̃ (H,E) as

P̃ (H,E) =1

Z

∏j,l

eφ̃jl(Djl)

Therefore, we have

∣∣∣logP (H,E)− log P̃ (H,E)∣∣∣ ≤∑

j

δj

From the above equation we can state the following theorem


of δj-decompositions on the factors ψj of P and P is a normalized distribution, then

logP (H,E)−∑j

δj ≤ log P̃ (H,E) ≤ logP (H,E) +∑j

δj (4.2)

Having derived bounds for the joint probability, we proceed towards deriving bounds

for marginal probabilities. From theorem 4.1 we have

∣∣∣logP (H,E)− log P̃ (H,E)∣∣∣ ≤∑

j

δj

In section 4.2 we showed that for a set A ⊆ X we have


f(X)− log∑A

f̃(X)

∣∣∣∣∣ ≤ δ

43

Therefore, summing out a set A ⊆ H from P (H,E), we get


P (H,E)− log∑A

P̃ (H,E)

∣∣∣∣∣ ≤∑j

δj

From the above equation we have the following theorem



using a set of δj-decompositions on the factors ψj of P and P is a normalized distri-

bution, then

log∑A

P (H,E)−∑j

δj ≤ log∑A


P (H,E) +∑j

δj (4.3)

We now consider the case when P (X) is not a normalized distribution. Assume we

have a function f(X) and an approximation f̃(X) to f(X) with error δ. We have

∣∣∣log f(X)− log f̃(X)∣∣∣ ≤ δ

Assume that we cannot calculate the real normalization constant Z =∑

X f(X).

We can calculate an approximation to the normalization constant as Z̃ =∑

X f̃(X).

We want a relation between our normalized approximation f̃(X)/Z̃ and the real

normalized function f(X)/Z, i.e., we want to find bounds for

∣∣∣∣∣logf(X)

Z− log

f̃(X)

Z̃

∣∣∣∣∣ (4.4)

In section 4.2 we showed that for a set A ⊆ X we have


f(X)− log∑A

f̃(X)

∣∣∣∣∣ ≤ δ

44

Since X ⊆ X, we have

∣∣∣∣∣log∑X

f(X)− log∑X

f̃(X)

∣∣∣∣∣ ≤ δ

Substituting Z =∑

X f(X) and Z̃ =∑

X f̃(X) we have

∣∣∣logZ − log Z̃∣∣∣ ≤ δ

We can rewrite the normalized expression of equation 4.4 as


Z− log

f̃(X)

Z̃

∣∣∣∣∣ =∣∣∣log f(X)− log f̃(X) + log Z̃ − logZ

∣∣∣By triangle inequality we have


Z− log

f̃(X)

Z̃

∣∣∣∣∣ ≤ ∣∣∣log f(X)− log f̃(X)∣∣∣+∣∣∣log Z̃ − logZ

∣∣∣≤ 2δ

Hence, doing the normalization increases the error by a factor of at most two.

Now assuming that for each factor ψj of P we have a δ-decomposition, we have

∣∣∣∣∣∑j

φj(Dj)−∑j,l

φ̃jl(Djl)

∣∣∣∣∣ ≤∑j

δj

Since we do not know the real normalization constant Z we normalize the result to

get

∣∣∣∣∣(∑

j

φj(Dj)− logZ

)−(∑

j,l

φ̃jl(Djl)− log Z̃

)∣∣∣∣∣ ≤ 2∑j

δj

45

In this scenario we define P̃ (H,E) as

P̃ (H,E) =1

Z̃

∏j,l

eφ̃jl(Djl)

Therefore, we have

∣∣∣logP (H,E)− log P̃ (H,E)∣∣∣ ≤ 2

∑j

δj

From the above equation we have the following theorem:


of δj-decompositions on the factors ψj of P and P is not a normalized distribution,

then

logP (H,E)− 2∑j

δj ≤ log P̃ (H,E) ≤ logP (H,E) + 2∑j

δj (4.5)

Since we have shown that summing out a set of variables from a function does not

affect the error of the approximation, we can state the theorem for the log of the

marginal probabilities as follows:



using a set of δj-decompositions on the factors ψj of P and P is not a normalized

distribution, then

log∑A

P (H,E)− 2∑j

δj ≤ log∑A


P (H,E) + 2∑j

δj (4.6)

46

Chapter 5

Optimizing a Decomposition

In chapters 3 and 4, we provided methods to calculate error bounds given a particular

approximation. In this chapter we present methods for finding a decomposition of a

function.

Ideally we would like to find the decomposition of a function that has the smallest

value of the error measure ε or δ. There are two problems we have to consider namely,

how can we select good subsets {X1, . . . , Xm} for the decomposition, and given the

decomposition {X1, . . . , Xm}, how can we select the functions {φ̃1(X1), . . . , φ̃m(Xm)}.

Choosing good subsets is an interesting but difficult problem. In this thesis, we take

the option of choosing subsets {X1, . . . , Xm} at random and then find good values

for the functions {φ̃1(X1), . . . , φ̃m(Xm)}.

Given a decomposition {X1, . . . , Xm}, the ideal way to choose {φ̃1(X1), . . . , φ̃m(Xm)}would be to choose the functions that minimize our error measure. Considering the

47

error measure ε defined in chapter 3 the problem we would like to solve is

min(φ̃1,...,φ̃m)

maxX

{∑i φ̃i(Xi)

φ(X),

φ(X)∑i φ̃i(Xi)

}(5.1)

If we use the error measure δ defined in chapter 4, then the problem would be

min(φ̃1,...,φ̃m)

supX

∣∣∣∣∣φ(X)−∑i

φ̃i(Xi)

∣∣∣∣∣ (5.2)

Since optimizing either of these two objective functions is not easy, we look to optimize

some similar measure in the hope that we will get small value of our error measure.

Wexler and Meek suggest the L2 norm because for a vector X, ‖X‖2 ≥ ‖X‖∞. The

L2 norm for a decomposition is defined as follows:

min(φ̃1,...,φ̃m)

√√√√∑X

[(∑i

φ̃i(Xi)

)− φ(X)

]2

(5.3)

We can choose disjoint or overlapping subsets for the decomposition. In section 5.1

we consider the case wherein we are restricted to choosing disjoint subsets, i.e., each

φ̃i is defined on a non-overlapping subset of the variables. More formally we choose

sets {X1, . . . , Xm} such that

∀i, j = 1, . . . ,m Xi ∩Xj = ∅ ⇔ i 6= j

Then in section 5.2 we remove the restriction of disjoint subsets and consider the

more general case of allowing sets that overlap with each other.

48

5.1 Optimizing a decomposition with disjoint sub-

sets

The L2 norm is computationally convenient for disjoint subsets. The reason for this

is that there exists a closed form solution to equation (5.3) when the sets are disjoint.

Wexler and Meek obtained the closed form solution as follows. First they remove the

square root since the square root function is monotonic for positive values. Hence the

objective function becomes

min(φ̃1,...,φ̃m)

∑X

[(∑i

φ̃i(Xi)

)− φ(X)

]2

Next Wexler and Meek differentiate the equation with respect to each φ̃k(Xk) and

set it to zero. The resulting set of equations is under constrained so they use the

arbitrary constraint

∑X

φ̃i(Xi) =

∑X φ(X)

m

to make the system well posed. Solving the resulting set of equations they get the

closed form solution as

φ̃k(Xk) =

∑X\Xk φ(X)∏i 6=k |Xi| −

(m− 1)∑

X φ(X)

m |X| (5.4)

49

5.2 Optimizing a decomposition with overlapping

subsets

When we remove the restriction of disjoint subsets, we cannot find a closed form

solution for minimizing the L2 norm. So, we present a method that can generate

overlapping decompositions from disjoint decompositions.

Given a function φ(X) we would like to generate functions defined over overlapping

sets of variables. We create overlapping decompositions by first generating functions

over two or more different sets of disjoint decompositions and then combining them

together. We explain the procedure by considering two sets of disjoint decompositions

{X11, . . . , X1p} and {X21, . . . , X2q}.

First we generate a disjoint decomposition over {X11, . . . , X1p} for φ(X). To calculate

the values for {φ̃11(X11), . . . , φ̃1p(X1p)} we can use the closed form solution from

equation (5.4). We now compute the residual R(X) as

R(X) = φ(X)−p∑i=1

φ̃1i(X1i)

Next we generate a disjoint decomposition over {X21, . . . , X2q} for R(X) where the

sets {X21, . . . , X2q} are different from the sets {X11, . . . , X1p}. To calculate the values

of {R̃21(X21), . . . , R̃2q(X2q)} we again use the closed form solution from equation (5.4).

We now union the sets {φ̃11(X11), . . . , φ̃1p(X1p)} and {R̃21(X21), . . . , R̃2q(X2q)} to gen-

erate the combined set of functions {φ̃11(X11), . . . , φ̃1p(X1p), R̃21(X21), . . . , R̃2q(X2q)}.In this combined set if there exists a function over a set Xj and another function

over a set Xk such that Xj ⊆ Xk, then we multiply the two functions together and

generate a single new function. After performing this multiplication operation on the

50

combined set we get an overlapping decomposition {φ̂1(X1), . . . , φ̂m(Xm)}.

To make the procedure of generating overlapping decompositions clearer, we provide

the schematic trace of an example in Figure 5.1. We have a function φ that is

defined over three variables X1, X2, X3 for which we would like to generate overlapping

decompositions. We look at all possible decompositions that can be made except for

the trivial decomposition of three functions over single variables. At the first stage,

we decompose φ which can be done in three different ways. At the second stage

we decompose the residual. For each choice we made in the first stage we have

two possible choices at this stage. Finally we combine all the functions together to

generate the required overlapping decomposition. For each path down the tree from

root to leaf, we compute the values of ε and δ for the resulting decomposition.

Having provided a theoretical analysis of MAS including new ways to calculate error

bounds and new ways to optimize ε-decompositions, we now shift our focus to the

application of MAS to inference algorithms. In the next chapter, we look at the

application of MAS to bucket elimination [6]. We analyze the algorithm proposed

by Wexler and Meek called DynaDecomp and discuss some of its limitations. We

then present our own DynaDecompPlus algorithm which modifies the DynaDecomp

algorithm so as to overcome its limitations.

51

R̃(X

2)R̃

(X1,X

3)

R̃(X

1)R̃

(X2,X

3)

R̃(X

3)R̃

(X1,X

2)

!̃(X

3)!̃

(X1,X

2)

!̃(X

1)!̃

(X2,X

3)

!̃(X

2)!̃

(X1,X

3)

R̃(X

3)R̃

(X1,X

2)

R̃(X

1)R̃

(X2,X

3)

R̃(X

2)R̃

(X1,X

3)

!̂(X1, X2)!̂(X2, X3)

!̂(X1, X2)!̂(X1, X3)

!̂(X1, X3)!̂(X2, X3)

!̂(X1, X2)!̂(X1, X3)

!̂(X1, X2)!̂(X2, X3)

!̂(X1, X3)!̂(X2, X3)

ε=0.4265δ=0.6360

ε=0.4208δ=0.4938

ε=0.4265δ=0.6360

ε=0.3816δ=0.4289

ε=0.4208δ=0.4938

ε=0.3816δ=0.4289

!(X

1,X

2,X

3)

Figure 5.1: Schematic trace of generating overlapping decompositions

52

Chapter 6

Bucket Elimination with MAS

Many existing inference algorithms for graphical models compute and utilize multi-

plicative factors during the course of the inference procedure. As a result the multi-

plicative approximation scheme can be applied to these algorithms. In this chapter

we look at one such application, using MAS with bucket elimination [6].

In chapter 2 (section 2.4) we provided an introduction to the bucket elimination

framework. In section 6.1 we introduce and analyze the algorithm developed by

Wexler and Meek called DynaDecomp and present its major limitation. Then, in

section 6.2, we introduce our own algorithm called DynaDecomPlus which improves

on the DynaDecomp algorithm by providing additional control over the computational

complexity.

6.1 The DynaDecomp Algorithm

DynaDecomp is an algorithm developed by Wexler and Meek and presented in [24].

It performs variable elimination using MAS. The strategy that DynaDecomp uses for

53

Input: A model for n variables X = {X1, . . . , Xn} and functions ψj(Dj ⊆ X), thatencodes P (X) =

∏j ψj(Dj); A set E = X\H of observed variables; An

elimination order R over the variables in H; threshold M .Output: The log-likelihood logP (E); an error ε.Initialize: ε = 0; F ← {ψj(Dj)}for i = 1 to n do

k ← R[i]T ← All functions f ∈ F that contain variable Xk

F ← F\TEliminate Xk from functions in bucket T : f ′ ←∑

Xk

∏fj∈T fj

if |f ′| > M thenDecompose f ′ into a set F̃ and an error εf ′

F ← F ∪ F̃ε = max {ε, εf ′}

elseF ← F ∪ f ′

endendmultiply all constant functions in F and put in preturn log p, ε

Figure 6.1: The DynaDecomp Algorithm

decomposition is a simple one: If the size of a function exceeds a predefined threshold,

decompose the function. The pseudocode for DynaDecomp is presented in Figure 6.1.

DynaDecomp begins exactly like bucket elimination. Given an elimination order we

first assign the functions to their respective buckets and then start processing the

buckets one after the other. The difference is that while we are at bucket i, if the

function generated after eliminating variable Xi, hXi , has a size greater than a pre-

defined threshold M , we decompose this function hXi into a set of smaller functions.

A schematic trace of DynaDecomp is presented in Figure 6.2.

In our example we set the threshold M = 3, i.e., we will decompose any function

that is defined over more than three variables. In figure 6.2 we see that the function

hB(A,D,C,E) generated after eliminating variable B is defined over four variables.

Hence, it is decomposed into two smaller functions h̃B1 (D,C) and h̃B2 (A,E). The

54

bucket B

bucket A

bucket C

bucket D

bucket E

!B

"P (E|B, C) P (D|A, B) P (B|A)

P (C|A)

P (A)

E = 0

hB(A, D, C, E)

h̃B2 (A, E)h̃B

1 (D, C)

h̃C(A, D)

h̃D(A) h̃E(A)

P̃ (A|E = 0)

Figure 6.2: A trace of algorithm DynaDecomp

algorithm then proceeds as in standard bucket elimination and in the end we get an

approximation P̃ (A|E = 0) to P (A|E = 0).

One thing to notice is that even though we wanted to restrict ourselves to functions

over three variables, the algorithm generated a function over 5 variables when we

multiplied all the functions in bucket B. Hence, even though the DynaDecomp al-

gorithm does provide savings later on in the inference procedure, it still generates

functions greater than the threshold specified. As a result, it becomes quite hard to

control the time and space complexity of DynaDecomp.

In order to provide greater control over the time and space complexity we modify

DynaDecomp to create our own algorithm called DynaDecompPlus. In the next

section we provide the details of our algorithm

55

6.2 Improvements to DynaDecomp

The major issue with the DynaDecomp algorithm is that it is very difficult to control

the largest function size generated by the algorithm. Therefore we developed a mod-

ified version of DynaDecomp called DynaDecompPlus which ensures that functions

larger in size than the given threshold are never generated.

Before presenting the details of DynaDecompPlus, we describe the small changes that

have to be made to DynaDecomp so that we can use δ to track the error instead of ε.

6.2.1 Using L∞ Bounds with DynaDecomp

Modifying DynaDecomp to use δ for error tracking is simple. The algorithm pro-

ceeds exactly as in Figure 6.1. The only difference is that to decompose the func-

tion f ′ into the set of functions F̃ , we use a δ-decomposition (4.1) instead of an

ε-decomposition (3.1) to calculate the error. Hence the error changes from ε =

max {ε, εf ′} to δ = δ + δf ′. With these changes, we can use the L∞ bounds de-

rived in chapter 4 to bound the returned result, log p.

6.2.2 The DynaDecompPlus Algorithm

The DynaDecompPlus algorithm is presented in Figure 6.3. The basic principle be-

hind the algorithm is that instead of decomposing the function resulting from pro-

cessing a particular bucket, decompositions are performed while processing a bucket.

The idea is that before we multiply all the functions in a bucket, we check to see if the

resulting function will exceed our threshold. If it does then instead of multiplying all

the functions, we multiply as many functions as we can without exceeding the thresh-

56

Input: A model for n variables X = {X1, . . . , Xn} and functions ψj(Dj ⊆ X), thatencodes P (X) =

∏j ψj(Dj); A set E = X\H of observed variables; An

elimination order R over the variables in H; threshold M .Output: The log-likelihood logP (E); an error δ.Initialize: δ = 0; F ← {ψj(Dj)}for i = 1 to n do

k ← R[i]T ← All functions f ∈ F that contain variable Xk

F ← F\Ta← length(T ); f ′ ← 1; S ← ∅for j = 1 to a do

if Multiplying f ′ and T [j] will result in a function of size greater than Mthen

S ← S ∪ T [j]else

f ′ ← Multiplication of f ′ and T [j]end

endif S is not empty then

(δf ′ , F̃ )← decompose(S, f ′)(f ′, F̃ )← subsume functions(f ′, F̃ )δ ← δ + δf ′

endNow we sum out variable Xk from f ′: f ′ ←∑

Xkf ′

F ← F ∪ f ′F ← F ∪ F̃

endmultiply all constant functions in F and put in preturn log p, δ

Figure 6.3: The DynaDecompPlus Algorithm

57

bucket B

bucket A

bucket C

bucket D

bucket E

P (E|B, C) P (D|A, B) P (B|A)

h̃1(D, A) h̃2(B) h̃3(B) h̃4(A)

h̃B(E, C)

h̃C(E, A)

h̃E(A) h̃4(A)

h̃1(D, A)

h̃D(A)

P (C|A)

E = 0

P (A)

P̃ (A|E = 0)

Figure 6.4: A trace of algorithm DynaDecompPlus

old. The rest of the functions are then decomposed and the variable is eliminated at

the end.

To make this clearer, we look at the trace of the algorithm presented in Figure 6.4.

The problem we are trying to solve is P (A|E = 0). But we do not want to generate

any functions greater than size three. So given this threshold and the elimination

order o = (A,E,C,D,B), we start the DynaDecompPlus algorithm. At the very first

bucket B, we see that if we multiply all the functions in this bucket, we will generate

a function hB(A,B,C,D,E) which is greater than our threshold. So instead, we

choose the decompose the functions P (D|A,B) and P (B|A). By performing the

decomposition early and then proceeding with bucket elimination we can see that we

never generated any function greater than size three. If we compare this to the trace

of DynaDecomp in Figure 6.2, it also had set the maximum size to three, but since it

first multiplied all the functions in bucket B, it actually generated a function of size

five.

58

Note on subsume functions

The DynaDecompPlus algorithm calls a method known as subsume functions. This

method is designed to take a set of functions and if there exists any function that

is defined over a set of variables which are a subset of another function’s vari-

ables, then it multiplies the two functions. For example if we had a set of func-

tions that contained a function f1(A,B,C,D) and another function f2(B,D), the

method subsume functions will multiply these two functions and produce a new re-

sult f ′1(A,B,C,D).

In the next chapter we provide the results of a variety of experiments we performed

that compare the performance of DynaDecomp and DynaDecompPlus. We also com-

pare the tightness of the corrected MAS bounds derived in chapter 3 with the L∞

bounds derived in chapter 4.

59

Chapter 7

Experiments

In this chapter we present results of a variety of experiments that we performed in

order to evaluate the theoretical results derived in the previous chapter. In section 7.1

we describe the experimental setup we used to perform our experiments. Next, in

section 7.2 we show results that were obtained using simple Ising models. Finally,

in section 7.3 we show the results obtained using some instances from the UAI 2006

inference competition.

7.1 Experimental Setup

All the experiments focus on the inference problem of computing the probability of

evidence. The probability distribution we consider is always normalized or we know

the true normalization constant.

The experiments were run on a desktop system containing two Intel R© CoreTM2 Duo

processors running at 3 Ghz, 4 GB of main memory and running the 64-bit Ubuntu

operating system. The programming environment was Matlab. In order to code the

60

algorithms, some functions from the Bayes Net Toolbox [18] were used.

To determine the elimination order we used the MIN-FILL heuristic. The subsets for

a decomposition were chosen at random. To optimize a decomposition, we used the

L2 optimization method discussed in chapter 5.

7.2 Results on Ising Models

We considered a 15X15 Ising model with binary nodes and random pairwise poten-

tials. For evidence, we randomly assigned values to about 20% of the nodes. Since we

use random splits for decomposition, we ran each algorithm ten times for each value

of the threshold and chose the result that had the smallest error value.

First we compare the performance of the corrected MAS bounds derived in chapter 3

versus the L∞ MAS bounds derived in chapter 4. Figure 7.1 shows the result of

running the DynaDecomp algorithm as described in [24] for different values of the

threshold parameter. The clear square represents the exact value of logP (E) while

the green circle shows the value of the approximation log P̃ (E). The black line on

the left shows the corrected bounds whereas the red line on the right shows the L∞

bounds. We see that the L∞ bounds are tighter than the corrected bounds for all

values of the threshold.

Next, we check if using overlapping decompositions with DynaDecomp results in

tighter bounds. Figure 7.2 shows the result of running the DynaDecomp algorithm

with overlapping decompositions. The bounds in Figure 7.2 are marginally tighter

than the bounds in Figure 7.1 which suggests that using overlapping decompositions

gives slightly better results than using disjoint decompositions. We also see that the

L∞ bounds are again tighter than the corrected bounds for all values of the threshold.

61

2 3 4 5 6 7 8 9 10 11−150

−100

−50

0

50

Max Size Specified

log

P(E

)

Figure 7.1: Bounds for logP (E) using DynaDecomp algorithm with no overlappingdecompositions allowed

2 3 4 5 6 7 8 9 10 11−150

−100

−50

0

50

Max Size Specified

log

P(E

)

Figure 7.2: Bounds for logP (E) using DynaDecomp algorithm with overlapping de-compositions allowed

62

Now we look at the performance of the corrected bounds and the L∞ bounds on

the DynaDecompPlus algorithm. Figure 7.3 shows the results of running DynaDe-

compPlus for different values of the threshold parameter. Since the L∞ bounds in

Figure 7.3 are not very clear, we plotted just the L∞ bounds in Figure 7.4. Again

we see the trend that the L∞ bounds are tighter than the corrected bounds. In fact,

in all the experiments we ran the L∞ bounds were much tighter than the corrected

bounds.

Having compared the performance of the bounds, we now focus on the comparison

between DynaDecomp and DynaDecompPlus. We first look at the control the algo-

rithms provide over the time and space complexities.

Figure 7.5 is a plot of the accuracy of the result versus the maximum function size

created by DynaDecomp. The plot shows multiple runs of DynaDecomp at each

value for the threshold parameter. We can see that even when we set the threshold

at M = 3, DynaDecomp still generated functions over 7 to 9 variables. This makes

it hard to control the time and space complexity of DynaDecomp. For example

assuming that we have binary variables and want to limit ourselves to functions not

larger than 210 values, we have no easy way of setting the threshold parameter M

because as seen in Figure 7.5 there is no simple relation between the threshold value

and the maximum function size created by DynaDecomp.

Figure 7.6 is a plot of the accuracy of the result versus the maximum function size

created by DynaDecompPlus. It shows that DynaDecompPlus never generates a

function that is larger than the threshold specified. Thus, looking at our earlier

example, if we wanted to limit ourselves to functions not larger than 210 values,

we just have to set the threshold parameter M = 10. Comparing Figure 7.6 with

Figure 7.5 we can see that DynaDecompPlus provides much greater control over the

time and space complexities than DynaDecomp.

63

4 6 8 10 12 14 16−100

−80

−60

−40

−20

0

20

Max Size Specified

log

P(E

)

Figure 7.3: Bounds for logP (E) using DynaDecompPlus algorithm

4 6 8 10 12 14 16−58

−56

−54

−52

−50

−48

−46

−44

−42

−40

Max Size Specified

log

P(E

)

Figure 7.4: L∞ Bounds for logP (E) using DynaDecompPlus algorithm

64

6 7 8 9 10 11 12 13 14 15 16 17

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

Max function size

log(

appr

ox)

− lo

g(ex

act)

maxsize = 3

maxsize = 4

maxsize = 5

maxsize = 6

maxsize = 7

maxsize = 8

maxsize = 9

maxsize = 10

Figure 7.5: Plot of accuracy versus maximum function size for the DynaDecompalgorithm

4 6 8 10 12 14 16

−1.5

−1

−0.5

0

0.5

1

1.5

Max function size

log(

appr

ox)

− lo

g(ex

act)

maxsize = 5

maxsize = 6

maxsize = 7

maxsize = 8

maxsize = 9

maxsize = 10

maxsize = 11

maxsize = 12

maxsize = 13

maxsize = 14

maxsize = 15

Figure 7.6: Plot of accuracy versus maximum function size for the DynaDecompPlusalgorithm

65

4 6 8 10 12 14 16−90

−85

−80

−75

−70

−65

−60

−55

−50

−45

−40

Max Function Size

log

P(E

)

Figure 7.7: L∞ Bounds for logP (E) vs Max Function Size for the DynaDecompalgorithm

4 6 8 10 12 14 16−90

−85

−80

−75

−70

−65

−60

−55

−50

−45

−40

Max Function Size

log

P(E

)

Figure 7.8: L∞ Bounds for logP (E) vs Max Function Size for the DynaDecompPlusalgorithm

66

Having shown that DynaDecompPlus provides greater control over the complexity

than DyanDecomp, we now look at the performance of DynaDecompPlus and Dy-

naDecomp for the same complexity. Figure 7.7 shows a plot of the L∞ bounds against

the maximum function size that DynaDecomp generated. Figure 7.8 shows a plot of

the L∞ bounds against the maximum function size that DynaDecompPlus generated

for the same problem. We can see that for the same complexity DynaDecompPlus is

more accurate and generates tighter bounds than DynaDecomp.

7.3 Results on UAI Data

In the previous section we considered only simple Ising models. To compare the

performance of the bounds and the two algorithms on more complex models, we ran

experiments on seventeen instances from the UAI 2006 inference competition. The

instances were taken from the UAI06 Problem Repository available at [1]. We chose

instances that had no zero probabilities and to compute the exact solution, we used

the aolibPE package for exact inference in Bayesian Networks available at [17].

We first tried to run the DynaDecomp algorithm. However, for the heuristic ordering

we chose, and a range of threshold values from M = 3 to M = 15, it failed to solve

any of the instances as it kept running out of memory.

Next, we ran the DynaDecompPlus algorithm for different values of the threshold

parameter. We provide the results for thresholds of 9, 12 and 15 in Figure 7.9, Fig-

ure 7.10 and Figure 7.11 respectively. We can see that as we increase the size of the

threshold the bounds get tighter. We again see that the L∞ bounds are much tighter

than the corrected bounds. In fact the corrected bounds are extremely loose. We do

not see the ends of the corrected bounds because we limited the Y -axis to lie between

67

−200 and 200 for plotting purposes.

Although the L∞ bounds are tighter than the corrected bounds, we see that in most

cases even they are not tight enough for the log probability of evidence. For example,

we see that in many cases the upper bound on logP (E) is greater than zero and

the lower bound is about −100. This states that the probability of evidence lies

somewhere between e−100 and 1 which is not very useful.

From these experiments we can conclude that the corrected MAS bounds from chap-

ter 3 are not useful even for simple Ising models and are quite unusable for any

complex models. The L∞ bounds on the other hand work reasonably well for simple

models but as the models get more complicated they too result in not very useful

bounds.

With regards to the algorithms, we can see that DynaDecomp does a poor job of

controlling the time and space complexities. For complex models it has trouble per-

forming inference even for small values of the threshold parameter. DynaDecompPlus

on the other hand can efficiently control the time and space complexities and can per-

form inference with good accuracy.

68

0 2 4 6 8 10 12 14 16 18−200

−150

−100

−50

0

50

100

150

200

Problem Number

log

P(E

)

Figure 7.9: Bounds for logP (E) using DynaDecompPlus algorithm with a maximumfunction size of 9

0 2 4 6 8 10 12 14 16 18−200

−150

−100

−50

0

50

100

150

200

Problem Number

log

P(E

)


69

0 2 4 6 8 10 12 14 16 18−200

−150

−100

−50

0

50

100

150

200

Problem Number

log

P(E

)


70

Chapter 8

Conclusions

The research presented in this thesis focussed on a specific approximation scheme,

called the Multiplicative Approximation Scheme (MAS), that was introduced by

Wexler and Meek. It is a general approximation scheme that can be applied to

inference algorithms so as to perform bounded approximate inference in graphical

models.

8.1 Contributions

We analyzed the multiplicative approximation scheme and proved that in reality the

bounds that were claimed by Wexler and Meek turn out to be incorrect when dealing

with normalized probability distributions. We then proceeded to derive the correct

bounds for these situations.

We also introduced a new way to calculate the error of a local decomposition. Our

method used a measure that was based on the L∞ norm. Using the L∞ error measure,

we derived new bounds on the results of an inference procedure.

71

We also analyzed the ε-decomposition optimization strategy. Wexler and Meek had

provided a closed form solution using L2 optimization on disjoint subsets. We pro-

vided a method that allowed us to generate overlapping subsets in our decompositions

using the closed form solution for disjoint subsets.

We then proceeded to analyze the algorithm developed by Wexler and Meek called Dy-

naDecomp, which is the application of MAS to bucket elimination. We demonstrated

the limitations of DynaDecomp particularly with respect to the lack of control over

how much space the algorithm would require. We then modified DynaDecomp and

developed our own DynaDecompPlus algorithm that gave us the necessary control

over the space the algorithm needed.

Finally we provided experimental evidence that demonstrated how in practice the

corrected bounds for MAS turn out to be very loose whereas our L∞ bounds were

much tighter. We also showed that how using overlapped decompositions resulted in

tighter bounds than using disjoint decompositions. We also compared the workings

of DynaDecomp with DynaDecompPlus and showed how DynaDecompPlus provided

much better control over space utilization. We also showed how it could solve prob-

lems that DynaDecomp had trouble solving.

8.2 Future Work

With the extensions we have provided, MAS still has room for additional improve-

ments that can be pursued in the future.

The biggest room for improvement is with the local decomposition. For example, say

we had a function defined over five variables, f(X1, X2, X3, X4, X5), and we would like

to decompose this functions into functions over not more than three variables, there

72

are many different options for us to try. We could create five single variable functions

f̃i(Xi) or we might choose a disjoint partition f̃1(X1, X3, X4) & f̃2(X2, X5) or we might

create an overlapping partition of the form f̃1(X1, X2, X3) & f̃2(X3, X4, X5) and so

on. Currently we make this decision by choosing the subsets at random. Therefore, it

would be worthwhile to explore some more sophisticated techniques to choose these

subsets.

Ideally we would like to choose the subsets such that the value of the error measure ε

or δ for the decomposition is the smallest. We could consider a brute force approach

of trying all possible decompositions, but with there being an exponential number of

decompositions to try, this would be infeasible for all but the most trivial cases.

One possible approach is to use the concept of mutual information, which measures

the mutual dependence of the two variables or conditional mutual information, which

is the mutual information of two variables conditioned on a third variable. There

has been recent work done by Narasimhan and Bilmes [19] and by Chechetka and

Guestrin [3] that uses conditional mutual information to learn graphical models with

bounded tree-width. Some of the ideas presented in these papers could be used to

provide a solution to our problem of finding good decompositions.

73

Bibliography

[1] UAI 2006 problem repository in ergo file format. http://graphmod.ics.uci.

edu/repos/pe/uaicomp06/, 2006.

[2] S. Arnborg. Efficient algorithms for combinatorial problems with bounded de-composability - a survey. BIT, 25(1):2–23, 1985.

[3] A. Chechetka and C. Guestrin. Efficient principled learning of thin junction trees.In In Advances in Neural Information Processing Systems (NIPS), Vancouver,Canada, December 2007.

[4] G. F. Cooper. The computational complexity of probabilistic inference usingBayesian belief networks (research note). Artificial Intelligence, 42(2-3):393–405,1990.

[5] A. Darwiche. Recursive conditioning. Artificial Intelligence, 126(1-2):5–41, 2001.

[6] R. Dechter. Bucket elimination: A unifying framework for reasoning. ArtificialIntelligence, 113(1-2):41–85, 1999.

[7] R. Dechter. Constraint Processing. Morgan Kaufmann Publishers, 340 PineStreet, Sixth Floor, San Francisco, CA 94104-3205, 2003.

[8] R. Dechter and J. Pearl. Network-based heuristics for constraint-satisfactionproblems. Artificial Intelligence, 34(1):1–38, 1987.

[9] R. Dechter and I. Rish. Mini-buckets: A general scheme for bounded inference.J. ACM, 50(2):107–153, 2003.

[10] M. Henrion. Propagating uncertainty in Bayesian networks by probabilistic logicsampling. In Proceedings of the 2nd Annual Conference on Uncertainty in Arti-ficial Intelligence (UAI-86), New York, NY, 1986. Elsevier Science.

[11] A. T. Ihler, J. W. Fisher III, and A. S. Willsky. Loopy belief propagation: Con-vergence and effects of message errors. Journal of Machine Learning Research,6:905–936, May 2005.

[12] F. V. Jensen, S. L. Lauritzen, and K. G. Olesen. Bayesian updating in causalprobabilistic networks by local computations. Computational Statistics Quaterly,4:269–282, 1990.

74

http://graphmod.ics.uci.edu/repos/pe/uaicomp06/

http://graphmod.ics.uci.edu/repos/pe/uaicomp06/

[13] M. Jordan, editor. Learning in Graphical Models. The MIT Press, 1998.

[14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introductionto variational methods for graphical models. In Learning in graphical models,pages 105–161. MIT Press, Cambridge, MA, USA, 1999.

[15] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-productalgorithm. IEEE Transactions on Information Theory, 47(2):498–519, Feb 2001.

[16] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilitieson graphical structures and their application to expert systems. Journal of theRoyal Statistical Society, Series B, 50(2):157–224, 1988.

[17] R. Mateescu. aolibPE package for exact inference in Bayesian networks. http:

//graphmod.ics.uci.edu/group/aolibPE, 2008.

[18] K. Murphy. Bayes net toolbox for Matlab. http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html, 2007.

[19] M. Narasimhan and J. Bilmes. Pac-learning bounded tree-width graphical mod-els. In Proceedings of the 20th conference on Uncertainty in artificial intelligence,pages 410–417, Arlington, Virginia, United States, 2004. AUAI Press.

[20] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, September 1988.

[21] N. Robertson and P. D. Seymour. Graph minors. iii. planar tree-width. Journalof Combinatorial Theory, Series B, 36(1):49 – 64, 1984.

[22] R. D. Shachter, B. D’Ambrosio, and B. A. Del Favero. Symbolic probabilisticinference in belief networks. In Proceedings of the Eighth National Conferenceon Artificial Intelligence, pages 126–131, Boston, Massachusetts, United States,1990.

[23] G. R. Shafer and P. P. Shenoy. Probability propagation. Annals of Mathematicsand Artificial Intelligence, 2(1-4):327–351, 1990.

[24] Y. Wexler and C. Meek. MAS: a multiplicative approximation scheme for prob-abilistic inference. In Advances in Neural Information Processing Systems 21,pages 1761–1768. 2009.

[25] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagationand its generalizations. In Exploring artificial intelligence in the new millennium,pages 239–269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2003.

75

http://graphmod.ics.uci.edu/group/aolibPE

http://graphmod.ics.uci.edu/group/aolibPE

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html

http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html

UNIVERSITY OF CALIFORNIA, IRVINE THESIS MASTER OF...

Documents

Transcript of UNIVERSITY OF CALIFORNIA, IRVINE THESIS MASTER OF...