UNIVERSITY OF CALIFORNIA, IRVINE DISSERTATION DOCTOR OF …dechter/publications/r165.pdf ·...

UNIVERSITY OF CALIFORNIA,

IRVINE

Sampling Algorithms for Probabilistic Graphical Models with Determinism

DISSERTATION

submitted in partial satisfaction of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

in Information and Computer Sciences

by

Vibhav Giridhar Gogate

Dissertation Committee:

Professor Rina Dechter, Chair

Professor Alex Ihler

Professor Padhraic Smyth

Professor Max Welling

2009

DEDICATION

To my lovely wife Tripti,

my son Neeraj,

my parents,

my sister Kirti, her husband Ajay,

and my niece Anisha.

ii

TABLE OF CONTENTS

Page

LIST OF FIGURES vi

LIST OF TABLES ix

ACKNOWLEDGMENTS xii

CURRICULUM VITAE xiv

ABSTRACT OF THE DISSERTATION xvii

1 Introduction 1

1.1 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Overview of Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.2 Graph concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.4 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.5 Constraint Networks . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.6 Propositional Satisfiability . . . . . . . . . . . . . . . . . . . . . . 18

1.2.7 Mixed Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3 Exact Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3.1 Cluster Tree Elimination . . . . . . . . . . . . . . . . . . . . . . . 23

1.3.2 Cutset Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.4 Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.4.1 Iterative Join Graph Propagation . . . . . . . . . . . . . . . . . . . 27

1.4.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.4.3 Markov Chain Monte Carlo schemes . . . . . . . . . . . . . . . . 43

1.4.4 Rao-Blackwellised sampling . . . . . . . . . . . . . . . . . . . . . 50

2 Hybrid Dynamic Mixed Networks for modeling Transportation routines 53

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.2 Hybrid Dynamic Mixed networks . . . . . . . . . . . . . . . . . . . . . . 55

2.3 The transportation model . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.4 Constructing a proposal distribution using Iterative Join Graph Propagation 60

2.5 Eliminating and Reducing Rejection . . . . . . . . . . . . . . . . . . . . . 64

iii

2.5.1 Sampling from the backtrack-free distribution using adaptive con-

sistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.5.2 Reducing Rejection using IJGP-sampling . . . . . . . . . . . . . . 70

2.6 Iterative Join Graph Propagation for HDMNs . . . . . . . . . . . . . . . . 72

2.6.1 Hybrid IJGP(i) for inference in Hybrid Mixed Networks . . . . . . 72

2.6.2 IJGP(i)-S for inference in sequential domains . . . . . . . . . . . . 73

2.7 Rao-Blackwellised Particle Filtering for HDMNs . . . . . . . . . . . . . . 78

2.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.8.1 Finding destination or goal of a person . . . . . . . . . . . . . . . . 81

2.8.2 Finding the route taken by the person . . . . . . . . . . . . . . . . 83

2.9 Related Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 84

3 SampleSearch: A scheme that searches for consistent samples 87

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.2 The SampleSearch Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.2.1 The Sampling Distribution of SampleSearch . . . . . . . . . . . . 93

3.2.2 Approximating QF (x) . . . . . . . . . . . . . . . . . . . . . . . . 99

3.2.3 Incorporating Advanced Search Techniques in SampleSearch . . . . 106

3.3 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.3.1 SampleSearch with w-cutset and IJGP . . . . . . . . . . . . . . . . 108

3.3.2 Other Competing Schemes . . . . . . . . . . . . . . . . . . . . . . 111

3.3.3 Results for Weighted Counts . . . . . . . . . . . . . . . . . . . . . 114

3.3.4 Results for the Posterior Marginal Tasks . . . . . . . . . . . . . . . 125

3.3.5 Summary of Experimental Evaluation . . . . . . . . . . . . . . . . 132

3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

4 Studies in Solution Sampling 142

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4.2 Background and Related work . . . . . . . . . . . . . . . . . . . . . . . . 145

4.2.1 Basic Notation and Definitions . . . . . . . . . . . . . . . . . . . . 145

4.2.2 Earlier work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.3 Using IJGP for solution sampling . . . . . . . . . . . . . . . . . . . . . . . 149

4.4 The SampleSearch scheme for solution sampling . . . . . . . . . . . . . . 152

4.4.1 The Sampling Distribution of SampleSearch . . . . . . . . . . . . 154

4.5 SampleSearch-MH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.5.1 Improved SampleSearch-MH . . . . . . . . . . . . . . . . . . . . 159

4.6 SampleSearch-SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

4.6.1 Extensions of basic SampleSearch-SIR . . . . . . . . . . . . . . . 162

4.6.2 Discussion on related work . . . . . . . . . . . . . . . . . . . . . . 164

4.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.7.1 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 169

4.7.2 Results for p = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

4.7.3 Predicting which sampling scheme to use . . . . . . . . . . . . . . 180

4.7.4 Results for p > 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

iv

5 Lower Bounding weighted counts using the Markov Inequality 196

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

5.3 Markov Inequality based Lower Bounds . . . . . . . . . . . . . . . . . . . 200

5.3.1 The Minimum scheme . . . . . . . . . . . . . . . . . . . . . . . . 201

5.3.2 The Average Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 203

5.3.3 The Maximum scheme . . . . . . . . . . . . . . . . . . . . . . . . 204

5.3.4 Using the Martingale Inequalities . . . . . . . . . . . . . . . . . . 205


5.4.1 The Algorithms Evaluated . . . . . . . . . . . . . . . . . . . . . . 209

5.4.2 Results on networks having no determinism . . . . . . . . . . . . . 213

5.4.3 Results on networks having determinism . . . . . . . . . . . . . . 215

5.4.4 Summary of Experimental Results . . . . . . . . . . . . . . . . . . 222

5.5 Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

6 AND/OR Importance Sampling 223

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

6.2 AND/OR search spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6.3 AND/OR Tree importance sampling . . . . . . . . . . . . . . . . . . . . . 230

6.3.1 Estimating Expectation by Parts . . . . . . . . . . . . . . . . . . . 231

6.3.2 Estimating weighted counts using an AND/OR sample tree . . . . . 235

6.4 Variance Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

6.4.1 Remarks on Variance Reduction . . . . . . . . . . . . . . . . . . . 253

6.5 Estimating Sample mean in AND/OR graphs . . . . . . . . . . . . . . . . 254

6.6 AND/OR w-cutset sampling . . . . . . . . . . . . . . . . . . . . . . . . . 257

6.6.1 The algorithm and its properties . . . . . . . . . . . . . . . . . . . 261

6.6.2 Variance Hierarchy and Complexity . . . . . . . . . . . . . . . . . 264


6.7.1 The Algorithms evaluated . . . . . . . . . . . . . . . . . . . . . . 266

6.7.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 267

6.7.3 Results on the Grid Networks . . . . . . . . . . . . . . . . . . . . 268

6.7.4 Results on Linkage networks . . . . . . . . . . . . . . . . . . . . . 275

6.7.5 Results on 4-Coloring Problems . . . . . . . . . . . . . . . . . . . 278

6.8 Discussion and Related work . . . . . . . . . . . . . . . . . . . . . . . . . 279

6.8.1 Relation to other graph-based variance reduction schemes . . . . . 279

6.8.2 Hoeffding’s U -statistics . . . . . . . . . . . . . . . . . . . . . . . 279

6.8.3 Problem with large sample sizes . . . . . . . . . . . . . . . . . . . 280

6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

7 Conclusions 283

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 286

Bibliography 288

v

LIST OF FIGURES

Page

1.1 Figure showing (a) Directed Acyclic Graph (DAG), (b) Moral graph of

DAG in (a), (c) Induced graph along the ordering (A,E,D,B,C), (d) In-

duced graph along the ordering (A,B,C,D,E) . . . . . . . . . . . . . . . 12

1.2 Figure showing two valid tree decompositions for the moral graph in Figure

1.1(b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 (a) An example Bayesian network, (b) An example CPT P (D|B,C) and

(C) Moral graph of the Bayesian network shown in (a) . . . . . . . . . . . 15

1.4 (a) An example 3 × 3 square Grid Markov network (ising model) and (b)

An example potential H6(D,E) . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 (a) An example constraint network showing a three coloring problem (b)

The set of solutions of the constraint network . . . . . . . . . . . . . . . . 17

1.6 Execution of CTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.7 Constructing a Join-graph decomposition using schematic mini-buckets . . 32

1.8 Join-graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.1 Car travel activity model . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.2 Figure showing an example two slice Dynamic Bayesian network. The

shaded nodes form the Forward interface (adapted from Murphy [98]) . . . 75

2.3 Schematic illustration of the Procedure used for creating join-graphs and

join-trees for HDMNs. I indicates a set of interface nodes while N indicates

non-interface nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2.4 Route prediction. On the left is the person’s current location derived from

the GPS device on his car and on the right is the route the person is most

likely to take based on his past history predicted by our model. . . . . . . . 85

3.1 A full OR search tree given a set of constraints and a proposal distribution. . 95

3.2 Five possible traces of SampleSearch which lead to the sample (A = 0, B = 2 , C = 0 ). The children of each node are specified from left to

right in the order in which they are generated. . . . . . . . . . . . . . . . . 95

3.3 (a) Three DFS-traces (b) Combined information from the three DFS-traces

given in (a) and (c) Two possible approximations of I(B|A = 1) . . . . . . 101

3.4 Chart showing the scope of our experimental study . . . . . . . . . . . . . 108

3.5 Time versus solution counts for two sample Latin square instances. . . . . . 117

3.6 Time versus solution counts for two sample Langford instances. . . . . . . 119

3.7 A fragment of a Bayesian network used in genetic linkage analysis. . . . . 121

vi

3.8 Convergence of probability of evidence as a function of time for two sample

Linkage instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.9 Time versus Hellinger distance for two sample Linkage instances. . . . . . 127

3.10 Time versus Hellinger distance for two sample Friends and Smokers net-

works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

3.11 Time versus Hellinger distance for two sample Mastermind networks. . . . 137

3.12 Time versus Hellinger distance for two sample Grid instances with deter-

ministic ratio=50%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138


ministic ratio=75%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139


ministic ratio=90%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.15 Time versus Hellinger distance for two sample Logistics planning instances. 141

4.1 Search tree for the given SAT formula F annotated with probabilities for

sampling solutions from a uniform distribution. . . . . . . . . . . . . . . . 148

4.2 (a) A distribution Q expressed as a probability tree and (b) Backtrack-free

distribution QF of Q w.r.t. F . . . . . . . . . . . . . . . . . . . . . . . . . . 155

4.3 Time versus Hellinger distance plots for Circuit instances 2bitmax 6 and

2bitcomp 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

4.4 Time versus Hellinger distance plots for Circuit instances ssa7552-158 and

ssa7552-159 instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

4.5 Time versus Hellinger distance plots for two 3-coloring instances with 100

and 200 vertices respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 176

4.6 Time versus Hellinger distance plots for Logistics planning instances log-1

and log-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

4.7 Time versus Hellinger distance plots for Logistics planning instances log-3

and log-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

4.8 Time versus Hellinger distance plot for a Logistics planning instance log-5. 180

4.9 Time versus Hellinger distance plot for Grid pebbling instances of size 10

and 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

4.10 Time versus Hellinger distance plot for Grid pebbling instances of size 20

and 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

4.11 Time versus Hellinger distance plot for Grid pebbling instance of size 30 . . 183

4.12 Time versus co-efficient of variation of SampleSearch for the Logistics and

3-coloring benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

4.13 Time versus co-efficient of variation of SampleSearch for the Grid pebbling

and Circuit benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

4.14 Time versus Hellinger distance plot for SampleSearch as a function of p for

log-4 and Grid pebbling instance of size 25. . . . . . . . . . . . . . . . . . 191

4.15 Time versus Hellinger distance plot for SampleSearch-SIR with replace-

ment as a function of p for log-4 and Grid pebbling instance of size 25. . . . 192

4.16 Time versus Hellinger distance plot for SampleSearch-SIR without replace-

ment as a function of p for log-4 and Grid pebbling instance of size 25. . . . 193

vii

4.17 Time versus Hellinger distance plot for SampleSearch-MH as a function of

p for log-4 and Grid pebbling instance of size 25. . . . . . . . . . . . . . . 194

6.1 AND/OR search spaces for graphical models . . . . . . . . . . . . . . . . 227

6.2 Assigning weights to OR-to-AND arcs of an AND/OR search tree . . . . . 228

6.3 A Bayesian network and its CPTs . . . . . . . . . . . . . . . . . . . . . . 230

6.4 A pseudo tree of the Bayesian network given in Figure 6.3(a) in which each

variable is annotated with its bucket function. . . . . . . . . . . . . . . . . 235

6.5 Figure showing four samples arranged on an AND/OR sample tree. . . . . 237

6.6 Value computation on an AND/OR sample tree . . . . . . . . . . . . . . . 239

6.7 The possible pseudo trees for n = 1, 2, 3 excluding a permutation of the

variables, where n is the number of variables. . . . . . . . . . . . . . . . . 249

6.8 Figure showing 4 samples arranged on an OR sample tree, AND/OR sam-

ple tree and AND/OR sample graph. The AND/OR sample tree and graph

are based on the pseudo tree given in Figure 6.1(b). . . . . . . . . . . . . . 254

6.9 (a) Example primal graph of a mixed network (b) Pseudo tree (c) Start

pseudo tree of OR w-cutset tree sampling and (d) Start pseudo tree of

AND/OR w-cutset tree sampling. . . . . . . . . . . . . . . . . . . . . . . . 258

6.10 Variance Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

6.11 Complexity of various AND/OR importance sampling schemes and their

w-cutset generalizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

6.12 Time versus log relative error for importance sampling, w-cutset impor-

tance sampling and their AND/OR tree and graph generalizations for Grid

instances with deterministic ratio = 50%. . . . . . . . . . . . . . . . . . . 270








tance sampling and their AND/OR tree and graph generalizations for two

sample Linkage instances from the UAI 2006 evaluation. . . . . . . . . . . 274


tance sampling and their AND/OR tree and graph generalizations for two

sample Linkage instances from the UAI 2008 evaluation. . . . . . . . . . . 276

viii

LIST OF TABLES

Page

2.1 Goal prediction accuracy for Model-1. Each cell in the last four columns

contains the prediction accuracy for the corresponding combination of the

inference scheme (row) and the learning algorithm (column). Given a

learned model, the column ‘Time’ reports the average time required by

each inference algorithm for predicting the goal. . . . . . . . . . . . . . . . 82











2.4 False positives (FP) and False negatives for routes taken by a person (FN) . 84

3.1 Query types handled by various solvers. . . . . . . . . . . . . . . . . . . . 114

3.2 Solution counts output by SampleSearch, ApproxCount, SampleCount and

Relsat after 10 hours of CPU time for Latin Square instances. When the ex-

act counts are not known, the entries for SampleCount and SS/LB contain

the lower bounds computed by combining their respective sample weights

with the Markov inequality based Average and Martingale schemes. . . . . 115

3.3 Solution counts output by SampleSearch, ApproxCount, SampleCount and

Relsat after 10 hours of CPU time for Langford’s problem instances. When

the exact counts are not known, the entries for SampleCount and SS/LB

contain the lower bounds computed by combining their respective sam-

ple weights with the Markov inequality based Average and Martingale

schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3.4 Lower Bounds on Solution counts output by SampleSearch/LB, Sample-

Count and Relsat after 10 hours of CPU time for FPGA routing instances.

The entries for SampleCount and SS/LB contain the lower bounds com-

puted by combining their respective sample weights with the Markov in-

equality based Average and Martingale schemes. . . . . . . . . . . . . . . 120

ix

3.5 Probability of evidence computed by VEC, EDBP and SampleSearch after

3 hours of CPU time for Linkage instances from the UAI 2006 evaluation. . 121

3.6 Partition function computed by VEC, EDBP and SampleSearch after 3

hours of CPU time for Linkage instances from the UAI 2008 evaluation. . . 124

3.7 Probability of evidence computed by VEC, EDBP and SampleSearch after

3 hours of CPU time for relational instances. . . . . . . . . . . . . . . . . . 125

3.8 Table showing Hellinger distance of SampleSearch, IJGP, EPIS and EDBP

for Linkage instances from the UAI 2006 evaluation after 3 hours of CPU

time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128


for Relational instances after 3 hours of CPU time. . . . . . . . . . . . . . 128


for Grid networks after 3 hours of CPU time. . . . . . . . . . . . . . . . . 130


for Logistics planning instances after 3 hours of CPU time. . . . . . . . . . 132

4.1 Table showing the two parameters (Number of Xors and Size of Xors) used

by XorSample for each problem instance. ‘X’ indicates that no parameter

that satisfies our criteria was found. . . . . . . . . . . . . . . . . . . . . . 167

4.2 Table showing Hellinger distance of SampleSat, XorSample, SampleSearch,

SIRwR, SIRwoR and MH after 10 hrs of CPU time. . . . . . . . . . . . . . 171

4.3 Table showing the number of samples generated per second by SampleSat,

XorSample and SampleSearch. . . . . . . . . . . . . . . . . . . . . . . . . 172

4.4 Table showing the number of solutions output per second by SampleSearch

as a function of p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

4.5 Table showing the Hellinger distance of SampleSearch as a function of pafter 10 hrs of CPU time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

4.6 Table showing the Hellinger distance of SampleSearch-SIR with replace-

ment as a function of p after 10 hrs of CPU time. . . . . . . . . . . . . . . 188

4.7 Table showing the Hellinger distance of SampleSearch-SIR without re-

placement as a function of p after 10 hrs of CPU time. . . . . . . . . . . . . 189

4.8 Table showing the Hellinger distance of SampleSearch-MH as a function

of p after 10 hrs of CPU time. . . . . . . . . . . . . . . . . . . . . . . . . . 190

5.1 Table showing the log-relative error ∆ of bound propagation and four ver-

sions of Markov-LB combined with IJGP-sampling for Bayesian networks

having no determinism after 2 minutes of CPU time. . . . . . . . . . . . . 214

5.2 Table showing the log-relative error ∆ of Relsat and four versions of Markov-

LB combined with SampleSearch and SampleCount respectively for Latin

Square instances after 10 hours of CPU time. . . . . . . . . . . . . . . . . 216


LB combined with SampleSearch and SampleCount respectively for Lang-

ford instances after 10 hours of CPU time. . . . . . . . . . . . . . . . . . . 217

x


LB combined with SampleSearch and SampleCount respectively for FPGA

routing instances after 10 hours of CPU time. . . . . . . . . . . . . . . . . 218

5.5 Table showing the log-relative error ∆ of VEC and four versions of Markov-

LB combined with SampleSearch for Linkage instances from the UAI 2006

evaluation after 3 hours of CPU time. . . . . . . . . . . . . . . . . . . . . 219


LB combined with SampleSearch for Linkage instances from the UAI 2008

evaluation after 3 hours of CPU time. . . . . . . . . . . . . . . . . . . . . 220


LB combined with SampleSearch for relational instances after 3 hours of

CPU time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

6.1 Table showing the log-relative error ∆ of importance sampling, w-cutset

importance sampling and their AND/OR tree and graph variants for the

Grid instances after 1 hour of CPU time. . . . . . . . . . . . . . . . . . . . 269

6.2 Table showing the log relative error ∆ of importance sampling, w-cutset


Linkage instances from UAI 2006 evaluation after 1 hour of CPU time. . . 275



Markov Linkage instances from UAI 2008 evaluation after 1 hour of CPU

time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277


importance sampling and their AND/OR tree and graph variants on Joseph

Culberson’s flat graph coloring instances after 1 hour of CPU time. . . . . . 278

xi

ACKNOWLEDGMENTS

First I would like to thank Rina Dechter, my advisor through most of my graduate career.

She has shown me a vision of what Artificial Intelligence in general, and what probabilistic

and constraint reasoning in particular, can and should be. I am graduating with a clear idea

of what a scientific researcher should be because of Rina’s example. Her high standards,

hard working practices and unfettered commitment to excellence have made a profound

impact on me and have challenged me to do the same. Needless to say that most of the

work presented in the thesis benefited greatly from her comments. Without her, this thesis

and most of my papers would have been utterly incomprehensible. I would like to thank

other members of my committee – Alex Ihler, Padhraic Smyth and Max Welling – for their

time and comments, for their constant encouragement and for their enormous help on job

search.

I would also like to thank Professor Sharad Mehrotra who has served as my mentor ever

since I have met him. When funding was tight, he ‘rescued’ my Ph.D. by supporting me

through a large NSF grant titled RESCUE (award numbers IIS-0412854, IIS-0331707)

which stands for “Responding to Crisis and Unexpected Events”. Sharad always took a

keen interest in my research and his insights and broad knowledge helped me enormously.

My graduate study was helped supported by the NSF grants IIS-0086529, IIS-0412854,

IIS-0331707 and IIS-0412854, by the NIH grant R01-HG004175-02, and by the Donald

Bren School of Information and Computer Science at University of California, Irvine.

I also wish to express my gratitude to our graduate student advisors Kris Bolcer, Milena

Wypchlak and Gina Anzivino and to all ICS faculty for creating an inspiring environment in

our department. Many thanks to current and former members of Rina’s group: Dan Frost,

Kalev Kask, Bozhena Bidyuk, Robert Mateescu, Radu Marinescu, Lars Otten, Emma Rol-

lon, Tuan Nguyen and Natasha Flerova for many countless interesting conversations. Fel-

xii

low graduate students in our department, both past and present, have been a great help to

me: Jonathan Hutchins, Ramaswamy Hariharan, Guy Yosiphon, Srikanth Agaram, Bijit

Hore, Ravi Chandra Jammalamadaka, Vidhya Balasubramanian, Chaitanya Desai, Chai-

tanya Chemudugunta, Arthur Asuncion and Ian Porteous.

Next, I would like to thank Professor Elise Turner, my Master’s advisor from University

of Maine, Orono and her husband Professor Roy Turner. Elise and Roy introduced me to

Artificial Intelligence and taught me “how to think like a computer scientist”. Elise passed

away in August, 2008 and she will be greatly missed.

This thesis (obviously) would not have been possible without the endless love and support

of my wife, Tripti. She stood firmly with me through the highs and lows of working on

a Ph.D. She always patiently listened to all my ideas, wrote some GUI interfaces in Java

and C++ to help sell my research and more importantly made sure that I’m at least 30 lb

heavier than what I was when we got married.

Finally, I would like to thank my parents, my sister Kirti and her husband Ajay for their

support. They always encouraged me to do what I loved to do.

xiii

CURRICULUM VITAE


EDUCATION

Ph.D. Information and Computer Science, 2009

Donald Bren School of Information and Computer Science

University of California, Irvine

Dissertation: Sampling Algorithms for Probabilistic Graphical models

with Determinism

Advisor: Rina Dechter

M.S. Computer Science, 2002

University of Maine, Orono

B.E. Computer Engineering, 1999

Mumbai University, Maharashtra, India.

PUBLICATIONS

[1] Vibhav Gogate and Rina Dechter: Approximate Solution Sampling (and Counting)

on AND/OR Spaces. In 14th International Conference on Principles and Practice

of Constraint Programming (CP) 2008. Pages 534-538.

[2] Vibhav Gogate and Rina Dechter: AND/OR Importance Sampling. In 24th Con-

ference on Uncertainty in Artificial Intelligence (UAI) 2008: 212-219.

xiv

[3] Vibhav Gogate and Rina Dechter: Studies in Solution Sampling. In 23rd Confer-

ence on Artificial Intelligence (AAAI) 2008. Pages 271-276.

[4] Vibhav Gogate, Bozhena Bidyuk and Rina Dechter: Studies in Lower Bounding

Probability of Evidence using the Markov Inequality. In 23rd Conference on Un-

certainty in Artificial Intelligence (UAI) 2007. Pages 141-148.

[5] Vibhav Gogate and Rina Dechter: Approximate Counting by Sampling the

Backtrack-free Search Space. In 22nd Conference on Artificial Intelligence

(AAAI) 2007. Pages 198-203.

[6] Vibhav Gogate and Rina Dechter: SampleSearch: A Scheme that Searches for

Consistent Samples. In Proceedings of the Eleventh International Conference on

Artificial Intelligence and Statistics (AISTATS) 2007. Pages 147-154.

[7] Vibhav Gogate and Rina Dechter: A New Algorithm for Sampling CSP Solutions

Uniformly at Random. In 12th International Conference on Principles and Practice

of Constraint Programming (CP) 2006. Pages 711-715.

[8] Vibhav Gogate and Rina Dechter: Approximate Inference Algorithms for Hybrid

Bayesian Networks with Discrete Constraints. In 21st Conference on Uncertainty

in Artificial Intelligence (UAI) 2005. Pages 209-216.

[9] Vibhav Gogate, Rina Dechter, Bozhena Bidyuk, Craig Rindt and James Marca:

Modeling Transportation Routines using Hybrid Dynamic Mixed Networks. In

21st Conference on Uncertainty in Artificial Intelligence (UAI) 2005. Pages 217-

224.

xv

[10] Vibhav Gogate and Rina Dechter: A Complete Anytime Algorithm for Treewidth.

In 20th Conference on Uncertainty in Artificial Intelligence (UAI) 2004. Pages

201-208.

[11] Kalev Kask, Rina Dechter and Vibhav Gogate: Counting-Based Look-Ahead

Schemes for Constraint Satisfaction. In 10th International Conference on Prin-

ciples and Practice of Constraint Programming (CP) 2004. Pages 317-331.

xvi

ABSTRACT OF THE DISSERTATION

Sampling Algorithms for Probabilistic Graphical Models with Determinism

By


Doctor of Philosophy in Information and Computer Sciences

University of California, Irvine, 2009

Professor Rina Dechter, Chair

Mixed constraint and probabilistic graphical models occur quite frequently in many real

world applications. Examples include: genetic linkage analysis, functional/software veri-

fication, target tracking and activity modeling. Query answering and in particular proba-

bilistic inference on such graphical models is computationally hard often requiring expo-

nential time in the worst case. Therefore in practice sampling algorithms are widely used

for providing an approximate answer. In presence of deterministic dependencies or hard

constraints, however, sampling has to overcome some principal challenges. In particular,

importance sampling type schemes suffer from what is known as the rejection problem in

that samples having zero weight may be generated with probability arbitrarily close to one

yielding useless results. On the other hand, Markov Chain Monte Carlo techniques do not

converge at all often yielding highly inaccurate estimates.

In this thesis, we address these problems in a two fold manner. First, we utilize research

done in constraint satisfaction and satisfiability communities for processing constraints to

reduce or eliminate rejection. Second, mindful of the time overhead in sample generation

due to determinism, we both make and utilize advances in statistical estimation theory to

xvii

make the “most” out of the generated samples.

Utilizing constraint satisfaction and satisfiability research, we propose two classes of sam-

pling algorithms - one based on consistency enforcement and the other based on systematic

search. The consistency enforcement class of algorithms work by shrinking the domains

of random variables, by strengthening constraints, or by creating new ones, so that some or

all zeros in the problem space can be removed. This improves convergence because of di-

mensionality reduction and also reduces rejection because many zero weight samples will

not be generated. Our systematic search based techniques called SampleSearch manage

the rejection problem by interleaving sampling with backtracking search. In this scheme,

when a sample is supposed to be rejected, the algorithm continues instead with systematic

backtracking search until a strictly positive-weight sample is generated. The strength of

this scheme is that any state-of-the-art constraint satisfaction or propositional satisfiability

search algorithm can be used with minor modifications. Through large scale experimental

evaluation, we show that SampleSearch outperforms all state-of-the-art schemes when a

significant amount of determinism is present in the graphical model.

Subsequently, we combine SampleSearch with known statistical techniques such as Sam-

pling Importance Resampling and Metropolis Hastings yielding efficient algorithms for

sampling solutions from a uniform distribution over the solutions of a Boolean satisfiability

formula. Unlike state-of-the-art algorithms, our SampleSearch-based algorithms guarantee

convergence in the limit.

As to statistical estimation, we make two distinct contributions. First, we propose sev-

eral new statistical inequalities extending the one-sample Markov inequality to multiple

samples which can be used in conjunction with SampleSearch to probabilistically lower

bound likelihood tasks over mixed networks. Second, we present a novel framework called

“AND/OR importance sampling” which generalizes the process of computing sample mean

by exploiting AND/OR search spaces for graphical models. Specifically we provide a spec-

xviii

trum of AND/OR sample means which are defined on the same set of samples but derive

different estimates trading variance with time. At one end is the AND/OR sample tree

mean which has smaller variance than the conventional OR sample tree mean and has the

same time complexity. At the other end is the AND/OR graph sample mean which has even

lower variance but has higher time and space complexity. We demonstrate empirically that

AND/OR sample means are far closer to the exact answer than the conventional OR sample

mean.

xix

Chapter 1

Introduction

Representation and reasoning are fundamental computer science concepts. Representation

involves translating real world knowledge into a model intended for computer processing

while reasoning involves solving problems that arise in the real world by querying the

computer model. In this thesis, we will focus on a widely used graph based knowledge

representation and reasoning framework for dealing with uncertainty called Probabilistic

Graphical models.

Probabilistic graphical models such as Bayesian and Markov networks use the well stud-

ied language of probability theory for modeling real world uncertain phenomenon. Since

its advent in the late 17th century, probability theory has been widely used in many fields

such as biology, statistics and physics; to name a few. The semantics of probability theory

are quite natural: the world consists of random variables and the joint distribution over the

variables represents complete knowledge about them. Unfortunately, storing the joint dis-

tribution is exponential in the number of random variables and therefore clearly infeasible

for any realistic application.

Probabilistic graphical models such as Bayesian and Markov networks address these con-

1

cerns by representing the joint distribution compactly using a graph-based representation.

In a Bayesian network, the joint distribution is represented by a Directed Acyclic Graph

(DAG) where each node in the DAG corresponds to a random variable and is associated

with a Conditional Probability Distribution (CPTs) of the variable given its parents in the

DAG. The joint distribution of a Bayesian network is a product of its CPTs. In a Markov

network, the joint distribution is represented using an undirected graph, with potential func-

tions defined on the cliques in the graph. The joint distribution represented by a Markov

network is the normalized product of the potential functions.

In this thesis, we will focus on Bayesian and Markov networks which may contain also sig-

nificant deterministic relationships. Deterministic relationships can occur either in the form

of zeros inside the CPTs or in the potential functions, or may come in the form of explicit

external constraints that restrict the values that the random variables can take. Throughout

the thesis, we will use the framework of mixed networks [92] to present our new algorith-

mic contributions. The mixed network framework was introduced for the purpose of ad-

dressing the representational and computational aspects of representing deterministic and

probabilistic information in graphical models. The central aim is to allow exploiting the

power of constraint processing for efficient probabilistic inference.

Constraint networks and propositional theories are common frameworks for modeling de-

terministic information. A constraint network uses the declarative language of constraints

for representing knowledge. In a constraint network, we have a set of variables which can

be assigned values from a finite domain and a set of constraints which restrict these assign-

ments. The main task is to find a solution or a model i.e. assign values to all variables

without violating any of the constraints. Another is to count all such assignments referred

to as model or solution counting. The constraints can be associated with an undirected

graph called the constraint graph in which nodes represent variables and edges connect any

two variables involved in a constraint. Thus, a constraint network is a deterministic graph-

2

ical model. One example of a constraint network is graph coloring. Here, the variables are

vertices, the values are the available colors and the constraints enforce the condition that

no two adjacent vertices can be assigned the same color. The constraint graph in this case

is the graph to be colored.

It should be noted however that the use of constraints within the confines of a mixed net-

work adds nothing new to the representation power of a Bayesian or a Markov network. In

principle, constraints and in general a constraint network can be easily incorporated into a

Bayesian or Markov network. In case of a Markov network, one can specify each constraint

as a 0/1 potential function in which a 1 indicates allowed tuples while a 0 indicates other-

wise (and which may add additional edges in the undirected graph). In a Bayesian network,

we can embed a constraint by modeling it as a Conditional Probability Table (CPT). In this

approach, one has to add a new variable for each constraint that is perceived as its effect

(child node) in the corresponding causal relationship and then to clamp its value to true i.e.

set it as evidence.

The main shortcoming, however, of any of the above approaches is computational. Con-

straints have special properties that render them computationally attractive. When con-

straints are disguised as probabilistic relationships, their computational benefits may be

hard to exploit. In particular, the power of constraint inference and constraint propagation

may not be brought to bear. Therefore in mixed networks, the identity of the respective re-

lationships, as constraints or probabilities are maintained explicitly, so that their respective

computational power and semantic differences can be made vivid and easy to exploit.

Perhaps the most fundamental issue for any graphical model is the problem of inference.

Given a (mixed) graphical model that represents some joint probability distribution P(X)

over a set of variables X, we are required to answer a specific query relative to an event

E = e, which is an assignment of values to a subset of variables; also called as evidence.

Examples of such queries are computing the probability of evidence P (E = e) (also called

3

as likelihood computation) or computing the posterior marginal distribution P (X|E = e)

(also called as belief updating). For example, in medical diagnosis [110] one is interested

in finding the probability that a person has a disease given a collection of symptoms; or in

activity modeling [86] one is interested in assessing the likelihood that a person is going

home given his current location and the time of the day.

Inference algorithms are developed to answer queries over graphical models. The algo-

rithms can be exact or approximate. Exact algorithms however are applicable to networks

which admit a relatively simple graph representation having low treewidth. When the graph

structure is too complex, approximate algorithms must be used.

Approximate algorithms for graphical models are of two primary types: (i) bounded mes-

sage passing based or (ii) sampling based. They approximate the two types of exact al-

gorithms which are tree-clustering based or search based. Examples of message passing

approximate schemes are mini bucket elimination [39], loopy belief propagation [106] and

generalized belief propagation [132] while the main types of sampling schemes are impor-

tance sampling or Markov Chain Monte Carlo sampling. Since the focus of this thesis is on

sampling schemes, we will provide some general description in the following paragraphs.

A detailed overview of all the schemes is given in section 1.4.

The main idea in importance sampling type algorithms is to express the likelihood and

marginal estimation tasks as the expected value of some random variable whose proba-

bility distribution is selected by the algorithm designer. The only requirement is that the

special distribution called the proposal (or importance or trial) distribution should be easy

to sample from (typically in a sequential variable by variable manner). Importance sam-

pling then generates samples from the proposal distribution and aims to approximate the

true expected value (also called the true average) by a weighted average over the samples

(also called sample average). The sample average has several desirable properties like: (a)

unbiasedness, namely that the expected value of the sample average equals the true average,

4

and (b) the mean squared error between the sample average and the true average decreases

with the number of samples.

Markov Chain Monte Carlo (MCMC) sampling schemes are based on an utterly different

principle. If we are able to generate samples from the posterior distributionP(X|e), then we

can estimate the marginal probability P (X = x|e) for any variable X ∈ X as the number

of times X = x appears in the generated samples. Unfortunately, P(X|e) is typically very

difficult to sample from and therefore MCMC schemes construct a Markov chain that has

P(X|e) as its equilibrium or stationary distribution. A Markov chain consists of a sequence

of states and a transition-rule for moving from state x to state y. The state of the chain after

a large number of steps is then used as a sample from the desired posterior distribution.

The quality of the samples improves as a function of the number of steps. It is usually

not hard to construct Markov chains for a graphical model (see Section 1.4.3). However,

convergence to the posterior distribution depends upon the Markov chain being ergodic

requiring a positive probability of moving from a state x to a state y in a finite number of

steps.

In presence of hard constraints, however, sampling becomes much harder, in the sense that

some guarantees that hold for a strictly positive probability space break down. Specifi-

cally, given a bounded number of evidence nodes, a small real number ǫ < 1, and a strictly

positive Bayesian network, importance sampling yields a randomized polynomial time ap-

proximation (polynomial in the number of evidence nodes and ǫ) of the posterior marginal

distribution with relative error ǫ [22]. However, in presence of determinism guaranteeing

an approximation with relative error ǫ is NP-hard.

Indeed, it is well known empirically that importance sampling and MCMC schemes per-

form quite poorly in presence of determinism. In particular, importance sampling suffers

from the so-called rejection problem. The rejection problem occurs when the proposal dis-

tribution does not faithfully capture all the zeros or constraints in the probability distribu-

5

tion of interest. Consequently, samples generated from the proposal distribution may have

zero weight and will be essentially rejected because they contribute nothing to the sample

average. In case of Markov Chain Monte Carlo techniques like Gibbs sampling, guarantees

like ergodicity break down due to determinism rendering the space disconnected, meaning

that not all states are reachable from a given current state. In such cases, MCMC techniques

may not even converge.

The contribution of this thesis is in investigating and addressing these deficiencies of sam-

pling based techniques in presence of determinism. Over the course of the next five chap-

ters, we will present a range of sampling techniques which will mitigate the rejection and

address convergence issues under different sets of input assumptions and which will be

specialized for different queries such as computing the posterior marginals, computing the

probability of evidence and generating samples from the posterior distribution.

1.1 Thesis Outline and Contributions

Our approach to managing sampling in presence of determinism has two components. First,

we utilize research done in the constraint satisfaction and satisfiability communities for pro-

cessing constraints to reduce or eliminate rejection and to improve convergence. Second,

mindful of the time overhead in sample generation due to determinism, we both make and

utilize advances in statistical estimation theory to make the “most” out of the generated

samples.

We propose two classes of sampling algorithms for reducing or eliminating rejection - one

based on consistency enforcement and the other based on systematic search.

The consistency enforcement approach presented in Chapter 2 works by tightening the do-

mains of the random variables, by strengthening the constraints, or by creating new ones, so

6

that some or all zeros in the problem space can be removed. This reduces the problem space

on which sampling is carried out which not only improves convergence but also reduces or

altogether eliminates rejection because either some or all zeros, which would otherwise be

sampled, are removed. We applied our new importance sampling technique to learn a dy-

namic Bayesian network which models car travel activities of individuals using GPS data.

Given a time of day and current location via the persons’ GPS device, the Bayesian net-

work can be used to predict the person’s destination and the route to the destination given

his current location.

In Chapter 3, we present a family of algorithms called SampleSearch which combine

systematic backtracking search with sampling. Unlike consistency enforcement schemes

which reduce rejection in a pre-processing manner, SampleSearch handles the rejection

problem during the sampling process itself. In this scheme, when a sample is about to

be rejected due to inconsistency, the algorithm treats it as a dead-end during search, and

instead of aborting, it backtracks as is common in systematic backtracking search, until a

strictly positive weight sample is generated. The strength of this scheme is that any state-

of-the-art constraint satisfaction or propositional satisfiability search algorithm can be used

with minor modifications. However, SampleSearch by itself cannot facilitate unbiased or

asymptotically unbiased estimators because search introduces a bias which is not articu-

lated by the underlying proposal distribution. We therefore characterize this bias, show-

ing that the sampling distribution of SampleSearch is the backtrack-free distribution of

the original proposal distribution which helps derive unbiased or asymptotically unbiased

estimators. Through large scale experimental evaluation, we show that SampleSearch out-

performs all state-of-the-art schemes when a substantial amount of determinism is present

in the graphical model.

In Chapter 4, we focus on the solution sampling task which requires generating solutions

of a constraint satisfaction problem or a satisfiability formula, such that each solution is

7

equally likely. While we can use SampleSearch to generate random solution samples, the

distribution over the generated solution samples would converge to the backtrack free dis-

tribution which is biased away from the uniform distribution over the solutions. We correct

this bias using well known statistical techniques such as Sampling/Importance Resampling

and Metropolis-Hastings which guarantee convergence to the uniform distribution in the

limit. Such guarantees are not available for other state-of-the-art schemes. Our extensive

experimental evaluation demonstrates that our solution sampling techniques substantially

outperform alternative approaches both in terms of speed of sample generation and in terms

of accuracy.

Chapter 5 presents lower bounding schemes for desired queries such as probability of ev-

idence and solution counting based on the well known Markov inequality. A straight for-

ward application of Markov inequality, however, yields bad estimates - a fact which is well

known in statistics. We show that Markov inequality is weak because its estimates are

based only on one sample. We derive several new statistical inequalities based on multiple

samples which guarantee that the lower bound will likely increase as more samples are

drawn. Our empirical results capitalize on the strength of SampleSearch demonstrating the

virtue of our lower bounding schemes in the context of mixed networks.

In Chapter 6, we introduce a family of alternative statistics for estimating the sample av-

erage based on the AND/OR search spaces for graphical models. AND/OR search spaces

capture problem decomposition in search using AND nodes. We show that problem de-

composition uncovered through the AND/OR search space yields a family of sample means

having smaller variance. The new family is defined on the same set of samples but yields

different estimates that trade time with variance. At one end of the spectrum we have the

AND/OR sample tree mean which has the same time complexity as conventional impor-

tance sampling but has lower mean squared error or variance. At the other end is AND/OR

sample graph mean which requires more time and space but has the lowest mean squared

8

error or variance. Via large scale experimental evaluation, we demonstrate that given the

same amount of time the AND/OR sample means are far closer to the true mean than con-

ventional sample mean (output by importance sampling).

Finally in Chapter 7, we conclude by surveying the contributions of this thesis, and outline

directions for future research.

1.2 Overview of Graphical Models

Many practical applications such as genetic linkage analysis [46, 1], car travel activity mod-

eling [86, 53] functional verification [4, 31], target tracking [105], machine vision [45, 85],

medical diagnosis [94, 110] and music parsing [111] typically involve a large collection of

random variables, which interact with each other in a non trivial manner. Consequently,

representing these applications using a joint probability distribution is infeasible because

of exponential memory requirements. For example, a joint distribution over 1000 random

binary variables requires 21000 − 1 entries.

Graphical models such as Bayesian and Markov networks address these concerns by repre-

senting the joint distribution compactly in a factored form using a graph-based representa-

tion. In a Bayesian network, the joint distribution is represented by a Directed acyclic graph

(DAG) with every node in the DAG corresponding to a random variable that is associated

with a conditional probability distribution of the variable given its parents in the DAG. In a

Markov network, the joint distribution is compactly represented using an undirected graph,

with potential functions defined on the cliques in the graph. The joint distribution repre-

sented by a Markov network is defined as the normalized product of the potential functions.

In this section, we introduce and compare several different families of graphical models,

such as Bayesian networks, Markov networks, constraint networks and the mixed networks

9

framework which contains both probabilistic information as well as constraints.

1.2.1 Notation

A discrete graphical model consists of a set of variables, each of which can take a value

from a finite domain and a set of functions defined over the variables. We denote variables

by upper case letters (e.g. X,Y, . . .) and values of variables by the corresponding lower

case letters (e.g. x, y, . . .). Sets of variables are denoted by bold upper case letters, (e.g.

X = X1, . . . , Xn) while the corresponding values are denoted by bold lower case letters

(e.g. x = x1, . . . , xn). X = x denotes an assignment of value to a variable while X = x

denotes an assignment of values to all variables in the set. We denote by Di the set of

possible values of Xi (also called as the domain of Xi) and by DX the Cartesian product

D1 × D2 × . . .Dn of the domains of all variables in the set X.

∑x∈X denotes the sum over the possible values of variables in X, namely,

∑x1∈X1

∑x2∈X2

. . .∑

xn∈Xn. The expected value EQ[X] of a random variable X with respect to a distri-

bution Q is defined as: EQ[X] =∑

x∈X xQ(x). The variance VQ[X] of X is defined as:

VQ[X] =∑

x∈X(x − EQ[X])2Q(x). Often for clarity, we will write EQ[X] as E[X] and

VQ[X] as V [X].

We denote the projection of an assignment x to a set S ⊆ X by xS. Given an assignment y

and z to the sub-sets Y and Z of X, such that X = Y∪Z, x = (y, z) denotes the composition

of y and z.

1.2.2 Graph concepts

We describe some common notations, definitions and results from graph theory. For more

information see [44, 20, 106, 29].

10

DEFINITION 1 (Directed and Undirected Graph). A directed graph is a pairG = (X,E),

where X = X1, . . . , Xn is a set of vertices, and E = (Xi, Xj)|Xi, Xj ∈ X is the set

of directed edges. An undirected graph is defined similarly to a directed graph, but there is

no directionality associated with the edges i.e (Xi, Xj) is equal to (Xj, Xi).

A clique is a set of nodes in G for which all pairs are connected by an edge. If the entire

graph forms a clique, it is said to be complete. A path between nodes X0, Xd is a sequence

of distinct nodes (X0, X1, . . . , Xd−1, Xd) such that for all l = 1, . . . , d, (Xl−1, Xl) ∈ E. A

cycle, is a path which starts and ends with the same node i.e. X0 = Xd. If there is a path

between every pair of nodes in G, it is connected. If an edge joins two non-consecutive

vertices within some cycle, then it is called a chord.

A graph that has no undirected cycles is called a tree. A chordal graph is an undirected

graph which has no chord less cycle.

Given a directed edge (Xi, Xj) ∈ E, Xi is called the parent of Xj and Xj is called the

child of Xi. The set of parents of a node Xi is denoted by either pa(Xi) or pai while the

set of children are denoted by chi(Xi) or chii. A node Xj is called the descendant of Xi iff

there is a directed path starting at Xi and ending at Xj . Similarly, a node Xj is called the

ancestor of Xi iff Xi is the descendant of Xj . A leaf node has no descendants while a root

node has no ancestors. A directed graph is acyclic (also called as Directed Acyclic Graph

or a DAG) if it has no directed cycles.

DEFINITION 2 (Moral Graph). A moral graph of a directed graph is an undirected graph

obtained by connecting all parents of a node to each other and removing direction.

DEFINITION 3 (Induced Width). An ordered graph is a pair (G, o) where G is an undi-

rected graph, and o = (X1, ..., Xn) is an ordering of the nodes. The width of a node X is

the number of the neighbors of X that precede it in the ordering. The width of an ordering

o, is the maximum width over all nodes. The induced width of an ordered graph, wG(o),

11

A

B D

C E(a)

A

B D

C E(b)

C

B

D

E

A(c)

E

D

C

B

A(d)

Figure 1.1: Figure showing (a) Directed Acyclic Graph (DAG), (b) Moral graph of DAG

in (a), (c) Induced graph along the ordering (A,E,D,B,C), (d) Induced graph along the

ordering (A,B,C,D,E)

is the width of the induced ordered graph obtained as follows: nodes are processed from

last to first; when node Xi is processed, all its preceding neighbors are connected. The

induced width of a graph, denoted by t∗ (or w∗), is the minimum induced width over all its

orderings.

EXAMPLE 1. Figure 1.1 (a) shows a directed acyclic graph with 5 vertices. Figure 1.1 (b)

shows the moral graph of the directed acyclic graph of Figure 1.1 (a). Figure 1.1 (c) shows

the induced graph along the ordering (A,E,D,B,C), whose induced width is 3, while

Figure 1.1 (d) shows the induced graph along the ordering (A,B,C,D,E) whose induced

width is 2.

DEFINITION 4 (Tree Decomposition and Treewidth). A tree decomposition of an undi-

rected graph G(X,E) is a pair (T,Ψ) where T = (V,F) is a tree and Ψ = ψV |V ∈ V is

a family of subsets of X such that the following conditions are satisfied:

1. ∪ψV ∈ΨψV = X.

2. For each edge (Xi, Xj) ∈ E, there exists a ψV ∈ Ψ such that bothXi andXj belong

to ψV .

12

A,B,D,E

A,B,C

(a)

B,D,E

A,B,D A,B,C

(b)

Figure 1.2: Figure showing two valid tree decompositions for the moral graph in Figure

1.1(b)

3. For all Xj ∈ X, there is a set of nodes V ∈ V|Xj ∈ ψV that forms the connected

sub-tree of T (running intersection property).

The treewidth of a tree-decomposition is given by: maxV ∈V(|ψV | − 1). The treewidth of a

graph G denoted by t∗ equals the minimum treewidth over all possible tree decompositions

of G.

Tree decomposition and induced graphs are related as follows. If we connect the maximal

cliques of the induced graph in a tree structure that obeys the running intersection property,

we get a tree decomposition. In this case, the induced width of the induced graph is equal

to the treewidth of the tree decomposition [29]. The task of finding the treewidth of a graph

is NP-complete [2]. In the past two decades, substantial research has focused on designing

exact and approximate algorithms for finding the treewidth [74, 76, 79, 52, 42].

EXAMPLE 2. Figure 1.2 (a) and (b) show two valid tree decompositions for the moral

undirected graph in Figure 1.1 (b). The treewidth of the two tree decompositions is 3 and

2 respectively. The tree decompositions in Figure 1.2 (a) and (b) are obtained from the

induced graphs given in Figures 1.1(c) and (d) respectively by connecting the maximum

cliques in a tree structure such that they satisfy the running intersection property.

13

1.2.3 Bayesian Networks

DEFINITION 5 (Graphical Models). A discrete graphical model G is a 3-tuple 〈X,D,F〉

where X = X1, . . . , Xn is a finite set of variables, D = D1, . . . ,Dn is a finite set of

domains where Di is the domain of variable Xi and F = F1, . . . , Fn is a finite set of

discrete-valued functions. Each function Fi is defined over a subset of variables Si called

its scope and is denoted by scope(Fi). The graphical model represents a product of all of

its functions.

Each graphical model is associated with a primal graph which captures the dependencies

present in the model.

DEFINITION 6 (Primal Graph). The primal graph of a graphical model G = 〈X,D,F〉 is

an undirected graph G(X,E) which has variables of G as its vertices and an edge between

two variables that appear in the scope of a function.

DEFINITION 7 (Bayesian or Belief Networks). A Bayesian network is a graphical model

B = 〈X,D,G,P〉 where G = (X,E) is a directed acyclic graph over the set of variables X.

The functions P = P1, . . . , Pn are conditional probability tables Pi = P (Xi|pai), where

pai = scope(Pi) \ Xi is the set of parents of Xi in G. The primal graph of a Bayesian

network is same as the moral graph of G. When the entries of the CPTs are 0 and 1 only,

they are called deterministic or functional CPTs. An evidence E = e is an instantiated

subset of variables.

A Bayesian network represents the joint probability distribution given by

PB(X) =∏n

i=1 P (Xi|pai) and therefore can be used to answer any query defined over the

joint distribution. In this thesis, we consider two queries: (a) computing the probability of

evidence P (E = e) and (b) computing the posterior marginal distribution P (Xi|E = e) for

each variable Xi ∈ X.

14

A

B C

E D

GF

(a)

D B C P(D|B,C)

0 0 0 0.2

1 0 0 0.8

0 0 1 0.7

1 0 1 0.3

0 1 0 0.1

1 1 0 0.9

0 1 1 0.6

1 1 1 0.4

(b)

A

B C

E D

GF

(c)

Figure 1.3: (a) An example Bayesian network, (b) An example CPT P (D|B,C) and (C)

Moral graph of the Bayesian network shown in (a)

EXAMPLE 3. Figure 1.3 (a) shows an example Bayesian network over seven variables

A,B,C,D,E, F,G. The network depicts structural relationships between different vari-ables. The conditional probability tables (CPTs) associated with variables A, B, C, D, E,

F and G are P (A), P (B|A), P (C|A), P (D|B,C), P (E|A,B), P (F |E) and P (G|D) re-

spectively. An example CPT for P (D|B,C) is given in Figure 1.3(b). Figure 1.3 (c) shows

the network’s moral graph which coincides with the primal graph. The Bayesian network

represents the joint distribution:

P (A, B, C, D, E, F, G) = P (A)P (B|A)P (C|A)P (D|B, C)P (E|A, B)P (F |E)P (G|D)

1.2.4 Markov Networks

DEFINITION 8 (Markov Networks). A Markov network is a graphical model T = 〈

X,D,H〉 where H= H1, . . . , Hm is a set of potential functions where each potential

Hi is a non-negative real-valued function defined over variables Si. The Markov network

15

A

D

B

E

C

F

G H I

H1(A,B) H2(B,C)

H3(A

,D)

H4(B

,E)

H5(C

,F)

H8(D

,G)

H9(E

,H)

H10(F

,I)

H6(D,E) H7(E,F)

H11(G,H) H12(H,I)

(a)

D E H6(D,E)

0 0 20.2

0 1 12

1 0 23.4

1 1 11.7

(b)

Figure 1.4: (a) An example 3 × 3 square Grid Markov network (ising model) and (b) An

example potential H6(D,E)

represents a joint distribution over the variables X given by:

P (x) =1

Z

m∏

i=1

Hi(x) where Z =∑

x∈X

m∏

i=1

Hi(x)

where the normalizing constant Z is often referred to as the partition function.

The primary queries over Markov networks are computing the posterior marginal distribu-

tion over all variables Xi ∈ X and finding the partition function.

EXAMPLE 4. Figure 1.4 shows a 3 × 3 square grid Markov network with 9 variables

A,B,C,D,E, F,G,H, I. The twelve potentials are: H1(A,B), H2(B,C), H3(A,D),

H4(B,E), H5(C,F ), H6(C,D), H7(D,E), H8(D,G), H9(E,H), H10(F, I), H11(G,H)

and

H12(H, I). The Markov network represents the probability distribution formed by taking a

product of these twelve functions and then normalizing. Namely,

P (A,B, . . . , I) =1

Z

12∏

i=1

Hi

where Z is the partition function.

16

A

CB

≠ ≠

≠

Red, Green, Blue

Red, Green, Blue Red, Green, Blue

(a)

A B C

Red Blue Green

Red Green Blue

Blue Green Red

Blue Red Green

Green Red Blue

Green Blue Red

(b)

Figure 1.5: (a) An example constraint network showing a three coloring problem (b) The

set of solutions of the constraint network

1.2.5 Constraint Networks

DEFINITION 9 (Constraint Networks). A constraint network is a graphical model R =

〈X,D,C〉 where C = C1, . . . , Cm is a set of constraints. Each constraint Ci is a 0/1

function defined over a subset of variables Si, called the scopes of the constraints. Given

an assignment Si = si, a constraint is satisfied if Ci(si) = 1. A constraint can also

be expressed by a pair 〈Ri,Si〉 where Ri is a relation defined over the variables Si and

contains all tuples Si = si for which Ci(si) = 1. The primal graph of a constraint network

is called the constraint graph.

The primary query over a constraint network is to decide whether it has a solution i.e. to find

an assignment X = x to all variables such that all constraints are satisfied or to prove that

no such assignment exists. The constraint network represents its set of solutions. Another

important query is that of counting the number of solutions of the constraint network.

EXAMPLE 5. Figure 1.5(a) shows a graph coloring problem that can be modeled by a con-

straint network. The task is to color each vertex in such a way that no two vertices that

share an edge have the same color. To encode the problem as a constraint network, we de-

17

fine three variables A,B,C. The domain of each variable is Red,Green,Blue. The

constraints require variables sharing an edge to have different colors and can be modeled

as binary “not-equal constraints”. Figure 1.5(b) shows the set of solutions of the constraint

network.

1.2.6 Propositional Satisfiability

A special case of a constraint satisfaction problem is a propositional formula in conjunctive

normal form (CNF). A formula F in conjunctive normal form (CNF) is a conjunction of

clauses Cl1, . . . , Clt (denoted as a set Cl1, . . . , Clt) where a clause is a disjunction of

literals (propositions or their negations). For example, Cl = (P ∨ ¬Q ∨ ¬R) is a clause,

where P , Q and R are propositions, and P , ¬Q and ¬R are literals. The set of models

or solutions of a formula F is the set of all truth assignments to all its symbols that do

not violate any clause. Common queries in SAT are satisfiability i.e. finding a model or

proving that none exists; also called the SAT problem, and model counting i.e. counting

the number of models or counts.

Propositional satisfiability can be defined as a constraint network, where propositions cor-

respond to variables with bi-valued domains 0, 1, and constraints are represented by

clauses (for instance, clause (¬A∨B) allows all tuples over (A,B) except (A = 1, B = 0)).

1.2.7 Mixed Networks

Throughout the thesis, we will use the framework of mixed networks defined in [36, 92].

Mixed networks represent all the deterministic information explicitly in the form of con-

straints facilitating the use of constraint processing techniques developed over the past three

decades for efficient probabilistic inference. This framework includes Bayesian, Markov

18

and constraint networks as a special case. Therefore, many inference tasks become equiv-

alent when we consider a mixed network view allowing a unifying treatment of all these

problems within a single framework. For example, problems such as computing the prob-

ability of evidence in a Bayesian network, the partition function in a Markov network and

counting solutions of a constraint network can be expressed as weighted counting over

mixed networks.

DEFINITION 10 (Mixed Network). [36, 92] A mixed network is a four-tupleM = 〈 X, D,

F, C〉 where X = X1, . . . , Xn is a set of random variables, D = D1, . . . ,Dn is a set

of domains where Di is the domain of Xi, F = F1, , . . . , Fm is a set of non-negative real

valued functions where each Fi is defined over a subset of variables Si ⊂ X (its scope) and

C = C1, . . . , Cp is a set of constraints (or 0/1 functions).

A mixed network represents a joint distribution over X given by:

PM(x) =

1Z

∏mi=1 Fi(x) if x ∈ sol(C)

0 otherwise

where sol(C) is the set of solutions of C and Z =∑

x∈sol(C)

∏mi=1 Fi(x) is the normalizing

constant.

The primal graph of a mixed network has variables as its vertices and an edge between any

two variables than appear in the scope of a function F ∈ F or a constraint C ∈ C.

We can define several queries over the mixed network. In this thesis, however we will focus

on the following two queries:

DEFINITION 11 (The Weighted Counting Task). Given a mixed networkM= 〈 X, D, F,

19

C〉, the weighted counting task is to compute the normalization constant given by:

Z =∑

x∈Sol(C)

m∏

i=1

Fi(x) (1.1)

where sol(C) is the set of solutions of the constraint portion C. Equivalently, if we represent

the constraints in C as 0/1 functions, we can rewrite Z as:

Z =∑

x∈X

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (1.2)

We will refer to Z as weighted counts.

DEFINITION 12 (Marginal task). Given a mixed networkM = 〈X,D,F,C〉, the marginal

task is to compute the marginal distribution at each variable. Namely, for each variableXi

and xi ∈ Di, compute:

P (xi) =∑

x∈X

δxi(x)PM(x), where δxi

(x) =

1 if Xi is assigned the value xi in x

0 otherwise

To be able to use the constraint portion of the mixed network more effectively, for the

remainder of the thesis, we require that all zero probabilities in the mixed network are also

represented as constraints. It is easy to define such a network as we show below. The new

constraints are redundant though.

DEFINITION 13 (Modified Mixed network). Given a mixed networkM = 〈X,D,F,C〉,

a modified mixed network is a four-tupleM′ = 〈X,D,F,C′〉 where C′ = C ∪ FCimi=1

where

FCi(Si = si) =

0 if Fi(si) = 0

1 Otherwise

(1.3)

FCi can be expressed as a relation. It is sometimes called the flat constraints of the prob-

20

ability function.

Clearly, the modified mixed network M ′ and the original mixed network M are equivalent

in that PM′ = PM.

It is easy to see that the weighted counts over a mixed network specialize to (a) the prob-

ability of evidence in a Bayesian network, (b) the partition function in a Markov network

and (c) the number of solutions of a constraint network. The marginal problem can express

the posterior marginals in a Bayesian or Markov network.

ENCODING 1 (Bayesian network with evidence as a mixed network). A Bayesian net-

work B = 〈X,D,G,P〉 and evidence (E1 = e1, . . . , Ek = ek) is a mixed networkM =

〈X,D,F,C〉 where F = P1, . . . , Pn are the conditional probability tables (CPTs) and

C = Ci where

Ci(e′i) =

1 e′i = ei

0 otherwise

It is trivial to prove that:

PROPOSITION 1. Given a Bayesian network B = 〈X,D,G,P〉, evidence (E1 = e1 , . . . ,

Ek = ek) and a mixed network M = 〈X,D,F,C〉 encoded according to Encoding 1,

weighted counts computed overM is equal to computing the probability of evidence in

B while the posterior marginals ofM equal the posterior marginals over each variable

Xi ∈ X of B given the evidence.

ENCODING 2 (Markov network as a mixed network). A Markov network T = 〈X, D,

G, H〉 is a mixed networkM = 〈X,D,F,C〉 where F = H1, . . . , Hm is a set of po-

tential functions and C contains a single global constraint GC in which all variable value

combinations are allowed. Namely, GC(x) = 1


21

PROPOSITION 2. Given a Markov network T = 〈X,D, G,H〉 and a mixed networkM =

〈X,D,F,C〉 encoded according to Encoding 2, the weighted counts computed overM are

equal to the partition function of T and the posterior marginals overM are equal to the

posterior marginals over T .

ENCODING 3 (Constraint network as a mixed network). A constraint network R =

〈X,D,C〉 is a mixed networkM = 〈X,D,F,C〉 where F contains a single unit function F1

whose scope is X, namely F1(x) = 1.


PROPOSITION 3. Given a constraint network R = 〈X,D,C〉 and a mixed networkM =

〈X,D,F,C〉 encoded according to Encoding 3, the weighted counts ofM are equal the

number of solutions ofR.

1.3 Exact Inference

A mixed network represents knowledge about the world as a joint distribution. The process

of answering queries posed on this knowledge base is called inference. Unfortunately, it

turns out that the inference in a graphical model is NP-hard.

THEOREM 1. Computing a solution to the weighted counting and marginal problems is

NP-hard.

Proof. See [19] for a proof.

In spite of this worst case hardness result, in some cases, we can take advantage of some

structural graph based properties like treewidth of the graphical model to perform exact

inference efficiently. In this section, we review some of these schemes.

22

1.3.1 Cluster Tree Elimination

DEFINITION 14 (Cluster-tree or Join-tree). LetM = 〈X, D, C, F〉 be a mixed network.

A cluster-tree or a join-tree ofM is a triple JT =< T, χ, ψ >, where T = (V, E) is a

tree, and χ and ψ are labeling functions which associate with each vertex V ∈ V two sets,

variable label χ(V ) ⊆ X and function label ψ(V ) ⊆ C ∪ F.

1. For each functionH ∈ C∪F , there is exactly one vertex V ∈ V such thatH ∈ ψ(V )

and scope(H) ⊆ χ(V ).

2. For each variableXi ∈ X, the set V ∈ V|Xi ∈ χ(V ) induces a connected sub-tree

of T . The connectedness requirement is also called the running intersection property.

Let (U, V ) ∈ E be an edge of T . The separator of U and V is defined as sep(U, V ) =

χ(U) ∩ χ(V ). The eliminator of U and V is defined as elim(U, V ) = χ(U)− sep(U, V ).

Cluster-tree elimination (CTE) [32, 72] is a message-passing algorithm on a cluster tree, the

nodes of which are called clusters, each associated with variable and function subsets (their

labels). CTE computes two messages for each edge (one in each direction), from each node

to its neighbors, in two passes, from the leaves to the root and from the root to the leaves.

The message that cluster U sends to cluster V is as follows. The cluster takes a product

of all its own functions (in its label), with all the messages received from its neighbors

excluding V and then marginalizes the resulting function relative to the eliminator between

U and V . This yields a message defined on the separator between U and V .

The complexity of CTE is time exponential in the maximum size of variable subsets and

space exponential in the maximum size of the separators. Join-tree and junction-tree al-

gorithms for constraint and Bayesian networks are instances of CTE. For more details see

[32, 72, 33].

23

A

B

C D E

F

G(a)

1 χ(1) = A,B,C, ψ(1) = P (A), P (B|A), P (C|A,B)

2 χ(2) = B,C,D, F, ψ(2) = P (D|B), P (F |C,D)

3 χ(3) = B,E, F, ψ(3) = P (E|B,F )

4 χ(4) = E,F,G, ψ(4) = P (G|E,F )

(b)

1

m1→2(b, c) =P

a P (a)P (b|a)P (c|a, b)

m2→1(b, c) =P

d,f P (d|b)P (f |c, d)m3→2(b, f)

2

m2→3(b, f) =P

c,d P (d|b)P (f |c, d)m1→2(b, c)

m3→2(b, f) =P

e P (e|b, f)m4→3(e, f)

3

m3→4(e, f) =P

b P (e|b, f)m2→3(b, f)

m4→3(e, f) = P (G = g|e, f)

4(c)

Figure 1.6: a) A Bayesian network; b) A join-tree decomposition; c)Execution of CTE.

Formally, given a cluster tree of the network, the message propagation over this tree can be

synchronized. We select any one cluster as the root of the tree and propagate messages up

and down the tree, where a message is defined as follows:

DEFINITION 15 (Message). Given a join tree JT =< G(V,E), χ, ψ >, a message from

a vertex A ∈ V to a vertex B ∈ V such that (A,B) ∈ E, denoted by mA→B is defined as

follows. Let Y = sep(A,B), Z = elim(A,B) and NG(A) be the set of vertices adjacent to

A in G.

mA→B(y) =∑

z∈Z

∏

F∈ψ(A)

F (y, z)∏

D∈NG(A)\B

mD→A(y, z)

EXAMPLE 6. Figure 1.6 (a) describes a Bayesian network and a join-tree decomposition

24

for it (b). Figure 1.6(c) shows the trace of running CTE. mU→V is a message that cluster

U sends to cluster V .

Assuming that the maximum number of variables in a cluster is w + 1 and the maximum

domain size is d, the time and space required to process one cluster isO(d(w+1)). Therefore,

the complexity of CTE is O(nd(w+1)).

We can compute the posterior marginals for any variable Xi ∈ χ(A) by taking a product of

all functions and messages in the cluster A and marginalizing out all other variables:

P (xi) =∑

z∈χ(A)\Xi

∏

F∈ψ(A)

F (z, xi)∏

D∈NG(A)

mD→A(z, xi)

(1.4)

Similarly, we can compute the weighted counts by simply selecting any cluster A and

marginalizing out all the variables:

Z =∑

z∈χ(A)

∏

F∈ψ(A)

F (z)∏

D∈NG(A)

mD→A(z)

(1.5)

Bucket elimination is a special case of cluster-tree elimination where messages are passed

from leaves to root along a bucket-tree [28]. Given a variable ordering, the algorithm

partitions functions into buckets, each associated with a single variable, corresponding to

clusters in a join-tree. A function is placed in the bucket of its latest argument in the

ordering. The algorithm processes each bucket, top-down, from the last variable to the first,

by a variable elimination procedure that computes a new function using combination and

marginalization operators. The new function is placed in the closest lower bucket whose

variable appears in the new function’s scope. In a generalized elimination scheme, known

as bucket-tree elimination, a second pass along the bucket tree can update every bucket in

the tree [28]. This scheme is similar to CTE.

25

1.3.2 Cutset Conditioning

When the treewidth of the mixed network is large, cluster tree elimination is infeasible

because of its excessive memory requirements. In general, we can usually wait for some

extra time for an algorithm to terminate but can do nothing when space limits are exceeded,

unless we buy more memory. Therefore, schemes which trade time with memory known

as conditioning were introduced in the graphical models literature. The main idea in this

scheme is to select a subset of variables K ⊆ X, called a cutset or the conditioning set, and

obtain posterior marginals for any node Xi ∈ X \K by:

P (xi) =∑

k∈K

P (xi|k)P (k) (1.6)

where P (xi|k) is the marginal probability of xi given K = k and P (k) is the probability

of the assignment K = k. The probabilities P (xi|k) and P (k) can be computed using the

cluster tree elimination algorithm described in the previous section. Equation 1.6 implies

that we can compute P (xi) by enumerating all instantiations over K then summing up over

the results.

The main idea in cutset conditioning schemes is to select K in such a way that the treewidth

of the remaining graph is bounded by a constant and can be formalized using the notion of

w-cutset defined below:

DEFINITION 16 (w-cutset). Given a graph G(X,E), a w-cutset is a subset of variables

K ⊆ X such that the treewidth of the graph G′ obtained by removing all the vertices in K

from G is bounded by w.

Given a w-cutset of size c, each instantiation of cluster tree elimination runs in O((n −

c) × exp(w)) time and space and therefore the overall time and space requirements of the

scheme are O(exp(|DK|)× (n− c)× exp(w)) and O((n− c)× exp(w)) respectively.

26

It is well-known that the minimum induced width t∗ of the network is always less than

or equal to the size of the smallest w-cutset . Namely, t∗ + 1 ≤ |C| for any C. Thus,

inference algorithms (e.g., cluster tree elimination) are never worse and are often better

than cutset conditioning time-wise. However, when t∗ is large we must resort to cutset

conditioning search, trading space for time. The optimal solution is a hybrid search and

inference approach that conditions on the smallest w-cutset K such that the induced width

w of the graph conditioned on K is small enough to permit exact inference.

1.4 Approximate Inference

1.4.1 Iterative Join Graph Propagation

Iterative Join-Graph Propagation (IJGP) [33] can be perceived as an iterative version of

cluster tree elimination. It applies the same message-passing to join-graphs rather than

join-trees, iteratively. A join graph is a decomposition of functions into a graph of clusters

(rather than a tree) that satisfies the running intersection property. It can thus be thought of

as a relaxation of the join tree (or a tree decomposition) in which the requirement that the

clusters should form a tree is relaxed. The IJGP class of algorithms generalizes loopy be-

lief propagation [106, 99]. These algorithms are not guaranteed to converge, nor have

bounded accuracies, however they have been demonstrated to be useful approximation

methods [33, 132]. While CTE is exact, and requires only 2 iterations, IJGP tends to

improve its performance with additional iterations. The size of clusters in a join-graph can

be far smaller than the treewidth and is only restricted by the function’s scopes. IJGP can

be parameterized by i which controls the cluster size in the join-graph, yielding a class of

algorithms denoted by IJGP(i) whose complexity is exponential in i, that allow a trade-off

between accuracy and complexity. As i increases, accuracy generally increases. When i is

27

big enough to allow a tree-structure, IJGP(i) coincides with CTE and becomes exact.

Next, we formally define join graphs along with an algorithmic description of IJGP. For

more details, see [33].

DEFINITION 17 ((minimal) Edge-labeled Join-Graph). [33] Given a mixed networkM

= 〈 X, D, F, C〉, an edge-labeled join-graph is a four-tuple JG =< G,χ, ψ, θ >, where

G = (V,E) is a graph, and χ and ψ are labeling functions which associate with each

vertex U ∈ V two sets, variable label χ(U) ⊆ X and function label ψ(U) ⊆ F ∪ C and θ

associates with each edge (A,B) ∈ E, the set θ(A,B) ⊆ X such that:

1. For each functionH ∈ F∪C, there is exactly one vertex U ∈ V such thatH ∈ ψ(U),

and the scope(H) ⊆ χ(U) and

2. (edge-connectedness) For each edge (A,B), θ(A,B) ⊆ χ(A) ∩ χ(B), such that

∀Xi ∈ X, any two clusters containing Xi can be connected by a path whose every

edge label includes Xi.

Finally, an edge-labeled join-graph is minimal if no variable can be deleted from any label

while still satisfying the edge-connectedness property.

DEFINITION 18 (separator and eliminator of edge-labeled join-graphs). Given two ad-

jacent verticesA andB of JG, the separator ofA andB is defined as sep(A,B) = θ(A,B),

and the eliminator of A with respect to B is elim(A,B) = χ(A)− θ(A,B). The separator

width is max(A,B) |sep(A,B)|.

Hence forth, for brevity, we will refer to edge-labeled join graphs simply as join-graphs.

DEFINITION 19 (Message). Given a join graph JG =< G(V,E), χ, ψ, θ >, a message

mA→B from a vertex A ∈ V to a vertex B ∈ V such that (A,B) ∈ E is defined over

28

Algorithm 1: Iterative Join Graph Propagation

Input: A mixed networkM = (X,D,F,C) and a partial assignment S = s to a

subset of variables, an integer i, a tolerated error ǫ and integer max iterOutput: A Join Graph with clusters containing original functions and constraints

fromM along with messages received from neighbors.

Construct a join graph JG = (G(V,E), χ, ψ, θ) from theM such that each cluster1

of JG has at most i variables;

Select an Activation schedule which contains all (two-way) edges2

d = (A1, B1), . . . , (A2∗|E|, B2∗|E|).;

// Let m(j)A→B be the message during the j-th iteration of

IJGP

j=0, KLD=0 ;3

Initialize all messages m0A→B to the uniform distribution.;4

repeat5

j=j+1, KLD=0 ;6

for each (A,B) in d do7

// Let Y = θ(A,B)

Compute the message m(j)A→B .8

m(j)A→B(y) = α

∑

z∈elim(A,B)

∏

F∈ψ(A)

F (y, z)∏

D∈NG(A)\B

m(j−1)D→A(y, z)

KLD=KLD+KL-Distance(m(j)A→B,m

(j−1)A→B);9

until KLD < ǫ OR j ≥ max iter ;10

variables Y = θ(A,B) as follows.

mA→B(y) = α∑

z∈elim(A,B)

∏

F∈ψ(A)

F (y, z)∏

D∈NG(A)\B

mD→A(y, z)

where NG(A) is set of vertices adjacent to A in G and α is the normalization constant

which ensures that each message is a proper probability distribution.

The pseudo-code of IJGP is given in Algorithm 1. It takes as input a mixed network,

an i-bound i, a tolerated message error ǫ and an integer max iter. The algorithm first

constructs a join graph having at most i variables in each cluster in step 1 using Algorithm

IJGP-structuring and then iteratively passes messages between adjacent clusters in steps

29

2-10. The iterative process is stopped either when the sum of the KL distance between the

messages in two consecutive iterations is less than some threshold ǫ (indicating that the

process has converged to a fixed-point [132]) or when the maximum number of iterations

have been reached. The output of IJGP(i) is a collection of functions and messages, each

bounded exponentially by i.

Constructing Join Graphs bounded by i

Given a join-graph JG = 〈G(V,E, χ, ψ, θ〉, the accuracy and complexity of the (iterative)

join-graph propagation algorithm depends on two different widths: joinwidth of JG (de-

fined as maxA∈V|χ(A)|) (this determines the complexity of processing one cluster) and

treewidth of G (which may affect the speed of convergence of iterative join-graph propa-

gation) as defined next.

Algorithm 2: Join-Graph Structuring

Input: A mixed networkM = 〈X,D,F,C〉, an Integer iOutput: A Join graph in which each cluster has at most i variables.

Apply procedure schematic mini-bucket(i);1

Associate each resulting mini-bucket with a node in the join-graph, the variables of2

the nodes are those appearing in the mini-bucket, the original functions are those in

the mini-bucket ;

Keep the edges created by the procedure (called out-edges) and label them by the3

regular separator;

Connect the mini-bucket clusters belonging to the same bucket in a chain by4

in-edges labeled by the single variable of the bucket;

DEFINITION 20 (external and internal widths). Given an edge-labeled join-graph JG =

〈G(V,E), χ, ψ, θ〉 of a mixed network M = 〈X,D,F,C〉, the internal width of JG is

maxV ∈V|χ(V )|, while the external width of JG is the treewidth of G.

Intuitively, the external width measures how close the join-graph is to a join-tree and there-

fore minimizing it would improve accuracy and speed of convergence. Algorithm Join-

30

Procedure 3: Schematic Mini-Bucket(i)Order the variables from X1 to Xn minimizing (heuristically) induced-width, and

associate a bucket for each variable ;

Place each CPT in the bucket of the highest index variable in its scope;

for j = n to 1 doPartition the functions in bucket(Xj) into mini-buckets having at most ivariables;

For each mini-bucket mb create a new scope-function (message) f where

scope(f) = X|X ∈ mb − Xi and place scope(f) in the bucket of its highest

variable;

Maintain an edge between mb and the mini-bucket (created later) of f ;

Graph Structuring describes a heuristic scheme for constructing join-graphs having mini-

mal external width. Given a mixed networkM = 〈X,D,F,C〉, and a bounding parameter i

the algorithm returns a join-graph JG ofM whose internal width is bounded by i. The al-

gorithm applies the procedure Schematic Mini-Bucket(i) (see Procedure 3). The procedure

only traces the scopes of the functions that would be generated by the full mini-bucket pro-

cedure, avoiding actual computation. The procedure ends with a collection of mini-bucket

trees, each rooted in the mini-bucket of the first variable. Each of these trees is minimally

edge-labeled. Then, in-edges labeled with only one variable are introduced, and they are

added only to obtain the running intersection property between branches of these trees.

PROPOSITION 4. [33] Algorithm Join-Graph Structuring(i) generates a minimal edge-

labeled join-graph having bound i.

EXAMPLE 7. Figure 1.7(a) shows the trace of procedure schematic mini-buckets with i-

bound of 3 and Figure 1.7(b) shows a join graph created from this trace. The only cluster

partitioned is that of F into two scopes (FCD) and (BF), connected by an in-edge labeled

with F.

EXAMPLE 8. Figure 1.8 shows a range of edge-labeled join-graphs. On the left extreme we

have a graph with smaller clusters, but more cycles. This is the type of graph Iterative Belief

Propagation (IBP) [99, 106] works on. On the right extreme we have a tree decomposition,

31

(b)(a)

CDB

CAB

BA

A

CB

P(D|B)

P(C|A,B)

P(A)

BA

P(B|A)

FCD

P(F|C,D)

GFE

EBF

BF

EF

P(E|B,F)

P(G|F,E)

B

CD

BF

A

F

G: (GFE)

E: (EBF) (EF)

F: (FCD) (BF)

D: (DB) (CD)

C: (CAB) (CB)

B: (BA) (AB) (B)

A: (A) (A)

Figure 1.7: Constructing a Join-graph decomposition using schematic mini-buckets

which has no cycles but has bigger clusters. In between, there could be a number of join-

graphs where maximum cluster size can be traded for number of cycles. Intuitively, the

graphs on the left present less complexity for join-graph algorithms because the cluster

size is small, but they are also likely to be less accurate. The graphs on the right side are

computationally more complex, because of the larger cluster size, but are likely to be more

accurate.

We can use the output of IJGP to estimate the posterior marginals Pi(xi) (the estimate is

denoted by Qi(xi) ) as follows. Let A be the cluster in the join graph such that Xi ∈ χ(A).

Qi(xi) = α∑

z∈χ(A)\Xi

∏

F∈ψ(A)

F (z, xi)∏

B∈NG(A)

mB→A(z, xi)

(1.7)

where α is the normalization constant which ensures that Qi(Xi) is a proper probability

distribution.

32

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A C

A AB BC

BE

C

CDE CE

FH

FFG GH H

GI

A

ABDE

FGI

ABC

BCE

GHIJ

CDEF

FGH

C

H

A

AB BC

CDE CE

H

FF GH

GI

ABCDE

FGI

BCE

GHIJ

CDEF

FGH

BC

DE CE

FF GH

GI

ABCDE

FGHI GHIJ

CDEF

CDE

F

GHI

more accuracy

less complexity

Figure 1.8: Join-graphs

1.4.2 Importance Sampling

Importance sampling [90, 50] is a general Monte Carlo simulation technique which can be

used for estimating various statistics of a given target distribution. Since it is often hard to

sample from the target distribution, the main idea is to generate samples from another easy-

to-simulate distribution Q called the proposal (or trial or importance) distribution and then

(as was shown) estimate various statistics over the target distribution by a weighted sum

over the samples. The weight of a sample is the ratio between the probability of generating

the sample from the target distribution and its probability based on the proposal distribution.

In this subsection, we describe how the weighted counts and posterior marginals can be

approximated via importance sampling. We first describe how to generate samples from Q.

If Q is given in a product form 1: Q(X) =∏n

i=1Qi(Xi|X1, . . . , Xi−1), we can generate a

full sample from Q easily as follows. For i = 1 to n, sample Xi = xi from the conditional

distribution Q(Xi|X1 = x1, . . . , Xi−1 = xi−1) and set Xi = xi.This is often referred to as

1Through out the thesis, we assume that the proposal distribution is specified in a product form.

33

an ordered Monte Carlo sampler.

Thus, when we say that Q is easy to sample from, we assume that Q can be expressed in a

product form and can be specified in polynomial space, namely,

Q(X) =n∏

i=1

Qi(Xi|X1, . . . , Xi−1) =n∏

i=1

Qi(Xi|Yi) (1.8)

where Yi ⊆ X1, . . . , Xi−1. The size of the set Yi is assumed to be bounded by a

constant.

Estimating the weighted counts by importance sampling

We start with some required definitions:

DEFINITION 21 (An estimator). An estimator is a function of data (or samples) that pro-

duces an estimate for an unknown parameter or statistics of the distribution that produced

the data (or samples).

DEFINITION 22 (Unbiased Estimator). Given a probability distributionQ and a statistics

θ ofQ, an estimator θN which is based onN random samples drawn fromQ, is an unbiased

estimator of θ if EQ[θN ] = θ. Similarly, an estimator θN which is based on N random

samples drawn from Q, is an asymptotically unbiased estimator of θ if limN→∞ EQ[θN ] =

θ. Clearly, all unbiased estimators are asymptotically unbiased.

Note that, we will denote an unbiased estimator of a statistics θ by θ, an asymptotically

unbiased estimator by θ and an arbitrary estimator by θ.

The notion of unbiasedness and asymptotic unbiasedness is important because it helps to

characterize the performance of an estimator which we explain briefly below ( for more

34

details see [118]). The mean-squared error of an estimator θ is given by:

MSE(θ) = EQ[(θ − θ)2] (1.9)

= EQ[θ2]− 2EQ[θ]θ + θ2 (1.10)

=[EQ[θ

2]− EQ[θ]2

]+[EQ[θ]2 − 2EQ[θ]θ + θ2

](1.11)

The bias of θ is given by:

BQ[θ] = EQ[θ]− θ

The variance of θ is given by:

VQ[θ] = EQ[θ2]− EQ[θ]2

From the definitions of bias, variance and mean-squared error, we have:

MSE(θ) =[EQ[θ

2]− E[θ]2

]+[E[θ]2 − 2EQ[θ]θ + θ2

]= VQ[θ] +

[BQ[θ]

]2(1.12)

In other words, the mean squared error of an estimator is equal to bias squared plus variance

[118]. In case of an unbiased estimator, the bias equals zero and therefore one can reduce

the mean squared error of an unbiased estimator by just reducing its variance. In case of

an asymptotically unbiased estimator, the bias of the estimator goes to zero as the number

of samples tend to infinity. However, for a finite sample size it may have a non-zero bias.

Although, in principle an unbiased estimator seems to be better than an asymptotically

unbiased estimator, the latter may have lower MSE than the former because it may have

lower variance. In general, one would prefer an unbiased estimator over an asymptotically

unbiased estimator because its MSE can be controlled by just controlling the variance.

35

Next, we show how the weighted counts can be estimated using importance sampling.

Consider the expression for weighted counts (see Definition 1.13).

Z =∑

x∈X

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (1.13)

If we have a proposal distribution Q(X) such that∏m

i=1 Fi(x)∏p

j=1Cj(x) > 0→ Q(x) >

0, we can rewrite Equation 1.13 as follows:

Z =∑

x∈X

∏mi=1 Fi(x)

∏pj=1Cj(x)

Q(x)Q(x) = EQ

[∏mi=1 Fi(x)

∏pj=1Cj(x)

Q(x)

](1.14)

Given independent and identically distributed (i.i.d.) samples (x1, . . . , xN) generated from

Q, we can estimate Z by:

ZN =1

N

N∑

k=1

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

Q(xk)=

1

N

N∑

k=1

w(xk) (1.15)

where

w(xi) =

∏mi=1 Fi(x

i)∏p

j=1Cj(xi)

Q(xi)

is the weight of sample xi. By definition, the variance of the weights is:

VQ[w(x)] =∑

x∈X

(w(x)− Z)2Q(x) (1.16)

We can estimate the variance of ZN by (see for example [118]):

VQ[ZN ] =1

N(N − 1)

N∑

k=1

(w(xk)− ZN

)2

(1.17)

36

and it can be shown that VQ[ZN ] is an unbiased estimator of VQ[ZN ], namely,

EQ[VQ[ZN ]] = VQ[ZN ]

1. EQ[ZN ] = Z i.e. ZN is unbiased.

2. limN→∞ ZN = Z , with probability 1 (follows from the central limit theorem).

3. EQ

[VQ[ZN ]

]= VQ[ZN ] = VQ[w(x)]/N

Therefore, VQ[ZN ] can be reduced by either increasing the number of samples N or by

reducing the variance of the weights. From [118] it follows that that if Q ∝ ∏mi=1 Fi(x)

∏pj=1 Cj(x), then ∀ N ZN = Z, thereby qualitatively identifying an optimal variance

estimator. However, making Q ∝ ∏mi=1 Fi(x)

∏pj=1Cj(x) is NP-hard and therefore in

order to have a small MSE in practice, it is recommended [88] that Q must be as “close” as

possible to the function it tries to approximate which in our case is∏m

i=1 Fi(x)∏p

j=1Cj(x).

Estimating the marginals by importance sampling

The posterior marginals are defined as:

P (xi) =∑

x∈X

δxi(x)PM(x) (1.18)

where PM is defined by:

PM(x) =1

Z

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (1.19)

Given a proposal distribution Q(x) satisfying PM(x) > 0 → Q(x) > 0, we can rewrite

37

Equation 1.18 as follows:

P (xi) =∑

x∈X

δxi(x)PM(x)

Q(x)Q(x) = EQ

[δxi

(x)PM(x)

Q(x)

](1.20)

Given independent and identically distributed (i.i.d.) samples (x1, . . . , xN) generated from

Q, we can estimate P (xi) by:

PN (xi) =1

N

N∑

k=1

δxi(xk)PM(xk)

Q(xk)=

1

N

N∑

k=1

δxi(xk)

∏mi=1 Fi(x

k)∏pj=1 Cj(x

k)

ZQ(xk)(1.21)

Unfortunately, Equation 1.21, while an unbiased estimator of P (xi) cannot be evaluated

because Z is not known. We can however sacrifice unbiasedness and estimate P (xi) by

using properly weighted samples [88].

DEFINITION 23 ( Properly weighted samples). A set of weighted samples xk, w(xk)Nk=1

drawn from a distribution G are said to be properly weighted with respect to a distribution

π if for any discrete function H ,

EG[H(xk)w(xk)] = cEπ[H(x)]

where c is a normalization constant common to all samples.

Given the weighted set of samples, we can estimate Eπ[H(x)] as:

Eπ[H(x)] =

∑Nk=1H(xk)w(xk)∑N

k=1w(xk)

Substituting Equation 1.19 in Equation 1.20, we have:

P (xi) =1

ZEQ

[δxi

(x)∏m

i=1 Fi(x)∏p

j=1Cj(x)

Q(x)

](1.22)

38

It is easy to prove that [88]:

PROPOSITION 5. Given w(x) =δxi

(x)Qm

i=1 Fi(x)Qp

j=1 Cj(x)

Q(x), the set of samples xk, w(xk)Nk=1

are properly weighted with respect to PM.

Therefore, we can estimate P (xi) as follows:

PN(xi) =

∑Nk=1w(xk)δxi

(xk)∑N

k=1w(xk)(1.23)

It is easy to prove that limN→∞ E[PN(xi)] = P (xi) i.e. it is asymptotically unbiased.

Therefore, by weak law of large numbers the sample average PN(xi) converges almost

surely to P (xi) as N →∞. Namely,

limN→∞

PN(xi) = P (xi) , with probability 1

Also it can be shown that in order to have small estimation error, the proposal distribution

Q should be as close as possible to the target distribution PM [88].

The Algorithm

Algorithm 4 presents an importance sampling scheme for estimating the weighted counts

and marginals. The algorithm maintains two mutable variables Z and P [j][k] which store

the estimate of weighted counts and posterior marginals respectively. Note that j indexes

the variables j = 1, . . . , n and k indexes the values in the domains of variable Xj , k =

1, . . . , |Dj|. In steps 1-4, we initialize Z and all P [j][k] to zero. Steps 5-9 contain the

main loop of the algorithm. In this loop, we first generate a sample (xi1, . . . , xin) from

the proposal distribution Q (step 6) and then compute its weight wi in step 7. Then, we

update Z and the appropriate marginal probability P [j] by adding the weight wi of the

current sample to their current value. Finally, after the required N samples are generated

39

Algorithm 4: Importance Sampling

Input: A mixed networkM = 〈X,D,C,F〉, a proposal distribution Q, and number of samples NOutput: Estimate of posterior marginals and weighted counts.

Z = 0 ;1

for j = 1 to n do2

for k = 1 to Di do3

P [j][k] = 0.0 ;4

for i = 1 to N do5

Generate a sample xi = (X1 = xi1, . . . ,Xn = xi

n) from Q using the ordered Monte Carlo6

sampler.;

Calculate the importance weight7

wi =

∏m

j=1 Fj(xi)∏p

k=1 Ck(xi)

Q(xi)

Z = Z + wi ;

for j = 1 to n do8

P [j][xij ] = P [j][xi

j ] + wi ;9

Z = Z / N ;10

for j = 1 to n do11

Normalize P[j] ;12

Return Z as a estimate of the counting problem and P as a estimate of the marginal problem;13

the algorithm returns Z/N as an estimate of the weighted counts and normalized value of

P [j] for j = 1, . . . , n as an estimate of the marginals.

Importance sampling schemes in Literature

The obvious question one has to answer in order to use importance sampling is how to

choose the proposal distribution Q.

Likelihood weighting (LW) is the most simplest form of importance sampling in Bayesian

networks which uses a slight modification of the prior distribution (distribution without ev-

idence) as the proposal distribution. Given a Bayesian network B = 〈X,D,P〉with evidence

E = e, the proposal distribution of LW defined over the non-evidence variables Y = X \ E

is given by:

Q(y) =∏

Xi∈Y

P (Xi = yXi|ypa(Xi)

, epa(Xi)) (1.24)

40

where, as mentioned earlier, the notation yXidenotes the restriction of the assignment

Y = y to the set Xi.

We can generate samples from Q(Y) by processing the nodes in topological order. If the

node is not an evidence node, we sample a value for it based on its CPT and the value

already sampled for its parents. If the node is observed we set it to its observed value and

continue. The process just described generates one sample. Repeating it N times generates

the required N samples.

Recall that the efficiency of importance sampling depends on how close the proposal dis-

tribution is to the posterior distribution. When all the evidence is in the root, Equation 1.24

represents exactly the posterior distribution and LW yields an optimal sampling scheme.

On the other hand, if all the evidence is in the leaf nodes, LW samples from the prior distri-

bution. If the evidence is very unlikely, the prior distribution is a very bad approximation

of the posterior distribution. In this case, LW may yield very bad estimates and show poor

convergence.

To deal with the case of unlikely evidence in the leaf nodes, backward sampling [48] at-

tempts to change the sampling order so that evidence variables are sampled earlier. In

[97], the variables are sampled in reverse elimination order; the sampling distribution of

a variable is obtained by computing a product of all the functions in the variables bucket

and summing out all other variables. However, bucket elimination is not possible for larger

functions and therefore [97] suggest to approximate large functions by a probability tree. In

[133], the output of loopy belief propagation is used to construct the proposal distribution

in topological order.

Another area of research, which is orthogonal to selecting the best proposal distribution Q

is adaptive importance sampling which tries to dynamically update Q based on the gener-

ated samples [16, 101]. The updating step is performed every l samples where l is usually

41

selected heuristically or empirically. Since initial proposal distribution, no matter how well

chosen, is often very different from the posterior distribution, dynamic updating can sub-

stantially improve the convergence of importance sampling. Examples of adaptive impor-

tance sampling include methods such as self-importance sampling and heuristic importance

sampling [121], and, more recently, AIS-BN [16], and EPIS [133] and dynamic importance

sampling [97]. The main challenge here is in selecting the update step, so that as more and

more samples are drawn, the updated proposal distribution gets closer and closer to the

posterior distribution. The procedures for updating the sampling probabilities vary. In the

following, we describe the AIS-BN scheme in more detail.

AIS-BN algorithm is based on the observation that if we could sample each node in topo-

logical order from distribution P (Xi|pa(Xi), e), then the resulting sample would be drawn

from the posterior distribution P (X|e). Since this distribution is unknown for any vari-

able that has observed descendants, AIS-BN initializes the proposal distribution Q to some

Q0(X) defined by a collection of sampling distributions Q0i (Xi|pa(Xi), e).

The objective of AIS-BN is to update each distribution Qki (Xi|pa(Xi), e) so that the next

sampling distribution Qk+1i (Xi|pa(Xi), e) will be closer to the required posterior distribu-

tion P(Xi|pa(Xi), e) than Qki (Xi|pa(Xi), e).

The updating formula, applied after generating every l samples, is as follows:

Qk+1i (xi|pa(Xi), e) = Qk

i (xi|pa(Xi), e) + α(k)(Pr′(xi|pa(Xi), e)−Qki (xi|pa(Xi), e)) (1.25)

where α(k) is a positive function that determines the learning rate and Pr′(xi|pa(Xi), e) is

an estimate of P(xi|pa(Xi), e) based on the last l samples. When α(k) = 0 (lower bound),

the importance function is not updated. When α(k) = 1 (upper bound), the old function

is discarded so that Qk+1i (xi|pa(Xi), e) = Pr′(xi|pa(Xi), e). The convergence speed is

directly related to α(k) and there are various trade-offs in choosing an appropriate α(k).

If α(k) is small, the convergence will be slow due to the large number of updating steps

42

needed to reach a local minimum. On the other hand, if it is large, convergence rate will be

initially very fast, but the algorithm will eventually start to oscillate and thus may not reach

a minimum.

1.4.3 Markov Chain Monte Carlo schemes

Markov Chains

A Markov chain is a stochastic process which evolves over time. We will denote by

x(0), x(1), . . ., the states of the system at discrete time steps starting at 0 and extending to

infinity. The distribution from which x(0) is drawn is called the initial distribution denoted

by Q(x(0)). The infinite sequence x(0), x(1), . . . is called a Markov Chain if it satisfies the

following Markov property:

A(x(t+1) = y|x(t) = x, . . . , x(0) = z) = A(x(t+1) = y|x(t) = x) (1.26)

In other words, the distribution of x(t+1) given all states up to t only depends on x(t). If

the transition probability A(x(t+1) = y|x(t) = x) does not change over time, then it is often

expressed as the transition function T (y|x). Because T (y|x) is a conditional distribution, it

has to satisfy the following property:

∑

y

T (y|x) = 1 (1.27)

The most important property of a Markov chain that we rely upon is its forgetfulness prop-

erty. Namely, after a Markov chain has evolved for a period of time, the current state is

nearly independent of the starting state. This “fixed point” of a Markov chain is a distribu-

tion called the stationary or invariant distribution.

43

DEFINITION 24 (Stationary distribution). An invariant or stationary distribution of a

Markov Chain is a distribution that once reached remains forever. Namely, P(x) is an

invariant distribution iff for every state x, the following condition is satisfied:

P(y) =∑

x

P(x)A(y|x)

In this thesis, we are interested in Markov chains that converge eventually to the stationary

distribution P regardless of the initial distribution A. The property of ergodicity forms the

backbone of such chains, which we define below:

DEFINITION 25 (Ergodicity). A Markov chain is said to be aperiodic if the maximum

common divider of the number of steps it takes for the chain to come back to the start state

is equal to one. A Markov chain is irreducible if the chain has nonzero probability (density)

to move from one position in the state space to any other position in a finite number of

steps. An aperiodic, irreducible Markov chain is called ergodic.

Aperiodicity means that the chain does not have regular loops where after every k steps we

return to the same state. Irreducibility guarantees that we can get to a state y from another

state x with non-zero probability and thus, will be able to visit (in the limit of infinite

samples) all statistically important regions of the state space. The conditions are almost

always satisfied as long as all probabilities are positive. As already mentioned, ergodicity

is important because it ensures convergence to the stationary distribution. Formally,

THEOREM 2 (Convergence). An ergodic Markov Chain converges to its unique stationary

distribution regardless of the initial distribution.


44

Constructing MCMC techniques

The main idea in MCMC type of simulation algorithms in the context of graphical models

is as follows. We start from some initial full assignment (or state according to Markov

chain terminology) x(0) to all (non-evidence) variables, and then use the Markov transi-

tion function to change this current state. The Markov transition function is selected in

such a way that it satisfies the following properties to ensure convergence to the posterior

distribution P of the graphical model:

1. the stationary distribution of the Markov Chain is the required posterior distribution

P of the graphical model

2. the resulting Markov chain is ergodic.

The question is how to ensure that the two aforementioned properties are satisfied. The

task may sound daunting but in the context of graphical models is quite easy to achieve. In

the next two sub subsections, we overview two alternative constructions.

Metropolis Hastings Method

The main idea in the Metropolis Hastings method is to generate a large number of samples

from a proposal function T (y|x) and then employ some acceptance/rejection criteria to

“thin down” the samples, namely, only accept a subset of “good” samples which guarantee

convergence. The only restriction imposed on T (y|x) is that T (y|x) > 0 implies that

T (x|y) > 0 (which is similar to the importance sampling criteria described Section 1.4.2).

The Metropolis algorithm then implements the following iteration given a current state x(t):

• Draw y from the proposal distribution T (y|x(t)).

45

• Draw a real number p from the uniform distribution [0, 1] and update:

x(t+1) =

y if p ≤ r(y|x(t))

x(t) otherwise

(1.28)

where r(y|x) is given by:

r(y|x) = min

1,P(y)T (x|y)

P(x)T (y|x)

(1.29)

Next, we show that the Metropolis Hastings method prescribes a Markov transition rule

whose stationary distribution equals the posterior distribution P . Let A(y|x) be the transi-

tion function of the algorithm. To prove that the distribution is stationary (see Definition

24), we have to prove that:

P(y) =∑

x

P(x)A(y|x) (1.30)

Fortunately, we can check a rather easier condition called detailed balance to prove that P

is the stationary distribution, which is given by:

P(x)A(y|x) = P(y)A(x|y) (1.31)

It should be clear that the detailed balance condition ensures invariance or convergence to

the stationary distribution as shown below:

∑

x

P(x)A(y|x) =∑

x

P(y)A(y|x) = P(y)∑

x

A(y|x) = P(y) (1.32)

The Markov transition probability of the Metropolis Hastings method is given by:

A(y|x) = T (y|x)r(y|x) = T (y|x)min

1,P(y)T (y|x)

P(x)T (y|x)

(1.33)

46

THEOREM 3. The Metropolis Hastings method satisfies the detailed balance condition:

P(x)A(y|x) = P(y)A(x|y)

Proof.

P(x)A(y|x) = P(x)

[T (y|x)min

1,P(y)T (y|x)

P(x)T (y|x)

](1.34)

= minP(x)T (y|x),P(y)T (x|y) (1.35)

= P(y)

[T (x|y)min

1,P(x)T (x|y)

P(y)T (x|y)

](1.36)

= P(y)A(x|y) (1.37)

The Metropolis Hastings method is so general that one can use any positive function as

T (y|x). However, as in importance sampling, the rate of convergence to the posterior

distribution will depend upon how close T (y|x) is to the posterior distribution.

When the graphical model has zero probabilities or in general for a mixed networkM =

〈X,D,F,C〉, however, there is a small caveat in generating the initial sample x(0). As

before, one can use any positive function as T (y|x) but we have to ensure that P(x(0)) > 0

to ensure that transitions are made only in the space of solutions of C.

Gibbs Sampling

In this subsection, we present the Gibbs sampling algorithm specialized for Bayesian net-

works. Gibbs sampling starts with generating some random assignment to all variables

(the assignment is typically generated using importance sampling). Then at each step j, we

randomly choose Xi and sample a value for it, given the current assignment to the other

47

variables. Formally, the sample is generated from:

P(Xi|x(j)1 , . . . , x

(j)i−1, x

(j)i+1, . . . , x

(j)n )

For simplicity, we introduce some new notation. Let x(j)−i = (x

(j)1 , . . . , x

(j)i−1, x

(j)i+1, . . . , x

(j)n ).

In Bayesian networks, the conditional distribution P(Xi|x(j)−i ) is dependent only on the

assignment to the Markov blanket mai of variable Xi, where Markov blanket is defined as

the set containing Xi, its parents, its children, and parents of children. Let x−i,maidenote

the restriction of the assignment x−i to mai. Thus,

P(Xi|x(j)−i ) = P(Xi|x(j)

−i,mai)

Given an assignment to the Markov blanket of Xi, we can obtain the sampling probability

distribution P(Xi|x(j)−i,mai

) using the following equation [106]:

P(Xi|x(j)−i,mai

) ∝ P (Xi|x(j)−i,pa(Xi)

)∏

Xk∈chi(Xi)

P (x(j)−i,Xk

|x(j)−i,pa(Xk)) (1.38)

Algorithm 5 describes the Gibbs sampling scheme. We can see that generating a full new

sample requires O(n × r) multiplication steps where r is the maximum family size and n

is the number of variables.

Assuming it takes approximately K samples for a Markov chain to converge to its station-

ary distribution, the first K samples may be discarded to ensure that the collected samples

properly represent distribution P(X|e). The time spent computing K discarded samples is

referred to as burn-in time. However, determining K is hard [88]. In general, the “burn in

time ” is optional in the sense that the convergence of the estimates to the correct posterior

marginals does not depend on it.

48

Algorithm 5: Gibbs sampling

Input: A belief network B = 〈X,D,P〉, and evidence E = e.

Output: A set of samples x(t)Tt=1

// Initialize:

Assign random value Xi = x(1)i to each variable Xi. Assign evidence variables their1

observed values.;

// Generate samples

for t = 1 to T do2

// Generate a new sample x(t).

for each Xi ∈ X \ E do3

// compute a new value x(t)i

Compute distribution P(Xi|x(t)−i,mai

) using Equation 1.38;4

Sample and set Xi = x(t)i from P(Xi|x(t)

−i,mai);5

A sufficient condition to ensure that Gibbs sampling has an ergodic Markov chain is that

whenever we sample Xi there is a non-zero probability for it to be assigned to any one of

its possible values. This ensures that we have a positive probability to get from any state

to any state in some number of moves. In particular, if there are no zero entries in the

CPTs the chain is ergodic. Furthermore, if the chain is ergodic it is trivial to verify that the

stationary distribution of the chain is the posterior distribution P .

The crucial factor in analyzing Markov Chains is the mixing rate , i.e., the rate with which a

chain converges to the stationary distribution. Recall that we start the chain with the burn in

phase. If the mixing rate is very low the burn in phase must be very long to avoid generating

samples from the wrong distribution. Thus, we might need to generate a very large number

of samples without being able to use them. Furthermore, when the mixing rate is low, the

samples we generate after the burn in phase are very correlated and we might need a very

large number of them in order to get a reliable estimate.

Unfortunately, the task of determining the mixing rate of a Gibbs sampler is quite difficult.

Thus, the questions of when we can start using the samples and how many samples are

needed for a reliable estimate are hard and make the use of Gibbs sampling a non-trivial

49

task. Also, another problem with Gibbs sampling is that it cannot be used to generate

samples from mixed networks. This is because the sampling space of a mixed network is

not ergodic.

1.4.4 Rao-Blackwellised sampling

Recall that the mean squared error and therefore the quality of an unbiased estimator is

directly proportional to the variance of the weights. Therefore, a vast amount of literature

has been devoted to reducing variance (see for example [118, 88] and the references there

in). In this subsection, we will review the Rao-Blackwellised importance sampling scheme

which based on the Rao-Blackwell theorem reduces the variance by combining sampling

with exact inference.

THEOREM 4 (Rao-Blackwell Theorem). Let F (Y,Z) be a function andQ(Y,Z) be a prob-

ability distributions which satisfies the importance sampling property that for any assign-

ment (Y = y,Z = z), F (y, z) > 0⇒ Q(y, z) > 0. Then,

VQ

[F (y, z)

Q(y, z)

]≥ VQ

[F (y)

Q(y)

]

where Q(y) =∑

zQ(y, z) and F (y) =∑

z F (y, z).


Rao-Blackwellisation [88, 118] can be used to estimate the weighted counts Z as follows.

Assume that the set of variables X is partitioned into two sets K and R and let Q(K) be a

50

proposal distribution defined over K. Then,

Z =∑

x∈X

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (1.39)

=∑

k∈K

∑

r∈R

m∏

i=1

Fi(k, r)

p∏

j=1

Cj(k, r) (1.40)

=∑

k∈K

∑

r∈R

∏mi=1 Fi(k, r)

∏pj=1Cj(k, r)

Q(k)Q(k) (1.41)

= E

[∑

r∈R

∏mi=1 Fi(k, r)

∏pj=1Cj(k, r)

Q(k)

](1.42)

Given N independent samples (k1, . . . ,kN) generated from Q(K), we can estimate Z in

an unbiased fashion by replacing the expectation in Equation 1.42 by the sample average

as given below:

Zrao−blackwell =1

N

N∑

i=1

∑r∈R

∏mj=1 Fj(r,K = ki)

∏pa=1Ca(r,K = ki)

Q(ki)(1.43)

The assumption here is that∑

r∈R


∏pa=1Ca(r,K = ki) can be com-

puted efficiently (using exact inference). It follows from the Rao-Blackwell theorem that

the variance of Zrao−blackwell is less than the variance of conventional importance sampling

estimator Z given in Equation 1.15 when the two are based on the same set of samples over

the sub-set K of variables.

w-cutset sampling [7] is an elegant implementation of the Rao-Blackwellisation idea. In

this scheme, given a mixed networkM = 〈X,D,F,C〉 and aw-cutset K (see Definition 16)

ofM, we sample the variables in K and compute using Bucket Elimination [28] the exact

value of∑

r∈R


∏pa=1Ca(r,K = ki) for each sample K = ki. Because

the time and space complexity of Bucket Elimination is exponential in the treewidth of the

graph, it is obvious that given a w-cutset, Bucket Elimination can be carried out efficiently

51

in polynomial time (exponential in the constant w).

PROPOSITION 6. Given a mixed networkM = 〈X,D,F,C〉 having n variables and a w-

cutset ofM of size c, the time and space complexity of w-cutset importance sampling is

O((n− c)×N × exp(w) + c×N) and O((n− c)× exp(w)) respectively where N is the

number of samples.


It is clear from Proposition 6 that each w-cutset sample is heavy in the sense that we require

more time (and space) to compute its weight. Therefore, we expect that the conventional

estimator Z given in Equation 1.15 will be based on a larger sample size than Zrao−blackwell

and consequently w-cutset (Rao-Blackwellised) sampling may not always have lower vari-

ance. Thus, w-cutset sampling presents interesting time versus variance tradeoff. In prac-

tice, it seems that w-cutset sampling is well worth the extra computational complexity in

that it usually yields more accurate estimates than conventional importance sampling [7].

52

Chapter 2

Hybrid Dynamic Mixed Networks for

modeling Transportation routines

2.1 Introduction

Modeling sequential real-life domains requires the ability to represent both probabilistic

and deterministic information. Hybrid Dynamic Bayesian Network (HDBN) were recently

proposed for modeling such domains [83]. In essence, these are factored representation of

Markov processes that allow discrete and continuous variables. Such probabilistic frame-

works which focus on handling uncertain information, represent constraints as probabilis-

tic entities and thus do not advertise their presence. This may have negative computational

consequences. In this chapter, we address this issue by extending the mixed networks

framework to dynamic environments through the model of dynamic Bayesian networks

(DBN) accommodating both discrete and continuous variables. The resulting framework is

called Hybrid Dynamic mixed networks (HDMN). We address the algorithmic issues that

emerge and demonstrate the potential of our approach on a complex dynamic domain of a

53

person’s transportation routines.

Our motivation for developing HDMNs as a modeling framework is a range of problems in

the transportation literature that depend upon reliable estimates of the prevailing demand

for travel over various time scales. At one end, there is a pressing need for accurate and

complete estimation of the global origins and destinations (O-D) matrix at any given time

for an entire urban area. Such estimates are used in both urban planning applications [122]

and integrated traffic control systems based upon dynamic traffic assignment techniques

[107]. Even the most advanced techniques, however, are hamstrung by their reliance upon

out-dated, pencil-and-paper travel surveys and sparsely distributed detectors in the trans-

portation system. We view the increasing proliferation of powerful mobile computing de-

vices as an opportunity to remedy this situation. If even a small sample of the traveling

public agreed to collect their travel data and make that data publicly available, transporta-

tion management systems could significantly improve their operational efficiency. At the

other end of the spectrum, personal traffic assistants running on the mobile devices could

help travelers re-plan their travel when the routes they typically use are impacted by failures

in the system arising from accidents or natural disasters. A common starting point for these

problems is to develop an efficient formulation for learning and inferring individual traveler

routines like traveler’s destination and his route to destination from raw data points. This

specific application was also considered in Liao et al. [86] and we use their probabilistic

model with minor modifications for our purposes. The main focus of this chapter is on the

algorithmic aspects.

In this chapter, we extend several algorithmic schemes that were shown to be effective for

dynamic Bayesian networks such as generalized belief propagation [132, 33, 9, 67] and

Rao-Blackwellised Particle Filtering (RBPF) [40, 47] to accommodate and exploit discrete

constraints in the presence of continuous probabilistic functions. The chapter describes the

first steps that we took and will serve to provide a base-line starting point for the algorithms

54

that we will develop in the rest of the chapters. As we will show extending generalized be-

lief propagation to handle constraints is easy, extension to continuous variables is somewhat

more intricate but still straightforward. However, the primary challenge is to have effective

importance sampling based techniques like particle filtering in the presence of constraints;

because of the rejection problem.

The chapter starts by describing the framework of HDMN (section 2.2) and the transporta-

tion domain modeling within this framework (section 2.3). The algorithms for processing

hybrid models are described in three stages. Section 2.4 shows how importance sampling

can use a proposal distribution obtained from the output of Iterative Join Graph Propa-

gation (IJGP) [33] within any graphical model. Section 2.5 addresses the rejection prob-

lem associated with the proposal distribution of importance sampling and proposes the use

of constraint based pre-processing techniques to mitigate rejection. Sections 2.6 and 2.7

extend IJGP and IJGP-based importance sampling respectively to the full framework of

HDMNs. Finally, we present experimental results in section 2.8 on our application domain

and conclude in section 2.9.

The research presented in this chapter is based in part on [53, 60].

2.2 Hybrid Dynamic Mixed networks

We use the term hybrid for graphical models that have both discrete and continuous vari-

ables and mixed for graphical models that have mixed constraint and probabilistic depen-

dencies.

DEFINITION 26 (Hybrid Bayesian networks (HBN)). [81] Hybrid Bayesian networks

(HBN) are graphical models defined by a tuple B = 〈X,D,G,P〉, where X is the set of

variables partitioned into discrete and continuous ones X = Γ⋃

∆, where ∆ are discrete

55

variables andΓ are continuous variables. G is a directed acyclic graph (DAG) over X. P =

P1, ..., Pn is a set of conditional probability distributions (CPD). The graph structure

G is restricted in that continuous variables cannot have discrete variables as their child

nodes. The conditional distributions of discrete variables are expressed in a tabular form

as usual, whereas the conditional distribution of continuous variables are given by a linear

Gaussian model: P (Xi|I = i,Z = z) = N(α(i) + β(i)T × z, γ(i)), Xi ∈ Γ where Z and

I are the set of continuous and discrete parents of Xi respectively, N(µ, σ2) is a normal

distribution with mean µ and variance σ2, γ(i) and α(i) are real numbers and β(i) is a

vector of the same dimension as |Z|. The network represents a joint distribution over all its

variables given by a product of all its CPDs.

The joint distribution represented by a Hybrid Bayesian network is a mixture of multi-

variate Gaussians where every mixture component corresponds to an assignment x∆ to all

the discrete variables.

The mixed networks framework [92, 36] for augmenting Bayesian networks with con-

straints can immediately be applied to HBNs yielding the hybrid mixed networks (HMNs).

Formally,

DEFINITION 27 (Hybrid Mixed networks (HMN)). Given a HBN B = (X,D, G,P)

that expresses the joint probability PB and given a constraint network R = (∆,D,C)

that expresses a set of solutions sol(C), an HMN is a pair M = (B,R). The dis-

crete variables and their domains are shared by B and R. We assume that R is con-

sistent. The hybrid mixed network M = (B,R) represents the conditional probability

PM(x) = PB(x) if x∆ ∈ sol(C) and 0 otherwise.

Dynamic probabilistic networks [98] are discrete time Markov models whose state-space

and transition functions are expressed in a factored form using two-slice probabilistic net-

works. They are defined by a prior P (X0) and a state transition function by the two-slice

56

dynamic network expressing P (Xt+1|Xt). Hybrid dynamic probabilistic networks also in-

clude continuous variables, while Hybrid dynamic mixed networks accommodate both con-

tinuous variables and explicit constraints. Formally,

DEFINITION 28 (Hybrid Dynamic mixed networks (HDMN)). A Hybrid Dynamic Mixed

Network (HDMN) is a pair (M0,M→), defined over a set of variables X = X1, ..., Xn,

whereM0 is an HMN defined over X representing P (X0).M→ is a 2-slice network defin-

ing the stochastic process P (Xt+1|Xt). The 2-time-slice hybrid mixed network (2-THMN)

is an HMNM = (B,R) defined over X′ ∪ X′′

such that X′

and X′′

are identical to X.

The acyclic graph of the probabilistic portion is restricted so that nodes in X′

are root

nodes. The constraints are defined the usual way. The 2-THMN represents the conditional

distribution P (X′′|X′

).

The semantics of any dynamic network can be understood by unrolling the network to

T time-slices. Namely, P (X0:t) = P (X0) ∗∏T

t=1 P (Xt|Xt−1) where each probabilistic

component can be factored in the usual way, yielding a regular HMN over T copies of the

state variables.

The most common task over dynamic probabilistic models is filtering and prediction. Fil-

tering is the task of determining the belief state P (Xt|e0:t) where Xt is the set of variables

at time t and e0:t are the observations accumulated from time-slices 0 to t. Performing

filtering in dynamic networks is often called online inference or sequential inference.

Filtering can be accomplished in principle by unrolling the dynamic model and using any

state-of-the art exact or approximate reasoning algorithm. For example, one could use the

join-tree clustering algorithm. When all variables in a dynamic network are discrete, the

treewidth of the dynamic model is at most O(n) where n is the number of variables in a

time-slice (see [98] for details). However, when continuous Gaussian variables are mixed

with discrete ones, the treewidth of HDMN is at least O(T ) when T is the number of

57

dt wt dt+1 wt+1

gt gt+1

ft ft+1

rt rt+1

vt lt vt+1 lt+1

yt yt+1

Figure 2.1: Car travel activity model

.

time slices. This is because to maintain correctness of join-tree propagation, we have to

eliminate all continuous variables before the discrete ones which creates a clique between

discrete variables from time slices 1, . . . , T . Thus exact inference is clearly infeasible in dy-

namic graphical models having both discrete and continuous variables [81, 83]. Therefore

the applicable approximate inference algorithms for hybrid dynamic networks are either

sampling-based such as Particle Filtering or propagation-based such as generalized belief

propagation.

2.3 The transportation model

The primary motivating domain we use is the transportation model. In this section, we

describe the application of HDMNs to the problem of inferring car travel activity of indi-

viduals. The main query of interest is predicting where a traveler is likely to go and what

his/her route to the destination is likely to be, given the current location of the traveler’s car.

58

This application was described in [86] and our goal is extend their work and study algo-

rithms that are relevant to such domain using the general modeling framework we defined

and which can be useful for a variety of applications.

Figure 2.1 shows a HDMN model for modeling the car travel activity of individuals. Note

that the directed links express the probabilistic relationships while the undirected (bold)

edges express the constraints.

Similar to [86], we consider the roads as a Graph G(V,E) where the vertices V corre-

spond to intersections while the edges E correspond to segments of roads between in-

tersections. The variables in the model are all indexed by time index t as follows. The

variables dt and wt represent the information about time-of-day and day-of-week respec-

tively at time t. The variable dt, the day at time t, is a discrete variable and has four values

(morning, afternoon, evening, night) while the variable wt, the week at time t, has two

values (weekend, weekday). Variable gt represents the persons next goal at time t (e.g.

his office, home etc). We consider a location where the person spends significant amount

of time as a proxy for a goal as in [86]. These locations are determined through a prepro-

cessing step by noting the locations in which the dwell-time is greater than a threshold (15

minutes). Once such locations are determined, we cluster those that are in close proximity

to simplify the goal set. A goal can be thought of as a set of edges E1 ⊂ E in our graph

representation. The route level rt represents the route taken by the person moving from one

goal to the next. We arbitrarily set the number of values it can take to |gt|2 i.e., the model

can identify |gt|2 distinct routes that the individual uses to navigate. The person’s location

lt and velocity vt at time t are estimated from the GPS reading yt at time t. ft is a counter

(essentially goal duration) that governs goal switching. The location lt is represented in the

form of a two-tuple (A,N(µ, σ2)) where A = (V1, V2), A ∈ E and V1, V2 ∈ V is an edge of

the map G(V,E) and N(µ, σ2) is a Gaussian whose mean is equal to the distance between

the person’s current position on A and one of the intersections, say V1.

59

The probabilistic dependencies in the model are same as those used in [86] except for

two new variables, time-of-day and day-of-week, which determine the person’s goal and

the GPS reading. The constraints in the model are as follows. We assume that a person

switches his goal from one time slice to another when he is near a goal or moving away

from a goal but not when he is on a goal location. In this latter case, goal switching is forced

when a specified maximum time at that goal, or duration D, is reached. These assumptions

of switching between goals is modeled using the following constraints between the current

location, the current goal, the next goal and the switching counters:

1. If lt−1 = gt−1 and Ft−1 = 0 Then Ft = D

2. If lt−1 = gt−1 and Ft−1 > 0 Then Ft = Ft−1 − 1,

3. If lt−1 6= gt−1 and Ft−1 = 0 Then Ft = 0,

4. If lt−1 6= gt−1 and Ft−1 > 0 Then Ft = 0,

5. If Ft−1 > 0 and Ft = 0 Then gt is given by P (gt|gt−1),

6. If Ft−1 = 0and Ft = 0 Then gt is same as gt−1,

7. If Ft−1 > 0 and Ft > 0 gt is same as gt−1 and

8. If Ft−1 = 0 and Ft > 0 gt is given by P (gt|gt−1).

2.4 Constructing a proposal distribution using Iterative

Join Graph Propagation

To perform inference in HDMNs, we will use an importance sampling algorithm for dy-

namic domains called particle filtering. As noted earlier, one of the main building block

60

of importance sampling is its proposal distribution. In this section we will show how to

construct a proposal distribution using Iterative Join Graph Propagation (IJGP ) (described

in subsection 1.4.1 of Chapter 1).

The proposal distribution is defined in a product form: Q(X) =∏n

i=1 Qi(Xi |X1, . . . , Xn)

along an ordering o = (X1, . . . , Xn) of variables. Recall that (see for example [118, 88]),

the optimal proposal distribution is one in which each component Qi(Xi|X1, . . . , Xi−1)

is equal to the target distribution Pi(Xi|X1, . . . , Xi−1) (namely the posterior distribution

in our case). Since Pi(Xi|X1, . . . , Xi−1) is NP-hard to construct, in practice, we should

use a good approximation of Pi(Xi|X1, . . . , Xi−1) as a substitute. It was shown that IJGP

[33] yields a very good approximation of Pi(Xi|X1, . . . , Xi−1) and therefore it is an ideal

candidate for constructing Q.

We now describe a straightforward way of constructing the components Qi(Xi| X1, . . . ,

Xi−1) from the output of Iterative Join Graph propagation (IJGP). Given an ordering o =

(X1, . . . , Xn) of the variables to be sampled, we can use IJGP dynamically during the

sampling process to create the conditional distribution Qi(Xi|x1, . . . , xi−1) or we can use

IJGP once in a pre-processing manner. Clearly, the latter would be more efficient but less

accurate. There is also a spectrum of choices in between trading time for accuracy (we will

present a scheme that exploits this spectrum in Chapter 4). In order to minimize overhead,

we investigate here the choice with the least overhead, namely, constructing the proposal

distribution obtained by running IJGP once prior to sampling.

Algorithm IJGP-sampling is given in Algorithm 6. As noted, the proposal distribution is

pre-computed before sampling by running IJGP just once. Given a mixed networkM =

〈X,D,F,C〉, the output of IJGP (see Algorithm 1 in Chapter 1) is a join graph JG =

(G(V,E), χ, ψ, θ) in which each cluster A ∈ V contains the original functions ψ(A) ⊆F ∪ C and messages mB→A received from all its neighbors B ∈ NG(A). Each component

Qi(Xi|X1, . . . , Xi−1) can be constructed from JG as follows. Let A be the cluster of JG

61

containing Xi, namely Xi ∈ χ(A). Let Y = χ(A) ∩ X1, . . . , Xi−1. Then,

Qi(xi|y) = α∑

z∈χ(A)\X1,...,Xi

∏

F∈ψ(A)

F (z, xi, y)∏

B∈NG(A)

mB→A(z, xi, y)

(2.1)

where α is the normalization constant which ensures that Qi(Xi|y) is a proper probability

distribution. Because there could be many clusters in the join graph that mention Xi, we

choose heuristically the cluster which mentions Xi and has the largest intersection with

X1, . . . , Xi−1 (ties are broken randomly). We will refer to this heuristic as the max

heuristic 1. We can show that:

PROPOSITION 7. The time and space complexity of IJGP-sampling (Algorithm 6) is O(h×

exp(i) + n × N) and O(h × exp(i)) respectively where h is the number of clusters in

the join-graph of IJGP, i is the i-bound of IJGP, N is the number of samples and n is the

number of variables.

Proof. The time complexity of IJGP (step 1) is O(h× exp(i)) [33]. Because Qi(Xi|xi−1)

is computed in step 4 by consulting a cluster whose size is bounded exponentially by i, the

time and space complexity of computing Qi(Xi|xi−1) is O(exp(i)). Therefore, to generate

all components of Q, we require O(n × exp(i)) time. Given Q, the importance sampling

step has a time complexity of O(nN). Therefore, the overall time complexity is O(h ×

exp(i)+n× exp(i)+n×N). Since h ≥ n, the overall time complexity is given by O(h×

exp(i) + n×N). The space required by IJGP and Q is O(h× exp(i)) and O(n× exp(i))

respectively and therefore the overall space complexity is O(h× exp(i)).

1The max heuristic was chosen based on a preliminary empirical study in which we compared it with a

random heuristic (in which we randomly select a cluster that mentions Xi) and found that the max heuristic

was more accurate.

62

Algorithm 6: IJGP(i)-sampling

Input: A mixed networkM = 〈X,D,F,C〉 and an ordering of variables

o = (X1, . . . , Xn), Integers i and N .

Output: Estimates of posterior marginals or weighted counts

Run Algorithm IJGP(i) (given in Algorithm 1) on the mixed networkM;1

// The output of IJGP is a join graph JG = (G(V,E), χ, ψ, θ)in which each cluster A ∈ V contains the original

functions ψ(A) ⊆ F ∪ C and messages mB→A received from

all its neighbors B ∈ NG(A).// Proposal distribution construction

for i=1 to n do2

Select a cluster A in JG that mentions Xi and has the largest number of3

variables common with X1, . . . , Xi−1;Compute Qi(Xi|X1, . . . , Xi−1) from A as follows. Let4

Y = χ(A) ∩ X1, . . . , Xi−1

Qi(xi|y) = α∑

z∈χ(A)\X1,...,Xi

∏

F∈ψ(A)

F (z, xi, y)∏

B∈NG(A)

mB→A(z, xi, y)

// Sampling phase

Run Importance sampling with (M, Q,N) as input (see Algorithm 4).5

63

2.5 Eliminating and Reducing Rejection

Given a mixed networkM = 〈X,D,F,C〉, a proposal distributionQ defined over X suffers

from the rejection problem if the probability of generating a sample from Q that violates

constraints of PM expressed in C is relatively high. When a sample x violates some con-

straints in C, its weight w(x) is zero and it is effectively rejected from the sample average

estimate given by Equation 1.15. In an extreme case, if the probability of generating a

rejected sample is arbitrarily close to one, then even after generating a huge number of

samples, the estimate of weighted counts (given by Equation 1.15) would be zero (and the

estimate of marginals given by Equation 1.23 would be an ill-defined ratio of 0/0); making

importance sampling impractical.

Obviously, if Q encodes all the zeros inM, then we would have no rejection.

DEFINITION 29 (Zero Equivalence). A distribution P is zero equivalent with a distribution

P ′, iff their flat constraint networks are equivalent. Namely, they have the same set of

consistent solutions.

Clearly then, given a mixed networkM = 〈X,D,F,C〉 representing PM and a proposal

distribution Q = Q1, . . . , Qn which is zero equivalent to PM, every sample x generated

from Q satisfies PM(x) > 0 (i.e. no sample generated from Q is rejected).

Because a proposal distribution Q is expressed in a product form: Q(X) =∏n

i=1 Qi(Xi|

X1, . . . , Xi−1) along o = 〈X1, . . . , Xn〉, we can make Q zero equivalent to PM by modi-

fying its components Qi(Xi|X1, . . . , Xi−1) along o. To accomplish that, we have to make

the set Q1, . . . , Qn backtrack-free along o w.r.t. C. The following definitions formalize

this notion.

DEFINITION 30 (consistent and globally consistent partial sample). Given a set of con-

straints C defined over a set of variables X = X1, . . . , Xn, a partial sample (x1, . . . , xi)

64

is said to be consistent if it does not violate any relevant constraint in C. A partial sample

(x1, . . . , xi) is said to be globally consistent if it can be extended to a solution of C (i.e. it

can be extended to a full assignment to all n variables that satisfies all constraints in C).

Note that a consistent partial sample may not be globally consistent.

DEFINITION 31 (Backtrack-free distribution of Q w.r.t. C). Given a mixed network

M = 〈X,D,F,C〉 and a proposal distribution Q = Q1, . . . , Qn representing Q(X) =∏n

i=1Qi(Xi|X1, . . . , Xi−1) along an ordering o, the backtrack-free distribution QF =

QF1 , . . . , Q

Fn ofQ along ow.r.t. CwhereQF (X) =

∏ni=1Q

Fi (Xi|X1, . . . , Xi−1) is defined

by:

QFi (xi|x1, . . . , xi−1)

= αQi(xi|x1, . . . , xi−1) if (x1, . . . , xi) is globally consistent w.r.t C

= 0 otherwise.

where α is a normalization constant.

Let xi−1 = (x1, . . . , xi−1) and let Bxi−1

i = x′i ∈ Di|(x1, . . . , xi−1, x′i) is not globally

consistent w.r.t. C . Then, α can be expressed by:

α =1

1−∑x′i∈B

xi−1i

Qi(x′i|x1, . . . , xi−1)

By definition, a proposal distribution Q1, . . . , Qn is backtrack-free w.r.t. itself along o.

The modified proposal distribution defined above (Definition 31) takes a proposal distri-

bution that is backtrack-free relative to itself (as a flat constraint network) and modifies its

components to yield a distribution that is backtrack-free relative to PM (as a flat constraint

network).

Given a mixed networkM = 〈X,D,F,C〉 and a proposal distribution Q = Q1, . . . , Qn

along o, we now show how to generate samples from the backtrack-free distribution QF =

QF1 , . . . , Q

Fn of Q w.r.t. C. Algorithm 7 assumes that we have an oracle which takes

65

Algorithm 7: Sampling from the Backtrack-free distribution

Input: A mixed networkM = 〈X,D,F,C〉, a proposal distribution Q along an

ordering o and an oracle

Output: A full sample (x1, . . . , xn) from the backtrack free distribution QF of Qx = φ;1

for i=1 to n do2

QFi (Xi|x) = Qi(Xi|x);3

for each xi ∈ Di do4

y = x ∪ xi;5

if oracle says that y is not globally consistent w.r.t C then6

QFi (xi|x) = 0 ;7

Normalize QFi (Xi|x) and generate a sample Xi = xi from it;8

x = x ∪ xi;9

return x10

a partial assignment (x1, . . . , xi) and a constraint satisfaction problem 〈X,D,C〉 as input

and answers “yes” if the assignment is globally consistent and “no” otherwise. Given a

partial assignment (x1, . . . , xi−1), the algorithm constructs QFi (Xi|x1, . . . , xi−1) and sam-

ples a value for Xi as follows. QFi (Xi|x1, . . . , xi−1) is initialized to Qi(Xi|x1, . . . , xi−1).

Then, for each extension (x1, . . . , xi−1, xi) to Xi, it checks using the oracle, whether

(x1, . . . , xi−1, xi) is globally consistent relative to C. If not, it sets QFi (xi|x1, . . . , xi−1)

to zero, normalizes QFi (xi|x1, . . . , xi−1) and generates a sample from it. Repeating this

process along the order (X1, . . . , Xn) yields a single sample from QF . Note that for each

sample, the oracle should be invoked a maximum of O(n×d) times where n is the number

of variables and d is the maximum domain size.

Given samples (x1, . . . , xN) generated from QF , we can unbiasedly estimate Z (defined in

Equation 1.2) by replacing Q by QF in Equation 1.15. We get:

ZN =1

N

N∑

k=1

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

QF (xk)=

1

N

N∑

k=1

wF (xk) (2.2)

66

where

wF (x) =

∏mi=1 Fi(x)

∏pj=1Cj(x)

QF (x)(2.3)

is the backtrack-free weight of the sample.

Similarly, we can estimate the posterior marginals by replacing the weightw(x) in Equation

1.23 with the backtrack free weight wF (x).

PN(xi) =

∑Nk=1w

F (xk)δxi(xk)

∑Nk=1w

F (xk)(2.4)

Clearly, ZN defined in Equation 2.2 is an unbiased estimate of Z while PN(xi) defined in

Equation 2.4 is an asymptotically unbiased estimate of the posterior marginals P (xi).

2.5.1 Sampling from the backtrack-free distribution using adaptive

consistency

One of the methods that can be used to generate the backtrack-free distribution is adaptive

consistency [29].

DEFINITION 32 (Directional i-consistency). [29] A constraint network R = (X,D,C) is

said to be directional i-consistent along an ordering o = (X1, . . . , Xn) if every consistent

assignment to i−1 variables can be consistently extended to another variable that is higher

in the order. A network is strong directional i-consistent if it is directional j-consistent for

every j ≤ i.

It can be shown that:

PROPOSITION 8. [29] If the ordered constraint graph of a constraint network R has a

width (see Definition 3 in Chapter 1) of i − 1 along o and if R is also strong directional

67

i-consistent, thenR is backtrack-free along o.

Because the original constraint network may not satisfy the desired relationship between

width and local consistency, we can increase the level of strong directional consistency

until it matches the width of the network. Adaptive consistency [38] implements this idea.

Given an ordering o, adaptive consistency establishes directional i-consistency recursively,

changing levels for each variable to adapt to the changing width of the currently processed

variable. The method works because when a variable is processed, its final induced width

is determined and a matching level of consistency can be achieved. After running adaptive

consistency, we can assemble every solution in a backtrack-free manner along o in linear

time.

We can use adaptive consistency to sample from the backtrack-free distribution as described

in Algorithm ADC-sampling given as Algorithm 8. The algorithm first runs adaptive con-

sistency to make the constraint portion C strong directional i-consistent along o. Given

an ordering o, the algorithm places each constraint into the bucket of the variable that ap-

pears latest in its scope. Subsequently, buckets are processed in reverse order. A bucket is

processed by solving the subproblem and recording its solutions as a new constraint. The

newly generated constraint is placed in the bucket of its latest variable.

In the sampling step, given a partial assignment (x1, . . . , xi−1), the algorithm constructs

QFi (Xi|x1, . . . , xi−1) as follows. For each extension (x1, . . . , xi−1, xi) to Xi, it checks

whether (x1, . . . , xi−1, xi) is consistent w.r.t. to the constraints in the bucket of Xi. If it is

not consistent, it sets QFi (xi|x1, . . . , xi−1) to zero. Then it normalizes QF

i (xi|x1, . . . , xi−1)

and generates a sample from it. Repeating the process along the order (X1, . . . , Xn) yields

a single sample.

THEOREM 5. [29] Every sample generated by ADC-sampling is globally consistent and is

generated from the backtrack-free distribution of Q relative to C.

68

Algorithm 8: ADC-sampling

Input: A proposal distribution Q(X) =∏n

i=1Qi(Xi|X1, . . . , Xi−1), a mixed

networkM = (X,D,F,C), an integer NOutput: N globally consistent samples from Q// Adaptive Consistency step

Partition constraints in C into bucket1 . . . , bucketn as follows;1

for i=n down to 1 do2

Put in bucketi all unplaced constraints mentioning Xi3

for p down to 1 do4

for all constraints C1, . . . , Cj in bucketp do do5

A = ∪ji=1Ci − Xp;6

CA = πA(⊲⊳ji=1 Ci);7

Add CA to the bucket of the latest variable in scope A. Let Xi be the latest8

variable in A. Update: bucketi = bucketi ∪ CA;

// Sampling Step

for j=1 to N do9

x = φ;10

for i=1 to n do11

QFi (Xi) = Qi(Xi|x);12

for each value xi ∈ Di do13

If (x, xi) violates any constraints in bucketi QFi (xi) = 0;14

Normalize Qi and sample a value xi from QFi (Xi);15

x = x ∪ xi;16

Output x;17

69

THEOREM 6. The time and space complexity of ADC sampling isO(n×exp(w)+O(nN))

where n is the number of variables, w is the induced width along o and N is the number of

samples.

2.5.2 Reducing Rejection using IJGP-sampling

Clearly, ADC-sampling is infeasible when the induced width along o is large. In such cases,

we could use bounded directional consistency techniques like directional i-consistency,

where i is a constant. This will not eliminate rejection but would certainly reduce it.

In principle, we can use mini-bucket elimination [39] to achieve partial directional i-

consistency. Interestingly, if the join graph of IJGP is constructed using the schematic

mini-buckets scheme (see Procedure 3 in Chapter 1) [33], then IJGP achieves a stronger

level of partial directional i-consistency than mini-buckets (see [35]).

We describe a sampling scheme which uses this partial directional i-consistency power

of IJGP to reduce rejection in Algorithm 9. Because our primary concern is minimizing

rejection, we select an ordering o that minimizes the induced width w.r.t. the constraint

portion of the mixed network in Step 1. Then, we create a join graph JG along o using the

schematic mini-buckets scheme ( see Procedure 3 in Chapter 1), run IJGP on JG and cre-

ate a proposal distribution from its output using the max heuristic described in Algorithm

6. Finally, in steps 4-13, we generate the required samples. The only difference between

ADC-sampling and the current algorithm is that we update the proposal distribution us-

ing the zeros derived from the output of IJGP instead of adaptive consistency in steps 9

and 10 (consistency checking step). Formally, as proved in [35], given a partial sample

(x1, . . . , xi) if the marginal probability denoted by FA(x1, . . . , xi) is zero at any cluster A,

then (x1, . . . , xi) is inconsistent. We can therefore safely set QCi (xi|x1, . . . , xi−1) to zero

whenever FA(x1, . . . , xi) is zero. Obviously. given samples generated using Algorithm 9,

70

Algorithm 9: IJGP-sampling with consistency checking

Input: A mixed networkM = (X,D,F,C), and integer N and iOutput: N samples from QC

Select an ordering o = (X1, . . . , Xn) minimizing heuristically induced width w.r.t.1

C;

Create a i-bounded join graph JG using Algorithm Join-Graph structuring (see2

Algorithm 2 in Chapter 1) by running the schematic mini-buckets scheme along o;Run IJGP(i) on JG and create a proposal distribution Q along o from the output of3

IJGP(i) using the max heuristic outlined in Algorithm 6;

// Sampling Phase

for j=1 to N do4

x = φ;5

for i=1 to n do6

QCi (Xi) = Qi(Xi|x);7

for each value xi ∈ Di do8

// Consistency Checking Step

for all clusters A corresponding to the mini-buckets of Xi do9

// Let FA(x, xi) denote the marginal over

cluster AIf FA(x, xi) = 0 set QC

i (Xi) = 0;10

Normalize Qi and sample a value xi from QCi (Xi);11

x = x ∪ xi;12

Output x;13

71

we have to use QC instead of Q in Equations 1.15 and 1.23 to maintain unbiasedness and

asymptotic unbiasedness respectively.

In the next two sections, we extend the algorithms IJGP and IJGP-sampling to Hybrid

Dynamic Mixed Networks (HDMNs).

2.6 Iterative Join Graph Propagation for HDMNs

We first extend Iterative Join Graph Propagation (IJGP) to Hybrid Mixed Networks which

contain both discrete and continuous variables yielding hybrid IJGP. Then, we present al-

gorithm IJGP-S which extends hybrid IJGP to dynamic networks. These extensions are

straight-forward and can be easily derived by combining results from [81, 33, 67].

2.6.1 Hybrid IJGP(i) for inference in Hybrid Mixed Networks

To extend IJGP(i) to hybrid networks, we have to use the marginalization and product

operators specified in [81] for constructing messages. This is because when continuous

variables are mixed with discrete ones it yields a conditional Gaussian, which has its own

unique properties (see [81] for details). Second, Gaussian nodes can be processed in poly-

nomial time and therefore the complexity of processing each cluster is exponential only in

the number of its discrete variables. We capture this property using the notion of adjusted-

i-bound.

DEFINITION 33 (Adjusted i-bound). The adjusted i-bound of a join graph JG = 〈 G (V

, E) , χ , ψ , θ 〉 of a Hybrid Mixed NetworkM = (B = 〈 X ,D , G, P 〉 ,R = 〈∆, D , C 〉

is ai = maxU∈V |χ(U) ∩∆| − 1.

Henceforth, we will use i-bound to mean adjusted i-bound.

72

2.6.2 IJGP(i)-S for inference in sequential domains

Next, we will extend hybrid IJGP for performing filtering in Hybrid Dynamic Mixed Net-

works yielding IJGP(i)-S where “S” denotes that the algorithm performs online or sequen-

tial inference.

Constructing the join graph of IJGP(i)-S

Murphy [98] describes a scheme called the interface algorithm for constructing join-trees

in a dynamic network. Interface is defined as a set of nodes in the current time slice that

have an edge with nodes in the next time slice. For example, the shaded nodes in Figure

2.2 show the interface nodes for the given DBN. Murphy [98] showed that we can create a

join tree in an online manner by simply pasting together join-trees of a so-called 1− 12-slice

graph associated with each time slice via the interface nodes. Figure 2.3(a) illustrates this

construction. A 1 − 12-slice graph at time t is an undirected graph constructed from the

moral graph over the nodes in time slices t − 1 and t as follows. We first form a clique of

all interface nodes at time t− 1 and t. Then, we remove all non-interface nodes from time

slice t− 1 yielding a 1− 12-slice graph.

We adapt the interface algorithm for constructing i-bounded join graphs as follows. We

create join-graphs over the 1 − 12-slice graph and paste them together using the interface

cluster. If the interface cluster has more than i + 1 variables, then it is split into smaller

clusters such that the new clusters have less than i+1 variables. This construction procedure

is illustrated in Figure 2.3(b).

73

Message Passing

In IJGP(i)-S, we perform the same message passing as the join-tree algorithm for Dy-

namic Bayesian networks over a join graph. If all variables are discrete, we can compute

P (XT |e0:T ) (i.e. perform filtering) by passing messages from left to right over the join-tree

as follows (this is same as sending messages from leaves to the root in cluster tree elimi-

nation where the root is a cluster in time slice T and all leaves are present in time slice 0).

At each time-slice t, we are given a distribution over the interface cluster It−1 denoted by

P (It−1|e0:t) which is passed from the previous time slice t − 1. Then, we pass messages

from the leaves to the root of the join-tree at time slice t and project a message onto the the

interface cluster It which is then passed to the time-slice t+ 1 and so on. To compute mes-

sages at time slice t, we do not need the message received by t− 1 from t− 2 and therefore

we do not need to keep the whole unrolled network in memory yielding a memory efficient

online filtering scheme.

In IJGP-S, we pass messages in a join graph in a similar manner. Namely, at each time-

slice t, we are given a distribution over all the interface clusters It−1,1, . . . , It−1,j denoted by

P ′(It−1,1|e0:t) , . . . , P′(It−1,j|e0:t) which is passed from the previous time slice t− 1. Then

we iteratively pass messages in the join-graph at time slice t until convergence or until a

maximum number of iterations have been reached and project a message onto the interface

clusters It,1, . . . , It,j which is then passed to the time-slice t+ 1 and so on.

Hybrid IJGP-S is similarly defined except that we use Lauritzen’s operators [81] for mes-

sage passing and use adjusted i-bound for creating the join-graphs.

We summarize the complexity of IJGP-S in the following theorem:

THEOREM 7. The time complexity of IJGP(i)-S is O(h × exp(i) × j3 × T ) where h is

the number of clusters of the join graph at each time slice, j is the number of continuous

variables in a time slice and T is the number of time slices. The space complexity of

74

X11 X2

1 X12 X2

2

X31 X3

2

X41 X4

2

Figure 2.2: Figure showing an example two slice Dynamic Bayesian network. The shaded

nodes form the Forward interface (adapted from Murphy [98])

.

IJGP(i)-S is O(h× exp(i)).

Proof. The time complexity of constructing a message using Lauritzen’s operators [81] in

a cluster containing i discrete variables and j continuous variables is O(exp(i)×j3). Since

there are T time slices and h clusters in each time slice, the total number of clusters is

bounded by O(h × T ). Therefore, the overall time complexity is O(h × exp(i) × j3 ×

T ). Because, we do not utilize clusters of the join graph from time slice t − 2 when we

perform message passing in time slice t, we only have to keep O(h) clusters in memory.

Because each cluster requires only O(exp(i)) space, the overall space complexity is O(h×

exp(i)).

75

I0

N0

I1

N1

I2

N2

I0 I1

Slice 0 Slice 1 Slice 2

Interface Interface

I0

I1

(a) Join tree construction

Slice 0 Slice 1

Interface Interface

Slice 2

I0,1I0,0

N0,0 N0,1

N0,2

I0,0

I0,1

I1,0

I1,1

I1,0

I1,1I0,1

I0,0

I1,0

I1,1

I2,0

I2,1

N1,0 N1,1

N1,2

N2,0 N2,1

N2,2

(b) Join graph construction

Figure 2.3: Schematic illustration of the Procedure used for creating join-graphs and join-

trees for HDMNs. I indicates a set of interface nodes while N indicates non-interface

nodes.

76

Algorithm 10: IJGP(i)-RBPF

Input: A Hybrid Dynamic Mixed Network (X,D, G,P,C)0:T and a observation

sequence e0:T Integers N and wOutput: P (XT |e0:T )for t = 0 to T do1

// 1. Sequential Importance Sampling step

// 1.1 Rao-Blackwellisation step

Partition the Variables Xt into discrete ∆t and continuous Γt variables;2

// 1.2 IJGP step

Construct a join graph JG over the variables Xt in time-slice t using the3

interface method outlined in section 2.6;

Run hybrid IJGP(i) on JG;4

Construct a proposal distribution Qt(∆t|et,∆t−1,Γt−1) over the discrete5

variables ∆t from the output of IJGP(i) using the max heuristic outlined in

Algorithm 6;

// 1.3 Sampling step

for i = 1 to N do6

Generate a sample δit from Qt(∆t|et, δit−1, γit−1) using (the sampling phase7

of) Algorithm 9;

// Exact Step

Use join-tree-propagation to compute a distribution over Γt given δit,8

δit−1, γit−1 and et denoted by γit .;

Compute the importance weight wit of (δit, γit)9

Normalize the importance weights to form wit.;10

// 2. Resampling or Selection Step

Re-sample N samples from δit, γit according to the normalized importance11

weights wit to obtain new N random samples.;

77

2.7 Rao-Blackwellised Particle Filtering for HDMNs

In this section, we will present IJGP-based Rao-Blackwellised Particle filtering algorithm

(IJGP-RBPF) for Hybrid Dynamic mixed networks. The main idea in IJGP-RBPF is to use

IJGP for constructing a proposal distribution within the Rao-Blackwellised Particle filtering

scheme [40, 47] and use its partial i-consistency power to reduce rejection (see Algorithm

9). The algorithm is straight-forward to derive and we present it here for completeness

sake. We begin with a review of Rao-Blackwellised particle filtering.

Particle filtering uses a weighted set of samples or particles to approximate the filtering

distribution P (Xt|e0:t). The main idea is to approximate the distribution sequentially by

first creating a set of weighted samples which approximate the distribution P (X0|e0) at the

first time slice, then using these samples to construct a new set of weighted samples which

approximate the distribution P (X1|e0, e1), and iterating this procedure until we reach the

required time slice t for which the filtering distribution is desired. Particle filtering uses

an appropriate (importance) proposal distribution Qt(Xt|X0:t−1, e0:t) at each time slice to

generate the set of weighted samples. Formally, given samples x1t , . . . , x

Nt drawn from

Qt, the distribution at Xt is approximated using wit, xitNi=1, where wit is the importance

weight of sample xit. It is often desirable to have the samples in unweighted form. For

instance if importance weights are close to zero, only a few of them would dominate the

sample-based estimates and after a few iterations all but one particle will have negligible

weight. This is called the degeneracy or sample depletion problem [41, 73] which yields

an inefficient particle filter in which a large computational effort is devoted to updating

particles whose contribution to the approximation is almost zero. To mitigate this problem,

we can generate unweighted samples using a resampling step in which we generate N

samples from wit, xitNi=1 by sampling each xit with probability proportional to wit.

Particle filtering often shows poor performance in high-dimensional spaces and its per-

78

formance can be improved by sampling from a sub-space using the Rao-Blackwell (RB)

theorem (and the particle filtering is called Rao-Blackwellised Particle Filtering (RBPF)).

Specifically, the state Xt is divided into two sets: Yt and Zt such that only variables in

set Yt are sampled from the proposal distribution Qt(Yt|Y0:t,Z0:t, e0:t) while the distribu-

tion on Zt is computed analytically given a sample Yt = yt. The complexity of RBPF is

proportional to the complexity of exact inference step for each sample ykt .

Since exact inference can be done in polynomial time if a hybrid dynamic mixed network

(HDMN) contains only continuous variables, a straightforward application of RBPF to

HDMNs is to sample only the discrete variables in each time slice and apply exact inference

to the continuous variables [47, 83]. Given a sampled assignment to the discrete variables,

the continuous variables have a normal distribution which can be computed exactly. Each

particle is thus heavy containing an assignment to the discrete variables and a Gaussian

over all the continuous variables.

The IJGP-RBPF (i) scheme is given in Algorithm 10. The algorithm has two main steps:

Sequential importance sampling (SIS) step and resampling step. Only the SIS step differs

from standard RBPF described in [40]. SIS is divided into three sub-steps: (1) In the Rao-

Blackwellisation step, we first partition the variables Xt in a 2THMN into discrete ∆t and

continuous variables Γt. (2) Then, in the IJGP step, we use hybrid IJGP(i) introduced in the

previous section (Section 2.6) to generate a proposal distribution Qt(∆t|et,∆t−1,Γt−1).

(3) Finally, in the sampling step, we first generate N samples over the discrete variables

from Qt using the sampling phase of IJGP with consistency checking scheme given by

Algorithm 9. Then, we compute a distribution over the continuous variables Γt exactly

using join-tree propagation [81]2.

THEOREM 8. The time complexity of IJGP-RBPF(i) isO([n×N×j3+h×exp(i)×j3]×T )

where n is the number of discrete variables in each time slice, h is the number of clusters

2Note that given an assignment to all the discrete variables, join-tree propagation over the continuous

variables is equivalent to a Kalman filter [116].

79

of the join graph at each time slice, j is the number of continuous variables in a time slice,

N is the number of samples or particles generated at each time slice and T is the number

of time slices. The space complexity of IJGP-RBPF(i) is O((n+ j2)×N + h× exp(i)).

Proof. From Theorem 7, the time complexity of running IJGP is O(h× exp(i)× j3 × T ).

To generate N samples on n discrete variables, it takes O(N × n) time at each time slice.

To compute the weight of each sample via exact inference on j continuous variables, using

Lauritzen’s join-tree propagation scheme [81], we need O(j3) time. Therefore, the time

required to generate N samples at each time slice is is O(n × N × j3). Thus, the overall

time complexity is O([n×N × j3 + h× exp(i)× j3]× T ).

The space complexity of running IJGP(i) is O(h× exp(i)). Note that each particle consists

of a full assignment to all the discrete variables and a multi-variate Gaussian distribution

over all the continuous variables. To store each multi-variate Gaussian over j variables, we

require O(j2) space and to store an assignment over n discrete variables, we need O(n)

space. Therefore, to store N particles, we need O((n+ j2)×N) space yielding an overall

space complexity of O((n+ j2)×N + h× exp(i)).

2.8 Experimental Results

The test data consists of a log of GPS readings collected by a post doctoral researcher in

the Institute of Transportation Science at UCI. The test data was collected over a six month

period at intervals of 1-5 seconds each. The data consist of the current time, date, location

and velocity of the person’s travel. The location is given as latitude and longitude pairs.

The data was first divided into individual routes taken by the person and the HDMN model

was learned using the Monte Carlo version of the EM algorithm [86, 84].

We used the first three months’ data as our training set while the remaining data was used

80

as a test set. TIGER/Line files available from the US Census Bureau formed the graph

on which the data was snapped. As specified earlier our aim is two-fold: (a) Finding the

destination or goal of a person given the current location and (b) Finding the route taken by

the person towards the destination or goal.

To evaluate our inference and learning algorithms, we use three HDMN models. Model-

1 is the model shown in Figure 2.1. Model-2 is the model given in Figure 2.1 when the

variables wt and dt are removed from each time-slice. Model-3 is the base-model which

tracks the person without any high-level information and is constructed from Figure 2.1 by

removing the variables wt, dt, gt and rt from each time-slice.

We used four inference algorithms within the EM-learning scheme [86, 84]. Since EM-

learning uses inference as a sub-step, we have four different learning algorithms. We call

these algorithms as IJGP-S(1), IJGP-S(2) and IJGP-RBPF(1,N) and IJGP-RBPF(2,N) re-

spectively. Note that the algorithm IJGP-S(i) (described in Section 2.6) uses i as the i-

bound. IJGP-RBPF(i,N) (described in Section 2.7) uses i as the i-bound for IJGP(i) and N

is the number of particles at each time slice. Three values of N were used: 100, 200 and

500. For the EM-learning, N was 500. Experiments were run on a Pentium-4 2.4 GHz ma-

chine with 2G of RAM. Note that for Model-3, we only use IJGP-RBPF(1) and IJGP(1)-S

because the maximum i-bound in this model is bounded by 1 .

2.8.1 Finding destination or goal of a person

The results for goal prediction with various combinations of models, learning and inference

algorithms are shown in Tables 2.1, 2.2 and 2.3. We define the prediction accuracy as

the number of goals predicted correctly. Learning was performed off-line. Our slowest

learning algorithm IJGP-RBPF(2) used almost 5 days of CPU time for Model-1, and almost

4 days for Model-2—significantly less than the period over which the data was collected.

81

The column ‘Time’ in Tables 1, 2 and 3 shows the average time in seconds required for

performing prediction by the inference algorithms on a given learnt model while the other

entries indicate the accuracy for each combination of inference and learning algorithms.

We can see that Model-1 achieves the highest prediction accuracy of 84% while Model-2

and Model-3 achieve prediction accuracies of 77% and 68% respectively or lower.

For Model-1, we see that the learning algorithms that use IJGP-RBPF(2) and IJGP(2)-S

yield an average accuracy of 83% and 81% respectively. For Model-2, we see that the

average accuracy of learning algorithms that use IJGP-RBPF(2) and IJGP(2)-S is 76%

and 75% respectively. Therefore, IJGP-RBPF(2) and IJGP(2)-S are the best performing

learning algorithms.

For Model-1 and Model-2, to verify which inference algorithm yields the best accuracy

given a learned model, we see that IJGP(2)-S is the most cost-effective alternative in terms

time versus accuracy while IJGP-RBPF yields the best accuracy.

LEARNING

IJGP-RBPF IJGP-S

N Inference Time (i=1) (i=2) (i=1) (i=2)

100 IJGP-RBPF(1) 12.3 78 80 79 80

100 IJGP-RBPF(2) 15.8 81 84 78 81

200 IJGP-RBPF(1) 33.2 80 84 77 82

200 IJGP-RBPF(2) 60.3 80 84 76 82

500 IJGP-RBPF(1) 123.4 81 84 80 82

500 IJGP-RBPF(2) 200.12 84 84 81 82

IJGP(1)-S 9 79 79 77 79

IJGP(2)-S 34.3 74 84 78 82

Average 79.625 82.875 78.25 81.25

Table 2.1: Goal prediction accuracy for Model-1. Each cell in the last four columns con-

tains the prediction accuracy for the corresponding combination of the inference scheme

(row) and the learning algorithm (column). Given a learned model, the column ‘Time’

reports the average time required by each inference algorithm for predicting the goal.

82

2.8.2 Finding the route taken by the person

To see how our models predict a person’s route, we use the following method. We first run

our inference algorithm on the learned model and predict the route that the person is likely

to take. Then, we super-impose this route on the actual route taken by the person and count

the number of roads that were not taken by the person but were in the predicted route (i.e.

the false positives), and the number of roads that were taken by the person but were not in

the actual route (i.e. the false negatives). The two measures are reported in Table 2.4 for

the best performing learning models in each category: viz IJGP-RBPF(2) for Model-1 and

Model-2 and IJGP-RBPF(1) for Model-3. As we can see Model-1 and Model-2 have the

best route prediction accuracy (given by low false positives and false negatives).

Finally, in Figure 2.4, we show a typical execution of our system. On the left is the person’s

current location derived from the GPS device on his car and on the right is the route the

person is most likely to take based on his past history (predicted by our model).

LEARNING

IJGP-RBPF IJGP-S

N Inference Time (i=1) (i=2) (i=1) (i=2)

100 IJGP-RBPF(1) 8.3 73 73 71 73

100 IJGP-RBPF(2) 14.5 76 76 71 75

200 IJGP-RBPF(1) 23.4 76 77 71 75

200 IJGP-RBPF(2) 31.4 76 77 71 76

500 IJGP-RBPF(1) 40.08 76 77 71 76

500 IJGP-RBPF(2) 51.87 76 77 71 76

IJGP(1)-S 6.34 71 73 71 74

IJGP(2)-S 10.78 76 76 72 76

Average 75 75.75 71.125 75.125





83

LEARNING

N Inference Time IJGP-RBPF(1) IJGP(1)-S

100 IJGP-RBPF(1) 2.2 68 61

200 IJGP-RBPF(1) 4.7 67 64

500 IJGP-RBPF(1) 12.45 68 63

IJGP(1)-S 1.23 66 62

Average 67.25 62.5





2.9 Related Work and Conclusion

Lioa et al. [86] and Patterson et al. [104] describe a model based on AHMEM and HMM

respectively for inferring high-level behavior from GPS-data. Our model goes beyond their

model by representing two new variables day-of-week and time-of-day which improves the

accuracy by about 6%.

A mixed network framework for representing deterministic and uncertain information was

presented in [34, 80, 36]. These previous works also describe exact inference algorithms

for mixed networks with the restriction that all variables should be discrete. Our work

goes beyond these previous works in that we describe approximate inference algorithms

for the mixed network framework, allow continuous Gaussian nodes with certain restric-

tions and model discrete-time stochastic processes. The approximate inference algorithms

Model1 Model2 Model3

N INFERENCE FP/FN FP/FN FP/FN

IJGP(1) 33/23 39/34 60/55

IJGP(2) 31/17 39/33

100 IJGP-RBPF(1) 33/21 39/33 60/54

200 IJGP-RBPF(1) 33/21 39/33 58/43

100 IJGP-RBPF(2) 32/22 42/33

200 IJGP-RBPF(2) 31/22 38/33

Table 2.4: False positives (FP) and False negatives for routes taken by a person (FN)

84

Figure 2.4: Route prediction. On the left is the person’s current location derived from the

GPS device on his car and on the right is the route the person is most likely to take based

on his past history predicted by our model.

called IJGP(i) described in [33] handles only discrete variables. In our work, we extend

this algorithm to include Gaussian variables and discrete constraints. We also develop a

sequential version of this algorithm for dynamic models.

Particle Filtering is a very attractive research area [40]. Particle Filtering in HDMNs can

be inefficient if non-solutions of constraint portion have high probability of being sampled.

We show how to alleviate this difficulty by performing IJGP(i) before sampling. This

algorithm IJGP-RBPF yields the best performance in our settings and might prove to be

useful in applications in which particle filtering is preferred.

To conclude, in this chapter, we introduced a new importance sampling algorithm called

IJGP-sampling which uses the output of Iterative Join Graph Propagation (IJGP) to con-

struct a proposal distribution. IJGP is a good choice for constructing the proposal distri-

bution because it not only yields a very good approximation to the posterior distribution

but also achieves directional i-consistency which can substantially reduce the amount of

rejection.

85

Subsequently, we introduced a new modeling framework called HDMNs, a representation

that handles discrete-time stochastic processes with deterministic and probabilistic infor-

mation on both continuous and discrete variables in a systematic way. We also proposed

a IJGP-based algorithm called IJGP(i)-S for approximate inference in this framework, and

proposed a class of Rao-Blackwellised particle filtering algorithm, IJGP-RBPF, which uses

IJGP-sampling as a backbone, for effective sampling in HDMNs in the presence of con-

straints.

The HDMN framework and the two inference schemes were then applied to modeling

travel behavior demonstrating their usefulness on a complex and highly relevant real life

domain. This domain is attractive because travel behavior is highly routinized, but at the

same time is highly variable. People may make daily trips between home and work, but

those trips are varied in time and space, and interspersed with other trips. Eventually, the

goal of this research is to be able to infer exactly the routine portions of travel behavior, and

also to be able to infer exactly when routines have been broken. This goal has applicability

to designing travel behavior surveys, as well as to designing in-vehicle or hand-held travel

assistance devices.

86

Chapter 3

SampleSearch: A scheme that searches

for consistent samples

3.1 Introduction

The chapter presents a scheme called SampleSearch for performing effective importance

sampling based inference over mixed probabilistic and deterministic networks [34, 80,

36, 92]. As mentioned earlier, the mixed networks framework encompasses probabilistic

graphical models such as Bayesian and Markov networks [106], and deterministic graphi-

cal models such as constraint networks [29]. We focus on weighted counting and posterior

marginal queries over mixed networks. Weighted counts express the probability of evidence

of a Bayesian network, the partition function of a Markov network and the number of so-

lutions of a constraint network. Marginals seek the marginal distribution of each variable,

also called as belief updating or posterior estimation in a Bayesian or Markov network.

It is straightforward to design importance sampling algorithms [90, 118, 50] for approxi-

mately answering counting and marginal queries because both are variants of summation

87

problems for which importance sampling was designed (see Section 1.4.2). In particular,

weighted counts is the sum of a function over some domain while a marginal is a ratio

between two sums. The main idea in importance sampling is to transform a summation

into an expectation using a special distribution called the proposal (or importance or trial)

distribution from which it would be easy to sample from. Importance sampling then gener-

ates samples from the proposal distribution and approximates the expectation (also called

true average or true mean) by a weighted average over the samples (also called sample

average or sample mean). The sample average can be shown to be an unbiased estimate of

the original summation, and therefore importance sampling yields an unbiased estimate of

the weighted counts. In case of marginals, importance sampling has to compute a ratio of

two unbiased estimates yielding an asymptotically unbiased estimate only.

In presence of hard constraints or zero probabilities, however, importance sampling may

suffer from the rejection problem (see Section 2.5). The rejection problem occurs when

the proposal distribution does not faithfully capture all the zero probabilities or all the

constraints in the given mixed network. Consequently, a number of samples generated

from the proposal distribution may have zero weight and do not contribute to the sample

mean. In some cases, the probability of generating a rejected sample can be arbitrarily

close to 1 yielding completely wrong estimates of both weighted counts and marginals.

Although the pre-processing using consistency enforcement approach presented in Chap-

ter 2 reduces the amount of rejection, for some mixed networks having a hard constraint

portion, consistency enforcement may not be powerful enough and may still allow a large

number of rejected samples. In this chapter, we propose a more powerful scheme called

SampleSearch which with some additional computation manages rejection during the sam-

pling process itself.

SampleSearch, as the name suggests combines systematic backtracking search with Monte

Carlo sampling. In this scheme, when a sample is supposed to be rejected, the algorithm

88

continues instead with a randomized backtracking search until a sample with non-zero

weight is found. In this formulation, the problem of generating a non-zero weight sample

is equivalent to the problem of finding a solution to a satisfiability (SAT) or a constraint

satisfaction problem (CSP). SAT and CSPs are NP-Complete problems and therefore the

idea of generating a sample by solving an NP-Complete problem may seem inefficient.

However, recently SAT/CSP solvers have achieved unprecedented success and are able to

solve some large industrial problems having as many as a million variables within a few

seconds [69, 112]. Therefore, a rather radical idea of solving a constant number of NP-

complete problems to approximate a #P-complete problem such as weighted counting is no

longer unreasonable.

We show that SampleSearch generates samples from a modification of the proposal distri-

bution which is backtrack-free. Recall that (see Section 2.5) the backtrack-free distribution

can be obtained by removing all partial assignments which lead to a zero weight sample.

Namely, the backtrack-free distribution is zero whenever the exact distribution from which

we wish to sample is zero. We propose two schemes to compute the backtrack-free prob-

ability of the generated samples which is required for computing the sample weights. The

first is a computationally intensive method which involves invoking a CSP or a SAT solver

O(n × d) times where n is the number of variables and d is the maximum domain size.

The second scheme approximates the backtrack-free probability by consulting information

gathered during SampleSearch’s operation. This latter scheme has several desirable prop-

erties: (i) it runs in linear time, (ii) it yields an asymptotically unbiased estimate and (iii) it

can provide upper and lower bounds on the exact backtrack-free probability.

Finally, we present an extensive empirical evaluation demonstrating the power of Sample-

Search. We conducted experiments on three tasks: (a) counting models of a SAT formula

(b) computing the probability of evidence in a Bayesian network and the partition func-

tion of a Markov network, and (c) computing posterior marginals in Bayesian and Markov

89

networks.

For model counting, we compared against three approximate solution counters: Approx-

Count [130], SampleCount [62] and Relsat [115]. Our experiments show that on most in-

stances, given the same time bound SampleSearch yields solution counts which are closer

to the true counts by a few orders of magnitude compared with the other schemes.

For the problem of computing the probability of evidence or the partition function, we com-

pared SampleSearch with the any time scheme of Variable Elimination and Conditioning

(VEC) [28], and an advanced generalized belief propagation scheme called Edge Deletion

Belief Propagation (EDBP) [18]. We show that on most instances the estimates output by

SampleSearch were more accurate than those output by EDBP. VEC solved some instances

exactly, however on the remaining instances it was substantially inferior.

For the posterior marginal tasks, we experimented with linkage analysis benchmarks, with

partially deterministic grid benchmarks, with relational benchmarks and with logistics

planning benchmarks. Here, we compared the accuracy of SampleSearch in terms of a

standard distance measure called the Hellinger distance against three other schemes: two

generalized belief propagation schemes of Iterative Join Graph Propagation [33] and Edge

Deletion belief propagation [18] and an adaptive importance sampling scheme called Ev-

idence Pre-propagation Importance sampling (EPIS) [133]. Again, we found that except

for the grid instances, SampleSearch consistently yields estimates having smaller error (or

Hellinger distance) than the competition.

Thus, via a large scale experimental evaluation, we conclude that SampleSearch consis-

tently yields very good approximations. In particular, on instances which have a substan-

tial amount of determinism SampleSearch yields an order of magnitude improvement over

state-of-the-art schemes.


90

The rest of the chapter is organized as follows. In Section 3.2, we present the SampleSearch

scheme. In Section 3.3, we present experimental results and we conclude in Section 3.4.

3.2 The SampleSearch Scheme

In a nutshell, SampleSearch incorporates systematic backtracking search into the ordered

Monte Carlo sampler (see Section 1.4.2) so that all full samples are solutions of the con-

straint portion of the mixed network but it does not insist on backtrack-freeness of the

search process itself. We will sketch our ideas using the most basic form of systematic

search: chronological backtracking, emphasizing that the scheme can work with any ad-

vanced systematic search scheme. In our empirical work, we will indeed use advanced

search schemes such as minisat [125].

Given a mixed networkM = 〈X,D,F,C〉 and a proposal distribution Q(X), a traditional

ordered Monte Carlo sampler samples variables along the order o = (X1, . . . , Xn) from Q

and rejects a partial sample (x1, . . . , xi) if it violates any constraints in C. Upon rejecting

a sample, the sampler starts sampling anew from the first variable in the ordering. Sample-

Search instead modifies the conditional probability as Qi(Xi = xi|x1, . . . , xi−1) = 0 (to

reflect that (x1, . . . , xi) is not consistent), normalizes the distribution Qi(Xi|x1, . . . , xi−1)

and re-samples Xi from the normalized distribution. The newly sampled value may be

consistent in which case the algorithm proceeds to variable Xi+1 or it may be inconsistent.

If we repeat the process we may reach a point where Qi(Xi|x1, . . . , xi−1) is 0 for all val-

ues of Xi. In this case, (x1, . . . , xi−1) is inconsistent and therefore SampleSearch revises

the distribution at Xi−1 by setting Qi−1(Xi−1 = xi−1|x1, . . . , xi−2) = 0, normalizes Qi−1

and re-samples a new value for Xi−1 and so on. SampleSearch repeats this process until a

consistent full sample that satisfies all the constraints in C is generated.

91

Algorithm 11: SampleSearch

Input: A mixed networkM = 〈X,D,F,C〉, an initial proposal distribution

Q(X) =∏n

i=1Qi(Xi|X1, . . . , Xi−1) along an ordering o = (X1, . . . , Xn)Output: A consistent full sample x = (x1, . . . , xn)SET i=1, D′

i = Di (copy domains), Q′1(X1) = Q1(X1) (copy distribution), x = ∅;1

while 1 ≤ i ≤ n do2

// Forward phase

if D′i is not empty then3

Sample Xi = xi from Q′i and remove it from D′

i;4

if (x1, . . . , xi) violates any constraint in C then5

SET Q′i(Xi = xi|x1, . . . , xi−1) = 0 and normalize Q′

i;6

Goto step 3.;7

x = x ∪ xi, i = i+ 1, D′i = Di, Q

′i(Xi|x1, . . . , xi−1) = Qi(Xi|x1, . . . , xi−1);8

// Backward phase

else9

x = x\xi−1.;10

SET Q′i−1(Xi−1 = xi−1|x1, . . . , xi−2) = 0 and normalize11

Q′i−1(Xi−1|x1, . . . , xi−2);

SET i = i− 1;12

if i = 0 then13

return inconsistent;14

else15

return x;16

92

The pseudo-code for SampleSearch is given in Algorithm 11. It is a depth first backtrack-

ing search (DFS) over the state space of consistent partial assignments searching for a so-

lution to a constraint satisfaction problem 〈X,D,C〉, whose value ordering is stochastically

guided by Q that evolves during search. The updated distribution that guides the search is

maintained at Q′. The first phase is a forward phase in which the variables are sampled

in sequence and a current partial sample (or assignment) is extended by sampling a value

xi for the next variable Xi using the current distribution Q′i. If for all values xi ∈ Di,

Q′i(xi|x1, . . . , xi−1) = 0, then SampleSearch backtracks to the previous variable Xi−1

(backward phase) and updates the distributionQ′i−1 by settingQ′

i−1(xi−1|x1, . . . , xi−2) = 0

and normalizing Q′i−1 and continues.

3.2.1 The Sampling Distribution of SampleSearch

Let us denote by I =∏n

i=1 Ii(Xi|X1, . . . , Xi−1) the sampling distribution of SampleSearch

along the ordering o = (X1, . . . , Xn). We will show that I coincides with the backtrack-

free distribution QF of Q w.r.t. C (see Definition 31 in Chapter 2). To recap, given a mixed

networkM = 〈X,D,F,C〉 and a proposal distributionQ = Q1, . . . , Qn along an ordering

o, the backtrack-free distribution is defined as QF = QF1 , . . . , Q

Fn where:

QFi (xi|x1, . . . , xi−1)

= αQi(xi|x1, . . . , xi−1) if (x1, . . . , xi) is globally consistent w.r.t. C

= 0 otherwise.(3.1)

Let xi−1 = (x1, . . . , xi−1) and let Bxi−1

i = x′i ∈ Di|(x1, . . . , xi−1, x′i) is not globally

consistent w.r.t. C . Then, α can be expressed by:

α =1

1−∑x′i∈B

xi−1i

Qi(x′i|x1, . . . , xi−1)

THEOREM 9 (Main Result). Given a mixed networkM = 〈X,D,F,C〉 and an input pro-

posal distribution Q = Q1, . . . , Qn along o, SampleSearch generates independent and

93

identically distributed samples from the backtrack-free probability distribution QF of Q

w.r.t. C, i.e. ∀ i QFi = Ii.

To prove this theorem, we need the following proposition.

PROPOSITION 9. Given a mixed networkM = 〈X,D,F,C〉, an initial proposal distribu-

tion Q = Q1, . . . , Qn and a globally consistent partial assignment (x1, . . . , xi−1) w.r.t.

C, SampleSearch samples values without replacement from the domain of variableXi until

a globally consistent sample (x1, . . . , xi−1, xi) is generated.

Proof. Consider a globally inconsistent extension (x1, . . . , x′i) of (x1, . . . , xi−1). Because

SampleSearch is systematic, if (x1, . . . , x′i) is sampled then SampleSearch would eventu-

ally detect its inconsistency by not being able to extend it to a solution. At this point, it will

set Q′i(x

′i|x1, . . . , xi−1) = 0 either in step 6 or step 11 and normalize Q′

i. In other words, x′i

is sampled just once yielding sampling without replacement from Q′i(Xi|x1, . . . , xi−1). On

the other hand, if a globally consistent extension (x1, . . . , xi) is sampled, SampleSearch

will always extend it to a full sample that is consistent.

The following example demonstrates how the backtrack free distribution is constructed.

EXAMPLE 9. Consider the complete search tree corresponding to the proposal distribu-

tion and to the constraints given in Figure 3.1. The inconsistent partial assignments are

grounded in the figure. Each arc is labeled with the probability of generating the child node

from Q given an assignment from the root node to its parent. Consider the full assignment

(A = 0, B = 2, C = 0). The five different ways in which this assignment could be gen-

erated by SampleSearch (called as DFS-traces) are shown in Figure 3.2. In the following,

we show how to compute the probability IB(B = 2|A = 0) i.e. the probability of sampling

B = 2 given A = 0. Given A = 0, the events that could lead to sampling B = 2 are shown

in Figure 3.2, (a) 〈B = 2〉|A = 0 (b) 〈B = 0, B = 2〉|A = 0 (c) 〈B = 3, B = 0〉|A = 0

94

Root

A=0 A=2

0.2

A=1

B=0 B=3B=1 B=2 B=0 B=3B=1 B=2 B=0 B=3B=1 B=2

0.3 0.3

0.4 0.20.1

0.4 0.2

0.1 0.3

0.4 0.2

0.1

0.1

0.7

C=0 C=1C=1 C=0 C=0 C=1C=1 C=0 C=0 C=1C=1 C=0

0.7 0.3 0.7 0.30.7

0.3 0.7 0.3 0.7 0.3 0.7 0.3

Proposal Distribution Q

Q=Q(A)*Q(B|A)*Q(C|A,B)

Q(A)=(0.1,0.7,0.2)

Q(B|A)=Q(B)=(0.3,0.4,0.2,0.1)

Q(C|A,B)=Q(C)=(0.7,0.3)

Constraints

A≠B, A=1→B ≠0

B=3→C ≠0, B=3→C ≠1

A=1→B≠3, A=2→B≠3

Figure 3.1: A full OR search tree given a set of constraints and a proposal distribution.

Root

A=0

B=2

C=0

(a)

Root

A=0

B=0 B=2

C=0

(b)

Root

A=0

B=3 B=2

C=0

(c)

Root

A=0

B=3 B=2B=0

C=0

(d)

Root

A=0

B=0 B=2B=3

C=0

(e)

Figure 3.2: Five possible traces of SampleSearch which lead to the sample (A = 0 , B = 2, C = 0 ). The children of each node are specified from left to right in the order in which

they are generated.

95

(d) 〈B = 0, B = 3, B = 2〉|A = 0 and (e) 〈B = 3, B = 0, B = 2〉|A = 0. The

notation 〈B = 3, B = 0, B = 2〉|A = 0 means that given A = 0, the states were sam-

pled in the order from left to right (B = 3, B = 0, B = 2). Clearly, the probability

of IB(B = 2|A = 0) is the sum over the probability of these events. Let us now com-

pute the probability of the event 〈B = 3, B = 0, B = 2〉|A = 0. The probability of

sampling B = 3|A = 0 from Q(B|A = 0) = (0.3, 0.4, 0.2, 0.1) is 0.1. The assignment

(A = 0, B = 3) is inconsistent and therefore the distribution Q(B|A = 0) is changed

by SampleSearch to Q′(B|A = 0) = (0.3/0.9, 0.4/0.9, 0.2/0.9, 0) = (3/9, 4/9, 2/9, 0).

Subsequently, the probability of sampling B = 0 from Q′ is 3/9. However, the assignment

(A = 0, B = 0) is also globally inconsistent and therefore the distribution is changed to

Q′′(B|A = 0) ∝ (0, 4/9, 2/9, 0) = (0, 2/3, 1/3, 0). Next, the probability of sampling B =

2 from Q′′ is 1/3. Therefore, the probability of the event 〈B = 3, B = 0, B = 2〉|A = 0

is 0.1 × (3/9) × (1/3) = 1/90. By calculating the probabilities of the remaining events

using the approach described above and taking the sum, one can verify that the probability

of sampling B = 2 given A = 0 i.e. IB(B = 2|A = 0) = 1/3.

We will now show that:

PROPOSITION 10. Given a mixed networkM = 〈X,D,F,C〉, an initial proposal distri-

bution Q = Q1, . . . , Qn and a globally consistent partial assignment (x1, . . . , xi−1, xi)

w.r.t. C, the probability Ii(xi|x1, . . . , xi−1) of sampling xi given (x1, . . . , xi−1) using Sam-

pleSearch is directly proportional to Qi(xi|x1, . . . , xi−1). Namely,

Ii(xi|x1, . . . , xi−1) ∝ Qi(xi|x1, . . . , xi−1)

Proof. The proof is obtained by deriving a general expression for Ii(xi|x1, . . . , xi−1), sum-

ming the probabilities of all events that can lead to this desired partial sample. Consider

a globally consistent partial assignment xi−1 = (x1, . . . , xi−1). Let us assume that the

domain of the next variable Xi given xi−1, denoted by Dxi−1

i is partitioned into Dxi−1

i =

96

Rxi−1

i ∪ Bxi−1

i where Rxi−1

i = xi ∈ Dxi−1

i |(x1, . . . , xi−1, xi) is globally consistent and

Bxi−1

i = Dxi−1

i \ Rxi−1

i .

We introduce some notation. Let Bxi−1

i = xi,1, . . . , xi,q. Let j = 1, . . . , 2q index the

sequence of all subsets of Bxi−1

i with Bxi−1

i,j denoting the j-th element of this sequence.

Let π(Bxi−1

i,j ) denote the sequence of all permutations of Bxi−1

i,j with πk(Bxi−1

i,j ) denoting the

k-th element of this sequence. Finally, let Pr(πk(Bxi−1

i,j ), xi|xi−1) be the probability of

generating xi and πk(Bxi−1

i,j ) given xi−1.

The probability of sampling xi ∈ Rxi−1

i given xi−1 is:

Ii(xi|xi−1) =2q∑

j=1

|π(Bxi−1i,j )|∑

k=1

Pr(πk(Bxi−1

i,j ), xi|xi−1) (3.2)

where, Pr(πk(Bxi−1

i,j ), xi|xi−1) is given by:

Pr(πk(Bxi−1

i,j ), xi|xi−1) = Pr(πk(Bxi−1

i,j )|xi−1)Pr(xi|πk(Bxi−1

i,j ),xi−1) (3.3)

Substituting Equation 3.3 in Equation 3.2, we get:


j=1


k=1

Pr(πk(Bxi−1

i,j )|xi−1)Pr(xi|πk(Bxi−1

i,j ),xi−1) (3.4)

where Pr(xi|πk(Bxi−1

i,j ),xi−1) is given by an expression that takes the modified proposal

distribution into account.

Pr(xi|πk(Bxi−1

i,j ),xi−1) =Qi(xi|xi−1)

1−∑x′i∈B

xi−1i,j

Qi(x′i|xi−1)(3.5)

97

From Equations 3.4 and 3.5, we get:


j=1


k=1

Qi(xi|xi−1)

1−∑x′i∈B

xi−1i,j

Qi(x′i|xi−1)Pr(πk(B

xi−1

i,j )|xi−1) (3.6)

Qi(xi|xi−1) does not depend on the indices j and k in Equation 3.6 and therefore we can

rewrite Equation 3.6 as:

Ii(xi|xi−1) = Qi(xi|xi−1)

2q∑

j=1


k=1

Pr(πk(Bxi−1

i,j )|xi−1)

1−∑x′i∈B

xi−1i,j

Qi(x′i|xi−1)

(3.7)

The term enclosed in brackets in Equation 3.7 does not depend on xi and therefore it

follows that if (x1, . . . , xi−1, xi) is globally consistent:

Ii(xi|xi−1) ∝ Qi(xi|xi−1) (3.8)

which is what we wanted to prove.

We now have the necessary components to prove Theorem 9:

Proof of Theorem 9. From Proposition 9, Ii(xi|xi−1) equals zero iff xi is not globally con-

sistent and from Proposition 10, for all other values, Ii(xi|xi−1) ∝ Qi(xi|xi−1). Therefore,

the normalization constant equals 1−∑x′i∈B

xi−1i

Qi(x′i|xi−1). Consequently,

Ii(xi|xi−1) =Qi(xi|xi−1)

1−∑x′i∈B

xi−1i

Qi(x′i|xi−1)(3.9)

The right hand side of Equation 3.9 is by definition equal to QFi (xi|xi−1) (see Equation

3.1).

98

Computing QF (x)

Given a set of i.i.d. samples (x1 = (x11, . . . , x

1n), . . . ,x

N = (xN1 , . . . , xNn )) generated by

SampleSearch from the backtrack free distribution QF of Q w.r.t. C, all we need in order

to compute the estimates ZN and PN(xi) is to compute the appropriate weights so that we

can substitute them in Equations 2.2 and 2.4. To derive these weights, we need to know the

probability QF (x) of the given sample x.

From Equation 3.1, we see that to compute QFi (Xi|xi−1), we have to determine the set

Bxi−1

i = x′i ∈ Dxi−1

i |(x1, . . . , xi−1, x′i) is not globally consistent . One way to accomplish

that, as described in Algorithm 7 in Chapter 2 is to use a complete CSP oracle. The oracle

should be invoked a maximum of n× (d− 1) times where n is the number of variables and

d is the maximum domain size. Methods such as adaptive consistency [29] (see Algorithm

8 in Chapter 2) or an exact CSP solver can be used as oracles. But then, we have gained

nothing by SampleSearch if ultimately we need to use the oracle almost the same number

of times as the sampling method presented in Algorithm 7. We will next show how to

mitigate this computational problem by approximating the backtrack-free probabilities on

the fly during the execution of SampleSearch. We will show that this process still maintains

some desired statistical guarantees.

3.2.2 Approximating QF (x)

During the process of generating the sample x, SampleSearch may have discovered one or

more values in the set Bxi−1

i and thus we can build an approximation for QFi (xi|xi−1) as

follows. Let Axi−1

i ⊆ Bxi−1

i be the set of values in the domain of Xi that were proved to be

inconsistent given xi−1 while generating a sample x. We use the set Axi−1

i to compute an

99

approximation T Fi (xi|xi−1) of QFi (xi|xi−1) as follows:

T Fi (xi|xi−1) =Qi(xi|xi−1)

1−∑x′i∈A

xi−1i

Qi(x′i|xi−1)(3.10)

Finally we compute T F (x) =∏n

i=1 TFi (xi|xi−1). However, T F (x) does not guarantee

asymptotic unbiasedness when replacing QF (x) for computing the weight wF (x) in Equa-

tion 2.3.

To remedy the situation, we can store each sample (x1, . . . , xn) and all its partial assign-

ments (x1, . . . , xi−1, x′i) that were proved inconsistent during each trace of an independent

execution of SampleSearch called DFS-traces (for example, Figure 3.2 shows the five DFS-

traces that could generate the sample (A = 0, B = 2, C = 0)). After executing Sample-

Search N times generating N samples, we can use all the stored DFS-traces to compute an

approximation of QF (x) as illustrated in the following example.

EXAMPLE 10. Consider the three traces given in Figure 3.3 (a). We can combine the

information from the three traces as shown in Figure 3.3(b). Consider the assignment

(A = 1, B = 2). The backtrack-free probability of generating B = 2 given A = 1 requires

the knowledge of all the values of B which are inconsistent. Based on the combined traces,

we know thatB = 0 andB = 1 are inconsistent (givenA = 1) but we do not know whether

B = 3 is consistent or not because it is not explored (indicated by ”???” in Figure 3.3(b)).

Setting the unexplored nodes to either inconsistent or consistent give us the two different

approximations as shown in Figure 3.3(c).

Generalizing Example 10, we consider two bounding approximations denoted by UFN and

LFN respectively (the approximation is indexed to denote dependence onN ) which are based

on setting each unexplored node in the combined N traces to consistent or inconsistent

respectively. As we will show, these approximations can be used to bound the sample

100

Root

A=1

B=0 B=2

Root

A=1

B=1 B=2

Root

A=0

B=0 B=2

0.7 0.7 0.1

0.3 0.2 0.4 0.2 0.3 0.2

C=0 C=1 C=0

0.7 0.70.3

(a)

Root

A=0

B=0 B=2

A=1

B=1 B=3 B=0 B=2B=1 B=3

?????????

0.1 0.7

0.30.4

0.2 0.10.3

0.4 0.20.1

C=0 C=0 C=1

?????????0.7 0.7 0.3

(b)

A=1

B=0 B=2B=1 B=3

0

0 10

Approximation with “???”

nodes set to inconsistent

A=1

B=0 B=2B=1 B=3

0

0 .66.33

Approximation with “???”

nodes set to consistent

(c)

Figure 3.3: (a) Three DFS-traces (b) Combined information from the three DFS-traces

given in (a) and (c) Two possible approximations of I(B|A = 1)

101

mean ZN from above and below respectively 1.

DEFINITION 34 (Upper and Lower Approximations of QF by UFN and LFN ). Given a

mixed networkM = 〈X,D,F,C〉, an initial proposal distribution Q = Q1, . . . , Qn, a

combined sample tree generated from N independent runs of SampleSearch and a partial

sample xi−1 = (x1, . . . , xi−1) generated in one of the N independent runs, we define two

sets:

• Axi−1

N,i ⊆ Bxi−1

i = xi ∈ Dxi−1

i | (x1, . . . , xi−1, xi) was proved to be inconsistent

during the N independent runs of SampleSearch .

• Cxi−1

N,i ⊆ Dxi−1

i = xi ∈ Dxi−1

i | (x1, . . . , xi−1, xi) was not explored during the N

independent runs of SampleSearch .

We can set all the nodes in Cxi−1

N,i (i.e. the nodes which are not explored) either to consistent

or inconsistent yielding:

UFN (x) =

n∏

i=1

UFN,i(xi|xi−1) where

UFN,i(xi|xi−1) =

Qi(xi|xi−1)

1−∑x′i∈A

xi−1N,i

Qi(x′i|xi−1)(3.11)

LFN(x) =n∏

i=1

LFN,i(xi|xi−1) where

LFN,i(xi|xi−1) =Qi(xi|xi−1)

1−∑x′i∈A

xi−1N,i

∪Cxi−1N,i

Qi(x′i|xi−1)(3.12)

1Note that it is easy to envision other approximations in which we designate some unexplored nodes

as consistent while others as inconsistent based on the domain knowledge or via some other Monte Carlo

estimate. We consider the two extreme options because they usually work well in practice and bound the

sample mean from above and below.

102

It is clear that as N grows, the sample tree grows and therefore more inconsistencies will

be discovered and as N →∞, all inconsistencies will be discovered making the respective

sets approach Axi−1

N,i = Bxi−1

i and Cxi−1

N,i = φ. Clearly then,

PROPOSITION 11. limN→∞ UFN (x) = limN→∞ LFN(x) = QF (x)

As before, given a set of i.i.d. samples (x1 = (x11, . . . , x

1n), . . . ,x

N = (xN1 , . . . , xNn )) gen-

erated by SampleSearch, we can estimate the weighted counts Z using the two statistics

UFN (x) and LFN(x) by:

ZUN =

1

N

N∑

k=1

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

UFN (xk)

=1

N

N∑

k=1

wUN(xk) (3.13)

where

wUN(xk) =

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

UFN (xk)

is the weight of the sample based on the combined sample tree using the upper approxima-

tion UFN .

ZLN =

1

N

N∑

k=1

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

LFN(xk)=

1

N

N∑

k=1

wLN(xk) (3.14)

where

wLN(xk) =

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

LFN(xk)

is the weight of the sample based on combined sample tree using the lower approximation

LFN .

Similarly, for marginals, we can develop the statistics.

PUN (xi) =

∑Nk=1w

UN(xk)δxi

(xk)∑N

k=1wUN(xk)

(3.15)

103

and

PLN(xi) =

∑Nk=1w

LN(xk)δxi

(xk)∑N

k=1wLN(xk)

(3.16)

Clearly,

THEOREM 10. ZLN ≤ ZN ≤ ZU

N .

Proof. Because, Bxi−1

i ⊆ Axi−1

N,i ∪Cxi−1

N,i , we have:

∑

x′i∈Bxi−1i

Qi(x′i|xi−1) ≤

∑

x′i∈Axi−1N,i

∪Cxi−1N,i

Qi(x′i|xi−1) (3.17)

∴ 1−∑

x′i∈Bxi−1i

Qi(x′i|xi−1) ≥ 1−

∑

x′i∈Axi−1N,i

∪Cxi−1N,i

Qi(x′i|xi−1) (3.18)

∴

Qi(xi|xi−1)

1−∑x′i∈B

xi−1i

Qi(x′i|xi−1)≤ Qi(xi|xi−1)

1−∑x′i∈A

xi−1N,i

∪Cxi−1N,i

Qi(x′i|xi−1)(3.19)

∴ QFi (xi|xi−1) ≤ LFN,i(xi|xi−1) (3.20)

∴

n∏

i=1

QFi (xi|xi−1) ≤

n∏

i=1

LFN,i(xi|xi−1) (3.21)

∴ QF (x) ≤ LFN(x) (3.22)

∴

∏mi=1 Fi(x)

∏pj=1Cj(x)

QF (x)≥

∏mi=1 Fi(x)

∏pj=1Cj(x)

LFN(x)(3.23)

∴ wF (x) ≥ wFL (x) (3.24)

∴

1

N

N∑

k=1

wF (xk) ≥ 1

N

N∑

k=1

wFL (xk) (3.25)

∴ ZN ≥ ZLN (3.26)

Similarly, by using Axi−1

N,i ⊆ Bxi−1

i , it is easy to prove that ZFN ≤ ZU

N .

We can prove that:

THEOREM 11. The estimates ZUN and ZL

N of Z given in Equations 3.13 and 3.14 respec-

tively are asymptotically unbiased. Similarly, the estimates PUN (xi) and P

LN(xi) of P (xi)

given in Equations 3.15 and 3.16 respectively are asymptotically unbiased.

104

Proof. From Proposition 11, it follows that UFN and LFN in the limit of infinite samples

coincide with the backtrack-free distribution QF . Therefore,

limN→∞

wLN(x) = limN→∞

∏mi=1 Fi(x)

∏pj=1Cj(x)

LFN(x)(3.27)

=

∏mi=1 Fi(x)

∏pj=1Cj(x)

QF (x)(3.28)

= wF (x) (3.29)

Therefore,

limN→∞

EQ

[1

N

N∑

k=1

wL(x)

]= lim

N→∞

1

N

∑

x∈X

wLN(x)Q(x)N∑

k=1

(1) (3.30)

=1

N×N lim

N→∞

∑

x∈X

wLN(x)Q(x) (3.31)

=∑

x∈X

wF (x)Q(x) . . . (From Equation 3.29) (3.32)

= Z (3.33)

Similarly, we can prove that the estimator based on UFN in Equation 3.13 is asymptotically

unbiased by replacing wLN(x) with wUN(x) in Equations 3.30-3.33.

Finally, because the estimates PUN (xi) and PL

N(xi) of P (xi) given in Equations 3.15 and

3.16 respectively are ratios of two asymptotically unbiased estimators, by definition, they

are asymptotically unbiased too.

PROPOSITION 12. Given N samples output by SampleSearch for a mixed networkM =

〈X,D,F,C〉, the space and time complexity of computing ZLN , Z

UN , P

LN(xi) and PU

N (xi)

given in Equations 3.14, 3.13, 3.16 and 3.15 is O(N × d× n).

Proof. Because we store all full solutions (x1 , . . . , xn) and all partial assignments (x1 , . . .

, xi−1 , x′i) that were proved inconsistent during the N executions of SampleSearch, we

105

require an additionalO(N×n×d) space to store the combined sample tree used to estimate

Z and the marginals. Similarly, because we compute a sum or their ratios by visiting all

nodes of this combined sample tree, the time complexity is also O(N × d× n)

In summary, we presented two approximations to the backtrack-free probability QF which

bound the sample mean ZN . We proved that the two approximations yield an asymptoti-

cally unbiased estimate of the weighted counts and marginals. They will also enable trading

bias with variance as we discuss next.

Bias-Variance Tradeoff

As pointed in Section 1.4.2, the mean squared error of an estimator can be reduced by

either controling the bias or by increasing the number of samples. The estimators ZUN and

ZLN have more bias than the unbiased estimator ZF

N (which has a bias of zero but requires an

exact CSP solver). However, we expect the estimators ZUN and ZL

N to allow larger sample

size than ZFN . Moreover, ZU

N and ZLN bound ZF

N from above and below and therefore the

absolute distance |ZUN − ZL

N | can be used to estimate their bias. If |ZUN − ZL

N | is small

enough, then we can expect ZUN and ZL

N to perform better than ZFN because they can be

based on a larger sample size.

3.2.3 Incorporating Advanced Search Techniques in SampleSearch

Theorem 9 is applicable to any search procedure that is systematic i.e. once the search

procedure encounters an assignment (x1, . . . , xi), it will either prove that the assignment

is inconsistent or return with a full consistent sample extending (x1, . . . , xi). Therefore,

we can use any advanced systematic search technique [29] instead of naive backtracking

within SampleSearch and easily show that:

106

PROPOSITION 13. Given a mixed networkM = 〈X,D,F,C〉 and an initial proposal distri-

butionQ = Q1, . . . , Qn, SampleSearch augmented with any systematic advanced search

technique generates independent and identically distributed samples from the backtrack-

free probability distribution QF of Q w.r.t. C.

While advanced search techniques would not change the sampling distribution of Sam-

pleSearch, they can have a significant impact on its time complexity. In particular, since

SAT solvers are very efficient, we can represent the constraints in the mixed network using

CNF expressions 2 and use minisat [125] as our SAT solver. However, we have to make

minisat [125] (or any other state-of-the-art SAT solver like RSAT [108]) systematic via the

following changes:

• Turn off random restarts and far backtracks The use of restarts and far backtracks

makes a SAT solver non-systematic and therefore they cannot be used.

• Change variable and value Ordering We change the variable ordering to respect

the structure of the input proposal distribution Q, namely given Q(X) =∏n

i=1 Qi

(Xi| X1 , . . . , Xi−1), we order variables as (X1, . . . , Xn). Also, at each decision

point, variable Xi is assigned a value xi by sampling it from Qi(Xi |x1 , . . . , xi−1).

3.3 Empirical Evaluation

In this section, we describe results of our extensive empirical study evaluating the perfor-

mance of SampleSearch and comparing with other approaches. We conducted experiments

on three tasks: (a) counting models of SAT formula, (b) computing probability of evidence

and partition function in a Bayesian and Markov network respectively and (c) computing

2 It is easy to convert any (relational) constraint network to a CNF formula (see [128] for details). In our

implementation, we use the direct encoding described in [128].

107

1. Satisfiability Model

counting

1. Latin Square instances

2. Langford instances

3. FPGA-routing instances

1. SampleSearch

2. ApproxCount

3. SampleCount

4. RELSAT

1. Probability of evidence

in a Bayesian network

2. Partition function of a

Markov network

1. Linkage instances

2. Relational instances

1. SampleSearch

2. Edge Deletion Belief Propagation

3. Variable Elimination and

Conditioning

Posterior marginal

computation on

1. Bayesian networks

2. Markov networks

1. SampleSearch


3. Iterative Join Graph Propagation

4. Evidence Pre-propagated

Importance sampling

1. Linkage instances

2. Relational instances

3. Logistics planning

instances

4. Grid networks

Benchmarks Tasks Competing schemes

Figure 3.4: Chart showing the scope of our experimental study

posterior marginals in a Bayesian and Markov network. The benchmarks and competing

techniques for each task type are shown in Figure 3.4.

In the next subsection, we present implementation details of SampleSearch and other com-

peting schemes. In Subsection 3.3.3, we focus on the three weighted counting tasks while

in Subsection 3.3.4, we focus on the posterior marginals task.

3.3.1 SampleSearch with w-cutset and IJGP

In our experiments, we show how SampleSearch operates within the framework of w-

cutset sampling [7] (see Section 1.4.4). Thus, we run SampleSearch only on a subset of

the variables that form a w-cutset. We use IJGP to generate a proposal distribution and we

used minisat [125] as a search scheme within SampleSearch. Below, we outline the details.

• The Proposal distribution: As in IJGP-sampling (see Algorithm 6 in Chapter 2),

we obtain Q = Q1, . . . , Qn from the output of Iterative Join graph propagation

108

Algorithm 12: SampleSearch: Implementation Details

Input: A mixed networkM = 〈X,D,F,C〉, integers i, N and w.

Output: A set of N samples globally consistent with respect to C

Create a min-fill ordering o = (X1, . . . , Xn).;1

Create a join-graph JG with i-bound i along o using the join-graph structuring2

algorithm given in Algorithm 2 in Chapter 1 and run IJGP on it.;

Create a w-cutset K ⊆ X. Let K = K1, . . . , Kt;3

Create a proposal distribution Q(K) =∏t

i=1Qi(Ki|K1, . . . , Ki−1) from the4

messages and functions in JG using the max heuristic described in Section 2.4 of

Chapter 2. To recap, in this scheme, we find a cluster A in JG that mentions Ki and

has the largest number of variables common with the previous variables

K1, . . . , Ki−1. Then, we construct Qi(Ki|K1, . . . , Ki−1) by marginalizing out all

variables not mentioned in K1, . . . , Ki from the marginal over the variables of A;

for i=1 to N do do5

Apply minisat based SampleSearch on the mixed network restricted to K with6

proposal distribution Q(K) to get a sample k.;

Output k.;7

(IJGP) [33]. Recall that IJGP is a generalized belief propagation [132] technique

for approximating the posterior distribution in graphical models (for more details see

Section 1.4.1). It runs the same message passing as join-tree propagation on a join

graph rather than a tree. The time and space complexity of IJGP can be controlled by

an i-bound (IJGP is exponential in its i-bound) and its accuracy generally increases

with the i-bound. In our experiments, for every instance, we select the maximum

i-bound that can be accommodated by 512 MB of space as follows.

Note that the space required by a message (or a function) is the product of the domain

sizes of the variables in its scope. Given an i-bound, we can create a join graph using

the join graph structuring scheme outlined in Algorithm 2 in Chapter 1 and compute,

in advance, the space required by IJGP by summing over the space required by indi-

vidual messages 3. We iterate from i = 1 until the space bound is surpassed. This

ensures that IJGP terminates in a reasonable amount of time and requires bounded

space.

3Note that we can do this without constructing the messages explicitly

109

• w-cutset sampling: We combined SampleSearch with w-cutset sampling [7] (see

Section 1.4.4) as follows. We partition the variables X into two sub-sets K and R

such that the treewidth of the graphical model restricted to R after removing K is

bounded by w. We only sample variables in K using a proposal distribution Q(K)

(see Equation 1.43 in Chapter 1). The weight of each sample K = k is then given by

a ratio between the weighted counts of the sub-problem over R given k and Q(K).

The exact counts can be computed using an exact inference technique such as bucket

elimination. It was demonstrated that the higher the w-bound [7], the lower the

sampling variance. Here also, we select the maximum w such that the resulting

bucket elimination algorithm uses less than 512 MB of space. We can choose the

appropriate w by using a similar iterative scheme to the one described above.

• Variable Ordering: We use the min-fill ordering for creating the join-graph for IJGP

because it was shown to be a good heuristic for finding tree decompositions with low

treewidth. Sampling is performed in reverse min-fill ordering.

The details of SampleSearch embedded in w-cutset sampling and IJGP-based proposal

is given in Algorithm 12. The algorithm takes as input a mixed network and integer i,

w and N which specify the i-bound for IJGP, w for creating a w-cutset and the num-

ber of samples N respectively 4. In steps 1-2, the algorithm creates a join graph along

the min fill ordering and runs IJGP. Then, in step 3, it computes a w-cutset K for the

mixed network. In step 4, the algorithm creates a proposal distribution over the w-cutset K,

Q(K) =∏t

i=1Qi(Ki|K1, . . . , Ki−1) from the output of IJGP using the max-heuristic de-

scribed in Section 2.4 in Chapter 2. Finally, the algorithm executes minisat based Sample-

Search on the mixed network restricted to the w-cutset to generate the required N samples.

Note that we will refer to the estimates of SampleSearch generated using the upper and

lower approximations of the backtrack-free probability given by Equations 3.13 and 3.14,

4This is done after we determine the i-bound and the w for the w-cutset

110

as SampleSearch/UB and SampleSearch/LB respectively (or SS/UB or SS/LB in short).

3.3.2 Other Competing Schemes

We experimented also with the following schemes.

1. Iterative Join Graph Propagation (IJGP)

In our experiments, we used an anytime version of IJGP (for more details see Section 1.4.1)

in which we start with i-bound of 1, run IJGP until convergence or 10 iterations whichever

is earlier. Then we increase the i-bound by one and reconstruct the join graph. We do

this until one the following conditions is met: (a) i equals the treewidth in which case

IJGP yields exact marginals or (b) the 2 GB space limit is reached or (c) the prescribed

time-bound is reached.

3. ApproxCount and SampleCount

Wei and Selman [131] introduced an approximate solution counting scheme called Ap-

proxCount. ApproxCount is based on the formal result of [126] that if one can sample

uniformly (or close to it) from the set of solutions of a SAT formula F , then one can ex-

actly count (or approximate with a good estimate) the number of solutions of F . Consider a

SAT formula F with S solutions. If we are able to sample solutions uniformly, then we can

exactly compute the fraction of the number of solutions, denoted by γ that have a variable

X set to True or 1 (and similarly to False or 0). If γ is greater than zero, we can set X

to that particular value and simplify F to F ′. The estimate of the number of solutions is

now equal to the product of 1γ

and the number of solutions of F ′. Then, we recursively

repeat the process, leading to a series of multipliers, until all variables are assigned a value

or until the conditioned formula is easy for exact model counters like cachet [119]. To

reduce the variance, [131] suggest to set the selected variable to a value that occurs more

111

often in the sample. In this scheme, the fraction for each variable branching is selected

via a solution sampling method called SampleSat [130], which is an extension of the well-

known local search SAT solver Walksat [120]. We experimented with an anytime version

of ApproxCount in which we report the cumulative average accumulated over several runs.

SampleCount [62] employs ApproxCount to yield a probabilistic lower bound on the num-

ber of solutions. SampleCount reduces the variance in ApproxCount by branching on vari-

ables which are more balanced i.e. variables having multipliers close to 2. Unlike Ap-

proxCount, SampleCount assigns a value to a variable by sampling it with probability 0.5

yielding an unbiased estimate of the solution counts. We experimented with an anytime

version of SampleCount in which we report the unbiased cumulative averages over several

runs.

In our experiments, we used an implementation of ApproxCount and SampleCount avail-

able from the respective authors [130, 62]. Following the recommendations made in pre-

vious work by the authors [62], we use the following parameters for ApproxCount and

SampleCount: (a) Number of samples for SampleSat NS = 20, (b) Number of variables

remaining to be assigned a value before running Cachet NR = 100 and (c) local search

cutoff α = 100K.

4. Evidence Pre-propagation Importance sampling

The Evidence Pre-propagation Importance Sampling (EPIS) algorithm is an adaptive im-

portance sampling algorithm for computing marginals in Bayesian networks [133]. The

algorithm uses loopy belief propagation [106, 99] to construct the proposal distribution.

We use the implementation of EPIS included in the Smile genie library available publicly.


Edge deletion belief propagation (EDBP) [18] is an approximation algorithm for comput-

112

ing posterior marginals and for computing probability of evidence. EDBP solves exactly

a simplified version of the original problem, obtained by deleting some of the edges of

the problem graph. Deleted edges are selected based on two criteria : quality of approx-

imation and complexity of computation (tree-width reduction) which is parameterized by

an integer k, called the k-bound. Subsequently, information loss from the lost dependen-

cies is compensated for by using several heuristic techniques (see [18] for details). The

implementation of this scheme is available from the authors [18].

6. Variable Elimination + Conditioning (VEC)

When a problem having a high treewidth is encountered, variable or bucket elimination may

be unsuitable, primarily because of its extensive memory demand. To alleviate the space

complexity, we can use the w-cutset conditioning scheme (see Section 1.3.2). Namely, we

condition or instantiate enough variables (or the w-cutset) so that the remaining problem

after removing the instantiated variables can be solved exactly using bucket elimination

[28]. In our experiments we select the w-cutset in such a way that bucket elimination would

require less than 1.5GB of space. Exact weighted counts can be computed by summing

over the exact solution output by bucket elimination for all possible instantiations of the

w-cutset. When VEC is terminated before completion, it outputs a partial sum yielding

a lower bound on the weighted counts. The implementation of this scheme is available

publicly from our software website [30].

7. Relsat

Relsat [115] is an exact algorithm for counting solutions of a satisfiability problem. When

Relsat is stopped before completion, it yields a lower bound on the number of solutions.

The implementation of Relsat is available publicly from the authors web site [115].

Table 3.1 summarizes different query types that can be handled by the various solvers. A

’√

’ indicates that the algorithm is able to approximately estimate the query while a lack of

113

Problem Type SampleSearch IJGP EDBP EPIS-BN VEC SampleCount

ApproxCount

Relsat

Bayesian networks P (e)√ √ √

Markov Networks Z√ √ √

Bayesian networks Mar√ √ √ √

Markov networks Mar√ √ √

Model counting√ √

Z: partition function, P(e): probability of evidence and Mar: posterior marginals.

Table 3.1: Query types handled by various solvers.

√indicates otherwise.

3.3.3 Results for Weighted Counts

Satisfiability instances

For the task of weighted counts, we evaluate the algorithms on formulas from three do-

mains: (a) normalized Latin square problems, (b) Langford problems, (c) FPGA-Routing

instances. We ran each algorithm for 10 hours on each instance. We will refer to the

implemented algorithms as solvers.

Notation in Tables

The first column in each table (see Table 3.2 for example) gives the name of the instance.

The second column provides various statistical information about the instance such as:

(i) number of variables (n), (ii) average domain size (d), (iii) number of clauses (c) or

number of evidence variables (e) and (iv) the treewidth of the instance (w) (computed

using the min-fill heuristic). The third column provides the exact answer for the problem

if available while the remaining columns display the actual output produced by the various

solvers when terminated at the specified time-bound. The solver(s) giving the best results

is highlighted in each row. A “*” next to the output of a solver indicates that it solved the

114

Problem 〈n, k, c, w〉 Exact Sample Approx REL SS SS

Count Count SAT /LB /UB

ls8-norm.cnf 〈512, 2, 5584, 255〉 5.40E11 5.15E+11 3.52E+11 2.44E+08 5.91E+11 5.91E+11




ls12-norm.cnf 〈1728, 2, 28968, 1044〉 2.23E+43 6.54E+39 1.26E+08 1.25E+43 1.25E+43

ls13-norm.cnf 〈2197, 2, 40079, 1558〉 3.20E+54 1.41E+50 9.32E+07 1.15E+55 1.15E+55

ls14-norm.cnf 〈2744, 2, 54124, 1971〉 5.08E+65 5.96E+60 3.71E+07 1.24E+70 1.24E+70

ls15-norm.cnf 〈3375, 2, 71580, 2523〉 3.12E+79 2.82E+72 2.06E+07 2.03E+83 2.03E+83

ls16-norm.cnf 〈4096, 2, 92960, 2758〉 7.68E+95 2.04E+85 X 2.08E+98 2.08E+98

Table 3.2: Solution counts output by SampleSearch, ApproxCount, SampleCount and Rel-

sat after 10 hours of CPU time for Latin Square instances. When the exact counts are not

known, the entries for SampleCount and SS/LB contain the lower bounds computed by

combining their respective sample weights with the Markov inequality based Average and

Martingale schemes.

problem exactly while a “X” indicates that no solution was output.

When the exact weighted counts are not known, we compare the lower bounds on the

weighted counts Z obtained by combining the estimates output by SampleSearch/LB and

SampleCount with the Markov inequality based average and Martingale schemes; which

we will present in Chapter 5. These lower bounding schemes take as input: (a) a set of

unbiased sample weights and (b) a real number 0 < α < 1 and output a lower bound on

Z that is correct with probability greater than α. In our experiments, we set α = 0.99.

SampleSearch/UB cannot be used to lower bound Z because it outputs upper bounds on

the unbiased sample weights (except when its estimate equals the one output by Sample-

Search/LB). Likewise, ApproxCount cannot be used to lower bound Z because it is not

unbiased. Finally, note that Relsat and VEC always yield a lower bound on the weighted

counts (with probability one). When we compare lower bounds, the higher the lower bound,

the better the solver is.

Latin Square instances

Our first set of benchmark instances come from the normalized Latin squares domain. A

Latin square of order s is an s × s table filled with s numbers from 1, . . . , s in such a

way that each number occurs exactly once in each row and exactly once in each column.

115

In a normalized Latin square the first row and column are fixed. The task here is to count

the number of normalized Latin squares of a given order. The Latin squares were modeled

as SAT formulas using the extended encoding given in [61]. The exact counts for these

formulas are known up to order 11 (see [114] for details).

Table 3.2 shows the results. ApproxCount and Relsat underestimate the counts by sev-

eral orders of magnitude. On the other hand, SampleSearch/UB, SampleSearch/LB and

SampleCount yield very good estimates close to the true counts. The estimates of Sample-

Count are only slightly worse than SampleSearch/UB and SampleSearch/LB. The counts

output by SampleSearch/UB and SampleSearch/LB are the same for all instances indicat-

ing that the sample mean is accurately estimated by the upper and lower approximations

of the backtrack-free distribution (see the discussion on bias versus variance in Subsection

3.2.1). On Latin squares of size 12 through 15, the exact counts are not known and we

compare lower bounding ability of SampleSearch/LB and SampleCount. From Table 3.2,

we can see that SampleSearch/LB yields far better lower bounds than SampleCount as the

problem size increases.

Finally, in Figure 3.5 (a) and (b), we show how the approximations change with time for

the two largest instances for which the exact counts are known. Here, we can clearly see

the superior convergence of SampleSearch/LB, SampleSearch/UB and SampleCount over

other approaches.

Langford instances

Our second set of benchmark instances come from the Langford’s problem domain. The

problem is parameterized by its (integer) size denoted by s. Given a set of s numbers

1, 2, . . . , s, the problem is to produce a sequence of length 2s such that each i ∈ 1,

2, . . . , s appears twice in the sequence and the two occurrences of i are exactly i apart

from each other. This problem is satisfiable only if n is 0 or 3 modulo 4. We encoded

116

1e+06

1e+08

1e+10

1e+12

1e+14

1e+16

1e+18

1e+20

1e+22

1e+24

1e+26

0 5000 10000 15000 20000 25000 30000 35000 40000

Num

ber

of

Solu

tions

Time in seconds

Solution Counts vs Time for ls10-norm.cnf

ExactSampleCount

ApproxCountRelsat

SS/LBSS/UB

(a)

100000

1e+10

1e+15

1e+20

1e+25

1e+30

1e+35

0 5000 10000 15000 20000 25000 30000 35000 40000

Num

ber

of

Solu

tions

Time in seconds

Solution Counts vs Time for ls11-norm.cnf

ExactSampleCount

ApproxCountRelsat

SS/LBSS/UB

(b)

Figure 3.5: Time versus solution counts for two sample Latin square instances.

117

Problem 〈n, k, c, w〉 Exact Sample Approx REL SS SS

Count Count SAT /LB /UB

lang12 〈576, 2, 13584, 383〉 2.16E+5 1.93E+05 2.95E+04 2.16E+05 2.16E+05 2.16E+05

lang16 〈1024, 2, 32320, 639〉 6.53E+08 5.97E+08 8.22E+06 6.28E+06 6.51E+08 6.99E+08

lang19 〈1444, 2, 54226, 927〉 5.13E+11 9.73E+10 6.87E+08 8.52E+05 6.38E+11 7.31E+11

lang20 〈1600, 2, 63280, 1023〉 5.27E+12 1.13E+11 3.99E+09 8.55E+04 2.83E+12 3.45E+12

lang23 〈2116, 2, 96370, 1407〉 7.60E+15 7.53E+14 3.70E+12 X 4.17E+15 4.19E+15

lang24 〈2304, 2, 109536, 1535〉 9.37E+16 1.17E+13 4.15E+11 X 8.74E+15 1.40E+16

lang27 〈2916, 2, 156114, 1919〉 4.38E+16 1.32E+14 X 2.41E+19 2.65E+19

Table 3.3: Solution counts output by SampleSearch, ApproxCount, SampleCount and Rel-

sat after 10 hours of CPU time for Langford’s problem instances. When the exact counts

are not known, the entries for SampleCount and SS/LB contain the lower bounds computed

by combining their respective sample weights with the Markov inequality based Average

and Martingale schemes.

the Langford problem as a CNF instance using the channeling SAT encoding described in

[129].

Table 3.3 presents the results. ApproxCount and Relsat severely under estimate the true

counts except on the instance of size 12 (lang12 in Table 3.3) which Relsat solves exactly.

SampleCount is inferior to SampleSearch/UB and SampleSearch/LB by several orders of

magnitude. SampleSearch/UB is slightly better than SampleSearch/LB. Unlike the Latin

square instances, the solution counts output by SampleSearch/LB and SampleSearch/UB

are different for large problems but the difference is small. The exact counts for the Lang-

ford instance of size 27 is not known (see [96] for details). For this instance, we compare

the lower bounds. We observe that the lower bound output by SampleSearch/LB are better

by three orders of magnitude as compared to SampleCount.

Finally, in Figure 3.6 (a) and (b), we show how the approximations change with time for

the two largest instances for which the exact counts are known. Here, we clearly see the

superior anytime performance of SampleSearch/LB and SampleSearch/UB.

118

1e+08

1e+09

1e+10

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

0 5000 10000 15000 20000 25000 30000 35000 40000

Num

ber

of

Solu

tions

Time in seconds

Solution Counts vs Time for lang23.cnf

ExactSampleCount

ApproxCountSS/LB

SS/UB

(a)

1e+11

1e+12

1e+13

1e+14

1e+15

1e+16

1e+17

0 5000 10000 15000 20000 25000 30000 35000 40000

Num

ber

of

Solu

tions

Time in seconds

Solution Counts vs Time for lang24.cnf

ExactSampleCount

ApproxCountSS/LB

SS/UB

(b)

Figure 3.6: Time versus solution counts for two sample Langford instances.

119

Problem 〈n, k, c, w〉 Exact SampleCount Relsat SS/LB

9symml gr 2pin w6 〈2604, 2, 36994, 413〉 3.36E+51 3.41E+32 3.06E+53

9symml gr rcs w6 〈1554, 2, 29119, 613〉 8.49E+84 3.36E+72 2.80E+82

alu2 gr rcs w8 〈4080, 2, 83902, 1470〉 1.21E+206 1.88E+56 1.69E+235

apex7 gr 2pin w5 〈1983, 2, 15358, 188〉 5.83E+93 4.83E+49 2.33E+94

apex7 gr rcs w5 〈1500, 2, 11695, 290〉 2.17E+139 3.69E+46 9.64E+133

c499 gr 2pin w6 〈2070, 2, 22470, 263〉 X 2.78E+47 2.18E+55

c499 gr rcs w6 〈1872, 2, 18870, 462〉 2.41E+87 7.61E+54 1.29E+84

c880 gr rcs w7 〈4592, 2, 61745, 1024〉 1.50E+278 1.42E+43 7.16E+255

example2 gr 2pin w6 〈3603, 2, 41023, 350〉 3.93E+160 7.35E+38 7.33E+160

example2 gr rcs w6 〈2664, 2, 27684, 476〉 4.17E+265 1.13E+73 6.85E+250

term1 gr 2pin w4 〈746, 2, 3964, 31〉 X 2.13E+35 6.90E+39

term1 gr rcs w4 〈808, 2, 3290, 57〉 X 1.17E+49 7.44E+55

too large gr rcs w7 〈3633, 2, 50373, 1069〉 X 1.46E+73 1.05E+182

too large gr rcs w8 〈4152, 2, 57495, 1330〉 X 1.02E+64 5.66E+246

vda gr rcs w9 〈6498, 2, 130997, 2402〉 X 2.23E+92 5.08E+300

Table 3.4: Lower Bounds on Solution counts output by SampleSearch/LB, SampleCount

and Relsat after 10 hours of CPU time for FPGA routing instances. The entries for Sam-

pleCount and SS/LB contain the lower bounds computed by combining their respective

sample weights with the Markov inequality based Average and Martingale schemes.

FPGA Routing instances

The FPGA routing instances are constructed by reducing FPGA (Field Programmable Gate

Array) detailed routing problems into a satisfiability problem. The instances were gener-

ated by Gi-Joon Nam and were used in the SAT 2002 competition [123]. Table 3.4 presents

the results. The exact solution counts are not known and therefore we compare lower

bounds. Relsat usually underestimates the solution counts by several orders of magnitude.

SampleSearch/LB yields higher lower bounds than SampleCount and Relsat on 10 out of

the 15 instances. On the remaining five instances SampleCount yields higher lower bounds

than SampleSearch/LB.

Linkage networks

The Linkage networks are generated by converting biological linkage analysis data into a

Bayesian or Markov network.

120

Problem 〈n, k, e, w〉 Exact VEC EDBP SS/LB SS/UB

BN 69 〈777, 7, 78, 47〉 5.28E-054 1.93E-61 2.39E-57 3.00E-55 3.00E-55

BN 70 〈2315, 5, 159, 87〉 2.00E-71 7.99E-82 6.00E-79 1.21E-73 1.21E-73

BN 71 〈1740, 6, 202, 70〉 5.12E-111 7.05E-115 1.01E-114 1.28E-111 1.28E-111

BN 72 〈2155, 6, 252, 86〉 4.21E-150 1.32E-153 9.21E-155 4.73E-150 4.73E-150

BN 73 〈2140, 5, 216, 101〉 2.26E-113 6.00E-127 2.24E-118 2.00E-115 2.00E-115

BN 74 〈749, 6, 66, 45〉 3.75E-45 3.30E-48 5.84E-48 2.13E-46 2.13E-46

BN 75 〈1820, 5, 155, 92〉 5.88E-91 5.83E-97 3.10E-96 2.19E-91 2.19E-91

BN 76 〈2155, 7, 169, 64〉 4.93E-110 1.00E-126 3.86E-114 1.95E-111 1.95E-111

Table 3.5: Probability of evidence computed by VEC, EDBP and SampleSearch after 3

hours of CPU time for Linkage instances from the UAI 2006 evaluation.

L11p L11m

X11

L21p L21m

X21

L31p L31m

X31

S11p S11m

L12p L12m

X12

L22p L22m

X22

L32p L32m

X32

S12p S12m

Figure 3.7: A fragment of a Bayesian network used in genetic linkage analysis.

Linkage analysis is a statistical method for mapping genes onto a chromosome [102]. This

is very useful in practice for identifying disease genes. The input is an ordered list of loci

L1, . . . , Lk+1 with allele frequencies at each locus and a pedigree with some individuals

typed at some loci. The goal of linkage analysis is to evaluate the likelihood of a candidate

vector [θ1, . . . , θk] of recombination fractions for the input pedigree and locus order. The

component θi is the candidate recombination fraction between the loci Li and Li+1.

The pedigree data can be represented as a Bayesian network with three types of random

121

variables: genetic loci variables which represent the genotypes of the individuals in the

pedigree (two genetic loci variables per individual per locus, one for the paternal allele and

one for the maternal allele), phenotype variables, and selector variables which are auxiliary

variables used to represent the gene flow in the pedigree. Figure 3.7 represents a fragment

of a network that describes parents-child interactions in a simple 2-loci analysis. The ge-

netic loci variables of individual i at locus j are denoted by Li,jp and Li,jm. Variables Xi,j

, Si,jp and Si,jm denote the phenotype variable, the paternal selector variable and the ma-

ternal selector variable of individual i at locus j, respectively. The conditional probability

tables that correspond to the selector variables are parameterized by the recombination ratio

θ. The remaining tables contain only deterministic information. It can be shown that given

the pedigree data, computing the likelihood of the recombination fractions is equivalent to

computing the probability of evidence on the Bayesian network that model the problem

(for more details consult [46]).

We first evaluate the solvers on Linkage (Bayesian) networks used in the UAI 2006 eval-

uation [8]. Table 3.5 contains the results. We see that SampleSearch/UB and Sample-

Search/LB are very accurate usually yielding a few orders of magnitude improvement over

VEC and EDBP. Because the estimates output by SampleSearch/UB and SampleSearch/LB

are the same on all instances, they yield an exact approximation of the sample mean. Fig-

ures 3.8 contains time versus accuracy plots for two linkage instances. We see superior

anytime performance of both SampleSearch schemes as compared with VEC and EDBP.

In Table 3.6, we present the results on Linkage instances encoded as Markov networks.

These instances were used in the UAI 2008 evaluation [24]. We were able to compute ex-

act solutions of 11 instances out of 20 using ACE [13] (which is not an anytime scheme).

We see that VEC (as an anytime scheme) and EDBP exactly solve 10 and 6 instances re-

spectively as indicated by a ∗ in Table 3.6. The instances for which exact solutions are

known generally have smaller treewidth (< 30). On these instances, SampleSearch/LB and

122

1e-88

1e-86

1e-84

1e-82

1e-80

1e-78

1e-76

1e-74

1e-72

1e-70

0 2000 4000 6000 8000 10000 12000

Pro

bab

ilit

y o

f E

vid

ence

Time in seconds

Probability of Evidence vs Time for BN_70, num-vars= 2315

Exact VEC EDBP SS/LB SS/UB

(a)

1e-135

1e-130

1e-125

1e-120

1e-115

1e-110

1e-105

0 2000 4000 6000 8000 10000 12000

Pro

bab

ilit

y o

f E

vid

ence

Time in seconds

Probability of Evidence vs Time for BN_76, num-vars= 2155

Exact VEC EDBP SS/LB SS/UB

(b)

Figure 3.8: Convergence of probability of evidence as a function of time for two sample

Linkage instances

123

Problem 〈n, k, e, w〉 Exact SS/LB SS/UB VEC EDBP

pedigree18 〈1184, 1, 0, 26〉 7.18E-79 7.39E-79 7.39E-79 7.18E-79* 7.18E-79*

pedigree1 〈334, 2, 0, 20〉 7.81E-15 7.81E-15 7.81E-15 7.81E-15 7.81E-15*

pedigree20 〈437, 2, 0, 25〉 2.34E-30 2.31E-30 2.31E-30 2.34E-30* 6.19E-31

pedigree23 〈402, 1, 0, 26〉 2.78E-39 2.76E-39 2.76E-39 2.78E-39* 1.52E-39

pedigree25 〈1289, 1, 0, 38〉 1.69E-116 1.69E-116 1.69E-116 1.69E-116* 1.69E-116*

pedigree30 〈1289, 1, 0, 27〉 1.84E-84 1.90E-84 1.90E-84 1.85E-84* 1.85E-84*

pedigree37 〈1032, 1, 0, 25〉 2.63E-117 1.18E-117 1.18E-117 2.63E-117* 5.69E-124

pedigree38 〈724, 1, 0, 18〉 5.64E-55 3.80E-55 3.80E-55 5.65E-55* 8.41E-56

pedigree39 〈1272, 1, 0, 29〉 6.32E-103 6.29E-103 6.29E-103 6.32E-103* 6.32E-103*

pedigree42 〈448, 2, 0, 23〉 1.73E-31 1.73E-31 1.73E-31 1.73E-31* 8.91E-32

pedigree19 〈793, 2, 0, 23〉 6.76E-60 6.76E-60 1.597E-60 3.35E-60

pedigree31 〈1183, 2, 0, 45〉 2.08E-70 2.08E-70 1.67E-76 1.34E-70

pedigree34 〈1160, 1, 0, 59〉 3.84E-65 3.84E-65 2.58E-76 4.30E-65

pedigree13 〈1077, 1, 0, 51〉 7.03E-32 7.03E-32 2.17E-37 6.53E-32

pedigree40 〈1030, 2, 0, 49〉 1.25E-88 1.25E-88 2.45E-91 7.02E-17

pedigree41 〈1062, 2, 0, 52〉 4.36E-77 4.36E-77 4.33E-81 1.09E-10

pedigree44 〈811, 1, 0, 29〉 3.39E-64 3.39E-64 2.23E-64 7.69E-66

pedigree51 〈1152, 1, 0, 51〉 2.47E-74 2.47E-74 5.56E-85 6.16E-76

pedigree7 〈1068, 1, 0, 56〉 1.33E-65 1.33E-65 1.66E-72 2.93E-66

pedigree9 〈1118, 2, 0, 41〉 2.93E-79 2.93E-79 8.00E-82 3.13E-89

Table 3.6: Partition function computed by VEC, EDBP and SampleSearch after 3 hours of

CPU time for Linkage instances from the UAI 2008 evaluation.

SampleSearch/UB deviate only slightly from the true value of the partition function and

on four instances for which EDBP does not output the exact answer, the estimates output

by SampleSearch/LB and SampleSearch/UB are better than EDBP. On instances with large

treewidth (> 30) and for which exact results are not known, we compare the lower bound-

ing capability of VEC and SampleSearch. The lower bounds output by SampleSearch are

superior to VEC on all the 9 instances. EDBP does not yield guaranteed lower bounds but

we see that in 6 out of 9 instances, it yields an approximation which is smaller than the

lower bound.

Relational Instances

The relational instances are generated by grounding the relational Bayesian networks using

the primula tool (see Chavira et al. [14] for more information). We experimented with ten

Friends and Smoker networks and six mastermind networks from this domain which have

124

Problem 〈n, k, e, w〉 Exact SS/LB SS/UB VEC EDBP

fs-04 〈262, 2, 226, 12〉 1.52E-05 8.11E-06 8.11E-06 1.53E-05* X

fs-07 〈1225, 2, 1120, 35〉 9.80E-17 2.23E-16 2.23E-16 1.78E-15* X

fs-10 〈3385, 2, 3175, 71〉 7.88E-31 2.49E-32 2.49E-32 X X

fs-13 〈7228, 2, 6877, 119〉 1.33E-51 3.26E-55 3.26E-55 X X

fs-16 〈13240, 2, 12712, 171〉 8.63E-78 6.04E-79 6.04E-79 X X

fs-19 〈21907, 2, 21166, 243〉 2.12E-109 1.62E-114 1.62E-114 X X

fs-22 〈33715, 2, 32725, 335〉 2.00E-146 4.88E-147 4.88E-147 X X

fs-25 〈49150, 2, 47875, 431〉 7.18E-189 2.67E-189 2.67E-189 X X

fs-28 〈68698, 2, 67102, 527〉 9.82E-237 4.53E-237 4.53E-237 X X

fs-29 〈76212, 2, 74501, 559〉 6.81E-254 9.44E-255 9.44E-255 X X

mastermind 03 08 03 〈1220, 2, 48, 20〉 9.79E-8 9.87E-08 9.87E-08 9.79E-08* X




mastermind 04 08 04 〈2616, 2, 64, 33〉 2.20E-08 1.84E-08 1.84E-08 1.21E-08 X



mastermind 10 08 03 〈2606, 2, 48, 56〉 1.92E-07 5.01E-07 5.01E-07 7.79E-08 X

Table 3.7: Probability of evidence computed by VEC, EDBP and SampleSearch after 3

hours of CPU time for relational instances.

between 262 to 76,212 variables. Table 3.7 summarizes the results. These networks are

quite large and EDBP does not output any answer in 3 hours of CPU time. VEC solves 2

friends and smokers networks exactly while on the remaining instances, it fails to output

any answer. The estimates computed by SampleSearch/LB and SampleSearch/UB on the

other hand are very close to the exact value of probability of evidence. This shows that

SampleSearch LB/UB are more scalable than VEC and EDBP. VEC solves exactly six out

of the eight mastermind instances while on the remaining two instances VEC is worse than

SampleSearch/UB and SampleSearch/LB.

3.3.4 Results for the Posterior Marginal Tasks

We experimented with the following four benchmark domains: (a) The linkage instances

(b) The relational instances and (c) The grid instances and (d) The logistics planning in-

stances. For this task, we measure the accuracy of the solvers using average Hellinger

distance [77]. Given a mixed network with n variables, let P (Xi) and A(Xi) denote the

125

exact and approximate marginals for a variable Xi, then the Hellinger distance is defined

as:

Hellinger distance =

∑ni=1

12

∑xi∈Di

(√P (xi)−

√A(xi))

2

n(3.34)

We chose Hellinger distance because as pointed out in [77], it is superior to other choices

such as the Kullback-Leibler (KL) distance, the mean squared error and the relative error

when zero probabilities are present in the graphical model. We do not use KL distance

because it lies between 0 and ∞ and in practice when the exact marginals are close to 0,

precision errors may cause the approximation to become zero, yielding infinite KL distance.

Hellinger distance is between 0 and 1 and lower bounds the KL distance.

We did compute the error using other distance measures like the mean squared error and

the relative error. All error measures show similar trends, with Hellinger distance being the

most discriminative.

Finally, note that for the marginal task, SampleSearch/LB and SampleSearch/UB output

the same marginals for all benchmarks that we experimented with and therefore we do not

distinguish between them. This however, implies that our lower and upper approximations

of the backtrack free probability are indeed strong. Therefore, for the rest of this subsection,

we will refer to SampleSearch/LB and SampleSearch/UB as SampleSearch.

Linkage instances

We report the average Hellinger distance between exact and approximate marginals in Ta-

ble 3.8. EPIS which is also an importance sampling technique like SampleSearch does

not generate even a single consistent sample in 3 hours of CPU time and therefore its av-

erage Hellinger distance is 1. SampleSearch is more accurate than IJGP which in turn is

more accurate than EDBP on 7 out of 8 instances. We can clearly see the relationship be-

tween treewidth and the performance of propagation based and sampling based techniques.

126

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for BN_70, num-vars= 2315

SampleSearchIJGP

EPISEDBP

(a)

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for BN_76, num-vars= 2155

SampleSearchIJGP

EPISEDBP

(b)

Figure 3.9: Time versus Hellinger distance for two sample Linkage instances.

127

Problem 〈n,K, e, w〉 SampleSearch IJGP EPIS EDBP

BN 69 〈777, 7, 78, 47〉 9.4E-04 3.2E-02 1 8.0E-02

BN 70 〈2315, 5, 159, 87〉 2.6E-03 3.3E-02 1 9.6E-02

BN 71 〈1740, 6, 202, 70〉 5.6E-03 1.9E-02 1 2.5E-02

BN 72 〈2155, 6, 252, 86〉 3.6E-03 7.2E-03 1 1.3E-02

BN 73 〈2140, 5, 216, 101〉 2.1E-02 2.8E-02 1 6.1E-02

BN 74 〈749, 6, 66, 45〉 6.9E-04 4.3E-06 1 4.3E-02

BN 75 〈1820, 5, 155, 92〉 8.0E-03 6.2E-02 1 9.3E-02

BN 76 〈2155, 7, 169, 64〉 1.8E-02 2.6E-02 1 2.7E-02

Table 3.8: Table showing Hellinger distance of SampleSearch, IJGP, EPIS and EDBP for

Linkage instances from the UAI 2006 evaluation after 3 hours of CPU time.


fs-04 〈262, 2, 226, 12〉 5.4E-05 4.6E-08 1 6.4E-02

fs-07 〈1225, 2, 1120, 35〉 1.4E-02 1.6E-02 1 3.0E-02

fs-10 〈3385, 2, 3175, 71〉 1.2E-02 6.3E-03 1 2.7E-02

fs-13 〈7228, 2, 6877, 119〉 2.0E-02 6.5E-03 1 2.3E-02

fs-16 〈13240, 2, 12712, 171〉 1.2E-03 6.8E-03 1 1.7E-02

fs-19 〈21907, 2, 21166, 243〉 3.1E-03 8.8E-03 1 1

fs-22 〈33715, 2, 32725, 335〉 2.5E-03 8.6E-03 1 1

fs-25 〈49150, 2, 47875, 431〉 2.5E-03 8.4E-03 1 1

fs-28 〈68698, 2, 67102, 527〉 1.3E-03 7.4E-03 1 1

fs-29 〈76212, 2, 74501, 559〉 1.9E-03 7.0E-03 1 1

mastermind 03 08 03 〈1220, 2, 48, 20〉 1.1E-03 3.8E-02 1 3.8E-01

mastermind 03 08 04 〈2288, 2, 64, 30〉 1.1E-02 4.4E-02 1 1

mastermind 03 08 05 〈3692, 2, 80, 42〉 4.0E-02 3.2E-02 1 1

mastermind 04 08 04 〈2616, 2, 64, 33〉 3.1E-02 3.5E-02 1 1

mastermind 05 08 03 〈1616, 2, 48, 28〉 1.0E-02 3.6E-02 1 4.0E-02

mastermind 06 08 03 〈1814, 2, 48, 31〉 4.7E-03 3.3E-02 5.6E-01 3.2E-01

mastermind 10 08 03 〈2606, 2, 48, 56〉 3.9E-02 5.3E-02 1 8.3E-02


Relational instances after 3 hours of CPU time.

When the treewidth is small (on BN 74.uai), a propagation based scheme like IJGP is more

accurate than SampleSearch but as the treewidth increases, there is one to two orders of

magnitude difference in the Hellinger distance. Finally in Figure 3.9, we demonstrate the

superior anytime performance of SampleSearch compared with other solvers.

128

Relational Instances

As described before, the relational instances are generated by grounding the relational

Bayesian networks using the primula tool [14]. We experimented again with the 10 Friends

and Smoker networks and 6 mastermind networks from this domain. Table 3.9 shows the

Hellinger distance after 3 hours of CPU time for each solver.

On the small friends and smoker networks, fs-04 to fs-13, IJGP performs better than Sam-

pleSearch. However, on large networks which have between 13240 and 76212 variables,

and treewidth between 12712 to 74501, SampleSearch performs better than IJGP. EPIS is

not able to generate a single consistent sample in 3 hours of CPU time as indicated by

Hellinger distance of 1. EDBP is slightly worse than IJGP and runs out of memory on

large instances, as indicated by a Hellinger distance of 1.

On the mastermind networks, SampleSearch is the best performing scheme followed by

IJGP. EPIS fails to output even a single consistent sample in 3 hours on 6 out of the 7

instances. EDBP is slightly worse than IJGP on 5 out of the 6 instances. Figures 3.10

and 3.11 show the anytime performance of the solvers demonstrating clear superiority of

SampleSearch.

Grid Networks

The Grid Bayesian networks benchmarks are available from the authors of Cachet [119]. A

grid Bayesian network is a s× s grid, where there are two directed edges from a node to its

neighbors right and down. The upper-left node is a source, and the bottom-right node is a

sink. The sink node is the evidence node. The deterministic ratio p is a parameter specifying

the fraction of nodes that are deterministic (functional in this case), that is, whose values

are determined given the values of their parents. The grid instances are designated as p−s.

129


50-12-5 〈144, 2, 1, 16〉 4.3E-04 3.2E-07 2.6E-04 2.5E-02

50-14-5 〈196, 2, 1, 20〉 4.9E-04 1.8E-02 1.2E-04 4.0E-02

50-15-5 〈225, 2, 1, 23〉 4.9E-04 1.0E-02 2.3E-04 6.1E-02

50-17-5 〈289, 2, 1, 25〉 8.0E-04 2.1E-02 2.0E-04 3.6E-03

50-18-5 〈324, 2, 1, 27〉 9.3E-04 1.9E-02 3.0E-04 2.1E-03

50-19-5 〈361, 2, 1, 28〉 1.1E-03 3.4E-02 4.0E-04 3.4E-04

75-16-5 〈256, 2, 1, 24〉 6.5E-04 2.5E-07 1.7E-04 7.8E-02

75-17-5 〈289, 2, 1, 25〉 1.4E-03 2.6E-07 2.7E-04 1.2E-03

75-18-5 〈324, 2, 1, 27〉 1.2E-03 3.9E-02 2.0E-04 5.0E-03

75-19-5 〈361, 2, 1, 28〉 9.0E-03 4.3E-02 2.5E-04 6.7E-05

75-20-5 〈400, 2, 1, 30〉 6.2E-04 3.1E-07 1.9E-04 1.7E-02

75-21-5 〈441, 2, 1, 32〉 1.9E-03 2.9E-07 2.8E-04 1.5E-02

75-22-5 〈484, 2, 1, 35〉 3.2E-03 2.3E-02 2.6E-04 2.0E-02

75-23-5 〈529, 2, 1, 35〉 2.0E-03 4.8E-02 2.3E-04 2.4E-02

75-24-5 〈576, 2, 1, 38〉 8.4E-03 4.3E-02 2.6E-04 3.5E-02

75-26-5 〈676, 2, 1, 44〉 2.4E-02 5.1E-02 3.5E-04 5.1E-02

90-20-5 〈400, 2, 1, 30〉 1.6E-03 2.7E-07 2.5E-04 3.7E-02

90-22-5 〈484, 2, 1, 35〉 4.6E-04 2.8E-07 1.5E-04 5.1E-02

90-23-5 〈529, 2, 1, 35〉 2.8E-04 3.2E-07 3.9E-04 1.9E-02

90-24-5 〈576, 2, 1, 38〉 5.0E-04 3.9E-07 3.5E-04 2.8E-02

90-25-5 〈625, 2, 1, 39〉 2.6E-07 2.7E-07 3.4E-04 4.6E-02

90-26-5 〈676, 2, 1, 44〉 1.0E-03 1.9E-06 2.3E-04 3.9E-02

90-34-5 〈1156, 2, 1, 65〉 8.6E-04 1.8E-07 3.9E-04 4.1E-02

90-38-5 〈1444, 2, 1, 69〉 1.6E-02 4.3E-07 1.7E-03 1.6E-01


Grid networks after 3 hours of CPU time.

130

For example, the instance 50 − 18 indicates a grid of size 18 in which 50% of the nodes

are deterministic or functional. Table 3.10 shows the Hellinger distance after 3 hours of

CPU time for each solver. Time versus approximation error plots are shown for six sample

instances in Figures 3.12 through 3.14.

On grids with deterministic ratio of 50%, EPIS is the best performing scheme on all but

one instance. SampleSearch is second best while EDBP is slightly better than IJGP. There

is two orders of magnitude difference between SampleSearch and EDBP/IJGP while there

is one order of magnitude difference between EPIS and SampleSearch.

On grids with deterministic ratio of 75%, IJGP is best on four out of the six smaller grids

(up to size 21). EPIS dominates on the larger grids (size 22-26). Again, we see that there

is an order of magnitude difference between EPIS and SampleSearch and it gets larger as

the size of the grid increases.

On grids with deterministic ratio of 90%, IJGP is again the best performing scheme. EPIS

is slightly better than SampleSearch while EDBP is the least accurate scheme.

Note that both EPIS and SampleSearch are importance sampling schemes. The poor perfor-

mance of SampleSearch as compared to EPIS is due to the overhead of solving a satisfiabil-

ity problem via search to generate a sample. Consequently, SampleSearch generates fewer

samples than EPIS which uses relevancy-based reasoning [87] to determine the inconsis-

tencies as a pre-processing step5. This highlights one of the advantages of using inference

or reasoning to determine the inconsistencies before sampling rather than combining search

and sampling. Inference schemes are however not scalable because of their extensive time

and memory requirements when the treewidth of the constraint portion is large.

5Unfortunately, the EPIS program does not output the number of consistent samples that were used in

computing the marginals and therefore we cannot experimentally compare the actual difference in sample

size between SampleSearch and EPIS.

131


log-1.uai 〈4724, 2, 3785, 22〉 2.2E-05 0* 1 1

log-2.uai 〈26114, 2, 24777, 51〉 8.6E-04 9.8E-03 1 1

log-3.uai 〈30900, 2, 29487, 56〉 1.2E-04 7.5E-03 1 1

log-4.uai 〈23266, 2, 20963, 52〉 2.3E-02 1.8E-01 1 1

log-5.uai 〈32235, 2, 29534, 51〉 8.6E-03 1.2E-02 1 1


Logistics planning instances after 3 hours of CPU time.

Logistics Planning instances

Our last domain is that of logistics planning. Given prior probabilities on actions and facts,

the task is to compute marginal distribution of each variable. Goals and initial conditions

are observed true. Bayesian networks are generated from the plan graphs, where additional

nodes (all observed false) are added to represent mutex, action-effect and preconditions of

actions. These benchmarks are available from the authors of Cachet [119].

Table 3.11 summarizes the results. Both EPIS and EDBP fail to output any answer after 3

hours of CPU time. IJGP solves the log-1 instance exactly as indicated by a * in Table 3.11

while on the remaining instances, SampleSearch is more accurate than IJGP. Finally, in Fig-

ure 3.15, we demonstrate the superior anytime performance of SampleSearch as compared

with the other schemes.

3.3.5 Summary of Experimental Evaluation

To summarize, for satisfiability model counting, we compared SampleSearch with three

other approximate solution counters available in literature: ApproxCount [130], Sample-

Count [62] and Relsat [115] on three benchmarks: (a) Latin Square instances (b) Langford

instances and (c) FPGA-routing instances. We found that on most instances, SampleSearch

yields solution counts which are closer to the true counts by a few orders of magnitude than

those output by SampleCount and by several orders of magnitude than those output by Ap-

132

proxCount and Relsat.

For the problem of computing the probability of evidence in a Bayesian network and

the partition function in a Markov network, we compared SampleSearch with two other

schemes available in literature: Variable Elimination and Conditioning (VEC) [28], and

an advanced generalized belief propagation scheme called Edge Deletion Belief Propaga-

tion (EDBP) [18] on two benchmarks: (a) linkage analysis benchmarks and (b) relational

Bayesian network benchmarks. We found that on most instances the estimates output by

SampleSearch were closer to the exact answer than those output by EDBP. VEC solved

some instances exactly, while on the remaining instances, VEC was substantially inferior.

For the posterior marginal tasks, we experimented with linkage analysis benchmarks, par-

tially deterministic grid benchmarks, relational benchmarks and logistics planning bench-

marks. Here, we compared the accuracy of SampleSearch in terms of a distance measure

called the Hellinger distance with three other schemes: two generalized belief propagation

schemes of Iterative Join Graph Propagation [33] and Edge Deletion Belief Propagation

[18] and an adaptive importance sampling scheme called Evidence Pre-propagated Impor-

tance Sampling (EPIS) [133]. Again, we found that except on the grid instances, Sample-

Search consistently yields estimates having smaller Hellinger distance than the competi-

tion. On the grid instances, IJGP and EPIS are the best schemes in terms of the Hellinger

distance.

3.4 Conclusion

In this chapter we presented the SampleSearch scheme for performing effective importance

sampling based inference in mixed probabilistic and deterministic graphical models. It is

well known that on such graphical models, importance sampling performs quite poorly be-

133

cause of the rejection problem. SampleSearch remedies the rejection problem by interleav-

ing random sampling with systematic backtracking. Specifically, when sampling variables

one by one via logic sampling [106], instead of rejecting a sample when its inconsistency

is detected, SampleSearch backtracks to the previous variable, modifies the proposal dis-

tribution to reflect the inconsistency and continues this process until a consistent sample is

found.

We showed that SampleSearch can be viewed as a systematic search technique whose value

selection is stochastically guided by sampling from a distribution. This view enables us to

integrate any systematic SAT/CSP solver within SampleSearch (with minor modifications).

Indeed, in our experiments, we used an advanced SAT solver called minisat [125]. Thus,

advances in the systematic search community whose primary focus is solving “yes/no” type

NP-complete problems can be leveraged through SampleSearch for approximating much

harder #P-complete problems in Bayesian inference.

We showed that SampleSearch samples from the backtrack-free distribution, which is basi-

cally a modification of the proposal distribution from which all inconsistent partial assign-

ments are removed. When the backtrack-free probability is too complex to compute, we

proposed two approximations, which bound the backtrack-free probability from above and

below and yield asymptotically unbiased estimates of the weighted counts and marginals.

We demonstrated experimentally that these approximations were very accurate on most

benchmarks.

We performed an extensive empirical evaluation on several benchmark graphical models

and our results clearly demonstrate that SampleSearch is consistently superior to other

state-of-the-art schemes on domains having a substantial amount of determinism.

Specifically, on probabilistic graphical models, we showed that a state-of-the-art impor-

tance sampling technique of Evidence Pre-propagated Importance Sampling (EPIS) [133]

134

which reasons about determinism in a limited way is unable to generate a single consis-

tent sample on several hard linkage analysis and relational benchmarks. In such cases,

SampleSearch is the only alternative importance sampling technique to date.

SampleSearch is also superior to generalized belief propagation schemes like Iterative Join

Graph Propagation (IJGP) [33] and Edge Deletion Belief Propagation (EDBP) [18]. In

theory, these propagation techniques are anytime, whose approximation quality can be

improved by increasing their i-bound. However, their time and space complexity is ex-

ponential in i and in practice, beyond a certain i-bound (typically > 25), their memory

requirement becomes a major bottleneck. Consequently, as we saw, on most benchmarks

IJGP and EDBP quickly converge to an estimate which they are unable to improve with

time. SampleSearch, being an importance sampling technique improves with time, and as

we demonstrated yields superior anytime performance than IJGP and EDBP.

Finally, on the problem of counting solutions of a SAT/CSP, we showed that SampleSearch

is slightly better than the recently proposed SampleCount [62] technique and substantially

better than ApproxCount [130] and Relsat [115].

135

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for fs-28, num-vars= 68698

SampleSearchIJGP

EPIS

(a)

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for fs-29, num-vars= 76212

SampleSearchIJGP

EPIS

(b)

Figure 3.10: Time versus Hellinger distance for two sample Friends and Smokers networks.

136

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for mastermind_03_08_03-0015, num-vars= 1220

SampleSearchIJGP

EPISEDBP

(a)

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for mastermind_06_08_03-0015, num-vars= 1814

SampleSearchIJGP

EPISEDBP

(b)

Figure 3.11: Time versus Hellinger distance for two sample Mastermind networks.

137

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Av

erag

e H

elli

nger

Dis

tance

Time in seconds

Approximation Error vs Time for 50-18-5, num-vars= 324

SampleSearchIJGP

EPISEDBP

(a)

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds


SampleSearchIJGP

EPISEDBP

(b)

Figure 3.12: Time versus Hellinger distance for two sample Grid instances with determin-

istic ratio=50%.

138

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Av

erag

e H

elli

nger

Dis

tance

Time in seconds


SampleSearchIJGP

EPISEDBP

(a)

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds


SampleSearchIJGP

EPISEDBP

(b)


istic ratio=75%.

139

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds


SampleSearchIJGP

EPISEDBP

(a)

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds


SampleSearchIJGP

EPISEDBP

(b)


istic ratio=90%.

140

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for log-2, num-vars= 26114

SampleSearchIJGP

EPIS

(a)

0.0001

0.001

0.01

0.1

1

0 2000 4000 6000 8000 10000 12000

Aver

age

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for log-3, num-vars= 30900

SampleSearchIJGP

EPIS

(b)

Figure 3.15: Time versus Hellinger distance for two sample Logistics planning instances.

141

Chapter 4

Studies in Solution Sampling

4.1 Introduction

In this chapter, we extend the SampleSearch with IJGP scheme presented in Chapter 3 for

sampling solutions uniformly from a Boolean satisfiability problem. The primary motiva-

tion for solution sampling comes from applications in the field of functional verification

[134, 31] and first order probabilistic models [113, 95].

In the functional verification field for instance, the main vehicle for the verification of

large and complex hardware designs is the simulation of a large number of random test

programs for diagnosis of faults [4, 134]. Hardware designs can be modeled as Boolean

satisfiability problems in which case the test programs are its solutions. Typically, the

number of solutions of the modeled program could be as large as 101000 and the number of

selected test programs are in the range of 105 or 106. Naturally, the best test generator is

the one that would uniformly sample the space of test programs so that every fault in the

circuit has equal probability of being tested.

142

Another application for solution sampling is to perform inference in first-order probabilis-

tic models such as Markov logic networks [113] and Bayesian logic [95]. A Markov logic

network (or MLN) is a probabilistic language which applies the ideas of a Markov network

to first-order logic. Markov logic networks (MLNs) generalize first-order logic, where

in a certain limit, all unsatisfiable statements have a probability of zero, and all entailed

formulas have probability one. Popular Markov Chain Monte Carlo (MCMC) inference

techniques in MLNs such as MC-SAT [109] assume the ability to uniformly sample so-

lutions of a SAT formula. Therefore, advances in random solution sampling would have

impact on the accuracy of inference schemes for MLNs.

Theoretically, the solution sampling task is closely related to the well-known #P-Complete

problem of counting solutions of a satisfiability problem. In fact, it is known [31, 70] that

if one can sample uniformly from the set of solutions of a satisfiability problem then one

can design a highly accurate method for counting solutions of a satisfiability problem.

Conversely, one can also design a highly efficient method to solve the solution sampling

task from an exact counting algorithm.

This general principle was exploited in [31] in which the authors show how an exact count-

ing method like Bucket Elimination can be used to solve the solution sampling task. The

main idea here is to express the uniform distributionP in a product formP = P1, . . . ,Pn

or a Bayesian network and then sample from the product form using the standard logic sam-

pling algorithm [106]. The main drawback of Bucket Elimination is that it is time and space

exponential in the treewidth and is infeasible when the treewidth is large.

Therefore, in order to make their solution sampling schemes practical, [31] propose to use

approximate solution counters based on mini-bucket elimination (MBE) [39] instead of

Bucket Elimination. In this chapter, we propose to use a more powerful scheme called

Iterative Join Graph propagation (IJGP) instead of MBE. The main idea in these MBE

and IJGP based schemes is to construct a product-form distribution Q = Q1, . . . , Qn

143

that approximates the product form P = P1, . . . ,Pn of the uniform distribution over

the solutions. However, both IJGP and MBE schemes have two issues. First, because Q

is an approximation of P , solutions generated from Q are biased. Second, just like the

proposal distribution in importance sampling, Q suffers from the rejection problem in that

it may generate non-solutions. Clearly, we can easily use the SampleSearch scheme pre-

sented in Chapter 3 to remedy rejection. However, even with SampleSearch, the problem

of generating biased solution samples still remains in that SampleSearch samples from

the backtrack-free distribution QF of Q which can be quite far from the required uniform

distribution over the solutions.

The main contribution of this chapter is in correcting this deficiency by augmenting Sam-

pleSearch with statistical techniques of Sampling/Importance Resampling (SIR) and the

Metropolis-Hastings (MH) method yielding SampleSearch-SIR and SampleSearch-MH re-

spectively. Our new techniques operate in two phases. In the first phase, they use Sample-

Search to generate a set of solution samples and in the second phase they thin down the

samples by accepting only a subset of good samples. By carefully selecting this acceptance

criteria, we can ensure that the generated samples converge in the limit to the required

uniform distribution over the solutions. SampleSearch-MH and SampleSearch-SIR are the

first to have such convergence guarantees, not available for state-of-the-art schemes such

as SampleSat [130] and XorSample [63].

We present empirical evaluation, comparing with SampleSat [130] and XorSample [63].

We ran each scheme for the same amount of time and evaluated their performance using

two criteria: (i) throughput, measured as the number of solutions generated per second and

(ii) sampling accuracy measured using the Hellinger distance [77] between the uniform

distribution and the sampling distribution of the generated random solutions. Our results

demonstrate that SampleSearch is competitive with SampleSat and XorSample in terms

of accuracy and throughput while the schemes we introduce here, SampleSearch-MH and

144

SampleSearch-SIR are usually more accurate.


The rest of the chapter is organized as follows. In Section 4.2 we present preliminaries and

background. In Section 4.3, we present the IJGP based solution sampling scheme while in

Section 4.4, we review the SampleSearch scheme. In Sections 4.5 and 4.6, we present the

SampleSearch-MH and SampleSearch-SIR schemes respectively. Experimental results are

presented in section 4.7 and we conclude in section 4.8.

4.2 Background and Related work

In this section, we describe the notation and terminology, define the solution sampling

problem, and describe earlier work.

4.2.1 Basic Notation and Definitions

We denote a SAT formula represented in conjunctive normal form (CNF) by F (X,C) where

X = X1, . . . , Xn and C = C1, . . . , Cm denote the sets of variables and clauses re-

spectively. Each variable Xi can take two values from the set 0, 1 (or False, True).

A literal is a variable Xi or its negation ¬Xi. A clause is a disjunction (denoted by ∨)

of literals and the Formula F is a conjunction (denoted by ∧) of clauses. For example,

F = (X1 ∨¬X2)∧ (¬X1 ∨¬X3 ∨X4)∧ (X2 ∨X4) is a CNF formula with four variables

X1, X2, X4, X4 and three clauses.

A clause Ci is satisfied by an assignment x if at least one literal in Ci is set to 1 (True).

(Note that if a variable Xi is assigned the value 0 (False) then it implies that the literal

¬Xi is assigned a value 1 (True) and vice versa.) If x satisfies all clauses of F , then x is a

145

solution or a model of F .

Given an assignment xi, the assignment xi denotes the negation of xi. For example, if xi is

True, then xi is False and vice versa.

With each clause Ci, we associate a 0/1 function Ci defined on the variables corresponding

to the literals of Ci called its scope and is denoted by S(Ci):

Ci(x) =

1 If clause Ci is satisfied by x

0 Otherwise

(4.1)

Similarly, with a formula F , we associate a 0/1 function F whose scope is the set X of

variables and is defined as:

F(x) =

1 If x satisfies all clauses of F

0 Otherwise

(4.2)

It is easy to see that:

F(x) =m∏

i=1

Ci(x) (4.3)

Given a SAT formula F and an integer M , the solution sampling task is to generate M

solutions from F such that each solution is generated with equal probability. Formally,

DEFINITION 35 (The Solution Sampling task). Given a SAT formula F (X,C) and an

integerM , let #sol be the number of solutions of F given by:

#sol =∑

x∈X

F(x) =∑

x∈X

m∏

i=1

Ci(x) (4.4)

146

and let P be a uniform distribution over the solutions of F , where

P(x) =

1#sol

If x is a solution of F

0 Otherwise

then the solution sampling task is to generateM solution samples from P .

4.2.2 Earlier work

A brute-force way to sample solutions uniformly is to first generate and store all solutions

and then generate M solutions from this stored set such that each solution is sampled with

equal probability; but this is clearly impractical.

A more reasonable approach developed by Dechter et al. [31] is presented as Algorithm

13. We first express the uniform distribution P(x) over the solutions in a product form

(or a Bayesian network): P(x) =∏n

i=1Pi(xi|x1, . . . , xi−1). Then, we use a standard

Monte Carlo (MC) sampler, also called logic sampling [106] to sample along the ordering

o = 〈X1, . . . , Xn〉 implied by the product form. Namely, at each step, given a partial

assignment (x1, . . . , xi−1) to the previous i− 1 variables, we assign a value to variable Xi

by sampling it from the distribution Pi(Xi|x1, . . . , xi−1). Repeating this process for each

variable in the sequence produces a solution sample (x1, . . . , xn). To generate the required

M solutions from P , all we have to do now is repeat the above process M times.

It was shown that:

PROPOSITION 14. [31] The probability Pi(Xi = xi|x1, . . . , xi−1) is equal to the ratio

between the number of solutions that the partial assignment (x1, . . . , xi) participates in

and the number of solutions that (x1, . . . , xi−1) participates in.

Proposition 14 implies that a solution counting algorithm can be used to generate the sam-

147

Algorithm 13: Monte-Carlo Sampler (F,O,P)

Input: Formula F , An Ordering o = (X1, . . . , Xn) and the uniform distribution over

the solutions P =∏n

i=1 Pi(xi|x1, . . . , xi−1)Output: M Solutions drawn from the uniform distribution over the solutions of Ffor j = 1 toM do1

x = φ;2

for i = 1 to n do3

r = a random real number in (0, 1) ;4

if r < Pi(Xi = 0|x) then5

x = x ∪ (Xi = 0);6

else7

x = x ∪ (Xi = 1)8

Output x9

Root

C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

B=0 B=1 B=0 B=1

A=0 A=1

3/4 1/4

2/3 1/3

1

0

1/2 1/2 10 0 0 0

1

SAT formula F = (A∨¬B ∨¬C)∧ (¬A∨B ∨C)∧ (¬A∨¬B ∨C)∧ (¬A∨¬B ∨¬C)

Figure 4.1: Search tree for the given SAT formula F annotated with probabilities for sam-

pling solutions from a uniform distribution.

pling probabilities for sampling solutions.

EXAMPLE 11. Figure 4.1 shows a complete search tree for the given SAT formula F . Each

arc from a parent Xi−1 = xi−1 to a child Xi = xi in the search tree is labeled with

the ratio of the number of solutions below the child (i.e. the number of solutions that

(x1, . . . , xi) participates in) and the number of solutions below the parent (i.e. the number

of solutions that (x1, . . . , xi−1) participates in) yielding the probability Pi(Xi = xi|X1 =

x1, . . . , Xi−1 = xi−1). Given a sequence of random numbers 0.2, 0.3, 0.6, the reader

can verify that the solution highlighted in bold in Figure 4.1 will be generated first by the

ordered Monte Carlo sampler.

148

Dechter et al. [31] used a bucket elimination based solution counting scheme to compute

Pi(Xi|x1, . . . , xi−1). In this scheme, bucket elimination is run just once as a pre-processing

step yielding a representation from which Pi(Xi|x1, . . . , xi−1) can be computed for any

partial assignment (x1, . . . , xi−1) to the previous i − 1 variables by performing a constant

time table look-up (For details see [31]). However, the time and space complexity of this

pre-processing scheme is exponential in the induced width of the graph along the given

ordering, and when the induced width is large, bucket elimination is impractical. Clearly,

any state-of-the-art counting algorithms (e.g. Cachet [119]) can also be used to compute

Pi(Xi = xi|x1, . . . , xi−1) but the properties of utilizing these methods for sampling may

be different.

To address the exponential blow up in large induced-width problems, Dechter et al. [31]

proposed to approximate each component Pi(Xi|x1, . . . , xi−1) using the polynomial mini-

bucket elimination scheme (MBE) [39]. In the next section, we propose to compute the

approximation denoted by Q using Iterative Join Graph propagation (IJGP) rather than

MBE. The quality of the sampling scheme depends upon the accuracy of Q (i.e. how close

Q is to P). Since IJGP was empirically shown to approximate P better than MBE [33], it

is likely to be a better choice.

4.3 Using IJGP for solution sampling

To recap, Iterative Join-Graph Propagation (IJGP) [33] (see Section 1.4.1) is a class of gen-

eralized belief propagation algorithm [132] that applies the message passing algorithm of

join-tree clustering to join-graphs iteratively. A join graph is a decomposition of functions

into a graph of clusters (rather than a tree) that satisfies the running intersection property.

IJGP uses a parameter i which controls the cluster size of the join-graph, yielding a class of

algorithms (IJGP(i)) whose complexity is exponential in i, that allow a trade-off between

149

Algorithm 14: ALGORITHM IJGP(i,p)-SAMPLING

Input: A SAT Formula F (X,C), an ordering o = (X1, . . . , Xn), Integers i,p ∈ 0, 100 and M

Output: M randomly generated solutions.

Run IJGP(i) on the formula F and compute a proposal distribution1

Q = Q1, . . . , Qn from its output using Equation 4.5;

Randomly select p/100× n variables without replacement from X and store them in2

a set Xp;

repeat3

Initialize s to the null assignment;4

for j = 1 to n do5

Q′j = Qj;6

for j=1 to n do7

if Xj ∈ Xp then8

F ′=Unit-propagate(F ,s);9

Run IJGP on F ′ and recompute Q′j, . . . , Q

′n from IJGP’s output using10

Equation 4.5 ;

r=random number between 0 and 1;11

if r < Q′j(Xj = 0|s) then12

s = s ∪ (Xj = 0) ;13

else14

s = s ∪ (Xj = 1);15

// Check if s is inconsistent

If s is not consistent according to F , then goto step 3.;16

Output s;17

until M ≤ 0 ;18

150

accuracy and complexity. As i increases, accuracy generally increases. When i is big

enough to allow a tree-structure, IJGP(i) coincides with join-tree clustering and becomes

exact.

Given a SAT Formula F (X,C), the output of IJGP (see Algorithm 1 in Chapter 1) is a

join graph JG = (G(V,E), χ, ψ, θ) in which each cluster A ∈ V contains the original

clauses ψ(A) ⊆ C and messages mB→A received from all its neighbors B ∈ NG(A). Each

component Qi(Xi|X1, . . . , Xi−1) can be constructed from JG as follows. Let A be the

cluster of JG containing Xi, namely Xi ∈ χ(A). Let Y = χ(A) ∩ X1, . . . , Xi−1. Then,

Qi(xi|y) = α∑

z∈χ(A)\X1,...,Xi

∏

C∈ψ(A)

C(z, xi, y)∏

B∈NG(A)

mB→A(z, xi, y)

(4.5)

where α is the normalization constant which ensures that Qi(Xi|y) is a proper probability

distribution. Because there could be many clusters in the join graph that mention Xi, as

mentioned in Chapter 2, we choose heuristically the cluster which mentions Xi and has the

largest intersection with X1, . . . , Xi−1 (ties are broken randomly).

Algorithm 14 presents a IJGP-based technique for sampling solutions from a Satisfiabil-

ity formula F . We introduce here a parameter p for controlling the recomputation of the

sampling distribution periodically. Given a SAT formula F (X,C) and an ordering o =

(X1, . . . , Xn) of variables, we first run IJGP and create the componentsQ = Q1, . . . , Qn

of the sampling distribution using Equation 4.5. Then, we randomly select p% of the vari-

ables and designate them as checkpoints at which IJGP will be rerun. We sample variables

one by one using the latest proposal distribution component Qj until a checkpoint variable

Xj is encountered. At each check-point, we re-run IJGP on the Formula F ′ obtained from

F by conditioning on the current assignment to the previous j− 1 variables and recompute

the components Qj, . . . , Qn for all the unassigned variables. The primary reason for in-

troducing p ∈ 0, 100 is to explore a time versus accuracy tradeoff. As we increase p, the

accuracy of Q improves but so does the time required to generate a sample.

151

PROPOSITION 15. The time complexity and space complexity of IJGP(i,p)-sampling is

O(h × n × p100× exp(i) × N) and O(h × exp(i)) respectively where n is the number

of variables, h is the number of clusters in the join graph of IJGP and N is the number of

solution samples.

Proof. To generate a sample, IJGP is executed O(n × p100

) times. The time complexity

of running IJGP is O(h × exp(i)). Therefore, to generate N samples, we would require

O(h× n× p100× exp(i)×N) time.

The space required by IJGP andQ isO(h×exp(i)) andO(n×exp(i)) respectively. Because

h ≥ n, the overall space complexity is O(h× exp(i)).

Another issue with IJGP(i,p) sampling is Step 16 of the algorithm in which a partial (or full)

sample which is not consistent is rejected. Obviously, we can use SampleSearch presented

in Chapter 3 in a straight forward manner to manage this problem. For completeness sake,

we present this scheme next.

4.4 The SampleSearch scheme for solution sampling

Instead of returning with a sample that is inconsistent, SampleSearch progressively re-

vises the inconsistent sample via backtracking search until a solution is found. Since the

focus of this chapter is on SAT formulas, we use the conventional backtracking procedure

for SAT which is the DPLL algorithm [26].

SampleSearch with DPLL is presented as Algorithm 15. It takes as input a formula F ,

an ordering o = 〈X1, . . . , Xn〉 of variables and a distribution Q(X) =∏n

i=1 Qi (Xi| X1,

. . . , Xi−1) along that ordering. Given a partial assignment (x1, ..., xi−1) already generated,

the next variable in the ordering Xi is selected and its value Xi = xi is sampled from the

152

Algorithm 15: SampleSearch(F = (X,C), Q, o)

Input: A SAT formula F in CNF form, a distribution Q and an ordering o of

variables.

Output: 1 if a solution sample is generated ( plus the solution stored in F ) and 0otherwise

if there is an empty clause in F then1

Return 0;2

if all clauses in F are unit then3

Return 1;4

Select the earliest variable Xi in o not yet assigned a value;5

for every unit clause C in F do6

Unit-propagate(F,C);7

Sample Xi = xi from Qi(Xi|x1, . . . , xi−1);8

return SampleSearch(F ∧ xi, Q, o) OR SampleSearch(F ∧ xi, Q, o);9

conditional distribution Qi(Xi|x1, . . . , xi−1). Then the algorithm applies unit-propagation

with the new unit clause Xi = xi created over the formula F . If no empty clause is

generated, the algorithm proceeds to the next variable. Otherwise, the algorithm tries Xi =

xi, performs unit propagation and either proceeds forward (if no empty clause generated)

or it backtracks. On termination, the output of SampleSearch is a solution to F (assuming

one exists).

The distribution Q can be constructed using the output of IJGP described in the previous

section or by any other scheme. The only restriction we have is that Q must satisfy P(x) >

0 → Q(x) > 0 to guarantee that every solution is sampled with non-zero probability

(similar to importance sampling).

Next, we characterize the sampling distribution of SampleSearch which will be used to

quantify how far the generated solutions are from the uniform distribution over the solu-

tions.

153

4.4.1 The Sampling Distribution of SampleSearch

As shown in Chapter 3, SampleSearch generates independent and identically distributed

samples from the backtrack-free distribution denoted by QF . In this section, we recap the

definition of the backtrack-free distribution relative to a SAT formula. We will also provide

an alternative proof for the claim that SampleSearch generates samples from the backtrack-

free distribution. Because of the restriction to binary domains as in a SAT formula, our

proof is much simpler and more intuitive than the proof presented in Chapter 3.

DEFINITION 36 (The Backtrack-free distribution of Q w.r.t. F ). Given a distribu-

tion Q(X) in the product form Q(X) =∏n

i=1Qi(Xi|X1, . . . , Xi−1), an ordering o =

〈X1, . . . , Xn〉 and a SAT formula F (X,C), the backtrack-free distribution QF is factored

into QF (x) =∏n

i=1QFi (xi|x1, . . . , xi−1) where Q

Fi (xi|x1, . . . , xi−1) is defined as follows:

1. QFi (xi|x1, . . . , xi−1) = 0 if (x1, . . . , xi−1, xi) cannot be extended to a solution of F .

2. QFi (xi|x1, . . . , xi−1) = 1 if (x1, . . . , xi−1, xi) can be extended to a solution of F but

(x1, . . . , xi−1, xi) cannot.

3. QFi (xi|x1, . . . , xi−1) = Qi(xi|x1, . . . , xi−1) if both (x1, . . . , xi−1, xi) and (x1 , . . . ,

xi−1, xi) can be extended to a solution of F .

EXAMPLE 12. Figure 4.2(a) shows a distribution Q expressed as a probability tree. Each

arc from a parent to a child in the probability tree is labeled with the conditional probability

of the child given an assignment from the root to the parent. Figure 4.2(b) shows the

backtrack-free distribution of Q w.r.t. the given SAT formula F . QF is constructed from Q

as follows. Given a parent and its two children one which participates in a solution and the

other which does not, the edge-label of the child participating in a solution is changed to 1

while the edge-label of the other child which does not participate in a solution is changed

to 0. If both children participate in a solution, the edge labels are not changed.

154

Root

C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

B=0 B=1 B=0 B=1

A=0 A=1

0.8 0.2

0.6 0.4

0.1

0.5

0.5 0.5 0.80.9 0.2 0.7 0.3

0.5

(a)

Root

C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

B=0 B=1 B=0 B=1

A=0 A=1

0.8 0.2

0.6 0.4

1

0

0.5 0.5 10 0 0 0

1

(b)

SAT formula F = (A∨¬B ∨¬C)∧ (¬A∨B ∨C)∧ (¬A∨¬B ∨C)∧ (¬A∨¬B ∨¬C)

Figure 4.2: (a) A distribution Q expressed as a probability tree and (b) Backtrack-free

distribution QF of Q w.r.t. F .

THEOREM 12. Given a distribution Q and a formula F , the sampling distribution of

SampleSearch is the backtrack-free distribution QF of Q w.r.t. F .

Proof. Let I(x) =∏n

i=1 Ii(xi|x1, . . . , xi−1) be the factored sampling distribution of Sam-

pleSearch (F,Q, o). We will prove that for any arbitrary partial assignment (x1, . . . , xi−1,

xi), Ii(xi|x1, . . . , xi−1) = QFi (xi|x1, . . . , xi−1). We consider three cases corresponding to

the definition of the backtrack-free distribution (see Definition 36):

Case (1): (x1, . . . , xi−1, xi) cannot be extended to a solution. Since, all samples generated

by SampleSearch are solutions of F , Ii(xi|x1, . . . , xi−1) = 0 = QFi (xi|x1, . . . , xi−1).

Case (2): (x1, . . . , xi−1, xi) can be extended to a solution but (x1, . . . , xi−1, xi) cannot

be extended to a solution. Since SampleSearch is a systematic search procedure, if it

explores a partial assignment (x1, . . . , xk) that can be extended to a solution of F , it is

guaranteed to return a full sample (solution) extending this partial assignment. Otherwise,

155

if (x1, . . . , xk) is not part of any solution of F then SampleSearch will prove this incon-

sistency before it will finish generating a full sample. Consequently, if (x1, . . . , xi−1, xi)

is sampled first, a full assignment (a solution) extending (x1, . . . , xi) will be returned and

if (x1, . . . , xi) is sampled first, SampleSearch will detect that (x1, . . . , xi−1, xi) cannot be

extended to a solution and eventually explore (x1, . . . , xi−1, xi). Therefore, in both cases if

(x1, . . . , xi−1) is explored by SampleSearch, a solution extending (x1, . . . , xi−1, xi) will

definitely be generated i.e. Ii(xi|x1, . . . , xi−1) = 1 = QFi (xi|x1, . . . , xi−1).

Case (3): Both (x1, . . . , xi−1, xi) and (x1, . . . , xi−1, xi) can be extended to a solution.

Since SampleSearch is a systematic search procedure, if it samples a partial assignment

(x1, . . . , xi−1, xi) it will return a solution extending (x1, . . . , xi−1, xi) without sampling

(x1, . . . , xi−1, xi). Therefore, the probability of sampling xi given (x1, . . . , xi−1) equals

Qi(xi|x1, . . . , xi−1) which equals QFi (xi|x1, . . . , xi−1).

Note that the only property of backtracking search that we have used in the proof is its

systematic nature. Therefore, as pointed out in Chapter 3, if we replace naive backtracking

search by any systematic SAT solver such as minisat [125] for example, Theorem 12 would

still hold. The only modifications we have to make are: (a) use static variable ordering 1,

(b) use value ordering based on the proposal distribution Q. Consequently:

COROLLARY 1. Given a Formula F , a distribution Q and an ordering o, any systematic

SAT solver replacing DPLL in SampleSearch will generate independent and identically

distributed solution samples from the backtrack-free distribution QF .

While SampleSearch samples from the backtrack-free distributionQF , it still does not solve

the solution sampling problem because QF may still be quite far from the uniform distribu-

tion over the solutions. In the next two sections, we show how to augment SampleSearch

1 We can also use some restricted dynamic variable orderings and still maintain correctness of Theorem

12

156

with ideas presented in the statistics literature − the Metropolis-Hastings method (MH)

and the Sampling/Importance Resampling (SIR) theory so that in the limit the sampling

distribution of the resulting techniques is the target uniform distribution over the solutions.

4.5 SampleSearch-MH

As explained in Chapter 1, the main idea in the Metropolis-Hastings (MH) simulation algo-

rithm [65] is to generate a Markov Chain whose limiting or stationary distribution is equal

to the target distribution P . A Markov chain consists of a sequence of states and a pro-

posal distribution T (y|x) for moving from state x to state y. Given a proposal distribution

T (y|x), an acceptance function 0 ≤ r(y|x) ≤ 1 and a current state xt, the Metropolis-

Hastings algorithm works as follows:

1. Draw y from T (y|xt).

2. Draw p uniformly from [0, 1] and update

xt+1 =

y if p ≤ r(y|xt)

xt otherwise

Hastings [65] proved that the stationary distribution of the above Markov Chain converges

to P when it satisfies the detailed balance condition. Formally,

THEOREM 13. [65] The stationary distribution of the Markov Chain generated using the

two steps outlined above converges to P , when the following detailed balanced condition

is satisfied:

P(x)A(y|x) = P(y)A(x|y) (4.6)

where A(x|y) = T (x|y)r(x|y) is called the transition function.

157

Algorithm 16: SampleSearch−MH(F,Q,O,M)

Input: A SAT formula F , A distribution Q, An ordering O, Integer M .

Output: M solution samples

x0 = SampleSearch(F,O,Q);1

for t = 0 toM − 1 do2

y = SampleSearch(F,O,Q);3

Generate a random number p in the interval [0, 1];4

if p ≤ QF (xt)QF (xt)+QF (y)

then5

xt+1 = y;6

else7

xt+1 = xt;8

Algorithm 16 uses the Metropolis Hastings algorithm (MH) [65] for solution sampling.

We first generate a solution sample y using SampleSearch and then accept the solution

with probabilityQF (x)

QF (x)+QF (y). We can show that:

PROPOSITION 16. The Markov Chain of SampleSearch-MH (Algorithm 16) satisfies the

detailed balance condition.

Proof. Since our task is to generate samples from a uniform distribution over the solutions,

for any two solution samples x and y, we have

P(x) = P(y)

Therefore, the detailed balance condition given in Equation 4.6 reduces to:

A(y|x) = A(x|y) (4.7)

Because each sample is generated independently fromQF and we accept it with probability

158

QF (x)QF (x)+QF (y)

, the transition function A(y|x) is given by:

A(y|x) = QF (y)QF (x)

QF (x) +QF (y)(4.8)

Similarly, the transition function A(x|y) is given by:

A(x|y) = QF (x)QF (y)

QF (x) +QF (y)(4.9)

From Equations 4.8 and 4.9, we have A(y|x) = A(x|y) as required.

It follows from Proposition 16 and the Metropolis-Hastings theory that the stationary dis-

tribution of the Markov Chain generated by SampleSearch-MH is equal to the uniform

distribution over the solutions. Therefore, if we run SampleSearch-MH long enough (also

called as the burn-in time), the samples produced by it can be regarded as following the

uniform distribution over the solutions. Integrating MH with SampleSearch opens the way

for applying various improvements to MH proposed in the statistics literature. We describe

next the integration of an improved MH algorithm with SampleSearch.

4.5.1 Improved SampleSearch-MH

In SampleSearch-MH, we generate a state (or a sample) from the backtrack-free distribu-

tion and subsequently a decision is made whether to accept the sample based on an accep-

tance function. In Improved SampleSearch-MH (Algorithm 17), we first widen the accep-

tance by generating multiple samples instead of just one and then accept a good sample

from these multiple options. Such methods are referred to as multiple trial MH [89]. Given

159

Algorithm 17: Improved-SampleSearch-MH (F,Q,O,M, k)

Input: A formula F , A distribution Q, An ordering O, Integer M and kOutput: M solutions

Generate x0 using SampleSearch(F,O,Q);1

for t = 0 toM − 1 do2

Y = φ;3

for j = 1 to k do4

yj = SampleSearch(F,O,Q);5

Y = Y ∪ yj;6

ComputeW =∑k

j=11

QF (yj);7

Select y from the set Y by sampling each element yj in Y with probability8

proportional to 1QF (yj)

;

Generate a random number p in the interval [0, 1];9

if p ≤ WW− 1

QF (y)+ 1

QF (xt)

then10

xt+1 = y;11

else12

xt+1 = xt;13

a current sample x, the algorithm generates k candidates y1, . . . ,yk from the backtrack-free

distribution QF in the usual SampleSearch style. It then selects a (single) sample y from

the k candidates with probability proportional to 1/QF (yj). Sample y is then accepted

according to the following acceptance function:

r(y|x) =W

W − 1QF (y)

+ 1QF (x)

where W =k∑

j=1

1

QF (yj)(4.10)

Using results from [89], it is easy to prove that:

PROPOSITION 17. [89] The Markov Chain generated by Improved-SampleSearch-MH sat-

isfies the detailed balance condition.

In summary,

THEOREM 14. The stationary distribution of SampleSearch-MH and Improved

SampleSearch-MH is the uniform distribution over the solutions.

160

Proof. Proof follows from Propositions 16 and 17 and the Metropolis-Hastings theory [65].

4.6 SampleSearch-SIR

Algorithm 18: SampleSearch− SIR(F,Q, o,N,M)

Input: A formula F , A distribution Q, An ordering O, Integers M and N , N > MOutput: M solution samples

Generate N i.i.d. samples A = x1, . . . ,xN by executing SampleSearch(F,O,Q) N1

times;

Compute importance weights w(x1) = 1QF (x1)

, . . . , w(xN) = 1QF (xN )

for each2

sample where QF is the backtrack-free distribution;

Normalize the importance weights using w(xi) = w(xi)/∑N

j=1w(xj);3

// Resampling Step

Generate M i.i.d. samples y1, . . . ,yM from A by sampling each sample xi with4

probability w(xi).;

We now discuss our second scheme in which we augment SampleSearch with

Sampling/Importance Resampling (SIR) [117] which will yield the SampleSearch-SIR

technique. Standard SIR [117] aims at drawing random samples from a target distribution

P(x) by using a given distribution Q(x) which satisfies P(x) > 0 ⇒ Q(x) > 0 (which

is the same requirement as the importance sampling one). First, a set of independent and

identically distributed random samples A = (x1, . . . ,xN) are drawn from a proposal dis-

tribution Q(x). Second, a possibly smaller number of samples B = (y1, . . . ,yM) are

drawn from A with sample probabilities which are proportional to the weights w(xi) =

P(xi)/Q(xi) (this step is referred to as the re-sampling step). The samples from SIR will,

consist of independent draws from P as N →∞ (see [117] for details).

SampleSearch augmented with SIR is described in Algorithm 18. Here, a set A = (x1, . . . ,

xN) of solution samples (N > M ) is generated from SampleSearch. Then, M samples are

drawn from A with sample probabilities proportional to w(xi) = 1/QF (xi).

161

THEOREM 15. As N → ∞, the samples generated by SampleSearch-SIR consist of inde-

pendent draws from the uniform distribution P over the solutions of F .

Proof. From the SIR theory [117], we know that SampleSearch-SIR would i.i.d. samples

from the uniform distribution P over the solutions of F as N →∞ if the following condi-

tions hold:

• C.1 The N samples are generated i.i.d from some distribution QF (satisfied trivially

from Theorem 9).

• C.2 The M samples are generated in the re-sampling step by sampling each sample

x with probability proportional to P(x)/QF (x)

Because, P is a uniform distribution over the solutions, we have

P(x)

QF (x)∝ 1

QF (x)(4.11)

Since theM samples are generated in the re-sampling step of SampleSearch-SIR by resam-

pling each generated solution with probability∝ 1QF (x)

, condition C.2 is also satisfied.

Note that Theorem 15 does not make any assumptions on the type or quality of QF . There-

fore, even with a bad proposal distribution we are guaranteed to get independent samples

in the limit. However, for a finite sample size, the quality of the proposal distribution is the

most important factor.

4.6.1 Extensions of basic SampleSearch-SIR

The integration of SampleSearch within the SIR framework allows using various improve-

ments to the SIR framework presented in the statistics literature over the past decade. In

162

this subsection, we consider two such improvements: (a) sampling without replacement

[49] and (b) Improved SIR [124].

Sampling without replacement

SampleSearch-SIR uses sampling with replacement in the re-sampling step i.e. the same

solution sample may be drawn twice. Instead we could use sampling without replacement

so that each solution will appear only once in the final set of samples. Sampling without

replacement helps avoid replicates when QF is a poor approximation of P . In this situ-

ation we may have a few very large weights and many small weights which causes the

large weight samples to appear multiple times in the final set of samples. It was shown by

[49] that sampling without replacement yields a more desirable intermediate approxima-

tion somewhere between the starting (proposal) QF and the target distribution P . In the

experimental section, we will revisit this issue and show how a simple statistical test based

on co-efficient of variation [88] is able to predict which sampling method is likely to have

better accuracy.

Improved SampleSearch-SIR

Under certain restrictions, [124] proved that the convergence of SIR is proportional to

O(1/N). To speed up this convergence to O(1/N2), they proposed the Improved SIR

framework. For our purposes, Improved SIR only changes the weights during resampling

step as follows.

In case of sampling with replacement the Improved SampleSearch-SIR weighs each sample

as:

w(xi) ∝ 1

S−i ×QF (xi)where S−i =

M∑

j=1

1

QF (xj)− 1

QF (xi)(4.12)

163

The first draw of Improved SampleSearch-SIR without replacement is specified by the

weights in Equation 4.12. For the k-th draw, k > 1, the distribution of w is modified to:

w(xk) ∝ 1

QF (xk)(∑N

j=11

QF (xj)−∑k−1

j=11

QF (xj)− k 1

QF (xk))

(4.13)

4.6.2 Discussion on related work

Theorems 14 and 15 are important because they guarantee that as the sample size increases,

the samples drawn by SampleSearch-SIR and SampleSearch-MH would converge to the

uniform distribution over the solutions. Therefore, we can expect that the sampling error

of both schemes will decrease as the number of samples is increased and will disappear

in the limit. To the best of our knowledge, the state-of-the-art schemes in literature: (a)

SampleSat [130] and (b) XorSample [63] do not have such guarantees. In particular, it

is difficult to characterize the sampling distribution of SampleSat. It is possible to make

SampleSat sample uniformly by setting the noise parameter appropriately but doing so

was shown by [130] to compromise significantly the time required to generate a solution

sample making SampleSat impractical. The XorSample algorithm [63] can be adjusted

to generate samples that can be within a constant factor of the uniform distribution by

increasing its slack parameter α. However, if α is fixed, XorSample may not generate

solutions from a uniform distribution [63]. Finally, although the samples drawn from the

scheme based on bucket elimination [31] do converge to the uniform distribution over the

solutions, it is clearly not practical when the tree-width is large.

164

4.7 Experimental Results

We evaluated the performance of (1) SampleSearch, (2) SampleSearch-SIR with replace-

ment (3) SampleSearch-SIR without replacement and (4) SampleSearch-MH on four do-

mains: (a) circuit, (b) coloring, (c) logistics planning and (d) grid pebbling. The initial

distribution Q of SampleSearch is generated using IJGP(i,p) (recall that we re-compute Q

by executing IJGP periodically using the parameter p). For each scheme, we experimented

with 5 values: 0, 10, 20, 50, 100 of p. We also experimented with two other schemes

available in literature: Wei et al.’s SampleSat scheme [130] and Gomes et al.’s XorSample

scheme [63].

Next, we will describe the implementation details of each scheme.

1. SampleSearch

SampleSearch is implemented via the following components:

• Search: As already mentioned, the search component in SampleSearch was imple-

mented using the minisat solver [125].

• i-bound of IJGP(i,p): We use the iterative scheme outlined in Section 3.3.1 to select

the maximum i-bound that can be accommodated by 512 MB of space. This was

done to ensure that IJGP terminates in a reasonable amount of time.

• w-cutset sampling: Following the w-cutset idea [7], we partition the variables X

into two sets K and R such that the treewidth of the graphical model induced by R

after removing K is bounded by w. We only sample the set of variables K using

SampleSearch and then exactly sample the variables in R given K = k by using the

Bucket Elimination based exact sampling scheme described in [31]. Again, we use

165

the iterative scheme outlined in Section 3.3.1 to select the maximum w such that the

Bucket Elimination based exact sampling scheme uses only 512MB space.

• Variable Ordering:We use the min-fill ordering for creating the join-graph for IJGP

because it was shown to be a good heuristic for finding tree decompositions with low

treewidth. Sampling is performed in reverse min-fill ordering.

2. SampleSearch-MH

We experimented with the Improved SampleSearch-MH scheme given in Algorithm 17.

We set k to 10 (set arbitrarily). For brevity, henceforth we will refer to SampleSearch-MH

as MH.

3. SampleSearch-SIR with and without replacement

We experimented with the improved SampleSearch-SIR with and without replacement

schemes outlined in section 4.6. We use a resampling ratio M/N = 0.1 = 10%. The

choice of the resampling ratio of 10% is arbitrary. For brevity, henceforth we will refer to

SampleSearch-SIR with replacement as SIRwr and SampleSearch-SIR without replacement

as SIRwor.

4. SampleSat

SampleSat [130] is based on the popular local search solver called Walksat [120]. Walksat

starts with a random assignment to all variables and then flips a variable at each iteration by

mixing local search style greedy moves with random walk moves. The inherent random-

ness due to random walk moves often leads the algorithm to different solutions in different

runs. This however provides a biased sampling of solution space. To reduce this bias, Wei

166

Instance #Vars #Cl #Sol #Xors(s) Size of Xors (k)

Circuit

ssa-7558-158 1363 3034 2.56e+31 85 20

ssa-7558-159 1363 3032 7.65e+33 X X

2bitcomp 5 125 310 9.84e+15 96 20

2bitmax 6 252 766 2.06e+29 53 20

Coloring

gcoloring-100 300 1117 1.8E+9 17 150

gcoloring-200 600 2237 1.28e+13 41 35

Logistics

log-1 939 3785 5.64E+20 59 20

log-2 1337 24777 3.23E+10 26 50

log-3 1413 29487 2.80E+11 27 50

log-4 2303 20963 2.34E+28 65 25

log-5 2701 29534 7.24E+38 102 20

Pebbling

grid-pbl-0010 110 191 5.93e+23 70 5

grid-pbl-0015 240 436 3.01e+54 X X

grid-pbl-0020 420 781 5.06e+95 X X

grid-pbl-0025 650 1226 1.81e+151 X X

grid-pbl-0030 930 1771 1.54e+218 X X

Table 4.1: Table showing the two parameters (Number of Xors and Size of Xors) used by

XorSample for each problem instance. ‘X’ indicates that no parameter that satisfies our

criteria was found.

and Selman [130] explore a hybrid strategy in which they select with probability p a ran-

dom walk style move and with probability 1 − p a fixed temperature simulated annealing

(SA) move (the authors suggest setting the annealing temperature to 1). More specifically,

for each SA move, they consider a randomly selected neighbor of the current assignment.

When the neighbor satisfies the same or more clauses as the current assignment, the algo-

rithm moves to the neighbor. When the neighbor satisfies fewer clauses than the current

assignment, the algorithm moves to the neighbor with probability e−Cost where Cost is

the decrease in the number of satisfied clauses. Note that the simulated annealing moves,

when used in excess tend to make the algorithm significantly slower. Therefore, Wei and

Selman [130] suggest setting p = 0.5. In our experiments, we used an implementation of

SampleSat available from the authors website 2.

2http://www.cs.cornell.edu/∼sabhar/

167

5. XorSample

XorSample [63] works by adding random Xor (parity) constraints to the original SAT for-

mula. According to a formal result by Valiant and Vazirani [127], for a formula with n

variables, each Xor constraint of length n/2 would cut the solution space of satisfying as-

signments approximately by half. Therefore, in expectation, if we add s Xor constraints of

length n/2 for a formula with 2s solutions, we will have a SAT instance with exactly one

solution. Then one can use any SAT solver like minisat to generate the surviving solution.

Repeating this processM times yields the required near uniform solution samples. In prac-

tice, however, adding Xor constraints of length n/2 i.e. large Xors is infeasible. Large Xor

constraints cause poor constraint propagation making the underlying SAT solver quite inef-

ficient. Therefore Gomes et al. [63] consider the use of small Xors. Small Xors, however,

result in an algorithm of not as high a quality as with large Xors in that in most iterations,

no solutions would survive after adding s constraints. Gomes et al. [63] suggest to over-

come this issue by adding fewer than s small Xor constraints with the hope that more than

one solution survives and then uniformly sampling the surviving solutions (if any) by using

an exact sampling algorithm. In this scheme, assuming that the new formula obtained after

adding < s (short) Xor constraints has mc surviving solutions, the algorithm would output

the ith surviving solution by choosing i uniformly from 1, 2, . . . ,mc.

The implementation available from the authors3 is not a stand-alone implementation of

XorSample. What the authors provide is a script that takes two arguments: (a) the number

s of Xors used and (b) the size k of Xors as input and generates s random Xors of size

k. We implemented XorSample on top of this implementation as follows. Note that s and

k are problem dependent which were determined as follows. For each problem, we first

fix k = n/2 and search for s by iteratively increasing it until the following two conditions

are satisfied. Let F ′ be the formula obtained by adding s Xor constraints of size k to the

3http://www.cs.cornell.edu/∼sabhar/

168

Algorithm 19: XorSample

Input: A SAT formula F , Integers k (Size of Xors) and s (number of Xors)

Output: A solution sample x

Add s Xors of size k to F yielding a new Formula F ′;1

Use Cachet [119] to count the number of solutions of F ′;2

// Let mc be the number of solutions of F ′

if mc > 1 million then3

Goto Step 1;4

i = random number between 1 and mc;5

Output the ith solution of F ′ using Relsat solution enumerator [115].;6

original formula.

1. F ′ is easy enough so that solution counting techniques like Cachet [119] and Relsat

[115] count its solutions in a reasonable amount of time (1hr).

2. F ′ has less than 1 million solutions so that the Relsat solution enumerator terminates

in a reasonable amount of time. The process of enumerating solutions is more com-

plex than counting and is quite expensive if too many solutions survive. Therefore,

we restrict ourselves to one million solutions.

Given k, if no value of s satisfies the two conditions described above, we decrease k by 1

and continue. The parameters returned by our search for k and s are reported in Table 4.1.

In some cases, however, no combination of s and k satisfies the two conditions mentioned

above indicating that running XorSample on the particular formula is impractical (indicated

by a ‘X’ in Table 4.1).

Once we fix k and s, we can generate a sample from a Formula F using Algorithm 19.

4.7.1 Evaluation Criteria

We evaluate the quality of various algorithms using two criteria:

169

1. Accuracy: We measure the accuracy of the solvers using the Hellinger distance [77].

Given a formula F with n variables, let P (Xi) and A(Xi) denote the exact and ap-

proximate marginal distribution respectively of variable Xi, then the Hellinger dis-

tance is defined as:

Hellinger distance =

∑ni=1

12

∑xi∈Di

(√P (xi)−

√A(xi))

2

n(4.14)

The exact marginal for each variable Xi is defined as: P (xi) = |Sxi|/|S| where Sxi

is the set of solutions of F having Xi = xi and S is the set of all solutions. The

number of solutions of F were computed using Cachet [119]. After running various

sampling algorithms, we get a set of solution samples φ from which we compute the

approximate marginal distribution as: A(xi) = |φ(xi)|/|φ| where φ(xi) is the subset

of solutions in φ having Xi = xi. We chose Hellinger distance because as pointed

out in [77], it is superior to other choices such as the Kullback-Leibler (KL) distance,

mean squared error and relative error when the marginal probabilities are close to

zero. We do not use KL distance because it lies between 0 and∞ and in practice it

may equal∞. For example, if P (xi) > 0 and xi is not sampled, A(xi) would be zero

and the KL distance would be infinite. Hellinger distance always lies between 0 and 1

and always yields a lower bound on the KL distance. We did compute the error using

other distance measures like the mean squared error and the relative error. All error

measures show similar trends, with Hellinger distance being the most discriminative.

2. Throughput: Throughput is defined as the number of random solutions generated per

second.

Note that accuracy is the primary performance measure because our target is to sample

solutions uniformly from the SAT formula and not specifically generating solutions quickly.

Throughput only serves to show whether a particular scheme is practical and scalable as

170

Instance n k SampleSat XorSample SampleSearch SIRwr SIRwor MH

Circuit

ssa7552-158.cnf 1363 3034 1.14E-01 2.61E-01 8.41E-02 1.25E-03 4.00E-03 1.07E-03

ssa7552-159.cnf 1363 3032 9.57E-02 1.00E+00 5.88E-02 1.40E-03 3.46E-03 1.09E-03

2bitcomp 5.cnf 125 310 2.65E-02 3.51E-01 1.02E-01 2.09E-03 3.42E-02 2.54E-03

2bitmax 6.cnf 252 766 3.11E-02 2.57E-01 8.10E-02 5.63E-02 5.99E-02 4.31E-02

Coloring

gcoloring-100.cnf 300 1417 2.96E-03 1.50E-01 6.98E-02 9.21E-04 1.94E-03 4.78E-04

gcoloring-200.cnf 600 2237 7.37E-03 1.03E-01 6.59E-02 7.21E-04 7.32E-03 8.40E-04

Logistics

log-1.cnf 939 3785 3.61E-02 5.93E-01 1.64E-02 8.75E-04 4.28E-03 1.02E-03

log-2.cnf 1337 24777 1.52E-01 8.60E-02 1.06E-01 1.65E-02 7.06E-02 1.33E-02

log-3.cnf 1413 29487 1.33E-01 7.76E-02 1.04E-01 2.23E-03 5.28E-02 2.59E-03

log-4.cnf 2303 20963 2.29E-01 1.39E-01 2.39E-01 8.07E-02 2.22E-01 6.92E-02

log-5.cnf 2701 29534 1.48E-01 7.54E-02 1.24E-01 2.20E-01 9.90E-02 1.09E-01

Pebbling

sat-grid-pbl-0010.cnf 110 191 8.38E-02 1.00E+00 1.60E-02 1.16E-03 1.77E-03 9.04E-04





Table 4.2: Table showing Hellinger distance of SampleSat, XorSample, SampleSearch,

SIRwR, SIRwoR and MH after 10 hrs of CPU time.

well as to quantify time vs accuracy tradeoffs.

In our evaluation, we had to use SAT instances whose solutions can be counted in a rela-

tively small amount of time because to compute the Hellinger distance (see Equation 4.14)

we have to count solutions to n + 1 SAT problems for each formula having n variables.

Therefore, even though we are able to generate solution samples from large SAT instances,

we are not able to evaluate the accuracy of our schemes on these instances. In general, the

time required to generate a solution sample via our SampleSearch strategy is roughly the

same as the time required by state-of-the-art solvers like minisat [125] and RSAT [108].

4.7.2 Results for p = 0

Table 4.2 summarizes the results of running each algorithm for exactly 10 hrs per problem

instance on these benchmarks. Columns 1 contains the instance name, columns 2 and

171

Problem #Var #Cl SampleSat XorSample SampleSearch

Circuit

ssa-158 1363 3034 6.67E+03 1.51E-01 1.67E+04

ssa-159 1363 3032 8.89E+03 0.00E+00 1.67E+04

2bitmax 6 125 310 8.06E+04 4.10E+00 1.00E+05

2bitcomp 5 252 766 6.94E+04 8.50E+00 1.00E+05

Coloring

gcoloring-100 300 1117 3.89E+03 1.98E+00 1.42E+04

gcoloring-200 600 2237 1.36E+03 5.68E-01 1.42E+03

Logistics

log-1 939 3785 1.25E+04 9.44E-01 2.00E+04

log-2 1337 24777 1.39E+02 5.36E-01 3.06E+03

log-3 1413 29487 9.72E+00 3.15E-01 2.98E+03

log-4 2303 20963 9.56E+00 4.14E-02 2.95E+03

log-5 2701 29534 9.10E-01 2.50E-03 2.22E+03

Pebbling

grid-10 110 191 5.00E+04 36 3.06E+03

grid-15 240 436 3.33E+04 0.00E+00 8.33E+03

grid-20 420 781 1.25E+04 0.00E+00 5.56E+03

grid-25 650 1226 3.33E+04 0.00E+00 3.33E+03

grid-30 930 1771 8.33E+03 0.00E+00 2.58E+02

Table 4.3: Table showing the number of samples generated per second by SampleSat, Xor-

Sample and SampleSearch.

172

3 report the number of variables and clauses for the instance, columns 4 to 9 report the

Hellinger distance after 10 hrs per instance for the solvers. Table 4.3 shows the number of

solution samples generated per second by SampleSearch, XorSample and SampleSat. Note

that the number of samples of SIR and MH schemes equals 10% of that of SampleSearch

and are therefore not shown in Table 4.3.

From Tables 4.2 and 4.3, we can make the following observations. First, on most instances

the Hellinger distance of all SIR and MH schemes is less than pure SampleSearch, Sam-

pleSat and XorSample. Second, on most instances, SampleSearch has lower Hellinger

distance and generates more samples than SampleSat.

Results on the Circuit instances

Time versus Hellinger distance plots for the circuit instances are shown in Figures 4.3(a)

and (b) and Figures 4.4(a) and (b). We observe that MH and SIR with replacement are the

best performing schemes on 3 instances while SampleSat performs better on the 2bitmax 6

instance. SampleSearch is more accurate than SampleSat and XorSample on the two ssa-

7552-* instances while SampleSat is more accurate than SampleSearch on 2bitcomp 5 and

2bitmax 6 instances.

Results on Graph Coloring instances

Figure 4.5 shows the anytime performance of the solvers over two graph coloring instances.

MH and SIR with replacement are the best performing schemes. On the instance having

100 vertices (see Figure 4.5(a)), MH and SIR with and without replacement have bet-

ter sampling error than pure SampleSearch, XorSample and SampleSat. On the instance

having 200 vertices (see Figure 4.5(b)), we see that SampleSat is slightly better than SIR

without replacement but is worse than both SIR with replacement and MH. Pure Sample-

173

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for 2bitcomp_5.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for 2bitmax_6.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.3: Time versus Hellinger distance plots for Circuit instances 2bitmax 6 and 2bit-

comp 5.

174

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for ssa7552-158.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for ssa7552-159.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.4: Time versus Hellinger distance plots for Circuit instances ssa7552-158 and

ssa7552-159 instances.

175

0.0001

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for gcoloring-100.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.0001

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for gcoloring-200.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.5: Time versus Hellinger distance plots for two 3-coloring instances with 100 and

200 vertices respectively.

176

Search is the worst performing scheme. Note that because finding a solution is relatively

easy on these instances, poor performance of SampleSearch does not translate down to SIR

and MH schemes because it is offset by the relatively large sample size.

Results on Logistics planning instances

Time versus Hellinger distance plots for the logistics instances are shown in Figures 4.6,

4.7 and 4.8. SIR with replacement and MH are more accurate than other schemes on log-1,

log-2 and log-3 and log-4 instances but on the log-5 instance, they are worse than pure

SampleSearch. On the log-5 instance, SampleSearch-SIR without replacement is the better

than SampleSearch. XorSample has the best accuracy for log-5 instance while on other

instances it is worse than MH and SIR schemes. SampleSat is worse than SampleSearch

on 3 instances.

Results on Grid Pebbling instances

Figures 4.9, 4.10 and 4.11 show a plot of time versus Hellinger distance on the grid pebbling

instances. On the smaller Grid Pebbling benchmarks of size 10, 15 and 20, we observe

that SIR with replacement and MH are the best performing schemes while on the larger

instances of size 25 and 30, SIR without replacement performs better than other competing

techniques. XorSample is able to generate solution samples on only the smallest pebbling

instance of size 10. On this instance, it not only generates far less samples as compared

with other competing schemes but also has lower accuracy. SampleSearch is slightly more

accurate than SampleSat on 4 out of 5 instances.

177

0.0001

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for log-1.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.6: Time versus Hellinger distance plots for Logistics planning instances log-1 and

log-2.

178

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.7: Time versus Hellinger distance plots for Logistics planning instances log-3 and

log-4.

179

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

Figure 4.8: Time versus Hellinger distance plot for a Logistics planning instance log-5.

4.7.3 Predicting which sampling scheme to use

In this subsection, we discuss a simple statistical test based on co-efficient of variation [88]

to predict which SampleSearch based sampling scheme would be better. Co-efficient of

variation measures the dispersion of the probability distribution and can be used to esti-

mate the variance of the normalized weights. Given a weight vector (w1, . . . , wN) for the

samples generated by SampleSearch, the co-efficient of variation is defined as:

cv2(w) =

∑Ni=1(w

i − w)2

(N − 1)w2 (4.15)

where w is the average of the weight vector.

cv2 measures how the samples output by SampleSearch compare to those drawn from the

uniform distribution over the solutions. When it is large, it indicates that the samples are

not as accurate yielding high variance while when small, it indicates good accuracy.

180

0.0001

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds

Approximation Error vs Time for sat-grid-pbl-0010.cnf

SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.001

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.9: Time versus Hellinger distance plot for Grid pebbling instances of size 10 and

15.

181

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(b)

Figure 4.10: Time versus Hellinger distance plot for Grid pebbling instances of size 20 and

25

182

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


SampleSatXorSample

SampleSearch

SIRwrSIRwor

MH

(a)

Figure 4.11: Time versus Hellinger distance plot for Grid pebbling instance of size 30

Figure 4.12 and 4.13 show the change in cv2 of SampleSearch’s weights with time for each

benchmark category. We find that the performance of SIR and MH schemes is quite cor-

related with the co-efficient of variation. When the co-efficient of variation is greater than

10000, we find that SIR without replacement is more accurate than SIR with replacement

and MH schemes while when it is below 10000, SIR with replacement and MH are more

accurate than SIR without replacement. For example, in case of log-5 and grid-pbl-0030

instances, the co-efficient of variation is greater than 10000 and SIR without replacement

is better than SIR with replacement (see Figures 4.12(a) and 4.8).

4.7.4 Results for p > 0

In this subsection, we study the impact of p on the accuracy of SampleSearch, MH and SIR

methods.

183

1

10

100

1000

0 5000 10000 15000 20000 25000 30000 35000 40000

Co-e

ffic

ient

of

Var

iati

on s

quar

ed

Time in seconds

Logistics instances: co-efficient of Variation squared versus time

log-1 log-2 log-3 log-4 log-5

(a)

1

10

100

1000

10000

100000

1e+06

0 5000 10000 15000 20000 25000 30000 35000 40000

Co-e

ffic

ient

of

Var

iati

on s

quar

ed

Time in seconds

Graph coloring instances: co-efficient of Variation squared versus time

flat-100 flat-200

(b)

Figure 4.12: Time versus co-efficient of variation of SampleSearch for the Logistics and

3-coloring benchmarks.

184

0.1

1

10

100

1000

0 5000 10000 15000 20000 25000 30000 35000 40000

Co-e

ffic

ient

of

Var

iati

on s

quar

ed

Time in seconds

Grid Pebbling instances: co-efficient of Variation squared versus time

grid-pbl-10grid-pbl-15

grid-pbl-20grid-pbl-25

grid-pbl-30

(a)

1

10

100

1000

0 5000 10000 15000 20000 25000 30000 35000 40000

Co-e

ffic

ient

of

Var

iati

on s

quar

ed

Time in seconds

Circuit instances: co-efficient of Variation squared versus time

ssa7552-158ssa7552-159

2bitcomp\_52bitmax\_6

(b)

Figure 4.13: Time versus co-efficient of variation of SampleSearch for the Grid pebbling

and Circuit benchmarks.

185

SampleSearch

Problem #Var #Cl P=0 P=10 P=20 P=50 P=100

Circuit

ssa-158 1363 3034 16667 1.50 0.94 0.40 0.22

ssa-159 1363 3032 16667 1.39 0.86 0.33 0.17

2bitmax 6 125 310 100000 75.94 34.23 22.31 10.36

2bitcomp 5 252 766 100000 9.06 5.70 2.31 1.11

Coloring

gcoloring-100 300 1117 14167 4.94 3.22 1.08 0.56

gcoloring-200 600 2237 1417 1.39 0.75 0.31 0.16

Logistics

log-1 939 3785 20000 7.33 4.98 1.92 1.11

log-2 1337 24777 3060 0.22 0.14 0.06 0.00

log-3 1413 29487 2980 0.50 0.31 0.13 0.06

log-4 2303 20963 2950 0.86 0.47 0.14 0.06

log-5 2701 29534 2220 0.08 0.00 0.00 0.00

Pebbling

grid-10 110 191 3060 81.14 81.00 81.08 52.83

grid-15 240 436 8330 11.75 7.22 3.33 2.04

grid-20 420 781 5560 6.33 1.44 0.75 0.36

grid-25 650 1226 3330 1.08 0.67 0.25 0.11

grid-30 930 1771 258 0.17 0.08 0.00 0.00

Table 4.4: Table showing the number of solutions output per second by SampleSearch as a

function of p.

186

SampleSearch

Instance #Var #Cl p=0 p=10 p=20 p=50 p=100

Circuit

ssa7552-158.cnf 1363 3034 8.41E-02 1.01E-01 1.01E-01 1.02E-01 1.06E-01

ssa7552-159.cnf 1363 3032 5.88E-02 8.86E-02 8.92E-02 8.96E-02 8.61E-02

2bitcomp 5.cnf 125 310 1.02E-01 9.34E-02 9.34E-02 9.38E-02 9.39E-02

2bitmax 6.cnf 252 766 8.10E-02 7.54E-02 7.60E-02 7.51E-02 7.43E-02

Coloring

gcoloring-100.cnf 300 1417 6.98E-02 5.35E-02 5.37E-02 5.31E-02 5.11E-02


Logistics

log-1.cnf 939 3785 1.64E-02 2.39E-02 2.40E-02 2.36E-02 2.35E-02

log-2.cnf 1337 24777 1.06E-01 9.02E-02 1.00E-01 9.76E-02 1.08E-01

log-3.cnf 1413 29487 1.04E-01 9.88E-02 9.95E-02 9.49E-02 9.95E-02

log-4.cnf 2303 20963 2.39E-01 1.73E-01 1.71E-01 1.77E-01 1.71E-01

log-5.cnf 2701 29534 1.24E-01 1.05E-01 1.13E-01 X X

Pebbling

sat-grid-pbl-0010.cnf 110 191 1.60E-02 5.24E-03 5.24E-03 5.24E-03 5.41E-03





Table 4.5: Table showing the Hellinger distance of SampleSearch as a function of p after

10 hrs of CPU time.

187

SIRwr


Circuit

ssa7552-158.cnf 1363 3034 1.25E-03 2.72E-02 3.11E-02 4.11E-02 6.18E-02

ssa7552-159.cnf 1363 3032 1.40E-03 5.32E-02 5.85E-02 7.78E-02 1.01E-01


2bitmax 6.cnf 252 766 5.63E-02 1.37E-01 2.30E-01 3.58E-01 1.13E-01

Coloring



Logistics

log-1.cnf 939 3785 8.75E-04 6.05E-03 9.62E-03 8.57E-03 1.89E-02

log-2.cnf 1337 24777 1.65E-02 7.72E-02 1.21E-01 1.57E-01 1.66E-02

log-3.cnf 1413 29487 2.23E-03 4.61E-02 5.66E-02 8.10E-02 1.09E-01

log-4.cnf 2303 20963 8.07E-02 1.34E-01 1.52E-01 1.49E-01 1.72E-01

log-5.cnf 2701 29534 2.20E-01 2.25E-01 2.80E-02 X X

Pebbling






Table 4.6: Table showing the Hellinger distance of SampleSearch-SIR with replacement as

a function of p after 10 hrs of CPU time.

Tables 4.4 and 4.5 show how increasing p affects the throughput and accuracy of Sample-

Search respectively. Figure 4.14 shows how the accuracy varies with time for two sample

instances. As expected, on an average the accuracy increases slightly with p but throughput

decreases significantly. Low values of p (≤ 20) pay off more than high values of p in that

accuracy increases at a higher rate when we move from p = 0 to p = 20 and remains

mostly constant there after while throughput reduces substantially.

Tables 4.6, 4.7 and 4.8 show how increasing p affects the accuracy of SIR with replacement,

SIR without replacement and MH respectively. Figures 4.15, 4.16 and 4.17 show how the

accuracy varies with time for the SIR and MH schemes for two sample instances. We see

that accuracy does not improve with p. This is because the accuracy of MH and SIR is

dependent not only on the distance between the backtrack-free distribution and the uniform

188

SIRwor


Circuit

ssa7552-158.cnf 1363 3034 4.00E-03 3.75E-02 4.33E-02 5.26E-02 6.26E-02

ssa7552-159.cnf 1363 3032 3.46E-03 3.52E-02 4.39E-02 4.28E-02 6.20E-02


2bitmax 6.cnf 252 766 5.99E-02 4.53E-02 4.60E-02 4.36E-02 5.05E-02

Coloring



Logistics

log-1.cnf 939 3785 4.28E-03 7.13E-03 7.50E-03 8.88E-03 1.07E-02

log-2.cnf 1337 24777 7.06E-02 6.79E-02 7.72E-02 8.19E-02 9.23E-02

log-3.cnf 1413 29487 5.28E-02 4.91E-02 5.40E-02 4.96E-02 5.65E-02

log-4.cnf 2303 20963 2.22E-01 1.43E-01 1.31E-01 1.57E-01 1.58E-01

log-5.cnf 2701 29534 9.90E-02 8.99E-02 1.01E-01 X X

Pebbling






Table 4.7: Table showing the Hellinger distance of SampleSearch-SIR without replacement

as a function of p after 10 hrs of CPU time.

189

MH


Circuit

ssa7552-158.cnf 1363 3034 1.07E-03 3.30E-02 3.73E-02 3.73E-02 7.78E-02

ssa7552-159.cnf 1363 3032 1.09E-03 3.85E-02 4.91E-02 5.41E-02 6.64E-02


2bitmax 6.cnf 252 766 4.31E-02 7.54E-02 7.29E-02 1.34E-01 8.42E-02

Coloring



Logistics

log-1.cnf 939 3785 1.02E-03 6.11E-03 1.28E-02 8.75E-03 1.57E-02

log-2.cnf 1337 24777 1.33E-02 7.83E-02 9.66E-02 1.28E-01 1.44E-02

log-3.cnf 1413 29487 2.59E-03 4.27E-02 3.65E-02 4.82E-02 7.58E-02

log-4.cnf 2303 20963 6.92E-02 9.17E-02 1.08E-01 1.28E-01 1.60E-01

log-5.cnf 2701 29534 1.09E-01 1.04E-01 3.16E-02 X X

Pebbling






Table 4.8: Table showing the Hellinger distance of SampleSearch-MH as a function of pafter 10 hrs of CPU time.

190

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(a)

0.01

0.1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(b)

Figure 4.14: Time versus Hellinger distance plot for SampleSearch as a function of p for

log-4 and Grid pebbling instance of size 25.

191

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(b)

Figure 4.15: Time versus Hellinger distance plot for SampleSearch-SIR with replacement

as a function of p for log-4 and Grid pebbling instance of size 25.

192

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(b)

Figure 4.16: Time versus Hellinger distance plot for SampleSearch-SIR without replace-

ment as a function of p for log-4 and Grid pebbling instance of size 25.

193

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(a)

0.01

0.1

1

0 5000 10000 15000 20000 25000 30000 35000 40000

Hel

linger

Dis

tance

Time in seconds


p=0p=10p=20

p=50p=100

(b)

Figure 4.17: Time versus Hellinger distance plot for SampleSearch-MH as a function of pfor log-4 and Grid pebbling instance of size 25.

194

distribution over the solutions but also on the number of samples. Both quantities decrease

with p and we find that improving SampleSearch’s accuracy at the expense of generating

less samples does not pay off. In general, our empirical study suggests using low values of

p ≤ 20 in conjunction with MH and SIR.

4.8 Conclusion

The chapter introduces two schemes of SampleSearch-MH and SampleSearch-SIR for gen-

erating random, uniformly distributed solutions from a Boolean satisfiability formula. The

origin for this task is the use of satisfiability based methods for random test program gener-

ation in the functional verification field and for performing MCMC inference in first order

probabilistic models.

SampleSearch-MH and SampleSearch-SIR combine SampleSearch with statistical tech-

niques of Metropolis Hastings (MH) and Sampling / Importance Resampling (SIR) respec-

tively and guarantee that in the limit of infinite samples the distribution over the generated

solution samples is the uniform distribution over the solutions. These guarantees are sig-

nificant because state-of-the-art schemes of SampleSat [130] and XorSample [63] do not

have them.

We provided a thorough empirical evaluation of SampleSearch, SampleSearch-MH and

SampleSearch-SIR with SampleSat and XorSample. Our results showed conclusively that

SampleSearch-MH and SampleSearch-SIR are superior to both SampleSat and XorSample

in terms of accuracy.

We also studied experimentally the use of a parameter p ∈ 0, 100 which guides re-

computation of the sampling distribution by executing IJGP periodically. We found that

values of p < 20 are more cost-effective.

195

Chapter 5

Lower Bounding weighted counts using

the Markov Inequality

5.1 Introduction

Approximating weighted counting tasks such as computing the probability of evidence in a

Bayesian network, the partition function of a Markov network and counting the number of

solutions of a constraint satisfaction or a Boolean satisfiability problem, even with known

error bounds is NP-hard [21]. In this chapter, we address this hard problem by proposing

schemes that yield a high confidence lower bound on the weighted counts but do not have

any guarantees of relative or absolute error.

Previous work on bounding the weighted counts comprises of deterministic approximations

[39, 82, 5] and sampling based randomized approximations [15, 22]. An approximation

algorithm for computing the lower bound is deterministic if it is always guaranteed to

output a lower bound. On the other hand, an approximation algorithm is randomized if the

approximation fails with a known probability δ > 0. The work in this chapter falls under

196

the class of randomized approximations.

Randomized approximations [15, 22] use known inequalities such as the Chebyshev and

the Hoeffding inequalities [68] for lower (and upper) bounding the weighted counts. The

Chebyshev and Hoeffding inequalities provide bounds on how the sample mean of N inde-

pendently and identically distributed random variables deviates from the actual mean. The

main idea in [15, 22], which is in some sense similar to importance sampling [118, 50]

is to express the problem of computing the weighted counts as the problem of computing

the mean (or expected value) of independent random variables and then use the mean over

the sampled random variables to bound the deviation from the actual mean. The problem

with these previous approaches is that the number of samples required to guarantee high

confidence lower (or upper) bounds is inversely proportional to the weighted counts (or the

actual mean). Therefore, if the counts are arbitrarily small (e.g. < 10−20), a large number

of samples (approximately 1019) are required to guarantee the correctness of the bounds.

In this chapter, we address this problem by focusing on the Markov inequality which is

independent of N . Recently, the Markov inequality was used to lower bound the number

of solutions of a satisfiability formula [62] showing good empirical results. We adapt this

scheme to compute lower bounds on weighted counts and extend it in the following ways.

First, we address one of the well known concerns in statistics literature that the Markov

inequality is quite weak and yields bad approximations. We argue that the Markov inequal-

ity is weak because it is based on a single sample and in fact good lower bounds can be

obtained by extending it to multiple samples. Specifically, we propose three new schemes:

one based on sample average and two based on the martingale theory [10] which utilize the

maximum weight from the generated samples. Our new schemes guarantee that as more

samples are drawn, the lower bound is likely to increase. Second, we show how we can

estimate the weighted counts efficiently on mixed constraint and probabilistic graphical

models by applying the Markov inequality on top of SampleSearch (see Chapter 3).

197

We provide thorough empirical results demonstrating the potential of our new scheme.

On the probabilistic networks we compared against state-of-the-art deterministic approx-

imations such as Variable elimination and Conditioning (VEC) [28] and bound propaga-

tion [82, 6]. For the task of lower bounding the number of models of a satisfiability

formula, we compared against the Relsat [115] scheme which yields a deterministic ap-

proximation and a randomized approximation called SampleCount [62] which can also be

combined with our extensions of the Markov inequality. Our results clearly show that our

new randomized approximations based on the Markov inequality are more scalable than

deterministic approximations like VEC, Relsat and bound propagation and in most cases

yield highly accurate lower bounds which are closer to the exact answer than those output

by these other schemes. Also on most instances, SampleSearch yields higher lower bounds

as compared with SampleCount.

The research presented in this chapter is based in part on [51].

The rest of this chapter is organized as follows. In Section 5.2, we describe preliminaries

and previous work. In Section 5.3, we present our basic lower bounding scheme and sev-

eral enhancements. Experimental results are presented in Section 5.4 and we conclude in

Section 5.5.

5.2 Background

In this section, we present previous work by Dagum and Luby [22] and Cheng [15] to

approximate the weighted counts with a known relative error ǫ.

Given a mixed networkM = 〈X,D,F,C〉, consider the expression for weighted counts:

Z =∑

x∈X

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (5.1)

198

Recall that in importance sampling, given N independent and identically distributed (i.i.d.)

samples (x1, . . . , xN) generated from a proposal distribution Q(X) satisfying∏m

i=1 Fi(x)∏p

j=1Cj(x) > 0 → Q(x) > 0, the weighted counts are estimated in an unbiased fashion

by:

ZN =1

N

N∑

k=1

∏mi=1 Fi(x

k)∏p

j=1Cj(xk)

Q(xk)=

1

N

N∑

k=1

w(xk) (5.2)

where

w(x) =

∏mi=1 Fi(x)

∏pj=1Cj(x)

Q(x)(5.3)

is the weight of sample x.

Dagum and Luby [22] provide an upper bound on the number of samples N required to

guarantee that for any ǫ, δ > 0, the estimate ZN approximates Z with relative error ǫ with

probability at least 1− δ. Formally,

Pr[Z(1− ǫ) ≤ ZN ≤ Z(1 + ǫ)] > 1− δ (5.4)

when N satisfies:

N ≥ 4

Zǫ2ln

2

δ(5.5)

These bounds were later improved by [15] yielding:

N ≥ 1

Z

1

(1 + ǫ)ln(1 + ǫ)− ǫln2

δ(5.6)

In both these bounds (Equations 5.5 and 5.6 ) N is inversely proportional to Z and there-

fore when Z is small, a large number of samples are required to achieve an acceptable

confidence level (1− δ) > 0.99.

A bound onN is required because [22, 15] insist on approximating Z with a known relative

199

error ǫ. If we relax this relative error requirement, we can use just one sample and the

Markov inequality to obtain a high confidence lower bound on Z. Furthermore, we can

improve the lower bound with more samples, as we demonstrate in the next section.

5.3 Markov Inequality based Lower Bounds

PROPOSITION 18 (Markov Inequality). For any random variable X and a real number

r > 1, Pr(X > rE[X]) < 1r.

The Markov inequality states that the probability that a random variable is r times its ex-

pected value is less than or equal to 1/r. The weight of each sample generated by impor-

tance sampling is a random variable. Because the expected value of the weight equals the

weighted counts Z ( see Equation 5.1 ), it is straightforward to see that given a real number

r > 1, the probability that the weight of a sample is greater than r times Z is less than 1/r.

Alternately, the weight of the sample divided by r is a lower bound on Z with probability

greater than 1 − 1/r. Formally, given a sample x drawn independently from a proposal

distribution Q, we have:

Pr(w(x) > r × Z) <1

r(5.7)

Rearranging Equation 5.7, we get:

Pr

(w(x)

r< Z

)> 1− 1

r(5.8)

Equation 5.8 can be used to probabilistically lower bound Z as shown in the following

example.

EXAMPLE 13. Let r = 100 and let w(x) be the weight of a sample drawn independently

from a proposal distribution Q. Then from Equation 5.8,w(x)100

is a lower bound on Z with

200

probability greater than 1− (1/100) = 0.99.

The lower bound based on the Markov inequality uses just one sample. In the following,

we consider ways in which multiple samples could be utilized to improve the lower bound.

We formalize the probabilistic lower bounding problem as follows:

DEFINITION 37 (Probabilistic Lower Bounding problem). Given N samples (x1 , . . . ,

xN) drawn independently from a proposal distribution Q, such that E[w(xi)] = Z for

i ∈ 1, . . . , N and a constant 0 < α < 1, the probabilistic lower bounding problem

is to output a statistic ZN (based on the N samples) which is a lower bound on Z with

probability greater than α.

The minimum scheme was proposed to address this probabilistic lower bounding problem.

5.3.1 The Minimum scheme

This scheme due to Gomes et al. [62] uses the minimum over the sample weights to com-

pute a lower bound on Z. Although this scheme was originally introduced in the context

of lower bounding the number of solutions to a satisfiability (SAT) problem, we can easily

modify it to compute a lower bound on the weighted counts as we show next.

THEOREM 16 (minimum scheme). Given N samples (x1, . . . , xN) drawn independently

from a proposal distribution Q such that E[w(xi)] = Z for i = 1, . . . , N and a constant

0 < α < 1,

Pr

[minNi=1

[w(xi)

β

]< Z

]> α, where β =

(1

1− α

) 1N

201

Proof. Consider an arbitrary sample xi. From the Markov inequality, we get:

Pr

[w(xi)

β> Z

]< 1/β (5.9)

Since, the generated N samples are independent, the probability that the minimum over

them is also a lower bound is given by:

Pr

[minNi=1

[w(xi)

β

]> Z

]< 1/βN (5.10)

Rearranging Equation 5.10, we get:

Pr

[minNi=1

[w(xi)

β

]< Z

]> 1− 1

βN(5.11)

Substituting β =(

11−α

) 1N in 1− 1

βN , we get:

1− 1

βN= 1− 1

((1

1−α

) 1N

)N

= 1− 11

1−α

= 1− (1− α)

= α (5.12)

Therefore, from Equations 5.11 and 5.12, we have

Pr

[minki=1

[w(xi)

β

]< Z

]> α (5.13)

Algorithm 20 describes the minimum scheme based on Theorem 16. The algorithm first

calculates β based on the value of α andN . It then returns the minimum ofw(xi)β

(minCount

202

Algorithm 20: Minimum-scheme

Input: A mixed networkM = 〈X,D,F,C〉, proposal distribution Q, integer N and a

real number 0 < α < 1Output: Lower Bound on Z that is correct with probability greater than αminCount←∞;1

β =(

11−α

) 1N ;2

for i = 1 to N do3

Generate a sample xi from Q ;4

IF minCount > w(xi)β

THEN minCount = w(xi)β

;5

Return minCount;6

in Algorithm 20) over the N samples.

The problem with the minimum scheme is that because it computes a minimum over the

sample weights, unless the variance of the weights is very small, we expect the lower bound

to decrease with increase in the number of samples N . On the other hand, a good property

of the minimum scheme is that with more samples, the divisor β = 1

(1−α)1N

decreases,

thereby possibly increasing the lower bound. Next, we present our average scheme for

computing a probabilistic lower bound which avoids this problem.

5.3.2 The Average Scheme

An obvious scheme is to use the unbiased importance sampling estimator ZN given in

Equation 5.2. Because EQ[ZN ] = Z, from the Markov inequalitybZN

βwhere β = 1

1−αis a

lower bound of Z with probability greater than α. Formally,

Pr

[ZNβ

< Z

]> α, where β =

1

1− α (5.14)

As more samples are drawn the average is likely to stabilize and will usually be larger than

the minimum value. However, unlike the minimum scheme in which the divisor β decreases

with increase in the sample size thereby increasing the lower bound, the divisor β in the

203

average scheme remains constant. As a consequence, for example, if all the generated

samples have the same weight (or almost the same weight), the lower bound due to the

minimum scheme would be greater than the lower bound output by the average scheme.

However, in practice the variance is typically never close to zero and therefore the average

scheme is likely to better than the minimum scheme.

5.3.3 The Maximum scheme

We can even use the maximum instead of the average over the N i.i.d samples as shown in

the following Lemma.

LEMMA 1 (maximum scheme). GivenN samples (x1, . . . ,xN) drawn independently from

a proposal distribution Q such that E[w(xi)] = Z for i = 1, . . . , N and a constant 0 <

α < 1,

Pr

[maxNi=1w(xi)

β< Z

]> α, where β =

1

1− α 1N

Proof. From Markov inequality, we have:

Pr

[w(xi)

β< Z

]> 1− 1

β(5.15)

Given a set of N independent events such that each event occurs with probability > (1 −

1/β), the probability that all events occur is > (1 − 1/β)N . In other words, given N

independent samples such that the weight of each sample is a lower bound on Z with

probability > (1 − 1/β), the probability that all samples are a lower bound on Z is >

(1− 1/β)N . Consequently,

Pr

[maxNi=1w(xi)

β< Z

]>

(1

1− β

)N(5.16)

204

Substituting the value of β in(

11−β

)N, we have:

(1

1− β

)N=

1

1− 1

1−α1N

N

= (1− (1− α 1N ))N

= α (5.17)

From Equations 5.16 and 5.17, we get:

Pr

[maxNi=1w(xi)

β< Z

]> α (5.18)

The problem with the maximum scheme is that increasing the number of samples increases

β and consequently the lower bound decreases. However, when only a few samples are

available and the variance of the weights w(xi) is large, the maximum value is likely to be

larger than the sample average and obviously the minimum.

5.3.4 Using the Martingale Inequalities

Another approach to utilize the maximum over the N samples is to use the martingale

inequalities.

DEFINITION 38 (Martingale). A sequence of random variables X1, . . . , XN is a martin-

gale with respect to another sequence Y1, . . . , YN defined on a common probability space

Ω iff E[Xi|Y1, . . . , Yi−1] = Xi−1 for all i.

It is easy to see that given i.i.d. samples (x1, . . . ,xN) generated from Q, the sequence

205

Λ1, . . . ,ΛN , where Λp =∏p

i=1w(xi)Z

forms a martingale as shown below:

E[Λp|x1, . . . ,xp−1] = E

[Λp−1 ∗

w(xp)

Z|x1, . . . ,xp−1

]

= Λp−1 ∗ E

[w(xp)

Z|x1, . . . ,xp−1

]

Because E[w(xp)Z|x1, . . . ,xp−1] = 1, we have E[Λp|x1, . . . ,xp−1] = Λp−1 as required. The

expected value E[Λ1] = 1 and for such martingales which have a mean of 1, Breiman [10]

provides the following extension of the Markov inequality:

Pr(maxNi=1Λi > β) <1

β(5.19)

and therefore,

Pr

([maxNi=1

i∏

j=1

w(xj)

Z

]> β

)<

1

β(5.20)

From Inequality 5.20, we can prove that:

THEOREM 17 (Random permutation scheme). Given N samples (x1, . . . ,xN) drawn

independently from a proposal distribution Q such that E[w(xi)] = Z for i = 1, . . . , N

and a constant 0 < α < 1,

Pr

maxNi=1

(1

β

i∏

j=1

w(xj)

)1/i

< Z

> α, where β =

1

1− α

Proof. From Inequality 5.20, we have:

Pr

([maxNi=1

i∏

j=1

w(xj)

Z

]> β

)<

1

β(5.21)

206

Rearranging Inequality 5.21, we have:

Pr

maxNi=1

(1

β

i∏

j=1

w(xj)

)1/i

< Z

> 1− 1

β= α (5.22)

Therefore, given N samples, the following quantity

maxNi=1

(1

β

i∏

j=1

w(xj)

)1/i

where β =1

1− α

is a lower bound on Z with a confidence greater than α. In general one could use any

randomly selected permutation of the samples (x1, . . . ,xN) and apply inequality 5.20. We

therefore call this scheme as the random permutation scheme.

Another related extension of Markov inequality for martingales deals with the order statis-

tics of the samples. Letw(x(1))Z≤ w(x(2))

Z≤ . . . ≤ w(x(N))

Zbe the order statistics of the

sample. Using martingale theory, Kaplan [71] proved that the random variable

Θ∗ = maxNi=1

i∏

j=1

w(x(N−j+1))

Z ×(Ni

)

satisfies the inequality Pr(Θ∗ > k) < 1/k. Therefore,

Pr

([maxNi=1

i∏

j=1

w(x(N−j+1))

Z ×(Ni

)]> β

)<

1

β(5.23)

From Inequality 5.23, we can prove that:

THEOREM 18 (Order Statistics scheme). Given an order statistics of the weightsw(x(1))Z≤

w(x(2))Z≤ . . . ≤ w(x(N))

Zof N samples (x1, . . . ,xN) drawn independently from a proposal

207

distribution Q, such that E[w(xi)] = Z for i = 1, . . . , N and a constant 0 < α < 1,

Pr

maxNi=1

(1

β

i∏

j=1

w(x(N−j+1))(Ni

))1/i

< Z

> α, where β =

1

1− α

Proof. From Inequality 5.23, we have:

Pr

([maxNi=1

i∏

j=1

w(x(N−j+1))

Z ×(Ni

)]> β

)<

1

β(5.24)

Rearranging Inequality 5.24, we have:

Pr

maxNi=1

(1

β

i∏

j=1

w(x(N−j+1))(Ni

))1/i

< Z

> 1− 1

β= α (5.25)

Thus, given N samples, the following quantity

maxNi=1

(1

β

i∏

j=1

w(x(N−j+1))(Ni

))1/i

, where β =1

1− α

is a lower bound on Z with probability greater than α. Because the lower bound is based

on the order statistics, we call this scheme as the order statistics scheme.

To summarize, we have proposed five schemes that generalize the Markov inequality to

multiple samples: (1) The minimum scheme, (2) The average scheme, (3) The maximum

scheme, (4) The martingale random permutation scheme and (5) The martingale order

statistics scheme.

All these schemes can be used with any sampling scheme that outputs unbiased sample

weights to yield a probabilistic lower bound on the weighted counts.

208


Because SampleSearch samples from the backtrack-free distribution, it is easy to see that

to use the lower bounding schemes on the samples output by SampleSearch, all we have to

do is replace the weight w(x) of sample x by its backtrack-free weight wF (x) (see Chapter

3). Interestingly, we can even use the lower backtrack free approximation wLN(x) instead

of wF (x) because a lower bound of a lower bound is also a lower bound, and as we already

proved in Chapter 3, wLN(x) is a lower bound of wF (x).

We evaluate the performance of the probabilistic lower bounding schemes presented in this

chapter using the IJGP-sampling and SampleSearch schemes presented in this thesis (see

Chapters 2 and 3 respectively). We also compare against deterministic approximations,

which always yield a lower bound (with probability one) such as Bound propagation and

its improvements [5], Variable elimination and conditioning (VEC) [28] and Relsat [115].

We conducted experiments on three weighted counting tasks: (a) Satisfiability Model

counting, (b) Computing probability of evidence in a Bayesian network and (c) computing

the partition function of a Markov network. Our experimental data clearly demonstrates

that our new lower bounding schemes are more accurate, robust and scalable than deter-

ministic approximations such as VEC, Relsat and Bound propagation yielding far better

(higher) lower bounds on large instances.

5.4.1 The Algorithms Evaluated

We experimented with the following schemes. Hence forth, we call our new probabilistic

lower bounding schemes as Markov-LB.

209

Markov-LB with SampleSearch and IJGP-sampling: We used the same implementa-

tion of IJGP-sampling and SampleSearch described in Chapters 2 and 3 respectively. We

used IJGP-sampling on networks which have no determinism (or a small amount) while on

networks with substantial amount of determinism, we used the SampleSearch scheme.

Bound Propagation with Cut-set Conditioning We also experimented with the state of

the art any-time bounding scheme that combines sampling-based cut-set conditioning and

bound propagation [82] and which is a part of Any-Time Bounds framework for bounding

posterior marginals [5]. Given a subset of variables C ⊂ X\E, we can compute P (e)

exactly as follows:

P (e) =k∑

i=1

P (ci, e) (5.26)

The lower bound on P (e) is obtained by computing P (ci, e) for h high probability tuples of

C (selected through sampling) and bounding the remaining probability mass by computing

a lower bound PL(c1, ..., cq, e) on P (c1, ..., cq, e), q < |C|, for a polynomial number of

partially instantiated tuples of subset C, resulting in:

P (e) ≥h∑

i=1

P (ci, e) +k′∑

i=1

PLBP (ci

1, ..., ciq, e) (5.27)

where lower bound PLBP (c1, ..., cq, e) is obtained using bound propagation. Although

bound propagation bounds marginal probabilities, it can be used to bound any joint proba-

bility P (z) as follows:

PLBP (z) =

∏

i

PLBP (zi|z1, ..., zi−1) (5.28)

where lower bound PLBP (zi|z1, ..., zi−1) is computed directly by bound propagation. We

use here the same variant of bound propagation described in [6] that is used by the Any-

Time Bounds framework. The lower bound obtained by Equation 5.27 can be improved by

exploring a larger number of tuples h. After generating h tuples by sampling, we can stop

the computation at any time after bounding p < k′ out of k′ partially instantiated tuples and

210

produce the result.

In our experiments we run the bound propagation with cut-set conditioning scheme until

convergence or until a stipulated time bound has expired. Finally, we should note that

the bound propagation with cut-set conditioning scheme provides deterministic lower and

upper bounds on P (e) while our Markov-LB scheme provides only a lower bound and it

may fail with a probability δ ≤ 0.01.

Variable Elimination and Conditioning (VEC) When a problem having a high treewidth

is encountered, variable or bucket elimination may be unsuitable, primarily because of its

extensive memory demand. To alleviate the space complexity, we can use the w-cutset con-

ditioning scheme (see Section 1.3.2). Namely, we condition or instantiate enough variables

(or the w-cutset) so that the remaining problem after removing the instantiated variables

can be solved exactly using bucket elimination [28]. In our experiments we select the w-

cutset in such a way that bucket elimination would require less than 1.5GB of space. Exact

weighted counts can be computed by summing over the exact solution output by bucket

elimination for all possible instantiations of the w-cutset. When VEC is terminated before

completion, it outputs a partial sum yielding a lower bound on the weighted counts. The

implementation of VEC is available publicly from our software website [30].

Markov-LB with SampleCount

We use the same implementation of SampleCount [62] as described in Chapter 3. To recap,

SampleCount outputs unbiased solution counts of a Boolean Satisfiability formula. There-

fore, it can be easily combined with all Markov-LB schemes yielding the Markov-LB with

SampleCount scheme.

Relsat

As mentioned in Chapter 3, Relsat [115] is an exact algorithm for counting solutions of

211

a satisfiability problem. If stopped before termination, Relsat yields a lower bound on

the number of solutions. The implementation of Relsat is available publicly from the first

authors website [115].

We experimented with four versions of Markov-LB (run with SampleSearch, SampleCount

and IJGP-Sampling): (a) Markov-LB as given in Algorithm 20, (b) Markov-LB with the

average scheme, (c) Markov-LB with the martingale random permutation scheme and (d)

Markov-LB with the martingale order statistics scheme. Note that the maximum scheme

is subsumed by the Markov-LB with the martingale order statistics scheme. In all our

experiments, we set α = 0.99.

Evaluation Criteria

On each instance, we compared the log relative error between the exact value of probability

of evidence (or the solution counts for satisfiability problems) and the lower bound gener-

ated by the respective techniques. Formally, if Z is the actual probability of evidence (or

solution counts) and Z is the approximate probability of evidence (or solution counts), the

log-relative error denoted by ∆ is given by:

∆ =log(Z)− log(Z)

log(Z)(5.29)

When the exact results are not known, we use the highest lower bound reported by the

schemes as a substitute for Z in Equation 5.29. We compute the log relative error instead

of the usual relative error because when the probability of evidence is extremely small

(< 10−10) or when the solution counts are large (e.g. > 1010) the relative error between

the exact and the approximate probability of evidence will be arbitrarily close to 1 and we

would need a large number of digits to determine the best performing scheme.

212

Notation in Tables

The first column in each table (see for example Table 5.1) gives the name of the instance.

The second column provides raw statistical information about the instance such as: (i)

number of variables (n), (ii) average domain size (d), (iii) number of clauses (c) or number

of evidence variables (e) and (iv) the treewidth of the instance (w). The third column

provides the exact answer for the problem if available while the remaining columns display

the output produced by the various schemes after the specified time-bound. The columns

Min, Avg, Per and Ord give the log-relative-error ∆ for the minimum, the average, the

martingale random permutation and the martingale order statistics schemes respectively.

The final column Best LB reports the best lower bound reported by all the schemes whose

log-relative error is highlighted by bold in each row.

We organize our results in two parts. We first consider networks which do not have deter-

minism and compare Bound propagation and its improvements with IJGP-sampling based

Markov-LB schemes. Then, we consider networks which have determinism and compare

SampleSearch based Markov-LB with Variable elimination and Conditioning for proba-

bilistic networks and with SampleCount for Boolean satisfiability problems.

5.4.2 Results on networks having no determinism

Table 5.1 summarizes the results. We ran each algorithm for 2 minutes. We see that our

new strategy of Markov-LB scales well with problem size and provides good quality high-

confidence lower bounds on most problems. It clearly outperforms the bound propagation

with cut-set conditioning scheme. We discuss the results in detail below.

Non-deterministic Alarm networks The Alarm networks are one of the earliest belief

networks designed by medical experts for monitoring patients in intensive care. The evi-

213

Markov-LB with Bound

IJGP-sampling propa-

Problem 〈n, d, e, w〉 Exact Min Avg Per Ord gation Best

P(e) ∆ ∆ ∆ ∆ ∆ LB

Alarm

BN 3 〈100, 2, 36〉 2.8E-13 0.157 0.031 0.040 0.059 0.090 1.1E-13

BN 4 〈100, 2, 51〉 3.6E-18 0.119 0.023 0.040 0.045 0.025 1.4E-18

BN 5 〈125, 2, 55〉 1.8E-19 0.095 0.020 0.021 0.030 0.069 7.7E-20

BN 6 〈125, 2, 71〉 4.3E-26 0.124 0.016 0.024 0.030 0.047 1.6E-26

BN 11 〈125, 2, 46〉 8.0E-18 0.185 0.023 0.061 0.064 0.102 3.3E-18

CPCS

CPCS-360-1 〈360, 2, 20〉 1.3E-25 0.012 0.012 0.000 0.001 0.002 1.3E-25

CPCS-360-2 〈360, 2, 30〉 7.6E-22 0.045 0.015 0.010 0.010 0.000 7.6E-22

CPCS-360-3 〈360, 2, 40〉 1.2E-33 0.010 0.009 0.000 0.000 0.000 1.2E-33

CPCS-360-4 〈360, 2, 50〉 3.4E-38 0.022 0.009 0.002 0.000 0.000 3.4E-38

CPCS-422-1 〈422, 2, 20〉 7.2E-21 0.028 0.016 0.001 0.001 0.002 6.8E-21

CPCS-422-2 〈422, 2, 30〉 2.7E-57 0.005 0.005 0.000 0.000 0.000 2.7E-57

CPCS-422-3 〈422, 2, 40〉 6.9E-87 0.003 0.003 0.000 0.000 0.001 6.9E-87

CPCS-422-4 〈422, 2, 50〉 1.4E-73 0.007 0.004 0.000 0.000 0.001 1.3E-73

Random

BN 94 〈53, 50, 6〉 4.0E-11 0.235 0.029 0.063 0.025 0.028 2.2E-11

BN 96 〈54, 50, 5〉 2.1E-09 0.408 0.036 0.095 0.013 0.131 1.6E-09

BN 98 〈57, 50, 6〉 1.9E-11 0.131 0.024 0.013 0.024 0.147 1.4E-11

BN 100 〈58, 50, 8〉 1.6E-14 0.521 0.022 0.079 0.041 0.134 8.1E-15

BN 102 〈76, 50, 15〉 1.5E-26 0.039 0.007 0.007 0.012 0.056 9.4E-27

Table 5.1: Table showing the log-relative error ∆ of bound propagation and four versions of

Markov-LB combined with IJGP-sampling for Bayesian networks having no determinism

after 2 minutes of CPU time.

214

dence in these networks was set at random. These networks have between 100-125 binary

nodes. We can see that Markov-LB with IJGP-sampling is slightly superior to the bound

propagation based scheme accuracy-wise. Among the different versions of Markov-LB

with IJGP-sampling, the average scheme performs better than the martingale schemes. The

minimum scheme is the worst performing scheme.

The CPCS networks The CPCS networks are derived from the Computer-based Patient

Case Simulation system [110]. The nodes of CPCS networks correspond to diseases and

findings and conditional probabilities describe their correlations. The CPCS360b and

CPCS422b networks have 360 and 422 variables respectively. We report results on the

two networks with 20,30,40 and 50 randomly selected evidence nodes. We see that the

lower bounds reported by the bound propagation based scheme are slightly better than

Markov-LB with IJGP-sampling on the CPCS360b networks. However, on the CPCS422b

networks, Markov-LB with IJGP-sampling gives higher lower bounds. The martingale

schemes (the random permutation and the order statistics) give higher lower bounds than

the average scheme. Again, the minimum scheme is the weakest.

Random networks The random networks are randomly generated graphs available from

the UAI 2006 evaluation web site. The evidence nodes are generated at random. The

networks have between 53 and 76 nodes and the maximum domain size is 50. We see that

Markov-LB is better than the bound propagation based scheme on all random networks.

The random permutation and the order statistics martingale schemes are slightly better

than the average scheme on most instances.

5.4.3 Results on networks having determinism

In this subsection, we report on experiments for networks which have determinism. We

experimented with five benchmark domains: (a) Latin square instances, (b) Langford in-

215

Markov-LB with Markov-LB with REL

SampleSearch SampleCount SAT

Problem 〈n, k, c, w〉 Exact Min Avg Per Ord Min Avg Per Ord Best

∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ LB

ls8-norm 〈512, 2, 5584, 255〉 5.40E+11 0.387 0.012 0.068 0.095 0.310 0.027 0.090 0.090 0.344 3.88E+11

ls9-norm 〈729, 2, 9009, 363〉 3.80E+17 0.347 0.021 0.055 0.070 0.294 0.030 0.097 0.074 0.579 1.59E+17

ls10-norm 〈1000, 2, 13820, 676〉 7.60E+24 0.304 0.002 0.077 0.044 0.237 0.016 0.054 0.050 0.710 6.93E+24

ls11-norm 〈1331, 2, 20350, 956〉 5.40E+33 0.287 0.023 0.102 0.026 0.227 0.036 0.094 0.034 0.783 7.37E+34

ls12-norm 〈1728, 2, 28968, 1044〉 0.251 0.007 0.045 0.011 0.232 0.000 0.079 0.002 0.833 3.23E+43

ls13-norm 〈2197, 2, 40079, 1558〉 0.250 0.005 0.080 0.000 0.194 0.015 0.087 0.044 0.870 1.26E+55

ls14-norm 〈2744, 2, 54124, 1971〉 0.174 0.010 0.057 0.000 0.140 0.043 0.065 0.026 0.899 2.72E+67

ls15-norm 〈3375, 2, 71580, 2523〉 0.189 0.015 0.080 0.000 0.130 0.053 0.077 0.062 0.923 4.84E+82

ls16-norm 〈4096, 2, 92960, 2758〉 0.158 0.000 0.055 0.001 0.108 0.030 0.053 0.007 X 1.16E+97

Table 5.2: Table showing the log-relative error ∆ of Relsat and four versions of Markov-

LB combined with SampleSearch and SampleCount respectively for Latin Square instances

after 10 hours of CPU time.

stances, (c) FPGA routing instances, (d) Linkage instances and (e) Relational instances.

The task of interest on the first three domains is counting solutions while the task of inter-

est on the final two domains is computing probability of evidence. All these domains were

also used to evaluate SampleSearch in Chapter 3.

Results on Satisfiability model counting

For model counting, we evaluate the lower bounding power of Markov-LB with Sample-

Search and Markov-LB with SampleCount [62]. We ran both algorithms for 10 hours on

each instance.

Results on Latin Square instances

Table 5.2 shows the detailed results. The exact counts for Latin square instances are known

only up to order 11. As pointed out earlier, when the exact results are not known, we use

the highest lower bound reported by the schemes as a substitute for Z in Equation 5.29.

Among the different versions of Markov-LB with SampleSearch, we see that the average

scheme performs better than the martingale order statistics scheme on 5 out of 8 instances

while the martingale order statistics scheme is superior on the other 3 instances. The min-

216


Ex SampleSearch SampleCount SAT

Problem 〈n, k, c, w〉 act Min Avg Per Ord Min Avg Per Ord Best

∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ LB

lang12 〈576, 2, 13584, 383〉 2.16E+05 0.464 0.051 0.128 0.171 0.455 0.067 0.103 0.175 0.000 2.16E+05

lang16 〈1024, 2, 32320, 639〉 6.53E+08 0.475 0.008 0.106 0.131 0.378 0.019 0.097 0.023 0.365 7.68E+08

lang19 〈1444, 2, 54226, 927〉 5.13E+11 0.405 0.041 0.109 0.095 0.420 0.156 0.219 0.200 0.636 1.70E+11

lang20 〈1600, 2, 63280, 1023〉 5.27E+12 0.411 0.031 0.150 0.102 0.424 0.217 0.188 0.123 0.685 2.13E+12

lang23 〈2116, 2, 96370, 1407〉 7.60E+15 0.389 0.058 0.119 0.100 0.418 0.215 0.284 0.211 X 9.15E+14

lang24 〈2304, 2, 109536, 1535〉 9.37E+16 0.258 0.076 0.043 0.054 0.283 0.220 0.203 0.220 X 1.74E+16

lang27 〈2916, 2, 156114, 1919〉 0.261 0.000 0.093 0.107 0.364 0.264 0.291 0.267 X 7.67E+19

Table 5.3: Table showing the log-relative error ∆ of Relsat and four versions of Markov-LB

combined with SampleSearch and SampleCount respectively for Langford instances after

10 hours of CPU time.

imum scheme is the weakest scheme while the martingale random permutation scheme is

between the minimum scheme and the average and martingale order statistics scheme.

Among the different versions of Markov-LB with SampleCount, we see very similar per-

formance.

SampleSearch with Markov-LB generates better lower bounds than SampleCount with

Markov-LB on 6 out of the 8 instances. The lower bounds output by Relsat are sev-

eral orders of magnitude lower than those output by Markov-LB with SampleSearch and

Markov-LB with SampleCount.

Results on Langford instances

Table 5.3 shows the results. Among the different versions of Markov-LB with Sample-

Search, we see again the superiority of the average scheme. The martingale order statistics

and random permutation schemes are the second and the third best respectively. Among

the different versions of SampleCount based Markov-LB, we see a similar trend where the

average scheme performs better than other schemes on 6 out of the 7 instances.

Markov-LB with SampleSearch outperforms Markov-LB with SampleCount on 6 out of the

7 instances. The lower bounds output by Relsat are inferior by several orders of magnitude

to the Markov-LB based lower bounds except on the lang12 instance which RESLAT solves

217


Ex- SampleSearch SampleCount SAT

Problem 〈n, k, c, w〉 act Min Avg Per Ord Min Avg Per Ord Best

∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ LB

9symml gr 2pin w6 〈2604, 2, 36994, 413〉 0.192 0.000 0.075 0.006 0.087 0.073 0.076 0.075 0.491 2.76E+53

9symml gr rcs w6 〈1554, 2, 29119, 613〉 0.237 0.016 0.117 0.023 0.117 0.060 0.041 0.009 0.000 9.95E+84

alu2 gr rcs w8 〈4080, 2, 83902, 1470〉 0.224 0.097 0.152 0.102 0.000 0.906 0.023 0.345 0.762 1.47E+235

apex7 gr 2pin w5 〈1983, 2, 15358, 188〉 0.158 0.003 0.073 0.000 0.064 0.023 0.047 0.036 0.547 2.71E+93

apex7 gr rcs w5 〈1500, 2, 11695, 290〉 0.228 0.037 0.118 0.038 0.099 0.000 0.028 0.008 0.670 3.04E+139

c499 gr 2pin w6 〈2070, 2, 22470, 263〉 0.262 0.012 0.092 0.000 X X X X 0.376 6.84E+54

c499 gr rcs w6 〈1872, 2, 18870, 462〉 0.310 0.046 0.164 0.043 0.083 0.042 0.062 0.000 0.391 1.07E+88

c880 gr rcs w7 〈4592, 2, 61745, 1024〉 0.223 0.110 0.142 0.110 0.000 0.000 0.000 0.003 0.845 1.37E+278

example2 gr 2pin w6 〈3603, 2, 41023, 350〉 0.112 0.000 0.026 0.000 0.005 0.005 0.005 0.005 0.756 2.78E+159

example2 gr rcs w6 〈2664, 2, 27684, 476〉 0.176 0.050 0.079 0.054 0.056 0.005 0.000 0.005 0.722 1.47E+263

term1 gr 2pin w4 〈746, 2, 3964, 31〉 0.199 0.000 0.077 0.002 X X X X 0.141 7.68E+39

term1 gr rcs w4 〈808, 2, 3290, 57〉 0.252 0.000 0.090 0.017 X X X X 0.175 4.97E+55

too large gr rcs w7 〈3633, 2, 50373, 1069〉 0.156 0.026 0.073 0.000 X X X X 0.608 7.73E+182

too large gr rcs w8 〈4152, 2, 57495, 1330〉 0.147 0.000 0.038 0.020 X X X X 0.750 8.36E+246

vda gr rcs w9 〈6498, 2, 130997, 2402〉 0.088 0.009 0.030 0.000 X X X X 0.749 5.04E+300

Table 5.4: Table showing the log-relative error ∆ of Relsat and four versions of Markov-LB

combined with SampleSearch and SampleCount respectively for FPGA routing instances

after 10 hours of CPU time.

exactly.

Results on FPGA routing instances

Table 5.4 shows the results for FPGA routing instances. We see a similar behavior to the

Langford and Latin square instances in that the average and the martingale order statistics

schemes are better than other schemes with the average scheme performing the best. Sam-

pleSearch based Markov-LB yields better lower bounds than SampleCount based Markov-

LB on 11 out of the 17 instances. As in the other benchmarks, the lower bounds output by

Relsat are inferior by several orders of magnitude.

Results on Linkage instances

The Linkage instances that we experimented with in Chapter 3 are generated by converting

biological linkage analysis data into a Bayesian or Markov network.

Table 5.5 show the results for linkage instances used in the UAI 2006 evaluation [8]. Here,

we compare Markov-LB with SampleSearch with VEC. The bound propagation scheme [5]

218

Markov-LB with

SampleSearch VEC Best

Problem 〈n, k, c, w〉 Exact Min Avg Per Ord LB

∆ ∆ ∆ ∆ ∆BN 69.uai 〈777, 7, 78, 47〉 5.28E-54 0.082 0.029 0.031 0.034 0.140 1.56E-55

BN 70.uai 〈2315, 5, 159, 87〉 2.00E-71 0.275 0.035 0.101 0.046 0.147 6.24E-74

BN 71.uai 〈1740, 6, 202, 70〉 5.12E-111 0.052 0.009 0.019 0.017 0.035 5.76E-112

BN 72.uai 〈2155, 6, 252, 86〉 4.21E-150 0.021 0.002 0.004 0.007 0.023 2.38E-150

BN 73.uai 〈2140, 5, 216, 101〉 2.26E-113 0.172 0.020 0.059 0.026 0.121 1.19E-115

BN 74.uai 〈749, 6, 66, 45〉 3.75E-45 0.233 0.035 0.035 0.049 0.069 1.09E-46

BN 75.uai 〈1820, 5, 155, 92〉 5.88E-91 0.077 0.005 0.024 0.019 0.067 1.98E-91

BN 76.uai 〈2155, 7, 169, 64〉 4.93E-110 0.109 0.015 0.043 0.018 0.153 1.03E-111

Table 5.5: Table showing the log-relative error ∆ of VEC and four versions of Markov-LB

combined with SampleSearch for Linkage instances from the UAI 2006 evaluation after 3

hours of CPU time.

does not work on instances having determinism and therefore we do not report on it here.

We clearly see that SampleSearch based Markov-LB yields higher lower bounds than VEC.

Remember, however that the lower bounds output by VEC are correct (with probability 1)

while the lower bounds output by Markov-LB are correct with probability > 0.99. We see

that the average scheme is the best performing scheme.

Table 5.6 reports the results on Linkage instances encoded as Markov networks, used in the

UAI 2008 evaluation [24]. VEC solves 10 instances exactly. On these instances, the lower

bound output by SampleSearch based Markov-LB are quite accurate as evidenced by the

small log relative error. On instances which VEC does not solve exactly, we clearly see

that Markov-LB with SampleSearch yields higher lower bounds than VEC.

Comparing between different versions of Markov-LB, we see that the average scheme is

overall the best performing scheme. The Martingale order statistics scheme is the second

best scheme while the Martingale random permutation scheme is the third best.

219

Markov-LB with



∆ ∆ ∆ ∆ ∆pedigree18.uai 〈1184, 1, 0, 26〉 7.18E-79 0.062 0.004 0.011 0.016 0.000 7.18E-79

pedigree1.uai 〈334, 2, 0, 20〉 7.81E-15 0.034 0.020 0.020 0.020 0.000 7.81E-15

pedigree20.uai 〈437, 2, 0, 25〉 2.34E-30 0.208 0.010 0.011 0.029 0.000 2.34E-30

pedigree23.uai 〈402, 1, 0, 26〉 2.78E-39 0.093 0.007 0.016 0.019 0.000 2.78E-39

pedigree25.uai 〈1289, 1, 0, 38〉 2.12E-119 0.006 0.022 0.019 0.019 0.024 1.69E-116

pedigree30.uai 〈1289, 1, 0, 27〉 4.03E-88 0.014 0.039 0.039 0.035 0.042 1.85E-84

pedigree37.uai 〈1032, 1, 0, 25〉 2.63E-117 0.031 0.005 0.005 0.006 0.000 2.63E-117

pedigree38.uai 〈724, 1, 0, 18〉 5.64E-55 0.197 0.010 0.024 0.023 0.000 5.65E-55

pedigree39.uai 〈1272, 1, 0, 29〉 6.32E-103 0.039 0.003 0.001 0.007 0.000 7.96E-103

pedigree42.uai 〈448, 2, 0, 23〉 1.73E-31 0.024 0.009 0.007 0.010 0.000 1.73E-31

pedigree19.uai 〈793, 2, 0, 23〉 0.158 0.018 0.000 0.031 0.011 3.67E-59

pedigree31.uai 〈1183, 2, 0, 45〉 0.059 0.000 0.003 0.011 0.083 1.03E-70

pedigree34.uai 〈1160, 1, 0, 59〉 0.211 0.006 0.000 0.012 0.174 4.34E-65

pedigree13.uai 〈1077, 1, 0, 51〉 0.175 0.000 0.038 0.023 0.163 2.94E-32

pedigree40.uai 〈1030, 2, 0, 49〉 0.126 0.000 0.036 0.008 0.025 4.26E-89

pedigree41.uai 〈1062, 2, 0, 52〉 0.079 0.000 0.012 0.010 0.049 2.29E-77

pedigree44.uai 〈811, 1, 0, 29〉 0.045 0.002 0.007 0.009 0.000 2.23E-64

pedigree51.uai 〈1152, 1, 0, 51〉 0.150 0.003 0.027 0.000 0.139 1.01E-74

pedigree7.uai 〈1068, 1, 0, 56〉 0.127 0.000 0.019 0.009 0.101 6.42E-66

pedigree9.uai 〈1118, 2, 0, 41〉 0.072 0.000 0.009 0.009 0.028 1.41E-79


combined with SampleSearch for Linkage instances from the UAI 2008 evaluation after 3

hours of CPU time.

220

Markov-LB with



∆ ∆ ∆ ∆ ∆fs-01.uai 〈10, 2, 7, 2〉 5.00E-01 0.000 0.000 0.000 0.000 0.000 5.00E-01

fs-04.uai 〈262, 2, 226, 12〉 1.53E-05 0.116 0.116 0.116 0.116 0.000 1.53E-05

fs-07.uai 〈1225, 2, 1120, 35〉 9.80E-17 0.028 0.004 0.014 0.016 0.079 1.78E-15

fs-10.uai 〈3385, 2, 3175, 71〉 7.89E-31 0.071 0.064 0.064 0.065 X 9.57E-33

fs-13.uai 〈7228, 2, 6877, 119〉 1.34E-51 0.077 0.077 0.077 0.077 X 1.69E-55

fs-16.uai 〈13240, 2, 12712, 171〉 8.64E-78 0.085 0.019 0.048 0.025 X 3.04E-79

fs-19.uai 〈21907, 2, 21166, 243〉 2.13E-109 0.051 0.050 0.050 0.050 X 8.40E-115

fs-22.uai 〈33715, 2, 32725, 335〉 2.00E-146 0.053 0.006 0.022 0.009 X 2.51E-147

fs-25.uai 〈49150, 2, 47875, 431〉 7.18E-189 0.050 0.005 0.026 0.004 X 1.57E-189

fs-28.uai 〈68698, 2, 67102, 527〉 9.83E-237 0.231 0.017 0.023 0.011 X 4.53E-237

fs-29.uai 〈76212, 2, 74501, 559〉 6.82E-254 0.259 0.101 0.201 0.027 X 9.44E-255

mastermind 03 08 03 〈1220, 2, 48, 20〉 9.79E-08 0.283 0.039 0.034 0.096 0.000 9.79E-08

mastermind 03 08 04 〈2288, 2, 64, 30〉 8.77E-09 0.562 0.045 0.145 0.131 0.000 8.77E-09

mastermind 03 08 05 〈3692, 2, 80, 42〉 8.90E-11 0.432 0.041 0.021 0.095 0.000 1.44E-10

mastermind 04 08 03 〈1418, 2, 48, 22〉 8.39E-08 0.297 0.041 0.072 0.082 0.000 8.39E-08

mastermind 04 08 04 〈2616, 2, 64, 33〉 2.20E-08 0.640 0.026 0.155 0.103 0.034 1.38E-08

mastermind 05 08 03 〈1616, 2, 48, 28〉 5.30E-07 0.625 0.062 0.188 0.185 0.000 5.30E-07

mastermind 06 08 03 〈1814, 2, 48, 31〉 1.80E-08 0.510 0.058 0.193 0.175 0.000 1.80E-08

mastermind 10 08 03 〈2606, 2, 48, 56〉 1.92E-07 0.839 0.058 0.297 0.162 0.058 7.90E-08


combined with SampleSearch for relational instances after 3 hours of CPU time.

Results on Relational instances

To recap, the relational instances are generated by grounding the relational Bayesian net-

works using the primula tool (see Chavira et al. [14] for more information). In Table 5.7,

we report the results on instances with 10 Friends and Smoker networks and 6 mastermind

networks from this domain which have between 262 to 76,212 variables. On the 11 friends

and smokers network, we can see that as the problems get larger the lower bounds out-

put by Markov-LB with SampleSearch are higher than VEC. This clearly indicates that

Markov-LB with SampleSearch is more scalable than VEC. VEC solves exactly six out

of the eight mastermind instances while on the remaining two instances Markov-LB with

SampleSearch yields higher lower bounds than VEC.

221

5.4.4 Summary of Experimental Results

Based on our large scale experimental evaluation, we see that applying Markov inequality

and its generalization to multiple samples generated by SampleSearch and IJGP-Sampling

is more scalable than deterministic approaches for lower bounding the weighted counts

like Variable elimination and conditioning (VEC), Relsat and improved bound propagation.

Among the different versions of Markov-LB, we find that the average and martingale order

statistics schemes consistently yield higher lower bounds and should be used.

5.5 Conclusion and Summary

In this chapter, we proposed a randomized approximation algorithm, Markov-LB for com-

puting high confidence lower bounds on weighted counting tasks such as computing the

probability of evidence in a Bayesian network, counting the number of solutions of a con-

straint network and computing the partition function of a Markov network. Markov-LB

is based on importance sampling and the Markov inequality. Since a straight-forward ap-

plication of the Markov inequality may lead to poor lower bounds, we proposed several

improved measures such as the average scheme which utilizes the sample average and mar-

tingale schemes which utilize the maximum values from the sample weights. A highlight

of Markov-LB is that it can be used with any scheme that outputs unbiased (or a lower

bound on the unbiased) sample weights. We showed that Markov-LB applied to the IJGP-

sampling and SampleSearch schemes presented in this thesis provide high confidence good

quality lower bounds on most instances. Furthermore, the lower bounds are often better

than those ouput by state-of-the-art schemes such as bound propagation [6], SampleCount

[62], Relsat [115] and variable elimination and conditioning.

222

Chapter 6

AND/OR Importance Sampling

6.1 Introduction

Importance sampling [118] is a general scheme which can be used to approximate various

weighted counting tasks defined over graphical models such as computing the probability

of evidence in a Bayesian network, computing the partition function of a Markov network

and counting the number of solutions of a constraint network. The main idea in importance

sampling is to first transform the counting or summation problem to an expectation prob-

lem using a special distribution called the proposal or importance distribution and then to

estimate the counts by a weighted average over the generated samples, also called the sam-

ple mean. As pointed out in Chapter 1, the quality of estimation of importance sampling is

highly dependent on the variance of the weights or the sample mean and therefore over the

past few years significant research effort has been devoted to reducing variance [118, 88].

In this chapter, we propose a family of variance reduction schemes in the context of graph-

ical models called AND/OR importance sampling.

The central idea in AND/OR importance sampling is to exploit problem decomposition in-

223

troduced by the conditional independencies in the graphical model. Recently, graph based

problem decomposition was introduced in the context of systematic search in graphical

models [23, 3, 37], using the notion of AND/OR search spaces, and was shown to substan-

tially reduce search time; sometimes exponentially. The usual way of performing search is

to systematically go over all possible instantiations of the variables, which can be organized

in an OR search tree. In AND/OR search, additional AND nodes are interleaved with OR

nodes to capture decomposition into conditionally independent subproblems.

We can organize the generated samples as covering a part of a full AND/OR search tree

yielding an AND/OR sample tree. Likewise, the OR sample tree is the portion of the full

OR search tree that is covered by the samples. The main intuition in moving from OR

space to AND/OR space is that at AND nodes, we can combine samples in independent

components to yield a virtual increase in the sample size. For example, ifX is conditionally

independent of Y given Z, we can consider N samples of X independently from those of

Y given Z = z, thereby yielding an effective sample size of N2 instead of the input N .

Since the variance reduces as the number of samples increases (see for example [118]), the

sample mean computed over the AND/OR sample tree has lower variance than the sample

mean computed over the OR sample tree.

We can take this idea a step further and look at the AND/OR search graph [37] as the target

for compiling the given set of samples. Since the AND/OR search graph is smaller than the

AND/OR search tree, generating a partial cover of the AND/OR graph by the current set of

samples yields the AND/OR sample graph mean with an even reduced variance. However,

due to caching, computing the AND/OR sample graph mean is O(w∗) (where w∗ is the

treewidth) times more expensive time-wise and O(N) times more expensive space wise.

Finally, we combine AND/OR sampling with the w-cutset sampling scheme [7]. Since

(based on the Rao-Blackwell theorem [12]) w-cutset sampling reduces the variance by

combining sampling with exact inference; it leads to additional variance reduction.

224

We provide a thorough empirical evaluation of all the new estimators. We compare the

impact of exploiting varying levels of graph decompositions via (a) AND/OR tree, (b)

AND/OR graph and (c) their w-cutset generalizations on a variety of benchmark probabilis-

tic networks. We demonstrate that exploiting more decomposition improves the accuracy

of the estimates as a function of time in most cases. In particular, AND/OR tree sampling

usually yields better estimates than (conventional) OR tree sampling, AND/OR graph sam-

pling is superior to AND/OR tree sampling and w-cutset sampling (OR and AND/OR)

yields additional improvement. Our results clearly show that the scheme that is the most

aggressive in exploiting decomposition - AND/OR w-cutset graph importance sampling, is

consistently superior.


The rest of the chapter is organized as follows. In the next section, we present preliminaries

on AND/OR search spaces for graphical models. In Section 6.3, we present AND/OR tree

sampling and in Section 6.4, we prove that it yields lower variance than pure importance

sampling. AND/OR graph sampling is presented in Section 6.5 and AND/OR w-cutset

sampling is presented in Section 6.6. Section 6.7 presents empirical results and related

work is presented in Section 6.8. We conclude in Section 6.9.

6.2 AND/OR search spaces

Given a mixed networkM = 〈X,D,F,C〉, we can compute the weighted counts (see Def-

inition 11 in Chapter 1) by search, by accumulating probabilities over the search space

of instantiated variables. In the simplest case, the algorithm traverses an OR search tree,

whose nodes represent states in the space of partial assignments. This traditional search

space does not capture any of the structural properties of the underlying graphical model,

225

however. Introducing AND nodes into the search space can capture the conditional inde-

pendencies in the graphical model.

The AND/OR search space is a well known problem solving approach developed in the

area of heuristic search [100], that exploits the problem structure to decompose the search

space. The AND/OR search space for graphical models was introduced in [37]. It is guided

by a pseudo tree that spans the original graphical model.

DEFINITION 39 (Pseudo Tree). Given an undirected graph G = (V,E′), a directed rooted

tree T = (V,E) defined on all its nodes is called pseudo tree if any edge in E′ which is not

included in E is a back-arc, namely it connects a node to an ancestor in T .

DEFINITION 40 (AND/OR search tree). Given a mixed networkM = 〈X,D,F,C〉, its

primal graph G and a pseudo tree T of G, the associated AND/OR search tree, has alter-

nating levels of AND and OR nodes. The OR nodes are labeled Xi and correspond to the

variables. The AND nodes are labeled xi and correspond to the value assignments. The

structure of the AND/OR search tree is based on T . The root of the AND/OR search tree is

an OR node labeled by the root of T .

The children of an OR node Xi are AND nodes labeled with assignments xi that are con-

sistent with the assignments (x1, . . . , xi−1) relative to C along the path from the root. In-

consistent assignments are not extended.

The children of an AND node xi are OR nodes labeled with the children of Xi in T .

DEFINITION 41 (Solution Subtree). A solution subtree of a labeled AND/OR search tree

(or graph) contains the root node. For every OR node it contains one of its child nodes and

for each of its AND nodes it contains all its child nodes.

Semantically, the OR states represent alternative assignments, whereas the AND states

represent problem decomposition into independent subproblems, all of which need to be

226

solved. When the pseudo tree is a chain, the AND/OR search tree coincides with the regular

OR search tree.

C

B

D

A

≠

≠

≠

0,1,2

0,1,2

0,1,2

0,1,2

(a) 3 color-

ing problem

C

B D

A

(b) Pseudo tree

2

B

0 1

2 1 2 0

D

0 1

D

0 1

A A A A A A A A

C

0

B

1

D

1

A

2

D

2

A

1

B

0

D

0

A

2

D

2

A

2 0 2 0 1 0 1 1 2 0 1 1 2 1 2 0 2 0 20 1 2 10

(c) OR tree

2

B D

0 101

C

0 1

B D B D

1 2 1 202 0 2

2 0 1 1

A

1 2

A

0 2

A

0

A

1

A

2

A

0

(d) AND/OR tree

2

B D

0 101

C

0 1

D D

1 2 0 2

1 2

2

0

B

1

A

0

2

A

1

B

0

A

2

(e) AND/OR graph

Figure 6.1: AND/OR search spaces for graphical models

EXAMPLE 14. Figure 6.1(a) shows a constraint network for a 3-coloring problem over 4

variables. A possible pseudo tree is given in Figure 6.1(b). Figure 6.1(c) shows an OR tree

and 6.1(d) shows an AND/OR tree.

In [92, 36] it was shown how the posterior probability distribution associated with a mixed

network M = 〈X,D,F,C〉 can be expressed using AND/OR search spaces, and how

queries of interest, such as computing the weighted counts, can be computed by a depth-

first search traversal. All we need is to annotate the OR-to-AND arcs with weights derived

from the relevant constraints C and functions F, such that the product of weights on the arc

227

of any solution subtree i.e. a full assignment x is equal to the weight∏m

i=1 Fi(x) of that

solution.

DEFINITION 42 (Weighted AND/OR tree). Given a mixed networkM = 〈X,D,F,C〉

and its AND/OR tree along some pseudo tree T , the weight w(n,m) of an arc from an OR

node n to an AND node m such that n is labeled with Xi and m is labeled with xi, is the

product of all functions F ∈ F which become fully instantiated by the last assignment from

the root to Xi. A weighted AND/OR tree is the AND/OR tree annotated with weights.

A

D

B C

E

(a) Belief

Network

A

D

B

CE

(b) Pseudo tree

A B E=0 E=1

0 0 1 0

0 1 .5 .5

1 0 .7 .3

1 1 0 1

A B=0 B=1

0 1 0

1 .1 .9

A C=0 C=1

0 .2 .8

1 .7 .3

A P(A)

0 .6

1 .4

B C D=0 D=1

0 0 .2 .8

0 1 .1 .9

1 0 .3 .7

1 1 .5 .5

P(E | A,B)P(D | B,C)

P(B | A) P(C | A)P(A)

(c) CPTs

0

A

B

0

E C

0

D

0 1

1

D

0 1

0

1

B

0

E C

0

D

0 1

1

D

0 1

0 1

1

E C

0

D

0 1

1

D

0 1

1

.8 .9 .7.8 .9 .5

1 .7.2 .8 .7 .3 .7 .3

1 .1 .9

.6 .4

.3 1

.2 .1 .2 .1 .3 .5

(d) Weighted AND/OR Search tree

Figure 6.2: Assigning weights to OR-to-AND arcs of an AND/OR search tree

EXAMPLE 15. Figure 6.2 shows how to assign weights to an AND/OR tree. Figure 6.2

(a) shows a Bayesian network, 6.2 (b) is a pseudo tree, and 6.2(c) shows the conditional

probability tables. Figure 6.2(d) shows the weighted AND/OR search tree.

The probability of evidence or weighted counts can be computed by traversing a weighted

AND/OR tree in a DFS manner and computing the value of all nodes from leaves to the

228

root [37, 92], as defined next. The value of a node is the weighted counts of the sub-tree

that it roots.

DEFINITION 43 (Value of a node for computing the weighted counts). The value of a

node is defined recursively as follows. The value of leaf AND nodes are “1” and the value

of leaf OR nodes are “0”. Let chi(n) denote the children and v(n) denote the value of an

OR or an AND node n respectively. If n is an OR node then:

v(n) =∑

n′∈chi(n)

v(n′)w(n, n′)

If n is an AND node then:

v(n) =∏

n′∈chi(n)

v(n′)

The value of the root node is equal to the weighted counts Z.

The AND/OR search tree may contain nodes that root identical subproblems. These nodes

are unifiable and can be merged yielding a search graph whose size is smaller than the

AND/OR search tree. Traversing the AND/OR search graph requires additional memory,

however. A depth first search algorithm can cache previously computed results, and retrieve

them when the same sub-problem is encountered. Some unifiable nodes can be identified

based on their context which express the set of ancestor variables in the pseudo tree that

completely determine a conditioned subproblem [37].

DEFINITION 44 (context). Given a pseudo tree T (X,E) and a primal graph G(X,E′), the

context of a node Xi in T denoted by contextT (Xi) is the set of ancestors of Xi in T ,

ordered descendingly, that are connected in G to Xi or to descendants of Xi.

DEFINITION 45 (context minimal AND/OR graph). Given an AND/OR search tree, two

OR nodes n1 and n2 are context unifiable if they have the same variable label Xi and

the assignments of their contexts are identical. In other words, if y and z are the partial

229

assignment of variables along the path to n1 and n2 respectively, and if their restriction

to the context of Xi satisfy ycontextT (Xi)= zcontextT (Xi), then n1 and n2 are unifiable. The

context minimal AND/OR graph is obtained from the AND/OR search tree by merging all

the context unifiable OR nodes.

EXAMPLE 16. Figure 6.1(e) shows a context minimal AND/OR graph created from the

AND/OR tree of Figure 6.1(d) by merging all context unifiable nodes.

In the next five sections, we describe AND/OR importance sampling and its w-cutset gen-

eralizations; which are the main contributions of this chapter.

6.3 AND/OR Tree importance sampling

P(X|Z) X=0 X=1 X=2

Z=0 0.3 0.4 0.3

Z=1 0.2 0.7 0.1

P(Y|Z) Y=0 Y=1 Y=2

Z=0 0.5 0.1 0.4

Z=1 0.2 0.6 0.2

P(A|X) A=0 A=1

X=0 0.1 0.9

X=1 0.2 0.8

X=2 0.6 0.4

P(B|Y) B=0 B=1

Y=0 0.2 0.9

Y=1 0.7 0.8

Y=2 0.1 0.4

P(Z) Z=0 Z=1

0.8 0.2Z

XY

A B

Evidence A=0, B=0

Figure 6.3: A Bayesian network and its CPTs

We start by discussing computing expectation by parts; which forms the backbone of

AND/OR importance sampling. We then present our first version based on AND/OR tree

search.

230

6.3.1 Estimating Expectation by Parts

In Equation 1.14 (in Chapter 1), the expectation of a function defined over a set of variables

is computed by summing over the Cartesian product of the domains of all variables. This

method is clearly inefficient because it does not take into account the inherent decomposi-

tion in the graphical model as we illustrate below.

Consider the tree graphical model given in Figure 6.3. Let A = a and B = b be the

evidence. By definition, the probability of evidence P (a, b) is given by:

P (a, b) =∑

xyz∈XY Z

P (z)P (x|z)P (a|x)P (y|z)P (b|y) (6.1)

Let Q(Z,X, Y ) = Q(Z)Q(X|Z)Q(Y |Z) be a proposal distribution. We can express

P (a, b) in terms of Q as:

P (a, b) =∑

xyz∈XY Z

P (z)P (x|z)P (a|x)P (y|z)P (b|y)

Q(z)Q(x|z)Q(y|z)Q(z)Q(x|z)Q(y|z) (6.2)

We can now apply some simple symbolic manipulations, and rewrite Equation 6.2 as:

P (a, b) =∑

z∈Z

P (z)Q(z)

Q(z)

∑

x∈X

P (x|z)P (a|x)Q(x|z)

Q(x|z)

∑

y∈Y

P (y|z)P (b|y)Q(y|z)

Q(y|z)(6.3)

By definition of conditional expectation1:

E

[P (x|z)P (a|x)

Q(x|z) | z]

=∑

x∈X

P (x|z)P (a|x)Q(x|z)Q(x|z) (6.4)

1Because, the expectation is always taken with respect to the component of the proposal distribution in

the denominator, we write EQ[X] as E[X] for clarity.

231

and

E

[P (y|z)P (b|y)

Q(y|z) | z]

=∑

y∈Y

P (y|z)P (b|y)Q(y|z)Q(y|z) (6.5)

Substituting Equations 6.4 and 6.5 in Equation 6.3, we get:

P (a, b) =∑

z∈Z

P (z)

Q(z)E

[P (x|z)P (a|x)

Q(x|z) | z]

E

[P (y|z)P (b|y)

Q(y|z) | z]Q(z) (6.6)

By definition (of expectation), we can rewrite Equation 6.6 as:

P (a, b) = E

[P (z)

Q(z)E

[P (x|z)P (a|x)

Q(x|z) | z]

E

[P (y|z)P (b|y)

Q(y|z) | z]]

(6.7)

We will refer to Equations of the form 6.7 as expectation by parts. If the domain size of

all variables is d = 3, for example, computing P (a, b) using Equation 6.2 would require

summing over d3 = 33 = 27 terms while computing P (a, b) using Equation 6.7 would

require summing over d+ d2 + d2 = 3 + 32 + 32 = 21 terms.

We will now describe how to estimate P (a, b) using Equation 6.7. Assume that we are given

N samples (z1, x1, y1), . . . , (zN , xN , yN) generated from Q. Let 0, 1 be the domain of

Z and let Z = 0 and Z = 1 be sampled N0 and N1 times respectively. We define two

sets S(j) = k|k ∈ 1, . . . , N and zk = j for j ∈ 0, 1 which store the indices of the

samples in which the value j is assigned to Z.

We can unbiasedly estimate E

[P (x|z)P (a|x)

Q(x|z)| z]

and E

[P (y|z)P (b|y)

Q(y|z)| z]

by replacing the ex-

pectation by the sample average. The unbiased estimates denoted by gX(Z = j) and

232

gY (Z = j); j ∈ 0, 1 are given by:

gX(Z = j) =1

Nj

∑

i∈S(j)

P (xi|Z = j)P (a|xi)Q(xi|Z = j)

, (6.8)

gY (Z = j) =1

Nj

∑

i∈S(j)

P (yi|Z = j)P (b|yi)Q(yi|Z = j)

(6.9)

Substituting the unbiased estimates for E

[P (x|z)P (a|x)

Q(x|z)| z]

and E

[P (y|z)P (b|y)

Q(y|z)| z]

in Equa-

tion 6.7, we get the following unbiased estimate of P (a, b):

P (a, b) = E

[P (z)

Q(z)gX(Z = j)gY (Z = j)

](6.10)

Given samples (z1, . . . , zN) generated from Q(Z), we can unbiasedly estimate P (a, b) by

replacing the expectation in Equation 6.10 by the sample average as given below:

Pao−is(a, b) =1

N

N∑

i=1

P (zi)

Q(zi)gX(zi)gY (zi) (6.11)

where Pao−is(a.b) stands for an AND/OR estimate of P (a, b). Based on our assumption

that Z = 0 and Z = 1 are sampled N0 and N1 times respectively, we can collect together

all samples in which the value Z = j, j ∈ 0, 1 is generated and rewrite Equation 6.10 as:

Pao−is(a, b) =1

N

1∑

j=0

NjP (Z = j)

Q(Z = j)gX(Z = j)gY (Z = j) (6.12)

It is easy to show that E[Pao−is(a, b)] = P (a, b), namely Pao−is is unbiased.

233

ZP(Z)

P(Y|Z)P(b|Y)P(X|Z)P(a|X)

X Y

Figure 6.4: A pseudo tree of the Bayesian network given in Figure 6.3(a) in which each

variable is annotated with its bucket function.

over two separate quantities defined over the random variables X|Z = z and Y |Z = z

respectively from the generated samples. While in conventional importance sampling, we

estimate only one quantity defined over the joint random variable XY |Z = z using the

generated samples. Because the samples for X|Z = z and Y |Z = z are considered inde-

pendently in Equation 6.12, Nj samples drawn over the joint random variable XY |Z = z

in Equation 6.15 correspond to Nj × Nj = N2j virtual samples in Equation 6.12. Con-

sequently, because of a larger sample size our new estimation technique will likely have

lower error than the conventional approach.

6.3.2 Estimating weighted counts using an AND/OR sample tree

In this subsection, we generalize the idea of estimating expectation by parts using an

AND/OR search tree [37]. We will define an AND/OR sample tree which is a restric-

tion of the full AND/OR search space to the generated samples. On this AND/OR sample

tree, we define a new sample mean and show that it yields an unbiased estimate of the

weighted counts. We start with some required definitions.

DEFINITION 46 (Bucket function). Given a mixed network M = 〈X,D,F,C〉 and a

rooted pseudo tree T (X,E), the bucket function ofXi relative to T , denoted by BT,Xiis the

product of all functions and constraints inM which mention Xi but do not mention any

variables that are descendants of Xi in T .

EXAMPLE 17. Figure 6.4 shows a possible pseudo tree over the non evidence variables of

235

the Bayesian network given in Figure 6.3. Each variable in the pseudo tree is annotated

with its bucket function. Note that after instantiating the evidence variable A to a, the CPT

P (A|X) yields a function P (a|X) defined over X . Similarly, after instantiating B to the

value b, the CPT P (B|Y ) yields a function P (b|Y ) defined over Y . Therefore, by definition,

the bucket function at X is the product of P (a|X) and P (X|Z) while the bucket function

at Y is the product of P (b|Y ) and P (Y |Z). The bucket function of Z is P (Z).

DEFINITION 47 (AND/OR Sample Tree). Given a mixed networkM = 〈X,D,F,C〉, a

pseudo tree T (X,E), a proposal distribution along the pseudo tree Q(X) =∏n

i=1 Qi(Xi|

Yi) such that Yi ⊆ contextT (Xi)2, a sequence of samples S and a complete AND/OR

search tree φAOT , an AND/OR sample tree SAOT is obtained from φAOT by removing all

nodes and corresponding edges which do not appear in S.

A path from the root of SAOT to a node n is denoted by πn. If n is an OR node labeled

with Xi or an AND node labeled with xi, the path will be denoted by πn(Xi) or πn(xi)

respectively. The assignment sequence along the path πn, denoted by A(πn) is the set of

assignments associated with the sequence of AND nodes along πn:

A(πn(Xi)) = x1, . . . , xi−1

A(πn(xi)) = x1, . . . , xi

The set of variables associated with OR nodes along the path πn is denoted by V (πn):

V (πn(Xi)) = X1, . . . , Xi−1, V (πn(xi)) = X1, . . . , Xi. Clearly, if an OR node n is

labeled Xi then V (πn) is the set of variables mentioned on the path from the root to Xi in

the pseudo tree T , denoted by pathT (Xi).

2For simplicity, we assume that Yi is a subset of context of Xi. When the proposal distribution is specified

externally, it may not obey this constraint. In that case, we construct a pseudo tree from a graph G obtained

by combining the primal graphs of the proposal distribution and the mixed network.

236

Z

0

X

2

1

Y

0 1

X

21

Y

0 1

<1.6,2> <0.4,2>

<0.16,1>

<0.36,1><0.2,1>

<0.14,1> <0.28,1> <0.84,1>

<0.08,1><0.12,1>

16.05.0

)2.0)(4.0(

)0|1(

)1|0()0|1(==

==

====

ZXQ

XAPZXP

4.05.0

2.0

)1(

)1(==

=

=

ZQ

ZP

0 1 2 0 2

Samples: (a) Z = 0, X = 1, Y = 0, (b) Z = 0, X = 2, Y = 1,

(c) Z = 1, X = 1, Y = 1 and (d) Z = 1, X = 2, Y = 0.

Figure 6.5: Figure showing four samples arranged on an AND/OR sample tree.

The arc-label for an OR node n to an AND node m in SAOT , where Xi labels n and xi

labelsm, is a pair 〈w(n,m),#(n,m)〉 where:

• w(n,m) =BT,Xi

(xi,A(πn))

Qi(xi|A(πn))is called the weight of the arc.

• #(n,m) is the frequency of the arc. Namely, it is equal to the number of times the

partial assignment A(πm) occurs in S.

where BT,Xibe the bucket function of Xi (see Definition 46).

We define an OR sample tree as an AND/OR sample tree whose pseudo tree is a chain.

EXAMPLE 18. Consider again the Bayesian network given in Figure 6.3. Assume that

the proposal distribution is given by Q(X,Y, Z) = Q(X)Q(Y )Q(Z) where Q(X), Q(Y )

and Q(Z) are uniform distributions. Figure 6.5 shows a full AND/OR search tree over

the Bayesian network and four hypothetical random samples drawn from Q. The AND/OR

sample tree is obtained by removing the dotted edges and nodes which are not sampled from

the full AND/OR search tree. Each arc from an OR node to an AND node in the AND/OR

237

sample tree is labeled with appropriate frequencies and weights according to Definition 47.

Figure 6.5 shows the derivation of weights for two arcs.

We can now compute an approximation of node values by mimicking the value computation

on the AND/OR sample tree.

DEFINITION 48 (Value of a node). Given an AND/OR sample tree (or graph) SAOT and

an initialization function ψl(V (πl)) which defines the value of each leaf AND node l, the

value of a node n, denoted by v(n) is defined recursively as follows. The value of leaf AND

node l is given by the initialization function ψl and the value of leaf OR node is 0. If n is

an AND node then:

v(n) =∏

n′∈chi(n)

v(n′)

and if n is an OR node then

v(n) =

∑n′∈chi(n) #(n, n′)× w(n, n′)× v(n′)

∑n′∈chi(n) #(n, n′)

We will show that the value of an OR node n is equal to an unbiased estimate of the

conditional expectation of the subproblem conditioned on the assignment from the root to n

(see Theorem 19). Note that unlike previous work on AND/OR search spaces [37] in which

the leaf AND nodes are always initialized to one, we require an initialization function to

accommodate w-cutset extensions to AND/OR sample trees presented in Section 6.6.

DEFINITION 49 (AND/OR Sample Tree Mean). Given an AND/OR sample tree SAOT

with arcs labeled according to Definition 47, the AND/OR Sample Tree mean denoted by

Zao−tree is the value of the root node of SAOT , where the value of each leaf AND node of

SAOT is initialized to “1”.

EXAMPLE 19. The calculations involved in computing the AND/OR sample tree mean are

shown in Figure 6.6. Each node in Figure 6.6 is marked with a value that is computed

238

Z

0

X

2

1

1

Y

0 1

X

21

Y

0 1

<1.6,2> <0.4,2>

<0.16,1> <0.36,1><0.2,1>

<0.14,1><0.28,1> <0.84,1>

<0.08,1><0.12,1>

0.262

0.360.16=

+

0.172

0.140.2=

+0.46

2

0.840.08=

+0.2

2

0.120.28=

+

0.09246.0*2.0 =

0.044217.0*26.0 =

05376.04

)092.04.02(0.0442)1.6(2=

××+××

Figure 6.6: Value computation on an AND/OR sample tree

recursively using Definition 49. The value of OR nodes X and Y given Z = j ∈ 0, 1

is equal to gX(Z = j) and gY (Z = j) respectively defined in Equations 6.8 and 6.9. The

value of the root node labeled by Z is equal to the AND/OR sample tree mean which is

equal to the sample mean computed by parts in Equation 6.12.

THEOREM 19. The AND/OR sample tree mean Zao−tree is an unbiased estimate of the

weighted counts Z.

Proof. We prove by induction that the value of any OR node n is an unbiased estimate of

the conditional expectation of the subproblem rooted at n (conditioned on the assignment

from the root to n).

Consider the expression for weighted counts:

Z =∑

x∈X

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (6.17)

Let T (X,E) be a pseudo tree, pathT (Xi) be the set of variables along the path from root

up to node Xi (note that pathT (Xi) does not include Xi) in T and BT,Xibe the bucket

239

function (see Definition 46) of Xi w.r.t. T . For any full assignment x, we have:

n∏

i=1

BT,Xi(xi, xpathT (Xi)) =

m∏

i=1

Fi(x)

p∏

j=1

Cj(x) (6.18)

To recap our notation, xpathT (Xi) is the projection of the assignment x onto the subset

pathT (Xi) of X.

Substituting Equation 6.18 into Equation 6.17, we get:

Z =∑

x∈X

n∏

i=1

BT,Xi(xi, xpathT (Xi)) (6.19)

Let Q(X) =∏n

i=1Qi(Xi|Yi) be the proposal distribution where Yi ⊆ contextT (Xi).

Because contextT (Xi) ⊆ pathT (Xi), we can express Q as:

Q(X) =n∏

i=1

Qi(Xi|Yi) =n∏

i=1

Qi(Xi|pathT (Xi)) (6.20)

We can express Z in Equation 6.19 in terms of Q as:

Z =∑

x∈X

n∏

i=1

BT,Xi(xpathT (Xi))

Qi(xi|xpathT (Xi))Qi(xi|xpathT (Xi)) (6.21)

Using the notation xi = (x1, . . . , xi) and xi,pathT (Xj) as the projection of xi on pathT (Xj)

and migrating the functions to the left of summation variables which it does not reference,

we can rewrite Equation 6.21 as:

Z =∑

x1∈X1

BT,X1(x1)

Q1(x1)Q1(x1)× . . .×

∑

xi∈Xi

BT,Xi(xi, xi−1,pathT (Xi))

Qi(xi|xi−1,pathT (Xi))Qi(xi|xi−1,pathT (Xi))× . . .×

∑

xn∈Xn

BT,Xi(xi, xn−1,pathT (Xn))

Qn(xn|xn−1,pathT (Xn))Qn(xn|xn−1,pathT (Xn)) (6.22)

240

Using the definition of expectation and conditional expectation, we can rewrite Equation

6.22 as:

Z = E

[BT,X1(x1)

Q1(x1)× . . .× E

[BT,Xi

(xi, xi−1,pathT (Xi))

Qi(xi|xi−1,pathT (Xi))× . . .×

E

[BT,Xn(xn, xn−1,pathT (Xn))

Qn(xn|xn−1,pathT (Xn))

∣∣∣∣ xn−1,pathT (Xn)

]. . .

∣∣∣∣ xi−1,pathT (Xi)

]. . .

](6.23)

Let chi(Xi) be the set of children of Xi in the pseudo tree T and let us denote the com-

ponent of conditional expectation at a node Xi, along an assignment xi−1,pathT (Xi) by

VXi(Xi|xi−1,pathT (Xi)). VXi

(Xi|xi−1,pathT (Xi)) can be recursively defined as follows:

VXi(Xi|xi−1,pathT (Xi)) = E

[BT,Xi

(xi, xi−1,pathT (Xi))

Qi(xi|xi−1,pathT (Xi))×

∏

Xj∈chi(Xi)

VXj(Xj |xi, xi−1,pathT (Xi))

∣∣∣∣∣∣xi−1,pathT (Xi)

(6.24)

It is easy to see that Z equals VX1(X1), namely,

Z = E

BT,X1(X1)

Q1(X1)

∏

Xj∈chi(X1)

VXj(Xj|x1)

= VX1(X1) (6.25)

We will now derive an unbiased estimate of VXi(Xi| xi−1,pathT (Xi)). Assume that for all

Xj ∈ chi(Xi) in T , we have an unbiased estimate of VXj(Xj|xi, xi−1,pathT (Xi)) denoted by

vXj(Xj| xi, xi−1,pathT (Xi)). Assume that given xi−1,pathT (Xi), we have generatedN samples

(x1i , . . . , x

Ni ) from Qi(Xi|xi−1,pathT (Xi)). By replacing the (conditional) expectation by its

241

sample average, we get the following unbiased estimate of VXi(Xi| xi−1,pathT (Xi)).

vXi(Xi|xi−1,pathT (Xi)) =

1

N

N∑

a=1

BT,Xi(xai , xi−1,pathT (Xi))

Qi(xai |xi−1,pathT (Xi))

∏

Xj∈chi(Xi)

vXj(Xj |xai , xi−1,pathT (Xi))

(6.26)

Assume that the domain of Xi is xi,1, . . . , xi,k. Also, assume that each value xi,j is sam-

pled Ni,j times. By collecting together all the samples in which the value xi,j is generated

and substituting N =∑k

a=1Ni,a, we can rewrite Equation 6.26 as:

vXi(Xi|xi−1,pathT (Xi)) =

∑k

a=1 Ni,aBT,Xi

(xi,a,xi−1,pathT (Xi))

Qi(xi,a|xi−1,pathT (Xi))

∏Xj∈chi(Xi)

vXj(Xj |xi,a, xi−1,pathT (Xi))

∑k

a=1 Ni,a

(6.27)

Next, we show that given an AND/OR sample tree SAOT and the same samples from which

vXi(Xi|xi−1,pathT (Xi)) is derived, the value of an OR node n in SAOT labeled by Xi such

that A(πn) = xi−1,pathT (Xi) is equal to vXi(Xi|xi−1,pathT (Xi)). Let us denote the kth child

AND node by mk. By definition the frequencies and weights of the arcs from n to ma are

given by:

#(n,ma) = Ni,a

w(n,ma) =BT,Xi

(xi,a, A(πn))

Qi(xi,a|A(πn))=BT,Xi

(xi,a, xi−1,pathT (Xi))

Qi(xi,a|xi−1,pathT (Xi))

By definition, the value of each AND node ma is given by:

v(ma) =∏

n′∈chi(ma)

v(n′)

242

Similarly by definition, the value of the OR node n is given by:

v(n) =

∑ka=1 #(n,ma)× w(n,ma)× v(ma)∑k

a=1 #(n,ma)

=

∑ka=1 #(n,ma)× w(n,ma)×

∏n′∈chi(ma) v(n

′)∑k

a=1 #(n,ma)(6.28)

Substituting the expressions for #(n,ma) and w(n,ma) in Equation 6.28, we get:

v(n) =

∑ka=1Ni,a

BT,Xi(xi,a,xi−1,pathT (Xi)

)

Qi(xi,a,xi−1,pathT (Xi))

∏n′∈chi(ma) v(n

′)∑k

a=1Ni,a

(6.29)

Assuming v(n′) = vXj(Xj|xi,a, xi−1,pathT (Xi)), we can see that the right hand sides of

Equations 6.27 and 6.29 are equal yielding v(n) = vXi(Xi|xi−1,pathT (Xi)). Namely, we

have proved that if the value of a child OR node n′ of a child AND node of an OR node n is

equal to an unbiased estimate of the conditional expectation of the sub-problem rooted at n′,

then the value of the OR node n is also an unbiased estimate of the conditional expectation

of the sub-problem rooted at n.

Since this result is true for any OR node, the value of the root OR node is equal to an

unbiased estimate of Z = VX1(X1), which is what we wanted to prove.

We look now at the special case of an OR tree.

THEOREM 20. Given a mixed networkM = 〈X,D,F,C〉, a pseudo tree T , a proposal

distribution Q(X) =∏n

i=1Qi(Xi|contextT (Xi)), N i.i.d. samples drawn from Q and a

chain pseudo tree T ′ obtained by any topological linearization order o of T , the AND/OR

sample tree mean computed on an OR sample tree based on T ′ is same as the conventional

importance sampling estimate Z defined in Equation 1.15.

243

Proof. We prove this theorem by induction over the nodes of the pseudo tree T (X,E).

Base Case: Here we prove that the statement of the theorem is true for n = 1. Assume that

T has only one variableX1. In this case, the chain pseudo tree T ′ obtained by any topologi-

cal linearization order o of T coincides with T . Given samples S = (x11, . . . , x

N1 ) generated

from a proposal distribution Q1(X1) and bucket function BT ′,X1(X1) (see Definition 46),

the conventional importance sampling estimate is:

Z =1

N

N∑

i=1

BT ′,X1(xi1)

Q1(xi1)(6.30)

Let x1,1, . . . , x1,k be the domain of X1 and N1,j be the number of times the value x1,j

appears in S, j ∈ 1, . . . , k. Then, by collecting together all the samples in which the

value x1,j is generated and substituting N =∑k

a=1N1,a, we can rewrite Equation 6.30 as:

Z =

∑ka=1N1,a

BT ′,X1(x1,a)

Q1(x1,a)∑ka=1N1,a

(6.31)

Since T ′ has just one node, the OR sample tree based on T has just one OR node denoted

by n. Let m1, . . . ,mk be the child AND nodes of n. By definition, the value of all leaf

AND nodes is 1, while the weight and frequency of the arcs (n,ma) are given by:

#(n,ma) = N1,a

w(n,ma) =BT ′,X1(x1,a)

Q1(x1,a)

244

Also, by definition, the value of the OR node n is given by:

v(n) =

∑ka=1 #(n,ma)w(n,ma)∑k

a=1 #(n,ma)

=

∑ka=1N1,a

BT ′,X1(x1,a)

Q1(x1,a)∑ka=1N1,a

(6.32)

From, Equations 6.31 and 6.32, we have Z = v(n) which proves the base case. Next, we

prove the induction case.

Induction case: In this case, we assume that the statement of the theorem is true for n vari-

ables X1, . . . , Xn and then prove that it is also true for n+ 1 variables X1, . . . , Xn+1.

Consider a pseudo tree T over n + 1 variables with Xn+1 as the root. Let T ′ be the chain

pseudo tree corresponding to the topological linearization order o of T . By definition, both

T and T ′ have the same root node Xn+1.

Given samples S = ((x1n, x

1n+1) , . . . , (x

Nn , x

Nn+1)) generated from the proposal distribution

Q(Xn, Xn+1), the conventional importance sampling estimate is given by:

Z =1

N

N∑

i=1

∏n+1j=1 BT ′,Xj

(xin, xin+1)∏n+1

j=1 Qj(xin, xin+1)

(6.33)

Let Xn+1 have k values in its domain given by xn+1,1, . . . , xn+1,k and Nn+1,j be the

number of times the value xn+1,j appears in S. Let S(xn+1,j) ⊆ S be the subset of all

samples which mention the value xn+1,j . Then, by collecting together all the samples in

which the value xn+1,j is generated and substituting N =∑k

a=1Nn+1,a, we can rewrite

Equation 6.33 as:

245

Z =

∑ka=1Nn+1,a

BT ′,Xn+1(xn+1,a)

Qn+1(xn+1,a)

[1

Nn+1,a

∑xk∈S(xn+1,j)

Qnj=1BT ′,Xj

(xkn,xn+1,a)

Qnj=1Qj(xk

n|xn+1,a)

]

∑ka=1Nn+1,a

(6.34)

Without loss of generality, let X1 be the child of Xn+1 in T ′. From the induction case

assumption, the quantity in the brackets in Equation 6.34 is equal to the value of the OR

node labeled by X1 given xn+1,a. Let the OR node be denoted by ra. We can rewrite

Equation 6.34 as:

Z =

∑ka=1Nn+1,a

BT ′,Xn+1(xn+1,a)

Qn+1(xn+1,a)v(ra)

∑ka=1Nn+1,a

(6.35)

Consider the root OR node denoted by r of the OR sample tree which is labeled by Xn+1.

r has k child AND nodes m1, . . . ,mk which in turn have one child OR node each. By

definition, the value of an AND node is the product of the values of all its child nodes.

Since each AND node ma a = 1, . . . , k has only one child OR node denoted by ra in an

OR sample tree, the value of ma is equal to the value of ra. Namely,

v(ma) = v(ra)

By definition, the weights of the arcs between r and ma for a = 1, . . . , k are given by:

#(r,ma) = Nn+1,a

w(r,ma) =BT ′,Xn+1(xn+1,a)

Qn+1(xn+1,a)

246

By definition, the value of the root OR node r denoted by v(r) is given by:

v(r) =

∑ka=1 #(r,ma)× w(r,ma)× v(ma)∑k

a=1 #(r,ma)=

∑ka=1Nn+1,a

BT ′,Xn+1(xn+1,a)

Qn+1(xn+1,a)v(ra)

∑ka=1Nn+1,a

(6.36)

From Equations 6.35 and 6.36, we have Z = v(r), which proves the induction case. There-

fore, from the principle of induction, the proof follows.

Algorithm 21: AND/OR Tree Importance Sampling

Input: A Mixed Network 〈X,D,F,C〉, a pseudo tree T (X, E) and a proposal distribution

Q(X) =∏ni=1 Q(Xi|context(Xi))

Output: AND/OR sample tree mean

Generate samples x1, . . . ,xN from Q;1

Build an AND/OR sample tree SAOT for the samples x1, . . . ,xN ;2

Initialize all labeling functions 〈w(n, m), #(n, m)〉 on each arc from an OR node n to an3

AND node m using Definition 47;

// Start: Value computation phase

Initialize the value of all leaf OR nodes n to 0 and leaf AND nodes to 1;4

for every node n from leaves to the root of SAOT do5

Let chi(n) denote the child nodes of node n ;6

// denote value of a node by v(n)if n is an AND node then7

v(n) =∏n′∈chi(n) v(n′) ;8

else9

10

v(n) =

∑n′∈chi(n) #(n, n′)× w(n, n′)× v(n′)

∑n′∈chi(n) #(n, n′)

.

// End: Value computation phase

return v(root node of SAOT )11

Algorithm AND/OR tree importance sampling is presented as Algorithm 21. In steps 1-3,

the algorithm generates samples from Q and stores them on an AND/OR sample tree. The

algorithm then computes the AND/OR sample tree mean over the AND/OR sample tree

recursively from leaves to the root in steps 4− 10 (value computation phase).

We summarize the complexity of Algorithm 21 in the following theorem.

247

THEOREM 21 (Complexity of AND/OR tree importance sampling). GivenN samples, a

mixed network with n variables and a pseudo tree of depth h, the time and space complexity

of computing the AND/OR sample tree mean is O(nN) and O(h) respectively.

Proof. Because only N full samples are generated, the number of nodes of the AND/OR

sample tree SAOT is bounded by O(nN). Because each node of SAOT is processed only

once during the value computation phase of Algorithm 21 (with each node processed in

constant time) the overall time complexity is O(nN). We could perform a depth first

search traversal of the AND/OR sample tree (i.e. build it on the fly). In this case, we only

have to store the current search path, whose maximum size is bounded by the depth h of

the pseudo tree. Therefore, the space complexity is O(h).

In summary, we defined AND/OR and OR sample tree mean and showed that they yield an

unbiased estimate of the weighted counts Z. We also proved that conventional importance

sampling mean can be derived over an OR sample tree. We provided an algorithm for

computing the AND/OR sample tree mean and proved that it has the same time complexity

as conventional importance sampling. In the next section, we will prove that AND/OR

importance sampling is indeed a more powerful approach than OR importance sampling in

that it has lower variance.

6.4 Variance Reduction

In this section, we prove that the AND/OR sample tree mean has lower (or equal) variance

than that of the OR sample tree mean (which in turn is equal to conventional sample mean

given by Equation 1.15).

THEOREM 22 (Variance Reduction). Given a pseudo tree TAO of a mixed networkM =

〈X,D,F,C〉, a chain pseudo tree TOR obtained from any topological linearization order o

248

Z Z Z Z

n = 1 X X X Y

(a) n = 2 Y n = 3, case 2

(b) n = 3, case 1 (d)

(c)

Figure 6.7: The possible pseudo trees for n = 1, 2, 3 excluding a permutation of the

variables, where n is the number of variables.

of TAO, and a set of samples S = (x1, . . . , xN), the variance of the AND/OR sample tree

mean on the AND/OR sample tree of S along TAO is always smaller than or equal to the

variance of the OR sample tree mean on the OR sample tree of S along TOR.

Proof. We prove this theorem by an induction over the number of variables n of TAO.

Base case, n=1,2,3: Trivial to prove for n = 1, 2 and case 1 for n = 3 (see Figures

6.7(a), (b) and (c)). In these cases, the pseudo trees form a chain and therefore the OR and

AND/OR sample tree means are equal.

Next, we consider the second case for n = 3 corresponding to the pseudo tree TAO given in

Figure 6.7(d). The OR pseudo tree TOR corresponding to a possible topological lineariza-

tion order o of TAO is the one in Figure 6.7(c).

Without loss of generality, let z1, z2 be the domain of Z and let Q(Z) × Q(X|Z) ×

Q(Y |Z) be the proposal distribution. Let us assume that we have generated N samples

from Q and each value zj of Z appears Nj times in the generated N samples. Let Sj =

(xj,1, yj,1), . . . , (xj,Nj , yj,Nj) be the set of samples on X , Y given Z = zj , j ∈ 1, 2.

Let BTOR,X and BTOR,Y be the bucket functions (see Definition 46) associated with TOR

and let BTAO,X and BTAO,Y be the bucket functions associated with TAO.

249

Consider the OR sample tree SORT relative to the samples S = Sj| j ∈ 1, 2 and

TOR. By definition, in the OR sample tree SORT , the value of the AND node labeled by zj ,

denoted by aj is given by:

vSORT(aj) =

1

Nj

Nj∑

i=1

BTOR,Y (yj,i, zj)

Q(yj,i|zj)BTOR,X(xj,i, zj)

Q(xj,i|zj)(6.37)

Consider the AND/OR sample tree SAOT relative to the samples S = Sj| j ∈ 1, 2

and TAO. By definition, in the AND/OR sample tree SAOT , the value of the AND node bj

labeled by zj is given by:

vSAOT(bj) =

1

Nj

Nj∑

i=1

BTAO,Y (yj,i, zj)

Q(yj,i|zj)

1

Nj

Nj∑

i=1

BTOR,X(xj,i, zj)

Q(xj,i|zj)

(6.38)

Based on [64], we will show that variance of vSORT(aj) is greater than or equal to the vari-

ance of vSAOT(bj). Goodman [64] proved that the variance of product of two independent

random variablesX and Y , denoted by V [XY ] can be expressed in terms of their variances

V [X], V [Y ] and expected values E[X] and E[Y ] respectively as follows:

V (XY ) = V [X]E[Y ]2 + V [Y ]E[X]2 + V [Y ]V [X] (6.39)

Given random samples (x1, . . . , xN) and (y1, . . . , yN), Goodman [64] considered the fol-

lowing two unbiased estimators of XY :

xy =

(1

N

N∑

i=1

xi

)(1

N

N∑

i=1

yi

)(6.40)

xy =1

N

N∑

i=1

xiyi (6.41)

Let V [xy] and V [xy] denote the variances of the two unbiased estimators respectively.

250

Goodman [64] showed that:

V [xy] =V [X]E[Y ]2

N+V [Y ]E[X]2

N+V [Y ]V [X]

N2(6.42)

V [xy] =V [X]E[Y ]2

N+V [Y ]E[X]2

N+V [Y ]V [X]

N(6.43)

Notice that Equations 6.42 and 6.43 differ only in the last term. In Equation 6.42, the last

term is divided by N2 while in Equation 6.43, the last term is divided by N . Therefore, if

N > 1, then V [xy] < V [xy].

Since vSORT(aj) has the same form as the estimator xy while vSAOT

(bj) has the same form

as the estimator xy, the variances of vSORT(aj) and vSAOT

(bj) obey:

V [vSORT(aj)] =

1

Nj

[E

[BTOR,Y (y, zj)

Q(y|zj)

∣∣∣∣ zj]2

V

[BTOR,X(x, zj)

Q(x|zj)

∣∣∣∣ zj]

+

E

[BTOR,X(x, zj)

Q(x|zj)

∣∣∣∣ zj]2

V

[BTOR,Y (y, zj)

Q(y|zj)

∣∣∣∣ zj]

+

V

[BTOR,X(x, zj)

Q(x|zj)

∣∣∣∣ zj]V

[BTOR,Y (y, zj)

Q(y|zj)

∣∣∣∣ zj] ]

(6.44)

V [vSAOT(bj)] =

1

Nj

[E

[BTAO,Y (y, zj)

Q(y|zj)

∣∣∣∣ zj]2

V

[BTAO,X(x, zj)

Q(x|zj)

∣∣∣∣ zj]

+

E

[BTAO,X(x, zj)

Q(x|zj)

∣∣∣∣ zj]2

V

[BTAO,Y (y, zj)

Q(y|zj)

∣∣∣∣ zj]

+

V

[BTAO,X(x, zj)

Q(x|zj)

∣∣∣∣ zj]V

[BTAO,Y (y, zj)

Q(y|zj)

∣∣∣∣ zj]/Nj

](6.45)

By definition (see Definition 46), BTOR,X(x, zj) = BTAO,X(x, zj) and BTOR,Y (y, zj) =

251

BTAO,Y (y, zj). Therefore, from Equations 6.44 and 6.45, we can see that if Nj > 1,

V [vSAOT(bj)] < V [vSORT

(aj)]. This proves the base case for n = 3.

Induction case: Here, we assume that the theorem holds for any pseudo tree TAO having n

variables or fewer and then we prove that the theorem also holds for any pseudo tree having

n+ 1 variables.

Consider a pseudo tree TAO having n + 1 variables. Let TOR be the pseudo tree obtained

from any topological linearization order o of TAO. Assume that the root node of TAO and

therefore that of TOR is labeled by Z. Without loss of generality, let Z contain two child

nodes labeled by X and Y in TAO and let X be the child of Z and Y be the child of X

in TOR. Let z1, z2 be the domain of Z. Let us assume that each value zj j ∈ 1, 2 is

sampled Nj times. Let Sj = (xj,1, yj,1), . . . , (xj,Nj , yj,Nj) be the samples on X , Y given

Z = zj , j ∈ 1, 2. Consider the OR sample tree SORT relative to S = Sj| j ∈ 1, 2

and TOR. By definition, in the OR sample tree SORT , the value of the AND node labeled

by zj , denoted by cj is given by:

vSORT(cj) =

1

Nj

Nj∑

i=1

BTOR,Y (yj,i, zj)

Q(yj,i|zj)BTOR,X(xj,i, zj)

Q(xj,i|zj)× vORT (X|xj,i, yj,i, zj) (6.46)

where vORT (X|xj,i, yj,i, zj) is the estimate of conditional expectation of the sub-problem

given xj,i, yj,i, zj in SORT .

By definition, in the AND/OR sample tree SAOT , the value of the AND node dj labeled by

zj is given by:

vSAOT(dj) =

1

Nj

Nj∑

i=1

BTAO,Y (yj,i, zj)

Q(yj,i|zj)

1

Nj

Nj∑

i=1

BTOR,X(xj,i|zj)Q(xj,i|zj)

× vAOT (X|xj,i, yj,i, zj)

(6.47)

where vAOT (X|xj,i, yj,i, zj) is the estimate of conditional expectation of the sub-problem

given xj,i, yj,i, zj in SAOT .

252

From the inductive assumption, the variance of vAOT (X|xj,i, yj,i, zj) is less than or equal to

vORT (X|xj,i, yj,i, zj). Assuming a weak case of equality, namely V [vAOT (X|xj,i, yj,i, zj)] =

V [vORT (X|xj,i, yj,i, zj)], we can ignore these two quantities from Equations 6.46 and 6.47.

Doing so, the expression for vSORT(cj) given in Equation 6.46 has the same form as the

expression for vSORT(aj) given in Equation 6.37. Similarly, the expression for vSAOT

(dj)

given in Equation 6.47 has the same form as vSAOT(bj) given in Equation 6.38. As already

proved in the base case, if Nj > 1, then V [vSAOT(bj)] < V [vSORT

(aj)] and therefore it

follows that V [vSAOT(dj)] < V [vSORT

(cj)]. This proves the induction case and the proof

follows.

6.4.1 Remarks on Variance Reduction

From the proof of Theorem 22, it is easy to see that if each value Z = zj is sampled only

once i.e. Nj = 1 then there is no variance reduction. We can tie variance reduction to

the difference between the (virtual) number of AND/OR tree and OR tree samples (defined

below).

DEFINITION 50 (Virtual AND/OR Tree samples). Given an AND/OR sample tree SAOT ,

its virtual samples are all the solution subtrees of the AND/OR sample tree (see Definition

41).

EXAMPLE 20. Figure 6.8(a) and Figure 6.8(b) show four samples arranged on an OR

sample tree and an AND/OR sample tree respectively. The four samples correspond to

8 virtual samples (solution subtrees) on an AND/OR sample tree. The AND/OR sample

tree includes for example the assignment (C = 0, B = 2, D = 1, A = 1) which is not

represented in the OR sample tree.

253

C

0

B

1

D

1

A

0

2

D

2

A

1

1

B

0

D

0

A

2 0

2

D

2

A

0 1 2 0

(a) OR sample tree

C

0 1

B D B D

1 2 1 202 0 2

A

0

A

1

A

2

A

0

(b) AND/OR sample tree

C

0 1

D D

1 2 0 22

0

B

1

A

0

2

A

1

B

0

A

2

(c) AND/OR sample graph

Figure 6.8: Figure showing 4 samples arranged on an OR sample tree, AND/OR sample

tree and AND/OR sample graph. The AND/OR sample tree and graph are based on the

pseudo tree given in Figure 6.1(b).

It is trivial to show that:

PROPOSITION 19. The number of virtual AND/OR tree samples is greater than or equal to

the number of OR (tree) samples.

So clearly, if the number of virtual samples rooted at an AND node labeled by xi is equal

to the number of OR samples, it does not matter whether xi is a part of an OR tree or an

AND/OR tree. Variance reduction therefore only occurs at AND nodes which are sampled

more than once and which have at least two child nodes.

To summarize, in this section, we proved that the variance of AND/OR sample tree mean

is less than or equal to the variance of the OR sample tree mean. We showed how variance

reduction can be tied to the difference between the number of virtual AND/OR tree samples

and the original number of samples.

6.5 Estimating Sample mean in AND/OR graphs

Next, we describe a more powerful strategy for estimating sample mean in the AND/OR

space by moving from AND/OR trees to AND/OR graphs [37]. The idea is similar to

254

AND/OR graph search in that we merge nodes in an AND/OR sample tree, which are

unifiable based on context (see Definition 44), to form an AND/OR sample graph. This can

result in an even larger increase in the number of virtual samples.

DEFINITION 51 (AND/OR Sample Graph). Given an AND/OR sample tree SAOT , the

AND/OR sample graph SAOG is obtained from SAOT by merging all OR nodes that have

the same context.

EXAMPLE 21. Figure 6.8(c) shows an AND/OR sample graph obtained from the AND/OR

sample tree in Figure 6.8(b) by merging all context unifiable nodes.

DEFINITION 52 (AND/OR Sample Graph mean Zao−graph). The AND/OR sample graph

mean, denoted by Zao−graph is the value of the root node of SAOG (see Definition 48) with

initialization functions ψl(V (πl)) = 1 for all leaf AND nodes.

DEFINITION 53 (Virtual AND/OR Graph samples). The set of virtual AND/OR graph

samples is the set of solution subtrees of the AND/OR sample graph.

EXAMPLE 22. Figure 6.8(b) and Figure 6.8(c) show four OR samples arranged over an

AND/OR sample tree and an AND/OR sample graph respectively. The 8 virtual AND/OR

tree samples (solution subtrees) correspond to 12 virtual AND/OR graph samples. The

AND/OR sample graph includes for example the sample (C = 0, B = 2, D = 1, A = 0)

which is not part of the virtual samples of the AND/OR sample tree.

It is trivial to show that:

PROPOSITION 20. The number of virtual AND/OR graph samples is greater than or equal

to the number of virtual AND/OR tree samples if both are based on the same underlying

pseudo tree.

Since the AND/OR graph captures more virtual samples, the variance of AND/OR sample

graph mean may be smaller than the variance of AND/OR sample tree mean. Formally,

255

THEOREM 23. The variance of AND/OR sample graph mean Zao−graph is less than or equal

to that of AND/OR sample tree mean Zao−tree.

Proof. Given a pseudo tree T , in an AND/OR sample graph SAOG, the value of an OR

node, denoted by nAOG and labeled by Xi is computed using a subset of the samples that

have the same assignment to the contextT (Xi) of Xi while in an AND/OR sample tree

SAOT , the value of the corresponding OR node nAOT is computed using a subset of the

samples that have the same assignment to all variables along the path from root to Xi,

denoted by pathT (Xi). Because, contextT (Xi) ⊆ pathT (Xi), the value of nAOG is based

on a larger (or equal) number of samples compared with the value of nAOT . Because of

its larger virtual sample size, from Equation 6.45, the variance of the value of nAOG is less

than (or equal to) the variance of the value of nAOT .

The algorithm for computing the AND/OR sample graph mean is identical to that of

AND/OR sample tree mean (Steps 4-10 of Algorithm 21). Obviously, the only difference is

that we store the samples and perform value computations over an AND/OR sample graph

instead of an AND/OR sample tree.

THEOREM 24 (Complexity of computing AND/OR graph sample mean). Given a graphi-

cal model with n variables, a pseudo tree T with maximum context size (treewidth) w∗ and

N samples, the time complexity of AND/OR graph sampling is O(nNw∗) while its space

complexity is O(nN).

Proof. Let Xj be the child node of Xi in T . Given N samples and maximum context size

w∗, the number of edges emanating from AND nodes corresponding to Xi to OR nodes

labeled by Xj in the AND/OR sample graph is bounded by O(Nw∗). Since each such

edge is visited just once in the value computation phase, the overall time complexity is

256

O(nNw∗). To store N samples it takes O(nN) space and therefore the space complexity

is O(nN).

6.6 AND/OR w-cutset sampling

We now show how the w-cutset sampling framework [7] can be combined with AND/OR

importance sampling to reduce variance even further. Recall from Section 1.4.4 that the w-

cutset sampling framework facilitates the use of the Rao-Blackwell theorem (see Theorem

4 in Chapter 1). To recap, given a w-cutset K (see Definition 16) of a mixed network

M = 〈X,D,F,C〉, a proposal distribution Q(K) defined over the w-cutset, and a set of

samples (k1, . . . ,kN) generated from Q, the w-cutset estimate of weighted counts Z is

given by:

Zor−wc−tree =1

N

N∑

i=1

∑r∈R


∏pk=1Ck(r,K = ki)

Q(ci)(6.48)

From the Rao-Blackwell theorem, it follows that the variance of Zor−wc−tree is less than

or equal to the variance of Zor−tree when the two are based on the same samples on the

w-cutset variables.

We now illustrate how w-cutset sampling can be easily combined with AND/OR sampling

in the following example.

EXAMPLE 23. Consider the primal graph in Figure 6.9 (a). The minimal 1-cutset or cycle

cutset contains three nodes A,B,C. Given a proposal distribution Q(A,B,C) = Q(A)

Q(B|A) Q(C|A), in 1-cutset sampling the variables A, B and C are sampled as if they

form a chain pseudo tree as in Figure 6.9(c). Bucket elimination is applied on the remaining

network to compute the sample weight. However once A is sampled, the remaining sub-

problem is split into two components and therefore as presented in [91] we can organize

257

A

B C

D

E G

F

(a)

A

B C

D

E G

F

(b)

A

B

C

(c)

A

B C

(d)

Figure 6.9: (a) Example primal graph of a mixed network (b) Pseudo tree (c) Start pseudo

tree of OR w-cutset tree sampling and (d) Start pseudo tree of AND/OR w-cutset tree

sampling.

the 1-cutset into two portions as in the start pseudo tree given in Figure 6.9(d). Thus,

the main idea is to arrange the samples on an AND/OR sample tree restricted over the 1-

cutset variables A,B,C, perform exact bucket elimination separately on the sub-problems

rooted at B and C respectively in the pseudo tree of Figure 6.9(b) and then compute a new

AND/OR w-cutset sample tree mean over the AND/OR w-cutset sample tree.

DEFINITION 54 (start and full pseudo trees). [91] Given a pseudo tree T (V,E), a di-

rected rooted tree T ′(V’,E’) where V’ ⊆ V and E′ ⊆ E is a start pseudo tree of T if it has

the same root as T and is a connected subgraph of T . T is called the full pseudo tree of its

start pseudo tree T ′.

EXAMPLE 24. The pseudo tree given in Figure 6.9(d) is a start pseudo tree of the (full)

pseudo tree given in Figure 6.9(b).

DEFINITION 55 (AND/OR w-cutset). [91] Given a mixed networkM = 〈 X, D, F, C 〉,

an AND/OR w-cutset is a pair 〈T ′,K〉 where K ⊆ X is a w-cutset ofM and T ′(K,E) is a

start pseudo tree defined over K.

DEFINITION 56 (AND/OR w-cutset search tree). Given a mixed networkM = 〈 X, D,

258

F, C 〉 , an AND/OR w-cutset 〈T ′,K〉, a full pseudo tree T of T ′ and an AND/OR search

tree φAOT based on T andM, the AND/OR w-cutset search tree φAOWT is obtained from

φAOT by removing all nodes and edges that are not included in K.

DEFINITION 57 (AND/OR w-cutset sample tree). Given a mixed networkM = 〈 X, D,

F, C 〉, an AND/OR w-cutset 〈T ′,K〉, a full pseudo tree T of T ′, a proposal distribution

along the pseudo tree Q(X) =∏n

i=1Qi(Xi|Yi) such that Yi ⊆ contextT (Xi), a sequence

of samples ST ′ and a complete AND/OR w-cutset search tree φAOWT defined relative to

M, 〈T ′,K〉 and T ′, an AND/OR w-cutset sample tree SAOWT is obtained from φAOWT by

removing all edges and corresponding nodes which do not appear in ST ′ . The arc-labels

from an OR node n to an AND node m in SAOWT are defined in a similar way to those of

an AND/OR sample tree (see Definition 47). Let Xi be the label of n and xi be the label of

m, then the arc-label from n tom is a pair 〈w(n,m),#(n,m)〉 where:

• w(n,m) =BT,Xi

(xi,A(πn))

Qi(xi|A(πn))is called the weight of the arc.

• #(n,m) is the frequency of the arc. Namely, it is equal to the number of times the

partial assignment A(πm) occurs in ST ′ .

where BT,Xibe the bucket function of Xi w.r.t. T (see Definition 46). Note that the Bucket

function is defined relative to the full pseudo tree T and not the start pseudo tree T ′.

DEFINITION 58 (AND/OR w-cutset sample tree mean Zao−wc−tree). Given a mixed net-

workM = 〈X,D,F,C〉, an AND/OR w-cutset 〈T ′,K〉, a full pseudo tree T of T ′, a proposal

distribution along the pseudo tree Q(X) =∏n

i=1Qi(Xi|Yi) such that Yi ⊆ contextT (Xi),

a sequence of samples ST ′ , an AND/OR w-cutset sample tree SAOWT defined relative to

M, 〈T ′,K〉 and T and an initialization function ψ(V (πl)) given by Equation 6.49 for each

leaf AND node l, labeled with xi , the AND/OR w-cutset sample tree mean is the value (see

Definition 48) of the root node of SAOWT .

259

Algorithm 22: AND/OR w-cutset Importance Sampling

Input: A mixed networkM, and an integer wOutput: AND/OR w-cutset sample mean

Construct an AND/OR w-cutset (see Definition 55) 〈T ′,K〉 ofM. Let T be the full1

pseudo tree of T ′;

Construct a proposal distribution Q(K) =∏|K|

i=1Qi(Ki|Yi) where2

Yi ⊆ contextT (Xi);Generate samples k1, . . . ,kN from Q;3

Build a AND/OR w-cutset sample tree SAOWT for the samples k1, . . . ,kN (see4

Definition 57);

Initialize all labeling functions 〈w,#〉 on each arc from an OR-node n to an AND5

node n′ using Definition 57;

for all leaf AND nodes l of SAOWT do6

// Let l be labeled by xiCompute v(l) =

∑u∈lpathT (Xi)

∏Xj∈lpathT (Xi)

BT,Xj(u, A(πl)) using Bucket7

elimination [28];

Initialize all leaf OR nodes to 0;8

for every node n from leaves to the root do9

// Let chi(n) denote the child nodes of node nif n is an AND node then10

v(n) =∏

n′∈chi(n) v(n′);11

else12

13

v(n) =

∑n′∈chi(n)(#(n, n′)w(n, n′)v(n′))

∑n′∈chi(n) #(n, n′)

return v(root node of SAOWT )14

ψ(A(πl)) =∑

u∈lpathT (Xi)

∏

Xj∈lpathT (Xi)

BT,Xj(u, A(πl)) (6.49)

where lpathT (Xi) is the set of variables along the path fromXi to the leaf in the full pseudo

tree T .

260

6.6.1 The algorithm and its properties

The AND/OR w-cutset importance sampling scheme is presented in Algorithm 22. Given

a mixed networkM = 〈X,D,F,C〉 and an integer w, we generate an AND/OR w-cutset

〈T ′,K〉 ofM. Then, we create a proposal distribution Q(K) over the w-cutset along the

start pseudo tree structure and generateN samples. We store these samples on the AND/OR

w-cutset sample tree over K and perform value computations as follows. For each leaf AND

node l, its value is computed by exact bucket elimination on the sub-tree of T rooted at l.

Then, the algorithm computes the AND/OR sample tree mean recursively by performing

product at each AND node and weighted average at each OR node (steps 9-13). Finally,

the algorithm returns the value of the root-node.

From the proof of Theorem 19, it follows trivially that:

THEOREM 25. The AND/OR w-cutset sample mean is an unbiased estimate of the weighted

counts.

From the Rao-Blackwell theorem, it follows that:

THEOREM 26. The variance of AND/OR w-cutset sample tree mean is less than or equal

to the variance of AND/OR sample tree mean.

Next, we prove that:

THEOREM 27. Given a mixed network M = 〈X,D,F,C〉, an AND/OR w-cutset 〈T ′,K〉

ofM, a full pseudo tree T of T ′, a chain pseudo tree T ′OR obtained by any topological

linearization order of T ′ and a full pseudo tree TOR of T ′OR, the variance of Zao−wc−tree

based on T , 〈T ′,K〉 andM is less than or equal to the variance of Zor−wc−tree based on

TOR, 〈T ′OR,K〉 andM.

261

Proof. To prove this theorem, we show that AND/OR w-cutset sample tree is a special case

of an AND/OR sample tree and OR w-cutset sample tree is a special case of an OR sample

tree. Therefore, from Theorem 22, it follows that the variance of AND/OR w-cutset sample

tree mean is smaller than or equal to OR w-cutset sample tree mean.

Consider an AND/OR w-cutset sample tree SAOWT defined relative toM, 〈T ′,K〉 and T .

Let l be a leaf AND node of SAOWT , labeled by xi. For the purpose of discussion, let

U = lpathT (Xi) and U = u be an assignment. We define the perfect proposal distribution,

denoted by P(lpathT (Xi)|V (πl)) as follows.

P(u|A(πl)) = α∏

Xj∈lpathT (Xi)

BT,Xj(u, A(πl))

where BT,Zjis a bucket function of Xj relative to T (see Definition 46) and α is a normal-

ization constant given by:

α =1∑

u∈lpathT (Xi)

∏Xj∈lpathT (Xi)

BT,Xj(u, A(πl))

If we draw a sample ui from P(U|A(πl)), then, by definition, the weight of ui is given by:

wP(ui) =

∏Xj∈lpathT (Xi)

BT,Xj(ui, A(πl))

P(ui|k)(6.50)

=

∏Xj∈lpathT (Xi)

BT,Xj(ui, A(πl))

α∏

Xj∈lpathT (Xi)BT,Xj

(ui, A(πl))(6.51)

=1

α(6.52)

=∑

u∈lpathT (Xi)

∏

Xj∈lpathT (Xi)

BT,Xj(u, A(πl)) (6.53)

From Equations 6.53 and 6.49, it follows that the value of a leaf AND node l of SAOWT is

262

equal to the weight of any sample ui drawn from P . Namely,

v(l) = wP(ui) (6.54)

Assume that SAOT is a full AND/OR sample tree relative to T which is obtained from

SAOWT by adding a single sample from P to each leaf AND node of SAOWT . We will

now prove that SAOT and SAOWT are equivalent in the sense that the AND/OR sample tree

mean computed on SAOT is equal to the AND/OR w-cutset sample tree mean computed

on SAOWT . Consider the AND node g in SAOT corresponding to the leaf AND node l in

SAOWT . By construction, to prove that SAOT is equivalent to SAOWT , all we need to prove

is that the value of l is equal to the value of g. Because, we only added one sample at l

in SAOWT to form SAOT , the value of g in SAOT equals the weight of the added sample,

namely

v(g) = wP(ui) (6.55)

Therefore, from Equations 6.54 and 6.55, we get v(l) = v(g), as required.

Thus, we can create an AND/OR sample tree SAOT which is equivalent to a given AND/OR

w-cutset sample tree SAOWT by adding a sample from P to each leaf node of SAOWT .

Similarly, we can create an OR sample tree SORT which is equivalent to the OR w-cutset

sample tree SORWT by adding a sample from P to each leaf node of SORWT .

From Theorem 22, the variance of AND/OR sample tree mean computed on SAOT is less

than or equal to the variance of the AND/OR sample tree mean computed on SORT . Since

these full trees are equivalent to their w-cutset counterparts, it follows that the variance

of AND/OR w-cutset sample tree mean computed on SAOWT is less than or equal to the

variance of the AND/OR w-cutset sample tree mean computed on SORWT .

263

As in full AND/OR sampling, we can restrict the AND/OR sample graph mean to the start

pseudo tree. Clearly, we will get additional reduction in variance.

DEFINITION 59 (AND/OR w-cutset sample graph and mean). Given an AND/OR w-

cutset sample tree SAOWT , the AND/OR w-cutset sample graph SAOWG is obtained from

SAOWT by context-based merging. The AND/OR w-cutset sample graph mean, denoted by

Zao−wc−graph is AND/OR sample mean computed over SAOWG.

The algorithm for computing AND/OR w-cutset graph mean is identical to Algorithm 22.

The only difference is that we store all the generated samples on an AND/OR w-cutset

sample graph rather than the AND/OR w-cutset sample tree. We can prove that:

THEOREM 28. The variance of AND/OR w-cutset sample graph mean Zao−wc−graph is less

than or equal to that of AND/OR w-cutset sample tree mean Zao−wc−tree.

Proof. The proof of this theorem is along the similar lines as Theorem 27 in that we con-

struct an equivalent AND/OR sample graph SAOG from the AND/OR w-cutset sample

graph SAOWG and an equivalent AND/OR sample tree SAOT from the AND/OR w-cutset

sample tree SAOWT by adding a sample from a perfect proposal distribution. Then, the

proof follows from Theorem 23.

6.6.2 Variance Hierarchy and Complexity

Theorems 22, 23, 27 and 28 along with the Rao-Blackwell theorem help establish the

variance hierarchy shown in Figure 6.10. The main assumption is that all sample means

are based on the same set of samples and the same (full) pseudo tree. Semantically, given a

w-cutset where 0 < w < w∗ (w∗ is the treewidth), the directed arcs in Figure 6.10 indicate

that the variance of the child node is smaller than (or equal to) the variance of the parent. We

see that the variance of AND/OR sample tree mean is incomparable with w-cutset sample

264

IS

w-cutset ISAND/OR

Tree IS

AND/OR w-cutset

Tree IS

AND/OR

Graph IS

AND/OR w-cutset

Graph IS

Figure 6.10: Variance Hierarchy

Sampling Scheme Time Complexity Space Complexity

OR tree O(nN) O(1)AND/OR tree O(nN) O(h)

AND/OR graph O(nNw∗) O(nN)OR w-cutset tree O(cN + (n− c)Nexp(w)) O((n− c)exp(w))

AND/OR w-cutset tree O(cN + (n− c)Nexp(w)) O(h + (n− c)exp(w))AND/OR w-cutset graph O(cNw∗ + (n− c)Nexp(w)) O(cN + (n− c)exp(w))

(N: number of samples, n: number of variables, w: w-cutset bound, c: number of variables in the w-cutset,

h: height of pseudo tree and w∗: treewidth)

Figure 6.11: Complexity of various AND/OR importance sampling schemes and their

w-cutset generalizations.

mean and that AND/OR w-cutset sample graph mean has the lowest variance. Finally,

variance reduction obviously comes at extra computational cost. As we move down the

variance hierarchy, the time and space complexity of compiling the various means typically

increases except when moving from OR tree schemes to AND/OR tree schemes and their

respective w-cutset generalizations. The complexities are depicted in Figure 6.11.

265


We provide empirical evaluation which demonstrates the properties of the various schemes.

In particular, we show how moving from OR space sampling to AND/OR space sampling

improves performance. We evaluate the performance as an anytime scheme. Namely, we

measure accuracy as a function of time.

6.7.1 The Algorithms evaluated

We use the SampleSearch with IJGP(i) scheme presented in Chapter 3 (see Algorithm 12)

for evaluating AND/OR sampling.

We experimented with six versions of SampleSearch based importance sampling:

1. OR tree importance sampling (or-tree-IS) introduced in [53] which computes the OR

sample tree mean.

2. AND/OR tree importance sampling (ao-tree-IS), which computes the AND/OR sam-

ple tree mean.

3. AND/OR graph importance sampling (ao-graph-IS) which computes the AND/OR

graph sample tree mean.

4. OR w-cutset tree importance sampling (or-wc-tree-IS) introduced in [7, 53] which

computes the OR w-cutset sample tree mean.

5. AND/OR w-cutset tree importance sampling (ao-wc-tree-IS) which computes the

AND/OR w-cutset tree sample mean and

6. AND/OR w-cutset graph importance sampling (ao-wc-graph-IS) which computes

AND/OR w-cutset graph sample mean.

266

We set the w of w-cutset to 5 to ensure that the bucket elimination component of w-cutset

sampling does not run out of memory and terminates in a reasonable amount of time. The

underlying scheme for generating the samples is identical in all schemes and is done via

SampleSearch. What changes is the method of accumulating the samples and deriving the

estimates.

6.7.2 Evaluation Criteria

We experimented with three sets of benchmarks: (a) Grid networks, (b) Linkage networks,

and (c) 4-coloring problems. On each instance, we compare the log relative error (see

Chapter 5) between the exact probability of evidence (or the solution counts for 4-coloring

problems) and the approximate one. If Z is the exact value and Z is the approximate value,

the log-relative error denoted by ∆ is defined as:

∆ =log(Z)− log(Z)

log(Z)(6.56)

When the exact results are not known, we compute a lower bound on the probability of

evidence (or the solution counts) by the martingale and average heuristics described in

Chapter 5. Recall that the lower bounding scheme takes a set of unbiased sample weights

and a confidence 0 < α < 1 as input, and outputs a probabilistic lower bound on the

weighted counts, which are correct with probability greater than α. We use α = 0.99 in all

our experiments which means our lower bounds are correct with probability greater than

0.99. Again, we use the highest lower bound available as a substitute for Z in Equation

6.56.

As mentioned in Chapter 5, we compute the log relative error instead of the usual relative

error because when the probability of evidence is extremely small (< 10−10) or when the

solution counts are large (e.g. > 1010) the relative error between the exact and the approx-

267

imate answer will be arbitrarily close to 1 and we would need a large number of digits to

distinguish between the results.

Notation in Tables

Tables 6.1, 6.2, 6.3 and 6.4 presents the results. For each instance, in column 2, we report

the number of variables (N), average domain size (k), number of evidence nodes or con-

straints (E), treewidth w∗ and w-cutset size (c). The third column reports the exact value

of probability of evidence or solution counts for that instance; if known. The remaining

columns (from Column 4 onwards) contain the log-relative error of the approximate solu-

tion output by the competing schemes in 1 hour of CPU time. Next, we discuss the results

for each benchmark in more detail. The best results are indicated by bold in each row.

6.7.3 Results on the Grid Networks

The grid networks are available from the authors of Cachet - a SAT model counter [119].

A grid Bayesian network is a s × s grid, where there are two directed edges from a node

to its neighbors right and down. The upper-left node is a source, and the bottom-right

node is a sink. The sink node is the evidence node whose marginal probability is to be

determined. The deterministic ratio p is a parameter specifying the fraction of nodes that

are deterministic, that is, whose values are determined given the values of their parents.

The grid instances are designated as p − s. For example, the instance 50 − 18 indicates a

grid of size 18 in which 50% of the nodes are deterministic.

Table 6.1 shows the log-relative error computed by the various schemes after 1 hour of

CPU time. Time versus log relative error plots for six sample grid instances are shown in

Figures 6.12, 6.13 and 6.14 respectively.

On grids with deterministic ratio of 50% (see Table 6.1 and Figures 6.12(a) and 6.12(b)),

268

Problem 〈n, k,E, t∗, c〉 Exact or- ao- ao- or-wc- ao-wc- ao-wc-

tree-IS tree-IS graph-IS tree-IS tree-IS graph-IS

∆ ∆ ∆ ∆ ∆ ∆50% Ratio

50-17-5 〈289, 2, 1, 25, 45〉 7.71E-01 2.96E+00 2.54E+00 1.03E-01 3.80E-02 2.41E-02 2.92E-02

50-18-5 〈324, 2, 1, 27, 53〉 4.14E-01 3.49E-01 4.87E-01 2.19E-02 1.91E-02 1.98E-02 2.13E-02

50-19-5 〈361, 2, 1, 28, 60〉 2.21E-01 7.72E-02 1.06E-01 2.62E-02 8.79E-03 8.95E-03 9.68E-03

50-20-5 〈400, 2, 1, 30, 70〉 5.69E-01 2.38E+00 2.84E+00 1.62E-01 5.22E-02 2.74E-02 2.55E-02

75% Ratio

75-16-5 〈256, 2, 1, 24, 18〉 6.16E-01 1.05E-01 4.22E-02 1.48E-02 3.77E-03 3.14E-03 5.96E-03

75-17-5 〈289, 2, 1, 25, 23〉 2.25E-02 3.00E-02 6.53E-03 2.60E-03 2.00E-03 1.70E-03 1.79E-03

75-18-5 〈324, 2, 1, 27, 39〉 1.89E-01 1.51E-01 6.77E-02 8.56E-03 3.80E-03 9.46E-04 5.84E-04

75-19-5 〈361, 2, 1, 28, 38〉 5.84E-01 9.55E-01 2.24E+00 1.24E-01 1.43E-02 5.77E-03 1.90E-04

75-20-5 〈400, 2, 1, 30, 33〉 9.93E-01 5.60E+01 2.26E+01 1.88E+00 1.59E+00 1.44E+00 1.51E+00

75-21-5 〈441, 2, 1, 32, 43〉 1.83E-01 1.19E-01 5.97E-02 6.82E-03 5.16E-02 2.39E-03 1.30E-02

75-22-5 〈484, 2, 1, 35, 47〉 4.38E-01 6.19E-01 6.55E-01 2.07E-01 3.74E-03 9.58E-03 8.94E-04

75-23-5 〈529, 2, 1, 35, 65〉 3.48E-01 6.90E-01 4.92E-01 1.36E-01 3.59E-02 7.40E-03 3.83E-03

75-26-5 〈676, 2, 1, 44, 82〉 2.64E-01 1.78E+00 1.22E+00 1.54E-01 1.75E-01 1.92E-01 1.75E-01

90% Ratio

90-21-5 〈441, 2, 1, 32, 7〉 6.54E-01 1.21E-01 2.93E-02 1.87E-02 2.65E-03 1.57E-03 1.65E-03

90-22-5 〈484, 2, 1, 35, 21〉 8.31E-01 1.16E-01 2.10E-01 7.48E-02 3.65E-02 3.37E-02 3.58E-02

90-23-5 〈529, 2, 1, 35, 13〉 1.85E-01 2.20E-02 6.57E-03 1.90E-03 2.64E-04 3.13E-04 4.69E-05

90-24-5 〈576, 2, 1, 38, 10〉 2.62E-01 3.18E-02 2.04E-02 2.33E-02 2.29E-04 1.87E-04 8.39E-05

90-25-5 〈625, 2, 1, 39, 7〉 3.03E-01 2.39E-02 7.49E-02 1.38E-02 3.73E-05 1.70E-04 5.39E-05

90-26-5 〈676, 2, 1, 44, 21〉 3.33E-01 1.53E-01 4.61E-02 4.20E-02 1.18E-02 5.87E-03 5.60E-03

90-34-5 〈1156, 2, 1, 65, 18〉 8.59E-02 1.54E-01 3.88E-02 7.77E-02 9.95E-03 5.51E-03 7.88E-03

90-38-5 〈1444, 2, 1, 69, 57〉 1.41E-01 5.78E-01 1.77E-01 5.02E-02 2.96E-01 2.74E-01 3.73E-01

90-42-5 〈1764, 2, 1, 83, 58〉 6.55E-01 2.15E+00 2.60E-01 3.64E-02 2.77E-04 1.43E-02 3.90E-02

90-50-5 〈2500, 2, 1, 108, 85〉 1.91E-01 2.36E+00 7.20E-01 2.35E-01 2.00E-01 1.36E-01 1.60E-01

Table 6.1: Table showing the log-relative error ∆ of importance sampling, w-cutset impor-

tance sampling and their AND/OR tree and graph variants for the Grid instances after 1

hour of CPU time.

269

0.001

0.01

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds

Log Relative error Error vs Time for 50-17-5, num-vars= 289

or-tree-ISao-tree-IS

ao-graph-ISor-wc-tree-IS

ao-wc-tree-ISao-wc-graph-IS

(a)

0.0001

0.001

0.01

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds





(b)

Figure 6.12: Time versus log relative error for importance sampling, w-cutset importance

sampling and their AND/OR tree and graph generalizations for Grid instances with deter-

ministic ratio = 50%.

270

schemes that employ w-cutset sampling (the last 3 columns in Table 6.1) are superior. We

observe that the log relative error does not decrease when moving from OR tree sampling to

AND/OR tree sampling. However, when moving from AND/OR tree sampling to AND/OR

graph sampling, the log relative error decreases by one order of magnitude (see Figure

6.12(a) and (b)). Comparing w-cutset sampling schemes, we see that there is a very minor

difference in their respective log relative errors. In other words, AND/OR w-cutset tree and

AND/OR w-cutset graph sampling do not yield any additional variance reduction beyond

what is achieved by OR w-cutset tree sampling.

On grids with deterministic ratio of 75% (see Table 6.1 and Figures 6.13(a) and 6.13(b)),

we see that on most instances ao-wc-tree-IS and ao-wc-graph-IS are slightly better than ao-

graph-IS and or-wc-tree-IS. ao-tree-IS and or-tree-IS are the worst performing schemes.

For example in Figures 6.13(a) and (b), ao-tree-IS and or-tree-IS are on the top indicating

larger log relative error. Thus, moving from AND/OR tree sampling to graph sampling

results in appreciable variance reduction but moving from OR and AND/OR w-cutset tree

sampling to AND/OR w-cutset graph sampling does not yield additional variance reduc-

tions.

On grids with deterministic ratio of 90% (see Table 6.1 and Figures 6.14(a) and 6.14(b)), we

see a similar picture to the grids with deterministic ratio of 75% in that ao-wc-tree-IS and

ao-wc-graph-IS are slightly better than ao-graph-IS and or-wc-tree-IS schemes. ao-tree-IS

and or-tree-IS schemes are the worst performing schemes.

Overall, we find that on the grids instances, ao-wc-graph-IS and ao-wc-tree-IS schemes

are slightly better than or-wc-tree-IS and ao-graph-IS schemes and substantially better than

or-tree-IS and ao-tree-IS schemes. Namely, moving from AND/OR tree to AND/OR graph

pays off substantially in schemes which do not employ w-cutset sampling but not as much

in schemes that use w-cutset sampling. w-cutset by itself has a significant impact.

271

0.001

0.01

0.1

1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds





(a)

0.001

0.01

0.1

1

10

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds





(b)




272

1e-06

1e-05

0.0001

0.001

0.01

0.1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds





(a)

0.0001

0.001

0.01

0.1

1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds





(b)




273

0.001

0.01

0.1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds

Log Relative error Error vs Time for BN_71, num-vars= 1740


ao-graph-ISor-wc-IS


(a)

0.01

0.1

1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds

Log Relative error Error vs Time for BN_76, num-vars= 2155


ao-graph-ISor-wc-IS


(b)


sampling and their AND/OR tree and graph generalizations for two sample Linkage in-

stances from the UAI 2006 evaluation.

274



∆ ∆ ∆ ∆ ∆ ∆BN 69 〈777, 7, 78, 47, 59〉 5.28E-54 2.26E-02 2.46E-02 2.43E-02 2.42E-02 2.34E-02 4.22E-03

BN 70 〈2315, 5, 159, 87, 98〉 2.00E-71 6.32E-02 7.25E-02 5.12E-02 8.18E-02 5.36E-02 2.62E-02

BN 71 〈1740, 6, 202, 70, 139〉 5.12E-111 6.74E-02 5.51E-02 2.35E-02 8.58E-02 9.46E-03 1.21E-02

BN 72 〈2155, 6, 252, 86, 88〉 4.21E-150 3.19E-02 4.61E-02 2.46E-03 6.12E-02 1.41E-03 2.63E-03

BN 73 〈2140, 5, 216, 101, 149〉 2.26E-113 1.18E-01 1.12E-01 4.55E-02 1.58E-01 3.54E-02 3.95E-02

BN 74 〈749, 6, 66, 45, 72〉 3.75E-45 5.34E-02 4.31E-02 2.87E-02 8.08E-02 2.83E-02 2.76E-02

BN 75 〈1820, 5, 155, 92, 131〉 5.88E-91 4.47E-02 8.15E-02 4.73E-02 7.28E-02 4.20E-02 7.60E-03

BN 76 〈2155, 7, 169, 64, 239〉 4.93E-110 1.07E-01 1.39E-01 6.95E-02 1.13E-01 5.03E-02 2.26E-02

BN 77 〈1020, 9, 135, 22, 97〉 6.88E-79 1.06E-01 9.38E-02 8.26E-02 1.24E-01 6.75E-02 3.27E-02

Table 6.2: Table showing the log relative error ∆ of importance sampling, w-cutset impor-

tance sampling and their AND/OR tree and graph variants for the Linkage instances from

UAI 2006 evaluation after 1 hour of CPU time.

6.7.4 Results on Linkage networks

The Linkage instances that we experimented with in Chapter 3 are generated by converting

biological linkage analysis data into a Bayesian or Markov network. These networks have

between 777-2315 nodes with an average domain size of 9 or less. The results of our ex-

periments with these networks are shown in Table 6.2 and anytime performances for two

sample linkage instances are shown in Figure 6.15. We observe that on the 6 out of the

9 instances, ao-wc-graph-IS is more accurate than ao-wc-tree-IS which in turn is substan-

tially more accurate than or-wc-tree-IS. Comparing schemes which do not employ w-cutset

sampling, ao-graph-IS is substantially better than ao-tree-IS and or-tree-IS schemes. Thus,

we see that exploiting more problem decomposition using AND/OR graph sampling pays

off more when used with conventional importance sampling than with w-cutset importance

sampling. However, the scheme that utilizes the most decomposition - AND/OR w-cutset

graph importance sampling consistently outperforms other schemes and whenever it is the

second or the third best, the difference between it and the best scheme is very minor (for

example see time versus log relative error plots for two sample Linkage networks given in

Figure 6.15).

275

0.0001

0.001

0.01

0.1

1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds

Log Relative error Error vs Time for pedigree19, num-vars= 793




(a)

1e-05

0.0001

0.001

0.01

0.1

0 500 1000 1500 2000 2500 3000 3500 4000

Log R

elat

ive

Err

or

Time in seconds

Log Relative error Error vs Time for pedigree39, num-vars= 1272




(b)


sampling and their AND/OR tree and graph generalizations for two sample Linkage in-

stances from the UAI 2008 evaluation.

276

Problem 〈n, k,E, t∗, w〉 Exact or- ao- ao- or-wc- ao-wc- ao-wc-


∆ ∆ ∆ ∆ ∆ ∆pedigree18 〈1184, 1, 0, 26, 72〉 4.19E-79 3.17E-02 3.44E-02 3.20E-03 4.30E-02 3.49E-04 3.02E-04

pedigree19 〈793, 2, 0, 23, 102〉 1.59E-60 1.32E-01 1.28E-01 5.41E-02 8.92E-02 1.79E-03 2.97E-03








pedigree31 〈1183, 2, 0, 45, 118〉 1.09E-01 1.31E-01 4.15E-02 8.34E-02 0.00E+00 2.93E-04

pedigree34 〈1160, 1, 0, 59, 104〉 2.12E-01 1.47E-01 8.37E-02 8.09E-02 4.83E-04 0.00E+00

pedigree13 〈1077, 1, 0, 51, 98〉 3.93E-01 3.93E-01 5.66E-02 9.11E-02 1.51E-04 0.00E+00

pedigree41 〈1062, 2, 0, 52, 95〉 1.12E-01 5.06E-02 8.23E-04 5.04E-02 0.00E+00 3.15E-04

pedigree44 〈811, 1, 0, 29, 64〉 3.16E-02 3.08E-02 2.27E-03 1.90E-02 0.00E+00 4.63E-06

pedigree51 〈1152, 1, 0, 51, 106〉 9.22E-02 6.39E-02 2.26E-02 4.31E-02 9.35E-05 0.00E+00

pedigree7 〈1068, 1, 0, 56, 90〉 7.86E-02 9.98E-02 2.31E-02 4.61E-02 4.38E-04 0.00E+00

pedigree9 〈1118, 2, 0, 41, 80〉 3.29E-02 3.19E-02 0.00E+00 8.25E-02 9.74E-03 1.01E-02

Table 6.3: Table showing the log-relative error ∆ of importance sampling, w-cutset im-

portance sampling and their AND/OR tree and graph variants for the Markov Linkage

instances from UAI 2008 evaluation after 1 hour of CPU time.

Markov Linkage networks

The Markov linkage instances were used in the UAI 2008 evaluation. These are linkage

Bayesian networks in which evidence is instantiated yielding an un-normalized Bayesian

network. These networks are quite hard and only a handful of problems were solved exactly

in the UAI 2008 evaluation. As pointed earlier, because the exact value of probability of

evidence is not available on some instances, we compare lower bounds and use the highest

lower bound as a substitute for Z in Equation 5.29 for computing the log relative error.

Table 6.3 shows the log relative error of various schemes after 1 hour of CPU time. In Fig-

ure 6.16, we show time versus log relative error plots for two sample instances. We observe

that ao-wc-graph-IS, ao-wc-tree-IS and ao-graph-IS schemes have lower log relative error

than other schemes, often outperforming the competition by an order of magnitude. Of the

three schemes that do not use w-cutset, we observe that ao-tree-IS is better than or-tree-

IS on most instances while or-wc-tree-IS is usually better than ao-tree-IS and or-tree-IS.

277



∆ ∆ ∆ ∆ ∆ ∆4-coloring1 〈400, 2, 0, 71, 309〉 3.82E-03 4.05E-03 4.51E-03 6.00E-03 2.35E-03 0.00E+00

4-coloring2 〈400, 2, 0, 95, 315〉 1.23E-02 9.54E-03 7.64E-03 3.38E-02 3.63E-02 0.00E+00

4-coloring3 〈800, 2, 0, 144, 617〉 2.86E-03 4.58E-03 2.32E-03 2.41E-02 2.38E-02 0.00E+00

4-coloring4 〈800, 2, 0, 191, 620〉 2.13E-02 5.06E-03 2.19E-02 1.79E-02 4.69E-03 0.00E+00

4-coloring5 〈1200, 2, 0, 304, 925〉 2.98E-02 2.81E-02 5.85E-02 5.70E-02 3.89E-02 0.00E+00

4-coloring6 〈1200, 2, 0, 338, 929〉 3.43E-02 2.72E-02 2.63E-03 3.17E-03 2.09E-03 0.00E+00

Table 6.4: Table showing the log-relative error ∆ of importance sampling, w-cutset impor-

tance sampling and their AND/OR tree and graph variants on Joseph Culberson’s flat graph

coloring instances after 1 hour of CPU time.

ao-wc-graph-IS yields the best approximation.

6.7.5 Results on 4-Coloring Problems

Our final domain is that of 4-coloring problems generated using Joseph Culberson’s flat

graph coloring generator 3. Here, we are interested in counting the number of solutions of

the graph coloring instance. Table 6.4 shows the log relative error of the estimate output by

various schemes after 1 hour of CPU time. As described earlier, because exact results are

not known, we compare lower bounds. We observe that AND/OR tree and graph sampling

schemes yield estimates having lower log relative error than OR tree sampling schemes.

Among the AND/OR schemes, AND/OR w-cutset graph sampling (ao-wc-graph-IS) yields

the best lower bound on all instances.

3Available at http://www.cs.ualberta.ca/∼joe/Coloring/

278

6.8 Discussion and Related work

6.8.1 Relation to other graph-based variance reduction schemes

The work presented here is related to the work by [66, 75, 27] who perform sampling based

inference on a junction tree. The main idea in these papers is to perform message passing

on a junction tree by substituting messages which are too hard to compute exactly by their

sampling-based approximations. Kjærulff [75] and Dawid [27] use Gibbs sampling while

Hernandez et al. [66] use importance sampling to approximate the messages. Similar to

some recent works on Rao-Blackwellised sampling such as [7, 103, 53], variance reduction

is achieved in these junction tree based sampling schemes because of some exact computa-

tions, as dictated by the Rao-Blackwell theorem. AND/OR estimation, however, does not

require exact computations to achieve variance reduction. In fact, as we show, variance

reduction due to Rao-Blackwellisation is orthogonal to the variance reduction achieved by

AND/OR estimation and therefore the two could be combined to achieve more variance

reduction. Also, unlike our work which focuses on probability of evidence or the weighted

counting problem, the focus of these aforementioned papers was on the belief updating

task.

6.8.2 Hoeffding’s U -statistics

AND/OR-estimates are also closely related to cross match estimates [78] which are based

on Hoeffding’s U -statistics. To derive cross-match estimates, the original function over a

set of variables is divided into several marginal functions which are defined only on a subset

of variables. Then, each marginal function is sampled independently and the cross-match

sample mean is derived by considering all possible combinations of the samples. For exam-

ple, if there are k marginal functions and m samples are taken over each function, the cross

279

match sample mean is computed over mk combinations. It was shown in [78] that the cross

match sample mean has lower variance than the conventional sample mean, similar to our

work. The only caveat in cross match estimates is that it requires exponentially more time

O(mk) to compute the estimates as compared to O(m) for conventional estimates; making

their direct application infeasible for large values of k. So the authors suggest resampling

from the possible O(mk) samples with the hope that the estimates based on the resampled

samples would have lower variance than the conventional one. Unlike cross match esti-

mates, the most complex AND/OR estimates are only w∗ times more expensive time wise,

where w∗ is the treewidth, as compared to the conventional estimates, and therefore do not

require the extra resampling step.

6.8.3 Problem with large sample sizes

Given that the space complexity of the AND/OR graph based schemes isO(nN), the reader

may think that as more samples are drawn our algorithms would run out of memory. One

can, however, perform multi-stage (adaptive) sampling to circumvent this problem. Here,

at each stage we stop storing samples when a pre-specified memory limit is reached. Then

AND/OR sample graph mean is computed from the stored samples and the samples are

thrown away, repeating the process until the stipulated time bound expires or enough sam-

ples are drawn. The final sample mean is then simply the average of sample means com-

puted at each stage. By linearity of expectation, the final sample mean is unbiased and

obviously would have lower variance than the conventional sample mean.

280

6.9 Conclusion

The primary contribution of this chapter is in viewing importance sampling based estima-

tion in the context of AND/OR search spaces rather than the usual OR search spaces for

graphical models [37]. Specifically, we viewed sampling as a partial exploration of the

full AND/OR search space yielding an AND/OR sample tree. We defined a new AND/OR

sample mean on the samples organized on the AND/OR sample tree and showed that it is

an unbiased estimate of the weighted counts.

We showed that the AND/OR sample tree mean has lower variance than the conventional

OR sample tree mean because it yields more virtual samples by taking advantage of the

conditional independencies in the graphical model. Furthermore, because the AND/OR

sample tree mean has the same time complexity and only slightly more space overhead

than the OR sample tree mean; it should always be preferred.

The AND/OR sample tree was extended into a graph by merging identical subtrees, which

is analogous to extending AND/OR tree search to AND/OR graph search [37]. We showed

that the AND/OR sample graph yields more virtual samples than the AND/OR sample tree

and therefore reduces variance even further. However, computing the AND/OR sample

graph mean requires more time and space and thus there is a tradeoff.

Subsequently, we showed that the AND/OR sampling scheme can be combined with the w-

cutset sampling scheme [7] to reduce variance even further. The combination of AND/OR

and w-cutset sampling yields a family of sample means which trade variance with time and

space as summarized in Figures 6.10 and 6.11.

We focused our empirical investigation on the task of computing probability of evidence

or partition function in a Bayesian and Markov network respectively. The main aim of our

evaluation was to compare the impact of exploiting varying levels of graph decompositions

281

via (a) OR tree (b) AND/OR tree, (c) AND/OR graph and (d) their w-cutset generalizations

on the accuracy of the estimates. Our results show conclusively that (a) AND/OR tree sam-

pling usually yields better estimates than (conventional) OR tree sampling, (b) AND/OR

graph sampling is superior to AND/OR tree sampling and (c) w-cutset sampling (OR and

AND/OR) is superior to non w-cutset sampling.

282

Chapter 7

Conclusions

In this chapter, we conclude the dissertation with a survey of our principal contributions

and outline several promising avenues for further research.

Our research addresses the problems associated with performing sampling based approx-

imate inference over mixed probabilistic and deterministic networks. In particular, be-

cause of determinism, importance sampling algorithms suffer from the rejection problem

in that samples with zero weight are generated with probability arbitrarily close to one

while Markov Chain Monte Carlo techniques do not converge at all yielding very poor

performance.

7.1 Contributions

The first contribution of this dissertation is a scheme called IJGP-sampling which uses the

output of a generalized belief propagation scheme called Iterative Join Graph Propagation

(IJGP) [33] to construct a proposal distribution for importance sampling and uses the partial

directional i-consistency power of IJGP to reduce rejection. We applied this algorithm to

283

perform inference in a dynamic mixed network that modeled travel routines of individuals.

The second contribution is the SampleSearch scheme which combines systematic back-

tracking search with random sampling and completely eliminates the rejection problem.

We showed that the bias introduced by mixing search with sampling can be characterized

using the backtrack-free distribution and thus the samples generated by SampleSearch can

be used for performing approximate inference with desirable statistical guarantees such as

unbiasedness and asymptotic unbiasedness. We also showed that SampleSearch can be

viewed as a systematic search technique whose value selection is stochastically guided by

sampling from a distribution. This view enabled us to utilize state-of-the-art systematic

SAT/CSP solvers such as minisat [125] for performing search within SampleSearch. Thus,

advances in the systematic search community whose primary focus is solving “yes/no” type

NP-complete problems can be leveraged through SampleSearch for approximating much

harder #P-complete problems in Bayesian inference. We performed an extensive empirical

evaluation on several benchmark graphical models and our results clearly demonstrate that

SampleSearch is consistently superior to other state-of-the-art schemes on domains having

a substantial amount of determinism.

The third contribution is two schemes of SampleSearch-MH and SampleSearch-SIR for

sampling solutions from a uniform distribution over the solutions of a Boolean satisfia-

bility problem. This solution sampling task is motivated by applications in fields such as

functional/software verification [134, 31] and first order probabilistic models [113, 95].

SampleSearch-MH and SampleSearch-SIR combine SampleSearch with statistical tech-

niques of Metropolis Hastings (MH) and Sampling / Importance Resampling (SIR) respec-

tively and guarantee that in the limit of infinite samples the distribution over the generated

solution samples is the uniform distribution over the solutions. Such convergence guaran-

tees are not available for state-of-the-art schemes such as SampleSat [130] and XorSample

[63]. Via a thorough empirical evaluation, we showed conclusively that our new schemes

284

are superior to both SampleSat and XorSample in terms of accuracy as a function of time.

Our fourth contribution is a randomized approximation algorithm called Markov-LB for

computing high confidence lower bounds on the weighted counting tasks such as proba-

bility of evidence and the partition function. Markov-LB is based on importance sampling

and the Markov inequality. We argued that Markov inequality is quite weak (a fact well

known in the statistics literature) because it is based on a single sample. To mitigate this

deficiency, we proposed to extend the Markov inequality to multiple samples yielding three

new schemes: one based on computing the sample average of the generated samples and

two based on the martingale theory [10] which utilize the maximum statistics from the

generated samples. We showed via a large-scale experimental evaluation that Markov-LB

applied to IJGP-sampling and SampleSearch schemes presented in this thesis outperforms

state-of-the-art schemes such as bound propagation [6], SampleCount [62], Relsat [115]

and variable elimination and conditioning (VEC) and provides good quality high confi-

dence lower bounds on most instances.

Our final contribution is a new sampling framework called AND/OR importance sampling

which generalizes conventional OR space importance sampling which is impervious to

problem decomposition to the AND/OR space [37] which is sensitive to problem decom-

position. Our generalization yields a family of estimators which are based on the same

set of samples and which trade variance (and therefore accuracy) with time and space by

utilizing different levels of problem decomposition. At one end is the AND/OR tree es-

timator which has smaller variance than the conventional OR tree estimator and has the

same time complexity. At the other end is the AND/OR graph estimator which has even

lower variance than the AND/OR tree estimator but has higher time and space complexity.

Subsequently, we showed that the AND/OR sampling scheme can be combined with an-

other sampling technique which utilizes graph decomposition called w-cutset sampling [7]

yielding further variance reduction. Via an extensive experimental evaluation, we showed

285

that exploiting decomposition aggressively by combining AND/OR graph sampling and w-

cutset sampling is well worth the extra computational complexity in that the combination

usually yields more accurate estimates.

7.2 Directions for Future Research

Our investigation leaves plenty of room for additional improvements that can be pursued

in the future.

Incremental SAT solving

To generate samples using SampleSearch, we solve the same SAT/CSP problem multiple

times. Therefore, various goods and no-goods (i.e., knowledge about the problem space)

discovered while generating one sample can be used to speed-up the search for a solution

while generating the next sample. How to achieve this in a principled and structured way

is an important theoretical and practical question. Some initial related research on solving

a series of similar SAT problems which differ only slightly from one another has appeared

in the bounded model checking community [43] and can be applied to improve Sample-

Search’s performance.

Adaptive Importance Sampling

A second line of future research is to use adaptive importance sampling [17, 101, 133, 97]

within SampleSearch and AND/OR importance sampling. In adaptive importance sam-

pling, one updates the proposal distribution based on the generated samples; so that with

every update the proposal gets closer and closer to the desired posterior distribution. Be-

286

cause we already store the DFS traces of the generated samples in SampleSearch, one could

use them to dynamically update and learn the proposal distribution. Also, because the opti-

mal proposal distribution decomposes according to an AND/OR tree or graph, we can even

utilize graph decompositions to facilitate learning.

Exploiting independencies uncovered while sampling

The AND/OR sampling framework utilizes decomposition or conditional independencies

uncovered by the primal graph associated with a probabilistic model. It is well known that

the graph captures only a fraction of the actual conditional independencies and therefore

AND/OR sampling utilizes decomposition in a sub-optimal manner. New conditional inde-

pendencies could be discovered while sampling through the AND/OR space; which could

then be utilized to get better estimates as dictated by the AND/OR sampling theory. How

to utilize these new independencies on the fly as well as how to guide sampling to look for

such independencies is still an open question.

Approximate Compilation

When knowledge is represented as a graphical model, the representation requires only poly-

nomial space but exact inference is time (and/or space) exponential. Compilation turns this

problem upside down in that knowledge is represented by using as much space as possible

so that it takes only polynomial time for typical inference tasks. Examples of compila-

tion frameworks are Ordered Binary Decision Diagrams (OBDDs) [11], Decomposable

Normal Negation form (DNNF) [25] and its variants and AND/OR multi-valued decision

diagrams [93]. The problem with these exact compilation frameworks is that their space

requirements are excessive and impractical. What we actually need is an approximate

compilation framework which provides good enough answers based on the application.

287

SampleSearch and AND/OR sampling can be easily modified to approximately compile a

graphical model. Obviously, some research issues that need to be addressed are whether we

can provide guarantees on the quality and error of the estimates derived from the compiled

structure.

288

Bibliography

[1] D. Allen and A. Darwiche. Rc link: Genetic linkage analysis using Bayesian net-

works. International Journal of Approximate Reasoning, 48(2):499–525, 2008.

[2] S. Arnborg and A. Proskourowski. Linear time algorithms for NP-hard problems

restricted to partial k-trees. Discrete and Applied Mathematics, 23:11–24, 1989.

[3] R. Bayardo and D. P. Miranker. On the space-time trade-off in solving constraint

satisfaction problems. In Fourteenth International Joint Conference on Artificial

Intelligence (IJCAI), pages 558–562, 1995.

[4] J. Bergeron. Writing Testbenches: Functional Verification of HDL Models. Kluwer

Academic Publishers, 2000.

[5] B. Bidyuk and R. Dechter. An anytime scheme for bounding posterior beliefs. In

Proceedings of the Twenty-First AAAI Conference on Artificial Intelligence, pages

1095–1100, 2006.

[6] B. Bidyuk and R. Dechter. Improving bound propagation. In Proceedings of the

17th European Conference on Artificial Intelligence (ECAI), pages 342–346, 2006.

[7] B. Bidyuk and R. Dechter. Cutset sampling for Bayesian networks. Journal of

Artificial Intelligence Research (JAIR), 28:1–48, 2007.

[8] J. Bilmes and R. Dechter. Evaluation of Probabilistic Inference Systems of UAI’06.

Available online at http://ssli.ee.washington.edu/ bilmes/uai06InferenceEvaluation/,

2006.

[9] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In

Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence

(UAI), pages 33–42, 1998.

[10] L. Breiman. Probability. Addison-Wesley, 1968.

[11] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE

Transancations on Computers, 35(8):677–691, 1986.

[12] G. Casella and C. P. Robert. Rao-Blackwellisation of sampling schemes. Biometrika,

83(1):81–94, 1996.

289

[13] M. Chavira and A. Darwiche. On probabilistic inference by weighted model count-

ing. Artificial Intelligence, 172(6–7):772–799, April 2008.

[14] M. D. Chavira, A. Darwiche, and M. Jaeger. Compiling relational Bayesian networks

for exact inference. International Journal of Approximate Reasoning, 42(1-2):4–20,

2006.

[15] J. Cheng. Sampling algorithms for estimating the mean of bounded random vari-

ables. In Computational Statistics, pages 1–23. Volume 16(1), 2001.

[16] J. Cheng and M. J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm

for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence

Research (JAIR), 13:155–188, 2000.

[17] J.-F. Cheng. Iterative Decoding. PhD thesis, California Institute of Technology

(Electrical Engineering), 1997.

[18] A. Choi and A. Darwiche. An edge deletion semantics for belief propagation and

its practical impact on approximation quality. In Proceedings of The Twenty-First

National Conference on Artificial Intelligence (AAAI), pages 1107–1114, 2006.

[19] M. C. Cooper. An optimal k-consistency algorithm. Artificial Intelligence, 41(1):89–

95, 1990.

[20] R. G. Cowell, P. A. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic

Networks and Expert Systems (Information Science and Statistics). Springer, New

York, May 2003.

[21] P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian belief

networks is NP-hard. Artificial Intelligence, 60(1):141–153, 1993.

[22] P. Dagum and M. Luby. An optimal approximation algorithm for Bayesian inference.

Artificial Intelligence, 93:1–27, 1997.

[23] A. Darwiche. Recursive conditioning. Artificial Intelligence, 126(1-2):5–41, 2001.

[24] A. Darwiche, R. Dechter, A. Choi, V. Gogate, and L. Otten. Results

from the Probablistic Inference Evaluation of UAI’08. Available online at:

http://graphmod.ics.uci.edu/uai08/Evaluation/Report, 2008.

[25] A. Darwiche and P. Marquis. A knowledge compilation map. Journal of Artificial

Intelligence Research (JAIR), 17:229–264, 2002.

[26] M. Davis, G. Logemann, and D. Loveland. A machine program for theorem proving.

Communications of the ACM, 5:394–397, 1962.

[27] A. P. Dawid, U. Kjaerulff, and S. L. Lauritzen. Hybrid propagation in junction trees.

In Advances in Intelligent Computing (IPMU), pages 85–97, 1994.

290

[28] R. Dechter. Bucket elimination: A unifying framework for reasoning. Artificial

Intelligence, 113:41–85, 1999.

[29] R. Dechter. Constraint Processing. Morgan Kaufmann, 2003.

[30] R. Dechter, V. Gogate, L. Otten, R. Marinescu, and R. Mateescu. Graphical model

algorithms at UC Irvine. website: http://graphmod.ics.uci.edu/group/Software,

2009.

[31] R. Dechter, K. Kask, E. Bin, and R. Emek. Generating random solutions for con-

straint satisfaction problems. In Proceedings of the Eighteenth National Conference

on Artificial Intelligence (AAAI), pages 15–21, 2002.

[32] R. Dechter, K. Kask, and J. Larrosa. A general scheme for multiple lower-bound

computation in constraint optimization. Seventh International Conference on Prin-

ciples and Practice of Constraint Programming (CP), pages 346–360, 2001.

[33] R. Dechter, K. Kask, and R. Mateescu. Iterative join graph propagation. In Proceed-

ings of the 18th Conference in Uncertainty in Artificial Intelligence (UAI), pages

128–136. Morgan Kaufmann, August 2002.

[34] R. Dechter and D. Larkin. Hybrid processing of beliefs and constraints. In Proceed-

ings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI), pages

112–119, 2001.

[35] R. Dechter and R. Mateescu. A simple insight into iterative belief propagation’s

success. Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence

(UAI), pages 175–183, 2003.

[36] R. Dechter and R. Mateescu. Mixtures of deterministic-probabilistic networks and

their and/or search space. In Proceedings of the 20th Annual Conference on Uncer-

tainty in Artificial Intelligence (UAI), pages 120–129, 2004.

[37] R. Dechter and R. Mateescu. AND/OR search spaces for graphical models. Artificial

Intelligence, 171(2-3):73–106, 2007.

[38] R. Dechter and J. Pearl. Tree clustering for constraint networks. Artificial Intelli-

gence, pages 353–366, 1989.

[39] R. Dechter and I. Rish. Mini-buckets: A general scheme for bounded inference.

Journal of ACM, 50(2):107–153, 2003.

[40] A. Doucet, N. de Freitas, K. P. Murphy, and S. J. Russell. Rao-blackwellised particle

filtering for dynamic Bayesian networks. In Proceedings of the 16th Conference on

Uncertainty in Artificial Intelligence (UAI), pages 176–183, 2000.

[41] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling meth-

ods for Bayesian filtering. Statistics and Computing, 10(3):197–208, 2000.

291

[42] P. A. Dow and R. E. Korf. Best-first search for treewidth. In Proceedings of the

Twenty-Second AAAI Conference on Artificial Intelligence, pages 1146–1151, 2007.

[43] N. Een and N. Sorensson. Temporal induction by incremental SAT solving. Elec-

tronic Notes in Theoretical Computer Science, 89(4):543–560, 2003.

[44] S. Even. Graph algorithms. In Computer Science Press, 1979.

[45] P. W. Fieguth, W. C. Karl, and A. S. Willsky. Efficient multiresolution counter-

parts to variational methods for surface reconstruction. Computer Vision and Image

Understanding, 70(2):157–176, 1998.

[46] M. Fishelson and D. Geiger. Optimizing exact genetic linkage computations. In

Proceedings of the seventh annual international conference on Research in compu-

tational molecular biology (RECOMB), pages 114–121, 2003.

[47] N. D. Freitas. Rao-blackwellised particle filtering for fault diagnosis. In In IEEE

Aerospace Conference, pages 1767–1772, 2001.

[48] R. M. Fung and K.-C. Chang. Weighing and integrating evidence for stochastic

simulation in Bayesian networks. In Proceedings of the Fifth Annual Conference on


[49] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman &

Hall/CRC, 2nd edition, June 1995.

[50] J. Geweke. Bayesian inference in econometric models using Monte Carlo integra-

tion. Econometrica, 57(6):1317–39, 1989.

[51] V. Gogate, B. Bidyuk, and R. Dechter. Studies in lower bounding probability of

evidence using the Markov inequality. In Proceedings of 23rd Conference on Un-

certainty in Artificial Intelligence (UAI), pages 141–148, 2007.

[52] V. Gogate and R. Dechter. A complete anytime algorithm for treewidth. In Pro-

ceedings of 20th Conference on Uncertainty in Artificial Intelligence (UAI), pages

201–208, 2004.

[53] V. Gogate and R. Dechter. Approximate inference algorithms for hybrid Bayesian

networks with discrete constraints. In Proceedings of the 21st Annual Conference

on Uncertainty in Artificial Intelligence (UAI), pages 209–216, 2005.

[54] V. Gogate and R. Dechter. A new algorithm for sampling CSP solutions uniformly at

random. In Proceedings of 12th International Conference on Principles and Prac-

tices of Constraint Programming (CP), pages 711–715, 2006.

[55] V. Gogate and R. Dechter. Approximate counting by sampling the backtrack-free

search space. In Proceedings of 22nd Conference on Artificial Intelligence (AAAI),

pages 198–203, 2007.

292

[56] V. Gogate and R. Dechter. Samplesearch: A scheme that searches for consistent

samples. Proceedings of the 11th Conference on Artificial Intelligence and Statistics

(AISTATS), pages 147–154, 2007.

[57] V. Gogate and R. Dechter. AND/OR Importance Sampling. In In 23rd Conference

on Artificial Intelligence (AAAI), pages 212–219, 2008.

[58] V. Gogate and R. Dechter. Approximate solution sampling (and counting) on and/or

spaces. In Proceedings of 14th International Conference on Principles and Practice

of Constraint Programming (CP), pages 534–538, 2008.

[59] V. Gogate and R. Dechter. Studies in solution sampling. In Proceedings of the

Twenty-Third AAAI Conference on Artificial Intelligence (AAAI), pages 271–276,

2008.

[60] V. Gogate, R. Dechter, B. Bidyuk, C. Rindt, and J. Marca. Modeling transportation

routines using hybrid dynamic mixed networks. In Proceedings of the 21st Annual

Conference on Uncertainty in Artificial Intelligence (UAI), pages 217–224, 2005.

[61] C. Gomes and D. Shmoys. Completing quasigroups or latin squares: A structured

graph coloring problem. In Proceedings of the Computational Symposium on Graph

Coloring and Extensions, 2002.

[62] C. P. Gomes, J. Hoffmann, A. Sabharwal, and B. Selman. From sampling to model

counting. In Proceedings of the 20th International Joint Conference on Artificial

Intelligence (IJCAI), pages 2293–2299, 2007.

[63] C. P. Gomes, A. Sabharwal, and B. Selman. Near-uniform sampling of combinatorial

spaces using xor constraints. In Advances in Neural Information Processing Systems

(NIPS), pages 481–488, 2006.

[64] L. A. Goodman. On the exact variance of products. Journal of the American Statis-

tical Association, 55(292):708–713, 1960.

[65] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their

applications. Biometrika, 57(1):97–109, April 1970.

[66] L. D. Hernandez and S. Moral. Mixing exact and importance sampling propagation

algorithms in dependence graphs. International Journal of Approximate Reasoning,

12(8):553–576, 1995.

[67] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in

dynamic Bayesian networks. In Proceedings of 18th Conference of Uncertainty in

Artificial Intelligence, UAI, pages 216–223, 2002.

[68] W. Hoeffding. Probability inequalities for sums of bounded random variables. Jour-

nal of the American Statistical Association, 58(301):13–30, 1963.

293

[69] H. Hoos and T. Stutzle. Satlib: An online resource for research on SAT. In H. van

Maaren I. P. Gent and T. Walsh, editors, SAT2000, pages 283–292. IOS Press, 2000.

[70] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorial

structures from a uniform. Theoretical Computer Science, 43(2-3):169–188, 1986.

[71] H. M. Kaplan. A method of one-sided nonparametric inference for the mean of a

nonnegative population. The American Statistician, 41(2):157–158, 1987.

[72] K. Kask, R. Dechter, J. Larrosa, and A. Dechter. Unifying tree decompositions for

reasoning in graphical models. Artificial Intelligence, 166(1-2):165–193, August

2005.

[73] G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state

space models. Journal of Computational and Graphical Statistics, 5:1–25, 1996.

[74] U. Kjaerulff. Optimal decomposition of probabilistic networks by simulated anneal-

ing. Statistics and Computing, 2:7–17, 1992.

[75] U. Kjærulff. Hugs: Combining exact inference and Gibbs sampling in junction

trees. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence

(UAI), pages 368–375, 1995.

[76] T. Kloks. Treewidth: Computations and Approximations. Springer-Verlag New

York, Incorporated, 1994.

[77] G. Kokolakis and P. Nanopoulos. Bayesian multivariate micro-aggregation under

the hellinger distance criterion. Research in official statistics, 4:117–125, 2001.

[78] Kong, Augustine, Liu, Jun S., and Wong, Wing Hung. The properties of the cross-

match estimate and split sampling. The Annals of Statistics, 25(6):2410–2432, dec

1997.

[79] A. M. Koster, H. L. Bodlaender, and S. P. M. van Hoesel. Treewidth: Computational

experiments. Technical report, Universiteit Utrecht, 2001.

[80] D. Larkin and R. Dechter. Bayesian inference in the presence of determinism. In

Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS),

2003.

[81] S. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical

association models. Journal of the American Statistical Association, 87(420):1098–

1108, 1992.

[82] M. Leisink and B. Kappen. Bound propagation. Journal of Artificial Intelligence

Research, 19:139–154, 2003.

[83] U. Lerner. Hybrid Bayesian Networks for Reasoning about complex systems. PhD

thesis, Stanford University, 2002.

294

[84] R. Levine and G. Casella. Implementations of the Monte Carlo EM Algorithm.

Journal of Computational and Graphical Statistics, 10:422–439, 2001.

[85] F. Li and P. Perona. A Bayesian hierarchical model for learning natural scene cat-

egories. In Computer Vision and Pattern Recognition (CVPR), volume 2, pages

524–531, 2005.

[86] L. Liao, D. J. Patterson, D. Fox, and H. Kautz. Learning and inferring transportation

routines. Artificial Intelligence, 171(5-6):311–331, 2007.

[87] Y. Lin and M. J. Druzdzel. Relevance-based incremental belief updating in Bayesian

networks. International Journal of Pattern Recognition and Artificial Intelligence

(IJPRAI), 13(2):285–295, 1999.

[88] J. Liu. Monte-Carlo strategies in scientific computing. Springer-Verlag, New York,

2001.

[89] Liu, Jun S., Liang, Faming, and Wong, Wing Hung. The multiple-try method and

local optimization in metropolis sampling. Journal of the American Statistical As-

sociation, 95(449):121–134, mar 2000.

[90] A. W. Marshall. The use of multi-stage sampling schemes in Monte Carlo computa-

tions. In Symposium on Monte Carlo Methods, pages 123–140, 1956.

[91] R. Mateescu and R. Dechter. And/or cutset conditioning. In Proceedings of the 19th

International Joint Conference on Artificial Intelligence (IJCAI), pages 230–235,

2005.

[92] R. Mateescu and R. Dechter. Mixed deterministic and probabilistic networks. An-

nals of Mathematics and Artificial Intelligence (AMAI); Special Issue: Probabilistic

Relational Learning (to appear), 2009.

[93] R. Mateescu, R. Dechter, and R. Marinescu. And/or multi-valued decision diagrams

(aomdds) for graphical models. Journal of Artificial Intelligence Research (JAIR),

33:465–519, 2008.

[94] B. Middleton, S. Michael Shwe M., D. Heckerman, D. Harold Lehmann M., and

G. Cooper. Probabilistic diagnosis using a reformulation of the internist-1/qmr

knowledge base. Medicine, 30:241–255, 1991.

[95] B. Milch, B. Marthi, S. J. Russell, D. Sontag, D. L. Ong, and A. Kolobov. Blog:

Probabilistic models with unknown objects. In Proceedings of the 19th International

Joint Conference on Artificial Intelligence (IJCAI), pages 1352–1359, 2005.

[96] J. E. Miller. Langford’s problem. http://www.lclark.edu/∼miller/langford.html,2006.

[97] S. Moral and A. Salmeron. Dynamic importance sampling in Bayesian net-

works based on probability trees. International Journal of Approximate Reasoning,

38(3):245–261, 2005.

295

[98] K. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning.

PhD thesis, University of California Berkeley, 2002.

[99] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate

inference: An empirical study. In In Proceedings of the Fifteenth Conference on


[100] N. J. Nillson. Principles of Artificial Intelligence. Tioga, Palo Alto, Ca, 1980.

[101] L. Ortiz and L. Kaelbling. Adaptive importance sampling for estimation in struc-

tured domains. In In Proceedings of the 16th Annual Conference on Uncertainty in

Artificial Intelligence (UAI), pages 446–454, 2000.

[102] J. Ott. Analysis of Human Genetic Linkage. The Johns Hopkins University Press,

Baltimore, Maryland, 1999.

[103] M. A. Paskin. Sample propagation. In Advances in Neural Information Processing

Systems (NIPS), 2003.

[104] D. J. Patterson, L. Liao, D. Fox, and H. Kautz. Inferring high-level behavior from

low-level sensors. In Proceedings of UBICOMP 2003: The Fifth International Con-

ference on Ubiquitous Computing, pages 73–89, 2003.

[105] V. Pavlovic, J. M. Rehg, T.-J. Cham, and K. P. Murphy. A dynamic Bayesian network

approach to figure tracking using learned dynamic models. In The Proceedings of

the Seventh IEEE International Conference on Computer Vision, volume 1, pages

94–101, 1999.

[106] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.

[107] S. Peeta and A. K. Zilaskopoulos. Foundations of dynamic traffic assignment: The

past, the present and the future. Networks and Spatial Economics, 1:233–265, 2001.

[108] K. Pipatsrisawat and A. Darwiche. Rsat 2.0: SAT solver description. Technical Re-

port D–153, Automated Reasoning Group, Computer Science Department, UCLA,

2007.

[109] H. Poon and P. Domingos. Sound and efficient inference with probabilistic and

deterministic dependencies. In The Twenty-First National Conference on Artificial

Intelligence (AAAI), pages 458–463, 2006.

[110] M. Pradhan, G. Provan, B. Middleton, and M. Henrion. Knowledge engineering for

large belief networks. In Proceedings of the 10th Canadian Conference on Artificial

Intelligence, pages 484–490, 1994.

[111] C. Raphael. A hybrid graphical model for rhythmic parsing. Artificial Intelligence,

137(1-2):217–238, May 2002.

[112] W. Resource. SAT competitions. http://www.satcompetition.org/.

296

[113] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-

2):107–136, 2006.

[114] T. Ritter. Latin squares: A literature survey. Available online at:

http://www.ciphersbyritter.com/RES/LATSQ.HTM, 2003.

[115] J. Roberto J. Bayardo and J. D. Pehoushek. Counting models using connected

components. In Proceedings of 17th National Conference on Artificial Intelligence

(AAAI), pages 157–162, 2000.

[116] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural

Computation, 11:305–345, 1997.

[117] D. B. Rubin. The calculation of posterior distributions by data augmentation: Com-

ment: A noniterative sampling/importance resampling alternative to the data aug-

mentation algorithm for creating a few imputations when fractions of missing infor-

mation are modest: The SIR Algorithm. Jornal of the American Statistical Associa-

tion, 82(398):543–546, 1987.

[118] R. Y. Rubinstein. Simulation and the Monte Carlo Method. John Wiley & Sons Inc.,

1981.

[119] T. Sang, P. Beame, and H. Kautz. Heuristics for fast exact model counting. In Eighth

International Conference on Theory and Applications of Satisfiability Testing (SAT),

pages 226–240, 2005.

[120] B. Selman, H. Kautz, and B. Cohen. Noise strategies for local search. In Proceedings

of the Eleventh National Conference on Artificial Intelligence, pages 337–343, 1994.

[121] R. D. Shachter and M. A. Peot. Simulation approaches to general probabilistic in-

ference on belief networks. In Proceedings of the Fifth Annual Conference on Un-

certainty in Artificial Intelligence (UAI), pages 221–234, 1990.

[122] H. D. Sherali, A. Narayanan, and R. Sivanandan. Estimation of origin-destination

trip-tables based on a partial set of traffic link volumes. Transportation Research,

Part B: Methodological, pages 815–836, 2003.

[123] L. Simon, D. L. Berre, and E. Hirsch. The SAT 2002 competition. Annals of Math-

ematics and Artificial Intelligence(AMAI), 43:307–342, 2005.

[124] O. Skare, E. Bolviken, and L. Holden. Improved sampling-importance resampling

and reduced bias importance sampling. Scandinavian Journal of Statistics, 30:719–

737, 2003.

[125] N. Sorensson and N. Een. Minisat v1.13-a SAT solver with conflict-clause mini-

mization. In SAT 2005 competition, 2005.

[126] L. G. Valiant. The complexity of enumeration and reliability problems. Siam Journal

of Computation, 8(3):105–117, 1987.

297

[127] L. G. Valiant and V. V. Vazirani. NP is as easy as detecting unique solutions. In

STOC ’85: Proceedings of the seventeenth annual ACM symposium on Theory of

computing, pages 458–463, New York, NY, USA, 1985.

[128] T. Walsh. SAT v CSP. In Proceedings of the 6th International Conference on Princi-

ples and Practice of Constraint Programming, pages 441–456, London, UK, 2000.

Springer-Verlag.

[129] T. Walsh. Permutation problems and channelling constraints. In Proceedings of

the 8th International Conference on Logic Programming and Automated Reasoning

(LPAR), pages 377–391, 2001.

[130] W. Wei, J. Erenrich, and B. Selman. Towards efficient sampling: Exploiting random

walk strategies. In Proceedings of the Nineteenth National Conference on Artificial

Intelligence, pages 670–676, 2004.

[131] W. Wei and B. Selman. A new approach to model counting. In Proceedings of Eighth

International Conference on Theory and Applications of Satisfiability Testing (SAT),

pages 324–339, 2005.

[132] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations

and generalized belief propagation algorithms. IEEE Transactions on Information

Theory, 51:2282–2312, 2004.

[133] C. Yuan and M. J. Druzdzel. Importance sampling algorithms for Bayesian net-

works: Principles and performance. Mathematical and Computer Modelling, 43(9-

10):1189–1207, 2006.

[134] J. Yuan, K. Shultz, C. Pixley, H. Miller, and A. Aziz. Modeling design constraints

and biasing in simulation using BDDs. In Proceedings of the 1999 IEEE/ACM in-

ternational conference on Computer-aided design (ICCAD), pages 584–590, 1999.

298

UNIVERSITY OF CALIFORNIA, IRVINE DISSERTATION DOCTOR OF …dechter/publications/r165.pdf ·...

Documents

Transcript of UNIVERSITY OF CALIFORNIA, IRVINE DISSERTATION DOCTOR OF …dechter/publications/r165.pdf ·...