Multiscale Conditional Random Fields for Machine Vision · Multiscale Conditional Random Fields for...

Multiscale Conditional Random Fields for Machine Vision

by

David Duvenaud

B. Sc. Hons., University of Manitoba, 2006

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

THE FACULTY OF GRADUATE STUDIES

(Computer Science)

The University Of British Columbia

(Vancouver)

July 2010

c© David Duvenaud, 2010

Abstract

We develop a single joint model which can classify images and label super-pixels,

based on tree-structured conditional random fields (CRFs) derived from a hierar-

chical image segmentation, extending previous work by Reynolds and Murphy,

and Plath and Toussaint. We show how to train this model in a weakly-supervised

fashion, in which some of the images only have captions specifying which ob-

jects are present; this information is propagated down the tree and thus provides

weakly labeled data at the leaves, which can be used to improve the performance

of the super-pixel classifiers. After training, information can be propagated from

the super-pixels up to the root-level image classifier (although this does not seem

to help in practice compared to just using root-level features). We compare two

kinds of tree: the standard one with pairwise potentials, and one based on noisy-or

potentials, which better matches the semantics of the recursive partitioning used

to create the tree. However, we do not find any significant difference between the

two.

ii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Multi-scale Approaches . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Previous Approaches to Multiscale CRFs . . . . . . . . . 3

1.3.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.3 Under-segmentation vs Over-segmentation . . . . . . . . 5

1.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Pascal VOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5.1 Image Features . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Supervised and Semi-Supervised Learning . . . . . . . . . . . . . 7

1.7 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Pairwise Trees and Learning . . . . . . . . . . . . . . . . . . . . . . 102.1 Model Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Local Evidence Potentials . . . . . . . . . . . . . . . . . 10

2.1.2 Independent Model . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Pairwise Potentials . . . . . . . . . . . . . . . . . . . . . 11

iii

2.1.4 Regularization of Image Feature Weights . . . . . . . . . 12

2.1.5 Regularization of Pairwise Potentials . . . . . . . . . . . 12

2.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Computational Issues . . . . . . . . . . . . . . . . . . . . 14

2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . 15

2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Noisy-Or Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Motivation for Noisy-Or Tree Models . . . . . . . . . . . . . . . 20

3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 Computing Expected Complete Likelihood . . . . . . . . 21

3.4 Evidence Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.1 Learning the Noisy-Or Failure Rate Parameter . . . . . . 26

3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1 Comparing Models . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.1 Independent Model . . . . . . . . . . . . . . . . . . . . . 30

4.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Pixel-level Accuracy . . . . . . . . . . . . . . . . . . . . 31

4.3.2 Global-level Accuracy . . . . . . . . . . . . . . . . . . . 31

4.3.3 Remedies . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . 37

4.6 Introducing an Oracle . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 41

iv

5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Structure-Adaptive Potentials . . . . . . . . . . . . . . . 41

5.1.2 Bounding Box Data . . . . . . . . . . . . . . . . . . . . 43

5.1.3 Combined Grid and Tree Structures . . . . . . . . . . . . 43

5.1.4 Joint Learning Over All Classes . . . . . . . . . . . . . . 44

5.1.5 Large-Scale Experiments . . . . . . . . . . . . . . . . . . 44

5.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

v

List of Figures

Figure 1.1 An example of a tree-structured CRF built on an exact recur-

sive segmentation of an image. . . . . . . . . . . . . . . . . . 3

Figure 1.2 An example of a CRF defined for the presence of the sheep

class over multiple scales of a recursively segmented image. . 4

Figure 1.3 The maximum attainable accuracy given the segmentation used. 6

Figure 1.4 Example Multi-scale segmentations from the VOC 2008 dataset.

Rows one to four: Image segmentation at progressively finer

levels of detail. Bottom row: Pixel-level class labels. . . . . . 8

Figure 2.1 Pixel-level test accuracy on the tree model. White bars indicate

performance on fully-labeled data, black bars indicate perfor-

mance after additional semi-supervised training. . . . . . . . 16

Figure 2.2 Global-level test accuracy on the tree model. White bars in-

dicate performance on fully-labeled data, black bars indicate

performance after additional semi-supervised training. . . . . 17

Figure 2.3 Change in pixel-level test accuracy after training with partially

labeled data. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 2.4 Change in global-level test accuracy after training with par-

tially labeled data. . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 2.5 A plot of the change in test error versus the pixel-level accu-

racy on the test set after supervised training, when the global

node was set to the true value. . . . . . . . . . . . . . . . . . 19

vi

Figure 3.1 An example of belief propagation. Left: A situation in which

there is no local evidence for a class being present, except in

one leaf node. Middle: Marginals after BP in a pairwise tree.

Right: Marginals after BP in a noisy-or tree. . . . . . . . . . 22

Figure 3.2 Left: A situation in which there is local evidence in two adja-

cent leaf nodes. Middle: Marginals after BP in a pairwise tree.

Right: Marginals after BP in a noisy-or tree. . . . . . . . . . 23

Figure 3.3 Left: A situation in which there is strong evidence at the global

scale, and weak local evidence at one of the leaf nodes. Mid-

dle: Marginals after BP in a pairwise tree. Right: Marginals

after BP in a noisy-or tree. . . . . . . . . . . . . . . . . . . . 23

Figure 3.4 An example of belief propagation and evidence flow in a noisy-

or tree, trained on real data. Node size is proportional to prob-

ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Figure 3.5 An example of belief propagation and evidence flow in the

pairwise tree model, trained on real data. Node size is pro-

portional to probability. . . . . . . . . . . . . . . . . . . . . 24

Figure 3.6 The segmentation of the image used in figures 3.5 and 3.4. . . 25

Figure 3.7 Pixel-level test accuracy on the noisy-or model. White bars

indicate performance on fully-labeled data, black bars indicate


Figure 3.8 Global-level test accuracy on the noisy-or model. White bars

indicate performance on fully-labeled data, black bars indicate


Figure 3.9 Change in test pixel-level accuracy after training with partially

labeled data. . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 3.10 Change in test global-level test accuracy after training with

partially labeled data. . . . . . . . . . . . . . . . . . . . . . . 29

Figure 4.1 Pixel-level test accuracy on the three models, with and without

semi-supervised training. . . . . . . . . . . . . . . . . . . . 32

Figure 4.2 Global-level test accuracy on the three models, with and with-

out semi-supervised training. . . . . . . . . . . . . . . . . . . 33

vii

Figure 4.3 Pixel-level test accuracy across all models. . . . . . . . . . . 34

Figure 4.4 Global-level test accuracy across all models. . . . . . . . . . . 34

Figure 4.5 Mean test cross entropy over all nodes across all models. Lower

is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Figure 4.6 Left: Histogram of the number of neighbours of each node

in the training set. Nodes with one neighbour are necessar-

ily leaves. Right: Histogram of the number of neighbours of

global-level nodes in the training set . . . . . . . . . . . . . . 35

Figure 4.7 Mean cross-entropy per node versus the number of neighbours

of a node on the training set. . . . . . . . . . . . . . . . . . . 36

Figure 4.8 Detecting a dog. Top left: Original image. Top center: Seg-

mentation at bottom level. Top right: True pixel labels. Bot-

tom left: Pixel probabilities for independent model. Bottom

center: Pixel probabilities for pairwise model. Bottom right:

Pixel probabilities for noisy-or model. . . . . . . . . . . . . . 38

Figure 4.9 Detecting a person. Top left: Original image. Top center: Seg-

mentation at bottom level. Top right: True pixel labels. Bot-

tom left: Pixel probabilities for independent model. Bottom

center: Pixel probabilities for pairwise model. Bottom right:

Pixel probabilities for noisy-or model. . . . . . . . . . . . . . 39

Figure 4.10 Pixel-level test accuracy across all models, including the case

where the global-level nodes were clamped to their true value. 40

Figure 5.1 Learned αg versus the number of neighbours. . . . . . . . . . 42

viii

Acknowledgments

There are many people who made my time at UBC both instructive and fruitful.

First and foremost: my advisor Kevin Murphy, by giving his support, as well as by

his patience in steering my ideas towards useful research. I’d like to thank Nando

de Frietas for his many helpful ideas. I’d like to profusely thank Benjamin Marlin

for sharing the recursive segmentation and feature extraction code, as well as for

being an example to me.

I’d like to thank Kevin Swersky and Bo Chen for raising the bar, Mark Schmidt

for showing me not to be afraid to start at square one, and Emtiyaz Khan for keep-

ing me going through it all.

ix

Chapter 1

Introduction

1.1 MotivationA central problem in learning object localization models is that pixel-labeled im-

ages are rare and costly to produce. In contrast, data that have only weak labeling

information, such as captions, are relatively abundant. In the extreme case, one

may wish to train a model on a massive set of images drawn from an automated

search. This is essentially the data collector’s dream: To generate a training dataset

for a cat detector by simply performing a web image search for ’cat’.

Such datasets can easily be used to train an image classifier. However, in order

to train a model to perform object localization on such data, we require a unified

model of object presence at both the scale of the whole image as well as the scale

of individual super-pixels.

1.2 Image ModelsFor the tasks of image classification and object localization, we are interested in

predicting whether a given image region contains an example of a given class c

somewhere within it. For the tasks of image classification or object detection, the

region of interest is the entire image. For the task of image segmentation or object

localization, the regions of interest are on the scale of pixels or super-pixels.

We can cast both of these tasks as inference problems in the following way. We

1

define random variables Y1,Y2, ...,Yn denoting the presence or absence of class c in

image regions 1...n. We are then interested in computing P(Y1,Y2, ...,Yn|x) where x

are the image features. We may then assume some structure on the conditional

probability P(Y1,Y2, ...,Yn|x) in order to make learning and inference tractable.

Such structured conditional models are called Conditional Random Fields (CRFs).

In this work, we further develop the use of CRFs which are structured in

such a way as to combine evidence from multiple scales of segmentation. This

model structure allows us to perform semi-supervised learning to take advantage

of weakly labeled data as described above. There are two additional motivations

for this model structure. First, it can effectively incorporate evidence from other

image classifiers to perform better localization on unlabeled images (as shown in

Chapter 4). Second, it can combine evidence from multiple scales and image loca-

tions in a simple, consistent way, allowing one model to do both classification and

localization.

Figure 1.1 shows an example of a multi-scale CRF defined over multiple scales

of an image.

1.3 Multi-scale ApproachesRecent work on image classification and segmentation has shown that incorporat-

ing evidence from multiple scales is an effective strategy for image classification

[14] [18] [19]. This can be explained by the fact that labels at different scales must

agree with each other to some degree. Evidence about segments at one level of de-

tail can help in classifying segments at neighbouring levels of detail. For example,

knowing that there is a person in a small segment of the image means that there is

also a person in any larger segment containing the first segment.

This observation motivates the idea of combining evidence at multiple scales

of the image by performing segmentation at different scales, estimating a class’

presence or absence separately for each segment, and then combining this local

estimates with through a conditional random field.

2

Figure 1.1: An example of a tree-structured CRF built on an exact recursivesegmentation of an image.

1.3.1 Previous Approaches to Multiscale CRFs

The construction of multi-scale CRFs was previously demonstrated in [18] and

[19]. In these models, each image was segmented at multiple scales, and each

segment was assigned a label node designating whether the class of interest was

contained in that segment. A node at a given level of detail was connected to the

node at the next coarser level of detail according to the degree of overlap between

their corresponding image regions. This procedure results in a tree-structured CRF

with one image-level node representing the presence or absence of a class in the

image, as well as one node for each segment at the finest level.

In these works, the image segmentations at different levels of detail were con-

structed independently of one another. This approach has the advantage that the

segmentations at each layer can be constructed independently, but has the disad-

vantage that the segments at one level of detail may not significantly overlap with

3

Figure 1.2: An example of a CRF defined for the presence of the sheep classover multiple scales of a recursively segmented image.

the segments at the next coarsest level.

1.3.2 Our Approach

As opposed to constructing image segmentations at different scales independently,

we construct a recursive image segmentation. We define a multi-scale segmentation

to be recursive when each segment at a given level is contained in exactly one other

segment at the next coarser level of detail.

To produce the exact recursive segmentation used by our model, we first seg-

ment the image at the finest spatial scale using a Quick-shift based super-pixel

algorithm[27]. We then run a sparse affinity propagation clustering algorithm[6] to

cluster image regions. We only consider merging adjacent image regions at each

step. The similarity between regions is simply the L2 distance between their mean

colors. We perform several rounds of clustering, corresponding to increasingly

coarse levels of detail. We stop when there are 6 or fewer segments, which are

merged in a final step. Figure 1.4 shows some example segmentations using this

4

algorithm.

As in previous work, we constructed the CRF by connecting nodes representing

regions with maximal overlap. In our case, however, each node is completely

contained in exactly one other node by design. We denote the containing region

to be a parent region, and the nodes contained within it to be the children of that

region.

One benefit of constructing a recursive image segmentation is that the resulting

label structure allows the use of factors joining parent and child nodes that have

stricter agreement constraints than a typical pairwise-factored CRF. In Chapter 3,

we explore the use of such “noisy-or” factors in tree-structured CRFs.

1.3.3 Under-segmentation vs Over-segmentation

Constructing this recursive segmentation raises the question: What level of detail

is appropriate for the finest level of segmentation?

The cost of over-segmentation is that our CRF will be unnecessarily large,

slowing down inference. However, inference in this model is linear in the number

of nodes in the CRF.

The cost of under-segmentation is that the bottom-layer segments will contain

pixels from classes other than the task of interest. This will put an upper bound on

the accuracy of our model’s pixel-level labels.

In the experiments performed in this thesis, we truncated the recursive segmen-

tation at four levels of recursion, leaving approximately 30 segments per image at

the finest level of detail. Figure 1.3 shows the maximum attainable VOC accuracy

(defined below) for this four-layer deep recursive segmentation is 50.2%. For com-

parison, the best-performing algorithm on the PASCAL VOC 2008 segmentation

challenge achieved a mean accuracy of 25.4%.

1.4 Performance MetricsPerformance is measured for both tasks by the accuracy a as defined in the VOC

2008 challenge as

a =tp

tp+ fp+ fn(4.1)

5

0

50

100

Mea

n

Backg

roun

d

Aerop

lane

Bicycle Bird

Boat

Bottle Bus Car Cat

Chair

Per

cent

Acc

urac

y

0

20

40

60

80

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Per

cent

Acc

urac

y

Figure 1.3: The maximum attainable accuracy given the segmentation used.

where tp, fp, and fn mean true positive, false positive, and false negative, re-

spectively [18]. True positive is the number of foreground image pixels that are

correctly predicted. False positive is the number of background pixels that are

incorrectly predicted. False negative is the number of foreground pixels that are in-

correctly predicted. Here, “foreground” refers to the parts of the image containing

the class of interest.

Note that this performance measure effectively ignores true negatives, which

generally comprise the majority of all predictions for this task. This measure is

useful, but is quite different from the objective function maximized by our learning

algorithms.

6

1.5 Pascal VOC DatasetThe pixel-labeled data used for these experiments were gathered from the training

and validation sets in the PASCAL Visual Object Classes(VOC) Challenge 2008

dataset [4]. This dataset contains approximately 1000 images, each approximately

500x500 pixels, in which each pixel is either assigned to one of 20 classes, assigned

to the “background” class, or labeled as “don’t care”. The predicted labels of pixels

labeled “don’t care” do not count towards the accuracy score.

1.5.1 Image Features

The image features used were: colour histograms (100 dimensions), histogram of

oriented gradients (200 dimensions) [2], textons (500 dimensions) [30], and the

5x5 discretized location of the segment (25 dimensions). With a bias term added,

each feature vector had 826 dimensions. However, the model is somewhat agnostic

to the image features computed, and allows the use of different feature vectors at

different levels in the segmentation hierarchy. For instance, one may want to use

GIST [16] features at the highest spatial scale, as in [15]. However, exploratory

experiments did not show a significant different in performance by replacing the

top level features with a GIST vector.

1.6 Supervised and Semi-Supervised LearningWe conducted two sets of experiments: In the first, models were trained using

pixel-labeled training data. Given an image that is fully labeled at the pixel level,

we can compute the labels for regions at all spatial scales. Thus, this phase of

learning is fully supervised.

In the second set of experiments, the pixel-labeled training data was augmented

with an equal amount of “caption-only” data, which was unlabeled at every level

of the hierarchy except at the global level. Data labeled in this way corresponds

to knowing only whether or not a certain type of object is found anywhere in the

image. Because all but one node per image is unlabeled, we call this regime “semi-

supervised” training. Here belief propagation is used to compute the expected

sufficient statistics of the unlabeled nodes conditioned on the global label, which

can then be used to train parameters at finer spatial scales.

7

Figure 1.4: Example Multi-scale segmentations from the VOC 2008 dataset.Rows one to four: Image segmentation at progressively finer levels ofdetail. Bottom row: Pixel-level class labels.

8

1.7 Thesis ContributionsThe main contribution of this thesis is in demonstrating a concrete way in which

caption-only data can be used to train a localization model. In doing so, we develop

several refinements to the class of multi-scale image CRFs:

• We develop the use of recursive image segmentation.

• Using the recursive segmentation, we introduce the noisy-or structured tree

model.

• In contrast to previous efforts[15][18][19], which first learn local image

patch classifiers and then connect them with a CRF, we show how to learn

the parameters of the local classifiers in concert, as a structured prediction

problem.

9

Chapter 2

Pairwise Trees and Learning

In this chapter we precisely define a tree-structured conditional random field with

pairwise potentials. We give formulas for the likelihood, show how learning can

be done via the Expectation-Maximization algorithm, and show results on fully

observed and semi-supervised learning.

2.1 Model SemanticsThe image segments at all levels of detail can be denoted by S(r), for some integer

r. The model contains one label node Y (r)c for each class c and each element of

the recursive image partition S(r). Setting Y (r)c = 1 is interpreted as meaning that

the image region defined by segment S(r) contains part of an object from class c,

while setting Y (r)c = 0 is interpreted as meaning that the image region S(r) does not

contain part of an object from class c.

2.1.1 Local Evidence Potentials

The local evidence log-potential for node Y (r) in this model depends linearly on the

feature vectors x(r) for the region S(r). We define the local evidence log-potential in

Equation 1.1 where W lc are the feature-to-label weights for class c and segmentation

level l.

10

φ f

(y(r)c ,x(r)

)= y(r)c (x(r))TW l

c (1.1)

Weights are shared across all of nodes in a given level of detail l of the segmen-

tation tree within each object class. In the experiments below, the weight vectors

W 2c and W 3

c were also constrained to be equal.

2.1.2 Independent Model

As a baseline, we can consider an image model consisting solely of these local

evidence potentials. In this “independent” model, every region label is predicted

separately, and the model becomes equivalent to a per-region logistic regression

on each region’s image features. The likelihood of a node label assignment y is as

follows:

P(Y = y|x) = 1Z

L

∏l=1

∏(i)∈Nl

exp(

φ f (y(i),x(i)))

(1.2)

Here L is the number of layers in the tree, and (i) ∈ NL denotes the nodes in

layer L. The parameters of the independent model can be trained on only fully-

labeled data.

2.1.3 Pairwise Potentials

In Equation 1.3 we define the pairwise potential between neighbouring nodes. The

pairwise potentials depend only on a 2x2 table of parameters θ , indexed by values

taken by the nodes at each end of the potential.

φpair

(y(r1)

c ,y(r2)c

)= θ(y(r1)

c ,y(r2)c ) (1.3)

In our experiments, three sets of pairwise parameters were learned: One set

for the potentials connecting global nodes to their children, one for the potentials

connecting nodes in the middle layers to their children, and a third for the potentials

11

connecting middle-layer nodes to bottom-layer nodes.

2.1.4 Regularization of Image Feature Weights

Learning for the weights of the local evidence log-potentials is done with a simple

L2 regularizer λ on the image feature weights W . In order to avoid choosing a sep-

arate hyperparameter for each weightgroup, we choose one regularization setting,

kept fixed across different weightgroups. This opens up the issue of how to scale

the regularization as the number of training examples changes.

The different weightgroups have a widely varying number of training instances.

For instance, the nodes at the bottom (pixel-level) of the trees will have many more

training examples as the nodes at the top (global-level), since each image will have

only one top-level node and many bottom-level nodes. Following a Bayesian inter-

pretation of regularized optimization as MAP estimation, we view the regularizer

as a prior, and do not scale it with the number of training examples. This can be

interpreted as giving the same prior to each of the weight groups’ parameters, and

then conditioning on varying amounts of evidence.

2.1.5 Regularization of Pairwise Potentials

In these experiments, no regularization was used for the pairwise potential parame-

ters, as it was unclear how to scale this regularization with respect to the regulariza-

tion of the node potentials. Using a separate hyperparameter would have increased

the computational cost of cross-validation significantly. However, this was not ex-

pected to cause significant over-fitting, since in each model, only three sets of four

pairwise potentials were learned.

Note that if the pairwise links are regularized separately from the node poten-

tial parameters, strong enough regularization of the pairwise potential parameters

effectively makes the pairwise tree model equivalent to the independent model.

2.2 LikelihoodThe likelihood of observing a particular configuration of label nodes y given feature

vector x is defined as:

12

P(Y = y|x) = 1Z

L

∏l=1

∏(i)∈Nl

exp(

φ f (y(i),x(i))+φpair(y(i),yparent(i)))

(2.4)

Here parent(i) denotes the index of parent of node i. As a special case, the root

has no parent node, and φpair = 0.

2.3 LearningFirst, let ymis denotes the missing labels while yobs denotes the observed labels.

The marginal probability of observing yobs can be obtained by summing out over

all joint configurations of the missing labels ymis.

P(yobs|x) = ∑ymis

P(yobs,ymis|x) (3.5)

We also define the posterior probability of the missing labels given the observed

labels:

P(ymis|yobs,x) =P(ymis,yobs|x)

P(yobs|x)(3.6)

The expected complete log-likelihood over all training examples is as follows:

E[L ] =N

∑n=1

∑ymis

n

P(ymisn |yobs

n ,xn) log(

P(ymisn ,yobs

n |xn))

(3.7)

We now show how the gradient of the expected complete log likelihood with

respect to the feature weights W lc can be computed using these two quantities ( the

L2 regularization term is omitted for clarity):

13

∂E[L ]

∂W lc

=N

∑n=1

∑ymis

n

P(ymisn |yobs

n ,xn) ∑(i)∈Nl

(∂φ f (y

(i)cn ,x

(i)n )

∂W lc

−∑y′

P(y′|xn)∂φ f (y′

(i)c ,x(i)n )

∂W lc

)

=N

∑n=1

∑(i)∈Nl

(E

P(y(i)cn |yobsn ,xn)

[y(i)cn x(i)n ]−EP(y′(i)c |xn)

[y′(i)c x(i)n ])

(3.8)

=N

∑n=1

∑(i)∈Nl

(P(y(i)cn |yobs

n ,xn)−P(y′(i)c |xn))

x(i)n (3.9)

Here N is the number of training examples, and Nl is the number of nodes in

layer l of example N. To apply L2 regularization, we add to the derivative the term

−2λ ||W lc ||2.

We find that the gradient is proportional to the difference between the marginals

computed after clamping all observed nodes to their true values, and the marginals

computed with nodes observed. We can obtain exact marginals in time linear in the

number of nodes by using belief propagation [17].

The gradient of the likelihood with respect to the pairwise parameters θc(a,b)

has a similar form:

∂E[L ]

∂θc(a,b)=

N

∑n=1

∑(i)∈NL

∑( j)∈NL

(E

P(y(i)cn ,y( j)cn |yobs

n ,xn)I[y(i)cn = a,y( j)

cn = b]

−EP(y′(i)cn ,y′

( j)cn )|xn

I[y′(i)cn = a,y′( j)cn = b]

)(3.10)

=N

∑n=1

∑(i)∈NL

∑( j)∈NL

(P(y(i)cn = a,y( j)

cn = b|yobscn ,xn)−P(y(i)cn = a,y( j)

cn = b|xn))

(3.11)

The gradient of θc(a,b) is simply the expected difference in the number of

times we would observe [yc = a,yc = b] between the clamped versus the unclamped

distributions.

2.3.1 Computational Issues

Learning was broken into three stages as follows:

14

1. The image feature weights W , initialized to zero, were trained in the inde-

pendent model by supervised training on fully labeled training images.

2. The pairwise factors φpair were added to the CRF, and the feature weights W

along with the pairwise parameters θ were learned by supervised training on

the fully labeled training examples.

3. Caption-only data was added to the dataset, and the model was trained in a

semi-supervised way using the E-M algorithm shown above.1

2.4 Experimental SetupIn these experiments, we balanced the dataset for each class separately by removing

approximately 80% of images that did not contain the class of interest.

2.4.1 Cross-validation

Error bars depicting one standard error were produced by conducting experiments

on five training/test splits of the data. Within each split, the L2 regularization pa-

rameter λ was chosen by nested cross-validation: Each training set was split into

five inner training/validation splits. For both the supervised case and the semi-

supervised case, the setting of λ that had the best average accuracy on the valida-

tion set was chosen to train the model on the whole training set for that fold.

Each outer fold had 400 fully-labeled training examples, 400 caption-only

training examples, and 200 test examples.

2.4.2 Evaluation Criteria

At test time, for each example, posterior marginals are computed for each node

in the tree conditional on the image feature vectors. Node marginals at the top

and bottom of the trees were thresholded at 0.5 to produce the final pixel-level and

image-level classifications, respectively.

1 The function minimizer used in the M step was minFunc by Mark Schmidt, which implementsthe L-BFGS algorithm. This software is available at http://www.cs.ubc.ca/ schmidtm/Software/min-Func.html

15

0

5

10

15

Mea

n

Aerop

lane

Bicycle Bird

Boat

Bottle Bus Car Cat

Chair

Per

cent

Acc

urac

y

0

10

20

30

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Per

cent

Acc

urac

y

Figure 2.1: Pixel-level test accuracy on the tree model. White bars indicateperformance on fully-labeled data, black bars indicate performance afteradditional semi-supervised training.

2.5 ResultsFigures 2.1, 2.4, show the pixel-level and global-level accuracy after both super-

vised and semi-supervised training. In these results, “accuracy” refers to the VOC

accuracy metric defined in Chapter 1, and the “absolute percentage change” repre-

sents the percent accuracy after semi-supervised training minus the percent accu-

racy before semi-supervised training.

2.6 DiscussionThe mean improvement in accuracy after semi-supervised training is statistically

significant, but varies significantly between classes. Besides noise, how can we

explain the decrease in performance in some classes? Note that while the accuracy

did not always improve, the accuracy measure used here is quite different from the

16

0

10

20

30

Mea

n

Aerop

lane

Bicycle Bird

Boat

Bottle Bus Car Cat

Chair

Per

cent

Acc

urac

y

0

20

40

60

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Per

cent

Acc

urac

y

Figure 2.2: Global-level test accuracy on the tree model. White bars indicateperformance on fully-labeled data, black bars indicate performance afteradditional semi-supervised training.

training objective. As well, in order for the pixel-level accuracy to improve, the

model must already be somewhat competent at predicting pixel-level labels given

the global label. This is because the weight gradient is a function of the difference

in node marginals between the case where the global-level node in unobserved,

and the case where it is observed. If the bottom-level marginals do not change

significantly (or in the correct direction) when the global-level node is clamped,

the model weights will not change.

Figure 2.5 supports this explanation. Plotted is the change in pixel-level accu-

racy after semi-supervised training, versus the pixel-level accuracy after supervised

training when the top-level node was clamped to the true value. This figure shows

that the pixel-level accuracy did not improve in classes which had poor localiza-

tion performance after supervised training. We can also see that the best-localized

classes also enjoyed the biggest improvement in accuracy from semi-supervised

17

−2

−1

0

1

2

3

4

5

6

7

8

Bac

kgro

und

Aer

opla

neB

icyc

leB

irdB

oat

Bot

tleB

us Car

Cat

Cha

irC

owD

inin

g T

able

Dog

Hor

seM

otor

bike

Per

son

Pot

ted

Pla

ntS

heep

Sof

aT

rain

TV

/ M

onito

r

Per

cent

Abs

olut

e Im

prov

emen

t in

Acc

urac

y

Figure 2.3: Change in pixel-level test accuracy after training with partiallylabeled data.

−6

−4

−2

0

2

4

6

8

10

12

14

Bac

kgro

und

Aer

opla

neB

icyc

leB

irdB

oat

Bot

tleB

us Car

Cat

Cha

irC

owD

inin

g T

able

Dog

Hor

seM

otor

bike

Per

son

Pot

ted

Pla

ntS

heep

Sof

aT

rain

TV

/ M

onito

r

Per

cent

Abs

olut

e Im

prov

emen

t in

Acc

urac

y

Figure 2.4: Change in global-level test accuracy after training with partiallylabeled data.

18

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Aeroplane

Bicycle

Bird

Boat

Bottle

Bus

Car

Cat

Chair

Cow

Dining Table

DogHorse

Motorbike

Person

Potted Plant

Sheep

Sofa

TrainTV / Monitor

Oracle Pixel−level Accuracy on Test data after Supervised Training

Per

cent

Cha

nge

afte

r S

emi−

Sup

ervi

sed

Lear

ning

Figure 2.5: A plot of the change in test error versus the pixel-level accuracyon the test set after supervised training, when the global node was set tothe true value.

training.

We leave a discussion of global-level accuracy until Chapter 4.

19

Chapter 3

Noisy-Or Factors

3.1 Motivation for Noisy-Or Tree ModelsIf one was to learn a fully-parameterized factor over a group of child nodes and

their common parent, what sort of factor would we learn? If we had segmented the

image recursively, then we would observe a parent node to be on if and only if at

least one child node was on. Thus the maximum likelihood solution for a factor

joining parents and children would be one that put probability mass only on states

where a parent was on if and only if any child was on.

This factor would have the same semantics of a logical OR-gate, and, as noted

by [17], its probabilistic analogue, the noisy-or factor, has many desirable proper-

ties.

3.2 DefinitionThe noisy-or factor has the following semantics: The parent1 node yp turns on with

probability θ independently for each child yi that is turned on, where i ranges from

1 to g, the number of children of yp. Thus the noisy-or log-potential can be defined

as:1Here we are using “parent” and “child” to denote relative position in the image segmentation,

not in the sense of a Directed Acyclic Graph.

20

φno(yp,y1, . . . ,yg

c)= yp log

(1−

g

∏i=1

(1−θ)yi

)+(1− yp)

g

∑i=1

yi log(1−θyi)

(2.1)

In a form that is easier to read, we can replace the success rate θ with the failure

rate q = 1−θ :

exp(φno(yp,y1, . . . ,yg

c)) =

[1−

g

∏i=1

qyi

]yp[

g

∏i=1

qyi

](1−yp)

(2.2)

As shown in [17], messages for a noisy-or factor can be computed in time linear

in the number of children in the factor, giving belief propagation in this model the

same time complexity as the pairwise model.

3.3 LikelihoodThe likelihood of the noisy-or model is similar to that of the pairwise model. Es-

sentially, each set of pairwise potentials between a parent and all of its children is

replaced by one noisy-or factor:

P(Y = y|x) = 1Z

L

∏l=1

∏(i)∈Nl

exp(

φ f (y(i),x(i))+φno(y(i),ychildren(i)))

(3.3)

(3.4)

Where, as a special case, nodes at the bottom later of the tree have no children,

and φno = 0.

3.3.1 Computing Expected Complete Likelihood

To compute the expected complete likelihood of the noisy-or factors conditioned on

local evidence at each of the child nodes, we could simply use the normal junction-

tree algorithm. However, using this method, we must sum over all possible states

21

Local Evidence Pairwise Beliefs Noisy-Or Beliefs

Figure 3.1: An example of belief propagation. Left: A situation in whichthere is no local evidence for a class being present, except in one leafnode. Middle: Marginals after BP in a pairwise tree. Right: Marginalsafter BP in a noisy-or tree.

of each group of parents and children. The length of this sum is exponential in

the number of children and may be prohibitively slow. Fortunately, the expected

likelihood can be calculated in linear time.

To see that this is the case, consider computing the expected likelihood of a

family of nodes yp,y1...yc, each with local evidence P(yi|ei) representing the con-

tribution from the unary potentials. Note that computing the sum over all child

nodes when the parent is off has a factorized form:

∑y1,...,yc

P(yp = 0|y1, ...,yc)P(yp,y1, ...yc|e) = ∑y1,...,yc

c

∏i=1

qyiP(yi|ei) (3.5)

Bringing sums inside of products, we obtain the efficient form:

∑y1,...,yc

P(yp = 0|y1, ...,yc)P(yp,y1, ...yc|e) =c

∏i=1

∑yi

qyiP(yi|ei) (3.6)

Since P(yp = 1|y1, ...,yc) = 1− P(yp = 0|y1, ...,yc), we can compute every

quantity needed efficiently. The normalization constant P(e) can be computed ef-

ficiently in the same manner.

3.4 Evidence FlowIn Figure 3.1 we can observe the effects of evidence flowing upwards from the

pixel-level labels to the image-level labels. We see that the pairwise tree sends

22


Figure 3.2: Left: A situation in which there is local evidence in two adja-cent leaf nodes. Middle: Marginals after BP in a pairwise tree. Right:Marginals after BP in a noisy-or tree.


Figure 3.3: Left: A situation in which there is strong evidence at the globalscale, and weak local evidence at one of the leaf nodes. Middle:Marginals after BP in a pairwise tree. Right: Marginals after BP in anoisy-or tree.

evidence to its neighbours, regardless of spatial distance, while the noisy-or tree

only sends evidence to its parents.

In Figure 3.2 we can observe the effects of strong evidence in two adjacent

nodes. This is equivalent to an object being divided into more than one segment.

The pairwise tree significantly increases its confidence that objects are present at

other leaf nodes in the image, while the noisy-or tree does not significantly change

its marginals.

In Figure 3.3 we can observe manner in which evidence flows down the noisy-

or tree. Given strong evidence that a class is present somewhere in the image,

and weak evidence that it is present at one location, the pairwise tree adjusts its

probability strongly everywhere. The noisy-or tree only adjusts its probability in

the regions containing weak evidence.

In Figures 3.4 and 3.5, we can observe the behavior of the two models on a

23

True Labels Local Evidence

Tree Marginals Clamped Tree Marginals

Figure 3.4: An example of belief propagation and evidence flow in a noisy-ortree, trained on real data. Node size is proportional to probability.

True Labels Local Evidence

Tree Marginals Clamped Tree Marginals

Figure 3.5: An example of belief propagation and evidence flow in the pair-wise tree model, trained on real data. Node size is proportional to prob-ability.

24

Figure 3.6: The segmentation of the image used in figures 3.5 and 3.4.

real example. The local evidence potentials are unintuitively small, since they have

been calibrated to be combined together across the tree. The most striking feature

of these figures is the difference in the tree marginals before and after the global

node has been clamped to the true value.

In these examples, we can again see evidence flowing down the tree from the

root, and observe that in the noisy-or model, evidence tends to flow down only one

branch of the tree, while in the pairwise model, it tends to flow down all branches

to some degree.

25

3.5 TrainingThe gradients for the image feature weights W are identical to those of the pairwise

tree model once the node marginals have been computed, and can be estimated with

the same E-M algorithm as the pairwise trees.

3.5.1 Learning the Noisy-Or Failure Rate Parameter

In a fully observed, recursively segmented image, the maximum likelihood esti-

mate for the failure rate parameter q will always be zero, since a parent node will

be observed to be on if and only if a child is on. However, on partially observed

data, this is not necessarily the case.

In initial experiments, the parameter q was learned in parallel with the fea-

ture weights W , but as the model converged, the learned q parameter again tended

towards zero. For the experiments below, this parameter was fixed to 0.01.

3.6 ResultsShown in figures 3.7, 3.8 and 3.9 are the results on the noisy-or model of a set

of experiments identical to those performed on the pairwise model. We observe a

similar increase in accuracy after semi-supervised training as in the pairwise model.

A detailed comparison is given in Chapter 4.

26

0

5

10

15

Mea

n

Aerop

lane

Bicycle Bird

Boat

Bottle Bus Car Cat

Chair

Per

cent

Acc

urac

y

0

10

20

30

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Per

cent

Acc

urac

y

Figure 3.7: Pixel-level test accuracy on the noisy-or model. White bars indi-cate performance on fully-labeled data, black bars indicate performanceafter additional semi-supervised training.

27

0

10

20

30

Mea

n

Aerop

lane

Bicycle Bird

Boat

Bottle Bus Car Cat

Chair

Per

cent

Acc

urac

y

0

20

40

60

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Per

cent

Acc

urac

y

Figure 3.8: Global-level test accuracy on the noisy-or model. White barsindicate performance on fully-labeled data, black bars indicate perfor-mance after additional semi-supervised training.

28

−3

−2

−1

0

1

2

3

4

5

6

Bac

kgro

und

Aer

opla

neB

icyc

leB

irdB

oat

Bot

tleB

us Car

Cat

Cha

irC

owD

inin

g T

able

Dog

Hor

seM

otor

bike

Per

son

Pot

ted

Pla

ntS

heep

Sof

aT

rain

TV

/ M

onito

r

Per

cent

Abs

olut

e Im

prov

emen

t in

Acc

urac

y

Figure 3.9: Change in test pixel-level accuracy after training with partiallylabeled data.

−4

−2

0

2

4

6

8

Bac

kgro

und

Aer

opla

neB

icyc

leB

irdB

oat

Bot

tleB

us Car

Cat

Cha

irC

owD

inin

g T

able

Dog

Hor

seM

otor

bike

Per

son

Pot

ted

Pla

ntS

heep

Sof

aT

rain

TV

/ M

onito

r

Per

cent

Abs

olut

e Im

prov

emen

t in

Acc

urac

y

Figure 3.10: Change in test global-level test accuracy after training with par-tially labeled data.

29

Chapter 4

Model Comparison

4.1 Comparing ModelsIn this chapter we compare the performance of the two tree structured models, as

well as with the independent model.

4.1.1 Independent Model

As a baseline, we compare our tree-structured CRF models to the independent

model. This model has the same node potential factors as the tree models, but with

no factors connecting nodes. Thus, every region label is predicted separately, and

the model becomes equivalent to a per-region logistic regression on each region’s

image features. The parameters of the independent model are trained on only the

fully-labeled data.

4.2 Performance MeasuresBelow we show pixel-level labeling VOC accuracy, global-level VOC accuracy,

and cross-entropy for the three models. Here the cross-entropy is measured be-

tween the true node labels and the node marginals, over all nodes in all trees T1...n:

− 1N

N

∑n=1

N

∑i∈Tn

1

∑y=0

Ptrue(yni = y) log[P(yni = y)

](2.1)

30

Note that the cross entropy is proportional to the training objective function of

the independent model, but not of the tree models.

4.3 Performance ComparisonFigures 4.3, 4.4, and 4.5 show mean performance across all 21 classes in the VOC

dataset, averaged over 5 folds. Error bars represent one standard error.

4.3.1 Pixel-level Accuracy

Figure 4.3 compares model performance with respect to pixel-level accuracy. The

difference between the three models in the supervised setting is statistically in-

significant. However a paired t-test shows significant improvement in accuracy

after semi-supervised training for both the pairwise and noisy-or models.

4.3.2 Global-level Accuracy

Figure 4.4 shows a surprising result: The independent model performs slightly

better at the global classification task than the supervised-only tree models. This is

surprising, since the independent global label prediction is equivalent to a logistic

regression on the global-level features of the image, and has no access to image

features at finer scales.

How can we explain the poor global-level performance of the tree models,

given that their pixel level accuracy was better? We propose two possible explana-

tions.

Our first possible explanation is that the tree models’ training objective is heav-

ily weighted towards matching the labels at the bottom levels of the tree as opposed

to the top of the tree, since there are many more bottom-layer nodes than global-

layer nodes. To the extent that there is any trade-off between how well a particular

set of parameters models these two groups, we would expect the model to tend to

choose parameters that accurately model the pixel-level labels at the expense of the

global-level labels. The independent model faces no such trade-off, since it models

the global-level and pixel-level nodes of the tree independently.

Our second, related explanation is that the tree-based models face a trade-

off between modeling nodes with small numbers of neighbours, versus modeling

31

0

5

10

15

Per

cent

Acc

urac

y

Mea

n

Aerop

lane

Bicycle Bird

IndependentTreeTree + PLNoisy−SSNoisy−Or + SS

0

5

10

15

Per

cent

Acc

urac

y

Boat

Bottle Bus Car Cat

Chair

0

5

10

15

Per

cent

Acc

urac

y

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

0

10

20

30

Per

cent

Acc

urac

y

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Figure 4.1: Pixel-level test accuracy on the three models, with and withoutsemi-supervised training.

32

0

20

40

Mea

n

Aerop

lane

Bicycle Bird

Per

cent

Acc

urac

y

IndependentTreeTree + PLNoisy−SSNoisy−Or + SS

0

10

20

30

Boat

Bottle Bus Car Cat

Chair

Per

cent

Acc

urac

y

0

10

20

Cow

Dining

Tab

leDog

Horse

Mot

orbik

e

Per

cent

Acc

urac

y

0

20

40

60

Perso

n

Potte

d Plan

t

Sheep

Sofa

Train

TV / M

onito

r

Per

cent

Acc

urac

y

Figure 4.2: Global-level test accuracy on the three models, with and withoutsemi-supervised training.

33

8

8.5

9

9.5

10

10.5

11

Inde

pend

ent

Pairwise

Pairwise

+ S

S

Noisy−

Or

Noisy−

Or + S

S

Per

cent

Acc

urac

y

Figure 4.3: Pixel-level test accuracy across all models.

14

15

16

17

18

19

20

21

Inde

pend

ent

Pairwise

Pairwise

+ S

S

Noisy−

Or

Noisy−

Or + S

S

Per

cent

Acc

urac

y

Figure 4.4: Global-level test accuracy across all models.

nodes with many neighbours, and that this trade-off again favours nodes near the

bottom of the tree, since there are many more such nodes. Figure 4.6 compares the

distribution of degree of all nodes versus only global-level nodes.

Figure 4.7 shows the cross-entropy on the training set versus the number of

neighbours a node has. We see that the cross-entropy of the tree models is lowest

(best fitting the data) for nodes with a small number of neighbours, and increases

with the number of neighbours.

To verify that it is not simply the case that nodes with higher numbers of edges

are not simply harder to model for some reason, we also plot the performance

of the independent model versus the number of neighbours that each node would

34

14

15

16

17

18

19

20

Inde

pend

ent

Pairwise

Pairwise

+ S

S

Noisy−

Or

Noisy−

Or + S

S

Mea

n C

ross

Ent

ropy

Figure 4.5: Mean test cross entropy over all nodes across all models. Loweris better.

All Nodes Global-Level Nodes

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Neighbours

Fre

quen

cy

0 2 4 6 8 10 12 140

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Number of Children of Global Node

Fre

quen

cy

Figure 4.6: Left: Histogram of the number of neighbours of each node in thetraining set. Nodes with one neighbour are necessarily leaves. Right:Histogram of the number of neighbours of global-level nodes in thetraining set.

35

2 4 6 8 10 120

0.5

1

1.5

Number of Neighbours

Mea

n C

ross

Ent

ropy

(Lo

wer

is B

ette

r)

IndependentPairwiseNoisy−Or

Figure 4.7: Mean cross-entropy per node versus the number of neighbours ofa node on the training set.

have had if the graph were connected. We see that the performance in this case is

roughly independent of the number of nodes.

While both tree models show roughly the same trend, there still remains the

question of why the noisy-or tree matches the node marginals much more poorly

on the training set than the pairwise model. This may be due to the fact that the

pairwise model has 3 sets of free parameters with which to calibrate the pairwise

potentials between nodes, while the noisy-or model has no free parameters to adjust

its joint potentials.

4.3.3 Remedies

If our first explanation of the poor global-level performance is correct, then a sim-

ple way to improve the performance of the model at the global-level task would

be to weight the likelihood contribution of the global node more strongly during

36

training.

If our second explanation is correct, then the global-level performance can be

improved in several ways. First, we may consider constraining the recursive seg-

mentation such that the resulting tree has a restricted range of node degrees. This

approach has the disadvantage that it makes the segment joining task computa-

tionally difficult, and constrains the segmenter’s ability to group similar regions.

Alternatively, we may wish to parameterize the joint factors in a way that takes

into account the varying number of neighbours a node has. One approach, which

retains the linear-time computation of the pairwise and noisy-or factors, is outlined

in Chapter 5.

4.4 Qualitative EvaluationIn Figures 4.8 and 4.9, we compare the pixel level probabilities among the three

models for two images in the test set. In both of the images shown, the independent

model finds several disparate patches which match the class of interest. In the tree

models, this evidence is combined at higher levels in the tree, and we observe a

smoother labeling at the pixel level.

4.5 Improving PerformanceThe performance of the tree models in these experiments is unimpressive relative

to the state of the art. However, there are several reasons to expect that significantly

better performance can be achieved, at the cost of slower training times1.

• The recursive segmentation can be made finer. In the experiments performed

above, the recursive segmentation was only four levels deep, leaving rela-

tively large segments at the finest scale. Allowing a finer segmentation would

raise the upper bound on pixel-level accuracy caused by under-segmentation.

• The training set can be left unbalanced. In the experiments above, the train-

ing datasets were balanced by removing approximately 80% of images that1 In the experiments above, the tree models take approximately 2 hours to train per class, for a

given setting of the hyperparameters. The main bottleneck in training the model is in performinginference at each step on each of the the partially-labeled examples. However, this inference step canbe computed in parallel over all examples.

37

Figure 4.8: Detecting a dog. Top left: Original image. Top center: Segmen-tation at bottom level. Top right: True pixel labels. Bottom left: Pixelprobabilities for independent model. Bottom center: Pixel probabili-ties for pairwise model. Bottom right: Pixel probabilities for noisy-ormodel.

did not contain the class of interest.

• Greater care can be put into selecting image features for the node potentials.

• The number of unlabeled examples can be increased relatively easily. To

gather training data for the “dog” image model, for example, it suffices to

merely find images that somewhere contain a dog, with no further labeling

required. Note that these models can be trained on images with probabilistic

labels, allowing the use of automatically gathered datasets where it is known

that some percentage of images do not contain the class of interest.

• Performance on the pairwise model could potentially be improved by reg-

ularizing the edge-potential parameters (of which there were twelve in the

pairwise tree model used). However, this would entail another hyperparam-

eter on which to perform cross-validation.

38

Figure 4.9: Detecting a person. Top left: Original image. Top center: Seg-mentation at bottom level. Top right: True pixel labels. Bottom left:Pixel probabilities for independent model. Bottom center: Pixel proba-bilities for pairwise model. Bottom right: Pixel probabilities for noisy-or model.

4.6 Introducing an OracleAlthough one motivation of tree-structured CRF models is that they can perform

simultaneous image classification and object localization, it is certainly the case

that a model dedicated to image classification may perform better at that task than

our model. However, we can easily incorporate evidence from a separate image

classification algorithm into our model by introducing an extra factor at the global-

level node. This evidence would then be propagated by belief propagation down to

the bottom-level nodes, where pixel-level accuracy may be improved.

Following [18], we calculate an upper bound on the possible performance boost

in pixel-level accuracy attainable by incorporating evidence from a better global-

level classifier. In Figure 4.10 we show the pixel-level accuracy of the models in the

setting where an oracle tells us the correct image classification. We see that both

39

0

5

10

15

20

25

30

Inde

pend

ent

Pairwise

Pairwise

+ S

S

Noisy−

Or

Noisy−

Or + S

S

Oracle

Pair

wise

Oracle

Pair

wise +

SS

Oracle

Nois

y−Or

Oracle

Nois

y−Or +

SS

Per

cent

Acc

urac

y

Figure 4.10: Pixel-level test accuracy across all models, including the casewhere the global-level nodes were clamped to their true value.

models obtain a large increase in accuracy when combined with an oracle. This

result is consistent with results in [18], who also report a large boost in pixel-level

accuracy by coupling with an oracle.

We also observe that the pairwise model receives a much greater boost in ac-

curacy from the oracle than the pairwise model. The cause of this result is unclear.

One possible explanation is suggested by in Figure 3.3, where, upon clamping the

root node to be on, the pairwise model increases node marginals among all the

bottom-level nodes, while the noisy-or model increases node marginals only at

bottom-level nodes whose marginals are already high. This behavior suggests that

the noisy-or model might be inappropriately “explaining away” the oracle’s being

on by increasing the marginal probability of only a small number of bottom-level

nodes.

40

Chapter 5

Conclusions and Future Work

In this chapter, we discuss some well-motivated possible extensions of our work,

and attempt to characterize the different models studied.

5.1 Future Work

5.1.1 Structure-Adaptive Potentials

As mentioned in Chapter 4, the tree models had difficulty modeling the marginals

of nodes with high degree. To address this, we may wish to learn a more flexi-

ble factor computing the joint probability of neighbouring nodes. For instance, we

could learn a tabular factor with a different potential for each setting of parents

and children. However, without introducing additional structure to these factors,

this approach has two major disadvantages: It would no longer be possible to com-

pute messages or likelihoods efficiently, and we would need to learn a number of

parameters exponential in the number of children.

We can attempt to correct the problem of the parent’s local evidence being

either over- or under-weighted in the following way: We can redefine the local ev-

idence factors to be scaled according to a parameter αg that varies with the number

of children g a node has.

For the pairwise model, the unary potential for node r was defined as:

41

Figure 5.1: Learned αg versus the number of neighbours.

φ f

(y(r)c ,x(r)

)= y(r)c (x(r))TW l

c (1.1)

The new joint potential would be defined as:

φ f

(y(r)c ,x(r)

)= αgy(r)c (x(r))TW l

c (1.2)

Where αg is the reweighting parameter for nodes with g children.

In a preliminary experiment, we first learned W as normal, then optimized α

in a second step by numerical differentiation of the expected data log-likelihood.

Because so few nodes have more than 9 children, we tied together αg for all values

of g > 9.

We would expect αg to increase monotonically with the number of children,

and this is what we see in Figure 5.1. After α has been learned, the local evidence

is weighted more strongly in nodes that have many neighbours.

42

5.1.2 Bounding Box Data

One major advantage to using noisy-or tree models over pairwise tree models is

that they can more consistently incorporate information from bounding boxes in

an image.

When viewed as information about an image, all a bounding box tells us is that

a certain area of an image contains a certain class. In a pairwise model, the most

straightforward way to incorporate this evidence into the model is by asserting

that all of the pixels in the bounded area belong to the specified class, with some

probability higher than they would have otherwise had.

In the noisy-or model, we can incorporate bounding box information consis-

tently, by setting a node containing the entire bounding box to be ’on’. If the shape

of the smallest region containing the bounding box closely matches the shape of

the bounding box, we will be effectively incorporating the information provided by

the bounding box, while giving the model more flexibility in assigning individual

pixel labels.

These properties of noisy-or models suggest a set of experiments incorporating

bounding-box labeled image data into the semi-supervised training phase.

5.1.3 Combined Grid and Tree Structures

The most popular form of CRFs on images is the grid-structured CRF, where neigh-

bouring pixels or regions are connected. One problem inherent in multi-scale tree

models is that evidence may be unable to flow directly between neighbouring re-

gions in the image in this way. We can combine these models by adding pairwise

connections between neighbouring nodes at each level of the tree. The disadvan-

tage to adding potentials to the tree model is that we lose the ability to do efficient

exact inference, and to compute the expected data log-likelihood exactly. However,

using loopy belief propagation, we can compute approximate marginals as well as

an approximate gradient of the expected log-likelihood, allowing us to optimize

the parameters using gradient descent.

43

5.1.4 Joint Learning Over All Classes

If we are willing to sacrifice exact inference, there is another interesting extension

of the tree models: Connecting the trees of all classes together at the pixel-level

nodes, where a mutual exclusion between classes is enforced. In this way, if one

class is successfully identified in part of an image, the remaining classes can receive

evidence that they are not contained in that region. This effect could be expected to

improve both the efficiency of semi-supervised learning and test-time localization

performance.

5.1.5 Large-Scale Experiments

Of the 400 caption-only VOC 2008 images used for semi-supervised learning, on

most classes only 20-50 images in that set actually contained the class of interest.

Given the improvement in performance obtained by adding only this small num-

ber of examples to the test set, it seems worth noting that a large weakly-labeled

dataset could easily be constructed for a small number of classes, to evaluate the

effectiveness of yet adding more caption-only data.

5.2 Concluding RemarksWe find that multi-scale CRFs are indeed effective at using weakly-labeled data for

semi-supervised learning. However, we found no significant improvement through

the use of the noisy-or factor on the tasks studied.

44

Bibliography

[1] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with multiscale graphdecomposition. In Computer Vision and Pattern Recognition, volume 2,pages 1124–1131, 2005. → pages

[2] N. Dalai, B. Triggs, R. I. Alps, and F. Montbonnot. Histograms of orientedgradients for human detection. Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on, 1, 2005. →pages 7

[3] L. Du, L. Ren, D. Dunson, and L. Carin. A bayesian model for simultaneousimage clustering, annotation and object segmentation. In Y. Bengio,D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,Advances in Neural Information Processing Systems 22, pages 486–494.2009. → pages

[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results.http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html. → pages7

[5] X. Feng, C. K. I. Williams, and S. N. Felderhof. Combining belief networksand neural networks for scene segmentation. IEEE Trans. Pattern Anal.Mach. Intell., 24(4):467–483, 2002. → pages

[6] B. Frey and D. Dueck. Mixture modeling by affinity propagation. Advancesin neural information processing systems, 18:379, 2006. → pages 4

[7] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient objectlocalization and image classification. 2009. → pages

45

[8] X. He and R. S. Zemel. Learning hybrid models for image annotation withpartially labeled data. In Advances in Neural Information ProcessingSystems, 2008. → pages

[9] T. S. Jaakkola and M. I. Jordan. Variational probabilistic inference and theqmr-dt network. Journal of Artificial Intelligence Research, 10:291–322,1999. → pages

[10] J. Lafferty, A. Mccallum, and F. Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In Proc.18th International Conf. on Machine Learning, pages 282–289, 2001. →pages

[11] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep beliefnetworks for scalable unsupervised learning of hierarchical representations.In ICML, 2009. → pages

[12] L. Liao, D. Fox, and H. A. Kautz. Hierarchical conditional random fields forgps-based activity recognition. In ISRR, pages 487–506, 2005. → pages

[13] D. G. Lowe. Object recognition from local scale-invariant features. In TheProceedings of the Seventh IEEE International Conference on ComputerVision, pages 1150–1157, 1999. → pages

[14] K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: agraphical model relating features, objects and scenes. Advances in NeuralInformation Processing Systems, 16, 2003. → pages 2

[15] K. Murphy, A. Torralba, D. Eaton, and W. Freeman. Object detection andlocalization using local and global features. Toward Category-Level ObjectRecognition, pages 382–400, 2006. → pages 7, 9

[16] A. Oliva and A. Torralba. Modeling the shape of the scene: A holisticrepresentation of the spatial envelope. International Journal of ComputerVision, 42(3):145–175, 2001. → pages 7

[17] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausibleinference. Morgan Kaufmann, 1988. → pages 14, 20, 21

[18] N. Plath, M. Toussaint, and S. Nakajima. Multi-class image segmentationusing conditional random fields and global classification. In Proceedings ofthe 26th Annual International Conference on Machine Learning, pages817–824. ACM, 2009. → pages 2, 3, 6, 9, 39, 40

46

[19] J. Reynolds and K. Murphy. Figure-ground segmentation using ahierarchical conditional random field. In Fourth Canadian Conference onComputer and Robot Vision, 2007. CRV’07, pages 175–182, 2007. → pages2, 3, 9

[20] X. H. Richard, R. S. Zemel, and M. A. C. perpi nan. Multiscale conditionalrandom fields for image labeling. In CVPR, pages 695–702, 2004. → pages

[21] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1regularization: A comparative study and two new approaches. In ECML ’07:Proceedings of the 18th European conference on Machine Learning, pages286–297, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN978-3-540-74957-8. → pages

[22] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning inrandom fields for heart motion abnormality detection. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2008. →pages

[23] P. Schnitzspan, M. Fritz, S. Roth, and B. Schiele. Discriminative structurelearning of hierarchical representations for object detection. In CVPR, pages2238–2245, 2009. → pages

[24] J. Shi and J. Malik. Normalized Cuts and Image Segmentation, 2000. →pages

[25] A. J. Storkey and C. K. I. Williams. Image modelling withposition-encoding dynamic trees. IEEE Trans. Pattern Anal. Machine Intell,25:859–871, 2003. → pages

[26] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs using graph cuts. InECCV, volume 5303 of Lecture Notes in Computer Science, pages 582–595.Springer, 2008. → pages

[27] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library ofcomputer vision algorithms. http://www.vlfeat.org/, 2008. → pages 4

[28] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking.Computer Vision–ECCV 2008, pages 705–718, 2008. → pages

[29] J. Verbeek and B. Triggs. Scene segmentation with crfs learned frompartially labeled images. In Advances in Neural Information ProcessingSystems, 2007. → pages

47

http://www.vlfeat.org/

[30] P. Wu, B. Manjunanth, S. Newsam, and H. Shin. A texture descriptor forimage retrieval and browsing. In IEEE Workshop on Content-Based Accessof Image and Video Libraries, pages 3–7, 1999. → pages 7

[31] J. Zhu, E. P. Xing, and B. Zhang. Partially observed maximum entropydiscrimination markov networks. In Advances in Neural InformationProcessing Systems, 2008. → pages

48

Multiscale Conditional Random Fields for Machine Vision · Multiscale Conditional Random Fields for...

Documents

Transcript of Multiscale Conditional Random Fields for Machine Vision · Multiscale Conditional Random Fields for...