Multiscale Conditional Random Fields for Machine Vision · Multiscale Conditional Random Fields for...
Transcript of Multiscale Conditional Random Fields for Machine Vision · Multiscale Conditional Random Fields for...
Multiscale Conditional Random Fields for Machine Vision
by
David Duvenaud
B. Sc. Hons., University of Manitoba, 2006
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
Master of Science
in
THE FACULTY OF GRADUATE STUDIES
(Computer Science)
The University Of British Columbia
(Vancouver)
July 2010
c© David Duvenaud, 2010
Abstract
We develop a single joint model which can classify images and label super-pixels,
based on tree-structured conditional random fields (CRFs) derived from a hierar-
chical image segmentation, extending previous work by Reynolds and Murphy,
and Plath and Toussaint. We show how to train this model in a weakly-supervised
fashion, in which some of the images only have captions specifying which ob-
jects are present; this information is propagated down the tree and thus provides
weakly labeled data at the leaves, which can be used to improve the performance
of the super-pixel classifiers. After training, information can be propagated from
the super-pixels up to the root-level image classifier (although this does not seem
to help in practice compared to just using root-level features). We compare two
kinds of tree: the standard one with pairwise potentials, and one based on noisy-or
potentials, which better matches the semantics of the recursive partitioning used
to create the tree. However, we do not find any significant difference between the
two.
ii
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Multi-scale Approaches . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Previous Approaches to Multiscale CRFs . . . . . . . . . 3
1.3.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Under-segmentation vs Over-segmentation . . . . . . . . 5
1.4 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Pascal VOC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Image Features . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Supervised and Semi-Supervised Learning . . . . . . . . . . . . . 7
1.7 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Pairwise Trees and Learning . . . . . . . . . . . . . . . . . . . . . . 102.1 Model Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Local Evidence Potentials . . . . . . . . . . . . . . . . . 10
2.1.2 Independent Model . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Pairwise Potentials . . . . . . . . . . . . . . . . . . . . . 11
iii
2.1.4 Regularization of Image Feature Weights . . . . . . . . . 12
2.1.5 Regularization of Pairwise Potentials . . . . . . . . . . . 12
2.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Computational Issues . . . . . . . . . . . . . . . . . . . . 14
2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . 15
2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Noisy-Or Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Motivation for Noisy-Or Tree Models . . . . . . . . . . . . . . . 20
3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Computing Expected Complete Likelihood . . . . . . . . 21
3.4 Evidence Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Learning the Noisy-Or Failure Rate Parameter . . . . . . 26
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1 Comparing Models . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Independent Model . . . . . . . . . . . . . . . . . . . . . 30
4.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . 31
4.3.1 Pixel-level Accuracy . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Global-level Accuracy . . . . . . . . . . . . . . . . . . . 31
4.3.3 Remedies . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Improving Performance . . . . . . . . . . . . . . . . . . . . . . . 37
4.6 Introducing an Oracle . . . . . . . . . . . . . . . . . . . . . . . . 39
5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 41
iv
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Structure-Adaptive Potentials . . . . . . . . . . . . . . . 41
5.1.2 Bounding Box Data . . . . . . . . . . . . . . . . . . . . 43
5.1.3 Combined Grid and Tree Structures . . . . . . . . . . . . 43
5.1.4 Joint Learning Over All Classes . . . . . . . . . . . . . . 44
5.1.5 Large-Scale Experiments . . . . . . . . . . . . . . . . . . 44
5.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 44
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
v
List of Figures
Figure 1.1 An example of a tree-structured CRF built on an exact recur-
sive segmentation of an image. . . . . . . . . . . . . . . . . . 3
Figure 1.2 An example of a CRF defined for the presence of the sheep
class over multiple scales of a recursively segmented image. . 4
Figure 1.3 The maximum attainable accuracy given the segmentation used. 6
Figure 1.4 Example Multi-scale segmentations from the VOC 2008 dataset.
Rows one to four: Image segmentation at progressively finer
levels of detail. Bottom row: Pixel-level class labels. . . . . . 8
Figure 2.1 Pixel-level test accuracy on the tree model. White bars indicate
performance on fully-labeled data, black bars indicate perfor-
mance after additional semi-supervised training. . . . . . . . 16
Figure 2.2 Global-level test accuracy on the tree model. White bars in-
dicate performance on fully-labeled data, black bars indicate
performance after additional semi-supervised training. . . . . 17
Figure 2.3 Change in pixel-level test accuracy after training with partially
labeled data. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.4 Change in global-level test accuracy after training with par-
tially labeled data. . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.5 A plot of the change in test error versus the pixel-level accu-
racy on the test set after supervised training, when the global
node was set to the true value. . . . . . . . . . . . . . . . . . 19
vi
Figure 3.1 An example of belief propagation. Left: A situation in which
there is no local evidence for a class being present, except in
one leaf node. Middle: Marginals after BP in a pairwise tree.
Right: Marginals after BP in a noisy-or tree. . . . . . . . . . 22
Figure 3.2 Left: A situation in which there is local evidence in two adja-
cent leaf nodes. Middle: Marginals after BP in a pairwise tree.
Right: Marginals after BP in a noisy-or tree. . . . . . . . . . 23
Figure 3.3 Left: A situation in which there is strong evidence at the global
scale, and weak local evidence at one of the leaf nodes. Mid-
dle: Marginals after BP in a pairwise tree. Right: Marginals
after BP in a noisy-or tree. . . . . . . . . . . . . . . . . . . . 23
Figure 3.4 An example of belief propagation and evidence flow in a noisy-
or tree, trained on real data. Node size is proportional to prob-
ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 3.5 An example of belief propagation and evidence flow in the
pairwise tree model, trained on real data. Node size is pro-
portional to probability. . . . . . . . . . . . . . . . . . . . . 24
Figure 3.6 The segmentation of the image used in figures 3.5 and 3.4. . . 25
Figure 3.7 Pixel-level test accuracy on the noisy-or model. White bars
indicate performance on fully-labeled data, black bars indicate
performance after additional semi-supervised training. . . . . 27
Figure 3.8 Global-level test accuracy on the noisy-or model. White bars
indicate performance on fully-labeled data, black bars indicate
performance after additional semi-supervised training. . . . . 28
Figure 3.9 Change in test pixel-level accuracy after training with partially
labeled data. . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 3.10 Change in test global-level test accuracy after training with
partially labeled data. . . . . . . . . . . . . . . . . . . . . . . 29
Figure 4.1 Pixel-level test accuracy on the three models, with and without
semi-supervised training. . . . . . . . . . . . . . . . . . . . 32
Figure 4.2 Global-level test accuracy on the three models, with and with-
out semi-supervised training. . . . . . . . . . . . . . . . . . . 33
vii
Figure 4.3 Pixel-level test accuracy across all models. . . . . . . . . . . 34
Figure 4.4 Global-level test accuracy across all models. . . . . . . . . . . 34
Figure 4.5 Mean test cross entropy over all nodes across all models. Lower
is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 4.6 Left: Histogram of the number of neighbours of each node
in the training set. Nodes with one neighbour are necessar-
ily leaves. Right: Histogram of the number of neighbours of
global-level nodes in the training set . . . . . . . . . . . . . . 35
Figure 4.7 Mean cross-entropy per node versus the number of neighbours
of a node on the training set. . . . . . . . . . . . . . . . . . . 36
Figure 4.8 Detecting a dog. Top left: Original image. Top center: Seg-
mentation at bottom level. Top right: True pixel labels. Bot-
tom left: Pixel probabilities for independent model. Bottom
center: Pixel probabilities for pairwise model. Bottom right:
Pixel probabilities for noisy-or model. . . . . . . . . . . . . . 38
Figure 4.9 Detecting a person. Top left: Original image. Top center: Seg-
mentation at bottom level. Top right: True pixel labels. Bot-
tom left: Pixel probabilities for independent model. Bottom
center: Pixel probabilities for pairwise model. Bottom right:
Pixel probabilities for noisy-or model. . . . . . . . . . . . . . 39
Figure 4.10 Pixel-level test accuracy across all models, including the case
where the global-level nodes were clamped to their true value. 40
Figure 5.1 Learned αg versus the number of neighbours. . . . . . . . . . 42
viii
Acknowledgments
There are many people who made my time at UBC both instructive and fruitful.
First and foremost: my advisor Kevin Murphy, by giving his support, as well as by
his patience in steering my ideas towards useful research. I’d like to thank Nando
de Frietas for his many helpful ideas. I’d like to profusely thank Benjamin Marlin
for sharing the recursive segmentation and feature extraction code, as well as for
being an example to me.
I’d like to thank Kevin Swersky and Bo Chen for raising the bar, Mark Schmidt
for showing me not to be afraid to start at square one, and Emtiyaz Khan for keep-
ing me going through it all.
ix
Chapter 1
Introduction
1.1 MotivationA central problem in learning object localization models is that pixel-labeled im-
ages are rare and costly to produce. In contrast, data that have only weak labeling
information, such as captions, are relatively abundant. In the extreme case, one
may wish to train a model on a massive set of images drawn from an automated
search. This is essentially the data collector’s dream: To generate a training dataset
for a cat detector by simply performing a web image search for ’cat’.
Such datasets can easily be used to train an image classifier. However, in order
to train a model to perform object localization on such data, we require a unified
model of object presence at both the scale of the whole image as well as the scale
of individual super-pixels.
1.2 Image ModelsFor the tasks of image classification and object localization, we are interested in
predicting whether a given image region contains an example of a given class c
somewhere within it. For the tasks of image classification or object detection, the
region of interest is the entire image. For the task of image segmentation or object
localization, the regions of interest are on the scale of pixels or super-pixels.
We can cast both of these tasks as inference problems in the following way. We
1
define random variables Y1,Y2, ...,Yn denoting the presence or absence of class c in
image regions 1...n. We are then interested in computing P(Y1,Y2, ...,Yn|x) where x
are the image features. We may then assume some structure on the conditional
probability P(Y1,Y2, ...,Yn|x) in order to make learning and inference tractable.
Such structured conditional models are called Conditional Random Fields (CRFs).
In this work, we further develop the use of CRFs which are structured in
such a way as to combine evidence from multiple scales of segmentation. This
model structure allows us to perform semi-supervised learning to take advantage
of weakly labeled data as described above. There are two additional motivations
for this model structure. First, it can effectively incorporate evidence from other
image classifiers to perform better localization on unlabeled images (as shown in
Chapter 4). Second, it can combine evidence from multiple scales and image loca-
tions in a simple, consistent way, allowing one model to do both classification and
localization.
Figure 1.1 shows an example of a multi-scale CRF defined over multiple scales
of an image.
1.3 Multi-scale ApproachesRecent work on image classification and segmentation has shown that incorporat-
ing evidence from multiple scales is an effective strategy for image classification
[14] [18] [19]. This can be explained by the fact that labels at different scales must
agree with each other to some degree. Evidence about segments at one level of de-
tail can help in classifying segments at neighbouring levels of detail. For example,
knowing that there is a person in a small segment of the image means that there is
also a person in any larger segment containing the first segment.
This observation motivates the idea of combining evidence at multiple scales
of the image by performing segmentation at different scales, estimating a class’
presence or absence separately for each segment, and then combining this local
estimates with through a conditional random field.
2
Figure 1.1: An example of a tree-structured CRF built on an exact recursivesegmentation of an image.
1.3.1 Previous Approaches to Multiscale CRFs
The construction of multi-scale CRFs was previously demonstrated in [18] and
[19]. In these models, each image was segmented at multiple scales, and each
segment was assigned a label node designating whether the class of interest was
contained in that segment. A node at a given level of detail was connected to the
node at the next coarser level of detail according to the degree of overlap between
their corresponding image regions. This procedure results in a tree-structured CRF
with one image-level node representing the presence or absence of a class in the
image, as well as one node for each segment at the finest level.
In these works, the image segmentations at different levels of detail were con-
structed independently of one another. This approach has the advantage that the
segmentations at each layer can be constructed independently, but has the disad-
vantage that the segments at one level of detail may not significantly overlap with
3
Figure 1.2: An example of a CRF defined for the presence of the sheep classover multiple scales of a recursively segmented image.
the segments at the next coarsest level.
1.3.2 Our Approach
As opposed to constructing image segmentations at different scales independently,
we construct a recursive image segmentation. We define a multi-scale segmentation
to be recursive when each segment at a given level is contained in exactly one other
segment at the next coarser level of detail.
To produce the exact recursive segmentation used by our model, we first seg-
ment the image at the finest spatial scale using a Quick-shift based super-pixel
algorithm[27]. We then run a sparse affinity propagation clustering algorithm[6] to
cluster image regions. We only consider merging adjacent image regions at each
step. The similarity between regions is simply the L2 distance between their mean
colors. We perform several rounds of clustering, corresponding to increasingly
coarse levels of detail. We stop when there are 6 or fewer segments, which are
merged in a final step. Figure 1.4 shows some example segmentations using this
4
algorithm.
As in previous work, we constructed the CRF by connecting nodes representing
regions with maximal overlap. In our case, however, each node is completely
contained in exactly one other node by design. We denote the containing region
to be a parent region, and the nodes contained within it to be the children of that
region.
One benefit of constructing a recursive image segmentation is that the resulting
label structure allows the use of factors joining parent and child nodes that have
stricter agreement constraints than a typical pairwise-factored CRF. In Chapter 3,
we explore the use of such “noisy-or” factors in tree-structured CRFs.
1.3.3 Under-segmentation vs Over-segmentation
Constructing this recursive segmentation raises the question: What level of detail
is appropriate for the finest level of segmentation?
The cost of over-segmentation is that our CRF will be unnecessarily large,
slowing down inference. However, inference in this model is linear in the number
of nodes in the CRF.
The cost of under-segmentation is that the bottom-layer segments will contain
pixels from classes other than the task of interest. This will put an upper bound on
the accuracy of our model’s pixel-level labels.
In the experiments performed in this thesis, we truncated the recursive segmen-
tation at four levels of recursion, leaving approximately 30 segments per image at
the finest level of detail. Figure 1.3 shows the maximum attainable VOC accuracy
(defined below) for this four-layer deep recursive segmentation is 50.2%. For com-
parison, the best-performing algorithm on the PASCAL VOC 2008 segmentation
challenge achieved a mean accuracy of 25.4%.
1.4 Performance MetricsPerformance is measured for both tasks by the accuracy a as defined in the VOC
2008 challenge as
a =tp
tp+ fp+ fn(4.1)
5
0
50
100
Mea
n
Backg
roun
d
Aerop
lane
Bicycle Bird
Boat
Bottle Bus Car Cat
Chair
Per
cent
Acc
urac
y
0
20
40
60
80
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Per
cent
Acc
urac
y
Figure 1.3: The maximum attainable accuracy given the segmentation used.
where tp, fp, and fn mean true positive, false positive, and false negative, re-
spectively [18]. True positive is the number of foreground image pixels that are
correctly predicted. False positive is the number of background pixels that are
incorrectly predicted. False negative is the number of foreground pixels that are in-
correctly predicted. Here, “foreground” refers to the parts of the image containing
the class of interest.
Note that this performance measure effectively ignores true negatives, which
generally comprise the majority of all predictions for this task. This measure is
useful, but is quite different from the objective function maximized by our learning
algorithms.
6
1.5 Pascal VOC DatasetThe pixel-labeled data used for these experiments were gathered from the training
and validation sets in the PASCAL Visual Object Classes(VOC) Challenge 2008
dataset [4]. This dataset contains approximately 1000 images, each approximately
500x500 pixels, in which each pixel is either assigned to one of 20 classes, assigned
to the “background” class, or labeled as “don’t care”. The predicted labels of pixels
labeled “don’t care” do not count towards the accuracy score.
1.5.1 Image Features
The image features used were: colour histograms (100 dimensions), histogram of
oriented gradients (200 dimensions) [2], textons (500 dimensions) [30], and the
5x5 discretized location of the segment (25 dimensions). With a bias term added,
each feature vector had 826 dimensions. However, the model is somewhat agnostic
to the image features computed, and allows the use of different feature vectors at
different levels in the segmentation hierarchy. For instance, one may want to use
GIST [16] features at the highest spatial scale, as in [15]. However, exploratory
experiments did not show a significant different in performance by replacing the
top level features with a GIST vector.
1.6 Supervised and Semi-Supervised LearningWe conducted two sets of experiments: In the first, models were trained using
pixel-labeled training data. Given an image that is fully labeled at the pixel level,
we can compute the labels for regions at all spatial scales. Thus, this phase of
learning is fully supervised.
In the second set of experiments, the pixel-labeled training data was augmented
with an equal amount of “caption-only” data, which was unlabeled at every level
of the hierarchy except at the global level. Data labeled in this way corresponds
to knowing only whether or not a certain type of object is found anywhere in the
image. Because all but one node per image is unlabeled, we call this regime “semi-
supervised” training. Here belief propagation is used to compute the expected
sufficient statistics of the unlabeled nodes conditioned on the global label, which
can then be used to train parameters at finer spatial scales.
7
Figure 1.4: Example Multi-scale segmentations from the VOC 2008 dataset.Rows one to four: Image segmentation at progressively finer levels ofdetail. Bottom row: Pixel-level class labels.
8
1.7 Thesis ContributionsThe main contribution of this thesis is in demonstrating a concrete way in which
caption-only data can be used to train a localization model. In doing so, we develop
several refinements to the class of multi-scale image CRFs:
• We develop the use of recursive image segmentation.
• Using the recursive segmentation, we introduce the noisy-or structured tree
model.
• In contrast to previous efforts[15][18][19], which first learn local image
patch classifiers and then connect them with a CRF, we show how to learn
the parameters of the local classifiers in concert, as a structured prediction
problem.
9
Chapter 2
Pairwise Trees and Learning
In this chapter we precisely define a tree-structured conditional random field with
pairwise potentials. We give formulas for the likelihood, show how learning can
be done via the Expectation-Maximization algorithm, and show results on fully
observed and semi-supervised learning.
2.1 Model SemanticsThe image segments at all levels of detail can be denoted by S(r), for some integer
r. The model contains one label node Y (r)c for each class c and each element of
the recursive image partition S(r). Setting Y (r)c = 1 is interpreted as meaning that
the image region defined by segment S(r) contains part of an object from class c,
while setting Y (r)c = 0 is interpreted as meaning that the image region S(r) does not
contain part of an object from class c.
2.1.1 Local Evidence Potentials
The local evidence log-potential for node Y (r) in this model depends linearly on the
feature vectors x(r) for the region S(r). We define the local evidence log-potential in
Equation 1.1 where W lc are the feature-to-label weights for class c and segmentation
level l.
10
φ f
(y(r)c ,x(r)
)= y(r)c (x(r))TW l
c (1.1)
Weights are shared across all of nodes in a given level of detail l of the segmen-
tation tree within each object class. In the experiments below, the weight vectors
W 2c and W 3
c were also constrained to be equal.
2.1.2 Independent Model
As a baseline, we can consider an image model consisting solely of these local
evidence potentials. In this “independent” model, every region label is predicted
separately, and the model becomes equivalent to a per-region logistic regression
on each region’s image features. The likelihood of a node label assignment y is as
follows:
P(Y = y|x) = 1Z
L
∏l=1
∏(i)∈Nl
exp(
φ f (y(i),x(i)))
(1.2)
Here L is the number of layers in the tree, and (i) ∈ NL denotes the nodes in
layer L. The parameters of the independent model can be trained on only fully-
labeled data.
2.1.3 Pairwise Potentials
In Equation 1.3 we define the pairwise potential between neighbouring nodes. The
pairwise potentials depend only on a 2x2 table of parameters θ , indexed by values
taken by the nodes at each end of the potential.
φpair
(y(r1)
c ,y(r2)c
)= θ(y(r1)
c ,y(r2)c ) (1.3)
In our experiments, three sets of pairwise parameters were learned: One set
for the potentials connecting global nodes to their children, one for the potentials
connecting nodes in the middle layers to their children, and a third for the potentials
11
connecting middle-layer nodes to bottom-layer nodes.
2.1.4 Regularization of Image Feature Weights
Learning for the weights of the local evidence log-potentials is done with a simple
L2 regularizer λ on the image feature weights W . In order to avoid choosing a sep-
arate hyperparameter for each weightgroup, we choose one regularization setting,
kept fixed across different weightgroups. This opens up the issue of how to scale
the regularization as the number of training examples changes.
The different weightgroups have a widely varying number of training instances.
For instance, the nodes at the bottom (pixel-level) of the trees will have many more
training examples as the nodes at the top (global-level), since each image will have
only one top-level node and many bottom-level nodes. Following a Bayesian inter-
pretation of regularized optimization as MAP estimation, we view the regularizer
as a prior, and do not scale it with the number of training examples. This can be
interpreted as giving the same prior to each of the weight groups’ parameters, and
then conditioning on varying amounts of evidence.
2.1.5 Regularization of Pairwise Potentials
In these experiments, no regularization was used for the pairwise potential parame-
ters, as it was unclear how to scale this regularization with respect to the regulariza-
tion of the node potentials. Using a separate hyperparameter would have increased
the computational cost of cross-validation significantly. However, this was not ex-
pected to cause significant over-fitting, since in each model, only three sets of four
pairwise potentials were learned.
Note that if the pairwise links are regularized separately from the node poten-
tial parameters, strong enough regularization of the pairwise potential parameters
effectively makes the pairwise tree model equivalent to the independent model.
2.2 LikelihoodThe likelihood of observing a particular configuration of label nodes y given feature
vector x is defined as:
12
P(Y = y|x) = 1Z
L
∏l=1
∏(i)∈Nl
exp(
φ f (y(i),x(i))+φpair(y(i),yparent(i)))
(2.4)
Here parent(i) denotes the index of parent of node i. As a special case, the root
has no parent node, and φpair = 0.
2.3 LearningFirst, let ymis denotes the missing labels while yobs denotes the observed labels.
The marginal probability of observing yobs can be obtained by summing out over
all joint configurations of the missing labels ymis.
P(yobs|x) = ∑ymis
P(yobs,ymis|x) (3.5)
We also define the posterior probability of the missing labels given the observed
labels:
P(ymis|yobs,x) =P(ymis,yobs|x)
P(yobs|x)(3.6)
The expected complete log-likelihood over all training examples is as follows:
E[L ] =N
∑n=1
∑ymis
n
P(ymisn |yobs
n ,xn) log(
P(ymisn ,yobs
n |xn))
(3.7)
We now show how the gradient of the expected complete log likelihood with
respect to the feature weights W lc can be computed using these two quantities ( the
L2 regularization term is omitted for clarity):
13
∂E[L ]
∂W lc
=N
∑n=1
∑ymis
n
P(ymisn |yobs
n ,xn) ∑(i)∈Nl
(∂φ f (y
(i)cn ,x
(i)n )
∂W lc
−∑y′
P(y′|xn)∂φ f (y′
(i)c ,x(i)n )
∂W lc
)
=N
∑n=1
∑(i)∈Nl
(E
P(y(i)cn |yobsn ,xn)
[y(i)cn x(i)n ]−EP(y′(i)c |xn)
[y′(i)c x(i)n ])
(3.8)
=N
∑n=1
∑(i)∈Nl
(P(y(i)cn |yobs
n ,xn)−P(y′(i)c |xn))
x(i)n (3.9)
Here N is the number of training examples, and Nl is the number of nodes in
layer l of example N. To apply L2 regularization, we add to the derivative the term
−2λ ||W lc ||2.
We find that the gradient is proportional to the difference between the marginals
computed after clamping all observed nodes to their true values, and the marginals
computed with nodes observed. We can obtain exact marginals in time linear in the
number of nodes by using belief propagation [17].
The gradient of the likelihood with respect to the pairwise parameters θc(a,b)
has a similar form:
∂E[L ]
∂θc(a,b)=
N
∑n=1
∑(i)∈NL
∑( j)∈NL
(E
P(y(i)cn ,y( j)cn |yobs
n ,xn)I[y(i)cn = a,y( j)
cn = b]
−EP(y′(i)cn ,y′
( j)cn )|xn
I[y′(i)cn = a,y′( j)cn = b]
)(3.10)
=N
∑n=1
∑(i)∈NL
∑( j)∈NL
(P(y(i)cn = a,y( j)
cn = b|yobscn ,xn)−P(y(i)cn = a,y( j)
cn = b|xn))
(3.11)
The gradient of θc(a,b) is simply the expected difference in the number of
times we would observe [yc = a,yc = b] between the clamped versus the unclamped
distributions.
2.3.1 Computational Issues
Learning was broken into three stages as follows:
14
1. The image feature weights W , initialized to zero, were trained in the inde-
pendent model by supervised training on fully labeled training images.
2. The pairwise factors φpair were added to the CRF, and the feature weights W
along with the pairwise parameters θ were learned by supervised training on
the fully labeled training examples.
3. Caption-only data was added to the dataset, and the model was trained in a
semi-supervised way using the E-M algorithm shown above.1
2.4 Experimental SetupIn these experiments, we balanced the dataset for each class separately by removing
approximately 80% of images that did not contain the class of interest.
2.4.1 Cross-validation
Error bars depicting one standard error were produced by conducting experiments
on five training/test splits of the data. Within each split, the L2 regularization pa-
rameter λ was chosen by nested cross-validation: Each training set was split into
five inner training/validation splits. For both the supervised case and the semi-
supervised case, the setting of λ that had the best average accuracy on the valida-
tion set was chosen to train the model on the whole training set for that fold.
Each outer fold had 400 fully-labeled training examples, 400 caption-only
training examples, and 200 test examples.
2.4.2 Evaluation Criteria
At test time, for each example, posterior marginals are computed for each node
in the tree conditional on the image feature vectors. Node marginals at the top
and bottom of the trees were thresholded at 0.5 to produce the final pixel-level and
image-level classifications, respectively.
1 The function minimizer used in the M step was minFunc by Mark Schmidt, which implementsthe L-BFGS algorithm. This software is available at http://www.cs.ubc.ca/ schmidtm/Software/min-Func.html
15
0
5
10
15
Mea
n
Aerop
lane
Bicycle Bird
Boat
Bottle Bus Car Cat
Chair
Per
cent
Acc
urac
y
0
10
20
30
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Per
cent
Acc
urac
y
Figure 2.1: Pixel-level test accuracy on the tree model. White bars indicateperformance on fully-labeled data, black bars indicate performance afteradditional semi-supervised training.
2.5 ResultsFigures 2.1, 2.4, show the pixel-level and global-level accuracy after both super-
vised and semi-supervised training. In these results, “accuracy” refers to the VOC
accuracy metric defined in Chapter 1, and the “absolute percentage change” repre-
sents the percent accuracy after semi-supervised training minus the percent accu-
racy before semi-supervised training.
2.6 DiscussionThe mean improvement in accuracy after semi-supervised training is statistically
significant, but varies significantly between classes. Besides noise, how can we
explain the decrease in performance in some classes? Note that while the accuracy
did not always improve, the accuracy measure used here is quite different from the
16
0
10
20
30
Mea
n
Aerop
lane
Bicycle Bird
Boat
Bottle Bus Car Cat
Chair
Per
cent
Acc
urac
y
0
20
40
60
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Per
cent
Acc
urac
y
Figure 2.2: Global-level test accuracy on the tree model. White bars indicateperformance on fully-labeled data, black bars indicate performance afteradditional semi-supervised training.
training objective. As well, in order for the pixel-level accuracy to improve, the
model must already be somewhat competent at predicting pixel-level labels given
the global label. This is because the weight gradient is a function of the difference
in node marginals between the case where the global-level node in unobserved,
and the case where it is observed. If the bottom-level marginals do not change
significantly (or in the correct direction) when the global-level node is clamped,
the model weights will not change.
Figure 2.5 supports this explanation. Plotted is the change in pixel-level accu-
racy after semi-supervised training, versus the pixel-level accuracy after supervised
training when the top-level node was clamped to the true value. This figure shows
that the pixel-level accuracy did not improve in classes which had poor localiza-
tion performance after supervised training. We can also see that the best-localized
classes also enjoyed the biggest improvement in accuracy from semi-supervised
17
−2
−1
0
1
2
3
4
5
6
7
8
Bac
kgro
und
Aer
opla
neB
icyc
leB
irdB
oat
Bot
tleB
us Car
Cat
Cha
irC
owD
inin
g T
able
Dog
Hor
seM
otor
bike
Per
son
Pot
ted
Pla
ntS
heep
Sof
aT
rain
TV
/ M
onito
r
Per
cent
Abs
olut
e Im
prov
emen
t in
Acc
urac
y
Figure 2.3: Change in pixel-level test accuracy after training with partiallylabeled data.
−6
−4
−2
0
2
4
6
8
10
12
14
Bac
kgro
und
Aer
opla
neB
icyc
leB
irdB
oat
Bot
tleB
us Car
Cat
Cha
irC
owD
inin
g T
able
Dog
Hor
seM
otor
bike
Per
son
Pot
ted
Pla
ntS
heep
Sof
aT
rain
TV
/ M
onito
r
Per
cent
Abs
olut
e Im
prov
emen
t in
Acc
urac
y
Figure 2.4: Change in global-level test accuracy after training with partiallylabeled data.
18
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Aeroplane
Bicycle
Bird
Boat
Bottle
Bus
Car
Cat
Chair
Cow
Dining Table
DogHorse
Motorbike
Person
Potted Plant
Sheep
Sofa
TrainTV / Monitor
Oracle Pixel−level Accuracy on Test data after Supervised Training
Per
cent
Cha
nge
afte
r S
emi−
Sup
ervi
sed
Lear
ning
Figure 2.5: A plot of the change in test error versus the pixel-level accuracyon the test set after supervised training, when the global node was set tothe true value.
training.
We leave a discussion of global-level accuracy until Chapter 4.
19
Chapter 3
Noisy-Or Factors
3.1 Motivation for Noisy-Or Tree ModelsIf one was to learn a fully-parameterized factor over a group of child nodes and
their common parent, what sort of factor would we learn? If we had segmented the
image recursively, then we would observe a parent node to be on if and only if at
least one child node was on. Thus the maximum likelihood solution for a factor
joining parents and children would be one that put probability mass only on states
where a parent was on if and only if any child was on.
This factor would have the same semantics of a logical OR-gate, and, as noted
by [17], its probabilistic analogue, the noisy-or factor, has many desirable proper-
ties.
3.2 DefinitionThe noisy-or factor has the following semantics: The parent1 node yp turns on with
probability θ independently for each child yi that is turned on, where i ranges from
1 to g, the number of children of yp. Thus the noisy-or log-potential can be defined
as:1Here we are using “parent” and “child” to denote relative position in the image segmentation,
not in the sense of a Directed Acyclic Graph.
20
φno(yp,y1, . . . ,yg
c)= yp log
(1−
g
∏i=1
(1−θ)yi
)+(1− yp)
g
∑i=1
yi log(1−θyi)
(2.1)
In a form that is easier to read, we can replace the success rate θ with the failure
rate q = 1−θ :
exp(φno(yp,y1, . . . ,yg
c)) =
[1−
g
∏i=1
qyi
]yp[
g
∏i=1
qyi
](1−yp)
(2.2)
As shown in [17], messages for a noisy-or factor can be computed in time linear
in the number of children in the factor, giving belief propagation in this model the
same time complexity as the pairwise model.
3.3 LikelihoodThe likelihood of the noisy-or model is similar to that of the pairwise model. Es-
sentially, each set of pairwise potentials between a parent and all of its children is
replaced by one noisy-or factor:
P(Y = y|x) = 1Z
L
∏l=1
∏(i)∈Nl
exp(
φ f (y(i),x(i))+φno(y(i),ychildren(i)))
(3.3)
(3.4)
Where, as a special case, nodes at the bottom later of the tree have no children,
and φno = 0.
3.3.1 Computing Expected Complete Likelihood
To compute the expected complete likelihood of the noisy-or factors conditioned on
local evidence at each of the child nodes, we could simply use the normal junction-
tree algorithm. However, using this method, we must sum over all possible states
21
Local Evidence Pairwise Beliefs Noisy-Or Beliefs
Figure 3.1: An example of belief propagation. Left: A situation in whichthere is no local evidence for a class being present, except in one leafnode. Middle: Marginals after BP in a pairwise tree. Right: Marginalsafter BP in a noisy-or tree.
of each group of parents and children. The length of this sum is exponential in
the number of children and may be prohibitively slow. Fortunately, the expected
likelihood can be calculated in linear time.
To see that this is the case, consider computing the expected likelihood of a
family of nodes yp,y1...yc, each with local evidence P(yi|ei) representing the con-
tribution from the unary potentials. Note that computing the sum over all child
nodes when the parent is off has a factorized form:
∑y1,...,yc
P(yp = 0|y1, ...,yc)P(yp,y1, ...yc|e) = ∑y1,...,yc
c
∏i=1
qyiP(yi|ei) (3.5)
Bringing sums inside of products, we obtain the efficient form:
∑y1,...,yc
P(yp = 0|y1, ...,yc)P(yp,y1, ...yc|e) =c
∏i=1
∑yi
qyiP(yi|ei) (3.6)
Since P(yp = 1|y1, ...,yc) = 1− P(yp = 0|y1, ...,yc), we can compute every
quantity needed efficiently. The normalization constant P(e) can be computed ef-
ficiently in the same manner.
3.4 Evidence FlowIn Figure 3.1 we can observe the effects of evidence flowing upwards from the
pixel-level labels to the image-level labels. We see that the pairwise tree sends
22
Local Evidence Pairwise Beliefs Noisy-Or Beliefs
Figure 3.2: Left: A situation in which there is local evidence in two adja-cent leaf nodes. Middle: Marginals after BP in a pairwise tree. Right:Marginals after BP in a noisy-or tree.
Local Evidence Pairwise Beliefs Noisy-Or Beliefs
Figure 3.3: Left: A situation in which there is strong evidence at the globalscale, and weak local evidence at one of the leaf nodes. Middle:Marginals after BP in a pairwise tree. Right: Marginals after BP in anoisy-or tree.
evidence to its neighbours, regardless of spatial distance, while the noisy-or tree
only sends evidence to its parents.
In Figure 3.2 we can observe the effects of strong evidence in two adjacent
nodes. This is equivalent to an object being divided into more than one segment.
The pairwise tree significantly increases its confidence that objects are present at
other leaf nodes in the image, while the noisy-or tree does not significantly change
its marginals.
In Figure 3.3 we can observe manner in which evidence flows down the noisy-
or tree. Given strong evidence that a class is present somewhere in the image,
and weak evidence that it is present at one location, the pairwise tree adjusts its
probability strongly everywhere. The noisy-or tree only adjusts its probability in
the regions containing weak evidence.
In Figures 3.4 and 3.5, we can observe the behavior of the two models on a
23
True Labels Local Evidence
Tree Marginals Clamped Tree Marginals
Figure 3.4: An example of belief propagation and evidence flow in a noisy-ortree, trained on real data. Node size is proportional to probability.
True Labels Local Evidence
Tree Marginals Clamped Tree Marginals
Figure 3.5: An example of belief propagation and evidence flow in the pair-wise tree model, trained on real data. Node size is proportional to prob-ability.
24
Figure 3.6: The segmentation of the image used in figures 3.5 and 3.4.
real example. The local evidence potentials are unintuitively small, since they have
been calibrated to be combined together across the tree. The most striking feature
of these figures is the difference in the tree marginals before and after the global
node has been clamped to the true value.
In these examples, we can again see evidence flowing down the tree from the
root, and observe that in the noisy-or model, evidence tends to flow down only one
branch of the tree, while in the pairwise model, it tends to flow down all branches
to some degree.
25
3.5 TrainingThe gradients for the image feature weights W are identical to those of the pairwise
tree model once the node marginals have been computed, and can be estimated with
the same E-M algorithm as the pairwise trees.
3.5.1 Learning the Noisy-Or Failure Rate Parameter
In a fully observed, recursively segmented image, the maximum likelihood esti-
mate for the failure rate parameter q will always be zero, since a parent node will
be observed to be on if and only if a child is on. However, on partially observed
data, this is not necessarily the case.
In initial experiments, the parameter q was learned in parallel with the fea-
ture weights W , but as the model converged, the learned q parameter again tended
towards zero. For the experiments below, this parameter was fixed to 0.01.
3.6 ResultsShown in figures 3.7, 3.8 and 3.9 are the results on the noisy-or model of a set
of experiments identical to those performed on the pairwise model. We observe a
similar increase in accuracy after semi-supervised training as in the pairwise model.
A detailed comparison is given in Chapter 4.
26
0
5
10
15
Mea
n
Aerop
lane
Bicycle Bird
Boat
Bottle Bus Car Cat
Chair
Per
cent
Acc
urac
y
0
10
20
30
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Per
cent
Acc
urac
y
Figure 3.7: Pixel-level test accuracy on the noisy-or model. White bars indi-cate performance on fully-labeled data, black bars indicate performanceafter additional semi-supervised training.
27
0
10
20
30
Mea
n
Aerop
lane
Bicycle Bird
Boat
Bottle Bus Car Cat
Chair
Per
cent
Acc
urac
y
0
20
40
60
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Per
cent
Acc
urac
y
Figure 3.8: Global-level test accuracy on the noisy-or model. White barsindicate performance on fully-labeled data, black bars indicate perfor-mance after additional semi-supervised training.
28
−3
−2
−1
0
1
2
3
4
5
6
Bac
kgro
und
Aer
opla
neB
icyc
leB
irdB
oat
Bot
tleB
us Car
Cat
Cha
irC
owD
inin
g T
able
Dog
Hor
seM
otor
bike
Per
son
Pot
ted
Pla
ntS
heep
Sof
aT
rain
TV
/ M
onito
r
Per
cent
Abs
olut
e Im
prov
emen
t in
Acc
urac
y
Figure 3.9: Change in test pixel-level accuracy after training with partiallylabeled data.
−4
−2
0
2
4
6
8
Bac
kgro
und
Aer
opla
neB
icyc
leB
irdB
oat
Bot
tleB
us Car
Cat
Cha
irC
owD
inin
g T
able
Dog
Hor
seM
otor
bike
Per
son
Pot
ted
Pla
ntS
heep
Sof
aT
rain
TV
/ M
onito
r
Per
cent
Abs
olut
e Im
prov
emen
t in
Acc
urac
y
Figure 3.10: Change in test global-level test accuracy after training with par-tially labeled data.
29
Chapter 4
Model Comparison
4.1 Comparing ModelsIn this chapter we compare the performance of the two tree structured models, as
well as with the independent model.
4.1.1 Independent Model
As a baseline, we compare our tree-structured CRF models to the independent
model. This model has the same node potential factors as the tree models, but with
no factors connecting nodes. Thus, every region label is predicted separately, and
the model becomes equivalent to a per-region logistic regression on each region’s
image features. The parameters of the independent model are trained on only the
fully-labeled data.
4.2 Performance MeasuresBelow we show pixel-level labeling VOC accuracy, global-level VOC accuracy,
and cross-entropy for the three models. Here the cross-entropy is measured be-
tween the true node labels and the node marginals, over all nodes in all trees T1...n:
− 1N
N
∑n=1
N
∑i∈Tn
1
∑y=0
Ptrue(yni = y) log[P(yni = y)
](2.1)
30
Note that the cross entropy is proportional to the training objective function of
the independent model, but not of the tree models.
4.3 Performance ComparisonFigures 4.3, 4.4, and 4.5 show mean performance across all 21 classes in the VOC
dataset, averaged over 5 folds. Error bars represent one standard error.
4.3.1 Pixel-level Accuracy
Figure 4.3 compares model performance with respect to pixel-level accuracy. The
difference between the three models in the supervised setting is statistically in-
significant. However a paired t-test shows significant improvement in accuracy
after semi-supervised training for both the pairwise and noisy-or models.
4.3.2 Global-level Accuracy
Figure 4.4 shows a surprising result: The independent model performs slightly
better at the global classification task than the supervised-only tree models. This is
surprising, since the independent global label prediction is equivalent to a logistic
regression on the global-level features of the image, and has no access to image
features at finer scales.
How can we explain the poor global-level performance of the tree models,
given that their pixel level accuracy was better? We propose two possible explana-
tions.
Our first possible explanation is that the tree models’ training objective is heav-
ily weighted towards matching the labels at the bottom levels of the tree as opposed
to the top of the tree, since there are many more bottom-layer nodes than global-
layer nodes. To the extent that there is any trade-off between how well a particular
set of parameters models these two groups, we would expect the model to tend to
choose parameters that accurately model the pixel-level labels at the expense of the
global-level labels. The independent model faces no such trade-off, since it models
the global-level and pixel-level nodes of the tree independently.
Our second, related explanation is that the tree-based models face a trade-
off between modeling nodes with small numbers of neighbours, versus modeling
31
0
5
10
15
Per
cent
Acc
urac
y
Mea
n
Aerop
lane
Bicycle Bird
IndependentTreeTree + PLNoisy−SSNoisy−Or + SS
0
5
10
15
Per
cent
Acc
urac
y
Boat
Bottle Bus Car Cat
Chair
0
5
10
15
Per
cent
Acc
urac
y
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
0
10
20
30
Per
cent
Acc
urac
y
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Figure 4.1: Pixel-level test accuracy on the three models, with and withoutsemi-supervised training.
32
0
20
40
Mea
n
Aerop
lane
Bicycle Bird
Per
cent
Acc
urac
y
IndependentTreeTree + PLNoisy−SSNoisy−Or + SS
0
10
20
30
Boat
Bottle Bus Car Cat
Chair
Per
cent
Acc
urac
y
0
10
20
Cow
Dining
Tab
leDog
Horse
Mot
orbik
e
Per
cent
Acc
urac
y
0
20
40
60
Perso
n
Potte
d Plan
t
Sheep
Sofa
Train
TV / M
onito
r
Per
cent
Acc
urac
y
Figure 4.2: Global-level test accuracy on the three models, with and withoutsemi-supervised training.
33
8
8.5
9
9.5
10
10.5
11
Inde
pend
ent
Pairwise
Pairwise
+ S
S
Noisy−
Or
Noisy−
Or + S
S
Per
cent
Acc
urac
y
Figure 4.3: Pixel-level test accuracy across all models.
14
15
16
17
18
19
20
21
Inde
pend
ent
Pairwise
Pairwise
+ S
S
Noisy−
Or
Noisy−
Or + S
S
Per
cent
Acc
urac
y
Figure 4.4: Global-level test accuracy across all models.
nodes with many neighbours, and that this trade-off again favours nodes near the
bottom of the tree, since there are many more such nodes. Figure 4.6 compares the
distribution of degree of all nodes versus only global-level nodes.
Figure 4.7 shows the cross-entropy on the training set versus the number of
neighbours a node has. We see that the cross-entropy of the tree models is lowest
(best fitting the data) for nodes with a small number of neighbours, and increases
with the number of neighbours.
To verify that it is not simply the case that nodes with higher numbers of edges
are not simply harder to model for some reason, we also plot the performance
of the independent model versus the number of neighbours that each node would
34
14
15
16
17
18
19
20
Inde
pend
ent
Pairwise
Pairwise
+ S
S
Noisy−
Or
Noisy−
Or + S
S
Mea
n C
ross
Ent
ropy
Figure 4.5: Mean test cross entropy over all nodes across all models. Loweris better.
All Nodes Global-Level Nodes
0 2 4 6 8 10 12 140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Number of Neighbours
Fre
quen
cy
0 2 4 6 8 10 12 140
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Number of Children of Global Node
Fre
quen
cy
Figure 4.6: Left: Histogram of the number of neighbours of each node in thetraining set. Nodes with one neighbour are necessarily leaves. Right:Histogram of the number of neighbours of global-level nodes in thetraining set.
35
2 4 6 8 10 120
0.5
1
1.5
Number of Neighbours
Mea
n C
ross
Ent
ropy
(Lo
wer
is B
ette
r)
IndependentPairwiseNoisy−Or
Figure 4.7: Mean cross-entropy per node versus the number of neighbours ofa node on the training set.
have had if the graph were connected. We see that the performance in this case is
roughly independent of the number of nodes.
While both tree models show roughly the same trend, there still remains the
question of why the noisy-or tree matches the node marginals much more poorly
on the training set than the pairwise model. This may be due to the fact that the
pairwise model has 3 sets of free parameters with which to calibrate the pairwise
potentials between nodes, while the noisy-or model has no free parameters to adjust
its joint potentials.
4.3.3 Remedies
If our first explanation of the poor global-level performance is correct, then a sim-
ple way to improve the performance of the model at the global-level task would
be to weight the likelihood contribution of the global node more strongly during
36
training.
If our second explanation is correct, then the global-level performance can be
improved in several ways. First, we may consider constraining the recursive seg-
mentation such that the resulting tree has a restricted range of node degrees. This
approach has the disadvantage that it makes the segment joining task computa-
tionally difficult, and constrains the segmenter’s ability to group similar regions.
Alternatively, we may wish to parameterize the joint factors in a way that takes
into account the varying number of neighbours a node has. One approach, which
retains the linear-time computation of the pairwise and noisy-or factors, is outlined
in Chapter 5.
4.4 Qualitative EvaluationIn Figures 4.8 and 4.9, we compare the pixel level probabilities among the three
models for two images in the test set. In both of the images shown, the independent
model finds several disparate patches which match the class of interest. In the tree
models, this evidence is combined at higher levels in the tree, and we observe a
smoother labeling at the pixel level.
4.5 Improving PerformanceThe performance of the tree models in these experiments is unimpressive relative
to the state of the art. However, there are several reasons to expect that significantly
better performance can be achieved, at the cost of slower training times1.
• The recursive segmentation can be made finer. In the experiments performed
above, the recursive segmentation was only four levels deep, leaving rela-
tively large segments at the finest scale. Allowing a finer segmentation would
raise the upper bound on pixel-level accuracy caused by under-segmentation.
• The training set can be left unbalanced. In the experiments above, the train-
ing datasets were balanced by removing approximately 80% of images that1 In the experiments above, the tree models take approximately 2 hours to train per class, for a
given setting of the hyperparameters. The main bottleneck in training the model is in performinginference at each step on each of the the partially-labeled examples. However, this inference step canbe computed in parallel over all examples.
37
Figure 4.8: Detecting a dog. Top left: Original image. Top center: Segmen-tation at bottom level. Top right: True pixel labels. Bottom left: Pixelprobabilities for independent model. Bottom center: Pixel probabili-ties for pairwise model. Bottom right: Pixel probabilities for noisy-ormodel.
did not contain the class of interest.
• Greater care can be put into selecting image features for the node potentials.
• The number of unlabeled examples can be increased relatively easily. To
gather training data for the “dog” image model, for example, it suffices to
merely find images that somewhere contain a dog, with no further labeling
required. Note that these models can be trained on images with probabilistic
labels, allowing the use of automatically gathered datasets where it is known
that some percentage of images do not contain the class of interest.
• Performance on the pairwise model could potentially be improved by reg-
ularizing the edge-potential parameters (of which there were twelve in the
pairwise tree model used). However, this would entail another hyperparam-
eter on which to perform cross-validation.
38
Figure 4.9: Detecting a person. Top left: Original image. Top center: Seg-mentation at bottom level. Top right: True pixel labels. Bottom left:Pixel probabilities for independent model. Bottom center: Pixel proba-bilities for pairwise model. Bottom right: Pixel probabilities for noisy-or model.
4.6 Introducing an OracleAlthough one motivation of tree-structured CRF models is that they can perform
simultaneous image classification and object localization, it is certainly the case
that a model dedicated to image classification may perform better at that task than
our model. However, we can easily incorporate evidence from a separate image
classification algorithm into our model by introducing an extra factor at the global-
level node. This evidence would then be propagated by belief propagation down to
the bottom-level nodes, where pixel-level accuracy may be improved.
Following [18], we calculate an upper bound on the possible performance boost
in pixel-level accuracy attainable by incorporating evidence from a better global-
level classifier. In Figure 4.10 we show the pixel-level accuracy of the models in the
setting where an oracle tells us the correct image classification. We see that both
39
0
5
10
15
20
25
30
Inde
pend
ent
Pairwise
Pairwise
+ S
S
Noisy−
Or
Noisy−
Or + S
S
Oracle
Pair
wise
Oracle
Pair
wise +
SS
Oracle
Nois
y−Or
Oracle
Nois
y−Or +
SS
Per
cent
Acc
urac
y
Figure 4.10: Pixel-level test accuracy across all models, including the casewhere the global-level nodes were clamped to their true value.
models obtain a large increase in accuracy when combined with an oracle. This
result is consistent with results in [18], who also report a large boost in pixel-level
accuracy by coupling with an oracle.
We also observe that the pairwise model receives a much greater boost in ac-
curacy from the oracle than the pairwise model. The cause of this result is unclear.
One possible explanation is suggested by in Figure 3.3, where, upon clamping the
root node to be on, the pairwise model increases node marginals among all the
bottom-level nodes, while the noisy-or model increases node marginals only at
bottom-level nodes whose marginals are already high. This behavior suggests that
the noisy-or model might be inappropriately “explaining away” the oracle’s being
on by increasing the marginal probability of only a small number of bottom-level
nodes.
40
Chapter 5
Conclusions and Future Work
In this chapter, we discuss some well-motivated possible extensions of our work,
and attempt to characterize the different models studied.
5.1 Future Work
5.1.1 Structure-Adaptive Potentials
As mentioned in Chapter 4, the tree models had difficulty modeling the marginals
of nodes with high degree. To address this, we may wish to learn a more flexi-
ble factor computing the joint probability of neighbouring nodes. For instance, we
could learn a tabular factor with a different potential for each setting of parents
and children. However, without introducing additional structure to these factors,
this approach has two major disadvantages: It would no longer be possible to com-
pute messages or likelihoods efficiently, and we would need to learn a number of
parameters exponential in the number of children.
We can attempt to correct the problem of the parent’s local evidence being
either over- or under-weighted in the following way: We can redefine the local ev-
idence factors to be scaled according to a parameter αg that varies with the number
of children g a node has.
For the pairwise model, the unary potential for node r was defined as:
41
Figure 5.1: Learned αg versus the number of neighbours.
φ f
(y(r)c ,x(r)
)= y(r)c (x(r))TW l
c (1.1)
The new joint potential would be defined as:
φ f
(y(r)c ,x(r)
)= αgy(r)c (x(r))TW l
c (1.2)
Where αg is the reweighting parameter for nodes with g children.
In a preliminary experiment, we first learned W as normal, then optimized α
in a second step by numerical differentiation of the expected data log-likelihood.
Because so few nodes have more than 9 children, we tied together αg for all values
of g > 9.
We would expect αg to increase monotonically with the number of children,
and this is what we see in Figure 5.1. After α has been learned, the local evidence
is weighted more strongly in nodes that have many neighbours.
42
5.1.2 Bounding Box Data
One major advantage to using noisy-or tree models over pairwise tree models is
that they can more consistently incorporate information from bounding boxes in
an image.
When viewed as information about an image, all a bounding box tells us is that
a certain area of an image contains a certain class. In a pairwise model, the most
straightforward way to incorporate this evidence into the model is by asserting
that all of the pixels in the bounded area belong to the specified class, with some
probability higher than they would have otherwise had.
In the noisy-or model, we can incorporate bounding box information consis-
tently, by setting a node containing the entire bounding box to be ’on’. If the shape
of the smallest region containing the bounding box closely matches the shape of
the bounding box, we will be effectively incorporating the information provided by
the bounding box, while giving the model more flexibility in assigning individual
pixel labels.
These properties of noisy-or models suggest a set of experiments incorporating
bounding-box labeled image data into the semi-supervised training phase.
5.1.3 Combined Grid and Tree Structures
The most popular form of CRFs on images is the grid-structured CRF, where neigh-
bouring pixels or regions are connected. One problem inherent in multi-scale tree
models is that evidence may be unable to flow directly between neighbouring re-
gions in the image in this way. We can combine these models by adding pairwise
connections between neighbouring nodes at each level of the tree. The disadvan-
tage to adding potentials to the tree model is that we lose the ability to do efficient
exact inference, and to compute the expected data log-likelihood exactly. However,
using loopy belief propagation, we can compute approximate marginals as well as
an approximate gradient of the expected log-likelihood, allowing us to optimize
the parameters using gradient descent.
43
5.1.4 Joint Learning Over All Classes
If we are willing to sacrifice exact inference, there is another interesting extension
of the tree models: Connecting the trees of all classes together at the pixel-level
nodes, where a mutual exclusion between classes is enforced. In this way, if one
class is successfully identified in part of an image, the remaining classes can receive
evidence that they are not contained in that region. This effect could be expected to
improve both the efficiency of semi-supervised learning and test-time localization
performance.
5.1.5 Large-Scale Experiments
Of the 400 caption-only VOC 2008 images used for semi-supervised learning, on
most classes only 20-50 images in that set actually contained the class of interest.
Given the improvement in performance obtained by adding only this small num-
ber of examples to the test set, it seems worth noting that a large weakly-labeled
dataset could easily be constructed for a small number of classes, to evaluate the
effectiveness of yet adding more caption-only data.
5.2 Concluding RemarksWe find that multi-scale CRFs are indeed effective at using weakly-labeled data for
semi-supervised learning. However, we found no significant improvement through
the use of the noisy-or factor on the tasks studied.
44
Bibliography
[1] T. Cour, F. Benezit, and J. Shi. Spectral segmentation with multiscale graphdecomposition. In Computer Vision and Pattern Recognition, volume 2,pages 1124–1131, 2005. → pages
[2] N. Dalai, B. Triggs, R. I. Alps, and F. Montbonnot. Histograms of orientedgradients for human detection. Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on, 1, 2005. →pages 7
[3] L. Du, L. Ren, D. Dunson, and L. Carin. A bayesian model for simultaneousimage clustering, annotation and object segmentation. In Y. Bengio,D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,Advances in Neural Information Processing Systems 22, pages 486–494.2009. → pages
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results.http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html. → pages7
[5] X. Feng, C. K. I. Williams, and S. N. Felderhof. Combining belief networksand neural networks for scene segmentation. IEEE Trans. Pattern Anal.Mach. Intell., 24(4):467–483, 2002. → pages
[6] B. Frey and D. Dueck. Mixture modeling by affinity propagation. Advancesin neural information processing systems, 18:379, 2006. → pages 4
[7] H. Harzallah, F. Jurie, and C. Schmid. Combining efficient objectlocalization and image classification. 2009. → pages
45
[8] X. He and R. S. Zemel. Learning hybrid models for image annotation withpartially labeled data. In Advances in Neural Information ProcessingSystems, 2008. → pages
[9] T. S. Jaakkola and M. I. Jordan. Variational probabilistic inference and theqmr-dt network. Journal of Artificial Intelligence Research, 10:291–322,1999. → pages
[10] J. Lafferty, A. Mccallum, and F. Pereira. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In Proc.18th International Conf. on Machine Learning, pages 282–289, 2001. →pages
[11] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep beliefnetworks for scalable unsupervised learning of hierarchical representations.In ICML, 2009. → pages
[12] L. Liao, D. Fox, and H. A. Kautz. Hierarchical conditional random fields forgps-based activity recognition. In ISRR, pages 487–506, 2005. → pages
[13] D. G. Lowe. Object recognition from local scale-invariant features. In TheProceedings of the Seventh IEEE International Conference on ComputerVision, pages 1150–1157, 1999. → pages
[14] K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: agraphical model relating features, objects and scenes. Advances in NeuralInformation Processing Systems, 16, 2003. → pages 2
[15] K. Murphy, A. Torralba, D. Eaton, and W. Freeman. Object detection andlocalization using local and global features. Toward Category-Level ObjectRecognition, pages 382–400, 2006. → pages 7, 9
[16] A. Oliva and A. Torralba. Modeling the shape of the scene: A holisticrepresentation of the spatial envelope. International Journal of ComputerVision, 42(3):145–175, 2001. → pages 7
[17] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausibleinference. Morgan Kaufmann, 1988. → pages 14, 20, 21
[18] N. Plath, M. Toussaint, and S. Nakajima. Multi-class image segmentationusing conditional random fields and global classification. In Proceedings ofthe 26th Annual International Conference on Machine Learning, pages817–824. ACM, 2009. → pages 2, 3, 6, 9, 39, 40
46
[19] J. Reynolds and K. Murphy. Figure-ground segmentation using ahierarchical conditional random field. In Fourth Canadian Conference onComputer and Robot Vision, 2007. CRV’07, pages 175–182, 2007. → pages2, 3, 9
[20] X. H. Richard, R. S. Zemel, and M. A. C. perpi nan. Multiscale conditionalrandom fields for image labeling. In CVPR, pages 695–702, 2004. → pages
[21] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1regularization: A comparative study and two new approaches. In ECML ’07:Proceedings of the 18th European conference on Machine Learning, pages286–297, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN978-3-540-74957-8. → pages
[22] M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning inrandom fields for heart motion abnormality detection. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2008. →pages
[23] P. Schnitzspan, M. Fritz, S. Roth, and B. Schiele. Discriminative structurelearning of hierarchical representations for object detection. In CVPR, pages2238–2245, 2009. → pages
[24] J. Shi and J. Malik. Normalized Cuts and Image Segmentation, 2000. →pages
[25] A. J. Storkey and C. K. I. Williams. Image modelling withposition-encoding dynamic trees. IEEE Trans. Pattern Anal. Machine Intell,25:859–871, 2003. → pages
[26] M. Szummer, P. Kohli, and D. Hoiem. Learning crfs using graph cuts. InECCV, volume 5303 of Lecture Notes in Computer Science, pages 582–595.Springer, 2008. → pages
[27] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library ofcomputer vision algorithms. http://www.vlfeat.org/, 2008. → pages 4
[28] A. Vedaldi and S. Soatto. Quick shift and kernel methods for mode seeking.Computer Vision–ECCV 2008, pages 705–718, 2008. → pages
[29] J. Verbeek and B. Triggs. Scene segmentation with crfs learned frompartially labeled images. In Advances in Neural Information ProcessingSystems, 2007. → pages
47
[30] P. Wu, B. Manjunanth, S. Newsam, and H. Shin. A texture descriptor forimage retrieval and browsing. In IEEE Workshop on Content-Based Accessof Image and Video Libraries, pages 3–7, 1999. → pages 7
[31] J. Zhu, E. P. Xing, and B. Zhang. Partially observed maximum entropydiscrimination markov networks. In Advances in Neural InformationProcessing Systems, 2008. → pages
48