Hierarchical Kernel Stick-Breaking Process for Multi-Task...
Transcript of Hierarchical Kernel Stick-Breaking Process for Multi-Task...
Hierarchical Kernel Stick-Breaking Process
for Multi-Task Image Analysis
Qi An, Chunping Wang, Ivo Shterev, Eric Wang, †David Dunson and Lawrence Carin
Department of Electrical and Computer Engineering
†Institute of Statistics and Decision Sciences
Duke University
Durham, NC 27708-0291
{qa,cw36,is33,ew28,lcarin}@ece.duke.edu, [email protected]
Abstract
The kernel stick-breaking process (KSBP) is employed to segment general imagery, imposing the
condition that patches (small blocks of pixels) that are spatially proximate are more likely to be
associated with the same cluster (segment). The number of clusters is not set a priori and is inferred
from the hierarchical Bayesian model. Further, KSBP is integrated with a shared Dirichlet process
prior to simultaneously model multiple images, inferring their inter-relationships. This latter application
may be useful for sorting and learning relationships between multiple images. The Bayesian inference
algorithm is based on a hybrid of variational Bayesian analysis and local sampling. In addition to
providing details on the model and associated inference framework, example results are presented for
several image-analysis problems.
I. INTRODUCTION
The segmentation of general imagery is a problem of longstanding interest. There have been
numerous techniques developed for this purpose, including K-means and associated vector quan-
1
tization methods [7, 21], statistical mixture models [25], graph-diffusion techniques [26], as
well as spectral clustering [27]. This list of existing methods is not exhaustive, although these
methods share attributes associated with most existing algorithms. First, the clustering is based
on the features of the image (at the pixel level or for small contiguous patches of pixels), and
when clustering these features one typically does not account for their physical location within
the image (although the location may be appended as a feature component). Secondly, the
segmentation or clustering of images is typically performed one image at a time, and therefore
there is no attempt to relate the segments of one image to segments in other images (i.e., to
learn inter-relationships between multiple images). The latter concept may be important when
considering a new unlabeled image with the goal of relating components of it to images labeled
previously, and it may also be of interest when sorting or searching a database of images. Finally,
in many of the techniques cited above one must a priori set the number of anticipated segments
or clusters. The techniques developed in this paper seek to perform clustering or segmentation in
a manner that explicitly accounts for the physical locations of the features within the image, and
multiple images are segmented simultaneously (termed “multi-task learning”) to infer their inter-
relationships. Moreover, the analysis is performed in a semi-parametric manner, in the sense that
the number of segments or clusters is not set a priori, and is inferred from the data. There has
been recent research wherein spatial information has been exploited when clustering [14, 15, 28],
but that segmentation has been performed one image at a time, and therefore not in a multi-task
setting. Another recent paper of relevance to the method considered here is [32].
To address these goals within a statistical setting, we employ a class of hierarchical models related
to the Dirichlet process (DP) [2, 12, 13, 23]. The Dirichlet process is a statistical prior that may
be summarized succinctly as follows. Assume that the nth patch is represented by feature vector
2
xn, and the total image is composed of N such feature vectors {xn}Nn=1. The feature vector
associated with each patch is assumed drawn from a parametric distribution f(φn), where φn
represents the parameters associated with the nth feature vector. In a DP-based hierarchical
model for generation of the feature vectors we have
xn|φnind∼ f(φn)
φn|G iid∼ G (1)
G|α, G0 ∼ DP (αG0)
The DP is characterized by the non-negative parameter α and the “base” measure G0. To make
the clustering properties of (1) more apparent, we recall the stick-breaking construction of DP
developed by Sethuraman [31]. Specifically, in [31] it was shown that a draw G from DP (αGo)
may be expressed as
G =∞∑
h=1
πhδθh
πh = Vh
h−1∏
l=1
(1− Vl) (2)
Vhiid∼ Beta(1, α)
θhiid∼ G0
This is termed a “stick-breaking” representation of DP because one sequentially breaks off
“sticks” of length πh from an original stick of unit length (∑∞
h=1 πh = 1). As a consequence
of the properties of the distribution Beta(1, α), for relatively small α it is likely that only a
3
relatively small set of sticks πh will have appreciable weight/size, and therefore when drawing
parameters φn from the associated G it is probable multiple φn will share the same “atoms” θh
(those associated with the large-amplitude sticks). The parameter α therefore plays an important
role in defining the number of clusters that are constituted, and therefore in practice one typically
places a non-informative Gamma prior on α [36].
The form of the model in (1) and (2) imposes the prior belief that the feature vectors {xn}Nn=1
associated with an image should cluster, and the data are used to infer the most probable
clustering distribution, via the posterior distribution on the parameters {φn}n=1,N . Such semi-
parametric clustering has been studied successfully in many settings [11, 12, 30, 36]. However,
there are two limitations of such a model, with these defining the focus of this paper. First,
while the model in (1) and (2) captures our belief that the feature vectors should cluster,
it does not impose our additional belief that the probability that two feature vectors are in
the same cluster should increase as their physical location within the image becomes more
proximate; this is an important factor when one is interested in segmenting an image into
contiguous regions. Secondly, typical semi-parametric clustering has been performed one image
or dataset at a time, and here we wish to cluster multiple images simultaneously, to infer the
inter-relationships between clusters in different images, thereby inferring the inter-relationships
between the associated multiple images themselves.
As an extension of the DP-based mixture model, we here consider the recently developed kernel-
stick-breaking process (KSBP) [10], introduced by Dunson and Park. As detailed below, this
model is similar to that in (2), but now the stick-breaking process is augmented to impose the
prior belief associated with spatially proximate patches (that the associated feature vectors are
4
more likely to be contained within the same cluster). As the name suggests, a kernel is employed
to quantify the closeness of patches spatially, and it is also used to infer a representative set
of atoms that capture the variation in the feature vectors {xn}n=1,N . While this is a relatively
sophisticated model, its component parts are represented in terms of simple/standard distributions,
and therefore inference is relatively efficient. In [10] a Markov chain Monte Carlo (MCMC)
sampler was used to estimate the posterior on the model parameters. In the work considered
here we are interested in relatively large data sets, and therefore we develop an inference engine
that exploits ideas from variational Bayesian analysis [3, 4, 16, 36].
There are problems for which one may wish to perform segmentation on multiple images
simultaneously, with the goal of inferring the inter-relationships between the different images.
This is referred to as multi-task learning (MTL) [34, 36, 37], where here each “task” corresponds
to clustering feature vectors from a particular image. There are at least three applications of MTL
in the context of image analysis: (i) one may have a set of images, some of which are labeled,
and others of which are unlabeled, and by performing an MTL analysis on all of the images one
may infer labels for the unlabeled image segmentation, by drawing upon the relationships to the
labeled imagery; (ii) by inferring the inter-relationships between the different images, one may
sort the images as well as sort components within the images; (iii) one may identify abnormal
images and locations within an image in an unsupervised manner, by flagging those locations
that are allocated to a texture that is locally rare, based on a collection of images.
As indicated above the DP is a convenient framework for semi-parametric clustering of multiple
“tasks”. Moreover, the KSBP algorithm is also in the same family of hierarchical statistical
models. Therefore, as presented below, it is convenient to simultaneously cluster/segment multiple
5
images by linking the multiple associated KSBP models by an overarching DP. Note that we
argued against using the DP for analysis of the individual images because the DP doesn’t impose
our additional belief about the relationship of feature vectors as a function of their location within
the image. In the same manner, if one had additional information about the multiple images (e.g.,
the times and/or locations at which they were measured), this may also be incorporated by using a
more-sophisticated model than the DP, to link the task-dependent KSBPs. For example, methods
that have taken into account the time-dependence of the observed data include [9, 22]. Methods
of this type could be used to replace the DP if spatial or temporal prior knowledge was available
with regard to the multiple images, but here such is not assumed.
Concerning the feature vectors xn, we adopt the independent feature subspace analysis (ISA)
method proposed by Hyvarinen and Hoyer [19], which projects the pixel values to a subspace
expanded by basis vectors, and uses the norm of the projection coefficients as features. The ISA
features have proven to be an appealing representation for image analysis, and encouraging results
are also presented here. However, the basic statistical framework developed here is applicable
to general features.
The remainder of the paper is organized as follows. In Section II we review the basic kernel
stick-breaking process (KSBP) model, and in Section III we extend this to multi-task KSBP
(hierarchical KSBP, or H-KSBP) for analysis of multiple images. Example results are presented
in Section IV. Conclusions and future research directions are discussed in Section V.
6
II. KERNEL STICK-BREAKING PROCESS
A. KSBP prior for image processing
The stick-breaking representation of the Dirichlet process (DP) was summarized in (2), and this
has served as the basis of a number of generalizations of the DP. The limitation of the DP for the
image-processing application of interest is that the clustering or segmentation assumes that the
feature vectors are exchangeable. This means that the location of the features within the image
may be exchanged arbitrarily, and the DP clustering will not change. It is desirable to exploit
the known spatial position of the features within the image, and impose that it is more likely
for feature vectors to be clustered together if they are physically proximate (this corresponds
to removing the exchangeability assumption in DP). Goals of this type have motivated many
modifications to the DP. The dependent DP (DDP) proposed by MacEachern [22] assumes a
fixed set of weights, π, while allowing the atoms θ = {θ1, · · · , θN} to vary with the predictor
x according to a stochastic process. Alternatively, rather than employing fixed weights, Griffin
and Steel [17] and Duan et al. [8] developed models to allow the weights to change with the
predictor. Griffin and Steel’s approach incorporates the dependency by allowing a predictor-
dependent ordering of the weights in the stick-breaking construction, while Duan et al. use a
multivariate beta distribution.
Dunson and Park [10] have proposed the kernel stick-breaking process (KSBP), which is partic-
ularly attractive for image-processing applications. In [10] the KSBP was presented in a more-
general setting than considered here; the following discussion focuses specifically on the image-
processing application. Rather than simply considering the feature vectors {xn}n=1,N , we now
consider {xn, rn}n=1,N , where rn is tied to the location of the pixel or block of pixels used
7
to constitute feature vector xn. We let K(r, r′, ψ) → [0, 1] define a bounded kernel function
with parameter ψ, where r and r′ represent general locations in the image of interest. One
may choose to place a prior on the kernel parameter ψ; this issue is revisited below. A draw
Gr ∼ KSBP (a, b, ψ,Go, H) from a KSBP prior is a function of position r (and valid for all
r), and is represented as
Gr =∞∑
h=1
πh(r; Vh, Γh, ψ)δθh
πh(r; Vh, Γh, ψ) = VhK(r, Γh, ψ)h−1∏
l=1
[1− VlK(r, Γl, ψ)]
Vh ∼ Beta(a, b)
Γh ∼ H
θh ∼ Go (3)
Dunson and Park [10] prove the validity of Gr as a probability measure. Comparing (2) and
(3), both priors take the general form of a stick-breaking representation, while the KSBP prior
possesses several interesting properties. For example, the stick weights πh(r; Vh, Γh, ψ) are a
function of r. Therefore, although the atoms {θh}h=1,∞ are the same for all r, the weights
πh(r; Vh, Γh, ψ) effectively shift the probabilities of different θh based on r. The Γh serve to
localize in r regions (clusters) in which the weights πh(r; Vh, Γh, ψ) are relatively constant, with
the size of these regions tied to the kernel parameter ψ.
If f(φn) is the parametric model (with parameter φn) responsible for the generation of xn, we
now assume that the augmented data {xn, rn}n=1,N are generated as
8
xnind∼ f(φn)
φnind∼ Grn
Gr ∼ KSBP (a, b, ψ, Go, H) (4)
As (but one) example, f(φn) may represent a Gaussian distribution, and φn may represent the
associated mean vector and covariance matrix. In this representation the vector Γh defines the
localized region in the image at which particular atoms θh are likely to be drawn, with the
probability of a given atom tied to πh(r; Vh, Γh, ψ). The generative model in (4) states that two
feature vectors that come from the same region in the image (defined via r) will have similar
πh(r; Vh, Γh, ψ), and therefore they are likely to share the same atoms θh. The settings of a and
b control how much similarity there will be in drawn atoms for a given spatial cluster centered
about a particular Γh. If we set a = 1 and b = α, analogous to the DP, small α will impose
that Vh is likely to be near one, and therefore only a relatively small number of atoms θh are
likely to be dominant for a given cluster spatial center Γh. On the other hand, if two features are
generated from distant parts of a given image, the associated atoms θh that may be prominent
for each feature vector are likely to be different, and therefore it is of relatively low probability
that these feature vectors would have been generated via the same parameters φ. It is possible
that the model may infer two distinct and widely separated clusters/segments with the similar
dominant parameters (atoms); if the KSBP base distribution Go is a draw from a DP (as it will be
below when we consider processing of multiple images), then two distinct and widely separated
clusters/segments may have identical atoms.
9
For the case a = 1 and b = α, which we consider below, we employ the notation Gr ∼
KSBP (α, ψ,Go, H); we emphasize that this draw is not meant to be valid for any one r, but
for emphall r. In practice we may also place a non-informative Gamma prior on α. Below we
will also assume that f(φ) corresponds to a multivariate Gaussian distribution, in the manner
indicated above.
B. Spatial correlation properties
As indicated above, the functional form of the kernel function is important and needs to be
chosen carefully. A commonly used kernel is given as K(r, Γ, ψ) = exp (−ψ‖r − Γ‖2) for
ψ > 0, which allows the associated stick weight to change continuously from Vh
∏h−1l=1 (1− Vl)
to 0 conditional on the distance between r and Γ. By choosing a kernel we are also implicitly
imposing the dependency between the priors of two samples, Gr and Gr′ . Specifically, both priors
are encouraged to share the same atoms θh if r and r′ are close, with this discouraged otherwise.
Dunson and Park [10] derive the correlation coefficient between two probability measures Gr
and Gr′ to be
corr{Gr, Gr′} =
∑∞h=1 πh(r; Vh, Γh, ψ)πh(r
′; Vh, Γh, ψ){ ∑∞h=1 πh(r; Vh, Γh, ψ)2
}1/2{ ∑∞h=1 πh(r′; Vh, Γh, ψ)2
}1/2, (5)
The coefficient in (5) approaches unity in the limit as r → r′. Since the correlation is a strong
function of the kernel parameter ψ, below we will consider a distinct ψh for each Γh, with these
drawn from a prior distribution. This implies that the spatial extent within the image over which
a given component is important will vary as a function of the location (to accommodate textural
regions of different sizes).
10
III. MULTI-TASK IMAGE SEGMENTATION WITH A HIERARCHICAL KSBP
We now consider the problem for which we wish to jointly segment M images, where each
image has an associated set of feature vectors with location information, in the sense dis-
cussed above. Aggregating the data across the M images, we have the set of feature vectors
{xnm, rnm}n=1,Nm; m=1,M . The image sizes may be different, and therefore the number of feature
vectors Nm may vary between images. The premise of the model discussed below is that the
cluster or segment characteristics may be similar between multiple images, and the inference of
these inter-relationships may be of value. Note that the assumption is that sharing of clusters
may be of relevance for the feature vectors xnm, but not for the associated locations rnm (the
characteristics of feature-vector clusters may be similar between two images, but the associated
clusters may reside in different locations within the respective images).
A. Model
A relatively simple means of sharing feature-vector clusters between the different images is to
let each image be processed with a separate KSBP (αm, ψm, Gm, Hm). To achieve the desired
sharing of feature-vector clusters between the different images, we impose that Gm ≡ G and G is
drawn G ∼ DP (γ, Go). Recalling the stick-breaking form of a draw from DP (γ,Go), we have
G =∑∞
h=1 πhδθh, in the sense summarized in (2). The discrete form of G is very important, for it
implies that the different Gm will share the same set of discrete atoms {θh}h=1,∞. It is interesting
to note that for the case in which the kernel parameter ψ is set such that K(r, Γh, ψ) → 1,
this model reduces to the hierarchical Dirichlet process (HDP) [33]. Therefore, the principal
difference between the HDP and the hierarchical KSBP (H-KSBP) is that for the latter we
impose that it is probable that feature vectors extracted from proximate regions in the image are
11
likely to be clustered together.
Assuming that f(φ) corresponds to a Gaussian distribution, the H-KSBP model is represented
as
xnmind∼ N (φnm)
φnmind∼ Grnm (6)
Grnm
iid∼ KSBP (αm, ψm, G,Hm)
G ∼ DP (γ,Go)
Assume that G is composed of the atoms {θh}h=1,∞, from the perspective of the stick-breaking
representation in (2). These same atoms are shared across all {Grnm}m=1,M drawn from the
associated KSBPs, but with respective stick weights unique to the different images, and a
function of position within a given image; in (6) we again emphasize that the draw from
KSBP (αm, ψm, G, Hm) is valid for all r, and we here evaluate it for all {rnm}m=1,M . The
sharing of atoms {θh}h=1,∞ imposes the belief that feature vectors from clusters associated with
multiple images may have been drawn from the same Gaussian distribution N (φ), where φ
represents the mean vector and covariance of the Gaussian; in practice the base distribution
Go in (6) corresponds to a normal-Wishart distribution. The posterior inference allows one to
infer which clusters of features are unique to a particular image, and which clusters are shared
between multiple images. The density functions Hm are tied to the support of the mth image,
and in practice this is set as uniform across the image extent. The distinct αm, for each of which
a Gamma hyper-prior may be imposed, encourages that the number of clusters (segments) may
12
vary between the different images, although one may simply wish to set αm = α for all M
tasks.
For notational convenience, in (6) it was assumed that the kernel parameter ψm varied between
tasks, but was fixed for all sticks within a given task; this is overly restrictive. In the implemen-
tation that follows the parameter ψ may vary across tasks and across the task-specific KSBP
sticks.
B. Posterior inference
For inference purposes, we truncate the number of sticks in the KSBP to T , and the number of
sticks in the truncated DP to K (the truncation properties of the stick-breaking representation are
discussed in [20], where here we must also take into account the properties of the kernel). Due
to the discreteness of G =∑K
k=1 βkδθk, each draw of the KSBP, Grnm =
∑Th=1 πhmδφhm
, can
only take atoms {φhm}h=1,T ; m=1,M from K unique possible values {θk}k=1,K ; when drawing
atoms φhm from G, the respective probabilities for {θk}k=1,K are given by {βk}k=1,K , and
for a given rnm the respective probabilities for different {φhm}h=1,T ; m=1,M are defined by
{πhm}h=1,T ; m=1,M . In order to reflect the correspondences between the data and atoms explic-
itly, we further introduce two auxiliary indicator variables. One is znm, this indicating which
component of the KSBP the feature vector xnm is associated, and the other is thm, this indicating
which mixing component θk the atom φhm is associated.
With this specification we can represent our H-KSBP mixture model via a stick-breaking char-
acterization
13
xnmind∼ N(φnm)
φnm = φznmm
znmind∼
T∑
h=1
πnm,hδh
πnm,h = VhmK(rnm, Γhm, ψhm)h−1∏
l=1
[1− VlmK(rnm, Γlm, ψhm)]
φhm = θthm
thmind∼
K∑
k=1
βkδk (7)
βk = β′k
k−1∏
l=1
(1− β′l)
Vhmiid∼ Beta(1, αm)
Γhmiid∼ Hm
β′kiid∼ Beta(1, γ)
θkiid∼ G0
for i = 1, · · · , M ; j = 1, · · · , Nm; h = 1, · · · , T and k = 1, · · · , K. Note that we employ distinct
kernel widths ψhm, the details for which are addressed in the Appendix. In the expressions above
and hereafter we use a bold letter to denote a set of variables with different indices, for example
πnm = {πnm,h}h=1,T and β = {βk}k=1,K . With the conditional distributions above, we provide
a graphical representation of the proposed H-KSBP model in Figure 1.
The Markov chain Monte Carlo (MCMC) method is a powerful and general simulation tool
for Bayesian inference, and one may readily implement the H-KSBP with Gibbs sampling,
14
oG
k k
K
H
hmV hm hmtT
ijz nmx
mN
M
nmr
Fig. 1. A graphical representation of the H-KSBP mixture model.
as an extension of the formulation in [10]. However, for the large-scale problems of interest
here we alternatively employ variational Bayesian (VB) inference. The VB method has proven
to be relatively fast (compared to MCMC) and accurate inference tool for many models and
applications [1, 4, 24]. To employ VB, a conjugate prior is required for all variables in the model.
In the proposed model, we however cannot obtain a closed form for the variational posterior
distribution of the node Vhm, because of the the kernel function. Alternatively, motivated by
the Monte Carlo Expectation Maximization (MCEM) algorithm [35], where the intractable E-
step in the Expectation-Maximization (EM) algorithm is approximated with sets of Monte Carlo
samples, we develop a Monte Carlo Variational Bayesian (MCVB) inference algorithm. We
generally follow the variational Bayesian inference steps to iteratively update the variational
distributions, and therefore push the lower bound closer to the model evidence. For those
nodes that are not available in closed form, we obtain Monte Carlo samples in each iteration
through MCMC routines such as Gibbs and Metropolis-Hastings samplers. The resulting MCVB
algorithm combines the benefits of both MCMC and VB, and has proven to be effective for the
examples we have considered (some of which are presented here).
15
Given the H-KSBP mixture model detailed in (III-A), we derive a Monte Carlo variational
Bayesian approach to infer the variables of interests. Following standard variational Bayesian
inference [16], we seek a lower bound for the log model evidence log p(Data) by integrating
over all hidden variables and model parameters:
log p({xnm}, {rnm}) = log
∫p({xnm}, {rnm},β,θ,V ,Γ, t, z)dΘ
≥∫
q(β,φ, V ,Γ, t,z) logp({xnm}, {rnm},β, φ,V ,Γ, t,z)
q(β,φ,V ,Γ, t, z)dΘ (8)
≈∫
q(β)q(φ)q(V )q(Γ)q(t)q(z)
logp({xnm}, {rnm},β,φ,V ,Γ, t, z)
q(β)q(φ)q(V )q(Γ)q(t)q(z)dΘ (9)
≡ LB(q(Θ)
)
where Θ is a set including all variables of interest, q(·)’s are variational posterior distributions
with specified functional form, LB(q(Θ)
)is defined to be the lower bound of the log model
evidence (sometimes referred as negative variational free energy), (8) comes from Jensen’s
inequality and (9) results from the independence assumption over all variational distributions.
Since the log model evidence is independent of variational distributions, we can iteratively update
the variational distributions to increase the lower bound, which is equivalent to minimizing the
Kullback-Leibler distance between q(β,φ, V ,Γ, t,z) and p(β,φ,V ,Γ, t, z|{xij}, {lxij}). The
equality is achieved if and only if the variational distributions are equal to the true posterior
distributions.
The lower bound, LB(q(Θ)
), is a functional of the variational distributions. To make the
computation analytical, we assume the variational distributions take specified forms and iterate
16
over every variable until convergence. All the updates are analytical except for Vhm, which is
estimated with the samples from its conditional posterior distributions. The update equations for
the proposed model are given in the Appendix.
C. Convergence
To monitor the convergence of our MCVB algorithm, we computed the lower bound LB(q(Θ)
)
of log model evidence at each iteration. Because of the sampling of some variables (see Ap-
pendix), the lower bound does not in general increase monotonically, but we observed in all
experiments that the lower bound increases sequentially for the first several iterations, with
generally small fluctuations after it has converged to the local optimal solution. An example of
this phenomenon is presented below.
IV. EXPERIMENTAL RESULTS
We have applied the H-KSBP multi-task image-segmentation algorithm to both synthetic and
real images. We first present results on synthesized imagery, wherein we compare KSBP-based
clustering of a single image with associated DP-based clustering. This comparison shows in
a simple manner the advantage of imposing spatial-proximity constraints when performing
clustering and segmentation. We next consider H-KSBP as applied to actual imagery, taken from
a widely utilized database. In that analysis we consider the class of textures/clusters inferred by
the algorithm, and how these are distributed across different classes of imagery. We also examine
the utility of H-KSBP for sorting this database of imagery. These H-KSBP results are compared
with analogous results realized via HDP. The hyper-parameters in the model for the examples
that follow are set as follows: τ10 = 1e−2, τ20 = 1e−2, τ30 = 3e−2, τ40 = 3e−2 and µ = 0,
17
η0 = 1, w∗ = d + 2, Σ∗ = 5× I . The discrete priors for Γ and ψ are set to be uniform over all
candidates.
A. Single-image segmentation, comparing KSBP and DP
In this simple illustrative example, each feature vector is associated with a particular pixel, and
the feature is simply a real number, corresponding to its intensity; the pixel location is the
auxiliary information within the KSBP, while this information is not employed by the DP-based
segmentation algorithm. Since both DP and KSBP are semi-parametric, we do not a priori set
the number of clusters. Figure 2 shows the original image and the segmentation results of both
algorithms. In Figure 2(a) we note that there are five contiguous regions for which the intensities
are similar. There is a background region with a relatively fixed intensity, and within this are
four distinct contiguous sub-regions, and of these there are pairs for which the intensities are
comparable. The data in Figure 2(a) were generated as follows. Each pixel in each region is
generated independently as a draw from a Gaussian distribution; the standard deviation of each
of the Gaussians is 10, and the background has mean intensity 5, and the two pairs are generated
with mean intensities of 40 and 60. The color bar in Figure 2(a) denotes the pixel amplitudes.
The DP and KSBP segmentation results are shown in Figures 2(b) and 2(c), respectively. In the
latter results a distinct color is associated with distinct cluster parameters (mean and precision of
a Gaussian). In the DP results we note that the four subregions are generally properly segmented,
but there is significant speckle in the background region. The KSBP segmentation algorithm is
beset by far less speckle. Further, in the KSBP results there are five distinct clusters (dominant
KSBP sticks), where in the DP results there are principally three distinct sticks (in the DP,
the spatially separated segments with the same features are treated as one region, while in the
18
10 20 30 40 50 60
10
20
30
40
50
60
10
20
30
40
50
60
(a)
10 20 30 40 50 60
10
20
30
40
50
60
(b)
10 20 30 40 50 60
10
20
30
40
50
60
(c)
Fig. 2. A synthetic image example. (a) Original synthetic image, (b) image-segmentation results of DP-based model, and (c)image-segmentation results of KSBP-based model.
KSBP each contiguous region is represented by its own stick). While the KSBP yields five
distinct prominent sticks, there are only three distinct atoms, consistent with the representation
in Figure 2(a).
In the next set of results, on real imagery, we employ the H-KSBP algorithm, and therefore at
the task level segmentation is performed as in Figure 2(c). Alternatively, using HDP model [33],
at the task level one employs clustering of the form in Figure 2(b). The relative performance of
H-KSBP and HDP is analyzed.
B. Feature extraction for real imagery
Within the subsequent image analysis we employ features constituted by the independent feature
subspace analysis (ISA) technique, developed by Hyvarinen and Hoyer [19]. These features
have proven to be relatively shift or translation invariant. In brief, the ISA feature extraction
process is composed of two steps: (i) We employ patches of images as training data, to estimate
several independent feature subspaces via a modification of the independent component analysis
(ICA) [6]. Each feature subspace is represented as a set of orthogonal basis vectors, i.e., wi,
i = 1, · · · , n, where n is the dimension of the subspace. (ii) The feature F of a new data input
19
I is computed as the norm of the projections on the feature subspaces:
Fj(I) =n∑
i=1
〈wi, I〉2, for j = 1, · · · , J (10)
where J is the number of independent feature subspaces and 〈·〉 is the inner product. Interesting
invariance properties [19] enable the ISA features to be widely applicable to many types of
images. As indicated below, a set of images is selected from a database for H-KSBP analysis; the
set of images used for the aforementioned training constitute a separate set of images randomly
selected from this same database. For the results considered here n = 10 and J = 50.
C. H-KSBP applied to a set of real images
We test the H-KSBP model on a subset of images from Microsoft Research Cambridge1. There
are seven types of images in this database: buildings, clouds, countryside, faces, fireworks, offices
and urban. Twenty images are randomly selected from the database for each type, yielding a
total of 140 images (the image sizes may vary from 384× 225 to 640× 480 pixels among the
different images). Typical images are shown in Figure 3. To capture textural information within
the features, we first divided each image into a contiguous 24×24-pixel non-overlapping patches
and then extract ISA features from each patch; color images are considered, and the RGB colors
are handled within ISA feature extraction as in [18]. Concerning learning the ISA independent
feature subspaces, we randomly select 150 patches out of the 140 images from the seven classes,
and these 150 image patches are used for basis training. The posterior on the H-KSBP (and
HDP) model parameters is inferred based on the proposed MCVB algorithm, processing all 140
images simultaneously; as discussed in Section II, the HDP analysis is performed by a special
1The data are available at http://research.microsoft.com/vision/cambridge/recognition/
20
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
200 400 600
100
200
300
400
100 200 300
50
100
150
200
250100 200 300
50
100
150
200
250200 400 600 800
100200300400500
200 400 600 800
100200300400500
Fig. 3. Sample images used in the analysis.
setting of the H-KSBP parameters. Generally speaking, the VB algorithm is deterministic given
a specified initialization. The proposed MCVB is however probabilistic because of the random
samples involved. Moreover, considering the large amount of data, we have to randomly initialize
their membership of sticks, which may cause the convergence to some local optimal solutions.
We therefore perform the experiment ten times and average the results to mitigate the influence
of randomness and to make the results less sensitive to VB initialization.
Borrowing the successful “bag of words” assumption in text analysis [5], we assume each image
is a bag of atoms, which results in a measurable quantity of inter-relationship between images,
specifically similar images should share similar distribution over those mixture components. An
important aspect of the H-KSBP algorithm is that while in text analysis the “bag of words” may
be set a priori, here the “bag of atoms” is inferred from the data itself, within the clustering
21
Clouds Buildings Countryside Faces Fireworks Urban Office
20 40 60 80 100 120 140
5
10
15
20
25
30
35
40
Index of images
Index o
f ato
ms
Fig. 4. Matrix on the usage of atoms across the different images. The size of each box represents the relative frequency withwhich a particular atom is manifested in a given image. These results are computed via H-KSBP.
process. Related concepts have been employed previously in image analysis [29], but in that
work one had to set the canonical set of image atoms (shapes) a priori, which is somewhat ad
hoc. The learning of these canonical atoms, across the multiple images, is an important aspect
of the proposed method. As the HDP is a special case of H-KSBP, it also may be used to
infer a “bag of atoms”, albeit without the explicit accounting for the spatial relationship of the
segmentation.
As an example, for the data considered, we show one realization of H-KSBP in Figure 4. In
the figure, we display canonical atom usage across all 140 images. Figure 4 is a count matrix,
where each square represents the relative number of counts in a given image for a particular
atom (atoms indexed along the vertical axis in Figure 4). As demonstrated in this figure, the
distinct image classes tend to employ a distinct ensemble of atom types, with this of importance
for the desired sorting of image classes.
Given the matrix depicted in Figure 4, it is of interest to examine the character of the atoms in
the imagery; examples of different atoms are depicted in Figure 5.
22
4−th
ato
m
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
31−
st a
tom
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
39−
th a
tom
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
38−
th a
tom
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
11−
th a
tom
200 400 600
200
400
200 400 600
200
400
200 400
200
400
600200 400
200
400
600 200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
33−
rd a
tom
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
30−
th a
tom
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
Fig. 5. Demonstration of different atoms as inferred by an example run of the H-KSBP algorithm. Each row of the figurecorresponds to one atom. Every two images form a set, with the original images at left and areas assigns to a particular atomshown at right
Figure 5 gives a representation of most of the atoms. For example the 4th, 31st and 39th atoms
are associated with clouds and sky; the 38th atom is principally modeling buildings; and the 11th
atom is associated with trees and grasses. While performing the experiment, we also noticed it
was relatively easy to segment clouds, fireworks, countryside, and urban images while harder to
obtain contiguous segments within office images (these typically have far more details, and less
large regions of smooth texture; this latter issue may be less an issue of the H-KSBP, but rather of
the features employed). An example of this difficulty is observable in Figure 6, as office images
are composed of many different atoms. Fortunately, the office images still tend to share similar
23
200 400 600
200
400
HDP
200 400 600
200
400
200 400 600
200
400
HDP
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
100 200 300
50100150200250
100 200 300
50100150200
200400600800
100200300400500
200400600800
100200300400500
H−KSBP
200 400 600
200
400
H−KSBP
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
100 200 300
50100150200
200400600800
100200300400500
Fig. 6. Representative set of segmentation results, comparing H-KSBP and HDP. While these two algorithms tend to generallyyield comparable segmentations for the images considered, the H-KSBP is generally more sensitive to details, with this sometimesyielding better segmentations (e.g., the top-level and bottom-right results).
usage of atoms so that they can be grouped together (sorted) when quantifying similarities
between images based on the histogram over atoms (discussed next). Example segmentation
results on images are given in Figure 6.
The results in Figure 6, in which both H-KSBP and HDP segmentation results are presented,
demonstrate general properties observed within the context of the analysis the images considered
here: (i) the segmentation characteristics of HDP were generally good, but on some occasions
they were markedly worse (less detailed) than those of H-KSBP; and (ii) the H-KSBP was
generally more sensitive to detailed textural differences in the images, thereby generally inferring
a larger number of principal atoms (increased number of large sticks).
24
1 2 3 4 5 6 7
1
2
3
4
5
6
7
Type of images
Type o
f im
ages
Clouds
Buildings
Countryside
Faces
Fireworks
Urban
Offices
Fig. 7. The confusion matrix over image types, generated using H-KSBP.
To demonstrate the image-sorting potential of the H-KSBP, we compute the Kullback-Leibler
(KL) divergence on the histogram over atom between any two images, by averaging histograms
of the form in Figure 4 over ten random MCVB initializations. For each image, we rank its
similarity to all other images based on the associated KL divergence. Performance is addressed
quantitatively as follows. For each of the 140 images, we quantify via KL divergence its similarity
to all other 139 images, wherein we achieve in ordered list. In Figure 7 we present a confusion
matrix, which represents the fraction of the top-ten members of this ordered list that are within
the same class (among seven classes) as the image under test.
As demonstrated in Figure 7, the H-KSBP performs well in distinguishing clouds, faces and
fireworks images. The buildings and urban images often share some similar atoms, mainly
representing buildings, and therefore these are somewhat confused (reasonably, it is felt). The
offices images are often related to other relatively complex scenes. Some typical image ranking
results are given in Figure 8; clearly, for this dataset, the clouds are the most unique class type,
25
for they are always the least related to any of the other image classes. It was found that the HDP
produced very similar sorting results as produced by H-KSBP (e.g., the associated confusion
matrix for HDP is similar to that in Figure 7), and therefore the HDP sorting results are omitted
here for brevity. This indicates that while in some cases the HDP segmentation results are inferior
to those of H-KSBP, in general the ability of HDP and H-KSBP to sort images is comparable
(at least for the set of images considered).
The H-KSBP results on the 140-image database were performed in non-optimized MatlabTM
software, on a PC with 3 GHz CPU and 2 GB memory. It required about 3 hours to run one
run of the MCVB code for 80 iterations, with typically 40-50 iterations required to achieve
convergence. The H-KSBP and HDP algorithms were run with comparable computation times.
V. CONCLUSIONS
The kernel stick breaking process (KSBP) [10] has been extended for use in image segmentation.
The KSBP algorithm explicitly imposes the belief that feature vectors that are generated from
proximate locations in an image are more likely to be associated with the same image segment.
The algorithm is semi-parametric in the sense that a parametric model is defined for generation of
the feature vectors (here a multivariate Gaussian), but the number of components in the associated
mixture model is assumed unknown and inferred from the data. The KSBP extends the stick-
breaking representation of the Dirichlet process (DP), in that it imposes spatially-dependent stick
weights, with a shared set of parameter “atoms” employed throughout the image.
We have also extended this KSBP image-processing algorithm to the simultaneous segmentation
of multiple images. In this setting a KSBP is employed for each of the images under test, and
these KSBPs share a common base distribution. Specifically, the base distribution corresponds
26
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
100 200 300
50100150200250
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200400600800
100200300400500
200 400 600
200
400
200400600800
100200300400500
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
200 400 600
200
400
Fig. 8. Sample image sorting results, as generated by H-KSBP. The top left image is the original image followed by the fivemost similar images and then the five most dissimilar images.
27
to a draw from a DP. Since such a draw is, with probability one, composed of an infinite set
of discrete atoms, this implies that all of the KSBPs share the same set of atoms. Therefore,
this imposes the belief that the clusters/segments associated with any given image are likely
related to clusters/segments associated with other images. The hierarchical KSBP (H-KSBP)
infers these cluster inter-relationships, and therefore infers the inter-relationships between the
images themselves. Therefore, the H-KSBP algorithm, with appropriate image features, offers
the opportunity to sort or organize multiple images. In the special case for which the kernel
parameters are set such that the sticks are spatial-position independent, the H-KSBP algorithm
reduces to the hierarchical Dirichlet process (HDP) [33].
In the problems of interest here the number of images under test may be large, and therefore an
MCMC analysis [10] is often computationally intractable. Therefore, we here have considered
an augmented variational Bayesian (VB) inference framework. Almost all nodes on the graphical
model of the H-KSBP are appropriate for VB inference (they possess the required conjugate
relationships); however, the presence of the kernel does require sampling, and therefore we have
implemented a hybrid sampling-VB inference algorithm, which has performed effectively in the
tests considered. To mitigate sensitivity of VB to initialization considerations and local-optimal
solutions, the final inference results are taken as the average of multiple runs, with different
initializations.
In the context of segmenting single images, while the DP-based method often worked well for the
tests considered, the KSBP algorithm typically was better, in the sense that it was more sensitive
to subtle variations in the textures of the image segments. The KSBP typically employed more
atoms (large sticks) than the DP-based segmentation algorithm, apparently to capture these more-
28
subtle textural differences. The same generally superior segmentation performance of H-KSBP
was observed relative to HDP, when segmenting multiple images simultaneously. In addition to
segmenting multiple images, the H-KSBP and HDP algorithms also yield information about the
inter-relationships between the images, based on the underlying sharing mechanisms inferred
among the associated clusters. For the images considered, it was found that the H-KSBP and
HDP yielded very similar sorting results.
Concerning future research, one typically measures imagery with a known temporal ordering.
One might anticipate that the degree of similarity in the images will be enhanced if they are
measured at more proximate times. The H-KSBP and HDP models assume that the multiple
images under test are exchangeable (their order is assumed irrelevant). To account for the order
with which the images are measured, and to further encourage sharing among images measured
more closely in time, the Dirichlet process from which the base distributions of the H-KSBP
and HDP are drawn may be generalized. Examples of how DP has been generalized for such
purposes are discussed in (for example) [9, 17, 22]. In fact, the KSBP may itself be employed
to generalize DP to account for time, where now rather than using spatial location as auxiliary
data one may use time.
ACKNOWLEDGMENTS
The research reported here was supported by the Department of Energy, NA-22. L.C. also thanks
Dr. Lawrence Chilton of PNNL for several motivating discussions.
REFERENCES
[1] M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby
Computational Neuroscience Unit, University College London, 2003.
29
[2] D. Blackwell and J.B. MacQueen. Ferguson distribution via polya urn schemes. The Annals
of Statistics, 1:353–355, 1973.
[3] D. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. Journal of
Bayesian Analysis, 1(1):121–144, 2005.
[4] D.M. Blei and M.I. Jordan. Variational methods for the Dirichlet process. In Proc. the
21st International Conference on Machine Learning, 2004.
[5] D.M. Blei and J.D. Lafferty. Correlated topic models. In Advances in Neural Information
Processing System, volume 18, 2005.
[6] P. Common. Independent component analysis - a new concept? Signal Processing, 36:287–
314, 1994.
[7] C. Ding and X. He. K-means clustering via principal component analysis. In Proc. the
International Conference on Machine Learning, pages 225–232, 2004.
[8] J. Duan, M. Guindani, and A.E. Gelfand. Generalized spatial Dirichlet process models.
Biometrika, 2007.
[9] D.B. Dunson. Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7:551–
568, 2006.
[10] D.B. Dunson and J.-H. Park. Kernel stick-breaking process. Biometrika, accepted for
publication.
[11] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures.
Journal of the American Statistical Association, 90:577–588, 1995.
[12] M.D. Escobar and M. West. Baysian density estimation and inference using mixtures.
Journal of the American Statistical Association, 90:577–588, 1995.
[13] T.S. Ferguson. A bayesian analysis of some nonparametric problems. Annals of Statistics,
30
1, 1973.
[14] M.A. Figueiredo. Bayesian image segmentation using gaussian field priors. In Energy
Minimization Methods in Computer Vision and Pattern Recognition, pages 74–89. Berlin-
Heidelberg: Springer-Verlag, 2005.
[15] M.A. Figueiredo, D.S. Cheng, and V. Murino. Clustering under prior knowledge with
application to image segmentation. In Advances in Neural Information Processing System,
volume 19, 2007.
[16] Z. Ghahramani and M.J. Beal. Propagation algorithms for variational bayesian learning.
Advances in Neural Information Processing Systems 13, 2001.
[17] J.E. Griffin and M.F.J. Steel. Order-based dependent Dirichlet process. Journal of the
American Statistical Association, page in press, 2006.
[18] P. Hoyer and A. Hyvarinen. Independent component analysis applied to feature extraction
from colour and stereo images. Network: Computation in Neural Systems, 11(3):191–210,
2000.
[19] A. Hyvarinen and P. Hoyer. Emergence of phase- and shift-invariant features by
decomposition of natural images into independent feature subspaces. Neural Computation,
12(7):1705–1720, 2000.
[20] H. Ishwaran and L.F. James. Gibbs sampling methods for stick-breaking priors. Journal
of the American Statistical Association, 96:161–173, 2001.
[21] T. Kohonen. Self-Organizating Maps. Springer-Verlag, 1997.
[22] S.N. MacEachern. Dependent nonparametric process. In ASA Proceeding of the Section
on Bayesian Statistical Science, Alexandria, VA, 1999. American Statistical Association.
[23] S.N. MacEachern and P. Muller. Estimating mixture of dirichlet process models. Journal
31
of Computational and Graphical Statistics, 7:223–238, 1998.
[24] D.J.C. MacKay. Ensemble learning for hidden Markov models.
http://www.inference.phy.cam.ac.uk/mackay/abstracts/ensemblePaper.html, 1997.
[25] G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications to
Clustering. Marcel Dekker, 1988.
[26] B. Nadler, R.R. Coifman S. Lafon, and I.G. Kevrekidis. Diffusion maps, spectral clustering
and eigenfunctions of fokker-plaanck operators. In Advances in Neural Information
Processing Systems 18, 2006.
[27] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In
Advances in Neural Information Processing Systems 13, 2001.
[28] P. Orbanz and J. M. Buhmann. Smooth image segmentation by nonparametric Bayesian
inference. In ECCV, volume 1, pages 444–457, 2006.
[29] P. Quelhas, J.-M. Odobez F. Monay, D. Gatica-Perez, and T. Tuytelaars. A thosand words
in a scenes. IEEE Trans. Pattern Analysis Machine Intell., 9:1575–1589, 2007.
[30] C.E. Rasmussen. The infinite gaussian mixture model. In Advances in Neural Information
Processing System, volume 12, pages 554–560, 2000.
[31] J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4:639–650,
1994.
[32] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes
using transformed Dirichlet processes. In NIPS 18, pages 1297–1304. MIT Press, 2006.
[33] Y. W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical dirichlet processes. Journal
of the American Statistical Association, 101:1566–1582, 2005.
[34] S. Thrun and J. O’Sullivan. Discovering structure in multiple learning tasks: The TC
32
algorithm. In Proc. the 13th International Conference on Machine Learning, 1996.
[35] G.C. G. Wei and M.A. Tanner. A monte carlo implementation of the em algorithm and the
poor man’s data augmentation algorithms. Journal of the American Statistical Association,
85:699–704, 1990.
[36] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with
dirichlet process priors. Journal of Machine Learning Research, 8:35–63, 2007.
[37] K. Yu, V. Tresp, and S. Yu. A nonparametric hierarchical Bayesian framework for
information filtering. In Proc. the 27th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2004.
33
APPENDIX
Update q(z) Given the approximate distribution of the other variables, the variational distri-
bution q(z) can be derived as
q(znm) = Mult(ξnm,1, · · · , ξnm,T ) (1)
snm,h ∝ Vhm exp(−ψm‖rnm − Γhm‖2) exp(∑
l<h
log(1− Vlm exp(−ψm‖rnm − Γlm‖2)))
exp(K∑
k=1
υhm,k(1/2(d∑
r=1
Ψ(wnew∗ + 1− r
2) + d log 2 + log |Σnew
∗ |)
−1/2(xnm − µnew0k )T wnew
∗ Σnew∗ (xnm − µnew
0k )− d
2ηnewk
)), (2)
ξnm,h =snm,h∑T
h=1 snm,h
. (3)
where 〈·〉 denotes the expectation over all variables except the one we are updating, Ψ(·) is the
digamma function, and d is the dimensionality of the feature vector. Equation (2) results from
the sampling of Vhm, where the expectation reduces to evaluation over the samples.
Update q(t) Similar to the update for z, each indicator variable of one stick, thm, has a
posterior in a form of a multinomial distribution:
q(thm) = Mult(υhm,1, · · · , υhm,K) (4)
uhm,k ∝ exp(Ψ(ak)−Ψ(ak + bk))∏
l<k
exp(Ψ(bl)−Ψ(al + bl))
exp(∑
n
ξnm,h(1/2(d∑
r=1
Ψ(wnew∗ + 1− r
2) + d log 2 + log |Σnew
∗ |)
−1/2(xnm − µnewk )T wnew
∗ Σnew∗ (xnm − µnew
k )− d
2ηnewk
)), (5)
υhm,k =uhm,k∑Kk=1 uhm,k
. (6)
1
Sample V As indicated above, there is no closed form for the variational distribution of V . We
therefore employ a MCMC method to sample V from its conditional posterior distribution. We in-
troduce two auxiliary variables: Anm,h ∼ Bernoulli(Vhm) and Bnm,h ∼ Bernoulli(K(rnm, Γhm, ψm)),
with znm = min{h : Anm,h = Bnm,h = 1}. We can then alternate between
(i) sampling (Anm,h, Bnm,h) from the conditional distribution given znm;
p(Anm,h = Bnm,h = 0|znm) =(1− Vhm)(1−K(rnm, Γhm, ψm))
1− VhmK(rnm, Γhm, ψm)
p(Anm,h = 0, Bnm,h = 1|znm) =(1− Vhm)K(rnm, Γhm, ψm)
1− VhmK(rnm, Γhm, ψm)
p(Anm,h = 1, Bnm,h = 0|znm) =Vhm(1−K(rnm, Γhm, ψm))
1− VhmK(rnm, Γhm, ψm),
for h = 1, 2, · · · , znm − 1 and n = 1, 2, · · · ,M , and
Anm,h = Bnm,h = 1
for h = znm and n = 1, 2, · · · ,M , and
(ii) updating Vhm by sampling from the conditional posterior distribution:
Beta(1 +∑
n:znm≥h
Anm,h, α +∑
n:znm≥h
(1− Anm,h)). (7)
Update q(β) The update of q(β) is straightforward given other variational distributions:
q(β′k) = Beta(ak, bk) (8)
ak = 1 +∑
m,h
υhm,k (9)
bk = γ +∑
m,h
∑
l>k
υhm,l, (10)
2
and the stick-length βk can be represented as
βk = β′k∏
l<k
(1− β′l). (11)
Update q(θ) The distribution of θ is composed of two parts, one for the mean vector and
one for the precision matrix. Given the distributions of both indicators and samples, we obtain
an analytical form for q(θ):
q(θk) = N(µnew0k , ηnew
k Σk)W (wnew∗ ,Σnew
∗ ) (12)
µnew0k =
Nkxk + η0µ0
Nk + η0
ηnewk = η0 + Nk
wnew∗ = w∗ + Nk
Σnew∗ =
(Σ−1∗ +
∑m,n
q(tznmm = k)(xnm − xk)(xnm − xk)T +
Nkη0
Nk + η0
(xk − µ0)(xk − µ0)T)−1
,
(13)
where
Nk =∑m,n
q(tznmm = k), xk =
∑m,n q(tznmm = k)xnm
Nk
.
Update q(Γ) Theoretically, we expect the basis location Γhm to be estimated continuously
over the entire space; however, this results in non-conjugacy. We hence place a discrete prior on
3
Γhm, i.e., p(Γhm) =∑
r arΓr, and the associated posterior takes the form
q(Γhm) =∑
r
anewr Γr (14)
anewr = ar exp
( ∑n
ξnm,h(log(Vhm)− ψm‖lxnm − Γr‖2))
exp( ∑
n
∑
l>h
ξnm,l(log(1− Vhm exp(−ψm‖lxnm − Γr‖2)))), (15)
where {Γr} constitutes a grid of potential locations (for imagery, with discrete pixels, this is
quite realistic and practical). For computation efficiency, we fix the number of candidate basis
locations and drop some candidates, with very low weights in each iteration replaced with new
locations randomly sampled from a uniform distribution. Even though q(Γhm) can be derived
analytically, it complicates the computation of other variables severely. We thus intentionally
sample Γhm from its approximate posterior distribution.
Up to now, the hyper-parameters α, γ, and ψm are assumed to be constant for inference of
the other parameters. However the model performance may be sensitive to the settings of those
hyper-parameters, here we relax this assumption and derive the update equations for them after
placing non-informative priors.
Update q(α) We place a gamma prior on α with parameters τ10 and τ20, and the variational
posterior distribution q(α) is
q(α) = Gamma(τ1, τ2) (16)
τ1 = τ10 + M(T − 1)
τ2 = τ20 +M∑
m=1
T∑
h=1
log(1− Vhm). (17)
Update q(γ) Similar to the update of α, we also place a gamma prior on γ and the approximate
4
posterior distribution is
q(γ) = Gamma(τ3, τ4) (18)
τ3 = τ30 + K − 1
τ4 = τ40 +K∑
k=1
(Ψ(bk)−Ψ(ak + bk))). (19)
Update ψ Since the image size and the object size can vary dramatically between different
images, in practice no universal kernel width is applicable to all images and atoms. In order to
adaptively choose the kernel width for each image and atom, we place a discrete prior on ψhm,
p(ψhm) =∑
r brψr. Similar to the inference on Γhm, we sample ψhm to facilitate the computation
q(ψhm) =∑
r
bnewr ψr (20)
bnewr = br exp
( ∑n
ξnm,h(log(Vhm)− ψr‖rnm − Γhm‖2))
exp( ∑
n
∑
l>h
ξnm,l(log(1− Vhm exp(−ψr‖rnm − Γhm‖2)))), (21)
where {ψr} are potential kernel widths.
5