Hierarchical Kernel Stick-Breaking Process for Multi-Task...

Hierarchical Kernel Stick-Breaking Process

for Multi-Task Image Analysis

Qi An, Chunping Wang, Ivo Shterev, Eric Wang, †David Dunson and Lawrence Carin

Department of Electrical and Computer Engineering

†Institute of Statistics and Decision Sciences

Duke University

Durham, NC 27708-0291

{qa,cw36,is33,ew28,lcarin}@ece.duke.edu, [email protected]

Abstract

The kernel stick-breaking process (KSBP) is employed to segment general imagery, imposing the

condition that patches (small blocks of pixels) that are spatially proximate are more likely to be

associated with the same cluster (segment). The number of clusters is not set a priori and is inferred

from the hierarchical Bayesian model. Further, KSBP is integrated with a shared Dirichlet process

prior to simultaneously model multiple images, inferring their inter-relationships. This latter application

may be useful for sorting and learning relationships between multiple images. The Bayesian inference

algorithm is based on a hybrid of variational Bayesian analysis and local sampling. In addition to

providing details on the model and associated inference framework, example results are presented for

several image-analysis problems.

I. INTRODUCTION

The segmentation of general imagery is a problem of longstanding interest. There have been

numerous techniques developed for this purpose, including K-means and associated vector quan-

1

tization methods [7, 21], statistical mixture models [25], graph-diffusion techniques [26], as

well as spectral clustering [27]. This list of existing methods is not exhaustive, although these

methods share attributes associated with most existing algorithms. First, the clustering is based

on the features of the image (at the pixel level or for small contiguous patches of pixels), and

when clustering these features one typically does not account for their physical location within

the image (although the location may be appended as a feature component). Secondly, the

segmentation or clustering of images is typically performed one image at a time, and therefore

there is no attempt to relate the segments of one image to segments in other images (i.e., to

learn inter-relationships between multiple images). The latter concept may be important when

considering a new unlabeled image with the goal of relating components of it to images labeled

previously, and it may also be of interest when sorting or searching a database of images. Finally,

in many of the techniques cited above one must a priori set the number of anticipated segments

or clusters. The techniques developed in this paper seek to perform clustering or segmentation in

a manner that explicitly accounts for the physical locations of the features within the image, and

multiple images are segmented simultaneously (termed “multi-task learning”) to infer their inter-

relationships. Moreover, the analysis is performed in a semi-parametric manner, in the sense that

the number of segments or clusters is not set a priori, and is inferred from the data. There has

been recent research wherein spatial information has been exploited when clustering [14, 15, 28],

but that segmentation has been performed one image at a time, and therefore not in a multi-task

setting. Another recent paper of relevance to the method considered here is [32].

To address these goals within a statistical setting, we employ a class of hierarchical models related

to the Dirichlet process (DP) [2, 12, 13, 23]. The Dirichlet process is a statistical prior that may

be summarized succinctly as follows. Assume that the nth patch is represented by feature vector

2

xn, and the total image is composed of N such feature vectors {xn}Nn=1. The feature vector

associated with each patch is assumed drawn from a parametric distribution f(φn), where φn

represents the parameters associated with the nth feature vector. In a DP-based hierarchical

model for generation of the feature vectors we have

xn|φnind∼ f(φn)

φn|G iid∼ G (1)

G|α, G0 ∼ DP (αG0)

The DP is characterized by the non-negative parameter α and the “base” measure G0. To make

the clustering properties of (1) more apparent, we recall the stick-breaking construction of DP

developed by Sethuraman [31]. Specifically, in [31] it was shown that a draw G from DP (αGo)

may be expressed as

G =∞∑

h=1

πhδθh

πh = Vh

h−1∏

l=1

(1− Vl) (2)

Vhiid∼ Beta(1, α)

θhiid∼ G0

This is termed a “stick-breaking” representation of DP because one sequentially breaks off

“sticks” of length πh from an original stick of unit length (∑∞

h=1 πh = 1). As a consequence

of the properties of the distribution Beta(1, α), for relatively small α it is likely that only a

3

relatively small set of sticks πh will have appreciable weight/size, and therefore when drawing

parameters φn from the associated G it is probable multiple φn will share the same “atoms” θh

(those associated with the large-amplitude sticks). The parameter α therefore plays an important

role in defining the number of clusters that are constituted, and therefore in practice one typically

places a non-informative Gamma prior on α [36].

The form of the model in (1) and (2) imposes the prior belief that the feature vectors {xn}Nn=1

associated with an image should cluster, and the data are used to infer the most probable

clustering distribution, via the posterior distribution on the parameters {φn}n=1,N . Such semi-

parametric clustering has been studied successfully in many settings [11, 12, 30, 36]. However,

there are two limitations of such a model, with these defining the focus of this paper. First,

while the model in (1) and (2) captures our belief that the feature vectors should cluster,

it does not impose our additional belief that the probability that two feature vectors are in

the same cluster should increase as their physical location within the image becomes more

proximate; this is an important factor when one is interested in segmenting an image into

contiguous regions. Secondly, typical semi-parametric clustering has been performed one image

or dataset at a time, and here we wish to cluster multiple images simultaneously, to infer the

inter-relationships between clusters in different images, thereby inferring the inter-relationships

between the associated multiple images themselves.

As an extension of the DP-based mixture model, we here consider the recently developed kernel-

stick-breaking process (KSBP) [10], introduced by Dunson and Park. As detailed below, this

model is similar to that in (2), but now the stick-breaking process is augmented to impose the

prior belief associated with spatially proximate patches (that the associated feature vectors are

4

more likely to be contained within the same cluster). As the name suggests, a kernel is employed

to quantify the closeness of patches spatially, and it is also used to infer a representative set

of atoms that capture the variation in the feature vectors {xn}n=1,N . While this is a relatively

sophisticated model, its component parts are represented in terms of simple/standard distributions,

and therefore inference is relatively efficient. In [10] a Markov chain Monte Carlo (MCMC)

sampler was used to estimate the posterior on the model parameters. In the work considered

here we are interested in relatively large data sets, and therefore we develop an inference engine

that exploits ideas from variational Bayesian analysis [3, 4, 16, 36].

There are problems for which one may wish to perform segmentation on multiple images

simultaneously, with the goal of inferring the inter-relationships between the different images.

This is referred to as multi-task learning (MTL) [34, 36, 37], where here each “task” corresponds

to clustering feature vectors from a particular image. There are at least three applications of MTL

in the context of image analysis: (i) one may have a set of images, some of which are labeled,

and others of which are unlabeled, and by performing an MTL analysis on all of the images one

may infer labels for the unlabeled image segmentation, by drawing upon the relationships to the

labeled imagery; (ii) by inferring the inter-relationships between the different images, one may

sort the images as well as sort components within the images; (iii) one may identify abnormal

images and locations within an image in an unsupervised manner, by flagging those locations

that are allocated to a texture that is locally rare, based on a collection of images.

As indicated above the DP is a convenient framework for semi-parametric clustering of multiple

“tasks”. Moreover, the KSBP algorithm is also in the same family of hierarchical statistical

models. Therefore, as presented below, it is convenient to simultaneously cluster/segment multiple

5

images by linking the multiple associated KSBP models by an overarching DP. Note that we

argued against using the DP for analysis of the individual images because the DP doesn’t impose

our additional belief about the relationship of feature vectors as a function of their location within

the image. In the same manner, if one had additional information about the multiple images (e.g.,

the times and/or locations at which they were measured), this may also be incorporated by using a

more-sophisticated model than the DP, to link the task-dependent KSBPs. For example, methods

that have taken into account the time-dependence of the observed data include [9, 22]. Methods

of this type could be used to replace the DP if spatial or temporal prior knowledge was available

with regard to the multiple images, but here such is not assumed.

Concerning the feature vectors xn, we adopt the independent feature subspace analysis (ISA)

method proposed by Hyvarinen and Hoyer [19], which projects the pixel values to a subspace

expanded by basis vectors, and uses the norm of the projection coefficients as features. The ISA

features have proven to be an appealing representation for image analysis, and encouraging results

are also presented here. However, the basic statistical framework developed here is applicable

to general features.

The remainder of the paper is organized as follows. In Section II we review the basic kernel

stick-breaking process (KSBP) model, and in Section III we extend this to multi-task KSBP

(hierarchical KSBP, or H-KSBP) for analysis of multiple images. Example results are presented

in Section IV. Conclusions and future research directions are discussed in Section V.

6

II. KERNEL STICK-BREAKING PROCESS

A. KSBP prior for image processing

The stick-breaking representation of the Dirichlet process (DP) was summarized in (2), and this

has served as the basis of a number of generalizations of the DP. The limitation of the DP for the

image-processing application of interest is that the clustering or segmentation assumes that the

feature vectors are exchangeable. This means that the location of the features within the image

may be exchanged arbitrarily, and the DP clustering will not change. It is desirable to exploit

the known spatial position of the features within the image, and impose that it is more likely

for feature vectors to be clustered together if they are physically proximate (this corresponds

to removing the exchangeability assumption in DP). Goals of this type have motivated many

modifications to the DP. The dependent DP (DDP) proposed by MacEachern [22] assumes a

fixed set of weights, π, while allowing the atoms θ = {θ1, · · · , θN} to vary with the predictor

x according to a stochastic process. Alternatively, rather than employing fixed weights, Griffin

and Steel [17] and Duan et al. [8] developed models to allow the weights to change with the

predictor. Griffin and Steel’s approach incorporates the dependency by allowing a predictor-

dependent ordering of the weights in the stick-breaking construction, while Duan et al. use a

multivariate beta distribution.

Dunson and Park [10] have proposed the kernel stick-breaking process (KSBP), which is partic-

ularly attractive for image-processing applications. In [10] the KSBP was presented in a more-

general setting than considered here; the following discussion focuses specifically on the image-

processing application. Rather than simply considering the feature vectors {xn}n=1,N , we now

consider {xn, rn}n=1,N , where rn is tied to the location of the pixel or block of pixels used

7

to constitute feature vector xn. We let K(r, r′, ψ) → [0, 1] define a bounded kernel function

with parameter ψ, where r and r′ represent general locations in the image of interest. One

may choose to place a prior on the kernel parameter ψ; this issue is revisited below. A draw

Gr ∼ KSBP (a, b, ψ,Go, H) from a KSBP prior is a function of position r (and valid for all

r), and is represented as

Gr =∞∑

h=1

πh(r; Vh, Γh, ψ)δθh

πh(r; Vh, Γh, ψ) = VhK(r, Γh, ψ)h−1∏

l=1

[1− VlK(r, Γl, ψ)]

Vh ∼ Beta(a, b)

Γh ∼ H

θh ∼ Go (3)

Dunson and Park [10] prove the validity of Gr as a probability measure. Comparing (2) and

(3), both priors take the general form of a stick-breaking representation, while the KSBP prior

possesses several interesting properties. For example, the stick weights πh(r; Vh, Γh, ψ) are a

function of r. Therefore, although the atoms {θh}h=1,∞ are the same for all r, the weights

πh(r; Vh, Γh, ψ) effectively shift the probabilities of different θh based on r. The Γh serve to

localize in r regions (clusters) in which the weights πh(r; Vh, Γh, ψ) are relatively constant, with

the size of these regions tied to the kernel parameter ψ.

If f(φn) is the parametric model (with parameter φn) responsible for the generation of xn, we

now assume that the augmented data {xn, rn}n=1,N are generated as

8

xnind∼ f(φn)

φnind∼ Grn

Gr ∼ KSBP (a, b, ψ, Go, H) (4)

As (but one) example, f(φn) may represent a Gaussian distribution, and φn may represent the

associated mean vector and covariance matrix. In this representation the vector Γh defines the

localized region in the image at which particular atoms θh are likely to be drawn, with the

probability of a given atom tied to πh(r; Vh, Γh, ψ). The generative model in (4) states that two

feature vectors that come from the same region in the image (defined via r) will have similar

πh(r; Vh, Γh, ψ), and therefore they are likely to share the same atoms θh. The settings of a and

b control how much similarity there will be in drawn atoms for a given spatial cluster centered

about a particular Γh. If we set a = 1 and b = α, analogous to the DP, small α will impose

that Vh is likely to be near one, and therefore only a relatively small number of atoms θh are

likely to be dominant for a given cluster spatial center Γh. On the other hand, if two features are

generated from distant parts of a given image, the associated atoms θh that may be prominent

for each feature vector are likely to be different, and therefore it is of relatively low probability

that these feature vectors would have been generated via the same parameters φ. It is possible

that the model may infer two distinct and widely separated clusters/segments with the similar

dominant parameters (atoms); if the KSBP base distribution Go is a draw from a DP (as it will be

below when we consider processing of multiple images), then two distinct and widely separated

clusters/segments may have identical atoms.

9

For the case a = 1 and b = α, which we consider below, we employ the notation Gr ∼

KSBP (α, ψ,Go, H); we emphasize that this draw is not meant to be valid for any one r, but

for emphall r. In practice we may also place a non-informative Gamma prior on α. Below we

will also assume that f(φ) corresponds to a multivariate Gaussian distribution, in the manner

indicated above.

B. Spatial correlation properties

As indicated above, the functional form of the kernel function is important and needs to be

chosen carefully. A commonly used kernel is given as K(r, Γ, ψ) = exp (−ψ‖r − Γ‖2) for

ψ > 0, which allows the associated stick weight to change continuously from Vh

∏h−1l=1 (1− Vl)

to 0 conditional on the distance between r and Γ. By choosing a kernel we are also implicitly

imposing the dependency between the priors of two samples, Gr and Gr′ . Specifically, both priors

are encouraged to share the same atoms θh if r and r′ are close, with this discouraged otherwise.

Dunson and Park [10] derive the correlation coefficient between two probability measures Gr

and Gr′ to be

corr{Gr, Gr′} =

∑∞h=1 πh(r; Vh, Γh, ψ)πh(r

′; Vh, Γh, ψ){ ∑∞h=1 πh(r; Vh, Γh, ψ)2

}1/2{ ∑∞h=1 πh(r′; Vh, Γh, ψ)2

}1/2, (5)

The coefficient in (5) approaches unity in the limit as r → r′. Since the correlation is a strong

function of the kernel parameter ψ, below we will consider a distinct ψh for each Γh, with these

drawn from a prior distribution. This implies that the spatial extent within the image over which

a given component is important will vary as a function of the location (to accommodate textural

regions of different sizes).

10

III. MULTI-TASK IMAGE SEGMENTATION WITH A HIERARCHICAL KSBP

We now consider the problem for which we wish to jointly segment M images, where each

image has an associated set of feature vectors with location information, in the sense dis-

cussed above. Aggregating the data across the M images, we have the set of feature vectors

{xnm, rnm}n=1,Nm; m=1,M . The image sizes may be different, and therefore the number of feature

vectors Nm may vary between images. The premise of the model discussed below is that the

cluster or segment characteristics may be similar between multiple images, and the inference of

these inter-relationships may be of value. Note that the assumption is that sharing of clusters

may be of relevance for the feature vectors xnm, but not for the associated locations rnm (the

characteristics of feature-vector clusters may be similar between two images, but the associated

clusters may reside in different locations within the respective images).

A. Model

A relatively simple means of sharing feature-vector clusters between the different images is to

let each image be processed with a separate KSBP (αm, ψm, Gm, Hm). To achieve the desired

sharing of feature-vector clusters between the different images, we impose that Gm ≡ G and G is

drawn G ∼ DP (γ, Go). Recalling the stick-breaking form of a draw from DP (γ,Go), we have

G =∑∞

h=1 πhδθh, in the sense summarized in (2). The discrete form of G is very important, for it

implies that the different Gm will share the same set of discrete atoms {θh}h=1,∞. It is interesting

to note that for the case in which the kernel parameter ψ is set such that K(r, Γh, ψ) → 1,

this model reduces to the hierarchical Dirichlet process (HDP) [33]. Therefore, the principal

difference between the HDP and the hierarchical KSBP (H-KSBP) is that for the latter we

impose that it is probable that feature vectors extracted from proximate regions in the image are

11

likely to be clustered together.

Assuming that f(φ) corresponds to a Gaussian distribution, the H-KSBP model is represented

as

xnmind∼ N (φnm)

φnmind∼ Grnm (6)

Grnm

iid∼ KSBP (αm, ψm, G,Hm)

G ∼ DP (γ,Go)

Assume that G is composed of the atoms {θh}h=1,∞, from the perspective of the stick-breaking

representation in (2). These same atoms are shared across all {Grnm}m=1,M drawn from the

associated KSBPs, but with respective stick weights unique to the different images, and a

function of position within a given image; in (6) we again emphasize that the draw from

KSBP (αm, ψm, G, Hm) is valid for all r, and we here evaluate it for all {rnm}m=1,M . The

sharing of atoms {θh}h=1,∞ imposes the belief that feature vectors from clusters associated with

multiple images may have been drawn from the same Gaussian distribution N (φ), where φ

represents the mean vector and covariance of the Gaussian; in practice the base distribution

Go in (6) corresponds to a normal-Wishart distribution. The posterior inference allows one to

infer which clusters of features are unique to a particular image, and which clusters are shared

between multiple images. The density functions Hm are tied to the support of the mth image,

and in practice this is set as uniform across the image extent. The distinct αm, for each of which

a Gamma hyper-prior may be imposed, encourages that the number of clusters (segments) may

12

vary between the different images, although one may simply wish to set αm = α for all M

tasks.

For notational convenience, in (6) it was assumed that the kernel parameter ψm varied between

tasks, but was fixed for all sticks within a given task; this is overly restrictive. In the implemen-

tation that follows the parameter ψ may vary across tasks and across the task-specific KSBP

sticks.

B. Posterior inference

For inference purposes, we truncate the number of sticks in the KSBP to T , and the number of

sticks in the truncated DP to K (the truncation properties of the stick-breaking representation are

discussed in [20], where here we must also take into account the properties of the kernel). Due

to the discreteness of G =∑K

k=1 βkδθk, each draw of the KSBP, Grnm =

∑Th=1 πhmδφhm

, can

only take atoms {φhm}h=1,T ; m=1,M from K unique possible values {θk}k=1,K ; when drawing

atoms φhm from G, the respective probabilities for {θk}k=1,K are given by {βk}k=1,K , and

for a given rnm the respective probabilities for different {φhm}h=1,T ; m=1,M are defined by

{πhm}h=1,T ; m=1,M . In order to reflect the correspondences between the data and atoms explic-

itly, we further introduce two auxiliary indicator variables. One is znm, this indicating which

component of the KSBP the feature vector xnm is associated, and the other is thm, this indicating

which mixing component θk the atom φhm is associated.

With this specification we can represent our H-KSBP mixture model via a stick-breaking char-

acterization

13

xnmind∼ N(φnm)

φnm = φznmm

znmind∼

T∑

h=1

πnm,hδh

πnm,h = VhmK(rnm, Γhm, ψhm)h−1∏

l=1

[1− VlmK(rnm, Γlm, ψhm)]

φhm = θthm

thmind∼

K∑

k=1

βkδk (7)

βk = β′k

k−1∏

l=1

(1− β′l)

Vhmiid∼ Beta(1, αm)

Γhmiid∼ Hm

β′kiid∼ Beta(1, γ)

θkiid∼ G0

for i = 1, · · · , M ; j = 1, · · · , Nm; h = 1, · · · , T and k = 1, · · · , K. Note that we employ distinct

kernel widths ψhm, the details for which are addressed in the Appendix. In the expressions above

and hereafter we use a bold letter to denote a set of variables with different indices, for example

πnm = {πnm,h}h=1,T and β = {βk}k=1,K . With the conditional distributions above, we provide

a graphical representation of the proposed H-KSBP model in Figure 1.

The Markov chain Monte Carlo (MCMC) method is a powerful and general simulation tool

for Bayesian inference, and one may readily implement the H-KSBP with Gibbs sampling,

14

oG

k k

K

H

hmV hm hmtT

ijz nmx

mN

M

nmr

Fig. 1. A graphical representation of the H-KSBP mixture model.

as an extension of the formulation in [10]. However, for the large-scale problems of interest

here we alternatively employ variational Bayesian (VB) inference. The VB method has proven

to be relatively fast (compared to MCMC) and accurate inference tool for many models and

applications [1, 4, 24]. To employ VB, a conjugate prior is required for all variables in the model.

In the proposed model, we however cannot obtain a closed form for the variational posterior

distribution of the node Vhm, because of the the kernel function. Alternatively, motivated by

the Monte Carlo Expectation Maximization (MCEM) algorithm [35], where the intractable E-

step in the Expectation-Maximization (EM) algorithm is approximated with sets of Monte Carlo

samples, we develop a Monte Carlo Variational Bayesian (MCVB) inference algorithm. We

generally follow the variational Bayesian inference steps to iteratively update the variational

distributions, and therefore push the lower bound closer to the model evidence. For those

nodes that are not available in closed form, we obtain Monte Carlo samples in each iteration

through MCMC routines such as Gibbs and Metropolis-Hastings samplers. The resulting MCVB

algorithm combines the benefits of both MCMC and VB, and has proven to be effective for the

examples we have considered (some of which are presented here).

15

Given the H-KSBP mixture model detailed in (III-A), we derive a Monte Carlo variational

Bayesian approach to infer the variables of interests. Following standard variational Bayesian

inference [16], we seek a lower bound for the log model evidence log p(Data) by integrating

over all hidden variables and model parameters:

log p({xnm}, {rnm}) = log

∫p({xnm}, {rnm},β,θ,V ,Γ, t, z)dΘ

≥∫

q(β,φ, V ,Γ, t,z) logp({xnm}, {rnm},β, φ,V ,Γ, t,z)

q(β,φ,V ,Γ, t, z)dΘ (8)

≈∫

q(β)q(φ)q(V )q(Γ)q(t)q(z)

logp({xnm}, {rnm},β,φ,V ,Γ, t, z)

q(β)q(φ)q(V )q(Γ)q(t)q(z)dΘ (9)

≡ LB(q(Θ)

)

where Θ is a set including all variables of interest, q(·)’s are variational posterior distributions

with specified functional form, LB(q(Θ)

)is defined to be the lower bound of the log model

evidence (sometimes referred as negative variational free energy), (8) comes from Jensen’s

inequality and (9) results from the independence assumption over all variational distributions.

Since the log model evidence is independent of variational distributions, we can iteratively update

the variational distributions to increase the lower bound, which is equivalent to minimizing the

Kullback-Leibler distance between q(β,φ, V ,Γ, t,z) and p(β,φ,V ,Γ, t, z|{xij}, {lxij}). The

equality is achieved if and only if the variational distributions are equal to the true posterior

distributions.

The lower bound, LB(q(Θ)

), is a functional of the variational distributions. To make the

computation analytical, we assume the variational distributions take specified forms and iterate

16

over every variable until convergence. All the updates are analytical except for Vhm, which is

estimated with the samples from its conditional posterior distributions. The update equations for

the proposed model are given in the Appendix.

C. Convergence

To monitor the convergence of our MCVB algorithm, we computed the lower bound LB(q(Θ)

)

of log model evidence at each iteration. Because of the sampling of some variables (see Ap-

pendix), the lower bound does not in general increase monotonically, but we observed in all

experiments that the lower bound increases sequentially for the first several iterations, with

generally small fluctuations after it has converged to the local optimal solution. An example of

this phenomenon is presented below.

IV. EXPERIMENTAL RESULTS

We have applied the H-KSBP multi-task image-segmentation algorithm to both synthetic and

real images. We first present results on synthesized imagery, wherein we compare KSBP-based

clustering of a single image with associated DP-based clustering. This comparison shows in

a simple manner the advantage of imposing spatial-proximity constraints when performing

clustering and segmentation. We next consider H-KSBP as applied to actual imagery, taken from

a widely utilized database. In that analysis we consider the class of textures/clusters inferred by

the algorithm, and how these are distributed across different classes of imagery. We also examine

the utility of H-KSBP for sorting this database of imagery. These H-KSBP results are compared

with analogous results realized via HDP. The hyper-parameters in the model for the examples

that follow are set as follows: τ10 = 1e−2, τ20 = 1e−2, τ30 = 3e−2, τ40 = 3e−2 and µ = 0,

17

η0 = 1, w∗ = d + 2, Σ∗ = 5× I . The discrete priors for Γ and ψ are set to be uniform over all

candidates.

A. Single-image segmentation, comparing KSBP and DP

In this simple illustrative example, each feature vector is associated with a particular pixel, and

the feature is simply a real number, corresponding to its intensity; the pixel location is the

auxiliary information within the KSBP, while this information is not employed by the DP-based

segmentation algorithm. Since both DP and KSBP are semi-parametric, we do not a priori set

the number of clusters. Figure 2 shows the original image and the segmentation results of both

algorithms. In Figure 2(a) we note that there are five contiguous regions for which the intensities

are similar. There is a background region with a relatively fixed intensity, and within this are

four distinct contiguous sub-regions, and of these there are pairs for which the intensities are

comparable. The data in Figure 2(a) were generated as follows. Each pixel in each region is

generated independently as a draw from a Gaussian distribution; the standard deviation of each

of the Gaussians is 10, and the background has mean intensity 5, and the two pairs are generated

with mean intensities of 40 and 60. The color bar in Figure 2(a) denotes the pixel amplitudes.

The DP and KSBP segmentation results are shown in Figures 2(b) and 2(c), respectively. In the

latter results a distinct color is associated with distinct cluster parameters (mean and precision of

a Gaussian). In the DP results we note that the four subregions are generally properly segmented,

but there is significant speckle in the background region. The KSBP segmentation algorithm is

beset by far less speckle. Further, in the KSBP results there are five distinct clusters (dominant

KSBP sticks), where in the DP results there are principally three distinct sticks (in the DP,

the spatially separated segments with the same features are treated as one region, while in the

18

10 20 30 40 50 60

10

20

30

40

50

60

10

20

30

40

50

60

(a)

10 20 30 40 50 60

10

20

30

40

50

60

(b)

10 20 30 40 50 60

10

20

30

40

50

60

(c)

Fig. 2. A synthetic image example. (a) Original synthetic image, (b) image-segmentation results of DP-based model, and (c)image-segmentation results of KSBP-based model.

KSBP each contiguous region is represented by its own stick). While the KSBP yields five

distinct prominent sticks, there are only three distinct atoms, consistent with the representation

in Figure 2(a).

In the next set of results, on real imagery, we employ the H-KSBP algorithm, and therefore at

the task level segmentation is performed as in Figure 2(c). Alternatively, using HDP model [33],

at the task level one employs clustering of the form in Figure 2(b). The relative performance of

H-KSBP and HDP is analyzed.

B. Feature extraction for real imagery

Within the subsequent image analysis we employ features constituted by the independent feature

subspace analysis (ISA) technique, developed by Hyvarinen and Hoyer [19]. These features

have proven to be relatively shift or translation invariant. In brief, the ISA feature extraction

process is composed of two steps: (i) We employ patches of images as training data, to estimate

several independent feature subspaces via a modification of the independent component analysis

(ICA) [6]. Each feature subspace is represented as a set of orthogonal basis vectors, i.e., wi,

i = 1, · · · , n, where n is the dimension of the subspace. (ii) The feature F of a new data input

19

I is computed as the norm of the projections on the feature subspaces:

Fj(I) =n∑

i=1

〈wi, I〉2, for j = 1, · · · , J (10)

where J is the number of independent feature subspaces and 〈·〉 is the inner product. Interesting

invariance properties [19] enable the ISA features to be widely applicable to many types of

images. As indicated below, a set of images is selected from a database for H-KSBP analysis; the

set of images used for the aforementioned training constitute a separate set of images randomly

selected from this same database. For the results considered here n = 10 and J = 50.

C. H-KSBP applied to a set of real images

We test the H-KSBP model on a subset of images from Microsoft Research Cambridge1. There

are seven types of images in this database: buildings, clouds, countryside, faces, fireworks, offices

and urban. Twenty images are randomly selected from the database for each type, yielding a

total of 140 images (the image sizes may vary from 384× 225 to 640× 480 pixels among the

different images). Typical images are shown in Figure 3. To capture textural information within

the features, we first divided each image into a contiguous 24×24-pixel non-overlapping patches

and then extract ISA features from each patch; color images are considered, and the RGB colors

are handled within ISA feature extraction as in [18]. Concerning learning the ISA independent

feature subspaces, we randomly select 150 patches out of the 140 images from the seven classes,

and these 150 image patches are used for basis training. The posterior on the H-KSBP (and

HDP) model parameters is inferred based on the proposed MCVB algorithm, processing all 140

images simultaneously; as discussed in Section II, the HDP analysis is performed by a special

1The data are available at http://research.microsoft.com/vision/cambridge/recognition/

20

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

200 400 600

100

200

300

400

100 200 300

50

100

150

200

250100 200 300

50

100

150

200

250200 400 600 800

100200300400500

200 400 600 800

100200300400500

Fig. 3. Sample images used in the analysis.

setting of the H-KSBP parameters. Generally speaking, the VB algorithm is deterministic given

a specified initialization. The proposed MCVB is however probabilistic because of the random

samples involved. Moreover, considering the large amount of data, we have to randomly initialize

their membership of sticks, which may cause the convergence to some local optimal solutions.

We therefore perform the experiment ten times and average the results to mitigate the influence

of randomness and to make the results less sensitive to VB initialization.

Borrowing the successful “bag of words” assumption in text analysis [5], we assume each image

is a bag of atoms, which results in a measurable quantity of inter-relationship between images,

specifically similar images should share similar distribution over those mixture components. An

important aspect of the H-KSBP algorithm is that while in text analysis the “bag of words” may

be set a priori, here the “bag of atoms” is inferred from the data itself, within the clustering

21

Clouds Buildings Countryside Faces Fireworks Urban Office

20 40 60 80 100 120 140

5

10

15

20

25

30

35

40

Index of images

Index o

f ato

ms

Fig. 4. Matrix on the usage of atoms across the different images. The size of each box represents the relative frequency withwhich a particular atom is manifested in a given image. These results are computed via H-KSBP.

process. Related concepts have been employed previously in image analysis [29], but in that

work one had to set the canonical set of image atoms (shapes) a priori, which is somewhat ad

hoc. The learning of these canonical atoms, across the multiple images, is an important aspect

of the proposed method. As the HDP is a special case of H-KSBP, it also may be used to

infer a “bag of atoms”, albeit without the explicit accounting for the spatial relationship of the

segmentation.

As an example, for the data considered, we show one realization of H-KSBP in Figure 4. In

the figure, we display canonical atom usage across all 140 images. Figure 4 is a count matrix,

where each square represents the relative number of counts in a given image for a particular

atom (atoms indexed along the vertical axis in Figure 4). As demonstrated in this figure, the

distinct image classes tend to employ a distinct ensemble of atom types, with this of importance

for the desired sorting of image classes.

Given the matrix depicted in Figure 4, it is of interest to examine the character of the atoms in

the imagery; examples of different atoms are depicted in Figure 5.

22

4−th

ato

m

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

31−

st a

tom

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

39−

th a

tom

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

38−

th a

tom

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

11−

th a

tom

200 400 600

200

400

200 400 600

200

400

200 400

200

400

600200 400

200

400

600 200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

33−

rd a

tom

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

30−

th a

tom

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

Fig. 5. Demonstration of different atoms as inferred by an example run of the H-KSBP algorithm. Each row of the figurecorresponds to one atom. Every two images form a set, with the original images at left and areas assigns to a particular atomshown at right

Figure 5 gives a representation of most of the atoms. For example the 4th, 31st and 39th atoms

are associated with clouds and sky; the 38th atom is principally modeling buildings; and the 11th

atom is associated with trees and grasses. While performing the experiment, we also noticed it

was relatively easy to segment clouds, fireworks, countryside, and urban images while harder to

obtain contiguous segments within office images (these typically have far more details, and less

large regions of smooth texture; this latter issue may be less an issue of the H-KSBP, but rather of

the features employed). An example of this difficulty is observable in Figure 6, as office images

are composed of many different atoms. Fortunately, the office images still tend to share similar

23

200 400 600

200

400

HDP

200 400 600

200

400

200 400 600

200

400

HDP

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

100 200 300

50100150200250

100 200 300

50100150200

200400600800

100200300400500

200400600800

100200300400500

H−KSBP

200 400 600

200

400

H−KSBP

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

100 200 300

50100150200

200400600800

100200300400500

Fig. 6. Representative set of segmentation results, comparing H-KSBP and HDP. While these two algorithms tend to generallyyield comparable segmentations for the images considered, the H-KSBP is generally more sensitive to details, with this sometimesyielding better segmentations (e.g., the top-level and bottom-right results).

usage of atoms so that they can be grouped together (sorted) when quantifying similarities

between images based on the histogram over atoms (discussed next). Example segmentation

results on images are given in Figure 6.

The results in Figure 6, in which both H-KSBP and HDP segmentation results are presented,

demonstrate general properties observed within the context of the analysis the images considered

here: (i) the segmentation characteristics of HDP were generally good, but on some occasions

they were markedly worse (less detailed) than those of H-KSBP; and (ii) the H-KSBP was

generally more sensitive to detailed textural differences in the images, thereby generally inferring

a larger number of principal atoms (increased number of large sticks).

24

1 2 3 4 5 6 7

1

2

3

4

5

6

7

Type of images

Type o

f im

ages

Clouds

Buildings

Countryside

Faces

Fireworks

Urban

Offices

Fig. 7. The confusion matrix over image types, generated using H-KSBP.

To demonstrate the image-sorting potential of the H-KSBP, we compute the Kullback-Leibler

(KL) divergence on the histogram over atom between any two images, by averaging histograms

of the form in Figure 4 over ten random MCVB initializations. For each image, we rank its

similarity to all other images based on the associated KL divergence. Performance is addressed

quantitatively as follows. For each of the 140 images, we quantify via KL divergence its similarity

to all other 139 images, wherein we achieve in ordered list. In Figure 7 we present a confusion

matrix, which represents the fraction of the top-ten members of this ordered list that are within

the same class (among seven classes) as the image under test.

As demonstrated in Figure 7, the H-KSBP performs well in distinguishing clouds, faces and

fireworks images. The buildings and urban images often share some similar atoms, mainly

representing buildings, and therefore these are somewhat confused (reasonably, it is felt). The

offices images are often related to other relatively complex scenes. Some typical image ranking

results are given in Figure 8; clearly, for this dataset, the clouds are the most unique class type,

25

for they are always the least related to any of the other image classes. It was found that the HDP

produced very similar sorting results as produced by H-KSBP (e.g., the associated confusion

matrix for HDP is similar to that in Figure 7), and therefore the HDP sorting results are omitted

here for brevity. This indicates that while in some cases the HDP segmentation results are inferior

to those of H-KSBP, in general the ability of HDP and H-KSBP to sort images is comparable

(at least for the set of images considered).

The H-KSBP results on the 140-image database were performed in non-optimized MatlabTM

software, on a PC with 3 GHz CPU and 2 GB memory. It required about 3 hours to run one

run of the MCVB code for 80 iterations, with typically 40-50 iterations required to achieve

convergence. The H-KSBP and HDP algorithms were run with comparable computation times.

V. CONCLUSIONS

The kernel stick breaking process (KSBP) [10] has been extended for use in image segmentation.

The KSBP algorithm explicitly imposes the belief that feature vectors that are generated from

proximate locations in an image are more likely to be associated with the same image segment.

The algorithm is semi-parametric in the sense that a parametric model is defined for generation of

the feature vectors (here a multivariate Gaussian), but the number of components in the associated

mixture model is assumed unknown and inferred from the data. The KSBP extends the stick-

breaking representation of the Dirichlet process (DP), in that it imposes spatially-dependent stick

weights, with a shared set of parameter “atoms” employed throughout the image.

We have also extended this KSBP image-processing algorithm to the simultaneous segmentation

of multiple images. In this setting a KSBP is employed for each of the images under test, and

these KSBPs share a common base distribution. Specifically, the base distribution corresponds

26

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

100 200 300

50100150200250

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200400600800

100200300400500

200 400 600

200

400

200400600800

100200300400500

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

200 400 600

200

400

Fig. 8. Sample image sorting results, as generated by H-KSBP. The top left image is the original image followed by the fivemost similar images and then the five most dissimilar images.

27

to a draw from a DP. Since such a draw is, with probability one, composed of an infinite set

of discrete atoms, this implies that all of the KSBPs share the same set of atoms. Therefore,

this imposes the belief that the clusters/segments associated with any given image are likely

related to clusters/segments associated with other images. The hierarchical KSBP (H-KSBP)

infers these cluster inter-relationships, and therefore infers the inter-relationships between the

images themselves. Therefore, the H-KSBP algorithm, with appropriate image features, offers

the opportunity to sort or organize multiple images. In the special case for which the kernel

parameters are set such that the sticks are spatial-position independent, the H-KSBP algorithm

reduces to the hierarchical Dirichlet process (HDP) [33].

In the problems of interest here the number of images under test may be large, and therefore an

MCMC analysis [10] is often computationally intractable. Therefore, we here have considered

an augmented variational Bayesian (VB) inference framework. Almost all nodes on the graphical

model of the H-KSBP are appropriate for VB inference (they possess the required conjugate

relationships); however, the presence of the kernel does require sampling, and therefore we have

implemented a hybrid sampling-VB inference algorithm, which has performed effectively in the

tests considered. To mitigate sensitivity of VB to initialization considerations and local-optimal

solutions, the final inference results are taken as the average of multiple runs, with different

initializations.

In the context of segmenting single images, while the DP-based method often worked well for the

tests considered, the KSBP algorithm typically was better, in the sense that it was more sensitive

to subtle variations in the textures of the image segments. The KSBP typically employed more

atoms (large sticks) than the DP-based segmentation algorithm, apparently to capture these more-

28

subtle textural differences. The same generally superior segmentation performance of H-KSBP

was observed relative to HDP, when segmenting multiple images simultaneously. In addition to

segmenting multiple images, the H-KSBP and HDP algorithms also yield information about the

inter-relationships between the images, based on the underlying sharing mechanisms inferred

among the associated clusters. For the images considered, it was found that the H-KSBP and

HDP yielded very similar sorting results.

Concerning future research, one typically measures imagery with a known temporal ordering.

One might anticipate that the degree of similarity in the images will be enhanced if they are

measured at more proximate times. The H-KSBP and HDP models assume that the multiple

images under test are exchangeable (their order is assumed irrelevant). To account for the order

with which the images are measured, and to further encourage sharing among images measured

more closely in time, the Dirichlet process from which the base distributions of the H-KSBP

and HDP are drawn may be generalized. Examples of how DP has been generalized for such

purposes are discussed in (for example) [9, 17, 22]. In fact, the KSBP may itself be employed

to generalize DP to account for time, where now rather than using spatial location as auxiliary

data one may use time.

ACKNOWLEDGMENTS

The research reported here was supported by the Department of Energy, NA-22. L.C. also thanks

Dr. Lawrence Chilton of PNNL for several motivating discussions.

REFERENCES

[1] M.J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby

Computational Neuroscience Unit, University College London, 2003.

29

[2] D. Blackwell and J.B. MacQueen. Ferguson distribution via polya urn schemes. The Annals

of Statistics, 1:353–355, 1973.

[3] D. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. Journal of

Bayesian Analysis, 1(1):121–144, 2005.

[4] D.M. Blei and M.I. Jordan. Variational methods for the Dirichlet process. In Proc. the

21st International Conference on Machine Learning, 2004.

[5] D.M. Blei and J.D. Lafferty. Correlated topic models. In Advances in Neural Information

Processing System, volume 18, 2005.

[6] P. Common. Independent component analysis - a new concept? Signal Processing, 36:287–

314, 1994.

[7] C. Ding and X. He. K-means clustering via principal component analysis. In Proc. the

International Conference on Machine Learning, pages 225–232, 2004.

[8] J. Duan, M. Guindani, and A.E. Gelfand. Generalized spatial Dirichlet process models.

Biometrika, 2007.

[9] D.B. Dunson. Bayesian dynamic modeling of latent trait distributions. Biostatistics, 7:551–

568, 2006.

[10] D.B. Dunson and J.-H. Park. Kernel stick-breaking process. Biometrika, accepted for

publication.

[11] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures.

Journal of the American Statistical Association, 90:577–588, 1995.

[12] M.D. Escobar and M. West. Baysian density estimation and inference using mixtures.

Journal of the American Statistical Association, 90:577–588, 1995.

[13] T.S. Ferguson. A bayesian analysis of some nonparametric problems. Annals of Statistics,

30

1, 1973.

[14] M.A. Figueiredo. Bayesian image segmentation using gaussian field priors. In Energy

Minimization Methods in Computer Vision and Pattern Recognition, pages 74–89. Berlin-

Heidelberg: Springer-Verlag, 2005.

[15] M.A. Figueiredo, D.S. Cheng, and V. Murino. Clustering under prior knowledge with

application to image segmentation. In Advances in Neural Information Processing System,

volume 19, 2007.

[16] Z. Ghahramani and M.J. Beal. Propagation algorithms for variational bayesian learning.

Advances in Neural Information Processing Systems 13, 2001.

[17] J.E. Griffin and M.F.J. Steel. Order-based dependent Dirichlet process. Journal of the

American Statistical Association, page in press, 2006.

[18] P. Hoyer and A. Hyvarinen. Independent component analysis applied to feature extraction

from colour and stereo images. Network: Computation in Neural Systems, 11(3):191–210,

2000.

[19] A. Hyvarinen and P. Hoyer. Emergence of phase- and shift-invariant features by

decomposition of natural images into independent feature subspaces. Neural Computation,

12(7):1705–1720, 2000.

[20] H. Ishwaran and L.F. James. Gibbs sampling methods for stick-breaking priors. Journal

of the American Statistical Association, 96:161–173, 2001.

[21] T. Kohonen. Self-Organizating Maps. Springer-Verlag, 1997.

[22] S.N. MacEachern. Dependent nonparametric process. In ASA Proceeding of the Section

on Bayesian Statistical Science, Alexandria, VA, 1999. American Statistical Association.

[23] S.N. MacEachern and P. Muller. Estimating mixture of dirichlet process models. Journal

31

of Computational and Graphical Statistics, 7:223–238, 1998.

[24] D.J.C. MacKay. Ensemble learning for hidden Markov models.

http://www.inference.phy.cam.ac.uk/mackay/abstracts/ensemblePaper.html, 1997.

[25] G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications to

Clustering. Marcel Dekker, 1988.

[26] B. Nadler, R.R. Coifman S. Lafon, and I.G. Kevrekidis. Diffusion maps, spectral clustering

and eigenfunctions of fokker-plaanck operators. In Advances in Neural Information

Processing Systems 18, 2006.

[27] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In

Advances in Neural Information Processing Systems 13, 2001.

[28] P. Orbanz and J. M. Buhmann. Smooth image segmentation by nonparametric Bayesian

inference. In ECCV, volume 1, pages 444–457, 2006.

[29] P. Quelhas, J.-M. Odobez F. Monay, D. Gatica-Perez, and T. Tuytelaars. A thosand words

in a scenes. IEEE Trans. Pattern Analysis Machine Intell., 9:1575–1589, 2007.

[30] C.E. Rasmussen. The infinite gaussian mixture model. In Advances in Neural Information

Processing System, volume 12, pages 554–560, 2000.

[31] J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4:639–650,

1994.

[32] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Describing visual scenes

using transformed Dirichlet processes. In NIPS 18, pages 1297–1304. MIT Press, 2006.

[33] Y. W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical dirichlet processes. Journal

of the American Statistical Association, 101:1566–1582, 2005.

[34] S. Thrun and J. O’Sullivan. Discovering structure in multiple learning tasks: The TC

32

algorithm. In Proc. the 13th International Conference on Machine Learning, 1996.

[35] G.C. G. Wei and M.A. Tanner. A monte carlo implementation of the em algorithm and the

poor man’s data augmentation algorithms. Journal of the American Statistical Association,

85:699–704, 1990.

[36] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with

dirichlet process priors. Journal of Machine Learning Research, 8:35–63, 2007.

[37] K. Yu, V. Tresp, and S. Yu. A nonparametric hierarchical Bayesian framework for

information filtering. In Proc. the 27th Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval, 2004.

33

APPENDIX

Update q(z) Given the approximate distribution of the other variables, the variational distri-

bution q(z) can be derived as

q(znm) = Mult(ξnm,1, · · · , ξnm,T ) (1)

snm,h ∝ Vhm exp(−ψm‖rnm − Γhm‖2) exp(∑

l<h

log(1− Vlm exp(−ψm‖rnm − Γlm‖2)))

exp(K∑

k=1

υhm,k(1/2(d∑

r=1

Ψ(wnew∗ + 1− r

2) + d log 2 + log |Σnew

∗ |)

−1/2(xnm − µnew0k )T wnew

∗ Σnew∗ (xnm − µnew

0k )− d

2ηnewk

)), (2)

ξnm,h =snm,h∑T

h=1 snm,h

. (3)

where 〈·〉 denotes the expectation over all variables except the one we are updating, Ψ(·) is the

digamma function, and d is the dimensionality of the feature vector. Equation (2) results from

the sampling of Vhm, where the expectation reduces to evaluation over the samples.

Update q(t) Similar to the update for z, each indicator variable of one stick, thm, has a

posterior in a form of a multinomial distribution:

q(thm) = Mult(υhm,1, · · · , υhm,K) (4)

uhm,k ∝ exp(Ψ(ak)−Ψ(ak + bk))∏

l<k

exp(Ψ(bl)−Ψ(al + bl))

exp(∑

n

ξnm,h(1/2(d∑

r=1

Ψ(wnew∗ + 1− r

2) + d log 2 + log |Σnew

∗ |)

−1/2(xnm − µnewk )T wnew

∗ Σnew∗ (xnm − µnew

k )− d

2ηnewk

)), (5)

υhm,k =uhm,k∑Kk=1 uhm,k

. (6)

1

Sample V As indicated above, there is no closed form for the variational distribution of V . We

therefore employ a MCMC method to sample V from its conditional posterior distribution. We in-

troduce two auxiliary variables: Anm,h ∼ Bernoulli(Vhm) and Bnm,h ∼ Bernoulli(K(rnm, Γhm, ψm)),

with znm = min{h : Anm,h = Bnm,h = 1}. We can then alternate between

(i) sampling (Anm,h, Bnm,h) from the conditional distribution given znm;

p(Anm,h = Bnm,h = 0|znm) =(1− Vhm)(1−K(rnm, Γhm, ψm))

1− VhmK(rnm, Γhm, ψm)

p(Anm,h = 0, Bnm,h = 1|znm) =(1− Vhm)K(rnm, Γhm, ψm)

1− VhmK(rnm, Γhm, ψm)

p(Anm,h = 1, Bnm,h = 0|znm) =Vhm(1−K(rnm, Γhm, ψm))

1− VhmK(rnm, Γhm, ψm),

for h = 1, 2, · · · , znm − 1 and n = 1, 2, · · · ,M , and

Anm,h = Bnm,h = 1

for h = znm and n = 1, 2, · · · ,M , and

(ii) updating Vhm by sampling from the conditional posterior distribution:

Beta(1 +∑

n:znm≥h

Anm,h, α +∑

n:znm≥h

(1− Anm,h)). (7)

Update q(β) The update of q(β) is straightforward given other variational distributions:

q(β′k) = Beta(ak, bk) (8)

ak = 1 +∑

m,h

υhm,k (9)

bk = γ +∑

m,h

∑

l>k

υhm,l, (10)

2

and the stick-length βk can be represented as

βk = β′k∏

l<k

(1− β′l). (11)

Update q(θ) The distribution of θ is composed of two parts, one for the mean vector and

one for the precision matrix. Given the distributions of both indicators and samples, we obtain

an analytical form for q(θ):

q(θk) = N(µnew0k , ηnew

k Σk)W (wnew∗ ,Σnew

∗ ) (12)

µnew0k =

Nkxk + η0µ0

Nk + η0

ηnewk = η0 + Nk

wnew∗ = w∗ + Nk

Σnew∗ =

(Σ−1∗ +

∑m,n

q(tznmm = k)(xnm − xk)(xnm − xk)T +

Nkη0

Nk + η0

(xk − µ0)(xk − µ0)T)−1

,

(13)

where

Nk =∑m,n

q(tznmm = k), xk =

∑m,n q(tznmm = k)xnm

Nk

.

Update q(Γ) Theoretically, we expect the basis location Γhm to be estimated continuously

over the entire space; however, this results in non-conjugacy. We hence place a discrete prior on

3

Γhm, i.e., p(Γhm) =∑

r arΓr, and the associated posterior takes the form

q(Γhm) =∑

r

anewr Γr (14)

anewr = ar exp

( ∑n

ξnm,h(log(Vhm)− ψm‖lxnm − Γr‖2))

exp( ∑

n

∑

l>h

ξnm,l(log(1− Vhm exp(−ψm‖lxnm − Γr‖2)))), (15)

where {Γr} constitutes a grid of potential locations (for imagery, with discrete pixels, this is

quite realistic and practical). For computation efficiency, we fix the number of candidate basis

locations and drop some candidates, with very low weights in each iteration replaced with new

locations randomly sampled from a uniform distribution. Even though q(Γhm) can be derived

analytically, it complicates the computation of other variables severely. We thus intentionally

sample Γhm from its approximate posterior distribution.

Up to now, the hyper-parameters α, γ, and ψm are assumed to be constant for inference of

the other parameters. However the model performance may be sensitive to the settings of those

hyper-parameters, here we relax this assumption and derive the update equations for them after

placing non-informative priors.

Update q(α) We place a gamma prior on α with parameters τ10 and τ20, and the variational

posterior distribution q(α) is

q(α) = Gamma(τ1, τ2) (16)

τ1 = τ10 + M(T − 1)

τ2 = τ20 +M∑

m=1

T∑

h=1

log(1− Vhm). (17)

Update q(γ) Similar to the update of α, we also place a gamma prior on γ and the approximate

4

posterior distribution is

q(γ) = Gamma(τ3, τ4) (18)

τ3 = τ30 + K − 1

τ4 = τ40 +K∑

k=1

(Ψ(bk)−Ψ(ak + bk))). (19)

Update ψ Since the image size and the object size can vary dramatically between different

images, in practice no universal kernel width is applicable to all images and atoms. In order to

adaptively choose the kernel width for each image and atom, we place a discrete prior on ψhm,

p(ψhm) =∑

r brψr. Similar to the inference on Γhm, we sample ψhm to facilitate the computation

q(ψhm) =∑

r

bnewr ψr (20)

bnewr = br exp

( ∑n

ξnm,h(log(Vhm)− ψr‖rnm − Γhm‖2))

exp( ∑

n

∑

l>h

ξnm,l(log(1− Vhm exp(−ψr‖rnm − Γhm‖2)))), (21)

where {ψr} are potential kernel widths.

5

Hierarchical Kernel Stick-Breaking Process for Multi-Task...

Documents

Transcript of Hierarchical Kernel Stick-Breaking Process for Multi-Task...