IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57 ...jwp2128/Papers/PaisleyCarin2009c.pdfIEEE...

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009 3905

Hidden Markov Models With Stick-Breaking PriorsJohn Paisley, Student Member, IEEE, and Lawrence Carin, Fellow, IEEE

Abstract—The number of states in a hidden Markov model(HMM) is an important parameter that has a critical impact onthe inferred model. Bayesian approaches to addressing this issueinclude the nonparametric hierarchical Dirichlet process, whichdoes not extend to a variational Bayesian (VB) solution. We presenta fully conjugate, Bayesian approach to determining the numberof states in a HMM, which does have a variational solution.The infinite-state HMM presented here utilizes a stick-breakingconstruction for each row of the state transition matrix, whichallows for a sparse utilization of the same subset of observationparameters by all states. In addition to our variational solution,we discuss retrospective and collapsed Gibbs sampling methodsfor MCMC inference. We demonstrate our model on a musicrecommendation problem containing 2250 pieces of music fromthe classical, jazz, and rock genres.

Index Terms—Hidden Markov models (HMM), hierarchicalBayesian modeling, music analysis, variational Bayes (VB).

I. INTRODUCTION

T HE hidden Markov model (HMM) [28] is an effectivemeans for statistically representing sequential data, andhas been used in a variety of applications, including the mod-eling of music and speech [4], [5], target identification [10],[29], and bioinformatics [15]. These models are often trainedusing the maximum-likelihood Baum-Welch algorithm [14],[28], where the number of states is preset and a point estimateof the parameters is returned. However, this can lead to over-or underfitting if the underlying state structure is not modeledcorrectly.

Bayesian approaches to automatically inferring the numberof states in an HMM have been investigated in the MCMC set-ting using reversible jumps [9], as well as a nonparametric, in-finite-state model that utilizes the hierarchical Dirichlet process(HDP) [31]. This latter method has proven effective [27], buta lack of conjugacy between the two levels of Dirichlet pro-cesses prohibits fast variational inference [6], making this ap-proach computationally prohibitive when modeling very largedata sets.

To address this issue, we propose the stick-breaking hiddenMarkov model (SB-HMM), a fully conjugate prior for an in-finite-state HMM that has a variational solution. In this model,

Manuscript received July 21, 2008; accepted May 12, 2009. First publishedJune 10, 2009; current version published September 16, 2009. The associate ed-itor coordinating the review of this manuscript and approving it for publicationwas Dr. Marcelo G. S. Bruno.

The authors are with the Department of Electrical and Computer Engineering,Duke University, Durham, NC 27708 USA (e-mail: [email protected];[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2009.2024987

each row of the transition matrix is given an infinite dimensionalstick-breaking prior [18], [19], which has the nice property ofsparseness on the same subset of locations. Therefore, for an or-dered set of states, as we have with an HMM, a stick-breakingprior on the transition probabilities encourages sparse usage ofthe same subset of states.

To review, the general definition of a stick-breaking construc-tion of a probability vector [19], , is as fol-lows:

(1)

for and a set of nonnegative, real parametersand . In this definition, can be

finite or infinite, with the finite case also known as the gener-alized Dirichlet distribution [12], [34]. When infinite, differentsettings of and result in a variety of well-known priors re-viewed in [19]. For the model presented here, we are interestedin using the stick-breaking prior on the ordered set of states ofan HMM. The stick-breaking process derives its name from ananalogy to iteratively breaking the proportion from the re-mainder of a unit-length stick . As with the stan-dard Dirichlet distribution discussed below, this prior distribu-tion is conjugate to the multinomial distribution parameterizedby and, therefore, can be used in variational inference [6].

Another prior on a discrete probability vector is the finitesymmetric Dirichlet distribution [20], [24], which we callthe standard Dirichlet prior. This distribution is denoted as

, where is a nonnegative real pa-rameter, and is the dimensionality of the probability vector

. This distribution can be viewed as the standard prior forthe weights of a mixture model (the relevance of the mixturemodel to the HMM is discussed in Section III), as it is themore popular of the two conjugate priors to the multinomialdistribution. However, for variational inference, the parameter

must be set due to conjugacy issues, which can impact theinferred model structure. In contrast, for the case where ,the model in (1) allows for a gamma prior on , removing theneed to select this value a priori.

This nonparametric uncovering of the underlying state struc-ture will be shown to improve model inference, as demonstratedon a music recommendation problem, where we analyze 2250pieces of music from the classical, jazz, and rock genres. We willshow that our model has comparable or improved performanceover other approaches, including the standard finite Dirichletdistribution prior, while using substantially fewer states. This

1053-587X/$25.00 © 2009 IEEE

3906 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 10, OCTOBER 2009

performance is quantified via measures of the quality of the topten recommendations for all songs using the log-likelihood ratiobetween models as a distance metric.

We motivate our desire to simplify the underlying state struc-ture on practical grounds. The computing of likelihood ratioswith HMMs via the forward algorithm requires the integratingout of the underlying state transitions and is in an -statemodel. For use in an environment where many likelihood ratiosare to be calculated, it is therefore undesirable to utilize HMMswith superfluous state structure. The calculation of like-lihoods in our music recommendation problem will show a sub-stantial time-savings. Furthermore, when many HMMs are to bebuilt in a short period of time, a fast inference algorithm, suchas variational inference may be sought. For example, when ana-lyzing a large and evolving database of digital music, it may bedesirable to perform HMM design quickly on each piece. Whencomputing pairwise relationships between the pieces, computa-tional efficiency manifested by a parsimonious HMM state rep-resentation is of interest.

The remainder of the paper is organized as follows: Wereview the HMM in Section II. Section III provides a reviewof HMMs with finite-Dirichlet and Dirichlet process priors. InSection IV, we review the stick-breaking process and its finiterepresentation as a generalized Dirichlet distribution, followedby the presentation of the stick-breaking HMM formulation.Section V discusses two MCMC inference methods and vari-ational inference for the SB-HMM. For MCMC in particular,we discuss retrospective and collapsed Gibbs samplers, whichserve as benchmarks to validate the accuracy of the simplerand much faster variational inference algorithm. Finally, wedemonstrate our model on synthesized and music data inSection VI, and conclude in Section VII.

II. THE HMM

The HMM [28] is a generative statistical representation of se-quential data, where an underlying state transition process op-erating as a Markov chain selects state-dependent distributionsfrom which observations are drawn. That is, for a sequenceof length , a state sequence is drawnfrom , as well as an observa-tion sequence , from some distribution

, where is the set of parameters for the distributionindexed by . An initial-state distribution, , is also definedto begin the sequence.

For an HMM, the state transitions are discrete and thereforehave multinomial distributions, as does the initial-state proba-bility mass function (pmf). We assume the model is ergodic,or that any state can reach any other state in a finite numberof steps. Therefore, states can be revisited in a single sequenceand the observation distributions reused. For concreteness, wedefine the observation distributions to be discrete, multinomialof dimension . Traditionally, the number of states associatedwith an HMM is initialized and fixed [28]; this is a nuisance pa-rameter that can lead to overfitting if set too high or underfitting

if set too low. However, to complete our initial definition we in-clude this variable, , with the understanding that it is not fixed,but to be inferred from the data.

With this, we define the parameters of a discrete HMM:

matrix where entry represents the transitionprobability from state to ;

matrix where entry represents theprobability of observing from state ;

dimensional probability mass function for the initialstate.

The data generating process may be represented as

The observed and hidden data, and , respectively, combineto form the “complete data,” for which we may write the com-plete-data likelihood

with the data likelihood obtained by integratingover the states using the forward algorithm [28].

III. THE HMM WITH DIRICHLET PRIORS

It is useful to analyze the properties of prior distributionsto assess what can be expected in the inference process. ForHMMs, the priors typically given to all multinomial parame-ters are standard Dirichlet distributions. Below, we review thesecommon Dirichlet priors for finite and infinite-state HMMs. Wealso note that the priors we are concerned with are for the statetransition probabilities, or the rows of the matrix. We assumeseparate -dimensional standard Dirichlet priors for all rows ofthe matrix throughout this paper (since the -dimensionalobservation alphabet is assumed known, as is typical), thoughthis may be generalized to other data-generating modalities.

A. The Dirichlet Distribution

The standard Dirichlet distribution is conjugate to the multi-nomial distribution and is written as

(2)

with the mean and variance of an element, , being,

(3)

This distribution can be viewed as a continuous density on thedimensional simplex in , since for all

and . When , draws of are sparse,meaning that most of the probability mass is located on a subsetof . When seeking a sparse state representation, one might findthis prior appealing and suggest that the dimensionality, , be

PAISLEY AND CARIN: HMM WITH STICK -BREAKING PRIORS 3907

made large (much larger than the anticipated number of states),allowing the natural sparseness of the prior to uncover the cor-rect state number. However, we discuss the issues with this ap-proach in Section III-C.

B. The Dirichlet Process

The Dirichlet process [17] is the extension of the standardDirichlet distribution to a measurable space and takes two in-puts: a scalar, nonnegative precision, , and a base distribution

defined over the measurable space . For a partition ofinto disjoint sets, , where ,

the Dirichlet process is defined as

(4)

Because is a continuous distribution, the Dirichlet process isinfinite-dimensional, i.e., . A draw , from a Dirichletprocess is expressed and can be written in theform

The location associated with a measure is drawn from .In the context of an HMM, each represents the parametersfor the distribution associated with state from which an obser-vation is drawn. The key difference between the distribution ofSection III-A and the Dirichlet process is the infinite extent ofthe Dirichlet process. Indeed, the finite-dimensional Dirichletdistribution is a good approximation to the Dirichlet processunder certain parameterizations [20].

To link the notation of (2) with (4), we note that .When these are uniformly parameterized, choosingmeans that for all as . Therefore, the ex-pectation as well as the variance of . Ferguson [17]showed that this distribution remains well defined in this ex-treme circumstance. Later, Sethuraman [30] showed that we canstill draw explicitly from this distribution.

1) Constructing a Draw From the Dirichlet Process: Sethu-raman’s constructive definition of a Dirichlet prior allows oneto draw directly from the infinite-dimensional Dirichlet processand is one instance of a stick-breaking prior included in [19]. Itis written as

(5)

(6)

From Sethuraman’s definition, the effect of on a draw fromthe Dirichlet process is clear. When , the entire stickis broken and allocated to one component, resulting in a degen-erate measure at a random component with location drawn from

. Conversely, as , the breaks become infinitesimallysmall and converges to the empirical distribution of the indi-vidual draws from , reproducing itself.

When the space of interest for the DP prior is not the data it-self, but rather a parameter, , for a distribution, , from

which the data is drawn, a Dirichlet process mixture model re-sults [2] with the following generative process

(7)

C. The Dirichlet Process HMM

Dirichlet process mixture modeling is relevant becauseHMMs may be viewed as a special case of state-dependentmixture modeling where each mixture shares the same support,but has different mixing weights. These mixture models arelinked in that the component selected at time from which wedraw observation also indexes the mixture model to be usedat time . Using the notation of Section II, and defining

, we can write the state-dependent mixturemodel of a discrete HMM as

(8)where it is assumed that the initial state has been selected fromthe pmf, . From this formulation, we see that we are justifiedin modeling each state as an independent mixture model withthe condition that respective states share the same observationparameters, . For unbounded state models, one might look tomodel each transition as a Dirichlet process. However, a signif-icant problem arises when this is done. Specifically, in Sethu-raman’s representation of the Dirichlet process, each row, , ofthe infinite state transition matrix would be drawn as follows,

(9)

(10)

where is the component of the infinite vector . How-ever, with each indexing a state, it becomes clear that since

for when is continuous, theprobability of a transition to a previously visited state is zero.Because is continuous, this approach for solving an infi-nite-state HMM is therefore impractical.

1) The Hierarchical Dirichlet Process HMM: To resolvethis issue, though developed for more general applications thanthe HMM, Teh et al. [31] proposed the hierarchical Dirichletprocess (HDP), in which the base distribution, , over isitself drawn from a Dirichlet process, making almost surelydiscrete. The formal notation is as follows:

(11)

The HDP is a two-level approach, where the distribution onthe points in is shifted from the continuous to the discrete,but infinite . In this case, multiple draws for will havesubstantial weight on the same set of atoms (i.e., states). Wecan show this by writing the top-level DP in stick-breaking formand truncating at , allowing us to write the second-level DPexplicitly,

(12)


(13)

(14)where is a probability measure at location . Therefore,the HDP formulation effectively selects the number of states andtheir observation parameters via the top-level DP and uses themixing weights as the prior for a second-level Dirichlet distri-bution from which the transition probabilities are drawn. Thelack of conjugacy between these two levels means that a trulyvariational solution [6] does not exist.

IV. THE HMM WITH STICK-BREAKING PRIORS

A more general class of priors for a discrete probability dis-tribution is the stick-breaking prior [19]. As discussed above,these are priors of the form

(15)

for and the parameters and. This stick-breaking process is more general

than that for drawing from the Dirichlet process, in which boththe -distributed random variables and the atoms as-sociated with the resulting weights are drawn simultaneously.The representation of (15) is more flexible in its parametriza-tion, and in that the weights have been effectively detached fromthe locations in this process. As written in (15), the process isinfinite, however, when and terminate at some finite number

with , the result is a draw from a gener-alized Dirichlet distribution (GDD) [21], [34]. Since the neces-sary truncation of the variational model means that we are usinga GDD prior for the state transitions, we briefly review this priorhere.

For the GDD, the density function of is

(16)Following a change of variables from to , the density of iswritten as

(17)

which has a mean and variance for an element

(18)

When for , and leaving , theGDD reverts to the standard Dirichlet distribution of (2).

We will refer to a construction of from the infinite processof (15) as and from (17) as .

For a set of integer-valued observations, ,the posterior of the respective priors are parameterized

by and , where and, where is an indicator

function that equals one when the argument is true and zerootherwise and is used to count the number of times the randomvariables are equal to values of interest. This posterior calcula-tion will be used in deriving update equations for MCMC andvariational inference.

A. SB-HMM

For the SB-HMM, we model each row of the infinite statetransition matrix, , with a stick-breaking prior of the form of(15), with the state-dependent parameters drawn iid from a basemeasure .

(19)

for and and whereand . The initial state pmf, , is also con-structed according to an infinite stick-breaking construction.The key difference in this formulation is that the state transitionpmf, , for state has an infinite stick-breaking prior on thedomain of the positive integers, which index the states. Thatis, the random variable, , is defined tocorrespond to the portion broken from the remainder of the unitlength stick belonging to state , which defines the transitionprobability from state to state . The corresponding state-de-pendent parameters, , are then drawn separately, effectivelydetaching the construction of from the construction of .This is in contrast with Dirichlet process priors, where thesetwo matrices are linked in their construction, thus requiring anHDP solution as previously discussed.

We observe that this distribution requires an infiniteparametrization. However, to simplify this we propose thefollowing generative process,

(20)

where we have fixed for all and changed to toemphasize the similar function of this variable in our model asthat found in the finite Dirichlet model. Doing this, we can ex-ploit the conjugacy of the gamma distribution to thedistribution to provide further flexibility of the model. As in theDirichlet process, the value of has a significant impact in ourmodel, as does the setting of the gamma hyperparameters. Forexample, the posterior of a single is

(21)


Allowing , the posterior expectation is. We see that when is near one, a large portion

of the remaining stick is broken, which drives the value ofdown, with the opposite occurring when is near zero. There-fore, one can think of the gamma prior on each as acting asa faucet that is opened and closed in the inference process.

Therefore, the value of the hyperparameter is important tothe model. If , then it is possible for to growvery large, effectively turning off a state transition, which mayencourage sparseness in a particular transition pmf, , but notover the entire state structure of the model. With , thevalue of has the effect of ensuring that, under the conditionalposterior of ,

(22)

which will mitigate this effect. The SB-HMM has therefore beendesigned to accommodate an infinite number of states, with thestatistical property that only a subset will be used with highprobability.

B. The Marginal SB Construction

To further analyze the properties of this prior, we discuss theprocess of integrating out the state transition probabilities byintegrating out the random variables, , of the stick-breakingconstruction. This will be used in the collapsed Gibbs samplerdiscussed in the next section. To illustrate this idea with respectto the Dirichlet process, we recall that marginalizing yieldswhat is called the Chinese restaurant process (CRP) [1]. In sam-pling from this process, observation is drawn according tothe distribution

(23)

where is the number of previous observations that used pa-rameter . If the component is selected, a new parameter isdrawn from . In the CRP analogy, represents the numberof customers sitting at the table . A new customer sits at thistable with probability proportional to , or a new table withprobability proportional to . The empirical distribution of (23)converges to as . Below, we detail a similar processfor drawing samples from a single, marginalized stick-breakingconstruction, with this naturally extended to the state transitionsof the HMM.

In (15), we discussed the process for constructing fromthe infinite stick-breaking construction using beta-distributedrandom variables, . The process of samplingfrom is also related to in the following way: Instead ofdrawing integer-valued samples, , one can alternativelysample from by drawing random variables ,from with and setting

. That is, by sampling sequentially from, and terminating when a one is drawn, setting to

that respective index. Fig. 1(a) contains a visualization of thisprocess using a tree structure, where we sample paths untilterminating at a leaf node.

We now consider sampling from the marginalized version ofthis distribution by sequentially sampling from a set of two-di-mensional Pólya urn models. As with the urn model for the DP,

Fig. 1. The tree structure for drawing samples from (a) an infinitestick-breaking construction and (b) the marginalized stick-breaking con-struction. The parameters for (a) are drawn exactly and fixed. The parametersof (b) evolve with the data and converge as � ��. Beginning at thetop of the tree, paths are sequentially chosen according to a biased coin flip,terminating at a leaf node.

we can sample observation from a marginalized beta dis-tribution having prior hyperparameters and and conditionedon observations as follows,

(24)

where is the number of previous samples taking value , orand . The function

indicates that with probabilityand the corresponding probability for . This equation arisesby performing the following integration of the random variable,

,

(25)

where if andif .


As with the CRP, this converges to a sample from the beta dis-tribution as . Fig. 1(b) contains the extension of this tothe stick-breaking process. As shown, each point of intersectionof the tree follows a local marginalized beta distribution, withpaths along this tree selected according to these distributions.The final observation is the leaf node at which this process ter-minates. In keeping with other restaurant analogies, we tell thefollowing intuitive story to accompany this picture.

Person walks down a street containing many Chineserestaurants. As he approaches each restaurant, he looks insideto see how many people are present. He also hears the noisecoming from other restaurants down the road and is able toperfectly estimate the total number of people who have chosento eat at one of the restaurants he has not yet seen. He chooses toenter restaurant with probabilityand to continue walking down the road with probability

, which are probabil-ities nearly proportional to the number of people in restaurant

and in restaurants still down the road, respectively. If hereaches the end of populated restaurants, he continues to makedecisions according to the prior shared by all people. Thisanalogy can perhaps be called the “Chinese restaurant district.”

To extend this idea to the HMM, we note that each leaf noderepresents a state, and therefore sampling a state at time ac-cording to this process also indexes the parameters that are tobe used for sampling the subsequent state. Therefore, each ofthe trees in Fig. 1 is indexed by a state, with its own count sta-tistics. This illustration provides a good indication of the naturalsparseness of the stick-breaking prior, as these binary sequencesare more likely to terminate on a lower indexed state, with theprobability of a leaf node decreasing as the state index increases.

V. INFERENCE FOR THE SB-HMM

In this section, we discuss Markov chain Monte Carlo(MCMC) and variational Bayes (VB) inference methods forthe SB-HMM. For Gibbs sampling MCMC, we derive a ret-rospective sampler that is similar to [25] in that it allows thestate number to grow and shrink as needed. We also derive acollapsed inference method [22], where we marginalize overthe state transition probabilities. This process is similar to otherurn models [21], including that for the Dirichlet process [11].

A. MCMC Inference for the SB-HMM: A Retrospective GibbsSampler

We define to be the total number of used states following aniteration. Our sampling methods require that only the utilizedstates be monitored for any given iteration, along with state

drawn from the base distribution, a method which acceleratesinference.

The full posterior of our model can be written as, where represents

sequences of length . Following Bayes’ rule, we sample eachparameter from its full conditional posterior as follows:

1) Sample from its respective gamma-distributed poste-rior

(26)

Draw new values for and the new row of from theprior.

2) Construct the state transition probabilities of from theirstick-breaking conditional posteriors:

(27)

where for sequence ,, the count of transitions from state to state . Set

and draw the dimensionalprobability vector from the GDD prior.

3) Sample the innovation parameters, , for from theirgamma-distributed posteriors

(28)

4) Sample from its stick-breaking conditional posterior:

(29)

for with , thenumber of times a sequence begins in state . Set

.5) Sample each row, of the observation ma-

trix from its Dirichlet conditional posterior:

(30)

where for sequence , ,the number of times is observed while instate . Draw a new atom from the base distribution

.6) Sample each new state sequence from the conditional pos-

terior, defined for a given sequence as

(31)

7) Set equal to the number of unique states drawn in 6) andprune away any unused states.

This process iterates until convergence [3], at which pointeach uncorrelated sample of (1–6) can be considered a sample


from the full posterior. In practice, we find that it is best to ini-tialize with a large number of states and let the algorithm pruneaway, rather than allow for the state number to grow from a smallnumber, a process that can be very time consuming. We alsomention that, as the transition weights of the model are orderedin expectation, a label switching procedure may be employed[25], [26], where the indices are reordered based on the statemembership counts calculated from such that these countsare decreasing with an increasing index number.

B. MCMC Inference for the SB-HMM: A Collapsed GibbsSampler

It is sometimes useful to integrate out parameters in a model,thus collapsing the structure and reducing the amount of ran-domness in the model [22]. This process can lead to faster con-vergence to the stationary posterior distribution. In this section,we discuss integrating out the infinite state transition probabil-ities, , which removes the randomness in the construction ofthe matrix. These equations are a result of the discussion inSection IV-B. The following modifications can be made to theMCMC sampling method of Section V-A to perform collapsedGibbs sampling:

1) To sample , sample as in Step (2) of Section V-Ausing the previous values for . This is only for the pur-pose of resampling as in Step (1) of Section V-A. See[16] for a similar process for marginalized DP mixtures.

2) The transition probability, , is now constructed using themarginalized beta probabilities

A similar process must be undertaken for the construction of. Furthermore, we mention that the state sequences can also be

integrated out using the forward-backward algorithm for addi-tional collapsing of the model structure.

C. VB Inference for the SB-HMM

Variational Bayesian inference [6], [32] is motivated by

(32)which can be rewritten as

(33)

where represents the model parameters and hidden data, theobserved data, an approximating density to be determinedand

(34)

The goal is to best approximate the true posteriorby minimizing . Due to

the fact that , this can be done by maximizing. This requires a factorization of the distributions, or

(35)

A general method for performing variational inference forconjugate-exponential Bayesian networks outlined in [32] is asfollows: For a given node in a graph, write out the posterior asthough everything were known, take the natural logarithm, theexpectation with respect to all unknown parameters and expo-nentiate the result. Since it requires the computational resourcescomparable to the expectation-maximization algorithm, varia-tional inference is fast. However, the deterministic nature of thealgorithm requires that we truncate to a fixed state number, .As will be seen in the posterior, however, only a subset of thesestates will contain substantial weight. We call this truncated ver-sion the truncated stick-breaking HMM (TSB-HMM).

1) VB-E Step: For the VB-E step, we calculate the variationalexpectation with respect to all unknown parameters. For a givensequence, we can write

(36)

To aid in the cleanliness of the notation, we first provide thegeneral variational equations for drawing from the generalizedDirichlet distribution, which we recall is the distribution thatresults following the necessary truncation of the variationalmodel. Consider a truncation to -dimensions and let bethe expected number of observations from component for agiven iteration. The variational equations can be written as

(37)

where represents the digamma function. See [8] for furtherdiscussion. We use the above steps with the appropriate countvalues, or , inserted for to calculate the variationalexpectations for and

where is the row of the transition matrix. This requiresuse of the expectation , where and are


Fig. 2. Synthesized Data—State occupancy as a function of Gibbs samplenumber. A threshold set to 99% indicates that most of the data resides in thetrue number of states, while a subset attempts to innovate.

the posterior parameters of . The variational equation foris

(38)

The values for , and are outputs of the forward-backward algorithm from the previous iteration. Given thesevalues for , and , the forward-backward algorithm is em-ployed as usual.

2) VB-M Step: Updating of the variational posteriors in theVB-M step is a simple updating of the sufficient statistics ob-tained from the VB-E step. They are as follows:

(39)

where and are the respective posterior parameters calcu-lated using the appropriate and values. This process it-erates until convergence to a local optimal solution.

VI. EXPERIMENTAL RESULTS

We demonstrate our model on both synthetic and digitalmusic data. For synthetic data, this is done on a simple HMMusing MCMC and VB inference. We then provide a comparisonof our inference methods on a small-scale music problem,which is done to help motivate our choice of VB inference for alarge-scale music recommendation problem, where we analyze2250 pieces of music from the classical, jazz, and rock genres.For this problem, our model will demonstrate comparable orbetter performance than several algorithms while using sub-stantially fewer states than the finite Dirichlet HMM approach.

Fig. 3. Synthesized Data—State occupancy as function of sample for collapsedGibbs sampling. The majority of the data again resides in the true state number,with less innovation.

A. MCMC Results on Synthesized Data

We synthesize data from the following HMM to demonstratethe effectiveness of our model in uncovering the underlying statestructure:

(40)

From this model were generated sequences of length. We placed priors on each and

priors for the observation matrix. We selectas an arbitrarily small number. The setting of

is a value that requires some care, and we were motivated bysetting a bound of as discussed in Section IV-A. Weran 10 000 iterations using the Gibbs sampling methods out-lined in Sections V-A and V-B and plot the results in Figs. 2 and3. Because we are not sampling from this chain, but are onlyinterested in the inferred state number, we do not distinguishbetween burn-in and collection iterations.

It is observed that the state value does not converge exactly tothe true number, but continually tries to innovate around the truestate number. To give a sense of the underlying data structure, athreshold is set, and the minimum number of states containing atleast 99% of the data is plotted in red. This value is calculatedfor each iteration by counting the state memberships from thesampled sequences, sorting in decreasing order and finding theresulting distribution. This is intended to emphasize that almostall of the data clusters into the correct number of states, whereasif only one of the 1500 observations selects to innovate, this isincluded in the total state number of the blue line. As can beseen, the red line converges more tightly to the correct statenumber. In Fig. 3, we see that collapsing the model structureshows even tighter convergence with less innovation [22].

We mention that the HDP solution also identifies the correctnumber of states for this problem, but as we are more interestedin the VB solution, we do not present these results here. Ourmain interest here is in validating our model using MCMC in-ference.


TABLE IA LIST OF MUSIC PIECES USED BY GENRE. THE DATA IS INTENDED TO CLUSTER BY GENRE, AS WELL AS BY SUBGENRE GROUPS OF SIZE FOUR

Fig. 4. The number of occupied states as a function of iteration number aver-aged over 500 runs. As can be seen, the TSB-HMM converges more quickly tothe simplified model.

Fig. 5. The negative log-likelihood ratio between the inferred HMM and thetrue HMM displayed as a per-observation average. The TSB-HMM is shown tobe invariant to state initialization, while the EM-HMM experiences a degrada-tion in quality as the number of states deviates from the true number.

B. Variational Results on Synthesized Data

Using the same sequences and parametrization as above, wethen built 500 TSB-HMM models, truncating to 30 states anddefining convergence to be when the fractional change in thelower bound falls below . In Fig. 4, we show the numberof states as a function of iteration number averaged over 500runs for the TSB-HMM compared with the standard DirichletHMM models with .

For this toy problem, both models converge to the same statenumber, but the TSB-HMM converges noticeably faster. Wealso observe that the variational solution does not converge ex-actly to the true state number on average, but slightly overesti-mates the number of states. This may be due to the local optimalnature of the solution. In the large-scale example, we will seethat these two models do not converge to the same state num-bers; future research is required to discover why this may be.

To emphasize the impact that the inferred state structure canhave on the quality of the model, we compare our results withthe EM-HMM [7]. For each state initialization shown in Fig. 5,we built 100 models using the two approaches and comparedthe quality of the resulting models by calculating the negativelog-likelihood ratio between the inferred model and the groundtruth model using a data sequence of length 5,000 generatedfrom the true model. The more similar two HMMs are statis-tically, the smaller this value should be. As Fig. 5 shows, for theEM algorithm this “distance” increases with an increasingly in-correct state initialization, indicating poorer quality models dueto overfitting. Our model is shown to be relatively invariant tothis initialization.

C. A Comparison of MCMC and Variational InferenceMethods on Music Data

To assess the relative quality of our MCMC and VB ap-proaches, we consider a small-scale music recommendationproblem. We select eight pieces each from the classical, jazz,and rock genres for a total of 24 pieces (see Table I) that areintended to cluster by genre. Furthermore, the first and last fourpieces within each genre are also selected to cluster together,though not as distinctly.

From each piece of music, we first extracted 20-dimensionalMFCC features [13], ten per second. Using these features, weconstructed a global codebook of size using k-meansclustering with which we quantized each piece of music. Foreach inference method, we placed priorson each and adaptively set the prior on the observation sta-tistics to be , where is the empirical pmf ofobservations for the piece of music. For the retrospective and


Fig. 6. Kernel maps for: (a) retrospective MCMC; (b) collapsed MCMC; (c)VB (averaged over 10 runs). The larger the box, the larger the similarity betweenany two pieces of music. The VB results are very consistent with the MCMCresults.

collapsed MCMC methods, we used 7000 burn-in iterations toensure proper convergence [3] and 3000 collection iterations,sampling every 300 iterations. Also, ten TSB-HMM models arebuilt for each piece of music using the same convergence mea-sure as in the previous section.

To assess quality, we consider the problem of music recom-mendation where for a particular piece of music, other piecesare recommended based on the sorting of a distance metric. Inthis case, we use a distance based on the log-likelihood ratio be-tween two HMMs, which is obtained by using data generatedfrom each model as follows:

(41)

where is a set of sequences generated from the indicatedHMM; in this paper, we use the original signal to calculate thisdistance and the expectation of the posterior to represent themodels. For this small problem, the likelihood was averagedover the multiple HMMs before calculating the ratio.

Fig. 6 displays the kernel maps of these distances, where weuse the radial basis function with a kernel width set to the 10%quantile of all distances to represent proximity, meaning largerboxes indicate greater similarity. We observe that the perfor-mance is consistent for all inference methods and uniformlygood. The genres cluster clearly and the subgenres cluster as awhole, with greater similarity between other pieces in the samegenre. In Fig. 7, we show the kernel map for one VB run, in-dicating consistency with the average. We also show a plot ofthe top four recommendations for each piece of music for thisVB run showing with more clarity the clustering by subgenre.From these results, we believe that we can be confident in theresults provided in the next section, where we only considerVB inference and build only one HMM per piece of music.We also mention that this same experiment was performed withthe HDP-iHMM of Section III-C-1 and produced similar resultsusing MCMC sampling methods. However, as previously dis-cussed, we cannot consider this method for fast variational in-ference.

D. Experimental Results on a 2250 Piece Music Database

For our large-scale analysis, we used a personal database of2250 pieces of music, 750 each in the classical, jazz, and rockgenres. Using two minutes selected from each piece of music,

Fig. 7. (a) Kernel map for one typical VB run, indicating good consistency withthe average variational run. (b) A graphic of the four closest pieces of music. Theperformance indicates an ability to group by genre as well as subgenre.

we extracted ten, 20 dimensional MFCC feature vectors persecond and quantized using a global codebook of size 100, againconstructing this codebook using the k-means algorithm on asampled subset of the feature vectors. We built a TSB-HMM oneach quantized sequence with a truncation level of 50 and usingthe prior settings of the previous section. For inference, we de-vised the following method to significantly accelerate conver-gence: Following each iteration, we check the expected numberof observations from each state and prune those states that havesmaller than one expected observation. Doing so, we were ableto significantly reduce inference times for both the TSB-HMMsand the finite Dirichlet HMMs.

As previously mentioned, the parameter has a significantimpact on inference for the finite Dirichlet HMM (here also ini-tialized to 50 states). This is seen in the box plots of Fig. 8, wherethe number of utilized states increases with an increasing . Wealso observe in the histogram of Fig. 9 that the state usage of ourTSB-HMM reduces to levels unreachable in the finite Dirichletmodel. Therefore, provided that the reduced complexity of ourinferred models does not degrade the quality of the HMMs, ourmodel is able to provide a more compact representation of thesequential properties of the data.

This compactness can be important in the following way:As noted in the introduction, the calculation of distances usingthe approach of the previous section is , where is thenumber of states. Therefore, for very large databases of music,superfluous states can waste significant processing power whenrecommending music in this way. For example, the distance cal-culations for our problem took 14 hours, 22 minutes using thefinite Dirichlet HMM with and 13 h for our TSB-HMM;much of this time was due to the same computational overhead.We believe that this time savings can be a significant benefit forvery large scale problems, including those beyond music mod-eling.

Below, we compare the performance of our model with sev-eral other methods. We compare with the traditional VB-HMM[24] where and , as well as the maximum-likeli-hood EM-HMM [7] with a state initialization of 50. We alsocompare with a simple calculation of the KL-divergence be-tween the empirical codebook usage of two pieces of music -equivalent to building an HMM with one state. Another com-parison is with the Gaussian mixture model using VB inference[33], termed the VB-GMM, where we use a 50-dimensionalDirichlet prior with on the mixing weights. This model isbuilt on the original feature vectors and an equivalent distancemetric is used.


Fig. 8. (a) A box plot of the state usage for the finite Dirichlet HMMs of 2,250pieces of music as a function of parameter � with 50 initialized states. (b) Boxplots of the smallest number of states containing at least 99% of the data as afunction of � As can be seen, the state usage for the standard variational HMMis very sensitive to this parameter.

To assess quality, in Tables II–IV, we ask a series of questionsregarding the top 10 recommendations for each piece of musicand tabulate the probability of a recommendation meeting thesecriteria. We believe that these questions give a good overviewof the quality of the recommendations on both large and finescales. For example, the KL-divergence is able to recommendmusic well by genre, but on the subgenre level the performanceis noticeably worse; the VB-GMM also performs significantlyworse. A possible explanation for these inferior results is thelack of sequential modeling of the data. We also notice that theperformance of the finite Dirichlet model is not consistent forvarious settings. As increases, we see a degradation of per-formance, which we attribute to overfitting due to the increasedstate usage. This overfitting is most clearly highlighted in theEM implementation of the HMM with the state number initial-ized to 50.

We emphasize that we do not conclude on the basis of ourexperiments that our model is superior to the finite DirichletHMM, but rather that our prior provides the ability to infer agreater range in the underlying state structure than possible withthe finite Dirichlet approach, which in certain cases, such as

Fig. 9. (a) A histogram of the number of occupied states using a�� prior on � and state initialization 50 as well as(b) the smallest number of states containing at least 99% of the data. Thereduction in states is greater than that for any parametrization of the Dirichletmodel.

TABLE IITHE PROBABILITY THAT A RECOMMENDATION IN THE

TOP 10 IS OF THE SAME GENRE

TABLE IIICLASSICAL—THE PROBABILITY THAT A RECOMMENDATION IN THE TOP 10

IS OF THE SAME SUBGENRE

our music recommendation problem, may be desirable. Addi-tionally, though we infer through the use of


TABLE IVJAZZ—THE PROBABILITY THAT A TOP 10 RECOMMENDATION FOR

SAXOPHONISTS JOHN COLTRANE, CHARLIE PARKER AND SONNY ROLLINSIS FROM ONE OF THESE SAME THREE ARTISTS. HARD ROCK—THEPROBABILITY THAT A TOP 10 RECOMMENDATION FOR HARD ROCKGROUPS JIMI HENDRIX AND LED ZEPPELIN IS FROM ONE OF THESE

SAME TWO ARTISTS. THE BEATLES—THE PROBABILITY THAT ATOP 10 BEATLES RECOMMENDATION IS ANOTHER BEATLES SONG

priors, the hyperparameter settings of those priors, specifically, may still need to be tuned to the problem at hand. Though it

can be argued that a wider range in state structure can be ob-tained by a smaller truncation of the Dirichlet priors used on ,we believe that this is a less principled approach as it prohibitsthe data from controlling the state clustering that naturally arisesunder the given prior. Doing so would also not be in the spirit ofnonparametric inference, which is our motivation here. Rather,we could see our model being potentially useful in verifyingwhether a simpler state structure is or is not more appropriatethan what the Dirichlet approach naturally infers in the VB set-ting, while still allowing for this state structure to be inferrednonparametrically.

VII. CONCLUSION

We have presented an infinite-state HMM that utilizes thestick-breaking construction to simplify model complexity by re-ducing state usage to the amount necessary to properly modelthe data. This was aided by the use of gamma priors on theparameters of the beta distributions used for each break, whichacts as a faucet allowing data to pass to higher state indices. Theefficacy of our model was demonstrated in the MCMC and VBsettings on synthesized data, as well as on a music recommen-dation problem, where we showed that our model performs aswell, or better, than a variety of other algorithms.

We mention that this usage of gamma priors can also be ap-plied to infinite mixture modeling as well. The stick-breakingconstruction of Sethuraman, which uses a prior onthe stick-breaking proportions, only allows for a single gammaprior to be placed on the shared . This is necessary to be the-oretically consistent with the Dirichlet process, but can be toorestrictive in the inference process due to the “left-bias” of theprior. If this were to be relaxed and separate gamma priors wereto be placed on each for each break , this problem wouldbe remedied, though the resulting model would no longer be aDirichlet process.

REFERENCES[1] D. Aldous, “Exchangeability and related topics,” in École d’ete de

probabilités de Saint-Flour XIII-1983. Berlin, Germany: Springer,1985, pp. 1–198.

[2] C. E. Antoniak, “Mixtures of Dirichlet processes with applicationsto Bayesian nonparametric problems,” Ann. Statist., vol. 2, pp.1152–1174, 1974.

[3] K. B. Athreya, H. Doss, and J. Sethuraman, “On the convergence of theMarkov chain simulation method,” Ann. Statist., vol. 24, pp. 69–100,1996.

[4] J. J. Aucouturier and M. Sandler, “Segmentation of musical signalsusing hidden Markov models,” in Proc. 110th Convention Audio Eng.Soc., 2001.

[5] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood ap-proach to continuous speech recognition,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 5, no. 2, pp. 179–190, 1983.

[6] M. J. Beal, “Variational algorithms for approximate Bayesian infer-ence,” Ph.D. thesis, Gatsby Computat. Neurosci. Unit, Univ. CollegeLondon, London, U.K., 2003.

[7] J. A. Bilmes, “A gentle tutorial of the EM algorithm and its applica-tion to parameter estimation for Gaussian mixture and hidden Markovmodels,” Univ. Calif., Berkeley, Tech. Rep. 97-021, 1998.

[8] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet processmixtures,” Bayesian Anal., vol. 1, no. 1, pp. 121–144, 2006.

[9] R. J. Boys and D. A. Henderson, “A comparison of reversible jumpMCMC algorithms for DNA sequence segmentation using hiddenMarkov models,” Comput. Sci. Statist., vol. 33, pp. 35–49, 2001.

[10] P. K. Bharadwaj, P. R. Runkle, and L. Carin, “Target identificationwith wave-based matched pursuits and hidden Markov models,” IEEETrans. Antennas Propag., vol. 47, pp. 1543–1554, 1999.

[11] D. Blackwell and J. B. MacQueen, “Ferguson distributions via Pólyaurn schemes,” The Ann. Statist., vol. 1, pp. 353–355, 1973.

[12] R. J. Connor and J. E. Mosimann, “Concepts of independence for pro-portions with a generalization of the Dirichlet distribution,” J. Amer.Statist. Assoc., vol. 64, pp. 194–206, 1969.

[13] S. Davis and P. Mermelstein, “Comparison of parametric represen-tations of monosyllabic word recognition in continuously spokensentences,” IEEE Trans. Acoust. Speech Signal Process., vol. 28, pp.357–366, 1980.

[14] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from in-complete data via the EM algorithm,” J. Royal Statist. Soc. B, vol. 39,no. 1, pp. 1–38, 1977.

[15] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, BiologicalSequence Analysis: Probabilistic Models of Proteins and NucleicAcids. Cambridge, U.K.: Cambridge Univ. Press, 1999.

[16] M. D. Escobar and M. West, “Bayesian density estimation and infer-ence using mixtures,” J. Amer. Statist. Assoc., vol. 90, no. 430, pp.577–588, 1995.

[17] T. Ferguson, “A Bayesian analysis of some nonparametric problems,”Ann. Statist., vol. 1, pp. 209–230, 1973.

[18] P. Halmos, “Random Alms,” Ann. Math. Statist., vol. 15, pp. 182–189,1944.

[19] H. Ishwaran and L. F. James, “Gibbs sampling methods forstick-breaking priors,” J. Amer. Statist. Assoc., vol. 96, pp. 161–173,2001.

[20] H. Ishwaran and M. Zarepour, “Dirichlet prior sieves in finite normalmixtures,” Statistica Sinica, vol. 12, pp. 941–963, 2002.

[21] N. Johnson and S. Kotz, Urn Models and Their Applications, ser. Seriesin Probabil. Math. Statist.. New York: Wiley, 1977.

[22] J. S. Liu, “The collapsed Gibbs sampler in Bayesian computations withapplications to a gene regulation problem,” J. Amer. Statist. Assoc., vol.89, pp. 958–966, 1994.

[23] S. MacEachern and P. Mueller, “Efficient MCMC schemes for robustmodel extensions using encompassing Dirichelt process mixturemodels,” in Robust Bayesian Anal., F. Ruggeri and D. Rios Insua,Eds. New York: Springer-Verlag, 2000.

[24] D. J. C. MacKay, “Ensemble learning for hidden Markov models,”Cavendish Lab., Univ. Cambridge, U.K., 1997, Tech. Rep..

[25] O. Papaspiliopoulos and G. O. Roberts, “Retrospective Markov chainMonte Carlo methods for Dirichlet process hierarchical models,”Biometrika, vol. 95, no. 1, pp. 169–186, 2008.

[26] I. Porteous, A. Ihler, P. Smyth, and M. Welling, “Gibbs sampling for(coupled) infinite mixture models in the stick-breaking representation,”in Proc. Conf. Uncertainty in Artif. Intell., Pittsburgh, PA, 2006.

[27] K. Ni, Y. Qi, and L. Carin, “Multi-aspect target detection with the in-finite hidden Markov model,” J. Acoust. Soc. Amer., 2007, To appear.

[28] L. R. Rabiner, “A tutorial on hidden Markov models and selected ap-plications in speech recognition,” Proc. IEEE, vol. 77, no. 2, 1989.

[29] P. R. Runkle, P. K. Bharadwaj, L. Couchman, and L. Carin, “HiddenMarkov models for multiaspect target classification,” IEEE Trans.Signal Process., vol. 47, pp. 2035–2040, 1999.

[30] J. Sethuraman, “A constructive definition of Dirichlet priors,” StatisticaSinica, vol. 4, pp. 639–650, 1994.

[31] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “HierarchicalDirichlet processes,” J. Amer. Statist. Assoc., vol. 101, no. 476, pp.1566–1581, 2006.


[32] J. Winn and C. M. Bishop, “Variational message passing,” J. Mach.Learn. Res., vol. 6, pp. 661–694, 2005.

[33] J. Winn, “Variational message passing and its applications,” Ph.D. dis-sertation, Inference Group, Cavendish Lab., Univ. Cambridge, Cam-bridge, U.K., 2004.

[34] T. T. Wong, “Generalized Dirichlet distribution in Bayesian analysis,”Appl. Math. Computat., vol. 97, pp. 165–181, 1998.

John Paisley (S’08) received the B.S. and M.S.degrees in electrical and computer engineering fromDuke University, Durham, NC, in 2004 and 2007,respectively.

He is currently pursuing the Ph.D. degree in elec-trical and computer engineering at Duke University.His research interests include Bayesian hierarchicalmodeling and machine learning.

Lawrence Carin (SM ’96–F’01) was born March 25, 1963, in Washington, DC.He received the B.S., M.S., and Ph.D. degrees in electrical engineering from theUniversity of Maryland, College Park, in 1985, 1986, and 1989, respectively.

In 1989, he joined the Electrical Engineering Department, Polytechnic Uni-versity, Brooklyn, NY, as an Assistant Professor, and became an Associate Pro-fessor in 1994. In September 1995, he joined the Electrical Engineering Depart-ment, Duke University, Durham, NC, where he is now the William H. YoungerProfessor of Engineering. He is a cofounder of Signal Innovations Group, Inc.(SIG), where he serves as the Director of Technology. His current research in-terests include signal processing, sensing, and machine learning.

Dr. Carin was an Associate Editor of the IEEE TRANSACTIONS ON ANTENNASAND PROPAGATION from 1995 to 2004. He is a member of the Tau Beta Pi andEta Kappa Nu honor societies.

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57 ...jwp2128/Papers/PaisleyCarin2009c.pdfIEEE...

Documents

Transcript of IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57 ...jwp2128/Papers/PaisleyCarin2009c.pdfIEEE...