LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood...

12
U -Likelihood and U -Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1 , Sung-Yang Bang 1 , Seungjin Choi 1 , and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH, Republic of Korea {emtidi, sybang, seungjin}@postech.ac.kr 2 Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London WC1N 3AR, England [email protected] Abstract. In this paper we consider latent variable models and intro- duce a new U -likelihood concept for estimating the distribution over hid- den variables. One can derive an estimate of parameters from this distri- bution. Our approach differs from the Bayesian and Maximum Likelihood (ML) approaches. It gives an alternative to Bayesian inference when we don’t want to define a prior over parameters and gives an alternative to the ML method when we want a better estimate of the distribution over hidden variables. As a practical implementation, we present a U - updating algorithm based on the mean field theory to approximate the distribution over hidden variables from the U -likelihood. This algorithm captures some of the correlations among hidden variables by estimating reaction terms. Those reaction terms are found to penalize the likelihood. We show that the U -updating algorithm becomes the EM algorithm as a special case in the large sample limit. The useful behavior of our method is confirmed for the case of mixture of Gaussians by comparing to the EM algorithm. 1 Introduction Latent variable models are important tools for probabilistic methods and have wide applications in machine learning, computer vision, pattern recognition, and speech processing, to name a few. The Bayesian and the Maximum Likelihood (ML) approaches have been extensively studied for learning such models in the past decades. In Bayesian Inference [1], we define a prior over parameters P (θ) and from this all inference is automatically performed. In particular, using this prior we can compute the marginal probability of data set Y = {y 1 ,..., y n } and hidden variable set X = {x 1 ,..., x n }: P (Y,X )= P (Y,X |θ)P (θ)dθ. (1) J. Gama et al. (Eds.): ECML 2005, LNAI 3720, pp. 377–388, 2005. c Springer-Verlag Berlin Heidelberg 2005

Transcript of LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood...

Page 1: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

U-Likelihood and U-Updating Algorithms:Statistical Inference in Latent Variable Models

Jaemo Sung1, Sung-Yang Bang1, Seungjin Choi1, and Zoubin Ghahramani2

1 Department of Computer Science,POSTECH, Republic of Korea

{emtidi, sybang, seungjin}@postech.ac.kr2 Gatsby Computational Neuroscience Unit,University College London, 17 Queen Square,

London WC1N 3AR, [email protected]

Abstract. In this paper we consider latent variable models and intro-duce a new U-likelihood concept for estimating the distribution over hid-den variables. One can derive an estimate of parameters from this distri-bution. Our approach differs from the Bayesian and Maximum Likelihood(ML) approaches. It gives an alternative to Bayesian inference when wedon’t want to define a prior over parameters and gives an alternativeto the ML method when we want a better estimate of the distributionover hidden variables. As a practical implementation, we present a U-updating algorithm based on the mean field theory to approximate thedistribution over hidden variables from the U-likelihood. This algorithmcaptures some of the correlations among hidden variables by estimatingreaction terms. Those reaction terms are found to penalize the likelihood.We show that the U-updating algorithm becomes the EM algorithm as aspecial case in the large sample limit. The useful behavior of our methodis confirmed for the case of mixture of Gaussians by comparing to theEM algorithm.

1 Introduction

Latent variable models are important tools for probabilistic methods and havewide applications in machine learning, computer vision, pattern recognition, andspeech processing, to name a few. The Bayesian and the Maximum Likelihood(ML) approaches have been extensively studied for learning such models in thepast decades.

In Bayesian Inference [1], we define a prior over parameters P (θ) and fromthis all inference is automatically performed. In particular, using this prior wecan compute the marginal probability of data set Y = {y1, . . . , yn} and hiddenvariable set X = {x1, . . . , xn}:

P (Y, X) =∫

P (Y, X |θ)P (θ)dθ. (1)

J. Gama et al. (Eds.): ECML 2005, LNAI 3720, pp. 377–388, 2005.c© Springer-Verlag Berlin Heidelberg 2005

Page 2: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

378 J. Sung et al.

We can further marginalize out hidden variables to get the marginal probabilityof just data set Y :

P (Y ) =∑X

P (Y, X), (2)

assuming all hidden variables are discrete. The Bayesian approach basically pro-vides a way of solving the overfitting problem by eliminating model parametersby integrating over them.

In certain settings it may be undesirable to define a prior over parameters. Forexample, many statisticians don’t like the subjective nature of Bayesian inferenceeven though all modelling contains an element of subjectivity. Moreover, in somecases it is very difficult to define one’s prior belief about the model parameters.This motivates the use of ML method for parameter estimation.

Starting from the likelihood of the parameters, which is the probability ofdata set Y given the parameters, in the ML approach one can find the parameterswhich maximize the likelihood function given by

L(θ) =∑X

P (Y, X |θ). (3)

We wish to find the parameters that maximize the likelihood: θ∗=arg maxθ L(θ).From this estimate of parameters, we can find the distribution over hidden vari-ables P (X |Y, θ∗), regarding θ∗ as true parameters. The fundamental problem ofthe ML approach is overfitting since it considers only a single estimate of MLparameters, whereas the Bayesian approach solves this problem by integratingover parameters.

If we don’t want a Bayesian approach, we can still eliminate the parameters,not by marginalizing over them as in (1), but by maximizing over them. This isthe key of our work. By doing so, we obtain a method somewhat analogous tothe Bayesian approach without specifying the parameter prior.

In this paper, we introduce a new concept, U -likelihood, to infer the dis-tribution over hidden variables, which differs from the Bayesian and the MLapproaches. We obtain the U -likelihood, which is an analogous quantity to themarginal probability of data set in (2), by marginalizing the maximum of thecomplete-data likelihood over hidden variables. We show that the U -likelihoodcan be bounded using variational method [2] and this gives the joint distributionover hidden variables Q(X) (Sec. 3). Like the Bayesian and the ML approaches,the exact Q(X) is intractable to compute for large data sets. As a practicalimplementation, we introduce a U-updating algorithm, which iteratively solvesmean field equations for hidden variables under Q(X). We show that the U-updating algorithm estimates the reaction of all the other hidden variables andit penalizes the likelihood term to alleviate the overfitting problem. The EMalgorithm appears as a special case of U-updating algorithm in the large samplelimit. (Sec. 4). We demonstrate the useful behavior of our U-updating algorithm,compared to the EM algorithm, through the example of mixtures of Gaussianson synthetic and real data sets (Sec. 5).

Page 3: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

U-Likelihood and U-Updating Algorithms 379

2 General Framework

Throughout this paper, we assume that data set Y = {y1, . . . , yn} of n datapoints is always given. Let X = (x1, . . . , xn) denote hidden variable set. Allowingyi and xi, to be multidimensional, we assume that a complete data point (yi, xi)is IID from a sampling distribution which is parameterized by parameter vectorθ such as P (yi, xi|θ). For simplicity, we here focus on the discrete type hiddenvariable xi, but the understanding of the case of continuous hidden variables isstraightforward by exchanging sums into integrals.

For Bayesian inference, we can form a lower bound of log P (Y ) in (2) for anyQ(X) using Jensen’s inequality:

log P (Y ) ≥∑X

Q(X) logP (Y, X)Q(X)

≡ FB(Q(X)), (4)

and then we can find Q(X) by maximizing FB. The maximization of FB is equiv-alent to the minimization of the Kullback-Leibler divergence between Q(X) andP (X |Y ) = P (Y, X)/P (Y ). Therefore, at maxima of FB, Q(X) gives the exactP (X |Y ). However, for most models of interest this is intractable to compute.For example, for a mixture model with m components, the sum

∑X of P (Y )

in (2) contains mn terms. As practical implementations, MCMC [3] methods, theExpectation-Propagation (EP) [4] and the variational Bayes (VB) [5] methodswere introduced but we will not tackle them in this paper.

For the ML approach, we can form the lower bound of log likelihood in asimilar way to (4):

log L(θ) ≥∑X

Q(X) logP (Y, X |θ)

Q(X)≡ FL(Q(X), θ). (5)

The maximization of FL is equivalent to the maximization of L since if (Q∗, θ∗)occurs at maxima of FL, then θ∗ occurs at maxima of L(θ) and Q∗(X) becomesP (X |Y, θ∗). Since the global maximization of FL is intractable in most caseslike in the Bayesian approach, the well-known EM algorithm [6] independentlymaximizes FL w.r.t. Q or θ by fixing the other as a practical implementation.Refer [6, 7] for more details on the EM algorithm.

3 U-Likelihood

If we don’t want a Bayesian approach, we can still eliminate the parameters, notby marginalizing over them as in (1), but by maximizing over them. We start bydefining the U-function which is the maximum of the complete-data likelihood:

U(Y, X) ≡ maxθ

P (Y, X |θ) = P (Y, X |θ(Y, X)) > 0, (6)

where θ (Y, X) denotes the ML parameter estimator, a function of the complete-data set, defined by

θ (Y, X) ≡ arg maxθ

P (Y, X |θ). (7)

Page 4: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

380 J. Sung et al.

The U-function is analogous to the marginal probability in (1) except that itmaximizes over parameters rather than integrating over parameters. Anotherway to think about it is that instead of a parameter prior, we substitute in (1)a delta function of the ML parameter estimate on the complete-data set, e.g.P (θ) = δ(θ − θ (Y, X)). This is certainly not coherent from the point of view ofBayesian inference since the prior cannot depend on the data, but we will seesome of the interesting properties of this approach.

We can take the U-function and marginalize out the hidden variable set X :

U(Y ) ≡∑X

U(Y, X). (8)

We call this quantity U-likelihood and it is analogous to P (Y ) in (2). Anotherview of it is that it forms an upper bound of the likelihood function: U ≥ L∗ ≥L(θ), where L∗ = maxθ L(θ) denotes the maximum likelihood value. Analogousto (4), we can lower bound it for any distribution Q(X) over hidden variables:

log U(Y ) ≥∑X

Q(X) logU(Y, X)Q(X)

≡ FU (Q(X)). (9)

We can use the optimal Q(X) maximizing FU as a joint conditional distributionover hidden variables given data set: P (X |Y ). The next theorem shows the formof Q(X) at the maxima of FU .

Theorem 1. The optimal joint distribution Q(X) maximizing the lower boundFU(Q(X)) is of the form

Q(X) =U(Y, X)U(Y )

. (10)

Proof: Let Q′(X) = U(Y,X)U(Y ) . Then, the Kullback-Leibler divergence between

Q(X) and Q′(X) is given by KL [Q‖Q′] = log U(Y ) − FU (Q). It follows fromGibbs inequality that KL [Q‖Q′] = 0 when Q(X) = Q′(X), implying that FU (Q)is maximized at Q(X) = Q′(X). �

We illustrate the joint distribution Q(X) in (10) when data set Y consistsof 12 data points generated from the mixture of two Gaussians. The true X∗ isa binary vector with 12 components. Figure 1 plots log Q(X) as a function ofManhattan distance from the true X∗. This demonstrates that Q(X) tends togive higher probability to hidden states that are similar to the true states.

We have seen the relationship of U-likelihood to Bayesian inference. We canalso see a simple relationship to maximum likelihood methods:

Maximum likelihood : L∗ = maxθ

∑X

P (Y, X |θ) , (11)

U-likelihood : U =∑X

maxθ

P (Y, X |θ) , (12)

where we here dropped the data dependency. The former gives a single value ofthe model parameters θ∗, from which a distribution over hidden variables can

Page 5: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

U-Likelihood and U-Updating Algorithms 381

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

0 1 2 3 4 5 6 7 8 9 10 11 12−30

−25

−20

−15

−10

−5

0

log

Q(X

)

distance from X to true state

(a) (b)

Fig. 1. Demonstration of Q(X) in Theorem 1 given data set Y of 12 data pointsgenerated from the mixture of two Gaussians, i.e. N ([−3, 0], I) and N ([3, 0], I). (a)data set Y . (b) log Q(X) as a function of Manhattan distance from the true state,∑12

i=1 |xi − x∗i |. Each dot indicates a state of 212 possible configurations of X. The

symmetrical phenomenon stems from the identifiability of the mixture of Gaussians.

be derived: P (X |Y, θ∗). The latter no longer gives a single value of parameters.However, it may give a better estimate of the distribution over hidden variables,Q(X), which captures some of correlations among hidden variables. From thisdistribution Q(X) one can derive an estimate of parameters.

We outline some of the possible advantages of U-likelihood approach over theBayesian and the ML approaches:

1. High dimensional integrals like (1) required for Bayesian inference can beintractable. For many models, the optimum of θ given the complete-dataset (Y, X) is a simple function of the sufficient statistics. So no explicitoptimization is necessary to compute (6).

2. Many researchers may not wish to define a prior over parameters. The U-likelihood method provides an alternative.

3. Optimizing over parameters in the ML method is often fraught with localoptima. By optimizing out parameters, the U-likelihood method is sometimesfound to have better convergence properties than the ML method. That is,it can find good solutions without falling into local optima as often. We showthis empirically.

The distribution Q(X) in (10) may be intractable to compute, exceptingfor small n, since it requires all possible configurations of X . As a practi-cal implementation, we will use a mean field approximation and present theU-updating algorithm as an alternative to the EM algorithm in the nextsection.

Page 6: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

382 J. Sung et al.

4 U-Updating Algorithm

We start by considering a case where the sampling distribution of the complete-data (yi, xi) is in the exponential family with the following form

P (yi, xi|θ) = f(si(yi, xi))g(θ) exp{φ(θ)Tsi(yi, xi)

}, (13)

where φ(θ) is a vector of natural parameters and si(yi, xi) is a vector of sufficientstatistics. The normalizing constant is denoted by g(θ). Probability distributionsof the exponential family have been widely used in latent variable models suchas mixture of Gaussians, factor analysis, hidden Markov models, state-spacemodels, and so on. For the case of the exponential family, the complete-datalikelihood depends on the complete-data set only through sufficient statistics:

P (Y, X |θ) =[ n∏

i=1

f(si(yi, xi))]g(θ)n exp

{φ(θ)Ts(Y, X)

}, (14)

where s(Y, X) =∑n

i=1 si(yi, xi). Moreover, a closed-form solution of θ and U-function always exists as a function of sufficient statistics: θ(Y,X) = θ(s(Y, X))and U(Y, X) = U(s(Y, X)).

The mean field theory [8], originally from statistical physics, has been widelyused in the machine learning community to approximate joint distributions ingraphical models when exact inference is intractable because of highly-coupledinteractions among variables. Consider the marginal distribution over xi:

Q(xi) =∑X\i

Q(X) =1

U(Y )

∑X\i

U(s(Y, X)), (15)

where X\i denotes a subset of hidden variables where xi is excluded: X\i =X \ xi. In general, the exact calculation of Q(xi) is intractable since it requiresall possible realizations of X\i. Assuming weak dependencies among hidden vari-ables, the mean field theory suggests that the influence of the other hidden vari-ables sj(yj , xj) in the marginal distribution Q(xi) can be approximated by theexpected values 〈sj(yj , xj)〉. This leads to the mean field distributions Qi(xi):

Qi(xi) ≡ 1Ui

U(si(xi)) ≈ Q(xi), (16)

where si(xi) = si(yi, xi) +∑n

j=1,j �=i〈sj(yj , xj)〉 and Ui =∑

xiU(si(xi)) is

the normalizing constant. The joint distribution Q(X) is approximated by thefactored form with all mean field distributions Qi(xi): Q(X) ≈

∏ni=1 Qi(xi).

Moreover, the expected sufficient statistics 〈s(Y, X)〉 can be obtained by solv-ing self-consistent equations called mean field equations, which are stationaryconditions:

〈si(yi, xi)〉 =∑xi

si(yi, xi)Qi(xi) . (17)

Page 7: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

U-Likelihood and U-Updating Algorithms 383

Table 1. U-updating algorithm and EM algorithm

U-updating algorithm EM algorithm

Initialize 〈s〉(0) =∑n

i=1〈si(yi, xi)〉(0) .Set 〈s〉(1) = 〈s〉(0) .Repeat t = 1, 2, . . . until convergence .Repeat i = 1, . . . , n .Update Q

(t)i (xi) ∝ U(s(t)

i (xi)) ,s(t)i (xi) =

〈s〉(t) + si(yi, xi) − 〈si(yi, xi)〉(t−1) .Refine〈s〉(t) ←〈s〉(t) + 〈si(yi, xi)〉(t) − 〈si(yi, xi)〉(t−1)

with 〈si(yi, xi)〉(t) under new Q(t)i (xi) .

End (Repeat)Set 〈s〉(t+1) = 〈s〉(t) .

End (Repeat)

Initialize θ(0) .Repeat t = 1, 2, . . . until convergence.E-Step :

Update Q(t)i (xi) = P (xi|yi, θ

(t))for all i = 1, . . . , n.

M-Step :Estimate θ(t+1) = θ(〈s〉(t))with 〈s〉(t) =

∑ni=1〈si(yi, xi)〉(t)

under new {Q(t)i (xi)} .

End (Repeat)

Therefore, the distribution Qi(xi) in (16) can be computed by iterative proce-dure solving the mean field equations in (17). This iterative procedure is referredto as the U-updating algorithm and gives an alternative to the EM algorithm.

The Table 1 summarizes the U-updating algorithm in comparison with theEM algorithm. In order to estimate ML parameter θ(t+1) in M-Step, the EMalgorithm requires distributions P (xi|yi, θ

(t)) in E-Step built on the ML pa-rameter θ(t) which may be overfitted to the data set at the previous itera-tion.Therefore, the overfitting effects may accumulate throughout iterations inthe EM algorithm. However, the U-updating algorithm alleviates this overfitting-accumulation problem by estimating the reaction of all the other hiddenvariables, which penalizes the likelihood. Therefore, it can give better distri-bution Qi(xi) than the EM algorithm. We can simply use all Qi(xi) resultedfrom the U-updating algorithm to estimate parameters like the M-Step of theEM algorithm.

In order to see how the U-updating algorithm penalizes the likelihood, de-compose the U-function:

U(si(xi)) = αi(xi)βi(xi), (18)

where αi(xi) = P (si(yi, xi) | θ(si(xi))) and βi(xi) =∏n

j=1, �=i ρ(〈sj(yj , xj)〉 |θ(si(xi))), given by

ρ(〈sj(yj , xj)〉 | θ(si(xi))) = f(〈sj〉)g(θ(si(xi))) exp{φ(θ(si(xi)))T〈sj〉

}.

The term αi(xi) is the likelihood on the complete data point i. The term βi(xi)can be interpreted as a reaction of all the other hidden variables via the expectedvalues 〈sj(yj , xj)〉. When computing Qi(xi), the U-updating algorithm thereforepenalizes the likelihood αi(xi) by estimating the reaction βi(xi) of the otherhidden variables, which captures some correlations among hidden variables.

Page 8: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

384 J. Sung et al.

The U-updating algorithm generalizes the EM algorithm since if we ignore thereaction term βi(xi) in (18), it will be same to the EM algorithm. The followingtheorem states the behavior of U-updating algorithm in the large sample limit.

Theorem 2. For the case of the exponential family, the U-updating algorithmis equivalent to the EM algorithm in the limit of large samples.

Proof: In the large sample limit, the sufficient statistic s(Y, X) will be insensitiveto one hidden variable: si(yi, xi) +

∑nj=1, �=i〈sj(yj , xj)〉 ≈ 〈s(Y, X)〉 as n → ∞.

Therefore, in the large sample limit, the reaction term βi(xi) becomes a constantand Qi(xi) of the U-updating algorithm becomes the distribution resulted fromthe E-step of the EM algorithm:

Qi(xi) =P (yi, xi | θ(〈s〉))∑x′

iP (yi, x

′i | θ(〈s〉))

= P (xi | yi, θ(〈s〉)) . (19)

From the fact that θ(〈s〉) gives the ML parameter in the M-step of the EMalgorithm, the U-updating algorithm is equivalent to the EM algorithm in thelarge sample limit. �

5 Numerical Experiments

5.1 Mixture of Gaussians

For the p-dimensional observational vector yi ∈ Rp, the mixture model [9, 10] ofm components with parameter θ is generally defined as P (yi|θ) =

∑mk=1 P (yi|xi =

k, θ)P (xi = k|θ), where xi ∈ {k = 1, . . . , m} denotes the hidden variable indi-cating which mixture component is in charge of generating yi. The componentsare labelled by k. Although our method can be applied to an arbitrary mix-ture model, for simplicity, we consider the case of Gaussian components. In thiscase, the mixture model parameterized by θ = ({µk}, {Σk}, {wk}) is given byP (yi|θ) =

∑mk=1 N (yi; µk, Σk) wk, where wk = P (xi = k|θ) is the mixing propor-

tion satisfying∑m

k=1 wk = 1 and N (yi; µk, Σk) = P (yi|xi = k, θ) denotes thekth Gaussian component distribution with the mean vector µk and covariancematrix Σk. The sampling distribution of the mixture of Gaussians is given by

P (yi, xi|θ) =m∏

k=1

[N (yi; µk, Σk)wk

]δk(xi), (20)

where δk(xi) denotes the Kronecker delta function given by δk(xi) = 1 for xi = kand δk(xi) = 0 for xi = k.

Let (Y, X) denote the complete data set of n IID observations, where Y ={y1, . . . , yn} and X = {x1, . . . , xn}. Since the sampling distribution P (yi, xi|θ)is in the exponential family, the complete data likelihood P (Y, X |θ) and theML parameter estimator θ(Y, X) become the function of the sufficient statistics.

Page 9: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

U-Likelihood and U-Updating Algorithms 385

Table 2. U-updating Algorithm : Mixture of Gaussians

Initialize 〈γk〉(0) =∑n

i=1〈δk(xi)〉(0),〈ξk〉(0) =

∑ni=1〈δk(xi)〉(0) yi,

〈λk〉(0) =∑n

i=1〈δk(xi)〉(0) yiyTi ,

where 〈δk(xi)〉(0) = Q(0)i (xi = k).

Set 〈γk〉(1) = 〈γk〉(0), 〈ξk〉(1) = 〈ξk〉(0) and 〈λk〉(1) = 〈λk〉(0).Repeat t = 1, 2, 3, . . . until convergence.

Repeat i = 1, . . . , n.1) Update Q t

i (xi) ∝∏m

k=1

(t)k (xi)1+

p2 |C(t)

k (xi)|−12

)γ(t)k

(xi),

where γ(t)k (xi) = 〈γk〉(t) + [δk(xi) − 〈δk(xi)〉(t−1) ] ,

ξ(t)k (xi) = 〈ξk〉(t) + [δk(xi) − 〈δk(xi)〉(t−1) ] yi ,

λ(t)k (xi) = 〈λk〉(t) + [δk(xi) − 〈δk(xi)〉(t−1) ] yiy

Ti ,

C(t)k (xi) = λ

(t)k (xi) − γ

(t)k (xi)

−1ξ(t)k (xi)ξ

(t)k (xi)T.

2) Refine sufficient statistics〈γk〉(t) ← 〈γk〉(t) + [〈δk(xi)〉(t) − 〈δk(xi)〉(t−1) ] ,〈ξk〉(t) ← 〈ξk〉(t) + [〈δk(xi)〉(t) − 〈δk(xi)〉(t−1) ] yi ,〈λk〉(t) ← 〈λk〉(t) + [〈δk(xi)〉(t) − 〈δk(xi)〉(t−1) ] yiy

Ti

with 〈δk(xi)〉(t) = Q(t)i (xi).

End (Repeat)Set 〈γk〉(t+1) = 〈γk〉(t), 〈ξk〉(t+1) = 〈ξk〉(t) and 〈λk〉(t+1) = 〈λk〉(t).

End (Repeat)

Therefore, the U-function is also a function of the sufficient statistics s(Y, X) =({γk, ξk, λk}):

U({γk, ξk, λk}

)= c

m∏k=1

(γk

1+ p2 |Ck|−

12

)γk

, (21)

where Ck = λk − γ−1k ξkξT

k and

γk =n∑

i=1

δk(xi), ξk =n∑

i=1

δk(xi)yi, λk =n∑

i=1

δk(xi)yiyTi , (22)

and c is a constant. Using 〈δk(xi)〉 = Qi(xi = k), we present the U-updatingalgorithm for the mixture of Gaussians in Table 2. We can simply obtain the es-timate of the parameters by θ∗ = θ

({〈γk〉, 〈ξk〉, 〈λk〉}

)under all Qi(xi) resulted

from the U-updating algorithm like as the M-Step of the EM algorithm, wherethe ML parameter estimator is given by

θ({γk, ξk, λk}

)=

({wk =

γk

n, µk =

ξk

γk, Σk =

Ck

γk

}). (23)

5.2 Numerical Results

In order to demonstrate U-updating algorithm in comparison with the EM algo-rithm, we first used the data set of 800 data points generated from the mixture

Page 10: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

386 J. Sung et al.

components m = 6 m = 9 m = 12 m = 16

(a) true (c) U-updating algorithm

−10 −8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(b) data set (d) EM algorithm

Fig. 2. Results on a mixture of 6 well-clustered Gaussian components: (a) trueGaussian-mixture distribution, where the more bright, the higher probability is there.(b) 800 data points generated from the true distribution. (c) and (d) learned distri-butions by U-updating and EM algorithms when the models have the componentsm = 6, 9, 12, 16.

100 200 300 400 500−50

−45

−40

−35

−30

−25

−20

−15

−10

−5

0

the number of iterations (t)

U−updatingEM

Fig. 3. Intermediate log likelihood values, subtracted from the maximum value, of theU-updating and the EM algorithms in the case of m = 16 on data set shown in Figure 2

of 6 well-clustered Gaussian components having equal mixing proportion wk buthaving different volume. Both algorithms started by the same initial guess fromk-means algorithm. Figure 2 shows that the U-updating algorithm alleviates theoverfitting in comparison with the EM algorithm. Although models were morecomplicated than the true model (m = 6), the U-updating algorithm demon-strated that all of the learned distributions (m = 6, 9, 12, 16) were very similar

Page 11: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

U-Likelihood and U-Updating Algorithms 387

Acidity Data Set (155 data points)

3 3.5 4 4.5 5 5.5 6 6.5 70

0.2

0.4

0.6

0.8

1m=2m=4m=6

3 3.5 4 4.5 5 5.5 6 6.5 70

0.2

0.4

0.6

0.8

1m=2m=4m=6

U-updating algorithm EM algorithm

Galaxy Data Set (82 data points)

10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25 m=2m=4m=6

10 15 20 25 30 350

0.05

0.1

0.15

0.2

0.25 m=2m=4m=6

U-updating algorithm EM algorithm

Fig. 4. Learned distributions by U-updating and EM algorithms when the modelshave the components m = 2, 4, 6

to the true distribution. However, for the EM algorithm, the more complicatedthe model we considered, the more overfitted the distribution that resulted.

As a practical issue, overfitting leads to slow convergence. Figure 3 shows theconvergence curves in term of log likelihood subtracted from the maximum valuein the case of m = 16. By penalizing the likelihood term αi by the reaction termβi, the U-updating algorithm achieved much faster convergence, approximatelymore than three times, than the EM algorithm. The U-updating algorithm metthe convergence threshold, that was

√∑mk=1 |w(t)

k−w

(t−1)k

|2 < 10−4, after 153 iter-ations, whereas the EM algorithm met the same threshold after 563 iterations.

Next, we used real data sets, acidity and galaxy data sets shown in [10].Figure 4 shows the learned distributions when the models have 2, 4, and 6components and the Table 3 shows that the optimized mixing proportions wk

when the model has 6 components.

6 Conclusions

In this paper, we introduced the U-likelihood approach for learning latent vari-able models, which differs from the Bayesian and the ML approaches. We pre-sented some advantages of our approach over them in section 3. Our U-likelihoodmethod gives an alternative to Bayesian inference and the ML method when we

Page 12: LNAI 3720 - $mathcal{U}$-Likelihood and ...mlg.eng.cam.ac.uk/pub/pdf/SunBanCho05a.pdfU-Likelihood and U-Updating Algorithms: Statistical Inference in Latent

388 J. Sung et al.

Table 3. Optimized mixing proportions (wk) of learned model having 6 components

Component 1 2 3 4 5 6

Acidity Data SetU-updating algorithm 0.295 0.259 0.187 0.086 0.086 0.086

EM algorithm 0.386 0.188 0.171 0.169 0.073 0.013

Galaxy Data SetU-updating algorithm 0.267 0.267 0.267 0.083 0.058 0.058

EM algorithm 0.403 0.277 0.171 0.085 0.037 0.026

don’t want to use these. As a practical implementation, we presented the U-updating algorithm to compute the distribution over hidden variables, whichwas found to penalize the likelihood by estimating the reaction of the other hid-den variables and to alleviate the overfitting-accmulation problem of the EMalgorithm.

We leave some of issues for the future work: 1) How can we more accuratelyapproximate Q(X) in (10) than the U-updating algorithm. 2) How can we per-form the model selection in the framework of the U-likelihood. 3) Comparisonwith the Bayesian approach, e.g. EP and VB.

References

1. A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, and A. Gelman. Bayesian DataAnalysis. Chapman & Hall/CRC, 1995.

2. M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to varia-tional methods for graphical models. Machine Learning, 72(2):183–233, 1999.

3. R. M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods.Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto,1993.

4. T. Minka. Expectation Propagation for approximate Bayesian inference. In Proc.Uncertainty in Artificial Intelligence, 2001.

5. Z. Ghahramani and M. J. Beal. Propagation algorithms for variational bayesianlearning. In Advances in Neural Information Processing Systems, volume 13. MITPress, 2001.

6. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from in-complete data via the EM algorithm. Journal of the Royal Statistical Society B,39:1–38, 1977.

7. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental,sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models,pages 355–368. Kluwer Academic Publishers, 1988.

8. M. Opper and D. Saad, editors. Advanced Mean Field Methods : Theory andPractice. MIT Press, 2001.

9. G. McLachlan and D. Peel. Finite Mixture Models. Wiley-Interscience, 2000.10. S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown

number of components. Journal of the Royal Statistical Society B, 59:731–792,1997.