EM Algorithms - University of California, Berkeleytrain/Ch14EM.pdf · EM Algorithms 14.1...

Chapter 14

EM Algorithms

14.1 Introduction

In Chapter 8, we discussed methods for maximizing the log-likelihoodfunction. As models become more complex, maximization by thesemethods becomes more difficult. Several issues contribute to the dif-ficulty. First, greater flexibility and realism in a model is usually at-tained by increasing the number of parameters. However, the proce-dures in Chapter 8 require that the gradient be calculated with respectto each parameter, which becomes increasingly time consuming as thenumber of parameters rises. The Hessian, or approximate Hessian,must be calculated and inverted; with a large number of parameters,the inversion can be difficult numerically. Also, as the number of pa-rameters grows, the search for the maximizing values is over a larger-dimensioned space, such that locating the maximum requires moreiterations. In short, each iteration takes longer and more iterationsare required.

Second, the log-likelihood (LL) function for simple models is oftenapproximately quadratic, such that the procedures in Chapter 8 oper-ate effectively. As the model becomes more complex, however, the LLfunction usually becomes less like a quadratic, at least in some regionsof the parameter space. This issue can manifest itself in two ways.The iterative procedure can get “stuck” in the nonquadratic areas ofthe LL function, taking tiny steps without much improvement in theLL. Or the procedure can repeatedly “bounce over” the maximum,taking large steps in each iteration but without being able to locatethe maximum.

1

2 CHAPTER 14. EM ALGORITHMS

Another issue arises that is more fundamental than the count of pa-rameters and the shape of the LL function. Usually, a researcher spec-ifies a more general, and hence complex, model because the researcherwants to rely less on assumptions and, instead, obtain more informa-tion from the data. However, the goal of obtaining more informationfrom the data is inherently at odds with simplicity of estimation.

Expectation-Maximization (EM) algorithms are procedures for max-imizing a LL function when standard procedures are numerically diffi-cult or infeasible. The procedure was introduced by Dempster, Laird,and Rubin (1977) as a way of handling missing data. However, it isapplicable far more generally and has been used successfully in manyfields of statistics. McLachlan and Krishnan (1997) provide a review ofapplications. In the field of discrete choice modeling, EM algorithmshave been used by Bhat (1997a) and Train (2008a,b).

The procedure consists of defining a particular Expectation andthen Maximizing it (hence the name). This expectation is related tothe LL function in a way that we will describe, but it differs in a waythat facilitates maximization. The procedure is iterative, starting atsome initial value for the parameters and updating the values in eachiteration. The updated parameters in each iteration are the values thatmaximize the expectation in that particular iteration. As we will show,repeated maximization of this function converges to the maximum ofLL function itself.

In this chapter, we describe the EM algorithm in general and de-velop specific algorithms for discrete choice models with random coeffi-cients. We show that the EM alogorithm can be used to estimate veryflexible distributions of preferences, including nonparametric specifica-tions that can approximate asymptotically any true underlying distri-bution. We apply the methods in a case study of consumers’ choicebetween hydrogen and gas-powered vehicles.

14.2 General procedure

In this section we describe the EM procedure in a highly general wayso as to elucidate its features. In subsequent sections, we apply thegeneral procedure to specific models. Let the observed dependent vari-ables be denoted collectively as y, representing the choices or sequenceof choices for an entire sample of decision-makers. The choices dependon observed explanatory variables that, for notational convenience, we

14.2. GENERAL PROCEDURE 3

do not explicitly denote. The choices also depend on data that aremissing, denoted collectively as z. Since the values of these missingdata are not observed, the researcher specifies a distribution that rep-resents the values that the missing data could take. For example, ifthe income of some sampled individuals is missing, the distribution ofincome in the population can be a useful specification for the distri-bution of the missing income values. The density of the missing datais denoted f(x | θ), which depends in general on parameters θ to beestimated.

The behavioral model relates the observed and missing data to thechoices. This model predicts the choices that would arise if the missingdata were actually observed instead of missing. This behavioral modelis denoted P (y | z, θ) where θ are parameters which may overlap orextend those in f . (For notational compactness, we use θ to denote allthe parameters to be estimated, including those entering f and thoseentering P .) Since, however, the missing data are in fact missing, theprobability of the observed choices, based on the information that theresearcher observes, is the integral of the conditional probability overthe density of the missing data:1

P (y | θ) =∫

P (y | z, θ)f(z | θ)dz

The density of missing data, f(z | θ), is used to predict the observedchoices and hence does not depend on y. However, we can obtain someinformation about the missing data by observing the choices that weremade. For example, in vehicle choice, if a person’s income is missingbut the person is observed to have bought a Mercedes, one can inferthat it is likely that this person’s income is above average. Let usdefine g(z | y, θ) as the density of the missing data conditional on theobserved choices in the sample. This conditional density is related tothe unconditional density through Bayes identity:

h(z | y, θ) =P (y | z, θ)f(z | θ)

P (y | θ)

Stated succinctly: the density of z conditonal on observed choices isproportional to the unconditional density of z times the probability

1We assume in this expression that z is continuous, such that the unconditionalprobability is an integral. If z is discrete, or a mixture of continuous and discretevariables, then the integration is replaced with a sum over the discrete values, or acombination of integrals and sums.


of the observed choices given this z. The denominator is simply thenormalizing constant, equal to the integral of the numerator. Thisconcept of a conditional distribution should be familiar to readers fromChapter 11.

Now consider estimation. The log-likelihood function is based onthe the information that the researcher has, which does not includethe missing data. The LL function is:

LL(θ) = logP (y | θ) = log

(∫P (y | z, θ)f(z | θ)dz

).

In principle this function can be maximized using the procedures de-scribed in Chapter 8. However, as we will see, it is often much easierto maximize LL in a different way.

The procedure is iterative, starting with an initial value of theparameters and updating them in a way to be described. Let the trialvalue of the parameters in a given iteration be denoted θt. Let us definea new function at θt that is related to LL but utilizes the conditionaldensity h. This new function is:

E(θ | θt) =∫

h(z | y, θt)log(

P (y | z, θ)f(z | θ))

dz.

where the conditional density h is calculated using the current trialvalue of the parameters, θt. This function has a specific meaning.Note that the part on the far right, P (y | z, θ)f(z | θ), is the jointprobability of the observed choices and the missing data. The log ofthis joint probability is the log-likelihood of the observed choices andthe missing data combined. This joint log-likelihood is integated over adensity, namely h(z | y, θt). Our function E is therefore an expectationof the joint log-likelihood of the missing data and observed choices. Itis a specific expectation, namely, the expectation over the density ofthe missing data conditional on observed choices. Since the conditionaldensity of z depends on the parameters, this density is calculated usingthe values θt. Stated equivalently, E is the weighted average of the jointlog-likelihood, using h(z | y, θt) as weights.

The EM procedure consists of repeatedly maximizing E . Startingwith some initial value of the parameters, the parameters are updatedin each iteration by the formula:

θt+1 = argmaxθ E(θ | θt). (14.1)


In each iteration, the current values of the parameters, θt, are used tocalculate the weights h, and then the weighted, joint log-likelihood ismaximized. The name EM derives from the fact that the procedureutilizes an expectation that is maximized.

It is important to recognize the dual role of the parameters in E .First, the parameters enter the joint likelihood of the observed choicesand the missing data, P (y | z, θ)f(z | θ). Second, the parameters enterthe conditional density of the missing data, h(z | y, θ). The functionE is maximized with respect to the former holding the later constant.That is, E is maximized over the θ entering P (y | z, θ)f(z | θ), holdingthe θ that enters the weights h(z | y, θ) at their current values θt.To denote this dual role, E(θ | θt) is expressed as a function of θ, itsargument over which maximization is performed, given θt, the valueused in the weights that are held fixed during maximization.

Under very general conditions, the iterations defined by equation(14.1) converge to the maximum of LL. Bolyes (1983) and Wu (1983)provide formal proofs. I provide an intuitive explanation in the nextsection. However, readers who are interested in seeing examples of thealgorithm first can proceed directly to section 14.3.

14.2.1 Why the EM algorithm works

The relation of the EM algorithm to the log-likelihood function canbe explained in three steps. Each step is a bit opaque, but the threecombined provide a startlingly intuitive understanding.

Step 1: Adjust E to equal LL at θt

E(θ | θt) is not the same as LL(θ). To aid comparison between them,let us add a constant to E(θ | θt) that is equal to the difference betweenthe two functions at θt:

E∗(θ | θt) = E(θ | θt) + [LL(θt) − E(θt | θt)]

The term in brackets is constant with respect to θ and so maximizationof E∗ is the same as maximization of E itself. Note, however, that byconstruction, E∗(θ | θt) = LL(θ) at θ = θt.


[log (a) + log (b)] / 2log [(a + b) / 2]

a + b2

a b x

log (x)

Figure 14.1: Example of Jensen’s inequality.

that E∗(θ | θt) ≤ LL(θ) for all θ. Consistent with this relation, E∗ isdrawn below LL(θ) in the graph at all points except θt where they arethe same.

θθ t + 1θ t

E*

LL

Figure 14.2: Relation of E∗ and LL.

The EM algorithm maximizes E∗(θ | θt) to find the next trial valueof θ. The maximizing value is shown as θt+1. As the graph indicates,the log-likelihood function is necessarily higher at the new parametervalue, θt+1, than at the original value, θt. As long as the derivativeof the log-likelihood function is not already zero at θt, maximizing


E∗(θ | θt) raises LL(θ).2 Each iteration of the EM algorithm raises thelog-likelihood function until the algorithm converges at the maximumof the log-likelihood function.

14.2.2 Convergence

Convergence of the EM algorithm is usually defined as a sufficientlysmall change in the parameters (e.g., Levine and Casella, 2001) or inthe log-likelihood function (e.g., Weeks and Lange, 1989, and Aitkinand Aitkin, 1996). These criteria need to be used with care, sincethe EM algorithm can move slowly near convergence. Ruud (1991)shows that the convergence statistic in section 8.4 can be used withthe gradient and Hessian of E instead of LL. However, calculating thisstatistic can be more computationally intensive than the iteration ofthe EM algorithm itself, and can be infeasible in some cases.

14.2.3 Standard errors

There are three ways that standard errors can be calculated. First,once the maximum of LL(θ) has been found with the EM algorithm,the standard errors can be calculated from LL the same as if the log-likelihood function had been maximized directly. The procedures inSection 8.6 are applicable: The asymptotic standard errors can becalculated from the Hessian or from the variance of the observation-specific gradients (i.e., the scores), calculated from LL(θ) evaluated atθ.

A second option arises from the result we obtained in step 2 above.We showed that E and LL have the same gradients at θ = θt. Atconvergence, the value of θ does not change from one interation to thenext, such that θ = θt+1 = θt. Therefore, at θ, the derivatives of thesetwo functions are the same. This fact implies that the scores can becalculated from E rather than LL. If E takes a more convenient formthan LL, as is usually the case when applying an EM algorithm, thisalternative calculation can be attractive.

A third option is bootstrap, as also discussed in Chapter 8.6. Un-der this option, the EM algorithm is applied numerous times, usinga different sample of the observations each time. In many contextsfor which EM algorithms are applied, bootstrapped standard errors

2In fact, any increase in E∗(θ | θt) leads to an increase in LL(θ).


are more feasible and useful than the asymptotic formulas. The casestudy in the last section provides an example.

14.3 Examples of EM Algorithms

We describe in this section several types of discrete choice modelswhose LL functions are difficult to maximize directly but are easyto estimate with EM algorithms. The purpose of the discussion is toprovide concrete examples of how EM alogorithms are specified as wellas to illustate the value of the approach.

14.3.1 Discrete Mixing Distribution with Fixed Points

One of the issues that arises with mixed logit models (or, indeed withany mixed model) is the appropriate specification of the mixing dis-tribution. It is customary to use a convenient distribution, such asnormal or lognormal. However, it is doubtful that the true distri-bution of coefficients takes a mathematically convenient form. Moreflexible distributions are useful, where flexibility means that the spec-ified distribution can take a wider variety of shapes, depending on thevalues of its parameters.

Usually, greater flexibility is attained by including more parame-ters. In nonparametric estimation, a family of distributions is specifiedthat has the property that the distribution becomes more flexible asthe number of parameters rises. By allowing the number of parame-ters to rise with sample size, the nonparametric estimator is consistentfor any true distribution. The term “nonparametric” is a bit of amisnomer in this context: “super-parametric” is perhaps more appro-priate, since the number of parameters is usually larger than understandard specifications and rises with sample size to gain ever-greaterflexibility.

The large number of parameters in nonparametric estimation makesdirect maximization of the LL function difficult. In many cases, how-ever, an EM algorithm can be developed that facilitates estimationconsiderably. The current section presents one such case.

Consider a mixed logit with an unknown distribution of coefficients.Any distribution can be approximated arbitarily closely by a discretedistribution with a sufficiently large number of points. We can use this

14.3. EXAMPLES OF EM ALGORITHMS 11

fact to develop a nonparametric estimator of the mixing distribution,using an EM algorithm for estimation.

Let the density of coefficients be represented by C points, withβc being the c-th point. The location of these points is assumed (forthe current procedure) to be fixed, and the mass at each point (i.e.,the share of the population at each point) are the parameters to beestimated. One way to select the points is to specify a maximum andminimum for each coefficient and create a grid of evenly spaced pointsbetween the maxima and minima. For example, suppose there are fivecoefficients and the range between the minimum and maximum of eachcoefficient is represented by 10 evenly spaced points. The 10 points ineach dimension creates a grid of 105 = 100, 000 points in the five-dimensional space. The parameters of the model are the share of thepopulation at each of the 100,000 points. As we will see, estimationof such a large number of parameters is quite feasible with an EMalgorithm. By increasing the number of points, the grid becomes ever-finer, such that the estimation of shares at the points approximatesany underlying distribution.

The utility that agent n obtains from alternative j is:

Unj = βnxnj + εnj

where ε is iid extreme value. The random coefficients have the discretedistribution described above, with sc being the share of the populationat point βc. The distribution is expressed functionally as:

f(βn) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

s1 if βn = β1

s2 if βn = β2...sC if βn = βC

0 otherwise.

where the shares sum to one:∑

c sc = 1. For convenience, denote theshares collectively as vector s = 〈s1, . . . , sC〉.3

3This specification can be condsidered a type of latent class model, where thereare C classes, the coefficients of people in class c are βc, and sc is the share of thepopulation in class c. However, the term “latent class model” usually refers to amodel in which the location of the points are parameters as well as the shares. Weconsider this more traditional form in our next example.


Conditional on βn = βc for some c, the choice model is a standardlogit:

Lni(βc) =eβcxni∑j eβcxnj

Since βn is not known for each person, the choice probability is a mixedlogit, mixed over the discrete distribution of βn:

Pni(s) =∑

c

scLni(βc)

The log-likelihood function is LL(s) =∑

n logPnin(s) where in is thechosen alternative of agent n.

The LL can be maximized directly to estimate the shares s. Witha large number of classes, as would usually be needed to flexibly rep-resent the true distribution, this direct maximization can be difficult.However, an EM algorithm can be utilized for this model that is amaz-ingly simple, even with hundreds of thousands of points.

The “missing data” in this model are the coefficients of each agent.The distribution f gives the share of the population with each coef-ficient value. However, as discussed in Chapter 11, a person’s choicereveals information about their coefficients. Conditional on person nchoosing alternative in, the probability that the person has coefficientsβc is given by Bayes’ identity:

h(βc | in, s) =scLnin(βc)

Pnin(s)

This conditional distribution is used in the EM algorithm. In particu-lar, the expectation for the EM algorithm is:

E(s | st) =∑n

∑c

h(βc | in, st)log(scLnin(βc)

)

Since log(sc Lnin(βc)

)= log(sc)+ logLnin(βc), this expectation can be

rewritten as two parts:

E(s | st) =∑n

∑c

h(βc | in, st)log(sc)+∑n

∑c

h(βc | in, st)logLnin(βc).

This expectation is to be maximized with respect to the parameterss. Note, however, that the second part on the right does not dependon s: it depends only on the coefficients βc which are fixed points in


this nonparametric procedure. Maximization of the above formula istherefore equivalent to maximization of just the first part:

E(s | st) =∑n

∑c

h(βc | in, st)log(sc).

This function is very easy to maximize. In particular, the maximizingvalue of sc, accounting for the constraint that the shares sum to 1, is:

st+1c =

∑n h(βc | in, st)∑

n

∑c′ h(βc′ | in, st)

Using the nomenclature from the general description of EM algorithms,h(βc | in, st) are weights, calculated at the current value of the sharesst. The updated share for class c is the sum of weights at point c as ashare of the sum of weights at all points.

This EM algorithm is implemented in the following steps:

1. Define the points βc for c = 1, . . . , C.

2. Calculate the logit formula for each person at each point: Lni(βc) ∀n, c.

3. Specify initial values of the share at each point, labeled collec-tively as s0. It is convenient for the initial shares to be sc =1/C ∀c.

4. For each person and each point, calculate the probability ofthe person having those coefficients conditional on the person’schoice, using the initial shares s0 as the unconditional probabil-ities: h(βc | in, s0) = s0

cLni(βc)/Pni(s0). Note that the denomi-nator is the sum over points of the numerator. For convenience,label this calculated value h0

nc.

5. Update the population share at point c as s1c =

∑n

h0nc∑

n

∑c′ h0

nc′.

6. Repeat steps 4 and 5 using the updated shares s in lieu of theoriginal starting values. Continue repeating until convergence.

This procedure does not require calculation of any gradients or inver-sion of any Hessian, as the procedures in Chapter 8 utilize for directmaximization of LL. Moreover, the logit probabilities are calculatedonly once (in step 2), rather than in each iteration. The iterations con-sist of recalibrating the shares at each point, which is simple arithmetic.


Because so little caculation is needed for each point, the procedure canbe implemented with a very large number of points. For example, theapplication in Train (2008a) included over 200,000 points, and yet es-timation took only about 30 minutes. In contrast, it is doubtful thatdirect maximization by the methods in Chapter 8 would have evenbeen feasible, since they would entail inverting a 200,000 × 200,000Hessian.

14.3.2 Discrete Mixing Distribution with Points as Pa-rameters

We can modify the previous model by treating the coefficients, βc foreach c, as parameters to be estimated rather than fixed points. The pa-rameters of the model are then the location and share of the populationat each point. Label these parameters collectively as θ = 〈sc, βc, c =1, . . . , C〉. This specification is often called a latent class model: thepopulation consists of C distinct classes with all people within a classhaving the same coefficients but coefficients being different for differentclasses. The parameters of the model are the share of the populationin each class and the coefficients for each class.

The missing data are the class membership of each person. Theexpectation for the EM algorithm is the same as for the previous spec-ification except that now the βc’s are treated as parameters:

E(θ | θt) =∑n

∑c

h(βc | in, st)log (scLnin(βc)) .

Note that each set of parameters enters only one term inside the log:the vector of shares s does not enter any of the Lnin(βc)’s, and eachβc enters only Lnin(βc) for class c. Maximization of this function istherefore equivalent to separate maximization of each of the followingfunctions:

E(s | θt) =∑n

∑c

h(βc | in, st)log(sc) (14.6)

and for each c:

E(βc | θt) =∑n

h(βc | in, st)logLnin(βc). (14.7)

The maximum of (14.6) is attained, as before, at:

st+1c =

∑n h(βc | in, st)∑

n

∑c′ h(βc′ | in, st)


For updating the coefficients βc, note that (14.7) is the log-likelihoodfunction for a standard logit model, with each observation weighted byh(βc | in, st). The updated values of βc are obtained by estimating astandard logit model with each person providing an observation thatis weighted appropriately. The same logit estimation is performed foreach class c, but with different weights for each class.

The EM algorithm is implemented in these steps:

1. Specify initial values of the share and coefficients in each class,labeled β0

c ∀c and s0. It is convenient for the initial shares to be1/C. I have found that starting values for the coefficients caneasily be obtained by partitioning the sample into C groups andrunning a logit on each group.4

2. For each person and each class, calculate the probability of beingin that class conditional on the person’s choice, using the initialshares s0 as the unconditional probabilities: h(βc | in, s0) =s0cLnin(βc)/Pnin(s0). Note that the denominator is the sum over

classes of the numerator. For convenience, label this calculatedvalue h0

nc.

3. Update the share in class c as s1c =

∑n

h0nc∑

n

∑c′ h0

nc′.

4. Update the coefficients for each class c by estimating a logitmodel with person n weighted by h0

nc. A total of C logit modelsare estimated, using the same observations but different weightsin each.

5. Repeat steps 2-4 using the updated shares s and coefficients βc ∀cin lieu of the original starting values. Continue repeating untilconvergence.

An advantage of this approach is that the researcher can implementnonparametric estimation, with the class shares and coefficients in each

4Note that these groups do not represent a partitioning of the sample into classes.The classes are latent and so such partitioning is not possible. Rather, the goal isto obtain C sets of starting values for the coefficients of the C classes. Thesestarting values must not be the same for all classes since, if they were the same,the algorithm would perform the same calculations for each class and return for allclasses the same shares and up-dated estimates in each iteration. An easy way toobtain C different sets of starting values is to divide the sample into C groups andestimate a logit on each group.


class treated as parameters, using any statistical package that includesa logit estimation routine. For a given number of classes, this pro-cedure takes considerably longer than the previous one, since C logitmodels must be estimated in each iteration. However, far fewer classesare probably needed with this approach than the previous one to ade-quately represent the true distribution, since this procedure estimatesthe “best” location of the points (i.e, the coefficients) while the previ-ous one takes the points as fixed.

Bhat (1997a) developed an EM algorithm for a model that is similarto this one, except that the class shares, sc are not parameters inthemselves but rather are specified to depend on the demographicsof the person. The EM algorithm that he used replaces our step 3with a logit model of class membership, with conditional probabilityof class membership serving as the dependent variable. His applicationdemonstrates the overriding point of this chapter, that EM algorithmscan readily be developed for many forms of complex choice models.

14.3.3 Normal Mixing Distribution with Full Covari-ance

We pointed out in Chapter 12 that a mixed logit with full covarianceamong coefficients can be difficult to estimate by standard maximiza-tion of the LL function, due both to the large number of covarianceparameters and the fact that the LL is highly non-quadratic. Train(2008b) developed an EM algorithm that is very simple and fast formixed logits with full covariance. The algorithm takes the followingvery simple form:

1. Specify initial values for the mean and covariance of the coeffi-cients in the population.

2. For each person, take draws from the population distributionusing this initial mean and covariance.

3. Weight each person’s draws by that person’s conditional densityof the draws.

4. Calculate the mean and covariance of the weighted draws for allpeople. These become the updated mean and covariance of thecoefficients in the population.


5. Repeat steps 2-4 using the updated mean and covariance, andcontinue repeating until there is no further change (to a toler-ance) in the mean and covariance.

The converged values are the estimates of the population mean andcovariance. No gradients are required. All that is needed for estimationof a model with fully correlated coefficients is to repeatedly take drawsusing the previously-calculated mean and covariance, weight the drawsappropriately, and calculate the mean and covariance of the weighteddraws.

The procedure is readily applicable when the coefficients are nor-mally distributed, or when they are transformations of jointly normalterms. Following the notation in Train and Sonnier (2008b), the utilitythat agent n obtains from alternative j is:

Unj = αnxnj + εnj

where the random coefficients are transformations of normally dis-tributed terms: αn = T (βn) with βn distributed normally with mean band covariance W . The transformation allows considerable flexibilityin the choice of distribution. For example, a lognormal distribution isobtained by specifying transformation αn = exp(βn). An Sb distribu-tion, which has an upper and lower bound, is obtained by specifyingαn = exp(βn)/(1 + exp(βn)). Of course, if the coefficient is itself nor-mal, then αn = βn. The normal density is denoted φ(βn | b, W ).

Conditional on βn, the choice probabilities are logit:

Lni(βn) =eT (βn)xni∑j eT (βn)xnj

Since βn is not known, the choice probability is a mixed logit, mixedover the distribution of βn:

Pni(b, W ) =∫

Lni(β)φ(β | b, W )dβ

The log-likelihood function is LL(b, W ) =∑

n logPnin(b, W ) where inis the chosen alternative of agent n.

As discussed in Section 12.7, classical estimation of this model bystandard maximization of LL is difficult, and this difficulty is one ofthe reasons for using Bayesian procedures. However, an EM algorithmcan be applied that is considerably easier than standard maximization


and makes classical estimation about as convenient as Bayesian for thismodel.

The missing data for the EM algorithm are the βn of each person.The density φ(β | b, W ) is the distribution of β in the population. Forthe EM algorithm, we use the conditional distribution for each person.By Bayes’ identity, the density of β conditional on alternative i beingchosen by person n is h(β | i, b, W ) = Lni(β)φ(β | b, W )/Pni(b, W ).The expectation for the EM algorithm is:

E(b, W | bt, W t) =∑n

∫h(β | in, bt, W t)log

(Lnin(β)φ(β | b, W )

)dβ

Note that Lnin(β) does not depend on the parameters b and W . Maxi-mization of this expectation with respect to the parameters is thereforeequivalent to maximization of:

E(b, W | bt, W t) =∑n

∫h(β | in, bt, W t)log(φ(β | b, W ))dβ

The integral inside this expectation does not have a closed form. How-ever we can approximate the integral through simulation. Substitutingthe definition of h(·) and rearranging, we have:

E(b, W | bt, W t) =∑n

∫Lnin(β)

Pnin(bt, W t)log(φ(β | b, W ))φ(β | bt, W t)dβ

The expectation over φ is simulated by taking R draws from φ(β |bt, W t) for each person, labeled βnr for the r-th draw for person n.The simulated expectation is:

E(b, W | bt, W t) =∑n

∑r

wtnrlog(φ(βnr | b, W ))/R

where the weights are:

wtnr =

Lnin(βnr)1R

∑r′ Lnin(βnr′)

This simulated expectation takes a familiar form: it is the log-likelihoodfunction for a sample of draws from a normal distribution, with eachdraw weighted by wt

nr.5 The maximum likelihood estimator of the

5The division by R can be ignored since it does not affect maximization, in thesame way that division by sample size N is omitted from E .


mean and covariance of a normal distribution, given a weighted sam-ple of draws from that distribution, is simply the weighted mean andcovariance of the sampled draws. The updated mean is:

bt+1 =1

NR

∑n

∑r

wtnrβnr

and the updated covariance matrix is

W t+1 =1

NR

∑n

∑r

wtnr(βnr − bt+1)(βnr − bt+1)′.

Note that W t+1 is necessarily positive definite, as required for a co-variance matrix, since it is constructed as the covariance of the draws.

The EM algorithm is implemented as follows:

1. Specify initial values for the mean and covariance, labeled b0 andW 0.

2. Create R draws for each of the N people in the sample as:β0

nr = b0 + chol(W 0)ηnr where chol(W 0) is the lower-triangularCholeski factor of W 0 and ηnr is a conforming vector of iid stan-dard normal draws.

3. For each draw for each person, calculate the logit probability ofthe person’s observed choice: Lnin(β0

nr).

4. For each draw for each person, calculate the weight

w0nr =

Lnin(β0nr)∑

r′ Lnin(β0nr′)/R

.

5. Calculate the weighted mean and covariance of the N ∗R drawsβ0

nr, r = 1, . . . , R, n = 1, . . . , N , using weight w0nr. The weighted

mean and covariance are the updated parameters b1 and W 1.

6. Repeat steps 2-5 using the updated mean b and variance W inlieu of the original starting values. Continue repeating until con-vergence.

This procedure can be implemented without evoking any estimationsoftware, simply by taking draws, calculating logit formulas to con-struct weights, and then calculating the weighted mean and covariance


of the draws. A researcher can estimate a mixed logit with full covari-ance and with coefficients that are possibly transformations of normalswith just these simple steps.

Train (2008a) shows that this procedure can be generalized to afinite mixture of normals, where β is drawn from any of C normalswith different means and covariances. The probability of drawing βfrom each normal (i.e., the share of the population whose coefficientsare described by each normal distribution) is a parameter along withthe means and covariances. Any distribution can be approximatedby a finite mixture of normals, with a sufficient number of underlyingnormals. By allowing the number of normals to rise with sample size,the approach becomes a form of nonparametric estimation of the truemixing distribution. The EM algorithm for this type of nonparametricscombines the concepts of the current section on normal distributionswith full covariance and the concepts from the immediately previoussection on discrete distributions.

One last note is useful. At convergence, the derivatives of E(b, W |b, W ) provide scores that can be used to estimate the asymptotic stan-dard errors of the estimates. In particular, the gradient with respectto b and W is:

E(b, W | b, W )db

=∑n

[1R

∑r

−wnrW−1(βnr − b)

]

and

E(b, W | b, W )dW

=∑n

[1R

∑r

wnr

(−1

2W−1 +

12W−1(βnr − b)(βnr − b)′W−1

)]

where the weights wnr are calculated at the estimated values b andW . The terms in brackets are the scores for each person, which wecan collect into a vector labeled sn. The variance of the scores isV =

∑n sns′n/N . The asymptotic covariance of the estimator is then

calculated as V −1/N , as discussed in section 8.6.

14.4 Case Study: Demand for Hydrogen Cars

Train (2008a) examined buyers’ preferences for hydrogen-powered ve-hicles, using several of the EM algorithms that we have described. We

14.4. CASE STUDY 21

will describe one of his estimated models as an illustration of the proce-dure. A survey of new car buyers in Southern California was conductedto assess the importance that these buyers place on several issues thatare relevant for hydrogen vehicles, such as the availability of refuelingstations. Each respondent was presented with a series of ten stated-preference experiments. In each experiment the respondent was offereda choice among three alternatives: the conventional-fuel vehicle (CV)that the respondent had recently purchased and two alternative-fueledvehicles (AV’s) with specified attributes. The respondent was askedto evaluate the three options, stating which they considered best andwhich worst. The attributes represented relevant features of hydrogenvehicles, but the respondents were not told that the alternative fuelwas hydrogen so as to avoid any preconceptions that respondents mighthave developed with respect to hydrogen vehicles. The attributes in-cluded in the experiments are:

• Fuel cost (FC), expressed as percent difference from the CV. Inestimation, the attribute was scaled as a share, such that fuelcosts of 50 percent less than the conventional vehicle enters as-0.5, and 50 percent more enters as 0.5.

• Purchase price (PP), expressed as percent difference from theCV, scaled analogously to fuel cost when entering the model.

• Driving radius (DR): the farthest distance from home that oneis able to travel and then return, starting on a full tank of fuel.As defined, driving radius is one-half of the vehicle’s range. Inthe estimated models, DR was scaled in hundreds of miles.

• Convenient medium distance destinations (CMDD): the percentof destinations within the driving radius that “require no ad-vanced planning because you can refuel along the way or at yourdestination” as opposed to destinations that “require refueling(or at least estimating whether you have enough fuel) before youleave to be sure you can make the round-trip.” This attributereflects the distribution of potential destinations and refuelingstations within the driving radius, recognizing that the tank willnot always be full when starting. In the estimated models, it wasentered as a share, such that, e.g., 50 percent enters as 0.50.

• Possible long distance destinations (PLDD): the percent of des-tinations beyond the driving radius that are possible to reach


because refueling is possible, as opposed to destinations that can-not be reached due to limited station coverage. This attributereflects the extent of refueling stations outside the driving radiusand their proximity to potential driving destinations. It enteredthe models scaled analogously to CMDD.

• Extra time to local stations (ETLS): additional one-way traveltime, beyond the time typically required to find a conventionalfuel station, that is required to get to an alternative fuel stationin the local area. ETLS was defined as having values of 0, 3 and10 minutes in the experiments; however, in preliminary analysis,it was found that respondents considered 3 minutes to be noinconvenience (i.e., equivalent to 0 minutes). In the estimatedmodels therefore, a dummy variable for ETLS being 10 or notwas entered, rather than ETLS itself.

In the experiments, the CV that the respondent had purchased wasdescribed as having a driving radius of 200 miles, CMDD and PLDDequal to 100 percent, and, by definition, ETLS, FC and PP of 0.

As stated above, the respondent was asked to identify the bestand worst of the three alternatives, thereby providing a ranking of thethree. Conditional on the respondent’s coefficients, the ranking proba-bilities were specified with the “exploded logit” formula (as describedin section 7.3.1). By this formulation, the probability of the rankingis the logit probability of the first choice from the three alternativesin the experiment, times the logit probability for the second choicefrom the two remaining alternatives. This probability is mixed overthe distribution of coefficients, whose parameters were estimated.

All three methods that we describe above were applied. We concen-trate on the method in section 14.3.2 above, since it provides a succinctillustration of the power of the EM algorithm. For this method, thereare C classes of buyers, and the coefficients βn and share sc of thepopulation in each class are treated as parameters.

Train (2008a) estimated the model with several different numbersof classes, ranging from 1 class (which is a standard logit) up to 30classes. Table 14.1 gives the log-likelihood value for these models.Increasing the number of classes improves LL considerably, from -7884.6 with 1 class to -5953.4 with 30 classes. Of course, a largernumber of classes entails more parameters, and the question arises ofwhether the improved fit is “worth” the extra parameters. In situations

14.4. CASE STUDY 23

such as this, it is customary to evaluate the models by the AkaikeInformation Criterion (AIC) or Bayesian Information Criterion (BIC).6

The value of these statistics are also given in Table 14.1. The AICis lowest (best) with 25 classes and the BIC, which penalizes extraparameters more heavily than the AIC, is lowest with 8 classes.

For the purposes of evaluating the EM algorithm, it is useful toknow that these models were estimated with run times of about 1.5minutes per classes, from initial values to convergence. This meansthat the model with 30 classes, which has 239 parameters,7 was esti-mated in only 45 minutes.

Table 14.1: Mixed Logit Models with Discrete Distributionof Coefficients and Different Numbers of Classes

Classes Log-Likelihood Parmeters AIC BIC1 -7884.6 7 15783.2 15812.85 -6411.5 39 12901.0 13066.06 -6335.3 47 12764.6 12963.47 -6294.4 55 12698.8 12931.58 -6253.9 63 12633.8 12900.39 -6230.4 71 12602.8 12903.210 -6211.4 79 12580.8 12915.015 -6124.5 119 12487.0 12990.420 -6045.1 159 12408.2 13080.825 -5990.7 199 12379.4 13221.330 -5953.4 239 12384.8 13395.9

Table 14.2 presents the estimates for the model with 8 classes,which is best by the BIC. The model with 25 classes, which is bestby the AIC, provides even greater detail but is not given for the sakeof brevity. As shown in Table 14.2, the largest of the 8 classes is thelast one with 25 percent. This class has a large, positive coefficientfor CV, unlike all the other classes. This class apparently consists ofpeople who prefer their CV over AV’s even when the AV has the sameattributes, perhaps because of the uncertainty associated with newtechnologies. Other distinguishing features of classes are evident. Forexample, class 3 cares far more about purchase price than the other

6See, e.g., Mittelhammer, et. al. (2000, section 18.5) for a discussion of informa-tion criteria. The AIC (Akaike, 1974) is −2LL + 2K where LL is the value of thelog-likelihood and K is the number of parameters. The BIC, also called Schwarz(1978), criterion, is −2LL + log(N)K where N is sample size.

77 coefficients and 1 share for each of 30 classes, with one class share determinedby constraint that the shares sum to one.


classes, while class 1 places more importance on fuel cost than theother classes.

Table 14.2: Model with Eight ClassesClass: 1 2 3 4Shares: 0.107 0.179 0.115 0.0699Coefficients:Fuel cost -3.546 -2.576 -1.893 -1.665Purchase price -2.389 -5.318 -12.13 0.480Driving radius 0.718 0.952 0.199 0.472CMDD 0.662 1.156 0.327 1.332PLPP 0.952 2.869 0.910 3.136ETLS=10 dummy -1.469 -0.206 -0.113 -0.278CV dummy -1.136 -0.553 -0.693 -2.961Class: 5 6 7 8Shares: 0.117 0.077 0.083 0.252Coefficients:Fuel cost -1.547 -0.560 -0.309 -0.889Purchase price -2.741 -1.237 -1.397 -2.385Driving radius 0.878 0.853 0.637 0.369CMDD 0.514 3.400 -0.022 0.611PLPP 0.409 3.473 0.104 1.244ETLS=10 dummy 0.086 -0.379 -0.298 -0.265CV dummy -3.916 -2.181 -0.007 2.656

Table 14.3 gives the mean and standard deviation of the coefficientsover the eight classes. As Train (2008a) points out, these means andstandard deviations are similar to those obtained with a more standardmixed logit with normally distributed coefficients (shown in his paperbut not repeated here). This result indicates that the use of numerousclasses in this application, which the EM algorithm makes possible,provides greater detail in explaining differences in preferences whilemaintaining very similar summary statistics.

It would be difficult to calculate standard errors from asymptoticformulas for this model (i.e., as the inverse of the estimated Hessian),since the number of parameters is so large. Also, we are interested insummary statistics, like the mean and standard deviation of the coef-ficients over all classes, given in Table 14.5. Deriving standard errorsfor these summary statistics from asymptotic formulas for the covari-ance of the parameters themselves would be computationally difficult.Instead, standard errors can readily be calculated for this model by

14.4. CASE STUDY 25

Table 14.3: Summary Statistics for CoefficientsMeans Std devs

Est. SE Est. SECoefficients:Fuel cost -1.648 0.141 0.966 0.200Purchase price -3.698 0.487 3.388 0.568Driving radius 0.617 0.078 0.270 0.092CMDD 0.882 0.140 0.811 0.126PLPP 1.575 0.240 1.098 0.178ETLS=10 dummy -0.338 0.102 0.411 0.089CV dummy -0.463 1.181 2.142 0.216

Table 14.4: Standard Errors for Class 1Est. SE

Share: 0.107 0.0566Coefficients:Fuel cost -3.546 2.473Purchase price -2.389 6.974Driving radius 0.718 0.404CMDD 0.662 1.713PLPP 0.952 1.701ETLS=10 dummy -1.469 0.956CV dummy -1.136 3.294

bootstrap. Given the speed of the EM algoirithm in this application,bootstrapping is quite feasible. Also, bootstrapping automatically pro-vides standard errors for our summary statistics (by calculating thesummary statistics for each bootstrap estimate and taking their stan-dard deviations.)

The standard errors for the summary statistics are given in Table14.3, based on 20 bootstrapped samples. Standard errors are not givenin Table 14.2 for each class’s parameters. Instead, Table 14.4 givesstandard errors for class 1, which is illustrative of all the classes. Asthe table shows, the standard errors for the class 1 parameters arelarge. These large standard errors are expected and arise from thefact that the labeling of classes in this model is arbitrary. Suppose,as an extreme but illustrative example, that two different bootstrapsamples give the same estimates for two classes but with their orderchanged (i.e., the estimates for class 1 becoming the estimates for class2, and vice versa). In this case, the bootstrapped standard errors forthe parameters for both classes rise even though the model for these


two classes together is exactly the same. Summary statistics avoid thisissue. All but one of the means are statistically significant, with theCV dummy obtaining the only insignificant mean. All of the standarddeviations are significantly different from zero.

Train (2008a) also estimated two other models on these data usingEM algorithms: (1) a model with a discrete distribution of coefficientswhere the points are fixed and the share at each point is estimated,using the procedure in section 14.3.1 above, and (2) a model with adiscrete mixture of two normal distributions with full covariance, usinga generalization of the procedure in section 14.3.3 above. The flexibilityof EM algorithms to accommodate a wide variety of complex modelsis the reason they are worth learning. They enhance the reseacher’sability to build specially tailored models that closely fit the realityof the situation and the goals of the research – which has been theoverriding objective of this book.

Bibliography

Aitkin, M. and I. Aitkin (1996), ‘A hybrid EM/Gauss-Newton algo-rithm for maximum likelihood in mixture distributions’, Statisticsand Computing 6, 127–130.

Akaike, H. (1974), ‘A new look at the statistical identification model’,IEEE Transactions on Automatic Control 19(6), 716–723.

Bhat, C. (1997a), ‘An endogenous segmentation mode choice modelwith an application to intercity travel’, Transportation Science31, 34–48.

Boyles, R. (1983), ‘On the convergence of the EM algorithm’, Journalof the Royal Statistical Society B 45, 47–50.

Dempster, A., N. Laird and D. Rubin (1977), ‘Maximum likelihoodfrom incomplete data via the EM algorithm’, Journal of the RoyalStatistical Society B 39, 1–38.

Levine, R. and G. Casella (2001), ‘Implementation of the monte carloEM algorithm’, Journal of Computational and Graphical Statis-tics 10, 422–439.

McLachlan, G. and T. Krishnan (1997), The EM Algorithm and Ex-tensions, Wiley, New York.

Mittelhammer, R., G. Judge and D. Miller (2000), Econometric Foun-dations, Cambridge University Press, New York.

Ruud, P. (1991), ‘Extensions of estimation methods using the EMalgorithm’, Journal of Econometrics 49, 305–341.

Schwarz, G. (1978), ‘Estimating the dimension of a model’, The Annalsof Statistics 6, 461–464.

27

28 BIBLIOGRAPHY

Train, K. (2008a), ‘EM algorithms for nonparametric estimation ofmixing distributions’, Journal of Choice Modelling 1, 40–69.

Train, K. (2008b), ‘A recursive estimator for random coefficient mod-els’, working paper, Department of Economics, U. of California,Berkeley.

Train, K. and G. Sonnier (2005), Mixed logit with bounded distribu-tions of correlated partworths, in R.Scarpa and A.Alberini, eds,‘Applications of Simulation Methods in Environmental and Re-source Economics’, Springer, Dordrecht, pp. 117–134.

Weeks, D. and K. Lange (1989), ‘Trials, tribulations, and triumphs ofthe EM algorithm in pegigree analysis’, Journal of MathematicsApplied in Medicine and Biology 6, 209–232.

Wu, C. (1983), ‘On the convergence properties of the EM algorithm’,Annals of Statistics 11, 95–103.

EM Algorithms - University of California, Berkeleytrain/Ch14EM.pdf · EM Algorithms 14.1...

Documents

Transcript of EM Algorithms - University of California, Berkeleytrain/Ch14EM.pdf · EM Algorithms 14.1...