thesis_final_draft

38
Senior Thesis in Mathematics Sampling from High Dimensional Distributions Author: Bill DeRose Advisor: Dr. Gabe Chandler Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor of Arts April 3, 2015

Transcript of thesis_final_draft

Page 1: thesis_final_draft

Senior Thesis in Mathematics

Sampling from High DimensionalDistributions

Author:Bill DeRose

Advisor:Dr. Gabe Chandler

Submitted to Pomona College in Partial Fulfillmentof the Degree of Bachelor of Arts

April 3, 2015

Page 2: thesis_final_draft

Contents

1 Introduction1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Random Variable Generation2.1 Random Variable Generation . . . . . . . . . . . . . . . . . . . .

2.1.1 The Inverse Transform . . . . . . . . . . . . . . . . . . . .2.1.2 Acceptance-Rejection Sampling . . . . . . . . . . . . . . .

3 Markov Chains3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . .

3.2.1 Irreducibility . . . . . . . . . . . . . . . . . . . . . . . . .3.2.2 Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . .3.2.3 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . .3.2.4 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Markov Chain Monte Carlo4.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . .4.2 Metropolis Hastings . . . . . . . . . . . . . . . . . . . . . . . . .

4.2.1 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . .4.2.2 Approximate Metropolis-Hastings . . . . . . . . . . . . .

4.3 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3.1 Auxiliary Variable MCMC . . . . . . . . . . . . . . . . . .4.3.2 Uniform Ergodicity of the Slice Sampler . . . . . . . . . .

4.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5 Hamiltonian Monte Carlo . . . . . . . . . . . . . . . . . . . . . .

4.5.1 Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . .4.5.2 HMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Conclusion

Page 3: thesis_final_draft

Chapter 1

Introduction

1.1 Introduction

The challenges we face in computational statistics are due to the incredibleadvance of technology in the past 100 years. In a world where human sufferingis the daily reality of so many, we should be so lucky to wrestle with the problemsof algorithm design and implementation. Despite the advances in statistics overthe past century, random number generation remains an active field of research.From machine learning and artificial intelligence, to the simulation of proteinformation, the ability to draw from probability distributions has wide rangingapplications.

But what exactly is a random number, and what is randomness? Moreimportantly, how can an algorithm take a finite number of deterministic stepsto produce something random? Often, humans delude themselves into seeingrandomness where there is none – they detect a signal in the noise.

Figure 1.1: Which is random?

Page 4: thesis_final_draft

The image on the left of Figure 1.1 depicts genuine randomness. The pointson the right are too evenly spaced for it to be truly random. In actuality, eachpoint on the left represents the location of a star in our galaxy while the pointson the right represent the location of glowworms on the ceiling of a cave inNew Zealand. The glowworms spread themselves out to reduce competition forfood amongst themselves. The seemingly uniform distribution is the result of anon-random force.

So how do we go about generating images like those on the left? We beginwith a little cheat and assume the existence of a random number generator thatallows us to sample U ∼ Uniform([0, 1]). Though we will not discuss methodsfor drawing uniformly from the unit interval, their importance to us cannot beunderstated.

In practice, exact inference is often either impossible (e.g. provably non-integrable functions) or intractable (e.g. high dimensional integration) and wemust turn to approximations. This work explores Monte Carlo methods as oneapproach to numerical approximation.

Example 1.1 (Numeric Integration) We wish to evaluate an integral Q =∫ baf(x) dx. From calculus, we know favg =

Q

b− a⇒ Q = (b − a)favg. By the

LLN, we can choose X1, . . . , Xn uniformly in [a, b] to approximate

favg ≈1

n

n∑i=1

f(Xi)⇒ Q ≈ b− an

n∑i=1

f(Xi).

1.2 Related works

Many of the algorithms we cover are versions of the Metropolis algorithm whichfirst appeared in [9] and was eventually generalized by Hastings in [6]. Thoughthe naming of the algorithm has been contended (Metropolis merely oversawthe research), we refer to the algorithm as Metropolis-Hastings for historicalreasons. Regardless of naming conventions, we are indebted to Arianna W.Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller fortheir work on the original paper outlining the Metropolis algorithm.

The term “Monte Carlo” was coined because some of the first applicationswere to card games, like those in the Monte Carlo Casino in Monaco. A MonteCarlo algorithm is simply an algorithm whose output is random. Such MonteCarlo simulations were important to the development of the Manhattan project(after all, Metropolis worked at Los Alamos during WWII) and remain an im-portant tool of modern statistical physics.

Though the Gibbs sampler was introduced by brothers Stuart and DonaldGeman in 1984 [5] these sorts of numerical sampling techniques did not en-ter the mainstream until the early 1990’s, arguable because of the advent ofthe personal computer and wider access to computational power. The Gibbssampler appeared nearly two decades before Neal’s slice sampler [11], though

Page 5: thesis_final_draft

we cover them in reverse chronological order because the latter provides a nicemotivation for the former. We draw heavily from [10], [13], [12] in proving theuniform ergodicity of the 2D slice sampler. Neal’s work appears again in theHamiltonian Monte Carlo [2] which uses gradient information to explore statespace more efficiently.

The contemporary results presented are mostly due to the proliferation ofbig data which, in the Bayesian setting, necessitates the ability to sample from aposterior distribution with billions of data points. Sequential hypothesis testingallows us to reduce some of the overhead required for Metropolis-Hastings [7].As a general introduction to each of these sampling techniques, [1] has proveninvaluable.

Page 6: thesis_final_draft

Chapter 2

Random VariableGeneration

2.1 Random Variable Generation

Assuming we may draw U ∼ Unif([0, 1]), what other distributions can we gen-erate from? It turns out we can draw X ∼ Bernoulli(p) by letting X ← 1(U <

p). If X1, . . . , Xni.i.d∼ Bern(p) then

∑ni=1Xi = Y ∼ Bin(n, p). However, a

more general approach called the inverse transform allows us to draw from any1D density whose closed form cdf we may write down.

2.1.1 The Inverse Transform

Definition 2.1 Suppose X is a random variable with probability distributionfunction (pdf) fX . We denote by FX the cumulative distribution function (cdf)where

FX(a) = Pr(X ≤ a).

Cumulative distribution functions are nonnegative, increasing, right-continuousfunctions with lima→−∞ = 0 and lima→∞ = 1.

Definition 2.2 For an increasing function F on R, the pseudoinverse of F ,denoted F−1p , is the function such that

F−1p (u) = inf{a : F (a) ≤ u}

If F is strictly increasing then F−1p ≡ F−1.

With these definitions in hand, we have the tools to generate random variatesfrom any distribution with a computable generalized inverse.

Lemma 2.3 Let F (a) be a cdf and U ∼ Unif([0, 1]). If X = F−1p (U) then Xhas cdf F (a).

Page 7: thesis_final_draft

Proof Let F,U , and X be as given in the lemma. Then

Pr(X ≤ a) = Pr(F−1p (U) ≤ a)

= Pr(U ≤ F (a))

= F (a)

Where the second equality follows from the fact that F is increasing.

We now see the importance of being able to to draw uniformly from the unitinterval. In fact if U is not truly uniform on [0, 1], then the inverse transformmethod fails to sample from the correct distribution. However, to use the inversetransform we must explicitly write down the cumulative distribution functionand efficiently compute its generalized inverse. As we will see in Example 2.6,this is not always possible.

Example 2.4 We wish to draw X ∼ Exp(λ) using the inverse transformmethod. The cdf of an exponential random variable is given by

FX(a) = 1− exp(−λa).

Solving for the inverse yields

U = 1− exp(−λa)

log(1− U) = − λa−λ−1 log(1− U) = a

So that if U ∼ Unif([0, 1]) then −λ−1 log(1− U) = X ∼ Exp(λ).

Example 2.5 Recall that the pdf of a Cauchy random variable, X, is

fX(s) =1

π(1 + s2).

Given U ∼ Unif([0, 1]), we find a transformation Y = r(U) such that Y has aCauchy distribution. We begin by finding the cdf of X:

FX(a) =

∫ a

−∞

1

π(1 + s2)ds

=1

π

[tan−1(s)

]a−∞

=1

πlim

n→−∞

[tan−1(a)− tan−1(n)

]=

1

π

[tan−1(a) +

π

2

]=

tan−1(a)

π+

1

2

Page 8: thesis_final_draft

To compute the desired transformation, we have:

U =tan−1(a)

π+

1

2

π(U − 1

2) = tan−1(a)

tan(π(U − 1

2)) = a

So that Y = r(U) = tan(π(U − 1

2))∼ Cauchy.

cauchies <- tan(pi * (runif(10000) - 0.5))

hist(cauchies[abs(cauchies) <= 500], prob = TRUE,

breaks = 2000, xlim = c(-20, 20), ylim = c(0, 0.35),

main = "Cauchy r.v. using ITF", xlab ="X")

lines(seq(-20, 20, 0.2), dcauchy(seq(-20, 20, 0.2)),

col = "blue")

Cauchy r.v. using ITF

X

Den

sity

−20 −10 0 10 20

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Page 9: thesis_final_draft

Example 2.6 Even in 1D, there exist densities whose cdf we cannot writedown. For example, the cumulative distribution function of the standard normaldistribution cannot be expressed in a closed form:

Φ(x) =1√2π

∫ x

−∞exp(−z2/2) dz

Clearly, we must develop other methods that do not rely as strongly on niceanalytic properties of our target distribution.

2.1.2 Acceptance-Rejection Sampling

Much of this section stems from the idea that if fX is the target distribution,we may write

fX(s) =

∫ fX(x)

0

1 ds.

Here, fX appears as the marginal density of the joint distribution

(X,U) ∼ Unif({(x, u) : 0 < u < fX(x)}).

Introducing the auxiliary variable U allows us to sample from our targetdistribution by drawing uniformly from the area under the curve of fX andignoring the auxiliary coordinate.

Theorem 2.7 (The Fundamental Theorem of Simulation) Simulating X ∼fX is equivalent to simulating

(X,U) ∼ Unif({(x, u) : 0 < u < fX(x)}).

Actually sampling from the joint distribution of (X,U) introduces difficulty,though, because sampling X ∼ fX and U |X ∼ Unif([0, fX(X)]) defeats thepurpose of introducing the auxiliary variable. If we could sample X ∼ fX inthe first place, we would already be done.

The solution is to generate pairs (X,U) from a superset and accept them ifthey satisfy the constraint. For instance, suppose the 1D density fX is boundedby m and the support of fX , denoted supp fX , is [c, d]. Sampling pairs

(X,U) ∼ Unif([0 ≤ u ≤ fX(x)])

is equivalent to simulating X ∼ Unif([c, d]), U |X ∼ Unif([0,m]), and acceptingthe pair if 0 < U < fX(X). It is easily shown that this does indeed sample from

Page 10: thesis_final_draft

the desired distribution:

Pr(X ≤ a) = Pr(X ≤ a|U ≤ fX(X))

=Pr(X ≤ a, U ≤ fX(X))

Pr(U ≤ fX(X))

=

∫ ac

∫ fX(x)

01d−c ·

1m du dz∫ d

c

∫ fX(x)

01d−c ·

1m du dz

=

∫ ac

∫ fX(x)

0du dz∫ d

c

∫ fX(x)

0du dz

=

∫ acfX(z) dz∫ d

cfX(z) dz

=

∫ a

c

fX(z) dz

= FX(a) X

This computation was made easier by the fact that both fX and supp fXwere bounded. In situations where this is not the case, we can no longer use arectangle as the superset from which we draw candidates. Instead, we use someother probability distribution g(x) that may be readily sampled from. Such adistribution is called a proposal distribution and must satisfy

M · g(x) ≥ fX(x), M ≥ 1,∀x ∈ supp fX .

We formalize this notion in the following theorem.

Theorem 2.8 (The Acceptance-Rejection Theorem) Let g be a probabil-ity distribution that satisfies

M · g(x) ≥ fX(x)

for some M ≥ 1 and for all x ∈ supp fX . Then, to simulate X ∼ fX , it issufficient to simulate

Y ∼ g and U |Y ∼ Unif([0,M · g(Y )])

and let X ← Y if U ≤ fX(Y ).

Proof Sampling Y ∼ g, U |Y ∼ Unif([0,M · g(Y )]), and letting X ← Y if

Page 11: thesis_final_draft

U ≤ fX(Y ) generates X ∼ fX :

Pr(X ∈ A) = Pr(Y ∈ A|U ≤ fX(Y ))

=

∫A

∫ fX(x)

0g(z) 1

Mg(z) du dz∫supp fX

∫ fX(x)

0g(z) 1

Mg(z) du dz

=

∫AfX(z) dz∫

supp fXfX(z) dz

=

∫A

fX(z) dz X

The proposals used in acceptance-rejection sampling come from g(Y ) and

are accepted with probabilityfX(Y )

M · g(Y ), so the probability we accept any given

proposal is then

Pr(accept) =

∫g(y)

fX(y)

M · g(y)dy

=1

M

∫fX(y) dy.

The larger M is, the more points we must reject before accepting a proposal.

For efficiency’s sake, we want M = supfX(x)

g(x)to ensure the highest possible

acceptance rate. This leads directly to the Acceptance-Rejection algorithm,which is a realization of Theorem 2.8:

Algorithm 1 AR Sampling

1: procedure Acceptance-Rejection2: Draw Y ∼ g, U ∼ Unif([0,M · g(Y )])3: Let X ← Y if U ≤ fX(Y ), else return to 2.4: end procedure

Example 2.9 Given Y ∼ Cauchy we use acceptance rejection to generateX ∼ Exp(1/2). To use AR with a proposal distribution g(x), we must ensure

M · g(x) ≥ fX(x)⇒M ≥ fX(x)

g(x)for all x ∈ supp fX . Ideally, M is as close to

1 as possible:

M ≥ supx≥0

fX(x)

g(x)≈ 3.629

We confine our maximization to the positive reals because the target distributiononly has support on the positive reals. The maximum is attained at x = 2 +

√3.

Using M = 3.629 yields

Page 12: thesis_final_draft

Draw.AR <- function() {repeat {proposal <- rcauchy(1)

u <- runif(1, 0, 3.629 * dcauchy(proposal))

if (u <= dexp(proposal, rate = 1 / 2)) {return(proposal)

}}

}x <- seq(0, 15, 0.1)

hist(replicate(10000, Draw.AR()), breaks = 100, prob = TRUE,

xlab = "X", main = "Exp(1/2) using AR")

lines(x, dexp(x, rate = 1 / 2), col = "blue")

Exp(1/2) using AR

X

Den

sity

0 5 10 15

0.0

0.1

0.2

0.3

0.4

Page 13: thesis_final_draft

Chapter 3

Markov Chains

3.1 Markov Chains

Definition 3.1 A sequence of random variables X1, . . . , Xn, denoted (Xn), isa Markov chain if

Pr(Xn+1|Xn, Xn−1, . . . , X1) = Pr(Xn+1|Xn) (3.1)

Example 3.2 A random walk is a Markov chain that satisfies

Xn+1 = Xn + εn

where εn is generated independent of the current state. If the distribution of εnis symmetric about 0, we call this a symmetric random walk. In section 4.2 wewill see how random walks are used in MCMC algorithms.

Every Markov chain has an initial distribution, π0, and a transition kernel K.The state space, denoted X , is the set of possible values Xi may take on at eachstep in the Markov chain.

Definition 3.3 A transition kernel is a function K defined on X ×B(X ) suchthat

• ∀x ∈ X ,K(x, ·) is a probability measure;

• ∀A ∈ B(X ),K(·, A) is measurable.

where B denotes the σ-algebra defined on the set X .

When the state space is discrete, the transition kernel is a matrix K where

Kij = Pr(Xn+1 = Xj |Xn = Xi).

In the continuous case, the transition kernel denotes a conditional density wherePr(x′ ∈ A|x) =

∫AK(x, x′) dx′. A Markov chain is said to be time homogeneous

if K(Xn+1|Xn) is independent of n.

Page 14: thesis_final_draft

We restrict our study almost entirely to time homogeneous Markov chains.An example of a time heterogeneous Markov chain is the simulated annealingalgorithm, whose transition kernel changes with the “temperature” of the sys-tem. Time heterogeneity is a key property of simulated annealing because itallows us to explore the entire state space when the temperature is high, butrestricts our moves when the temperature is low. The algorithm is inspired byannealing in metallurgy where the process is used to temper or harden metalsand glass by heating them to a high temperature and gradually cooling them,allowing the material to reach a low-energy crystalline state [14].

Given a transition matrix for a discrete Markov chain and an initial distri-bution π0, the distribution of X1 is obtained by matrix multiplication

π1 = π0K.

Similarly, Xn ∼ πn = π0Kn. Notice that once the initial state is specified, the

behavior of the chain is entirely dependent on K.

Definition 3.4 Consider A ∈ B(X ). The first n for which the chain enters theset A is denoted by

τA = inf{n ≥ 1 : Xn ∈ A}and is called the stopping time at A. By convention, τA = ∞ if Xn 6∈ A forevery n. Associated with the set A, we also define

ηA =

∞∑n=1

1(Xn ∈ A),

the number of times the chain enters A.

Example 3.5 In a zero-sum coin tossing game, the payoff to player b is +1 ifa heads appears and −1 if a tails appears. Similarly, the payoff to player c is +1if a tails appears and −1 if a heads appears. Let Xn be the sum of the gains ofplayer b after n rounds of the game. The infinite dimensional transition matrix,K, has zeros on the diagonal since player b must either lose or gain a point oneach round. Furthermore, K has upper and lower sub-diagonals equal to 1/2because because we are flipping a fair coin. Assuming that player b begins withB dollars and player c begins with C dollars,

τ1 = inf{n : Xn ≤ −B} and τ2 = inf{n : Xn ≥ C}

represent, respectively, the ruins of the player b and c. The probability ofbankruptcy for player b is then Pr(τ1 > τ2).

3.2 Properties of Markov Chains

3.2.1 Irreducibility

Irreducibility is an important property of Markov chains which guarantees thatregardless of the current state of the chain, it is possible to reach any other state

Page 15: thesis_final_draft

in a finite number of transitions. In the discrete case, irreducibility also tellsus the transition matrix cannot be broken down into smaller matrices (i.e. thetransition graph is connected).

Definition 3.6 Given a measure φ, a Markov chain with transition kernel K(·)is φ-irreducible if for every A ∈ B(X ) such that φ(A) > 0, Pr(τA < ∞) > 0regardless of the initial state.

Irreducibility together with aperiodicity, a property introduced in the followingsubsection, allow us to make strong analytic arguments about the convergenceof Markov chains.

3.2.2 Aperiodicity

We define the period of a state x ∈ X to be

d(x) = gcd{m ≥ 1 : Km(x, x) > 0}

If d(x) ≥ 2, we say x is periodic with period d(x). A state is aperiodic if it hasperiod 1. An irreducible chain is aperiodic if each state has period 1.

Example 3.7 A Markov chain with period n is given by the block matrix

P =

0 P1 0 0 00 0 P2 0 0...

......

. . ....

0 0 0 0 Pn−1Pn 0 0 0 0

where Pi is a stochastic matrix and P is irreducible.

3.2.3 Stationarity

Definition 3.8 A Markov chain (Xn) has stationary distribution π if Xn ∼π ⇒ Xn+1 ∼ π.

For MCMC methods to be of any use to us, we must be able to reason aboutthe asymptotic behavior of Markov chains. The distribution of Xn as n → ∞is called the limiting distribution. Ideally, we would like some guarantee that,regardless of initial conditions, the limiting distribution of a Markov chain isalso its stationary distribution.

The general approach with MCMC algorithms is to initialize and run aMarkov chain for a sufficient number of steps to draw samples approximatelyfrom the desired stationary distribution. It is common to ignore some num-ber of samples at the beginning, and then consider only every nth sample (forindependence) when computing an expectation.

Page 16: thesis_final_draft

3.2.4 Ergodicity

When exactly do we know when the limiting distribution of a Markov chain isthe stationary distribution? The Ergodic Theorem tells us just this.

Theorem 3.9 (The Ergodic Theorem) Let (Xn) be a Markov chain withstationary distribution π. If the chain is φ-irreducible and aperiodic, then forall measurable sets A, limn→∞ Pr(Xn ∈ A) = π(A).

Which is to say that the limiting distribution of irreducible, aperiodic Markovchains is always the stationary distribution. An even stronger guarantee ofconvergence exists, but to get there we must introduce more terminology.

Definition 3.10 The Markov chain (Xn) has an atom α ∈ B(X ) if there existsan associated non-zero measure µ such that

K(x,A) = µ(A), ∀x ∈ α,∀A ∈ B(X ).

The definition of a small set follows naturally and will be used in our defi-nition of one of the strongest form of convergence, uniform ergodicity.

Definition 3.11 A set C is small if there exists an m > 0 and a nonzeromeasure µm such that

Km(x, x′) ≥ µm(xm)

for all x ∈ C and for all x′ ∈ B(X )

Definition 3.12 The Markov chain (Xn) is uniformly ergodic if

limn→∞

supx∈X‖Pn(x, ·)− π‖TV = 0

Where ‖ · ‖TV denotes the total variation norm.

In showing uniform ergodicity, we will make use of the following theorem.

Theorem 3.13 (Doeblin’s Condition) The following are equivalent:

(a) (Xn) is uniformly ergodic;

(b) there exist R ≤ ∞ and r such that

‖Pn(x, ·)− π‖TV < Rr−n, ∀x ∈ X ;

(c) (Xn) is aperiodic and X is a small set;

(d) (Xn) is aperiodic and there exist a small set C and a real κ > 1 such that

supx∈X

Ex[kτC ] <∞

Page 17: thesis_final_draft

If the whole space X is small, there exists a probability distribution, φ, on X ,constants ε, δ > 0, and n such that, if φ(A) > ε then

infx∈X

Kn(x,A) > δ, ∀A ∈ B(X ).

We see here the relation between analytic limits and uniform ergodicity, givingus a feel for just how strong the guarantee of convergence is. Now that we havecovered enough of the basic vocabulary of Markov chains we may begin oursurvey of MCMC sampling algorithms.

Page 18: thesis_final_draft

Chapter 4

Markov Chain Monte Carlo

4.1 Monte Carlo Methods

Although the sampling techniques discussed in Chapter 2 work well, they arenot flawless. The inverse transform method fails beyond 1-dimension and eventhen it requires us to write down the closed form cdf of the target distribution.Acceptance-rejection can be used in any dimension we like, but as dimensonalityincreases it becomes more difficult to find good proposal distributions. Weturn now to Markov Chain Monte Carlo (MCMC) simulations because theyameliorate many of these issues.

Monte Carlo simulations allow us to approximate the probability of certainoutcomes by running a large number of trials to obtain an empirical distribu-tion of possible events. Markov Chain Monte Carlo simulations use Markovchains whose stationary distribution is the target distribution we wish to sam-ple from. The oldest MCMC algorithm, and the one we choose to cover first, isthe Metropolis-Hasting algorithm.

4.2 Metropolis Hastings

At this point, we may readily sample from most distributions covered in anintroductory probability course. However, when faced with the task of drawingfrom a non-standard distribution, we will need more powerful tools at our dis-posal. For instance, in Bayesian statistics we would often like to sample fromthe posterior distribution of a parameter to compute its expected value.

At a high level, Metropolis-Hastings samples from a target distribution fXby drawing from a proposal distribution g (“easy” to sample) and accepting ifit looks like it came from fX (“hard” to sample). At step T in the algorithm,in which the current state is XT , we draw a candidate/proposal X∗ ∼ g(X|XT )and and let XT+1 = X∗ with probability

A(X∗, XT ) = min

(1,fX(X∗)g(XT |X∗)fX(XT )g(X∗|XT )

).

Page 19: thesis_final_draft

Otherwise, let XT+1 = XT .We notice two things about the acceptance probability. First, the Metropolis-

Hastings algorithm only requires we know fX up to a normalizing constant.Second, if g is symmetric the acceptance probability becomes

A(X∗, XT ) = min

(1,fX(X∗)

fX(XT )

)which implies we always accept a candidate that is more probable, and acceptcandidates randomly otherwise. The acceptance probability combines conceptsfrom steepest accent and random walk algorithms which help prevent gettingstuck in local maxima. Following Algorithm 2 ensures the stationary distribu-tion of the Markov chain is fX .

Algorithm 2 MH Sampling

1: procedure Metropolis-Hastings Input: Current state: XT ∼ fX2: Draw X∗ ∼ g(X|XT ), U ∼ Unif([0, 1])3: Compute acceptance probability Pa = A(X∗, XT )4: If U < Pa set XT+1 ← X∗, otherwise set XT+1 ← XT

5: end procedure

Make no mistake, Metropolis-Hastings is no free lunch. The proposal distri-bution must be chosen carefully and presents difficulties in higher dimensionswhere our intuition and imagination fail us. This is especially the case whenusing a non-symmetric proposal distribution. For this reason, we restrict ourstudy of Metropolis-Hastings solely to the symmetric, random walk case. Acommon (symmetric) proposal distribution is a Gaussian centered on the cur-rent state. It is also typical for the proposal distribution’s variance to be chosento be on the same order of magnitude as the smallest variance of the targetdistribution.

σmax

σmin

ρ

Figure 4.1: Contours of a bivariate normal target distribution (red) and sym-metric proposal distribution with standard deviation ρ (blue).

Consider Figure 4.1, where the 2D target distribution exhibits a strong cor-relation between components. To achieve a high acceptance ratio, the stan-

Page 20: thesis_final_draft

dard deviation of the proposal distribution must be kept on the same order ofmagnitude as σmin. Otherwise, our proposals will be from all over the spaceand we would rarely accept any move. The random walk behavior also meansthat to explore the length of the distribution, a distance of σmax/σmin, it takes(σmax/σmin)2 steps due to the convergence of the chain being proportional to√n. If our target distribution is pinched in one dimension and elongated in

another, the Metropolis-Hastings algorithm offers poor convergence properties.

Example 4.1 Suppose we wish to sample from the 2-dimensional mixture ofnormals whose contours are shown in Figure 4.2 (bottom), alongside its 1-dimensional analogue (top). Figure 4.3 shows that in 2-dimensions, the firstcoordinate of the points sampled using a standard Metropolis algorithm appearto mix well early, but clearly display difficulty jumping between modes. Figure4.4a and Figure 4.4b suggest that in 5-,10-, and higher dimensions the problemis only exacerbated.

−10 −5 0 5 10

0.00

0.05

0.10

0.15

0.20

1D Normal Mixture

X

Densi

ty

2D Normal Mixture

X2

X 1 0.01 0.01

0.02

0.02

0.03

0.03

0.04

0.04

0.05

0.05

0.06

0.06

0.07

0.07

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

Figure 4.2: The one dimension analogue (top) of the 2D target distribution(bottom) (µ1 = −2, µ2 = 2, σ1 = σ2 = 1)

Page 21: thesis_final_draft

0 500 1000 1500 2000

−6

−4

−2

02

46

Random−walk Metropolis (2D)

Index

Firs

t pos

ition

coo

rdin

ate

Figure 4.3: Mixing of first coordinate, X1, from 2D a Metropolis sample.

0 500 1000 1500 2000

−6

−4

−2

02

46

Random−walk Metropolis (5D)

Index

Firs

t pos

ition

coo

rdin

ate

(a)

0 500 1000 1500 2000

−6

−4

−2

02

46

Random−walk Metropolis (10D)

Index

Firs

t pos

ition

coo

rdin

ate

(b)

Figure 4.4: First coordinate of points sampled from Metropolis random walk in5D (a) and 10D (b).

4.2.1 Bias-Variance Trade-off

Implicit in our handling of MCMC lies the desire for unbiased draws from somestationary distribution, π. In many practical applications, it is too computa-

Page 22: thesis_final_draft

tionally intensive to draw enough samples to estimate a parameter, θ̂, or theexpectation of a function, E[f(X)], with sufficiently low variance. If we allow forsome bias in our draws from the stationary distribution, the task of simulationis made easier.

The mean square error in our estimate is a measure of both bias and variance,MSE = B2 + V . When drawing from a posterior density over billions of datapoints, unbiased Markov chains incur significant computational costs. As aresult, the variance of these approximations are high because we can only collectsmall samples in a fixed amount of time.

Alternativly, we can simulate from a slightly biased stationary distributionπε, where ε is a parameter that controls the bias we allow in our simulation [7].As ε increases it becomes easier to simulate draws from πε. Given infinite timewe should let ε = 0 and run the chain to draw infinite samples. However, whengiven limited or finite wall-clock time it may be advantageous to tolerate somebias in return for lowering variance by either collecting large samples or mixingbetter.

4.2.2 Approximate Metropolis-Hastings

As we alluded to earlier, in Bayesian inference it is often the case that we wishto find the expectation of a parameter θ with respect to a posterior distribution,f(θ). Given a dataset of N observations XN = {x1, . . . , xN}, which we modelwith a distribution f(x|θ) and prior ρ(θ), we want to sample from the posteriordensity

f(θ) ∝ ρ(θ)

N∏i=1

f(x|θ)

to estimate θ̂. If our data is minimally sufficient and if XN contains billions ofpoints, then evaluating f(·) at least once in the Metropolis-Hastings acceptanceratio is a costly O(N) operation for a single bit of information.

By reformulating step 4 of Algorithm 2 as a statistical test of significance,we can reduce some of the overhead incurred by unbiased MCMC. In standardMetropolis-Hastings we accept the proposal θ∗ if U < Pa, otherwise we staywhere we are. This condition is equivalent to checking if

U <f(θ∗)g(θT |θ∗)f(θT )g(θ∗|θT )

Ug(θ∗|θT )ρ(θT )

g(θT |θ∗)ρ(θ∗)<

∏Ni=1 f(x|θ∗)∏Ni=1 f(x|θT )

1

Nlog

[Ug(θ∗|θT )ρ(θT )

g(θT |θ∗)ρ(θ∗)

]<

1

N

N∑i=1

li where li = log f(x|θ∗)− log f(x|θT )

µ0 < µ

where in the last step we substitute µ0 on the left-hand side and µ on the righthand side for notational convenience.

Page 23: thesis_final_draft

The costly computation that may have previously required the evaluationof a posterior density over billions of points is equivalent to testing whetherthe mean of a finite population {l1, . . . , lN} is greater than some constant µ0

that does not depend on the data. This makes it easy to frame the check as asequential hypothesis test: randomly draw a mini-batch of size n < N withoutreplacement from XN and compute its mean, l̄. If the difference between l̄ andµ0 is significantly larger than the standard deviation of l̄ and if µ0 < l̄ thenθ∗ is accepted, otherwise we stay put. If significance is not achieved, we addmore observations to the mini-batch and re-run until significance is achieved.Significance will eventually be achieved and the sequential hypothesis test willterminate because when n = N the standard deviation of l̄ is 0 because l̄ is thepopulation mean, µ.

Formally, we can test the hypotheses

H0 : µ0 ≤ µ vs H1 : µ0 > µ

where the sample mean, l̄, and the sample standard deviation, sl, are given as

l̄ =1

N

n∑i=1

li,

s2l =n(l2 − (l)2

)n− 1

,

the standard deviation of l̄ is estimated to be

s =sl√n

√1− n− 1

N − 1,

and the test statistic is

t =l̄ − µ0

s.

For large enough n, we claim t follows a standard Student-t distribution withn− 1 degrees of freedom when µ = µ0. To determine if the difference betweenµ0 and µ is significant, we compute the p-value as p = 1− φn−1(|t|) where φ(·)is the cdf of the Student-t distribution. If p is less than the α level of our test,then we can reject H0 and conclude µ0 6= µ. The peusdocode below as well asa more detailed proof of the distrbution of t may be found in [7].

We are often able to make confident decisions considering only n < N datapoints in the posterior. Though we introduce bias in the form of the α levelof the test, we make up for this by drawing more samples from the stationarydistribution. For error bounds on the estimates produced, a description ofoptimal sequential test design, and illustrative examples, see [7]. In the followingsection we cover the slice sampling algorithm, which may be conceptualizedas a higher dimensional analogue to the inverse transform. Interestingly, anapproximate slice sampler also exists [4].

Page 24: thesis_final_draft

Algorithm 3 Approximate MH Test

procedure Approx. MH Input: θT , θ∗, ε, µ0, XN ,m Output: accept

Initialize estimated means l← 0 and l2 ← 0n← 0, done← falseDraw U ∼ Unif([0, 1])while not done do

Draw mini-batch X of size min(m,N − n) w/o replacement fromXN and set XN ← XN \ X

Update l and l2 using X and n← n+ |X |

Compute δ ← 1− φn−1(∣∣∣∣ l − µ0

s

∣∣∣∣)if δ < ε then

accept← true if µ0 < l and false otherwisedone← true

end ifend while

end procedure

4.3 Slice Sampling

Unlike Metropolis-Hastings, the slice sampler does not require the selection ofa proposal distribution nor does it require any convexity properties, as someadaptive acceptance-rejection methods do. In practice, however, slice samplingis not entirely unreliant on hyperparameter selection.

In the univariate case, the slice sampler transitions from a point (X,U) underthe curve of fX to another point (X ′, U ′) under the curve of fX in such a waythat the stationary distribution of (X,U) converges to a uniform distributionover the area under the curve of fX [8].

The pseudocode in Algorithm 4 outlines the 2D case. Many important de-

Algorithm 4 2D Slice Sampler

1: procedure Slice sample Input: XT ∈ supp fX2: Draw U ∼ Unif([0, 1])3: Draw XT+1 ∼ Unif({x : fX(x) ≥ U · fX(XT )})4: end procedure

tails are left out but a full implementation may be found in Figure 4.5. Theproblem of drawing from the exact level sets of the distribution in step 3 can beintractable when fX is complex enough. We have adapted Neal’s slice samplingalgorithm from [11] and naively expand out from XT using an arbitrarily chosenstep size until a suitable interval is found. If we were able to sample perfectlyfrom the slice under the curve, there would be no rejected samples. The idea oflearning or predicting these level sets is intriguing, and to my knowledge, hasnot been attempted.

Page 25: thesis_final_draft

Slice.Sample <- function(x0, f, nsample, step = 1) {x <- x0

for (i in 2:nsample) {u <- runif(1, 0, f(x[i - 1]))

lower <- x[i - 1] - 1

upper <- x[i - 1] + 1

while (u < f(lower)) {lower <- lower - step

}while (u < f(upper)) {upper <- upper + step

}

repeat{x.proposal <- runif(1, lower, upper)

if (u < f(x.proposal)) {x[i] <- x.proposal

break

} else if (x.proposal < lower) {lower <- x.proposal

} else if (x.proposal > upper) {upper <- x.proposal

}}

}return(x)

}

Figure 4.5: Naive implementation of the slice sampler

Example 4.2 We use the slice sampler to draw from a tri-modal mixture ofnormals defined in the target function below. The issue of finding correct levelsets becomes apparent, as we might not expand our interval out far enough tojump modes.

Page 26: thesis_final_draft

target <- function(x) {return(0.25 * dnorm(x, -2, 0.3) +

0.50 * dnorm(x, 0, 0.3) +

0.25 * dnorm(x, 2, 0.3))

}

hist(Slice.Sample(1, target, 10000, 1),

breaks = 100, prob = TRUE, ylim = c(0, 0.7),

main = "Trimodal Mixture of Normal", xlab = "X")

x <- seq(-10, 10, length = 1000)

lines(x, target(x), col = "blue")

Trimodal Mixture of Normal

X

Den

sity

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 4.6: The result of slice sampling a trimodal normal distribution.

4.3.1 Auxiliary Variable MCMC

The slice sampler introduces an auxiliary variable, an approach we revisit withthe Hamiltonian Monte Carlo, that is marginalized out to produce the desireddistribution. Using the Fundamental Theorem of Simulation, we are able todraw samples from fX by drawing samples uniformly under the curve of fX .

LetQ be the area under the curve of fX so that choosing (X,U) ∼ Unif({(x, u) :

0 < u < fX(x)}) occurs with probability1

Q:

f(X,U)(X,U) =1

Q1(0 ≤ U ≤ fX(X)).

Page 27: thesis_final_draft

This implies the marginal distribution of X is∫f(X,U)(x, u) du =

1

Q

∫ fX(X)

0

du =fX(X)

Q. X

As Algorithm 4 suggests, we alternate between sampling X and U . To seethat the general slice sampler preserves the uniform distribution over the areaunder the curve of fX , note that if XT ∼ fX and UT+1 ∼ Unif([0, fX(XT )])then

(XT , UT+1) ∼ fX(XT )1(0 ≤ UT+1 ≤ fX(XT ))

fX(XT )∝ 1(0 ≤ UT+1 ≤ fX(XT )).

If XT+1 ∼ Unif(AT+1) = Unif({XT+1 : 0 ≤ UT+1 ≤ fX(XT+1)}) then

(XT , UT+1, XT+1) ∼ fX(XT )1(0 ≤ UT+1 ≤ fX(XT ))

fX(XT )

1(0 ≤ UT+1 ≤ fX(XT+1))

µ(At+1),

where µ(AT+1) denotes the Lebesgue measure of the set. Marginalizing out XT

gives

f(UT+1, XT+1) ∝∫1(0 ≤ UT+1 ≤ fX(x))

1(0 ≤ UT+1 ≤ fX(XT+1))

µ(AT+1)dx

=1(0 ≤ UT+1 ≤ fX(XT+1))

µ(AT+1)

∫1(0 ≤ UT+1 ≤ fX(x)) dx

= 1(0 ≤ UT+1 ≤ fX(XT+1)),

so that if we begin with XT ∼ fX then the updates that generate XT+1 andUT+1 preserve the uniform distribution under the curve of fX .

4.3.2 Uniform Ergodicity of the Slice Sampler

We now discuss the convergence properties of the slice sampler in the simple2D case. In the ensuing calculations we denote by µ(ω) the Lebesgue measureof the set

Aω = {x : 0 ≤ ω ≤ fX(x)}.

To gain insight into how the slice sampler behaves asymptotically, we look tothe cdf of the transition kernel. More specifically, we look at the probabilitythat fX(XT+1) ≤ η given that we are currently at XT and fX(XT ) = ν.

Pr

(fX(XT+1) ≤ η | fX(XT ) = ν

)=

∫ ∫1(0 ≤ ω ≤ ν)

ν

1(ω ≤ fX(x) ≤ η)

µ(ω)dω dx,

Page 28: thesis_final_draft

where we first draw ω uniformly on [0, ν] and then draw XT+1 uniformly on Aω.Simplifying further gives

Pr

(fX(XT+1) ≤ η | fX(XT ) = ν

)=

1

ν

∫1(0 ≤ ω ≤ ν)

µ(ω)

∫1(ω ≤ fX(x) ≤ η) dx dω

=1

ν

∫1(0 ≤ ω ≤ ν) · µ(ω)− µ(η)

µ(ω)dω

=1

ν

∫ min(η,ν)

0

µ(ω)− µ(η)

µ(ω)dω

=1

ν

∫ ν

0

max

(1− µ(η)

µ(ω), 0

)dω,

which tells us the convergence properties of the slice sampler are total dependenton the measure, µ. Now, for the main result which we owe Tierney and Mira [10]who, under boundness conditions, established the following lemma.

Lemma 4.3 If fX and supp fX are bounded, the 2D slice sampler is uniformlyergodic.

Proof Without loss of generality, assume that fX is bounded by 1 and thatsupp fX = [0, 1]. To prove uniform ergodicity, we will show that supp fX is asmall set so that we may invoke Doeblin’s condition. Let

ξ(ν) = Pr

(fX(XT+1) ≤ η | fX(XT ) = ν

)Notice that ω > η implies µ(η) > µ(ω) and ξ(ν) = 0. Further, when ν ≥ η,

ξ(ν) =1

ν

∫ η

0

1− µ(η)

µ(ω)dω

is decreasing in ν since it only appears in the denominator outside of the integral.When ν ≤ η we recognize

ξ(ν) =1

ν

∫ ν

0

1− µ(η)

µ(ω)dω

as the expected value of the function 1 − µ(η)

µ(ω)where ω ∼ Unif([0, ν]). The

larger ω, the smaller µ(ω) is; we conclude that µ(ω) is decreasing in ω and thusalso decreasing in ν.

Therefore ξ(ν) is decreasing in ν for all η. Intuitively, it would not makesense if ξ(ν) were increasing in ν because it would imply our Markov chain isnot spending enough time in the modes. If ξ(ν) were increasing in ν then thelarger ν the more likely we are to end up below some threshold (away from the

Page 29: thesis_final_draft

mode). For the proof to be complete, we must establish bounds on the cdf ofthe transition kernel. The minimum occurs when ν = 1:

limν→1

ξ(v) =

∫ η

0

1− µ(η)

µ(ω)dω,

which is bounded above by∫ η0

1 dω = µ(η) and below by 0. The maximum isgiven by L’Hopital’s rule:

limν→0

ξ(ν) = limν→0

∫ ν0

1− µ(η)

µ(ω)dω

ν

= limν→0

1− µ(η)

µ(ν)

= 1− µ(η).

1− µ(η) is bounded above by 1 and below by 0 because the support is [0, 1].Once we have found nondegenerate upper and lower bounds on the cdf of

the transition kernel,it is not difficult to derive Doeblin’s condition. The entiresupport of fX is thus a small set and uniform ergodicity follows.

This proof serves to remind us that rigorous results are not easy to comeby in MCMC. We must work hard to ensure the methods we employ do indeedsample from the desired target distribution. We have thus introduced the slicesampler, given a rudimentary implementation of it, and discussed its conver-gence properties in the simple 2D case. Next, we cover the Gibbs sampler whichextends the slice sampler’s idea of alternately sampling variables conditioned onone another.

4.4 Gibbs Sampling

In this section, we consider sampling from the multivariate distribution f(x) =f(X1, . . . , Xn). Each step of the Gibbs sampling algorithm replaces a singlevalue, say Xi, by sampling from the distribution conditioned on everythingbut Xi, namely fXi(Xi|x−i). That is, we replace Xi with a value drawn fromfXi

(Xi|x−i) whereXi denotes the ith component of the vector x and x−i denotesthe vector x without the ith component. The deterministic scan Gibbs sampleris expressed rather nicely in Algorithm 5.

Each Gibbs step loops through x and replaces each component with a sam-ple drawn from the correct conditional distribution using the most up-to-datevalues. In the context of Metropolis-Hastings, x−i remains unchanged whenwe draw Xi so the proposal distribution is fx∗(x∗|x−i). We also have thatx∗−i = x−i, and fx(x) = fXi

(Xi|x−i)fx−i(x−i) so the Metropolis-Hasting’s ac-

ceptance probability is

A(x∗,x) =fXi

(X∗i |x∗−i)fx−i(x∗−i)fXi

(Xi|x∗−i)fXi(Xi|x−i)fx−i(x−i)fXi(X

∗i |x−i)

= 1.

Page 30: thesis_final_draft

Algorithm 5 Gibbs Sampling

1: procedure Gibbs Step Input: x = (X1, . . . , Xn) Output: x∗

2: Draw X∗1 ∼ fX1(X1|X2, . . . , Xn)

3: Draw X∗2 ∼ fX2(X2|X∗1 , X3, . . . , Xn)

4:...

5: Draw X∗n ∼ fXn(Xn|X∗1 , X∗2 , . . . , X∗n−1)

6: return x∗ ← (X∗1 , . . . , X∗n)

7: end procedure

Thus, if when dealing with high dimensional distributions we have access theconditional distributions (which is often the case in Bayesian networks), theGibbs sampler never rejects a proposal.

Example 4.4 Say we wish to draw points (X,Y ) where X,Y ∼ Exp(λ). Be-low, we implement a deterministic scan Gibbs sampler that draws from a bounded2D exponential distribution. We bound/truncate the points we draw for graphicalsimplicity.

Exp.Bounded <- function(rate, B) {repeat{x <- rexp(1, rate)

if (x <= B) {return(x)

}}

}

Gibbs.Sampler <- function(M, B) {mat <- matrix(ncol=2, nrow = M)

x <- 1; y <- 1

mat[1, ] <- c(x, y)

for (i in 2:M) {x <- Exp.Bounded(y, B)

y <- Exp.Bounded(x, B)

mat[i,] <- c(x, y)

}return(mat)

}mat <- Gibbs.Sampler(1000, 10)

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE))

plot(mat, main="Joint Distribution", xlab=expression("X"[1]),

ylab=expression("X"[2]), ylim = c(0, 10), xlim = c(0, 10))

hist(mat[ , 1], main=expression("Marginal dist. of X"[1]),

xlab=expression("X"[1]), prob = TRUE, breaks = 30)

Page 31: thesis_final_draft

hist(mat[ , 2], main=expression("Marginal dist. of X"[2]),

xlab=expression("X"[2]), prob = TRUE, breaks = 30)

0 2 4 6 8 10

02

46

810

Joint Distribution

X1

X2

Marginal dist. of X1

X1

Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

Marginal dist. of X2

X2

Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

Example 4.5 Here we use a random scan Gibbs sampler to approximate theprobability that a point drawn uniformly from the unit hypersphere in 6 dimen-sions is at least a distance of 0.9 from the origin. Our algorithm begins at theorigin and then randomly chooses a cordinate to replace. Given (X1, . . . , Xn)in the n-dimensional unit hypersphere we choose a random coordinate to update(WLOG, say x1) and sample it uniformly such that

‖x‖ ≤ 1

X21 + . . .+X2

n ≤ 1

X21 ≤ 1− (X2 + . . .+Xn)

X1 ≤√

1− (X2 + . . .+Xn)

Page 32: thesis_final_draft

But square roots are always positive, so we must also flip a fair coin to determinethe sign. More explicitly,

Xi|x−i ∼ Unif

([−√

1− (X2 + . . .+Xn),√

1− (X2 + . . .+Xn)]).

Euclidean.Norm <- function(x) {return(sqrt(sum(x ^ 2)))

}

Gibbs.Hypersphere.Conditional <- function(x) {if (runif(1) <= 0.5) {return(-1 * runif(1, min = 0, max = sqrt(1 - sum(x ^ 2))))

}return(runif(1, min = 0, max = sqrt(1 - sum(x ^ 2))))

}

Random.Scan.Gibbs.Hypersphere <- function(x = rep(0, 6)) {idx <- sample(1:6, 1)

x[idx] <- Gibbs.Hypersphere.Conditional(x[-idx])

return(x)

}

Hypersphere.MC <- function(steps = 100, f.sample) {x <- rep(0, 6) # start at origin

for (i in 1:(0.1 * steps)) {x <- f.sample(x)

}

data <- matrix(0, ncol = length(x), nrow = steps)

for (i in 1:steps) {x <- f.sample(x)

data[i, ] <- x

}return(data)

}

draws <- replicate(10,

Hypersphere.MC(steps = 5000,

Random.Scan.Gibbs.Hypersphere))

counts <- apply(draws, MARGIN = 3, FUN = apply, 1, Euclidean.Norm)

p <- mean(counts >= 0.9)

s <- sd(counts >= 0.9) / sqrt(length(counts))

We find the probability that a uniform point drawn from the unit hypersphere in6 dimensions is at least 0.9 from the origin is 0.469± 0.002.

Page 33: thesis_final_draft

4.5 Hamiltonian Monte Carlo

Originally introduced in 1987 as the Hybrid Monte Carlo [3], what we referto as the Hamiltonian Monte Carlo (HMC) combines Hamiltonian dynamicsand the Metropolis algorithm to propose large changes in state (e.g. jumpingfrom mode to mode in a single iteration) while maintaining a high acceptanceprobability. HMC interprets x as a position and introduces an auxiliary variableto simulate Hamiltonian mechanics on phase space. But first, we introduce thebasic vocabulary of Hamiltonian dynamics.

4.5.1 Hamiltonian Dynamics

Hamiltonian dynamics is a reformulation of classical Newtonian mechanics inwhich a particle is described by a position vector x and a momentum vectorp. We associate with our position and momentum a total energy H(x,p) =U(x) + K(p) called the Hamiltonian of our system. H(x,p) is the sum of thepotential energy associated with x and the kinetic energy associated with p.We often take the kinetic energy to be

K(p) =1

2‖p‖22

which corresponds to simulating Hamiltonian dynamics on a Euclidean mani-fold. Exploring the effects of alternate kintetic energies is beyond the scope ofthis text, however one can imagine simulating the dynamics on a Riemannianmanifold instead. The choice of potential energy, we will see, depends on thetarget distribution we wish to sample from.

Given a position and momentum, the system evolves according to Hamilton’sequations:

dp

dt= −∂H

∂xand

dx

dt=∂H

∂p.

The laws of thermodynamics must be obeyed so that a particle whose movementis governed by Hamiltonian dynamics travels along level sets of constant energyin the joint, or phase, space. Although H remains invariant, the values of x andp change over time. By simulating the dynamics of a system over a finite timeperiod, we are able to make large changes to x and avoid random walk behavior.

Example 4.6 (A One-Dimensional Example) Consider the simple case inwhich the Hamiltonian of our system is defined as follows:

H(x, p) = U(x) +K(p), U(x) =x2

2, K(p) =

p2

2.

The resulting dynamics evolve according to the equations

dp

dt= −x, dx

dt= p.

Page 34: thesis_final_draft

The solutions to these equations have the following form, for some constants rand a:

x(t) = r cos(a+ t), p(t) = −r sin(a+ t),

which correspond to a rotation by s radians clockwise around the origin in the(x, p) plane.

4.5.2 HMC

If we consider the joint distribution over states (x,p) with total energy H(x,p),i.e.

P(x,p) ∝ exp(−H(x,p)),

we realize that simply starting at some point (x0,p0) and running the dynamicsdoes not sample ergodically from P. To see this, notice this only explores levelsets of constant energy. All states in the set {(x,p) : H(x,p) 6= H(x0,p0)} areunreachable. To construct an ergodic Markov chain, we need to perturb thevalue of H while keeping P invariant. Conceptually, we want to jump betweenlevel sets of constant energy to explore the space. Adding a Gibbs step wherewe draw p ∼ P(p|x) accomplishes just this. Our job is made even simpler bythe independence of x and p, which follows from the factorization of P as

P(x,p) ∝ exp(−U(x)) exp(−K(p)).

Marginalizing out x yields P(p) ∝ exp(−K(p)) which implies p ∼ exp(−‖p‖22/2)which we recognise as the pdf of a standard normal random variable. Applyingthe same thinking to p, we see that U(x) = − log(fX(x)) implies x ∼ fX , givingx the desired marginal distribution.

An algorithm begins to emerge: starting at some point (x,p) in phase space,simulate Hamiltonian dynamics for a finite number of steps, and end in a newstate (x∗,p∗). The proposal is accepted with probability

min(1, exp(H(x,p)−H(x∗,p∗))).

By the conservation of energy, we should always accept such proposals. Some-times, errors in our numeric simulation of the dynamics prevent this from hap-pening. In our experiments we used Radford Neal’s code that appears in Chap-ter 5 of [2] and is avalible online at http : //www.cs.utoronto.ca/ ∼ radford/ham− mcmc− simple.

Algorithm 6 Hamiltonian Monte Carlo Sampler

1: procedure HMC Input: x ∼ fX2: Draw p ∼ Norm(0, 1), U ∼ Unif([0, 1])3: Simulate Hamiltonian dynamics to get (x∗,p∗) ∼ P4: Compute acceptance probability Pa = min(1, exp(H(x,p)−H(x∗,p∗)))5: if U < Pa then return x∗

6: else return x7: end if8: end procedure

Page 35: thesis_final_draft

Example 4.7 Suppose we wish to sample from the 1D bimodal distribution fromExample 4.1. Although we have touted the performance of HMC in high dimen-sions, we restrict ourselves to a 1D density so that the joint phase space may bevizualized, as below. We begin somewhere in the space (Figure 4.7a), simulateHamiltonian dynamics for some number of steps, and accept the proposal state(Figure 4.7b). We must be cautious in our choice of L and ε, as it is not diffi-cult to imagine an instance where the simulated particle returns to its startingposition after a finite number of iterations. Choosing the stepsize, ε, at randombefore simulating the particle’s path can prevent this type of behavior.

−10 −5 0 5 10

−1

0−

50

51

0

P

X

10

20

30

40 40 50 50 60

60

60

60

70

70

70

70

(a)

−10 −5 0 5 10

−1

0−

50

51

0

P

X

10

20

30

40 40 50 50 60

60

60

60

70

70

70

70

(b)

Figure 4.7

4.6 Summary

Thus ends our exploration of Markov Chain Monte Carlo sampling methods.As you may have noticed, algorithms that sample from high dimensional dis-tributions are seldom written once and used forever. Instead, they require anattention to detail and a tested dedication to writing correct code. Even oncethe practitioner has chosen an algorithm most applicable to their setting, itmay require days or weeks of tuning and testing hyperparameter combinationsto achieve the desired convergence. However, the four approaches to MCMCpresented in this work (random walk, Metropolis-Hastings, auxiliary variables,Gibbs sampling) comprise the vast majority of the practitioner’s toolbox.

Page 36: thesis_final_draft

Chapter 5

Conclusion

We began with the question of how to generate randomness and have concen-trated largely on algorithms that do just that: spit out randomness. This merelyscratches the surface of the work being done on Monte Carlo methods. We cannow answer real questions faced by statisticans, ecnonomists, mathemeticians,and nuclear physicists. We can theorize models based on our beliefs, collectdata, and determine, through simulation, whether our observations are in linewith our predictions or if they can be considered “extreme”, “weird”, or “out-lying”. Prerequisite to all of this is the ability the sample uniformly from theunit interval. We are reminded of the power of the Fundamental Theorem ofSimulation and how ultimitaley, all of our problems are reduced to samplinguniformly.

Page 37: thesis_final_draft

Bibliography

[1] Christopher Bishop. Pattern Recognition and Machine Learning. Springer,New York, 2006.

[2] Steve Brooks. Handbook of Markov Chain Monte Carlo. CRC Press/Taylor& Francis, Boca Raton, 2011.

[3] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and DuncanRoweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.

[4] Christopher DuBois, Anoop Korattikara, Max Welling, and PadhraicSmyth. Approximate Slice Sampling for Bayesian Posterior Inference. InArtificial Intelligence and Statistics, 2014.

[5] Stuart Geman and Donald Geman. Stochastic Relaxation, Gibbs Distri-butions, and the Bayesian Restoration of Images. Pattern Analysis andMachine Intelligence, IEEE Transactions on, (6):721–741, 1984.

[6] W Keith Hastings. Monte Carlo Sampling Methods Using Markov Chainsand their Applications. Biometrika, 57(1):97–109, 1970.

[7] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity inMCMC Land: Cutting the Metropolis-Hastings Budget. arXiv preprintarXiv:1304.5299, 2013.

[8] David J. C. MacKay. Information Theory, Inference and Learning Algo-rithms. Cambridge University Press, 2003.

[9] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au-gusta H Teller, and Edward Teller. Equation of State Calculations by FastComputing Machines. The journal of chemical physics, 21(6):1087–1092,1953.

[10] Antonietta Mira and Luke Tierney. Efficiency and Convergence Propertiesof Slice Samplers. Scandinavian Journal of Statistics, 29(1):1–12, 2002.

[11] Radford M Neal. Slice Sampling. Annals of statistics, pages 705–741, 2003.

[12] Christian Robert. Monte Carlo Statistical Methods. Springer, New York,2004.

Page 38: thesis_final_draft

[13] Gareth O. Roberts and Jeffrey S. Rosenthal. On Convergence Rates ofGibbs Samplers for Uniform Distributions. The Annals of Applied Proba-bility, 8(4):pp. 1291–1302, 1998.

[14] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Ap-proach (3rd Edition). Prentice Hall, 2009.