Chapter 07: Monte Carlo Methods -...

LEARNING AND INFERENCE IN GRAPHICAL MODELS

Chapter 07: Monte Carlo Methods

Dr. Martin Lauer

University of FreiburgMachine Learning Lab

Karlsruhe Institute of TechnologyInstitute of Measurement and Control Systems

Learning and Inference in Graphical Models. Chapter 07 – p. 1/44

References for this chapter

◮ Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 11,Springer, 2006

◮ Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I.Jordan, An Introduction to MCMC for Machine Learning, In: MachineLearning, vol. 50, no. 1–2, pp. 5-43, 2003

◮ Christian P. Robert and George Casella, Monte Carlo Statistical Methods,Springer, 1999

◮ Radford M. Neal, Slice sampling, In: Annals of Statistics, vol. 31, no. 3, pp.705-767, 2003

◮ Darrall Henderson, Sheldon H. Jacobson, and Alan W. Johnson, The Theoryand Practice of Simulated Annealing, In: Fred Glover and Gary A.Kochenberger (eds.), Handbook of Metaheuristics, Springer, 2003


References for this chapter

◮ Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth,Augusta H. Teller, and Edward Teller, Equations of State Calculations by FastComputing Machines, In: The Journal of Chemical Physics, vol. 21, pp.1087–1092, 1953

◮ W. Keith Hastings, Monte Carlo Sampling Methods Using Markov Chains andTheir Applications, In: Biometrika, col. 57, pp. 97-109, 1970

◮ Stuart Geman and Donald Geman, Stochastic Relaxation, GibbsDistributions and the Bayesian Restauration of Images, In: IEEETransactions in Pattern Analysis and Machine Intelligence, vol. 6, pp.721-741, 1984

◮ Donald E. Knuth, The Art of Computer Science: Volume 2 SeminumericalAlgorithms, Addison-Wesley, 1997

◮ William Feller, An Introduction to Probability Theory and its Applications, vol.1, Wiley, 1968


Monte Carlo inference

◮ many tasks in probability theory deal with terms of the form:

∫

R

f(x)p(x)dx

• e.g. expectation value∫

xp(x)dx

• e.g. variance∫

x2p(x)dx− (∫

xp(x)dx)2

• e.g. expected risk∫

risk(x)p(x)dx

• e.g. expected gain∫

gain(x)p(x)dx

◮ but:

• integral often not tractable analytically

• p(·) often not known explicitly

◮ hence: replace analytical calculation by numerical approach→ Monte Carlo approach


Monte Carlo inference

◮ basic idea: calculate with random samples instead of pdfs

p(·) −→sample

{x1, . . . , xN} ∼ p

| ||× |↓ ↓

∫

Rf(x)p(x)dx ←−

approximates

1N

∑N

i=1 f(xi)

◮ as long as N is large enough 1N

∑N

i=1 f(xi) is a good approximation for∫

Rf(x)p(x)dx

◮ but: you need a random number generator for p


Random number generators

◮ random number generators for uniform distribution U(0, 1):many different algorithms exist, cf. book of Knuth

◮ quantile trick:

• assume, F (x) =∫ x

−∞p(t)dt is known (“cumulative distribution

function”, cdf)

• if u is a random sample element from U(0, 1), then F−1(u) is a randomsample element of p

• F−1 is called “quantile function”

◮ distribution specific transformation tricks:

• e.g. sampling from a Gaussian:assume u1, u2 independent random samples from U(0, 1). Then,

v1 =√−2 log u1 · sin(2πu2) and v2 =

√−2 log u1 · cos(2πu2) are

independent random variables fromN (0, 1)


Random number generators

◮ what can we do if we do not find any trick?→ accept-reject sampling

◮ assume:

• we want to sample from distribution with pdf p

• we own a random number generator for distribution with pdf q

• we know a constant M such that M ·q(x) ≥ p(x) for all x

• how can we use the random number generator for q to sample from p?

p

q

x1x2

p(x1)q(x1)

q(x2)p(x2)


Accept-reject sampling

p M ·q

x

p(x)

M ·q(x)

◮ sample from q yields x

◮ accept x with probabilityp(x)

Mq(x)

◮ otherwise, reject x

the set of all accepted sample elements x yields a sample from p since:

p(x) ∝ 1

Mp(x) = q(x) · p(x)

Mq(x)

on average the algorithm accepts only a ratio of 1M

of all sample elements.

→ choose appropriate q so that M remains small

Extension:accept-reject sampling works even if p is only known up to a constant factor


Example: robot localization

◮ robot is located within an area ofsize 1m× 1m

◮ it has sensors to measure thedistance to the corners of the field

d1d2

d3d4

~x

~e1 ~e2

~e3 ~e4

◮ Bayesian network

σ ~x

d1 d2 d3 d4

◮ distributions

~x ∼ U([0, 1]× [0, 1])

di|~x ∼ N (||~x− ~ei||, σ2)

~x|d1, d2, d3, d4 ∼ ?



◮ calculating the posterior

p(~x|d1, d2, d3, d4) ∝ p(d1|~x)p(d2|~x)p(d3|~x)p(d4|~x) · p(~x)

∝ exp(

− 1

2σ2

4∑

i=1

(||~x− ~ei|| − di)2)

· I[0≤x1,x2≤1]

≤ 1

◮ apply accept-reject sampling with M = 1, q(~x) = I[0≤x1,x2≤1]



◮ results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

~x = (0.5, 0.5)rejected: 3627accepted: 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

~x = (0.3, 0.2)rejected: 3087accepted: 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

~x = (0, 0)rejected: 5792accepted: 100

◮ question: how can we create more efficient sampling schemes?


Side topic:Markov chains and their properties


Markov chains

Definitions:

◮ A Markov model or Markov chain is a Bayesian network which is organizedas a chain of random variables Xi where Xi+1 solely depends on Xi.

X1 X2

· · ·Xi

· · ·Xn

◮ The set of values that Xi can take is called the state set S.

◮ The transition between subsequent states is given by a transition kernelT (Xi+1|Xi)

• T (Xi+1|Xi) is a conditional probability if state set is discrete

• T (Xi+1|Xi) is a conditional density if state set is continuous

For the moment, we focus on Markov chains with finite, discrete state set.

◮ A Markov chain is homogeneous if transition kernel T is invariant w.r.t. time


Markov chains

◮ A transition diagram for a homogeneous Markov chain is a directed graphwith one vertex for each state and an edge between vertex u and v ifT (v|u) > 0

s3 s4

s1 s2

Example A

s3 s4

s1 s2

Example B

s3 s4

s1 s2

Example C


Markov chains

◮ A homogeneous Markov chain is irreducible if for all states u, v exists asequence of states s1, . . . , sn with u = s1 and sn = v such thatT (si+1|si) > 0.

◮ The period of a state s is given by

gcd{N ∈ N| exist s1, . . . , sN−1 with T (s1|s) > 0 and T (s|sN−1) > 0

and T (si+1|si) > 0 for all i ∈ {1, . . . , N − 2}}◮ A homogeneous Markov chain is aperiodic if the period of all states is 1.

s3 s4

s1 s2

Example A

s3 s4

s1 s2

Example B

s3 s4

s1 s2

Example C


Ergodic Markov chains

◮ Given a homogeneous Markov chain with discete, finite state set S we canarrange all transition probabilities in a transition matrix M withMi,j = T (sj|si). Hence, each row of M is the probability vector of acategorical distribution over S.

◮ Given a transition matrix M and a categorical distribution over the state setwith probability vector (row vector) ~w we obtain the distribution of successorstates by ~w ·M .

◮ a categorical distribution with probability vector ~w is a stationary distributionof a Markov chain with transition matrix M if ~w ·M = ~w.

◮ A homogeneous Markov chains with discrete, finite state set S is ergodic if

limk→∞ Mk exists and all rows in limk→∞ Mk are identical andlimk→∞ Mk does not contain zeros. Then the rows in limk→∞ Mk formthe probability vector of a stationary categorical distribution over S.



s3 s4

s1 s2

Example A

s3 s4

s1 s2

Example B

s3 s4

s1 s2

Example C

MA=

310

710

0 0

0 0 910

110

12

0 0 12

0 0 1 0

, MB=

0 1 0 0

0 0 0 112

0 0 12

0 0 1 0

, MC=

1 0 0 0

0 0 13

23

12

0 0 12

0 0 1 0

Which of these Markov chains are ergodic?



◮ Theorem: if a homogeneous Markov chain with discrete, finite state set ifirreducible and aperiodic it is also ergodic.Proof: see literature, e.g. Feller 1968

◮ What happens if we sample a very long sequence from an ergodic Markovchain?

• the first part of the sample will depend on the initial state (burn in phase)

• after burn in the sample is drawn from the stationary distribution of theMarkov chain

• the sample elements are dependent on each other



Good and bad mixing behavior of a Markov chain

MA=

(

12

12

12

12

)

, MB=

(

99100

1100

1100

99100

)

s1 s2

Both Markov chains share the same stationary distribution, however, mixing isvery different.E.g. a random sample from chain A:s1, s1, s2, s1, s2, s2, s2, s1, s1, s2, s1, s1, s2, s2, s1, s2, . . .

and a sample from chain B:s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s2, s2, s2, s2, s2, s2, s2,

s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2,

s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s2, s1, s1, s1,

s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, s1, . . .


Markov chains with continuous state space

If S ⊆ Rd we have to replace the transition matrix by a transition kernel

T (Xt+1|Xt), i.e. a conditional probability density.

E.g. T (v|u) = 12πe−

12(u+1−v)2

Now, a stationary distribution a pdf p(·) with∫

T (v|u) · p(u)du = p(v)

Most results (especially those about ergodic chains) can be transferred fromdiscrete state spaces to continuous state spaces. For details, cf. the book ofChristian P. Robert & George Casella. Monte Carlo Statistical Methods. Springer,1999


Designing Markov chains

We want to create a Markov chain with a specific stationary distribution p(·). Howcan we design the transition kernel?

Theorem: If a transition kernel T meets the detailed balance equation

T (v|u) · p(u) = T (u|v) · p(v)

for all states u, v ∈ S then p is stationary distribution of T . In this case T iscalled reversible.

Proof:∫

T (v|u)p(u)du=

∫

T (u|v)p(v)du =

∫

T (u|v)du · p(v) = p(v)



Theorem: If T1 and T2 are transition kernels with stationary distribution p and0 < q < 1, then T = q · T1 + (1− q) · T2 is a transition kernel with stationary

distribution p. T is defined as T (v|u) = q · T1(v|u) + (1− q) · T2(v|u)

X1

X ′1

X ′′1

X2

X ′2

X ′′2

X3

X ′3

X ′′3

X4

· · ·T1 T1 T1

T2 T2 T2

Proof:∫

T (v|u)p(u)du=

∫

(q · T1(v|u) + (1− q) · T2(v|u)) p(u)du

= q ·∫

T1(v|u)p(u) du+ (1− q) ·∫

T2(v|u)p(u) du = p(v)


Markov Chain Monte Carlo Sampling


Markov chain Monte Carlo

◮ task

• we want to sample from a distribution p

• standard sampling tricks are not applicable

◮ basic idea:

• design a Markov chain with stationary distribution p

• sample from Markov chain. Reject initial sample elements

• obtain dependent sample from target distribution p

burn in: distribution dependson initial state

almost stationary target distribution

x0

◮ approach is known as Markov chain Monte Carlo sampling (MCMC)

• Metropolis-Hastings algorithm (Metropolis, 1953), (Hastings, 1970)

• Gibbs sampling (Geman and Geman, 1984)

• Slice sampling (Neal, 2003)Learning and Inference in Graphical Models. Chapter 07 – p. 25/44

Metropolis-Hastings algorithm

◮ basic idea:

• sample candidates for successor state using a distribution q

• apply detailed balance equation to calculate an acceptance probability

◮ principle:

xt zt

p

q(·|xt) q(·|zt)

transition:

• sample zt ∼ q(·|xt)

• set xt+1 = zt withprobability

min{

1, p(zt)·q(xt|zt)p(xt)·q(zt|xt)

}

• otherwise set xt+1 = xt

◮ the acceptance probability simplifies if q is symmetric: min{

1, p(z)p(x)

}

(Metropolis algorithmus)


Metropolis-Hastings algorithm

The transition kernel of the Metropolis-Hastings algorithm is

T (v|u) = q(v|u) · A(v|u) + δ(v − u) ·∫

q(w|u) · (1− A(w|u))dw

with A(v|u) = min{

1, p(v)·q(u|v)p(u)·q(v|u)

}

Lemma:The Metropolis-Hastings transition kernel meets the detailed balance equation.

Proof:→ blackboard

Remark: Metropolis-Hastings also works if the target probability is only known upto a normalization constant


Example: robot localization revisited

◮ robot localization example solved with Metropolis-Hastings algorithmsample distribution q: ~z ∼ ~x+ U([−0.1, 0.1]× [−0.1, 0.1])

◮ created samples (each 200 elements):

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

~x = (0.5, 0.5)93 candidates rejected

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

~x = (0.3, 0.2)82 candidates rejected

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

~x = (0, 0)106 candidates rejected


Example: uniform distribution over a parabola

◮ sample from a uniform distributionover a frustum of a parabola

◮ p(x1, x2) ∝ I[−2≤x1≤2]I[x21≤x2≤4]

x2|x1 ∼ U(x21, 4)

x1|x2 ∼ U(−√x2,√x2)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

0.5

1

1.5

2

2.5

3

3.5

4

x1

x2

xt

xt+1

xt+2

x̃(t)

x̃(t+1)


Gibbs sampling

generalization to multivariate distributions p(x1, . . . , xd):

◮ sample x(t+1)1 ∼ p(x1|x(t)

2 , . . . , x(t)d )

◮ sample x(t+1)2 ∼ p(x2|x(t+1)

1 , x(t)3 , . . . , x

(t)d )

...

◮ sample x(t+1)d ∼ p(xn|x(t+1)

1 , . . . , x(t+1)d−1 )


Example: bearing-only tracking

◮ observing a moving object from a fixed position

◮ object moves with constant velocity

◮ for every point in time, observer senses angleof observation, but only sometimes distance toobject

◮ distributions:~x0 ∼ N (~a,R)

~v ∼ N (~b, S)

~yi|~x0, ~v ∼ N (~x0 + ti~v, σ2I)

ri = ||~yi||wi =

~yi

||~yi||

object movement

unknowndistance

angle ofobservation

observer

unknown~x0 ~v

ri

~wi

��

��

ti

σ

~x0

~yi

~v

~wi

ri

n



◮ conditional distributions for ~x0, ~v:

~x0|~v, (~yi), (ti) ∼ N(

(n

σ2I +R−1)−1(

1

σ2

∑

(~yi − tiv) +R−1~a), (n

σ2I +R−1)−1

)

~v|~x0, (~yi), (ti) ∼ N(

(1

σ2

∑

t2i I + S−1)−1(1

σ2(∑

ti(~yi − ~x0)) + S−1~b), (1

σ2

∑

t2i I + S−1)−1)

◮ conditional distribution for ri:p(~yi|~x0, ~v, ti) ∝ exp{− 1

2σ2||~x0 + ti~v − ~yi||2}

p(~yi|~x0, ~v, ti, ~wi) ∝

exp{− 12σ2

||~x0 + ti~v − ~yi||2} · I[~yi‖~wi]

p(ri|~x0, ~v, ti, ~wi) ∝

exp{− 12σ2

||~x0 + ti~v − ri ~wi||2} =

exp{− 12σ2

(||~x0+ti~v||2−2ri · ~wTi (~x0+ti~v)+r2i )} ∝

exp{− 12σ2

(ri − ~wTi (~x0 + ti~v))

2}

⇒ ri|~x0, ~v, ti, ~wi ∼ N (~wTi (~x0 + ti~v), σ

2)



◮ Gibbs sampling with non-informative priors (R−1 = S−1 = 0):

~x0 ∼ N ( 1n

∑

(~yi − ti~v),σ2

nI)

~v ∼ N (∑

ti(~yi−~x0)∑

t2i, σ2∑

t2iI)

ri ∼ N (~wTi (~x0 + ti~v), σ

2)

◮ results and Matlab demo

−5 0 5 10 15 20 250

2

4

6

8

10iteration=1000


Gibbs sampling for Gaussian mixture

k

n

m0 r0 a0 b0

µj sj

Xi

Zi

~w

~β

µj ∼N (m0, r0)

sj ∼ Γ−1(a0, b0)

~w ∼D(~β)Zi|~w ∼ C(~w)

Xi|Zi, µZi, sZi∼N (µZi

, sZi)



calculate conditionals for Gibbs sampling using the results about conjugatedistributions (chapter 2)

~w|z1, . . . , zn, ~β ∼ D(β1 + n1, . . . , βk + nk) with nj = |{i|zi = j}|

µj|x1, . . . , xn, z1, . . . , zn, sj,m0, r0 ∼ N (sjm0 + r0

∑

i|zi=j xi

sj + njr0,

r0sj

sj + njr0)

sj|x1, . . . , xn, z1, . . . , zn, µj, a0, b0 ∼ Γ−1(a0 +nj

2, b0 +

1

2

∑

i|zi=j

(xi − µj)2)

zi = j|~w, xi, µ1, . . . , µk, s1, . . . , sk ∼ C(hi,1, . . . , hi,k)

with hi,j ∝wj

√

2πsje− 1

2

(xi−µj)2

sj



Example→ Matlab demo

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

1.2

1.4iteration = 116 k = 3 n = 1000 Plot shows sampled mixture after

1000 iterations of Gibbs sampling.Sample of size 1000 is taken froma uniform distribution. Priors wereset close to non-informativity.


Slice sampling

We want to sample from a distributionwith density p(x)

Extend distribution by a secondvariable u

p′(x, u) =

{

1 if 0 ≤ u ≤ p(x)

0 otherwise

p′ is a pdf since it is nonnegative and∫ ∫

p′(x, u)du dx = 1

p(x)

x

x

p′(x, u) u

Apply Gibbs sampling to p′:

u|x∼ U(0, p(x))x|u∼ U{x′|p(x′) ≥ u}We obtain a sample of p′. Since p is marginal of p′ we obtain a sample of p bysampling from p′ and forgetting about the ui.


Slice sampling

Executing slice sampling on the example:

x

u

(x1, u1) (x2, u1)

(x2, u2)(x3, u2)

(x3, u3)(x4, u3)

(x4, u4)

The crucial point in slice sampling is whether it is possible to determine the set{x′|p(x′) ≥ u} efficiently.

Slice sampling can also be used if p is only known up to a normalization factor.


Simulated annealing

Can we use MCMC if we want to calculate the MAP estimator of a distribution p?

Observation:

Consider the sequence of densities

p(x),1

Z2

(p(x))2,1

Z3

(p(x))3, . . .

What is the limit limν→∞1Zνpν?

Example:

p(x) =

{

2x if 0 ≤ x ≤ 1

0 otherwise

(p(x))ν =

{

(2x)ν if 0 ≤ x ≤ 1

0 otherwise

Zν =

∫ 1

0

(2x)νdx =2ν

ν + 1

limν→∞

1

Zν

pν(x) = limν→∞

{

(ν + 1)xν if 0 ≤ x ≤ 1

0 otherwise

= δ(x− 1) =

{

∞ if x = 1

0 otherwise


Simulated annealing

In general, if a density p(x) has a single global maximum at x = xmax, then the

sequence of 1Zν(p(x))ν converges pointwise to δ(x− xmax)

Hence, the larger ν, the more probable will a MCMC sampler focus on a smallsurrounding of xmax.

Let us build a Metropolis-Hastings sampler with symmetric proposal distribution

q(z|x). The acceptance probability is min{1, ( p(z)p(x)

)ν}To be consistent with literature, let us define t = 1

ν. Hence, the acceptance

probability is min{1, ( p(z)p(x)

)1t }. t is called the temperature.

Idea:While applying the Metropolis algorithm decrease the temperature t slowly overtime.→ simulated annealing


Simulated annealing

Goal: find the MAP of a probability distribution with density function p

1. initialize x arbitrarily

2. initialize temperature t = 1

3. repeat

4. sample a candidate z ∼ q(·|x)5. calculate acceptance probability A = min{1, ( p(z)

p(x))1t }

6. with probability A

7. set x← z

8. endif

9. decrease temperature t slighly

10. until convergence


Simulated annealing

Simulated annealing is guaranteed to find the MAP estimate with probability 1 if

◮ the Markov chain generated by proposal distribution q is ergodic for anychoice of t > 0

◮ the cooling scheme is sufficiently slow

Proof idea:

◮ since q generates an ergodic Markov chain it will sample from the full

distribution 1Z 1

t

(p(x))1t after a certain burnin period if we keep t constant.

◮ the smaller t is, the more time the Markov chain will stay in the closesurrounding of xmax

◮ since 1Z 1

t

(p(x))1t → δ(x− xmax) the Markov chain will converge to xmax

Background remark:Simulated annealing was motivated by the physical annealing of solids.

→ Matlab-demo robot localization Metropolis algorithm vs. simulated annealing


Summary

◮ Monte Carlo approximation

◮ accept reject sampling

• example: robot localization

◮ Markov chains

• ergodic Markov chains

• design of transition kernels

◮ Metropolis-Hastings algorithm

• example: robot localization

◮ Gibbs sampling

• example: bearing-only tracking

• example: Gaussian mixture

◮ slice sampling

◮ simulated annealing


Chapter 07: Monte Carlo Methods -...

Documents

Transcript of Chapter 07: Monte Carlo Methods -...