MCMC (Part II) By Marc Sobel. Monte Carlo Exploration Suppose we want to optimize a complicated...
-
Upload
delphia-gaines -
Category
Documents
-
view
215 -
download
0
Transcript of MCMC (Part II) By Marc Sobel. Monte Carlo Exploration Suppose we want to optimize a complicated...
MCMC (Part II)MCMC (Part II)
By Marc SobelBy Marc Sobel
Monte Carlo ExplorationMonte Carlo Exploration
Suppose we want to optimize a complicated Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a distribution f(*). We assume ‘f’ is known up to a multiplicative constant of proportionality. Newton-multiplicative constant of proportionality. Newton-Raphson says that we can pick a point nearer a Raphson says that we can pick a point nearer a mode by using the transformation:mode by using the transformation:
log ( );
log ( );
oldnew old
old
oldnew old
old
f xx x
x
f xx x x
x
Langevin AlgorithmsLangevin Algorithms
Monte Carlo demands that we explore the Monte Carlo demands that we explore the distribution rather than simply moving toward a distribution rather than simply moving toward a mode. Therefore, we can introduce a noise factor mode. Therefore, we can introduce a noise factor via:via:
(Note that we have replaced ‘(Note that we have replaced ‘εε’ by ’ by σσ. We can just . We can just use it as is or combine it with a Metropolis use it as is or combine it with a Metropolis Hastings step:Hastings step:
2 log ( );
2old
new old xold
f xx x Z
x
Langevin Algorithm with Langevin Algorithm with Metropolis HastingsMetropolis Hastings
The move probability is:The move probability is:
22
2
22
2
( )Pr 1
( )
log ( )
2exp
2
1log ( )
2exp
2
new
old
oldnew old
newold new
xobmove
x
f xx x
x
f xx x
x
Extending the Langevin to a Hybrid Extending the Langevin to a Hybrid Monte Carlo AlgorithmMonte Carlo Algorithm
Instead of moving based entirely on the gradient Instead of moving based entirely on the gradient (with noise added on) we could add (with noise added on) we could add
‘‘kinetic energy’ via: kinetic energy’ via:
Iterate this algorithm.Iterate this algorithm.
;
log ( )new old
new old
x x p
f xp p
x
Matlab Code for Hybrid MC: A total of Tau Matlab Code for Hybrid MC: A total of Tau steps along the constant energy pathsteps along the constant energy path
g=gradient(x); (set gradient)g=gradient(x); (set gradient) E=log(f(x)); (set energy)E=log(f(x)); (set energy) For i=1:LFor i=1:L P=randnorm(size(x));P=randnorm(size(x)); H=p’*p/2 + E;H=p’*p/2 + E; gnew=G; xnew=x;gnew=G; xnew=x; for tau=1:Taufor tau=1:Tau p=p-epsilon*gnew/2; (make half step in p)p=p-epsilon*gnew/2; (make half step in p) xnew=xnew+epsilon*p (make an x step)xnew=xnew+epsilon*p (make an x step) gnew=gradient(xnew); (update gradient)gnew=gradient(xnew); (update gradient) p=p-epsilon*gnew/2; (make another half step in p)p=p-epsilon*gnew/2; (make another half step in p) endend Enew=log(f(xnew)); (find anew value)Enew=log(f(xnew)); (find anew value) Hnew=p’*p/2+Enew; (find new H)Hnew=p’*p/2+Enew; (find new H) dH=Hnew-H; dH=Hnew-H; if(rand<exp(-dH)) Accept=1; else Accept=0; endif(rand<exp(-dH)) Accept=1; else Accept=0; end if(Accept==1) H=Hnew; endif(Accept==1) H=Hnew; end endend
ExampleExample
Log(f(x))= xLog(f(x))= x22+a+a22-log(cosh(ax)); k(p)=p-log(cosh(ax)); k(p)=p22;;
ProjectProject
Use Hybrid MC to Use Hybrid MC to sample from a sample from a multimodal multivariate multimodal multivariate density. Does it density. Does it improve simulation?improve simulation?
Monte Carlo Optimization: Feedback, random Monte Carlo Optimization: Feedback, random updates, and maximizationupdates, and maximization
Can monte Carlo help us search for the Can monte Carlo help us search for the optimum value of a function. We’ve already optimum value of a function. We’ve already talked about simulated annealing. There talked about simulated annealing. There are other methods as well.are other methods as well.
Random Updates to get to the Random Updates to get to the optimumoptimum
Suppose we return to the problem of finding Suppose we return to the problem of finding modes: Let modes: Let ζζ denote a uniform random variable denote a uniform random variable on the unit sphere, and on the unit sphere, and ααxx, , ββx x are determined by are determined by
numerical analytic considerations (see Duflo numerical analytic considerations (see Duflo 1998). (We don’t get stuck using this).1998). (We don’t get stuck using this).
log ( );x old
new oldx old
f xx x
x
Optimization of a function depending Optimization of a function depending on the dataon the data
Minimize the (two-way) KLD between a density Minimize the (two-way) KLD between a density q(x) and a Gaussian mixture q(x) and a Gaussian mixture
f=∑f=∑ααiiφφ(x-(x-θθii) using samples. The two way KLD is:) using samples. The two way KLD is:
We can minimize this by first sampling XWe can minimize this by first sampling X11,…,X,…,Xn n
from q, and then sampling Yfrom q, and then sampling Y11,…,Y,…,Yn n from sfrom s00(x) (x)
(assuming it contains the support of the f’s) and (assuming it contains the support of the f’s) and minimizingminimizing
( ) ( | )( ) log ( | ) log
( | ) ( )q x f x
q x dx f x dxf x q x
Example (two-way) KLDExample (two-way) KLD
Monte Carlo rules dictate that we can’t sample Monte Carlo rules dictate that we can’t sample from a distribution which depends on the from a distribution which depends on the parameters we want to optimize. Hence we parameters we want to optimize. Hence we importance sample the second KLD equation importance sample the second KLD equation using susing s00. We also employ an EM type step . We also employ an EM type step
involving latent variables Z:involving latent variables Z:
1
1 0
( )(1/ ) log ( | , )
( | )
( | ) ( | )1log ( | , )
( ) ( )
ni
ii i l
m j l j lj
j j j
q xn P Z l x
f x
f y f yP Z l y
m s y q y
Prior ResearchPrior Research
We (Dr Latecki, Dr. Lakaemper and I) minimized the one We (Dr Latecki, Dr. Lakaemper and I) minimized the one way KLD between a nonparametric density q and a way KLD between a nonparametric density q and a gaussian mixture. (paper pending)gaussian mixture. (paper pending)
But note that for mixture models which put large weight on But note that for mixture models which put large weight on places where the NPD is not well-supported, minimizing places where the NPD is not well-supported, minimizing may not give you the best possible result. may not give you the best possible result.
( )[1] ( ) log
( | )q x
KLD q xf x
ProjectProject
Use this formulation to Use this formulation to minimize the KLD distance minimize the KLD distance between q (e.g., a between q (e.g., a nonparametric density based nonparametric density based on a data set) and a on a data set) and a gaussian mixture. gaussian mixture.
General Theorem General Theorem in Monte Carlo Optimization in Monte Carlo Optimization
One way of finding an optimal value for a function One way of finding an optimal value for a function f(f(θθ), defined on a closed bounded set,), defined on a closed bounded set, is as is as follows: Define a distribution: follows: Define a distribution:
for a parameter for a parameter λλ which we let tend to infinity. If which we let tend to infinity. If we then simulate we then simulate θθ11,…,,…,θθnn ≈ h( ≈ h(θθ), then ), then
( ) exp ( )h f
max1
in
Monte Carlo OptimizationMonte Carlo Optimization
Observe (XObserve (X11,…,X,…,Xnn||θθ)≈ )≈ L(L(XX||θθ)): Simulate, : Simulate, θθ11,…,,…,θθn n
from the prior distribution from the prior distribution ππ((θθ)). . Define the posterior (up to a constant of proportionality) by, l(θ|X). It follows that,
converges to the MLE. Proof uses laplace
approximation (see Robert (1993)).
1
,
1
exp ( | )ˆlim
exp ( | )
ni i
in n
ii
l
l
X
X
Exponential Family ExampleExponential Family Example
Let X~exp{Let X~exp{λθλθx-x-λψλψ((θθ)}, and )}, and θθ~~ππ
( | , )E x x
Possible ExamplePossible Example
It is known that calculating maximum It is known that calculating maximum likelihood estimators for the parameters in a k-likelihood estimators for the parameters in a k-component mixture model are hard to component mixture model are hard to compute. If, instead maximizing the likelihood, compute. If, instead maximizing the likelihood, we treat the mixture as a Bayesian model we treat the mixture as a Bayesian model together with a scale parameter together with a scale parameter λλ and and an an indifference prior, we can (typically) use Gibbs indifference prior, we can (typically) use Gibbs sampling to sample from this model. Letting sampling to sample from this model. Letting λλ tend to infinity leads to our being able to tend to infinity leads to our being able to construct MLE’s.construct MLE’s.
ProjectProject
Implement an algorithm to Implement an algorithm to find the MLE for a simple find the MLE for a simple 3 component mixture 3 component mixture model. (Use Robert model. (Use Robert (1993)).(1993)).