Markov Chain Monte Carlo Samplingsrihari/CSE574/Chap11/Ch11.3-MCMCSa… · Machine Learning Srihari...

Machine Learning Srihari

Markov Chain Monte Carlo Sampling

Sargur Sriharisrihari@cedar.buffalo.edu

Topics

1. Limitations of simple Monte Carlo2. Markov Chains3. Basic Metropolis Algorithm4. Metropolis-Hastings Algorithm

Limitations of plain Monte Carlo• Direct (unconditional) sampling

– Hard to get rare events in high-dimensional spaces– Infeasible for MRFs, unless we know partition function Z

• Rejection sampling, Importance sampling– Doesn’t work well if proposal q(x) is very different from p(x)– Yet constructing a q(x) similar to p(x) can be difficult

• Making a good proposal usually requires knowledge of the analytic form of p(x) – but if we had that, we wouldn’t even need to sample!

• Intuition of MCMC– Instead of a fixed proposal q(x), use an adaptive proposal

Markov Chain Monte Carlo (MCMC)

• Simple Monte Carlo methods (Rejection sampling and importance sampling) are for evaluating expectations of functions – They suffer from severe limitations, particularly with high

dimensionality• MCMC is a very general and powerful framework

– Markov refers to sequence of samples rather than the model being Markovian

– Allows sampling from large class of distributions– Scales well with dimensionality – MCMC origin is in statistical physics (Metropolis 1949)

MCMC and Adaptive Proposals

• MCMC:– Instead of q(x), use q(x’|x) where x’ is the new state

being sampled, and x is previous sample• As x changes, q(x’|x) changes as a function of x’

q(x)p(x) p(x)

q(x2|x1)q(x3|x2) q(x4|x3)

x3 x1 x2 x1 x2 x3

Importance sampling from p(x)with a bad proposal q(x)

MCMC sampling from p(x)with an adaptive proposal q(x’|x)

Markov chains• First order Markov chain is a sequence of random

variables z(1),…,z(M) such that– conditional independence property holds:p(z(m+1)| z(1),.., z(m)) = p(z(m+1)| z(m))

• Represented in a directed graph as a chain• Markov chain specified by

– Distribution of initial variable p(z(0))– Conditional (transition) probabilitiesTm(z(m),z(m+1)) = p(z(m+1)|z(m))

• Markov chain is homogeneous if all transition probabilities are the same for all m

Each sample dependentonly on previous sample

Markov Chain• A sequence of random variables S0, S1, S2,…

with each Si∈{1,2,…,d} taking one of d possible values representing state of a system– Initial state distributed according to p(S0)– Subsequent states generated from a conditional

distribution that depends only on the previous state• i.e. Si is distributed according to p(Si∣Si−1)

A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state

Idea of MCMC• Idea of MCMC is

– Construct a Markov chain whose states will be joint assignments to the variables in the model

• And whose stationary distribution will equal the model probability p

Metropolis-Hastings• User to specify a transition kernel q(x’|x) and

acceptance probability A(x’|x)• M-H Algorithm

– Draw sample x’ from q(x’|x), where x is previous sample

– New sample x’ accepted/rejected with probability A(x’|x)

• This acceptance probability is

• It encourages us to move towards more likely points in the distribution

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟

Acceptance probability• A(x’|x) is like ratio of importance sampling

weights

p(x’)/q(x’|x) is the importance weight for x’p(x)/q(x|x’) is the importance weight for x

– We divide the importance weight for x’ by that of x– Notice that we only need to compute p(x’)/p(x)

rather than p(x’) or p(x) separately• A(x’|x) ensures that, after sufficient draws,

samples will come from true distribution p(x)

A(x ' |x) = min 1,

p(x ')q(x |x ')p(x)q(x ' |x)

⎛⎝⎜

⎞⎠⎟

The M-H Algorithm (1)• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Initialize x(0)

The M-H Algorithm (2)

Initialize x(0)Draw, accept x(1)

• Let q(x’|x) be a Gaussian centered on x• We are trying to sample from bimodal p(x)

Draw, accept x(2)

Draw, accept x(2)Draw, but reject, set x(3)=x(2)

Draw, accept x(2)

Draw, accept x(4)Draw, but reject, set x(3)=x(2)

Initialize x(0)

Draw, accept x(1)

Draw, accept x(2)

Draw, but reject, set x(3)=x(2)

Draw, accept x(4)

Draw, accept x(5)

Basic Metropolis Algorithm• As with rejection and importance sampling use a proposal distribution (simpler distribution)

• Maintain a record of current state z(t)

• Proposal distribution q(z|z(t)) depends on current state (next sample depends on previous one)– E.g., q(z|z(t)) is a symmetric Gaussian with mean z(t) and a

small variance

• Thus sequence of samples z(1), z(2)… forms a Markov chain

• Write where is readily evaluated

• At each cycle generate candidate z* and test for acceptance

p(z) = 1Zp

p(z) p(z)

Metropolis Algorithm• Assumes simple proposal

distribution, that is symmetric– e.g., an isotropic Gaussian q(z),q(zA|zB) = q(zB|zA) for all zA, zB

• New sample z* from q(z) is accepted with probability

– Done by choosing u ~ U(0,1) and accepting if A(z*,z(t))>u

• If accepted then z(t+1)=z*• Otherwise: – z* is discarded, – z(t+1) is set to z(t) and – another candidate drawn from q(z|z(t+1))

Accepted steps in greenRejected steps in red

÷÷ø

öççè

)z(~*)z(~

,1min)z*,z( )()(

Gaussian p(z) whose one-standard deviation contour is shown

q(z) isotropic Gaussian withstd dev 0.2

Inefficiency of Random Walk• Consider simple random walk

• State space z consisting of integers with probabilities

p(z(t+1) = z(t)) = 0.5p(z(t+1) = z(t)+1) = 0.25Ip(z(t+1) = z(t) - 1) = 0.25

• If initial state is z(1)=0– the expected state at time t will be zero, E[z(t)] = 0• Similarly E[(z(t))2] = t / 2• Thus after t steps distance traveled is proportional to sqrt (t)

• Random walks are inefficient in exploring state-space

• MCMC algorithms try to avoid random walk behavior

Stay in same state

Increase state by 1

Decrease state by 1

Metropolis-Hastings Algorithm• Generalizes Metropolis algorithm

– Proposal distribution is no longer a symmetric function of arguments, i.e., q(zA|zB) .ne. q(zB|zA) for all zA, zB

1.At step t , in which current state is z(t) we draw a sample z* from distribution qk(z|z(t))

2.Accept with probability Ak(z*,z(t)) where

• Can show that p(z) is an invariant distribution of the Markov chain defined by Metropolis-Hastings Algorithm– since detailed balance is satisfied

Ak (z*,z(τ ) ) = min 1,

p(z*)qk (z(τ ) | z*)

p(z(τ ) )qk (z* | z(τ ) )⎛⎝⎜

⎞⎠⎟

k labels set of possibletransitions considered

Choice of Proposal Distribution for Metropolis-Hastings

• Gaussian centered on current state

• Keeping rejection rate low

• Independent sampleIsotropic Gaussianproposal distribution

CorrelatedMultivariateGaussian

Scale ρ of proposal distributionshould be of order σ min

No. of steps needed to get independent sample is oforder (σ max /σ min )

Markov Chain Monte Carlo Samplingsrihari/CSE574/Chap11/Ch11.3-MCMCSa… · Machine Learning Srihari...

Documents

Transcript of Markov Chain Monte Carlo Samplingsrihari/CSE574/Chap11/Ch11.3-MCMCSa… · Machine Learning Srihari...

Computational Learning Theory - cedar.buffalo.educedar.buffalo.edu/~srihari/CSE574/Chap17/Chap17.Part2.pdf8 Machine Learning, Chapter 7, Part 2 CSE 574, Spring 2004 Shattering a Set

Reinforcement Learning: Overview › ~srihari › CSE574 › Chap15 › 15.1-Reinf-Learning.pdf•Reinforcement learning algorithms interact with an environment •Q-learning generates

Basic Sampling Methods - Welcome to CEDARsrihari/CSE574/Chap11/Ch11.2-BasicSamplin… · Machine Learning! !! ! ! Srihari 1 Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu

Griffin Chap11

Decision Theory - University at Buffalosrihari/CSE574/Chap1/1.7... · Machine Learning Srihari 2 Decision Theory • Using probability theory to make optimal decisions • Input vector

Log-linear MRFs: Ising, Boltzmann, Deep Belief, Metricsrihari/CSE574/Chap8/8.6... · Log-linear MRFs: Ising, Boltzmann, Deep Belief, Metric Sargur Srihari srihari@cedar.buffalo.edu

Information Theorysrihari/CSE574/Chap1/Information-Theory.pdf · Machine Learning Srihari 3 Information Measure • How much information is received when we observe a specific value

Q -Learningsrihari/CSE574/Chap15/15.3-Q-Learning.pdfMachine Learning Srihari Q-Function Property •Surprising that one can choose globally optimal action sequences by reacting repeatedly

Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap9/Ch... · – From Bayes theorem – View as prior probability of component k and as the posterior probability

Sargur Srihari srihari@cedar.buffalosrihari/CSE574/Chap3/3...• If this is so, Bayes model comparison will favor the correct model • If M 1 is the true model, and we average over

Basic Sampling Methods - University at Buffalosrihari/CSE574/Chap11/Ch11.2...Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu Machine Learning Srihari 2 Topics 1.Motivation

Polynomial Curve Fitting - CEDARsrihari/CSE574/Chap1/Curve-Fitting.pdf · Machine Learning Srihari Topics 1. Simple Regression Problem 2. Polynomial Curve Fitting 3. Probability Theory

Log-linear MRFs: Ising, Boltzmann, Deep Belief, Metricsrihari/CSE574/Chap8/Ch8-PGM-Undirected/9.5... · Log-linear MRFs: Ising, Boltzmann, Deep Belief, Metric Sargur Srihari srihari@cedar.buffalo.edu

BOOK Chap11

Sargur Srihari srihari@buffalocedar.buffalo.edu/~srihari/CSE574/Chap5/Chap5.6-MixDensityNetworks.pdfChap5.6-MixDensityNetworks.ppt Author: Sargur Srihari Created Date: 10/25/2018 9:51:02

Derickson chap11

Hidden Markov Modelssrihari/CSE574/Chap13/13.2-HMMs.pdf · Hidden Markov Models Sargur Srihari srihari@cedar.buffalo.edu. HMM Overview Machine Learning Srihari 1. Role of HMMs in

Machine Learning: Generative and Discriminative Modelssrihari/CSE574/Discriminative-Generative.pdf · Machine Learning Srihari 8 ML Methodologies are increasingly statistical •

Regularization in Neural Networks - University at Buffalosrihari/CSE574/Chap5/Chap5.5... · · 2017-11-15Machine Learning Srihari Desirable invariance property of regularizer •

Chap11 roqueszedezedzedze