Download - Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing with intractable likelihoods

BackgroundABC methods for generative models

MC2 type methodsState-Space models, PMCMC

SMC2

Dealing with intractability: recent advances inBayesian Monte Carlo methods for intractable

likelihoods

N. CHOPIN1

CREST-ENSAE

1joint work with S. BARTHELME, P.E. JACOB, & O.PAPASPILIOPOULOS

N. CHOPIN Intractability 1/ 54



SMC2

Outline

1 Background

2 ABC methods for generative models

3 MC2 type methods

4 State-Space models, PMCMC

5 SMC2




SMC2

Tractable models

For a prototypic Bayesian model, defined by (a) prior p(θ), and (b)likelihood p(y |θ), a standard approach is to sample from theposterior

p(θ|y) ∝ p(θ)p(y |θ).

using the Metropolis-Hastings algorithm:

Metropolis-Hastings

From current point θn1 Sample θp ∼ T (θn, dθ

p)

2 With probability 1 ∧ r , take θn+1 = θp, otherwise θn+1 = θn,where

r =p(θp)p(y |θp)T (θp, θn)

p(θn)p(y |θn)T (θn, θp)

This generates a Markov chain which leaves the posterior invariant.N. CHOPIN Intractability 3/ 54



SMC2

Intractable models

This generic approach cannot be applied in the followingsituations:

1 The likelihood reads p(y |θ) = C (θ)hθ(y), where C (θ) is anintractable normalising constant; e.g. log-linear models, Isingmodels.

2 The likelihood p(y |θ) is an intractable integral

p(y |θ) =

∫Xp(y , x |θ) dx

of a tractable integrand; e.g. state-space models.

3 The likelihood is even more complicate, because itcorresponds to some generative process (scientific models).

Solutions to these problems involve auxiliary variables.




SMC2

Outline

1 Background


3 MC2 type methods


5 SMC2




SMC2

Example of a generative model: reaction times

Subject must choose between k alternatives. Evidence ej(t) infavour of choice j follows a Brownian motion with drift:

τdej(t) = mjdt + dW jt .

Decision is taken when one evidence “wins the race”; see plot.

0 50 100 150

time (ms)

Threshold for A

Threshold for B

Evidence for A

Evidence for B




SMC2

ABC methods for generative models

ABC stands for “Approximate Bayesian Computation”. In suchalgorithms, the auxiliary variable is an artificial dataset y ∼ p(y |θ).Denote the actual dataset y?. Consider the simple rejectionalgorithm:

Basic ABC

Repeat

1 Sample θ ∼ p(θ).

2 Sample y ∼ p(y |θ).

3 Accept with probability Kε(‖s(y)− s(y?)‖).

where Kε(x) = K (x/ε), K is a kernel function, and s is a vector of“summary statistics”.




SMC2

ABC target

This algorithm samples from:

πε(θ, y) ∝ p(θ)p(y |θ)Kε(‖s(y)− s(y?)‖).

and the marginal πε(θ)→ p(θ|s(y?)) as ε→ 0.

If s is sufficient, then the limit is the true posteriorp(θ|s(y?)) = p(θ|y?), but this is rarely possible unfortunately.




SMC2

MCMC-ABC

One can instead derive a MCMC algorithm that sample from thesame distribution.

MCMC-ABC

From current point (θn, yn)

1 Sample θp ∼ T (θn, dθp).

2 Sample yp ∼ p(y |θp).

3 With probability 1 ∧ r , take (θn+1, yn+1) = (θp, yp), otherwise(θn+1, yn+1) = (θn, yn), where

r =p(θp)Kε(‖s(yp)− s(y?)‖)T (θp, θn)

p(θn)Kε(‖s(yn)− s(y?)‖)T (θn, θp)




SMC2

Remarks on the KDE interpretation of ABC

Having sampled N pairs (θi , y i ) from p(θ)p(y |θ), choosing εessentially amounts to choosing the bandwidth of a KDE. Thereare some specific aspects that may deserve some investigationhowever:

1 The objective is to approximate a conditional density, that isp(θ|s(y?)). (But approximating p(s(y?)) may be interestingtoo.)

2 The marginal distribution of the simulated θ’s is known.

3 Could we use a bandwidth matrix instead?




SMC2

Parametric interpretation of ABC

It would be great to take s(y) = y . In that way, the ABC posteriorcould be interpreted as the posterior distribution of the samemodel, but corrupted with noise (of size ε). See the followingpaper for a fast (EP) approximation of such an ABC posterior:

Barthelme, S. and Chopin, N. (2011). ABC-EP: ExpectationPropagation for Likelihood-free Bayesian Computation, ICML2011, L. Getoor and T. Scheffer (eds), 289-296. (see alsoarXiv:1107.5959).




SMC2

ABC: summary

We use ABC for very challenging models (generative/scientificmodels). We pay a heavy price for this:

1 First level of approximation is p(θ|y?) ≈ p(θ|s(y?))(althought not in ABC-EP).

2 Second level of approximation is p(θ|s(y?)) ≈ πε(θ).

3 Huge CPU cost (but less in ABC-EP).

4 ABC-EP cannot be used in all situations.

In the rest of the talk, we will deal with milder problems, and wewill be able to avoid approximations.




SMC2

Outline

1 Background


3 MC2 type methods


5 SMC2




SMC2

Basic framework

Imagine a model such that

p(y |θ) =

∫p(x , y |θ) dx

is intractable, but can be approximated by the following unbiasedMC estimate:

p(y |θ) =1

N

N∑j=1

p(x j , y |θ)

qθ(x j)

where the x j ’s are N points sampled from the (user-chosen)proposal distribution qθ.




SMC2

Naive question

Can we simply replace p(y |θ) by p(y |θ)? i.e.

MC2

From current point θn (plus p(y |θn) from previous iteration)

1 Sample θp ∼ T (θn, dθp)

2 Sample x1:N ∼ qθp so as to compute p(y |θp).

3 With probability 1 ∧ r , set θn+1 = θp, otherwise θn+1 = θnwith


p(θn)p(y |θn)T (θn, θp).




SMC2

Answer: yes, and the algorithm is exact!

More precisely, this algorithm is a correct Metropolis step withrespect to the following extended distribution:

π(θ, x1:N) ∝ p(θ)N∏j=1

qθ(x j)

1

N

N∑j=1

p(x j , y |θ)

qθ(x j)

which is such that the marginal distribution of θ is precisely thetrue posterior distribution:∫

π(θ, x1:N) dx1:N = p(θ|y).




SMC2

Outline

1 Background


3 MC2 type methods


5 SMC2




SMC2

State Space Models

A system of equations

Hidden states (Markov): p(x1|θ) = µθ(x1) and for t ≥ 1

p(xt+1|x1:t , θ) = p(xt+1|xt , θ) = fθ(xt+1|xt)

Observations:

p(yt |y1:t−1, x1:t−1, θ) = p(yt |xt , θ) = gθ(yt |xt)

Parameter: θ ∈ Θ, prior p(θ). We observe y1:T = (y1, . . . yT ),T might be large (≈ 104). x and θ will also be of severaldimensions.

There are several interesting models for which fθ cannot be writtenin closed form (but it can be simulated).




SMC2

State Space Models

Some interesting distributions

Bayesian inference focuses on:

static: p(θ|y1:T ) dynamic: p(θ|y1:t) , t ∈ 1 : T

Filtering (traditionally) focuses on:

∀t ∈ [1,T ] pθ(xt |y1:t)

Smoothing (traditionally) focuses on:

∀t ∈ [1,T ] pθ(xt |y1:T )




SMC2

Examples

Population growth model{yt = xt + σwεt

log xt+1 = log xt + b0 + b1(xt)b2 + σεηt

θ = (b0, b1, b2, σε, σW ).




SMC2

Examples

Stochastic Volatility (Levy-driven models)

Observations (“log returns”):

yt = µ+ βvt + v1/2t εt , t ≥ 1

Hidden states (“actual volatility” - integrated process):

vt+1 =1

λ(zt − zt+1 +

k∑j=1

ej)




SMC2

Examples

. . . where the process zt is the “spot volatility”:

zt+1 = e−λzt +k∑

j=1

e−λ(t+1−cj )ej

k ∼ Poi(λξ2/ω2

)c1:k

iid∼ U(t, t + 1) ei :kiid∼ Exp

(ξ/ω2

)The parameter is θ ∈ (µ, β, ξ, ω2, λ), and xt = (vt , zt)

′.

See the results




SMC2

Why are those models challenging?

. . . It is effectively impossible to compute the likelihood

p(y1:T |θ) =

[∫XT

p(y1:T |x1:T , θ)p(x1:T |θ)dx1:T

]Similarly, all other inferential quantities are impossible to compute.




SMC2

Problems with MCMC approaches

Metropolis-Hastings:1 p(θ|y1:T ) cannot be evaluated point-wise (marginal MH)2 p(x1:T , θ|y1:T ) are high-dimensional and it is hard to design

reasonable proposals

Gibbs sampler (updates states and parameters):1 The hidden states x1:T are typically very correlated and it is

hard to update them efficiently in a block2 Parameters and latent variables highly correlated

Common: they are not designed to recover the wholesequence π(x1:t , θ | y1:t)




SMC2

Particle filters

Consider the simplified problem of targeting

pθ(xt+1|y1:t+1)

This sequence of distributions is approximated by a sequence ofweighted particles which are properly weighted using importancesampling, mutated/propagated according to the system dynamics,and resampled to control the variance.

Below we give a pseudo-code version. Any operation involving thesuperscript n must be understood as performed for n = 1 : Nx ,where Nx is the total number of particles.




SMC2

Step 1: At iteration t = 1,

(a) Sample xn1 ∼ q1,θ(·).

(b) Compute and normalise weights

w1,θ(xn1 ) =µθ(xn1 )gθ(y1|xn1 )

q1,θ(xn1 ), W n

1,θ =w1,θ(xn1 )∑Ni=1 w1,θ(x i1)

.

Step 2: At iteration t = 2 : T

(a) Sample the index ant−1 ∼M(W 1:Nxt−1,θ) of the ancestor

(b) Sample xnt ∼ qt,θ(·|xant−1

t−1 ).

(c) Compute and normalise weights

wt,θ(xant−1

t−1 , xnt ) =

fθ(xnt |xant−1

t−1 )gθ(yt |xnt )

qt,θ(xnt |xant−1

t−1 ), W n

t,θ =wt,θ(x

ant−1

t−1 , xnt )∑Nx

i=1 wt,θ(xait−1

t−1 , xit)




SMC2

Particle filtering

timeFigure: Three weighted trajectories x1:t at time t.




SMC2

Particle filtering

timeFigure: Three proposed trajectories x1:t+1 at time t + 1.




SMC2

Particle filtering

timeFigure: Three reweighted trajectories x1:t+1 at time t + 1




SMC2

Observations

At each t, (w(i)t , x

(i)1:t)Nx

i=1 is a particle approximation ofpθ(xt |y1:t).

Resampling to avoid degeneracy. If there were no interactionbetween particles there would be typically polynomial or worseincrease in the variance of weights

Taking qθ = fθ simplifies weights, but mainly yields a feasiblealgorithm when fθ can only be simulated.




SMC2

Unbiased likelihood estimator

A by-product of PF output is that

ZNt =

T∏t=1

(1

Nx

Nx∑i=1

w(i)t

)

is an unbiased estimator of the likelihood Zt = p(y1:t |θ) for all t.

Whereas consistency of the estimator is immediate to check,unbiasedness is subtle, see e.g Proposition 7.4.1 in Del Moral. Thevariance of this estimator grows typically linealy with T (and notexponentially) because of lack of independence.




SMC2

PSMC

Breakthrough paper of Andrieu et al. (2011), based on theunbiasedness of the PF estimate of the likelihood.

Marginal PMCMC

From current point θn (and current PF estimate p(y |θn)):

1 Sample θp ∼ T (θn, dθp)

2 Run a PF so as to obtain p(y |θp), an unbiased estimate ofp(y |θp).

3 With probability 1 ∧ r , set θn+1 = θp, otherwise θn+1 = θnwith


p(θn)p(y |θn)T (θn, θp)




SMC2

Outline

1 Background


3 MC2 type methods


5 SMC2




SMC2

Objectives

1 to derive sequentially

p(θ, x1:t |y1:t), p(y1:t), for all t ∈ {1, . . . ,T}

2 to obtain a black box algorithm (automatic calibration).




SMC2

Main tools of our approach

Particle filter algorithms for state-space models (this will be toestimate the likelihood, for a fixed θ).

Iterated Batch Importance Sampling for sequential Bayesianinference for parameters (this will be the theoretical algorithmwe will try to approximate).

Both are sequential Monte Carlo (SMC) methods




SMC2

IBIS

SMC method for particle approximation of the sequence p(θ | y1:t)for t = 1 : T . PF is not going to work here by just pretending thatθ is a dynamic process with zero (or small) variance. Recall thepath degeneracy problem.

In the next slide we give the pseudo-code of the IBIS algorithm.Operations with superscript m must be understood as operationsperformed for all m ∈ 1 : Nθ, where Nθ is the total number ofθ-particles.




SMC2

Sample θm from p(θ) and set ωm ← 1. Then, at time t = 1, . . . ,T

(a) Compute the incremental weights and their weightedaverage

ut(θm) = p(yt |y1:t−1, θm), Lt =

1∑Nθm=1 ω

m×

Nθ∑m=1

ωmut(θm),

(b) Update the importance weights,

ωm ← ωmut(θm). (1)

(c) If some degeneracy criterion is fulfilled, sample θm

independently from the mixture distribution

1∑Nθm=1 ω

m

Nθ∑m=1

ωmKt (θm, ·) .

Finally, replace the current weighted particle system:

(θm, ωm)← (θm, 1).N. CHOPIN Intractability 37/ 54



SMC2

Observations

Cost of lack of ergodicity in θ: the occasional MCMC move

Still, in regular problems resampling happens at diminishingfrequency (logarithmically)

Kt is an MCMC kernel invariant wrt π(θ | y1:t). Itsparameters can be chosen using information from currentpopulation of θ-particles

Lt is a MC estimator of the model evidence

Infeasible to implement for state-space models: intractableincremental weights, and MCMC kernel




SMC2

Our algorithm: SMC2

We provide a generic (black box) algorithm for recovering thesequence of parameter posterior distributions, but as well filtering,smoothing and predictive.

We give next a pseudo-code; the code seems to only track theparameter posteriors, but actually it does all other jobs.Superficially, it looks an approximation of IBIS, but in fact it doesnot produce any systematic errors (unbiased MC).




SMC2

Sample θm from p(θ) and set ωm ← 1. Then, at timet = 1, . . . ,T ,

(a) For each particle θm, perform iteration t of the PF: If

t = 1, sample independently x1:Nx ,m1 from ψ1,θm , and

compute

p(y1|θm) =1

Nx

Nx∑n=1

w1,θ(xn,m1 );

If t > 1, sample(x1:Nx ,mt , a1:Nx ,m

t−1

)from ψt,θm

conditional on(x1:Nx ,m1:t−1 , a1:Nx ,m

1:t−2

), and compute

p(yt |y1:t−1, θm) =1

Nx

Nx∑n=1

wt,θ(xan,mt−1,m

t−1 , xn,mt ).




SMC2

(b) Update the importance weights,

ωm ← ωmp(yt |y1:t−1, θm)

(c) If some degeneracy criterion is fulfilled, sample(θm, x1:Nx ,m

1:t , a1:Nx1:t−1

)independently from

1∑Nθm=1 ω

m

Nθ∑m=1

ωmKt

{(θm, x1:Nx ,m

1:t , a1:Nx ,m1:t−1

), ·}

Finally, replace current weighted particle system:

(θm, x1:Nx ,m1:t , a1:Nx ,m

1:t−1 , ωm)← (θm, x1:Nx ,m

1:t , a1:Nx ,m1:t−1 , 1)




SMC2

Observations

It appears as approximation to IBIS. For Nx =∞ it is IBIS.

However, no approximation is done whatsoever. Thisalgorithm really samples from p(θ|y1:t) and all otherdistributions of interest. One would expect an increase of MCvariance over IBIS.

The validity of algorithm is essentially based on two results: i)the particles are weighted due to unbiasedness of PF estimatorof likelihood; ii) the MCMC kernel is appropriately constructedto maintain invariance wrt to an expanded distribution whichadmits those of interest as marginals; it is a Particle MCMCkernel.

The algorithm does not suffer from the path degeneracyproblem due to the MCMC updates




SMC2

The MCMC step

(a) Sample θ from proposal kernel, θ ∼ T (θ, d θ).

(b) Run a new PF for θ: sample independently(x1:Nx

1:t , a1:Nx1:t−1) from ψt,θ, and compute

Zt(θ, x1:Nx1:t , a1:Nx

1:t−1).

(c) Accept the move with probability

1 ∧p(θ)Zt(θ, x

1:Nx1:t , a1:Nx

1:t−1)T (θ, θ)

p(θ)Zt(θ, x1:Nx1:t , a1:Nx

1:t−1)T (θ, θ).

It can be shown that this is a standard Hastings-Metropolis kernelwith proposal

qθ(θ, x1:Nx1:t , a1:Nx

1:t ) = T (θ, θ)ψt,θ(x1:Nx1:t , a1:Nx

1:t )

invariant wrt to an extended distribution πt(θ, x1:Nx1:t , a1:Nx

1:t−1).




SMC2

Some advantages of the algorithm

Immediate estimates of filtering and predictive distributions

Immediate and sequential estimator of model evidence

Easy recovery of smoothing distributions

Principled framework for automatic calibration of Nx

Population Monte Carlo advantages




SMC2

Numerical illustrations: SV

Time

Squ

ared

obs

erva

tions

0

2

4

6

8

200 400 600 800 1000

(a)

Iterations

Acc

epta

nce

rate

s

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

(b)

Iterations

Nx

100

200

300

400

500

600

700

800

0 200 400 600 800 1000

(c)

Figure: Squared observations (synthetic data set), acceptance rates, andillustration of the automatic increase of Nx .

See the model




SMC2


µ

Den

sity

0

2

4

6

8

T = 250

−1.0 −0.5 0.0 0.5 1.0

T = 500

−1.0 −0.5 0.0 0.5 1.0

T = 750

−1.0 −0.5 0.0 0.5 1.0

T = 1000

−1.0 −0.5 0.0 0.5 1.0

Figure: Concentration of the posterior distribution for parameter µ.




SMC2


Multifactor model

yt = µ+βvt+v1/2t εt+ρ1

k1∑j=1

e1,j+ρ2

k2∑j=1

e2,j−ξ(wρ1λ1+(1−w)ρ2λ2)

where vt = v1,t + v2,t , and (vi , zi )i=1,2 are following the samedynamics with parameters (wiξ,wiω

2, λi ) and w1 = w ,w2 = 1− w .




SMC2


Time

Squ

ared

obs

erva

tions

5

10

15

20

100 200 300 400 500 600 700

(a)

Iterations

Evi

denc

e co

mpa

red

to th

e on

e fa

ctor

mod

el−2

0

2

4

100 200 300 400 500 600 700

variableMulti factor without leverageMulti factor with leverage

(b)

Figure: S&P500 squared observations, and log-evidence comparisonbetween models (relative to the one-factor model).




SMC2

Final Remarks

A powerful framework

A generic algorithm for sequential estimation and stateinference in state space models: only requirements are to beable (a) to simulate the Markov transition fθ(xt |xt−1), and (b)to evaluate the likelihood term gθ(yt |xt).

The article is available on arXiv and our web pages

A package is available at:

http://code.google.com/p/py-smc2/.


http://code.google.com/p/py-smc2/



SMC2

Appendix




SMC2

Why does it work? - Intuition for t = 1

At time t = 1, the algorithm generates variables θm from the priorp(θ), and for each θm, the algorithm generates vectors x1:Nx ,m

1 ofparticles, from ψ1,θm(x1:Nx

1 ).




SMC2

Thus, the sampling space is Θ×XNx , and the actual “particles” ofthe algorithm are Nθ independent and identically distributed copiesof the random variable (θ, x1:Nx

1 ), with density:

p(θ)ψ1,θ(x1:Nx1 ) = p(θ)

Nx∏n=1

q1,θ(xn1 ).




SMC2

Then, these particles are assigned importance weightscorresponding to the incremental weight functionZ1(θ, x1:Nx

1 ) = N−1x

∑Nxn=1 w1,θ(xn1 ).

This means that, at iteration 1, the target distribution of thealgorithm should be defined as:

π1(θ, x1:Nx1 ) = p(θ)ψ1,θ(x1:Nx

1 )×Z1(θ, x1:Nx

1 )

p(y1),

where the normalising constant p(y1) is easily deduced from theproperty that Z1(θ, x1:Nx

1 ) is an unbiased estimator of p(y1|θ).




SMC2

Direct substitutions yield

π1(θ, x1:Nx1 ) =

p(θ)

p(y1)

Nx∏i=1

q1,θ(x i1)

{1

Nx

Nx∑n=1

µθ(xn1 )gθ(y1|xn1 )

q1,θ(xn1 )

}

=1

Nx

Nx∑n=1

p(θ)

p(y1)µθ(xn1 )gθ(y1|xn1 )

Nx∏

i=1,i 6=n

q1,θ(x i1)

and noting that, for the triplet (θ, x1, y1) of random variables,

p(θ)µθ(x1)gθ(y1|x1) = p(θ, x1, y1) = p(y1)p(θ|y1)p(x1|y1, θ)

one finally gets that:

π1(θ, x1:Nx1 ) =

p(θ|y1)

Nx

Nx∑n=1

p(xn1 |y1, θ)

Nx∏

i=1,i 6=n

q1,θ(x i1)

.




SMC2

By a simple induction, one sees that the target density πt atiteration t ≥ 2 should be defined as:

πt(θ, x1:Nx1:t , a1:Nx

1:t−1) = p(θ)ψt,θ(x1:Nx1:t , a1:Nx

1:t−1)×Zt(θ, x

1:Nx1:t , a1:Nx

1:t−1)

p(y1:t)

and the following Proposition




SMC2

Proposition

The probability density πt may be written as:

πt(θ, x1:Nx1:t , a1:Nx

1:t−1) = p(θ|y1:t)

× 1

Nx

Nx∑n=1

p(xn1:t |θ, y1:t)Nt−1x

Nx∏i=1

i 6=hnt (1)

q1,θ(x i1)

×

t∏

s=2

Nx∏i=1

i 6=hnt (s)

Wais−1

s−1,θqs,θ(x is |xais−1

s−1 )