Sequential Monte Carlo Methodscepr.org/sites/default/files/40002_Session 3...Now, we use sequential...

Sequential Monte Carlo Methods

Frank SchorfheideUniversity of Pennsylvania

EABCN Training School

May 10, 2016

Frank Schorfheide Sequential Monte Carlo Methods

Introduction

• Unfortunately, “standard” MCMC can be inaccurate, especially inmedium and large-scale DSGE models:

• disentangling importance of internal versus external propagationmechanism;

• determining the relative importance of shocks.

• Previously: Modify MCMC algorithms to overcome weaknesses:blocking of parameters; tailoring of (mixture) proposal densities

• Now, we use sequential Monte Carlo (SMC) (more precisely,sequential importance sampling) instead:

• Better suited to handle irregular and multimodal posteriorsassociated with large DSGE models.

• Algorithms can be easily parallelized.

• SMC = Importance Sampling on Steriods. We build on• Theoretical work: Chopin (2004); Del Moral, Doucet, Jasra (2006)• Applied work: Creal (2007); Durham and Geweke (2011, 2012)


Importance Sampling

• Approximate π(·) by using a different, tractable density g(θ) that iseasy to sample from.

• For more general problems, posterior density may be non-normalized.So we write

π(θ) =p(Y |θ)p(θ)

p(Y )=

f (θ)∫f (θ)dθ

.

• Importance sampling is based on the identity

Eπ[h(θ)] =

∫h(θ)π(θ)dθ =

∫Θh(θ) f (θ)

g(θ)g(θ)dθ∫Θ

f (θ)g(θ)g(θ)dθ

.

• The ratio

w(θ) =f (θ)

g(θ)

is called the (unnormalized) importance weight.


Importance Sampling

1 For i = 1 to N, draw θi iid∼ g(θ) and compute the unnormalizedimportance weights

w i = w(θi ) =f (θi )

g(θi ).

2 Compute the normalized importance weights

W i =w i

1N

∑Ni=1 w

i.

An approximation of Eπ[h(θ)] is given by

hN =1

N

N∑i=1

W ih(θi ).


Illustration

If θi ’s are draws from g(·) then

Eπ[h] ≈1N

∑Ni=1 h(θi )w(θi )

1N

∑Ni=1 w(θi )

, w(θ) =f (θ)

g(θ).

6 4 2 0 2 4 60.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

f

g1

g2

6 4 2 0 2 4 6

weights

f/g1

f/g2


Accuracy

• Since we are generating iid draws from g(θ), it’s fairlystraightforward to derive a CLT:

• It can be shown that√N(hN−Eπ[h]) =⇒ N

(0,Ω(h)

), where Ω(h) = Vg [(π/g)(h−Eπ[h])].

• Using a crude approximation (see, e.g., Liu (2008)), we can factorizeΩ(h) as follows:

Ω(h) ≈ Vπ[h](Vg [π/g ] + 1

).

The approximation highlights that the larger the variance of theimportance weights, the less accurate the Monte Carloapproximation relative to the accuracy that could be achieved withan iid sample from the posterior.

• Users often monitor

ESS = NVπ[h]

Ω(h)≈ N

1 + Vg [π/g ].


From Importance Sampling to Sequential ImportanceSampling

• In general, it’s hard to construct a good proposal density g(θ),

• especially if the posterior has several peaks and valleys.

• Idea - Part 1: it might be easier to find a proposal density for

πn(θ) =[p(Y |θ)]φnp(θ)∫[p(Y |θ)]φnp(θ)dθ

=fn(θ)

Zn.

at least if φn is close to zero.

• Idea - Part 2: We can try to turn a proposal density for πn into aproposal density for πn+1 and iterate, letting φn −→ φN = 1.


Illustration:

• Our state-space model:

yt = [1 1]st , st =

[θ2

1 0(1− θ2

1)− θ1θ2 (1− θ21)

]st−1 +

[10

]εt .

• Innovation: εt ∼ iidN(0, 1).

• Prior: uniform on the square 0 ≤ θ1 ≤ 1 and 0 ≤ θ2 ≤ 1.

• Simulate T = 200 observations given θ = [0.45, 0.45]′, which isobservationally equivalent to θ = [0.89, 0.22]′


Illustration: Tempered Posteriors of θ1

θ1

0.00.2

0.40.6

0.81.0

n10

2030

4050

0

1

2

3

4

5

πn(θ) =[p(Y |θ)]φnp(θ)∫[p(Y |θ)]φnp(θ)dθ

=fn(θ)

Zn, φn =

(n

Nφ

)λ


Illustration: Posterior Draws

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0θ2

θ1


SMC Algorithm: A Graphical Illustration

C S M C S M C S M

−10

−5

0

5

10

φ0 φ1 φ2 φ3

• πn(θ) is represented by a swarm of particles θin,W

inN

i=1:

hn,N =1

N

N∑i=1

W inh(θi

n)a.s.−→ Eπn [h(θn)].

• C is Correction; S is Selection; and M is Mutation.


SMC Algorithm

1 Initialization. (φ0 = 0). Draw the initial particles from the prior:

θi1

iid∼ p(θ) and W i1 = 1, i = 1, . . . ,N.

2 Recursion. For n = 1, . . . ,Nφ,

1 Correction. Reweight the particles from stage n − 1 by defining theincremental weights

w in = [p(Y |θi

n−1)]φn−φn−1 (1)

and the normalized weights

W in =

w inW

in−1

1N

∑Ni=1 w

inW i

n−1

, i = 1, . . . ,N. (2)

An approximation of Eπn [h(θ)] is given by

hn,N =1

N

N∑i=1

W inh(θi

n−1). (3)

2 Selection.


SMC Algorithm

1 Initialization.

2 Recursion. For n = 1, . . . ,Nφ,

1 Correction.2 Selection. (Optional Resampling) Let θN

i=1 denote N iid drawsfrom a multinomial distribution characterized by support points andweights θi

n−1, WinN

i=1 and set W in = 1.

An approximation of Eπn [h(θ)] is given by

hn,N =1

N

N∑i=1

W inh(θi

n). (4)

3 Mutation. Propagate the particles θi ,Win via NMH steps of a MH

algorithm with transition density θin ∼ Kn(θn|θi

n; ζn) and stationarydistribution πn(θ). An approximation of Eπn [h(θ)] is given by

hn,N =1

N

N∑i=1

h(θin)W i

n . (5)


Remarks

• Correction Step:• reweight particles from iteration n− 1 to create importance sampling

approximation of Eπn [h(θ)]

• Selection Step: the resampling of the particles• (good) equalizes the particle weights and thereby increases accuracy

of subsequent importance sampling approximations;• (not good) adds a bit of noise to the MC approximation.

• Mutation Step:• adapts particles to posterior πn(θ);• imagine we don’t do it: then we would be using draws from prior

p(θ) to approximate posterior π(θ), which can’t be good!

θ1

0.00.2

0.40.6

0.81.0

n10

2030

4050

0

1

2

3

4

5


Theoretical Properties

• Goal: strong law of large numbers (SLLN) and central limit theorem(CLT) as N −→∞ for every iteration n = 1, . . . ,Nφ.

• Regularity conditions:• proper prior;• bounded likelihood function;• 2 + δ posterior moments of h(θ).

• Idea of proof (Chopin, 2004): proceed recursively• Initialization: SLLN and CLT for iid random variables because we

sample from prior.• Assume that n − 1 approximation (with normalized weights) yields

√N

(1

N

N∑i=1

h(θin−1)W i

n−1 − Eπn−1 [h(θ)]

)=⇒ N

(0,Ωn−1(h)

)• Show that

√N

(1

N

N∑i=1

h(θin)W i

n − Eπn [h(θ)]

)=⇒ N

(0,Ωn(h)

)Frank Schorfheide Sequential Monte Carlo Methods

Theoretical Properties: Correction Step

• Suppose that the n − 1 approximation (with normalized weights)yields

√N

(1

N

N∑i=1

h(θin−1)W i

n−1 − Eπn−1 [h(θ)]

)=⇒ N

(0,Ωn−1(h)

)• Then

√N

(1N

∑Ni=1 h(θi

n−1)[p(Y |θin−1)]φn−φn−1W i

n−1

1N

∑Ni=1 [p(Y |θi

n−1)]φn−φn−1W in−1

− Eπn [h(θ)]

)=⇒ N

(0, Ωn(h)

)where

Ωn(h) = Ωn−1

(vn−1(θ)(h−Eπn [h])

)vn−1(θ) = [p(Y |θ)]φn−φn−1

Zn−1

Zn

• This step relies on likelihood evaluations from iteration n − 1 thatare already stored in memory.


Theoretical Properties: Selection / Resampling

• After resampling by drawing from iid multinomial distribution weobtain

√N

(1

N

N∑i=1

h(θi )Win − Eπn [h]

)=⇒ N

(0, Ω(h)

),

where

Ωn(h) = Ω(h) + Vπn [h]

• Disadvantage of resampling: it adds noise.

• Advantage of resampling: it equalizes the particle weights, reducingthe variance of vn(θ) in Ωn+1(h) = Ωn

(vn(θ)(h − Eπn+1 [h]).


Theoretical Properties: Mutation

• We are using the Markov transition kernel Kn(θ|θ) to transformdraws θi

n into draws θin.

• To preserve the distribution of the θin’s it has to be the case that

πn(θ) =

∫Kn(θ|θ)πn(θ)d θ.

• It can be shown that the overall asymptotic variance after themutation is the sum of

• the variance of the approximation of the conditional meanEKn(·|θn−1)[h(θ)] which is given by

Ω(EKn(·|θn−1)[h(θ)]

);

• a weighted average of the conditional variance VKn(·|θn−1)[h(θ)]:∫Wn−1(θn−1)vn−1(θn−1)VKn(·|θn−1)[h(θ)]πn−1(θn−1).

• This step is easily parallelizable.


More on Transition Kernel in Mutation Step

• Transition kernel Kn(θ|θn−1; ζn): generated by running M steps of aMetropolis-Hastings algorithm.

• Lessons from DSGE model MCMC:• blocking of parameters can reduces persistence of Markov chain;• mixture proposal density avoids “getting stuck.”

• Blocking: Partition the parameter vector θn into Nblocks equally sizedblocks, denoted by θn,b, b = 1, . . . ,Nblocks . (We generate the blocksfor n = 1, . . . ,Nφ randomly prior to running the SMC algorithm.)

• Example: random walk proposal density:

ϑb|(θin,b,m−1, θ

in,−b,m,Σ

∗n,b) ∼ N

(θi

n,b,m−1, c2n Σ∗n,b

).


Adaptive Choice of ζn = (Σ∗n, cn)

• Infeasible adaption:• Let Σ∗n = Vπn [θ].• Adjust scaling factor according to

cn = cn−1f(1− Rn−1(ζn−1)

),

where Rn−1(·) is population rejection rate from iteration n − 1 and

f (x) = 0.95 + 0.10e16(x−0.25)

1 + e16(x−0.25).

• Feasible adaption – use output from stage n− 1 to replace ζn by ζn:• Use particle approximations of Eπn [θ] and Vπn [θ] based onθi

n−1, WinN

i=1.• Use actual rejection rate from stage n − 1 to calculate

cn = cn−1f(Rn−1(ζn−1)

).

• Result: under suitable regularity conditions replacing ζn by ζn where√n(ζn − ζn) = Op(1) does not affect the asymptotic variance of the

MC approximation.


Adaption of SMC Algorithm for Stylized State-SpaceModel

0.0

0.5

1.0 Acceptance Rate

0.00

0.75

1.50 Scale Parameter c

10 20 30 40 500

500

1000

n

Effective Sample Size

Notes: The dashed line in the top panel indicates the target acceptancerate of 0.25.


Convergence of SMC Approximation for StylizedState-Space Model

10000 15000 20000 25000 30000 35000 400000.00

0.02

0.04

0.06

0.08

0.10

0.12

N

NV [θ1]NV [θ2]

Notes: The figure shows NV[θj ] for each parameter as a function of thenumber of particles N. V[θj ] is computed based on Nrun = 1, 000 runs ofthe SMC algorithm with Nφ = 100. The width of the bands is

(2 · 1.96)√

3/Nrun(NV[θj ]).Frank Schorfheide Sequential Monte Carlo Methods

More on Resampling

• So far, we have used multinomial resampling. It’s fairly intuitive andit is straightforward to obtain a CLT.

• But: multinominal resampling is not particularly efficient.

• The book contains a section on alternative resampling schemes(stratified resampling, residual resampling...)

• These alternative techniques are designed to achieve a variancereduction.

• Most resampling algorithms are not parallelizable because they relyon the normalized particle weights.


Application 1: Small Scale New Keynesian Model

• We will take a look at the effect of various tuning choices onaccuracy:

• Tempering schedule λ: λ = 1 is linear, λ > 1 is convex.

• Number of stages Nφ versus number of particles N.

• Number of blocks in mutation step versus number of particles.


Effect of λ on Inefficiency Factors InEffN [θ]

1 2 3 4 5 6 7 8100

101

102

103

104

λ

Notes: The figure depicts hairs of InEffN [θ] as function of λ. Theinefficiency factors are computed based on Nrun = 50 runs of the SMCalgorithm. Each hair corresponds to a DSGE model parameter.


Number of Stages Nφ vs Number of Particles N

ρg σr ρr ρz σg κ σz π(A) ψ1 γ(Q) ψ2 r (A) τ10−3

10−2

10−1

100

101

Nφ = 400, N = 250Nφ = 200, N = 500Nφ = 100, N = 1000

Nφ = 50, N = 2000Nφ = 25, N = 4000

Notes: Plot of V[θ]/Vπ[θ] for a specific configuration of the SMCalgorithm. The inefficiency factors are computed based on Nrun = 50runs of the SMC algorithm. Nblocks = 1, λ = 2, NMH = 1.


Number of blocks Nblocks in Mutation Step vs Number ofParticles N

ρg σg σz σr ρr κ ρz π(A) ψ1 τ γ(Q) r (A) ψ210−3

10−2

10−1

100

Nblocks = 4, N = 250Nblocks = 2, N = 500

Nblocks = 1, N = 1000

Notes: Plot of V[θ]/Vπ[θ] for a specific configuration of the SMCalgorithm. The inefficiency factors are computed based on Nrun = 50runs of the SMC algorithm. Nφ = 100, λ = 2, NMH = 1.


SMC Marginal Data Density Estimates

Nφ = 100 Nφ = 400N Mean(ln p(Y )) SD(ln p(Y )) Mean(ln p(Y )) SD(ln p(Y ))500 -352.19 (3.18) -346.12 (0.20)1,000 -349.19 (1.98) -346.17 (0.14)2,000 -348.57 (1.65) -346.16 (0.12)4,000 -347.74 (0.92) -346.16 (0.07)

Notes: Table shows mean and standard deviation of log marginal datadensity estimates as a function of the number of particles N computedover Nrun = 50 runs of the SMC sampler with Nblocks = 4, λ = 2, andNMH = 1.


Application 2: Estimation of Smets and Wouters (2007)Model

• Benchmark macro model, has been estimated many (many) times.

• “Core” of many larger-scale models.

• 36 estimated parameters.

• RWMH: 10 million draws (5 million discarded); SMC: 500 stageswith 12,000 particles.

• We run the RWM (using a particular version of a parallelizedMCMC) and the SMC algorithm on 24 processors for the sameamount of time.

• We estimate the SW model twenty times using RWM and SMC andget essentially identical results.


Application 2: Estimation of Smets and Wouters (2007)Model

• More interesting question: how does quality of posterior simulatorschange as one makes the priors more diffuse?

• Replace Beta by Uniform distributions; increase variances ofparameters with Gamma and Normal prior by factor of 3.


SW Model with DIFFUSE Prior: Estimation stability RWH(black) versus SMC (red)

l ιw µp µw ρw ξw rπ−4

−3

−2

−1

0

1

2

3

4


A Measure of Effective Number of Draws

• Suppose we could generate iid Neff draws from posterior, then

Eπ[θ]approx∼ N

(Eπ[θ],

1

NeffVπ[θ]

).

• We can measure the variance of Eπ[θ] by running SMC and RWMalgorithm repeatedly.

• Then,

Neff ≈Vπ[θ]

V[Eπ[θ]

]


Effective Number of Draws

SMC RWMHParameter Mean STD(Mean) Neff Mean STD(Mean) Neff

σl 3.06 0.04 1058 3.04 0.15 60l -0.06 0.07 732 -0.01 0.16 177ιp 0.11 0.00 637 0.12 0.02 19h 0.70 0.00 522 0.69 0.03 5Φ 1.71 0.01 514 1.69 0.04 10rπ 2.78 0.02 507 2.76 0.03 159ρb 0.19 0.01 440 0.21 0.08 3ϕ 8.12 0.16 266 7.98 1.03 6σp 0.14 0.00 126 0.15 0.04 1ξp 0.72 0.01 91 0.73 0.03 5ιw 0.73 0.02 87 0.72 0.03 36µp 0.77 0.02 77 0.80 0.10 3ρw 0.69 0.04 49 0.69 0.09 11µw 0.63 0.05 49 0.63 0.09 11ξw 0.93 0.01 43 0.93 0.02 8


A Closer Look at the Posterior: Two Modes

Parameter Mode 1 Mode 2ξw 0.844 0.962ιw 0.812 0.918ρw 0.997 0.394µw 0.978 0.267Log Posterior -804.14 -803.51

• Mode 1 implies that wage persistence is driven by extremelyexogenous persistent wage markup shocks.

• Mode 2 implies that wage persistence is driven by endogenousamplification of shocks through the wage Calvo and indexationparameter.

• SMC is able to capture the two modes.


A Closer Look at the Posterior: Internal ξw versus Externalρw Propagation

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ξw

ρw


Stability of Posterior Computations: RWH (black) versusSMC (red)

P (ξw > ρw) P (ρw > µw) P (ξw > µw) P (ξp > ρp) P (ρp > µp) P (ξp > µp)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Marginal Data Density

• Bayesian model evaluation is based on marginal data density

pn(Y ) =

∫[p(Y |θ)]φnp(θ)dθ.

• Recall that marginal data density is a by-product of the importancesampler:

pn(Y ) = Zn =1

N

N∑i=1

W in.

Algorithm, Method MDD Estimate Standard DeviationSMC, Particle Estimate -873.46 0.24RWM, Harmonic Mean -874.56 1.20


Application 3: News Shock Model of Schmitt-Grohe andUribe (2012)

• Real business cycle model;

• investment adjustment costs;

• variable capacity utilitization;

• Jaimovich-Rebelo (2009) preferences;

• permanent and stationary neutral and investment-specific technologyshocks, goverment spending, wage mark-up, and preference shocks.

• Data: Real GDP, consumption, investment, government spending,hours, TFP, price of investment growth rates from 1955:Q2 -2006:Q4.


Key Model Features

• News Shocks: each of the 7 exogenous processes has twoanticipated components

zt = ρzt−1 + ε0t︸︷︷︸

unanticipated

+ ε4t−4︸︷︷︸

anticipated 4 qtrs ahead

+ ε8t−8︸︷︷︸

anticipated 8 qtrs ahead

• Jaimovich-Rebelo Pref: Households have CRRA preferences overconsumption bundle Vt :

Vt = Ct − bCt − ψhθt St

St is a geometric average of current and past habit-adjustedconsumption levels.

St = (Ct − bCt−1)γ(St−1)1−γ

γ → 0 : GHH preferences. Wealth effect on labor is small.

γ → 1 : Standard preferences.


Effective Number of Draws

SMC RWMHParameter Mean STD(Mean) Neff Mean STD(Mean) Neff

σ4z i 3.14 0.04 4190 3.08 0.24 108σ8

g 0.41 0.01 1830 0.41 0.03 108σ8

z i 5.59 0.09 1124 5.68 0.30 102θ 4.13 0.02 671 4.14 0.05 146σ0

z i 12.27 0.09 640 12.36 0.09 616σ0µ 1.04 0.04 626 0.92 0.04 625σ0

g 0.62 0.01 609 0.60 0.03 111κ 9.32 0.05 578 9.33 0.09 208σ4ζ 2.43 0.09 406 2.44 0.04 2066σ0ζ 3.82 0.10 384 3.80 0.22 73σ8ζ 2.65 0.11 335 2.62 0.18 126σ4µ 4.26 0.24 49 4.33 0.49 12σ8µ 1.36 0.24 46 1.34 0.49 11


Histogram of Wage Markup Process Parameters: RWH(black) versus SMC (red)

0.9 0.92 0.94 0.96 0.98 10

0.1

0.2

0.3

0.4

ρµ

0 2 4 60

0.1

0.2

0.3

0.4

σµ

0

0 2 4 60

0.1

0.2

0.3

0.4

σµ

4

0 2 4 60

0.1

0.2

0.3

0.4

σµ

8


Histogram of Preference Parameter γ: RWH (black)versus SMC (red)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

γ

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

1

2

3

4

5

6

7

8x 10

−3 γ

γ ≈ 0.6 implies that the importance of anticipated shocks for hours issubstantially diminished.


An Alternative Prior for γ

• SGU use a Uniform prior for γ.

• We now change the prior to Beta(2, 1) (density is straight linebetween 0 and 1).


Histogram of Preference Parameter γ – Alternative Prior:RWH (black) versus SMC (red)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

γ


Anticipated Shocks’ Variance Shares for Hours: RWH(black) versus SMC (red)

SGU Prior Beta(2, 1) Prior

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hours

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hours


Sequential Monte Carlo Methodscepr.org/sites/default/files/40002_Session 3...Now, we use sequential...

Documents

Transcript of Sequential Monte Carlo Methodscepr.org/sites/default/files/40002_Session 3...Now, we use sequential...