An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k...

45
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems An Overview of Parallel and Distributed MCMC Methods Cheng Li DSAP, National University of Singapore

Transcript of An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k...

Page 1: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

An Overview of Parallel and Distributed MCMCMethods

Cheng Li

DSAP, National University of Singapore

Page 2: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Outline

Divide and Conquer Bayes

Numerical Performance

Theory

Improvement

Open Problems

Page 3: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Bayesian Analysis

Given the data Y(n), choosing a model (likelihood) p(Y(n)|θ) and aprior π(θ), the posterior is

πn

(θ|Y(n)

)=

p(Y(n)|θ)π(θ)∫Θ p(Y(n)|θ)π(θ)dθ

=p(Y(n)|θ)π(θ)

p(Y(n))

Θ is the parameter space (a metric space, usually Euclidean).

I θ could be moderate to high-dimensional.

I The integral in the denominator is usually intractable.

I The model (likelihood) may contain latent variables.

I Two classes of approaches:1. Posterior approximations(Laplace approximation, Variational Bayes, Expectation propagation)2. Posterior sampling (MCMC, HMC, etc.)

Page 4: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Big Data Challenges

I Big data poses multiple challenges to Bayesian analysis.

I Big n, Big p, ...

I For posterior sampling: slow computation, bad mixing, ...

Page 5: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Scalable BayesFor i.i.d. Y(n),

πn

(θ|Y(n)

)=

p(Y(n)|θ)π(θ)∫Θ p(Y(n)|θ)π(θ)dθ

log p(Y(n)|θ) =n∑

i=1

log p(Yi |θ)

Difficulty:

I In every update of θ, we have to scan all n observations andcalculate log p(Yi |θ) for i = 1, . . . , n.

I This happens for both posterior approximation methods andposterior sampling methods.

I MCMC is a sequential algorithm. It is not parallelizable innature.

Page 6: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Distributed Bayes

I In this talk, we discuss the development in distributedBayesian methods, in particular, the divide-and-conquer(D&C) Bayes.

I The basic idea of D&C Bayes:

I The whole dataset is partitioned into many subsets.I Every subset can be handled by a machine. Posterior sampler

is run on each subset in parallel.

I Advantage: Simple, feasible, and efficient.

I Disadvantage: Often provides an approximate posterior,instead of the exact posterior.

Page 7: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Divide and ConquerFrequentist: Kernel ridge regression (Zhang, Duchi and Wainwright 15’JMLR,

Yang, Pilanci, and Wainwright 15’AOS, etc.)

Page 8: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Divide and Conquer Bayes

Idea: Run posterior samplers in parallel on disjoint subset data, and combinethe subset posterior distributions into a global posterior.

Page 9: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Divide and Conquer Bayes

I Consensus Monte Carlo (CMC, Huang and Gelman 05’, Scottet al. 13’)

I Weierstrass Sampler (WS, Wang and Dunson 14’)

I Nonparametric and semiparametric density product (NDP,SDP, Neiswanger et al. 14’)

I Median posterior (Minsker et al. 14’)

I Wasserstein posterior (WASP, Srivastava et al. 15’)

I Posterior interval estimation (PIE, Li et al. 16’)

I Random partition trees (Wang and Dunson 15’)

I Likelihood inflated sampling algorithm (LISA, Entezari et al.16’)

I Gaussian process based method (Nemeth and Sherlock 16’)

I Double-parallel Monte Carlo (DPMC, Xue and Liang 17’)

Page 10: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Divide and Conquer Bayes

When the data are all independent

Posterior ∝ Likelihood× Prior

K∏j=1

jth subset likelihood

× prior

∝K∏j=1

[jth subset likelihood× prior1/K

](CMC, WS, NDP, SDP)

K∏j=1

(jth subset likelihood)K × prior

1/K

(WASP, PIE, LISA, DPMC)

Page 11: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Modified Prior or Likelihood

I Modifying the prior by using prior1/k (CMC, NDP, SDP):

I The full posterior is the product of all subset posteriors.I No need to change the complicated likelihood.I Issues: Not invariant to model reparametrization. Each subset

posterior has a much larger variance than the full posterior.

I Modifying the likelihood by using (jth subset likelihood)k

(WASP, PIE, DPMC):

I Invariant to model reparameterization.I Every subset posterior becomes a “noisy” approximation to the

full posterior.I The full posterior is the geometric mean of all subset

posterior densities.I Issues: Difficult to modify complicated likelihoods with many

layers of latent variables.

Page 12: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Notation

I The full iid dataset is partitioned into K subsets, with equalsubset sample size m. The total sample size is n = Km.Y(n) = Y[1], . . . ,Y[j], . . . ,Y[K ].

I The jth subset posterior in CMC, NDP, SDP is defined by

πKm(θ|Y[j]

)=

[p(Y[j]|θ)

]π1/K (θ)∫

Θ

[p(Y[j]|θ)

]π1/K (θ)dθ

. (1)

I The jth subset posterior in WASP, PIE, LISA, DPMC isdefined by

πKm(θ|Y[j]

)=

[p(Y[j]|θ)

]Kπ(θ)∫

Θ

[p(Y[j]|θ)

]Kπ(θ)dθ

. (2)

I We only has discrete draws from the K subset posteriordistributions. No analytical form is available.

Page 13: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

D&C Bayes Algorithm in 3 Steps

For the parameter of interest θ ∈ Rd ,

I Step 1: Modify either the prior or the likelihood on the subsetlevel.

I Step 2: Run posterior samplers on K subsets of data, on Kmachines in parallel. Draw posterior samples from each subsetposterior. Get the posterior samples for θ.

I Step 3: Combine the K subset posterior distributions.

I Output: A combined approximate posterior distribution of θ.

For independent data, D&C Bayes methods are oftenasymptotically valid for approximating the true posterior of θ giventhe full data.

Page 14: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Consensus Monte Carlo

Huang and Gelman 05’, Scott et al. 13’, R packageparallelMCMCcombine.

I Pretend that all the subset posteriors are Gaussian(approximated correct under BvM).

I Let Ωj be the sample covariance matrix of πKm(θ|Y[j]

),

j = 1, . . . ,K .

I If θj denotes a draw from πKm(θ|Y[j]

), then CMC calculates

θ =(∑K

j=1Ω−1j

)−1∑K

j=1Ω−1j θj .

I This combination is close to exact when the subset posteriorsare exactly Gaussian.

Page 15: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Kernel-based Method: NDP and SDPNonparametric and semiparametric density product, Neiswanger etal. 14’, R package parallelMCMCcombine.

I Use kernel densities estimation to fit every subset posterior,and then multiply them together.

I Given posterior draws θjtTt=1 for j = 1, . . . ,K ,

πKm(θ|Y[j]

)= T−1

∑T

t=1

Nd(θ|θjt , h2Id)Nd(θ|µj , Ωj)

Nd(θjt |µj , Ωj),

πSDP (θ|Y) =∏K

j=1πKm(θ|Y[j]

).

I Eventually, πSDP (θ|Y) can be written in the form

πSDP (θ|Y) =∑T

t1=1. . .∑T

tK=1wt·Nd(θ|µt·,Ωt·).

I Complexity is high in K (but can be reduced to linear in K ).

Page 16: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Kernel-based Method: Weierstrass Sampler

Weierstrass sampler (WS, Wang and Dunson 14’)

I The subset posteriors are multiplied together via Weierstrasstransformation.

Wh[πj ] =

∫1√2πh

e−(θ−ξ)2

2h2 πKm(ξ|Y[j]

)dξ, lim

h→0Wh[πj ] = πKm ,

π(θ|Y) =∏K

j=1πKm(θ|Y[j]

)≈∏K

j=1Wh[πj ]

∝∫ ∏K

j=1

1√2πh

e−(θ−ξj )2

2h2 πKm(ξj |Y[j]

)dξ1 . . . dξJ .

I WS targets the augmented model with (θ, ξ1, . . . , ξJ):

πh(θ, ξ1, . . . , ξJ |Y) ∝∏K

j=1

1√2πh

e−(θ−ξj )2

2h2 πKm(ξj |Y[j]

)

Page 17: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Geometric Center Methods

Median posterior (Minsker et al. 14’), WASP (Srivastava et al.15’), PIE (Li et al. 16’)

I Use the modified likelihood instead of modified prior.

I Approximate the full posterior using a different geometriccenter.

I The full posterior is

π(θ|Y) ∝[∏K

j=1πKm(θ|Y[j]

)]1/K

I The geometric mean here is difficult to calculate based ondiscrete subset MCMC draws.

I Instead, we use the Wasserstein barycenter as a replacement.

Page 18: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Geometric Center MethodsExample: Logistic regression using WASP

200

400

600 800

1000

1200

200

400

600 800

1000

1200

200

400

600

800

100

0

1200

1600

200

400

600

800

100

0

1600

200

400

600

800

100

0 120

0

200

400

600 800

1000 1200 200

400

600

800

1000 120

0

200

400 600

800 1000

1600

200

400

600

800

100

0

1200

1800

200

400

600

800

100

0

1200 160

0

200

400

600

800

1000

120

0

200

400

600

800

1000

120

0

1600

200

400

600 800

1000

120

0

1600

200

400

600

800

1000

1200

200

400

600

800

1000 120

0

1400

200

400

600

800

100

0

140

0

1600

200

400

600

800

100

0

1600

200

400

600

800

1000 120

0

200

400

600

800 1000

1600

200

400 600

800

1000 120

0

1600

200

400

600

800

1000

1600

200

400

600

800

1000

1200 200

400

600

800 100

0

1200

160

0

200

400 600

800

100

0 1200 1600

200

400

600 800

1000

1400

200

400

600

800

1000 120

0

1600

200

400 600

800

1000 120

0

1600

200

400

600 800

1000 1200

200

400 600

800

1000

1200

1400

200

400

600

800 1000

200

400

600

800

100

0

1400

200

400

600

800

100

0

120

0 1800

200

400

600

800 100

0

120

0 1

400

200

400

600 800

1000

1200 200

400

600

800 1000

120

0

1600

200

400

600

800

1000

1200

140

0

200

400

600

800

1000

1200

200

400

600

800

1000

120

0 1600

200

400

600

800 1000 1400

200

400

600

800

1000 120

0

1800

200

400

600

800

1000

120

0

1800

−1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.85

0.90

0.95

1.00

1.05

1.10

1.15

200

400

600

800

1000 120

0

500

1000

1500

2000

MCMCSubset PosteriorWASP

θ1

θ 2

Page 19: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Wasserstein Distance

I Wasserstein-p (Wp) distance between two measures µ and ν:

Wp(µ, ν) = inf[

E(X − Y )2]1/p

, law(X ) = µ, law(Y ) = ν.

I Wasserstein-p barycenter of K probability measuresν1, . . . , νK :

µ = arg minµ∈Pp

1

K

∑K

j=1νj ,

where Pp is the space of all probability measures with finitepth moment.

I If the underlying space is R, then for two distributions F1,F2:

Wp(F1,F2) =

∫ 1

0[F−1

1 (u)− F−12 (u)]pdu

1/p

.

Page 20: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Median Posterior

Median posterior (Minsker et al. 14’), R package Mposterior.

I The median posterior is the W1 barycenter of K subsetposteriors.

I It is proposed based on two motivations:

I Faster convergence rate of geometric median than individualestimators.

I Robustness to outliers.

I It uses kernel embedding: Every subset posterior isrepresented by a kernel-based expansion in some Hilbert spaceH, and the W1 distance is defined on H.

I The returned approximate posterior distribution is a linearcombination of subset posteriors, i.e.π(θ|Y) =

∑Kj=1 wjπ

Km

(θ|Y[j]

).

I Weiszfeld’s algorithm is used to search for the optimal weightsw1, . . . ,wK for the geometric median in H.

Page 21: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Wasserstein Posterior (WASP)

Wasserstein posterior (WASP, Srivastava et al. 15’), GitHub codesavailable.

I WASP is the W2 barycenter of K subset posteriors.

I WASP is motivated by the nice computational property of theW2 barycenter of discrete distributions: It can be casted as anoptimization problem and solved using linear programming,related to the optimal transport.

I The WASP of K subset posteriors can be accuratelyapproximated by

π(·|Y) =∑K

j=1

∑T

t=1wjtδθjt (·),

where the weights wit ’s need to be optimized, and δθjt (·) isthe Dirac measure at the subset posterior draw θjt .

Page 22: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Posterior Interval Estimation (PIE)

Posterior Interval Estimation (PIE, Li et al. 16’), GitHub codesavailable.

I Also computes the W2 barycenter of K subset posteriors.

I Simplifies the WASP computation for the posterior of1-dimensional functionals of θ.

I PIE finds the approximate posterior by averaging the subsetposterior quantiles

Qπ(u|Y(n)) =1

K

K∑j=1

QπKm

(u|Y[j]), u ∈ (0, 1).

Page 23: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Divide and Conquer BayesPIE (Li, Srivastava, and Dunson 16’): Average the subset posterior empiricalquantiles. Leading to the “Wasserstein-2 barycenter”, as an approximation tothe full data posterior.

Page 24: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Outline

Divide and Conquer Bayes

Numerical Performance

Theory

Improvement

Open Problems

Page 25: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Example: Linear Mixed Effects (lme) Model

I yi | β, ui , σ2 ∼ Nsi (Xiβ + Ziui , τ2Isi ), ui ∼ Nr (0,Σ).

I si is the number of observations for the individual i(i = 1, . . . , n). s =

∑ni=1 si .

I Xi ∈ Rp,Zi ∈ Rr are observed. β is the fixed effects. ui is therandom effects for the individual i .

I (p, r) = (4, 3), (80, 6), such that the number of parameters inβ and Σ is 10 and 100.

I n = 6000, s = 105, si ’s are randomly allocated.

I Criteria: Bivariate posterior distributions for covarianceparameters in Σ.

I Methods to compare: CMC, SDP, WASP, SGLD (Welling andTeh 11’), SA (Lee and Wand 16’), ADVI (Kucukelbir et al.15’), full data MCMC (gold standard).

I K = 10, 20 for CMC, SDP, WASP.

Page 26: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Example: lme Model (r = 3)

20 40

60 80

100

120 140

160 200

220

20

40

60 80 100

120

140

160

180

200

50

100

150

200

50

100

150

200 250

20

40

60 80

100

120 140

160

200 240

50

100

150

200

250

350

200 200

0.0

0.1

0.2

0.3

0.4

0.5

0.6

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 10, r = 3σ12 , σ13

20

40

60

80

100

120

140

160

180

200

20

40

60

80

100

120

140

160

1

80

200

20

40

60

80 100

120

140

160

180

200

20

40

60

80

100

120

140

160

180

200

220

240

20

40 60

80

100

120

140

160

180

50

100

150

200

250

100

200

300

400

500

600

800

100

200

300

400

500

600 700

−0.10

−0.05

0.00

0.05

0.10

0.15

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 10, r = 3σ12 , σ23

20

40

60

80

100

120

140

160

20

40

60

80

100

120

140

160

180

20

40

60

80

100

120

140

160

180

20

40

60

80

100

120

140

160

180

200 220

20

40

60

80

100

120

140

160

20

40

60 80

100

120

140

160

180 200

100

200

300

400

600

700

800

100

200

300

400

600

700

800

−0.10

−0.05

0.00

0.05

0.10

0.15

0.0 0.1 0.2 0.3 0.4 0.5 0.6

k = 10, r = 3σ13 , σ23

20 40

60 80

100

120 140

160 200

220

20 40

60

80

100 120

140

160

180

220

20 40

60

80 100

120

140 160

200

240

50

100

150

200 250

20

40

60 80

100

120 140

160

200

240

50

100

150

200

250

350

200 200

0.0

0.1

0.2

0.3

0.4

0.5

0.6

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 20, r = 3σ12 , σ13

20

40

60

80

100

120

140

160

180

200

20

40

60

80

100

120

140

160

180

200

20

40

60

80

100

120

140

160

180

200

20

40

60

80

100

120

140

160

180

200

220

240

20

40

60

80

100

120

140

160

180

200

50

100

150

200 250

300

100

200

300

400

500

600

800

100

200

300

400

500

600

800

−0.10

−0.05

0.00

0.05

0.10

0.15

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 20, r = 3σ12 , σ23

ADVICMCMCMC

SASDPSGLD (2000)

SGLD (4000)WASP

20

40

60 80

100

120

140

160

20

40

60

80

100

120

140

160

180

20

40

60

80

100

120

140

160

180

20

40

60

80

100

120

140

160

180

200 220

20

40

60

80

100

120

140

160

180

20

40

60 80

100

120

140

160

180 200

100

200

300

400

500

600

700

100

200

300

400

500

600

700

−0.10

−0.05

0.00

0.05

0.10

0.15

0.0 0.1 0.2 0.3 0.4 0.5 0.6

k = 20, r = 3σ13 , σ23

3 CMC, SDP, WASP, SA, and ADVI agree with the full dataMCMC.7 SGLD with batch sizes 2000 and 4000 failed to find the trueparameters.

Page 27: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Example: lme Model (r = 6)

20 40 60

80

100

120

140

160

180 220

240

20 40

60

80

100 120

140

160

180

220

50

100

150

200

50 100

150

200

250

20 40

60

80

100

120

140

160 220

5

10

15 20

25

30

35 40

45

45

50

200 200

0.0

0.1

0.2

0.3

0.4

0.5

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 10, r = 6σ12 , σ13

20 40

60 80

100

120

140

160

180

20

40

60

80 100

120

140

160

20 40

60 80

100

120 140

160

20

40

60

80

100

120

140

160

180 200

20

40

60

80

100

120

140

160

180

5

10

15

20

25

30

35

40

100

200

300

400

100

200

300

400

−0.2

−0.1

0.0

0.1

0.2

0.3

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 10, r = 6σ12 , σ23

20

40

60

80

100

120

140

160

20

40

60 80

100

120

140

160

20

40 60

80 100

120

140 160

20

40 60

80

100

120

140

160

180

20

40

60 80

100

120

140

5 10

15

20

25

30

30

50

100

150

200

250

300

350

50

100

150

200

250

300

350

−0.2

−0.1

0.0

0.1

0.2

0.3

0.0 0.1 0.2 0.3 0.4 0.5

k = 10, r = 6σ13 , σ23

20

40

60

80

100

120

140

160

180

240

20

40

60 80

100

120

140 160

180

200

240

20 40

60

80

100

120

140

160

180

220

50

100

150

200 250

300

20

40

60 80

100

120

140

160

200

220

5

10

15 20

25 30

35 40

45 45

50

200 200

0.0

0.1

0.2

0.3

0.4

0.5

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 20, r = 6σ12 , σ13

20 40

60 80

100

120

140

160

180

20

40 60 80

100

120

140

160

20

40 60

80

100

120

140

160

20

40

60

80

100

120

140

160

180 200

20

40

60

80

100

120

140

160

5

10

15

20

25

30

35

40

100

200

300

400

100

200

300

400

−0.2

−0.1

0.0

0.1

0.2

0.3

−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

k = 20, r = 6σ12 , σ23

ADVICMCMCMC

SASDPSGLD (2000)

SGLD (4000)WASP

20

40

60

80

100

120

140

160

20

40

60

80

100

120

140

20

40 60

80

100

120

140

20

40 60

80

100

120

140

160

180

20

40

60 80

100

120

140

5

10

15

20

25

30

30

50

100

150

200

250

300

350

50

100

150

200

250

300

350

−0.2

−0.1

0.0

0.1

0.2

0.3

0.0 0.1 0.2 0.3 0.4 0.5

k = 20, r = 6σ13 , σ23

3 CMC, SDP, WASP, SA agree with the full data MCMC.7 ADVI, SGLD with batch sizes 2000 and 4000 failed to find thetrue parameters.

Page 28: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Example: lme Model for Movie Lens Data

200

400

600

800

1000

1200

1400

1600

1800

500

1000

1500

2000

2500

500 1000

1500

2000

2500

300

0

2000

4000

6000

800

0

200

400 600

800 1000

1200

1400

1600

1800

100

200

300

400

500

600

700

800

900

1000

2000

3000

4000

5000

6000

9000

2000

4000

600

0

−0.04

−0.02

0.00

0.02

0.04

−0.05 0.00 0.05 0.10

σ12 , σ13

200

400

600

800 1000

1200

1400

1600 1800

2000

500 1000

1500

2000

2500

300

0

500 1000 1500

2000

2500

300

0

2000

4000

6000

200

400

600

800

1000

1200

1400 1600

1800

100

200

300

400

500 600

700

800

1000

2000

3000 4000 600

0 1000

2000

3000

4000

7000

9000

−0.06

−0.04

−0.02

0.00

0.02

0.04

−0.05 0.00 0.05 0.10

σ12 , σ14

500 1000

1500

2000

2500

3000

3500

4000

500

1000

1500

2000

2500

3000

3500

4000 4500

5000

550

0

500

1000

1500

2000

2500 3000

3500

4000

4500

5000

5500

6000

5000

10000

30000

500 1000

1500

2000

2500

3000

3500

4000

200

400 600

800

1000

1200

1400

1600

1800

1000

2000

3000

4000

5000

6000

700

0

8000 900

0

2000

4000

600

0

8000

100

00

−0.03

−0.02

−0.01

0.00

0.01

0.02

−0.05 0.00 0.05 0.10

σ12 , σ15

ADVICMCMCMC

SASDPSGLD (2000)

SGLD (4000)WASP

200

400 600

800

1000

1200

1400 1600

1800

2000

500 1000

1500

2000

2500

3000

500 1000

1500

2000

2500

3000

2000

4000

600

0 8000

120

00

200 400

600 800

1000

1200

1400 1600

1800

2000

2200

200

400

600

800

1000

1200

1400

1600 1800

1000

2000

3000

4000 500

0

6000

100

00

2000

4000

6000

8000

−0.06

−0.04

−0.02

0.00

0.02

0.04

−0.05 0.00 0.05 0.10

σ12 , σ16

3 Only WASP agrees with the full data MCMC.7 CMC, SDP, SA, ADVI, SGLD with batch sizes 2000 and 4000failed to find the true parameters.

Page 29: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Outline

Divide and Conquer Bayes

Numerical Performance

Theory

Improvement

Open Problems

Page 30: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Convergence Rate Result

For median posterior (p = 1) and WASP (p = 2),

EY

∫‖θ − θ0‖p π(θ|Y)dθ ≤ C

logc m

m.

Here m is the subset sample size.

I This holds for regular parametric models, under very generalconditions. For WASP, the data are only required to beindependent but not necessarily i.i.d.

I The rate is close to optimal if k = O(loga n). But it becomessuboptimal as k = O(nb) for 0 < b < 1. The optimal rate forthe full posterior is O(1/n).

I The fundamental issue here is that the combination stepincurs bias. This bias slows down the convergence rate fromO(1/n) to about O(1/m).

Page 31: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Effects of Stochastic Approximation

I Blue: Subset posteriors; Red: Full posterior.

I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.

Page 32: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Effects of Stochastic Approximation

I Blue: Subset posteriors; Red: Full posterior.

I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.

Page 33: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Effects of Stochastic Approximation

I Blue: Subset posteriors; Red: Full posterior.

I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.

Page 34: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Effects of Stochastic Approximation

I Blue: Subset posteriors; Red: Full posterior.

I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.

Page 35: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Effects of Stochastic Approximation

I Blue: Subset posteriors; Red: Full posterior.

I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.

Page 36: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Bernstein von Mises Result

I For PIE, let ξ = ξ(θ) be a 1-d functional

W2

(πPIE

(ξ∣∣Y(n)

), N

(ξ; ξ, [nIξ(θ0)]−1

))= O(1/n),

W2

(π(ξ∣∣Y(n)

), N

(ξ; ξ, [nIξ(θ0)]−1

))= O(1/n),

W2

(πPIE

(ξ∣∣Y(n)

), π(ξ∣∣Y(n)

))= O(1/m).

I ξ is the average of subset MLEs. ξ is the global MLE of ξ.Their difference is of order 1/m.

I This bias is intrinsic to almost all D&C Bayes methods,including CMC, WASP, PIE.

Page 37: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Double Parallel MC

Double parallel MC (DPMC, Xue and Liang 17’)

I Recenter the subset posterior by subtracting the posteriorbiases, and then perform simple averaging.

π(θ|Y) =1

K

∑K

j=1πKm(θ + (µ− µj)|Y[j]),

µj = T−1∑T

t=1θjt , µ = K−1

∑K

j=1µj .

I In addition to the parallelization in D&C Bayes, the DPMCmethod also has another layer of parallelization from thepopulation stochastic approximation Monte Carlo(pop-SAMC).

I The algorithm is implemented in OpenMP.

Page 38: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Outline

Divide and Conquer Bayes

Numerical Performance

Theory

Improvement

Open Problems

Page 39: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Improve the Accuracy of D&C Bayes

I Although all the aforementioned methods provide more or lessgood approximation to the full posterior, they are stillapproximate and not exact.

I Furthermore, they may have difficulty when the posteriors arenot “regularly” shaped (when justification throughasymptotics does not work).

I Can we find ways to fit the posterior more accurately, withoutresorting to asymptotics?

I Nemeth and Sherlock 16’: Find a good approximate posteriorand then perform importance sampling.

Page 40: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

GP based Method

Merge subset posteriors using Gaussian processes (Nemeth andSherlock 16’)

I Model each subset posterior density as a Gaussian process(GP).

I Fit a GP model to each log πKm

(θ|Y[j]

), j = 1, . . . ,K , with

quadratic mean functions.

I Pretend that all GP models are independent. Add the K GPstogether, by adding their mean functions and covariancefunctions. The approximate posterior πE (θ) is now distributedas log-normal for each fixed θ.

I Run a HMC sampler to sample from πE (θ); obtain θtTt=1.

I For estimating E[h(θ)|Y], reweight θtTt=1 usingself-normalized importance sampling weightsw t = π(θt |Y)/πE (θ), where the true posterior π(θt |Y) isevaluated by multiplying πK

m

(θt |Y[j]

)across j = 1, . . . ,K .

Page 41: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Example: Warped Gaussian Distribution

GP-HMC fits better than CMC, NDP, SDP, WS.

Page 42: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Example: 2-Component Mixture of Gaussians

GP-HMC still fits better than CMC, NDP, SDP, WS.

Page 43: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Outline

Divide and Conquer Bayes

Numerical Performance

Theory

Improvement

Open Problems

Page 44: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Open Problems

I Big p in the data: Some success (split-and-merge BVS,Song and Liang 15’), but in general very few results

I How to determine the best partition

I Scalability on the latent variable level

I Dependent data (spatial, temporal)

I Hogwild! Gibbs, D&C Bayes with limited communication

Thank you!&

Questions?

Page 45: An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k (CMC, NDP, SDP): I The full posterior is the product of all subset posteriors. I No

Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems

Open Problems

I Big p in the data: Some success (split-and-merge BVS,Song and Liang 15’), but in general very few results

I How to determine the best partition

I Scalability on the latent variable level

I Dependent data (spatial, temporal)

I Hogwild! Gibbs, D&C Bayes with limited communication

Thank you!&

Questions?