Objectives: Feedforward Networks Multilayer Networks Backpropagation Posteriors Kernels
An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k...
Transcript of An Overview of Parallel and Distributed MCMC Methods · I Modifying the prior by using prior1=k...
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
An Overview of Parallel and Distributed MCMCMethods
Cheng Li
DSAP, National University of Singapore
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Outline
Divide and Conquer Bayes
Numerical Performance
Theory
Improvement
Open Problems
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Bayesian Analysis
Given the data Y(n), choosing a model (likelihood) p(Y(n)|θ) and aprior π(θ), the posterior is
πn
(θ|Y(n)
)=
p(Y(n)|θ)π(θ)∫Θ p(Y(n)|θ)π(θ)dθ
=p(Y(n)|θ)π(θ)
p(Y(n))
Θ is the parameter space (a metric space, usually Euclidean).
I θ could be moderate to high-dimensional.
I The integral in the denominator is usually intractable.
I The model (likelihood) may contain latent variables.
I Two classes of approaches:1. Posterior approximations(Laplace approximation, Variational Bayes, Expectation propagation)2. Posterior sampling (MCMC, HMC, etc.)
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Big Data Challenges
I Big data poses multiple challenges to Bayesian analysis.
I Big n, Big p, ...
I For posterior sampling: slow computation, bad mixing, ...
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Scalable BayesFor i.i.d. Y(n),
πn
(θ|Y(n)
)=
p(Y(n)|θ)π(θ)∫Θ p(Y(n)|θ)π(θ)dθ
log p(Y(n)|θ) =n∑
i=1
log p(Yi |θ)
Difficulty:
I In every update of θ, we have to scan all n observations andcalculate log p(Yi |θ) for i = 1, . . . , n.
I This happens for both posterior approximation methods andposterior sampling methods.
I MCMC is a sequential algorithm. It is not parallelizable innature.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Distributed Bayes
I In this talk, we discuss the development in distributedBayesian methods, in particular, the divide-and-conquer(D&C) Bayes.
I The basic idea of D&C Bayes:
I The whole dataset is partitioned into many subsets.I Every subset can be handled by a machine. Posterior sampler
is run on each subset in parallel.
I Advantage: Simple, feasible, and efficient.
I Disadvantage: Often provides an approximate posterior,instead of the exact posterior.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Divide and ConquerFrequentist: Kernel ridge regression (Zhang, Duchi and Wainwright 15’JMLR,
Yang, Pilanci, and Wainwright 15’AOS, etc.)
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Divide and Conquer Bayes
Idea: Run posterior samplers in parallel on disjoint subset data, and combinethe subset posterior distributions into a global posterior.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Divide and Conquer Bayes
I Consensus Monte Carlo (CMC, Huang and Gelman 05’, Scottet al. 13’)
I Weierstrass Sampler (WS, Wang and Dunson 14’)
I Nonparametric and semiparametric density product (NDP,SDP, Neiswanger et al. 14’)
I Median posterior (Minsker et al. 14’)
I Wasserstein posterior (WASP, Srivastava et al. 15’)
I Posterior interval estimation (PIE, Li et al. 16’)
I Random partition trees (Wang and Dunson 15’)
I Likelihood inflated sampling algorithm (LISA, Entezari et al.16’)
I Gaussian process based method (Nemeth and Sherlock 16’)
I Double-parallel Monte Carlo (DPMC, Xue and Liang 17’)
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Divide and Conquer Bayes
When the data are all independent
Posterior ∝ Likelihood× Prior
∝
K∏j=1
jth subset likelihood
× prior
∝K∏j=1
[jth subset likelihood× prior1/K
](CMC, WS, NDP, SDP)
∝
K∏j=1
(jth subset likelihood)K × prior
1/K
(WASP, PIE, LISA, DPMC)
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Modified Prior or Likelihood
I Modifying the prior by using prior1/k (CMC, NDP, SDP):
I The full posterior is the product of all subset posteriors.I No need to change the complicated likelihood.I Issues: Not invariant to model reparametrization. Each subset
posterior has a much larger variance than the full posterior.
I Modifying the likelihood by using (jth subset likelihood)k
(WASP, PIE, DPMC):
I Invariant to model reparameterization.I Every subset posterior becomes a “noisy” approximation to the
full posterior.I The full posterior is the geometric mean of all subset
posterior densities.I Issues: Difficult to modify complicated likelihoods with many
layers of latent variables.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Notation
I The full iid dataset is partitioned into K subsets, with equalsubset sample size m. The total sample size is n = Km.Y(n) = Y[1], . . . ,Y[j], . . . ,Y[K ].
I The jth subset posterior in CMC, NDP, SDP is defined by
πKm(θ|Y[j]
)=
[p(Y[j]|θ)
]π1/K (θ)∫
Θ
[p(Y[j]|θ)
]π1/K (θ)dθ
. (1)
I The jth subset posterior in WASP, PIE, LISA, DPMC isdefined by
πKm(θ|Y[j]
)=
[p(Y[j]|θ)
]Kπ(θ)∫
Θ
[p(Y[j]|θ)
]Kπ(θ)dθ
. (2)
I We only has discrete draws from the K subset posteriordistributions. No analytical form is available.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
D&C Bayes Algorithm in 3 Steps
For the parameter of interest θ ∈ Rd ,
I Step 1: Modify either the prior or the likelihood on the subsetlevel.
I Step 2: Run posterior samplers on K subsets of data, on Kmachines in parallel. Draw posterior samples from each subsetposterior. Get the posterior samples for θ.
I Step 3: Combine the K subset posterior distributions.
I Output: A combined approximate posterior distribution of θ.
For independent data, D&C Bayes methods are oftenasymptotically valid for approximating the true posterior of θ giventhe full data.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Consensus Monte Carlo
Huang and Gelman 05’, Scott et al. 13’, R packageparallelMCMCcombine.
I Pretend that all the subset posteriors are Gaussian(approximated correct under BvM).
I Let Ωj be the sample covariance matrix of πKm(θ|Y[j]
),
j = 1, . . . ,K .
I If θj denotes a draw from πKm(θ|Y[j]
), then CMC calculates
θ =(∑K
j=1Ω−1j
)−1∑K
j=1Ω−1j θj .
I This combination is close to exact when the subset posteriorsare exactly Gaussian.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Kernel-based Method: NDP and SDPNonparametric and semiparametric density product, Neiswanger etal. 14’, R package parallelMCMCcombine.
I Use kernel densities estimation to fit every subset posterior,and then multiply them together.
I Given posterior draws θjtTt=1 for j = 1, . . . ,K ,
πKm(θ|Y[j]
)= T−1
∑T
t=1
Nd(θ|θjt , h2Id)Nd(θ|µj , Ωj)
Nd(θjt |µj , Ωj),
πSDP (θ|Y) =∏K
j=1πKm(θ|Y[j]
).
I Eventually, πSDP (θ|Y) can be written in the form
πSDP (θ|Y) =∑T
t1=1. . .∑T
tK=1wt·Nd(θ|µt·,Ωt·).
I Complexity is high in K (but can be reduced to linear in K ).
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Kernel-based Method: Weierstrass Sampler
Weierstrass sampler (WS, Wang and Dunson 14’)
I The subset posteriors are multiplied together via Weierstrasstransformation.
Wh[πj ] =
∫1√2πh
e−(θ−ξ)2
2h2 πKm(ξ|Y[j]
)dξ, lim
h→0Wh[πj ] = πKm ,
π(θ|Y) =∏K
j=1πKm(θ|Y[j]
)≈∏K
j=1Wh[πj ]
∝∫ ∏K
j=1
1√2πh
e−(θ−ξj )2
2h2 πKm(ξj |Y[j]
)dξ1 . . . dξJ .
I WS targets the augmented model with (θ, ξ1, . . . , ξJ):
πh(θ, ξ1, . . . , ξJ |Y) ∝∏K
j=1
1√2πh
e−(θ−ξj )2
2h2 πKm(ξj |Y[j]
)
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Geometric Center Methods
Median posterior (Minsker et al. 14’), WASP (Srivastava et al.15’), PIE (Li et al. 16’)
I Use the modified likelihood instead of modified prior.
I Approximate the full posterior using a different geometriccenter.
I The full posterior is
π(θ|Y) ∝[∏K
j=1πKm(θ|Y[j]
)]1/K
I The geometric mean here is difficult to calculate based ondiscrete subset MCMC draws.
I Instead, we use the Wasserstein barycenter as a replacement.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Geometric Center MethodsExample: Logistic regression using WASP
200
400
600 800
1000
1200
200
400
600 800
1000
1200
200
400
600
800
100
0
1200
1600
200
400
600
800
100
0
1600
200
400
600
800
100
0 120
0
200
400
600 800
1000 1200 200
400
600
800
1000 120
0
200
400 600
800 1000
1600
200
400
600
800
100
0
1200
1800
200
400
600
800
100
0
1200 160
0
200
400
600
800
1000
120
0
200
400
600
800
1000
120
0
1600
200
400
600 800
1000
120
0
1600
200
400
600
800
1000
1200
200
400
600
800
1000 120
0
1400
200
400
600
800
100
0
140
0
1600
200
400
600
800
100
0
1600
200
400
600
800
1000 120
0
200
400
600
800 1000
1600
200
400 600
800
1000 120
0
1600
200
400
600
800
1000
1600
200
400
600
800
1000
1200 200
400
600
800 100
0
1200
160
0
200
400 600
800
100
0 1200 1600
200
400
600 800
1000
1400
200
400
600
800
1000 120
0
1600
200
400 600
800
1000 120
0
1600
200
400
600 800
1000 1200
200
400 600
800
1000
1200
1400
200
400
600
800 1000
200
400
600
800
100
0
1400
200
400
600
800
100
0
120
0 1800
200
400
600
800 100
0
120
0 1
400
200
400
600 800
1000
1200 200
400
600
800 1000
120
0
1600
200
400
600
800
1000
1200
140
0
200
400
600
800
1000
1200
200
400
600
800
1000
120
0 1600
200
400
600
800 1000 1400
200
400
600
800
1000 120
0
1800
200
400
600
800
1000
120
0
1800
−1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.85
0.90
0.95
1.00
1.05
1.10
1.15
200
400
600
800
1000 120
0
500
1000
1500
2000
MCMCSubset PosteriorWASP
θ1
θ 2
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Wasserstein Distance
I Wasserstein-p (Wp) distance between two measures µ and ν:
Wp(µ, ν) = inf[
E(X − Y )2]1/p
, law(X ) = µ, law(Y ) = ν.
I Wasserstein-p barycenter of K probability measuresν1, . . . , νK :
µ = arg minµ∈Pp
1
K
∑K
j=1νj ,
where Pp is the space of all probability measures with finitepth moment.
I If the underlying space is R, then for two distributions F1,F2:
Wp(F1,F2) =
∫ 1
0[F−1
1 (u)− F−12 (u)]pdu
1/p
.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Median Posterior
Median posterior (Minsker et al. 14’), R package Mposterior.
I The median posterior is the W1 barycenter of K subsetposteriors.
I It is proposed based on two motivations:
I Faster convergence rate of geometric median than individualestimators.
I Robustness to outliers.
I It uses kernel embedding: Every subset posterior isrepresented by a kernel-based expansion in some Hilbert spaceH, and the W1 distance is defined on H.
I The returned approximate posterior distribution is a linearcombination of subset posteriors, i.e.π(θ|Y) =
∑Kj=1 wjπ
Km
(θ|Y[j]
).
I Weiszfeld’s algorithm is used to search for the optimal weightsw1, . . . ,wK for the geometric median in H.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Wasserstein Posterior (WASP)
Wasserstein posterior (WASP, Srivastava et al. 15’), GitHub codesavailable.
I WASP is the W2 barycenter of K subset posteriors.
I WASP is motivated by the nice computational property of theW2 barycenter of discrete distributions: It can be casted as anoptimization problem and solved using linear programming,related to the optimal transport.
I The WASP of K subset posteriors can be accuratelyapproximated by
π(·|Y) =∑K
j=1
∑T
t=1wjtδθjt (·),
where the weights wit ’s need to be optimized, and δθjt (·) isthe Dirac measure at the subset posterior draw θjt .
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Posterior Interval Estimation (PIE)
Posterior Interval Estimation (PIE, Li et al. 16’), GitHub codesavailable.
I Also computes the W2 barycenter of K subset posteriors.
I Simplifies the WASP computation for the posterior of1-dimensional functionals of θ.
I PIE finds the approximate posterior by averaging the subsetposterior quantiles
Qπ(u|Y(n)) =1
K
K∑j=1
QπKm
(u|Y[j]), u ∈ (0, 1).
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Divide and Conquer BayesPIE (Li, Srivastava, and Dunson 16’): Average the subset posterior empiricalquantiles. Leading to the “Wasserstein-2 barycenter”, as an approximation tothe full data posterior.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Outline
Divide and Conquer Bayes
Numerical Performance
Theory
Improvement
Open Problems
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Example: Linear Mixed Effects (lme) Model
I yi | β, ui , σ2 ∼ Nsi (Xiβ + Ziui , τ2Isi ), ui ∼ Nr (0,Σ).
I si is the number of observations for the individual i(i = 1, . . . , n). s =
∑ni=1 si .
I Xi ∈ Rp,Zi ∈ Rr are observed. β is the fixed effects. ui is therandom effects for the individual i .
I (p, r) = (4, 3), (80, 6), such that the number of parameters inβ and Σ is 10 and 100.
I n = 6000, s = 105, si ’s are randomly allocated.
I Criteria: Bivariate posterior distributions for covarianceparameters in Σ.
I Methods to compare: CMC, SDP, WASP, SGLD (Welling andTeh 11’), SA (Lee and Wand 16’), ADVI (Kucukelbir et al.15’), full data MCMC (gold standard).
I K = 10, 20 for CMC, SDP, WASP.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Example: lme Model (r = 3)
20 40
60 80
100
120 140
160 200
220
20
40
60 80 100
120
140
160
180
200
50
100
150
200
50
100
150
200 250
20
40
60 80
100
120 140
160
200 240
50
100
150
200
250
350
200 200
0.0
0.1
0.2
0.3
0.4
0.5
0.6
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 10, r = 3σ12 , σ13
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
1
80
200
20
40
60
80 100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
220
240
20
40 60
80
100
120
140
160
180
50
100
150
200
250
100
200
300
400
500
600
800
100
200
300
400
500
600 700
−0.10
−0.05
0.00
0.05
0.10
0.15
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 10, r = 3σ12 , σ23
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
200 220
20
40
60
80
100
120
140
160
20
40
60 80
100
120
140
160
180 200
100
200
300
400
600
700
800
100
200
300
400
600
700
800
−0.10
−0.05
0.00
0.05
0.10
0.15
0.0 0.1 0.2 0.3 0.4 0.5 0.6
k = 10, r = 3σ13 , σ23
20 40
60 80
100
120 140
160 200
220
20 40
60
80
100 120
140
160
180
220
20 40
60
80 100
120
140 160
200
240
50
100
150
200 250
20
40
60 80
100
120 140
160
200
240
50
100
150
200
250
350
200 200
0.0
0.1
0.2
0.3
0.4
0.5
0.6
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 20, r = 3σ12 , σ13
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
220
240
20
40
60
80
100
120
140
160
180
200
50
100
150
200 250
300
100
200
300
400
500
600
800
100
200
300
400
500
600
800
−0.10
−0.05
0.00
0.05
0.10
0.15
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 20, r = 3σ12 , σ23
ADVICMCMCMC
SASDPSGLD (2000)
SGLD (4000)WASP
20
40
60 80
100
120
140
160
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
20
40
60
80
100
120
140
160
180
200 220
20
40
60
80
100
120
140
160
180
20
40
60 80
100
120
140
160
180 200
100
200
300
400
500
600
700
100
200
300
400
500
600
700
−0.10
−0.05
0.00
0.05
0.10
0.15
0.0 0.1 0.2 0.3 0.4 0.5 0.6
k = 20, r = 3σ13 , σ23
3 CMC, SDP, WASP, SA, and ADVI agree with the full dataMCMC.7 SGLD with batch sizes 2000 and 4000 failed to find the trueparameters.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Example: lme Model (r = 6)
20 40 60
80
100
120
140
160
180 220
240
20 40
60
80
100 120
140
160
180
220
50
100
150
200
50 100
150
200
250
20 40
60
80
100
120
140
160 220
5
10
15 20
25
30
35 40
45
45
50
200 200
0.0
0.1
0.2
0.3
0.4
0.5
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 10, r = 6σ12 , σ13
20 40
60 80
100
120
140
160
180
20
40
60
80 100
120
140
160
20 40
60 80
100
120 140
160
20
40
60
80
100
120
140
160
180 200
20
40
60
80
100
120
140
160
180
5
10
15
20
25
30
35
40
100
200
300
400
100
200
300
400
−0.2
−0.1
0.0
0.1
0.2
0.3
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 10, r = 6σ12 , σ23
20
40
60
80
100
120
140
160
20
40
60 80
100
120
140
160
20
40 60
80 100
120
140 160
20
40 60
80
100
120
140
160
180
20
40
60 80
100
120
140
5 10
15
20
25
30
30
50
100
150
200
250
300
350
50
100
150
200
250
300
350
−0.2
−0.1
0.0
0.1
0.2
0.3
0.0 0.1 0.2 0.3 0.4 0.5
k = 10, r = 6σ13 , σ23
20
40
60
80
100
120
140
160
180
240
20
40
60 80
100
120
140 160
180
200
240
20 40
60
80
100
120
140
160
180
220
50
100
150
200 250
300
20
40
60 80
100
120
140
160
200
220
5
10
15 20
25 30
35 40
45 45
50
200 200
0.0
0.1
0.2
0.3
0.4
0.5
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 20, r = 6σ12 , σ13
20 40
60 80
100
120
140
160
180
20
40 60 80
100
120
140
160
20
40 60
80
100
120
140
160
20
40
60
80
100
120
140
160
180 200
20
40
60
80
100
120
140
160
5
10
15
20
25
30
35
40
100
200
300
400
100
200
300
400
−0.2
−0.1
0.0
0.1
0.2
0.3
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0
k = 20, r = 6σ12 , σ23
ADVICMCMCMC
SASDPSGLD (2000)
SGLD (4000)WASP
20
40
60
80
100
120
140
160
20
40
60
80
100
120
140
20
40 60
80
100
120
140
20
40 60
80
100
120
140
160
180
20
40
60 80
100
120
140
5
10
15
20
25
30
30
50
100
150
200
250
300
350
50
100
150
200
250
300
350
−0.2
−0.1
0.0
0.1
0.2
0.3
0.0 0.1 0.2 0.3 0.4 0.5
k = 20, r = 6σ13 , σ23
3 CMC, SDP, WASP, SA agree with the full data MCMC.7 ADVI, SGLD with batch sizes 2000 and 4000 failed to find thetrue parameters.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Example: lme Model for Movie Lens Data
200
400
600
800
1000
1200
1400
1600
1800
500
1000
1500
2000
2500
500 1000
1500
2000
2500
300
0
2000
4000
6000
800
0
200
400 600
800 1000
1200
1400
1600
1800
100
200
300
400
500
600
700
800
900
1000
2000
3000
4000
5000
6000
9000
2000
4000
600
0
−0.04
−0.02
0.00
0.02
0.04
−0.05 0.00 0.05 0.10
σ12 , σ13
200
400
600
800 1000
1200
1400
1600 1800
2000
500 1000
1500
2000
2500
300
0
500 1000 1500
2000
2500
300
0
2000
4000
6000
200
400
600
800
1000
1200
1400 1600
1800
100
200
300
400
500 600
700
800
1000
2000
3000 4000 600
0 1000
2000
3000
4000
7000
9000
−0.06
−0.04
−0.02
0.00
0.02
0.04
−0.05 0.00 0.05 0.10
σ12 , σ14
500 1000
1500
2000
2500
3000
3500
4000
500
1000
1500
2000
2500
3000
3500
4000 4500
5000
550
0
500
1000
1500
2000
2500 3000
3500
4000
4500
5000
5500
6000
5000
10000
30000
500 1000
1500
2000
2500
3000
3500
4000
200
400 600
800
1000
1200
1400
1600
1800
1000
2000
3000
4000
5000
6000
700
0
8000 900
0
2000
4000
600
0
8000
100
00
−0.03
−0.02
−0.01
0.00
0.01
0.02
−0.05 0.00 0.05 0.10
σ12 , σ15
ADVICMCMCMC
SASDPSGLD (2000)
SGLD (4000)WASP
200
400 600
800
1000
1200
1400 1600
1800
2000
500 1000
1500
2000
2500
3000
500 1000
1500
2000
2500
3000
2000
4000
600
0 8000
120
00
200 400
600 800
1000
1200
1400 1600
1800
2000
2200
200
400
600
800
1000
1200
1400
1600 1800
1000
2000
3000
4000 500
0
6000
100
00
2000
4000
6000
8000
−0.06
−0.04
−0.02
0.00
0.02
0.04
−0.05 0.00 0.05 0.10
σ12 , σ16
3 Only WASP agrees with the full data MCMC.7 CMC, SDP, SA, ADVI, SGLD with batch sizes 2000 and 4000failed to find the true parameters.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Outline
Divide and Conquer Bayes
Numerical Performance
Theory
Improvement
Open Problems
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Convergence Rate Result
For median posterior (p = 1) and WASP (p = 2),
EY
∫‖θ − θ0‖p π(θ|Y)dθ ≤ C
logc m
m.
Here m is the subset sample size.
I This holds for regular parametric models, under very generalconditions. For WASP, the data are only required to beindependent but not necessarily i.i.d.
I The rate is close to optimal if k = O(loga n). But it becomessuboptimal as k = O(nb) for 0 < b < 1. The optimal rate forthe full posterior is O(1/n).
I The fundamental issue here is that the combination stepincurs bias. This bias slows down the convergence rate fromO(1/n) to about O(1/m).
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Effects of Stochastic Approximation
I Blue: Subset posteriors; Red: Full posterior.
I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Effects of Stochastic Approximation
I Blue: Subset posteriors; Red: Full posterior.
I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Effects of Stochastic Approximation
I Blue: Subset posteriors; Red: Full posterior.
I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Effects of Stochastic Approximation
I Blue: Subset posteriors; Red: Full posterior.
I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Effects of Stochastic Approximation
I Blue: Subset posteriors; Red: Full posterior.
I Dashed lines: Usual subset posterior with no rescaling.Solid lines: Stochastic approximations.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Bernstein von Mises Result
I For PIE, let ξ = ξ(θ) be a 1-d functional
W2
(πPIE
(ξ∣∣Y(n)
), N
(ξ; ξ, [nIξ(θ0)]−1
))= O(1/n),
W2
(π(ξ∣∣Y(n)
), N
(ξ; ξ, [nIξ(θ0)]−1
))= O(1/n),
W2
(πPIE
(ξ∣∣Y(n)
), π(ξ∣∣Y(n)
))= O(1/m).
I ξ is the average of subset MLEs. ξ is the global MLE of ξ.Their difference is of order 1/m.
I This bias is intrinsic to almost all D&C Bayes methods,including CMC, WASP, PIE.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Double Parallel MC
Double parallel MC (DPMC, Xue and Liang 17’)
I Recenter the subset posterior by subtracting the posteriorbiases, and then perform simple averaging.
π(θ|Y) =1
K
∑K
j=1πKm(θ + (µ− µj)|Y[j]),
µj = T−1∑T
t=1θjt , µ = K−1
∑K
j=1µj .
I In addition to the parallelization in D&C Bayes, the DPMCmethod also has another layer of parallelization from thepopulation stochastic approximation Monte Carlo(pop-SAMC).
I The algorithm is implemented in OpenMP.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Outline
Divide and Conquer Bayes
Numerical Performance
Theory
Improvement
Open Problems
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Improve the Accuracy of D&C Bayes
I Although all the aforementioned methods provide more or lessgood approximation to the full posterior, they are stillapproximate and not exact.
I Furthermore, they may have difficulty when the posteriors arenot “regularly” shaped (when justification throughasymptotics does not work).
I Can we find ways to fit the posterior more accurately, withoutresorting to asymptotics?
I Nemeth and Sherlock 16’: Find a good approximate posteriorand then perform importance sampling.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
GP based Method
Merge subset posteriors using Gaussian processes (Nemeth andSherlock 16’)
I Model each subset posterior density as a Gaussian process(GP).
I Fit a GP model to each log πKm
(θ|Y[j]
), j = 1, . . . ,K , with
quadratic mean functions.
I Pretend that all GP models are independent. Add the K GPstogether, by adding their mean functions and covariancefunctions. The approximate posterior πE (θ) is now distributedas log-normal for each fixed θ.
I Run a HMC sampler to sample from πE (θ); obtain θtTt=1.
I For estimating E[h(θ)|Y], reweight θtTt=1 usingself-normalized importance sampling weightsw t = π(θt |Y)/πE (θ), where the true posterior π(θt |Y) isevaluated by multiplying πK
m
(θt |Y[j]
)across j = 1, . . . ,K .
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Example: Warped Gaussian Distribution
GP-HMC fits better than CMC, NDP, SDP, WS.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Example: 2-Component Mixture of Gaussians
GP-HMC still fits better than CMC, NDP, SDP, WS.
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Outline
Divide and Conquer Bayes
Numerical Performance
Theory
Improvement
Open Problems
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Open Problems
I Big p in the data: Some success (split-and-merge BVS,Song and Liang 15’), but in general very few results
I How to determine the best partition
I Scalability on the latent variable level
I Dependent data (spatial, temporal)
I Hogwild! Gibbs, D&C Bayes with limited communication
Thank you!&
Questions?
Divide and Conquer Bayes Numerical Performance Theory Improvement Open Problems
Open Problems
I Big p in the data: Some success (split-and-merge BVS,Song and Liang 15’), but in general very few results
I How to determine the best partition
I Scalability on the latent variable level
I Dependent data (spatial, temporal)
I Hogwild! Gibbs, D&C Bayes with limited communication
Thank you!&
Questions?