A Contour Stochastic Gradient Langevin Dynamics Algorithm ...
Transcript of A Contour Stochastic Gradient Langevin Dynamics Algorithm ...
A Contour Stochastic Gradient Langevin DynamicsAlgorithm for Simulations of Multi-modal
Distributions
Wei DengDepartment of Mathematics
Purdue UniversityWest Lafayette, IN, [email protected]
Guang LinDepartments of Mathematics &
School of Mechanical EngineeringPurdue University
West Lafayette, IN, [email protected]
Faming Liang βDepartments of Statistics
Purdue UniversityWest Lafayette, IN, [email protected]
Abstract
We propose an adaptively weighted stochastic gradient Langevin dynamics algo-rithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD),for Bayesian learning in big data statistics. The proposed algorithm is essentiallya scalable dynamic importance sampler, which automatically flattens the targetdistribution such that the simulation for a multi-modal distribution can be greatly fa-cilitated. Theoretically, we prove a stability condition and establish the asymptoticconvergence of the self-adapting parameter to a unique fixed-point, regardless ofthe non-convexity of the original energy function; we also present an error analysisfor the weighted averaging estimators. Empirically, the CSGLD algorithm is testedon multiple benchmark datasets including CIFAR10 and CIFAR100. The numeri-cal results indicate its superiority over the existing state-of-the-art algorithms intraining deep neural networks.
1 Introduction
AI safety has long been an important issue in the deep learning community. A promising solution tothe problem is Markov chain Monte Carlo (MCMC), which leads to asymptotically correct uncertaintyquantification for deep neural network (DNN) models. However, traditional MCMC algorithms[Metropolis et al., 1953, Hastings, 1970] are not scalable to big datasets that deep learning modelsrely on, although they have achieved significant successes in many scientific areas such as statisticalphysics and bioinformatics. It was not until the study of stochastic gradient Langevin dynamics(SGLD) [Welling and Teh, 2011] that resolves the scalability issue encountered in Monte Carlocomputing for big data problems. Ever since, a variety of scalable stochastic gradient Markov chainMonte Carlo (SGMCMC) algorithms have been developed based on strategies such as Hamiltoniandynamics [Chen et al., 2014, Ma et al., 2015, Ding et al., 2014], Hessian approximation [Ahn et al.,2012, Li et al., 2016, Simsekli et al., 2016], and higher-order numerical schemes [Chen et al., 2015,Li et al., 2019]. Despite their theoretical guarantees in statistical inference [Chen et al., 2015, Tehet al., 2016, Vollmer et al., 2016] and non-convex optimization [Zhang et al., 2017, Raginsky et al.,2017, Xu et al., 2018], these algorithms often converge slowly, which makes them hard to be used forefficient uncertainty quantification for many AI safety problems.
To develop more efficient SGMCMC algorithms, we seek inspirations from traditional MCMCalgorithms, such as simulated annealing [Kirkpatrick et al., 1983], parallel tempering [Swendsen andWang, 1986, Geyer, 1991], and flat histogram algorithms [Berg and Neuhaus, 1991, Wang and Landau,βTo whom correspondence should be addressed: Faming Liang.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
arX
iv:2
010.
0980
0v1
[st
at.M
L]
19
Oct
202
0
2001]. In particular, simulated annealing proposes to decay temperatures to increase the hittingprobability to the global optima [Mangoubi and Vishnoi, 2018], which, however, often gets stuckinto a local optimum with a fast cooling schedule. Parallel tempering proposes to swap positions ofneighboring Markov chains according to an acceptance-rejection rule. However, under the mini-batchsetting, it often requires a large correction which is known to deteriorate its performance [Denget al., 2020a]. The flat histogram algorithms, such as the multicanonical [Berg and Neuhaus, 1991]and Wang-Landau [Wang and Landau, 2001] algorithms, were first proposed to sample discretestates of Ising models by yielding a flat histogram in the energy space, and then extended as ageneral dynamic importance sampling algorithm, the so-called stochastic approximation Monte Carlo(SAMC) algorithm [Liang, 2005, Liang et al., 2007, Liang, 2009]. Theoretical studies [Lelièvre et al.,2008, Liang, 2010, Fort et al., 2015] support the efficiency of the flat histogram algorithms in MonteCarlo computing for small data problems. However, it is still unclear how to adapt the flat histogramidea to accelerate the convergence of SGMCMC, ensuring efficient uncertainty quantification for AIsafety problems.
This paper proposes the so-called contour stochastic gradient Langevin dynamics (CSGLD) algorithm,which successfully extends the flat histogram idea to SGMCMC. Like the SAMC algorithm [Liang,2005, Liang et al., 2007, Liang, 2009], CSGLD works as a dynamic importance sampling algorithm,which adaptively adjusts the target measure at each iteration and accounts for the bias introducedthereby by importance weights. However, theoretical analysis for the two types of dynamic importancesampling algorithms can be quite different due to the fundamental difference in their transition kernels.We proceed by justifying the stability condition for CSGLD based on the perturbation theory, andestablishing ergodicity of CSGLD based on newly developed theory for the convergence of adaptiveSGLD. Empirically, we test the performance of CSGLD through a few experiments. It achievesremarkable performance on some synthetic data, UCI datasets, and computer vision datasets such asCIFAR10 and CIFAR100.
2 Contour stochastic gradient Langevin dynamics
Suppose we are interested in sampling from a probability measure Ο(x) with the density given byΟ(x) β exp(βU(x)/Ο), x β X , (1)
where X denotes the sample space, U(x) is the energy function, and Ο is the temperature. It isknown that when U(x) is highly non-convex, SGLD can mix very slowly [Raginsky et al., 2017]. Toaccelerate the convergence, we exploit the flat histogram idea in SGLD.
Suppose that we have partitioned the sample space X into m subregions based on the energy functionU(x): X1 = {x : U(x) β€ u1}, X2 = {x : u1 < U(x) β€ u2}, . . ., Xmβ1 = {x : umβ2 < U(x) β€umβ1}, and Xm = {x : U(x) > umβ1}, where ββ < u1 < u2 < Β· Β· Β· < umβ1 <β are specifiedby the user. For convenience, we set u0 = ββ and um =β. Without loss of generality, we assumeui+1 β ui = βu for i = 1, . . . ,mβ 2. We propose to simulate from a flattened density
$Ψθ (x) β Ο(x)
Ψ΢θ(U(x))
, (2)
where ΞΆ > 0 is a hyperparameter controlling the geometric property of the flatted density (see Figure1(a) for illustration), and ΞΈ = (ΞΈ(1), ΞΈ(2), . . . , ΞΈ(m)) is an unknown vector taking values in the space:
Ξ =
{(ΞΈ(1), ΞΈ(2), Β· Β· Β· , ΞΈ(m))
β£β£0 < ΞΈ(1), ΞΈ(2), Β· Β· Β· , ΞΈ(m) < 1 andmβi=1
ΞΈ(i) = 1
}. (3)
2.1 A naΓ―ve contour SGLD
It is known if we set β
(i) ΢ = 1 and Ψθ(U(x)) =
mβi=1
ΞΈ(i)1uiβ1<U(x)β€ui ,
(ii) ΞΈ(i) = ΞΈ?(i),where ΞΈ?(i) =
β«Οi
Ο(x)dx for i β {1, 2, Β· Β· Β· ,m},(4)
β 1A is an indicator function that takes value 1 if event A occurs and 0 otherwise.
2
the algorithm will act like the SAMC algorithm [Liang et al., 2007], yielding a flat histogram in thespace of energy (see the pink curve in Fig.1(b)). Theoretically, such a density flattening strategyenables a sharper logarithmic Sobolev inequality and accelerates the convergence of simulations[LeliΓ¨vre et al., 2008, Fort et al., 2015]. However, such a density flattening setting only works underthe framework of the Metropolis algorithm [Metropolis et al., 1953]. A naΓ―ve application of the stepfunction in formula (4(i)) to SGLD results in β log Ψθ(u)
βu = 1Ψθ(u)
βΨθ(u)βu = 0 almost everywhere,
which leads to the vanishing-gradient problem for SGLD. Calculating the gradient for the naΓ―vecontour SGLD, we have
βx log$Ψθ (x) = β[1 + ΞΆΟ
β log Ψθ(u)
βu
]βxU(x)
Ο= ββxU(x)
Ο.
As such, the naΓ―ve algorithm behaves like SGLD and fails to simulate from the flattened density (2).
2.2 How to resolve the vanishing gradient
To tackle this issue, we propose to set Ψθ(u) as a piecewise continuous function:
Ψθ(u) =
mβi=1
(ΞΈ(iβ 1)e(log ΞΈ(i)βlog ΞΈ(iβ1))
uβuiβ1βu
)1uiβ1<uβ€ui , (5)
where ΞΈ(0) is fixed to ΞΈ(1) for simplicity. A direct calculation shows that
βx log$Ψθ (x) = β[1 + ΞΆΟ
β log Ψθ(u)
βu
]βxU(x)
Ο
= β[1 + ΞΆΟ
log ΞΈ(J(x))β log ΞΈ((J(x)β 1) β¨ 1)
βu
]βxU(x)
Ο,
(6)
where J(x) β {1, 2, Β· Β· Β· ,m} denotes the index of the subregion that x belongs to, i.e., uJ(x)β1 <
U(x) †uJ(x). § Since θ is unknown, we propose to estimate it on the fly under the frameworkof stochastic approximation [Robbins and Monro, 1951]. Provided that a scalable transition kernelΠθk(xk, ·) is available and the energy function U(x) on the full data can be efficiently evaluated, theweighted density $Ψθ (x) can be simulated by iterating between the following steps:
(i) Simulate xk+1 from Ξ ΞΈk(xk, Β·), which admits $ΞΈk(x) as the invariant distribution,
(ii) ΞΈk+1(i) = ΞΈk(i) + Οk+1ΞΈΞΆk(J(xk+1))
(1i=J(xk+1) β ΞΈk(i)
)for i β {1, 2, Β· Β· Β· ,m}.
(7)
where θk denotes a working estimate of θ at the k-th iteration. We expect that in a long run, suchan algorithm can achieve an optimization-sampling equilibrium such that θk converges to the fixedpoint θ? and the random vector xk converges weakly to the distribution $Ψθ?
(x).
To make the algorithm scalable to big data, we propose to adopt the Langevin transition kernelfor drawing samples at each iteration, for which a mini-batch of data can be used to acceleratecomputation. In addition, we observe that evaluating U(x) on the full data can be quite expensive forbig data problems, while it is free to obtain the stochastic energy U(x) in evaluating the stochasticgradient βxU(x) due to the nature of auto-differentiation [Paszke et al., 2017]. For this reason,we propose a biased index J(x), where uJ(x)β1 <
Nn U(x) β€ uJ(x), N is the sample size of the
full dataset and n is the mini-batch size. Let {Ξ΅k}βk=1 and {Οk}βk=1 denote the learning rates andstep sizes for SGLD and stochastic approximation, respectively. Given the above notations, theproposed algorithm can be presented in Algorithm 1, which can be viewed as a scalable Wang-Landaualgorithm for deep learning and big data problems.
2.3 Related work
Compared to the existing MCMC algorithms, the proposed algorithm has a few innovations:
First, CSGLD is an adaptive MCMC algorithm based on the Langevin transition kernel insteadof the Metropolis transition kernel [Liang et al., 2007, Fort et al., 2015]. As a result, the existingconvergence theory for the Wang-Landau algorithm does not apply. To resolve this issue, we first
Β§Formula (6) shows a practical numerical scheme. An alternative is presented in the supplementary material.
3
Algorithm 1 Contour SGLD Algorithm. One can conduct a resampling step from the pool ofimportance samples according to the importance weights to obtain the original distribution.
[1.] (Data subsampling) Simulate a mini-batch of data of size n from the whole dataset of sizeN ; Compute the stochastic gradientβxU(xk) and stochastic energy U(xk).[2.] (Simulation step) Sample xk+1 using the SGLD algorithm based on xk and ΞΈk, i.e.,
xk+1 = xk β Ξ΅k+1N
n
[1 + ΞΆΟ
log ΞΈk(J(xk))β log ΞΈk((J(xk)β 1) β¨ 1)
βu
]βxU(xk) +
β2ΟΞ΅k+1wk+1,
(8)
where wk+1 βΌ N(0, Id), d is the dimension, Ξ΅k+1 is the learning rate, and Ο is the temperature.[3.] (Stochastic approximation) Update the estimate of ΞΈ(i)βs for i = 1, 2, . . . ,m by setting
ΞΈk+1(i) = ΞΈk(i) + Οk+1ΞΈΞΆk(J(xk+1))
(1i=J(xk+1) β ΞΈk(i)
), (9)
where 1i=J(xk+1) is an indicator function which equals 1 if i = J(xk+1) and 0 otherwise.
prove a stability condition for CSGLD based on the perturbation theory, and then verify regularityconditions for the solution of the Poisson equation so that the fluctuations of the mean-field systeminduced by CSGLD get controlled, which eventually ensures convergence of CSGLD.
Second, the use of the stochastic index J(x) avoids the evaluation of U(x) on the full data and thussignificantly accelerates the computation of the algorithm, although it leads to a small bias, dependingon the mini-batch size n, in parameter estimation. Compared to other methods, such as using afixed sub-dataset to estimate U(x), the implementation is much simpler. Moreover, combining thevariance reduction of the noisy energy estimators [Deng et al., 2020b], the bias also decreases to zeroasymptotically as Ξ΅β 0.
Third, unlike the existing SGMCMC algorithms [Welling and Teh, 2011, Chen et al., 2014, Maet al., 2015], CSGLD works as a dynamic importance sampler which flattens the target distributionand reduces the energy barriers for the sampler to traverse between different regions of the energylandscape (see Fig.1(a) for illustration). The sampling bias introduced thereby is accounted for bythe importance weight ΞΈΞΆ(J(Β·)). Interestingly, CSGLD possesses a self-adjusting mechanism to easeescapes from local traps, which is similar to the self-repulsive dynamics [Ye et al., 2020] and can beexplained as follows. Letβs assume that the sampler gets trapped into a local optimum at iteration k.Then CSGLD will automatically increase the multiplier of the stochastic gradient (i.e., the bracketterm of (8)) at iteration k + 1 by increasing the value of ΞΈk(J(x)), while decreasing the componentsof ΞΈk corresponding to other subregions. This adjustment will continue until the sampler moves awayfrom the current subregion. Then, in the followed several iterations, the multiplier might becomenegative in neighboring subregions of the local optimum due to the increased value of ΞΈ(J(x)),which continues to help to drive the sampler to higher energy regions and thus escape from the localtrap. That is, in order to escape from local traps, CSGLD is sometimes forced to move toward higherenergy regions by changing the sign of the stochastic gradient multiplier! This is a very attractivefeature for simulations of multi-modal distributions.
3 Theoretical study of the CSGLD algorithm
In this section, we study the convergence of CSGLD algorithm under the framework of stochasticapproximation and show the ergodicity property based on weighted averaging estimators.
3.1 Convergence analysis
Following the tradition of stochastic approximation analysis, we rewrite the updating rule (9) as
ΞΈk+1 = ΞΈk + Οk+1H(ΞΈk,xk+1), (10)
where H(ΞΈ,x) = (H1(ΞΈ,x), . . . , Hm(ΞΈ,x)) is a random field function with
Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(
1i=J(x) β ΞΈ(i)), i = 1, 2, . . . ,m. (11)
4
Notably, H(ΞΈ,x) works under an empirical measure $ΞΈ(x) which approximates the invariantmeasure $Ψθ (x) β Ο(x)
Ψ΢θ(U(x))asymptotically as Ξ΅β 0 and nβ N . As shown in Lemma 1, we have
the mean-field equation
h(ΞΈ) =
β«XH(ΞΈ,x)$ΞΈ(x)dx = Zβ1
ΞΈ (ΞΈ? + Ρβ(ΞΈ)β ΞΈ) = 0, (12)
where ΞΈ? = (β«X1Ο(x)dx,
β«X2Ο(x)dx, . . . ,
β«Xm Ο(x)dx), ZΞΈ is the normalizing constant, Ξ²(ΞΈ) is
a perturbation term, Ξ΅ is a small error depending on Ξ΅, n and m. The mean-field equation implies thatfor any ΞΆ > 0, ΞΈk converges to a small neighbourhood of ΞΈ?. By applying perturbation theory andsetting the Lyapunov function V(ΞΈ) = 1
2βΞΈ? β ΞΈβ2, we can establish the stability condition:
Lemma 1 (Stability). Given a small enough Ξ΅ (learning rate), a large enough n (batch size) and m(partition number), there is a constant Ο = infΞΈ Z
β1ΞΈ > 0 such that the mean-field h(ΞΈ) satisfies
βΞΈ β Ξ, γh(ΞΈ),ΞΈ β ΞΈ?γ β€ βΟβΞΈ β ΞΈ?β2 +O(Ξ΅+
1
m+ Ξ΄n(ΞΈ)
),
where Ξ΄n(Β·) is a bias term depending on the batch size n and decays to 0 as nβ N .
Together with the tool of Poisson equation [Benveniste et al., 1990, Andrieu et al., 2005], whichcontrols the fluctuation of H(ΞΈ,x)β h(ΞΈ), we can establish convergence of ΞΈk in Theorem 1, whoseproof is given in the supplementary material.Theorem 1 (L2 convergence rate). Given Assumptions 1-5 (given in Appendix), a small enoughlearning rate Ξ΅k, a large partition number m and a large batch size n, ΞΈk converges to ΞΈ? such that
E[βΞΈk β ΞΈ?β2
]= O
(Οk + sup
iβ₯k0
Ξ΅i +1
m+ supiβ₯k0
Ξ΄n(ΞΈi)
),
where k0 is some large enough integer, ΞΈ? = (β«X1Ο(x)dx,
β«X2Ο(x)dx, . . . ,
β«Xm Ο(x)dx), and
Ξ΄n(Β·) is a bias term depending on the batch size n and decays to 0 as nβ N .
3.2 Ergodicity and dynamic importance sampler
CSGLD belongs to the class of adaptive MCMC algorithms, but its transition kernel is based onSGLD instead of the Metropolis algorithm. As such, the ergodicity theory for traditional adaptiveMCMC algorithms [Roberts and Rosenthal, 2007, Andrieu and Γric Moulines, 2006, Fort et al., 2011,Liang, 2010] is not directly applicable. To tackle this issue, we conduct the following theoreticalstudy. First, rewrite (8) as
xk β Ξ΅(βxL(xk,ΞΈ?) + Ξ₯(xk,ΞΈk,ΞΈ?)
)+N (0, 2Ξ΅ΟI), (13)
where βxL(xk,ΞΈ?) = Nn
[1 + ΞΆΟ
βu (log ΞΈ?(J(xk))β log ΞΈ?((J(xk)β 1) β¨ 1))]βxU(xk),
the bias term Ξ₯(xk,ΞΈk,ΞΈ?) = βxL(xk,ΞΈk) β βxL(xk,ΞΈ?), and βxL(xk,ΞΈk) =Nn
[1 + ΞΆΟ
βu
(log ΞΈk(J(xk))β log ΞΈk((J(xk)β 1) β¨ 1)
)]βxU(xk). The order of the bias is fig-
ured out in Lemma C1 in the supplementary material based on the results of Theorem 1.
Next, we show how the empirical mean 1k
βki=1 f(xi) deviates from the posterior meanβ«
X f(x)$Ψθ?(x)dx. Note that this is a direct application of Theorem 2 of Chen et al. [2015]
by treatingβxL(x,ΞΈ?) as the stochastic gradient of a target distribution and Ξ₯(x,ΞΈ,ΞΈ?) as the biasof the stochastic gradient. Moreover, considering that $Ψθ?
(x) β Ο(x)
ΞΈΞΆ?(J(x))β $Ψθ?
as m β βbased on Lemma B4 in the supplementary material, we have the followingLemma 2 (Convergence of the Averaging Estimators). Suppose Assumptions 1-6 (in the supplemen-tary material) hold. For any bounded function f , we haveβ£β£β£β£β£E
[βki=1 f(xi)
k
]ββ«Ο
f(x)$Ψθ?(dx)
β£β£β£β£β£ = O
1
kΞ΅+βΞ΅+
ββki=1 Οk
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
,
where $Ψθ?(x) = 1
ZΞΈ?
Ο(x)
ΞΈΞΆ?(J(x))and ZΞΈ? =
βmi=1
β«XiΟ(x)dx
ΞΈ?(i)ΞΆ.
5
Finally, we consider the problem of estimating the quantityβ«X f(x)Ο(x)dx. Recall that Ο(x)
is the target distribution that we would like to make inference for. To estimate this quantity, we
naturally consider the weighted averaging estimatorβki=1 ΞΈ
ΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
by treating ΞΈΞΆ(J(xi)) as the
dynamic importance weight of the sample xi for i = 1, 2, . . . , k. The convergence of this estimatoris established in Theorem 2, which can be proved by repeated applying Theorem 1 and Lemma 2with the details given in the supplementary material.Theorem 2 (Convergence of the Weighted Averaging Estimators). Suppose Assumptions 1-6 hold.For any bounded function f , we haveβ£β£β£β£β£E[βk
i=1 ΞΈΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
]ββ«Ο
f(x)Ο(dx)
β£β£β£β£β£ = O
1
kΞ΅+βΞ΅+
ββki=1 Οk
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
.
The bias of the weighted averaging estimator decreases if one applies a larger batch size, a finersample space partition, a smaller learning rate Ξ΅, and smaller step sizes {Οk}kβ₯0. Admittedly, theorder of this bias is slightly larger than O
(1kΞ΅ + Ξ΅
)achieved by the standard SGLD. We note that
this is necessary as simulating from the flattened distribution $Ψθ?often leads to a much faster
convergence, see e.g. the green curve v.s. the purple curve in Fig.1(c).
4 Numerical studies
4.1 Simulations of multi-modal distributions
A Gaussian mixture distribution The first numerical study is to test the performance of CSGLDon a Gaussian mixture distribution Ο(x) = 0.4N(β6, 1) + 0.6N(4, 1). In each experiment, thealgorithm was run for 107 iterations. We fix the temperature Ο = 1 and the learning rate Ξ΅ = 0.1. Thestep size for stochastic approximation follows Οk = 1
k0.6+100 . The sample space is partitioned into50 subregions with βu = 1. The stochastic gradients are simulated by injecting additional randomnoises following N(0, 0.01) to the exact gradients. For comparison, SGLD is chosen as the baselinealgorithm and implemented with the same setup as CSGLD. We repeat the experiments 10 times andreport the average and the associated standard deviation.
We first assume that ΞΈ? is known and plot the energy functions for both Ο(x) and $Ψθ?with different
values of ΞΆ. Fig.1(a) shows that the original energy function has a rather large energy barrier whichstrongly affects the communication between two modes of the distribution. In contrast, CSGLDsamples from a modified energy function, which yields a flattened landscape and reduced energybarriers. For example, with ΞΆ = 0.75, the energy barrier for this example is greatly reduced from12 to as small as 2. Consequently, the local trap problem can be greatly alleviated. Regarding thebizarre peaks around x = 4, we leave the study in the supplementary material.
Large barrierSmall barrier
0
5
10
15
20
25
β13 β6 4 11Sample space
Ene
rgy
Original energy
Modified energy (ΞΆ=0.5)
Modified energy (ΞΆ=0.75)
Modified energy (ΞΆ=1)
(a) Original v.s. trial densities
β
β
β
β
ββ β β β β β β
β
β
β
ββ
ββ
β ββ
β β
Higher energy
0.0
0.2
0.4
5 10Partition index
Fre
quen
cy
ββ
ββ
ββ
ββ
ββ
ΞΈ*
ΞΈ
CSGLD (ΞΆ=0.5)
CSGLD (ΞΆ=0.75)
CSGLD (ΞΆ=1)
(b) ΞΈβs estimates and histograms
0
2
4
4e5 2e6 1e7Iterations
Est
imat
ion
erro
r
SGLD
CSGLD
KSGLD
(c) Estimation errors
Figure 1: Comparison between SGLD and CSGLD: Fig.1(b) presents only the first 12 partitions foran illustrative purpose; KSGLD in Fig.1(c) is implemented by assuming ΞΈ? is known.
Fig. 1(b) summarizes the estimates of ΞΈ? with ΞΆ = 0.75, which matches the ground truth value of ΞΈ?very well. Notably, we see that ΞΈ?(i) decays exponentially fast as the partition index i increases, which
6
indicates the exponentially decreasing probability of visiting high energy regions and a severe localtrap problem. CSGLD tackles this issue by adaptively updating the transition kernel or, equivalently,the invariant distribution such that the sampler moves like a βrandom walkβ in the space of energy. Inparticular, setting ΞΆ = 1 leads to a flat histogram of energy (for the samples produced by CSGLD).
To explore the performance of CSGLD in quantity estimation with the weighed averaging estima-tor, we compare CSGLD (ΞΆ = 0.75) with SGLD and KSGLD in estimating the posterior meanβ«X xΟ(x)dx, where KSGLD was implemented by assuming ΞΈ? is known and sampling from $Ψθ?
directly. Each algorithm was run for 10 times, and we recorded the mean absolute estimation erroralong with iterations. As shown in Fig.1(c), the estimation error of SGLD decays quite slow and rarelyconverges due to the high energy barrier. On the contrary, KSGLD converges much faster, whichshows the advantage of sampling from a flattened distribution $Ψθ?
. Admittedly, ΞΈ? is unknown inpractice. CSGLD instead adaptively updates its invariant distribution while optimizing the parameterΞΈ until an optimization-sampling equilibrium is reached. In the early period of the run, CSGLDconverges slightly slower than KSGLD, but soon it becomes as efficient as KSGLD.
Finally, we compare the sample path and learning rate for CSGLD and SGLD. As shown in Fig.2(a),SGLD tends to be trapped in a deep local optimum for an exponentially long time. CSGLD, incontrast, possesses a self-adjusting mechanism for escaping from local traps. In the early period of arun, CSGLD might suffer from a similar local-trap problem as SGLD (see Fig.2(b)). In this case, thecomponents of ΞΈ corresponding to the current subregion will increase very fast, eventually renderinga smaller or even negative stochastic gradient multiplier which bounces the sampler back to highenergy regions. To illustrate the process, we plot a bouncy zone and an absorbing zone in Fig.2(c).The bouncy zone enables the sampler to βjumpβ over large energy barriers to explore other modes.As the run continues, ΞΈk converges to ΞΈ?. Fig.2(d) shows that larger bouncy βjumpsβ (in red lines)can potentially be induced in the bouncy zone, which occurs in both local and global optima. Due tothe self-adjusting mechanism, CSGLD has the local trap problem much alleviated.
ββ
ββ
ββ
ββ
β
β
β
β
β
ββ
β
β
ββ
β
β
β
β β
β
βββ
β
β
β
β
β
β
β
ββ
β
βββ
ββ
βββ
β
β
βββ
ββ
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
ββ β
β
β
β
β
β
β
β
β
β
ββ
ββββ
β
β β
β
βββ
β β
ββ
β
ββ
β
β
ββ
ββ ββ
β
β
ββββ
β
β β
β
β
ββββ
ββ
β
βββββ
β
β
β
ββββ β
βββ
β
β
β
β
β
β β
β
ββββ
β
β
β
ββ
β
β
β
β
β
ββ
β
β
β
ββ
βββ
β
ββββ
β
β
β
ββ
ββββ ββ
βββ
β
βββ
β
ββ βββ
ββββ
β
β
β
β
ββ β
β
β
β β
β
ββ
β
ββ
β
β
ββ
ββ
β β
ββββ
βββ
βββ
β
β
β
β
ββββββ
β
β
ββββ
β
ββ
β
β
β
β
ββ
β
β
ββ
ββ
ββ
βββ
ββ
β
β
ββ
β
β
ββ
βββ
β
β
β
βββ
β
β
βββ
β
ββ
β
β
ββ ββ
β
β
β
ββ
ββ
β
β
β
β
ββ
β
β
β
β
β
ββ ββ
β
β ββ ββ
β
β
ββ
βββ
ββ
β
β
β
ββ
β
β
β
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β
β
βββ
β
β
β
β
β
ββ
β
β
β
ββ
β
β
β
βββ ββ
β
β
ββ
β
β
β
ββ
ββ
β
ββ
β
β
β
β
ββ ββ β
β
β
β
β
β
β ββ
ββ
ββ ββ
β
β
β
β
β
β
ββ β
ββ
ββ
ββ
β
β
β
β
β
β ββ
β
β
β
ββ
β βββ
β
β
β
β
β
β
β β
β
β
β
β
βββ
β
β
β
β
β
β
β
β
β
ββ β
βββ
ββββ
β
β
ββ
β
β
ββ
β
ββ β
β
ββ
β
β
β
β
ββββ
β
β
β
β
β
β
β
ββ
β β
β
β
β
ββ β
β
ββββββ
β
βββ
β
β
β
β
ββ
β
βββ
β
β
β
β
β
β
β
ββ
ββ
β
β
ββββ
β
ββ
β
βββ
β
ββ
β
β
β
β
ββ
ββ
βββ
β
ββ
β
βββ
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
βββ
β
ββ β
β
β
βββ
βββ
β
ββ
β
β
β
ββββ
β
β
β
β
ββ
ββ
β
β
β
β
β
β
βββ
β
β ββββ
ββ
β ββ
β
β
β
β
β
β
β β
β
β
β
ββ
ββββ
ββ
β
β
β
ββ
β
ββ
β
ββ β
ββ
β
β
β
β
β
ββ
β
β
β
ββ
βββ
βββ βββ
βββ
β
βββ
β
ββ
ββ ββ
β
β
β
β
ββ ββ
β
β ββ
β
β
β
β
ββ
ββ
βββββ
β
β
β
β
β ββ
β
β
β
βββ
β
ββ
β
ββ
β
ββ
β
β
βββ
βββ
β
βββ
β
β
β
β
β
ββ
ββ
β
ββ
βββ
β
βββ β
ββ ββ
ββββ
β
β
ββ
β
β
ββ
β
ββ
ββ
β
β
ββ
ββ
ββ
β
βββ
β
βββ
β
β
β
ββ
β
β
β
ββ β
β
β
β ββ βββ
β
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
ββ
ββ
β
β
β
β
β
β
ββ
ββ
ββ
ββ ββββ
ββ
β
ββ
βββ
βββββ
β
β
β
ββ
β
ββ
ββ
β
β
β
β
ββ
β
β
β
βββ
β
ββ
ββ
β
ββ
β
ββ
β
β β
β
β
βββββ
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
ββ
ββ
β
β
Energy barrier > 10
(a) SGLD paths
Absorbing zone
Bouncy zone
β
β
ββ
β
β ββ
ββ
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
β
β
β
β
ββ
β
β
β
β
β β
ββ
β
β
β
βββ
β
β
β
β
β
β
βββ β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
βββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
βββ
β
β
ββ
ββ
β
β
β
β
β
(b) CSGLD paths (early)
Absorbing zone
Bouncy zone
β
β
β
β
ββ
β
β
β
ββ
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
ββ β
β
β β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
ββ
β
β
ββ
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β β
β
β
ββ
β
(c) CSGLD paths (mid)
β
β
β
β
ββ
β
β
β
ββ
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
ββ β
β
β β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
ββ
β
β
ββ
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β β
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
ββ
β
β
β
ββ
β
ββ
ββ β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
ββ
β
β
β
β
β
β
β
ββ
β
β ββ
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
β
β
β
β
β
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
βββ
β
β
β ββ
β
β
β
ββ
ββ
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
ββ
ββ
β
β
β
ββ
β
β
β
β
β
β
ββ
ββ
β
β
β
β
β ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
ββ
β
β
β
β
β
ββ
β
β
βββ
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
ββ
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
ββ
β
β
β
β
β
β
ββ
β
β
β
βββ
β
β
β
β
ββ
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
ββ
β
β
β β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
(d) CSGLD paths (late)
Figure 2: Sample trajectories of SGLD and CSGLD: plots (a) and (c) are implemented by 100,000iterations with a thinning factor 100 and ΞΆ = 0.75, while plot (b) utilizes a thinning factor 10.
A synthetic multi-modal distribution We next simulate from a distribution Ο(x) β eβU(x),where U(x) =
β2i=1
x(i)2β10 cos(1.2Οx(i))3 and x = (x(1), x(2)). We compare CSGLD with SGLD,
replica exchange SGLD (reSGLD) [Deng et al., 2020a], and SGLD with cyclic learning rates(cycSGLD) [Zhang et al., 2020] and detail the setups in the supplementary material. Fig.3(a) showsthat the distribution contains nine important modes, where the center mode has the largest probabilitymass and the four modes on the corners have the smallest mass. We see in Fig.3(b) that SGLDspends too much time in local regions and only identifies three modes. cycSGLD has a better abilityto explore the distribution by leveraging large learning rates cyclically. However, as illustratedin Fig.3(c), such a mechanism is still not efficient enough to resolve the local trap issue for thisproblem. reSGLD proposes to include a high-temperature process to encourage exploration andallows interactions between the two processes via appropriate swaps. We observe in Fig.3(d) thatreSGLD obtains both the exploration and exploitation abilities and yields a much better result.However, the noisy energy estimator may hinder the swapping efficiency and it becomes difficult toestimate a few modes on the corners. As to our algorithm, CSGLD first simulates the importancesamples and recovers the original distribution according to the importance weights. We notice that thesamples from CSGLD can traverse freely in the parameter space and eventually achieve a remarkableperformance, as shown in Fig.3(e).
7
(a) Ground truth (b) SGLD (c) cycSGLD (d) reSGLD (e) CSGLD
Figure 3: Simulations of a multi-modal distribution. A resampling scheme is used for CSGLD.
4.2 UCI data
We tested the performance of CSGLD on the UCI regression datasets. For each dataset, we normalizedall features and randomly selected 10% of the observations for testing. Following [Hernandez-Lobatoand Adams, 2015], we modeled the data using a Multi-Layer Perception (MLP) with a single hiddenlayer of 50 hidden units. We set the mini-batch size n = 50 and trained the model for 5,000epochs. The learning rate was set to 5e-6 and the default L2-regularization coefficient is 1e-4. Forall the datasets, we used the stochastic energy N
n U(x) to evaluate the partition index. We set theenergy bandwidth βu = 100. We fine-tuned the temperature Ο and the hyperparameter ΞΆ. For afair comparison, each algorithm was run 10 times with fixed seeds for each dataset. In each run,the performance of the algorithm was evaluated by averaging over 50 models, where the averagingestimator was used for SGD and SGLD and the weighted averaging estimator was used for CSGLD.As shown in Table 1, SGLD outperforms the stochastic gradient descent (SGD) algorithm for mostdatasets due to the advantage of a sampling algorithm in obtaining more informative modes. Since allthese datasets are small, there is only very limited potential for improvement. Nevertheless, CSGLDstill consistently outperforms all the baselines including SGD and SGLD.
The contour strategy proposed in the paper can be naturally extended to SGHMC [Chen et al.,2014, Ma et al., 2015] without affecting the theoretical results. In what follows, we adopted anumerical method proposed by Saatci and Wilson [2017] to avoid extra hyperparameter tuning.We set the momentum term to 0.9 and simply inherited all the other parameter settings used inthe above experiments. In such a case, we compare the contour SGHMC (CSGHMC) with thebaselines, including M-SGD (Momentum SGD) and SGHMC. The comparison indicates that someimprovements can be achieved by including the momentum.
Table 1: Algorithm evaluation using average root-mean-square error and its standard deviation.
Dataset Energy Concrete Yacht WineHyperparameters (Ο/ΞΆ) 1/1 5/1 1/2.5 5/10
SGD 1.13Β±0.07 4.60Β±0.14 0.81Β±0.08 0.65Β±0.01SGLD 1.08Β±0.07 4.12Β±0.10 0.72Β±0.07 0.63Β±0.01
CSGLD 1.02Β±0.06 3.98Β±0.11 0.69Β±0.06 0.62Β±0.01M-SGD 0.95Β±0.07 4.32Β±0.27 0.73Β±0.08 0.71Β±0.02SGHMC 0.77Β±0.06 4.25Β±0.19 0.66Β±0.07 0.67Β±0.02
CSGHMC 0.76Β±0.06 4.15Β±0.20 0.72Β±0.09 0.65Β±0.01
4.3 Computer vision data
This section compares only CSGHMC with M-SGD and SGHMC due to the popularity of momentumin accelerating computation for computer vision datasets. We keep partitioning the sample spaceaccording to the stochastic energy N
n U(x), where a mini-batch data of size n is randomly chosenfrom the full dataset of size N at each iteration. Notably, such a strategy significantly acceleratesthe computation of CSGHMC. As a result, CSGHMC has almost the same computational cost asSGHMC and SGD. To reduce the bias associated with the stochastic energy, we choose a large batchsize n = 1, 000. For more discussions on the hyperparameter settings, we refer readers to section Din the supplementary material.
8
CIFAR10 is a standard computer vision dataset with 10 classes and 60,000 images, for which 50,000images were used for training and the rest for testing. We modeled the data using a Resnet of 20layers (Resnet20) [He et al., 2016]. In particular, for CSGHMC, we considered a partition of theenergy space in 200 subregions, where the energy bandwidth was set to βu = 1000. We trained themodel for a total of 1000 epochs and evaluated the model every ten epochs based on two criteria,namely, best point estimate (BPE) and Bayesian model average (BMA). We repeated each experiment10 times and reported in Table 2 the average prediction accuracy and the corresponding standarddeviation.
In the first set of experiments, all the algorithms utilized a fixed learning rate Ξ΅ = 2eβ 7 and a fixedtemperature Ο = 0.01 under the Bayesian setting.SGHMC performs quite similarly to M-SGD, bothobtaining around 90% accuracy in BPE and 92% in BMA. Notably, in this case, simulated annealingis not applied to any of the algorithms and achieving the state-of-the-art is quite difficult. However,BMA still consistently outperforms BPE, implying the great potential of advanced MCMC techniquesin deep learning. Instead of simulating from Ο(x) directly, CSGHMC adaptively simulates from aflattened distribution $ΞΈ? and adjusts the sampling bias by dynamic importance weights. As a result,the weighted averaging estimators obtain an improvement by as large as 0.8% on BMA. In addition,the flattened distribution facilitates optimization and the increase in BPE is quite significant.
In the second set of experiments, we employed a decaying schedule on both learning rates andtemperatures (if applicable) to obtain simulated annealing effects. For the learning rate, we fix it at 2Γ10β6 in the first 400 epochs and then decayed it by a factor of 1.01 at each epoch. For the temperature,we consistently decayed it by a factor of 1.01 at each epoch. We call the resulting algorithms bysaM-SGD, saSGHMC, and saCSGHMC, respectively. Table 2 shows that the performances of allalgorithms are increased quite significantly, where the fine-tuned baselines already obtained thestate-of-the-art results. Nevertheless, saCSGHMC further improves BPE by 0.25% and slightlyimprove the highly optimized BMA by nearly 0.1%.
CIFAR100 dataset has 100 classes, each of which contains 500 training images and 100 testingimages. We follow a similar setup as CIFAR10, except that βu is set to 5000. For M-SGD, BMAcan be better than BPE by as large as 5.6%. CSGHMC has led to an improvement of 3.5% on BPEand 2% on BMA, which further demonstrates the superiority of advanced MCMC techniques. Table2 also shows that with the help of both simulated annealing and importance sampling, saCSGHMCcan outperform the highly optimized baselines by almost 1% accuracy on BPE and 0.7% on BMA.The significant improvements show the advantage of the proposed method in training DNNs.
Table 2: Experiments on CIFAR10 & 100 using Resnet20, where BPE and BMA are short for bestpoint estimate and Bayesian model average, respectively.
Algorithms CIFAR10 CIFAR100BPE BMA BPE BMA
M-SGD 90.02Β±0.06 92.03Β±0.08 61.41Β±0.15 67.04Β±0.12SGHMC 90.01Β±0.07 91.98Β±0.05 61.46Β±0.14 66.43Β±0.11
CSGHMC 90.87Β±0.04 92.85Β±0.05 63.97Β±0.21 68.94Β±0.23saM-SGD 93.83Β±0.07 94.25Β±0.04 69.18Β±0.13 71.83Β±0.12saSGHMC 93.80Β±0.06 94.24Β±0.06 69.24Β±0.11 71.98Β±0.10
saCSGHMC 94.06Β±0.07 94.33Β±0.07 70.18Β±0.15 72.67Β±0.15
5 Conclusion
We have proposed CSGLD as a general scalable Monte Carlo algorithm for both simulation andoptimization tasks. CSGLD automatically adjusts the invariant distribution during simulations tofacilitate escaping from local traps and traversing over the entire energy landscape. The sampling biasintroduced thereby is accounted for by dynamic importance weights. We proved a stability conditionfor the mean-field system induced by CSGLD together with the convergence of its self-adaptingparameter ΞΈ to a unique fixed point ΞΈ?. We established the convergence of a weighted averagingestimator for CSGLD. The bias of the estimator decreases as we employ a finer partition, a largermini-batch size, and smaller learning rates and step sizes. We tested CSGLD and its variants on a fewexamples, which show their great potential in deep learning and big data computing.
9
Broader Impact
Our algorithm ensures AI safety by providing more robust predictions and helps build a saferenvironment. It is an extension of the flat histogram algorithms from the Metropolis kernel to theLangevin kernel and paves the way for future research in various dynamic importance samplersand adaptive biasing force (ABF) techniques for big data problems. The Bayesian community andthe researchers in the area of Monte Carlo methods will enjoy the benefit of our work. To our bestknowledge, the negative society consequences are not clear and no one will be put at disadvantage.
Acknowledgment
Liangβs research was supported in part by the grants DMS-2015498, R01-GM117597 and R01-GM126089. Lin acknowledges the support from NSF (DMS-1555072, DMS-1736364), BNLSubcontract 382247, W911NF-15-1-0562, and DE-SC0021142.
ReferencesSungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic
Gradient Fisher Scoring. In Proc. of the International Conference on Machine Learning (ICML),2012.
Christophe Andrieu and Γric Moulines. On the Ergodicity Properties of Some Adaptive MCMCAlgorithms. Annals of Applied Probability, 16:1462β1505, 2006.
Christophe Andrieu, Γric Moulines, and Pierre Priouret. Stability of Stochastic Approximation underVerifiable Conditions. SIAM J. Control Optim., 44(1):283β312, 2005.
Albert Benveniste, Michael MΓ©tivier, and Pierre Priouret. Adaptive Algorithms and StochasticApproximations. Berlin: Springer, 1990.
Bernd A. Berg and T. Neuhaus. Multicanonical Algorithms for First Order Phase Transitions. PhysicsLetters B, 267(2):249β253, 1991.
Changyou Chen, Nan Ding, and Lawrence Carin. On the Convergence of Stochastic Gradient MCMCAlgorithms with High-order Integrators. In Advances in Neural Information Processing Systems(NeurIPS), pages 2278β2286, 2015.
Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. InProc. of the International Conference on Machine Learning (ICML), 2014.
Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and GaΓ« Richard. Stochastic Quasi-NewtonLangevin Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML),pages 642β651, 2016.
Wei Deng, Xiao Zhang, Faming Liang, and Guang Lin. An Adaptive Empirical Bayesian Method forSparse Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-Convex Learning via ReplicaExchange Stochastic Gradient MCMC. In Proc. of the International Conference on MachineLearning (ICML), 2020a.
Wei Deng, Qi Feng, Georgios Karagiannis, Guang Lin, and Faming Liang. Accelerating Convergenceof Replica Exchange Stochastic Gradient MCMC via Variance Reduction. arXiv:2010.01084,2020b.
Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven.Bayesian Sampling using Stochastic Gradient Thermostats. In Advances in Neural InformationProcessing Systems (NeurIPS), pages 3203β3211, 2014.
G. Fort, E. Moulines, and P. Priouret. Convergence of Adaptive and Interacting Markov Chain MonteCarlo Algorithms. Annals of Statistics, 39:3262β3289, 2011.
10
G. Fort, B. Jourdain, E. Kuhn, T. LeliΓ¨vre, and G. Stoltz. Convergence of the Wang-Landau Algorithm.Math. Comput., 84(295):2297β2327, 2015.
Charles J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. Computing Science andStatistics: Proceedings of the 23rd Symposium on the Interfac, pages 156β163, 1991.
W.K. Hastings. Monte Carlo Sampling Methods using Markov Chain and Their Applications.Biometrika, 57:97β109, 1970.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic Backpropagation for ScalableLearning of Bayesian Neural Networks. In Proc. of the International Conference on MachineLearning (ICML), volume 37, pages 1861β1869, 2015.
Scott Kirkpatrick, D. Gelatt Jr, and Mario P. Vecchi. Optimization by Simulated Annealing. Science,220(4598):671β680, 1983.
T. LeliΓ¨vre, M. Rousset, and G. Stoltz. Long-time Convergence of an Adaptive Biasing Force Method.Nonlinearity, 21:1155β1181, 2008.
Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned StochasticGradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference onArtificial Intelligence (AAAI), pages 1788β1794, 2016.
Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu. Stochastic Runge-Kutta AcceleratesLangevin Monte Carlo and Beyond. In Advances in Neural Information Processing Systems(NeurIPS), pages 7746β7758, 2019.
Faming Liang. A Generalized WangβLandau Algorithm for Monte Carlo Computation. Journal ofthe American Statistical Association, 100(472):1311β1327, 2005.
Faming Liang. On the Use of Stochastic Approximation Monte Carlo for Monte Carlo Integration.Statistics and Probability Letters, 79:581β587, 2009.
Faming Liang. Trajectory Averaging for Stochastic Approximation MCMC Algorithms. The Annalsof Statistics, 38:2823β2856, 2010.
Faming Liang, Chuanhai Liu, and Raymond J. Carroll. Stochastic Approximation in Monte CarloComputation. Journal of the American Statistical Association, 102:305β320, 2007.
Yi-An Ma, Tianqi Chen, and Emily B. Fox. A Complete Recipe for Stochastic Gradient MCMC. InAdvances in Neural Information Processing Systems (NeurIPS), 2015.
Oren Mangoubi and Nisheeth K. Vishnoi. Convex Optimization with Unbounded Nonconvex Oraclesusing Simulated Annealing. In Proc. of Conference on Learning Theory (COLT), 2018.
J.C. Mattingly, A.M. Stuartb, and D.J. Highamc. Ergodicity for SDEs and Approximations: LocallyLipschitz Vector Fields and Degenerate Noise. Stochastic Processes and their Applications, 101:185β232, 2002.
Jonathan C. Mattingly, Andrew M. Stuart, and M.V. Tretyakov. Convergence of Numerical Time-Averaging and Stationary Measures via Poisson Equations. SIAM Journal on Numerical Analysis,48:552β577, 2010.
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of StateCalculations by Fast Computing Machines. Journal of Chemical Physics, 21:1087β1091, 1953.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inPyTorch. In NeurIPS Autodiff Workshop, 2017.
11
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex Learning via StochasticGradient Langevin Dynamics: a Nonasymptotic Analysis. In Proc. of Conference on LearningTheory (COLT), June 2017.
Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of MathematicalStatistics, 22:400β407, 1951.
Gerneth O. Roberts and Jeff S. Rosenthal. Coupling and Ergodicity of Adaptive Markov Chain MonteCarlo Algorithms. Journal of Applied Probability, 44:458β475, 2007.
Yunus Saatci and Andrew G Wilson. Bayesian GAN. In Advances in Neural Information ProcessingSystems (NeurIPS), pages 3622β3631, 2017.
Issei Sato and Hiroshi Nakagawa. Approximation Analysis of Stochastic Gradient Langevin Dynamicsby Using Fokker-Planck Equation and Ito Process. In Proc. of the International Conference onMachine Learning (ICML), 2014.
Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo Simulation of Spin-Glasses. Phys.Rev. Lett., 57:2607β2609, 1986.
Yee Whye Teh, Alexandre ThiΓ©ry, and Sebastian Vollmer. Consistency and Fluctuations for StochasticGradient Langevin Dynamics. Journal of Machine Learning Research, 17:1β33, 2016.
Eric Vanden-Eijnden. Introduction to Regular Perturbation Theory. Slides, 2001. URL https://cims.nyu.edu/~eve2/reg_pert.pdf.
Sebastian J. Vollmer, Konstantinos C. Zygalakis, and Yee Whye Teh. Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics. Journal of MachineLearning Research, 17(159):1β48, 2016.
Fugao Wang and D. P. Landau. Efficient, Multiple-range Random Walk Algorithm to Calculate theDensity of States. Physical Review Letters, 86(10):2050β2053, 2001.
Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. InProc. of the International Conference on Machine Learning (ICML), pages 681β688, 2011.
Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global Convergence of Langevin DynamicsBased Algorithms for Nonconvex Optimization. In Advances in Neural Information ProcessingSystems (NeurIPS), 2018.
Mao Ye, Tongzheng Ren, and Qiang Liu. Stein Self-Repulsive Dynamics: Benefits From PastSamples. arXiv:2002.09070v1, 2020.
Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. CyclicalStochastic Gradient MCMC for Bayesian Deep Learning. In Proc. of the International Conferenceon Learning Representation (ICLR), 2020.
Yuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic GradientLangevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pages 1980β2022, 2017.
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmen-tation. ArXiv e-prints, 2017.
12
Supplimentary Material for βA Contour Stochastic Gradient LangevinDynamics Algorithm for Simulations of Multi-modal Distributionsβ
The supplementary material is organized as follows: Section A provides a review for the relatedmethodologies, Section B proves the stability condition and convergence of the self-adapting pa-rameter, Section C establishes the ergodicity of the contour stochastic gradient Langevin dynamics(CSGLD) algorithm, and Section D provides more discussions for the algorithm.
A Background on stochastic approximation and Poisson equation
A.1 Stochastic approximation
Stochastic approximation [Benveniste et al., 1990] provides a standard framework for the devel-opment of adaptive algorithms. Given a random field function H(ΞΈ,x), the goal of the stochasticapproximation algorithm is to find the solution to the mean-field equation h(ΞΈ) = 0, i.e., solving
h(ΞΈ) =
β«XH(ΞΈ,x)$ΞΈ(dx) = 0,
where x β X β Rd, ΞΈ β Ξ β Rm, H(ΞΈ,x) is a random field function and $ΞΈ(x) is a distributionfunction of x depending on the parameter ΞΈ. The stochastic approximation algorithm works byrepeating the following iterations
(1) Draw xk+1 βΌ Ξ ΞΈk(xk, Β·), where Ξ ΞΈk(xk, Β·) is a transition kernel that admits $ΞΈk(x) asthe invariant distribution,
(2) Update ΞΈk+1 = ΞΈk + Οk+1H(ΞΈk,xk+1) + Ο2k+1Ο(ΞΈk,xk+1), where Ο(Β·, Β·) denotes a bias
term.
The algorithm differs from the RobbinsβMonro algorithm [Robbins and Monro, 1951] in that x issimulated from a transition kernel Ξ ΞΈk(Β·, Β·) instead of the exact distribution $ΞΈk(Β·). As a result, aMarkov state-dependent noise H(ΞΈk,xk+1)β h(ΞΈk) is generated, which requires some regularityconditions to control the fluctuation
βk Ξ k
ΞΈ(H(ΞΈ,x)β h(ΞΈ)). Moreover, it supports a more generalform where a bounded bias term Ο(Β·, Β·) is allowed without affecting the theoretical properties of thealgorithm.
A.2 Poisson equation
Stochastic approximation generates a nonhomogeneous Markov chain {(xk,ΞΈk)}βk=1, for which theconvergence theory can be studied based on the Poisson equation
¡θ(x)βΠθ¡θ(x) = H(ΞΈ,x)β h(ΞΈ),
where Ξ ΞΈ(x, A) is the transition kernel for any Borel subset A β X and ¡θ(Β·) is a function on X .The solution to the Poisson equation exists when the following series converges:
¡θ(x) :=βkβ₯0
Ξ kΞΈ(H(ΞΈ,x)β h(ΞΈ)).
That is, the consistency of the estimator ΞΈ can be established by controlling the perturbations ofβkβ₯0 Ξ k
ΞΈ(H(ΞΈ,x)β h(ΞΈ)) via imposing some regularity conditions on ¡θ(Β·). Towards this goal,Benveniste et al. [1990] gave the following regularity conditions on ¡θ(Β·) to ensure the convergenceof the adaptive algorithm:
There exist a function V : X β [1,β), and a constant C such that for all ΞΈ,ΞΈβ² β Ξ,
βΠθ¡θ(x)β β€ CV (x), βΠθ¡θ(x)βΞ ΞΈβ²Β΅ΞΈβ²(x)β β€ CβΞΈ β ΞΈβ²βV (x), E[V (x)] β€ β,which requires only the first order smoothness. In contrast, the ergodicity theory by Mattingly et al.[2010] and Vollmer et al. [2016] relies on the much stronger 4th order smoothness.
13
B Stability and convergence analysis for CSGLD
B.1 CSGLD algorithm
To make the theory more general, we slightly extend CSGLD by allowing a higher order bias term.The resulting algorithm works by iterating between the following two steps:
(1) Sample xk+1 = xkβΞ΅kβxL(xk,ΞΈk)+N (0, 2Ξ΅kΟI), (S1)
(2) Update ΞΈk+1 = ΞΈk + Οk+1H(ΞΈk,xk+1) + Ο2k+1Ο(ΞΈk,xk+1), (S2)
where Ξ΅k is the learning rate, Οk+1 is the step size,βxL(x,ΞΈ) is the stochastic gradient given by
βxL(x,ΞΈ) =N
n
[1 +
ΞΆΟ
βu
(log ΞΈ(J(x))β log ΞΈ((J(x)β 1) β¨ 1)
)]βxU(x), (14)
H(ΞΈ,x) = (H1(ΞΈ,x), . . . , Hm(ΞΈ,x)) is a random field function with
Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(
1i=J(x) β ΞΈ(i)), i = 1, 2, . . . ,m, (15)
for some constant ΞΆ > 0, and Ο(ΞΈk,xk+1) is a bias term.
B.2 Convergence of parameter estimation
To establish the convergence of ΞΈk, we make the following assumptions:Assumption A1 (Compactness). The space Ξ is compact such that infΞ ΞΈ(i) > 0 for any i β{1, 2, . . . ,m}. There exists a large constant Q > 0 such that for any ΞΈ β Ξ and x β X ,
βΞΈβ β€ Q, βH(ΞΈ,x)β β€ Q, βΟ(ΞΈ,x)β β€ Q. (16)
To simplify the proof, we consider a slightly stronger assumption such that infΞ ΞΈ(i) > 0 holds forany i β {1, 2, . . . ,m}. To relax this assumption, we refer interested readers to Fort et al. [2015]where the recurrence property was proved for the sequence {ΞΈk}kβ₯1 of a similar algorithm. Such aproperty guarantees ΞΈk to visit often enough to a desired compact space, rendering the convergenceof the sequence.
Assumption A2 (Smoothness). U(x) is M -smooth; that is, there exists a constant M > 0 such thatfor any x,xβ² β X ,
ββxU(x)ββxU(xβ²)β β€Mβxβ xβ²β. (17)
Smoothness is a standard assumption in the study of convergence of SGLD, see e.g. Raginsky et al.[2017], Xu et al. [2018].Assumption A3 (Dissipativity). There exist constants m > 0 and b β₯ 0 such that for any x β Xand ΞΈ β Ξ,
γβxL(x,ΞΈ),xγ β€ bβ mβxβ2. (18)
This assumption ensures samples to move towards the origin regardless the initial point, which isstandard in proving the geometric ergodicity of dynamical systems, see e.g. Mattingly et al. [2002],Raginsky et al. [2017], Xu et al. [2018].Assumption A4 (Gradient noise). The stochastic gradient is unbiased, that is,
E[βxU(xk)ββxU(xk)] = 0;
in addition, there exist some constants M > 0 and B > 0 such that
E[ββxU(xk)ββxU(xk)β2] β€M2βxβ2 +B2,
where the expectation E[Β·] is taken with respect to the distribution of the noise component included inβxU(x).
14
Lemma B1 establishes a stability condition for CSGLD, which implies potential convergence of ΞΈk.Lemma B1 (Stability). Suppose that Assumptions A1-A4 hold. For any ΞΈ β Ξ, γh(ΞΈ),ΞΈ βΞΈ?γ β€ βΟβΞΈ β ΞΈ?β2 + O
(Ξ΄n(ΞΈ) + Ξ΅+ 1
m
), where Ο = infΞΈ Z
β1ΞΈ > 0, ΞΈ? =
(β«X1Ο(x)dx,
β«X2Ο(x)dx, . . . ,
β«Xm Ο(x)dx) and Ξ΄n(Β·) is a bias term depending on the batch size n
such that Ξ΄n(Β·)β 0 as nβ N .
Proof Let $Ψθ (x) β Ο(x)
Ψ΢θ(U(x))denote a theoretical invariant measure of SGLD, where Ψθ(u) is
a fixed piecewise continuous function given by
Ψθ(u) =
mβi=1
(ΞΈ(iβ 1)e(log ΞΈ(i)βlog ΞΈ(iβ1))
uβuiβ1βu
)1uiβ1<uβ€ui , (19)
the full data is used in determining the indexes of subregions, and the learning rate converges to zero.In addition, we define a piece-wise constant function
Ψθ =
mβi=1
ΞΈ(i)1uiβ1<uβ€ui ,
and a theoretical measure $Ψθ(x) β Ο(x)
ΞΈΞΆ(J(x)). Obviously, as the sample space partition becomes
fine and fine, i.e., u1 β umin, umβ1 β umax and m β β, we have βΨθ β Ψθβ β 0 andβ$Ψθ
(x) β $Ψθ (x)β β 0, where umin and umax denote the minimum and maximum of U(x),respectively. Without loss of generality, we assume umax < β. Otherwise, umax can be set to avalue such that Ο({x : U(x) > umax}) is sufficiently small.
For each i β {1, 2, . . . ,m}, the random field Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(
1iβ₯J(x) β ΞΈ(i))
is a biased
estimator of Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(1iβ₯J(x) β ΞΈ(i)
). Let Ξ΄n(ΞΈ) = E[H(ΞΈ,x) β H(ΞΈ,x)] denote
the bias, which is caused by the mini-batch evaluation of the energy and decays to 0 as nβ N .
First, letβs compute the mean-field h(ΞΈ) with respect to the empirical measure $ΞΈ(x):
hi(ΞΈ) =
β«XHi(ΞΈ,x)$ΞΈ(x)dx =
β«XHi(ΞΈ,x)$ΞΈ(x)dx+ Ξ΄n(ΞΈ)
=
β«XHi(ΞΈ,x)
$Ψθ(x)︸ ︷︷ ︸
I1
β$Ψθ(x) +$Ψθ (x)οΈΈ οΈ·οΈ· οΈΈ
I2
β$Ψθ (x) +$ΞΈ(x)οΈΈ οΈ·οΈ· οΈΈI3
dx+ Ξ΄n(ΞΈ).
(20)
For the term I1, we haveβ«XHi(ΞΈ,x)$Ψθ
(x)dx =1
ZΞΈ
β«XΞΈΞΆ(J(x))
(1i=J(x) β ΞΈ(i)
) Ο(x)
ΞΈΞΆ(J(x))dx
= Zβ1ΞΈ
[mβk=1
β«XkΟ(x)1k=idxβ ΞΈ(i)
mβk=1
β«XkΟ(x)dx
]= Zβ1
ΞΈ [ΞΈ?(i)β ΞΈ(i)] ,
(21)
where ZΞΈ =βmi=1
β«XiΟ(x)dx
θ(i)΢denotes the normalizing constant of $Ψθ
(x).
Next, letβs consider the integrals I2 and I3. By Lemma B4 and the boundedness of H(ΞΈ,x), we haveβ«XHi(ΞΈ,x)(β$Ψθ
(x) +$Ψθ (x))dx = O(
1
m
). (22)
For the term I3, we have for any fixed ΞΈ,β«XHi(ΞΈ,x) (β$Ψθ (x) +$ΞΈ(x)) dx = O(Ξ΄n (ΞΈ)) +O(Ξ΅), (23)
where Ξ΄n(Β·) uniformly decays to 0 as nβ N and the order of O(Ξ΅) follows from Theorem 6 of Satoand Nakagawa [2014].
15
Plugging (21), (22) and (23) into (20), we have
hi(ΞΈ) = Zβ1ΞΈ [Ρβi(ΞΈ) + ΞΈ?(i)β ΞΈ(i)] , (24)
where Ξ΅ = O(Ξ΄n(ΞΈ) + Ξ΅+ 1
m
)and Ξ²i(ΞΈ) is a bounded term such that Zβ1
θ Ρβi(θ) =
O(Ξ΄n(ΞΈ) + Ξ΅+ 1
m
).
To solve the ODE system with small disturbances, we consider standard techniques in perturbationtheory. According to the fundamental theorem of perturbation theory [Vanden-Eijnden, 2001], wecan obtain the solution to the mean field equation h(ΞΈ) = 0:
θ(i) = θ?(i) + Ρβi(θ?) +O(Ρ2), i = 1, 2, . . . ,m, (25)
which is a stable point in a small neighbourhood of ΞΈ?.
Considering the positive definite function V(ΞΈ) = 12βΞΈ? β ΞΈβ
2 for the mean-field system h(ΞΈ) =
Zβ1ΞΈ (Ρβi(ΞΈ) + ΞΈ? β ΞΈ) = Zβ1
ΞΈ (ΞΈ? β ΞΈ) +O(Ξ΅), we have
γh(ΞΈ),V(ΞΈ)γ = γh(ΞΈ),ΞΈβΞΈ?γ = βZβ1ΞΈ βΞΈβΞΈ?β
2+O(Ξ΅) β€ βΟβΞΈβΞΈ?β2+O(Ξ΄n(ΞΈ) + Ξ΅+
1
m
),
where Ο = infΞΈ Zβ1ΞΈ > 0 by the compactness assumption A1. This concludes the proof.
The following is a restatement of Lemma 1 of Deng et al. [2019], which holds for any ΞΈ in thecompact space Ξ.Lemma B2 (Uniform L2 bounds). Suppose Assumptions A1, A3 and A4 hold. Given a small enoughlearning rate, then supkβ₯1 E[βxkβ2] <β.
Lemma B3 (Solution of Poisson equation). Suppose that Assumptions A1-A4 hold. There is asolution ¡θ(·) on X to the Poisson equation
¡θ(x)βΠθ¡θ(x) = H(ΞΈ,x)β h(ΞΈ). (26)
In addition, for all ΞΈ,ΞΈβ² β Ξ, there exists a constant C such that
E[βΠθ¡θ(x)β] β€ C,E[βΠθ¡θ(x)βΞ ΞΈβ²Β΅ΞΈβ²(x)β] β€ CβΞΈ β ΞΈβ²β.
(27)
Proof The lemma can be proved based on Theorem 13 of Vollmer et al. [2016], whose conditionscan be easily verified for CSGLD given the assumptions A1-A4 and Lemma B2. The details areomitted.
Now we are ready to prove the first main result on the convergence of ΞΈk. The technique lemmas arelisted in Section B.3.Assumption A5 (Learning rate and step size). The learning rate {Ξ΅k}kβN is a positive non-increasingsequence of real numbers satisfying the conditions
limkΞ΅k = 0,
ββk=1
Ξ΅k =β.
The step size {Οk}kβN is a positive decreasing sequence of real numbers such that
Οk β 0,
ββk=1
Οk = +β, limkββ
inf 2ΟΟkΟk+1
+Οk+1 β ΟkΟ2k+1
> 0. (28)
According to Benveniste et al. [1990], we can choose Οk := AkΞ±+B for some Ξ± β ( 1
2 , 1] and somesuitable constants A > 0 and B > 0.Theorem 3 (L2 convergence rate). Suppose Assumptions A1-A5 hold. For a sufficiently large valueof m, a sufficiently small learning rate sequence {Ξ΅k}βk=1, and a sufficiently small step size sequence{Οk}βk=1, {ΞΈk}βk=0 converges to ΞΈ? in L2-norm such that
E[βΞΈk β ΞΈ?β2
]= O
(Οk + sup
iβ₯k0
Ξ΅i +1
m+ supiβ₯k0
Ξ΄n(ΞΈi)
),
where k0 is a sufficiently large constant, and Ξ΄n(ΞΈ) is a bias term decaying to 0 as nβ N .
16
Proof Consider the iterations
ΞΈk+1 = ΞΈk + Οk+1
(H(ΞΈk,xk+1) + Οk+1Ο(ΞΈk,xk+1)
).
Define Tk = ΞΈk β ΞΈ?. By subtracting ΞΈ? from both sides and taking the square and L2 norm, wehaveβT 2
k+1β = βT 2k β+ Ο2
k+1βH(ΞΈk,xk+1) + Οk+1Ο(ΞΈk,xk+1)β2 + 2Οk+1 γTk, H(xk+1) + Οk+1Ο(ΞΈk,xk+1)γοΈΈ οΈ·οΈ· οΈΈD
.
First, by Lemma B5, there exists a constant G = 4Q2(1 +Q2) such that
βH(ΞΈk,xk+1) + Οk+1Ο(ΞΈk,xk+1)β2 β€ G(1 + βTkβ2). (29)
Next, by the Poisson equation (26), we have
D = γTk, H(ΞΈk,xk+1) + Οk+1Ο(ΞΈk,xk+1)γ= γTk, h(ΞΈk) + ¡θk(xk+1)βΞ ΞΈk¡θk(xk+1) + Οk+1Ο(ΞΈk,xk+1)γ= γTk, h(ΞΈk)γοΈΈ οΈ·οΈ· οΈΈ
D1
+ γTk, ¡θk(xk+1)βΞ ΞΈk¡θk(xk+1)γοΈΈ οΈ·οΈ· οΈΈD2
+ γTk, Οk+1Ο(ΞΈk,xk+1)γοΈΈ οΈ·οΈ· οΈΈD3
.
For the term D1, by Lemma B1, we have
E [γTk, h(ΞΈk)γ] β€ βΟE[βTkβ2] +O(Ξ΄n(ΞΈk) + Ξ΅k +1
m).
For convenience, in the following context, we denote O(Ξ΄n(ΞΈk) + Ξ΅k + 1m ) by βk.
To deal with the term D2, we make the following decomposition
D2 = γTk, ¡θk(xk+1)βΞ ΞΈk¡θk(xk)γοΈΈ οΈ·οΈ· οΈΈD21
+ γTk,Ξ ΞΈk¡θk(xk)βΞ ΞΈkβ1¡θkβ1
(xk)γοΈΈ οΈ·οΈ· οΈΈD22
+ γTk,Ξ ΞΈkβ1¡θkβ1
(xk)βΞ ΞΈk¡θk(xk+1)γοΈΈ οΈ·οΈ· οΈΈD23
.
(i) From the Markov property, ¡θk(xk+1)βΞ ΞΈk¡θk(xk) forms a martingale difference sequence
E [γTk, ¡θk(xk+1)βΞ ΞΈk¡θk(xk)γ|Fk] = 0, (D21)
where Fk is a Ο-filter formed by {ΞΈ0,x1,ΞΈ1,x2, Β· Β· Β· ,xk,ΞΈk}.(ii) By the regularity of the solution of Poisson equation in (27) and Lemma B6, we have
E[βΞ ΞΈk¡θk(xk)βΞ ΞΈkβ1¡θkβ1
(xk)β] β€ CβΞΈk β ΞΈkβ1β β€ 2QCΟk. (30)
Using CauchyβSchwarz inequality, (30) and the compactness of Ξ in Assumption A1, we have
E[γTk,Ξ ΞΈk¡θk (xk)βΞ ΞΈkβ1¡θkβ1(xk)γ] β€ E[βTkβ] Β· 2QCΟk β€ 4Q2CΟk β€ 5Q2CΟk+1 (D22),
where the last inequality follows from assumption A5 and holds for a large enough k.
(iii) For the last term of D2,
γTk,Ξ ΞΈkβ1¡θkβ1
(xk)βΞ ΞΈk¡θk(xk+1)γ=(γTk,Ξ ΞΈkβ1
¡θkβ1(xk)γ β γTk+1,Ξ ΞΈk¡θk(xk+1)γ
)+ (γTk+1,Ξ ΞΈk¡θk(xk+1)γ β γTk,Ξ ΞΈk¡θk(xk+1)γ)
=(zk β zk+1) + γTk+1 β Tk,Ξ ΞΈk¡θk(xk+1)γ,
where zk = γTk,Ξ ΞΈkβ1¡θkβ1
(xk)γ. By the regularity assumption (27) and Lemma B6,
EγTk+1 β Tk,Ξ ΞΈk¡θk(xk+1)γ β€ E[βΞΈk+1 β ΞΈkβ] Β· E[βΞ ΞΈk¡θk(xk+1)β] β€ 2QCΟk+1. (D23)
Regarding D3, since Ο(ΞΈk,xk+1) is bounded, applying CauchyβSchwarz inequality gives
E[γTk, Οk+1Ο(ΞΈk,xk+1))] β€ 2Q2Οk+1 (D3)
17
Finally, adding (29), D1, D21, D22, D23 and D3 together, it follows that for a constant C0 =G+ 10Q2C + 4QC + 4Q2,E[βTk+1β2
]β€ (1β 2Οk+1Ο+GΟ2
k+1)E[βTkβ2
]+ C0Ο
2k+1 + 2βkΟk+1 + 2E[zk β zk+1]Οk+1.
(31)
Moreover, from (16) and (27), E[|zk|] is upper bounded by
E[|zk|] = E[γTk,Ξ ΞΈkβ1¡θkβ1
(xk)γ] β€ E[βTkβ]E[βΞ ΞΈkβ1¡θkβ1
(xk)β] β€ 2QC. (32)
According to Lemma B7, we can choose Ξ»0 and k0 such that
E[βTk0β2] β€ Οk0
= Ξ»0Οk0+
1
Οsupiβ₯k0
βi,
which satisfies the conditions (43) and (44) of Lemma B9. Applying Lemma B9 leads to
E[βTkβ2
]β€ Οk + E
kβj=k0+1
Ξkj (zjβ1 β zj)
, (33)
where Οk = Ξ»0Οk + 1Ο supiβ₯k0
βi for all k > k0. Based on (32) and the increasing condition of Ξkjin Lemma B8, we have
E
β£β£β£β£β£β£kβ
j=k0+1
Ξkj (zjβ1 β zj)
β£β£β£β£β£β£ = E
β£β£β£β£β£β£kβ1β
j=k0+1
(Ξkj+1 β Ξkj )zj β 2Οkzk + Ξkk0+1zk0
β£β£β£β£β£β£
β€kβ1β
j=k0+1
2(Ξkj+1 β Ξkj )QC + E[|2Οkzk|] + 2ΞkkQC
β€2(Ξkk β Ξkk0)QC + 2ΞkkQC + 2ΞkkQC
β€6ΞkkQC.
(34)
Given Οk = Ξ»0Οk + 1Ο supiβ₯k0
βi which satisfies the conditions (43) and (44) of Lemma B9, itfollows from (33) and (34) that the following inequality holds for any k > k0,
E[βTkβ2] β€ Οk + 6ΞkkQC = (Ξ»0 + 12QC)Οk +1
Οsupiβ₯k0
βi = Ξ»Οk +1
Οsupiβ₯k0
βi,
where Ξ» = Ξ»0 + 12QC, Ξ»0 =2G supiβ₯k0
βi+2C0Ο
C1Ο, C1 = lim inf 2Ο
ΟkΟk+1
+Οk+1 β ΟkΟ2k+1
> 0, C0 =
G+ 5Q2C + 2QC + 2Q2 and G = 4Q2(1 +Q2).
B.3 Technical lemmas
Lemma B4. Suppose Assumption A1 holds, and u1 and umβ1 are fixed such that Ξ¨(u1) > Ξ½ andΞ¨(umβ1) > 1β Ξ½ for some small constant Ξ½ > 0. For any bounded function f(x), we haveβ«
Xf(x)
($Ψθ (x)β$Ψθ
(x))dx = O
(1
m
). (35)
Proof Recall that $Ψθ(x) = 1
ZΞΈ
Ο(x)ΞΈΞΆ(J(x))
and $Ψθ (x) = 1ZΨθ
Ο(x)
Ψ΢θ(U(x)). Since f(x) is bounded,
it suffices to showβ«X
1
ZΞΈ
Ο(x)
ΞΈΞΆ(J(x))β 1
ZΨθ
Ο(x)
Ψ΢θ(U(x))
dx
β€β«X
β£β£β£β£β£ 1
ZΞΈ
Ο(x)
ΞΈΞΆ(J(x))β 1
ZΞΈ
Ο(x)
Ψ΢θ(U(x))
β£β£β£β£β£ dx+
β«X
β£β£β£β£β£ 1
ZΞΈ
Ο(x)
Ψ΢θ(U(x))
β 1
ZΨθ
Ο(x)
Ψ΢θ(U(x))
β£β£β£β£β£ dx=
1
ZΞΈ
mβi=1
β«Xi
β£β£β£β£β£Ο(x)
ΞΈΞΆ(i)β Ο(x)
Ψ΢θ(U(x))
β£β£β£β£β£ dxοΈΈ οΈ·οΈ· οΈΈI1
+
mβi=1
β£β£β£β£ 1
ZΞΈβ 1
ZΨθ
β£β£β£β£ β«Xi
Ο(x)
Ψ΢θ(U(x))
dxοΈΈ οΈ·οΈ· οΈΈI2
= O(
1
m
),
(36)
18
where ZΞΈ =βmi=1
β«Xi
Ο(x)ΞΈ(i)ΞΆ
dx, ZΨθ =βmi=1
β«Xi
Ο(x)
Ψ΢θ(U(x))dx, and Ψθ(u) is a piecewise continu-
ous function defined in (19).
By Assumption A1, infΞ ΞΈ(i) > 0 for any i. Further, by the mean-value theorem, which implies|xΞΆ β yΞΆ | . |xβ y|zΞΆ for any ΞΆ > 0, x β€ y and z β [x, y] β [u1,β), we have
I1 =1
ZΞΈ
mβi=1
β«Xi
β£β£β£β£β£ΞΈΞΆ(i)βΨ΢θ(U(x))
θ΢(i)Ψ΢θ(U(x))
β£β£β£β£β£Ο(x)dx .1
ZΞΈ
mβi=1
β«Xi
|Ψθ(uiβ1)βΨθ(ui)|ΞΈΞΆ(i)
Ο(x)dx
β€ maxi|Ψθ(ui ββu)βΨθ(ui)|
1
ZΞΈ
mβi=1
β«Xi
Ο(x)
ΞΈΞΆ(i)dx = max
i|Ψθ(ui ββu)βΨθ(ui)| . βu = O
(1
m
),
where the last inequality follows by Taylor expansion, and the last equality follows as u1 and umβ1
are fixed. Similarly, we have
I2 =
β£β£β£β£ 1
ZΞΈβ 1
ZΨθ
β£β£β£β£ZΨθ =|ZΨθ β ZΞΈ|
ZΞΈβ€ 1
ZΞΈ
mβi=1
β«Xi
β£β£β£β£β£Ο(x)
ΞΈΞΆ(i)β Ο(x)
Ψ΢θ(U(x))
β£β£β£β£β£ dx = I1 = O(
1
m
).
The proof can then be concluded by combining the orders of I1 and I2.Lemma B5. Given sup{Οk}βk=1 β€ 1, there exists a constant G = 4Q2(1 +Q2) such that
βH(ΞΈk,xk+1) + Οk+1Ο(ΞΈk,xk+1)β2 β€ G(1 + βΞΈk β ΞΈ?β2). (37)
Proof
According to the compactness condition in Assumption A1, we have
βH(ΞΈk,xk+1)β2 β€ Q2(1+βΞΈkβ2) = Q2(1+βΞΈkβΞΈ?+ΞΈ?β2) β€ Q2(1+2βΞΈkβΞΈ?β2+2Q2). (38)
Therefore, using (38), we can show that for a constant G = 4Q2(1 +Q2)
βH(ΞΈk,xk+1) + Οk+1Ο(ΞΈk,xk+1)β2
β€ 2βH(ΞΈk,xk+1)β2 + 2Ο2k+1βΟ(ΞΈk,xk+1)β2
β€ 2Q2(1 + 2βΞΈk β ΞΈ?β2 + 2Q2) + 2Q2
β€ 2Q2(2 + 2Q2 + (2 + 2Q2)βΞΈk β ΞΈ?β2)
β€ G(1 + βΞΈk β ΞΈ?β2).
Lemma B6. Given sup{Οk}βk=1 β€ 1, we have that
βΞΈk β ΞΈkβ1β β€ 2ΟkQ (39)
Proof Following the update ΞΈk β ΞΈkβ1 = ΟkH(ΞΈkβ1,xk) + Ο2kΟ(ΞΈkβ1,xk), we have that
βΞΈk β ΞΈkβ1β = βΟkH(ΞΈkβ1,xk) + Ο2kΟ(ΞΈkβ1,xk)β β€ ΟkβH(ΞΈkβ1,xk)β+ Ο2
kβΟ(ΞΈkβ1,xk)β.
By the compactness condition in Assumption A1 and sup{Οk}βk=1 β€ 1, (39) can be derived.Lemma B7. There exist constants Ξ»0 and k0 such that βΞ» β₯ Ξ»0 and βk > k0, the sequence {Οk}βk=1,where Οk = Ξ»Οk + 1
Ο supiβ₯k0βi, satisfies
Οk+1 β₯(1β 2Οk+1Ο+GΟ2k+1)Οk + C0Ο
2k+1 + 2βkΟk+1. (40)
Proof By replacing Οk with Ξ»Οk + 1Ο supiβ₯k0
βi in (40), it suffices to show
Ξ»Οk+1 +1
Οsupiβ₯k0
βi β₯(1β 2Οk+1Ο+GΟ2k+1)
(Ξ»Οk +
1
Οsupiβ₯k0
βi
)+ C0Ο
2k+1 + 2βkΟk+1.
which is equivalent to proving
Ξ»(Οk+1 β Οk + 2ΟkΟk+1ΟβGΟkΟ2k+1) β₯ 1
Οsupiβ₯k0
βi(β2Οk+1Ο+GΟ2k+1) + C0Ο
2k+1 + 2βkΟk+1.
19
Given the step size condition in (28), we have
Οk+1 β Οk + 2ΟkΟk+1Ο β₯ C1Ο2k+1,
where C1 = lim inf 2ΟΟkΟk+1
+Οk+1 β ΟkΟ2k+1
> 0. Combining β supiβ₯k0βi β€ βk, it suffices to
prove
Ξ» (C1 βGΟk)Ο2k+1 β₯
(G
Οsupiβ₯k0
βi + C0
)Ο2k+1. (41)
It is clear that for a large enough k0 and Ξ»0 such that Οk0 β€ C1
2G , Ξ»0 =2G supiβ₯k0
βi+2C0Ο
C1Ο, the
desired conclusion (41) holds for all such k β₯ k0 and Ξ» β₯ Ξ»0.
The following lemma is a restatement of Lemma 25 (page 247) from Benveniste et al. [1990].
Lemma B8. Suppose k0 is an integer satisfying infk>k0
Οk+1 β ΟkΟkΟk+1
+ 2ΟβGΟk+1 > 0 for some
constant G. Then for any k > k0, the sequence {ΞKk }k=k0,...,K defined below is increasing anduppered bounded by 2Οk
ΞKk =
2ΟkβKβ1j=k (1β 2Οj+1Ο+GΟ2
j+1) if k < K,
2Οk if k = K.(42)
Lemma B9. Let {Οk}k>k0 be a series that satisfies the following inequality for all k > k0
Οk+1 β₯Οk(1β 2Οk+1Ο+GΟ2
k+1
)+ C0Ο
2k+1 + 2βkΟk+1, (43)
and assume there exists such k0 that
E[βTk0β2
]β€ Οk0 . (44)
Then for all k > k0, we have
E[βTkβ2
]β€ Οk +
kβj=k0+1
Ξkj (zjβ1 β zj). (45)
Proof We prove by the induction method. Assuming (45) is true and applying (31), we have that
E[βTk+1β2
]β€ (1β 2Οk+1Ο+ Ο2
k+1G)(Οk +
kβj=k0+1
Ξkj (zjβ1 β zj))
+ C0Ο2k+1 + 2βkΟk+1 + 2Οk+1E[zk β zk+1]
Combining (40) and Lemma.B8, respectively, we have
E[βTk+1β2
]β€ Οk+1 + (1β 2Οk+1Ο+ Ο2
k+1G)
kβj=k0+1
Ξkj (zjβ1 β zj) + 2Οk+1E[zk β zk+1]
β€ Οk+1 +
kβj=k0+1
Ξk+1j (zjβ1 β zj) + Ξk+1
k+1E[zk β zk+1]
β€ Οk+1 +
k+1βj=k0+1
Ξk+1j (zjβ1 β zj).
C Ergodicity and dynamic importance sampler
Our interest is to analyze the deviation between the weighted averaging estimator1k
βki=1 ΞΈ
ΞΆi (J(xi))f(xi) and posterior expectation
β«X f(x)Ο(dx) for a bounded function f . To
20
accomplish this analysis, we first study the convergence of the posterior sample mean 1k
βki=1 f(xi)
to the posterior expectation f =β«X f(x)$Ψθ?
(x)(dx) and then extend it toβ«X f(x)$Ψθ?
(x)(dx).The key tool for establishing the ergodic theory is still the Poisson equation which is used to charac-terize the fluctuation between f(x) and f :
Lg(x) = f(x)β f , (46)
where g(x) is the solution of the Poisson equation, andL is the infinitesimal generator of the Langevindiffusion
Lg := γβg,βL(Β·,ΞΈ?)γ+ Οβ2g.
By imposing the following regularity conditions on the function g(x), we can control the perturbationsof 1
k
βki=1 f(xi)β f and enables convergence of the weighted averaging estimate.
Assumption A6 (Regularity). Given a sufficiently smooth function g(x) and a function V(x) suchthat βDkgβ . Vpk(x) for some constants pk > 0, where k β {0, 1, 2, 3}. In addition, Vp has abounded expectation, i.e., supx E[Vp(x)] <β; and V is smooth, i.e. supsβ{0,1} Vp(sx+(1βs)y) .Vp(x) + Vp(y) for all x,y β X and p β€ 2 maxk{pk}.
For stronger but verifiable conditions, we refer readers to Vollmer et al. [2016]. In what follows,we present a lemma, which is majorly adapted from Theorem 2 of Chen et al. [2015] with a fixedlearning rate Ξ΅.Lemma C1 (Convergence of the Averaging Estimators). Suppose Assumptions A1-A6 hold. For anybounded function f ,β£β£β£β£β£E
[βki=1 f(xi)
k
]ββ«Xf(x)$Ψθ?
(x)dx
β£β£β£β£β£ = O
1
kΞ΅+βΞ΅+
ββki=1 Οi
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
,
where k0 is a sufficiently large constant, $Ψθ?(x) β Ο(x)
ΞΈΞΆ?(J(x)), and
βki=1 Οik = o( 1β
k) as implied by
Assumption A5.
Proof We rewrite the CSGLD algorithm as follows:
xk+1 = xk β Ξ΅kβxL(xk,ΞΈk) +N (0, 2Ξ΅kΟI)
= xk β Ξ΅k(βxL(xk,ΞΈ?) + Ξ₯(xk,ΞΈk,ΞΈ?)
)+N (0, 2Ξ΅kΟI),
where βxL(x,ΞΈ) = Nn
[1 + ΞΆΟ
βu (log ΞΈ(J(x))β log ΞΈ((J(x)β 1) β¨ 1))]βxU(x), βxL(x,ΞΈ) is
as defined in Section B.1, and the bias term is given by Ξ₯(xk,ΞΈk,ΞΈ?) = βxL(xk,ΞΈk) ββxL(xk,ΞΈ?).
By Assumption A2, we have ββxU(x)β = ββxU(x)ββxU(x?)β . βxβ x?β β€ βxβ+ βx?βfor some optimum. Then the L2 upper bound in Lemma B2 implies that βxU(x) has a boundedsecond moment. Combining Assumption A4, we have E
[ββxU(x)β2
]<β. Further by Eveβs law
(i.e., the variance decomposition formula), it is easy to derive that E[ββxU(x)β
]<β. Then, by
the triangle inequality and Jensenβs inequality,
βE[Ξ₯(xk,ΞΈk,ΞΈ?)]β β€ E[ββxL(xk,ΞΈk)ββxL(xk,ΞΈ?)β] + E[ββxL(xk,ΞΈ?)ββxL(xk,ΞΈ?)β]
. E[βΞΈk β ΞΈ?β] +O(Ξ΄n(ΞΈ?)) β€β
E[βΞΈk β ΞΈ?β2] +O(Ξ΄n(ΞΈ?))
β€ O
(βΟk + Ξ΅+
1
m+ supiβ₯k0
Ξ΄n(ΞΈi)
),
(47)
where Assumption A1 and Theorem 3 are used to derive the smoothness ofβxL(x,ΞΈ) with respectto ΞΈ, and Ξ΄n(ΞΈ) = E[H(ΞΈ,x)βH(ΞΈ,x)] is the bias caused by the mini-batch evaluation of U(x).
The ergodic average based on biased gradients and a fixed learning rate is a direct result of Theorem2 of Chen et al. [2015] by imposing the regularity condition A6. By simulating from $Ψθ?
(x) β
21
Ο(x)
Ψ΢θ? (U(x))and combining (47) and Theorem 3, we haveβ£β£β£β£β£E
[βki=1 f(xi)
k
]ββ«Xf(x)$Ψθ?
(x)dx
β£β£β£β£β£ β€ O(
1
kΞ΅+ Ξ΅+
βki=1 βE[Ξ₯(xk,ΞΈk,ΞΈ?)]β
k
)
. O
1
kΞ΅+ Ξ΅+
βki=1
βΟi + Ξ΅+ 1
m+ supiβ₯k0
Ξ΄n(ΞΈi)
k
β€ O
1
kΞ΅+βΞ΅+
ββki=1 Οi
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
,
where the last inequality follows by repeatedly applying the inequalityβa+ b β€
βa+βb and the
inequalityβki=1
βΟi β€
βkβki=1 Οi.
For any a bounded function f(x), we have |β«X f(x)$Ψθ?
(x)dxββ«X f(x)$Ψθ?
(x)dx| = O( 1m )
by Lemma B4. By the triangle inequality, we haveβ£β£β£β£β£E[βk
i=1 f(xi)
k
]ββ«Xf(x)$Ψθ?
(x)dx
β£β£β£β£β£ β€ O 1
kΞ΅+βΞ΅+
ββki=1 Οi
k+
1βm
+ supiβ₯k0
βΞ΄(ΞΈi)
,
which concludes the proof.
Finally, we are ready to show the convergence of the weighted averaging estimatorβki=1 ΞΈ
ΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
to the posterior meanβ«X f(x)Ο(dx).
Theorem 4 (Convergence of the Weighted Averaging Estimators). Assume Assumptions A1-A6 hold.For any bounded function f , we have thatβ£β£β£β£β£E[βk
i=1 ΞΈΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
]ββ«Xf(x)Ο(dx)
β£β£β£β£β£ = O
1
kΞ΅+βΞ΅+
ββki=1 Οi
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
.
Proof
Applying triangle inequality and |E[x]| β€ E[|x|], we haveβ£β£β£β£β£E[βk
i=1 ΞΈΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
]ββ«Xf(x)Ο(dx)
β£β£β£β£β£β€E
[β£β£β£β£β£βki=1 ΞΈ
ΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
ββki=1 ΞΈ
ΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
β£β£β£β£β£]
οΈΈ οΈ·οΈ· οΈΈI1
+ E
[β£β£β£β£β£βki=1 ΞΈ
ΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
βZΞΈ?
βki=1 ΞΈ
ΞΆi (J(xi))f(xi)
k
β£β£β£β£β£]
οΈΈ οΈ·οΈ· οΈΈI2
+E
[ZΞΈ?k
kβi=1
β£β£β£ΞΈΞΆi (J(xi))β ΞΈΞΆ?(J(xi))β£β£β£ Β· |f(xi)|
]οΈΈ οΈ·οΈ· οΈΈ
I3
+
β£β£β£β£β£E[ZΞΈ?k
kβi=1
ΞΈΞΆ?(J(xi))f(xi)
]ββ«Xf(x)Ο(dx)
β£β£β£β£β£οΈΈ οΈ·οΈ· οΈΈI4
.
For the term I1, consider the bias Ξ΄n(ΞΈ) = E[H(ΞΈ,x)βH(ΞΈ,x)] as defined in the proof of LemmaB1, which decreases to 0 as nβ N . By applying mean-value theorem, we have
I1 = E
β£β£β£β£β£β£(βk
i=1 ΞΈΞΆi (J(xi))f(xi)
)(βki=1 ΞΈ
ΞΆi (J(xi))
)β(βk
i=1 ΞΈΞΆi (J(xi))f(xi)
)(βki=1 ΞΈ
ΞΆi (J(xi))
)(βk
i=1 ΞΈΞΆi (J(xi))
)(βki=1 ΞΈ
ΞΆi (J(xi))
)β£β£β£β£β£β£
. supiΞ΄n(ΞΈi)E
(βk
i=1 ΞΈΞΆi (J(xi))f(xi)
(βki=1 ΞΈ
ΞΆi (J(xi))
))(βk
i=1 ΞΈΞΆi (J(xi))
)(βki=1 ΞΈ
ΞΆi (J(xi))
) = O
(supiΞ΄n(ΞΈi)
).
(48)
22
For the term I2, by the boundedness of Ξ and f and the assumption infΞ ΞΈΞΆ(i) > 0, we have
I2 =E
[β£β£β£β£β£βki=1 ΞΈ
ΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
(1β
kβi=1
ΞΈΞΆi (J(xi))
kZΞΈ?
)β£β£β£β£β£]
.E
[β£β£β£β£β£ZΞΈ?βki=1 ΞΈ
ΞΆi (J(xi))
kβ 1
β£β£β£β£β£]
=E
β£β£β£β£β£β£ZΞΈ?mβi=1
βkj=1
(ΞΈΞΆj (i)β ΞΈΞΆ?(i) + ΞΈΞΆ?(i)
)1J(xj)=i
kβ 1
β£β£β£β£β£β£
β€E
ZΞΈ? mβi=1
βkj=1
β£β£β£ΞΈΞΆj (i)β ΞΈΞΆ?(i)β£β£β£ 1J(xj)=i
k
οΈΈ οΈ·οΈ· οΈΈ
I21
+E
[β£β£β£β£β£ZΞΈ?mβi=1
ΞΈΞΆ?(i)βkj=1 1J(xj)=i
kβ 1
β£β£β£β£β£]
οΈΈ οΈ·οΈ· οΈΈI22
.
For I21, by first applying the inequality |xΞΆ βyΞΆ | β€ ΞΆ|xβy|zΞΆβ1 for any ΞΆ > 0, x β€ y and z β [x, y]based on the mean-value theorem and then applying the CauchyβSchwarz inequality, we have
I21 .1
kE
kβj=1
mβi=1
β£β£β£ΞΈΞΆj (i)β ΞΈΞΆ?(i)β£β£β£ .
1
kE
kβj=1
mβi=1
|ΞΈj(i)β ΞΈ?(i)|
.1
k
ββββ kβj=1
E[βΞΈj β ΞΈ?β2
],
(49)where the compactness of Ξ has been used in deriving the second inequality.
For I22, considering the following relation
1 =
mβi=1
β«XiΟ(x)dx =
mβi=1
β«XiΞΈΞΆ?(i)
Ο(x)
ΞΈΞΆ?(i)dx = ZΞΈ?
β«X
mβi=1
θ΢?(i)1J(x)=i$Ψθ?(x)dx,
then we have
I22 = E
[β£β£β£β£β£ZΞΈ?mβi=1
ΞΈΞΆ?(i)βkj=1 1J(xj)=i
kβ ZΞΈ?
β«X
mβi=1
θ΢?(i)1J(x)=i$Ψθ?(x)dx
β£β£β£β£β£]
= ZΞΈ?E
β£β£β£β£β£β£1kkβj=1
(mβi=1
ΞΈΞΆ?(i)1J(xj)=i
)ββ«X
(mβi=1
ΞΈΞΆ?(i)1J(x)=i
)$Ψθ?
(x)dx
β£β£β£β£β£β£
= O
1
kΞ΅+βΞ΅+
ββki=1 Οik
+1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
,
(50)
where the last equality follows from Lemma C1 as the step functionβmi=1 ΞΈ
ΞΆ?(i)1J(x)=i is integrable.
For I3, by the boundedness of f , the mean value theorem and Cauchy-Schwarz inequality, we have
I3 . E
[1
k
kβi=1
β£β£β£ΞΈΞΆi (J(xi))β ΞΈΞΆ?(J(xi))β£β£β£] .
1
kE[ kβj=1
mβi=1
β£β£ΞΈj(i)β ΞΈ?(i)β£β£] . 1
k
ββββ kβj=1
E[βΞΈj β ΞΈ?β2
].
(51)
For the last term I4, we first decomposeβ«X f(x)Ο(dx) into m disjoint regions to facilitate the
analysis β«Xf(x)Ο(dx) =
β«βͺmj=1Xj
f(x)Ο(dx) =
mβj=1
β«XjΞΈΞΆ?(j)f(x)
Ο(dx)
ΞΈΞΆ?(j)
= ZΞΈ?
β«X
mβj=1
θ?(j)΢f(x)1J(xi)=j$Ψθ?
(x)(dx).
(52)
23
Plugging (52) into the last term I4, we have
I4 =
β£β£β£β£β£E[ZΞΈ?k
kβi=1
mβj=1
ΞΈ?(j)ΞΆf(xi)1J(xi)=j
]ββ«Xf(x)Ο(dx)
β£β£β£β£β£= ZΞΈ?
β£β£β£β£β£E[
1
k
kβi=1
(mβj=1
ΞΈΞΆ?(j)f(xi)1J(xi)=j
)]ββ«X
(mβj=1
ΞΈΞΆ?(j)f(xi)1J(xi)=j
)$Ψθ?
(x)(dx)
β£β£β£β£β£(53)
Applying the functionβmj=1 ΞΈ
ΞΆ?(j)f(xi)1J(xi)=j to Lemma C1 yieldsβ£β£β£β£β£E
[1
k
kβi=1
f(xi)
]ββ«Xf(x)$Ψθ?
(x)(dx)
β£β£β£β£β£ = O
1
kΞ΅+βΞ΅+
ββki=1 Οi
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
.
(54)
Plugging (54) into (53) and combining I1, I21, I22, I3 and Theorem 3, we haveβ£β£β£β£β£E[βk
i=1 ΞΈΞΆi (J(xi))f(xi)βk
i=1 ΞΈΞΆi (J(xi))
]ββ«Xf(x)Ο(dx)
β£β£β£β£β£ = O
1
kΞ΅+βΞ΅+
ββki=1 Οi
k+
1βm
+ supiβ₯k0
βΞ΄n(ΞΈi)
,
which concludes the proof of the theorem.
D More discussions on the algorithm
D.1 An alternative numerical scheme
In addition to the numerical scheme used in (6) and (8) in the main body, we can also consider thefollowing numerical scheme
xk+1 = xk β Ξ΅k+1N
n
[1 + ΞΆΟ
log ΞΈk(J(xk) β§m
)β log ΞΈk
(J(xk)
)βu
]βxU(xk) +
β2ΟΞ΅k+1wk+1.
Such a scheme leads to a similar theoretical result and a better treatment of Ψθ(·) for the subregionsthat contains stationary points.
D.2 Bizarre peaks in the Gaussian mixture distribution
A bizarre peak always indicates that there is a stationary point of the same energy in somewhere ofthe sample space, as the sample space is partitioned according to the energy function in CSGLD. Forexample, we study a mixture distribution with asymmetric modes Ο(x) = 1/6N(β6, 1)+5/6N(4, 1).Figure 4 shows a bizarre peak at x. Although x is not a local minimum, it has the same energy as β-6βwhich is a local minimum. Note that in CSGLD, x and β-6β belongs to the same subregion.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
x
0
5
10
15
20
β12.5 β6 4 10Sample space
Ene
rgy
Original energy
Modified energy (ΞΆ=0.5)
Modified energy (ΞΆ=0.75)
Modified energy (ΞΆ=1)
Figure 4: Explanation of bizarre peaks.
24
D.3 Simulations of multi-modal distributions
We run all the algorithms with 200,000 iterations and assume the energy and gradient follow theGaussian distribution with a variance of 0.1. We include an additional quadratic regularizer (βxβ2 β7)1βxβ2>7 to limit the samples to the center region. We use a constant learning rate 0.001 for SGLD,reSGLD, and CSGLD; We adopt the cyclic cosine learning rates with initial learning rate 0.005and 20 cycles for cycSGLD. The temperature is fixed at 1 for all the algorithms, excluding thehigh-temperature process of reSGLD, which employs a temperature of 3. In particular for CSGLD,we choose the step size Οk = min{0.003, 10/(k0.8 + 100)} for learning the latent vector. We fix100 partitions and each energy bandwidth is set to 0.25. We choose ΞΆ = 0.75.
D.4 Extension to the scenarios with high-ΞΆ
In some complex experiments (e.g. computer vision) with a high-loss function, the fixed point ΞΈ?can be very close to the vector (1, 0, ..., 0), i.e., the first subregion contains almost all the probabilitymass, if the sample space is not appropriately partitioned. As a result, estimating ΞΈ(i)βs for the highenergy subregions can be quite difficult due to the limitation of floating points. If a small value of ΞΆis used, the gradient multiplier 1 + ΞΆΟ log ΞΈ?(i)βlog ΞΈ?((iβ1)β¨1)
βu is close to 1 for any i and the algorithmwill perform similarly to SGLD, except with different weights. When a large value of ΞΆ is used, theconvergence of ΞΈ? can become relatively slow. To tackle this issue, we include a high-order bias itemin the stochastic approximation as follows:
ΞΈk+1(i) = ΞΈk(i) + Οk+1
(ΞΈΞΆk(J(xk+1) + Οk+11iβ₯J(xk+1)Ο)
)(1i=J(xk+1) β ΞΈk(i)
), (55)
for i = 1, 2, . . . ,m, where Ο is a constant. As shown early, our convergence theory allows inclusionof such a high-order bias term. In simulations, the high-order bias term Ο2
k+11iβ₯J(xk+1)Ο penalizedmore on the higher energy regions, and thus accelerates the convergence of ΞΈk toward the pattern(1, 0, 0, . . . , 0) especially in the early period.
In all computation for the computer vision examples, we set the momentum coefficient to 0.9 andthe weight decay to 25, and employed the data augmentation scheme as in Zhong et al. [2017].In addition, for CSGHMC and saCSGHMC, we set Οk = 10
k0.75+1000 and Ο = 1 in (55) for bothCIFAR10 and CIFAR100, and set ΞΆ = 1Γ 106 for CIFAR10 and 3Γ 106 for CIFAR100.
D.5 Number of partitions
A fine partition will lead to a smaller discretization error, but it may increase the risk in stability.In particular, it leads to large bouncy jumps around optima (a large negative learning rate, i.e.,log ΞΈ(2)βlog ΞΈ(1)
βu οΏ½ 0 in formula (8) may be caused there). Empirically, we suggest to partition thesample space into a moderate number of subregions, e.g. 10-1000, to balance between stability anddiscretization error.
25