A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

25
A Contour Stochastic Gradient Langevin Dynamics Algorithm for Simulations of Multi-modal Distributions Wei Deng Department of Mathematics Purdue University West Lafayette, IN, USA [email protected] Guang Lin Departments of Mathematics & School of Mechanical Engineering Purdue University West Lafayette, IN, USA [email protected] Faming Liang * Departments of Statistics Purdue University West Lafayette, IN, USA [email protected] Abstract We propose an adaptively weighted stochastic gradient Langevin dynamics algo- rithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD), for Bayesian learning in big data statistics. The proposed algorithm is essentially a scalable dynamic importance sampler, which automatically flattens the target distribution such that the simulation for a multi-modal distribution can be greatly fa- cilitated. Theoretically, we prove a stability condition and establish the asymptotic convergence of the self-adapting parameter to a unique fixed-point, regardless of the non-convexity of the original energy function; we also present an error analysis for the weighted averaging estimators. Empirically, the CSGLD algorithm is tested on multiple benchmark datasets including CIFAR10 and CIFAR100. The numeri- cal results indicate its superiority over the existing state-of-the-art algorithms in training deep neural networks. 1 Introduction AI safety has long been an important issue in the deep learning community. A promising solution to the problem is Markov chain Monte Carlo (MCMC), which leads to asymptotically correct uncertainty quantification for deep neural network (DNN) models. However, traditional MCMC algorithms [Metropolis et al., 1953, Hastings, 1970] are not scalable to big datasets that deep learning models rely on, although they have achieved significant successes in many scientific areas such as statistical physics and bioinformatics. It was not until the study of stochastic gradient Langevin dynamics (SGLD) [Welling and Teh, 2011] that resolves the scalability issue encountered in Monte Carlo computing for big data problems. Ever since, a variety of scalable stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms have been developed based on strategies such as Hamiltonian dynamics [Chen et al., 2014, Ma et al., 2015, Ding et al., 2014], Hessian approximation [Ahn et al., 2012, Li et al., 2016, ΒΈ SimΒΈ sekli et al., 2016], and higher-order numerical schemes [Chen et al., 2015, Li et al., 2019]. Despite their theoretical guarantees in statistical inference [Chen et al., 2015, Teh et al., 2016, Vollmer et al., 2016] and non-convex optimization [Zhang et al., 2017, Raginsky et al., 2017, Xu et al., 2018], these algorithms often converge slowly, which makes them hard to be used for efficient uncertainty quantification for many AI safety problems. To develop more efficient SGMCMC algorithms, we seek inspirations from traditional MCMC algorithms, such as simulated annealing [Kirkpatrick et al., 1983], parallel tempering [Swendsen and Wang, 1986, Geyer, 1991], and flat histogram algorithms [Berg and Neuhaus, 1991, Wang and Landau, * To whom correspondence should be addressed: Faming Liang. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. arXiv:2010.09800v1 [stat.ML] 19 Oct 2020

Transcript of A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Page 1: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

A Contour Stochastic Gradient Langevin DynamicsAlgorithm for Simulations of Multi-modal

Distributions

Wei DengDepartment of Mathematics

Purdue UniversityWest Lafayette, IN, [email protected]

Guang LinDepartments of Mathematics &

School of Mechanical EngineeringPurdue University

West Lafayette, IN, [email protected]

Faming Liang βˆ—Departments of Statistics

Purdue UniversityWest Lafayette, IN, [email protected]

Abstract

We propose an adaptively weighted stochastic gradient Langevin dynamics algo-rithm (SGLD), so-called contour stochastic gradient Langevin dynamics (CSGLD),for Bayesian learning in big data statistics. The proposed algorithm is essentiallya scalable dynamic importance sampler, which automatically flattens the targetdistribution such that the simulation for a multi-modal distribution can be greatly fa-cilitated. Theoretically, we prove a stability condition and establish the asymptoticconvergence of the self-adapting parameter to a unique fixed-point, regardless ofthe non-convexity of the original energy function; we also present an error analysisfor the weighted averaging estimators. Empirically, the CSGLD algorithm is testedon multiple benchmark datasets including CIFAR10 and CIFAR100. The numeri-cal results indicate its superiority over the existing state-of-the-art algorithms intraining deep neural networks.

1 Introduction

AI safety has long been an important issue in the deep learning community. A promising solution tothe problem is Markov chain Monte Carlo (MCMC), which leads to asymptotically correct uncertaintyquantification for deep neural network (DNN) models. However, traditional MCMC algorithms[Metropolis et al., 1953, Hastings, 1970] are not scalable to big datasets that deep learning modelsrely on, although they have achieved significant successes in many scientific areas such as statisticalphysics and bioinformatics. It was not until the study of stochastic gradient Langevin dynamics(SGLD) [Welling and Teh, 2011] that resolves the scalability issue encountered in Monte Carlocomputing for big data problems. Ever since, a variety of scalable stochastic gradient Markov chainMonte Carlo (SGMCMC) algorithms have been developed based on strategies such as Hamiltoniandynamics [Chen et al., 2014, Ma et al., 2015, Ding et al., 2014], Hessian approximation [Ahn et al.,2012, Li et al., 2016, Simsekli et al., 2016], and higher-order numerical schemes [Chen et al., 2015,Li et al., 2019]. Despite their theoretical guarantees in statistical inference [Chen et al., 2015, Tehet al., 2016, Vollmer et al., 2016] and non-convex optimization [Zhang et al., 2017, Raginsky et al.,2017, Xu et al., 2018], these algorithms often converge slowly, which makes them hard to be used forefficient uncertainty quantification for many AI safety problems.

To develop more efficient SGMCMC algorithms, we seek inspirations from traditional MCMCalgorithms, such as simulated annealing [Kirkpatrick et al., 1983], parallel tempering [Swendsen andWang, 1986, Geyer, 1991], and flat histogram algorithms [Berg and Neuhaus, 1991, Wang and Landau,βˆ—To whom correspondence should be addressed: Faming Liang.

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arX

iv:2

010.

0980

0v1

[st

at.M

L]

19

Oct

202

0

Page 2: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

2001]. In particular, simulated annealing proposes to decay temperatures to increase the hittingprobability to the global optima [Mangoubi and Vishnoi, 2018], which, however, often gets stuckinto a local optimum with a fast cooling schedule. Parallel tempering proposes to swap positions ofneighboring Markov chains according to an acceptance-rejection rule. However, under the mini-batchsetting, it often requires a large correction which is known to deteriorate its performance [Denget al., 2020a]. The flat histogram algorithms, such as the multicanonical [Berg and Neuhaus, 1991]and Wang-Landau [Wang and Landau, 2001] algorithms, were first proposed to sample discretestates of Ising models by yielding a flat histogram in the energy space, and then extended as ageneral dynamic importance sampling algorithm, the so-called stochastic approximation Monte Carlo(SAMC) algorithm [Liang, 2005, Liang et al., 2007, Liang, 2009]. Theoretical studies [Lelièvre et al.,2008, Liang, 2010, Fort et al., 2015] support the efficiency of the flat histogram algorithms in MonteCarlo computing for small data problems. However, it is still unclear how to adapt the flat histogramidea to accelerate the convergence of SGMCMC, ensuring efficient uncertainty quantification for AIsafety problems.

This paper proposes the so-called contour stochastic gradient Langevin dynamics (CSGLD) algorithm,which successfully extends the flat histogram idea to SGMCMC. Like the SAMC algorithm [Liang,2005, Liang et al., 2007, Liang, 2009], CSGLD works as a dynamic importance sampling algorithm,which adaptively adjusts the target measure at each iteration and accounts for the bias introducedthereby by importance weights. However, theoretical analysis for the two types of dynamic importancesampling algorithms can be quite different due to the fundamental difference in their transition kernels.We proceed by justifying the stability condition for CSGLD based on the perturbation theory, andestablishing ergodicity of CSGLD based on newly developed theory for the convergence of adaptiveSGLD. Empirically, we test the performance of CSGLD through a few experiments. It achievesremarkable performance on some synthetic data, UCI datasets, and computer vision datasets such asCIFAR10 and CIFAR100.

2 Contour stochastic gradient Langevin dynamics

Suppose we are interested in sampling from a probability measure Ο€(x) with the density given byΟ€(x) ∝ exp(βˆ’U(x)/Ο„), x ∈ X , (1)

where X denotes the sample space, U(x) is the energy function, and Ο„ is the temperature. It isknown that when U(x) is highly non-convex, SGLD can mix very slowly [Raginsky et al., 2017]. Toaccelerate the convergence, we exploit the flat histogram idea in SGLD.

Suppose that we have partitioned the sample space X into m subregions based on the energy functionU(x): X1 = {x : U(x) ≀ u1}, X2 = {x : u1 < U(x) ≀ u2}, . . ., Xmβˆ’1 = {x : umβˆ’2 < U(x) ≀umβˆ’1}, and Xm = {x : U(x) > umβˆ’1}, where βˆ’βˆž < u1 < u2 < Β· Β· Β· < umβˆ’1 <∞ are specifiedby the user. For convenience, we set u0 = βˆ’βˆž and um =∞. Without loss of generality, we assumeui+1 βˆ’ ui = βˆ†u for i = 1, . . . ,mβˆ’ 2. We propose to simulate from a flattened density

$Ψθ (x) ∝ Ο€(x)

Ψ΢θ(U(x))

, (2)

where ΞΆ > 0 is a hyperparameter controlling the geometric property of the flatted density (see Figure1(a) for illustration), and ΞΈ = (ΞΈ(1), ΞΈ(2), . . . , ΞΈ(m)) is an unknown vector taking values in the space:

Θ =

{(ΞΈ(1), ΞΈ(2), Β· Β· Β· , ΞΈ(m))

∣∣0 < ΞΈ(1), ΞΈ(2), Β· Β· Β· , ΞΈ(m) < 1 andmβˆ‘i=1

ΞΈ(i) = 1

}. (3)

2.1 A naΓ―ve contour SGLD

It is known if we set †

(i) ΢ = 1 and Ψθ(U(x)) =

mβˆ‘i=1

ΞΈ(i)1uiβˆ’1<U(x)≀ui ,

(ii) ΞΈ(i) = ΞΈ?(i),where ΞΈ?(i) =

βˆ«Ο‡i

Ο€(x)dx for i ∈ {1, 2, Β· Β· Β· ,m},(4)

†1A is an indicator function that takes value 1 if event A occurs and 0 otherwise.

2

Page 3: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

the algorithm will act like the SAMC algorithm [Liang et al., 2007], yielding a flat histogram in thespace of energy (see the pink curve in Fig.1(b)). Theoretically, such a density flattening strategyenables a sharper logarithmic Sobolev inequality and accelerates the convergence of simulations[LeliΓ¨vre et al., 2008, Fort et al., 2015]. However, such a density flattening setting only works underthe framework of the Metropolis algorithm [Metropolis et al., 1953]. A naΓ―ve application of the stepfunction in formula (4(i)) to SGLD results in βˆ‚ log Ψθ(u)

βˆ‚u = 1Ψθ(u)

βˆ‚Ξ¨ΞΈ(u)βˆ‚u = 0 almost everywhere,

which leads to the vanishing-gradient problem for SGLD. Calculating the gradient for the naΓ―vecontour SGLD, we have

βˆ‡x log$Ψθ (x) = βˆ’[1 + ΞΆΟ„

βˆ‚ log Ψθ(u)

βˆ‚u

]βˆ‡xU(x)

Ο„= βˆ’βˆ‡xU(x)

Ο„.

As such, the naΓ―ve algorithm behaves like SGLD and fails to simulate from the flattened density (2).

2.2 How to resolve the vanishing gradient

To tackle this issue, we propose to set Ψθ(u) as a piecewise continuous function:

Ψθ(u) =

mβˆ‘i=1

(ΞΈ(iβˆ’ 1)e(log ΞΈ(i)βˆ’log ΞΈ(iβˆ’1))

uβˆ’uiβˆ’1βˆ†u

)1uiβˆ’1<u≀ui , (5)

where ΞΈ(0) is fixed to ΞΈ(1) for simplicity. A direct calculation shows that

βˆ‡x log$Ψθ (x) = βˆ’[1 + ΞΆΟ„

βˆ‚ log Ψθ(u)

βˆ‚u

]βˆ‡xU(x)

Ο„

= βˆ’[1 + ΞΆΟ„

log ΞΈ(J(x))βˆ’ log ΞΈ((J(x)βˆ’ 1) ∨ 1)

βˆ†u

]βˆ‡xU(x)

Ο„,

(6)

where J(x) ∈ {1, 2, Β· Β· Β· ,m} denotes the index of the subregion that x belongs to, i.e., uJ(x)βˆ’1 <

U(x) ≀ uJ(x). Β§ Since ΞΈ is unknown, we propose to estimate it on the fly under the frameworkof stochastic approximation [Robbins and Monro, 1951]. Provided that a scalable transition kernelΞ ΞΈk(xk, Β·) is available and the energy function U(x) on the full data can be efficiently evaluated, theweighted density $Ψθ (x) can be simulated by iterating between the following steps:

(i) Simulate xk+1 from Ξ ΞΈk(xk, Β·), which admits $ΞΈk(x) as the invariant distribution,

(ii) ΞΈk+1(i) = ΞΈk(i) + Ο‰k+1ΞΈΞΆk(J(xk+1))

(1i=J(xk+1) βˆ’ ΞΈk(i)

)for i ∈ {1, 2, · · · ,m}.

(7)

where θk denotes a working estimate of θ at the k-th iteration. We expect that in a long run, suchan algorithm can achieve an optimization-sampling equilibrium such that θk converges to the fixedpoint θ? and the random vector xk converges weakly to the distribution $Ψθ?

(x).

To make the algorithm scalable to big data, we propose to adopt the Langevin transition kernelfor drawing samples at each iteration, for which a mini-batch of data can be used to acceleratecomputation. In addition, we observe that evaluating U(x) on the full data can be quite expensive forbig data problems, while it is free to obtain the stochastic energy U(x) in evaluating the stochasticgradient βˆ‡xU(x) due to the nature of auto-differentiation [Paszke et al., 2017]. For this reason,we propose a biased index J(x), where uJ(x)βˆ’1 <

Nn U(x) ≀ uJ(x), N is the sample size of the

full dataset and n is the mini-batch size. Let {Ξ΅k}∞k=1 and {Ο‰k}∞k=1 denote the learning rates andstep sizes for SGLD and stochastic approximation, respectively. Given the above notations, theproposed algorithm can be presented in Algorithm 1, which can be viewed as a scalable Wang-Landaualgorithm for deep learning and big data problems.

2.3 Related work

Compared to the existing MCMC algorithms, the proposed algorithm has a few innovations:

First, CSGLD is an adaptive MCMC algorithm based on the Langevin transition kernel insteadof the Metropolis transition kernel [Liang et al., 2007, Fort et al., 2015]. As a result, the existingconvergence theory for the Wang-Landau algorithm does not apply. To resolve this issue, we first

Β§Formula (6) shows a practical numerical scheme. An alternative is presented in the supplementary material.

3

Page 4: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Algorithm 1 Contour SGLD Algorithm. One can conduct a resampling step from the pool ofimportance samples according to the importance weights to obtain the original distribution.

[1.] (Data subsampling) Simulate a mini-batch of data of size n from the whole dataset of sizeN ; Compute the stochastic gradientβˆ‡xU(xk) and stochastic energy U(xk).[2.] (Simulation step) Sample xk+1 using the SGLD algorithm based on xk and ΞΈk, i.e.,

xk+1 = xk βˆ’ Ξ΅k+1N

n

[1 + ΞΆΟ„

log ΞΈk(J(xk))βˆ’ log ΞΈk((J(xk)βˆ’ 1) ∨ 1)

βˆ†u

]βˆ‡xU(xk) +

√2τΡk+1wk+1,

(8)

where wk+1 ∼ N(0, Id), d is the dimension, Ξ΅k+1 is the learning rate, and Ο„ is the temperature.[3.] (Stochastic approximation) Update the estimate of ΞΈ(i)’s for i = 1, 2, . . . ,m by setting

ΞΈk+1(i) = ΞΈk(i) + Ο‰k+1ΞΈΞΆk(J(xk+1))

(1i=J(xk+1) βˆ’ ΞΈk(i)

), (9)

where 1i=J(xk+1) is an indicator function which equals 1 if i = J(xk+1) and 0 otherwise.

prove a stability condition for CSGLD based on the perturbation theory, and then verify regularityconditions for the solution of the Poisson equation so that the fluctuations of the mean-field systeminduced by CSGLD get controlled, which eventually ensures convergence of CSGLD.

Second, the use of the stochastic index J(x) avoids the evaluation of U(x) on the full data and thussignificantly accelerates the computation of the algorithm, although it leads to a small bias, dependingon the mini-batch size n, in parameter estimation. Compared to other methods, such as using afixed sub-dataset to estimate U(x), the implementation is much simpler. Moreover, combining thevariance reduction of the noisy energy estimators [Deng et al., 2020b], the bias also decreases to zeroasymptotically as Ξ΅β†’ 0.

Third, unlike the existing SGMCMC algorithms [Welling and Teh, 2011, Chen et al., 2014, Maet al., 2015], CSGLD works as a dynamic importance sampler which flattens the target distributionand reduces the energy barriers for the sampler to traverse between different regions of the energylandscape (see Fig.1(a) for illustration). The sampling bias introduced thereby is accounted for bythe importance weight ΞΈΞΆ(J(Β·)). Interestingly, CSGLD possesses a self-adjusting mechanism to easeescapes from local traps, which is similar to the self-repulsive dynamics [Ye et al., 2020] and can beexplained as follows. Let’s assume that the sampler gets trapped into a local optimum at iteration k.Then CSGLD will automatically increase the multiplier of the stochastic gradient (i.e., the bracketterm of (8)) at iteration k + 1 by increasing the value of ΞΈk(J(x)), while decreasing the componentsof ΞΈk corresponding to other subregions. This adjustment will continue until the sampler moves awayfrom the current subregion. Then, in the followed several iterations, the multiplier might becomenegative in neighboring subregions of the local optimum due to the increased value of ΞΈ(J(x)),which continues to help to drive the sampler to higher energy regions and thus escape from the localtrap. That is, in order to escape from local traps, CSGLD is sometimes forced to move toward higherenergy regions by changing the sign of the stochastic gradient multiplier! This is a very attractivefeature for simulations of multi-modal distributions.

3 Theoretical study of the CSGLD algorithm

In this section, we study the convergence of CSGLD algorithm under the framework of stochasticapproximation and show the ergodicity property based on weighted averaging estimators.

3.1 Convergence analysis

Following the tradition of stochastic approximation analysis, we rewrite the updating rule (9) as

ΞΈk+1 = ΞΈk + Ο‰k+1H(ΞΈk,xk+1), (10)

where H(ΞΈ,x) = (H1(ΞΈ,x), . . . , Hm(ΞΈ,x)) is a random field function with

Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(

1i=J(x) βˆ’ ΞΈ(i)), i = 1, 2, . . . ,m. (11)

4

Page 5: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Notably, H(ΞΈ,x) works under an empirical measure $ΞΈ(x) which approximates the invariantmeasure $Ψθ (x) ∝ Ο€(x)

Ψ΢θ(U(x))asymptotically as Ξ΅β†’ 0 and nβ†’ N . As shown in Lemma 1, we have

the mean-field equation

h(ΞΈ) =

∫XH(ΞΈ,x)$ΞΈ(x)dx = Zβˆ’1

ΞΈ (ΞΈ? + Ρβ(ΞΈ)βˆ’ ΞΈ) = 0, (12)

where ΞΈ? = (∫X1Ο€(x)dx,

∫X2Ο€(x)dx, . . . ,

∫Xm Ο€(x)dx), ZΞΈ is the normalizing constant, Ξ²(ΞΈ) is

a perturbation term, Ξ΅ is a small error depending on Ξ΅, n and m. The mean-field equation implies thatfor any ΞΆ > 0, ΞΈk converges to a small neighbourhood of ΞΈ?. By applying perturbation theory andsetting the Lyapunov function V(ΞΈ) = 1

2β€–ΞΈ? βˆ’ ΞΈβ€–2, we can establish the stability condition:

Lemma 1 (Stability). Given a small enough Ξ΅ (learning rate), a large enough n (batch size) and m(partition number), there is a constant Ο† = infΞΈ Z

βˆ’1ΞΈ > 0 such that the mean-field h(ΞΈ) satisfies

βˆ€ΞΈ ∈ Θ, γ€ˆh(ΞΈ),ΞΈ βˆ’ ΞΈ?〉 ≀ βˆ’Ο†β€–ΞΈ βˆ’ ΞΈ?β€–2 +O(Ξ΅+

1

m+ Ξ΄n(ΞΈ)

),

where δn(·) is a bias term depending on the batch size n and decays to 0 as n→ N .

Together with the tool of Poisson equation [Benveniste et al., 1990, Andrieu et al., 2005], whichcontrols the fluctuation of H(ΞΈ,x)βˆ’ h(ΞΈ), we can establish convergence of ΞΈk in Theorem 1, whoseproof is given in the supplementary material.Theorem 1 (L2 convergence rate). Given Assumptions 1-5 (given in Appendix), a small enoughlearning rate Ξ΅k, a large partition number m and a large batch size n, ΞΈk converges to ΞΈ? such that

E[β€–ΞΈk βˆ’ ΞΈ?β€–2

]= O

(Ο‰k + sup

iβ‰₯k0

Ξ΅i +1

m+ supiβ‰₯k0

Ξ΄n(ΞΈi)

),

where k0 is some large enough integer, ΞΈ? = (∫X1Ο€(x)dx,

∫X2Ο€(x)dx, . . . ,

∫Xm Ο€(x)dx), and

δn(·) is a bias term depending on the batch size n and decays to 0 as n→ N .

3.2 Ergodicity and dynamic importance sampler

CSGLD belongs to the class of adaptive MCMC algorithms, but its transition kernel is based onSGLD instead of the Metropolis algorithm. As such, the ergodicity theory for traditional adaptiveMCMC algorithms [Roberts and Rosenthal, 2007, Andrieu and Γ‰ric Moulines, 2006, Fort et al., 2011,Liang, 2010] is not directly applicable. To tackle this issue, we conduct the following theoreticalstudy. First, rewrite (8) as

xk βˆ’ Ξ΅(βˆ‡xL(xk,ΞΈ?) + Ξ₯(xk,ΞΈk,ΞΈ?)

)+N (0, 2ΡτI), (13)

where βˆ‡xL(xk,ΞΈ?) = Nn

[1 + ΞΆΟ„

βˆ†u (log ΞΈ?(J(xk))βˆ’ log ΞΈ?((J(xk)βˆ’ 1) ∨ 1))]βˆ‡xU(xk),

the bias term Ξ₯(xk,ΞΈk,ΞΈ?) = βˆ‡xL(xk,ΞΈk) βˆ’ βˆ‡xL(xk,ΞΈ?), and βˆ‡xL(xk,ΞΈk) =Nn

[1 + ΞΆΟ„

βˆ†u

(log ΞΈk(J(xk))βˆ’ log ΞΈk((J(xk)βˆ’ 1) ∨ 1)

)]βˆ‡xU(xk). The order of the bias is fig-

ured out in Lemma C1 in the supplementary material based on the results of Theorem 1.

Next, we show how the empirical mean 1k

βˆ‘ki=1 f(xi) deviates from the posterior mean∫

X f(x)$Ψθ?(x)dx. Note that this is a direct application of Theorem 2 of Chen et al. [2015]

by treatingβˆ‡xL(x,ΞΈ?) as the stochastic gradient of a target distribution and Ξ₯(x,ΞΈ,ΞΈ?) as the biasof the stochastic gradient. Moreover, considering that $Ψθ?

(x) ∝ Ο€(x)

ΞΈΞΆ?(J(x))β†’ $Ψθ?

as m β†’ ∞based on Lemma B4 in the supplementary material, we have the followingLemma 2 (Convergence of the Averaging Estimators). Suppose Assumptions 1-6 (in the supplemen-tary material) hold. For any bounded function f , we have∣∣∣∣∣E

[βˆ‘ki=1 f(xi)

k

]βˆ’βˆ«Ο‡

f(x)$Ψθ?(dx)

∣∣∣∣∣ = O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰k

k+

1√m

+ supiβ‰₯k0

√δn(θi)

,

where $Ψθ?(x) = 1

ZΞΈ?

Ο€(x)

ΞΈΞΆ?(J(x))and ZΞΈ? =

βˆ‘mi=1

∫XiΟ€(x)dx

ΞΈ?(i)ΞΆ.

5

Page 6: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Finally, we consider the problem of estimating the quantity∫X f(x)Ο€(x)dx. Recall that Ο€(x)

is the target distribution that we would like to make inference for. To estimate this quantity, we

naturally consider the weighted averaging estimatorβˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

by treating ΞΈΞΆ(J(xi)) as the

dynamic importance weight of the sample xi for i = 1, 2, . . . , k. The convergence of this estimatoris established in Theorem 2, which can be proved by repeated applying Theorem 1 and Lemma 2with the details given in the supplementary material.Theorem 2 (Convergence of the Weighted Averaging Estimators). Suppose Assumptions 1-6 hold.For any bounded function f , we have∣∣∣∣∣E[βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

]βˆ’βˆ«Ο‡

f(x)Ο€(dx)

∣∣∣∣∣ = O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰k

k+

1√m

+ supiβ‰₯k0

√δn(θi)

.

The bias of the weighted averaging estimator decreases if one applies a larger batch size, a finersample space partition, a smaller learning rate Ξ΅, and smaller step sizes {Ο‰k}kβ‰₯0. Admittedly, theorder of this bias is slightly larger than O

(1kΞ΅ + Ξ΅

)achieved by the standard SGLD. We note that

this is necessary as simulating from the flattened distribution $Ψθ?often leads to a much faster

convergence, see e.g. the green curve v.s. the purple curve in Fig.1(c).

4 Numerical studies

4.1 Simulations of multi-modal distributions

A Gaussian mixture distribution The first numerical study is to test the performance of CSGLDon a Gaussian mixture distribution Ο€(x) = 0.4N(βˆ’6, 1) + 0.6N(4, 1). In each experiment, thealgorithm was run for 107 iterations. We fix the temperature Ο„ = 1 and the learning rate Ξ΅ = 0.1. Thestep size for stochastic approximation follows Ο‰k = 1

k0.6+100 . The sample space is partitioned into50 subregions with βˆ†u = 1. The stochastic gradients are simulated by injecting additional randomnoises following N(0, 0.01) to the exact gradients. For comparison, SGLD is chosen as the baselinealgorithm and implemented with the same setup as CSGLD. We repeat the experiments 10 times andreport the average and the associated standard deviation.

We first assume that ΞΈ? is known and plot the energy functions for both Ο€(x) and $Ψθ?with different

values of ΞΆ. Fig.1(a) shows that the original energy function has a rather large energy barrier whichstrongly affects the communication between two modes of the distribution. In contrast, CSGLDsamples from a modified energy function, which yields a flattened landscape and reduced energybarriers. For example, with ΞΆ = 0.75, the energy barrier for this example is greatly reduced from12 to as small as 2. Consequently, the local trap problem can be greatly alleviated. Regarding thebizarre peaks around x = 4, we leave the study in the supplementary material.

Large barrierSmall barrier

0

5

10

15

20

25

βˆ’13 βˆ’6 4 11Sample space

Ene

rgy

Original energy

Modified energy (ΞΆ=0.5)

Modified energy (ΞΆ=0.75)

Modified energy (ΞΆ=1)

(a) Original v.s. trial densities

●

●

●

●

●● ● ● ● ● ● ●

●

●

●

●●

●●

● ●●

● ●

Higher energy

0.0

0.2

0.4

5 10Partition index

Fre

quen

cy

●●

●●

●●

●●

●●

ΞΈ*

ΞΈ

CSGLD (ΞΆ=0.5)

CSGLD (ΞΆ=0.75)

CSGLD (ΞΆ=1)

(b) θ’s estimates and histograms

0

2

4

4e5 2e6 1e7Iterations

Est

imat

ion

erro

r

SGLD

CSGLD

KSGLD

(c) Estimation errors

Figure 1: Comparison between SGLD and CSGLD: Fig.1(b) presents only the first 12 partitions foran illustrative purpose; KSGLD in Fig.1(c) is implemented by assuming ΞΈ? is known.

Fig. 1(b) summarizes the estimates of ΞΈ? with ΞΆ = 0.75, which matches the ground truth value of ΞΈ?very well. Notably, we see that ΞΈ?(i) decays exponentially fast as the partition index i increases, which

6

Page 7: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

indicates the exponentially decreasing probability of visiting high energy regions and a severe localtrap problem. CSGLD tackles this issue by adaptively updating the transition kernel or, equivalently,the invariant distribution such that the sampler moves like a β€œrandom walk” in the space of energy. Inparticular, setting ΞΆ = 1 leads to a flat histogram of energy (for the samples produced by CSGLD).

To explore the performance of CSGLD in quantity estimation with the weighed averaging estima-tor, we compare CSGLD (ΞΆ = 0.75) with SGLD and KSGLD in estimating the posterior mean∫X xΟ€(x)dx, where KSGLD was implemented by assuming ΞΈ? is known and sampling from $Ψθ?

directly. Each algorithm was run for 10 times, and we recorded the mean absolute estimation erroralong with iterations. As shown in Fig.1(c), the estimation error of SGLD decays quite slow and rarelyconverges due to the high energy barrier. On the contrary, KSGLD converges much faster, whichshows the advantage of sampling from a flattened distribution $Ψθ?

. Admittedly, ΞΈ? is unknown inpractice. CSGLD instead adaptively updates its invariant distribution while optimizing the parameterΞΈ until an optimization-sampling equilibrium is reached. In the early period of the run, CSGLDconverges slightly slower than KSGLD, but soon it becomes as efficient as KSGLD.

Finally, we compare the sample path and learning rate for CSGLD and SGLD. As shown in Fig.2(a),SGLD tends to be trapped in a deep local optimum for an exponentially long time. CSGLD, incontrast, possesses a self-adjusting mechanism for escaping from local traps. In the early period of arun, CSGLD might suffer from a similar local-trap problem as SGLD (see Fig.2(b)). In this case, thecomponents of ΞΈ corresponding to the current subregion will increase very fast, eventually renderinga smaller or even negative stochastic gradient multiplier which bounces the sampler back to highenergy regions. To illustrate the process, we plot a bouncy zone and an absorbing zone in Fig.2(c).The bouncy zone enables the sampler to β€œjump” over large energy barriers to explore other modes.As the run continues, ΞΈk converges to ΞΈ?. Fig.2(d) shows that larger bouncy β€œjumps” (in red lines)can potentially be induced in the bouncy zone, which occurs in both local and global optima. Due tothe self-adjusting mechanism, CSGLD has the local trap problem much alleviated.

●●

●●

●●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

● ●

●

●●●

●

●

●

●

●

●

●

●●

●

●●●

●●

●●●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●● ●

●

●

●

●

●

●

●

●

●

●●

●●●●

●

● ●

●

●●●

● ●

●●

●

●●

●

●

●●

●● ●●

●

●

●●●●

●

● ●

●

●

●●●●

●●

●

●●●●●

●

●

●

●●●● ●

●●●

●

●

●

●

●

● ●

●

●●●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●●●●

●

●

●

●●

●●●● ●●

●●●

●

●●●

●

●● ●●●

●●●●

●

●

●

●

●● ●

●

●

● ●

●

●●

●

●●

●

●

●●

●●

● ●

●●●●

●●●

●●●

●

●

●

●

●●●●●●

●

●

●●●●

●

●●

●

●

●

●

●●

●

●

●●

●●

●●

●●●

●●

●

●

●●

●

●

●●

●●●

●

●

●

●●●

●

●

●●●

●

●●

●

●

●● ●●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●● ●●

●

● ●● ●●

●

●

●●

●●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●● ●●

●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●● ●● ●

●

●

●

●

●

● ●●

●●

●● ●●

●

●

●

●

●

●

●● ●

●●

●●

●●

●

●

●

●

●

● ●●

●

●

●

●●

● ●●●

●

●

●

●

●

●

● ●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●● ●

●●●

●●●●

●

●

●●

●

●

●●

●

●● ●

●

●●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●● ●

●

●●●●●●

●

●●●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●●●●

●

●●

●

●●●

●

●●

●

●

●

●

●●

●●

●●●

●

●●

●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●●

●

●● ●

●

●

●●●

●●●

●

●●

●

●

●

●●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●●

●

● ●●●●

●●

● ●●

●

●

●

●

●

●

● ●

●

●

●

●●

●●●●

●●

●

●

●

●●

●

●●

●

●● ●

●●

●

●

●

●

●

●●

●

●

●

●●

●●●

●●● ●●●

●●●

●

●●●

●

●●

●● ●●

●

●

●

●

●● ●●

●

● ●●

●

●

●

●

●●

●●

●●●●●

●

●

●

●

● ●●

●

●

●

●●●

●

●●

●

●●

●

●●

●

●

●●●

●●●

●

●●●

●

●

●

●

●

●●

●●

●

●●

●●●

●

●●● ●

●● ●●

●●●●

●

●

●●

●

●

●●

●

●●

●●

●

●

●●

●●

●●

●

●●●

●

●●●

●

●

●

●●

●

●

●

●● ●

●

●

● ●● ●●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●●

●● ●●●●

●●

●

●●

●●●

●●●●●

●

●

●

●●

●

●●

●●

●

●

●

●

●●

●

●

●

●●●

●

●●

●●

●

●●

●

●●

●

● ●

●

●

●●●●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●

●

Energy barrier > 10

(a) SGLD paths

Absorbing zone

Bouncy zone

●

●

●●

●

● ●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

● ●

●●

●

●

●

●●●

●

●

●

●

●

●

●●● ●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

(b) CSGLD paths (early)

Absorbing zone

Bouncy zone

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●

●

●

●●

●

(c) CSGLD paths (mid)

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●● ●

●

● ●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

● ●●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

(d) CSGLD paths (late)

Figure 2: Sample trajectories of SGLD and CSGLD: plots (a) and (c) are implemented by 100,000iterations with a thinning factor 100 and ΞΆ = 0.75, while plot (b) utilizes a thinning factor 10.

A synthetic multi-modal distribution We next simulate from a distribution Ο€(x) ∝ eβˆ’U(x),where U(x) =

βˆ‘2i=1

x(i)2βˆ’10 cos(1.2Ο€x(i))3 and x = (x(1), x(2)). We compare CSGLD with SGLD,

replica exchange SGLD (reSGLD) [Deng et al., 2020a], and SGLD with cyclic learning rates(cycSGLD) [Zhang et al., 2020] and detail the setups in the supplementary material. Fig.3(a) showsthat the distribution contains nine important modes, where the center mode has the largest probabilitymass and the four modes on the corners have the smallest mass. We see in Fig.3(b) that SGLDspends too much time in local regions and only identifies three modes. cycSGLD has a better abilityto explore the distribution by leveraging large learning rates cyclically. However, as illustratedin Fig.3(c), such a mechanism is still not efficient enough to resolve the local trap issue for thisproblem. reSGLD proposes to include a high-temperature process to encourage exploration andallows interactions between the two processes via appropriate swaps. We observe in Fig.3(d) thatreSGLD obtains both the exploration and exploitation abilities and yields a much better result.However, the noisy energy estimator may hinder the swapping efficiency and it becomes difficult toestimate a few modes on the corners. As to our algorithm, CSGLD first simulates the importancesamples and recovers the original distribution according to the importance weights. We notice that thesamples from CSGLD can traverse freely in the parameter space and eventually achieve a remarkableperformance, as shown in Fig.3(e).

7

Page 8: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

(a) Ground truth (b) SGLD (c) cycSGLD (d) reSGLD (e) CSGLD

Figure 3: Simulations of a multi-modal distribution. A resampling scheme is used for CSGLD.

4.2 UCI data

We tested the performance of CSGLD on the UCI regression datasets. For each dataset, we normalizedall features and randomly selected 10% of the observations for testing. Following [Hernandez-Lobatoand Adams, 2015], we modeled the data using a Multi-Layer Perception (MLP) with a single hiddenlayer of 50 hidden units. We set the mini-batch size n = 50 and trained the model for 5,000epochs. The learning rate was set to 5e-6 and the default L2-regularization coefficient is 1e-4. Forall the datasets, we used the stochastic energy N

n U(x) to evaluate the partition index. We set theenergy bandwidth βˆ†u = 100. We fine-tuned the temperature Ο„ and the hyperparameter ΞΆ. For afair comparison, each algorithm was run 10 times with fixed seeds for each dataset. In each run,the performance of the algorithm was evaluated by averaging over 50 models, where the averagingestimator was used for SGD and SGLD and the weighted averaging estimator was used for CSGLD.As shown in Table 1, SGLD outperforms the stochastic gradient descent (SGD) algorithm for mostdatasets due to the advantage of a sampling algorithm in obtaining more informative modes. Since allthese datasets are small, there is only very limited potential for improvement. Nevertheless, CSGLDstill consistently outperforms all the baselines including SGD and SGLD.

The contour strategy proposed in the paper can be naturally extended to SGHMC [Chen et al.,2014, Ma et al., 2015] without affecting the theoretical results. In what follows, we adopted anumerical method proposed by Saatci and Wilson [2017] to avoid extra hyperparameter tuning.We set the momentum term to 0.9 and simply inherited all the other parameter settings used inthe above experiments. In such a case, we compare the contour SGHMC (CSGHMC) with thebaselines, including M-SGD (Momentum SGD) and SGHMC. The comparison indicates that someimprovements can be achieved by including the momentum.

Table 1: Algorithm evaluation using average root-mean-square error and its standard deviation.

Dataset Energy Concrete Yacht WineHyperparameters (Ο„/ΞΆ) 1/1 5/1 1/2.5 5/10

SGD 1.13Β±0.07 4.60Β±0.14 0.81Β±0.08 0.65Β±0.01SGLD 1.08Β±0.07 4.12Β±0.10 0.72Β±0.07 0.63Β±0.01

CSGLD 1.02Β±0.06 3.98Β±0.11 0.69Β±0.06 0.62Β±0.01M-SGD 0.95Β±0.07 4.32Β±0.27 0.73Β±0.08 0.71Β±0.02SGHMC 0.77Β±0.06 4.25Β±0.19 0.66Β±0.07 0.67Β±0.02

CSGHMC 0.76Β±0.06 4.15Β±0.20 0.72Β±0.09 0.65Β±0.01

4.3 Computer vision data

This section compares only CSGHMC with M-SGD and SGHMC due to the popularity of momentumin accelerating computation for computer vision datasets. We keep partitioning the sample spaceaccording to the stochastic energy N

n U(x), where a mini-batch data of size n is randomly chosenfrom the full dataset of size N at each iteration. Notably, such a strategy significantly acceleratesthe computation of CSGHMC. As a result, CSGHMC has almost the same computational cost asSGHMC and SGD. To reduce the bias associated with the stochastic energy, we choose a large batchsize n = 1, 000. For more discussions on the hyperparameter settings, we refer readers to section Din the supplementary material.

8

Page 9: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

CIFAR10 is a standard computer vision dataset with 10 classes and 60,000 images, for which 50,000images were used for training and the rest for testing. We modeled the data using a Resnet of 20layers (Resnet20) [He et al., 2016]. In particular, for CSGHMC, we considered a partition of theenergy space in 200 subregions, where the energy bandwidth was set to βˆ†u = 1000. We trained themodel for a total of 1000 epochs and evaluated the model every ten epochs based on two criteria,namely, best point estimate (BPE) and Bayesian model average (BMA). We repeated each experiment10 times and reported in Table 2 the average prediction accuracy and the corresponding standarddeviation.

In the first set of experiments, all the algorithms utilized a fixed learning rate Ξ΅ = 2eβˆ’ 7 and a fixedtemperature Ο„ = 0.01 under the Bayesian setting.SGHMC performs quite similarly to M-SGD, bothobtaining around 90% accuracy in BPE and 92% in BMA. Notably, in this case, simulated annealingis not applied to any of the algorithms and achieving the state-of-the-art is quite difficult. However,BMA still consistently outperforms BPE, implying the great potential of advanced MCMC techniquesin deep learning. Instead of simulating from Ο€(x) directly, CSGHMC adaptively simulates from aflattened distribution $ΞΈ? and adjusts the sampling bias by dynamic importance weights. As a result,the weighted averaging estimators obtain an improvement by as large as 0.8% on BMA. In addition,the flattened distribution facilitates optimization and the increase in BPE is quite significant.

In the second set of experiments, we employed a decaying schedule on both learning rates andtemperatures (if applicable) to obtain simulated annealing effects. For the learning rate, we fix it at 2Γ—10βˆ’6 in the first 400 epochs and then decayed it by a factor of 1.01 at each epoch. For the temperature,we consistently decayed it by a factor of 1.01 at each epoch. We call the resulting algorithms bysaM-SGD, saSGHMC, and saCSGHMC, respectively. Table 2 shows that the performances of allalgorithms are increased quite significantly, where the fine-tuned baselines already obtained thestate-of-the-art results. Nevertheless, saCSGHMC further improves BPE by 0.25% and slightlyimprove the highly optimized BMA by nearly 0.1%.

CIFAR100 dataset has 100 classes, each of which contains 500 training images and 100 testingimages. We follow a similar setup as CIFAR10, except that βˆ†u is set to 5000. For M-SGD, BMAcan be better than BPE by as large as 5.6%. CSGHMC has led to an improvement of 3.5% on BPEand 2% on BMA, which further demonstrates the superiority of advanced MCMC techniques. Table2 also shows that with the help of both simulated annealing and importance sampling, saCSGHMCcan outperform the highly optimized baselines by almost 1% accuracy on BPE and 0.7% on BMA.The significant improvements show the advantage of the proposed method in training DNNs.

Table 2: Experiments on CIFAR10 & 100 using Resnet20, where BPE and BMA are short for bestpoint estimate and Bayesian model average, respectively.

Algorithms CIFAR10 CIFAR100BPE BMA BPE BMA

M-SGD 90.02Β±0.06 92.03Β±0.08 61.41Β±0.15 67.04Β±0.12SGHMC 90.01Β±0.07 91.98Β±0.05 61.46Β±0.14 66.43Β±0.11

CSGHMC 90.87Β±0.04 92.85Β±0.05 63.97Β±0.21 68.94Β±0.23saM-SGD 93.83Β±0.07 94.25Β±0.04 69.18Β±0.13 71.83Β±0.12saSGHMC 93.80Β±0.06 94.24Β±0.06 69.24Β±0.11 71.98Β±0.10

saCSGHMC 94.06Β±0.07 94.33Β±0.07 70.18Β±0.15 72.67Β±0.15

5 Conclusion

We have proposed CSGLD as a general scalable Monte Carlo algorithm for both simulation andoptimization tasks. CSGLD automatically adjusts the invariant distribution during simulations tofacilitate escaping from local traps and traversing over the entire energy landscape. The sampling biasintroduced thereby is accounted for by dynamic importance weights. We proved a stability conditionfor the mean-field system induced by CSGLD together with the convergence of its self-adaptingparameter ΞΈ to a unique fixed point ΞΈ?. We established the convergence of a weighted averagingestimator for CSGLD. The bias of the estimator decreases as we employ a finer partition, a largermini-batch size, and smaller learning rates and step sizes. We tested CSGLD and its variants on a fewexamples, which show their great potential in deep learning and big data computing.

9

Page 10: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Broader Impact

Our algorithm ensures AI safety by providing more robust predictions and helps build a saferenvironment. It is an extension of the flat histogram algorithms from the Metropolis kernel to theLangevin kernel and paves the way for future research in various dynamic importance samplersand adaptive biasing force (ABF) techniques for big data problems. The Bayesian community andthe researchers in the area of Monte Carlo methods will enjoy the benefit of our work. To our bestknowledge, the negative society consequences are not clear and no one will be put at disadvantage.

Acknowledgment

Liang’s research was supported in part by the grants DMS-2015498, R01-GM117597 and R01-GM126089. Lin acknowledges the support from NSF (DMS-1555072, DMS-1736364), BNLSubcontract 382247, W911NF-15-1-0562, and DE-SC0021142.

ReferencesSungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian Posterior Sampling via Stochastic

Gradient Fisher Scoring. In Proc. of the International Conference on Machine Learning (ICML),2012.

Christophe Andrieu and Γ‰ric Moulines. On the Ergodicity Properties of Some Adaptive MCMCAlgorithms. Annals of Applied Probability, 16:1462–1505, 2006.

Christophe Andrieu, Γ‰ric Moulines, and Pierre Priouret. Stability of Stochastic Approximation underVerifiable Conditions. SIAM J. Control Optim., 44(1):283–312, 2005.

Albert Benveniste, Michael MΓ©tivier, and Pierre Priouret. Adaptive Algorithms and StochasticApproximations. Berlin: Springer, 1990.

Bernd A. Berg and T. Neuhaus. Multicanonical Algorithms for First Order Phase Transitions. PhysicsLetters B, 267(2):249–253, 1991.

Changyou Chen, Nan Ding, and Lawrence Carin. On the Convergence of Stochastic Gradient MCMCAlgorithms with High-order Integrators. In Advances in Neural Information Processing Systems(NeurIPS), pages 2278–2286, 2015.

Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. InProc. of the International Conference on Machine Learning (ICML), 2014.

Umut Simsekli, Roland Badeau, A. Taylan Cemgil, and GaΓ« Richard. Stochastic Quasi-NewtonLangevin Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML),pages 642–651, 2016.

Wei Deng, Xiao Zhang, Faming Liang, and Guang Lin. An Adaptive Empirical Bayesian Method forSparse Deep Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.

Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-Convex Learning via ReplicaExchange Stochastic Gradient MCMC. In Proc. of the International Conference on MachineLearning (ICML), 2020a.

Wei Deng, Qi Feng, Georgios Karagiannis, Guang Lin, and Faming Liang. Accelerating Convergenceof Replica Exchange Stochastic Gradient MCMC via Variance Reduction. arXiv:2010.01084,2020b.

Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven.Bayesian Sampling using Stochastic Gradient Thermostats. In Advances in Neural InformationProcessing Systems (NeurIPS), pages 3203–3211, 2014.

G. Fort, E. Moulines, and P. Priouret. Convergence of Adaptive and Interacting Markov Chain MonteCarlo Algorithms. Annals of Statistics, 39:3262–3289, 2011.

10

Page 11: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

G. Fort, B. Jourdain, E. Kuhn, T. LeliΓ¨vre, and G. Stoltz. Convergence of the Wang-Landau Algorithm.Math. Comput., 84(295):2297–2327, 2015.

Charles J. Geyer. Markov Chain Monte Carlo Maximum Likelihood. Computing Science andStatistics: Proceedings of the 23rd Symposium on the Interfac, pages 156–163, 1991.

W.K. Hastings. Monte Carlo Sampling Methods using Markov Chain and Their Applications.Biometrika, 57:97–109, 1970.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic Backpropagation for ScalableLearning of Bayesian Neural Networks. In Proc. of the International Conference on MachineLearning (ICML), volume 37, pages 1861–1869, 2015.

Scott Kirkpatrick, D. Gelatt Jr, and Mario P. Vecchi. Optimization by Simulated Annealing. Science,220(4598):671–680, 1983.

T. LeliΓ¨vre, M. Rousset, and G. Stoltz. Long-time Convergence of an Adaptive Biasing Force Method.Nonlinearity, 21:1155–1181, 2008.

Chunyuan Li, Changyou Chen, David Carlson, and Lawrence Carin. Preconditioned StochasticGradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference onArtificial Intelligence (AAAI), pages 1788–1794, 2016.

Xuechen Li, Denny Wu, Lester Mackey, and Murat A. Erdogdu. Stochastic Runge-Kutta AcceleratesLangevin Monte Carlo and Beyond. In Advances in Neural Information Processing Systems(NeurIPS), pages 7746–7758, 2019.

Faming Liang. A Generalized Wang–Landau Algorithm for Monte Carlo Computation. Journal ofthe American Statistical Association, 100(472):1311–1327, 2005.

Faming Liang. On the Use of Stochastic Approximation Monte Carlo for Monte Carlo Integration.Statistics and Probability Letters, 79:581–587, 2009.

Faming Liang. Trajectory Averaging for Stochastic Approximation MCMC Algorithms. The Annalsof Statistics, 38:2823–2856, 2010.

Faming Liang, Chuanhai Liu, and Raymond J. Carroll. Stochastic Approximation in Monte CarloComputation. Journal of the American Statistical Association, 102:305–320, 2007.

Yi-An Ma, Tianqi Chen, and Emily B. Fox. A Complete Recipe for Stochastic Gradient MCMC. InAdvances in Neural Information Processing Systems (NeurIPS), 2015.

Oren Mangoubi and Nisheeth K. Vishnoi. Convex Optimization with Unbounded Nonconvex Oraclesusing Simulated Annealing. In Proc. of Conference on Learning Theory (COLT), 2018.

J.C. Mattingly, A.M. Stuartb, and D.J. Highamc. Ergodicity for SDEs and Approximations: LocallyLipschitz Vector Fields and Degenerate Noise. Stochastic Processes and their Applications, 101:185–232, 2002.

Jonathan C. Mattingly, Andrew M. Stuart, and M.V. Tretyakov. Convergence of Numerical Time-Averaging and Stationary Measures via Poisson Equations. SIAM Journal on Numerical Analysis,48:552–577, 2010.

N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller. Equation of StateCalculations by Fast Computing Machines. Journal of Chemical Physics, 21:1087–1091, 1953.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inPyTorch. In NeurIPS Autodiff Workshop, 2017.

11

Page 12: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex Learning via StochasticGradient Langevin Dynamics: a Nonasymptotic Analysis. In Proc. of Conference on LearningTheory (COLT), June 2017.

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. Annals of MathematicalStatistics, 22:400–407, 1951.

Gerneth O. Roberts and Jeff S. Rosenthal. Coupling and Ergodicity of Adaptive Markov Chain MonteCarlo Algorithms. Journal of Applied Probability, 44:458–475, 2007.

Yunus Saatci and Andrew G Wilson. Bayesian GAN. In Advances in Neural Information ProcessingSystems (NeurIPS), pages 3622–3631, 2017.

Issei Sato and Hiroshi Nakagawa. Approximation Analysis of Stochastic Gradient Langevin Dynamicsby Using Fokker-Planck Equation and Ito Process. In Proc. of the International Conference onMachine Learning (ICML), 2014.

Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo Simulation of Spin-Glasses. Phys.Rev. Lett., 57:2607–2609, 1986.

Yee Whye Teh, Alexandre ThiΓ©ry, and Sebastian Vollmer. Consistency and Fluctuations for StochasticGradient Langevin Dynamics. Journal of Machine Learning Research, 17:1–33, 2016.

Eric Vanden-Eijnden. Introduction to Regular Perturbation Theory. Slides, 2001. URL https://cims.nyu.edu/~eve2/reg_pert.pdf.

Sebastian J. Vollmer, Konstantinos C. Zygalakis, and Yee Whye Teh. Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics. Journal of MachineLearning Research, 17(159):1–48, 2016.

Fugao Wang and D. P. Landau. Efficient, Multiple-range Random Walk Algorithm to Calculate theDensity of States. Physical Review Letters, 86(10):2050–2053, 2001.

Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynamics. InProc. of the International Conference on Machine Learning (ICML), pages 681–688, 2011.

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global Convergence of Langevin DynamicsBased Algorithms for Nonconvex Optimization. In Advances in Neural Information ProcessingSystems (NeurIPS), 2018.

Mao Ye, Tongzheng Ren, and Qiang Liu. Stein Self-Repulsive Dynamics: Benefits From PastSamples. arXiv:2002.09070v1, 2020.

Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. CyclicalStochastic Gradient MCMC for Bayesian Deep Learning. In Proc. of the International Conferenceon Learning Representation (ICLR), 2020.

Yuchen Zhang, Percy Liang, and Moses Charikar. A Hitting Time Analysis of Stochastic GradientLangevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pages 1980–2022, 2017.

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmen-tation. ArXiv e-prints, 2017.

12

Page 13: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Supplimentary Material for β€œA Contour Stochastic Gradient LangevinDynamics Algorithm for Simulations of Multi-modal Distributions”

The supplementary material is organized as follows: Section A provides a review for the relatedmethodologies, Section B proves the stability condition and convergence of the self-adapting pa-rameter, Section C establishes the ergodicity of the contour stochastic gradient Langevin dynamics(CSGLD) algorithm, and Section D provides more discussions for the algorithm.

A Background on stochastic approximation and Poisson equation

A.1 Stochastic approximation

Stochastic approximation [Benveniste et al., 1990] provides a standard framework for the devel-opment of adaptive algorithms. Given a random field function H(ΞΈ,x), the goal of the stochasticapproximation algorithm is to find the solution to the mean-field equation h(ΞΈ) = 0, i.e., solving

h(ΞΈ) =

∫XH(θ,x)$θ(dx) = 0,

where x ∈ X βŠ‚ Rd, ΞΈ ∈ Θ βŠ‚ Rm, H(ΞΈ,x) is a random field function and $ΞΈ(x) is a distributionfunction of x depending on the parameter ΞΈ. The stochastic approximation algorithm works byrepeating the following iterations

(1) Draw xk+1 ∼ Πθk(xk, ·), where Πθk(xk, ·) is a transition kernel that admits $θk(x) asthe invariant distribution,

(2) Update ΞΈk+1 = ΞΈk + Ο‰k+1H(ΞΈk,xk+1) + Ο‰2k+1ρ(ΞΈk,xk+1), where ρ(Β·, Β·) denotes a bias

term.

The algorithm differs from the Robbins–Monro algorithm [Robbins and Monro, 1951] in that x issimulated from a transition kernel Ξ ΞΈk(Β·, Β·) instead of the exact distribution $ΞΈk(Β·). As a result, aMarkov state-dependent noise H(ΞΈk,xk+1)βˆ’ h(ΞΈk) is generated, which requires some regularityconditions to control the fluctuation

βˆ‘k Ξ k

ΞΈ(H(ΞΈ,x)βˆ’ h(ΞΈ)). Moreover, it supports a more generalform where a bounded bias term ρ(Β·, Β·) is allowed without affecting the theoretical properties of thealgorithm.

A.2 Poisson equation

Stochastic approximation generates a nonhomogeneous Markov chain {(xk,θk)}∞k=1, for which theconvergence theory can be studied based on the Poisson equation

¡θ(x)βˆ’Ξ ΞΈΒ΅ΞΈ(x) = H(ΞΈ,x)βˆ’ h(ΞΈ),

where Ξ ΞΈ(x, A) is the transition kernel for any Borel subset A βŠ‚ X and ¡θ(Β·) is a function on X .The solution to the Poisson equation exists when the following series converges:

¡θ(x) :=βˆ‘kβ‰₯0

Ξ kΞΈ(H(ΞΈ,x)βˆ’ h(ΞΈ)).

That is, the consistency of the estimator ΞΈ can be established by controlling the perturbations ofβˆ‘kβ‰₯0 Ξ k

ΞΈ(H(ΞΈ,x)βˆ’ h(ΞΈ)) via imposing some regularity conditions on ¡θ(Β·). Towards this goal,Benveniste et al. [1990] gave the following regularity conditions on ¡θ(Β·) to ensure the convergenceof the adaptive algorithm:

There exist a function V : X β†’ [1,∞), and a constant C such that for all ΞΈ,ΞΈβ€² ∈ Θ,

‖Πθ¡θ(x)β€– ≀ CV (x), ‖Πθ¡θ(x)βˆ’Ξ ΞΈβ€²Β΅ΞΈβ€²(x)β€– ≀ Cβ€–ΞΈ βˆ’ ΞΈβ€²β€–V (x), E[V (x)] ≀ ∞,which requires only the first order smoothness. In contrast, the ergodicity theory by Mattingly et al.[2010] and Vollmer et al. [2016] relies on the much stronger 4th order smoothness.

13

Page 14: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

B Stability and convergence analysis for CSGLD

B.1 CSGLD algorithm

To make the theory more general, we slightly extend CSGLD by allowing a higher order bias term.The resulting algorithm works by iterating between the following two steps:

(1) Sample xk+1 = xkβˆ’Ξ΅kβˆ‡xL(xk,ΞΈk)+N (0, 2Ξ΅kΟ„I), (S1)

(2) Update ΞΈk+1 = ΞΈk + Ο‰k+1H(ΞΈk,xk+1) + Ο‰2k+1ρ(ΞΈk,xk+1), (S2)

where Ξ΅k is the learning rate, Ο‰k+1 is the step size,βˆ‡xL(x,ΞΈ) is the stochastic gradient given by

βˆ‡xL(x,ΞΈ) =N

n

[1 +

ΞΆΟ„

βˆ†u

(log ΞΈ(J(x))βˆ’ log ΞΈ((J(x)βˆ’ 1) ∨ 1)

)]βˆ‡xU(x), (14)

H(ΞΈ,x) = (H1(ΞΈ,x), . . . , Hm(ΞΈ,x)) is a random field function with

Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(

1i=J(x) βˆ’ ΞΈ(i)), i = 1, 2, . . . ,m, (15)

for some constant ΢ > 0, and ρ(θk,xk+1) is a bias term.

B.2 Convergence of parameter estimation

To establish the convergence of θk, we make the following assumptions:Assumption A1 (Compactness). The space Θ is compact such that infΘ θ(i) > 0 for any i ∈{1, 2, . . . ,m}. There exists a large constant Q > 0 such that for any θ ∈ Θ and x ∈ X ,

β€–ΞΈβ€– ≀ Q, β€–H(ΞΈ,x)β€– ≀ Q, ‖ρ(ΞΈ,x)β€– ≀ Q. (16)

To simplify the proof, we consider a slightly stronger assumption such that infΘ ΞΈ(i) > 0 holds forany i ∈ {1, 2, . . . ,m}. To relax this assumption, we refer interested readers to Fort et al. [2015]where the recurrence property was proved for the sequence {ΞΈk}kβ‰₯1 of a similar algorithm. Such aproperty guarantees ΞΈk to visit often enough to a desired compact space, rendering the convergenceof the sequence.

Assumption A2 (Smoothness). U(x) is M -smooth; that is, there exists a constant M > 0 such thatfor any x,xβ€² ∈ X ,

β€–βˆ‡xU(x)βˆ’βˆ‡xU(xβ€²)β€– ≀Mβ€–xβˆ’ xβ€²β€–. (17)

Smoothness is a standard assumption in the study of convergence of SGLD, see e.g. Raginsky et al.[2017], Xu et al. [2018].Assumption A3 (Dissipativity). There exist constants m > 0 and b β‰₯ 0 such that for any x ∈ Xand ΞΈ ∈ Θ,

γ€ˆβˆ‡xL(x,ΞΈ),x〉 ≀ bβˆ’ mβ€–xβ€–2. (18)

This assumption ensures samples to move towards the origin regardless the initial point, which isstandard in proving the geometric ergodicity of dynamical systems, see e.g. Mattingly et al. [2002],Raginsky et al. [2017], Xu et al. [2018].Assumption A4 (Gradient noise). The stochastic gradient is unbiased, that is,

E[βˆ‡xU(xk)βˆ’βˆ‡xU(xk)] = 0;

in addition, there exist some constants M > 0 and B > 0 such that

E[β€–βˆ‡xU(xk)βˆ’βˆ‡xU(xk)β€–2] ≀M2β€–xβ€–2 +B2,

where the expectation E[Β·] is taken with respect to the distribution of the noise component included inβˆ‡xU(x).

14

Page 15: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Lemma B1 establishes a stability condition for CSGLD, which implies potential convergence of ΞΈk.Lemma B1 (Stability). Suppose that Assumptions A1-A4 hold. For any ΞΈ ∈ Θ, γ€ˆh(ΞΈ),ΞΈ βˆ’ΞΈ?〉 ≀ βˆ’Ο†β€–ΞΈ βˆ’ ΞΈ?β€–2 + O

(Ξ΄n(ΞΈ) + Ξ΅+ 1

m

), where Ο† = infΞΈ Z

βˆ’1ΞΈ > 0, ΞΈ? =

(∫X1Ο€(x)dx,

∫X2Ο€(x)dx, . . . ,

∫Xm Ο€(x)dx) and Ξ΄n(Β·) is a bias term depending on the batch size n

such that δn(·)→ 0 as n→ N .

Proof Let $Ψθ (x) ∝ Ο€(x)

Ψ΢θ(U(x))denote a theoretical invariant measure of SGLD, where Ψθ(u) is

a fixed piecewise continuous function given by

Ψθ(u) =

mβˆ‘i=1

(ΞΈ(iβˆ’ 1)e(log ΞΈ(i)βˆ’log ΞΈ(iβˆ’1))

uβˆ’uiβˆ’1βˆ†u

)1uiβˆ’1<u≀ui , (19)

the full data is used in determining the indexes of subregions, and the learning rate converges to zero.In addition, we define a piece-wise constant function

Ψθ =

mβˆ‘i=1

ΞΈ(i)1uiβˆ’1<u≀ui ,

and a theoretical measure $Ψθ(x) ∝ Ο€(x)

ΞΈΞΆ(J(x)). Obviously, as the sample space partition becomes

fine and fine, i.e., u1 β†’ umin, umβˆ’1 β†’ umax and m β†’ ∞, we have ‖Ψθ βˆ’ Ψθ‖ β†’ 0 andβ€–$Ψθ

(x) βˆ’ $Ψθ (x)β€– β†’ 0, where umin and umax denote the minimum and maximum of U(x),respectively. Without loss of generality, we assume umax < ∞. Otherwise, umax can be set to avalue such that Ο€({x : U(x) > umax}) is sufficiently small.

For each i ∈ {1, 2, . . . ,m}, the random field Hi(θ,x) = θ΢(J(x))(

1iβ‰₯J(x) βˆ’ ΞΈ(i))

is a biased

estimator of Hi(ΞΈ,x) = ΞΈΞΆ(J(x))(1iβ‰₯J(x) βˆ’ ΞΈ(i)

). Let Ξ΄n(ΞΈ) = E[H(ΞΈ,x) βˆ’ H(ΞΈ,x)] denote

the bias, which is caused by the mini-batch evaluation of the energy and decays to 0 as n→ N .

First, let’s compute the mean-field h(ΞΈ) with respect to the empirical measure $ΞΈ(x):

hi(ΞΈ) =

∫XHi(θ,x)$θ(x)dx =

∫XHi(θ,x)$θ(x)dx+ δn(θ)

=

∫XHi(θ,x)

$Ψθ(x)︸ ︷︷ ︸

I1

βˆ’$Ψθ(x) +$Ψθ (x)οΈΈ οΈ·οΈ· οΈΈ

I2

βˆ’$Ψθ (x) +$ΞΈ(x)οΈΈ οΈ·οΈ· οΈΈI3

dx+ Ξ΄n(ΞΈ).

(20)

For the term I1, we have∫XHi(θ,x)$Ψθ

(x)dx =1

ZΞΈ

∫Xθ΢(J(x))

(1i=J(x) βˆ’ ΞΈ(i)

) Ο€(x)

ΞΈΞΆ(J(x))dx

= Zβˆ’1ΞΈ

[mβˆ‘k=1

∫XkΟ€(x)1k=idxβˆ’ ΞΈ(i)

mβˆ‘k=1

∫XkΟ€(x)dx

]= Zβˆ’1

ΞΈ [ΞΈ?(i)βˆ’ ΞΈ(i)] ,

(21)

where ZΞΈ =βˆ‘mi=1

∫XiΟ€(x)dx

θ(i)΢denotes the normalizing constant of $Ψθ

(x).

Next, let’s consider the integrals I2 and I3. By Lemma B4 and the boundedness of H(ΞΈ,x), we have∫XHi(ΞΈ,x)(βˆ’$Ψθ

(x) +$Ψθ (x))dx = O(

1

m

). (22)

For the term I3, we have for any fixed ΞΈ,∫XHi(ΞΈ,x) (βˆ’$Ψθ (x) +$ΞΈ(x)) dx = O(Ξ΄n (ΞΈ)) +O(Ξ΅), (23)

where δn(·) uniformly decays to 0 as n→ N and the order of O(Ρ) follows from Theorem 6 of Satoand Nakagawa [2014].

15

Page 16: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Plugging (21), (22) and (23) into (20), we have

hi(ΞΈ) = Zβˆ’1ΞΈ [Ρβi(ΞΈ) + ΞΈ?(i)βˆ’ ΞΈ(i)] , (24)

where Ξ΅ = O(Ξ΄n(ΞΈ) + Ξ΅+ 1

m

)and Ξ²i(ΞΈ) is a bounded term such that Zβˆ’1

θ Ρβi(θ) =

O(Ξ΄n(ΞΈ) + Ξ΅+ 1

m

).

To solve the ODE system with small disturbances, we consider standard techniques in perturbationtheory. According to the fundamental theorem of perturbation theory [Vanden-Eijnden, 2001], wecan obtain the solution to the mean field equation h(ΞΈ) = 0:

θ(i) = θ?(i) + Ρβi(θ?) +O(Ρ2), i = 1, 2, . . . ,m, (25)

which is a stable point in a small neighbourhood of ΞΈ?.

Considering the positive definite function V(ΞΈ) = 12β€–ΞΈ? βˆ’ ΞΈβ€–

2 for the mean-field system h(ΞΈ) =

Zβˆ’1ΞΈ (Ρβi(ΞΈ) + ΞΈ? βˆ’ ΞΈ) = Zβˆ’1

ΞΈ (ΞΈ? βˆ’ ΞΈ) +O(Ξ΅), we have

γ€ˆh(ΞΈ),V(ΞΈ)〉 = γ€ˆh(ΞΈ),ΞΈβˆ’ΞΈ?〉 = βˆ’Zβˆ’1ΞΈ β€–ΞΈβˆ’ΞΈ?β€–

2+O(Ξ΅) ≀ βˆ’Ο†β€–ΞΈβˆ’ΞΈ?β€–2+O(Ξ΄n(ΞΈ) + Ξ΅+

1

m

),

where Ο† = infΞΈ Zβˆ’1ΞΈ > 0 by the compactness assumption A1. This concludes the proof.

The following is a restatement of Lemma 1 of Deng et al. [2019], which holds for any ΞΈ in thecompact space Θ.Lemma B2 (Uniform L2 bounds). Suppose Assumptions A1, A3 and A4 hold. Given a small enoughlearning rate, then supkβ‰₯1 E[β€–xkβ€–2] <∞.

Lemma B3 (Solution of Poisson equation). Suppose that Assumptions A1-A4 hold. There is asolution ¡θ(·) on X to the Poisson equation

¡θ(x)βˆ’Ξ ΞΈΒ΅ΞΈ(x) = H(ΞΈ,x)βˆ’ h(ΞΈ). (26)

In addition, for all ΞΈ,ΞΈβ€² ∈ Θ, there exists a constant C such that

E[‖Πθ¡θ(x)β€–] ≀ C,E[‖Πθ¡θ(x)βˆ’Ξ ΞΈβ€²Β΅ΞΈβ€²(x)β€–] ≀ Cβ€–ΞΈ βˆ’ ΞΈβ€²β€–.

(27)

Proof The lemma can be proved based on Theorem 13 of Vollmer et al. [2016], whose conditionscan be easily verified for CSGLD given the assumptions A1-A4 and Lemma B2. The details areomitted.

Now we are ready to prove the first main result on the convergence of θk. The technique lemmas arelisted in Section B.3.Assumption A5 (Learning rate and step size). The learning rate {Ρk}k∈N is a positive non-increasingsequence of real numbers satisfying the conditions

limkΞ΅k = 0,

βˆžβˆ‘k=1

Ρk =∞.

The step size {Ο‰k}k∈N is a positive decreasing sequence of real numbers such that

Ο‰k β†’ 0,

βˆžβˆ‘k=1

Ο‰k = +∞, limkβ†’βˆž

inf 2φωkΟ‰k+1

+Ο‰k+1 βˆ’ Ο‰kΟ‰2k+1

> 0. (28)

According to Benveniste et al. [1990], we can choose Ο‰k := AkΞ±+B for some Ξ± ∈ ( 1

2 , 1] and somesuitable constants A > 0 and B > 0.Theorem 3 (L2 convergence rate). Suppose Assumptions A1-A5 hold. For a sufficiently large valueof m, a sufficiently small learning rate sequence {Ξ΅k}∞k=1, and a sufficiently small step size sequence{Ο‰k}∞k=1, {ΞΈk}∞k=0 converges to ΞΈ? in L2-norm such that

E[β€–ΞΈk βˆ’ ΞΈ?β€–2

]= O

(Ο‰k + sup

iβ‰₯k0

Ξ΅i +1

m+ supiβ‰₯k0

Ξ΄n(ΞΈi)

),

where k0 is a sufficiently large constant, and δn(θ) is a bias term decaying to 0 as n→ N .

16

Page 17: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Proof Consider the iterations

ΞΈk+1 = ΞΈk + Ο‰k+1

(H(ΞΈk,xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)

).

Define Tk = ΞΈk βˆ’ ΞΈ?. By subtracting ΞΈ? from both sides and taking the square and L2 norm, wehaveβ€–T 2

k+1β€– = β€–T 2k β€–+ Ο‰2

k+1β€–H(ΞΈk,xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)β€–2 + 2Ο‰k+1 γ€ˆTk, H(xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)〉︸ οΈ·οΈ· οΈΈD

.

First, by Lemma B5, there exists a constant G = 4Q2(1 +Q2) such that

β€–H(ΞΈk,xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)β€–2 ≀ G(1 + β€–Tkβ€–2). (29)

Next, by the Poisson equation (26), we have

D = γ€ˆTk, H(ΞΈk,xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)〉= γ€ˆTk, h(ΞΈk) + ¡θk(xk+1)βˆ’Ξ ΞΈk¡θk(xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)〉= γ€ˆTk, h(ΞΈk)〉︸ οΈ·οΈ· οΈΈ

D1

+ γ€ˆTk, ¡θk(xk+1)βˆ’Ξ ΞΈk¡θk(xk+1)〉︸ οΈ·οΈ· οΈΈD2

+ γ€ˆTk, Ο‰k+1ρ(ΞΈk,xk+1)〉︸ οΈ·οΈ· οΈΈD3

.

For the term D1, by Lemma B1, we have

E [γ€ˆTk, h(ΞΈk)〉] ≀ βˆ’Ο†E[β€–Tkβ€–2] +O(Ξ΄n(ΞΈk) + Ξ΅k +1

m).

For convenience, in the following context, we denote O(Ξ΄n(ΞΈk) + Ξ΅k + 1m ) by βˆ†k.

To deal with the term D2, we make the following decomposition

D2 = γ€ˆTk, ¡θk(xk+1)βˆ’Ξ ΞΈk¡θk(xk)〉︸ οΈ·οΈ· οΈΈD21

+ γ€ˆTk,Ξ ΞΈk¡θk(xk)βˆ’Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)〉︸ οΈ·οΈ· οΈΈD22

+ γ€ˆTk,Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)βˆ’Ξ ΞΈk¡θk(xk+1)〉︸ οΈ·οΈ· οΈΈD23

.

(i) From the Markov property, ¡θk(xk+1)βˆ’Ξ ΞΈk¡θk(xk) forms a martingale difference sequence

E [γ€ˆTk, ¡θk(xk+1)βˆ’Ξ ΞΈk¡θk(xk)〉|Fk] = 0, (D21)

where Fk is a Οƒ-filter formed by {ΞΈ0,x1,ΞΈ1,x2, Β· Β· Β· ,xk,ΞΈk}.(ii) By the regularity of the solution of Poisson equation in (27) and Lemma B6, we have

E[β€–Ξ ΞΈk¡θk(xk)βˆ’Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)β€–] ≀ Cβ€–ΞΈk βˆ’ ΞΈkβˆ’1β€– ≀ 2QCΟ‰k. (30)

Using Cauchy–Schwarz inequality, (30) and the compactness of Θ in Assumption A1, we have

E[γ€ˆTk,Ξ ΞΈk¡θk (xk)βˆ’Ξ ΞΈkβˆ’1¡θkβˆ’1(xk)〉] ≀ E[β€–Tkβ€–] Β· 2QCΟ‰k ≀ 4Q2CΟ‰k ≀ 5Q2CΟ‰k+1 (D22),

where the last inequality follows from assumption A5 and holds for a large enough k.

(iii) For the last term of D2,

γ€ˆTk,Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)βˆ’Ξ ΞΈk¡θk(xk+1)〉=(γ€ˆTk,Ξ ΞΈkβˆ’1

¡θkβˆ’1(xk)〉 βˆ’ γ€ˆTk+1,Ξ ΞΈk¡θk(xk+1)〉

)+ (γ€ˆTk+1,Ξ ΞΈk¡θk(xk+1)〉 βˆ’ γ€ˆTk,Ξ ΞΈk¡θk(xk+1)〉)

=(zk βˆ’ zk+1) + γ€ˆTk+1 βˆ’ Tk,Ξ ΞΈk¡θk(xk+1)〉,

where zk = γ€ˆTk,Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)〉. By the regularity assumption (27) and Lemma B6,

Eγ€ˆTk+1 βˆ’ Tk,Ξ ΞΈk¡θk(xk+1)〉 ≀ E[β€–ΞΈk+1 βˆ’ ΞΈkβ€–] Β· E[β€–Ξ ΞΈk¡θk(xk+1)β€–] ≀ 2QCΟ‰k+1. (D23)

Regarding D3, since ρ(ΞΈk,xk+1) is bounded, applying Cauchy–Schwarz inequality gives

E[γ€ˆTk, Ο‰k+1ρ(ΞΈk,xk+1))] ≀ 2Q2Ο‰k+1 (D3)

17

Page 18: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Finally, adding (29), D1, D21, D22, D23 and D3 together, it follows that for a constant C0 =G+ 10Q2C + 4QC + 4Q2,E[β€–Tk+1β€–2

]≀ (1βˆ’ 2Ο‰k+1Ο†+GΟ‰2

k+1)E[β€–Tkβ€–2

]+ C0Ο‰

2k+1 + 2βˆ†kΟ‰k+1 + 2E[zk βˆ’ zk+1]Ο‰k+1.

(31)

Moreover, from (16) and (27), E[|zk|] is upper bounded by

E[|zk|] = E[γ€ˆTk,Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)〉] ≀ E[β€–Tkβ€–]E[β€–Ξ ΞΈkβˆ’1¡θkβˆ’1

(xk)β€–] ≀ 2QC. (32)

According to Lemma B7, we can choose Ξ»0 and k0 such that

E[β€–Tk0β€–2] ≀ ψk0

= Ξ»0Ο‰k0+

1

Ο†supiβ‰₯k0

βˆ†i,

which satisfies the conditions (43) and (44) of Lemma B9. Applying Lemma B9 leads to

E[β€–Tkβ€–2

]≀ ψk + E

kβˆ‘j=k0+1

Ξ›kj (zjβˆ’1 βˆ’ zj)

, (33)

where ψk = Ξ»0Ο‰k + 1Ο† supiβ‰₯k0

βˆ†i for all k > k0. Based on (32) and the increasing condition of Ξ›kjin Lemma B8, we have

E

∣∣∣∣∣∣kβˆ‘

j=k0+1

Ξ›kj (zjβˆ’1 βˆ’ zj)

∣∣∣∣∣∣ = E

∣∣∣∣∣∣kβˆ’1βˆ‘

j=k0+1

(Ξ›kj+1 βˆ’ Ξ›kj )zj βˆ’ 2Ο‰kzk + Ξ›kk0+1zk0

∣∣∣∣∣∣

≀kβˆ’1βˆ‘

j=k0+1

2(Ξ›kj+1 βˆ’ Ξ›kj )QC + E[|2Ο‰kzk|] + 2Ξ›kkQC

≀2(Ξ›kk βˆ’ Ξ›kk0)QC + 2Ξ›kkQC + 2Ξ›kkQC

≀6Ξ›kkQC.

(34)

Given ψk = Ξ»0Ο‰k + 1Ο† supiβ‰₯k0

βˆ†i which satisfies the conditions (43) and (44) of Lemma B9, itfollows from (33) and (34) that the following inequality holds for any k > k0,

E[β€–Tkβ€–2] ≀ ψk + 6Ξ›kkQC = (Ξ»0 + 12QC)Ο‰k +1

Ο†supiβ‰₯k0

βˆ†i = λωk +1

Ο†supiβ‰₯k0

βˆ†i,

where Ξ» = Ξ»0 + 12QC, Ξ»0 =2G supiβ‰₯k0

βˆ†i+2C0Ο†

C1Ο†, C1 = lim inf 2Ο†

ωkωk+1

+Ο‰k+1 βˆ’ Ο‰kΟ‰2k+1

> 0, C0 =

G+ 5Q2C + 2QC + 2Q2 and G = 4Q2(1 +Q2).

B.3 Technical lemmas

Lemma B4. Suppose Assumption A1 holds, and u1 and umβˆ’1 are fixed such that Ξ¨(u1) > Ξ½ andΞ¨(umβˆ’1) > 1βˆ’ Ξ½ for some small constant Ξ½ > 0. For any bounded function f(x), we have∫

Xf(x)

($Ψθ (x)βˆ’$Ψθ

(x))dx = O

(1

m

). (35)

Proof Recall that $Ψθ(x) = 1

ZΞΈ

Ο€(x)ΞΈΞΆ(J(x))

and $Ψθ (x) = 1ZΨθ

Ο€(x)

Ψ΢θ(U(x)). Since f(x) is bounded,

it suffices to show∫X

1

ZΞΈ

Ο€(x)

ΞΈΞΆ(J(x))βˆ’ 1

ZΨθ

Ο€(x)

Ψ΢θ(U(x))

dx

β‰€βˆ«X

∣∣∣∣∣ 1

ZΞΈ

Ο€(x)

ΞΈΞΆ(J(x))βˆ’ 1

ZΞΈ

Ο€(x)

Ψ΢θ(U(x))

∣∣∣∣∣ dx+

∫X

∣∣∣∣∣ 1

ZΞΈ

Ο€(x)

Ψ΢θ(U(x))

βˆ’ 1

ZΨθ

Ο€(x)

Ψ΢θ(U(x))

∣∣∣∣∣ dx=

1

ZΞΈ

mβˆ‘i=1

∫Xi

βˆ£βˆ£βˆ£βˆ£βˆ£Ο€(x)

ΞΈΞΆ(i)βˆ’ Ο€(x)

Ψ΢θ(U(x))

∣∣∣∣∣ dx︸ ︷︷ ︸I1

+

mβˆ‘i=1

∣∣∣∣ 1

ZΞΈβˆ’ 1

ZΨθ

∣∣∣∣ ∫Xi

Ο€(x)

Ψ΢θ(U(x))

dxοΈΈ οΈ·οΈ· οΈΈI2

= O(

1

m

),

(36)

18

Page 19: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

where ZΞΈ =βˆ‘mi=1

∫Xi

Ο€(x)ΞΈ(i)ΞΆ

dx, ZΨθ =βˆ‘mi=1

∫Xi

Ο€(x)

Ψ΢θ(U(x))dx, and Ψθ(u) is a piecewise continu-

ous function defined in (19).

By Assumption A1, infΘ ΞΈ(i) > 0 for any i. Further, by the mean-value theorem, which implies|xΞΆ βˆ’ yΞΆ | . |xβˆ’ y|zΞΆ for any ΞΆ > 0, x ≀ y and z ∈ [x, y] βŠ‚ [u1,∞), we have

I1 =1

ZΞΈ

mβˆ‘i=1

∫Xi

∣∣∣∣∣θ΢(i)βˆ’Ξ¨ΞΆΞΈ(U(x))

θ΢(i)Ψ΢θ(U(x))

βˆ£βˆ£βˆ£βˆ£βˆ£Ο€(x)dx .1

ZΞΈ

mβˆ‘i=1

∫Xi

|Ψθ(uiβˆ’1)βˆ’Ξ¨ΞΈ(ui)|ΞΈΞΆ(i)

Ο€(x)dx

≀ maxi|Ψθ(ui βˆ’βˆ†u)βˆ’Ξ¨ΞΈ(ui)|

1

ZΞΈ

mβˆ‘i=1

∫Xi

Ο€(x)

ΞΈΞΆ(i)dx = max

i|Ψθ(ui βˆ’βˆ†u)βˆ’Ξ¨ΞΈ(ui)| . βˆ†u = O

(1

m

),

where the last inequality follows by Taylor expansion, and the last equality follows as u1 and umβˆ’1

are fixed. Similarly, we have

I2 =

∣∣∣∣ 1

ZΞΈβˆ’ 1

ZΨθ

∣∣∣∣ZΨθ =|ZΨθ βˆ’ ZΞΈ|

Zθ≀ 1

ZΞΈ

mβˆ‘i=1

∫Xi

βˆ£βˆ£βˆ£βˆ£βˆ£Ο€(x)

ΞΈΞΆ(i)βˆ’ Ο€(x)

Ψ΢θ(U(x))

∣∣∣∣∣ dx = I1 = O(

1

m

).

The proof can then be concluded by combining the orders of I1 and I2.Lemma B5. Given sup{Ο‰k}∞k=1 ≀ 1, there exists a constant G = 4Q2(1 +Q2) such that

β€–H(ΞΈk,xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)β€–2 ≀ G(1 + β€–ΞΈk βˆ’ ΞΈ?β€–2). (37)

Proof

According to the compactness condition in Assumption A1, we have

β€–H(ΞΈk,xk+1)β€–2 ≀ Q2(1+β€–ΞΈkβ€–2) = Q2(1+β€–ΞΈkβˆ’ΞΈ?+ΞΈ?β€–2) ≀ Q2(1+2β€–ΞΈkβˆ’ΞΈ?β€–2+2Q2). (38)

Therefore, using (38), we can show that for a constant G = 4Q2(1 +Q2)

β€–H(ΞΈk,xk+1) + Ο‰k+1ρ(ΞΈk,xk+1)β€–2

≀ 2β€–H(ΞΈk,xk+1)β€–2 + 2Ο‰2k+1‖ρ(ΞΈk,xk+1)β€–2

≀ 2Q2(1 + 2β€–ΞΈk βˆ’ ΞΈ?β€–2 + 2Q2) + 2Q2

≀ 2Q2(2 + 2Q2 + (2 + 2Q2)β€–ΞΈk βˆ’ ΞΈ?β€–2)

≀ G(1 + β€–ΞΈk βˆ’ ΞΈ?β€–2).

Lemma B6. Given sup{Ο‰k}∞k=1 ≀ 1, we have that

β€–ΞΈk βˆ’ ΞΈkβˆ’1β€– ≀ 2Ο‰kQ (39)

Proof Following the update ΞΈk βˆ’ ΞΈkβˆ’1 = Ο‰kH(ΞΈkβˆ’1,xk) + Ο‰2kρ(ΞΈkβˆ’1,xk), we have that

β€–ΞΈk βˆ’ ΞΈkβˆ’1β€– = β€–Ο‰kH(ΞΈkβˆ’1,xk) + Ο‰2kρ(ΞΈkβˆ’1,xk)β€– ≀ Ο‰kβ€–H(ΞΈkβˆ’1,xk)β€–+ Ο‰2

k‖ρ(ΞΈkβˆ’1,xk)β€–.

By the compactness condition in Assumption A1 and sup{Ο‰k}∞k=1 ≀ 1, (39) can be derived.Lemma B7. There exist constants Ξ»0 and k0 such that βˆ€Ξ» β‰₯ Ξ»0 and βˆ€k > k0, the sequence {ψk}∞k=1,where ψk = λωk + 1

Ο† supiβ‰₯k0βˆ†i, satisfies

ψk+1 β‰₯(1βˆ’ 2Ο‰k+1Ο†+GΟ‰2k+1)ψk + C0Ο‰

2k+1 + 2βˆ†kΟ‰k+1. (40)

Proof By replacing ψk with λωk + 1Ο† supiβ‰₯k0

βˆ†i in (40), it suffices to show

λωk+1 +1

Ο†supiβ‰₯k0

βˆ†i β‰₯(1βˆ’ 2Ο‰k+1Ο†+GΟ‰2k+1)

(λωk +

1

Ο†supiβ‰₯k0

βˆ†i

)+ C0Ο‰

2k+1 + 2βˆ†kΟ‰k+1.

which is equivalent to proving

Ξ»(Ο‰k+1 βˆ’ Ο‰k + 2Ο‰kΟ‰k+1Ο†βˆ’GΟ‰kΟ‰2k+1) β‰₯ 1

Ο†supiβ‰₯k0

βˆ†i(βˆ’2Ο‰k+1Ο†+GΟ‰2k+1) + C0Ο‰

2k+1 + 2βˆ†kΟ‰k+1.

19

Page 20: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Given the step size condition in (28), we have

Ο‰k+1 βˆ’ Ο‰k + 2Ο‰kΟ‰k+1Ο† β‰₯ C1Ο‰2k+1,

where C1 = lim inf 2φωkΟ‰k+1

+Ο‰k+1 βˆ’ Ο‰kΟ‰2k+1

> 0. Combining βˆ’ supiβ‰₯k0βˆ†i ≀ βˆ†k, it suffices to

prove

Ξ» (C1 βˆ’GΟ‰k)Ο‰2k+1 β‰₯

(G

Ο†supiβ‰₯k0

βˆ†i + C0

)Ο‰2k+1. (41)

It is clear that for a large enough k0 and Ξ»0 such that Ο‰k0 ≀ C1

2G , Ξ»0 =2G supiβ‰₯k0

βˆ†i+2C0Ο†

C1Ο†, the

desired conclusion (41) holds for all such k β‰₯ k0 and Ξ» β‰₯ Ξ»0.

The following lemma is a restatement of Lemma 25 (page 247) from Benveniste et al. [1990].

Lemma B8. Suppose k0 is an integer satisfying infk>k0

Ο‰k+1 βˆ’ Ο‰kΟ‰kΟ‰k+1

+ 2Ο†βˆ’GΟ‰k+1 > 0 for some

constant G. Then for any k > k0, the sequence {Ξ›Kk }k=k0,...,K defined below is increasing anduppered bounded by 2Ο‰k

Ξ›Kk =

2Ο‰k∏Kβˆ’1j=k (1βˆ’ 2Ο‰j+1Ο†+GΟ‰2

j+1) if k < K,

2Ο‰k if k = K.(42)

Lemma B9. Let {ψk}k>k0 be a series that satisfies the following inequality for all k > k0

ψk+1 β‰₯ψk(1βˆ’ 2Ο‰k+1Ο†+GΟ‰2

k+1

)+ C0Ο‰

2k+1 + 2βˆ†kΟ‰k+1, (43)

and assume there exists such k0 that

E[β€–Tk0β€–2

]≀ ψk0 . (44)

Then for all k > k0, we have

E[β€–Tkβ€–2

]≀ ψk +

kβˆ‘j=k0+1

Ξ›kj (zjβˆ’1 βˆ’ zj). (45)

Proof We prove by the induction method. Assuming (45) is true and applying (31), we have that

E[β€–Tk+1β€–2

]≀ (1βˆ’ 2Ο‰k+1Ο†+ Ο‰2

k+1G)(ψk +

kβˆ‘j=k0+1

Ξ›kj (zjβˆ’1 βˆ’ zj))

+ C0Ο‰2k+1 + 2βˆ†kΟ‰k+1 + 2Ο‰k+1E[zk βˆ’ zk+1]

Combining (40) and Lemma.B8, respectively, we have

E[β€–Tk+1β€–2

]≀ ψk+1 + (1βˆ’ 2Ο‰k+1Ο†+ Ο‰2

k+1G)

kβˆ‘j=k0+1

Ξ›kj (zjβˆ’1 βˆ’ zj) + 2Ο‰k+1E[zk βˆ’ zk+1]

≀ ψk+1 +

kβˆ‘j=k0+1

Ξ›k+1j (zjβˆ’1 βˆ’ zj) + Ξ›k+1

k+1E[zk βˆ’ zk+1]

≀ ψk+1 +

k+1βˆ‘j=k0+1

Ξ›k+1j (zjβˆ’1 βˆ’ zj).

C Ergodicity and dynamic importance sampler

Our interest is to analyze the deviation between the weighted averaging estimator1k

βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi) and posterior expectation

∫X f(x)Ο€(dx) for a bounded function f . To

20

Page 21: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

accomplish this analysis, we first study the convergence of the posterior sample mean 1k

βˆ‘ki=1 f(xi)

to the posterior expectation f =∫X f(x)$Ψθ?

(x)(dx) and then extend it to∫X f(x)$Ψθ?

(x)(dx).The key tool for establishing the ergodic theory is still the Poisson equation which is used to charac-terize the fluctuation between f(x) and f :

Lg(x) = f(x)βˆ’ f , (46)

where g(x) is the solution of the Poisson equation, andL is the infinitesimal generator of the Langevindiffusion

Lg := γ€ˆβˆ‡g,βˆ‡L(Β·,ΞΈ?)〉+ Ο„βˆ‡2g.

By imposing the following regularity conditions on the function g(x), we can control the perturbationsof 1

k

βˆ‘ki=1 f(xi)βˆ’ f and enables convergence of the weighted averaging estimate.

Assumption A6 (Regularity). Given a sufficiently smooth function g(x) and a function V(x) suchthat β€–Dkgβ€– . Vpk(x) for some constants pk > 0, where k ∈ {0, 1, 2, 3}. In addition, Vp has abounded expectation, i.e., supx E[Vp(x)] <∞; and V is smooth, i.e. sups∈{0,1} Vp(sx+(1βˆ’s)y) .Vp(x) + Vp(y) for all x,y ∈ X and p ≀ 2 maxk{pk}.

For stronger but verifiable conditions, we refer readers to Vollmer et al. [2016]. In what follows,we present a lemma, which is majorly adapted from Theorem 2 of Chen et al. [2015] with a fixedlearning rate Ρ.Lemma C1 (Convergence of the Averaging Estimators). Suppose Assumptions A1-A6 hold. For anybounded function f ,∣∣∣∣∣E

[βˆ‘ki=1 f(xi)

k

]βˆ’βˆ«Xf(x)$Ψθ?

(x)dx

∣∣∣∣∣ = O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰i

k+

1√m

+ supiβ‰₯k0

√δn(θi)

,

where k0 is a sufficiently large constant, $Ψθ?(x) ∝ Ο€(x)

ΞΈΞΆ?(J(x)), and

βˆ‘ki=1 Ο‰ik = o( 1√

k) as implied by

Assumption A5.

Proof We rewrite the CSGLD algorithm as follows:

xk+1 = xk βˆ’ Ξ΅kβˆ‡xL(xk,ΞΈk) +N (0, 2Ξ΅kΟ„I)

= xk βˆ’ Ξ΅k(βˆ‡xL(xk,ΞΈ?) + Ξ₯(xk,ΞΈk,ΞΈ?)

)+N (0, 2Ξ΅kΟ„I),

where βˆ‡xL(x,ΞΈ) = Nn

[1 + ΞΆΟ„

βˆ†u (log ΞΈ(J(x))βˆ’ log ΞΈ((J(x)βˆ’ 1) ∨ 1))]βˆ‡xU(x), βˆ‡xL(x,ΞΈ) is

as defined in Section B.1, and the bias term is given by Ξ₯(xk,ΞΈk,ΞΈ?) = βˆ‡xL(xk,ΞΈk) βˆ’βˆ‡xL(xk,ΞΈ?).

By Assumption A2, we have β€–βˆ‡xU(x)β€– = β€–βˆ‡xU(x)βˆ’βˆ‡xU(x?)β€– . β€–xβˆ’ x?β€– ≀ β€–xβ€–+ β€–x?β€–for some optimum. Then the L2 upper bound in Lemma B2 implies that βˆ‡xU(x) has a boundedsecond moment. Combining Assumption A4, we have E

[β€–βˆ‡xU(x)β€–2

]<∞. Further by Eve’s law

(i.e., the variance decomposition formula), it is easy to derive that E[β€–βˆ‡xU(x)β€–

]<∞. Then, by

the triangle inequality and Jensen’s inequality,

β€–E[Ξ₯(xk,ΞΈk,ΞΈ?)]β€– ≀ E[β€–βˆ‡xL(xk,ΞΈk)βˆ’βˆ‡xL(xk,ΞΈ?)β€–] + E[β€–βˆ‡xL(xk,ΞΈ?)βˆ’βˆ‡xL(xk,ΞΈ?)β€–]

. E[β€–ΞΈk βˆ’ ΞΈ?β€–] +O(Ξ΄n(ΞΈ?)) β‰€βˆš

E[β€–ΞΈk βˆ’ ΞΈ?β€–2] +O(Ξ΄n(ΞΈ?))

≀ O

(βˆšΟ‰k + Ξ΅+

1

m+ supiβ‰₯k0

Ξ΄n(ΞΈi)

),

(47)

where Assumption A1 and Theorem 3 are used to derive the smoothness ofβˆ‡xL(x,ΞΈ) with respectto ΞΈ, and Ξ΄n(ΞΈ) = E[H(ΞΈ,x)βˆ’H(ΞΈ,x)] is the bias caused by the mini-batch evaluation of U(x).

The ergodic average based on biased gradients and a fixed learning rate is a direct result of Theorem2 of Chen et al. [2015] by imposing the regularity condition A6. By simulating from $Ψθ?

(x) ∝

21

Page 22: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Ο€(x)

Ψ΢θ? (U(x))and combining (47) and Theorem 3, we have∣∣∣∣∣E

[βˆ‘ki=1 f(xi)

k

]βˆ’βˆ«Xf(x)$Ψθ?

(x)dx

∣∣∣∣∣ ≀ O(

1

kΞ΅+ Ξ΅+

βˆ‘ki=1 β€–E[Ξ₯(xk,ΞΈk,ΞΈ?)]β€–

k

)

. O

1

kΞ΅+ Ξ΅+

βˆ‘ki=1

βˆšΟ‰i + Ξ΅+ 1

m+ supiβ‰₯k0

Ξ΄n(ΞΈi)

k

≀ O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰i

k+

1√m

+ supiβ‰₯k0

√δn(θi)

,

where the last inequality follows by repeatedly applying the inequality√a+ b ≀

√a+√b and the

inequalityβˆ‘ki=1

βˆšΟ‰i ≀

√kβˆ‘ki=1 Ο‰i.

For any a bounded function f(x), we have |∫X f(x)$Ψθ?

(x)dxβˆ’βˆ«X f(x)$Ψθ?

(x)dx| = O( 1m )

by Lemma B4. By the triangle inequality, we have∣∣∣∣∣E[βˆ‘k

i=1 f(xi)

k

]βˆ’βˆ«Xf(x)$Ψθ?

(x)dx

∣∣∣∣∣ ≀ O 1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰i

k+

1√m

+ supiβ‰₯k0

√δ(θi)

,

which concludes the proof.

Finally, we are ready to show the convergence of the weighted averaging estimatorβˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

to the posterior mean∫X f(x)Ο€(dx).

Theorem 4 (Convergence of the Weighted Averaging Estimators). Assume Assumptions A1-A6 hold.For any bounded function f , we have that∣∣∣∣∣E[βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

]βˆ’βˆ«Xf(x)Ο€(dx)

∣∣∣∣∣ = O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰i

k+

1√m

+ supiβ‰₯k0

√δn(θi)

.

Proof

Applying triangle inequality and |E[x]| ≀ E[|x|], we have∣∣∣∣∣E[βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

]βˆ’βˆ«Xf(x)Ο€(dx)

βˆ£βˆ£βˆ£βˆ£βˆ£β‰€E

[βˆ£βˆ£βˆ£βˆ£βˆ£βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

βˆ’βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

∣∣∣∣∣]

οΈΈ οΈ·οΈ· οΈΈI1

+ E

[βˆ£βˆ£βˆ£βˆ£βˆ£βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

βˆ’ZΞΈ?

βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)

k

∣∣∣∣∣]

οΈΈ οΈ·οΈ· οΈΈI2

+E

[ZΞΈ?k

kβˆ‘i=1

∣∣∣θ΢i (J(xi))βˆ’ ΞΈΞΆ?(J(xi))∣∣∣ Β· |f(xi)|

]οΈΈ οΈ·οΈ· οΈΈ

I3

+

∣∣∣∣∣E[Zθ?k

kβˆ‘i=1

ΞΈΞΆ?(J(xi))f(xi)

]βˆ’βˆ«Xf(x)Ο€(dx)

∣∣∣∣∣︸ ︷︷ ︸I4

.

For the term I1, consider the bias Ξ΄n(ΞΈ) = E[H(ΞΈ,x)βˆ’H(ΞΈ,x)] as defined in the proof of LemmaB1, which decreases to 0 as nβ†’ N . By applying mean-value theorem, we have

I1 = E

∣∣∣∣∣∣(βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)

)(βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))

)βˆ’(βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)

)(βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))

)(βˆ‘k

i=1 ΞΈΞΆi (J(xi))

)(βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))

)∣∣∣∣∣∣

. supiΞ΄n(ΞΈi)E

(βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)

(βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))

))(βˆ‘k

i=1 ΞΈΞΆi (J(xi))

)(βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))

) = O

(supiΞ΄n(ΞΈi)

).

(48)

22

Page 23: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

For the term I2, by the boundedness of Θ and f and the assumption infΘ θ΢(i) > 0, we have

I2 =E

[βˆ£βˆ£βˆ£βˆ£βˆ£βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

(1βˆ’

kβˆ‘i=1

ΞΈΞΆi (J(xi))

kZΞΈ?

)∣∣∣∣∣]

.E

[∣∣∣∣∣ZΞΈ?βˆ‘ki=1 ΞΈ

ΞΆi (J(xi))

kβˆ’ 1

∣∣∣∣∣]

=E

∣∣∣∣∣∣ZΞΈ?mβˆ‘i=1

βˆ‘kj=1

(ΞΈΞΆj (i)βˆ’ ΞΈΞΆ?(i) + ΞΈΞΆ?(i)

)1J(xj)=i

kβˆ’ 1

∣∣∣∣∣∣

≀E

ZΞΈ? mβˆ‘i=1

βˆ‘kj=1

∣∣∣θ΢j (i)βˆ’ ΞΈΞΆ?(i)∣∣∣ 1J(xj)=i

k

οΈΈ οΈ·οΈ· οΈΈ

I21

+E

[∣∣∣∣∣ZΞΈ?mβˆ‘i=1

ΞΈΞΆ?(i)βˆ‘kj=1 1J(xj)=i

kβˆ’ 1

∣∣∣∣∣]

οΈΈ οΈ·οΈ· οΈΈI22

.

For I21, by first applying the inequality |xΞΆ βˆ’yΞΆ | ≀ ΞΆ|xβˆ’y|zΞΆβˆ’1 for any ΞΆ > 0, x ≀ y and z ∈ [x, y]based on the mean-value theorem and then applying the Cauchy–Schwarz inequality, we have

I21 .1

kE

kβˆ‘j=1

mβˆ‘i=1

∣∣∣θ΢j (i)βˆ’ ΞΈΞΆ?(i)∣∣∣ .

1

kE

kβˆ‘j=1

mβˆ‘i=1

|ΞΈj(i)βˆ’ ΞΈ?(i)|

.1

k

√√√√ kβˆ‘j=1

E[β€–ΞΈj βˆ’ ΞΈ?β€–2

],

(49)where the compactness of Θ has been used in deriving the second inequality.

For I22, considering the following relation

1 =

mβˆ‘i=1

∫XiΟ€(x)dx =

mβˆ‘i=1

∫Xiθ΢?(i)

Ο€(x)

ΞΈΞΆ?(i)dx = ZΞΈ?

∫X

mβˆ‘i=1

θ΢?(i)1J(x)=i$Ψθ?(x)dx,

then we have

I22 = E

[∣∣∣∣∣ZΞΈ?mβˆ‘i=1

ΞΈΞΆ?(i)βˆ‘kj=1 1J(xj)=i

kβˆ’ ZΞΈ?

∫X

mβˆ‘i=1

θ΢?(i)1J(x)=i$Ψθ?(x)dx

∣∣∣∣∣]

= ZΞΈ?E

∣∣∣∣∣∣1kkβˆ‘j=1

(mβˆ‘i=1

ΞΈΞΆ?(i)1J(xj)=i

)βˆ’βˆ«X

(mβˆ‘i=1

ΞΈΞΆ?(i)1J(x)=i

)$Ψθ?

(x)dx

∣∣∣∣∣∣

= O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰ik

+1√m

+ supiβ‰₯k0

√δn(θi)

,

(50)

where the last equality follows from Lemma C1 as the step functionβˆ‘mi=1 ΞΈ

ΞΆ?(i)1J(x)=i is integrable.

For I3, by the boundedness of f , the mean value theorem and Cauchy-Schwarz inequality, we have

I3 . E

[1

k

kβˆ‘i=1

∣∣∣θ΢i (J(xi))βˆ’ ΞΈΞΆ?(J(xi))∣∣∣] .

1

kE[ kβˆ‘j=1

mβˆ‘i=1

∣∣θj(i)βˆ’ ΞΈ?(i)∣∣] . 1

k

√√√√ kβˆ‘j=1

E[β€–ΞΈj βˆ’ ΞΈ?β€–2

].

(51)

For the last term I4, we first decompose∫X f(x)Ο€(dx) into m disjoint regions to facilitate the

analysis ∫Xf(x)Ο€(dx) =

∫βˆͺmj=1Xj

f(x)Ο€(dx) =

mβˆ‘j=1

∫Xjθ΢?(j)f(x)

Ο€(dx)

ΞΈΞΆ?(j)

= ZΞΈ?

∫X

mβˆ‘j=1

θ?(j)΢f(x)1J(xi)=j$Ψθ?

(x)(dx).

(52)

23

Page 24: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

Plugging (52) into the last term I4, we have

I4 =

∣∣∣∣∣E[Zθ?k

kβˆ‘i=1

mβˆ‘j=1

ΞΈ?(j)ΞΆf(xi)1J(xi)=j

]βˆ’βˆ«Xf(x)Ο€(dx)

∣∣∣∣∣= Zθ?

∣∣∣∣∣E[

1

k

kβˆ‘i=1

(mβˆ‘j=1

ΞΈΞΆ?(j)f(xi)1J(xi)=j

)]βˆ’βˆ«X

(mβˆ‘j=1

ΞΈΞΆ?(j)f(xi)1J(xi)=j

)$Ψθ?

(x)(dx)

∣∣∣∣∣(53)

Applying the functionβˆ‘mj=1 ΞΈ

΢?(j)f(xi)1J(xi)=j to Lemma C1 yields∣∣∣∣∣E

[1

k

kβˆ‘i=1

f(xi)

]βˆ’βˆ«Xf(x)$Ψθ?

(x)(dx)

∣∣∣∣∣ = O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰i

k+

1√m

+ supiβ‰₯k0

√δn(θi)

.

(54)

Plugging (54) into (53) and combining I1, I21, I22, I3 and Theorem 3, we have∣∣∣∣∣E[βˆ‘k

i=1 ΞΈΞΆi (J(xi))f(xi)βˆ‘k

i=1 ΞΈΞΆi (J(xi))

]βˆ’βˆ«Xf(x)Ο€(dx)

∣∣∣∣∣ = O

1

kΡ+√Ρ+

βˆšβˆ‘ki=1 Ο‰i

k+

1√m

+ supiβ‰₯k0

√δn(θi)

,

which concludes the proof of the theorem.

D More discussions on the algorithm

D.1 An alternative numerical scheme

In addition to the numerical scheme used in (6) and (8) in the main body, we can also consider thefollowing numerical scheme

xk+1 = xk βˆ’ Ξ΅k+1N

n

[1 + ΞΆΟ„

log θk(J(xk) ∧m

)βˆ’ log ΞΈk

(J(xk)

)βˆ†u

]βˆ‡xU(xk) +

√2τΡk+1wk+1.

Such a scheme leads to a similar theoretical result and a better treatment of Ψθ(·) for the subregionsthat contains stationary points.

D.2 Bizarre peaks in the Gaussian mixture distribution

A bizarre peak always indicates that there is a stationary point of the same energy in somewhere ofthe sample space, as the sample space is partitioned according to the energy function in CSGLD. Forexample, we study a mixture distribution with asymmetric modes Ο€(x) = 1/6N(βˆ’6, 1)+5/6N(4, 1).Figure 4 shows a bizarre peak at x. Although x is not a local minimum, it has the same energy as β€œ-6”which is a local minimum. Note that in CSGLD, x and β€œ-6” belongs to the same subregion.

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

x

0

5

10

15

20

βˆ’12.5 βˆ’6 4 10Sample space

Ene

rgy

Original energy

Modified energy (ΞΆ=0.5)

Modified energy (ΞΆ=0.75)

Modified energy (ΞΆ=1)

Figure 4: Explanation of bizarre peaks.

24

Page 25: A Contour Stochastic Gradient Langevin Dynamics Algorithm ...

D.3 Simulations of multi-modal distributions

We run all the algorithms with 200,000 iterations and assume the energy and gradient follow theGaussian distribution with a variance of 0.1. We include an additional quadratic regularizer (β€–xβ€–2 βˆ’7)1β€–xβ€–2>7 to limit the samples to the center region. We use a constant learning rate 0.001 for SGLD,reSGLD, and CSGLD; We adopt the cyclic cosine learning rates with initial learning rate 0.005and 20 cycles for cycSGLD. The temperature is fixed at 1 for all the algorithms, excluding thehigh-temperature process of reSGLD, which employs a temperature of 3. In particular for CSGLD,we choose the step size Ο‰k = min{0.003, 10/(k0.8 + 100)} for learning the latent vector. We fix100 partitions and each energy bandwidth is set to 0.25. We choose ΞΆ = 0.75.

D.4 Extension to the scenarios with high-ΞΆ

In some complex experiments (e.g. computer vision) with a high-loss function, the fixed point ΞΈ?can be very close to the vector (1, 0, ..., 0), i.e., the first subregion contains almost all the probabilitymass, if the sample space is not appropriately partitioned. As a result, estimating ΞΈ(i)’s for the highenergy subregions can be quite difficult due to the limitation of floating points. If a small value of ΞΆis used, the gradient multiplier 1 + ΞΆΟ„ log ΞΈ?(i)βˆ’log ΞΈ?((iβˆ’1)∨1)

βˆ†u is close to 1 for any i and the algorithmwill perform similarly to SGLD, except with different weights. When a large value of ΞΆ is used, theconvergence of ΞΈ? can become relatively slow. To tackle this issue, we include a high-order bias itemin the stochastic approximation as follows:

ΞΈk+1(i) = ΞΈk(i) + Ο‰k+1

(ΞΈΞΆk(J(xk+1) + Ο‰k+11iβ‰₯J(xk+1)ρ)

)(1i=J(xk+1) βˆ’ ΞΈk(i)

), (55)

for i = 1, 2, . . . ,m, where ρ is a constant. As shown early, our convergence theory allows inclusionof such a high-order bias term. In simulations, the high-order bias term Ο‰2

k+11iβ‰₯J(xk+1)ρ penalizedmore on the higher energy regions, and thus accelerates the convergence of ΞΈk toward the pattern(1, 0, 0, . . . , 0) especially in the early period.

In all computation for the computer vision examples, we set the momentum coefficient to 0.9 andthe weight decay to 25, and employed the data augmentation scheme as in Zhong et al. [2017].In addition, for CSGHMC and saCSGHMC, we set Ο‰k = 10

k0.75+1000 and ρ = 1 in (55) for bothCIFAR10 and CIFAR100, and set ΞΆ = 1Γ— 106 for CIFAR10 and 3Γ— 106 for CIFAR100.

D.5 Number of partitions

A fine partition will lead to a smaller discretization error, but it may increase the risk in stability.In particular, it leads to large bouncy jumps around optima (a large negative learning rate, i.e.,log ΞΈ(2)βˆ’log ΞΈ(1)

βˆ†u οΏ½ 0 in formula (8) may be caused there). Empirically, we suggest to partition thesample space into a moderate number of subregions, e.g. 10-1000, to balance between stability anddiscretization error.

25