On: 24 September 2014, At: 10:57 Faming Liang, Yichen ...

This article was downloaded by: [Purdue University]On: 24 September 2014, At: 10:57Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical AssociationPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uasa20

Simulated Stochastic Approximation Annealing forGlobal Optimization With a Square-Root CoolingScheduleFaming Liang, Yichen Cheng & Guang LinAccepted author version posted online: 06 Jan 2014.Published online: 13 Jun 2014.

To cite this article: Faming Liang, Yichen Cheng & Guang Lin (2014) Simulated Stochastic Approximation Annealing for GlobalOptimization With a Square-Root Cooling Schedule, Journal of the American Statistical Association, 109:506, 847-863, DOI:10.1080/01621459.2013.872993

To link to this article: http://dx.doi.org/10.1080/01621459.2013.872993

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/uasa20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/01621459.2013.872993

http://dx.doi.org/10.1080/01621459.2013.872993

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA

Simulated Stochastic Approximation Annealingfor Global Optimization With a Square-Root

Cooling ScheduleFaming LIANG, Yichen CHENG, and Guang LIN

Simulated annealing has been widely used in the solution of optimization problems. As known by many researchers, the global optima cannotbe guaranteed to be located by simulated annealing unless a logarithmic cooling schedule is used. However, the logarithmic cooling scheduleis so slow that no one can afford to use this much CPU time. This article proposes a new stochastic optimization algorithm, the so-calledsimulated stochastic approximation annealing algorithm, which is a combination of simulated annealing and the stochastic approximationMonte Carlo algorithm. Under the framework of stochastic approximation, it is shown that the new algorithm can work with a coolingschedule in which the temperature can decrease much faster than in the logarithmic cooling schedule, for example, a square-root coolingschedule, while guaranteeing the global optima to be reached when the temperature tends to zero. The new algorithm has been tested on afew benchmark optimization problems, including feed-forward neural network training and protein-folding. The numerical results indicatethat the new algorithm can significantly outperform simulated annealing and other competitors. Supplementary materials for this article areavailable online.

KEY WORDS: Local trap; Markov chain Monte Carlo; Simulated annealing; Stochastic approximation Monte Carlo; Stochasticoptimization.

1. INTRODUCTION

Simulated annealing (Kirkpatrick, Gelatt, and Vecchi 1983)has gained popularity in stochastic optimization since its pub-lication in 1983. During the past three decades, it has beenapplied to many optimization problems encountered in areassuch as computer very large-scale integration (VLSI) design,image processing, and molecular physics and chemistry. SeeTan (2008) for an overview and recent developments of thisalgorithm.

Let U (x), x ∈ X , denote the objective function that one wantsto minimize, where X is the domain of U (x). Simulated anneal-ing works based on the fact that minimizing U (x) is equiv-alent to sampling from the Boltzmann distribution fτ∗ (x) ∝exp(−U (x)/τ∗) at a very small value (close to 0) of τ∗. In termsof simulated annealing, U (x) is called the energy function, andτ∗ is called the temperature. To sample from fτ∗ (x), Kirkpatrick,Gelatt, and Vecchi (1983) suggested simulating from a sequenceof Boltzmann distributions, fτ1 (x), fτ2 (x), . . . , fτm

(x), in a se-quential manner, where the temperatures τ1, . . . , τm form a de-creasing ladder τ1 > τ2 > · · · > τm = τ∗ > 0 with τ∗ ≈ 0 andτ1 reasonably large such that most uphill Metropolis–Hastings(MH) moves at that level can be accepted. The simulation athigh-temperature levels aims to provide a good initial sample,hopefully a point in the attraction basin of the global minima ofU (x), for the simulation at low-temperature levels. In summary,the simulated annealing algorithm can be described as follows:

Faming Liang is Professor (E-mail: [email protected]), and YichenCheng is Graduate Student (E-mail: [email protected]), Department ofStatistics, Texas A&M University, College Station, TX 77843. Guang Linis Research Scientist, Pacific Northwest National Laboratory, 902 BattelleBoulevard, P.O. Box 999, MSIN K7-90, Richland, WA 99352 (E-mail:[email protected]). Liang’s research was partially supported by grants fromthe National Science Foundation (DMS-1106494 and DMS-1317131) and theaward (KUS-C1-016-04) made by King Abdullah University of Science andTechnology (KAUST). The authors thank the editor, associate editor, and threereferees for their constructive comments, which have led to significant improve-ment of this article.

Algorithm 1.1 (Simulated annealing).

1. Initialize the simulation at temperature τ1 and an arbitrarysample x0 ∈ X .

2. At each temperature τi , simulate the distribution fτi(x) for

ni iterations using the MH sampler. Pass the final sampleto the next lower temperature level as the initial sample.

The main difficulty of using simulated annealing is in choos-ing the cooling schedule. One cooling schedule of theoreticalinterest is the logarithmic cooling schedule in which the tem-perature at the tth iteration is set at the order of �(1/ log(t))(the temperature needs to be at least constant/log(t)). This cool-ing schedule ensures the simulation to converge to the globalminima of U (x) with probability 1 (Geman and Geman 1984;Haario and Saksman 1991). In practice, however, it is so slowthat no one can afford to have so long a running time. A linear orgeometrical cooling schedule is commonly used, but, as shownin Holley, Kusuoka, and Stroock (1989), these schedules do notguarantee the global minima are reached.

In this article, we propose a variant of simulated annealing,the so-called simulated stochastic approximation annealing al-gorithm or, in short, stochastic approximation annealing (SAA),for global optimization. The SAA algorithm is a combinationof simulated annealing and the stochastic approximation MonteCarlo (SAMC) algorithm (Liang, Liu, and Carroll 2007). Thelatter itself is an adaptive Markov chain Monte Carlo (MCMC)algorithm and can be used for general purpose Monte Carlosimulations. Although SAMC falls into the class of adaptiveMCMC algorithms, it is different from conventional adaptiveMCMC algorithms, such as the adaptive MH algorithm (Haario,Saksman, and Tamminen 2001). During simulations, the con-ventional adaptive MCMC algorithms adapt the proposal dis-tribution based on the past samples, while SAMC, which is

© 2014 American Statistical AssociationJournal of the American Statistical Association

June 2014, Vol. 109, No. 506, Theory and MethodsDOI: 10.1080/01621459.2013.872993

847

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014

http://www.tandfonline.com/r/JASA

mailto:[email protected]



http://www.amstat.org

http://pubs.amstat.org/loi/jasa

http://dx.doi.org/10.1080/01621459.2013.872993

848 Journal of the American Statistical Association, June 2014

rooted in the Wang–Landau algorithm (Wang and Landau 2001),adapts the target distribution based on the past samples. Liang(2009a) showed that SAMC is essentially a dynamic importancesampling algorithm. The SAA algorithm, as a combination ofsimulated annealing and SAMC, adapts its target distribution ateach iteration based on both the past samples and a prespecifiedtemperature ladder. Under the framework of stochastic approx-imation (Robbins and Monro 1951), we show that SAA canwork with a cooling schedule in which the temperature can de-crease much faster than in the logarithmic cooling schedule,while guaranteeing the global minima to be reached when thetemperature tends to 0. For example, SAA can employ a square-root cooling schedule, which is to set the temperature for the tthiteration at the order of �(1/

√t).

The remainder of this article is organized as follows. Section 2describes the SAA algorithm, Section 3 studies its convergence,and Section 4 discusses some of its implementation issues. Sec-tion 5 illustrates the performance of the SAA algorithm usingtwo simple examples. Section 6 compares the performance ofSAA, simulated annealing, and some other optimization algo-rithms on three benchmark optimization problems, namely, therational-expectations model, feed-forward neural networks, andprotein folding. Section 7 concludes the article with a briefdiscussion.

2. SIMULATED STOCHASTIC APPROXIMATIONANNEALING ALGORITHM

Suppose that we want to minimize an energy functionU (x) by simulating from the Boltzmann distribution fτ∗ (x) ∝exp{−U (x)/τ∗} at a very low temperature τ∗ (close to 0). Toaccomplish this task, we propose the so-called SAA algorithm,which is described as follows.

Let E1, . . . , Em denote a partition of the sample space Xbased on the energy function:

E1 = {x : U (x) ≤ u1}, E2 = {x : u1 < U (x) ≤ u2}, . . . ,Em−1 = {x : um−2 < U (x) ≤ um−1},

Em = {x : U (x) > um−1}, (1)

where u1 < u2 < · · · < um−1 are prespecified numbers. In gen-eral, m should be greater than 1. To motivate the SAA algorithm,we assume that none of the subregions is empty for the time be-ing. In practice, some of the subregions may be empty due toan inappropriate choice of the cutting points u1, u2, . . . , um−1.How to deal with empty subregions will be discussed in Section4. Given a partition of the sample space, SAA seeks to drawsamples from the distribution

fw∗,τ∗ (x) ∝m∑

i=1

πie−U (x)/τ∗

w(i)∗

I (x ∈ Ei), (2)

where w∗ = (w(1)∗ , . . . , w

(m)∗ ), w(i)

∗ = ∫Ei

e−U (x)/τ∗dx, πi’s spec-ify the desired sampling frequency for each of the subregions,and I (·) is the indicator function. The πi’s satisfy the constraints:πi > 0 for all i and

∑mi=1 πi = 1. If w

(1)∗ , . . . , w

(m)∗ are known,

sampling from fw∗,τ∗ (x) will lead to a “random walk” in thespace of subregions (by regarding each subregion as a point)with each subregion being sampled with a frequency propor-tional to πi . Thus, this ensures that the lowest energy subregion

can be reached by SAA in an enough long run and thus sam-ples can be drawn from the neighborhood of the global energyminima when τ∗ is close to 0.

Since w(1)∗ , . . . , w

(m)∗ are generally unknown, SAA provides

an adaptive mechanism to estimate their values. As aforemen-tioned, SAA is a combination of the simulated annealing andSAMC algorithms. Like simulated annealing, SAA is run alonga prespecified temperature sequence. Let {τt , t = 1, 2, . . .} de-note such a temperature sequence, which is deterministic, pos-itive, and nonincreasing with the limit limt→∞ τt = τ∗. LikeSAMC and other stochastic approximation algorithms (see, e.g.,Benveniste, Metivier, and Priouret 1990; Chen 2002), SAA up-dates the estimate of w

(1)∗ , . . . , w

(m)∗ iteratively with the step size

controlled by a gain factor sequence. Let {γt , t = 1, 2, . . .} de-note the gain factor sequence, which is deterministic, positive,and nonincreasing with the limit limt→∞ γt = 0. How to spec-ify the sequences {τt } and {γt } will be detailed in Section 3 [seeCondition (A3)].

Let θ(i)∗ = log(w(i)

∗ /πi) for i = 1, . . . , m, let θ∗ =(θ (1)

∗ , . . . , θ(m)∗ ), let θt denote the working estimator of θ∗

at iteration t, and let � denote the space of θt . Under appro-priate conditions, we show that the sequence {θt } remains in acompact set (see Theorem 3.1). Let {Mk, k = 0, 1, . . .} be asequence of positive numbers increasingly diverging to infinity,which work as truncation bounds of {θt }. Let σt be a counter forthe number of truncations up to iteration t, and σ0 = 0. Let θ0

be a fixed point in �. Let θ0 = θ0, then SAA iterates as follows:

Algorithm 2.1 (SAA algorithm).

1. (Sampling) Simulate a sample Xt+1 with a single MHupdate, which starts with Xt and leaves the followingdistribution invariant:

fθt ,τt+1 (x) ∝m∑

i=1

exp{−U (x)/τt+1 − θ

(i)t

}I (x ∈ Ei),

(3)

where I (·) is the indicator function.2. (θ -updating) Set

θt+ 12

= θt + γt+1Hτt+1 (θt , xt+1), (4)

where Hτt+1 (θt , xt+1) = et+1 − π , et+1 = (I (xt+1 ∈E1), . . . , I (xt+1 ∈ Em)), and π = (π1, . . . , πm).

3. (Truncation) If ‖θt+ 12‖ ≤ Mσt

, set θt+1 = θt+ 12; other-

wise, set θt+1 = θ0 and σt+1 = σt + 1. Here ‖ · ‖ denotesthe Euclidean norm and it is the same throughout thearticle.

In the SAA algorithm, the dependence of the functionHτ (θ,X) on θ and τ is implicit through the sample X. To makethis algorithm comply with the traditional notation of stochasticapproximation algorithms, θ and τ are still included in this func-tion. Since the distribution (3) is invariant to the transformationθt ← θt + c for a constant vector c = (c, . . . , c), we may letθt undergo a θ -normalization step at the end of a run; that is,choose a vector c such that

∑mi=1 exp(θ (i)

n + c) = Z , where n isthe total number of iterations of the run and Z is a prespeci-fied positive constant, for example, 1, 100 or any other numberpreferred by the user, and then set θn ← θn + c.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014

Liang, Cheng, and Lin: Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule 849

A remarkable feature of the SAA algorithm is that it pos-sesses the self-adjusting mechanism that operates based on thepast samples. Mathematically, if a subregion Ei is visited atiteration t, θ

(i)t+1 will increase, θ

(i)t+1 ← θ

(i)t + γt+1(1 − πi), such

that this subregion will have a decreased probability to be vis-ited at iteration t + 1. On the other hand, for those subregions,Ej (j �= i), that are not visited at iteration t, θ

(j )t+1 will decrease,

θ(j )t+1 ← θ

(j )t − γt+1πj , such that each of these subregions will

have an increased probability to be visited at iteration t + 1. Thechange of θt leads to an update of the invariant distribution forthe next iteration. The self-adjusting mechanism distinguishesthe SAA algorithm from simulated annealing. For simulatedannealing, the change of the invariant distribution is solely de-termined by the temperature ladder. While for SAA, the changeof the invariant distribution is determined by both the tempera-ture ladder and the past samples. The advantage of SAA will bemuch clearer in the next section, where it is shown that under asquare-root cooling schedule, SAA can achieve similar conver-gence to the global energy minima as simulated annealing. SeeSection 3.2 for the details.

An existing algorithm that is related to SAA is space anneal-ing SAMC, which is proposed by Liang (2007a) for improvingthe performance of SAMC in optimization. The space annealingalgorithm works by shrinking the sample space with iterationsand can be described as follows. Suppose that the sample spacehas been partitioned as in Equation (1) with u1, . . . , um−1 ar-ranged in an ascending order. Let κ(u) denote the index of thesubregion that a sample x with energy u belongs to. For example,if x ∈ Ej , then κ(U (x)) = j . Let X (t) denote the sample spaceat iteration t. Space annealing SAMC starts withX (1) = ∪m

i=1Ei ,and then iteratively sets the sample space as

X (t) = ∪κ(u

(t)min+ℵ

)i=1 Ei, (5)

where u(t)min is the minimum energy value obtained by iteration

t, and ℵ is a user specified parameter. A major shortcoming ofthis algorithm is that it tends to get trapped into local energyminima when ℵ is small and the proposal is relatively local.

Compared to space annealing SAMC, SAA also shrinks itssample space with iterations but in a soft way: it gradually biasessampling toward local energy minima of each subregion throughlowering the temperature with iterations. This strategy of sam-ple space shrinkage reduces the risk of getting trapped into localenergy minima. Theoretically, SAA is also more of interest thanspace annealing SAMC. For space annealing SAMC, it is hardto achieve some theoretical results concerning convergence to-ward global energy minima other than the weak convergenceof MCMC samples toward their stationary distribution. In con-trast, a much stronger result can be obtained for SAA. SAA canachieve essentially the same convergence toward global energyminima as simulated annealing from the perspective of practicalapplications. This will be detailed in Section 3.2.

3. CONVERGENCE OF THE SAA ALGORITHM

In this section, we first give a brief review for stochasticapproximation algorithms, and then prove the convergence ofthe SAA algorithm under appropriate conditions.

3.1 A Brief Review of Stochastic ApproximationAlgorithms

Robbins and Monro (1951) introduced the stochastic approx-imation algorithm for solving the integration equation:

h(θ ) =∫X

H (θ, x)fθ (x)dx = 0, (6)

where θ ∈ � ⊂ Rdθ is a parameter vector and fθ (x), x ∈ X ⊂Rdx , is a density function dependent on θ . The stochastic ap-proximation algorithm proceeds iteratively as follows:

Algorithm 3.1 (Stochastic approximation).

(a) Draw sample Xt+1 ∼ fθt(x), where t indexes the itera-

tion.(b) Set θt+1 = θt + γt+1H (θt , Xt+1).

During the past six decades, this algorithm has been widelystudied and applied in various areas, including stochastic opti-mization and system control. Recently, the stochastic approxi-mation algorithm has been used with Markov chain Monte Carlo(MCMC), with Step (a) replaced by a MCMC sampling step:

(a′) Draw a sample Xt+1 with a Markov transition kernelPθt

(Xt, ·), starting with Xt and admitting fθt(x) as the in-

variant distribution.

The resulting algorithm is called the stochastic approxima-tion MCMC (SAMCMC) algorithm, which is also known asstochastic approximation with Markov state-dependent noise inthe literature. The SAMCMC algorithm has been successfullyapplied to many problems of general interest, such as param-eter estimation for incomplete data problems (Younes 1989;Gu and Kong 1998) and marginal density estimation (Liang2007b). Many of adaptive MCMC algorithms can also be for-mulated under the framework of SAMCMC, for example, theadaptive MH algorithms (Andrieu and Robert 2001; Haario,Saksman, and Tamminen 2001; Andrieu and Thoms 2008) andthe Wang–Landau-type algorithms (Wang and Landau 2001;Liang, Liu, and Carroll 2007; Atchade and Liu 2010). For theadaptive MH algorithms, the proposal distribution is parameter-ized by θ , which may include the mean vector and covariancematrix, and the function H (θ,X) specifies how the proposaldistribution can be updated as a new sample X arrives. Forthe Wang–Landau-type algorithms, the invariant distribution isparameterized through sample space partitioning with θ repre-senting a function of the normalizing constant of each of thepartitioned subregions. The SAA algorithm goes a step furtherthan the Wang–Landau-type algorithms, in which the invariantdistribution is parameterized based on both sample space parti-tioning and a prespecified temperature sequence. This casts theoptimization problem in a combined framework of stochasticapproximation and simulated annealing.

The convergence of the stochastic approximation algorithmis often studied by rewriting Step (b) as follows:

θt+1 = θt + γt+1[h(θt ) + ξt+1], (7)

where h(θt ) = ∫X H (θt , x)fθt

(x)dx is called the mean-fieldfunction, and ξt+1 = H (θt , Xt+1) − h(θt ) is called the obser-vation noise. To establish the convergence of the algorithm,

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


one often needs to impose restrictive conditions on the mean-field function and the observation noise. For example, one mayimpose an upper bound for the growth rate of the mean-fieldfunction and assume the observation noise to be mutually inde-pendent. To weaken these assumptions, Chen and Zhu (1986)proposed an expanding truncation stochastic approximation al-gorithm. The convergence of a variant of this algorithm wasfurther studied in Andrieu, Moulines, and Priouret (2005) underverifiable conditions.

To prove the convergence of SAA, we adopt the techniquedeveloped in Chen and Zhu (1986). This is detailed in the nextsubsection.

3.2 Convergence of the SAA Algorithm

The SAA algorithm can be formulated as a SAMCMC algo-rithm with the goal of solving the integration equation

hτ∗(θ ) =∫

Hτ∗ (θ, x)fθ,τ∗ (x)dx = 0, (8)

where fθ,τ∗ (x) denotes a density function dependent on θ and thelimiting temperature τ∗. Instead of directly solving this equationusing conventional stochastic approximation algorithms, SAAworks through solving a system of equations defined along thetemperature sequence {τt }:

hτt(θ ) =

∫Hτt

(θ, x)fθ,τt(x)dx = 0, t = 1, 2, . . . , (9)

where fθ,τt(x) is a density function dependent on θ and the

temperature τt . In solving this system of equations, the solutionto the equation at temperature τt works as an initial guess ofthe solution to the equation at temperature τt+1. Intuitively, ifthe temperature sequence {τt } does not decrease too fast, SAAshould perform similarly to the original stochastic approxima-tion algorithm; that is, the convergence θt → θ∗ still holds underappropriate conditions, where θ∗ denotes a solution to the targetEquation (8).

To facilitate the theoretical analysis for this algorithm, wedefine the mean-field function

hτt+1 (θt ) =∫X

Hτt+1 (θt , x)fτt+1 (θt , x)dx,

and the observation noise

ξt+1 = Hτt+1 (θt , Xt+1) − hτt+1 (θt ),

in the standard way of stochastic approximation. Let T denotethe space of the temperature τ . For mathematical simplicity, wetreat τ as a continuous variable and assume that T is compact.To be precise, we set T = [τ∗, τ1]. For parameter θ , we assumethe space � = Rm, where m is the number of subregions.

For a general SAMCMC algorithm, its convergence can usu-ally be established under appropriate conditions of the mean-field function, observation noise, and gain factor sequence, see,for example, Benveniste, Metivier, and Priouret (1990), Chen(2002), Andrieu, Moulines, and Priouret (2005). For SAA, ad-ditional conditions for the temperature sequence are needed.These conditions are studied in order as follows.

3.2.1 Conditions on Mean-Field Function. A straightfor-ward calculation shows that the mean-field function of SAA is

given by

hτ (θ ) =∫

Hτ (θ, x)fθ,τ (x)dx

=(

S(1)τ (θ )

Sτ (θ )− π1, . . . ,

S(m)τ (θ )

Sτ (θ )− πm

), (10)

for any fixed value of θ ∈ � and τ ∈ T , where S(i)τ (θ ) =∫

Eie−U (x)/τ dx/eθ (i)

, and Sτ (θ ) = ∑mi=1 S(i)

τ (θ ). It is easy to seethat hτ (θ ) is bounded and continuously differentiable with re-spect to both θ and τ on the space � × T .

Further, we define

vτ (θ ) = 1

2

m∑i=1

(S(i)

τ (θ )

Sτ (θ )− πi

)2

, (11)

which is the so-called Lyapunov function in the literature ofstochastic approximation. It is easy to see that vτ (θ ) is non-negative, upper bounded, and continuously differentiable withrespect to both θ and τ . Let ∇θ and ∇τ denote the gradient op-erators of vτ (θ ) with respect to θ and τ , respectively. It followsfrom the derivation of Liang, Liu, and Carroll (2007, p. 318)that ∇θ vτ (θ ) is bounded over � × T . Since T is a compact set,∇τ vτ (θ ) is also bounded over � × T , provided that U (x) hasfinite mean with respect to fτ (x). In addition, the second partialderivatives of vτ (θ ) (with respect to θ and τ ) are also boundedover � × T , provided that U (x) has finite variance with respectto fτ (x). In summary, SAA satisfies the following conditions,which, as in the literature, are called stability conditions.

(A1) (Stability conditions)

(i) The function hτ (θ ) is bounded and continuously dif-ferentiable with respect to both θ and τ , and there existsa nonnegative, upper bounded, and continuously differ-entiable function vτ (θ ) such that for any � > δ > 0,

supδ≤d((θ,τ ),L)≤�

∇Tθ vτ (θ )hτ (θ ) < 0, (12)

where L = {(θ, τ ) : hτ (θ ) = 0, θ ∈ �, τ ∈ T } is thezero set of hτ (θ ), and d(z, S) = infy{‖z − y‖ : y ∈ S}.Further, the set v(L) = {vτ (θ ) : (θ, τ ) ∈ L} is nowheredense.

(ii) Both ∇θ vτ (θ ) and ∇τ vτ (θ ) are bounded over � × T .In addition, for any compact set K ⊂ �, there exists aconstant 0 < c < ∞ such that

sup(θ,θ ′)∈K×K,τ∈T

‖∇θ vτ (θ ) − ∇θ vτ (θ ′)‖ ≤ c‖θ − θ ′‖,

supθ∈K,(τ,τ ′)∈T ×T

‖∇θ vτ (θ ) − ∇θ vτ ′(θ )‖ ≤ c|τ − τ ′|,

supθ∈K,(τ,τ ′)∈T ×T

‖hτ (θ ) − hτ ′(θ )‖ ≤ c|τ − τ ′|. (13)

The condition (12) can be verified as in Liang, Liu, and Carroll(2007). Note that for general SAMCMC algorithms, the stabil-ity conditions can be much relaxed. For example, the mean-fieldfunction is only required to be locally bounded, the Lyapunovfunction can be unbounded, and (A1)-(ii) can be removed. Dueto the strong stability conditions, the proof for the convergenceof SAA can be simplified compared to other SAMCMC algo-rithms. We also note that to make SAA satisfy (12), the sample

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


space partition should guarantee that there are at least twononempty subregions. Otherwise, vτ (θ ) will be reduced to aconstant and thus ∇T

θ vτ (θ )hτ (θ ) ≡ 0 for all θ .

3.2.2 Conditions on Observation Noise. In the literature,some authors choose to directly impose conditions on observa-tion noise, see, for example, Kushner and Clark (1978), Kulkarniand Horn (1995), and Chen (2002). These conditions are usu-ally very weak, but difficult to verify. An alternative way is toimpose conditions on the Markov transition kernel, which canlead to required conditions on the observation noise. For exam-ple, Andrieu, Moulines, and Priouret (2005) assumed that theMarkov transition kernel satisfies some drift and minorizationconditions such that the resulting Markov chain is V-uniformergodic (see, e.g., Meyn and Tweedie 2009), where the functionV : X �→ [1,∞) is called the drift function. In this article, weassume that for any θ ∈ � and τ ∈ T , the Markov transitionkernel Pθ,τ satisfies the Doeblin condition, which is equivalentto assuming that the resulting Markov chain is uniformly ergodic(Nummelin, 1984, Theorem 6.15).

(A2) (Doeblin condition). For any given θ ∈ � and τ ∈ T , theMarkov transition kernel Pθ,τ is irreducible and aperiodic. Inaddition, there exist an integer l, 0 < δ < 1, and a probabilitymeasure ν such that for any compact subset K ⊂ �,

infθ∈K,τ∈T

P lθ,τ (x,A) ≥ δν(A), ∀x ∈ X , ∀A ∈ BX ,

where BX denotes the Borel set of X ; that is, the wholesupport X is a small set for each kernel Pθ,τ , θ ∈ K andτ ∈ T .

Uniform ergodicity is slightly stronger than V-uniform er-godicity, but it just serves right for the SAA as for which thefunction Hτ (θ,X) is bounded, and thus the mean-field functionand observation noise are bounded. Note that if the drift func-tion V (x) ≡ 1, then V-uniform ergodicity is reduced to uniformergodicity. To verify (A2), one may assume that X is compact,U (x) is bounded in X , and the proposal distribution q(x, y)satisfies the local positive condition:

(Q) There exists δq > 0 and εq > 0 such that, for every x ∈ X ,|x − y| ≤ δq ⇒ q(x, y) ≥ εq .

Then, the condition (A2) holds following from Roberts andTweedie (1996, Theorem 2.2), where it is shown that if the targetdistribution is bounded away from 0 and ∞ on every compact setof its support X , then the MH chain with a proposal satisfying(Q) is irreducible and aperiodic, and the every nonempty com-pact set is a small set. The proposals satisfying the local positivecondition can also be easily designed for both continuous anddiscrete systems. For continuous systems, q(x, y) can be set toa random walk Gaussian proposal, y ∼ N (x, σ 2Idx

), where σ 2

can be calibrated to have a desired acceptance rate, for example,0.2∼0.4. For discrete systems, q(x, y) can be set to a discretedistribution defined on a neighborhood of x. Besides the single-step MH move, the multiple-step MH move, the Gibbs sampler,and the Metropolis-within-Gibbs sampler can also be shown tosatisfy condition (A2) under appropriate conditions, see, for ex-ample, Rosenthal (1995; Lemma 7) and Liang (2009b) for theproofs. Note that to satisfy (A2), X is not necessarily compact.

Rosenthal (1995) gave one example for which the sample spaceis unbounded, yet the Markov chain is uniformly ergodic.

3.2.3 Conditions on Gain Factor and TemperatureSequences. For SAA, we consider the following conditionsfor the gain factor sequence {γt } and the temperature sequence{τt }:(A3) (Conditions on {γt } and {τt })

(i) The sequence {γt } is positive, nonincreasing and sat-isfies the following conditions:

∞∑t=1

γt = ∞,γt+1 − γt

γt

= O(γ ιt+1),

∞∑t=1

γ(1+ι′)/2t √

t< ∞, (14)

for some ι ∈ [1, 2) and ι′ ∈ (0, 1).(ii) The sequence {τt } is positive and nonincreasing and

satisfies the following conditions:

limt→∞ τt = τ∗, τt − τt+1 = o(γt ),

∞∑t=1

γt |τt − τt−1|ι′′ < ∞, (15)

for some ι′′ ∈ (0, 1), and∞∑t=1

γt |τt − τ∗| < ∞. (16)

As shown in Chen (2002, p. 134), the condition∑∞

t=1γ

(1+ι′ )/2t √

t< ∞ implies

∞∑t=1

γ 1+ι′t < ∞, (17)

which is often assumed in studying the convergence of stochasticapproximation algorithms. The condition (15) implies that {τt }cannot decrease too fast, and it should be set according to thegain factor sequence {γt }. The condition (16) also rules out thesettings that {τt } converges to a point with a big gap to τ∗. Forthe sequences {γt } and {τt }, one can typically set

γt = C1

tς, τt = C2√

t+ τ∗, (18)

for some constants C1 > 0, C2 > 0, and ς ∈ (0.5, 1]. Then, itis easy to verify that Equation (18) satisfies (A3).

Under the above conditions, we have the following theoremsconcerning the convergence of {θt }. Theorem 3.1 shows that {θt }remains in a compact subset of �.

Theorem 3.1. Assume that T is compact and the conditions(A1)–(A3) hold. If θ0 used in the SAA algorithm is such thatsupτ∈T vτ (θ0) < inf‖θ‖=c0,τ∈T vτ (θ ) for some c0 > 0 and ‖θ0‖ <

c0, then the number of truncations in SAA is almost surely finite;that is, {θt } remains in a compact subset of � almost surely.

The proof of Theorem 3.1 follows the proof of Theorem 2.2.1of Chen (2002) but with some modifications related with theobservation noise, mean-field function, and Lyapunov function.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


The details of the proof can be found in the supplementarymaterial (available online).

Theorem 3.2. Assume the conditions of Theorem 3.1 hold.Then, as t → ∞,

d(θt ,Lτ∗ ) → 0, a.s.,

where Lτ∗ = {θ ∈ � : hτ∗ (θ ) = 0} and d(z, S) = infy{‖z −y‖ : y ∈ S}.

The proof of Theorem 3.2 is largely a reproduction of theproof of Theorem 5.5 of Andrieu, Moulines, and Priouret (2005)but with some modifications for accommodating the tempera-ture sequence {τt }. The details of the proof can be found inthe supplementary material (available online). As one can seefrom the proofs of these two theorems, the expanding truncationweakens the condition of the Markov transition kernel for SAA.Without this technique, a more restrictive condition may need tobe assumed. For example, one may assume that � is compact orthe Doeblin condition holds uniformly over the space � × T .The former is usually less acceptable and the latter is usuallydifficult to verify.

If solving the equation hτ∗ (θ ) = 0 directly, where hτ∗(θ ) isgiven in Equation (10), then we have that any solution θ∗ =(θ (1)

∗ , . . . , θ(m)∗ ) ∈ Lτ can be expressed in the form

θ (i)∗ = C + log

(∫Ei

exp(−U (x)/τ∗)dx

)− log(πi),

i = 1, . . . , m,

where C represents an arbitrary constant. Since the probabilitydensity/mass function fθ,τ (x) is invariant to the transformationθ ← θ + c for a constant vector c, all the solutions in Lτ∗ willlead to the same limiting density function fθ∗,τ∗ (x). Since fθ,τ (x)is a continuous function of θ and τ , Theorem 3.2 implies that ast → ∞,

fθt ,τt+1 (x) → fθ∗,τ∗ (x), a.s., (19)

that is, the convergence established in Theorem 3.2 is equivalentto a point convergence in terms of the density function fθ∗,τ∗ (x).However, for stochastic optimization problems, the convergence(19) is not enough. Moreover, since SAA falls into the class ofadaptive MCMC algorithms, it is unclear if Xt+1 ∼ fθt ,τt+1 (x)holds. To address this issue, we establish the following stronglaw of large numbers.

Theorem 3.3. (SLLN) Assume the conditions of Theorem 3.1hold. Let x1, . . . , xn denote a set of samples simulated by SAAin n iterations. Let g: X → R be a measurable function suchthat it is bounded and integrable with respect to fθ,τ (x). Then,

1

n

n∑k=1

g(xk) →∫X

g(x)fθ∗,τ∗ (x)dx, a.s.

The proof of this theorem can be found in the supplementarymaterial (available online). Let u∗

i = minx∈EiU (x) denote the

minimum of U (x) on the subregion Ei . Then, u∗1 corresponds

to the global minimum value of U (x) over X , provided that E1

is nonempty. Let J (xk) denote the index of the subregion that

the sample xk belongs to, that is, J (xk) = i if xk ∈ Ei . Then, wehave the following corollary.

Corollary 3.1. Assume the conditions of Theorem 3.1 hold.Let x1, . . . , xt denote a set of samples simulated by SAA in titerations. Then, for any ε > 0, as t → ∞,

1∑tk=1 I (J (xk) = i)

t∑k=1

I (U (xk) ≤ u∗i + ε & J (xk) = i)

→∫{x:U (x)≤u∗

i +ε}∩Eie−U (x)/τ∗dx∫

Eie−U (x)/τ∗dx

, a.s., (20)

for i = 1, . . . , m, where I (·) denotes an indicator function.

Note that the left side of Equation (20) represents anestimator of the conditional probability P (U (Xt ) ≤ u∗

i +ε|J (Xt ) = i), where Xt denotes the sample drawn by SAAat iteration t. Denote the left side of Equation (20) byP

(U (Xt ) ≤ u∗

i + ε|J (Xt ) = i). The right side of Equation

(20) represents the conditional probability P(U (X) ≤ u∗

i +ε|J (X) = i

), where X denotes a sample drawn by SAA at θt ≡

θ∗ and τt ≡ τ∗. With a little abuse of notation, we denote the rightside of Equation (20) by Pθ∗,τ∗

(U (X) ≤ u∗

i + ε|J (X) = i).

Therefore, by (A3) and Corollary 3.1, we have the results: Ast → ∞, τt → τ∗ and

P (U (Xt ) ≤ u∗i + ε|J (Xt ) = i)

→ Pθ∗,τ∗ (U (X) ≤ u∗i + ε|J (X) = i), a.s., (21)

for i = 1, . . . , m. Moreover, if τ∗ goes to 0, then

Pθ∗,τ∗ (U (X) ≤ u∗i + ε|J (X) = i) → 1, i = 1, . . . , m. (22)

This is a quite standard result in the literature of simulatedannealing. For the case that U (x) is discrete, Equation (22) canbe proved by following the proof of Theorem 1 of Villalobos-Arias, Coello Coello, and Hernandez-Lerma (2006); for thecase that U (x) is continuous, Equation (22) can be proved byfollowing the proof of Theorem 2.3 of Dekkers and Aarts (1991)under the assumptions that the number of local minima of U (x)is finite and U (x) is uniformly continuous on X . Note thatthe uniform continuity condition can be simply satisfied byrestricting X to a compact set as often done in optimization.

On the other hand, by Equation (3), P (U (Xt ) ≤ u∗i +

ε|J (Xt ) = i) is a continuous function of θt−1 and τt . There-fore, it follows from Theorem 3.2 and (A3) that as t → ∞,

P (U (Xt ) ≤ u∗i + ε|J (Xt ) = i)

→ Pθ∗,τ∗ (U (X) ≤ u∗i + ε|J (X) = i), a.s., (23)

for i = 1, 2, . . . , m. Combining Equations (21) and (23) resultsin that as t → ∞,

P (U (Xt ) ≤ u∗i + ε|J (Xt ) = i)

− P (U (Xt ) ≤ u∗i + ε|J (Xt ) = i) → 0, a.s.,

that is, P (U (Xt ) ≤ u∗i + ε|J (Xt ) = i) is a strongly consistent

estimator of P (U (Xt ) ≤ u∗i + ε|J (Xt ) = i).

This analysis implies that, as the number of iterations be-comes large, SAA is able to locate the minima of each subregionsimultaneously in a single run if τ∗ is small enough. In practice,the sampling in SAA can be biased to low-energy subregions by

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


choosing an appropriate desired sampling distribution. Corol-lary 3.1 also distinguishes the SAA algorithm from simulatedannealing. For simulated annealing, as shown in Haario andSaksman (1991), it can achieve the following convergence witha logarithmic cooling schedule: For any ε > 0,

P (U (Xt ) ≤ u∗1 + ε) → 1, a.s., (24)

as t → ∞. This represents a much stronger convergence resultthan Equation (22). As a trade-off, SAA can work with a coolingschedule in which the temperature decreases much faster thanin the logarithmic cooling schedule, such as the square-rootcooling schedule. However, from the perspective of practicalapplications, Equations (22) and (24) are almost equivalent:Both allow one to identify a sequence of samples that convergesto the global energy minima of U (x) in probability. In practice,SAA can often work better than simulated annealing. This is be-cause SAA possesses a self-adjusting mechanism, as explainedin Section 3.1, which enables SAA to be immune to local traps.

4. IMPLEMENTATION ISSUES OF THE SAAALGORITHM

Suppose that the sample space has been partitioned into msubregions E1, . . . , Em as in Equation (1). In proving the con-vergence of SAA, we have assumed that none of the subregionsis empty for the reason of mathematical simplicity. In practice,there are many ways to make this assumption true, say, apply-ing an adaptive partitioning method proposed by Troster andDellago (2005). In what follows, we describe a modified SAAalgorithm for which only nonempty subregions are consideredin determining whether to undergo a truncation step. There-fore, it is equivalent to working on a partition without emptysubregions.

Let St denote the collection of indices of the subregions fromwhich a sample has been proposed by iteration t; that is, St

contains the indices of all subregions, which are known to benonempty by iteration t. Let θ

Stt denote the subvector of θt

corresponding to the elements of St , and let �St denote thesubspace of � corresponding to the elements ofSt . One iterationof the modified SAA algorithm is as follows.

Algorithm 4.1 (SAA algorithm with empty subregions).

1. (Proposal) Generate a candidate sample Y according tothe proposal distribution. If Y ∈ X and J (Y ) /∈ St , setSt+1 ← St + {J (Y )}. Otherwise, set St+1 ← St .

2. (MH-update) Perform a single-step MH move with thecandidate sample Y and the target distribution as definedin Equation (3).

3. (θ -update) Update θt to θt+ 12

as in Equation (4) for allsubregions E1, . . . , Em.

4. (Truncation) If ‖θSt

t+ 12‖ ∈ Mσt

, set θt+1 = θt+ 12; otherwise,

set θt+1 = θ0 and σt+1 = σt + 1.

Since Equation (3) is invariant to the transformation θt ←θt + c for a constant vector c, the desired sampling distributionthat is actually applied to the modified SAA algorithm is πi +πe for i ∈ S∞, where πe = ∑

j /∈S∞ πj/|S∞|, S∞ denotes thelimiting set of St , and |S∞| is the cardinality of S∞. Given this

observation, a general expression for θ∗ is given by

θ (i)∗ =

⎧⎨⎩

C + log( ∫

Eiexp(−U (x)/τ∗)dx

)if Ei �= ∅,

− log(πi + πe),−∞, if Ei = ∅,

for i = 1, . . . , m, where θ(i)∗ denotes the ith component of θ∗,

and C is a constant. The value of C can be determined througha θ -normalization step as described in Section 2.

We note that in the θ -update step of Algorithm 4.1, θt canalso be updated only for the components corresponding to theelements of St ; that is

3′. (θ -update) For all i ∈ St , set θ(i)t+ 1

2= θ

(i)t + γt+1(I (xt+1 ∈

Ei) − πi).

For this version of the algorithm, a general expression ofθ∗ is

θ (i)∗ =

⎧⎨⎩

C + log( ∫

Eiexp(−U (x)/τ∗)dx

)if Ei �= ∅,

− log(πi + πe),θ

(i)0 , if Ei = ∅,

where θ(i)0 denotes the ith component of the starting point θ0.

Compared to Algorithm 4.1, this modification of the θ -updatestep ensures {θt } to remain in a compact set, although this isnot necessary for the components corresponding to the emptysubregions.

In addition to empty subregions, several other issues also needto be taken care for an effective implementation of SAA, whichinclude the choices of sample space partitioning, the proposaldistribution, the gain factor sequence, the temperature sequence,the desired sampling distribution, and the sequence of truncationbounds {Mσt

}. Regarding their choices, we have the followingrecommendations:

• Sample space partitioning. For an effective implementationof SAA, the sample space should be partitioned appropri-ately. First of all, to ensure the convergence of SAA, thepartitioning should guarantee that there are at least twononempty subregions. The parameters u1 and um−1 can bechosen by trial and error. In simulations, we usually chooseu1 so small that E1 is an empty set, and choose um−1 solarge that SAA can quickly move out from Em to othersubregions. Note that the simulation usually starts with apoint in Em, and may get trapped in Em for a long time ifum−1 is too small.

In general, increasing the number of subregions can in-crease the self-adjusting ability and thus reduce the riskof getting trapped into local energy minima. On the otherhand, increasing the number of subregions will increasethe volume of effective sample space and, as a result, moreCPU time is needed to reach the global energy minima. Inpractice, how to partition the sample space may depend onone’s budget of CPU time. If a long run of SAA is afford-able, one may choose a fine partition. Otherwise, a coarsepartition can be chosen. With a good partition, the samplershould be able to transverse freely different subregions.

• Proposal distribution. To facilitate the transition betweendifferent local energy minima, we recommend a proposal

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


for which the step size increases with energy levels. This al-lows SAA to move through the high-energy region quicklyand explore the low-energy region in detail. This proposalhas been used in the protein folding example of this article.For other examples, for the sake of comparison with theexisting literature results, the constant step size proposaldistributions are still used.

• Gain factor sequence. In this article, the gain factor se-quence is set in the form

γt =(

T0

max(t, T0)

)ς

, t = 1, 2, . . . , (25)

for some T0 > 0 and ς ∈ (0.5, 1]. The gain factor sequencecontrols largely the self-adjusting ability of SAA. The big-ger the gain factor is, the stronger self-adjusting abilitySAA has. To reduce the risk of getting trapped into localenergy minima, SAA prefers a large gain factor sequence;that is, a large value of T0 and a small value of ς can bechosen in Equation (25).

• Temperature sequence. In this article, the temperature se-quence is set in the form

τt = τh

√T ′

0

max(t, T ′0)

+ τ∗, t = 1, 2, . . . , (26)

for some τh > 0, T ′0 > 0, and τ∗ > 0. Here, τh is called

the highest temperature. Like simulated annealing, SAAprefers a high value of τh. A high value of τh flattens theenergy landscape of the system at the early stage of thesimulation, and thus facilitates the transition between dif-ferent local energy minima. As for simulated annealing,we suggest to set τh such that almost all the within subre-gion moves can be accepted. Unlike τh, τ∗ seems not veryimportant for the performance of SAA.

• Desired sampling distribution. We recommend one to setthe desired sampling distribution in the form

πi = e−ζ (i−1)∑mk=1 e−ζ (k−1)

, i = 1, 2, . . . , m, (27)

which is governed by a tunable parameter ζ ≥ 0. A largevalue of ζ will bias the sampling toward the low-energysubregions. A choice of ζ = 0 leads to the uniform desiredsampling distribution. To facilitate the transition betweendifferent local energy minima, a small value of ζ , for exam-ple, 0.05 or 0.1, is often used. This setting can force SAA tovisit high-energy regions frequently and thus avoid beingtrapped into a local energy minimum.

• Truncation bounds {Mσt}. As discussed previously, the ex-

panding truncation is of interest mainly in theory. In prac-tice, to avoid frequent truncations, one may setMσt

to verylarge values for all t. For example, we setMk+1 = 1010Mk

withM0 = 10200 in this article. Under this setting, no trun-cations occurred in simulations of our examples.

5. TWO ILLUSTRATIVE EXAMPLES

This section supplies two examples, one illustrates the con-vergence of {θt } and the other illustrates the performance ofthe SAA algorithm in minimization of a function with multiplelocal minima.

Table 1. The unnormalized mass function of the 10-state distribution

x 1 2 3 4 5 6 7 8 9 10

P (x) 5 100 40 1 125 75 1 150 50 20

5.1 A 10-State Distribution

In this example, we illustrate the convergence of the SAAalgorithm. The distribution function of this example consistsof 10 states with the unnormalized mass function P (x) =exp(−U (x)) as given in Table 1, where U (x) denotes the energyfunction to be minimized as in Section 3.

The sample space X = {1, 2, . . . , 10} was partitioned ac-cording to the mass function into five subregions: E1 = {8},E2 = {2, 5}, E3 = {6, 9}, E4 = {3}, and E5 = {1, 4, 7, 10}. Theproposal used in the MH step is a stochastic matrix whose eachrow is generated independently from the Dirichlet distributionDir(1,. . .,1). The gain factor sequence is chosen in the form(25) with T0 = 1000 and ς = 1.0. The temperature sequence ischosen in the form (26) with τh = 1, T ′

0 = 10, and τ∗ = 0.01.The desired sampling distribution is uniform, that is, set ζ = 0in Equation (27). We employed this setting of the desired sam-pling distribution as our goal is to check the validity of SAAinstead of optimization.

SAA was run five times independently and each run con-sisted of n = 5 × 107 iterations. Therefore, the end temper-ature of each run is τn = 0.0104472. Table 2 examines theconvergence of θt , where θt has been normalized such thatZ = ∑5

i=1 exp(θ (i)t ) = ∑

x∈X P (x) = 567 [log(Z) = 6.3404].We choose this normalization just for convenience, as P (x) hasbeen given in Table 2. This example shows the validity of theSAA algorithm: As τt → τ∗, it produces correct estimates of θ∗;meanwhile, the relative sampling frequency of each subregionconverges to the desired sampling frequency. It is interestingto point out that for τt > τ∗, the sampling tends to be biasedto the low-energy subregion. This is because the normalizingconstants w1, . . . , wm of different subregions change in an un-balanced way as the temperature decreases. However, this biasis preferred from the perspective of minimization.

Figure 1 shows the sampling path of SAA obtained at theend of a run, where the simulation has been thinned by a factorof 10,000. It indicates that as the temperature decreases, SAAcan still move freely between different subregions, but samplingonly focuses on the high-density states of each subregion. Forthis example, the high-density states are 8, 5, 6, 3, and 10 for thesubregions E1, . . . , E5, respectively. This feature implies thatSAA can be effective for optimization: It can locate the minimaof each subregion simultaneously while being immune to thelocal trap problem.

5.2 A Function With Multiple Local Minima

Consider minimizing the function U (x) = −{x1 sin(20x2) +x2 sin(20x1)}2 cosh{sin(10x1)x1} − {x1 cos(10x2) − x2 sin(10x1)}2 cosh{cos(20x2)x2}, where x = (x1, x2) ∈ [−1.1, 1.1]2.This example is modified from Example 5.3 of Robert andCasella (2004). Figure 2(a) shows that U (x) has a multitudeof local energy minima separated by high-energy barriers. The

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


Table 2. Convergence of θt for the 10-state distribution

Subregion E1 E2 E3 E4 E5

log(w) 6.3404 −11.1113 −60.0072 −120.1772 −186.5248θn 6.3404 −11.1116 −60.0009 −120.1687 −186.5044S.D. 0 6.26 ×10−3 2.28 × 10−3 6.01 × 10−3 8.16 × 10−3

Freq 20.29% 20.23% 20.05% 19.84% 19.6%

NOTE: log(wi ) is the true value of θn calculated at the end temperature 0.0104472, θn is the average of θn over five independent runs, S.D. is the standard deviation of θn, and freq is theaveraged relative sampling frequency of each subregion. The standard deviation of frequency is nearly 0.

global minimum energy value is −8.12465, which is located at(−1.0445, −1.0084) and (1.0445, −1.0084).

To apply SAA to this example, the sample space was par-titioned as in Equation (1) with m = 41, where ui’s forman arithmetic sequence with u1 = −8.0 and u40 = −0.2. Theproposal distribution is a Gaussian random walk q(xt , ·) =N2(xt , 0.252I2). The gain factor sequence is set in the formof Equation (25) with T0 = 2000 and ς = 1.0, and the tempera-ture sequence is set in the form of Equation (26) with τh = 0.5,T ′

0 = 200, and τ∗ = 0.01. To make the problem more difficult,τh was set to a very small value. SAA was initialized at (1.0,1.0),which is close to a local minimum of U (x), and run for 105 it-erations. After thinning by a factor of 100, 1000 samples werecollected from the run. Figure 2(b) shows the evolving path ofthe 1000 samples.

For comparison, simulated annealing was also applied to thisexample. Two different cooling schedules were tried, the square-root and geometric cooling schedules. The former was exactlythe same as the one used in SAA. With this cooling schedule,the temperature ladder consisted of 105 levels and there wasonly one iteration performed at each temperature level. The runstarted at the same point (1.0,1.0) as SAA. Figure 2(c) showsthe evolving path of 1000 samples collected at equally spacedtime points from the run. For the geometric cooling schedule,the temperature ladder was set as follows:

τi+1 = �τi, i = 1, 2, . . . , m,

where τ1 = τh , � = 0.997244, and the number of temperaturelevels m = 1000. This is a rather common setting for simulatedannealing, especially for the value of �. The resulting lowesttemperature from this schedule is the same as in the square-root cooling schedule. The algorithm started at the same point(1.0,1.0) as SAA and then iterated for 100 iterations at eachof the 1000 temperature levels. Figure 2(d) shows the evolvingpath of 1000 samples collected at the last iteration of eachtemperature level.

The comparison indicates that simulated annealing tends toget trapped into local energy minima while SAA does not. Forthis example, even though the starting temperature is very low,SAA can still transverse over the energy landscape and locatethe two global minima very quickly.

Later, each of the above three algorithms was run 1000 timesfor this example. The numerical results are summarized in Table3. The comparison indicates that SAA is superior to simulatedannealing for this example. Even with only 20,000 iterations,SAA can produce much lower energy values than simulatedannealing with 105 iterations.

6. BENCHMARK OPTIMIZATION PROBLEMS

In this section, we test SAA using some benchmark opti-mization problems, including the rational-expectations model(Hoffman and Schmidt 1981), feed-forward neural networks,and protein folding.

49000000 49200000 49400000 49600000 49800000 50000000

34

56

78

910

iteration

state

Figure 1. A thinned sample path of SAA for the 10-state distribution.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


(a) Contour

x1

x2

−6 −6

−4

−4 −4

−4

−3

−3 −3

−3 −3

−3 −3

−3

−2

−2

−2

−2

−2

−2

−2

2−

−2

−2

−2

−2

−2

−2

−2

−2

−2

2−

−2

−2 −2

−2

−2

−1

−1

1−

−1

−1 −1

−1 −1

−1

−1

−1

−1 −1

−1

−1 −1 −1

−1

−1

−1

−1

−1

−1

−1

−1

1−

−1

−1

1− −1

−1

−1

0 0

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

(b) SAA

x1

x2

O O

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

(c) SA (square−root)

x1

x2

O O

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.00.5

1.0

(d) SA (geometric)

x1x2

O O

Figure 2. (a) Contour of U (x), (b) sample path of SAA, (c) sample path of simulated annealing with a square-root cooling schedule, and (d)sample path of simulated annealing with a geometric cooling schedule. The white circles show the global minima of U (x).

6.1 Rational-Expectations Model

This example is created based on the rational-expectationsmodel encountered in economic research (Hoffman and Schmidt1981). The rational-expectations model is specified by the fol-lowing system of equations:

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

yt1 = gt1(α,β, γ ) + εt1,

yt2 = gt2(α,β, γ ) + εt2,

zt1 = γ1zt−1,1 + ut1,

zt2 = γ2zt−1,2 + ut2,

zt3 = γ3zt−1,3 + ut3,

(28)

where α = (α11, α12, α13, α21, α22, α23), β = (β11, β12, β21,

β22), and γ = (γ1, γ2, γ3) are unknown parameters, εti and utj

are normal random errors, and gt1(α,β, γ ) and gt2(α,β, γ ) are

specified as follows:

gt1(α,β, γ ) = α11zt1 + α12zt2 + α13zt3 + β11Et−1yt1

+ β12Et−1yt2,

gt2(α,β, γ ) = α21zt1 + α22zt2 + α23zt3 + β21Et−1yt1

+ β22Et−1yt2,

Et−1yt1 = 1

�(1 − β11){α11γ1zt−1,1 + α12γ2zt−1,2

+ α13γ3zt−1,3 + β12(1 − β22)−1[α21γ1zt−1,1

+ α22γ2zt−1,2 + α23γ3zt−1,3]},Et−1yt2 = 1

�(1 − β11){α21γ1zt−1,1 + α22γ2zt−1,2

+ α23γ3zt−1,3 + β21(1 − β11)−1[α11γ1zt−1,1

+ α12γ2zt−1,2 + α23γ3zt−1,3]},� = 1 − β12β21(1 − β11)(1 − β22).

Table 3. Comparison of SAA and simulated annealing for the multimodal example

Average of minimum energy valuesa

20,000 40,000 60,000 80,000 100,000 Propb CPUc

SAA −8.1145 −8.1198 −8.1214 −8.1223 −8.1229 92.0% 0.17(3.0 × 10−4) (1.5 × 10−4) (1.0 × 10−4) (7.5 × 10−5) (5.9 × 10−5)

SAd (sr) −5.9227 −5.9255 −5.9265 −5.9269 −5.9271 3.5% 0.14(1.3 × 10−2) (1.3 × 10−2) (1.3 × 10−2) (1.3 × 10−2) (1.3 × 10−2)

SAe (geo) −6.5534 −6.5598 −6.5611 −6.5617 −6.5620 30.7% 0.13(3.3 × 10−2) (3.3 × 10−2) (3.3 × 10−2) (3.3 × 10−2) (3.3 × 10−2)

NOTE: aThe average (over 1000 runs) of minimum energy values found during the first 20,000, 40,000, 60,000, 80,000, and 100,000 iterations; the standard deviations of the averagesare given in the parentheses. bThe proportion of the runs with minimum energy values less than −8.12. cCPU time (in seconds) cost by a single run on a 3.0 GHz personal computer.d Simulated annealing with a square-root cooling schedule. eSimulated annealing with a geometric cooling schedule.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


Table 4. Comparison of SAA, annealing evolutionary SAMC, space annealing SAMC, simulated annealing, and real-coded genetic algorithmfor minimization of the function (29)

Algorithm Setting Minimum Maximum Average SD Cost

SAA τh = 2000 62,693.5 62,701 62,694.4 0.37 1.85 × 107

τh = 1000 62,693.5 63,268 62,755.7 34 1.85 × 107

τh = 500 62,693.6 63,354 62,788.6 40 1.85 × 107

Annealing evolutionary SAMC ℵ = 5000 62,694 62,984 62,815 19 1.85 × 107

ℵ = 15,000 62,694 63,078 62,813 29 1.85 × 107

Space annealing SAMC ℵ = 10,000 62,694 63,893 62,864 71 1.85 × 107

ℵ = 20,000 62,694 63,370 62,940 64 1.85 × 107

Simulated annealing τh = 2000 62,700 64,779 63,501 118 1.85 × 107

τh = 1000 62,695 64,242 63,543 84 1.85 × 107

τh = 500 62,699 63,956 63,412 84 1.85 × 107

Genetic algorithm n = 50 64,585 66,703 65,140 122 1.9 × 107

n = 100 64,579 65,299 64,822 65 1.9 × 107

n = 200 64,579 65,288 64,861 63 1.9 × 107

NOTE: Let vi denote the best function value produced in the ith run. The numbers in the columns of Minimum, Maximum, Average, and SD are calculated, respectively, as follows:min1≤i≤20 vi , max1≤i≤20 vi ,

∑20i=1 vi/20, and the standard deviation of the average. Cost: the number of function evaluations in each run. The results of the latter four algorithms are from

Liang (2011).

The parameters of the model can be estimated by minimizingthe function:

U (α,β, γ )

=T∑

t=1

(y1t − gt1(α,β, γ ))2 +T∑

t=1

(y2t − gt2(α,β, γ ))2

+3∑

i=1

T∑t=1

(zti − γizt−1,i)2, (29)

on the space: |αij | ≤ 10, |βij | ≤ 10, |γi | ≤ 1 and |z0i | ≤ 10.As in Dorsey and Mayer (1995), we work on a dataset

consisting of 40 observations simulated under the setting:αij = 1 and βij = 0.2 for all i and j, γ1 = 0.1, γ2 = 0.2, γ3 =0.8, εt1 ∼ N (0, 252), εt2 ∼ N (0, 1), ut1 ∼ N (0, 362), ut2 ∼N (0, 42), and ut3 ∼ N (0, 92). Dorsey and Mayer (1995) as-sessed the computational difficulty of this problem by runningthe Marquardt–Levenberg gradient algorithm from 50 differentpoints randomly chosen in the search space. They reported thatthe algorithm failed to converge in 49 out of 50 runs becauseof either singularities or floating-point overflow, and produceda suboptimal solution on the run that did converge.

To apply SAA to this example, the sample space was parti-tioned as in Equation (1) with m = 1001, where u1, . . . , um−1

form an arithmetic sequence with u1 = 625.2 and um−1 = 825.The gain factor sequence was set in Equation (25) with T0 =5 × 106 and ς = 0.55, and the temperature sequence was set inEquation (26) with T ′

0 = 5, τ∗ = 2, and τh = 2000. The desiredsampling distribution was set in Equation (27) with ζ = 0.05,which biases sampling to low-energy subregions. Two types ofproposals, hit-and-run and K-point operators, were used. Thehit-and-run operator proceeds as follows:

(a) Generate a random direction vector v from a uniform dis-tribution on the surface of a dx-dimensional unit sphere.

(b) Generate a random number r ∼ N (0, σ 2l ) and set x∗ =

x + rv, where x denotes the current sample, x∗ denotesthe proposed sample, and σl is the step size, which is

calibrated such that the proposal has a reasonable overallacceptance rate.

The hit-and-run algorithm was proposed by Smith (1984) andfurther refined in Chen and Schmeiser (1996). As pointed out byBerger (1993), the hit-and-run algorithm is particularly usefulfor problems with a sharply constrained sample space. In theK-point operator, K components of x are randomly selected toundergo a modification by K Gaussian random variables drawnfrom N (0, σ 2

p ), where σp is the step size of the K-point operator.That is, for a selected component xj , the proposal is x∗

j = xj +rj where rj ∼ N (0, σ 2

p ). For this example, only 1-point and 2-point operators are used. Let ρ1, ρ2, and ρ3 denote the respectiveprobabilities of the hit-and-run, 1-point, and 2-point operatorsused at each iteration. For this example, we set σl = σp = 1.5,ρ1 = 0.5, and ρ2 = ρ3 = 0.25.

SAA was run 20 times independently. Each run consistedof 1.85 × 107 iterations and cost about 212 s CPU time on a3.0 GHz computer. The results were summarized in Table 4and Figure 3. To assess the effect of τh on the performance ofSAA, the algorithm was also run 20 times with τh = 500 andτh = 1000, keeping the setting of other parameters unchanged.The results suggest that SAA is pretty robust to the choice ofτh, while slightly preferring a higher value of τh. As one can seefrom the “SD” column of Table 4, a high value of τh makes theperformance of SAA more robust in different runs.

The same dataset has been analyzed by Liang (2011) using avariety of optimization methods, including simulated annealing,space annealing SAMC, annealing evolutionary SAMC, and thereal-coded genetic algorithm (RGA). For simulated annealing,Liang (2011) tried three different values of τh = 500, 1000, and2000. The temperature ladder decreased at a rate of 0.95. Theproposal distribution was the same as that used in SAA. For eachvalue of τh, simulated annealing was run 20 times independently.Each run consisted of 100 stages and consisted of 1.85 × 105 it-erations. Annealing evolutionary SAMC is a population versionof space annealing SAMC, which allows more advanced oper-ators, such as crossover between different chains, to be used in

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


0.0e+00 5.0e+06 1.0e+07 1.5e+07

6300

064

000

6500

066

000

Number of function evaluations

Best v

alues

SAAASAMCSARGA

(a)

0.0e+00 5.0e+06 1.0e+07 1.5e+07

6300

063

500

6400

064

500

Number of function evaluations

Best v

alues

SAAAESAMC

(b)

Figure 3. Average progression curves of the best function values (over 20 runs) and their 95% confidence regions (shaded area) producedby SAA (τh = 2000), annealing evolutionary SAMC (abbreviated as AESAMC, ℵ = 5000), space annealing SAMC (abbreviated as ASAMC,ℵ = 20, 000), simulated annealing (abbreviated as SA, τhigh = 2000), and the real-coded genetic algorithm (abbreviated as RGA, n = 100) forminimizing function (29). The results of the latter four algorithms are from Liang (2011).

optimization. Although annealing evolutionary SAMC is moreefficient than space annealing SAMC, it is also possible to gettrapped into a local energy minimum due to its hard samplespace shrinkage strategy. Both of the two algorithms were run20 times in Liang (2011) with different values of ℵ, where theyeven employed the same proposals. Their results are given inTable 4.

The genetic algorithm has many variants, each covering dif-ferent applications and aspects. The variant RGA is described inAli, Khompatraporn, and Zabinsky (2005) and has been testedon a variety of continuous global optimization problems. Theresults indicate that it is effective and comparable or favorableto other stochastic optimization algorithms, such as controlledrandom search (Price 1983; Ali and Storey 1994) and differen-tial evolution (Storn and Price 1997). For this example, Liang(2011) tried three different population sizes, 50, 100, and 150.For each population size, the number of individuals to be up-dated at each generation was set to one-fifth of the populationsize, and the number of generations was chosen such that thetotal number of function evaluations in each run was 1.9 × 107.RGA was run 20 times independently with the results given inTable 4.

Table 4 indicates that SAA can be much more efficient thansimulated annealing, space annealing SAMC, annealing evo-lutionary SAMC, and the genetic algorithm for this example.As shown in Figure 3(a), SAA can be consistently better thansimulated annealing, space annealing SAMC, and the geneticalgorithm over the whole path of simulations. While annealingevolutionary SAMC can outperform SAA in the early stage ofthe simulation due to its population effect, that is, multiple start-ing points and the use of population-based operators. However,as the simulation goes on, SAA tends to outperform annealingevolutionary SAMC. As aforementioned, annealing evolution-ary SAMC can get trapped into a local energy minimum due toits hard sample space shrinkage strategy, while SAA does notsuffer from this difficulty.

6.2 Feed-Forward Neural Networks

Consider the problem of learning a mapping for a givendataset {(zk, yk) : k = 1, . . . , N}, where zk and yk represent theindependent and dependent variables, respectively. If there isnot any structural information assumed for the mapping, a feed-forward neural network can be used. The feed-forward neuralnetwork is also known as the multiple-layer perceptron (MLP),which has been shown to be a universal approximator (Cybenko,1989; Hornik 1991); that is, the MLP network with a single layerof finite number of hidden units and appropriate activation func-tions can approximate any continuous functions arbitrarily wellon compact subsets of Rn. Figure 4 depicts the structure of aMLP with a single hidden layer and a single output unit, forwhich the MLP approximator can be written as

f (zk|x) = ϕo

⎛⎝α0 +

p∑j=1

γjzkj +M∑i=1

αiϕh

(βi0 +

p∑j=1

βij zkj

)⎞⎠,

(30)

where M is the number of hidden units, p is the numberof input units, zk = (zk1, . . . , zkp), x = {α0, βi0, αi, βij , γj ; i =1, . . . ,M; j = 1, . . . , p} is the collection of the weights onMLP connections, and ϕh(·) and ϕo(·) are activation functions ofthe hidden units and the output unit, respectively. The weightsαi , βij , and γj correspond to the connections from the ith hiddenunit to the output unit, from the jth input unit to the ith hiddenunit, and from the jth input unit to the output unit, respectively.In Equation (30), the bias unit is treated as a special input unit,indexed by 0 and having a constant input value of 1. The con-nections from input units to the output unit are called shortcutconnections. A popular choice for ϕh(·) is the sigmoid functionϕh(z) = 1/(1 + e−z). The choice of ϕo(·) is problem dependent.For regression problems, ϕo(·) is usually set to the identity func-tion ϕo(z) = z; and for classification problems, ϕo(·) is usuallyset to the sigmoid function.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


I1

I2

I3

I4

B H1

H2

H3

O

Input Layer

Hidden Layer

Output Layer

Figure 4. A fully connected one-hidden-layer MLP network withfour input units (I1, I2, I3, I4), one bias unit (B), three hidden units(H1, H2, H3), and one output unit (O). The arrows show the directionof data feeding.

The problem of MLP training is to minimize the objectivefunction:

U (x) =N∑

k=1

(f (zk|x) − yk

)2

+ λ

⎡⎣ M∑

i=0

α2i +

M∑i=1

p∑j=0

β2ij +

p∑j=1

γ 2j

⎤⎦ , (31)

by choosing appropriate weights x, where the second term isthe regularization term, and λ is the regularization parameter.Including the regularization term in the objective function canusually improve the generalization performance of MLP. In thisarticle, we assume that the value of M has been given. In general,M can be determined using a cross-validation method or underBayesian framework.

As known by many researchers, the energy landscape of theMLP is rugged. The conventional MLP training algorithms, such

as back-propagation (Rumelhart, Hinton, and Williams 1986),BFGS (Broyden, 1970; Fletcher 1970; Goldfarb 1970; Shanno1970), and simulated annealing (e.g., Amato et al. 1991; Owenand Abunawass 1993), tend to converge to a local energy mini-mum near the starting point. Consequently, the information con-tained in the training data may not be learned sufficiently. Liang(2007a) applied space annealing SAMC to this problem withgreat successes. Space annealing SAMC produced perfect solu-tions for some benchmark feed-forward neural network trainingproblems, such as the famous n-parity and two-spiral problems(Lang and Witbrock 1989).

The two-spiral problem is to learn a MLP that distinguishesbetween points on two intertwined spirals as shown in Figure 5.This problem has been solved using MLPs with multiple hid-den layers (with or without shortcut connections). Lang andWitbrock (1989) solved the problem using a 2-5-5-5-1 MLPwith shortcut connections (138 trainable connection weights).Fahlman and Lebiere (1990) solved the problem using cascade-correlation networks with 12–19 hidden units, the smallest net-work having 114 connections. Wong and Liang (1997) solvedthe problem using a 2-14-4-1 MLP without shortcut connec-tions. This problem is very difficult for the one-hidden-layerMLP, as it requires the MLP to learn a highly nonlinear sep-aration of the input space. Baum and Lang (1991) reportedthat a solution could be found using a 2-50-1 back-propagationMLP, but only the MLP has been initialized with queries. Liang(2007a) solved the problem with a 2-30-1 MLP without shortcutconnections, which consists of 121 connections.

In this article, we tested SAA on this problem by trainingthe same MLP as in Liang (2007a), where λ in Equation (31)was set to 0. This choice of λ makes the energy function U (x)have a known global minimum value of 0. In SAA, the samplespace was restricted to the region X = [−50, 50]d as in Liang(2007a) and was partitioned in Equation (1) with m = 76, whereu1, . . . , um−1 form an arithmetic sequence with u1 = 0.2 andum−1 = 15. The gain factor sequence was set in Equation (25)

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

x

y

(a)

−6 −4 −2 0 2 4 6

−6−4

−20

24

6

x

y

(b)

Figure 5. Classification maps learned by SAA with a MLE of 30 hidden units. The black and white points show the training data for twointertwined spirals. (a) Classification map learned in one run of SAA. (b) Classification map averaged over 20 runs. This figure shows the successof SAA in optimization of complex functions.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


Table 5. Comparison of SAA, space annealing SAMC, simulated annealing, and BFGS for the two-spiral example

Algorithm Mean SD Minimum Maximum Proportion Iteration (×106)

SAA 0.341 0.099 0.201 2.04 18 5.82Space annealing SAMC 0.620 0.191 0.187 3.23 15 7.07Simulated annealing-1 17.485 0.706 9.02 22.06 0 10.0simulated annealing-2 6.433 0.450 3.03 11.02 0 10.0BFGS 15.50 0.899 10.00 24.00 0 —

NOTE: Notation: vi denotes the minimum energy value obtained in the ith run for i = 1, . . . , 20, “Mean” is the average of vi , “SD” is the standard deviation of “mean,” “Minimum”= min20

i=1 vi , “Maximum” = max20i=1 vi , “Proportion” = #{i : vi ≤ 0.21}, and “Iteration” is the average number of iterations performed in each run. SA-1 employs the linear cooling

schedule, and SA-2 employs the geometric cooling schedule with a decreasing rate of 0.9863.

with T0 = 5 × 106 and ς = 0.55, and the temperature sequencewas set in Equation (26) with T ′

0 = 100, τ∗ = 0.2, and τh = 10.The desired sampling distribution was set in Equation (27) withζ = 0.1. The proposals we used were the hit-and-run and 1-pointoperators, which are exactly the same as those used in Liang(2007a). SAA was run 20 times independently. Each run startedwith a random configuration generated from Nd (0, 0.012), andstopped when a solution with energy U (x) < 0.21 has beenlocated or the maximum number of iterations 107 has beenreached. The results were summarized in Table 5. For compari-son, the results of space annealing SAMC, simulated annealing,and BFGS from Liang (2007a) are also included in Table 5.In simulated annealing, the highest temperature was set to 10and the lowest temperature was set to 0.01, and the temperatureladder consisted of 500 levels.

The comparison indicates that SAA outperforms space an-nealing SAMC, simulated annealing, and BFGS for this exam-ple. SAA can find perfect solutions in 18 out of 20 runs withan average of 5.82 × 106 iterations, while simulated annealingand BFGS failed to find perfect solutions in all runs. Althoughspace annealing SAMC is also able to find perfect solutions, itneeds more iterations on average and has a lower success ratethan SAA.

6.3 Protein Folding

In biophysics, predicting the native structure of a protein fromits sequence has been traditionally formulated as an optimizationproblem—finding the coordinates of atoms such that the poten-tial energy of the protein is minimized. Although the problemcan be simply formulated, it is extremely hard to solve due tothe following two reasons. First, the dimension of the system isusually high, which is in the same order as the number of atomsinvolved in the protein. Second, the energy landscape of thesystem is rugged, which can be characterized by a multitude oflocal energy minima separated by high-energy barriers. Giventhe complexity of the problem, there has been an increasing in-terest in recent years in understanding the relevant mechanics ofprotein folding by studying simplified models, such as the ABmodel (Stillinger, Head-Gordon, and Hirschfeld 1993) and HPmodel (Dill 1985). In these simplified models, each amino acidis treated as a point particle and thus the dimension of the systemcan be much reduced. However, even for the highly simplifiedmodels, it is far from trivial to predict the native structure for agiven sequence (see, e.g., Hsu, Mehra, and Grassberger 2003;Liang 2004; Lee et al. 2008). In this article, only the AB modelis studied.

The AB model consists of only two types of monomers, Aand B, which behave as hydrophobic (σi = +1) and hydrophilic(σi = −1) monomers, respectively. The monomers are linked byrigid bonds of unit length to form linear chains living in two-or three-dimensional space. For the purpose of demonstration,only the two-dimensional (2D) model is studied in this article.In the 2D-AB model, the shape of N-mer can be specified byN − 2 bond angles. The energy function consists of two typesof contributions, bond angle and Lennard–Jones, and is givenby

U (x) =N−2∑i=1

1

4(1 − cos xi,i+1)

+ 4N−2∑i=1

N∑j=i+2

[r−12ij − C2(σi, σj )r−6

ij

], (32)

where x = (x1,2, . . . , xN−2,N−1), xi,j ∈ [−π, π ] is the angle be-tween the ith and jth bond vectors, and rij is the distance betweenmonomers i and j. The constant C2(σi, σj ) is +1, + 1

2 , and − 12

for AA, BB, and AB pairs, respectively. They give strong at-traction between AA pairs, weak attraction between BB pairs,and weak repulsion between A and B. In the literature, the ABmodel is often investigated with the Fibonacci sequence (e.g.,Stillinger and Head-Gordon 1995; Hsu, Mehra, and Grassberger2003), which is defined recursively by

S0 = A, S1 = B, Si+1 = Si−1 ⊕ Si,

where ⊕ is the concatenation operator. The length of the se-quence is given by the Fibonacci number Ni+1 = Ni−1 + Ni .

In this article, we considered the sequences with lengths 13,21, and 34. For all these sequences, the sample space was par-titioned as in Equation (1), where u1, . . . , um−1 form an arith-metic sequence with a common difference of 0.1. For the 13-mer sequence, we set m = 41, u1 = −3.9, u40 = 0, the gainfactor sequence in the form of Equation (25) with T0 = 106 andς = 0.55, and the temperature sequence in the form of Equa-tion (26) with T ′

0 = 1000, τ∗ = 0.01, and τh = 5. The desiredsampling distribution was set in the form of Equation (27) withζ = 0.1. The proposals included the hit-and-run, 1-point, and2-point operators and each was selected equally at each itera-tion. The step sizes of the three operators are set to

√2s/4, s/4,

and s/4, respectively, where s = J (x)/10 and J (x) denotes theindex of the subregion that x belongs to. Under this setting, thestep size increases with energy levels, and thus SAA can movethrough the high-energy region quickly and explore the low-energy region in detail. SAA was run 20 times independently.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


Table 6. Comparison of SAA and simulated annealing for the 2D-AB models

SAA Simulated annealing

N Posta Averageb Bestc Averageb Bestc

13 −3.2941 −3.2833 (0.0011) −3.2881 −3.1775 (0.0018) −3.201221 −6.1980 −6.1578 (0.0020) −6.1712 −5.9809 (0.0463) −6.120134 −10.8060 −10.3396 (0.0555) −10.7689 −9.5845 (0.1260) −10.5240

NOTE: aThe minimum energy value obtained by SAA (subject to a post-conjugate gradient minimization procedure starting from the best configurations found in each run). bThe averagedminimum energy value sampled by the algorithm and the standard deviation of the average. cThe minimum energy value sampled by the algorithm in all runs.

Each run started with a configuration randomly chosen in thedomain [−π, π ]N−2 and consisted of 107 iterations.

For the 21-mer sequence, the sample space was partitionedwith m = 81, u1 = −7.9, and u80 = 0. The gain factor se-quence, the temperature sequence, the desired sampling dis-tribution, and the proposal distribution were the same as for the13-mer sequence. SAA was also run 20 times independentlyfor this sequence, but each run consisted of 2 × 107 iterations.For the 34-mer sequence, the sample space was partitionedinto m = 111 subregions with u1 = −10.9 and u110 = 0. Thegain factor sequence was set in the form of Equation (25) withT0 = 5 × 106 and ς = 0.55. The temperature sequence, the de-sired sampling distribution, and the proposal distribution werethe same as those used for the 13-mer and 21-mer sequences.SAA was run 30 times independently, and each run consisted of2.5 × 107 iterations. The results are summarized in Table 6.

Figure 6 shows the minimum energy configurations foundby SAA (subject to post -conjugate gradient optimization) forthe 13-mer, 21-mer, and 34-mer sequences. From these config-urations, it is easy to see that the hydrophobic (A) monomerstend to form a hydrophobic core (in the 13-mer sequences) orclusters of typically 4-5 monomers (in the 21-mer and 34-mersequences). It is remarkable that the post-optimization energyvalues reported in Table 6, all agree with the putative groundenergy values reported for these sequences, see, for example,Lee et al. (2008). This indicates the success of SAA in findingground energy configurations for these protein sequences.

For comparison, simulated annealing was also applied to thisexample. The highest temperature was set to 5 as in SAA. Thetemperature ladder consisted of 500 levels, where the tempera-ture decreases geometrically at a rate of 0.99. For the 13-mer,21-mer, and 34-mer sequences, there were 20,000, 40,000, and50,000 iterations, respectively, at each temperature level. Underthis setting, SAA and simulated annealing had the same num-ber of energy evaluations in each run. The proposal distributionused in simulated annealing was exactly the same as that used inSAA. The results are summarized in Table 6. The comparisonindicates that SAA significantly outperforms simulated anneal-ing for this example. For each protein sequence, the averageminimum energy value sampled by SAA is significantly lowerthan that sampled by simulated annealing, and the best energyvalue sampled by SAA is consistently lower than that sampledby simulated annealing.

7. DISCUSSION

In this article, we have developed the SAA algorithm forglobal optimization. The SAA algorithm is a combination ofsimulated annealing and the SAMC algorithm. Under the frame-work of stochastic approximation, we show that SAA can workwith a cooling schedule in which the temperature can decreasemuch faster than in the logarithmic cooling schedule, for ex-ample, a square-root cooling schedule, while guaranteeing theglobal energy minima to be reached when the temperature tendsto 0. The SAA algorithm has been compared with simulated

(a) (b) (c)

Figure 6. Minimum energy configurations produced by SAA (subject to post-conjugate gradient optimization) for (a) the 13-mer sequencewith energy value −3.2941, (b) the 21-mer sequence with energy value −6.1980, and (c) the 34-mer sequence with energy −10.8060. The solidand open circles indicate the hydrophobic and hydrophilic monomers, respectively.

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


annealing and some other optimization algorithms on a fewbenchmark optimization problems, including the famous feed-forward neural network training and protein folding problems.The numerical results consistently indicate that SAA outper-forms simulated annealing and other competitors.

Compared with simulated annealing, an added advantage ofSAA is its self-adjusting mechanism that enables it to escapefrom local traps. Compared with space annealing SAMC, whichis also the main competitor of SAA, SAA shrinks its samplespace in a soft way, gradually biasing the sampling towardthe local energy minima of each subregion through loweringthe temperature with iterations. This strategy of sample spaceshrinkage reduces the risk for SAA to get trapped into local en-ergy minima. Although both SAA and space annealing SAMCwork on the partitioned sample space, the numbers of subre-gions used by them are not directly comparable. The number ofsubregions used in space annealing SAMC is actually a randomvariable, which can usually be reduced to a very small num-ber in a few thousand iterations. Since this method tends to gettrapped into local energy minima, it prefers to start with a largesample space and thus a large number of subregions. While, inSAA, the number of subregions is fixed throughout the run. Asdiscussed in Section 4, SAA prefers a fine partition of samplespace but under the constraint of CPU time.

The SAA algorithm can be extended in various ways. Forexample, it can be run with multiple samples simulated at eachiteration, and it can also be run with a population of SAMCchains as in annealing evolutionary SAMC. In the latter case,the crossover operators can be incorporated into simulations andit can be expected that the performance of SAA can be furtherimproved due to the population effect.

Finally, we note the SAA algorithm potentially provides amore general framework of stochastic approximation than thecurrent stochastic approximation MCMC algorithms. By includ-ing an additional control parameter τt , stochastic approximationmay find new applications or improve its performance in exist-ing applications. For this general SAA algorithm, the functionHτ (θ,X) may be no longer bounded. In this case, its conver-gence can be studied under the assumption that the Markovtransition kernel is V-uniform ergodic.

SUPPLEMENTARY MATERIALS

Proof of Theorems 3.1, 3.2, and 3.3.

[Received September 2012. Revised October 2013.]

REFERENCES

Ali, M. M., Khompatraporn, C., and Zabinsky, Z. B. (2005), “A NumericalEvaluation of Several Stochastic Algorithms on Selected Continuous GlobalOptimization Test Problems,” Journal of Global Optimization, 31, 635–672.[858]

Ali, M. M., and Storey, C. (1994), “Modified Controlled Random Search Al-gorithms,” International Journal of Computer Mathematics, 54, 229–235.[858]

Amato, S., Apolloni, B., Caporali, G., Madesani, U., and Zanaboni, A. (1991),“Simulated Annealing Approach in Backpropagation,” Neurocomputing, 3,207–220. [859]

Andrieu, C., Moulines, E., and Priouret, P. (2005), “Stability of StochasticApproximation Under Verifiable Conditions,” SIAM Journal of Control andOptimization, 44, 283–312. [850,851,852]

Andrieu, C., and Robert, C. P. (2001), “Controlled MCMC for Optimal Sam-pling,” Technical Report 0125, Cahiers de Mathematiques du Ceremade,Universite Paris-Dauphine. [849]

Andrieu, C., and Thoms, J. (2008), “A Tutorial on Adaptive MCMC,” Statisticsand Computing, 18, 343–373. [849]

Atchade, Y. F., and Liu, J. S. (2010), “The Wang-Landau Algorithm for MonteCarlo Computation in General State Spaces,” Statistica Sinica, 20, 209–233.[849]

Baum, E. B., and Lang, K. J. (1991), “Constructing Hidden Units Using Exam-ples and Queries,” in Advances in Neural Information Processing Systems(3) eds. R. P. Lippmann, J. E. Moody, and D. S. Touretzky, San Mateo:Morgan Kaufmann, pp. 904–910. [859]

Benveniste, A., Metivier, M., and Priouret, P. (1990), Adaptive Algorithms andStochastic Approximations, New York: Springer-Verlag. [848,850]

Berger, J. O. (1993), “The Present and Future of Bayesian Multivariate Analy-sis,” in Multivariate Analysis: Future Directions, ed. C. R. Rao, Amsterdam:North-Holland, pp. 25–53. [857]

Broyden, C. G. (1970), “The Convergence of a Class of Double Rank Mini-mization Algorithms, Parts I and II,” IMA Journal of Applied Mathematics,6, 76–90 and 222–231. [859]

Chen, H. F. (2002), Stochastic Approximation and Its Applications, Dordrecht:Kluwer Academic Publishers. [848,850,851]

Chen, H. F., and Zhu, Y. M. (1986), “Stochastic Approximation Procedures WithRandomly Varying Truncations,” Scientia Sinica, Series A, 29, 914–926.[850]

Chen, M.-H., and Schmeiser, B. W. (1996), “General Hit-and-Run Monte CarloSampling for Evaluating Multidimensional Integrals,” Operations ResearchLetters, 19, 161–169. [857]

Cybenko, G. (1989), “Approximations by Superpositions of Sigmoidal Func-tions,” Mathematics of Control, Signals, and Systems, 2, 303–314. [858]

Dekkers, A., and Aarts, E. (1991), “Global Optimization and Simulated An-nealing,” Mathematical Programming, 50, 367–393. [852]

Dill, K. A. (1985), “Theory for the Folding and Stability of Globular Proteins,”Biochemistry, 24, 1501–1509. [860]

Dorsey, R. E., and Mayer, W. J. (1995), “Genetic Algorithms for EstimationProblems With Multiple Optima, Non-differentiability, and Other IrregularFeatures,” Journal of Business and Economic Statistics, 13, 53–66. [857]

Fahlman, S. E., and Lebiere, C. (1990), “The Cascade-Correlation LearningArchitecture,” in Advances in Neural Information Processing Systems 2,ed. D. S. Touretzky,, San Mateco, CA: Morgan Kaufmann, pp. 524–532.[859]

Fletcher, R. (1970), “A New Approach to Variable Metric Algorithms,” Com-puter Journal, 13, 317–322. [859]

Geman, S., and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distributionsand the Bayesian Restoration of Images,” IEEE Transactions on PatternAnalysis and Machine Intelligence, 6, 721–741. [847]

Goldfarb, D. (1970), “A Family of Variable Metric Methods Derived by Varia-tional Means,” Mathematics Computer, 24, 23–26. [859]

Gu, M. G., and Kong, F. H. (1998), “A Stochastic Approximation AlgorithmWith Markov Chain Monte Carlo Method for Incomplete Data EstimationProblems,” Proceedings of the National Academy of Sciences USA, 95,7270–7274. [849]

Haario, H., and Saksman, E. (1991), “Simulated Annealing Process in GeneralState Space,” Advance in Applied Probability, 23, 866–893. [847,853]

Haario, H., Saksman, E., and Tamminen, J. (2001), “An Adaptive MetropolisAlgorithm,” Bernoulli, 7, 223–242. [847,849]

Hoffman, D. L., and Schmidt, P. (1981), “Testing the Restrictions Implied by theRational Expectations Hypothesis,” Journal of Econometrics, 15, 265–287.[855,856]

Holley, R. A., Kusuoka, S., and Stroock, D. W. (1989), “Asymptotics of theSpectral Gap With Applications to the Theory of Simulated Annealing,”Journal of Functional Analysis, 83, 333–347. [847]

Hornik, K. (1991), “Approximation Capabilities of Multilayer FeedforwardNetworks,” Neural Networks, 4, 251–257. [858]

Hsu, H.-P., Mehra, V., and Grassberger, P. (2003), “Structure Optimization inan Off-Lattice Protein Model,” Physical Review E, 68, 037703. [860]

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983), “Optimization bySimulated Annealing,” Science, 220, 671–680. [847]

Kulkarni, S. R., and Horn, C. (1995), “Necessary and Sufficient Conditions forConvergence of Stochastic Approximation Algorithms Under Arbitrary Dis-turbances,” in Proceedings of the 34th Conference on Decision & Control,New Orleans, LA, pp. 3843–3848. [851]

Kushner, H. K., and Clark, D. S. (1978), Stochastic Approximation Methods forConstrained and Unconstrained Systems, New York: Springer. [851]

Lang, K. J., and Witbrock, M. J. (1989), “Learning to Tell Two Spirals Apart,”in Proceedings of the 1988 Connectionist Models, eds. D. Touretzky, G.Hinton, and T. Sejnowski, San Mateo, CA: Morgan Kaufmann, pp. 52–59.[859]

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014


Lee, J., Joo, K., Kim, S.-Y., and Lee, J. (2008), “Reexamination of Struc-ture Optimization of Off-lattice Protein AB Models by ConformationalSpace Annealing,” Journal of Computational Chemistry, 29, 2479–2484.[860,861]

Liang, F. (2004), “Annealing Contour Monte Carlo for Structure Optimiza-tion in an Off-lattice Protein Model,” Journal of Chemical Physics, 120,6756–6763. [860]

——— (2007a), “Annealing Stochastic Approximation Monte Carlo for NeuralNetwork Training,” Machine Learning, 68, 201–233. [849,859]

——— (2007b), “Continuous Contour Monte Carlo for Marginal Density Es-timation With an Application to a Spatial Statistical Model,” Journal ofComputational and Graphical Statistics, 16, 608–632. [849]

——— (2009a), “On the Use of Stochastic Approximation Monte Carlo forMonte Carlo Integration,” Statistics & Probability Letters, 79, 581–587.[848]

——— (2009b), “Improving SAMC Using Smoothing Methods: Theory andApplications to Bayesian Model Selection Problems,” The Annals of Statis-tics, 37, 2626–2654. [851]

——— (2011), “Annealing Evolutionary Stochastic Approximation MonteCarlo for Global Optimization,” Statistics and Computing, 21, 375–393.[857,858]

Liang, F., Liu, C., and Carroll, R. J. (2007), “Stochastic Approximation in MonteCarlo Computation,” Journal of the American Statistical Association, 102,305–320. [847,849,850]

Meyn, S., and Tweedie, R. L. (2009), Markov Chain and Stochastic Stability(2nd ed.), Cambridge: Cambridge University Press. [851]

Nummelin, E. (1984), General Irreducible Markov Chains and NonnegativeOperators, Cambridge: Cambridge University Press. [851]

Owen, C. B., and Abunawass, A. M. (1993), “Applications of Simulated An-nealing to the Backpropagation Model Improves Convergence,” SPIE Pro-ceedings, 1966, pp. 269–276. [859]

Price, W. L. (1983), “Global Optimization by Controlled Random Search,”Journal of Optimization Theory and Applications, 40, 333–348. [858]

Robbins, H., and Monro, S. (1951), “A Stochastic Approximation Method,”Annals of Mathematical Statistics, 22, 400–407. [848,849]

Robert, C. P., and Casella, G. (2004), Monte Carlo Statistical Methods (2nded.), New York: Springer. [854]

Roberts, G. O., and Tweedie, R. L. (1996), “Geometric Convergence and Cen-tral Limit Theorems for Multidimensional Hastings and Metropolis Algo-rithms,” Biometrika, 83, 95–110. [851]

Rosenthal, J. S. (1995), “Minorization Conditions and Convergence Rate forMarkov chain Monte Carlo,” Journal of the American Statistical Associa-tion, 90, 558–566. [851]

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986), “Learning InternalRepresentations by Backpropagating Errors,” in Parallel Distributed Pro-cessing: Explorations in the Microstructure of Cognition (Vol. 1), eds. D. E.Rumelhart and J. L. McClelland, Cambridge, MA: MIT Press, pp. 318–362.[859]

Shanno, D. F. (1970), “Conditioning of Quasi-Newton Methodsfor Function Minimization,” Mathematics Computer, 24, 647–656.[859]

Smith, R. L. (1984), “Efficient Monte-Carlo Procedures for Generating PointsUniformly Distributed Over Bounded Regions,” Operations Research, 32,1296–1308. [857]

Stillinger, F. H., and Head-Gordon, T. (1995), “Collective Aspects of ProteinFolding Illustrated by a Toy Model,” Physical Review E, 52, 2872–2877.[860]

Stillinger, F. H., Head-Gordon, T., and Hirschfeld, C. L. (1993), “ToyModel for Protein Folding,” Physical Review E, 48, 1469–1477.[860]

Storn, R., and Price, K. (1997), “Differential Evolution: A Simple and EfficientHeuristic for Global Optimization over Continuous Spaces,” Journal ofGlobal Optimization, 11, 341–359. [858]

Tan, C. M. (2008), Simulated Annealing, Vienna, Austria: In-Tech.[847]

Troster, A., and Dellago, C. (2005), “Wang-Landau Sampling With Self-Adaptive Range,” Physical Review E, 71, 066705. [853]

Villalobos-Arias, M., Coello Coello, C. A., and Hernandez-Lerma, O. (2006),“Asymptotic Convergence of a Simulated Annealing Algorithm for Mul-tiobjective Optimization Problems,” Mathematical Methods of OperationsResearch, 64, 353–362. [852]

Wang, F., and Landau, D. P. (2001), “Efficient, Multiple-Range Random WalkAlgorithm to Calculate Density of States,” Physical Review Letters, 86,2050–2053. [848,849]

Wong, W. H., and Liang, F. (1997), “Dynamic Weighting in Monte Carlo andOptimization,” Proceedings of the National Academy of Sciences USA, 94,14220–14224. [859]

Younes, L. (1989), “Parametric Inference for Imperfectly Observed Gibb-sian Fields,” Probabability Theory and Related Field, 82, 625–645.[849]

Dow

nloa

ded

by [

Purd

ue U

nive

rsity

] at

10:

57 2

4 Se

ptem

ber

2014

On: 24 September 2014, At: 10:57 Faming Liang, Yichen ...

Documents

Transcript of On: 24 September 2014, At: 10:57 Faming Liang, Yichen ...