Deterministic Annealing for Clustering: Tutorial and...
Embed Size (px)
Transcript of Deterministic Annealing for Clustering: Tutorial and...

Deterministic Annealing for Clustering: Tutorial and ComputationalAspects
Pratik M. Parekh, Dimitrios Katselis, Carolyn L. Beck, Srinivasa M. Salapaka
Abstract— The deterministic annealing (DA) method, used forthe solution of several nonconvex problems, offers the abilityto avoid shallow local minima of a given cost surface andthe ability to minimize the cost function even when there aremany local minima. The method is established in a probabilisticframework through basic informationtheoretic techniques suchas maximum entropy and random coding. It arises naturallyin the context of statistical mechanics by the emulation of aphysical process whereby a solid is slowly cooled and at zerotemperature assumes its minimum energy configuration. Inthis paper, we first present some background material on theDA method and its connections to statistical physics and ratedistortion theory. A computational complexity analysis is thenpresented for a given temperature schedule. The case studyfocuses on the geometric cooling law T (t) = ρT (t−1), 0 < ρ <1, where T (t) is the temperature at time t.
I. INTRODUCTION
Combinatorial resource allocation problems arise in a largenumber of applications such as facility location problems,data compression, strategy planning, model aggregation, andlocational optimization. These problems are typically castas optimization formulations whose cost surfaces are nonconvex with many local minima; therefore finding the globalminima is a prohibitive task. Due to the nonconvexityof these cost surfaces many gradient descent methods gettrapped in poor local minima, depending on the initializationpoint. An immediate remedy is to use multiple initializationpoints and to choose the lowest achievable cost value asthe potential global minimum [1]. Clearly, depending on thestructure of the cost surface and due to the combinatorialnature of the resource allocations, such an approach iscomputationally impractical. In this respect, the simulatedannealing (SA) and deterministic annealing (DA) algorithmsare effective [1], [2], [3]. The motivating idea behind thesemethods originates from statistical mechanics: SA and DAemulate a physical process whereby a solid is first heated toits melting point and then is slowly cooled at a rate dictatedby its heat transport speed to finally reach its minimumenergy configuration [2], [4]. The SA algorithm correspondsto the evolution of a discretetime inhomogeneous Markovchain (MC) and by starting from an initial point a random
P. M. Parekh is with the Coordinated Science Laboratory and the Mechanical Engineering Department, University of Illinois at UrbanaChampaign,Urbana, IL 618012925, Email: [email protected]
D. Katselis and C. L. Beck are with the Coordinated Science Laboratoryand the Department of Industrial and Enterprise Systems Engineering, University of Illinois at UrbanaChampaign, Urbana, IL 618012925, Emails:{katselis—beck3}@illinois.edu.
S. M. Salapaka is with the Mechanical Engineering Department, University of Illinois at UrbanaChampaign, Urbana, IL 618012925, Email:[email protected]
walk of bounded or unbounded variance triggers the searchof the corresponding statespace and may converge in probability to the minimum energy state. In a physical annealingprocess, this procedure corresponds to a Brownian motion ofa particle. However, the cooling schedule is very critical tothe performance of SA. Assuming a given random process,cooling at a fast rate will most probably lead to a nonglobalminimum at zero temperature, while cooling at a slow ratecorresponds to a waste of resources. In [5], it was shownthat convergence in probability to the global minimum canbe achieved if T (t) = O((log t)−1), where T (t) is the temperature at time t. This result was sharpened by Hajek [6],who showed that the SA algorithm converges in probability ifand only if limt→∞ T (t) = 0 and
∑∞t=1 exp(−d∗/T (t)) = 0
for some appropriately defined positive constant d∗, leadingto cooling schedules of the form T (t) = d/ log t for anyd ≥ d∗. Such schedules are generally nonadmissible for realworld applications. Moreover, the results in [3], [5] use aGaussian assumption on the aforementioned random walk,leading to bounded variance steps. In [4], it was shown thatusing Cauchydistributed (infinite variance) perturbations,cooling schedules of order O(t−1) can be achieved. On theother hand, the DA algorithm alleviates the random walkaspect of SA by replacing the stochastic search with anexpectation. At the same time, the DA inherits the positiveannealing attributes of SA to better identify the globalminima of a cost surface, but more importantly it allowsfor geometric cooling laws leading to performance that istypically significantly faster than the SA.
To promote the understanding of DA, we note that athigh temperatures the cost surface is convex under somemild assumptions. Therefore, it can be safely assumed thatthe initial minimum is unique, thus, global. Through anappropriate cooling schedule, the DA aims at tracking theglobal minimum of the cost surface, as the temperatureis lowered and the surface gradually assumes its nonconvex form. The DA has been successfully used in variousapplications such as clustering, source compression/vectorquantization, graphtheoretic optimization and image analysis [1], [7], [8], [9], [10]. It shares connections withthe computation of ratedistortion functions in informationtheory [8], [14], [15]. More explicitly, it has similarities withthe alternatingminimization BlahutArimoto (BA) algorithmfor the computation of ratedistortion functions and channelcapacities [16], [17]. Furthermore, due to the increasedinterest in tracking algorithms for surveillance and militaryapplications [18], [19], the DA algorithm has been used in thecontext of the dynamic coverage control problem [9], [11],
2015 American Control ConferencePalmer House HiltonJuly 13, 2015. Chicago, IL, USA
9781479986842/$31.00 ©2015 AACC 2906

simultaneous locational optimization and multihop routing[20], and in aggregation of graphs and Markov chains [10].For these problems, an understanding of the computationaleffort required by the algorithms is important for efficientimplementations.
In this paper, we first review the basic principles of theDA algorithm and its connections with statistical mechanics and ratedistortion theory. We leverage a computationalanalysis of the DA algorithm in usual topological spaceswhile focusing on the geometric cooling schedule T (t) =ρT (t−1), 0 < ρ < 1. A numerical study of the DA algorithmis then performed to promote a better understanding of itspractical behavior.
II. DA: THE ALGORITHM
To draw connections with the ratedistortion theory lateron, we use in the following the terminology that appliesto both ratedistortion and clustering setups. We assumethat we have a source that produces a sequence of i.i.d.input vectors x1,x2, . . . ,xN according to some distributionp(x),x ∈ X . We also assume that there exists an encodingfunction y(x) that maps the input vector x to the bestreproduction codevector in some finite set Y . In the clustering setup, the input vectors correspond to the training set,while Y corresponds to the set of some appropriately definedcluster centroids. To clarify the appropriateness notion of thecentroids, a distortion definition is necessary. A distortionmapping d : X ×Y → R+ quantifies the cost of associatingan input vector x with a specific output vector y. It is alsoassumed that maxx∈X ,y∈Y d(x,y) 2λmax(Cx), K = 1, y1 =∑x∈X xp(x) and p(y1) = 1. Here λmax(Cx) denotes
the maximum eigenvalue of the input covariance matrix Cx.
3) If K < Kmax, create codevectors according to y′j :=yj + δ and y
′′j := yj − δ. Here δ is a random pertur
bation. Set p(y′j) := p(yj)/2, p(y′′j ) := p(yj)/2.
4) Update all 2K codevectors:
yi =∑x
p(x)p(yix)x/p(yi) (3)
p(yix) =p(yi) exp(−‖x− yi‖2/T )∑2Kj=1 p(yj) exp(−‖x− yj‖2/T )
(4)
p(yi) =∑x
p(x)p(yix). (5)
5) Convergence Test: If not satisfied go to step 4).6) If T < Tmin perform the last iteration at T = 0 and
STOP.7) Cooling Step: T ← ρT , ρ ∈ (0, 1).8) Go to step 3).The temperature initialization is appropriate, since it has
been shown in [1] that above this value no codevectorsplitting occurs. The idea of codevector splitting or phasetransition will be further explained in the next section.
Convergence test in Step 5): The convergence test canbe implemented in various ways, e.g., as the norm of thedifference of subsequent codevectors falling below a predefined threshold or the difference of successive values of theimplicit objective function that the DA algorithm minimizes,called free energy, falling below a predefined tolerance. Inour analysis, the test used is of the form ‖F (y(n))−F (y(n−1))‖ ≤ α, where F denotes the objective function and α < 1is chosen to be sufficiently small. In any case, for worstcase computational considerations we impose an empiricalupper bound, nmax, on the maximum number of iterationsin Step 4), which is an implicit function of the clusteringmodel parameters. nmax is considered sufficiently large tobe a legitimate overestimate of the iterations needed for theconvergence of Step 4).
Remark: The reader may note that the DA algorithm forclustering in this paper is a variant of the massconstrainedimplementation of the DA in [1]. Step 7) in the massconstrained implementation for calculating the critical temperatures can be expensive in higher dimensions, sinceit corresponds to the solution of an eigenvalue problem.Therefore, it can be replaced by a simple perturbation. In
2907

this case, we always keep two codevectors at each locationand perturb them when the temperature is updated. Above thecritical temperature, these codevectors will merge naturallyin Step 4). They will split when the system undergoes a phasetransition.
III. CONNECTIONS TO STATISTICAL MECHANICS ANDRATEDISTORTION THEORY
Consider a system of n particles, which can be in various microstates. Each microstate is designated by a vector x = [x1, x2, . . . , xn]T , where the elements can bepositions, angular momenta or spins. With each state xwe associate a Hamiltonian, i.e., an energy function H(x)[15], [21]. Then, in thermal equilibrium the probability ofoccurrence of x is given by the BoltzmannGibbs distribution w(x) = exp(−βH(x))/Zn(β). Here, β = 1/(κT ),where κ is Boltzmann’s constant and T is the temperature.Moreover, Zn(β) is the earlier mentioned partition function,expressed as Zn(β) =
∑x exp(−βH(x)). Through the
partition function, an important physical quantity is defined,namely the Helmholtz free energy F = − lnZn(β)/β.The BoltzmannGibbs distribution can be obtained as themaximum entropy distribution under an energy constraint. Inthis setup, β corresponds to a Lagrange multiplier balancingthe entropy and the average energy contributions in F . Inthe clustering context, the Hamiltonian corresponds to theaverage distortion D and it turns out that F = D − TH ,where H is the Shannon entropy given by the formulaH(X,Y ) = −
∑x,y p(x,y) log p(x,y) [1]. Here, X,Y
correspond to random vectors in X and Y , respectively.The question that arises is how the notion of maximum
entropy principle in the DA context emerge? By the secondlaw of thermodynamics, the entropy of an isolated systemcan only increase. This leads to the conclusion that theentropy is maximized at thermal equilibrium. When thesystem is not isolated, the maximum entropy principle isreplaced by the minimum free energy principle asserting thatF cannot increase, and therefore it reaches its minimumat thermal equlibrium [22]. The DA algorithm is based onthe minimization of the average distortion D subject to anassumption on the allowed randomness of the solution, whichis dictated by the maximum entropy principle. The annealingprocess on the temperature plays the role of an external forceor the role of exchanging energy between the system andits environment. As the temperature is lowered sufficientlyslowly, the system is always kept at equilibrium and thereforeat zero temperature it assumes its minimum energy configuration, which is the best hard clustering solution.
We now turn our attention to ratedistortion theory. Thebasic problem is to determine the minimum expected distortion D achievable at a particular rate R, given a sourcedistribution p(x) and a distortion measure d(·, ·). One of themost critical results in this theory is that joint descriptionsof sequences of random variables are always more efficientthan describing each random variable separately, even whenthe random variables defining the sequence are independent[14]. For an i.i.d. source with distribution p(x) and bounded
distortion function, a basic theorem in ratedistortion theorystates that the corresponding ratedistortion function R(D)can be computed as follows [14]:
R(D) = minq(yx):
∑x,y p(x)q(yx)d(x,y)≤D
I(X;Y ). (6)
Here, I(X;Y ) denotes the mutual information. Clearly,R(D) corresponds to the minimum achievable rate at distortion D.
The ratedistortion function R(D) can also be expressedin terms of the relative entropy as follows [14]:
R(D) = minq∈B
minp∈A
D(p ‖ q). (7)
Here, A is the set joint distributions with marginal p(x) thatsatisfy the distortion constraints and B is the set of productdistributions p(x)r(y) with arbitrary r(y). These functionsets are convex and therefore alternatingminimization (AM)approaches can be applied to compute R(D). Since therelative entropy corresponds to a Bregman distance, suchAM approaches formulating generalized projections ontoconvex sets (POCs) are guaranteed to converge [23]. In theinformationtheoretic framework, the AM method computingR(D) is the BlahutArimoto (BA) algorithm [16], [17]. Moreexplicitly, the BA assumes an initial output distribution r(y)and computes q(yx) that minimizes the mutual informationas
q(yx) = r(y) exp(−µd(x,y))∑y r(y) exp(−µd(x,y))
, (8)
where µ > 0 is a userdefined parameter. It then updates r(y)as r(y) =
∑x p(x)q(yx). These two steps are repeated
till convergence. The reader may observe the similaritiesbetween these two steps and step 4) in the DA algorithm.
Finally, to connect the dots between statistical mechanics,ratedistortion theory and the DA algorithm, we note that ifw(x) is the BoltzmannGibbs distribution characterizing themicrostate x and π(x) is an arbitrary distribution on x, itcan be shown that [24]
D(π(x) ‖ w(x)) = β(Fπ − Fw), (9)
where Fφ denotes the Helmholtz free energy defined withrespect to distribution φ. This expression certifies the correspondences of both statistical mechanics and ratedistortiontheory with the DA framework.
Remark: A last notion that needs to be clarified is that ofphase transitions. For the system of n particles in statisticalphysics the free energy for usual models, such as the Isingor the liquidgas models, can be expressed via mean fieldapproximations by the formula F = − lnZn(β)/β [22].Landau theory of phase transitions explains how the shape ofthe freeenergy surface changes as the temperature changes,leading to splittings of existing optima. This attribute isinherited by the DA algorithm. When a phase transitionoccurs, some of the current codevectors split.
2908

IV. DA: COMPUTATIONAL ASPECTS
In this section, we present some mild worst case estimatesof the previously presented DA algorithm’s computationalcomplexity under some simplifying assumptions. The analysis assumes that N and Kmax are fixed and the trainingpoints are assumed to be d−dimensional. It can be shownthat the gradient of F with respect to yj is as follows:
∂F
∂yj= 2
∑x∈X
p(x)p(yj x)(yj − x), (10)
which gives
yn+1j = ynj −
1
2p(ynj )
∂F
∂yj. (11)
In eq. (11), n represents the iteration count in Step 4) andp(ynj ) is given by eq. (5). The iterates are computed using eq.(11), which corresponds to a descent method with descentdirection, dn = −p(ynj )
−1 ∂F∂yj
. Clearly, dTn∂F∂yj≤ 0 with
equality being true when ∂F∂yj = 0 and, hence, Step 4)converges [13].
We are now ready to examine the number of steps ofthe DA algorithm required for clustering in terms of basicoperations (additions, subtractions, multiplications), in eachcase giving one floating point operation, flop2. We also countdivisions as four flops [7].Step 1: No cost is associated with this step.Step 2: The computation of y1 for N training vectors requiresNd multiplications for the formation of the products xp(x)and (N − 1)d additions for the evaluation of the sum.Therefore, the total cost of this step is (2N − 1)d flops.Step 3: Assume that we fix the temperature and that the number of available codevectors is K < Kmax. For the formationof the y′js, we require 2Kd additions. For the evaluationof the corresponding probabilities, we need 2K divisions.Therefore, the total cost is 2K(d+ 4) flops. Moreover, notethat when we are not at a phase transition temperature, theduplicated codevectors will be merged together in Step 4).The total cost with respect to the evolution of K is discussedat the end of this section.Step 4: Fix the iteration index. For the computation ofyi we require N multiplications for the formation of theproducts p(x)p(yix), Nd multiplications for the formationof p(x)p(yix)x, (N − 1)d additions for the computationof the sum and a division for the final computation of yi.This computation is performed for 2K codevectors yieldinga total cost of 2K(N + 2Nd − d + 4) flops. We now fixx. For p(yix), note that the numerator is one of the termsappearing in the denominator. Each individual term can bestored to accelerate the code. We therefore focus on thedenominator. For each exponent, we require d subtractions,d multiplications, d− 1 additions and 1 division, i.e., 3d+3flops. The evaluation of an exponential requires 8 flops [7].Thus, the total cost per exponential becomes 24(d+1) flops.
2In this paper, 1 flop= 1 basic operation, although the meaning of a flopis slightly different in computer architercture.
Multiplication of the exponential with p(yi) adds a flop.Hence, for each summand of the denominator we require24d + 25 flops. Moreover, we have 2K summands in thedenominator and for each yi and fixed x, the denominatorremains the same with the numerator being one of thesummands in the denominator. Hence, the computation of allsummands appearing in the denominator requires 2K(24d+25) flops. Also, for computing all the values related top(yix), we require 2K − 1 additions. Therefore, the computation of the denominator of each p(yix) for a fixed xrequires (2K(24d+25)+2K−1) = (48Kd+52K−1) flops.Now with all the terms available, we just need 2K divisionsto calculate all p(yix)’s for a fixed x, which amounts to8K more flops. Hence, for all p(yix)’s with a fixed x, werequire (48Kd + 60K − 1) flops. Then, for all x, we haveto repeat the same calculations N times, which amounts to(48NKd+ 60NK −N) flops.
Finally, for each p(yi) we require N multiplications andN − 1 additions, i.e., 2N − 1 flops, leading to 4NK − 2Kflops for all 2K codevectors. Combining all computations,we have (48NKd + 60NK − N) + (2K(N + 2Nd − d +4))+(4NK−2K) = (52NKd+66NK+6K−N −2Kd)flops.
Therefore, Step 4) requires nmax(52NKd+66NK+6K−N − 2Kd) flops in the worst case.Step 5: In the worst case, nmax tests will be performed fortesting convergence. It can be seen that for the calculationof objective function, the number of flops required areproportional to N . Hence, the number of flops required forthis step is proportional to nmaxN .Step 6: No cost is associated with this step.Step 7: Assume that the initial temperature is chosen to be2λmax(Cx)+δ for some small positive constant δ. The totalnumber of temperature values is such that ρm(2λmax(Cx)+δ) ≤ Tmin yielding m ≥ M = dln
(Tmin
2λmax(Cx)+δ
)/ ln ρe.
Therefore, the cost of this step is M flops.Step 8: No cost is associated with this step.
To finish the analysis, we have to take into account theannealing process in all steps and the evolution of K in Steps3), 4) and 5). To this end, we assume that Kmax is achievablewithin our temperature schedule. Suppose that the sequenceof critical temperature values is T c1 , T
c2 , T
c3 , . . .. Then for all
intermediate temperature values in (T c1 , 2λmax(Cx) + δ] thevalue of K is always 1, for all intermediate temperaturevalues in (T c2 , T
c1 ] the value of K is always 2 etc, due to
the merging of codevectors is Step 4). If σ1 is the number oftemperature values in (T c1 , 2λmax(Cx)+δ], σ2 is the numberof temperature values in (T c2 , T
c1 ] etc, we have a sequence of
cardinalities σ1, σ2, . . . , σKmax , σKmax+1, where σKmax+1 isthe number of temperature values remaining till the end ofthe annealing process for which K = Kmax. Note that thisis the case no matter if we have more critical temperaturesafter TKmax since Step 3) is no longer executed. Clearly, σiis an increasing function of ρ for all i. With these definitions,we now focus on Steps 3), 4) and 5).Step 3: It is easy to verify that the total cost of this step is
2909

2(d+ 4)∑Kmax−1j=1 jσj .
Step 4: The cost of this step can be easily seen to be:
Kmax∑j=1
σjnmax(52Njd+ 66Nj + 6j −N − 2jd)+
σKmax+1nmax(52NKmaxd+ 66NKmax
+ 6Kmax −N − 2Kmaxd). (12)
Step 5: The cost of this step turns out to be
Kmax∑j=1
4σjnmaxj(3d− 1) + 4σKmax+1nmaxKmax(3d− 1).
(13)
Summing the flops for individual steps leads to a worstcase upper estimate of the investigated computational complexity. Setting σmax = {σ1, σ2, . . . , σKmax+1}, it can beeasily seen that the worst case complexity behaves asO(σmaxnmaxNK
2maxd
).
Remark: Ideas for the reduction of the DA’s computationalcomplexity have been proposed in [7]. The key points in theanalysis therein is to use a small number of neighboringcodevectors for the update of each codevector in Step 4) andto replace the Gibbs distribution by some simpler, fuzzylike membership function that is more easily computable andsufficiently accurate at the same time.
V. SIMULATIONS
In this section we perform a numerical study of the DAalgorithm. The data points and the codevectors are assumedto be in R2. We choose to implement a convergence test instep 5) based on the difference between successive values ofthe free energy falling below a predefined threshold, which isassumed to be 10−7. The temperature schedule is initializedto 1.25× 104 and Tmin is set to 0.25 in all simulations. Thecooling factor ρ assumes the value 0.91. For all plots, theDA algorithm is executed on multiple data sets.
In Fig. 1, the minimum, average and maximum numbersof iterations required for the convergence of Step 4) in theDA algorithm versus the number of data points is demonstrated. The comparison is performed over 100 realizationsof the data points. N is varied in the range [100, 500] andKmax = 10 is kept constant. We observe that the average andminimum numbers of iterations do not significantly changewith an increase in N . The maximum number of iterations orthe worst case number of iterations to convergence among the100 runs show a very slight linear increase with an increasingN .
Fig. 2 demonstrates the same numbers as Fig. 1 withthe difference that N is now kept constant at 300, whileKmax (maximum number of codevectors) is varied in therange [5, 25]. Again 100 different data sets are employed.We observe that all curves show an increasing trend with anincreasing Kmax. Clearly, the slopes in all cases are higherthan the corresponding slopes in Fig. 1. This means thatthe number of iterations for Step 4) is more sensitive to anincrease in Kmax than in N .
Fig. 1. Minimum, average and maximum numbers of iterations requiredfor the convergence of Step 4) in the DA algorithm vs the number of datapoints.
Fig. 2. Minimum, average and maximum numbers of iterations requiredfor the convergence of Step 4) in the DA algorithm vs the allowed numberof codevectors Kmax.
To check the exact behavior of the curves in Fig. 2 withrespect to large orders of magnitude for Kmax, in Fig. 3Kmax is largely varied in the range [1, 150]. In this plot, Nis kept constant to 10000. Clearly, the numbers of iterationsare highly dependent on Kmax.
Fig. 4 demonstrates the minimum, maximum and averageexecution times of the DA algorithm when N is allowed tovary in the range [200, 10000] and Kmax is kept fixed tothe value 10. To this end, the Matlab function CPUtime wasused. As is demonstrated, the curves show a steady increase.
Finally, in Fig. 5 we fix N to 10000 and we vary Kmaxdemonstrating the same curves as in Fig. 4. As Kmaxincreases, the computational times show a rapid increase.Furthermore, comparing Figs. 4 and 5 we observe that theexecution times are much more sensitive to an increase inKmax than in N .
VI. CONCLUSIONS
In this paper, the DA algorithm for clustering was revisited. Several connections with statistical physics and ratedistortion theory were discussed. Upper bounds on the computational complexity of the algorithm were derived underquite mild assumptions about the data sets. The presented
2910

Fig. 3. Minimum, average and maximum numbers of iterations requiredfor the convergence of Step 4) in the DA algorithm vs the allowed numberof codevectors Kmax.
Fig. 4. Kmax = 10: Minimum, average and maximum time required forthe DA algorithm vs the number of data points N .
Fig. 5. N = 10000: Minimum, average and maximum time required forthe DA algorithm vs Kmax.
theoretical aspects were accompanied by a numerical studyof the behavior of the algorithm to complete the treatment.Future work will aim at the determination of tighter computational complexity characterizations for the DA algorithm.
REFERENCES[1] K. Rose, “Deterministic Annealing for Clustering, Compression, Clas
sification, Regression, and Related Optimization Problems”, Proceedings of the IEEE, vol. 86, no. 11, pp. 2210–2239, Nov. 1998.
[2] D. Bertsimas, J. Tsitsiklis, “Simulated Annealing”, Statistical Science,vol. 8, no. 1, pp. 10–15, 1993.
[3] B. Hajek, “A Tutorial Survey of Theory and Applications of SimulatedAnnealing”, Proceedings of 24th Conf. on Decision and Control, 755–760, NY, 1985.
[4] H. H. Szu, R. L. Hartley, “Nonconvex Optimization by Fast SimulatedAnnealing”, Proceedings of the IEEE, vol. 75, no. 11, pp. 1538–1540,Nov. 1987.
[5] S. Geman, D. Geman “Stochastic Relaxation, Gibbs Distributions, andthe Bayesian Restoration of Images”, IEEE Trans. on Pattern Analysisand Machine Intelligence, vol. PAMI6, no. 6, pp. 721–741, Nov. 1984.
[6] B. Hajek, “Cooling Schedules for Optimal Annealing”, Math. Oper.Res., (1988)13: 311–329, .
[7] K. Demirciler, A. Ortega, “ReducedComplexity Deterministic Annealing for Vector Quantizer Design”, EURASIP Journal on AppliedSig. Proc., 2005:12, 1807–1820.
[8] K. Rose, “A Mapping Approach to RateDistortion Computation andAnalysis”, IEEE Trans. on Inf. Theory, vol. 40, no. 6, pp. 1939–1952,Nov. 1994.
[9] P. Sharma, S. M. Salapaka, C. L. Beck, “EntropyBased Frameworkfor Dynamic Coverage and Clustering Problems”, IEEE Trans. onAutomatic Control, vol. 57, no. 1, pp. 135–150, Jan. 2012.
[10] Y. Xu, S. M. Salapaka, C. L. Beck, “Aggregation of Graph Modelsand Markov Chains by Deterministic Annealing”, IEEE Trans. onAutomatic Control, vol. 59, no. 10, pp. 2807–2812, Oct. 2014.
[11] Y. Xu, S. M. Salapaka, C. L. Beck, “Clustering and Coverage Controlfor Systems With AccelerationDriven Dynamics”, IEEE Trans. onAutomatic Control, vol. 59, no. 5, pp. 1342–1347, May 2014.
[12] P. Sharma, S. M. Salapaka, C. L. Beck, “A Scalable Approach toCombinatorial Library Design for Drug Discovery”, J. Chem. Inf.Model., vol. 48, no. 1, pp.27–41, 2008.
[13] S. M. Salapaka, A. Khalak, M. A. Dahleh. “Constraints on locationaloptimization problems”, Proceedings of 42th Conf. on Decision andControl, vol. 2, pp. 1741–1746, 2003.
[14] T. M. Cover and J. A. Thomas Elements of Information Theory, JohnWiley & Sons, Inc. 1991.
[15] E. T. Jaynes, “Information Theory and Statistical Mechanics”, PhysicalReview, vol. 108, no. 2, pp. 171–190, Oct. 1957.
[16] S. Arimoto, “An Algorithm for Calculating the Capacity of an Arbitrary Discrete Memoryless Channel”, IEEE Trans. on Inf. Theory,vol. IT18, pp. 14–20, Jan. 1972.
[17] R. E. Blahut, “Computation of Channel Capacity and RateDistorionFunctions”, IEEE Trans. on Inf. Theory, vol. IT18, pp. 460–473, July1972.
[18] J. Cortés, S. Martinez, T. Karatas, F. Bullo, “Coverage Control forMobile Sensing Networks”, IEEE Trans. Robot. Autom., vol. 20, no.2, pp. 243–255, Feb. 2004.
[19] E. Frazzoli, F. Bullo, “Decentralized Algorithms for Vehicle Routingin a Stochastic TimeVarying Environment”, proceedings of Conf. onDecision and Control, 2004, pp. 3357–3363.
[20] N. V. Kale, S. M. Salapaka, “Maximum Entropy PrincipleBasedAlgorithm for Simultaneous Resource Location and Multihop Routingin Multiagent Networks”, IEEE Trans. on Mobile Computing, vol. 11,no. 4, pp. 591–602, Apr. 2012.
[21] W. Krauth, Statistical Mechanics: Algorithms and Computations,Oxford University Press, 2006.
[22] D. Arovas, Lecture Notes on Thermodynamics andStatistical Mechanics, available at: http://wwwphysics.ucsd.edu/students/courses/spring2010/physics210a/LECTURES/210 COURSE.pdf.
[23] C. Byrne, “Iterative Projection onto Convex Sets using MultipleBregman Distances”, Inverse Problems, 15(1999): 1295–1313.
[24] G. B. Bagci, “Some remarks on Rényi relative entropy in a thermostatistical framework”, arXiv:condmat/0703008v2.
2911