An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing ›...

An Introduction to Deterministic Annealing

Fabrice Rossi

SAMM – Université Paris 1

2012

Optimization

I in machine learning:I optimization is one of the central toolsI methodology:

I choose a model with some adjustable parametersI choose a goodness of fit measure of the model to some dataI tune the parameters in order to maximize the goodness of fit

I examples: artificial neural networks, support vectormachines, etc.

I in other fields:I operational research: project planning, routing, scheduling,

etc.I design: antenna, wings, engines, etc.I etc.

Easy or Hard?

I is optimization computationally difficult?

I the convex case is relatively easy:I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?

I a local minimum is globalI the local tangent hyperplane is a global lower bound

I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases

Easy or Hard?

I is optimization computationally difficult?I the convex case is relatively easy:

I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?

I a local minimum is globalI the local tangent hyperplane is a global lower bound

I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases

In machine learning

I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)

I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:

I kernel parametersI discrete parameters, e.g., feature selection

I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization

Particular cases

I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve

M∗ = arg minM∈M

E(M)

I example: graph clustering

I Mixed problems:I a solution spaceM× Rp

I an error measure E fromM× Rp to RI goal: solve

(M∗,W ∗) = arg minM∈M,W∈Rp

E(M,W )

I example: clustering in Rn

Particular cases

I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve

M∗ = arg minM∈M

E(M)

I example: graph clusteringI Mixed problems:

I a solution spaceM× Rp

I an error measure E fromM× Rp to RI goal: solve


E(M,W )

I example: clustering in Rn

Graph clustering

Goal: find an optimal clustering of a graph with N nodes in Kclusters

I M: set of all partitions of {1, . . . ,N} in K classes (theasymptotic behavior of |M| is roughly K N−1 for a fixed Kwith N →∞ )

I assignment matrix: a partition inM is described by N × Kmatrix M such that Mik ∈ {0,1} and

∑Kk=1 Mik = 1

I many error measures are available:I Graph cut measures (node normalized, edge normalized,

etc.)I ModularityI etc.

Graph clustering

I original graph

I four clusters

1

2

3 4

5

67

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Graph clustering

I original graphI four clusters

Graph visualization

womanbisexual manheterosexual man

I 2386 persons:unreadable

I maximal modularityclustering in 39clusters

I hierarchical display

Graph visualization

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

3536

37

38

39




Graph visualization

womanbisexual manheterosexual man




Vector quantization

I N observations (xi)1≤i≤N in Rn

I M: set of all partitionsI quantization error:

E(M,W ) =N∑

i=1

K∑k=1

Mik‖xi − wk‖2

I the “continuous” parameters are the prototypes wk ∈ Rn

I the assignment matrix notation is equivalent to thestandard formulation

E(M,W ) =N∑

i=1

‖xi − wk(i)‖2,

where k(i) is the index of the cluster to which xi is assigned

Example

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Combinatorial optimization

I a research field by itselfI many problems are NP hard...

and one therefore relies onheuristics or specialized approaches:

I relaxation methodsI branch and bound methodsI stochastic approaches:

I simulated annealingI genetic algorithms

I in this presentation: deterministic annealing

Combinatorial optimization

I a research field by itselfI many problems are NP hard... and one therefore relies on

heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:

I simulated annealingI genetic algorithms

I in this presentation: deterministic annealing

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Optimization strategies

I mixed problem transformation


E(M,W ),

I remove the continuous part

minM∈M,W∈Rp

E(M,W ) = minM∈M

(M 7→ min

W∈RpE(M,W )

)I or the combinatorial part

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)

I this is not alternate optimization

Optimization strategies

I mixed problem transformation


E(M,W ),

I remove the continuous part

minM∈M,W∈Rp

E(M,W ) = minM∈M

(M 7→ min

W∈RpE(M,W )

)I or the combinatorial part

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)I this is not alternate optimization

Alternate optimization

I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence

I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with

W ik = 1∑N

j=1 M i−1jk

∑Nj=1 M i−1

jk xj

3. compute the optimal partition with M ijk = 1 if and only if

k = arg min1≤l≤K ‖xj −W il ‖2

4. go back to 2 until convergenceI alternate optimization converges to a local minimum




W ik = 1∑N

j=1 M i−1jk

∑Nj=1 M i−1

jk xj



4. go back to 2 until convergence

I alternate optimization converges to a local minimum




W ik = 1∑N

j=1 M i−1jk

∑Nj=1 M i−1

jk xj



4. go back to 2 until convergenceI alternate optimization converges to a local minimum

Combinatorial first

I let consider the combinatorial first approach:

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)I not attractive a priori, as W 7→ minM∈M E(M,W ):

I has no reason to be convexI has no reason to be C1

I vector quantization example:

F (W ) =N∑

i=1

min1≤k≤K

‖xi − wk‖2

is neither convex nor C1

Example

Clustering in 2 clusters elements from R

x

back to the example

Example

ln F (W )

multiple local minima andsingularities: minimizingF (W ) is difficult

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Soft minimum

I a soft minimum approximation of minM∈M E(M,W )

Fβ(W ) = −1β

ln∑

M∈Mexp(−βE(M,W ))

solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1

I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then

limβ→∞

Fβ(W ) = E(M∗(W ),W ) = minM∈M

E(M,W )

Example

M

E(M,W )

Example

β = 10−2

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Example

β = 10−1.5

M


Example

β = 10−1

M


Example

β = 10−0.5

M


Example

β = 100

M


Example

β = 100.5

M


Example

β = 101

M


Example

β = 101.5

M


Example

β = 102

M


Example

Fβ

β

1.00e−02 3.16e−02 1.00e−01 3.16e−01 1.00e+00 3.16e+00 1.00e+01 3.16e+01 1.00e+02

Discussion

I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1

I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,

−1β

ln |M| ≤ Fβ(W )− minM∈M

E(M,W ) ≤ 0

I when β is close to 0, Fβ is generally easy to minimize

I negative aspects:I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?

I philosophy: where does Fβ(W ) come from?

Discussion

I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1

I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,

−1β

ln |M| ≤ Fβ(W )− minM∈M

E(M,W ) ≤ 0

I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:

I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?

I philosophy: where does Fβ(W ) come from?

Outline

Introduction




Computing Fβ(W )

I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM

I a particular case: clustering with an additive cost, i.e.

E(M,W ) =N∑

i=1

K∑k=1

Mikeik (W ),

where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)

I then

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

computational cost O(NK ) (compared to ' K N−1)

Proof sketch

I let Zβ(W ) be the partition function given by

Zβ(W ) =∑

M∈M

exp(−βE(M,W ))

I assignments inM are independent and the sum can berewritten

Zβ(W ) =∑

M1.∈CK

. . .∑

MN.∈CK

exp(−βE(M,W )),

with CK = {(1,0, . . . ,0), . . . , (0, . . . ,0,1)}I then

Zβ(W ) =N∏

i=1

∑Mi.∈CK

exp(−β∑

k

Mik eik (W ))

Independence

I additive cost corresponds to some form of conditionalindependence

I given the prototypes, observations are assignedindependently to their optimal clusters

I in other words: the global optimal assignment is theconcatenation of the individual optimal assignment

I this is what makes alternate optimization tractable in theK-means despite its combinatorial aspect:

minM∈M

N∑i=1

K∑k=1

Mik‖xi − wk‖2

Minimizing Fβ(W )

I assume E to be C1 with respect to W , then

∇Fβ(W ) =

∑M∈M exp(−βE(M,W ))∇W E(M,W )∑

M∈M exp(−βE(M,W ))

I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent

I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed

I if E is additive

∇Fβ(W ) =N∑

i=1

∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K

k=1 exp(−βeik (W ))

Fixed point scheme

I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :

1. compute

(expectation phase)

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

2. keeping the µik constant, solve for W

(maximization phase)

N∑i=1

K∑k=1

µik∇W eik (W ) = 0

3. loop to 1 until convergence

I generally converges to a minimum of Fβ(W ) (for a fixed β)

Fixed point scheme


1. compute (expectation phase)


2. keeping the µik constant, solve for W (maximization phase)

N∑i=1

K∑k=1


3. loop to 1 until convergence

I generally converges to a minimum of Fβ(W ) (for a fixed β)

Fixed point scheme


1. compute (expectation phase)


2. keeping the µik constant, solve for W (maximization phase)

N∑i=1

K∑k=1


3. loop to 1 until convergenceI generally converges to a minimum of Fβ(W ) (for a fixed β)

Vector quantization

I if eik (W ) = ‖xi − wk‖2 (vector quantization),

∇wk Fβ(W ) = 2N∑

i=1

µik (W )(wk − xi),

with

µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)

I setting ∇Fβ(W ) = 0 leads to

wk =1∑N

i=1 µik (W )

N∑i=1

µik (W )xi

Links with the K-means

I if µil = δk(i)=l , we obtained the prototype update rule of theK-means

I in the K-means, we have

k(i) = arg min1≤l≤K

‖xi − wk‖2

I in deterministic annealing, we have


I this is a soft minimum version of the crisp rule of thek-means

Links with EM

I isotropic Gaussian mixture with a unique and fixedvariance ε

pk (x |wk ) =1

(2πε)p/2 e−12ε‖xi−wk‖2

I given the mixing coefficients δk , responsibilities are

P(xi ∈ Ck |xi ,W ) =δke−

12ε‖xi−wk‖2∑K

l=1 δle−12ε‖xi−wl‖2

I identical update rules for WI β can be seen has an inverse variance: quantify the

uncertainty about the clustering results

So far...

I if E(M,W ) =∑N

i=1∑K

k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize

Fβ(W ) for a fixed β

I but:I the k-means algorithm is also a EM like algorithm: it does

not reach a global optimumI how handle to handle β →∞?

So far...

I if E(M,W ) =∑N

i=1∑K

k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize

Fβ(W ) for a fixed βI but:

I the k-means algorithm is also a EM like algorithm: it doesnot reach a global optimum

I how handle to handle β →∞?

Outline

Introduction




Limit case β → 0

I limit behavior of the EM scheme:I µik = 1

K for all i and k (and W !)I W is therefore a solution of

N∑i=1

K∑k=1

∇W eik (W ) = 0

I no iteration needed

I vector quantization:I we have

wk =1N

N∑i=1

xi

I each prototype is the center of mass of the data: only onecluster!

I in general the case β → 0 is easy (unique minimum undermild hypotheses)

Limit case β → 0

I limit behavior of the EM scheme:I µik = 1

K for all i and k (and W !)I W is therefore a solution of

N∑i=1

K∑k=1

∇W eik (W ) = 0

I no iteration neededI vector quantization:

I we have

wk =1N

N∑i=1

xi

I each prototype is the center of mass of the data: only onecluster!

I in general the case β → 0 is easy (unique minimum undermild hypotheses)

Example

Clustering in 2 clusters elements from R

x

back to the example

Example

original problem

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Example

simple soft minimum withβ = 10−1

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Example

more complex softminimum with β = 0.5

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Increasing β

I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:

I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large

values for the gradient at some points

I path following strategy (homotopy):I for an increasing series (βl)l with β0 = 0I initialize W ∗

0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K

k=1∇W eik (W ) = 0I compute W ∗

l via the EM scheme for βl starting from W ∗l−1

I a similar strategy is used in interior point algorithms

Increasing β

I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:

I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large

values for the gradient at some pointsI path following strategy (homotopy):

I for an increasing series (βl)l with β0 = 0I initialize W ∗

0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K

k=1∇W eik (W ) = 0I compute W ∗

l via the EM scheme for βl starting from W ∗l−1

I a similar strategy is used in interior point algorithms

Example

Clustering in 2 classes of 6 elements from R

x

1.00 2.00 3.00 7.00 8.25

Behavior of Fβ(W )

minM∈M E(M,W )

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−2

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−1.75

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−1.5

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−1.25

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−1

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−0.75

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−0.5

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 10−0.25

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

Fβ(W ), β = 1

w1

w2

2 4 6 8

24

68

Behavior of Fβ(W )

minM∈M E(M,W )

w1

w2

2 4 6 8

24

68

One step closer to the final algorithm

I given an increasing series (βl)l with β0 = 0:1. compute W ∗

0 such that∑N

i=1∑K

k=1∇W eik (W ∗0 ) = 0

2. for l = 1 to L:2.1 initialize W ∗l to W ∗l−1

2.2 compute µik =exp(−βeik (W

∗l ))∑K

l=1 exp(−βeil (W∗l ))

2.3 update W ∗l such that∑N

i=1

∑Kk=1 µik∇W eik (W ∗l ) = 0

2.4 good back to 2.2 until convergence

3. use W ∗L as an estimation of

arg minW∈Rp (minM∈M E(M,W ))

I this is (up to some technical details) the deterministicannealing algorithm for minimizingE(M,W ) =

∑Ni=1∑K

k=1 Mikeik (W )

What’s next?

I general topics:I why the name annealing?I why should that work? (and other philosophical issues)

I specific topics:I technical “details” (e.g., annealing schedule)I generalization:

I non additive criteriaI general combinatorial problems

Outline

Introduction




Annealing

I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.

I in our context, we’ll see thatI T = 1

kBβacts as a temperature and models some thermal

agitationI E(W ,M) is the energy of a configuration the system while

Fβ(W ) corresponds to the free energy of the system at agiven temperature

I increasing β reduces the thermal agitation

Simulated Annealing

I a classical combinatorial optimization algorithm forcomputing minM∈M E(M)

I given an increasing series (βl)l1. choose a random initial configuration M02. for l = 1 to L

2.1 take a small random step from Ml−1 to build Mcl (e.g.,

change the cluster of an object)2.2 if E(Mc

l ) < E(Ml−1) then set Ml = Mcl

2.3 else set Ml = Mcl with probability

exp(−βl(E(Mcl )− E(Ml−1)))

2.4 else set Ml = Ml−1 in the other case

3. use ML as an estimation of arg minM∈M E(M)

I naive vision:I always accept improvementI “thermal agitation” allows to escape local minima

Statistical physics

I consider a system with state spaceMI denote E(M) the energy of the system when in state MI at thermal equilibrium with the environment, the probability

for the system to be in state M is given by the Boltzmann(Gibbs) distribution

PT (M) =1

ZTexp

(−E(M)

kBT

)with T the temperature, kB Boltzmann’s constant, and

ZT =∑

M∈Mexp

(−E(M)

kBT

)

Back to annealing

I physical analogy:I maintain the system at thermal equilibrium:

I set a temperatureI wait for the system to settle at this temperature

I slowly decrease the temperature:I works well in real systems (e.g., crystallization)I allows the system to explore the state space

I computer implementation:I direct computation of PT (M) (or related quantities)I sampling from PT (M)

Simulated AnnealingRevisited

I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ

I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:

I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability

min(

1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)

)I keep x t+1 = x t when this fails

Simulated AnnealingRevisited

I in SA, Q is the random local perturbationI major feature of Metropolis-Hastings MCMC:

PT (M ′)PT (M)

= exp(−E(M ′)− E(M)

kBT

)ZT is not needed

I underlying assumption, symmetric proposals

Q(x t |x ′) = Q(x ′|x t)

I rationale:I sampling directly from PT (M) for small T is difficultI track likely area during cooling

Mixed problems

I Simulated Annealing applies to combinatorial problemsI in mixed problems, this would means removing the

continuous part

minM∈M,W∈Rp

E(M,W ) = minM∈M

(M 7→ min

W∈RpE(M,W )

)I in the vector quantization example

E(M) =N∑

i=1

K∑k=1

Mik

∥∥∥∥∥∥xi −1∑N

j=1 Mjk

N∑j=1

Mjkxj

∥∥∥∥∥∥2

I no obvious direct relation to the proposed algorithm

Statistical physics

I thermodynamic (free) energy: the useful energy availablein a system (a.k.a., the one that can be extracted toproduce work)

I Helmholtz (free) energy is FT = U − TS, where U is theinternal energy of the system and S its entropy

I one can shown that

FT = −kBT ln ZT

I the natural evolution of a system is to reduce its freeenergy

I deterministic annealing mimicks the cooling process of asystem by tracking the evolution of the minimal free energy

Gibbs distribution

I still no use of PT ...

I important property:I f a function defined onM

EPT (f (M)) =1

ZT

∑M∈M

f (M)exp(−E(M)

kBT

)I then

limT→0

EPT (f (M)) = f (M∗),

where M∗ = arg minM∈M E(M)

I useful to track global information

Gibbs distribution

I still no use of PT ...I important property:

I f a function defined onM

EPT (f (M)) =1

ZT

∑M∈M

f (M)exp(−E(M)

kBT

)I then

limT→0

EPT (f (M)) = f (M∗),

where M∗ = arg minM∈M E(M)

I useful to track global information

Membership functions

I in the EM like phase, we have


I this does not come out of thin air:

µik = EPβ,W (Mik ) =∑

M∈MMikPβ,W (M)

I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution

I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition

Example

Clustering in 2 classes of 6 elements from R

x

1.00 2.00 3.00 7.00 8.25


w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0


w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

before phase transition


w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

after phase transition


w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Summary

I two principles:I physical systems tend to reach a state of minimal free

energy at a given temperatureI when the temperature is slowly decreased, a system tends

to reach a well organized stateI annealing algorithms tend to reproduce this slow cooling

behaviorI simulated annealing:

I uses Metropolis Hasting MCMC to sample the Gibbsdistribution

I easy to implement but needs a very slow coolingI deterministic annealing:

I direct minimization of the free energyI computes expectations of global quantities with respect to

the Gibbs distributionI aggressive cooling is possible, but ZT computation must be

tractable

Outline

Introduction




Maximum entropy

I another interpretation/justification of DAI reformulation of the problem: find a probability distribution

onM which is “regular” and gives a low average error

I in other words: find PW such thatI∑

M∈M PW (M) = 1I∑

M∈M E(W ,M)PW (M) is smallI the entropy of PW , −

∑M∈M ln PW (M)PW (M) is high

I entropy plays a regularization roleI this leads to the minimization of∑

M∈ME(W ,M)PW (M) +

1β

∑M∈M

ln PW (M)PW (M)

where β sets the trade-off between fitting and regularity

Maximum entropy


onM which is “regular” and gives a low average errorI in other words: find PW such that

I∑

M∈M PW (M) = 1I∑



I entropy plays a regularization role

I this leads to the minimization of∑M∈M

E(W ,M)PW (M) +1β

∑M∈M

ln PW (M)PW (M)


Maximum entropy


onM which is “regular” and gives a low average errorI in other words: find PW such that

I∑

M∈M PW (M) = 1I∑



I entropy plays a regularization roleI this leads to the minimization of∑

M∈ME(W ,M)PW (M) +

1β

∑M∈M

ln PW (M)PW (M)


Maximum entropy

I some calculations show that the minimum over PW is givenby

Pβ,W (M) =1

Zβ,Wexp(−βE(M,W )),

withZβ,W =

∑M∈M

exp(−βE(M,W )).

I plugged back into the optimization criterion, we end upwith the soft minimum

Fβ(W ) = −1β

ln∑

M∈Mexp(−βE(M,W ))

So far...

I three derivations of the deterministic annealing:I soft minimumI thermodynamic inspired annealingI maximum entropy principle

I all of them boil down to two principles:I replace the crisp optimization problem in theM space by a

soft oneI track the evolution of the solution when the crispness of the

approximation increasesI now the technical details...

Outline

Introduction




Fixed point

I remember the EM phase:


wk =1∑N

i=1 µik (W )

N∑i=1

µik (W )xi

I this is a fixed point method, i.e., W = Uβ(W ) for a datadependent Uβ

I the fixed point is generally stable, except during a phasetransition

Phase transition

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

before phase transition

Phase transition

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

after phase transition

Stability

w1

w2

2 4 6 8

24

68

w1

w2

2 4 6 82

46

8

Stability

I unstable fixed point:I even during a phase transition, the exact previous fixed

point can remain a fixed pointI but we want the transition to take place: this is a new

cluster birth!I fortunately, the previous fixed point is unstable during a

phase transition

I modified EM phase:1. initialize W ∗

l to W ∗l−1 + ε

2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))

3. update W ∗l such that

∑Ni=1∑K

k=1 µik∇W eik (W ∗l ) = 0

4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition

Stability

I unstable fixed point:I even during a phase transition, the exact previous fixed

point can remain a fixed pointI but we want the transition to take place: this is a new

cluster birth!I fortunately, the previous fixed point is unstable during a

phase transitionI modified EM phase:

1. initialize W ∗l to W ∗

l−1 + ε

2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))

3. update W ∗l such that

∑Ni=1∑K

k=1 µik∇W eik (W ∗l ) = 0

4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition

Annealing strategy

I neglecting symmetries, there are K − 1 phase transitionsfor K clusters

I tracking W ∗ between transitions is easy

I trade-off:I slow annealing: long running time but no transition is

missedI fast annealing: quicker but with higher risk to miss a

transitionI but critical temperatures can be computed:

I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point

equation Uβ around a fixed point

Annealing strategy

I neglecting symmetries, there are K − 1 phase transitionsfor K clusters

I tracking W ∗ between transitions is easyI trade-off:

I slow annealing: long running time but no transition ismissed

I fast annealing: quicker but with higher risk to miss atransition

I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point

equation Uβ around a fixed point

Critical temperatures

I for vector quantizationI first temperature:

I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest

eigenvalue of the covariance matrix of the data

I in general:I the stability of the whole system is related to the stability of

each clusterI the µik play the role of membership functions for each

clusterI the critical β for cluster k is 1/2λk

max where λkmax is the

largest eigenvalue of∑i

µik (xi − wk )(xi − wk)T

Critical temperatures

I for vector quantizationI first temperature:

I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest

eigenvalue of the covariance matrix of the dataI in general:

I the stability of the whole system is related to the stability ofeach cluster

I the µik play the role of membership functions for eachcluster

I the critical β for cluster k is 1/2λkmax where λk

max is thelargest eigenvalue of∑

i

µik (xi − wk )(xi − wk)T

A possible final algorithm

1. initialize W 1 = 1N∑N

i=1 xi

2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1

2.3 for t values of β around βl

2.3.1 add some small noise to W l

2.3.2 compute µik =exp(−β‖xi−W l

k‖2)∑K

t=1 exp(−β‖xi−W lt ‖

2)

2.3.3 compute W lk = 1∑N

i=1 µik

∑Ni=1 µik xi

2.3.4 good back to 2.3.2 until convergence

3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M

Computational cost

I N observations in Rp and K clustersI prototype storage: KpI µ matrix storage: NKI one EM iteration costs: O(NKp)I we need at most K − 1 full EM runs: O(NK 2p)I compared to the K-means:

I more storageI no simple distance calculation tricks: all distances must be

computedI roughly corresponds to K − 1 k-means, not taking into

account the number of distinct β considered during eachphase transition

I neglecting the eigenvalue analysis (in O(p3))...

Outline

Introduction




Collapsed prototypes

I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:

I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...

I a related problem: uniform cluster weightsI in a Gaussian mixture

P(xi ∈ Ck |xi ,W ) =δke−

12ε‖xi−wk‖2∑K

l=1 δle−12ε‖xi−wl‖2

prototypes are weighted by the mixing coefficients δ

Mass constrained DAI we introduce prototype weights δk and plug them into the

soft minimum formulation

Fβ(W , δ) = −1β

N∑i=1

lnK∑

k=1

δk exp(−βeik (W ))

I Fβ is minimized under the constraint∑K

k=1 δk = 1I this leads to

µik (W ) =δk exp(−β‖xi − wk‖2)∑Kl=1 δl exp(−β‖xi − wl‖2)

δk =N∑

i=1

µik (W )/N

wk = NN∑

i=1

µik (W )xi/δk

MCDA

I increases the similarity with EM for Gaussian mixtures:I exactly the same algorithm for a fixed βI isotropic Gaussian mixture with variance 1

2β

I however:I the variance is generally a parameter in the Gaussian

mixturesI different goals: minimal distortion versus maximum

likelihoodI no support for other distributions in DA

Cluster birth monitoring

I avoid wasting computational resources:I consider a cluster Ck with a prototype wkI prior a phase transition in Ck :

I duplicate wk into w ′kI apply some noise to both prototypesI split δk into δk

2 and δ′k2

I apply the EM like algorithm and monitor ‖wk − w ′k‖I accept the new cluster if ‖wk − w ′k‖ becomes “large”

I two strategies:I eigenvalue based analysis (costly, but quite accurate)I opportunistic: always maintain duplicate prototypes and

promote diverging ones

Final algorithm

1. initialize W 1 = 1N∑N

i=1 xi


2.3 duplicate the prototype of the critical cluster and split theassociated weight



2.4.2 compute µik =δk exp(−β‖xi−W l

k‖2)∑K

t=1 δt exp(−β‖xi−W lt ‖

2)

2.4.3 compute δk =∑N

i=1 µik (W )/N2.4.4 compute W l

k = Nδk

∑Ni=1 µik xi


3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M

Example

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

A simple dataset

Example

β

E

0.2

0.5

1.0

2.0

5.0

0.1 0.15 0.22 0.32 0.47 0.69 1.02 1.51 2.22 3.37

Error evolution and cluster births

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Summary

I deterministic annealing addresses combinatorialoptimization by smoothing the cost function and trackingthe evolution of the solution while the smoothingprogressively vanishes

I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)

I in practice:I rather simple multiple EM like algorithmI a bit tricky to implement (phase transition, noise injection,

etc.)I excellent results (frequently better than those obtained by

K-means)

Outline

Introduction




Combinatorial only

I the situation is quite different for a combinatorial problem

arg minM∈M

E(M)

I no soft minimum heuristicI no direct use of the Gibbs distribution PT ; one need to rely

on EPT (f (M)) for interesting fI calculation of PT is not tractable:

I tractability is related to independenceI if

E(M) =∑M1

. . .∑MD

(E(M1) + . . .+ E(MD))

then independent optimization can be done on eachvariable...

Graph clustering

I a non oriented graph with N nodes and weight matrix A (Aijis the weight of the connection between nodes i and j)

I maximal modularity clustering:I node degree ki =

∑Nj=1 Aij , total weight m = 1

2

∑i,j Aij

I null model Pij =ki kj2m , Bij =

12m (Aij − Pij)

I assignment matrix MI Modularity of the clustering

Mod(M) =∑i,j

∑k

Mik Mjk Bij

I difficulty: coupling MikMjk (non linearity)I in the sequel E(M) = −Mod(M)

Strategy for DA

I choose some interesting statistics:I EPT (f (M))I for instance EPT (Mik ) for clustering

I compute an approximation of EPT (f (M)) at hightemperature T

I track the evolution of the approximation while lowering TI use limT→0 EPT (f (M)) = f (arg minM∈M E(M)) to claim

that the approximation converges to the interesting limitI rationale: approximating EPT (f (M)) for a low T is probably

difficult, we use the path following strategy to ease the task

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

x

y

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

005

0.01

00.

015

0.02

0β = 10−1

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

005

0.01

00.

015

0.02

0β = 10−0.75

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

005

0.01

00.

015

0.02

00.

025

β = 10−0.5

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

010

0.02

00.

030

β = 10−0.25

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.01

0.02

0.03

0.04

β = 100

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.02

0.04

0.06

0.08

β = 100.25

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

β = 100.5

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

0.20

0.25

0.30

β = 100.75

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.1

0.2

0.3

0.4

0.5

β = 101

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

β = 101.25

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8β = 101.5

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β = 101.75

x

prob

abili

ty

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β = 102

x

prob

abili

ty

Example

0.1 0.5 1.0 5.0 10.0 50.0 100.0

0.0

0.1

0.2

0.3

0.4

0.5

β

x̂

Expectation approximation

I computing

EPZ (f (Z )) =

∫f (Z )dPZ

has many applicationsI for instance, in machine learning:

I performance evaluation (generalization error)I EM for probabilistic modelsI Bayesian approaches

I two major numerical approaches:I samplingI variational approximation

Sampling

I (strong) law of large numbers

limN→∞

1N

N∑i=1

f (zi) =

∫f (Z )dPZ

if the zi are independent samples of ZI many extensions, especially to dependent samples (slower

convergence)I MCMC methods: build a Markov chain with stationary

distribution PZ and take the average of f on a trajectory

I applied to EPT (f (M)): this is simulated annealing!

Sampling

I (strong) law of large numbers

limN→∞

1N

N∑i=1

f (zi) =

∫f (Z )dPZ

if the zi are independent samples of ZI many extensions, especially to dependent samples (slower

convergence)I MCMC methods: build a Markov chain with stationary

distribution PZ and take the average of f on a trajectoryI applied to EPT (f (M)): this is simulated annealing!

Variational approximation

I main idea:I the intractability of the calculation of EPZ (f (Z )) is induced

by the complexity of PZI let’s replace PZ by a simpler distribution QZ ...I ...and EPZ (f (Z )) by EQZ (f (Z ))

I calculus of variation:I optimization over functional spacesI in this context: choose an optimal QZ in a space of

probability measures Q with respect to a goodness of fitcriterion between PZ and QZ

I probabilistic context:I simple distributions: factorized distributionsI quality criterion: Kullback-Leibler divergence

Variational approximation

I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D

i=1 fi(Zi)

I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))

I variational approximationI choose Q a set of tractable distributions on RI solve

Q∗ = arg minQ=(Q1×...×QD)∈QD

KL (Q||PZ )

withKL (Q||PZ ) = −

∫ln

dPZ

dQdQ

I approximate EPZ (f (Z )) by

EQ∗ (f (Z )) =D∑

i=1

EQ∗i (fi(Zi))

Application to DA

I general form: Pβ(M) = 1Zβ

exp(−βE(M))

I intractable because E(M) is not linear in M = (M1, . . . ,MD)

I variational approximation:I replace E by F (W ,M) defined by

F (W ,M) =D∑

i=1

WiMi

I use for QW ,β

QW ,β =1

ZW ,βexp(−βF (W ,M))

I optimize W via KL (QW ,β ||Pβ)

Application to DA

I general form: Pβ(M) = 1Zβ

exp(−βE(M))

I intractable because E(M) is not linear in M = (M1, . . . ,MD)I variational approximation:

I replace E by F (W ,M) defined by

F (W ,M) =D∑

i=1

WiMi

I use for QW ,β

QW ,β =1

ZW ,βexp(−βF (W ,M))

I optimize W via KL (QW ,β ||Pβ)

Fitting the approximation

I a complex expectation calculation is replaced by a newoptimization problem

I we have

KL(QW ,β||Pβ

)= ln Zβ−ln ZW ,β+βEQW ,β (E(M)− F (W ,M))

I tractable by factorization on F , except for ZβI solved by ∇W KL

(QW ,β||Pβ

)= 0

I tractable because Zβ do not depend on WI generally solved via a fixed point approach (EM like again)

I similar to variational approximation in EM or Bayesianmethods

Example

I in the case of (graph) clustering, we use

F (W ,M) =N∑

i=1

K∑k=1

WikMik

I a crucial (general) property is that, when i 6= j

EQW ,β

(MikMjl

)= EQW ,β (Mik )EQW ,β

(Mjl)

I a (long and boring) series of equations leads to the generalmean field equations

∂EQW ,β (E(M))

∂Wjl=

K∑k=1

∂EQW ,β

(Mjk)

∂WjlWjk , ∀j , l .

Example

I classical EM like scheme to solve the mean fieldequations:

1. we haveEQW,β (Mik ) =

exp(−βWik )∑Kl=1 exp(−βWil)

2. keep EQW,β (Mik ) fixed and solve the simplified mean fieldequations for W

3. update EQW,β (Mik ) based on the new W and loop on 2 untilconvergence

I in the graph clustering case

Wjk = 2∑i,j

EQW ,β (Mik )Bij

Summary

I deterministic annealing for combinatorial optimizationI given an objective function E(M)

I choose a linear parametric approximation of E(M),F (W ,M) with the associated distributionQW ,β = 1

ZW,βexp(−βF (W ,M))

I write the mean field equations, i.e.

∇W(− ln ZW ,β + βEQW,β (E(M)− F (W ,M))

)= 0

I use a EM like algorithm to solve the equations:I given W , compute EQW,β (Mik )I given EQW,β (Mik ), solve the equations

I back to our classical questions:I why would that work?I how to do that in practice?

Outline

Introduction




Physical interpretation

I back to the (generalized) Helmholtz (free) energy

Fβ = U − Sβ

I the free energy can be generalized to any distribution Q onthe states, by

Fβ,Q = EQ (E(M))− H(Q)

β,

where H(Q) = −∑

M∈MQ(M) ln Q(M) is the entropy of QI the Boltzmann-Gibbs distribution minimizes the free

energy, i.e.FT ≤ FT ,Q

Physical interpretation

I in fact we have

FT ,Q = FT +1β

KL (Q||P) ,

where P is the Gibbs distributionI minimizing KL (Q||P) ≥ 0 over Q corresponds to finding

the best upper bound of the free energy on a class ofdistribution

I in other words: given the system states must be distributedaccording to a distribution in Q, find the most stabledistribution

Mean fieldI E(M) is difficult to handle because of coupling

(dependencies) while F (W ,M) is linear and correspondsto a de-coupling in which dependencies are replaced bymean effects

I example:I modularity

E(M) =∑

i

∑k

Mik

∑j

Mjk Bij

I approximation

F (W ,M) =∑

i

∑k

Mik Wik

I the complex influence of Mij on E is replaced by a singleparameter Wik : the mean effect associated to a change inMij

Outline

Introduction




In practice

I homotopy again:I at high temperature (β → 0):

I EQW,β (Mik ) =1K

I the fixed point equation is generally easy to solve; forinstance in graph clustering

Wjk =2K

∑i,j

Bij

I slowly decrease the temperatureI use the mean field at the previous higher temperature as a

starting point for the fixed point iterationsI phase transitions:

I stable versus unstable fixed points (eigenanalysis)I noise injection and mass constrained version

A graph clustering algorithm

1. initialize W 1jk = 2

K∑

i,j Bij




2.3.2 compute µik = exp(−βWik )∑Kt=1 exp(−βWit )

2.3.3 compute Wjk = 2∑

i,j µik Bij


3. threshold µik into an optimal partition of the original graph

I mass constrained approach variantI stability based transition detection (slower annealing)

Karate

I clustering of asimple graph

I Zachary’s Karateclub

I Four clusters

1

2

3 4

5

67

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Modularity

Evolution of the modularity during annealing

50 100 200 500

0.0

0.1

0.2

0.3

0.4

ββ

Mod

ular

ity

Phase transitions

I first phase: 2clusters

I second phase: 4clusters

1 2

3 4

Karate

I clustering of asimple graph

I Zachary’s Karateclub

I Four clusters

DA Roadmap

How to apply deterministic annealing to a combinatorialoptimization problem E(M)?

1. define a linear mean field approximation F (W ,M)

2. specialize the mean field equations, i.e. compute

∂EQW,β (E(M))

∂Wjl=

∂

∂Wjl

(1

ZW ,β

∑M

E(M)exp(−βF (W ,M))

)

3. identify a fixed point scheme to solve the mean field equations,using EQW,β (Mi) as a constant if needed

4. wrap the corresponding EM scheme in an annealing loop: this isthe basic algorithm

5. analyze the fixed point scheme to find stability conditions: thisleads to an advanced annealing schedule

Summary

I deterministic annealing addresses combinatorialoptimization by solving a related but simpler problem andby tracking the evolution of the solution while the simplerproblem converges to the original one

I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)

I works mainly because of the solution following strategy(homotopy): do not solve a difficult problem from scratch,but rather starting from a good guess of the solution

Summary

I difficulties:I boring and technical calculations (when facing a new

problem)I a bit tricky to implement (phase transition, noise injection,

etc.)I tends to be slow in practice: computing exp is very costly

even on modern hardware (roughly 100 flop)I advantages:

I versatileI appears as a simple EM like algorithm embedded in an

annealing loopI very interesting intermediate resultsI excellent final results in practice

References

T. Hofmann and J. M. Buhmann.Pairwise data clustering by deterministic annealing.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(1):1–14, January 1997.

K. Rose.Deterministic annealing for clustering, compression,classification,regression, and related optimization problems.Proceedings of the IEEE, 86(11):2210–2239, November 1998.

F. Rossi and N. Villa.Optimizing an organized modularity measure for topographicgraph clustering: a deterministic annealing approach.Neurocomputing, 73(7–9):1142–1163, March 2010.

An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing ›...

Documents

Transcript of An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing ›...