An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing ›...

204
An Introduction to Deterministic Annealing Fabrice Rossi SAMM – Université Paris 1 2012

Transcript of An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing ›...

Page 1: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

An Introduction to Deterministic Annealing

Fabrice Rossi

SAMM – Université Paris 1

2012

Page 2: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Optimization

I in machine learning:I optimization is one of the central toolsI methodology:

I choose a model with some adjustable parametersI choose a goodness of fit measure of the model to some dataI tune the parameters in order to maximize the goodness of fit

I examples: artificial neural networks, support vectormachines, etc.

I in other fields:I operational research: project planning, routing, scheduling,

etc.I design: antenna, wings, engines, etc.I etc.

Page 3: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Easy or Hard?

I is optimization computationally difficult?

I the convex case is relatively easy:I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?

I a local minimum is globalI the local tangent hyperplane is a global lower bound

I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases

Page 4: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Easy or Hard?

I is optimization computationally difficult?I the convex case is relatively easy:

I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?

I a local minimum is globalI the local tangent hyperplane is a global lower bound

I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases

Page 5: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Easy or Hard?

I is optimization computationally difficult?I the convex case is relatively easy:

I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?

I a local minimum is globalI the local tangent hyperplane is a global lower bound

I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases

Page 6: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Easy or Hard?

I is optimization computationally difficult?I the convex case is relatively easy:

I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?

I a local minimum is globalI the local tangent hyperplane is a global lower bound

I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases

Page 7: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

In machine learning

I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)

I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:

I kernel parametersI discrete parameters, e.g., feature selection

I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization

Page 8: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

In machine learning

I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)

I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:

I kernel parametersI discrete parameters, e.g., feature selection

I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization

Page 9: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Particular cases

I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve

M∗ = arg minM∈M

E(M)

I example: graph clustering

I Mixed problems:I a solution spaceM× Rp

I an error measure E fromM× Rp to RI goal: solve

(M∗,W ∗) = arg minM∈M,W∈Rp

E(M,W )

I example: clustering in Rn

Page 10: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Particular cases

I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve

M∗ = arg minM∈M

E(M)

I example: graph clusteringI Mixed problems:

I a solution spaceM× Rp

I an error measure E fromM× Rp to RI goal: solve

(M∗,W ∗) = arg minM∈M,W∈Rp

E(M,W )

I example: clustering in Rn

Page 11: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph clustering

Goal: find an optimal clustering of a graph with N nodes in Kclusters

I M: set of all partitions of {1, . . . ,N} in K classes (theasymptotic behavior of |M| is roughly K N−1 for a fixed Kwith N →∞ )

I assignment matrix: a partition inM is described by N × Kmatrix M such that Mik ∈ {0,1} and

∑Kk=1 Mik = 1

I many error measures are available:I Graph cut measures (node normalized, edge normalized,

etc.)I ModularityI etc.

Page 12: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph clustering

I original graph

I four clusters

1

2

3 4

5

67

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Page 13: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph clustering

I original graphI four clusters

Page 14: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph visualization

womanbisexual manheterosexual man

I 2386 persons:unreadable

I maximal modularityclustering in 39clusters

I hierarchical display

Page 15: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph visualization

womanbisexual manheterosexual man

I 2386 persons:unreadable

I maximal modularityclustering in 39clusters

I hierarchical display

Page 16: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph visualization

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

3536

37

38

39

I 2386 persons:unreadable

I maximal modularityclustering in 39clusters

I hierarchical display

Page 17: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph visualization

womanbisexual manheterosexual man

I 2386 persons:unreadable

I maximal modularityclustering in 39clusters

I hierarchical display

Page 18: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Vector quantization

I N observations (xi)1≤i≤N in Rn

I M: set of all partitionsI quantization error:

E(M,W ) =N∑

i=1

K∑k=1

Mik‖xi − wk‖2

I the “continuous” parameters are the prototypes wk ∈ Rn

I the assignment matrix notation is equivalent to thestandard formulation

E(M,W ) =N∑

i=1

‖xi − wk(i)‖2,

where k(i) is the index of the cluster to which xi is assigned

Page 19: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 20: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 21: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Combinatorial optimization

I a research field by itselfI many problems are NP hard...

and one therefore relies onheuristics or specialized approaches:

I relaxation methodsI branch and bound methodsI stochastic approaches:

I simulated annealingI genetic algorithms

I in this presentation: deterministic annealing

Page 22: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Combinatorial optimization

I a research field by itselfI many problems are NP hard... and one therefore relies on

heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:

I simulated annealingI genetic algorithms

I in this presentation: deterministic annealing

Page 23: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Combinatorial optimization

I a research field by itselfI many problems are NP hard... and one therefore relies on

heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:

I simulated annealingI genetic algorithms

I in this presentation: deterministic annealing

Page 24: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 25: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 26: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Optimization strategies

I mixed problem transformation

(M∗,W ∗) = arg minM∈M,W∈Rp

E(M,W ),

I remove the continuous part

minM∈M,W∈Rp

E(M,W ) = minM∈M

(M 7→ min

W∈RpE(M,W )

)I or the combinatorial part

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)

I this is not alternate optimization

Page 27: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Optimization strategies

I mixed problem transformation

(M∗,W ∗) = arg minM∈M,W∈Rp

E(M,W ),

I remove the continuous part

minM∈M,W∈Rp

E(M,W ) = minM∈M

(M 7→ min

W∈RpE(M,W )

)I or the combinatorial part

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)I this is not alternate optimization

Page 28: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Alternate optimization

I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence

I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with

W ik = 1∑N

j=1 M i−1jk

∑Nj=1 M i−1

jk xj

3. compute the optimal partition with M ijk = 1 if and only if

k = arg min1≤l≤K ‖xj −W il ‖2

4. go back to 2 until convergenceI alternate optimization converges to a local minimum

Page 29: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Alternate optimization

I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence

I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with

W ik = 1∑N

j=1 M i−1jk

∑Nj=1 M i−1

jk xj

3. compute the optimal partition with M ijk = 1 if and only if

k = arg min1≤l≤K ‖xj −W il ‖2

4. go back to 2 until convergence

I alternate optimization converges to a local minimum

Page 30: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Alternate optimization

I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence

I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with

W ik = 1∑N

j=1 M i−1jk

∑Nj=1 M i−1

jk xj

3. compute the optimal partition with M ijk = 1 if and only if

k = arg min1≤l≤K ‖xj −W il ‖2

4. go back to 2 until convergenceI alternate optimization converges to a local minimum

Page 31: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Combinatorial first

I let consider the combinatorial first approach:

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)I not attractive a priori, as W 7→ minM∈M E(M,W ):

I has no reason to be convexI has no reason to be C1

I vector quantization example:

F (W ) =N∑

i=1

min1≤k≤K

‖xi − wk‖2

is neither convex nor C1

Page 32: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Combinatorial first

I let consider the combinatorial first approach:

minM∈M,W∈Rp

E(M,W ) = minW∈Rp

(W 7→ min

M∈ME(M,W )

)I not attractive a priori, as W 7→ minM∈M E(M,W ):

I has no reason to be convexI has no reason to be C1

I vector quantization example:

F (W ) =N∑

i=1

min1≤k≤K

‖xi − wk‖2

is neither convex nor C1

Page 33: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

Clustering in 2 clusters elements from R

x

back to the example

Page 34: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

ln F (W )

multiple local minima andsingularities: minimizingF (W ) is difficult

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Page 35: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Soft minimum

I a soft minimum approximation of minM∈M E(M,W )

Fβ(W ) = −1β

ln∑

M∈Mexp(−βE(M,W ))

solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1

I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then

limβ→∞

Fβ(W ) = E(M∗(W ),W ) = minM∈M

E(M,W )

Page 36: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Soft minimum

I a soft minimum approximation of minM∈M E(M,W )

Fβ(W ) = −1β

ln∑

M∈Mexp(−βE(M,W ))

solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1

I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then

limβ→∞

Fβ(W ) = E(M∗(W ),W ) = minM∈M

E(M,W )

Page 37: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

M

E(M,W )

Page 38: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 10−2

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 39: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 10−1.5

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 40: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 10−1

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 41: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 10−0.5

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 42: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 100

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 43: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 100.5

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 44: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 101

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 45: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 101.5

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 46: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β = 102

M

exp(−βE(M,W ))exp(−βE(M∗,W ))

Page 47: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β

1.00e−02 3.16e−02 1.00e−01 3.16e−01 1.00e+00 3.16e+00 1.00e+01 3.16e+01 1.00e+02

Page 48: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Discussion

I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1

I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,

−1β

ln |M| ≤ Fβ(W )− minM∈M

E(M,W ) ≤ 0

I when β is close to 0, Fβ is generally easy to minimize

I negative aspects:I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?

I philosophy: where does Fβ(W ) come from?

Page 49: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Discussion

I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1

I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,

−1β

ln |M| ≤ Fβ(W )− minM∈M

E(M,W ) ≤ 0

I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:

I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?

I philosophy: where does Fβ(W ) come from?

Page 50: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Discussion

I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1

I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,

−1β

ln |M| ≤ Fβ(W )− minM∈M

E(M,W ) ≤ 0

I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:

I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?

I philosophy: where does Fβ(W ) come from?

Page 51: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 52: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Computing Fβ(W )

I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM

I a particular case: clustering with an additive cost, i.e.

E(M,W ) =N∑

i=1

K∑k=1

Mikeik (W ),

where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)

I then

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

computational cost O(NK ) (compared to ' K N−1)

Page 53: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Computing Fβ(W )

I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM

I a particular case: clustering with an additive cost, i.e.

E(M,W ) =N∑

i=1

K∑k=1

Mikeik (W ),

where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)

I then

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

computational cost O(NK ) (compared to ' K N−1)

Page 54: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Proof sketch

I let Zβ(W ) be the partition function given by

Zβ(W ) =∑

M∈M

exp(−βE(M,W ))

I assignments inM are independent and the sum can berewritten

Zβ(W ) =∑

M1.∈CK

. . .∑

MN.∈CK

exp(−βE(M,W )),

with CK = {(1,0, . . . ,0), . . . , (0, . . . ,0,1)}I then

Zβ(W ) =N∏

i=1

∑Mi.∈CK

exp(−β∑

k

Mik eik (W ))

Page 55: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Independence

I additive cost corresponds to some form of conditionalindependence

I given the prototypes, observations are assignedindependently to their optimal clusters

I in other words: the global optimal assignment is theconcatenation of the individual optimal assignment

I this is what makes alternate optimization tractable in theK-means despite its combinatorial aspect:

minM∈M

N∑i=1

K∑k=1

Mik‖xi − wk‖2

Page 56: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Minimizing Fβ(W )

I assume E to be C1 with respect to W , then

∇Fβ(W ) =

∑M∈M exp(−βE(M,W ))∇W E(M,W )∑

M∈M exp(−βE(M,W ))

I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent

I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed

I if E is additive

∇Fβ(W ) =N∑

i=1

∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K

k=1 exp(−βeik (W ))

Page 57: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Minimizing Fβ(W )

I assume E to be C1 with respect to W , then

∇Fβ(W ) =

∑M∈M exp(−βE(M,W ))∇W E(M,W )∑

M∈M exp(−βE(M,W ))

I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent

I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed

I if E is additive

∇Fβ(W ) =N∑

i=1

∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K

k=1 exp(−βeik (W ))

Page 58: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Fixed point scheme

I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :

1. compute

(expectation phase)

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

2. keeping the µik constant, solve for W

(maximization phase)

N∑i=1

K∑k=1

µik∇W eik (W ) = 0

3. loop to 1 until convergence

I generally converges to a minimum of Fβ(W ) (for a fixed β)

Page 59: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Fixed point scheme

I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :

1. compute (expectation phase)

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

2. keeping the µik constant, solve for W (maximization phase)

N∑i=1

K∑k=1

µik∇W eik (W ) = 0

3. loop to 1 until convergence

I generally converges to a minimum of Fβ(W ) (for a fixed β)

Page 60: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Fixed point scheme

I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :

1. compute (expectation phase)

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

2. keeping the µik constant, solve for W (maximization phase)

N∑i=1

K∑k=1

µik∇W eik (W ) = 0

3. loop to 1 until convergenceI generally converges to a minimum of Fβ(W ) (for a fixed β)

Page 61: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Vector quantization

I if eik (W ) = ‖xi − wk‖2 (vector quantization),

∇wk Fβ(W ) = 2N∑

i=1

µik (W )(wk − xi),

with

µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)

I setting ∇Fβ(W ) = 0 leads to

wk =1∑N

i=1 µik (W )

N∑i=1

µik (W )xi

Page 62: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Links with the K-means

I if µil = δk(i)=l , we obtained the prototype update rule of theK-means

I in the K-means, we have

k(i) = arg min1≤l≤K

‖xi − wk‖2

I in deterministic annealing, we have

µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)

I this is a soft minimum version of the crisp rule of thek-means

Page 63: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Links with EM

I isotropic Gaussian mixture with a unique and fixedvariance ε

pk (x |wk ) =1

(2πε)p/2 e−12ε‖xi−wk‖2

I given the mixing coefficients δk , responsibilities are

P(xi ∈ Ck |xi ,W ) =δke−

12ε‖xi−wk‖2∑K

l=1 δle−12ε‖xi−wl‖2

I identical update rules for WI β can be seen has an inverse variance: quantify the

uncertainty about the clustering results

Page 64: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

So far...

I if E(M,W ) =∑N

i=1∑K

k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize

Fβ(W ) for a fixed β

I but:I the k-means algorithm is also a EM like algorithm: it does

not reach a global optimumI how handle to handle β →∞?

Page 65: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

So far...

I if E(M,W ) =∑N

i=1∑K

k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using

Fβ(W ) = −1β

N∑i=1

lnK∑

k=1

exp(−βeik (W ))

I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize

Fβ(W ) for a fixed βI but:

I the k-means algorithm is also a EM like algorithm: it doesnot reach a global optimum

I how handle to handle β →∞?

Page 66: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 67: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Limit case β → 0

I limit behavior of the EM scheme:I µik = 1

K for all i and k (and W !)I W is therefore a solution of

N∑i=1

K∑k=1

∇W eik (W ) = 0

I no iteration needed

I vector quantization:I we have

wk =1N

N∑i=1

xi

I each prototype is the center of mass of the data: only onecluster!

I in general the case β → 0 is easy (unique minimum undermild hypotheses)

Page 68: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Limit case β → 0

I limit behavior of the EM scheme:I µik = 1

K for all i and k (and W !)I W is therefore a solution of

N∑i=1

K∑k=1

∇W eik (W ) = 0

I no iteration neededI vector quantization:

I we have

wk =1N

N∑i=1

xi

I each prototype is the center of mass of the data: only onecluster!

I in general the case β → 0 is easy (unique minimum undermild hypotheses)

Page 69: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

Clustering in 2 clusters elements from R

x

back to the example

Page 70: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

original problem

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Page 71: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

simple soft minimum withβ = 10−1

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Page 72: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

more complex softminimum with β = 0.5

−4 −2 0 2 4

−4

−2

02

4

w1

w2

Page 73: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Increasing β

I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:

I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large

values for the gradient at some points

I path following strategy (homotopy):I for an increasing series (βl)l with β0 = 0I initialize W ∗

0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K

k=1∇W eik (W ) = 0I compute W ∗

l via the EM scheme for βl starting from W ∗l−1

I a similar strategy is used in interior point algorithms

Page 74: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Increasing β

I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:

I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large

values for the gradient at some pointsI path following strategy (homotopy):

I for an increasing series (βl)l with β0 = 0I initialize W ∗

0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K

k=1∇W eik (W ) = 0I compute W ∗

l via the EM scheme for βl starting from W ∗l−1

I a similar strategy is used in interior point algorithms

Page 75: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Increasing β

I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:

I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large

values for the gradient at some pointsI path following strategy (homotopy):

I for an increasing series (βl)l with β0 = 0I initialize W ∗

0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K

k=1∇W eik (W ) = 0I compute W ∗

l via the EM scheme for βl starting from W ∗l−1

I a similar strategy is used in interior point algorithms

Page 76: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

Clustering in 2 classes of 6 elements from R

x

1.00 2.00 3.00 7.00 8.25

Page 77: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

minM∈M E(M,W )

w1

w2

2 4 6 8

24

68

Page 78: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−2

w1

w2

2 4 6 8

24

68

Page 79: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−1.75

w1

w2

2 4 6 8

24

68

Page 80: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−1.5

w1

w2

2 4 6 8

24

68

Page 81: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−1.25

w1

w2

2 4 6 8

24

68

Page 82: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−1

w1

w2

2 4 6 8

24

68

Page 83: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−0.75

w1

w2

2 4 6 8

24

68

Page 84: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−0.5

w1

w2

2 4 6 8

24

68

Page 85: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 10−0.25

w1

w2

2 4 6 8

24

68

Page 86: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

Fβ(W ), β = 1

w1

w2

2 4 6 8

24

68

Page 87: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Behavior of Fβ(W )

minM∈M E(M,W )

w1

w2

2 4 6 8

24

68

Page 88: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

One step closer to the final algorithm

I given an increasing series (βl)l with β0 = 0:1. compute W ∗

0 such that∑N

i=1∑K

k=1∇W eik (W ∗0 ) = 0

2. for l = 1 to L:2.1 initialize W ∗l to W ∗l−1

2.2 compute µik =exp(−βeik (W

∗l ))∑K

l=1 exp(−βeil (W∗l ))

2.3 update W ∗l such that∑N

i=1

∑Kk=1 µik∇W eik (W ∗l ) = 0

2.4 good back to 2.2 until convergence

3. use W ∗L as an estimation of

arg minW∈Rp (minM∈M E(M,W ))

I this is (up to some technical details) the deterministicannealing algorithm for minimizingE(M,W ) =

∑Ni=1∑K

k=1 Mikeik (W )

Page 89: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

What’s next?

I general topics:I why the name annealing?I why should that work? (and other philosophical issues)

I specific topics:I technical “details” (e.g., annealing schedule)I generalization:

I non additive criteriaI general combinatorial problems

Page 90: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 91: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Annealing

I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.

I in our context, we’ll see thatI T = 1

kBβacts as a temperature and models some thermal

agitationI E(W ,M) is the energy of a configuration the system while

Fβ(W ) corresponds to the free energy of the system at agiven temperature

I increasing β reduces the thermal agitation

Page 92: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Annealing

I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.

I in our context, we’ll see thatI T = 1

kBβacts as a temperature and models some thermal

agitationI E(W ,M) is the energy of a configuration the system while

Fβ(W ) corresponds to the free energy of the system at agiven temperature

I increasing β reduces the thermal agitation

Page 93: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Simulated Annealing

I a classical combinatorial optimization algorithm forcomputing minM∈M E(M)

I given an increasing series (βl)l1. choose a random initial configuration M02. for l = 1 to L

2.1 take a small random step from Ml−1 to build Mcl (e.g.,

change the cluster of an object)2.2 if E(Mc

l ) < E(Ml−1) then set Ml = Mcl

2.3 else set Ml = Mcl with probability

exp(−βl(E(Mcl )− E(Ml−1)))

2.4 else set Ml = Ml−1 in the other case

3. use ML as an estimation of arg minM∈M E(M)

I naive vision:I always accept improvementI “thermal agitation” allows to escape local minima

Page 94: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Statistical physics

I consider a system with state spaceMI denote E(M) the energy of the system when in state MI at thermal equilibrium with the environment, the probability

for the system to be in state M is given by the Boltzmann(Gibbs) distribution

PT (M) =1

ZTexp

(−E(M)

kBT

)with T the temperature, kB Boltzmann’s constant, and

ZT =∑

M∈Mexp

(−E(M)

kBT

)

Page 95: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Back to annealing

I physical analogy:I maintain the system at thermal equilibrium:

I set a temperatureI wait for the system to settle at this temperature

I slowly decrease the temperature:I works well in real systems (e.g., crystallization)I allows the system to explore the state space

I computer implementation:I direct computation of PT (M) (or related quantities)I sampling from PT (M)

Page 96: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Simulated AnnealingRevisited

I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ

I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:

I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability

min(

1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)

)I keep x t+1 = x t when this fails

Page 97: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Simulated AnnealingRevisited

I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ

I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:

I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability

min(

1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)

)I keep x t+1 = x t when this fails

Page 98: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Simulated AnnealingRevisited

I in SA, Q is the random local perturbationI major feature of Metropolis-Hastings MCMC:

PT (M ′)PT (M)

= exp(−E(M ′)− E(M)

kBT

)ZT is not needed

I underlying assumption, symmetric proposals

Q(x t |x ′) = Q(x ′|x t)

I rationale:I sampling directly from PT (M) for small T is difficultI track likely area during cooling

Page 99: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Mixed problems

I Simulated Annealing applies to combinatorial problemsI in mixed problems, this would means removing the

continuous part

minM∈M,W∈Rp

E(M,W ) = minM∈M

(M 7→ min

W∈RpE(M,W )

)I in the vector quantization example

E(M) =N∑

i=1

K∑k=1

Mik

∥∥∥∥∥∥xi −1∑N

j=1 Mjk

N∑j=1

Mjkxj

∥∥∥∥∥∥2

I no obvious direct relation to the proposed algorithm

Page 100: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Statistical physics

I thermodynamic (free) energy: the useful energy availablein a system (a.k.a., the one that can be extracted toproduce work)

I Helmholtz (free) energy is FT = U − TS, where U is theinternal energy of the system and S its entropy

I one can shown that

FT = −kBT ln ZT

I the natural evolution of a system is to reduce its freeenergy

I deterministic annealing mimicks the cooling process of asystem by tracking the evolution of the minimal free energy

Page 101: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Gibbs distribution

I still no use of PT ...

I important property:I f a function defined onM

EPT (f (M)) =1

ZT

∑M∈M

f (M)exp(−E(M)

kBT

)I then

limT→0

EPT (f (M)) = f (M∗),

where M∗ = arg minM∈M E(M)

I useful to track global information

Page 102: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Gibbs distribution

I still no use of PT ...I important property:

I f a function defined onM

EPT (f (M)) =1

ZT

∑M∈M

f (M)exp(−E(M)

kBT

)I then

limT→0

EPT (f (M)) = f (M∗),

where M∗ = arg minM∈M E(M)

I useful to track global information

Page 103: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

I in the EM like phase, we have

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

I this does not come out of thin air:

µik = EPβ,W (Mik ) =∑

M∈MMikPβ,W (M)

I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution

I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition

Page 104: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

I in the EM like phase, we have

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

I this does not come out of thin air:

µik = EPβ,W (Mik ) =∑

M∈MMikPβ,W (M)

I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution

I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition

Page 105: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

I in the EM like phase, we have

µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))

I this does not come out of thin air:

µik = EPβ,W (Mik ) =∑

M∈MMikPβ,W (M)

I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution

I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition

Page 106: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

Clustering in 2 classes of 6 elements from R

x

1.00 2.00 3.00 7.00 8.25

Page 107: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 108: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 109: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 110: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

before phase transition

Page 111: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

after phase transition

Page 112: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 113: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 114: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 115: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Membership functions

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Page 116: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Summary

I two principles:I physical systems tend to reach a state of minimal free

energy at a given temperatureI when the temperature is slowly decreased, a system tends

to reach a well organized stateI annealing algorithms tend to reproduce this slow cooling

behaviorI simulated annealing:

I uses Metropolis Hasting MCMC to sample the Gibbsdistribution

I easy to implement but needs a very slow coolingI deterministic annealing:

I direct minimization of the free energyI computes expectations of global quantities with respect to

the Gibbs distributionI aggressive cooling is possible, but ZT computation must be

tractable

Page 117: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 118: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Maximum entropy

I another interpretation/justification of DAI reformulation of the problem: find a probability distribution

onM which is “regular” and gives a low average error

I in other words: find PW such thatI∑

M∈M PW (M) = 1I∑

M∈M E(W ,M)PW (M) is smallI the entropy of PW , −

∑M∈M ln PW (M)PW (M) is high

I entropy plays a regularization roleI this leads to the minimization of∑

M∈ME(W ,M)PW (M) +

∑M∈M

ln PW (M)PW (M)

where β sets the trade-off between fitting and regularity

Page 119: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Maximum entropy

I another interpretation/justification of DAI reformulation of the problem: find a probability distribution

onM which is “regular” and gives a low average errorI in other words: find PW such that

I∑

M∈M PW (M) = 1I∑

M∈M E(W ,M)PW (M) is smallI the entropy of PW , −

∑M∈M ln PW (M)PW (M) is high

I entropy plays a regularization role

I this leads to the minimization of∑M∈M

E(W ,M)PW (M) +1β

∑M∈M

ln PW (M)PW (M)

where β sets the trade-off between fitting and regularity

Page 120: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Maximum entropy

I another interpretation/justification of DAI reformulation of the problem: find a probability distribution

onM which is “regular” and gives a low average errorI in other words: find PW such that

I∑

M∈M PW (M) = 1I∑

M∈M E(W ,M)PW (M) is smallI the entropy of PW , −

∑M∈M ln PW (M)PW (M) is high

I entropy plays a regularization roleI this leads to the minimization of∑

M∈ME(W ,M)PW (M) +

∑M∈M

ln PW (M)PW (M)

where β sets the trade-off between fitting and regularity

Page 121: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Maximum entropy

I some calculations show that the minimum over PW is givenby

Pβ,W (M) =1

Zβ,Wexp(−βE(M,W )),

withZβ,W =

∑M∈M

exp(−βE(M,W )).

I plugged back into the optimization criterion, we end upwith the soft minimum

Fβ(W ) = −1β

ln∑

M∈Mexp(−βE(M,W ))

Page 122: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

So far...

I three derivations of the deterministic annealing:I soft minimumI thermodynamic inspired annealingI maximum entropy principle

I all of them boil down to two principles:I replace the crisp optimization problem in theM space by a

soft oneI track the evolution of the solution when the crispness of the

approximation increasesI now the technical details...

Page 123: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 124: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Fixed point

I remember the EM phase:

µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)

wk =1∑N

i=1 µik (W )

N∑i=1

µik (W )xi

I this is a fixed point method, i.e., W = Uβ(W ) for a datadependent Uβ

I the fixed point is generally stable, except during a phasetransition

Page 125: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Phase transition

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

before phase transition

Page 126: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Phase transition

w1

w2

2 4 6 8

24

68

1 2 3 7 7.5 8.25

x

Pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

after phase transition

Page 127: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Stability

w1

w2

2 4 6 8

24

68

w1

w2

2 4 6 82

46

8

Page 128: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Stability

I unstable fixed point:I even during a phase transition, the exact previous fixed

point can remain a fixed pointI but we want the transition to take place: this is a new

cluster birth!I fortunately, the previous fixed point is unstable during a

phase transition

I modified EM phase:1. initialize W ∗

l to W ∗l−1 + ε

2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))

3. update W ∗l such that

∑Ni=1∑K

k=1 µik∇W eik (W ∗l ) = 0

4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition

Page 129: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Stability

I unstable fixed point:I even during a phase transition, the exact previous fixed

point can remain a fixed pointI but we want the transition to take place: this is a new

cluster birth!I fortunately, the previous fixed point is unstable during a

phase transitionI modified EM phase:

1. initialize W ∗l to W ∗

l−1 + ε

2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))

3. update W ∗l such that

∑Ni=1∑K

k=1 µik∇W eik (W ∗l ) = 0

4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition

Page 130: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Annealing strategy

I neglecting symmetries, there are K − 1 phase transitionsfor K clusters

I tracking W ∗ between transitions is easy

I trade-off:I slow annealing: long running time but no transition is

missedI fast annealing: quicker but with higher risk to miss a

transitionI but critical temperatures can be computed:

I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point

equation Uβ around a fixed point

Page 131: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Annealing strategy

I neglecting symmetries, there are K − 1 phase transitionsfor K clusters

I tracking W ∗ between transitions is easyI trade-off:

I slow annealing: long running time but no transition ismissed

I fast annealing: quicker but with higher risk to miss atransition

I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point

equation Uβ around a fixed point

Page 132: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Annealing strategy

I neglecting symmetries, there are K − 1 phase transitionsfor K clusters

I tracking W ∗ between transitions is easyI trade-off:

I slow annealing: long running time but no transition ismissed

I fast annealing: quicker but with higher risk to miss atransition

I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point

equation Uβ around a fixed point

Page 133: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Critical temperatures

I for vector quantizationI first temperature:

I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest

eigenvalue of the covariance matrix of the data

I in general:I the stability of the whole system is related to the stability of

each clusterI the µik play the role of membership functions for each

clusterI the critical β for cluster k is 1/2λk

max where λkmax is the

largest eigenvalue of∑i

µik (xi − wk )(xi − wk)T

Page 134: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Critical temperatures

I for vector quantizationI first temperature:

I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest

eigenvalue of the covariance matrix of the dataI in general:

I the stability of the whole system is related to the stability ofeach cluster

I the µik play the role of membership functions for eachcluster

I the critical β for cluster k is 1/2λkmax where λk

max is thelargest eigenvalue of∑

i

µik (xi − wk )(xi − wk)T

Page 135: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

A possible final algorithm

1. initialize W 1 = 1N∑N

i=1 xi

2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1

2.3 for t values of β around βl

2.3.1 add some small noise to W l

2.3.2 compute µik =exp(−β‖xi−W l

k‖2)∑K

t=1 exp(−β‖xi−W lt ‖

2)

2.3.3 compute W lk = 1∑N

i=1 µik

∑Ni=1 µik xi

2.3.4 good back to 2.3.2 until convergence

3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M

Page 136: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Computational cost

I N observations in Rp and K clustersI prototype storage: KpI µ matrix storage: NKI one EM iteration costs: O(NKp)I we need at most K − 1 full EM runs: O(NK 2p)I compared to the K-means:

I more storageI no simple distance calculation tricks: all distances must be

computedI roughly corresponds to K − 1 k-means, not taking into

account the number of distinct β considered during eachphase transition

I neglecting the eigenvalue analysis (in O(p3))...

Page 137: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 138: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Collapsed prototypes

I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:

I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...

I a related problem: uniform cluster weightsI in a Gaussian mixture

P(xi ∈ Ck |xi ,W ) =δke−

12ε‖xi−wk‖2∑K

l=1 δle−12ε‖xi−wl‖2

prototypes are weighted by the mixing coefficients δ

Page 139: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Collapsed prototypes

I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:

I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...

I a related problem: uniform cluster weightsI in a Gaussian mixture

P(xi ∈ Ck |xi ,W ) =δke−

12ε‖xi−wk‖2∑K

l=1 δle−12ε‖xi−wl‖2

prototypes are weighted by the mixing coefficients δ

Page 140: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Mass constrained DAI we introduce prototype weights δk and plug them into the

soft minimum formulation

Fβ(W , δ) = −1β

N∑i=1

lnK∑

k=1

δk exp(−βeik (W ))

I Fβ is minimized under the constraint∑K

k=1 δk = 1I this leads to

µik (W ) =δk exp(−β‖xi − wk‖2)∑Kl=1 δl exp(−β‖xi − wl‖2)

δk =N∑

i=1

µik (W )/N

wk = NN∑

i=1

µik (W )xi/δk

Page 141: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

MCDA

I increases the similarity with EM for Gaussian mixtures:I exactly the same algorithm for a fixed βI isotropic Gaussian mixture with variance 1

I however:I the variance is generally a parameter in the Gaussian

mixturesI different goals: minimal distortion versus maximum

likelihoodI no support for other distributions in DA

Page 142: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Cluster birth monitoring

I avoid wasting computational resources:I consider a cluster Ck with a prototype wkI prior a phase transition in Ck :

I duplicate wk into w ′kI apply some noise to both prototypesI split δk into δk

2 and δ′k2

I apply the EM like algorithm and monitor ‖wk − w ′k‖I accept the new cluster if ‖wk − w ′k‖ becomes “large”

I two strategies:I eigenvalue based analysis (costly, but quite accurate)I opportunistic: always maintain duplicate prototypes and

promote diverging ones

Page 143: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Final algorithm

1. initialize W 1 = 1N∑N

i=1 xi

2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1

2.3 duplicate the prototype of the critical cluster and split theassociated weight

2.4 for t values of β around βl

2.4.1 add some small noise to W l

2.4.2 compute µik =δk exp(−β‖xi−W l

k‖2)∑K

t=1 δt exp(−β‖xi−W lt ‖

2)

2.4.3 compute δk =∑N

i=1 µik (W )/N2.4.4 compute W l

k = Nδk

∑Ni=1 µik xi

2.4.5 good back to 2.4.2 until convergence

3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M

Page 144: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

A simple dataset

Page 145: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

β

E

0.2

0.5

1.0

2.0

5.0

0.1 0.15 0.22 0.32 0.47 0.69 1.02 1.51 2.22 3.37

Error evolution and cluster births

Page 146: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 147: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 148: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 149: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 150: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 151: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 152: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 153: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 154: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 155: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 156: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Clusters

−2 0 2 4 6

−2

−1

01

23

4

X1

X2

Page 157: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Summary

I deterministic annealing addresses combinatorialoptimization by smoothing the cost function and trackingthe evolution of the solution while the smoothingprogressively vanishes

I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)

I in practice:I rather simple multiple EM like algorithmI a bit tricky to implement (phase transition, noise injection,

etc.)I excellent results (frequently better than those obtained by

K-means)

Page 158: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 159: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Combinatorial only

I the situation is quite different for a combinatorial problem

arg minM∈M

E(M)

I no soft minimum heuristicI no direct use of the Gibbs distribution PT ; one need to rely

on EPT (f (M)) for interesting fI calculation of PT is not tractable:

I tractability is related to independenceI if

E(M) =∑M1

. . .∑MD

(E(M1) + . . .+ E(MD))

then independent optimization can be done on eachvariable...

Page 160: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Graph clustering

I a non oriented graph with N nodes and weight matrix A (Aijis the weight of the connection between nodes i and j)

I maximal modularity clustering:I node degree ki =

∑Nj=1 Aij , total weight m = 1

2

∑i,j Aij

I null model Pij =ki kj2m , Bij =

12m (Aij − Pij)

I assignment matrix MI Modularity of the clustering

Mod(M) =∑i,j

∑k

Mik Mjk Bij

I difficulty: coupling MikMjk (non linearity)I in the sequel E(M) = −Mod(M)

Page 161: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Strategy for DA

I choose some interesting statistics:I EPT (f (M))I for instance EPT (Mik ) for clustering

I compute an approximation of EPT (f (M)) at hightemperature T

I track the evolution of the approximation while lowering TI use limT→0 EPT (f (M)) = f (arg minM∈M E(M)) to claim

that the approximation converges to the interesting limitI rationale: approximating EPT (f (M)) for a low T is probably

difficult, we use the path following strategy to ease the task

Page 162: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

x

y

Page 163: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

005

0.01

00.

015

0.02

0β = 10−1

x

prob

abili

ty

Page 164: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

005

0.01

00.

015

0.02

0β = 10−0.75

x

prob

abili

ty

Page 165: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

005

0.01

00.

015

0.02

00.

025

β = 10−0.5

x

prob

abili

ty

Page 166: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

00.

010

0.02

00.

030

β = 10−0.25

x

prob

abili

ty

Page 167: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.01

0.02

0.03

0.04

β = 100

x

prob

abili

ty

Page 168: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.02

0.04

0.06

0.08

β = 100.25

x

prob

abili

ty

Page 169: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

β = 100.5

x

prob

abili

ty

Page 170: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

0.20

0.25

0.30

β = 100.75

x

prob

abili

ty

Page 171: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.1

0.2

0.3

0.4

0.5

β = 101

x

prob

abili

ty

Page 172: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

β = 101.25

x

prob

abili

ty

Page 173: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8β = 101.5

x

prob

abili

ty

Page 174: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β = 101.75

x

prob

abili

ty

Page 175: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β = 102

x

prob

abili

ty

Page 176: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

0.1 0.5 1.0 5.0 10.0 50.0 100.0

0.0

0.1

0.2

0.3

0.4

0.5

β

Page 177: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Expectation approximation

I computing

EPZ (f (Z )) =

∫f (Z )dPZ

has many applicationsI for instance, in machine learning:

I performance evaluation (generalization error)I EM for probabilistic modelsI Bayesian approaches

I two major numerical approaches:I samplingI variational approximation

Page 178: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Sampling

I (strong) law of large numbers

limN→∞

1N

N∑i=1

f (zi) =

∫f (Z )dPZ

if the zi are independent samples of ZI many extensions, especially to dependent samples (slower

convergence)I MCMC methods: build a Markov chain with stationary

distribution PZ and take the average of f on a trajectory

I applied to EPT (f (M)): this is simulated annealing!

Page 179: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Sampling

I (strong) law of large numbers

limN→∞

1N

N∑i=1

f (zi) =

∫f (Z )dPZ

if the zi are independent samples of ZI many extensions, especially to dependent samples (slower

convergence)I MCMC methods: build a Markov chain with stationary

distribution PZ and take the average of f on a trajectoryI applied to EPT (f (M)): this is simulated annealing!

Page 180: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Variational approximation

I main idea:I the intractability of the calculation of EPZ (f (Z )) is induced

by the complexity of PZI let’s replace PZ by a simpler distribution QZ ...I ...and EPZ (f (Z )) by EQZ (f (Z ))

I calculus of variation:I optimization over functional spacesI in this context: choose an optimal QZ in a space of

probability measures Q with respect to a goodness of fitcriterion between PZ and QZ

I probabilistic context:I simple distributions: factorized distributionsI quality criterion: Kullback-Leibler divergence

Page 181: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Variational approximation

I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D

i=1 fi(Zi)

I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))

I variational approximationI choose Q a set of tractable distributions on RI solve

Q∗ = arg minQ=(Q1×...×QD)∈QD

KL (Q||PZ )

withKL (Q||PZ ) = −

∫ln

dPZ

dQdQ

I approximate EPZ (f (Z )) by

EQ∗ (f (Z )) =D∑

i=1

EQ∗i (fi(Zi))

Page 182: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Variational approximation

I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D

i=1 fi(Zi)

I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))

I variational approximationI choose Q a set of tractable distributions on RI solve

Q∗ = arg minQ=(Q1×...×QD)∈QD

KL (Q||PZ )

withKL (Q||PZ ) = −

∫ln

dPZ

dQdQ

I approximate EPZ (f (Z )) by

EQ∗ (f (Z )) =D∑

i=1

EQ∗i (fi(Zi))

Page 183: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Application to DA

I general form: Pβ(M) = 1Zβ

exp(−βE(M))

I intractable because E(M) is not linear in M = (M1, . . . ,MD)

I variational approximation:I replace E by F (W ,M) defined by

F (W ,M) =D∑

i=1

WiMi

I use for QW ,β

QW ,β =1

ZW ,βexp(−βF (W ,M))

I optimize W via KL (QW ,β ||Pβ)

Page 184: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Application to DA

I general form: Pβ(M) = 1Zβ

exp(−βE(M))

I intractable because E(M) is not linear in M = (M1, . . . ,MD)I variational approximation:

I replace E by F (W ,M) defined by

F (W ,M) =D∑

i=1

WiMi

I use for QW ,β

QW ,β =1

ZW ,βexp(−βF (W ,M))

I optimize W via KL (QW ,β ||Pβ)

Page 185: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Fitting the approximation

I a complex expectation calculation is replaced by a newoptimization problem

I we have

KL(QW ,β||Pβ

)= ln Zβ−ln ZW ,β+βEQW ,β (E(M)− F (W ,M))

I tractable by factorization on F , except for ZβI solved by ∇W KL

(QW ,β||Pβ

)= 0

I tractable because Zβ do not depend on WI generally solved via a fixed point approach (EM like again)

I similar to variational approximation in EM or Bayesianmethods

Page 186: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

I in the case of (graph) clustering, we use

F (W ,M) =N∑

i=1

K∑k=1

WikMik

I a crucial (general) property is that, when i 6= j

EQW ,β

(MikMjl

)= EQW ,β (Mik )EQW ,β

(Mjl)

I a (long and boring) series of equations leads to the generalmean field equations

∂EQW ,β (E(M))

∂Wjl=

K∑k=1

∂EQW ,β

(Mjk)

∂WjlWjk , ∀j , l .

Page 187: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Example

I classical EM like scheme to solve the mean fieldequations:

1. we haveEQW,β (Mik ) =

exp(−βWik )∑Kl=1 exp(−βWil)

2. keep EQW,β (Mik ) fixed and solve the simplified mean fieldequations for W

3. update EQW,β (Mik ) based on the new W and loop on 2 untilconvergence

I in the graph clustering case

Wjk = 2∑i,j

EQW ,β (Mik )Bij

Page 188: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Summary

I deterministic annealing for combinatorial optimizationI given an objective function E(M)

I choose a linear parametric approximation of E(M),F (W ,M) with the associated distributionQW ,β = 1

ZW,βexp(−βF (W ,M))

I write the mean field equations, i.e.

∇W(− ln ZW ,β + βEQW,β (E(M)− F (W ,M))

)= 0

I use a EM like algorithm to solve the equations:I given W , compute EQW,β (Mik )I given EQW,β (Mik ), solve the equations

I back to our classical questions:I why would that work?I how to do that in practice?

Page 189: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 190: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Physical interpretation

I back to the (generalized) Helmholtz (free) energy

Fβ = U − Sβ

I the free energy can be generalized to any distribution Q onthe states, by

Fβ,Q = EQ (E(M))− H(Q)

β,

where H(Q) = −∑

M∈MQ(M) ln Q(M) is the entropy of QI the Boltzmann-Gibbs distribution minimizes the free

energy, i.e.FT ≤ FT ,Q

Page 191: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Physical interpretation

I in fact we have

FT ,Q = FT +1β

KL (Q||P) ,

where P is the Gibbs distributionI minimizing KL (Q||P) ≥ 0 over Q corresponds to finding

the best upper bound of the free energy on a class ofdistribution

I in other words: given the system states must be distributedaccording to a distribution in Q, find the most stabledistribution

Page 192: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Mean fieldI E(M) is difficult to handle because of coupling

(dependencies) while F (W ,M) is linear and correspondsto a de-coupling in which dependencies are replaced bymean effects

I example:I modularity

E(M) =∑

i

∑k

Mik

∑j

Mjk Bij

I approximation

F (W ,M) =∑

i

∑k

Mik Wik

I the complex influence of Mij on E is replaced by a singleparameter Wik : the mean effect associated to a change inMij

Page 193: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Outline

Introduction

Mixed problemsSoft minimumComputing the soft minimumEvolution of β

Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing

Combinatorial problemsExpectation approximationsMean field annealingIn practice

Page 194: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

In practice

I homotopy again:I at high temperature (β → 0):

I EQW,β (Mik ) =1K

I the fixed point equation is generally easy to solve; forinstance in graph clustering

Wjk =2K

∑i,j

Bij

I slowly decrease the temperatureI use the mean field at the previous higher temperature as a

starting point for the fixed point iterationsI phase transitions:

I stable versus unstable fixed points (eigenanalysis)I noise injection and mass constrained version

Page 195: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

A graph clustering algorithm

1. initialize W 1jk = 2

K∑

i,j Bij

2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1

2.3 for t values of β around βl

2.3.1 add some small noise to W l

2.3.2 compute µik = exp(−βWik )∑Kt=1 exp(−βWit )

2.3.3 compute Wjk = 2∑

i,j µik Bij

2.3.4 good back to 2.3.2 until convergence

3. threshold µik into an optimal partition of the original graph

I mass constrained approach variantI stability based transition detection (slower annealing)

Page 196: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Karate

I clustering of asimple graph

I Zachary’s Karateclub

I Four clusters

1

2

3 4

5

67

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

Page 197: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Modularity

Evolution of the modularity during annealing

50 100 200 500

0.0

0.1

0.2

0.3

0.4

ββ

Mod

ular

ity

Page 198: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Phase transitions

I first phase: 2clusters

I second phase: 4clusters

1 2

3 4

Page 199: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Phase transitions

I first phase: 2clusters

I second phase: 4clusters

1 2

3 4

Page 200: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Karate

I clustering of asimple graph

I Zachary’s Karateclub

I Four clusters

Page 201: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

DA Roadmap

How to apply deterministic annealing to a combinatorialoptimization problem E(M)?

1. define a linear mean field approximation F (W ,M)

2. specialize the mean field equations, i.e. compute

∂EQW,β (E(M))

∂Wjl=

∂Wjl

(1

ZW ,β

∑M

E(M)exp(−βF (W ,M))

)

3. identify a fixed point scheme to solve the mean field equations,using EQW,β (Mi) as a constant if needed

4. wrap the corresponding EM scheme in an annealing loop: this isthe basic algorithm

5. analyze the fixed point scheme to find stability conditions: thisleads to an advanced annealing schedule

Page 202: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Summary

I deterministic annealing addresses combinatorialoptimization by solving a related but simpler problem andby tracking the evolution of the solution while the simplerproblem converges to the original one

I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)

I works mainly because of the solution following strategy(homotopy): do not solve a difficult problem from scratch,but rather starting from a good guess of the solution

Page 203: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

Summary

I difficulties:I boring and technical calculations (when facing a new

problem)I a bit tricky to implement (phase transition, noise injection,

etc.)I tends to be slow in practice: computing exp is very costly

even on modern hardware (roughly 100 flop)I advantages:

I versatileI appears as a simple EM like algorithm embedded in an

annealing loopI very interesting intermediate resultsI excellent final results in practice

Page 204: An Introduction to Deterministic Annealingapiacoa.org › ... › deterministic-annealing › da-slides.pdf · 2020-03-30 · I in this presentation:deterministic annealing. Outline

References

T. Hofmann and J. M. Buhmann.Pairwise data clustering by deterministic annealing.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(1):1–14, January 1997.

K. Rose.Deterministic annealing for clustering, compression,classification,regression, and related optimization problems.Proceedings of the IEEE, 86(11):2210–2239, November 1998.

F. Rossi and N. Villa.Optimizing an organized modularity measure for topographicgraph clustering: a deterministic annealing approach.Neurocomputing, 73(7–9):1142–1163, March 2010.