Genetic algorithms (GA) for clustering
description
Transcript of Genetic algorithms (GA) for clustering
![Page 1: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/1.jpg)
Genetic algorithms (GA)for clustering
Pasi Fränti
Clustering Methods: Part 2e
Speech and Image Processing UnitSchool of Computing
University of Eastern Finland
![Page 2: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/2.jpg)
General structure
Genetic Algorithm:Generate S initial solutionsREPEAT Z iterations
Select best solutionsCreate new solutions by crossoverMutate solutions
END-REPEAT
![Page 3: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/3.jpg)
Components of GA
• Representation of solution
• Selection method
• Crossover method
• Mutation
Most critical !
![Page 4: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/4.jpg)
Representation of solution
• Partition (P): – Optimal centroid can be calculated from P.
– Only local changes can be made.
• Codebook (C):– Optimal partition can be calculated from C.
– Calculation of P takes O(NM) slow.
• Combined (C, P):– Both data structures are needed anyway.
– Computationally more efficient.
![Page 5: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/5.jpg)
Selection method
• To select which solutions will be used in crossover for generating new solutions.
• Main principle: good solutions should be used rather than weak solutions.
• Two main strategies:
– Roulette wheel selection
– Elitist selection.
• Exact implementation not so important.
![Page 6: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/6.jpg)
Roulette wheel selection
),(1
1),(
PCdistortionPCw
S
jjj
iiii
PCw
PCwPCp
1
),(
),(),(
• Select two candidate solutions for the crossover randomly.
• Probability for a solution to be selected is weighted according to its distortion:
![Page 7: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/7.jpg)
Elitist selection
Elitist approach using zigzag scanning among the best
solutions
Select next pair(i, j): REPEAT
IF (i+j) MOD 2 = 0 THEN i max(1, i-1); j j+1; ELSE j max(1, j-1); i i+1;
UNTIL ij. RETURN(i, j)
• Main principle: select all possible pairs among the best candidates.
![Page 8: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/8.jpg)
Crossover methods
Different variants for crossover:• Random crossover• Centroid distance• Pairwise crossover• Largest partitions• PNN
Local fine-tuning:• All methods give new allocation of the centroids.• Local fine-tuning must be made by K-means.• Two iterations of K-means is enough.
![Page 9: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/9.jpg)
Random crossover
Solution 1 Solution 2
+
Select M/2 centroids randomly from the two parent.
![Page 10: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/10.jpg)
New Solution:
How to create a new solution?
Picking M/2 randomly chosen cluster centroids
from each of the two parents in turn.
How many solutions are there?
36 possibilities how to create a new solution.
What is the probability to select a good
one?
Not high, some are good but K-Means is needed,
most are bad. See statistics.
Parent solution A Parent solution B
Data point
Centroid
Explanation
M – number of clusters
Parent A Parent B Rating
c2, c4 c1, c4 Optimal
c1, c2 c3, c4 Good (K-Means)
c2, c3 c2, c3 Bad
Some possibilities: M = 4
c1
c4
c3
c2
1 2 4 5 8
c1c4
c3
c2
Rough statistics:
Optimal: 1Good: 7Bad: 28
![Page 11: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/11.jpg)
Parent solution A Parent solution B
c1
c4
c3
c2
1 2 4 5 8
c1c4
c3
c2
c1
c3
c2
c4
Child solution (optimal) Child solution (good) Child solution (bad)
c1
c3
c2
c4
c1
c2
c4
c3
![Page 12: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/12.jpg)
Centroid distance crossover [Pan, McInnes, Jack, 1995: Electronics Letters ]
[Scheunders, 1997: Pattern Recognition Letters ]
• For each centroid, calculate its distance to the
center point of the entire data set.
• Sort the centroids according to the distance.
• Divide into two sets: central vectors (M/2
closest) and distant vectors (M/2 furthest).
• Take central vectors from one codebook and
distant vectors from the other.
![Page 13: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/13.jpg)
Parent solution A Parent solution B
New solution:
Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B
OR
Variant (b) Take distant vectors from parent solution A and central vectors from parent solution B
Data point
Centroid
Explanation
M – number of clusters
Centroid of entire dataset
A: d(c4, Ced) < d(c2, Ced) < d(c1, Ced) < d(c3, Ced) B: d(c1, Ced) < d(c3, Ced) < d(c2, Ced) < d(c4, Ced)
1) Distances d(ci, Ced):
2) Sort centroids according to the distance:
A: c4, c2, c1, c3, B: c1, c3, c2, c4
3) Divide into two sets (M = 4):
A: central vectors: c4, c2, distant vectors: c1, c3 B: central vectors: c1, c3, distant vectors: c2, c4
1 2 4 5 8
1
5
6
c1
c2
c3
c4
Ced
1
1
2 4 5 8
5
6
c1
c2
c3
c4
Ced
c2
c4
c2
c4
c1 c3
c1
c3
![Page 14: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/14.jpg)
Child - variant (a) Child – variant (b)
New solution:
Variant (a) Take cental vectors from parent solution A and distant vectors from parent solution B
OR
Variant (b) Take distant vectors from parent solution A and central vectors from parent solution B
Data point
Centroid
Explanation
M – number of clusters
Centroid of entire dataset
c2
c4
c2
c4
c1 c3
c1
c3
1
1
2 4 5 8
5
6
c1
c2
c3
c4
Ced
1
1
2 4 5 8
5
6
c1c2
c3
c4
Ced
![Page 15: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/15.jpg)
Pairwise crossover[Fränti et al, 1997: Computer Journal]
Greedy approach:
• For each centroid, find its nearest centroid in the other parent solution that is not yet used.
• Among all pairs, select one of the two randomly.
Small improvement:
• No reason to consider the parents as separate solutions.
• Take union of all centroids.
• Make the pairing independent of parent.
![Page 16: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/16.jpg)
Initial parent solutions
Pairwise crossover example
MSE=8.79109
MSE=11.92109
![Page 17: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/17.jpg)
Pairwise crossover example
Pairing between parent solutions
MSE=7.34109
![Page 18: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/18.jpg)
Pairing without restrictions
MSE=4.76109
Pairwise crossover example
![Page 19: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/19.jpg)
Largest partitions[Fränti et al, 1997: Computer Journal]
• Select centroids that represent largest clusters.
• Selection by greedy manner.
• (illustration to appear later)
![Page 20: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/20.jpg)
PNN crossover for GA[Fränti et al, 1997: The Computer Journal]
Initial 2
After PNNUnion
PNN
Combined
Initial 1
![Page 21: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/21.jpg)
The PNN crossover method (1) [Fränti, 2000: Pattern Recognition Letters]
CrossSolutions(C1, P1, C2, P2) (Cnew, Pnew) Cnew CombineCentroids(C1, C2) Pnew CombinePartitions(P1, P2) Cnew UpdateCentroids(Cnew, Pnew) RemoveEmptyClusters(Cnew, Pnew) PerformPNN(Cnew, Pnew)
CombineCentroids(C1, C2) Cnew Cnew C1 C2
CombinePartitions(Cnew, P1, P2) Pnew FOR i 1 TO N DO
IF x c x ci p i pi i 1 2
2 2
THEN
p pinew
i 1 ELSE
p pinew
i 2 END-FOR
![Page 22: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/22.jpg)
The PNN crossover method (2)
UpdateCentroids(C1, C2) Cnew FOR j 1 TO |Cnew| DO
c jnew
CalculateCentroid(Pnew, j )
PerformPNN(Cnew, Pnew) FOR i 1 TO |Cnew| DO
qi FindNearestNeighbor(ci) WHILE |Cnew|>M DO
a FindMinimumDistance(Q) b qa MergeClusters(ca, pa, cb, pb) UpdatePointers(Q)
END-WHILE
![Page 23: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/23.jpg)
Importance of K-means(Random crossover)
160
180
200
220
240
260
0 5 10 15 20 25 30 35 40 45 50generation
dis
tort
ion
without k-means
with k-means
BestWorst
Bridge
![Page 24: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/24.jpg)
Effect of crossover method(with k-means iterations)
Bridge
160
165
170
175
180
185
190
0 5 10 15 20 25 30 35 40 45 50generation
dis
tort
ion Random
Cent.dist.Pairwise
PNN
Largest partitions
![Page 25: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/25.jpg)
Effect of crossover method(with k-means iterations)
Binary data (Bridge2)
1.25
1.30
1.35
1.40
1.45
1.50
0 5 10 15 20 25 30 35 40 45 50generation
dis
tort
ion Random
Cent.dist.Pairwise
PNN
Largest partitions
![Page 26: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/26.jpg)
Mutations
• Purpose is to implement small random changes to the solutions.
• Happens with a small probability.
• Sensible approach: change the location of one centroid by the random swap!
• Role of mutations is to simulate local search.
• If mutations are needed crossover method is not very good.
![Page 27: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/27.jpg)
Effect of k-means and mutations
160
165
170
175
180
0 10 20 30 40 50
Number of iterations
Dis
tort
ion
Bridge
Mutations + K-means
PNN crossover + K-means
Random crossover + K-means
PNN
Mutations alone better than random crossover!
K-means improves but not vital
![Page 28: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/28.jpg)
Pseudo code of GAIS [Virmajoki & Fränti, 2006: Pattern Recognition]
GeneticAlgorithm(X) (C, P)
FOR i 1 TO Z DO Ci RandomCodebook(X); Pi OptimalPartition(X, Ci);
SortSolutions(C,P);
REPEAT {C,P} CreateNewSolutions( {C,P} ); SortSolutions(C,P);
UNTIL no improvement;
CreateNewSolutions({C, P}) {Cnew, Pnew }
Cnew-1, Pnew-1 C1, P1; FOR i 2 TO Z DO
(a,b) SelectNextPair; Cnew-i, Pnew-I Cross(Ca, Pa, Cb, Pb); IterateK-Means(Cnew-i, Pnew-i);
Cross(C1, P1, C2, P2) (Cnew, Pnew)
Cnew CombineCentroids(C1, C2); Pnew CombinePartitions(P1, P2); Cnew UpdateCentroids(Cnew, Pnew); RemoveEmptyClusters(Cnew, Pnew); IS(Cnew, Pnew);
CombineCentroids(C1, C2) Cnew
Cnew C1 C2
CombinePartitions(Cnew, P1, P2) Pnew
FOR i 1 TO N DO
IF x c x ci p i pi i 1 2
2 2 THEN
p pinew
i 1
ELSE
p pinew
i 2
END-FOR
UpdateCentroids(C1, C2) Cnew
FOR j 1 TO |Cnew| DO
c jnew CalculateCentroid(Pnew, j );
![Page 29: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/29.jpg)
PNN vs. IS crossovers
Further improvement of about 1%
160
161
162
163
164
165
166
0 10 20 30 40 50Number of Iterations
MS
E
IS crossover + K-means
IS crossover
PNN crossover
PNN crossover + K-means
Bridge
![Page 30: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/30.jpg)
Optimized GAIS variants
GAIS short (optimized for speed): - Create new generations only as long as the best solution
keeps improving (T=*).
- Use a small population size (Z=10)
- Apply two iterations of k‑means (G=2).
GAIS long (optimized for quality): - Create a large number of generations (T=100)
- Large population size (Z=100)
- Iterate k‑means relatively long (G=10).
![Page 31: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/31.jpg)
Comparison of algorithms Image sets Birch data sets Synthetic data sets Time
Bridge House Miss
America B1 B2 B3 S1 S2 S3 S4 Bridge
Random 251.32 12.12 8.34 14.44 35.73 8.20 78.55 72.91 55.42 47.05 <1 K-means (aver.) 179.87 7.81 5.96 5.52 7.99 2.53 20.53 20.91 21.37 16.78 5 K-means (best) 176.95 7.35 5.93 5.13 6.87 2.16 13.23 16.07 18.96 15.71 50 SOM 173.63 7.59 5.92 13.50 10.03 15.18 20.11 13.28 21.10 15.71 376 FCM 178.39 7.79 6.22 5.02 5.29 2.48 8.92 13.28 16.89 15.71 166 Split 170.22 6.18 5.40 4.81 2.29 1.91 8.95 13.33 17.50 16.01 13 Split + k-means 165.77 6.06 5.28 4.64 2.28 1.91 8.92 13.28 16.92 15.77 17 RLS 164.64 5.96 5.28 4.64 2.28 1.86 8.92 13.28 16.89 15.71 1146 Split-n-Merge 163.81 5.98 5.19 4.64 2.28 1.93 8.92 13.28 16.91 15.75 85 SR (average) 162.45 6.02 5.27 4.84 3.39 1.99 9.52 13.68 17.31 15.80 213 SR (best) 161.96 5.98 5.25 4.76 3.12 1.98 8.93 13.28 16.89 15.71 2130 PNN 168.92 6.27 5.36 4.73 2.28 1.96 8.93 13.44 17.70 17.52 272 PNN + k-means 165.04 6.07 5.24 4.64 2.28 1.88 8.92 13.28 16.89 16.87 285 GKM – fast 10 164.12 5.94 5.34 4.64 2.28 1.92 8.92 13.28 16.89 15.71 91721 IS 163.38 6.09 5.19 4.70 2.28 1.89 8.92 13.29 16.96 15.79 717 IS + k-means 162.38 6.02 5.17 4.64 2.28 1.86 8.92 13.28 16.89 15.71 719 GA (k-means) 174.91 6.61 5.54 6.58 5.96 2.45 11.66 15.99 19.22 16.14 654 GA (PNN) 162.37 5.92 5.17 4.98 2.28 1.98 8.92 13.28 16.89 15.71 404 SAGA 161.22 5.86 5.10 4.64 2.28 1.86 8.92 13.28 16.89 15.71 74554 GAIS (short) 161.59 5.92 5.11 4.64 2.28 1.86 8.92 13.28 16.89 15.72 1311 GAIS (long) 160.73 5.89 5.07 4.64 2.28 1.86 8.92 13.28 16.89 15.71 387533
![Page 32: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/32.jpg)
Variation of the result
0
5
10
15
20
25
160 165 170 175 180 185 190
MSE
Fre
quen
cy
k-meansGAIS
IS PNN
IS + k-means
![Page 33: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/33.jpg)
Time vs. quality comparisonBridge
160
165
170
175
180
185
190
0 1 10 100 1000 10000 100000Time (s)
MS
E
repeatedK-means
RLS
GAIS
PNN
IS
SAGA
![Page 34: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/34.jpg)
Conclusions
• Best clustering obtained by GA.
• Crossover method most important.
• Mutations not needed.
![Page 35: Genetic algorithms (GA) for clustering](https://reader035.fdocuments.net/reader035/viewer/2022062305/56815940550346895dc68045/html5/thumbnails/35.jpg)
References1. P. Fränti and O. Virmajoki, "Iterative shrinking method for
clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006.
2. P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, January 2000.
3. P. Fränti, J. Kivijärvi, T. Kaukoranta and O. Nevalainen, "Genetic algorithms for large scale clustering problems", The Computer Journal, 40 (9), 547-554, 1997.
4. J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, 2003.
5. J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook design using genetic algorithms. Electronics Letters, 31, 1418-1419, August 1995.
6. P. Scheunders, A genetic Lloyd-Max quantization algorithm. Pattern Recognition Letters, 17, 547-556, 1996.