Kmeans plusplus
-
Upload
renaudrichardet -
Category
Technology
-
view
2.037 -
download
1
description
Transcript of Kmeans plusplus
![Page 1: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/1.jpg)
K-means++ Seeding Algorithm, Implementation in MLDemos!
Renaud Richardet!Brain Mind Institute !
Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland!
![Page 2: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/2.jpg)
K-means!• K-means: widely used clustering technique!• Initialization: blind random on input data!• Drawback: very sensitive to choice of initial cluster
centers (seeds)!• Local optimal can be arbitrarily bad wrt. objective
function, compared to global optimal clustering!
![Page 3: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/3.jpg)
K-means++!• A seeding technique for k-means
from Arthur and Vassilvitskii [2007]!• Idea: spread the k initial cluster centers away from
each other.!• O(log k)-competitive with the optimal clustering"• substantial convergence time speedups (empirical)!
![Page 4: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/4.jpg)
Algorithm!
c ∈ C: cluster center x ∈ X: data point D(x): distance between x and the nearest ck that has already been chosen
![Page 5: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/5.jpg)
Implementation!• Based on Apache Commons Math’s
KMeansPlusPlusClusterer and Arthur’s [2007] implementation!
• Implemented directly in MLDemos’ core!
![Page 6: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/6.jpg)
Implementation Test Dataset: 4 squares (n=16)!
![Page 7: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/7.jpg)
Expected: 4 nice clusters!
![Page 8: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/8.jpg)
Sample Output! 1: first cluster center 0 at rand: x=4 [-‐2.0; 2.0] 1: initial minDist for 0 [-‐1.0;-‐1.0] = 10.0 1: initial minDist for 1 [ 2.0; 1.0] = 17.0 1: initial minDist for 2 [ 1.0;-‐1.0] = 18.0 1: initial minDist for 3 [-‐1.0;-‐2.0] = 17.0 1: initial minDist for 5 [ 2.0; 2.0] = 16.0 1: initial minDist for 6 [ 2.0;-‐2.0] = 32.0 1: initial minDist for 7 [-‐1.0; 2.0] = 1.0 1: initial minDist for 8 [-‐2.0;-‐2.0] = 16.0 1: initial minDist for 9 [ 1.0; 1.0] = 10.0 1: initial minDist for 10[ 2.0;-‐1.0] = 25.0 1: initial minDist for 11[-‐2.0;-‐1.0] = 9.0 […] 2: picking cluster center 1 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 3: distSqSum=3345.0 3: random index 1532.706909 4: new cluster point: x=6 [2.0;-‐2.0]
![Page 9: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/9.jpg)
Sample Output (2)! 4: updating minDist for 0 [-‐1.0;-‐1.0] = 10.0 4: updating minDist for 1 [ 2.0; 1.0] = 9.0 4: updating minDist for 2 [ 1.0;-‐1.0] = 2.0 4: updating minDist for 3 [-‐1.0;-‐2.0] = 9.0 4: updating minDist for 5 [ 2.0; 2.0] = 16.0 4: updating minDist for 7 [-‐1.0; 2.0] = 25.0 4: updating minDist for 8 [-‐2.0;-‐2.0] = 16.0 4: updating minDist for 9 [ 1.0; 1.0] = 10.0 4: updating minDist for 10[2.0 ;-‐1.0] = 1.0 4: updating minDist for 11[-‐2.0;-‐1.0] = 17.0
[…] 2: picking cluster center 2 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ 3: distSqSum=961.0 3: random index 103.404701 4: new cluster point: x=1 [2.0;1.0] 4: updating minDist for 0 [-‐1.0;-‐1.0] = 13.0 […]
![Page 10: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/10.jpg)
Evaluation on Test Dataset!• 200 clustering runs, each with and without k-
means++ initialization!• Measure RSS (intra-class variance)!
• K-means!optimal clustering 115 times (57.5%) !
• K-means++ !optimal clustering 182 times (91%)!
![Page 11: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/11.jpg)
Comparison of the frequency distribution of RSS values between k-means and k-means++ on the evaluation dataset (n=200)!
![Page 12: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/12.jpg)
Evaluation on Real Dataset!• UCI’s Water Treatment Plant data set
daily measures of sensors in an urban waste water treatment plant (n=396, d=38)!
• Sampled two times 500 clustering runs for k-means and k-means++ with k=13, and recorded RSS!
• Difference highly significant (P < 0.0001) !
![Page 13: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/13.jpg)
Comparison of the frequency distribution of RSS values between k-means and k-means++ on the UCI real world dataset (n=500)!
![Page 14: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/14.jpg)
Alternatives Seeding Algorithms!• Extensive research into seeding techniques for k-
means.!• Steinley [2007]: evaluated 12 different techniques
(omitting k-means++). Recommends multiple random starting points for general use.!
• Maitra [2011] evaluated 11 techniques (including k-means++). Unable to provide recommendations when evaluating nine standard real-world datasets. !
• Maitra analyzed simulated datasets and recommends using Milligan’s [1980] or Mirkin’s [2005] seeding technique, and Bradley’s [1998] when dataset is very large.!
![Page 15: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/15.jpg)
Conclusions and Future Work!• Using a synthetic test dataset and a real world
dataset, we showed that our implementation of the k-means++ seeding procedure in the MLDemos software package yields a significant reduction of the RSS. !
• A short literature survey revealed that many seeding procedures exist for k-means, and that some alternatives to k-means++ might yield even larger improvements.!
![Page 16: Kmeans plusplus](https://reader034.fdocuments.net/reader034/viewer/2022042507/547cbba9b4af9fd3158b526c/html5/thumbnails/16.jpg)
References!• Arthur, D. & Vassilvitskii, S.: “k-means++: The advantages of careful
seeding”. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms 1027–1035 (2007).!
• Bahmani, B., Moseley, B., Vattani, A., Kumar, R. & Vassilvitskii, S.: “Scalable K-Means+”. Unpublished working paper available at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf (2012).!
• Bradley P. S. & Fayyad U. M.: “Refining initial points. for K-Means clustering”. Proc. 15th International Conf. on Machine Learning, 91-99 (1998).!
• Maitra, R., Peterson, A. D. & Ghosh, A. P.: “A systematic evaluation of different methods for initializing the K-means clustering algorithm”. Unpublished working paper available at http://apghosh.public.iastate.edu/files/IEEEclust2.pdf (2011).!
• Milligan G. W.: “The validation of four ultrametric clustering algorithms”. Pattern Recognition, vol. 12, 41–50 (1980). !
• Mirkin B.: “Clustering for data mining: A data recovery approach”. Chapman and Hall (2005). !
• Steinley, D. & Brusco, M. J.: “Initializing k-means batch clustering: A critical evaluation of several techniques”. Journal of Classification 24, 99–121 (2007).!