[IEEE 2009 Second International Workshop on Knowledge Discovery and Data Mining (WKDD) - Moscow,...

4
K-means Optimization Algorithm for Solving Clustering Problem Jinxin Dong, Minyong Qi College of Computer Science, Liaocheng University, Liaocheng, ShanDong, 252059, China e-mail: [email protected] Abstract—The basic K-means is sensitive to the initial centre and easy to get stuck at local optimal value. To solve such problems, a new clustering algorithm is proposed based on simulated annealing. The algorithm views the clustering as optimization problem, the bisecting K-means splits the dataset into k clusters at first, and then run simulated annealing algorithm using the sum of distances between each pattern and its centre based on bisecting K-means as the aim function. To avoid the shortcomings of simulated annealing such as long computation time and low efficiency, a new data structure named sequence list is given. The experiment result shows the feasibility and validity of the proposed algorithm. Keywords-simulated annealing; clustering; K-means algorithm; intelligent optimization; initial centre I. INTRODUCTION Data clustering is used frequently in many applications, such as data mining[1], vector quantization[2], pattern recognition[3], speaker recognition[4], fault detection[5] and etc. A cluster is a set of entities which are alike and entities from different cluster are not alike, a cluster is an aggregation of points in the test space such that the distance between any two points in the cluster is less than the distance between any point in the cluster and any point not in it, and clusters may be described as connected regions of a multi- dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points[6]. Now there are many methods for clustering, such as partition methods, hierarchical methods, density-based methods, grid-based methods, model-based methods and etc[7]. The K-means clustering algorithm, which is also called the generalized Lloyd algorithm, is a typical clustering algorithm. K-means is simple and can be easily used in practice, its time complex is O(nkt), n is the number of objects, k is the number of clusters, t is the number of iterations, so it is generally regarded as very fast. But there are many drawbacks in it: The result of K-means algorithm depends on initial clustering centers; different seed-points can cause different clusters, and can even lead to no resolution. The algorithm ends in local minima value. II. IMPROVED K-MEANS ALGORITHM A. K-means algorithm K-means algorithm is one of the most popular clustering algorithms used in variety of domains. The main idea of K- means is choosing K patterns as initial centers firstly, K is the parameter that user has set before, and K is the number of final pattern cluster, assigning each point to its closest center to form K clusters, then recomputing the center of each cluster, repeat the assignment and compute until no clusters change, or, until the center remain the same. Furthermore, to obtain K clusters, split the set of all patterns into two clusters, select one of these clusters to split, and so on, until K clusters have been produced[8]. B. ISA algorithm Simulated annealing is a stochastic searching optimization algorithm based on Mente Carl iteration strategy[9]. The name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The heat causes the atoms to become unstuck from their initial positions, a local minimum of the internal energy, and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one. By analogy with this physical process, each step of the SA algorithm replaces the current solution by a random "nearby" solution, chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (called the temperature), that is gradually decreased during the process. The dependency is such that the current solution changes almost randomly when T is large, but increasingly "downhill" as T goes to zero. The allowance for "uphill" moves saves the method from becoming stuck at local minima, which are the bane of greedier methods. SA can be applied in a lot of applications and can be realized easily. But the compute time of SA is long, so the efficiency is low, and it is difficult to reach the strict convergence condition in the process of use. To overcome the shortcomings of SA algorithm, a new simulated annealing called improved simulated annealing is given (ISA). The basic idea of it is to get some balance points using a new structure named sequence list in the Second International Workshop on Knowledge Discovery and Data Mining 978-0-7695-3543-2/09 $25.00 © 2009 IEEE DOI 10.1109/WKDD.2009.85 52

Transcript of [IEEE 2009 Second International Workshop on Knowledge Discovery and Data Mining (WKDD) - Moscow,...

K-means Optimization Algorithm for Solving Clustering Problem

Jinxin Dong, Minyong Qi College of Computer Science,

Liaocheng University, Liaocheng, ShanDong, 252059, China

e-mail: [email protected]

Abstract—The basic K-means is sensitive to the initial centre and easy to get stuck at local optimal value. To solve such problems, a new clustering algorithm is proposed based on simulated annealing. The algorithm views the clustering as optimization problem, the bisecting K-means splits the dataset into k clusters at first, and then run simulated annealing algorithm using the sum of distances between each pattern and its centre based on bisecting K-means as the aim function. To avoid the shortcomings of simulated annealing such as long computation time and low efficiency, a new data structure named sequence list is given. The experiment result shows the feasibility and validity of the proposed algorithm.

Keywords-simulated annealing; clustering; K-means algorithm; intelligent optimization; initial centre

I. INTRODUCTION Data clustering is used frequently in many applications,

such as data mining[1], vector quantization[2], pattern recognition[3], speaker recognition[4], fault detection[5] and etc. A cluster is a set of entities which are alike and entities from different cluster are not alike, a cluster is an aggregation of points in the test space such that the distance between any two points in the cluster is less than the distance between any point in the cluster and any point not in it, and clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points[6]. Now there are many methods for clustering, such as partition methods, hierarchical methods, density-based methods, grid-based methods, model-based methods and etc[7]. The K-means clustering algorithm, which is also called the generalized Lloyd algorithm, is a typical clustering algorithm.

K-means is simple and can be easily used in practice, its time complex is O(nkt), n is the number of objects, k is the number of clusters, t is the number of iterations, so it is generally regarded as very fast. But there are many drawbacks in it:

• The result of K-means algorithm depends on initial clustering centers; different seed-points can cause different clusters, and can even lead to no resolution.

• The algorithm ends in local minima value.

II. IMPROVED K-MEANS ALGORITHM

A. K-means algorithm K-means algorithm is one of the most popular clustering

algorithms used in variety of domains. The main idea of K-means is choosing K patterns as initial centers firstly, K is the parameter that user has set before, and K is the number of final pattern cluster, assigning each point to its closest center to form K clusters, then recomputing the center of each cluster, repeat the assignment and compute until no clusters change, or, until the center remain the same. Furthermore, to obtain K clusters, split the set of all patterns into two clusters, select one of these clusters to split, and so on, until K clusters have been produced[8].

B. ISA algorithm Simulated annealing is a stochastic searching

optimization algorithm based on Mente Carl iteration strategy[9]. The name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects. The heat causes the atoms to become unstuck from their initial positions, a local minimum of the internal energy, and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one. By analogy with this physical process, each step of the SA algorithm replaces the current solution by a random "nearby" solution, chosen with a probability that depends on the difference between the corresponding function values and on a global parameter T (called the temperature), that is gradually decreased during the process. The dependency is such that the current solution changes almost randomly when T is large, but increasingly "downhill" as T goes to zero. The allowance for "uphill" moves saves the method from becoming stuck at local minima, which are the bane of greedier methods. SA can be applied in a lot of applications and can be realized easily. But the compute time of SA is long, so the efficiency is low, and it is difficult to reach the strict convergence condition in the process of use.

To overcome the shortcomings of SA algorithm, a new simulated annealing called improved simulated annealing is given (ISA). The basic idea of it is to get some balance points using a new structure named sequence list in the

Second International Workshop on Knowledge Discovery and Data Mining

978-0-7695-3543-2/09 $25.00 © 2009 IEEE

DOI 10.1109/WKDD.2009.85

52

cooling process, when one annealing process finishes, the temperatures, the conditions and the sampling lengths of all balance points are brought out from the list and a new annealing process begins, repeat the process until the list is empty.

C. Improved K-means algorithm If K clustering center named centpattern and pattern

sample category vector named patterns are regarded as the solution of the clustering problem, the distance is euclidean distance, the clustering problem is regarded as optimization problem, and the goal of the clustering problem is minimization of the sum of all distances between each pattern sample and the closest clustering center, that is, the objective function is the minimization of sum of distances:

2

1( ) .

i

K

ii x C

D dist c x= ∈

= −∑∑ (1)

Here, Ci is the i-th cluster, ci is the i-th cluster centre, x is a pattern sample in Ci, dist denotes the euclidean distance between two objects. To minimize the above equation, we can differentiate it, suppose the data is in one dimension:

2

1( ) .

i

K

ii x Ck k

D x cc c = ∈

∂ ∂= −∂ ∂ ∑∑ (2)

That is:

k2( ).k

kx Ck

D c xc ∈

∂ = −∂ ∑ (3)

Set the above equation equal to 0, and then the centpattern is:

211 2

1 1 1, ,..., ) .kx C x C x Ck

x x xm m m∈ ∈ ∈

⎛ ⎞⎜ ⎟⎝ ⎠

∑ ∑ ∑ (4)

Here, mk denotes the number of the k-th cluster. The details of the new clustering algorithm can be given

by the following steps: Step 1: Initialize the list of clusters to contain the clusters

consisting of all points. There is only one list of cluster at first.

Step 2: Remove one cluster from the list of clusters, which distance D is largest. In this cluster, select two random points as initial clustering centers, assign every point to its closest center, thus two clusters are formed, then recompute the clustering centre until centpattern is not changed. Repeat this step for iternum times. Then choose two clusters, which distance D is minimum in the list of clusters, and insert the two clusters in the list of clusters. Repeat this bisecting until the list of clusters contains K clusters.

Step 3: To make the new solution acceptable, compute the value D of the objective function of the last iteration in step2, set the initial temperature T0=D, so the initial center pattern vector is centpattern, initial pattern vector is patterns.

Set T is a new list, N is the length of T, p is the pointer directed to the list T, the initial value of p is NULL. Suppose the parameter best_aim is the optimal value of objective function, the parameter bestpattern is the optimal cluster result, the parameter m_center is the optimal clustering center, and better_aim is current optimal value of objective function, betterpattern is current optimal clustering result, tm_center is current optimal clustering center, and their initial value are the result of step2.

Step 4: Randomly change one pattern’s current cluster label, thus a new clustering result is formed, recompute the clustering center and objective function, and judge whether accept the new clustering result according to the condition acceptance function. If the value of objective function D < better_aim, set better_aim is D, betterpattern is the new clustering result, tm_center is the new clustering center of new condition. If T is not full, insert the value of objective function, pattern vector and clustering center vector of new clustering result into T, p++; If T is full, the last N-1 elements move front one position, insert the value of objective function, pattern vector and clustering center vector of new clustering result into the last position of T, and i is set to 0.

Step 5: The temperature decreases according to the cooling function, and i++, if it doesn’t reach the iteration number, then goto step 4.

Step 6: If best_aim > better_aim, set best_aim is better_aim, bestpattern is betterpattern, m_center is tm_center, and p--, at the same time, set reannealing starting point pt is p. If best_aim ≤ better_aim, set p is pt-1. If p is 0, the algorithm is finished, best_aim is the value of optimal objective function, bestpattern is optimal clustering result, m_center is clustering center vector; Otherwise, the current condition is better_aim, betterpattern and tm_center that p directed in T, and goto step 4.

III. EXPERIMENT RESULTS The test dataset is IRIS in UCI machine learning database,

three algorithms are carried out, and that is K-means, bisecting K-means and the new clustering algorithm.

Firstly, randomly run the basic K-means for 20 times, three clustering centers are generated at every time, the result is showed in table 1.

Here SSE is the sum of the squared error between each pattern and closest clustering center. Suppose the IRIS data’s subscript is from 0 to 149. From the program running result we can see, in random condition, when the running is better, the clustering center subscript can be (0, 50,100), when the running is worse, the clustering center subscript can be (13, 140, 46). Table 1 illustrates that the quality of basic K-means depends on the clustering center.

Secondly, randomly run the bisecting cluster algorithm 20 times, and every time randomly generates two groups of clustering center, every group of clustering center includes two clustering centers subscript, the result is showed in table 2.

53

From the program running result we can see, in random condition, when the running is better, two groups of clustering center subscript can be (0, 60) and (43, 66), when the running is worse, two groups of clustering center subscript can be (91, 78) and (36, 15). Table 2 illustrates that the result of bisecting K-means doesn’t depend on the clustering center, but the accuracy needs to be improved.

Finally, randomly run the new cluster algorithm 20 times. In new algorithm, the condition acceptance function is the key of the capability of probabilistic jumping, it can avoid convergence to local minimum, and the acceptance condition can be:

}{min 1,exp( / ) [0,1].t random−Δ > (5)

Where t is temperature, Δ is the difference between new cluster objective function and the old one, the cooling function is:

.T T= λ (6)

Here λ=0.95, the length of list is 10,the result is showed in table 3.

The accuracy of 20 times running result is 90%, only SSE is different. From the program we can see, when the running is better, two groups of clustering center subscript can be (0, 97) and (12, 37), when the running is worse, two groups of clustering center subscript can be (63, 50) and (96, 77). Table 3 illustrates that the new algorithm doesn’t depend on the initial center and the accuracy of clustering is very high.

TABLE I. THE RESULT OF BASIC K-MEANS

Running condition

SSE accuracy

better 97.204566 89.33%

worse 124.022385 57.33%

average 108.339812 76.25%

TABLE II. THE RESULT OF BISECTING K-MEANS

Running condition

SSE accuracy

better 99.104216 87.33%

worse 99.021460 86.67%

average 99.071112 86.93%

TABLE III. THE RESULT OF NEW ALGORITHM

Running condition

SSE accuracy

better 97.09161 90%

worse 97.146661 90%

average 97.109591 90%

By the above analysis, in new algorithm, randomly choose clustering center, but it is not to choose K random clustering centers, but 2 random clustering centers and run K-1 times, and on every running time, split the dataset into two clusters, and choose one which SSE is biggest to split again. Then run the improved simulated annealing on the cluster result, so the quality of the cluster is improved greatly.

To illustrate the cluster quality better, Fig. 1 gives the ideal scatter plot of IRIS, Fig. 2 gives the scatter plot of IRIS after running the new algorithm. Here, Δ, □ and O denotes Setosa, Versicolor and Virginica.

4 4.5 5 5.5 6 6.5 7 7.5 82

2.5

3

3.5

4

4.5

Figure 1. Scatter plot of original IRIS

4 4.5 5 5.5 6 6.5 7 7.5 82

2.5

3

3.5

4

4.5

Figure 2. Scatter plot of IRIS after running new clustering algorithm

54

IV. CONCLUSIONS The traditional K-means algorithm depends on the initial

clustering center, and ends in local minima value. To solve the problem, a new clustering algorithm based on simulated annealing is given. The experiment results illustrate the accuracy of new algorithm is higher, and it doesn’t depend on the initial clustering center, it is feasible and efficient, but also it has shortcoming, for example, in new algorithm, K must be given previously.

ACKNOWLEDGMENT It is a project supported by National Natural Science

Foundation of China (60874075) and Natural Science Foundation of LiaoCheng University (X061041).

REFERENCES [1] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,

Advances in Knowledge Discovery and Data Mining, Menlo Park: AAAI/MIT Press, 1996.

[2] A. Gersho and R.M. Gray, vector quantization and signal compression, Boston: Kluwer Academic, 1992.

[3] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973.

[4] D. Liu, F. Kubala, “Online Speaker Clustering,” Proc. IEEE Conference on Acoustic, Speech, and Signal Processing, vol.1, 2004, pp. 333-336.

[5] P. Hojen-Sorensen, N. de Freitas, T. Fog, “On-line probabilistic classification and particle filters,” Proc. IEEE Signal Processing Society Workshop, vol.1, 2000, pp. 386-395.

[6] AK Jain, Algorithms for Clustering Data. New York: Prentice Hall College Div, 1988.

[7] J. Han, M. Kamber, Data mining: concept and technique, San Francisco: Morgan Kaufmann Publishers, 2000.

[8] M. Steinbach, G. Karypis, V. Kumar, “A Comparision of Document Clustering Techniques,” Proc.the 6th International Conference on Knowledge Discovery and Data Mining, Boston, 2000.

[9] S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, “Optimization by Simulated Annealing,” Science, Vol. 220, No. 4598, 1983, pp. 671-680.

[10] Michael Laszlo, Sumitra Mukherjee, “A genetic algorithm that exchanges neighboring centers for k-means clustering,” Pattern Recognition Letters, Vol.28, Issue 16, 2007, pp. 2359-2366.

[11] Kyoung-jae Kim, Hyunchul Ahn, “A recommender system using GA K-means clustering in an online shopping market”, Expert Systems with Applications, Vol. 34, Issue 2, 2008, pp. 1200-1209.

[12] Georgios P. Papamichail, Dimitrios P. Papamichail, “The k-means range algorithm for personalized data clustering in e-commerce,” European Journal of Operational Research, Vol.177, Issue 3, 2007, pp. 1400-1408.

[13] Dingxi Qiu, Ajit C. Tamhane, “A comparative study of the K-means algorithm and the normal mixture model for clustering: Univariate case,” Journal of Statistical Planning and Inference, Vol. 137, Issue 11, 2007, pp. 3722-3740.

55