[IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France...

5
An improved algorithm for subspace clustering applied to image segmentation Amel Boulemnadjel, Fella Hachouf Laboratoire d’automatique et de robotiques, Département d’Electronique Université Mentouri de Constantine Route Ain El-Bey, 25000 Constantine Algerie [email protected], [email protected] Abstract This paper presents a new algorithm for subspace clustering for high dimensional data. It is an iterative algorithm based on the minimization of an objective function. A major weakness of subspace clustering algorithms is that almost all of them are developed based on within- class information only or by employing both within-cluster and between- clusters information. The density of cluster is lost. The new function is developed by integrating the separation and compactness of clusters. The density of cluster is introduced also in the compactness term. The experimental results confirm that the proposed algorithm gives good results on different types of images by optimizing the runtime. Keywords--- subspace clustering, within cluster, between clusters, density, runtime. 1. Introduction Cluster analysis seeks to discover groups, or clusters, of similar objects. The objects are usually represented as a vector of measurements, or a point in multidimensional space. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. However in high dimensional data, many of the dimensions are often irrelevant. These irrelevant dimensions can confuse clustering algorithms by hiding clusters in noisy data. In very high dimensions it is common for all of the objects in a dataset to be nearly equidistant from each other, completely masking the clusters. Subspace clustering has been proposed to overcome this challenge. It has been studied extensively in recent years [1-3]. Subspace clustering is a method to determine the clusters that form on a different subspace. This approach is better in handling multidimensional data than other methods [4, 5]. Subspace clustering [6] is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. The two main categories of subspace clustering algorithms are hard subspace clustering and soft subspace clustering. Hard subspace clustering methods [6] are first studied extensively for the clustering of high dimensional data. These methods localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces [6]. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results [7]. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. An example of bottom-up methods is CLIQUE [8]. It was one of the first algorithms proposed that attempted to find clusters within subspaces of the dataset. In the algorithm of CLIQUE, it first partitions the whole data space into non-overlapping rectangular units and then searches for dense units and merges them to form clusters. PROCLUS [9] was the first top-down subspace clustering algorithm. Proclus samples the data then selects a set of k medoids and iteratively improves the clustering. The algorithm uses a three phase approach consisting of initialization, iteration, and cluster refinement. While the exact subspaces are identified in hard subspace clustering, a weight is assigned to each dimension in the clustering process of soft subspace clustering to measure the contribution of each dimension to the formation of a particular cluster. Soft subspace clustering can be divided in to two main categories: fuzzy weighting subspace clustering and entropy weighting subspace clustering. Many soft subspace clustering algorithms have been developed and applied to different areas [10-12]. Their performance can be further enhanced. A major weakness of soft subspace clustering algorithms is that almost all of them are developed based on within- class information only, e.g. the commonly used within-cluster compactness. These algorithms are 2012 16th International Conference on Information Visualisation 1550-6037/12 $26.00 © 2012 IEEE DOI 10.1109/IV.2012.57 297

Transcript of [IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France...

Page 1: [IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France (2012.07.11-2012.07.13)] 2012 16th International Conference on Information Visualisation

An improved algorithm for subspace clustering applied to image segmentation

Amel Boulemnadjel, Fella Hachouf Laboratoire d’automatique et de robotiques, Département d’Electronique

Université Mentouri de Constantine Route Ain El-Bey, 25000 Constantine Algerie

[email protected], [email protected]

Abstract This paper presents a new algorithm for subspace

clustering for high dimensional data. It is an iterative algorithm based on the minimization of an objective function. A major weakness of subspace clustering algorithms is that almost all of them are developed based on within- class information only or by employing both within-cluster and between- clusters information. The density of cluster is lost. The new function is developed by integrating the separation and compactness of clusters. The density of cluster is introduced also in the compactness term. The experimental results confirm that the proposed algorithm gives good results on different types of images by optimizing the runtime.

Keywords--- subspace clustering, within cluster,

between clusters, density, runtime.

1. Introduction

Cluster analysis seeks to discover groups, or clusters, of similar objects. The objects are usually represented as a vector of measurements, or a point in multidimensional space. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. However in high dimensional data, many of the dimensions are often irrelevant. These irrelevant dimensions can confuse clustering algorithms by hiding clusters in noisy data. In very high dimensions it is common for all of the objects in a dataset to be nearly equidistant from each other, completely masking the clusters.

Subspace clustering has been proposed to overcome this challenge. It has been studied extensively in recent years [1-3]. Subspace clustering is a method to determine the clusters that form on a different subspace. This approach is better in handling multidimensional data than other methods [4, 5]. Subspace clustering [6] is an extension of traditional clustering that seeks to find

clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data.

The two main categories of subspace clustering algorithms are hard subspace clustering and soft subspace clustering. Hard subspace clustering methods [6] are first studied extensively for the clustering of high dimensional data. These methods localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces [6]. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results [7]. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. An example of bottom-up methods is CLIQUE [8]. It was one of the first algorithms proposed that attempted to find clusters within subspaces of the dataset. In the algorithm of CLIQUE, it first partitions the whole data space into non-overlapping rectangular units and then searches for dense units and merges them to form clusters. PROCLUS [9] was the first top-down subspace clustering algorithm. Proclus samples the data then selects a set of k medoids and iteratively improves the clustering. The algorithm uses a three phase approach consisting of initialization, iteration, and cluster refinement.

While the exact subspaces are identified in hard subspace clustering, a weight is assigned to each dimension in the clustering process of soft subspace clustering to measure the contribution of each dimension to the formation of a particular cluster. Soft subspace clustering can be divided in to two main categories: fuzzy weighting subspace clustering and entropy weighting subspace clustering. Many soft subspace clustering algorithms have been developed and applied to different areas [10-12]. Their performance can be further enhanced. A major weakness of soft subspace clustering algorithms is that almost all of them are developed based on within- class information only, e.g. the commonly used within-cluster compactness. These algorithms are

2012 16th International Conference on Information Visualisation

1550-6037/12 $26.00 © 2012 IEEE

DOI 10.1109/IV.2012.57

297

Page 2: [IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France (2012.07.11-2012.07.13)] 2012 16th International Conference on Information Visualisation

expected to be improved if more discrimination information is added. Novel clustering technique called enhanced soft subspace clustering (ESSC) is proposed in [13] by employing both within-cluster and between- cluster information. Its experimental studies for texture image where the shape and the density of cluster are similar for all clusters have given the good result. But a problem in this algorithm occurs when the shape and the density are changed for each cluster. Most clustering techniques use the distance or similarity between objects as a measure to build clusters. But in high dimensional spaces, distances between points become relatively uniform. In such cases, the density of cluster may give better results [14,15]. Density based subspace clustering is one more approach. The authors in [16] propose a mathematical formulation regarding the density of points in subspaces. But again, the density of subspaces is measured using a hypercube of fixed width w. So it has the similar problems. Another efficient density connected approach is given in [17]. In [18] a method using the concept of density is exposed. This grid-based approach can detect the shape and position of clusters in the subspace. SUBCLU [17] achieves a better clustering quality but requires a higher runtime. All existing subspace clustering algorithms use input parameters that can be seen as constraints. For example, CLIQUE [8] uses a threshold on the minimum density a cluster can have. Although small changes in these parameters might completely change the resulting clustering.

In this paper, we propose a clustering method based on an objective function optimization. The proposed objective function contains two terms, the weighting within-cluster compactness and the weighting between-cluster separation. The clusters density is introduced in the compactness term without increasing the runtime. This is achieved by a suitable reprocessing for the extraction of data features to classify. Updating all the parameters randomly selected, allows eliminating their influence in the resulting clustering.

The paper is organized as follows: after an introduction, section two will present the proposed objective function. As we test our approach on different types of images, section three provides an overview of methods for extracting the parameters. Experimental results will be presented and discussed in section four. The paper ends with a conclusion.

2. Objective function

The proposed objective function is obtained by extending the objective function of the enhanced soft subspace clustering ESSC algorithm [13], defined in eq.1:

)1......(........................................)()(

ln)(),,(

1

20

1 1

1 1

2

11 1

�� �

�����

== =

= === =

−−

+−=

D

kkikik

c

i

N

j

mij

c

i

D

kikikik

D

kjkik

c

i

N

j

mijESSC

vvwu

wwvxwuwuvJ

η

γ

Where

c: the numbers of clusters.

N: the number of data.

D: is the number of features.

v: is the cluster center matrix

w :is the weight matrix.

u: the fuzzy partition matrix.

The modification we proposed in this function is based in part on the introduction of the density n (i) in the first term of compactness of the eq.1 to be replaced by the eq.2:

1)(/)(),( 2

11 1+−= ���

== =

invxwuuvj ik

D

kjkik

c

i

N

j

mijcfm ..(2)

And also the deletion of the weighted entropy to be replaced by computing the local entropy in the initialization phase. It reduces the runtime by always keeping the information entropy, updating its value at each iteration. The objective function below is thus developed for the proposed algorithm (eq.3).

�� �

���

== =

== =

−−

+−=

D

kkikik

c

i

N

j

mij

ik

D

kjkik

c

i

N

j

mijfm

vvwu

invxwuuvJ

1

20

1 1

2

11 1

)()(

)1)(/()(),(

η ..........(3)

By using the Lagrange multiplier, we deduce the u and v that minimize the objective function of eq.3, given respectively by equations 4 and 5.

� �

=

−−

=

=

−−

−−+−

−−+−

=c

i

mD

kkikikikik

D

k

mkikikjkik

ij

vvinvxw

vvinvxwu

1

1/1

10

2

1

1/10

2

])()1)(/()([

)]()1)(/()([

η

η .. (4)

With 01

1

=−�=

c

iiju

=

=

−+

=N

jij

N

jkjkij

ik

u

vinxu

v

1

10

)1(

))1)(/((

η

η …………….……..….(5)

The proposed method can be described by the following algorithm:

Algorithm

Step 1. Initialization: Input: - Number of cluster C, parameters εη, . - Initialize randomly cluster centers v0 and set the initial weight matrix w0.

298

Page 3: [IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France (2012.07.11-2012.07.13)] 2012 16th International Conference on Information Visualisation

- Extract image features: G matrix size (N x D) Step 2. Processing step:

While u(t+1)-u(t) ε≤ do 1. Compute the partition matrix u by using (4). 2. Compute the cluster center matrix v by using (5). 3. Update the density n(i) ,the cluster center v0 and the

weight matrix w by the equation (6).

w(c,k)= Entp (j,k) / � (Entp(k)) k=1,D. ……...….(6)

end while Step 6: clustering step: Assign each input x to its nearest cluster-center v.

3. The feature extraction of the texture

Texture analysis is important in many applications of computer image analysis for classification or segmentation of images based on local spatial variations of intensity or color. A successful classification or segmentation requires an efficient description of image texture. Important applications include industrial and biomedical surface inspection, for example for defects and disease, ground classification and segmentation of satellite, aerial or medical imagery, segmentation of textured regions in document analysis, and content-based access to image databases.

A wide variety of techniques for describing image texture have been proposed [19]. Example, the statistical methods analyze the spatial distribution of gray values, by computing local features at each point in the image, and deriving a set of statistics from the distributions of the local features . The most widely used statistical methods are co-occurrence features (Haralick 1973) [20].

In this paper we have used the first six parameters, namely: the contrast, homogeneity, Correlation, Energy,

angular second moment and the entropy. Also edge detection is used. Gabor filter have been used in the ESSC algorithm [13] to extract the features of the texture image. A filter bank with six orientations (one at every 30°) and five frequencies starting from 0.46 is created. Then, 30 dimensional features are extracted from each pixel by applying the filter bank to these texture images against seven dimensional features used in the proposed method.

4. Results and discussion

The compactness has been computed by using the proposed and the ESSC algorithm [13] on different types of images with different textures and different size. The results are shown in Figure .1. a. It is clear that the value of the compactness given by the proposed function is lower than the function of the ESSC algorithm for different image sizes. Assuming that a small value of this term means the clusters are compact.

The different images obtained by the implementation of the two objective functions are given in Figures 2 and 3. In Figure 2.b.1 and 2.b.2 a part of the land is not shown, unlike in Figure 2.c.1 and 2.c.2. In figure.3.c.3, it is easy to separate the spleen (6), liver (2) stomach (4) and pancreas. Ratings (1) are detected. The contours are limited, lung (8) is clear and homogeneous, liquid (5) around the spleen is detected. It takes a different class of the spleen. In this image, according to the radiologist it is easy to make the diagnostic with high accuracy however in figure.3.b.3 some organs like the spleen and stomach are separated, but we notice that the vertebral body (3) does not take its true form in figure.3.c.3. It is in contact with the liver.

a b

Figure 1 a: compactness according to the size of images, b : runtime for two methods.

299

Page 4: [IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France (2012.07.11-2012.07.13)] 2012 16th International Conference on Information Visualisation

a.1 b.1 c .1

a .2 b.2 c .2

Figure 2 a: original image ;b : ESSC results ;c : Proposed method results.

a.3 b.3 c.3

Figure 3 a: original image; b : ESSC results ;c : Proposed method results

The lungs are detected. But they are not homogeneous. The pancreas is not fully detected indicating that the liver is the same class with the spleen. This false anatomy obtained may create false diagnostic as it can hide other anomalies. So the radiologist confirms that this image cannot provide any information about the diagnosis.

Not only the visual analysis of the clustering has shown that our method is most suitable, but the calculations can easily judge the quality of our change and its advantage has been shown above for the value of compactness. The runtime also is one of the most interesting criterion to assess the quality of a clustering method on a database of different images sizes. We computed the runtime for the two algorithms. (See figure.1.b). These curves show the increasing runtime against the size of images for both methods but with a smaller slope for the proposed one.

Conclusions

In this paper we have proposed a subspace clustering method. The proposed objective function produces good results, and also reduces the runtime. The cluster density introduced in the objective function allows detecting

different shape of clusters. The choice of dimensional features and the updating of the input parameters are important in the resulting clustering. The implementation of this algorithm has improved the results regarding other existing methods.

References

[1] C. Bouveyron, C. Brunet. Classification automatique dans les sous-espaces discriminants de Fisher. 41èmes Journées de Statistique, SFdS, Bordeaux ,2009. [2] J. Guan, Y. Gan, H. Wang. Discovering pattern-based subspace clusters by pattern tree. Knowledge-Based Systems, vol .22 , 569–579. 2009. [3] S. Boutemedjet, D. Ziou, N. Bouguila. Model-based subspace clustering of non-Gaussian data. Neurocomputing, vol.73, 1730–1739. 2010. [4] M. Talibi-Alaoui, et al. Classification des images couleurs par association des transformations morphologiques aux cartes de Kohonen. Cari-Hammamat, 83-90. 2004. [5] K. Lung Wu , M-Shen Yang. Mean shift-based clustering. Pattern Recognition, Elsevier, 3035-3052. 2007. [6] L. Parson, E. Haque, H. Liu. Subspace clustering for high dimensional data: A review. SIAM Int. Conf. on Data Mining, vol.6. no 1, 90–105. 2004.

300

Page 5: [IEEE 2012 16th International Conference on Information Visualisation (IV) - Montpellier, France (2012.07.11-2012.07.13)] 2012 16th International Conference on Information Visualisation

[7] L. Parsons, E. Haque, H. Liu. Evaluating subspace clustering algorithms. In Workshop on Clustering High Dimensional Data and its Applications. SIAM Int. Conf. on DataMining, 48–56. 2004. [8] R. Agrawal. et al. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, ACM Press, 94-105. 1998 [9] C. Aggarwal, et al. Fast algorithms for projected clustering’’, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, ACM Press. 61-72. 1999. [10] R. Vidal, et al. An algebraic geometric approach to the identification of a class of linear hybrid systems. In Proc. Conf. Decision and Control, 167–172. 2003. [11] A. Yang, et al. Unsupervised segmentation of natural images via lossy data compression. Comput. Vis. Image Understand. vol. 110, no. 2, 212–225. 2008. [12] R. Vidal, R. Tron, and R. Hartley. Multiframe motion segmentation with missing data using power factorization and GPCA. Int. J. Comput. Vis., vol. 79. no. 1, 85–105. 2008. [13] Z. Deng et al. Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognition. vol.43, 767–781.2010. [14] J. Sunita, K. Parag. Intelligent Subspace Clustering, A Density based Clustering approach for High Dimensional Dataset. World Academy of Science, Engineering and Technology. Vol.55, 69-73. 2009. [15] R.W. Sembiring, J.M. Zain. Cluster Evaluation of Density Based subspace Clustering. journal of computing. vol 2. issue 11, 2151-9617. NOV.2010 [16] C.M. Procopiuc, and all. A Monte Carlo algorithm for fast projective clustering. Proceedings of the ACM SIGMOD Conference on Management of Data, 418–427. 2002. [17] K. Kailing, H.P. Kriegel, P. Kroger. Density-Connected Subspace Clustering for High-Dimensional Data. In Proc. 4th SIAM Int. Conf. on Data Mining, 246-257. 2004. [18] M. Ester, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, 291-316.1996. [19] F.M. Khellah. Texture Classification Using Dominant Neighborhood Structure. IEEE Transactions on image processing. vol. 20. NO.11, 3270-3279. NOVEMBER 2011. [20] R. Haralick et al. Textural features for image classification. IEEE Transactions on Systems, Man and Cybernetics, 3(6), 610–621. 1973.

301