Multimode Image Clustering Using Optimal Image Descriptor

IEICE TRANS. INF. & SYST., VOL.E97–D, NO.4 APRIL 2014743

PAPER Special Section on Data Engineering and Information Management

Multimode Image Clustering Using Optimal Image Descriptor

Nasir AHMED†a) and Abdul JALIL†b), Members

SUMMARY Manifold learning based image clustering models are usu-ally employed at local level to deal with images sampled from nonlinearmanifold. Multimode patterns in image data matrices can vary from nomi-nal to significant due to images with different expressions, pose, illumina-tion, or occlusion variations. We show that manifold learning based imageclustering models are unable to achieve well separated images at local levelfor image datasets with significant multimode data patterns. Because graylevel image features used in these clustering models are not able to capturethe local neighborhood structure effectively for multimode image datasets.In this study, we use nearest neighborhood quality (NNQ) measure basedcriterion to improve local neighborhood structure in terms of correct near-est neighbors of images locally. We found Gist as the optimal image de-scriptor among HOG, Gist, SUN, SURF, and TED image descriptors basedon an overall maximum NNQ measure on 10 benchmark image datasets.We observed significant performance improvement for recently reportedclustering models such as Spectral Embedded Clustering (SEC) and Non-negative Spectral Clustering with Discriminative Regularization (NSDR)using proposed approach. Experimentally, significant overall performanceimprovement of 10.5% (clustering accuracy) and 9.2% (normalized mu-tual information) on 13 benchmark image datasets is observed for SECand NSDR clustering models. Further, overall computational cost of SECmodel is reduced to 19% and clustering performance for challenging out-door natural image databases is significantly improved by using proposedNNQ measure based optimal image representations.key words: multimode image clustering, spectral embedded clustering,nonnegative spectral constraint, NNQ measure, Gist image descriptor

1. Introduction

Cluster analysis is an important tool in a variety of scientificareas including, e.g., pattern recognition, economics, bioin-formatics, document clustering, and information retrieval.Multimode clustering deals with practical problems whenwe need the simultaneous clustering of objects and vari-ables (rows and columns of a data matrix) with suitable gen-eralizations for multi-dimensional data matrices. In recentyears, multimode clustering has become an important chal-lenge, e.g., in market-basket analysis, text mining, microar-rays and recommender system analysis. Usually, manifoldlearning based clustering methods has been proposed to dealwith multimode patterns in data matrices [1]–[6].

Image clustering can be defined as the optimal parti-tioning of images into different groups such that the im-ages belonging to the same group are more similar to each

Manuscript received July 10, 2013.Manuscript revised October 3, 2013.†The authors are with the Department of Computer and In-

formation Sciences, Pakistan Institute of Engineering and AppliedSciences, Islamabad, Pakistan.

a) E-mail: [email protected]) E-mail: [email protected]

DOI: 10.1587/transinf.E97.D.743

other, and images from two different groups share the max-imum difference. An image dataset can consist of imageswith different expressions, pose, illumination, or occlusionvariations where nominal within-class variation is presentin image datasets that contain different expressions imagesonly while significant within-class variation is present thatcontain images with pose, illumination, and occlusion varia-tions. Distribution of all images in a class of image datasetswith nominal within-class variation is Gaussian-like (uni-modal) while it is non-Gaussian (multimodal) for imagedatasets with significant within-class variations. Outdoornatural image datasets have significant within-class varia-tion and image clustering of such image datasets is thus achallenging problem in computer vision [7].

Clustering problem is intrinsically hard to formulate.Classical Kmeans algorithm is based on optimizing sim-ple objective by minimizing the spread over centroids, butit can give poor solutions due to its simple assumptionabout cluster structure. Linear discriminant analysis (LDA)utilized discriminant information. The initial works iter-atively employed Kmeans to generate the cluster labelsfor LDA and exploited LDA to select the most discrimi-native subspace for Kmeans clustering [8], [9]. Discrimi-native Kmeans (DisKmeans) algorithm [10] was proposedwhich simplified the iterative procedures as a trace maxi-mization problem. However, LDA gives undesired resultsfor multimodal image datasets [11] due to its global naturewhen evaluating scatter matrices. Sugiyama proposed su-pervised local Fisher discriminant analysis (LFDA) [12] tolearn manifold structure of multimodal distribution at lo-cal level. In LFDA [12], first k-nearest neighbors of a datapoint are searched within its class using class label informa-tion, then optimization is performed to preserve multimodalstructure. However, LFDA approach cannot be used in un-supervised image clustering problem due to its supervisednature.

Manifold assumption is that two nearby data points inthe high-density region of a low-dimensional data manifoldhave the same cluster label. The well-known manifold learn-ing based spectral clustering (SC) algorithm normalized cut(NCut) [1] and its extension k-way NCut [13] have achievedpromising clustering performances in image segmentationand many other applications. In local learning based clus-tering techniques [3]–[5], for each image, a local clique wasconstructed using k-nearest neighbors of an image and lo-cal discriminant model was devised to evaluate the cluster-ing performance of images in the local clique. A unified

Copyright c© 2014 The Institute of Electronics, Information and Communication Engineers

744IEICE TRANS. INF. & SYST., VOL.E97–D, NO.4 APRIL 2014

objective function was used to maximize the sum of localdiscriminant scores from all the local cliques. Eigenvaluedecomposition method was then used to obtain the relaxedcontinuous valued solution of the cluster assignment matrix,which was further discretized to obtain the binary cluster as-signment matrix for all samples. Yang et al. devised localdiscriminant model and global integration (LDMGI) imageclustering model [5] based on discriminant analysis usingmanifold assumption.

Nie et al. proposed spectral embedded clustering(SEC) model [6] in which the optimization problem ofSC [13]–[17] was improved by adding linearity regulariza-tion at local level [18]. Let X = {x1, x2, . . . ., xn} be the imagedataset to be clustered, where xi ∈ � f (1 ≤ i ≤ n) representsthe ith image, f is the image feature dimension and n de-notes the total number of images. In SEC model [6], for eachimage xi, a local clique ℵk(xi) was constructed comprisingk (= 5) nearest neighbor images of xi, and local spectralclustering model was devised to evaluate the clustering re-sults for the images in ℵk(xi). The overall clustering re-sults were obtained by globally integrating over all the localcliques. Optimization problem arg minA Tr

[AT LgA

][13]–

[17] was improved to arg minA Tr[AT (∑n

i=1 Li + μLg)A]

[6]where

Lg = Hn −(XT(XXT + γgI

))−1X, (1)

Li = Hk − XTi

(XiX

Ti + γlI

)−1Xi (2)

and Hk = I − 1k 1k1k

T ∈ �k×k is a matrix for center-ing the data, Xi is the data matrix of the ith local clique,μ, γg, and γl are regularization parameters, and A is thescaled cluster assignment matrix [6]. Further, clustering re-sults for SEC/SC [6] were also reported in which Laplacianof k-way NCut [13] is utilized in place of Li (as given in(2)). However, in all these clustering approaches [5], [6],[10], [13], [19], spectral rotation [13] or Kmeans were per-formed to discretize the cluster indicator matrix on the re-laxed continuous solution. Yang et al. showed that the signsof the relaxed continuous solution are mixed and such re-sults may deviate severely from the true solution, so theyimposed an explicit nonnegative constraint for a more accu-rate solution during the relaxation [19]. They improved clus-tering performance by using this nonnegative constraint andproposed nonnegative spectral clustering with discrimina-tive regularization (NSDR) [19] algorithm. Recently, Huanget al. conducted a comparative study of spectral rotation andKmeans by imposing an additional orthonormal constraintto better approximate the optimal continuous solution [20]to the graph cut objective functions.

We found that pixel value based image features used inrecently reported SEC [6] and NSDR [19] clustering mod-els are not able to capture and to utilize the local neigh-borhood structure effectively for image datasets with sig-nificant within-class variation. Guang-Ho showed that theMinkowski metric (or Lp-norm) used in content-based im-age retrieval to measure similarity between image pairs can-

not adequately capture the aspects of the characteristics ofthe human visual system as well as the nonlinear relation-ships in contextual information given by images in a collec-tion [21]. Siagian et al. described a simple context-basedscene recognition algorithm that can differentiate outdoorscenes from various sites on a college campus by captur-ing the gist of the scene into a low-dimensional signaturevector [22]. Li et al. investigated Gist image descriptorwith Kmeans clustering [23]. Cai et al. studied how to in-tegrate heterogeneous image features by performing multi-modal spectral clustering on unlabeled images [24]. Ebertet al. explored promising directions to improve results ofobject recognition by looking at the local topology betweenimages and feature combination strategies [7]. They definednearest neighborhood quality (NNQ) measure that countsfor a given number of nearest neighbors k the correct near-est neighbors for each image and averages these accuraciesover all images, and showed that the better the neighborhoodstructure in terms of correct nearest neighbors the better theoverall performance of object recognition model.

We improved local neighborhood structure based onmaximum NNQ measure for k-nearest neighbor based ap-proach by selecting an optimal image descriptor among re-cently reported image descriptors such as Histograms ofOriented Gradients (HOG) [35], [36], modeling the shape ofthe scene (Gist) [27], Texture Edge Descriptors (TED) [37],Speeded-up Robust Features (SURF) [38], and Saliency Us-ing Natural Statistics (SUN) [39]. For each image descrip-tor, we obtained image features and computed correspond-ing NNQ measure, and obtained an overall NNQ measureon 10 image datasets. Based on our experiment, we foundGist as the optimal image descriptor that achieved maximumoverall NNQ measure over all other image descriptors. Wethus exploited the effectiveness of Gist image descriptor forrecently reported SEC [6] and NSDR [19] clustering modelswhere the corresponding clustering results are reported asSEC-Gist and NSDR-Gist, respectively.

We compared clustering performance of Kmeans,DisKmeans [10], NCut [13], LDMGI [5], SEC [6], NSDR[19], SEC-Gist, and NSDR-Gist on 13 benchmark im-age databases such as JAFFE [25], ORL [28], COIL20 [29],yalefaces [30], COREL [31], Caltech faces [32], AR [26],Scenes [27], Caltech10 [33], Flowers [34], Pointing04 [40],Out [41], and PIE [42] that contain images with different ex-pressions, pose, illumination, or occlusion variations. Clus-tering performance is evaluated in terms of clustering ac-curacy (ACC) and normalized mutual information (NMI).Using proposed NNQ measure based image representationapproach, significant overall performance improvement of10.8% (ACC) and 9.4% (NMI) for SEC model [6] and10.1% (ACC) and 9.0% (NMI) for NSDR model [19] is ob-served on all 13 image databases. Further, SEC model usingGist image features is computationally efficient and an over-all computational cost is reduced to 19% as compared withSEC model [6] using pixel value based image features.

The remainder of this paper is organized as follows. InSect. 2, motivation of the proposed work is discussed. Dis-

AHMED and JALIL: MULTIMODE IMAGE CLUSTERING USING OPTIMAL IMAGE DESCRIPTOR745

Fig. 1 Representative images from one class of JAFFE [25], AR [26], and Scenes [27] imagedatabases.

Fig. 2 Histogram of class distribution of (a) JAFFE [25], (b) ORL [28], (c) COIL20 [29], (d) yale-faces [30], (e) COREL [31], (f) Caltech faces [32], (g) AR [26], (h) Scenes [27], (i) Caltech101 [33], and(j) Flowers [34] image databases.

cussion about image representation and similarity measureis given in Sect. 3. Results and discussion are presented inSect. 4. Finally, the conclusions are drawn in Sect. 5.

2. Motivation

Motivation of the proposed work is discussed in Sects. 2.1and 2.2. In Sect. 2.1, we discuss nominal to significantwithin-class variation in image databases. In Sect. 2.2, weshow that gray level image features cannot achieve well sep-arated images at local level for image datasets with signifi-cant within-class variation.

2.1 Within-Class Variation and Histogram of Class Distri-bution in an Image Database

We have shown some representative images from one classof JAFFE [25], AR [26], and Scenes [27] image databases inFig. 1 using gray level image features to show within-classvariations in an image database due to images with differentexpressions, pose, illumination, and occlusion variations.JAFFE image database contains 213 images from 10 classesin which first 23 images corresponds to class 1, next 22 im-

ages corresponds to class 2, and so on. Here, order of classesin an image dataset is irrelevant and can be chosen in anyorder. For AR image database, we used images from 100classes (50 men and 50 women) where each class consistsof 26 images. Scenes image database consists of 8 differentcategories of urban and natural scenes and each categoryconsists of 80 images. Each class of JAFFE image databaseconsists of images of different facial expressions while ARimage database contains images with different facial expres-sions, illumination, and occlusion variations. Each class ofoutdoor natural Scenes image database contains images withpose, illumination, and occlusion variations.

Within-class variation for JAFFE and AR imagedatabases is different. For JAFFE image dataset, within-class variation is present due to images of different expres-sions while for AR image database, it is present due to im-ages of different facial expressions, illumination, and oc-clusion variations. Thus, within-class variation is nominalfor JAFFE image database while it is significant for ARand Scenes image databases. Histogram of all images inone class of each image dataset is shown in Fig. 2 whereGaussian-like unimodal distribution is observed for JAFFEimage dataset that corresponds to nominal within-class vari-


Table 1 Class separation at local level using k-nearest neighbors approach.

Dataset Within-class Within-class Image No. Class no. of images in ℵk(xi) Diff Class separationvariation distribution x0 x1 x2 x3 xk−1 at local level

n1 1 1 1 1 1 0n2 1 1 1 1 1 0n3 1 1 1 1 1 0n4 1 1 1 1 1 0

JAFFE Nominal Unimodal n5 1 1 1 1 1 0 100%n6 1 1 1 1 1 0n7 1 1 1 1 1 0n8 1 1 1 1 1 0n9 1 1 1 1 1 0n10 1 1 1 1 1 0n1 1 1 1 3 6 2n2 1 1 1 1 5 1n3 1 1 1 1 3 1n4 1 1 1 1 1 0

Scenes Significant Multimodal n5 1 6 3 1 1 2 63%n6 1 1 1 6 4 2n7 1 6 1 1 1 1n8 1 6 2 2 6 4n9 1 1 1 1 1 0n10 1 6 6 1 1 2

ation. Distinct multimode can be observed for yalefaces,COREL, Caltech faces, AR, Scenes, Caltech101, and Flow-ers image databases that contain images with pose, illumi-nation, or occlusion variations.

2.2 Class Separation at Local Level Using k-NearestNeighbor Approach

To classify the shortcoming of gray level image features inachieving well separated images at local level, we presentsimulation results of constructing ℵk(xi) for first 10 imagesin Table 1 from each of JAFFE and Scenes image databases.Let n j|nj=1 denotes the jth image. The class number of eachimage in ℵk(xi) is shown in Table 1 along with the total num-ber of images that belong to classes different than that of x0

(shown as Diff) where columns x0 through xk−1 correspondto 5 neighbors for each image nj|nj=1. A good separationis achieved if all neighbors belong to the same class, andhence yield a small Diff. For JAFFE having nominal within-class variation, images are well separated at local level usinggray level image features. However, optimal class separa-tion at local level is not achieved for Scenes image databasewith significant within-class variation. Hence, the bene-fit of recently reported clustering models such as SEC [6]and NSDR [19] cannot be fully utilized for image datasetswith significant within-class variation because images arenot well separated at local level, and an effective strategyshould be used to learn multimode patterns in data matricesat local level for optimal class separation of images usingk-nearest neighbor based approach.

3. Nearest Neighborhood Quality Measure and Opti-mal Image Representation

Clustering results of [6], [19] were reported using pixelvalue based image features that were obtained by resiz-

ing the original images using image interpolation approach(I2A). In our work, we obtained image features usingTED [37], SUN [39], SURF [38], HOG [36], and Gist [27]image descriptors. Image feature dimension obtainedthrough TED and SUN image descriptors are the same asthat of the original image which we further reduced usingI2A approach. For HOG, SURF, and Gist image features,each image in an image database is first reduced using I2Aapproach to image feature dimension as reported for [6],[19]. Then, total 64 image features are obtained using SURFimage descriptor. We used 15 orientation bins and cell sizeof 3 × 3 pixels to extract HOG image descriptors, and thusthe total number of image features for each image datasetis 15 × 9 = 135. In Gist image descriptor, the Gabor fil-ter banks are created with three scales, and each with 8, 8,4 orientations respectively. So filter banks are formed with20 filters, whose filter transfer functions are computed ac-cording to the size of images, and the total number of imagefeatures obtained for each image dataset is 512.

We improved local neighborhood structure using max-imum NNQ measure in order to achieve performance im-provement for image datasets with significant within-classvariation. We selected an optimal image descriptor amongTED [37], SUN [39], SURF [38], HOG [36], and Gist [27].For each image descriptor, we have shown simulation resultsof computing NNQ measure for each image dataset in Table2 in which an overall NNQ measure on all image datasetsis also shown. Maximum overall NNQ measure is obtainedfor Gist image descriptor over all other image descriptorson all image datasets. Gist image descriptor obtained maxi-mum overall NNQ measure of 68% while the near competi-tor image descriptor is HOG with an overall NNQ measureof 63%.

The block diagram of the proposed approach is shownin Fig. 3 where some representative images from 6 outof 100 classes of AR [26] image database are shown that


Table 2 Nearest neighborhood quality (NNQ) measure for each image database.

Dataset I2A [6] SURF TED SUN HOG GistJAFFE 98 38 99 98 99 99ORL 89 38 80 84 89 97COIL20 98 68 97 98 96 98yalefaces 71 25 70 77 76 78COREL 43 24 40 44 64 70Caltech faces 29 07 35 20 32 25AR 34 29 83 58 49 75Scenes 30 27 28 35 68 69Caltech101 19 11 20 37 36 40Flowers 19 13 10 19 25 29Overall NNQ Measure 53 28 56 57 63 68

Fig. 3 Block diagram of proposed SEC-Gist image clustering model.

Table 3 Database descriptions and image feature dimension for each image database.

Dataset Sample Class Image size Variation present Image Feature DimensionSize Number I2A [6], [19], SURF HOG Gist

TED, SUNJAFFE 213 10 256 × 256 Expression 676 64 135 512ORL 400 40 112 × 92 Expression, session 644 64 135 512COIL20 1440 20 128 × 128 Expression, pose 1024 64 135 512yalefaces 165 15 320 × 243 Expression, illumination 1024 64 135 512COREL 2620 20 120 × 80 Pose, occlusion 1200 64 135 512Caltech faces 413 19 896 × 592 Expression, illumination, occlusion 2072 64 135 512AR 2600 100 120 × 165 Expression, illumination, occlusion 1200 64 135 512Scenes 2688 8 256 × 256 Pose, illumination, occlusion 676 64 135 512Caltech101 1127 20 260 × 300 Pose, illumination, occlusion 1200 64 135 512Flowers 1360 17 650 × 500 Pose, illumination, occlusion 1344 64 135 512Pointing04 2790 15 384 × 288 Pose 1120 64 135 512Out 3330 90 640 × 480 Pose 1344 64 135 512PIE 3020 5 640 × 480 Expression, pose, illumination, occlusion 1200 64 135 512

clearly show pose, illumination and occlusion variationsin images of multimodal AR image datasets. For eachimage dataset, optimal image features are obtained usingTED [37], SUN [39], SURF [38], HOG [36], and Gist [27]image descriptors and corresponding NNQ measure for eachimage descriptor is computed. Decision to choose best im-age descriptor among all image descriptors is performed onthe basis of an overall maximum NNQ measure on all imagedatabases. In this work, over 21,000 samples from 13 bench-

mark image datasets were used to test the clustering perfor-mance of SEC [6] and NSDR [19] clustering models usingproposed NNQ measure based optimal image representa-tion. Detailed description in terms of sample size, numberof classes, original image size, variation present and imagefeature dimension for each image dataset are given in Table3.


3.1 Performance Evaluation Metrics

In this work, clustering accuracy (ACC) and normalized mu-tual information (NMI) are used for performance evaluation.

3.1.1 Clustering Accuracy (ACC)

For the ith image, let si be the clustering result from theclustering algorithm and ti be the ground truth label. TheACC is defined as follows:

ACC =

∑ni=1 δ(ti,map(si))

n(3)

where n is the total number of images, δ(t, s) = 1 ift = s, and map(si) is the optimal mapping function thatpermutes clustering labels to match the ground truth labels.The optimal mapping can be obtained by using the Kuhn-Munkres algorithm [43]. A larger ACC indicates better per-formance.

3.1.2 Normalized Mutual Information (NMI)

NMI is another widely used measure for evaluating the clus-tering results. For two arbitrary variables P and Q, NMI isdefined as follows:

NMI(P,Q) =I(P,Q)

H(P)H(Q)(4)

where I(P,Q) represents the mutual information be-tween P and Q. H(P) and H(Q) denote the entropies ofP and Q, respectively. It is obvious that NMI(P,Q) equals1 if P is identical with Q, and it becomes 0 if P is entirelydissimilar from Q. Let gh be the total samples in the hthground truth class (1 ≤ h ≤ c) and gm be the number ofsamples in cluster Cm(1 ≤ m ≤ c) obtained by using theclustering algorithms. NMI is defined as

NMI =

∑ch=1∑c

m=1 gh,mlog( n.gm,h

ghgm)

√(∑c

h=1 ghlog( gh

n ))(∑c

m=1 gmlog( gm

n ))(5)

where gh,m is the number of samples that are in the in-tersection between the cluster Cm and hth ground truth class.A larger NMI indicates a better clustering result.

4. Results and Discussion

We discuss the clustering performance of Kmeans,DisKmeans [10], NCut [13], LDMGI [5], SEC [6], NSDR [19],SEC-Gist, and NSDR-Gist on 13 benchmark image datasets.During simulation to reduce the statistical variation, the pro-cess of random initializations was repeated 20 times. Theaverage results for all image databases in 20 run are reportedin terms of mean ACC ± std and mean NMI ± std where stdrepresents the standard deviation around mean. The maxi-mum clustering performance in 20 runs is reported as best

ACC and best NMI. The value of image clustering parame-ter k [5], [6], [13], [19] is set to 5 and optimal value of γ [10],λ [5], γg [6], [19], γl [6], and σ [13], [19] is selected from set{10−8, 10−6, 10−4, 10−2, 100, 102, 104, 106, 108

}as reported

in [5]. In addition, parameter μ (SEC [6] and NSDR [19])is tuned from set

{10−9, 10−6, 10−3, 100, 103, 106, 109

}as re-

ported in [6].

4.1 Clustering Performance Comparison

Significant performance improvement can be observed inTable 4 and Table 5 by using Gist image features over previ-ous gray level image features for SEC [6] and NSDR [19]clustering models. Significant performance improvementof 25.9% (mean ACC) and 26.8% (mean NMI) is ob-served on multimodal Scenes image dataset. Overall perfor-mance on all 13 image datasets of Kmeans, DisKmeans [10],NCut [13], LDMGI [5], SEC [6], NSDR [19], SEC-Gist, andNSDR-Gist is 38.4%, 41.3%, 47.7%, 47.6%, 50.3%, 51.4%,61.4%, and 61.6% in terms of mean ACC and 45.4%,49.2%, 52.4%, 53.1%, 54.0%, 54.7%, 63.6%, and 63.6% interms of mean NMI. Thus, we achieved significant perfor-mance improvement of 10.8% (mean ACC) and 9.4% (meanNMI) for SEC model [6] and 10.1% (mean ACC) and 9.0%(mean NMI) for NSDR model [19] using proposed NNQmeasure based optimal image representation.

We have shown clusters formed using NCut [13], LD-MGI [5], SEC [6] and SEC-Gist for 5 classes from each ofJAFFE, yalefaces, and Scenes image databases. We repre-sented images with their mean values as shown in the firstcolumn of Fig. 4 which shows nominal within-class varia-tion for JAFFE while significant within-class variation canbe observed for yalefaces and Scenes image databases. Us-ing Gist image features, 100% cluster separation is obtainedfor JAFFE image database for SEC model [6] and significantimprovement in cluster separation is achieved for yalefacesand Scenes image databases. Clustering result on Point-ing04 is different from that reported in [5] because, in ourwork, the 1120-dimensional feature vector is not normal-ized.

4.2 Generalization of Proposed Image Representation Ap-proach

We selected Gist image descriptor as a result of a test on 10image databases (given in Table 2) which shows that the setof images used for selecting Gist over other descriptors isquite large. This shows that the choice of Gist is generic,and not dependent on the particular domain of images used.However, we further obtained clustering performance us-ing proposed similarity measure for SEC [6] and NSDR [19]clustering models on 3 test image databases, namely Point-ing04 [40], Out [41] and PIE [42] to show that set of imagedatabases used for selecting Gist descriptor, and the test setused for comparing clustering results are separated. Clus-tering results for SEC-Gist, NSDR-Gist on these 3 test im-age databases are also shown in Table 4 and Table 5 which


Table 4 Performance comparison of Kmeans, DisKmeans [10], NCut [13], LDMGI [5], SEC [6],NSDR [19], SEC-Gist, and NSDR-Gist clustering models in terms of best ACC and mean ACC ± std.

Dataset Best ACC Mean ACC ± stdK-means DisKmeans NCut LDMGI SEC NSDR SEC- NSDR- Kmeans DisKmeans NCut LDMGI SEC NSDR SEC- NSDR-

[10] [13] [5] [6] [19] Gist Gist [10] [13] [5] [6] [19] Gist GistJAFFE 83.1 91.1 96.2 96.4 96.2 96.7 100 100 75.2±6.0 80.5±7.2 96.2±0 96.4±0 91.3±1.2 96.5±0.2 100±0 100±0ORL 73.0 74.5 79.0 81.7 86.0 85.0 93.8 93.8 66.5±3.4 68.8±2.9 78.0±0.5 81.0±0.4 82.8±1.8 83.1±1.9 91.9±1.2 92.6±0.6COIL20 67.3 71.9 82.5 84.1 85.2 84.4 86.7 86.7 60.0±3.9 60.5±5.6 82.5±0 84.1±0 82.8±2.0 81.9±2.5 83.6±2.2 82.8±1.9yalefaces 61.2 63.6 70.3 67.3 73.9 73.8 98.2 98.8 56.2±4.0 57.2±5.0 70.2±0.3 65.4±2.6 71.0±2.1 71.5±2.4 97.5±2.0 98.2±0.4COREL 35.3 36.8 36.4 37.8 37.6 37.7 48.5 49.2 35.0±0.2 36.1±0.8 35.6±0.3 35.1±1.0 36.7±0.7 36.8±0.7 46.5±1.0 46.8±1.2Caltech faces 26.4 36.1 27.9 27.1 29.1 29.8 34.1 33.7 24.2±1.0 32.1±2.5 26.8±0.7 25.1±0.7 27.8±0.8 28.5±0.8 32.4±1.5 32.2±0.9AR 14.2 17.8 18.7 18.5 49.5 52.1 50.6 50.4 13.4±0.6 16.2±1.3 18.6±0.1 18.3±0.1 47.9±1.1 48.2±1.8 48.4±1.7 48.1±1.8Scenes 26.0 29.5 25.8 26.2 27.4 28.3 53.9 53.9 24.3±0.7 27.6±1.6 25.7±0.1 26.2±0 26.0±1.0 26.7±0.7 51.9±1.7 51.9±1.7Caltech101 18.2 21.0 17.8 18.9 20.4 19.8 31.9 31.7 17.4±0.7 19.6±0.9 16.9±0.4 18.7±0.2 18.8±0.7 18.9±0.7 30.3±1.5 30.1±1.3Flowers 20.5 20.5 17.3 18.5 17.1 18.1 25.6 25.3 18.0±0.8 19.6±0.7 17.1±0.2 18.0±0.3 16.8±0.3 17.5±0.4 24.1±1.1 24.5±0.6Pointing04 53.3 61.6 65.1 66.7 64.9 71.9 78.3 78.3 48.6±2.5 56.1±3.5 65.1±0 65.0±0.1 63.7±1.9 66.7±2.9 78.3±0 78.3±0Out 41.2 42.9 57.7 54.6 59.0 59.4 83.6 83.9 40.0±1.1 40.9±1.3 57.6±0.2 54.0±0.8 57.7±0.7 58.1±0.9 83.4±0.2 83.5±0.2PIE 22.9 24.7 31.1 36.5 37.9 36.0 34.3 36.3 20.8±1.1 21.7±2.0 29.5±1.3 32.1±2.6 30.6±3.5 33.8±1.4 30.3±3.4 32.0±2.9Overall Mean 41.7 45.5 48.1 48.8 52.6 53.3 63.0 63.2 38.4±2.0 41.3±2.7 47.7±0.3 47.6±0.7 50.3±1.4 51.4±1.3 61.4±1.3 61.6±1.0

Table 5 Performance comparison of Kmeans, DisKmeans [10], NCut [13], LDMGI [5], SEC [6],NSDR [19], SEC-Gist, and NSDR-Gist clustering models in terms of best NMI and mean NMI ± std.

Dataset Best NMI Mean NMI±stdK-means DisKmeans NCut LDMGI SEC NSDR SEC- NSDR- Kmeans DisKmeans NCut LDMGI SEC NSDR SEC- NSDR-

[10] [13] [5] [6] [19] Gist Gist [10] [13] [5] [6] [19] Gist GistJAFFE 98.5 96.2 95.6 96.7 95.6 96.4 100 100 85.0±3.1 90.3±3.4 95.6±0 96.7±0 92.0±0.9 96.1±0.4 100±0 100±0ORL 87.9 89.5 89.4 92.2 92.3 92.1 97.2 97.2 85.5±1.5 87.0±1.0 89.0±0.2 92.1±0.2 90.9±0.6 91.0±0.7 96.5±0.4 96.7±0.4COIL20 79.5 81.4 90.0 94.8 90.1 90.1 92.6 92.6 76.5±1.5 76.8±2.4 89.9±0 94.7±0 89.1±1.1 88.8±1.2 92.0±0.5 92.0±0.4yalefaces 67.7 75.4 73.0 71.3 73.1 73.1 98.0 98.5 65.0±2.0 69.5±3.1 72.7±0.5 70.3±1.4 72.1±0.8 72.0±1.0 97.3±1.3 97.8±0.4COREL 40.5 37.9 40.3 40.7 41.9 41.9 51.4 51.0 39.6±0.9 36.7±0.9 40.3±0 40.0±0.2 41.3±0.4 41.4±0.4 50.5±0.6 50.1±0.6Caltech faces 32.0 47.8 34.9 33.7 39.2 39.6 39.5 39.0 30.8±1.1 44.0±0.9 34.1±0.6 32.9±0.4 37.3±0.9 37.5±0.6 38.2±1.1 37.9±0.7AR 43.3 44.6 45.1 45.1 68.7 67.9 72.4 72.3 43.1±0.3 44.0±0.4 44.9±0.1 45.0±0.1 67.8±0.6 67.9±0.1 70.5±1.3 70.2±1.2Scenes 9.5 11.4 10.3 13.1 11.6 11.4 38.2 38.2 8.1±0.6 10.9±0.5 10.3±0 13.1±0 10.5±1.0 10.5±0.5 37.3±0.8 37.3±0.8Caltech101 20.6 21.5 19.3 19.4 20.4 20.4 33.1 32.9 18.7±0.8 20.0±0.8 18.8±0.2 19.1±0.2 19.4±0.6 19.6±0.5 32.2±0.6 32.0±0.9Flowers 17.5 19.1 16.2 16.1 16.9 17.3 21.0 21.2 16.1±0.6 17.2±0.9 15.7±0.2 15.7±0.3 16.1±0.4 16.7±0.3 19.9±0.6 20.2±0.6Pointing04 53.7 73.8 77.9 78.0 76.3 79.6 85.0 85.0 51.3±1.4 68.4±2.4 77.9±0 78.0±0 75.5±0.8 76.5±1.7 84.9±0 84.9±0Out 71.8 72.4 83.9 83.8 83.8 83.8 95.5 95.4 70.8±0.7 72.0±0.3 83.6±0.2 83.6±0.2 83.2±0.4 83.4±0.3 95.2±0.3 95.2±0.2PIE 1.6 6.4 10.1 10.8 12.9 12.1 17.5 20.9 0.3±0.7 2.4±3.3 8.1±1.8 8.7±1.8 6.6±3.4 9.4±2.2 12.8±5.3 12.9±5.0Overall Mean 48.0 52.1 52.8 53.5 55.6 55.8 64.7 64.9 45.4±1.2 49.2±1.6 52.4±0.3 53.1±0.4 54.0±0.9 54.7±0.8 63.6±1.0 63.6±0.9

clearly show that proposed NNQ measure based image rep-resentation achieved significant performance improvementover pixel value based image features used in clusteringmodels [5], [6], [10], [13], [19].

4.3 Image Clustering of Outdoor Natural Image Datasets

With our proposed NNQ measure based image represen-tation approach, significant performance improvement ofSEC [6] is observed for outdoor natural image datasets suchas Scenes [27] and Flowers [34]. Clustering performance onScenes image dataset is improved from 26.0% to 51.9% and10.5% to 37.3% in terms of mean ACC and mean NMI, re-spectively. Similarly, performance improvement on Flowersimage dataset is 16.8% to 24.1% and 16.1% to 19.7%. Thereason behind this significant performance improvement isthat we achieved maximum well separated images at locallevel using optimal image descriptor with maximum NNQmeasure.

4.4 Temporal Cost Comparison

We have conducted temporal cost comparison of SECmodel [6] using gray level and Gist image features for eachimage database. For example, for AR image database, com-putational cost of computing Gist image feature for singleimage is 0.085 seconds and there are total 2600 imagesused in this study, so the computational cost for AR imagedatabase is 3.670 minutes. In this way, the total compu-tational cost for computing Gist image features for all 10image databases in 19.245 minutes as shown in Table 6. To-tal computational cost, in minutes, for SEC [6] and SEC-Gist models is 48.066 and 19.685, respectively. Thus, anoverall computational cost for SEC model [6] using Gist im-age features is reduced from 4.807 to 3.893 (minutes) whichis a 19% reduction in computational cost. All simulationswere performed on Dell Desktop optiplex990 with i7 pro-cessor and 16GB memory.


Fig. 4 Comparisons of image cluster separation for 5 classes of JAFFE [25], AR [26], and Scenes [27]image databases using NCut [13], LDMGI [5], SEC [6] and SEC-Gist models.

Table 6 Comparison of computational cost (minutes) of SEC [6] and SEC-Gist models.

JAFFE yalefaces ORL COIL20 Caltech faces Flowers Scenes Caltech101 AR COREL Total Cost Overall MeanSEC [6] 0.123 0.273 0.210 3.021 6.37 5.106 4.97 8.265 9.254 10.474 48.066 4.807SEC-Gist 0.085 0.065 0.169 2.087 0.177 0.970 3.772 4.165 3.228 4.967 19.685 3.893Gist feature 0.311 0.320 0.724 2.297 1.283 2.051 2.699 1.821 3.670 4.099 19.245

5. Conclusion

Multimode patterns in image data matrices can vary fromnominal to significant due to images with different ex-pressions, pose, illumination, or occlusion variations. Weshowed that manifold learning based image clustering mod-els that used pixel value based image features are not ableto capture and to utilize the local neighborhood structure ef-fectively for image datasets with significant multimode datapatterns. We improved local neighborhood structure signif-icantly by selecting an image descriptor among recently re-ported image descriptors based on overall maximum near-est neighborhood quality (NNQ) measure. We observedGist as the optimal image descriptor that achieved overallmaximum NNQ measure on 10 benchmark image datasets.We used Gist image features for recently reported SEC [6]and NSDR [19] clustering models. Experimentally, signif-icant overall performance improvement of 10.8% (meanACC) and 9.4% (mean NMI) for SEC model [6] and 10.1%(mean ACC) and 9.0% (mean NMI) for NSDR model [19] isachieved using proposed NNQ measure based optimal im-age representation. Further, overall computational cost ofSEC model is reduced to 19% and clustering performance issignificantly improved for challenging outdoor natural im-age databases using proposed NNQ measure based optimalimage representations.

Acknowledgement

This research is supported by the Higher Education Com-mission of Pakistan under the indigenous PhD scholarshipprogram 17-5-5(Ps5-217)/HEC/Sch/2010.

References

[1] J. Shi and J. Malik, “Normalized cuts and image segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol.22, no.8, pp.888–905,2000.

[2] T. Zhang, B. Fang, Y.Y. Tang, Z. Shang, and B. Xu, “Generalizeddiscriminant analysis: A matrix exponential approach,” IEEE Trans.Syst. Man Cybern. B, vol.40, no.1, pp.2761–2773, 2010.

[3] M. Wu and B. Scholkopf, “A local learning approach for clustering,”Neural Information Processing System(NIPS), pp.1529–1536, 2006.

[4] L.K. Saul and S.T. Roweis, “Think globally, fit locally: Unsuper-vised learning of low dimensional manifolds,” J. Machine LearningResearch, vol.4, pp.119–155, 2003.

[5] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang, “Image clusteringusing local discriminant models and global integration,” IEEE Trans.Image Process., vol.19, no.10, pp.2761–2773, 2010.

[6] F. Nie, Z. Zeng, I.W. Tsang, D. Xu, and C. Zhang, “Spectral embed-ded clustering: A framework for in-sample and out-of-sample spec-tral clustering,” IEEE Trans. Neural Netw., vol.22, no.11, pp.1796–1808, 2011.

[7] S. Ebert, D. Larlus, and B. Schiele, “Extracting structures in imagecollections for object recognition,” 11th European Conference onComputer Vision (ECCV), pp.720–733, 2010.

[8] F.D. la Torre and T. Kanade, “Discriminative cluster analysis,” In-ternational Conference on Machine Learning (ICML), pp.241–248,2006.

[9] C. Ding and T. Li, “Adaptive dimension reduction using discrimi-nant analysis and k-means clustering,” International Conference onMachine Learning (ICML), pp.521–528, 2007.

[10] J. Ye, Z. Zhao, and M. Wu, “Discriminative k-means for cluster-ing,” Neural Information Processing System (NIPS), pp.1649–1656,2008.

[11] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2 ed.,ch. Feature Extraction and Linear Mapping for Classification, Aca-demic, San Diego, CA, 1990.

[12] M. Sugiyama, “Dimensionality reduction of multimodal labeled databy local fisher discriminant analysis,” J. Machine Learning Re-search, vol.8, pp.1027–1061, 2007.

[13] S.X. Yu and J. Shi, “Multiclass spectral clustering,” International


Conference on Computer Vision (ICCV), pp.313–339, 2003.[14] J. Shi and J. Malik, “Normalized cuts and image segmentation,”

IEEE Trans. Pattern Anal. Mach. Intell., vol.22, no.8, pp.888–905,2000.

[15] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral groupingusing the nystrom method,” IEEE Trans. Pattern Anal. Mach. Intell.,vol.26, no.2, pp.214–225, 2004.

[16] M. Filipponea, F. Camastrab, F. Masullia, and S. Rovetta, “A surveyof kernel and spectral methods for clustering,” Pattern Recognit.,vol.41, no.1, pp.176–190, 2008.

[17] Y. Yang, Y.T. Zhuang, F. Wu, and Y.H. Pan, “Harmonizing hier-archical manifolds for multimedia document semantics understand-ing and crossmedia retrieval,” IEEE Trans. Multimedia, vol.10, no.3,pp.437–446, 2008.

[18] F. Nie, D. Xu, I.W. Tsang, and C. Zhang, “pectral embedded clus-tering,” International Joint Conference on Artificial Intelligence (IJ-CAI), Pasadena, CA, pp.1181–1186, 2009.

[19] Y. Yang, H.T. Shen, F. Nie, R. Ji, and X. Zhou, “Nonnegativespectral clustering with discriminative regularization,” Twenty-FifthAAAI Conference on Artificial Intelligence (AAAI-11), pp.555–560, Association for the Advancement of Artificial Intelligence,2011.

[20] J. Huang, F. Nie, and H. Huang, “Spectral rotation versus k-means inspectral clustering,” Twenty-Seventh AAAI Conference on ArtificialIntelligence (AAAI-13), pp.431–437, Association for the Advance-ment of Artificial Intelligence, 2013.

[21] G.H. Cha, “A context-aware similarity search for a handwritten digitimage database,” The Computer Journal, vol.53, no.8, pp.1291–1301, 2010.

[22] C. Siagian and L. Itti, “Rapid biologically-inspired scene classifica-tion using features shared with visual attention,” IEEE Trans. PatternAnal. Mach. Intell., vol.29, no.2, pp.300–905, 2007.

[23] X. Li, C. Wu, C. Zach, S. Lazebnik, and J.M. Frahm, “Modelingand recognition of landmark image collections using iconic scenegraphs,” 10th European Conference on Computer Vision (ECCV),Marseille-France, pp.427–440, 2008.

[24] X. Cai, F. Nie, H. Huang, and F. Kamangar, “Heterogeneous im-age feature integration via multi-modal spectral clustering,” IEEEConference on Computer Vision and Pattern Recognition (CVPR),pp.1977–1984, 2011.

[25] M.J. Lyons, J. Budynek, and S. Akamatsu, “Automatic classificationof single facial images,” IEEE Trans. Pattern Anal. Mach. Intell.,vol.21, no.12, pp.1357–1362, 1999.

[26] A. Martinez and R. Benavente, The AR Face Database. CVC Techni-cal Report No.24. Computer Vision Center (CVC), Ohio State Uni-versity, 1998.

[27] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” Int. J. Comput. Vis.,vol.42, no.3, pp.145–175, 2001.

[28] F. Samaria and A. Harter, “Parameterisation of a stochastic modelfor human face identification,” Proc. 2nd IEEE Workshop on Appli-cations of Computer Vision, 1994.

[29] S.A. Nene, S.K. Nayar, and H. Murase, Columbia Object Image Li-brary (Coil-20), 1996.

[30] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, “Eigenfacesvs. fisherfaces: Recognition using class specific linear projection,”IEEE Trans. Pattern Anal. Mach. Intell., vol.19, no.7, pp.711–720,1997.

[31] J.Z. Wang, J. Li, and G. Wiederhold, “Simplicity: Semantics-sensitive integrated matching for picture libraries,” IEEE Trans. Pat-tern Anal. Mach. Intell, vol.23, no.9, pp.947–963, 2001.

[32] M. Weber, Frontal face dataset. Computational Vision Group, Cali-fornia Institute of Technology, 2012.

[33] F. Li, M. Andreetto, and M.A. Ranzato, Caltech101. ComputationalVision Group, California Institute of Technology, 2003.

[34] M.E. Nilsback and A. Zisserman, “A visual vocabulary for flowerclassification,” IEEE PROC, pp.1447–1454, 2006.

[35] D.G. Lowe, “Distinctive image features from scale-invariant key-points,” Int. J. Comput. Vis., vol.60, no.2, pp.91–110, 2004.

[36] N. Dalal and B.T. riggs, “Histograms of oriented gradients for hu-man detection,” IEEE PROC, pp.886–893, 2005.

[37] N. Armanfard, M. Komeili, and E. Kabir, “Ted: A texture-edge de-scriptor for pedestrian detection in video sequences,” Pattern Recog-nit., vol.45, pp.983–992, 2012.

[38] H. Bay, A. Ess, T. Tuytelaars, and L.V. Gool, “Speeded-up robustfeatures (surf),” Comput. Vis. Image Understand., vol.110, pp.346–359, 2008.

[39] L. Zhang, M.H. Tong, T.K. Marks, H. Shan, and G.W. Cottrell,“Sun: A bayesian framework for saliency using natural statistics,”J. Vision, vol.8, pp.1–20, 2008.

[40] N. Gourier, D. Hall, and J.L. Crowley, “Estimating face orientationfrom robust detection of salient facial features,” International Work-shop on Visual Observation of Deictic Gestures (ICPR), 2004.

[41] J.C.J. Chen, Out Database. Robotics Lab, National Cheng KungUniversity (NCKU), 2011.

[42] R. Gross, PIE Database. The Robotics Institute, Carnegie Mellon,2012.

[43] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization:Algorithms and Complexity, Dover, New York, 1998.

Nasir Ahmed received M.Sc. in Physicsfrom University of Agriculture, Faisalabad, Pak-istan, in 2000 and M.S in Information Technol-ogy from Pakistan Institute of Engineering andApplied Sciences, Islamabad, Pakistan, in 2004.He is currently pursuing PhD degree from thesame university. His research interests includesignal and image processing, machine learningand pattern recognition.

Abdul Jalil received M.Sc. in Electronicsand M. Phil in Signal Processing from Quaid-i-Azam University, Islamabad, Pakistan, in 1986and 2000, respectively, the PhD in Signal andImage Processing from M.A. Jinnah University,Islamabad, Pakistan, in 2006, and the Post-Docin Medical Image Analysis from University ofSussex, UK, in 2009. He is currently a Pro-fessor with Pakistan Institute of Engineeringand Applied Sciences (PIEAS), Islamabad, Pak-istan. His current research interests include tex-

ture analysis and matching, medical image analysis, application of differentimaging techniques.

Multimode Image Clustering Using Optimal Image Descriptor

Documents

Transcript of Multimode Image Clustering Using Optimal Image Descriptor