Semi-supervised clustering for gene-expression data in ...sriparna/papers/abhay-ijmlc.pdf ·...

ORIGINAL ARTICLE

Semi-supervised clustering for gene-expression datain multiobjective optimization framework

Abhay Kumar Alok • Sriparna Saha •

Asif Ekbal

Received: 10 June 2014 / Accepted: 18 January 2015 / Published online: 15 February 2015

� Springer-Verlag Berlin Heidelberg 2015

Abstract Studying the patterns hidden in gene expression

data helps to understand the functionality of genes. But due

to the large volume of genes and the complexity of bio-

logical networks it is difficult to study the resulting mass of

data which often consists of millions of measurements. In

order to reveal natural structures and to identify interesting

patterns from the given gene expression data set, clustering

techniques are applied. Semi-supervised classification is a

new direction of machine learning. It requires huge unla-

beled data and a few labeled data. Semi-supervised clas-

sification in general performs better than unsupervised

classification. But to the best of our knowledge there are no

works for solving gene expression data clustering problem

using semi-supervised classification techniques. In the

current paper we have made an attempt to solve the gene

expression data clustering problem using a multiobjective

optimization based semi-supervised classification tech-

nique with the aim to attain good quality partitions by using

few labeled data. In order to generate the labeled data,

initially Fuzzy C-means clustering technique is applied. In

order to automatically determine the partitioning, multiple

cluster centers corresponding to a cluster are encoded in the

form of a string. In order to compute the quality of the

obtained partitioning, values of five objective functions are

computed. The effectiveness of this proposed semi-super-

vised clustering technique is demonstrated on five publicly

available benchmark gene expression data sets. Compari-

son results with the existing techniques for gene expression

data clustering prove that the proposed method is the most

effective one. Statistical and biological significance tests

have also been carried out.

Keywords Gene expression data clustering � Semi-

supervised classification � Multiobjective optimization �Cluster validity index � AMOSA

1 Introduction

Due to invention of DNA (Deoxyribonucleic acid) micro-

array technology, it has become feasible to examine the

expression level of thousands of genes at a time during

their different ongoing biological processes and across

collection of related samples. Different application areas of

microarray technology are gene expression profiling,

medical diagnosis, bio-medicine [1, 22, 39]. Usually, dur-

ing the biological experiment, and at different time points,

gene expression values are measured. A microarray gene

expression data structure is defined as 2D matrix A ¼ ½fij� ofsize c� t, where c represents a gene and t represents a time

point. Each element fij tells about the expression level of ith

gene at the jth time point. To depict the set of genes

exhibiting similar expression profile, clustering or unsu-

pervised learning [7, 33, 51] is in general used.

Clustering, also termed as unsupervised learning, is the

procedure of grouping the data items into different parti-

tions or clusters in such a way that data items belonging to

same group are similar to each other according to some

criteria of similarity and dissimilar to each other according

to same criteria [1]. In supervised classification, actual

class labels of some data points are available. These

A. K. Alok (&) � S. Saha � A. EkbalComputer Science Engineering, Indian Institute of Technology,

Patna, India

e-mail: [email protected]

S. Saha


A. Ekbal


123

Int. J. Mach. Learn. & Cyber. (2017) 8:421–439

DOI 10.1007/s13042-015-0335-8

http://crossmark.crossref.org/dialog/?doi=10.1007/s13042-015-0335-8&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1007/s13042-015-0335-8&domain=pdf

labeled data are used to build a model which is further used

to assign class labels to some unknown samples. The main

problem of supervised classification is to generate the

labeled data, which is both time consuming and expensive.

In contrast we can easily access plenty of unlabeled data.

For unsupervised classification we can use this plenty of

unlabeled data to form the partitioning. It depends on the

data distribution and its intrinsic property. Semi-supervised

classification is a halfway between supervised and unsu-

pervised classification [3, 9, 11, 14, 15]. Here in addition to

the unlabeled data, some amount of labeled data are also

available. The available labeled data helps to improve the

clustering result. Due to this improved performance, semi-

supervised classification has a large number of applications

in the field of pattern recognition, document classification,

data mining, information retrieval, image categorization,

and gene function classification. Literature studies show

that semi-supervised clustering is much more effective as

compared to unsupervised classification techniques [3, 9,

11, 14, 15]. As gene expression data clustering is an

important problem in the field of bioinformatics with

applications in diagnosis of diseases, it would be beneficial

to use some sophisticated techniques like semi-supervised

clustering for solving this problem. But existing literature

shows that there was no effort in solving the gene

expression data clustering problem using semi-supervised

classification. Inspired by this, in the current paper we aim

to solve the gene expression data clustering problem using

a semi-supervised clustering technique with the motivation

of getting much improved results which will help in

diagnosis of some diseases more accurately. A new mul-

tiobjective optimization based semi-supervised classifica-

tion technique is developed to solve the gene expression

data clustering problem.

For the semi-supervised classification, some amount of

labeled data are required to fine-tune the obtained parti-

tioning based on the unlabeled data. In case of gene

expression data it is difficult to generate the labeled

information. In the current paper we have utilized a well-

known clustering technique, Fuzzy C-means [10], to gen-

erate some labeled data. Semi-supervised clustering is

having two different objectives : determining the parti-

tioning (i) corresponding to which some cluster quality

measures should be optimized, (ii) the class labels of the

available labeled data should be satisfied by the proposed

partitioning. In order to simultaneously satisfy both the

objectives, the use of multiobjective optimization (MOO)

[8] is proposed. In order to represent clusters, center based

encoding is used. A single cluster is divided into several

non-overlapping hyperspherical sub-clusters and the cen-

ters of these sub-clusters are used to represent this cluster.

AMOSA (archived multiobjective simulated annealing

based technique) [8], a newly developed simulated

annealing based multiobjective optimization technique is

used as the underlying MOO technique. Here five objective

functions are used and simultaneously optimized by

AMOSA. First four objective functions are some internal

cluster validity indices, based on some unsupervised

properties of data set. Last objective function is an external

cluster validity index which captures the compatibility of

the obtained partitioning with respect to the available

supervised information.

The performance of the multiobjective simulated

annealing based semi-supervised clustering technique

(Semi-GenClustMOO) has been tested on some well

known publicly available five real-life gene expression data

sets, viz., Yeast sporulation, Arabidopsis Thaliana, Rat

CNS, Yeast cell cycle, and Human fibroblasts serum. The

superiority of the proposed Semi-GenClustMOO clustering

technique is also compared with MO-fuzzy [51], MOGA

clustering [7], GenClustMOO [49], FCM [10], single

objective GA based clustering technique [41], Self

Organising Map (SOM) [59], Chinese Restaurant Cluster-

ing (CRC) [45], and Hierarchical average linkage cluster-

ing [62] techniques. Further some statistical significance

tests have been also performed to show the superiority of

the Semi-GenClustMOO clustering technique. We have

also conducted some biological significance tests to prove

that the formed clusters are biologically correct.

The current paper is unique in the following ways:

– As far our knowledge is concerned, this is the first

attempt where semi-supervised classification is applied

for solving the gene expression data clustering problem.

– A novel technique is devised to generate some labeled

data from a given unlabeled data set without taking

help of any human annotator. In general external

knowledge about a data set is generated by taking the

help of human annotators. In the current paper, we have

proposed a method which utilizes any popular fuzzy

clustering technique to generate the labeled data. Here

fuzzy C-means clustering is used for this very purpose.

– A multiobjective based approach is developed to solve

the semi-supervised classification problem. Here a new

encoding strategy, use of multiple centers correspond-

ing to a particular cluster is utilized to represent a

particular partitioning.

– Five different objective functions capturing different

data properties are optimized simultaneously using the

search capability of a newly developed simulated

annealing based multiobjective optimization technique,

namely AMOSA (archived multiobjective simulated

annealing). First four objective functions capture some

unsupervised data properties. The last one checks the

compatibility of the obtained partitioning with respect

to the original class labels.

422 Int. J. Mach. Learn. & Cyber. (2017) 8:421–439

123

– Different mutation strategies are used to handle the new

encoding strategy.

– Results are shown on five benchmark gene expression

data sets. For all the cases, proposed technique

outperforms seven existing techniques for gene expres-

sion data clustering. It also outperforms two recently

developed multiobjective clustering techniques for

gene expression data. This in turn shows the utility of

semi-supervised classification for solving the gene

expression data clustering problem.

– Obtained results are verified biologically and

statistically.

The rest of the paper is organized as follows. Section 2

discusses about some existing techniques of gene expres-

sion data clustering and semi-supervised classification

techniques. The procedure of generating some labeled data

from the set of unlabeled data is illustrated in Sect. 3.

Section 4 discusses about the different steps of the pro-

posed multiobjective based semi-supervised clustering

technique. Experimental results are discussed in detail in

Sect. 5 and finally Sect. 6 concludes the paper.

2 Background

In this section we have discussed about existing techniques

for gene expression data clustering and semi-supervised

classification.

2.1 Gene expression data clustering

Traditional approaches of genomic research concentrates on

local examination and collection of data for a single gene.

However, use of microarray technology enables monitoring

of expression levels of tens of thousands of genes simulta-

neously. There are mainly two different types of microarray

experiments [35]. These are cDNA microarray [53] and

oligonucleotide arrays (abbreviated oligo chip) [38].

During microarray experiment, a large number DNA

sequences (genes, cDNA clones, or co-expressed sequence

tags) are judged under multiple conditions. Examples of

conditions are timeseries during a biological process (e.g.,

the Yeast cell cycle) or a collection of different tissue

samples (e.g., normal versus cancerous tissues). In general

no distinction is made among DNA sequences; which are

uniformly termed as ‘‘genes’’. Similarly all kinds of

experimental conditions are termed as ‘‘samples’’.

Microarray technology has been successfully applied for

solving problems from many areas like medical diagnosis,

bio-medicine, gene expression profiling, etc. In general, the

gene expression values during a biological experiment are

determined at different time points [42]. A microarray gene

expression data, consisting of x genes and y time points, is

generally arranged in a 2D matrix G ¼ ½gij� of size x� y.

Each element gij tells the expression level of the ith gene at

the jth time point. In recent years, many new techniques

have been developed in the literature to deal with the gene

expression data [30, 56, 60, 67].

Some noise and missing values can be present in the

original gene expression matrix determined from the

scanning process. There may be some systematic variations

arising from the experimental procedure. Data pre-pro-

cessing is a necessary step before application of any clus-

tering technique. Missing value can be estimated using

various methods [63]. After this data normalization is done.

Thereafter any clustering technique can be applied on the

processed gene expression data.

Clustering is a widely used microarray analysis tool.

The genes with similar expression profiles can be detected

by using clustering techniques. Clustering methods divide a

set of n objects into K partitions depending on some sim-

ilarity/dissimilarity metric. Here the value of number of

clusters, K, is not known a priori. Application of clustering

techniques aid in the understanding of gene function, gene

regulation, cellular processes, and subtypes of cells. Genes

with similar expression patterns (co-expressed genes)

should be put in a single cluster with similar cellular

functions. This helps to predict the functionality of some

unknown genes whose information was not previously

known [22, 61]. Regulatory motifs specific to each gene

cluster can be recognized based on the searching for

common DNA sequences at the promoter regions of genes

within the same cluster [35]. Cis-regulatory elements can

also be proposed based on this information [12, 61].

Hypotheses regarding the mechanism of the transcriptional

regulatory network can be revealed based on the inference

of regulation through the clustering of gene expression data

[21]. At the end sub-cell types can be identified based on

the clustering results on expression profiles. These infor-

mation were difficult to be identified by using traditional

morphology-based approaches [1].

Clustering methods have been widely used for the dis-

covery of cancer subtypes [1, 22, 39]. Many novel clus-

tering techniques have been developed by the

bioinformaticians which are suitable for clustering gene

expression data. But medical scientists prefer to use some

traditional clustering techniques for solving this particular

problem [4]. Using modern days microarray technologies,

molecular signatures of cancer cells can be measured [58].

One most important and useful exploratory analysis on the

microarray data is to apply clustering techniques on the

cancer/patient samples (tissues) [58]. The basic goal is to

determine groups of samples sharing similar expression

patterns, which can help to discover novel cancer subtypes.

Int. J. Mach. Learn. & Cyber. (2017) 8:421–439 423

123

Clustering methods have been used extensively for solving

these kind of problems. Several clustering techniques have

been proposed by bioinformaticians taking into account

different internal characteristics of gene expression data,

for example presence of noise and missing values, and

high-dimensional nature of data sets [13, 37, 58]. These

clustering techniques are tested with the help of available

public data in clinical studies which were previously pub-

lished. In [18, 57] authors have used K-means clustering

technique for solving the gene expression data clustering

problem. Recently, many new clustering algorithms [28,

29] have been proposed for solving the gene expression

data clustering which can overcome the drawbacks of

K-means algorithm. Self organizing maps are used to

cluster gene expression data in [59]. These algorithms

perform better than K-means algorithm. Eisen et al. [22]

developed an agglomerative algorithm called UPGMA

(Unweighted Pair Group Method with Arithmetic Mean)

for gene expression data clustering and proposed a method

to graphically represent the clustered data set. Alon et al.

[2] devised an algorithm to partition the genes using a

divisive approach which is called the deterministic-

annealing algorithm (DAA) [47]. Eisen’s method was very

much popular among biologists and has become the most

widely-used tool in gene expression data analysis [2, 22,

32]. However, because of the sensitivity of the data on the

structure of the hierarchical dendogram, these conventional

agglomerative approaches suffer most. High computational

complexity is another drawback of the hierarchical clus-

tering techniques. In [55], the authors used CLICK, a graph

theory based clustering technique, for clustering publicly

available gene expression data and compared their

approach with SOM based [59] and Eisen’s hierarchical

method [22]. In [34], a new clustering technique named

DHC (a density-based, hierarchical clustering method), is

proposed to identify the co-expressed gene groups from

gene expression data. Several model based clustering

approaches for gene expression data have been proposed in

[25, 27, 68] which provide some statistical frameworks to

model the gene expression data.

In recent years the problem of gene expression data

clustering is modeled as a multiobjective optimization

problem. Many multiobjective optimization based clustering

techniques are developed to solve this problem. In [51], a

simulated annealing based multiobjective fuzzy clustering

technique, MO-fuzzy [51], is developed for gene expression

data clustering. It again uses AMOSA as the underline

optimization technique. But no supervised information is

used here to fine-tune the partitioning. In [7] a genetic

algorithm based multiobjective clustering technique, MOGA

clustering, is developed for gene expression data clustering.

It uses a genetic algorithm based multiobjective optimiza-

tion technique, NSGA-II, as the underline optimization

strategy. In [44], a multiobjective based interactive cluster-

ing technique is developed for gene expression data clus-

tering. This approach interactively takes the input from the

human decision maker (DM) during execution and adap-

tively learns from that input to obtain the final set of validity

measures along with the final clustering result. In [20], a

Fuzzy C-means based multiobjective clustering technique is

developed for gene expression data clustering via optimi-

zation of multiple objectives. In [24], an algorithm for

cluster analysis that integrates aspects from cluster ensemble

and multiobjective clustering is developed. The algorithm is

developed using the concepts of a Pareto-based multiob-

jective genetic algorithm, with a special crossover operator,

which uses clustering validation measures as objective

functions. The approach is then applied on some gene

expression data sets. In [43], a multiobjective clustering

technique is first applied on the gene expression data.

Thereafter the solutions on the final Pareto front are com-

bined using the help of a post-processing technique using

support vector machine.

2.2 Semi-supervised learning

Semi-supervised classification is a new direction of pattern

recognition and machine learning [9, 11, 15]. This is

popularly being used technique in many real-life domains

like text processing and bioinformatics, where there are

scarcities of the sufficient amount of labeled data. In case

of supervised learning the main goal is to build a classifier

which is further used to assign class labels to unknown

samples. Thus the main goal is prediction. But in case of

unsupervised learning, the major objective is to discover

the natural grouping from the data. Thus the main goal is to

determine a description. Semi-supervised learning utilizes

unlabeled data and a few labeled data for prediction. Here

the given data set can be divided into two parts (i) data

points for which actual class information is known, and (ii)

data points for which no class labels are known. The

supervised information can be provided in different forms

also. Like must link (two points should belong to the same

cluster) and can-not link (two points should not belong to

the same cluster). Semi-supervised learning is, in general,

solved using two different ways: (i) unsupervised learning

with addition to labeled information as constraints. Here

the additional labeled information can be utilized while

initializing the cluster centers or during the assignment of

points to different clusters or during objective function

calculation. (ii) supervised learning with additional infor-

mation on the distribution of the examples. This interpre-

tation is more appropriate if the final aim is to predict the

class label of an unknown sample. But this is not applicable

if the number and the nature of classes are not known in

advance. In the current paper we have solved the semi-


123

supervised learning problem by using the first view. Here

we have devised an unsupervised classification technique

which also takes care of the available supervised

information.

In any semi-supervised classification technique [50], the

target is to satisfy two major objectives. The obtained

partitioning of points into different clusters should be

perfect and there should not be any violation of available

supervised information. In order to measure these two

properties, two different types of cluster validity indices are

used: internal and external. Internal validity index utilizes

intrinsic properties of data items while external validity

index utilizes the supervised information given in the form

of class labels of data items. Different internal cluster

validity indices [5, 40, 52, 66] exist in the literature. Most

of the internal validity indices try to minimize the cluster

compactness and maximize the separation between clus-

ters. There are different ways of computing the cluster

compactness and separation. In order to capture clusters

having different shapes like hyperspherical shapes, sym-

metrical shapes, overlapping structures, different internal

cluster validity indices are used. In the current paper as the

internal cluster validity indices four different objective

functions are used. The first index is an Euclidean distance

based cluster validity index, I-index [40]. This utilizes the

maximum distance among any two cluster centers as the

separation between clusters. Sym-index [5] is the second

cluster validity index used in the current paper which uses a

newly developed point symmetry based distance for the

computation of cluster compactness. The next two cluster

validity indices, XB-index [66] and FCM-index [10], utilize

the concepts related to fuzzy logic in order to compute the

cluster compactness. These two indices try to obtain

overlapping structures from a given data set. As the

external cluster validity index, adjusted rand index (ARI)

[52] is used.

2.3 Multiobjective optimization

As semi-supervised clustering requires optimization of

more than one objective functions, use of multiobjective

optimization (MOO) is required to solve such problems.

MOO has a different perspective compared to single

objective optimization (SOO). In SOO we need to optimize

a single objective function but in case of MOO we need to

optimize more than one objective functions. SOO provides

a single solution as the final solution and MOO provides a

set of solutions on the final Pareto optimal front. All the

solutions produced by some MOO based technique are

equally important and they are non-dominating to each

other.

The concept of domination is an important aspect of

MOO. In case of maximization of objectives, a solution xi

is said to dominate xj if 8k 2 1; 2; . . .;O; OBkðxiÞ�OBkðxjÞ and 9k 2 1; 2; . . .;O; such that OBkðxiÞ[OBk

ðxjÞ: Among a set of solutions SOL, the non-dominated set

of solutions SOL0are those which are not dominated by any

member of the set SOL. The non-dominated set of the

entire search space S is called the globally Pareto-optimal

set or Pareto front. In general, a MOO algorithm outputs a

set of solutions not dominated by any solution encountered

by it.

3 Generation of labeled data for semi-supervised

clustering

In the current paper we have solved the gene expression

data clustering problem using semi-supervised classifica-

tion techniques. In case of semi-supervised classification,

in addition to the unlabeled data some amount of labeled

data are also available. But for gene expression data, it is

difficult to generate the labeled information. In the current

paper we have proposed a way of generating the labeled

data without taking help from any human annotator. It is

both time consuming and cost sensitive to generate the

labeled data with the help of human annotator. In the

current paper, we have initially executed the popular Fuzzy

C-means (FCM) clustering technique [10] to generate the

obtained partitioning of the available gene expression data

sets. The points which attain highest membership values

with respect to a given cluster are selected as the available

labeled data. Note that points with highest membership

values are the most certain points within a particular

cluster. Highest values of membership function signify that

those points are not the boundary points but the core points

of a given cluster. Instead of Fuzzy C-means we could have

used any other clustering techniques. But as Fuzzy

C-means quantifies the membership values and is very

popular in the field of pattern recognition, we have opted

for Fuzzy C-means.

Fuzzy C-means (FCM) [10], is a very popular clustering

technique used widely in pattern recognition, which

incorporates the property of fuzzy logic. A single data

point may belong to two or more clusters. FCM is based on

a single objective function (given below) which should be

minimized

Ji ¼XN

j¼1

XC

c¼1

umc;jD2 zc; xj� �

; 1�m�1 ð1Þ

Here, N represents the number of genes, C is the total

number of clusters, u denotes fuzzy membership and m

represents the fuzzy exponent. Let, xj denotes the jth gene,

zc is the centre of cth cluster, and distance of xj gene from

their cluster center zc is represented by D(zc, xj).


123

Initially, FCM algorithm starts its functioning by ran-

domly picking K cluster centers. Then after every iteration,

it evaluates the fuzzy membership for each gene with

respect to every cluster according to the following equation

ui;j ¼1

Dðzc;xiÞ

� � 1m�1

PCj¼1

1Dðzj;xiÞ

� � 1m�1

; 1� c�C; 1� i�N ð2Þ

where D(zc, xi) and D(zj, xi) represent the distances

between xi and zc, and xi and zj respectively. Now, after

evaluation of fuzzy membership of each gene, cluster

centers are updated with the help of the following equation

zc ¼PN

i¼1 umc;ixiPN

i¼1 umc;i

; 1� c�C ð3Þ

The two steps mentioned above, evaluation of fuzzy mem-

bership and re-computation of cluster centers, are executed

several times until there will be no change in cluster centers.

Final membership values are obtained considering each

cluster individually. We sort all the points for individual

cluster c based on their membership values. We select top 10C

% points with highest membership values with respect to a

particular cluster center and assign cluster label c to them.

Here C is the total number of clusters. Note that here 10 %

points from each cluster are selected based on the mem-

bership values. Points with highest values of membership

are the most certain points within that cluster. Points with

membership values below 0.5 are the most uncertain points.

Fuzzy C-means is a widely used technique for data clus-

tering. Literature shows that for data sets with overlapping

clusters points with highest values of membership with

respect to a particular cluster are the most central points of

that cluster. Thus using the class label information of those

points is supposed to be the most reliable one. This infor-

mation is used as the supervised information in the proposed

semi-supervised clustering technique.

4 Semi-supervised multiobjective clustering

algorithm :Semi-GenClustMOO

In this paper we have proposed a multiobjective based

solution for solving the semi-supervised clustering prob-

lem. The proposed algorithm, Semi-GenClustMOO, is a

generalized multiobjective framework to solve the semi-

supervised clustering problem. It can be associated with

any multiobjective optimization technique. In the current

paper we have used the archived multiobjective simulated

annealing based technique, AMOSA [8] as the underlying

optimization algorithm. Semi-GenClust MOO is an

extended version of GenclustMOO [49] (multi-center

based multiobjective clustering technique) to handle semi-

supervised clustering. One single cluster is represented by

‘‘multiple centers’’ in the form of a string. We assume that

each cluster consists of several non-overlapping hyper-

shperical sub-clusters. Centers of different sub-clusters are

then encoded in a state to represent a particular cluster.

Here, to optimize the five objective functions simulta-

neously, a newly developed simulated annealing based

optimization technique, AMOSA [8] is used. First four

objective functions quantify some unsupervised properties

and last one quantifies the supervised information. We

assume supervised information is available only for few of

the data items. The flowchart of the Semi-GenClustMOO

algorithm is shown in Fig. 1.

4.1 The SA based MOO algorithm: AMOSA

Here, AMOSA [8], archived multiobjective simulated

annealing based technique, which is a generalized version

of probabilistic metaheuristic based simulated annealing

(SA) algorithm using the concepts of multiobjective opti-

mization (MOO) is used as the underlying optimization

strategy. Simulated annealing is a search technique for

solving difficult optimization problems, which is based on

the principles of statistical mechanics [36]. SA can not only

replace exhaustive search to save time and resource, but

also converge to the global optimum if annealed suffi-

ciently slowly [26]. Although the single objective version

of SA has been quite popular, its utility in the multiob-

jective case was limited because of its search-from-a-point

nature. Recently Bandyopadhyay et al. developed an effi-

cient multiobjective version of SA called AMOSA [8] that

overcomes this limitation.

In AMOSA (archived multiobjective simulated anneal-

ing) [8], which is a multiobjective version of SA, several

concepts have been newly integrated. It utilizes the concept

of an archive as used in [69] where the non-dominated

solutions seen so far are stored. Two limits are kept on the

size of the archive: a hard or strict limit denoted by HL, and

a larger, soft limit denoted by SL, where SL[HL. The

non-dominated solutions are stored in the archive as and

when they are generated. In the process, if some members

of the archive get dominated by the new solutions, then

these are removed. If at some point of time, the size of the

archive exceeds a specified value, then the clustering pro-

cess, described below, is invoked.

In AMOSA, the initial temperature is set to Tmax. Then,

one of the points is randomly selected from the archive.

This is taken as the current-pt, or the initial solution. The

current-pt is perturbed to generate a new solution called

new-pt, and its objective functions are computed. The

domination status of the new-pt is checked with respect to

the current-pt and the solutions in the archive. A new


123

quantity called the amount of domination, Ddomða; bÞ,between two solutions a and b is defined as follows:

Ddoma;b ¼YM

i¼1;fiðaÞ6¼fiðbÞ

fiðaÞ � fiðbÞj jRi

ð4Þ

where, fiðaÞ and fiðbÞ are the ith objective values of the two

solutions and Ri is the corresponding range of the objective

function computed from the individuals in the population.

M is the number of objectives. Based on the domination

status between the new-pt, current-pt and the points in the

archive, different cases may arise. These are discussed in

details in [8], and are briefly mentioned here for the sake of

completeness.

Case 1: new-pt is either dominated by the current-pt or it

is nondominating with respect to the current-pt, but some

points in the archive dominate the new-pt. Suppose new-pt

is dominated by a total of k points (including current-pt

and points in the archive). This case is demonstrated in

Fig. 2 (the points D, E, F, G and H in the figure signify the

content of the archive at any instant, while the other points

illustrate different cases that may arise with respect to the

archive) where F represents the current-pt and B represents

the new-pt. Then a quantity Ddomavg is computed as

ðPk

i¼1ðDdomi;new�ptÞ þ Ddomcurrent�pt;new�pt=ðk þ 1Þ. The

new-pt is accepted as current-pt with a probability

pqs ¼1

1þ eDdomavg

T

: ð5Þ

Note that Ddomavg denotes the average amount of domi-

nation of the new-pt by ðk þ 1Þ points, namely, the current-

pt and k points of the archive. Also, as k increases, Ddomavg

will increase since here the dominating points that are

farther away from the new-pt are contributing to its value.

Fig. 1 Working principle of

Semi-GenClustMOO algorithm

Fig. 2 Pareto-optimal front and different domination examples


123

Case 2: Neither the current-pt nor the points in the

archive dominate the new-pt. This can be demonstrated

with different examples shown in Fig. 2, e.g., F represents

the current-pt and E represents the new-pt, G represents the

current-pt and I represents the new-pt, F represents the

current-pt and I represents the new-pt. For all these cases,

accept the new-pt as the current-pt. If there are any points

in the archive which are dominated by new-pt, remove

them from the archive. Add new-pt in the archive. If

archive size crosses the SL, apply single linkage clustering

to reduce its size to HL.

Case 3: new-pt dominates the current-pt but k points in

the archive dominate the new-pt. This case can be dem-

onstrated using Fig. 2 where A represents the current-pt

and B represents the new-pt. Here the minimum of the

differences of domination amounts between the new-pt and

the k points, denoted by Ddommin of the archive is com-

puted. The point from the archive that corresponds to the

minimum difference is selected as the current-pt with

probability

probability ¼ 1

1þ expðDdomminÞ: ð6Þ

Otherwise the new-pt is selected as the current-pt. This

may be considered as an informed reseeding of the

annealer only if the archive point is accepted.

The above process is repeated iter times for each tem-

perature (temp). Temperature is reduced to a� temp, using

the cooling rate of a till the minimum temperature, Tmin, is

attained. The process thereafter stops, and the archive

contains the final non-dominated solutions.

It has been demonstrated in [8] that the performance of

AMOSA is better than that of NSGA-II [19] and some

other well-known MOO algorithms. The pseudo-code of

the AMOSA algorithm is shown in Fig. 3.

4.2 State representation and archive initialization

In Semi-GenClustMOO, a set of real numbers represents

the state of AMOSA. These real numbers are in fact the

coordinates of the centers of the partitions. Hence AMOSA

can easily identify the appropriate set of cluster centers and

the respective partitionings of the data items. Suppose a

state comprises of encoded centers of K number of clusters,

and thereafter each cluster center is further sub-divided into

C number of sub-clusters. Then the length of that state will

be K �C � d, where d is the dimension of data set. Let

state contain K ¼ 2 number of clusters, each cluster is

further divided into C ¼ 10 number of sub-clusters, and

dimension of the data set, d ¼ 2:Sojth subcluster of ith

cluster is represented by cij ¼ (cxij, cyij). Then the entire

state will look like (cx11, cy11, cx

12, cy

12, . . ., cx

110, cy

110, cx

21,

cy21, . . ., cx210, cy210). Here, Ki ¼ ðrandðÞðKmax � 1ÞÞ þ 2,

where Ki is the total number of whole clusters encoded in

the string i of the archive. Here Kmax represents the upper

limit of the number of clusters and rand() is a function

which returns integer. So number of initial clusters will

range between 2 to Kmax where Kmax is the maximum

number of clusters encoded in a particular string. Initiali-

zation step is consisting of three sub-steps. One third

solutions of the archive are initialized randomly. Minimum

center based distance criteria is used to randomly initialize

these solutions. Single linkage clustering technique [23] is

used to initialize another one third solutions for different

values of number of clusters (K). These solutions work

well when clusters in data sets are well separated. To ini-

tialize the last one third solutions, K-means algorithm is

applied with different values of number of clusters (K).

These solutions work well for data sets having hyper-

spherical shaped clusters. After forming the initial parti-

tions, C number of sub-cluster centers are chosen for each

cluster. These sub-cluster centers are then encoded in the

string to represent a particular partitioning. So total number

of centers encoded in that string is C�K.

4.3 Assignment of points

Here, we have considered each sub-cluster as a separate

cluster for the assignment process. Now, assume that, each

state has K number of clusters and each cluster is divided

into C number of sub-clusters. For the assignment point of

view, minimum Euclidean distance based criterion has

been considered. A data point yj is assigned to the ðv; tÞthsub-cluster where

ðv; tÞ ¼ argmin deðznp; yjÞn o

; for p ¼ 1. . .K; n ¼ 1; . . .;C:

znp is the nth sub-cluster center of pth cluster. Thereafter, the

partition matrix can be formulated as followed: X½ðv�1Þ � C þ t�½j� ¼ 1 and X½c�½j� ¼ 0; 8c ¼ 1. . .K � C

otherwise .

4.4 Objective function

Five objective functions are considered for the purpose of

optimization. First four objective functions are some

internal cluster validity indices which rely on some

intrinsic properties of the data sets. Last one measures the

violation of available supervised information. This is also

called an external cluster validity index. These five

objective functions are Sym-index [5], I-index [40], XB-

index [66], FCM-index [10], and adjusted rand-index [52].

The details of the objective functions are now provided in

the supplementary material.


123

4.4.1 Other steps

Here, the other steps of Semi-GenClustMOO clustering

technique are similar to those of GenClustMOO [49]

algorithm. To evaluate the five objective functions, the

sub-cluster centers are joined to generate the whole

partitioning. Thereafter, the above mentioned five

objective functions are evaluated on these obtained

partitionings. To optimize these five objective functions

simultaneously, AMOSA is used as the underlying

optimization technique. Three types of mutation opera-

tions have been used. In the first type of mutation

operation, total number of clusters present in a state is

reduced by 1. In the second type, total number of clusters

present in a state is increased by 1, and finally in the third

type of mutation operation, cluster centers present in a

particular state are modified by some values. If any

string is selected for mutation, then any one of the above

mentioned mutation operations is applied on it with

uniform probability.

Fig. 3 The AMOSA algorithm [51] (source code is available at: http://www.isical.ac.in/*sriparna_r)� sriparna_r)


123

http://www.isical.ac.in/~sriparna_r)

4.4.2 Selection of the best solution

In MOO, the final Pareto optimal front consists of a set of

non-dominated solutions [50]. Each solution is responsible

for providing a partitioning information for a given data

set. Each non-dominated solution is equally important and

none of these dominates each other. However the user may

sometimes want only a single solution. In the current paper

we have adopted the following method to select a single

solution from the final Pareto optimal front. For the

supervised information, using the method mentioned in

Sect. 3, we have generated class labels of 10 % data points.

We have computed the ARI index [52] value for these

10 % data points for each solution on the final Pareto front.

Finally we have selected the solution with the highest value

of ARI index. The use of 10 % labeled data helps to

evaluate the goodness of each solution with respect to

ground truth information.

5 Experiments

In this section we have discussed about the experiments

which are conducted to prove the utility of the proposed

Semi-GenClustMOO clustering method. We have analyzed

the obtained results biologically and also statistically.

Obtained partitioning results are also compared with the

results of existing clustering techniques for gene expression

data.

5.1 Data sets used for experiment

Pre-processed datasets have been downloaded from the

site.1 The short descriptions of the data sets are given in

Table 1.

Yeast sporulation In this data set [17], seven time points

are considered for measurement of expression values of

6118 genes during the sporulation process of budding

yeast. Data are log-transformed. During the sporulation

process, the genes whose expression levels did not change

sufficiently are ignored. These rejected genes are not fur-

ther considered. Now consider log-transformed ratio of

gene expression values, and take root mean square values.

The value of 1.6 of root mean square is considered as a

threshold for selection of genes. Finally, we extract total

474 genes out of 6118.

Yeast cell cycle Here, in this data set [16], approxi-

mately 6000 gene expression levels are considered over

two cell cycles and 17 time points. The genes whose

expression levels did not change sufficiently are ignored

and, finally 384 genes are selected among 6000 genes.

Arabidopsis Thaliana Here, in this data set [46] only 138

gene expression levels are considered. The expression level

measurements have been conducted over 8 time points.

Human fibroblasts serum Here, in this data set [32]

expression levels of 8613 human genes are considered. The

expression level measurements have been conducted over

12 time points. It consists of 13 dimensions including 12

time points and an unsynchronized sample. The genes

whose expression levels did not change sufficiently are

ignored and, finally 517 genes are selected among 8613

genes. This data is again log2-transformed.

Rat CNS Here, in this data set [64] expression levels of

112 genes have been considered. Reverse transcription

coupled PCR method has been used for expression level

measurement. This measurement has been conducted over

9 time points during rat central nervous system

development.

5.2 Performance metrics

To validate the performance of clustering algorithms,

mainly two validity indices, Silhouette index (S(C)) [48]

and ARI are considered. The value of S(C) index lies

between -1 to ?1. So good partitioning results correspond

to high positive value of S(C) index. Simultaneously, for

the purpose of visualization, two methods have been used

in the form of Cluster profile plot [22, 42] and Eisen plot

[42].

5.2.1 Eisen plot

In case of Eisen plot [22, 42], gene expression value at a

particular time point is represented in a natural way. For

better understanding, it can be shown in Fig. 5. The first

step is to search the color of data matrix exactly similar to

its spotted color on the microarray. Thereafter, to represent

the Eisen plot, coloring of data matrix is done as mentioned

above. Higher expression levels are denoted by the shades

of red color, low expression levels are denoted by the

shades of green color and absence of differential expres-

sion values are denoted by the colors towards black. In this

paper before plotting the Eisen plot, genes have been

ordered in such a way that genes belonging to a particular

cluster have been placed one after another. White colored

blank rows are used to identify the cluster boundaries.

5.2.2 Cluster profile plot

With the help of cluster profile plot [42], it is possible to

visualize the normalized gene expression values of the

obtained gene clusters with respect to different conditions

like time points. Before plotting, at first for each gene1 http://anirbanmukhopadhyay.50webs.com/mogasvm.html.


123

http://anirbanmukhopadhyay.50webs.com/mogasvm.html

cluster its average gene expression value is calculated with

respect to different time points. Thereafter, standard devi-

ation of each gene cluster is calculated. Finally, the gene

expression values of a cluster are plotted along with their

average expression values and standard deviation, which

can be shown as a black line. The cluster profile plot for

Yeast sporulation data is shown in Fig. 6.

5.3 Discussion of results

Semi-GenClustMOO clustering technique is applied on

five real-life publicly available benchmark data sets. These

data sets are Arabidopsis Thalian, Yeast cell cycle, Human

fibroblasts serum, Yeast sporulation and Rat CNS data. The

partitioning results obtained after application of this clus-

tering technique on these five data sets are shown in this

paper. In the proposed clustering technique, a recently

developed simulated annealing based multiobjective opti-

mization technique, AMOSA [8] is used as the underlying

optimization technique. After a thorough sensitivity study,

the parameters of the proposed algorithm are selected.

These are: SL = 400, HL = 300, iter = 50, Tmax = 100,

Tmin = 0.00001 and cooling rate, a = 0.8. A discussion

regarding the parameters values of AMOSA is presented in

[8]. Inspired by this discussion, we have selected the

parameter values in the current paper. Increasing the Tmax

value does not improve the results further. Similarly

increasing the value of iter does not change the results. In

order to generate the labeled data, first Fuzzy C-means

clustering technique is applied on all the gene expression

data sets. The membership plots after application of Fuzzy

C-means clustering technique for all the data sets are

shown in Fig. 4a–e, respectively.

We have computed the Silhouette index and ARI values

of the final partitionings identified by the proposed Semi-

GenClustMOO technique. Those values are reported in

Table 2. The proposed Semi-GenClustMOO clustering

technique is able to automatically detect the number of

clusters from a data set. The automatically identified

number of clusters for different gene expression data sets

are reported in Table 2.

Results obtained by Semi-GenClustMOO clustering

technique are also compared with MO-fuzzy [51], fuzzy

MOGA clustering [7], GenClustMOO [49], FCM [10],

single objective GA based clustering technique [41], Self

Organising Map (SOM) [59], Chinese Restaurant Cluster-

ing (CRC) [45], and Hierarchical average linkage cluster-

ing technique [62]. Each of the algorithms is applied on

each data set five times. We have executed the algorithms

with default parameter values as specified in the corre-

sponding papers. Obtained clustering results are verified

after conducting several statistical and biological signifi-

cance tests. Mean Silhouette index values of the partiti-

onings for various data sets obtained by different clustering

techniques are shown in Table 3. Results reveal that the

proposed Semi-GenClustMOO clustering technique per-

forms the best for almost all data sets. For all the five data

sets the Silhouette index values attained by the proposed

Semi-GenClustMOO clustering technique are the highest

compared to Silhouette index values obtained by other

seven clustering techniques. In [7, 51], some multiobjective

clustering techniques are proposed to solve the gene

expression data clustering problem. In [51], AMOSA was

used as the underlying optimization technique and for the

assignment of genes to different clusters a newly developed

point symmetry based distance [5] is utilized. In [7], a

multiobjective genetic algorithm based technique, NSGA-

II [19], is used as the underlying optimization technique

and Euclidean distance based membership matrix compu-

tation is conducted. The GenClustMOO [49] clustering

technique is similar to our proposed approach without

using the labeled information. It is similar to our proposed

approach using four objective functions, Sym-index, I-

index, XB and FCM indices. It is not using the extra 10 %

labeled information. Note that the proposed semi-super-

vised clustering technique using the search capability of

AMOSA outperforms all these recent multiobjective tech-

niques for clustering the gene expression data in terms of

Silhouette index. This proves that the use of 10% labeled

data as the supervised information helps the proposed

technique to determine good quality partitions.

In order to visually demonstrate the results of Semi-

GenClustMOO clustering, Fig. 5 shows the Eisen plots of

the partitionings obtained by the proposed clustering

technique for all the gene expression data sets. From the

color representation of the Eisen plot, we can see that

similar colors are grouped together, denoting that within a

particular cluster all genes are having similar gene

expression profiles because they produce similar color

patterns.

The cluster profile plots (Fig. 6) are also drawn based on

the obtained clustering results by the proposed algorithm.

These plots also demonstrate how the expression profiles

for the different groups of genes differ from each other,

while the profiles within a group are reasonably similar.

Here cluster profile plots obtained by the proposed

Table 1 Data set description

Data set Number of genes Total features

Yeast sporulation 474 7

Yeast cell cycle 384 17

Arabidopsis Thaliana 138 8

Human fibroblasts serum 517 13

Rat CNS 112 9


123

algorithm are shown only for Yeast sporulation data in Fig.

6. For Arabidopsis Thaliana, Yeast cell cycle, Rat CNS,

Human fibroblasts serum, cluster profile plots are shown in

the supplementary file. The proposed technique performs

better compared to the other clustering methods mainly

because of the following reasons: first of all, this is a

multiobjective semi-supervised clustering method. Simul-

taneous optimization of multiple cluster validity measures

helps to handle clusters with different characteristics. This

in turn produces high quality solutions representing dif-

ferent possible partitionings. Secondly, the strength of

semi-supervised classification is utilized along with

Fig. 4 Fuzzy membership plots after application of FCM algorithm over data sets a Arabidopsis Thaliana, b Yeast cell cycle, c Rat CNS,

d Human fibroblasts serum, e Yeast sporulation

Table 2 Number of clusters automatically determined by the pro-

posed Semi-GenClustMOO technique and, the ARI and Silhouette

index values of the optimum partitionings identified by the proposed

Semi-GenClustMOO clustering technique when applied on several

gene expression data sets

Data set K ARI Silhouette

Sporulation K = 6 0.8584 0.6786

Cell cycle K = 5 0.7962 0.4353

Arabidopsis K = 4 0.8316 0.4310

Serum K = 6 0.6611 0.4112

Rat CNS K = 6 0.6023 0.4772

Table 3 Mean values of

Silhouette index corresponding

to the partitionings identified by

different gene expression

clustering techniques

Algorithm_used Sporulation Cell cycle Thaliana Serum Rat CNS

K s_(C) K s_(C) K s_(c) K s_(C) K s_(C)

Semi-GenClustMOO 6 0.6786 5 0.4353 4 0.4310 6 0.4112 6 0.5027

MO-fuzzy 6 0.5877 5 0.4342 4 0.4194 6 0.4073 6 0.4977

MOGA 6 0.5754 5 0.4232 4 0.4023 6 0.3874 6 0.4832

GenClusMOO 6 0.6037 5 0.4253 4 0.4154 6 0.4078 6 0.4993

FCM 7 0.4696 6 0.3856 4 0.3665 8 0.3125 5 0.4132

SGA 6 0.5712 5 0.4232 4 0.3854 6 0.3445 4 0.4492

Average linkage 6 0.5023 4 0.4378 5 0.3162 4 0.3576 6 0.4142

SOM 6 0.5794 6 0.3862 5 0.2352 6 0.3352 5 0.4354

CRC 8 0.5623 5 0.4275 4 0.3965 10 0.3254 4 0.4576


123

multiobjective optimization efficiently. Here we have

devised a novel technique without using any human

annotator to generate some labeled data. These information

are then used in the proposed multiobjective based semi-

supervised classification technique to fine tune the obtained

partitionings.

In order to show the conflicting nature of the objective

functions, the final Pareto optimal fronts obtained by the

Semi-GenClustMOO clustering technique for different

gene expression data sets are shown in Fig. 7a–e. Note that

as we have optimized total five objective functions simul-

taneously, each solution of the final Pareto optimal front is

Fig. 5 Eisen Plot for

a Arabidopsis Thaliana b Yeast

cell cycle, c Rat CNS, d Human

fibroblasts serum, e Yeast

sporulation after application of

Semi-GenClustMOO clustering

technique

Fig. 6 Cluster profile plot for Yeast sporulation data obtained after application of Semi-GenClustMOO clustering method


123

having five different objective functional values. But it is

not possible to show the five dimensional Pareto front.

Here we have projected the Pareto front on three dimen-

sions. The different solutions of the final Pareto front are

shown with three objective function values, Sym-index, I-

index and ARI-index. These figures show that for all the

data sets multiple solutions are obtained on the final Pareto

optimal front. This proves that the objective functions used

in the current algorithm are conflicting in nature.

The objective functions used in the current paper are

conflicting to each other. The first objective is based on

symmetry property of the clusters; the second one is based

on Euclidean distance, third and fourth cluster validity

indices are based on the concepts of fuzzy logic. The first

four cluster validity indices capture different properties of

data partitionings. These are some internal cluster validity

indices which differ in the way of capturing the cluster

goodness. The first objective function, I-index tries to

minimize the cluster compactness based on Euclidean

distance while maximizing the maximum separation

between any two cluster centers. The second objective

function, Sym-index tries to determine some symmetrical

shaped well separated clusters. Third objective function,

XB-index helps to determine some overlapping clusters

where the minimum distance between any two cluster

centers is maximized. The fourth objective function is

based on the concepts of fuzzy logic; it helps to detect

some overlapping clusters. The fifth objective is an external

cluster validity index which helps to check the matching of

the obtained partitioning by the proposed clustering

Fig. 7 Pareto optimal fronts obtained by Semi-GenClustMOO clustering technique for a Yeast sporulation, b Rat CNS, c Yeast cell cycle,

d Arabidopsis Thaliana and e Human fibroblasts serum data sets


123

technique with the available labeled information. This

helps to utilize the available supervised information for

reshaping the obtained clusters. Thus the objective func-

tions used in the current paper are conflicting to each other

and they capture different data properties. There is a trade-

off among the five objectives. AMOSA is used to optimize

the five objective functions. The final Pareto optimal fronts

identified by the proposed technique also show the set of

trade-off solutions.

In order to measure the quality of the obtained Pareto

optimal front by the proposed approach, Semi-GenClust-

MOO, we have computed the Purity [6, 31] and Minimal

Spacing [6] measurements. The measure named Purity

[6, 31] is used to compare the solutions obtained using

different MOO strategies. It calculates the fraction of

solutions from a particular method that remains nondomi-

nated when the final front solutions obtained from all the

algorithms are combined. A value near to 1(0) indicates

better (poorer) performance.

Purity Let N � 2 MOO strategies are applied on a given

problem. Let ri ¼ jRi1j; i ¼ 1; 2; . . .;N be the number of

rank one solutions obtained from different MOO strategies.

Now combine all these solutions as R =SN

i¼1fRi1g.

Thereafter ranking scheme is applied on these R solutionsand a new ranking solution set R

1 is obtained for the ith

strategy. Let ri be one of the obtained rank one solutions

in R1 and is denoted as ri = jfc j c�Ri

1and c�R1gj. There-

after, the Purity measure for the ith MOO strategy Pi is

defined as :

Pi ¼riri; i ¼ 1; 2; . . .;N: ð7Þ

The second measure named Minimal Spacing reflects the

uniformity of the solutions over the non-dominated front.

Minimal Spacing Schott[54] suggested a relative dis-

tance measure between consecutive solutions on the

obtained nondominated front. It is defined as

S ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

jQjXjQj

i¼1

di � dð Þ2vuut ð8Þ

where di = mink2Q^k 6¼i

PMm¼1 f im � f km

�� and f im or f km is the

mth objective value of the ith (or, kth) solution on the final

nondominated solution set, Q, and d =PjQj

i¼1dijQj is the mean

value of all dis. M is the number of objective function

values. Thus when the solutions are uniformly spaced, then

their corresponding distance measure values become low.

So, for finding minimal spacing between the nondominated

solutions, value of S should be low.

Smaller values of Minimal Spacing indicate better per-

formance. Table 4 shows the values of Purity and Minimal

Spacing measures of the obtained Pareto optimal fronts by

the proposed technique when combined with the solutions

obtained by other gene expression data clustering tech-

niques. Here after application of other multiobjective

optimization based clustering techniques like MO-fuzzy

and MOGA, we obtain a set of partitioning solutions on the

final Pareto optimal front. The five cluster validity indices

used as the objective functions of the proposed clustering

technique, Sym-index, I-index, XB-index, FCM-index, and

adjusted rand-index are calculated for those partitioning

solutions. Adjusted rand-index is calculated over the same

10 % data as used in the current paper. Similarly after

execution of a single objective based clustering technique,

only a single partitioning solution is obtained. The five

objective functional values are calculated for this solution.

All the solutions obtained by all the methods used in the

current paper are then combined and compared with

respect to the five objective functions. The fraction of

solutions which is still non-dominated with respect to all

the solutions is used as the purity value of that particular

method. These results show that the solutions generated by

the proposed Semi-GenClustMOO clustering technique are

of good quality. Minimum values of MinimalSpacing

indicate that solutions are well-spread over the final Pareto

optimal front. Similarly high values of Purity indicate that

the obtained solutions by the Semi-GenClustMOO clus-

tering technique are pure or globally non-dominating.

5.4 Biological significance

The statistically significant Gene Ontology (GO) annota-

tion database2. is used to establish the biological relevance

of a cluster. Three structured terms or vocabularies

(ontologies) are used to measure the functional augmen-

tations of a group of genes. These terms are given in the

form of associated biological processes, molecular func-

tions and biological components. A cumulative hypergeo-

metric distribution is used to determine the degree of

functional enrichment (pvalue). Thereafter, the probability

of calculating how many genes are there within a particular

Table 4 Purity and MinimalSpacing measurements of the obtained

solutions by the proposed Semi-GenClustMOO clustering technique

Data set Minimal Spacing Purity

Semi-GenClustMOO clustering technique

Sporulation 0.0548 0.9873

Cell cycle 0.0675 0.9583

Thaliana 0.0609 0.9637

Serum 0.1334 0.8723

Rat CNS 0.0769 0.9285

2 http://db.yeastgenome.org/cgi-bin/GO/goTermFinder.


123

http://db.yeastgenome.org/cgi-bin/GO/goTermFinder

cluster is determined which shows their involvement in a

given Go term. So, for a particular GO category, and

cluster of size K, probability p, to evaluate the number of

genes n belonging to a particular cluster can be formulated

according to the following equation[42, 61]

p ¼ 1�Xn�1

j¼0

t

j

� l� t

K � j

�

l

n

� ð9Þ

where t represents the total number of genes belonging to a

particular category, while l is the number of genes

belonging to the genome. Now, after evaluation of p-value

for each GO category, we can verify statistical significance

of the genes in a cluster. Due to this test, we can easily

quantify the matched label of a gene belonging to a par-

ticular cluster with the different types of GO categories.

The p value of the obtained cluster for a particular GO-

category will be equal to 0 if most of the genes belonging

to a particular cluster possess the same biological function.

In this paper, for yeast sporulation data set, biological

significance test has been performed at 1 % significance

level. Here we have tested the clustering results obtained

by all the algorithms used in this paper biologically. For the

proposed Semi-GenClustMOO clustering technique, all the

6 clusters obtained are biologically significant whereas for

MO-fuzzy, MOGA, FCM, SGA, Average linkage, SOM

and CRC the number of biologically significant clusters are

6, 6, 4, 6, 4, 4 and 6, respectively. In Table 5, top three

most relevant GO terms for genes of individual clusters

along with their respective p-values have been shown.

These results are obtained after applying Semi-GenClust-

MOO clustering technique on Sporulation data set. Here, to

evaluate the GO terms, we consider all p-values � 0.01.

Biological significant test reveals that the proposed Semi-

GenClustMOO clustering technique generates biologically

relevant and functionally enriched clusters in case of gene

expression data clustering.

5.5 Statistical significance

To show that the results obtained by the proposed Semi-

GenClustMOO algorithm are statistically significant, Wil-

coxon’s rank sum test [65] is used. In Table 6, the com-

parative p-values of Semi-GenClustMOO algorithm with

respect to other algorithms are reported. Null hypothesis

reveals that there is no significant change between median

values of the two different groups, but alternative

hypothesis reveals that there are some significant changes

between the median values of the two groups. Here we

have used the Silhouette index value to measure the per-

formance of the individual clustering technique. In Table 6,

all the p values are less than 5% significant level. This is a

strong evidence against the null hypothesis, indicating that

the better median values of the performance metric pro-

duced by Semi-GenClust-MOO are statistically significant

and have not occurred by chance.

6 Conclusions

In this paper the problem of gene expression data clustering is

solved using a multiobjective optimization based semi-super-

vised clustering technique, namely Semi-GenClustMOO. To

the best of our knowledge this is the first attempt in solving

gene expression data clustering problem using some semi-

supervised classification techniques. A newly developed

multiobjective optimization technique using the concepts of

simulated annealing is utilized as the underlying optimization

technique. In order to represent a particular cluster in the form

of a solution, multiple centers are used. For the purpose of

assignment, sub-clusters are considered individually. In order

to determine the optimal partitioning automatically, five

objective functions are optimized simultaneously by the search

capability of AMOSA. First four objective functions are some

internal cluster validity indices, Sym-index, I-index, XB index

and FCM index. The fifth objective function is an external

cluster validity index, namely adjusted rand-index. Here, we

assume that for each data set, class label information of 10 %

data points are available as the supervised information. For the

gene expression data, it is difficult to generate the labeled

information. In order to generate the labeled data without

taking help of any human annotator, we have first executed

another popular clustering technique, Fuzzy C-means on the

given data set. The data points with highest values of mem-

bership with respect to a particular cluster are the core points

of the clusters. These core points are used as the available

labeled information of the proposed semi-supervised classifi-

cation technique. The proposed technique is applied for

solving clustering problems from five bench-mark gene

expression data sets. The qualities of the obtained partitioning

results are measured using four internal cluster validity indices

and one external cluster validity index. Partitions are also

verified biologically and statistically. The effectiveness of the

Semi-GenClustMOO clustering technique has been compared

with MO-fuzzy, MOGA, FCM, SOM, CRC clustering tech-

niques. Results prove that the proposed semi-supervised

classification technique is more effective compared to the

existing techniques of gene expression data clustering.

In future we would like to propose other semi-super-

vised clustering techniques based on some other optimi-

zation techniques like genetic algorithm/differential

evolution. Thereafter we would like to evaluate these

techniques for gene expression data clustering. Another


123

important future work is the application of the proposed

semi-supervised clustering technique for some other

domains like satellite image segmentation.

References

1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald

A, Boldrick JC, Sabet H, Tran T, Yu X et al (2000) Distinct types

of diffuse large b-cell lymphoma identified by gene expression

profiling. Nature 403(6769):503–511

2. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D,

Levine AJ (1999) Broad patterns of gene expression revealed by

clustering analysis of tumor and normal colon tissues probed by

oligonucleotide arrays. Proceed National Acad Sci

96(12):6745–6750

3. Altun Y, Belkin M, Mcallester DA (2005) Maximum margin

semi-supervised learning for structured variables. In: Advances in

neural information processing systems, pp 33–40

4. Bandyopadhyay S (2007) Analysis of biological data: a soft

computing approach, World Scientific

5. Bandyopadhyay S, Saha S (2008) A point symmetry-based

clustering technique for automatic evolution of clusters. Knowl

Data Eng IEEE Trans 20(11):1441–1457

6. Bandyopadhyay S, Pal SK, Aruna B (2004) Multiobjective gas,

quantitative indices, and pattern classification. Syst Man Cybern

Part B Cybern IEEE Trans 34(5):2088–2099

7. Bandyopadhyay S, Mukhopadhyay A, Maulik U (2007) An

improved algorithm for clustering gene expression data. Bioin-

formatics 23(21):2859–2865

8. Bandyopadhyay S, Saha S, Maulik U, Deb K (2008) A simulated

annealing-based multiobjective optimization algorithm: AMOSA.

Evol Comput IEEE Trans 12(3):269–283

9. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision

for pairwise constrained clustering

10. Bezdek JC (1981) Pattern recognition with fuzzy objective

function algorithms. Kluwer Academic Publishers

11. Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints

and metric learning in semi-supervised clustering. In: Proceed-

ings of the twenty-first international conference on Machine

learning, ACM, pp 81–88

12. Brazma A, Vilo J (2000) Gene expression data analysis. FEBS

lett 480(1):17–24

13. Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes

and molecular pattern discovery using matrix factorization. Pro-

ceed Natl Acad Sci 101(12):4164–4169

14. Chapelle O, Zien A (2004) Semi-supervised classification by low

density separation. In AI STATS

15. Chapelle O, Scholkopf B, Zien A, et al. (2006) Semi-supervised

learning, MIT press Cambridge

Table 5 Three most significant

GO terms of individual six

clusters of Yeast sporulattion

data and their p-values, after

application of Semi-

GenClustMOO clustering

technique

Clusters Significance GO term p-value

Cluster1 Carboxylic acid metabolic process:G0:0019752 2.25E-05

Oxoacid metabolic process: GO:0043436 3.09E-05

Organic acid metabolic process: GO:0006082 3.17E-05

Cluster2 Single-organism cellular process:GO:0044763 0.00019

Single-organism process:GO:0044763 0.00086

Cell cycle process:GO:0022402 7.62E-19

Cluster3 Gene expression:GO:0010467 2.95E-06

Cellular component biogenesis:GO:0044085 4.47E-08

Ribonucleoprotein complex biogenesis:GO:0022613 6.84E-16

Cluster4 Carbohydrate metabolic process:GO:0005975 4.56E-09

Single-organism carbohydrate metabolic process:GO:0044724 7.06E-07

Generation of precursor metabolites and energy:GO:0006091 3.39E-07

Cluster5 Single-organism cellular process:GO:0044763 0.00077

Single-organism process:GO:0044699 0.00638

Cell cycle process:GO:0022402 3.15E-24

Cluster6 metabolic process:GO:0008152 1.16E-06

Cellular metabolic process:GO:0044237 3.04E-06

Organic substance metabolic process:GO:0071704 1.57E-06

Table 6 p-values produced by

Wilcoxon’s rank sum test

comparing Semi-GenClustMOO

with other algorithms

Data set MO-fuzzy MOGA FCM SGA SOM CRC

Sporulation 3.21E-03 3.57E-03 7.12E-05 3.77E-04 3.82E-04 3.22E-03

Cell cycle 1.29E-03 2.42E-05 5.60E-05 3.20E-04 2.44E-04 3.11E-04

Arabidopsis 1.11E-03 2.29E-03 6.42E-05 4.30E-03 2.12E-03 2.01E-03

Serum 2.10E-03 3.41E-03 6.21E-04 3.4E-03 2.62E-04 2.44E-03

Rat CNS 1.72E-03 2.71E-04 5.67E-05 3.43E-04 2.76E-03 2.71E-03


123

16. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A,

Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lock-

hart DJ et al (1998) A genome-wide transcriptional analysis of

the mitotic cell cycle. Mol cell 2(1):65–73

17. Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO,

Herskowitz I (1998) The transcriptional program of sporulation in

budding yeast. Science 282(5389):699–705

18. De Smet F, Mathys J, Marchal K, Thijs G, De Moor B, Moreau Y

(2002) Adaptive quality-based clustering of gene expression

profiles. Bioinformatics 18(5):735–746

19. Deb K, Pratap A, Agarwal S, Meyarivan T, Fast A (2002) Nsga-

ii. IEEE Trans Evol Comput 6(2):182–197

20. Dembele D (2008) Multi-objective optimization for clustering

3-way gene expression data. Adv Data Anal Cl 2(3):211–225

21. Dhaeseleer P, Wen X, Fuhrman S, Somogyi R (1998) Mining the

gene expression matrix: Inferring gene relationships from large

scale gene expression data. In: Information processing in cells

and tissues, Springer, pp 203–212

22. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster

analysis and display of genome-wide expression patterns. Pro-

ceed Natl Acad Sci 95(25):14863–14868

23. Everitt B (1974/1993) Cluster analysis. Halsted Press

24. Faceli K, de Souto MC, de Araujo DS, de Carvalho AC (2009)

Multi-objective clustering ensemble for gene expression data

analysis. Neurocomputing 72(13):2763–2774

25. Fraley C, Raftery AE (1998) How many clusters? which clus-

tering method? answers via model-based cluster analysis. Com-

put J 41(8):578–588

26. Geman S, Geman D (1984) Stochastic relaxation, gibbs distri-

butions, and the bayesian restoration of images. Patt Anal Mach

Intell IEEE Trans 6:721–741

27. Ghosh D, Chinnaiyan AM (2002) Mixture modelling of gene

expression data from microarray experiments. Bioinformatics

18(2):275–286

28. Herwig R, Poustka AJ, Muller C, Bull C, Lehrach H, O’Brien J

(1999) Large-scale clustering of cdna-fingerprinting data. Gen-

ome Res 9(11):1093–1105

29. Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression

data: identification and analysis of coexpressed genes. Genome

Res 9(11):1106–1115

30. Hu Q, Pan W, An S, Ma P, Wei J (2010) An efficient gene

selection technique for cancer recognition based on neighborhood

mutual information. Int J Mach Learn Cybern 1(1–4):63–74

31. Ishibuchi H, Murata T (1998) A multi-objective genetic local

search algorithm and its application to flowshop scheduling. Syst

Man Cybern Part C Appl Rev IEEE Trans 28(3):392–403

32. Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JC, Trent

JM, Staudt LM, Hudson J, Boguski MS et al (1999) The tran-

scriptional program in the response of human fibroblasts to

serum. Science 283(5398):83–87

33. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review.

ACM Comput Surv 31(3):264–323

34. Jiang D, Pei J, Zhang A (2003) Dhc: a density-based hierarchical

clustering method for time series gene expression data. In: Pro-

ceedings of Bioinformatics and Bioengineering. Third IEEE

Symposium, pp 393–400

35. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene

expression data: a survey. Knowl Data Eng IEEE Trans

16(11):1370–1386

36. Kirkpatrick S, Gelatt CD, Vecchi MP et al (1983) Optimization

by simmulated annealing. Science 220(4598):671–680

37. Liu L, Hawkins DM, Ghosh S, Young SS (2003) Robust singular

value decomposition analysis of microarray data. Proceed Natl

Acad Sci 100(23):13,167–13,172

38. Lockhart D, Dong H, Byrne M, Follettie M, Gallo M, Chee M,

Mittmann M, Wang C, Kobayashi M, Horton H et al (1996)

Expression monitoring by hybridization to high-density oligo-

nucleotide arrays. Nature Biotech 14(13):1675–1680

39. Lockhart DJ, Winzeler EA (2000) Genomics, gene expression

and dna arrays. Nature 405(6788):827–836

40. Maulik U, Bandyopadhyay S (2002) Performance evaluation of

some clustering algorithms and validity indices. Patt Anal Mach

Intell IEEE Trans 24(12):1650–1654

41. Maulik U, Bandyopadhyay S (2003) Fuzzy partitioning using a

real-coded variable-length genetic algorithm for pixel classifica-

tion. Geosci Remote Sens IEEE Trans 41(5):1075–1081

42. Maulik U, Mukhopadhyay A, Bandyopadhyay S (2009) Com-

bining pareto-optimal clusters using supervised learning for

identifying co-expressed genes. BMC Bioinform 10(1):27

43. Mukhopadhyay A, Bandyopadhyay S, Maulik U (2010) Multi-

class clustering of cancer subtypes through svm based ensemble

of pareto-optimal solutions for gene marker identification. PloS

one 5(11):e13803

44. Mukhopadhyay A, Maulik U, Bandyopadhyay S (2013) An

interactive approach to multiobjective clustering of gene

expression patterns. Biomed Eng IEEE Trans 60(1):35–41

45. Qin ZS (2006) Clustering microarray gene expression data using

weighted chinese restaurant process. Bioinformatics 22(16):

1988–1997

46. Reymond P, Weber H, Damond M, Farmer EE (2000) Differ-

ential gene expression in response to mechanical wounding and

insect feeding in arabidopsis. Plant Cell Online 12(5):707–719

47. Rose K (1998) Deterministic annealing for clustering, compres-

sion, classification, regression, and related optimization prob-

lems. In: Proceedings of the IEEE 86(11):2210–2239

48. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the inter-

pretation and validation of cluster analysis. J Comput Appl Math

20:53–65

49. Saha S, Bandyopadhyay S (2013) A generalized automatic

clustering algorithm in a multiobjective framework. Appl Soft

Comput 13(1):89–108

50. Saha S, Ekbal A, Alok AK (2012) Semi-supervised clustering

using multiobjective optimization. In: Hybrid Intelligent Systems

(HIS), 12th International Conference, IEEE, pp 360–365

51. Saha S, Ekbal A, Gupta K, Bandyopadhyay S (2013) Gene

expression data clustering using a multiobjective symmetry based

clustering technique. Comput Biol Med 43(11):1965–1977

52. Santos JM, Embrechts M (2009) On the use of the adjustedrand index as a metric for evaluating supervised classifica-

tion. In: Artificial Neural Networks-ICANN, Springer,

pp 175–184

53. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative

monitoring of gene expression patterns with a complementary

dna microarray. Science 270(5235):467–470

54. Schott JR (1995) Fault tolerant design using single and multi-

criteria genetic algorithm optimization. Tech Rep DTIC Doc

55. Sharan R, Shamir R (2000) Click: a clustering algorithm with

applications to gene expression analysis. Proceed Int Conf Intell

Syst Mol Biol 8:16

56. Sharma A, Imoto S, Miyano S, Sharma V (2012) Null space

based feature selection method for gene expression data. Int J

Mach Learn Cybern 3(4):269–276

57. Sherlock G (2000) Analysis of large-scale gene expression data.

Curr Opin Immunol 12(2):201–205

58. de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A

(2008) Clustering cancer gene expression data: a comparative

study. BMC Bioinform 9(1):497

59. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmit-

rovsky E, Lander ES, Golub TR (1999) Interpreting patterns of

gene expression with self-organizing maps: methods and appli-

cation to hematopoietic differentiation. Proceed Natl Acad Sci

96(6):2907–2912


123

60. Tang VT, Yan H (2012) Noise reduction in microarray gene

expression data based on spectral analysis. Int J Mach Learn

Cybern 3(1):51–57

61. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM

(1999) Systematic determination of genetic network architecture.

Nature Genet 22(3):281–285

62. Tou JT GR (1974) Pattern recognition principles. Reading:

Addison-Wesley

63. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T,

Tibshirani R, Botstein D, Altman RB (2001) Missing value

estimation methods for dna microarrays. Bioinformatics

17(6):520–525

64. Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Barker JL,

Somogyi R (1998) Large-scale temporal gene expression map-

ping of central nervous system development. Proceed Natl Acad

Sci 95(1):334–339

65. Wilcoxon F, Katti S, Wilcox RA (1963) Critical values and

probability levels for the Wilcoxon rank sum test and the Wil-

coxon signed rank test. American Cyanamid Comp

66. Xie XL, Beni G (1991) A validity measure for fuzzy clustering.

IEEE Trans Patt Anal Mach Intell 13(8):841–847

67. Xu X (2013) Enhancing gene expression clustering analysis using

tangent transformation. Int J Mach Learn Cybern 4(1):31–40

68. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001)

Model-based clustering and data transformations for gene

expression data. Bioinformatics 17(10):977–987

69. Zitzler E, Laumanns M, Thiele L (2001) Spea 2: Improving the

strength pareto evolutionary algorithm


123

Semi-supervised clustering for gene-expression data in ...sriparna/papers/abhay-ijmlc.pdf ·...

Documents

Transcript of Semi-supervised clustering for gene-expression data in ...sriparna/papers/abhay-ijmlc.pdf ·...