A Novel Approach for Biclustering GeneExpression Data Using Modular Singular ValueDecomposition

8/12/2019 A Novel Approach for Biclustering GeneExpression Data Using Modular Singular ValueDecomposition

1/12

A Novel Approach for Biclustering Gene

Expression Data Using Modular Singular Value

Decomposition

V.N. Manjunath Aradhya1,2, Francesco Masulli1,3, and Stefano Rovetta1

1 Dept. of Computer and Information Sciences University of Genova,Via Dodecaneso 35, 16146 Genova, Italy{aradhya,masulli,ste}@disi.unige.it

2 Dept. of ISE, Dayananda Sagar College of Engg, Bangalore, India - 5600783 Sbarro Institute for Cancer Research and Molecular Medicine,

Center for Biotechnology, Temple University, BioLife Science Bldg.,

1900 N 12th Street Philadelphia, PA 19122 USA

Abstract. Clustering is a method of unsupervised learning, and a com-mon technique for statistical data analysis used in many fields, includingmachine learning, data mining, pattern recognition, image analysis andbioinformatics. Recently, biclustering (or co-clustering), performing si-multaneous clustering on the row and column dimensions of the datamatrix, has been shown to be remarkably effective in a variety of ap-plications. In this paper we propose a novel approach to biclustering

gene expression data based on Modular Singular Value Decomposition(Mod-SVD). Instead of applying SVD directly on on data matrix, theproposed approach computes SVD on modular fashion. Experiments con-ducted on synthetic and real dataset demonstrated the effectiveness ofthe algorithm in gene expression data.

1 Introduction

Clustering is the unsupervised classification of patterns (observations, data items,

or feature vectors) into groups (clusters). The clustering problem has been ad-dressed in many contexts and by researchers in many disciplines; this reflects itsbroad appeal and usefulness as one of the steps in exploratory data analysis [18].Inspite of mature statistical literature on clustering, DNA microarray data arethe triggering aid for the development of multiple new methods. The clustersproduced by these methods reflect the global patterns of expression data, butan interesting cellular process for most cases may be only involved in a subsetof genes co-expressed only under a subset of conditions. Discovering such localexpression patterns may be the key to uncovering many genetic pathways that

are not apparent otherwise. Therefore, it is highly desirable to move beyondthe clustering paradigm, and to develop approaches capable of discovering localpatterns in microarray data [6,9].

In recent development, biclustering is one of the promising innovations in thefield of clustering. Hartigans [14] so-called direct clustering is the inspira-tion for the introduction of the term biclustering to gene expression analysis by

F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 254 265,2010.c Springer-Verlag Berlin Heidelberg 2010


2/12

A Novel Approach for Biclustering Gene Expression Data 255

Cheng and Church [6]. Biclustering, also called co-clustering, is the problem ofsimultaneously clustering rows and columns of a data matrix. Unlike clusteringwhich seeks similar rows or columns, co-clustering seeks blocks (or co-clusters)of rows and columns that are inter-related. In recent years, biclustering has re-

ceived a lot of attention in several practical applications such as simultaneousclustering of documents and words in text mining [16], genes and experimen-tal conditions in bioinformatics[6,19], tokens and contexts in natural languageprocessing [27], among others.

Many approaches to biclustering gene expression data have been proposed todate (and the problem is known to be NP-complete). Existing biclustering meth-ods include-biclusters[6,32], gene shaving[15], flexible overlapped biclusteringalgorithm (FLOC) [33], order preserving sub-matrix (OPSM) [3], interrelatedtwo-way clustering (ITWC)[29], coupled two-way clustering (CTWC)[13], and

spectral biclustering [19], statistical algorithmic model (SAMBA) [28], iterativesignature algorithm[17], a fast divide-and-conquer algorithm (Bimax) [9], andmaximum similarity biclustering (MSBE)[22]. A detailed survey on biclusteringalgorithms for biological data analysis can be found in [23]; the paper presentsa comprehensive survey on the models, methods and applications in the field ofbiclustering algorithms. Another interesting paper on comparison and evaluationof biclustering methods for gene expression data is described in [9]. The papercompares five prominent biclustering methods with respect to their capabilityof identifying groups of (locally) co-expressed genes; hierarchical clustering and

a baseline biclustering algorithm.A different approaches to biclustering gene expression data based on Hough

transform, possibilistic approach, fuzzy, genetic algorithm can also be seen in lit-erature. A geometric biclustering algorithm based on the Hough Transform (HT)for analysis of large scale microarray data is presented in [35]. The HT is per-formed in a two-dimensional (2-D) space of column pairs and coherent columnsare then combined iteratively to form larger and larger biclusters. This reducesthe computational complexity considerably and makes it possible to analyzelarge-scale microarray data. An HT-based algorithm has also been developed to

analyze three-color microarray data, which involve three experimental conditions[36]. However, the original HT-based algorithm becomes ineffective, in terms ofboth computing time and storage space, as the number of conditions increases.Another approach based on geometric interpretation of the biclustering problemis described in [12]. An implementation of the geometric framework using theFast Hough transform for hyperplane detection can be used to discover biologi-cally significant subsets of genes under subsets of conditions for microarray dataanalysis.

Work on possibilistic approach for biclustering microarray data is presented

in[5]. This approach obtain potentially-overlapping biclusters, the PossibilisticSpectral Biclustering algorithm (PSB), based on Fuzzy technology and spectralclustering. The method is tested on S. cerevisiae cell cycle expression data andon a human cancer dataset. An approach to the biclustering problem usingthe Possibilistic Clustering paradigm is described in [11]; this method finds one


3/12

256 V.N.M. Aradhya, F. Masulli, and S. Rovetta

bicluster at a time, assigning a membership to the bicluster for each gene and foreach condition. Possibilistic clustering is tested on the Yeast database, obtainingfast convergence and good quality solutions.

Genetic Algorithm (GA) have also been used in[24]to maximize homogene-

ity in a single bicluster. Given the complexity of the biclustering problem, themethod adopts a global search method, namely genetic algorithms. A multi-criterion approach is also used to optimize two fitness functions implementingthe criteria of bicluster homogeneity (Mean Squared Residue) and size [7]. Com-paring fuzzy approaches to biclustering is described in [10]. Fuzzy central clus-tering approaches to biclustering show very interesting properties, such as speedin finding a solution, stability, capability of obtaining large and homogeneousbiclusters are discussed.

SVD based methods have also been used in order to obtain biclusters in

gene expression data and also in many potential applications [8,21]. ApplyingSVD directly on the data may obtain biclusters, but obtaining large numberof biclusters and functionally enriched with GO categories is still a challengingproblem. Hence in this paper we propose a method based on modular SVD forbiclustering in gene expression data. By doing this, local features of genes andconditions can be extracted efficiently in order to obtain better biclusters.

The organization of the paper is as follows: in Sect 2, we explain the proposedModular SVD based method. In Sect 3, we perform experiments on syntheticand real dataset. Finally conclusions are drawn at the end.

2 Proposed Method

In this section we describe our proposed method which is based on modular (datasubsets) SVD. The SVD is one of the most important and powerful tool usedin numerical signal processing. It is employed in a variety of signal processingapplications, such as spectrum analysis, filter design, system identification, etc.SVD based methods has also been used in order to obtain biclusters in gene

expression data and also in many potential applications [8,21]. Applying SVDdirectly on the data may obtain biclusters, but obtaining efficient biclusters ondata is still a challenging problem. The standard SVD based method may not bevery effective under different conditions, since it considers the global informationof gene and conditions and represents them with a set of weights.

Hence in this work, we made an attempt at overcoming the aforementionedproblem by partitioning a gene expression data into several smaller subsets ofdata and then SVD is applied to each of the sub-data separately. The three mainsteps involved in our method, named M-SVD Biclustering, are:

1. An original whole data set denoted by a matrix is partitioned into a set ofequally sized sub-data in a non-overlapping way.

2. SVD is performed on each of such data subsets.3. At last, a single global feature is synthesized by concatenating each data

data subsets.


4/12


In this work, we partitioned the data along rows in a non-overlapping way, whichis as shown in figure 1. The size of each partition is around 20% of the originalmatrix.

Fig. 1.Procedure of applying SVD on partitioned data

Each step of the algorithm is explained in detail as follows:

Data Partition: Let us consider am nmatrixA, which containm genes andnconditions. Now, this matrixA is partitioned intoKd-dimensional sub-matricesof similar sizes in a non-overlapping way, where A = (A1, A2,...,AK) with Ak

being the sub-data ofA. Figure 1 shows a partition procedure for a given datamatrix. Note that a partition ofAcan be obtained in many different ways e.g.,selection groups of continuous rows or groups of continuous columns, or alsorandomly sampling some rows or some columns. In this work, we selected groupof continuous rows for experiment purpose.

Apply SVD onKsub-data: Now according to the second step, conventional SVDis applied on Ksub-pattern. The SVD provides a factorization for all matrices,even matrices that are not square or have repeated eigenvalues. In general, thetheory of SVD states that any matrix A of size m n can be factorized into aproduct of unitary matrices and a diagonal matrix, as follows [20]:

A= U VT (1)

where U mm is unitary,V nn is unitary, and mn has the form=diag(1, 2, .....p), where p is the minimum value of m or n. The diagonal


5/12


elements of are called the singular values of A and are usually ordered indescending manner. The SVD has the eigenvectors of AAT in U and of ATAin V.

This decomposition may be represented as shown in figure 2:

Fig. 2.Illustration of SVD method

The diagonal ofcontains the singular values of A. They are real and nonnegative numbers. The upper part (green strip) contains the positive singularvalues. There are r positive singular values, where r is the rank of A. The lowerpart of the diagonal (gray strip) contains the (n - r), or vanishing, singularvalues.

SVD is known to be more robust than usual eigen-vectors of covariance ma-trix. This is because the robustness is determined by the directional vectorsrather than mere scalar quantity like magnitudes (singular values stored in ).Since U and V matrices are inherently orthogonal in nature, these directions areencoded in U and V matrices. This is unlike the eigenvectors which need not beorthogonal. Hence, a small perturbation like noise has very little effect to disturbthe orthogonal properties encoded in the U and V matrices. This could be themain reason for the robust behavior of the SVD.

Finally, from each of the data partitions, we would expect that the eigenvec-tors corresponding to the largest eigenvalue would provide the optimal clusters.However, we also observed that an eigenvector with smaller eigenvalues couldyield clusters. Instead of clustering each eigenvector individually, we performclustering by applying k-means to the data projected to the best three or foureigenvectors. In general, it can be described as shown below:

Apply SVD on each partitioned data Project the data for assigned number of eigenvalues

Apply K-means algorithm to the projected data using desired number ofclusters.

Finally, find the biclusters for specified number of clusters.


6/12


Steps to obtain biclusters

[U S V] = svd(partion data) cluster index = kmeans(projected (U & V), de-

sired number of clusters) cluster data = cluster index(x,y)

More formally, the proposed method is presented in the form of Algorithm asshown below.

Algorithm: Modular SVD

Input: Gene Expression Data Output: Homogeneity H and Biclusters size N Steps:

1: Acquire gene expression matrix and generate K number of d-dimensional sub-data in a non-overlapping way and reshaped into Knmatrix A= (A1, A2,...,AK) with Ak being the sub-data ofA.

2: Apply standard SVD method to each sub-data obtained from the Step1.

3: Perform final clustering step by applying the k-means to the dataprojected to get the best three or four eigenvectors.

4: Repeat this procedure for all the partition present in the gene expres-sion data.

5: At last, computation of Heterogeneity H and size N, are done usingthe resultant biclusters obtained from each partition matrix.

Algorithm ends

3 Experiment Results and Comparative Study

In this section we experimentally evaluate the proposed method with pertainingto synthetic and standard dataset. The proposed algorithm has been coded in Rlanguage on Pentium IV 2 GHz with 756 MB of RAM under Windows platform.Following[6], we are interested in the largest biclustes Nfrom DNA microarraydata that do not exceed as assigned homogeneity constraint. To this aim we canutilize the definition of biclustering average heterogeneity H. The size N of abiclusters is usually defined as the number of cells in the gene expression matrixXbelonging to it that is the product of cardinality ng =|g| and nc =|c| (here

g, c refers to genes and conditions respectively):

N=ng nc (2)

herengandncare the number of columns and number of rows ofX, respectively.


7/12


We can defineH andGas the Sum-squared residue and Mean-squared residuerespectively, two related quantities that measures the biclusters and heterogeneity:

H=

ig

jc

d2ij (3)

G= 1

N

ig

jc

d2ij = H

N (4)

whered2ij = (xij+ xIJ xiJ xIj )/n,xIJ,xIj andxiJare the biclusters mean,the row mean and the column mean ofX for the selected genes and conditionsrespectively.

3.1 Results on Synthetic Dataset

We compared our results with a standard spectral method[19]using a syntheticand a standard real dataset. To obtain the synthetic dataset, we generated ma-trices with random values and the size of the matrices varied from 100 10(rows columns). Table 1 shows the experimental results of heterogeneity andbicluster size obtained using standard spectral and proposed method. From theresult we can notice that spectral method failed to find bicluster, however theproposed method successfully extracted bicluster from the data.

We also tested our method on another synthetic data set. To this aim we

generated matrices with random numbers, on which 3 biclusters were similar,with dimensions ranging from 3-5 rows and 5-7 columns. Heterogeneity Handbiclusters sizeNare tabulated in Table 2. The obtained heterogeneity from boththe methods is competitive. However if we consider bicluster size, the proposedmethod achieved better size compared to standard spectral method. Figure 3shows the example of biclusters extracted using the proposed method for syn-thetic dataset.

From the above two results, it is ascertained that proposed M-SVD basedmethod performs better compared to standard spectral method both in hetero-

geneity and bicluster size.

Table 1. Heterogeneity and size N for Synthetic Dataset of size 100 10

Methods Heterogeneity (H) Size (N)Spectral[19] M-SVD-BC 0.90 70

Table 2. Heterogeneity (H) and size N for Synthetic Dataset of size 1010

Methods Heterogeneity (H) Size (N)

Spectral[19] 7.1 30M-SVD-BC 7.21 51


8/12


Fig. 3. (a):A synthetic dataset with multiple biclusters of different patterns (b-d):andthe biclusters extracted

3.2 Results on Real Dataset

For the experiments on real data, we tested our proposed method on the standarddataset of Bicat Yeast, SyntrenEcoli and Yeast. Data structure with informationabout the expression levels of 419 probesets over 70 conditions follow Affymetrixprobeset notation in Bicat Yeast dataset[2]. Affymetrix data files are normallyavailable in DAT, CEL, CHP and EXP formats. Date containing in CDF files

can also be used and contain the information about which probes belong towhich probe set. For more information on file formats refer [1]. Results pertain-ing to this dataset are reported in Table 3. Another important experiment onperformance index, defined as the ratio of N/G is also considered. The largestthe ratio, the better the performance. Figure 4 shows the relationship betweenthe two quality criteria obtained by the proposed and Spectral[19] method re-spectively. The first column in the graph represents the bicluster size (N) forvarying number of principal components (PCs). The second column shows thevalue of performance index, defined as the ratio N/G.

Table 3. H and N for standard Bicat Yeast Dataset


Spectral[19] 0.721 680M-SVD-BC 0.789 2840

Another data structure with information about the expression levels of 200

genes over 20 conditions from transcription regulatory network is also used inour experiment [26]. Detailed description about Syntren can be found in [4].The results of heterogeneity and bicluster size is tabulated in Table 4. The ho-mogeneity obtained from spectral method is high compared to proposed method.However the proposed method obtained less homogeneity and achieved betterbicluster size compared to spectral method.


9/12


Fig. 4.Performance comparison of bicluster techniques: Size (N) and ratio N/G

Table 4. H and Nfor standard Syntren E. coli Dataset


Spectral[19] 19.95 16M-SVD-BC 7.07 196

We also applied our algorithm to the yeast dataset which is genomic databasecomposed by 2884 genes and 17 conditions [30]. The original microarray geneexpression data can be obtained from the web site http://arep.med.harvard.edu/biclustering/yeast.matrix. Results are expressed as both the homogeneityHandbicluster size N. Table5 shows the H and Nobtained from well known algo-rithms on Yeast dataset. Here we compared our method with Deterministic Bi-clustering with Frequent pattern mining [34], Flexible Overlapped Clusters[33],Cheng and Church [6], Single and Multi objective Genetic Algorithms [25,7],Fuzzy Co-clustering with Ruspinis condition [31], and Spectral [19]. From the

results it is ascertained that the proposed modular SVD extracts better biclustersize compared to other well known techniques.

Table 5. H and N for standard Yeast Dataset


DBF[34] 115 4000FLOC [33] 188 2000

Cheng and Church[6] 204 4485

Single-objective GA[7] 52.9 1408Multi-objective GA[25] 235 14828

FCR [31] 973.9 15174Spectral [19] 201 12320

Proposed Method 259 18845


10/12


From the above observations and results it is clear that applying SVD onmodular approach yields better performance compared to standard approaches.

4 Conclusion

In this paper, a novel approach for biclustering gene expression data based onModular SVD is proposed. Instead of applying SVD directly to the data, theproposed method uses modular approach to extract biclusters. The standardSVD based method may not be very effective under different conditions of gene,since it considers the global information of gene and conditions and representsthem with a set of weights. While applying SVD on modular way, local featuresof genes and conditions can be extracted efficiently in order to obtain better

biclusters. The efficiency of the proposed method is tested on synthetic as wellas real dataset demonstrated the effectiveness of the algorithm when comparedwith well known existing ones.

References

1. http://www.affymetrix.com/analysis/index.affix2. Barkow, S., Bleuler, S., Prelic, A., Zimmermann, P., Zitzler, E.: Bicat: A biclus-

tering analysis toolbox. Bioinformatics 19, 12821283 (2006)

3. Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in geneexpression data: the order-preserving sub-matrix problem. In: Proceedings of theSixth Annual International Conference on Computational Biology, pp. 4957. ACMPress, New York (2002)

4. Blucke, Leemput, Naudts, Remortel, Ma, Verschoren, Moor, Marchal: Syntren: agenerator of synthetic gene expression data for design and analysis of structurelearning algorithm. BMC Bioinformatics 7, 116 (2006)

5. Cano, C., Adarve, L., Lopez, J., Blanco, A.: Possibilistic approach for biclusteringmicroarray data. Computers in Biology and Medicine 37, 14261436 (2007)

6. Cheng, Y., Church: Biclustering of expression data. In: Proceedings of the Intl

Conf. on intelligent Systems and Molecular Biology, pp. 93103 (2000)7. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective

genetic algorithm: Nsga-ii. IEEE Trans. on Evolutionary Computatation 6, 182197 (2002)

8. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graphpartitioning. In: Proceedings of the 7th ACM SIGKDD, pp. 269274 (2001)

9. Prelic, A., et al.: A systematic comparison and evaluation of biclustering methodsfor gene expression data. Bioinformatics 22, 11221129 (2006)

10. Filippone, M., Masulli, F., Rovetta, S., Zini, L.: Comparing fuzzy approaches to bi-clustering. In: Proceedings of International Meeting on Computational Intelligence

for Bioinformatics and Biostatistics, CIBB (2008)11. Filippone, M., Masulli, F., Stefano, R.: Possibilistic approach to biclustering: An

application to oligonucleotide microarray data analysis. In: Proceedings of theComputational Methods in System Biology, pp. 312322 (2006)

12. Gan, X., Alan, Yan, H.: Discovering biclusters in gene expression data based onhigh dimensional linear geometries. BMC Bioinformatics 9, 209223 (2008)
http://www.affymetrix.com/analysis/index.affixhttp://www.affymetrix.com/analysis/index.affix


11/12


13. Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of genemicroarray data. In: Proceedings of National Academy of Science, 1207912084(2000)

14. Hartigan, J.A.: direct clustering of a data matrix. Journal of the American Statis-

tical Association 67, 123129 (1972)15. Hastie, T., Levine, E., Domany, E.: Gene shaving as a method for identifyingdistinct set of genes with similar expression patterns. Genome Biology 1, 0003.10003.21 (2000)

16. Mallela, S., Dhillon, I., Modha, D.: Information-theoretic co-clustering. In: In Pro-ceedings of the 9th International Conference on Knowledge Discovery and DataMining (KDD), pp. 8998 (2003)

17. Ihmels, J., Bergmann, S., Barkai, N.: Defining transcription modules using large-scale gene expression data. Bioinformatics 20, 19932003 (2004)

18. Jain, A.K., Murthy, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing

Surveys 31, 26432319. Kluger, Y., Basri, Chang, Gerstein: Spectral biclustering of microarray data: Co-

clustering genes and conditions. Genome Research 13, 703716 (2003)

20. Lay, D.C.: Linear Algebra and its Applications. Addison-Wesley, Reading (2002)

21. Li, Z., Lu, X., Shi, W.: Process variation dimension reduction based on svd. In:Proceedings of the Intl Symposium on Circuits and Systems, pp. 672675 (2003)

22. Liu, X., Wang, L.: Computing the maximum similarity bi-clusters of gene expres-sion data. Bioinformatics 23, 5056 (2007)

23. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis:A survey. IEEE & ACM Trans. on Computational Biology and Bioinformatics 1,2445 (2004)

24. Mitra, S., Banka, H.: Mulit-objective evolutionary biclustering of gene expressiondata. Pattern Recognition 39, 24642477 (2006)

25. Mitra, S., Banka, H.: Multi-objective evolutionary biclustering of gene expressiondata. Pattern Recognition 39, 24642477 (2006)

26. Orr, S.: Network motifs in the transcriptional regulation network of escherichiacoli. Nature Genetics 31, 6468 (2002)

27. Rohwer, R., Freitag, D.: Towards full automation of lexicon construction. In: HLT-NAACL 2004: Workshop on Computational Lexical Semantics, pp. 916 (2004)

28. Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclustersin gene expression data. Bioinformatics 18, 136144 (2002)

29. Tang, C., Zhang, L., Zhang, A., Ramanathan, M.: Interrelated two-way cluster-ing: an unsupervised approach for gene expression data analysis. In: Proceedingsof the Second Annual IEEE International Symposium on Bioinformatics and Bio-engineering, BIBE, pp. 4148 (2001)

30. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematicdetermination of genetic network architecture. Nature Genetics 22, 281285 (1999)

31. Tjhi, W.C., Chen, L.: A partitioning based algorithm to fuzzy co-cluster documentsand words. Pattern Recognition Letters 27, 151159 (2006)

32. Yang, J., Wang, W., Wang, H., Yu, P.: -cluster: capturing subspace correlation ina large data set. In: Proceedings of the 18th IEEE International Conference DataEngineering, pp. 517528 (2002)

33. Yang, J., Wang, W., Wang, H., Yu, P.: Enhanced biclustering on expression data.In: Proceedings of the Third IEEE Conference on Bioinformatics and Bioengineer-ing, pp. 321327 (2003)


12/12


34. Zhang, Z., Teo, A., Ooi, B.: Mining deterministic biclusters in gene expression data.In: Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering,p. 283 (2004)

35. Zhao, H., Alan, Xie, X., Yan, H.: A new geometric biclustering algorithm basedon the Hough transform for analysis of large scale microarray data. Journal ofTheoretical Biology 251, 264274 (2008)

36. Zhao, H., Yan, H.: Hough feature, a novel method for assessing drug effects inthree-color cdna microarray experiments. BMC Bioinformatics 8, 256 (2007)

A Novel Approach for Biclustering GeneExpression Data Using Modular Singular ValueDecomposition

Documents

Transcript of A Novel Approach for Biclustering GeneExpression Data Using Modular Singular ValueDecomposition