Gene Ontology Friendly Biclustering of Expression...

Gene Ontology Friendly Biclustering of Expression Profiles

Jinze Liu1, Wei Wang1, and Jiong Yang2,1Department of Computer Science, University of North Carolina, Chapel Hill, NC 27599

{liuj, weiwang}@cs.unc.edu2Department of Computer Science, University of Illinois, Urbana-Champaign, IL 61801

[email protected]

AbstractThe soundness of clustering in the analysis of gene

expression profiles and gene function prediction isbased on the hypothesis that genes with similar expres-sion profiles may imply strong correlations with theirfunctions in the biological activities. Gene Ontology(GO) has become a well accepted standard in orga-nizing gene function categories. Different gene func-tion categories in GO can have very sophisticated re-lationships, such as ’part of ’ and ’overlapping’. Untilnow, no clustering algorithm can generate gene clus-ters within which the relationships can naturally re-flect those of gene function categories in the GO hier-archy. The failure in resembling the relationships mayreduce the confidence of clustering in gene functionprediction. In this paper, we present a new cluster-ing technique, Smart Hierarchical Tendency Preserv-ing clustering (SHTP-clustering), based on a biclustermodel, Tendency Preserving cluster (TP-Cluster). Bydirectly incorporating Gene Ontology information intothe clustering process, the SHTP-clustering algorithmyields a TP-cluster tree within which any subtree canbe well mapped to a part of the GO hierarchy. Ourexperiments on yeast cell cycle data demonstrate thatthis method is efficient and effective in generating thebiological relevant TP-Clusters.

Keywords: Gene Ontology, Gene expression profiles,Biclustering, Tendency Preserving.

1 Introduction

The advent of DNA microarray technologies hasrevolutionized the experimental study of gene expres-sion. Thousands of genes are routinely probed in aparallel fashion. The gene expression data presentsboth great opportunities and challenges. They serveas valuable clues to understand the genetic behaviors

of life. The complexity of the underlying mechanismunderscores the potential complexity in analyzing thegene expression data.

With the advance of microarray technology, thedata analysis techniques have been intensively stud-ied as well. Clustering is one of the most popularapproaches of analyzing gene expression data with-out prior knowledge. Several representative algo-rithmic techniques have been developed and experi-mented in clustering gene expression data, which in-clude but are not limited to hierarchical clustering [7],self-organizing maps [10], and graphic theoretic ap-proaches, e.g., CLICK [16]. The applicability of clus-tering in gene function prediction is based on the hy-pothesis that similar expression profiles imply a func-tional relation [4] in the biological activities. As a re-sult, the quality of the clusters are often evaluated bytheir correlations to the known genes’ function groups.Those algorithms typically generate a small number ofdisjoint clusters whose sizes are usually much largerthan that of most function categories in GO.

Although these studies have been successful inshowing that genes participating in the same biolog-ical process have similar expression profiles, there areseveral reasons preventing the traditional clusteringanalysis from solving the core issues of modelling bi-ological ontology relationships (Shatkayet al.[17]).First, the expression levels of a set of biologically re-lated genes might only show coherency under a subsetof conditions. Therefore, clustering over all dimen-sions (conditions) may separate the biologically re-lated genes from each other. Secondly, grouping genesinto disjoint clusters may preclude genes participatingin multiple biological activities from being groupedproperly [22]. Finally, assume we have a large clus-ter containing over 200 genes and it includes most ofgenes from a small function family, e.g., around 10

1

genes. It is very difficult to tell whether this occursby chance or not. Therefore, annotating a large genecluster with a small function family is usually infeasi-ble. Traditional clustering algorithms generating largesizes of clusters only help with the annotation of rela-tively larger but less specific function categories.

Biclustering (or subspace clustering) might be ananswer to solve the above problems. Compared withtraditional clustering algorithms, biclustering is capa-ble of discovering the gene expression pattern embed-ded in only a subset of conditions. In addition, cluster-ing under different sets of conditions generates over-lapping biclusters. Cheng and Church [6] are amongthe pioneers in introducing the concept of bicluster-ing. Their biclusters are based on uniformity crite-ria, and a greedy algorithm is developed to discoverthem. Plaid [11] is another model to capture the ap-proximate uniformity in a submatrix in gene expres-sion data and look for patterns where genes differ intheir expression levels by a constant vector. Ben-Doret al. [2] discussed approaches for unsupervised iden-tification of patterns in expression data that distinguishtwo subclasses of a tissue on the basis of a supportingset of genes that offer accurate classification. Tanayetal. [20] defined a bicluster as a subset of genes thatjointly respond across a subset of conditions. Ben-Doret al. introduced the model of OPSM (order preserv-ing submatrix) [3] to discover a subset of genes iden-tically ordered among a subset of conditions. Proba-bilistic models were the basis of the work presentedabove. With probabilistic models, only a limited num-ber of valid clusters may be discovered and a seed usu-ally has to be selected manually before the generationof a cluster. Nevertheless, the inability of revealingthe complete set of biclusters hinders the systematicstudy of the relationships between the biclusters andthe function categories in biological activities.

In this paper, we present a biclustering algorithm,Smart Hierarchical Tendency Preserving clustering(SHTP-clustering), which directly incorporates GeneOntology information into clustering process. A clus-ter is a Tendency Preserving cluster (TP-Cluster) ifit has a maximal subset of genes that have strictlycoherent tendency along a subset of conditions. ATP-Cluster is a Smart Tendency Preserving cluster(STP-Cluster) if the enrichments of function cate-gories within the cluster are statistically significant.Our goal is to obtain a tree of STP-Clusters whosehierarchical relationships match the hierarchical orga-nization of the gene function categories in GO. The

clusters may overlap and may vary in size depend-ing on their levels in the STP-cluster tree. We designan algorithm, which automatically constructs a Hierar-chical Tendency Preserving clustering tree (TP-clustertree) in a very compact fashion. We evaluate the map-ping from the hierarchical GO relationships to that ofthe STP-Clusters. The assessment, in turn, is used toguide the mining of STP-Clusters. Our experimentsdemonstrate that, by directly incorporating gene on-tology information into the clustering process, we areable to efficiently and effectively discover biologicallyrelevant TP-Clusters.

The remainder of the paper is organized as follows.Section 2 introduces some preliminary knowledge onGO. Section 3 defines the TP-Cluster and its GO an-notation respectively. Section 4 presents the SHTP-clustering algorithm in detail. An extensive perfor-mance study is reported in Section 5. Section 6 con-cludes the paper and discusses some future work.

2 Preliminaries

The GO Consortium was formed to integrate the ef-forts to regulate the vocabulary for various genomicdatabases of diverse species in such a way that it canshow the essential features shared by all the organisms[23]. GO has three ontology files corresponding to itsthree categories, namely molecular function, biolog-ical process and cellular component. An acyclic di-rected graph can be obtained for each category withGO terms as nodes. Figure 1 presents a screen shot ofthe top levels of the gene ontology. At the first level,known genes are classified into three categories, i.e.,Molecular Function (MF), Cellular Component (CC)and Biological Process (BP).

Figure 1. Schema of GO annotation terms.

Formally, GO hierarchy is naturally described as a

2

directed acyclic graph (DAG).GO =< V, E >, whereV is a set of gene function description (GO terms) andE is a binary relation onV such that genes with func-tions described byvj are a subset of genes with func-tions described byvi, denotedvj � vi, if and onlyif there exists a path(vi, vi+1 , ..., vj−1, vj) such that(vm−1, vm) ∈ E for m = i + 1, i + 2, ..., j − 1, j.A term’s relationship with its ancestor is also defined”part of” or ”specific”, which means that the set ofgenes annotated with a GO term is also a subset ofthe genes annotated with its ancestor GO term.

Nevertheless, to fit GO into our model, we trans-form the original directed graph of GO into our desiredform, an ordered tree. Note that the same GO termmay occur several times in an ontology file. From abiological viewpoint, these occurrences should be con-sidered distinct because the location of the term in thehierarchy (i.e., the path from the root to the term) is farmore important than the term itself.

LetD be the universe of the genes and leta : D →2|V| be a function annotating each gene with a set ofGO-terms at the most specific level of gene ontology.Given a set of GO termsG = v1, v2, ..., vt, a gene iscalled aknown geneif there exists a GO termv, v ∈ G,such that the set of gene-term pairs{(x, v) |x ∈ Dandu ∈ a(x) andu � v andv ∈ G} is not empty.Otherwise, the gene is denoted as anunknown gene.Unknown genes are either the genes without annota-tion or genes with annotations beyond the scope of thegiven GO term setG.

3 TP-Cluster Model and Ontology Interpre-tation

Let D be the universe of then participating genesin a microarray experiment and letA be them condi-tions under which the gene expression levels are mea-sured. The whole gene expression database can be rep-resented in a data matrixM, whereMij is the expres-sion level of genei under conditionj (0 < i ≤ n,0 < j ≤ m).

3.1 Tendency Preserving Cluster Model

We are interested in the TP-Clusters, in which thesubset of genes inD exhibits a coherent tendency onthe subset of conditionsT of A.

Definition 3.1 Let O be a subset of genes in thedatabaseD, O ⊆ D. LetT be a subset of conditions,T ⊆A. LetR: T ×O× 2|A| → I be the function that

gID a b c d sequence1 4002 284 4108 228 dbac2 401 281 120 298 cbda3 401 292 109 238 cdba4 280 318 37 215 cdab

Table 1. An example dataset.

assigns the rank of a genei’s conditionj to ber, if theexpression value of the genei under conditionj is therth lowest value among that under all the conditionsin T . (O, T ) forms aTP-Cluster (Tendency Pre-serving Cluster), if ∀ i, j (i, j ∈ O), ∀ a (a ∈ T ),R(i, a, T ) = R(j, a, T ) and ∀ k (k ∈ D − O),∀l(l ∈ O), ∃b (b ∈ T ),R(k, b, T ) 6= R(l, b, T ).

Definition 3.1 first defines the rank functionR.Based on the rank function, a TP-Cluster is defined asa subset of genes which have consistent ranks alonga subset of conditions. In addition, a TP-Cluster isdefined to be maximal in that adding any additionalgene in the database will violate the rank coherencewithin the cluster. This property distinguishes ourmodel from OPSM model which also searches for asubset of genes along identically ordered a subset ofcondition [3]. An OPSM might not be maximal.

For example, in Table 1, we say that the gene set{2, 3} forms a TP-Cluster along the subset of condi-tions{a, c, d}, since the ranks of the three conditionsfor both genes are the same, i.e. (3, 1, 2).

Next,we show that each TP-Cluster can be mappedonto an ordered sequence of condition labels by im-posing a consistent order of the conditions, such asmonotonically increasing or decreasing.

Definition 3.2 Given a TP-ClusterC with gene setOand condition setT , we call a sequence of conditionsS representingC in a monotonically increasing order,if S=π(T ), where functionπ places each conditionain T at the positionR(a) in sequenceS.

For example, for cluster{2, 3}×{a, c, d} in Table 1,the sequence of conditions that represents the mono-tonically increasing order iscda. For conditionc, itsrank is 1. Therefore, its position in the sequence is 1.

Definition 3.3 Given two TP-ClustersC1 andC2 withcondition setsT1 andT2 respectively, we callC1 is anancestor ofC2 if π(T1) is a prefix of sequenceπ(T2).

For example in Table 1, the sequence representingclusterC1={2, 3}×{a, c, d} is cda. The sequence rep-resenting clusterC2={2, 3, 4}×{c, d} is cd. Sincecd

3

is a prefix ofcda, we callC2 is an ancestor cluster ofC1.

Based on the mapping from the TP-Clusters to thesequences, we are able to organize the TP-Clusters intoa prefix tree. We will introduce an algorithm whichbuilds the TP-cluster tree in a very compact fashion inthe Section 4.

The following Lemma illustrates the ’part-of’ rela-tionships that occur between two TP-Clusters of whichone is an ancestor of the other.

Lemma 3.1 Let C and C′ be two TP-Clusters in thedatabaseD. LetO andO′ be the gene set ofC andC′respectively, ifC′ is an ancestor ofC, thenO ⊆ O′.

Proof 3.1 SinceC′ is an ancestor ofC, π(T ′) mustbe a prefix ofπ(T ), based on Definition 3.3.∀g(g ∈O), g supportsπ(T ). Thus,g must support any prefixof π(T ), includingπ(T ′). Therefore,g ∈ O′. Since∀g, g ∈ O impliesg ∈ O′,O ⊆ O′.

Obviously, clustersC1 andC2 in the previous exam-ple follow this property.C2 is an ancestor ofC1 and{2, 3, 4}⊇{2, 3}.

3.2 The TP-cluster tree

The TP-cluster tree is generally analogous to a pre-fix tree of a predefined set of sequences. However,it is also different because of its unique interpreta-tion of each node and the parent-child relationship.Each node in a TP-cluster tree represents a unique TP-Cluster. The root node corresponds to the null space.The nodes at levelm correspond tom dimensionalTP-Clusters. The TP-Cluster at a node is related toits immediate parent by being part of the cluster. EachTP-Cluster other than the null root is a 1-dimensionalextension of its parent cluster. In order to elucidatethe structure of TP-cluster tree, we give a completeTP-Cluster tree of three conditions in Figure 2, whereeach TP-Cluster is represented by a sequence. Thistree is ’complete’ since there does not exist anotherTP-Cluster not included in the tree. The figure con-tains a three-level tree structure which corresponds to1-, 2- and 3-dimensional TP-Clusters. Each nodeuin the TP-cluster tree is represented by the path fromthe root of the TP-Cluster tree tou. For example, theTP-Cluster with two conditions{b, c} ordered increas-ingly as(bc) will be put at the node−1bc. The geneset associated with each node in the TP-cluster tree isomitted in the figure.

a b c

ab ac ba bc cb ca

abc acb bac bca cba cab

-1

Figure 2. A complete TP-Cluster tree given acondition space A={a, b, c}.

Definition 3.4 The TP-cluster tree is a hierarchicalarrangement of TP-Clusters with the following prop-erties: 1) The tree is rooted at level 0 with−1. 2) Eachnode at levelm corresponds to an m-dimensional TP-Cluster represented by a length-m sequence. 3) Eachnode at level(m + 1) is a 1-dimensional extension ofits immediate ancestor, which corresponds to a length(m + 1) sequence.

What we are interested in is the hierarchical rela-tionship among a set of TP-Clusters. Investigating therelationships may help us with the prediction of thebehavior of higher dimensional clusters based on thelower dimensional ones.

3.3 Annotation of a Gene Cluster

In this subsection, we present the annotation of agene cluster based on GO. We first introduce the P-value to assess the significance of a particular functiongroup within a cluster.

The hypergeometric distribution is used to modelthe probability of observing at leastk genes from acluster ofn genes by chance in a category containingfgenes from a total genome size ofg genes. The P-value

is given byP = 1 −∑k

i=0(f

i)(g−fn−i)

(gn)

. The test mea-

sures whether a cluster is enriched with genes from aparticular category to a greater extent than that wouldbe expected by chance. For e

xample, if the majority of genes in a cluster have thesame biological function, then it is unlikely that thishappens by chance and the category’s P-value wouldbe close to 0. Adopting the Bonferroni correction [15]for multiple independent hypotheses,0.01

Na is used asthe default thresholdθp, to measure the significance ofthe P-value.

4

To annotate a cluster, the P-value of each categorypresent in the cluster is computed first. Given a cut-off P-value thresholdθp, categories whose P-value arelarger thanθp are eliminated without further consider-ation. The result is a set of significant GO function cat-egoriesV={v1, v2, ..., vt}. There are two naive waysto annotate the clusters with the set of the significantfunction categoriesV. One method is to keep all sig-nificant function categories as annotation candidates.However, the annotation might become ambiguous ifgenes in one cluster are assigned with too many func-tion categories. The other way is to annotate a clusterwith the category that has the least P-value. Choosingthe most significant category to represent the cluster isreasonable. However, the simplicity is at the expenseof discarding useful information, such as the distribu-tion of the subcategories of the most significant cate-gory.

In our method, we adopt a middle way between thetwo extremes. We use an appropriate subtree in theGO tree to annotate a cluster. The subtree is rooted atthe node of the most significant category and includesall of its significant reachable subcategories. The an-notation is formally defined as the Ontology SubTree(OST ) in Definition 3.5

Definition 3.5 Given a clusterC, its significant func-tion categoriesV = {v1, v2, ...vt}, and the di-rected ontology treeG=< V, E >, the OntologySubTree (OST ) H=< V ′, E ′ > representing clus-ter C is defined as the following: 1.The root ofH is the function categoryvr, 0 < r≤t, whereP (vr, C)=min0<i≤t(P (vi, C)). 2. ∀v′ ∈V ’, there ex-ists a pathL (L ⊆ E) leading fromvr to v’. 3.∀v′1, v′2 ∈ V ′, if ∃e′, e′ ∈ V and e′ connectsv′1 andv′2, thene′ ∈ E ′.

First, with the hierarchical structure of GO, a geneof a function family will always be a member in thefunction family’s ancestor. Therefore, theOST isrooted at the most significant category. The less sig-nificant and less specific ancestor function categoriesare omitted.

Secondly, although the children ofvr are not as sig-nificant asvr in clusterC, it is still possible that furthersplit of the cluster may signify the coherence of themore specific categories ofv. Consequently, anOSTrepresentingC is a maximal connected tree rooted atthe most significant category inC.

Figure 3 shows a set of significant function cate-gories of a cluster organized in a tree structure. To de-termine theOST representing this cluster, we first find

Cell Growthlog(P-value)=-7

Cell Expansionlog(P-value)=-3

Regulation of Cell Growthlog(P-value)=-3

Cell Communicationlog(P-value)=-2

Cellular Processlog(P-value)=-3

Figure 3. An example of OST representing aCluster. The categories and P-value are shownat each node. The OST is the subtree underthe curve.

out the location of the most significant function group,which in this case is cell growth, withlog(P-value)=-7.We then discard its parent category —cellular process,and sibling—cell communication, which have higherP-value. The resultingOST is the subtree rooted atcell growth.

Definition 3.6 Given a clusterC, we call C is func-tionally enriched if there exists anOST representingthe cluster, given a P-value thresholdθp.

Definition 3.7 defines the� relationship betweentwo OSTs.

Definition 3.7 Given twoOSTsH1 andH2, we callH1 � H2 if the root node ofH1 appears as a node ofH2.

For example, Figure 4 contains two clusters’OSTs.We callH2 � H1 since we can find the root nodecellular growth ofH2 inH1.

3.4 Mapping the TP-cluster tree onto GO Hierar-chy

A child-parent relationship in GO hierarchy is”part-of” and ”more specific than”. Or, in other words,the genes in the child term should be more similar andconsistent than those in the parent term. Here we as-sume that the subset of conditions a child term maystay close on is larger than that for its parent term in theGO hierarchy. This is exactly the child-parent relation-ship in a TP-cluster tree. By the same child-parent re-lationship, we unite the two hierarchies together. Next,we use the child-parent relationship of gene ontologyto evaluate the child and parent relationship in the TP-cluster tree.

5

Cellular Processlog(P-value)=-6

Cell Deathlog(P-value)=-5


(A) H1


Cell Expansionlog(P-value)=-3

Regulation of Cell Growthlog(P-value)=-3

(B) H2

Figure 4. An example of two OST s H1 and H2,H2 � H1.

Definition 3.8 LetC be a TP-Cluster andC′ be one ofC’s descendants. LetH beC’s OST , and letH′ beC′’sOST . C′ is a biological descendent ofC if H′ � H.

For instance,C2 represented byH2 in Figure 4 is abiological descendant ofC1 represented byH1.

Problem Statement Let D be a database with genesetO and condition setA. Given a thresholdθp forcategory enrichment and the GO files, our goal is toextract a biologically relevant hierarchy of enrichedTP-Clusters.

Since genes and conditions in the expression datacorrespond to rows and columns in the expression ma-trix, we may use the two sets of terms interchangeablyin the following sections.

4 Construction of Biologically Relevant TP-cluster tree

In this section, we present the algorithm to builda STP-cluster tree. In order to illustrate the de-velopment of a STP-cluster tree, we start from thedevelopment of a TP-cluster tree (HTP-clustering).Ontology-based pruning techniques are then addedinto HTP-clustering process to extract the STP-clustertree (SHTP-clustering).

4.1 HTP-clustering

The HTP-clustering constructs the TP-cluster treeby suffix concatenation in conjunction with extracting

only biologically relevant TP-Clusters. The inputs tothe HTP-clustering include the databaseD, GO, andfunction enrichment thresholdθp. The TP-cluster treeis constructed hierarchically in a top-down fashion,along which the datasetD is partitioned. The HTP-clustering adopts a depth-first pre-order traversal algo-rithm in order to build the TP-cluster tree. We pre-fer the depth-first order over the breadth-first order be-cause we can minimize the amount of storage neededfor each level to develop clusters at the next level. Thedepth-first traversal definitely guarantees the correct-ness of the result because for each node, the construc-tion of its subtree is independent of the construction ofits siblings.

The HTP-clustering process can be summarized intwo steps:

1. We first preprocess the data. Each row in the datamatrix will be converted to an ordered sequenceof column labels based on the rank function inDefinition 3.1. Those sequences will be the inputsto the next step. An initial prefix tree containingthe sequence of every gene in the database will beconstructed.

2. The ontology information of genes is fed into theTP-cluster tree at the root level. The initial pre-fix tree is recursively visited and developed inthe depth-first order to reveal all frequent subse-quences, which represent TP-Clusters. Ontology-based pruning is performed when visiting eachnode.

We focus on the second step which is more chal-lenging and important during the whole mining pro-cess. The data structure representing the TP-clustertree is defined below.1. It consists of one root labelled as “-1” and a set ofsubtrees as the children of the root;2. Each node (expect the root) has four entries: entryvalue, a link to its first child node, a link to its next sib-ling node, and the list of gene IDs, each of which hasa suffix corresponding to the path from the root to thisnode. In other words, the gene IDs are only recorded atthe node that marks the end of a common subsequence.

We use the dataset in Table 1 in the following ex-ample to illustrate the suffix concatenation step duringthe tree construction process.

Example 4.1 For sequences in Table 1, the initial pre-fix tree representing the whole database is presented in

6

��

�

�

�

� ��

�

� ��

� �

� ��

�

�

�

(A) Initial tree

��

�

�

�

� ��

�

� ��

� �

� ��

� ��

�

d �

�

� ��

�

� ��

�

(B) First suffix concatenations at level 1

Figure 5. The illustration of suffix tree concate-nation.

Figure 5 (A) and the suffix concatenation upon visitingthe first node ”-1” is illustrated in Figure 5 (B).

Let’s denote the node currently being visited as theactive node. Given an active node in the TP-clustertree construction process, for example, at the root ”-1” in Figure 5 (B), the suffixes to be inserted to ”-1”’s subtree are those inside the rectangle box shownin Figure 5 (A). The concatenation of the suffixes tothe current active node is done by merging the suffixtree of the active node with the corresponding subtreeone level below the active node. For example, suffixtree ”-1cd” in (A) is merged with ”-1d”. The gener-ated subtree is shown as the ”-1d” subtree in (B). (B)is the subsequent tree after the visit of the node ”-1”.The same procedure will be applied recursively in thedepth-first order to construct the TP-cluster tree. Forexample, after the first node visit at the root ”−1”, thenext node to be visited is ”-1c” and the suffixes in-side the rectangle box in Figure 5(B) are the next setof suffixes to be inserted. The TP-Cluster algorithmwithout any biological assessment is presented in Al-gorithmgrowTree.

Algorithm growTree(H,depth)Input: H: the root of the initial tree,Output: TP-Cluster existed inH(∗ Grow patterns on the initial TP-ClusterH ∗)1. if H = nil2. return ;3. Hchild←H’s first child;

4. for any sub-treesubH ofH5. do insertSubTree(subH,H);6. growTree(Hchild,depth + 1);7. growTree(H’s next sibling,depth);8. return .9.

The correctness of the construction of TP-clustertree is proved in Lemma 4.1.

Lemma 4.1 Given a databaseD, the TP-cluster treecontains all TP-Clusters embedded in the database.

Rationale: According to Definition 3.2, each TP-Cluster corresponds to a unique sequence of the condi-tions. Therefore, the proof of Lemma 4.1 is equivalentto the proof that the TP-cluster tree contains all fre-quent subsequences of the set of sequences represent-ing rows in the database. Given any subsequenceS′,we want to prove that all sequences containingS′ willbe projected onto the path corresponds toS′. Givenany sequenceS = x1x2x3x4 . . . xn, we want to showthat all subsequences ofS can be found in a path start-ing from the root. S is inserted into the tree duringthe initiation procedure. Then given any subsequenceSS = xixj . . . xs, (1 ≤ i, s ≤ n), we can obtainSSby the following steps. First, at nodexi, insert suffixxixi+1 . . . xn. Now in the subtree rooted atxi, nodexj can be found because it should be along the pathxixi+1 . . . xn that is inserted in the first step. Simi-larly, we insert the suffixxj . . . xn. As a result, we getthe pathxixjxj+1 . . . xn. By repeating the same pro-cedure until we insert the suffix starting withxs, weget the pathxixj . . . xs. Since the path representinga subsequence is unique, all sequences containingS′

fall on the node corresponding toS′. The TP-clustertree contains all the subsequences, or, equivalently, allTP-Clusters embedded in databaseD.

4.2 SHTP-clustering

Based on the above HTP-clustering, we develop theSHTP-clustering algorithm by incorporating ontologyknowledge into the clustering process. The ontologyinformation serves for the following two purposes: (1)the assessment of function enrichments of a cluster. (2)the guidance to select the subset of conditions criticalto a function category. (Along that subset of condi-tions, a significant number of genes in that functioncategory may stay close.) These two functionalities ofontology information are transformed into two pruningtechniques in the SHTP-clustering algorithm.

7

The first pruning technique is based on the distri-bution of function groups in a cluster. For any clus-ter C, we expect that there exists at least one functioncategory inC that is statistically significant. Given aclusterC and the distribution of categories, we use thefollowing lemma for early detection of the potentialappearance of significant function categories.

Lemma 4.2 Let C be a cluster, letV={v1, v2, ..., vt}be a set of function categories and letS be a countervector in which each elementsi records the number ofappearances of the genes in categoryvi in C. Let theminimum number of genes required in a cluster benr

and letθp be P-value threshold.∀vi, vi ∈ V, let Ci’ bea cluster with sizenr and containsmin(si, nr) genesin function categoryvi. If ∀ i, P (si, Ci′) > θp, thenCwill not become an enriched cluster.

Proof 4.1 ∀vi, vi ∈ V, we haveP (si, Ci′) ≤ P (si, C)based on the property of P-value, i.e, the P-value in-creases as the number of genes in the same cluster in-creases. According to the condition in the Lemma, wehaveθp ≤ C′i ≤ P (si, C). Hence, according to Defini-tion 3.6,C will not be functionally enriched in any ofthe function categories inV.

The algorithm of this pruning technique takes atmostO(|V|) by scanning through each of the functioncategories inV and computing its smallest P-value thatmight occur.

The second technique is to useOST extracted ina parent cluster to guide the selection of its descen-dent TP-Cluster clusters, by favoring biological chil-dren clusters defined in Definition 3.8. Our criterionis based on the hypothesis that, the TP-Clusters in thehigher dimensional space are enriched in more specificcategories.

Criterion 4.1 Let C and C′ be two clusters and letC be the parent ofC′ in the TP-Cluster hierarchy.We say that the development ofC′ is not viable ifOSTC′<OSTC .

The extraction ofOST from a cluster also takesO|V| if we represent each gene category with a GOcode[12]. At most two scans of the ordered GO codesare necessary to generate theOST . Combining thetwo pruning techniques, we apply the following pro-cedure at each node of the traversal.

1. Evaluate the prediction potential of the clustercorresponding to this node. If it has no potential

to become a functionally enriched cluster accord-ing to Lemma 4.1, stop further development ofthis node and its descendants, then go to the nextnode in the predefined traversal order. Otherwise,go to step 2.

2. Extract theOST of the cluster. If theOST isnot biologically viable according to Criterion 4.1,stop further development of this node and its de-scendants. Go to the next node in the predefinedtraversal order.

We present the SHTP-clustering algorithm of ex-tracting biologically relevant TP-Clusters in Algo-rithm smartGrowTree. Its major differences from theHTP-clustering algorithm are the recursively feedingand pruning of theOST structure, and the cluster eval-uation and pruning based on the significance of theOST .

Algorithm smartGrowTree(H, nc, nr,depth,parentOST)Input: H: the root of the initial tree.Output: TP-Cluster existed inH, the originalOST .(∗ Grow patterns on the initial TP-ClusterH. ∗)1. if H = nil2. return ;3. Hchild←H’s first child;4. for any sub-treesubH ofH5. do insertSubTree(subH,H);6. curOST= extractOST(H);7. if (curOST is not empty)8. if (curOST�parentOST )9. growTree(Hchild, nc, nr, depth + 1,

curOST );10. else growTree(Hsib, nc, nr, depth + 1,

parentOST );11. else12. potential = evalFunction(H, parentOST )13. if (potential = good)14. growTree(Hchild, nc, nr, depth + 1,

curOST );15. else growTree(Hsib, nc, nr, depth,

parentOST );16. return .

Analysis of HTP-clustering and SHTP-clusteringconstruction For both HTP-clustering and SHTP-clustering, only one scan of the entire data matrix isneeded during the clustering. Each row is first con-verted into a sequence of column labels. The se-quences are then inserted into the prefix tree. In theinitial tree structure, sequences with the same prefixnaturally fall onto the same path from the root to thenode corresponding to the end of prefix. To be mem-ory efficient, the row/gene IDs associated with each

8

path are only recorded at the node marking the end ofthe longest common prefix shared by these sequences.The depth-first pre-order traversal is then applied to theprefix tree to generate a TP-cluster tree. Pruning tech-niques based on ontology knowledge further produce aSTP-cluster tree without generating the complete TP-cluster tree.

Both the time and space complexities of the twoalgorithms are exponential determined by the natureof being a NP-hard problem. In the worst case sce-nario, given a gene expression matrix (n × m), thesize of tree is (

∑ms=1 s!

(ms

)). However, since we use

the depth-first traversal of the tree and the part oftree that has been traversed will not be needed for fu-ture mining, they can be deleted and the space canbe reused. At leveli(i 6= 0), we only need to keep(m − i + 1) nodes. Therefore, the maximal space tobe allocated during the running time will be limited toO(n

∑mi=1 m− i + 1) = O(n ∗m2).

The SHTP-clustering will be more space and timeefficient than HTP-clustering since less biologically ir-relevant TP-Clusters are generated due to the ontol-ogy guided pruning techniques. The pruning effectsare largely determined by the relationship between aTP-Cluster and the significance of its underlying func-tional categories.

5 Evaluation

Our experiments demonstrate the applicability ofSHTP-clustering algorithm in clustering biologicallyrelated genes with effective pruning techniques basedon GO. The results are evaluated through the compar-ison of HTP-clustering and SHTP-clustering, and themapping between the TP-cluster tree to the GO hier-archy. The algorithm was implemented in C and exe-cuted on a Linux machine with a 700 MHz CPU and2G main memory.

Both HTP-clustering and SHTP-clustering algo-rithms are tested on the yeast cell cycle data of Spell-manet al. The study monitored the expression levelsof 6,218 S. cerevisiae putative gene transcripts (genes)measured at 10-minute intervals over two cell cycles(160 minutes) with 18 time points. Spellmanet al.identified 799 genes that are cell cycle regulated. Weused the expression levels of the 799 genes across 18time points as the original input matrix. The clusteringprocedure groups together genes on the basis of theircommon expression tendency across a subset of timepoints.

To assess the biological relevance of the clusters,we use GO and P-value to evaluate whether the clus-ter has significant enrichment in one or more func-tion groups. The ontology of the 799 yeast genes isdownloaded from gene ontology consortium [23] inFeb, 2004. We use functions from the three categories:molecular function(MF), cell component(CC) and bi-ological process(BP). We extract categories betweenontology level 2 and level 5 with a family size of atleast 5. The discovered TP-Clusters in each level ofthe hierarchy are evaluated for enrichment with any ofthose function categories.

Types #Knowngenes

#Categories(#genes > 5)

#Anno pergene

MF 370 16 0.77CC 616 48 3.4BP 538 38 5.72

Table 2. Statistics for the three categories.

20 30 40 50 60 7040

50

60

70

80

90

100

110

120

130

140

nr

Resp

on

se T

ime(

s)

θp=10−3

θp=10−4

θp=10−5

θp=10−6

θp=10−7

a) Response Time

−3 −4 −5 −6 −70

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

log(θp

)

Perc

enta

ge

OST−prunedEnrichedPoor

b) The distribution of the clusters

Figure 6. The performance of the SHTP-clustering varying nr and θp.

5.1 Performance

The first set of experiments was done using theSHTP-clustering algorithm and cellular componentontology category to evaluate the performance byvarying parametersnr andθp. As shown in Figure 6

9

−3 −4 −5 −6 −740

60

80

100

120

140

160

log(θp

)

Resp

on

se T

ime(

s)

HTP, nr=50

HTP, nr=70

SHTP, nr=50

SHTP, nr=70

a) The comparison of response time

−3 −4 −5 −6 −70

0.5

1

1.5

2

2.5

3

3.5x 10

4

log(θp

)

)

Nu

mb

er o

f clu

ster

s

SHTP−EnrichedHTP−EnrichedSHTP−TotalHTP−total

b) The comparison of enriched clusters and total clusters

Figure 7. The comparison between SHTP-clustering and HTP-clustering.

−3 −4 −5 −6 −720

40

60

80

100

120

140

160

180

200

log(θp

)

Resp

on

se T

ime(

s)

BPCCMF

Figure 8. The comparison of performance ofSHTP-clustering among three categories.

a), the response time of theSHTP−clustering algo-rithm decreases as the significance threshold decreasesand as the minimum number of rows increases. Ac-cording to the pruning strategy Lemma 4.1, high sig-nificance threshold allows early drop of cluster withpoor functional implication. More early pruning en-ables shorter response time. Thenr helps to pruneclusters with the size limitation. The application of thesame algorithm to the other two categories exhibits thesame trend when varyingnr andθp.

Figure 6 b) presents the distribution of the gener-ated clusters in three categories: not enriched cluster,enriched cluster, and enriched cluster not following itsparent’sOST according to Criterion 4.1. The percent-age of not enriched cluster increases significantly as

θp decreases. It also explains the performance gain ofSHTP-clustering at the same time. Also the percentageof clusters being pruned due to Criterion 4.1 drops sig-nificantly compared to the percentage of the enrichedclusters as the significance threshold decreases. Thismay also indicate that the more significant the enrich-ment of the clusters, the higher the probability that itsOST leads to the right direction of selecting the bio-logically appropriate biclusters.

The second set of experiment in Figure 7 a) is acomparison between SHTP-clustering algorithm andHTP-clustering algorithm. For each algorithm, wehave done two tests with different settings ofnr. Wecan observe significant and consistent improvement ofSHTP-clustering algorithm over HTP-clustering espe-cially whenθp is relatively low. The performance ofSHTP-clustering can be as short as 1/4 of that of theHTP-clustering algorithm.

Figure 7 b) compares the number of enriched clus-ters and the total number of clusters for both ofthe algorithms HTP-clustering and SHTP-clustering.Clearly, HTP-clustering algorithm generate a largenumber of TP-Clusters, of which only a small percentare enriched. Compared with HTP-clustering, SHTP-clustering generates less than half of the number ofTP-Clusters and almost the same number of enrichedTP-Clusters. Overall, SHTP-clustering improves theperformance with minimum loss of the enriched clus-ters.

Figure 8 gives the comparison of the response timesvarying the three available ontology files, i.e, MF, CC,BF. We can observe a clear trend that the experimentsusing biological process category consistently spendmore time than the rest two. This can be explainedby the data in Table 2. The average number of cate-gories that a gene might have is 5.7, which is muchhigher than that of either the cellular component orthe molecular function files. With fewer categoriesbut more gene annotations, the distribution of func-tion groups in a cluster has a higher probability be-ing more concentrated in one or more function groupsrather than being evenly distributed. As a result, fewerfunctional clusters might be pruned, and hence, the re-sponse time is longer. In addition, this may also becoincident with the hypothesis that similar gene ex-pression profiles may indicate a function relation inbiological process [4]. As a result, more time will betaken for generating a larger number of significantlyenriched clusters compared with the rest two ontologyfiles.

10

Overall, our experiment shows that the ontology-based pruning is effective in reducing the search spaceof biclustering. In addition, the response time of ouralgorithm is influenced by the two input parametersand the distribution of genes in each category of theontology.

5.2 Mapping between GO and the TP-cluster tree

We present a generic example of hierarchically or-ganized clusters that map to a hierarchical substructureof GO.

In Figure 9, (A) presents a three-level hierarchyof TP-Clusters, while (B) shows the correspondingOSTs. The gene ontology summarizing the relation-ships among all the function categories appearing in(B) is ”Necleoside→ DNA metabolism→ DNA re-pair”.

The root clusterC01 in (A) is the largest clusterwith 71 genes. However, it has the smallest numberof conditions shared by all genes in its cluster, i.e.(4, 15, 13, 8). Its OST shown at the top of the hier-archy in (B) is rooted at the category, Necleoside. Aswe go down the hierarchy of clusters in (A), clusterstend to contain a smaller number of genes but sharea larger number of consistent conditions. In addition,theOSTs is likely to exist in the subtree of theOSTof its parent cluster. For example, the root clusterC01

is split into two smaller overlapping clustersC11 andC12 featuring enriched function ”DNA metabolism”,which is a subcategory ofnecleoside. OSTC11 andOSTC12 suggest that the two clusters in level one havemore significant grouping at a deeper level in GO hier-archy than clusterC01. A further clustering of clusterC12 into C21 with six conditions again signifies the aeven deeper function group, i.e. ”DNA repair”.

This example illustrates the connection between theontology hierarchy and the STP-cluster tree. Our ex-periments demonstrate that it is possible that only asubset of conditions matter for a ontology category. Inaddition, the deeper the level of a category within theGO hierarchy, the more the conditions under which thegenes in that category have the similar expression pro-files.

6 Conclusions and Future Work

Clustering of gene expression data has been usedfor gene function prediction based on the hypothesisthat similar expression profiles indicate a function rela-tion in biological activities. However, traditional clus-

tering algorithms are weak in modelling the hierarchyof GO due to the fact that traditional algorithms can-not produce a hierarchy of overlapping clusters withvarious sizes. To overcome these problems, we pro-pose to use a hierarchy of TP-Clusters to match thehierarchy of GO. We present a biclustering algorithmguided by GO, SHTP-clustering, which efficiently andeffectively extracts the biological relevant gene clus-ters. Our experiments on yeast gene expression datademonstrate the effectiveness of the ontology-basedpruning techniques. Our future work will be using thegenerated STP-cluster tree for effective classificationof the unknown genes.

References

[1] A. Ben-Dor, R.Shamir, and Z.Yakhini. Clustering geneexpression patterns. inJ Comput Biol6(3-4):281-97.

[2] A. Ben-Dor, N. Friedman and Z. Yakhini. Class discov-ery in gene expression data. InRECOMB2001.

[3] A. Ben-Dor, B. Chor, R.Karp, and Z.Yakhini. Dis-covering Local Structure in Gene Expression Data:The Order-Preserving Submatrix Problem. InRECOMB2002.

[4] M.P.S. Brown, W.N. Grundy, N. Cristianini, C.W. Sug-net, T.S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by us-ing support vector machine.proc. Natl Acad. Sci.USA,97, 262-267.

[5] Y. Radmacher M.Bittner M.simon R.MeltzerP.Custerson B.Esteller M.Raffeld M.Yakhini Z.Ben-DorA.Dougherty E.Kononen J.Bubendorf L.Fehrle W. Pit-taluga S.Gruvberger S.Loman N.Johannsson O.OlssonH.Wilfond B.Sauter G.Kallioniemi O.-P.Borg A.TrentJ.Hedenfalk I., Duggan D. Gene-expression profiles inhereditary breast cancer.NEJM, 344:539-548, 2001.

[6] Y. Cheng and G. Church. Biclustering of expressiondata. InISMB, 2000.

[7] M.B.Eisen, P.T.Spellman, P.O.Brown, and D.Botstein.Cluster analysis and display of genome-wide expressionpatterns. InProc Natl Acad Sci U S A, 95(25):14863-8,1998.

[8] T. R. Hvidsten, A. Lagreid and J. Komorowski. Learn-ing rule-based models of biological process from geneexpression time profiles using Gene Ontology.Bioinfor-matics, 2003.

[9] G. Getz, E. Levine and E. Domany. Coupled two-wayclustering analysis of gene microarray data.Proc. NatlAcad. Sci.USA, 97, 12079-12084, 2000.

[10] S. Kaski, J. Nikkil, G. Wong, Analysis And Visual-ization Of Gene Expression Data Using Self-OrganizingMaps,Proceedings of NSIP-01, IEEE-EURASIP Work-shop on Nonlinear Signal and Image Processing, 2001.

11

(A) Expression profiles of the TP-Cluster subtrees (B) CorrespondingOSTs of clusters in (A)

Figure 9. An example of mapping from a hierarchy of TP-Clusters to their OST s. For each cluster in(A), the rows correspond to the genes while the columns correspond to the conditions.

[11] L.Lazzeroni and A.Owen. Plaid models for gene ex-pression data. 2000

[12] S.G.Lee, J.U.Hur, and Y.S.Kim. A graphic-theoreticmodeling on GO space for biological interpretation ofgene clusters.Bioinformatics, 2004

[13] J.Liu and W.Wang. Subspace clustering by tendencyin high dimensional space. InICDM 2003.

[14] E. Segal, B. Taskar, A. Gasch, N. Friedman and D.Koller. Rich probabilistic models for gene expression.Bioinformatics, 17, S243-S252, 2001.

[15] J. P. Shaffer, Multiple Hypothesis Testing.Ann. Rev.Psych. 46, 561-584, 1995

[16] R. Sharan and R.Shamir. Click: A clustering algo-rithm with applications to gene expression analysis. InISMB, pages 307-216, 2000

[17] H. Shatkay, S. Edwards, W. Wilbur, and M. BoguskiGenes, themes and microarrays-using information re-trieval for large-scale gene analysis. InISMB, 2000, pp317-328.

[18] G. Sherlock. Analysis of large-scale gene expressiondata.Curr. Opin. Immunol., 12, 201-205

[19] P.T. Spellman, G.Sherlock, M.Q.Zhang, V.R.Lyer,K.Anders, M.B.Eisen, P.O.Brown, D.Botstein, andFutcher. Comprehensive identification of cell cycle-regulated genes of the yeast sacccharomyces cerevisiaeby microaray hybidization.Molecular Biology of theCell, 9:3273-2297, 1998.

[20] A.Tanay, R. Sharan and R. Shamir. Discovering sta-tistically significant biclusters in gene expression data.Bioinformatics, Vol.18, pages S136-S144, 2002.

[21] S. Tavazoie, J. Hughes, M.Campbell, R.Cho, and G. Church. Yeast micro data set. Inhttp://arep.med.harvard.edu/biclustering/yeast.matrix,2000.

[22] L.F.Wu, T.R.Hughes, A.P.Davierwala,M.D.Robinson, R.Stoughton, S.J.Altschuler: Large-scale prediction of Saccharomyces cerevisiae gene func-tion using overlapping transcriptional clusters.NatureGenetics2002, 31:255-265

[23] Gene Ontology Consortium, www.geneontology.org.

12

Gene Ontology Friendly Biclustering of Expression...

Documents

Transcript of Gene Ontology Friendly Biclustering of Expression...