Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes...

7
355 Microarray technology enables us to monitor large changes in transcripts at any given time. The compilation of these data makes possible the comparison of such gene expression data on a genome-wide scale. As comparisons of genome sequence data yield new biological insights, comparative analyses of transcriptome data also promise new discoveries regarding metabolic pathways and cellular processes. The coordinated expression of genes shows that these genes physically interact with each other or are part of the same cascade. We have produced one of the largest expression profiles of adult mice and developmental tissues. These data, as well as the data on yeast from previous reports, were used to see whether coordinated expression (with high correlation coefficient) is closely coupled to the actual cascade on the pathway map. Addresses Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan *e-mail: [email protected] e-mail: [email protected] Current Opinion in Structural Biology 2002, 12:355–361 0959-440X/02/$ — see front matter © 2002 Elsevier Science Ltd. All rights reserved. Abbreviations EC number Enzyme Commission number GO Gene Ontology KEGG Kyoto Encyclopedia of Genes and Genomes ORF open reading frame UPGMA unweighted pair group method with arithmetic mean Introduction The availability of the complete genome sequences of tens of model organisms and the draft version of the human genome sequence [1,2] has completely changed analyses of the genetic information encoded by genomes. By focusing on their similarities and differences, comparing the genome sequences of multiple organisms reveals new biological insights that cannot be found when only the sequence encoded by a single genome is studied [3,4]. For example, Tatusov et al. [5] compared in detail the protein products, often called open reading frames (ORFs), encoded by the Haemophilus influenzae and Escherichia coli genomes in order to deduce important aspects of the largely uncharacterized metabolism of H. influenzae. Furthermore, Watanabe et al. [6] compared the genome locations of orthologous genes among four bacterial genomes (H. influenzae, Mycoplasma genitalium, E. coli and Bacillus subtilis). These investigators found several regions that were highly conserved among all four bacteria, with the longest conserved region comprising the S10, spc and alpha operons. Microarray technology enables us to monitor multiple whole transcriptomes simultaneously [7]. The first definitive example of this approach assessed the metabolic repro- gramming that occurs during the diauxic shift in yeast [8]. Another important feature of this particular publication was that all the data presented were accessible through the Internet as supplementary data, thereby facilitating subsequent, more advanced analyses. The main features of transcriptome data are that they are dynamic and time- dependent, whereas genome sequence data are static and robust. This implies that the notion of comparison is a more basic feature of the transcriptome data. In this review, we will focus on the expression profiles of genes involved in the metabolic pathway of glycolysis in yeast and mouse to see whether the coordinated expression of genes is closely coupled to the cascade on the pathway map. The assignment of EC (Enzyme Commission) number to each gene is the first step towards this type of analysis. Although the EC numbers of yeast genes were well assigned after the completion of the yeast (Saccharomyces cerevisiae) genome sequence, mouse EC number assignment has been slower. We have made the largest mouse full-length cDNA collections (RIKEN full-length cDNAs) so far. Using these resources, we have also produced the expression profiles of various adult tissues and developmental stages of the embryo. The assignment of EC number to each gene is a somewhat laborious step. We will describe the strategy we have used to assign EC numbers to the RIKEN full-length cDNAs and show the link between the coexpression of these genes and the cascade on the pathway map. Systematic analyses of gene function using microarray data As the genome sequences of various organisms have been obtained, the possible functions of their ORFs have been predicted systematically in light of their sequence similarity to other functionally known genes. However, the function of many ORFs cannot be inferred from only a sequence similarity search. Approaches using protein sequence motifs and/or the three-dimensional structure, predicted in light of the deduced amino acid sequence of the ORF, have also been used to predict the function of the corre- sponding gene. Unfortunately, these approaches have not greatly improved our understanding because they are primarily based on sequence information from the genome. If the functional prediction process was per- formed inappropriately, its results might lead to the creation and propagation of functional assignment errors [9]. For inferring the gene functions of ORFs that do not have any significant sequence similarity to known Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts Hidemasa Bono* and Yasushi Okazaki

Transcript of Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes...

Page 1: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

355

Microarray technology enables us to monitor large changes intranscripts at any given time. The compilation of these datamakes possible the comparison of such gene expression dataon a genome-wide scale. As comparisons of genome sequencedata yield new biological insights, comparative analyses oftranscriptome data also promise new discoveries regardingmetabolic pathways and cellular processes. The coordinatedexpression of genes shows that these genes physically interactwith each other or are part of the same cascade. We haveproduced one of the largest expression profiles of adult miceand developmental tissues. These data, as well as the data onyeast from previous reports, were used to see whethercoordinated expression (with high correlation coefficient) isclosely coupled to the actual cascade on the pathway map.

AddressesLaboratory for Genome Exploration Research Group, RIKEN GenomicSciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho,Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan*e-mail: [email protected]†e-mail: [email protected]

Current Opinion in Structural Biology 2002, 12:355–361

0959-440X/02/$ — see front matter© 2002 Elsevier Science Ltd. All rights reserved.

AbbreviationsEC number Enzyme Commission numberGO Gene Ontology KEGG Kyoto Encyclopedia of Genes and GenomesORF open reading frameUPGMA unweighted pair group method with arithmetic mean

IntroductionThe availability of the complete genome sequences of tensof model organisms and the draft version of the humangenome sequence [1,2] has completely changed analyses ofthe genetic information encoded by genomes. By focusingon their similarities and differences, comparing the genomesequences of multiple organisms reveals new biologicalinsights that cannot be found when only the sequenceencoded by a single genome is studied [3,4]. For example,Tatusov et al. [5] compared in detail the protein products,often called open reading frames (ORFs), encoded by theHaemophilus influenzae and Escherichia coli genomes in orderto deduce important aspects of the largely uncharacterizedmetabolism of H. influenzae. Furthermore, Watanabe et al.[6] compared the genome locations of orthologous genesamong four bacterial genomes (H. influenzae, Mycoplasmagenitalium, E. coli and Bacillus subtilis). These investigatorsfound several regions that were highly conserved among allfour bacteria, with the longest conserved region comprisingthe S10, spc and alpha operons.

Microarray technology enables us to monitor multiplewhole transcriptomes simultaneously [7]. The first definitiveexample of this approach assessed the metabolic repro-gramming that occurs during the diauxic shift in yeast [8].Another important feature of this particular publicationwas that all the data presented were accessible throughthe Internet as supplementary data, thereby facilitatingsubsequent, more advanced analyses. The main features oftranscriptome data are that they are dynamic and time-dependent, whereas genome sequence data are staticand robust. This implies that the notion of comparison isa more basic feature of the transcriptome data. In thisreview, we will focus on the expression profiles of genesinvolved in the metabolic pathway of glycolysis in yeastand mouse to see whether the coordinated expression ofgenes is closely coupled to the cascade on the pathwaymap. The assignment of EC (Enzyme Commission)number to each gene is the first step towards this type ofanalysis. Although the EC numbers of yeast genes werewell assigned after the completion of the yeast(Saccharomyces cerevisiae) genome sequence, mouse ECnumber assignment has been slower. We have made thelargest mouse full-length cDNA collections (RIKENfull-length cDNAs) so far. Using these resources, we havealso produced the expression profiles of various adulttissues and developmental stages of the embryo. Theassignment of EC number to each gene is a somewhatlaborious step. We will describe the strategy we have usedto assign EC numbers to the RIKEN full-length cDNAsand show the link between the coexpression of thesegenes and the cascade on the pathway map.

Systematic analyses of gene function usingmicroarray dataAs the genome sequences of various organisms have beenobtained, the possible functions of their ORFs have beenpredicted systematically in light of their sequence similarityto other functionally known genes. However, the functionof many ORFs cannot be inferred from only a sequencesimilarity search. Approaches using protein sequencemotifs and/or the three-dimensional structure, predicted inlight of the deduced amino acid sequence of the ORF,have also been used to predict the function of the corre-sponding gene. Unfortunately, these approaches have notgreatly improved our understanding because they areprimarily based on sequence information from thegenome. If the functional prediction process was per-formed inappropriately, its results might lead to thecreation and propagation of functional assignment errors[9]. For inferring the gene functions of ORFs that do nothave any significant sequence similarity to known

Functional transcriptomes: comparative analysis of biologicalpathways and processes in eukaryotes to infer genetic networksamong transcriptsHidemasa Bono* and Yasushi Okazaki†

Page 2: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

356 Sequences and topology

Figure 1

Thresholds:<0.6,0.6,0.7,0.8,0.9

Current Opinion in Structural Biology

Page 3: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

genes, gene expression information can be an alternativeunique resource.

Eisen et al. [10] published the results of an analysis of geneexpression profiles taken from 79 experiments in S. cerevisiae.Although the genes included in the microarray were 2469well-annotated ORFs, this first large and publicly availablegene expression data set prompted many bioinformatists toanalyze the information. Some of these researchers generatedclustering algorithms specifically designed for microarraydata and others tested the feasibility of predicting genefunction from gene expression data by adopting the sameclustering algorithms used to infer the function of ORFsaccording to sequence similarity [11].

Such a large set of microarray data was helpful in ‘calibrating’the results of the cluster analysis of gene expression datafrom microarray experiments. Simple hierarchical clusteringalgorithms (i.e. single linkage, complete linkage and theunweighted pair group method with arithmetic mean[UPGMA]) were tested initially. Several other methods,including nonhierarchical clustering methods (e.g. k-meansand self-organizing maps [SOMs]), have been reviewed [12].UPGMA seemed to be best for checking the functionalcoupling within these clusters because this method yieldsgene clusters of the appropriate size [13,14]. The largestcluster resulting from the UPGMA clustering analysis ofthe yeast data by Eisen et al. [10] comprised ribosomalproteins and the second largest cluster corresponds toORFs that encode enzymes of the glycolysis pathway(Figure 1). The assignment of EC numbers to the ORFs inyeast was taken from the Kyoto Encyclopedia of Genes andGenomes (KEGG) database [15•]. This database is a keyresource for enzymes because it describes their functionsand enables the mapping of enzymatic genes to theirrespective metabolic pathways. The ORFs in the glycolysispathway that have corresponding EC numbers weresummarized and color-coded in light of the threshold forUPGMA (see Figure 1, upper and middle). The geneexpression patterns of the ORFs in the orthologous genetable from KEGG (Figure 1, middle), as well as those ofthe ORFs in the gene expression cluster, were extractedfrom the entire data set to show the functional featuresof this pathway (Figure 1, bottom). Significant patterndifferences were found for ORFs YOL056W andYDL021W, both of which encode phosphoglycerate mutase(EC 5.4.2.1). The gene names of these are GPM3 andGPM2, respectively. These ORFs have been describedas nonfunctional homologs of GPM1 (YKL152C), eventhough they have conserved key amino acid residues

involved in catalysis [16]. We can easily explore this typeof functional analysis by comparing transcriptome data.

Although genes without functional descriptions were notreported in the above analysis by Eisen et al. [10], theinclusion of functionally unassigned gene data to findcoexpression clusters will help to identify or infer thefunctions of these functionally unknown genes. As themore complete assignment of EC number to all metabolicenzymes in yeast has already been done, it is easy to mapORFs to appropriate positions in the metabolic pathway.But for other higher organisms, such as human, mice or rat,EC number assignment is yet to be completed. We havebeen engaged in the mouse encyclopedia project and havedeveloped a system that assigns functional annotation tothe cDNAs. The details of the functional annotation systemfor the mouse cDNAs (FANTOM) have been reportedelsewhere [17,18••]. The systematic assignment of a

Functional transcriptomes Bono and Okazaki 357

Figure 1 legend

The second largest cluster from UPGMA cluster analysis of the yeast data (see text for details) mapped to the glycolysis pathway(top). In the top pathway and middle table, the similarity of the geneexpression pattern was represented in color code (red for high

similarity and green for low similarity). The gene expression pattern(bottom: red for up-regulated and green for down-regulated) wasdrawn using a tool from the EPTable software(http://eptable.sourceforge.net/).

Figure 2

A scheme for making gene associations between anonymoustranscripts in the RIKEN mouse cDNA microarray and EC numbers(GO terms). To systematically draw a metabolic pathway map from theRIKEN cDNA clone set, EC number was primarily used. BecauseGO terms can be converted to EC numbers for enzymatic genes,GO terms were initially assigned in two ways. One was to follow linksin the database (gray arrow: links from RIKEN clone identifier to MGDand from MGD to GO) and the other was to assign GO terms using asequence similarity search (black arrow) against GO-supportingdatabases (SWISS-PROT and InterPro). The KEGG/GENESdatabase, enriched in EC number assignment, was directly used topredict enzymatic genes with EC numbers (dashed arrow).

RIKEN 19K set(microarray)

GO

MGD

EC

Current Opinion in Structural Biology

Page 4: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

358 Sequences and topology

Figure 3

0.57

0.44

0.55

0.44

2210403E17

2010009G14

2300002N12

0610014E22

1500026O08

2310007M10

2410002L132310011H

182010109P

172410003I202310039P

092410003B

042210008A

020610008C

101110001B

130610021N

071500001O

053000002C

103300001F202010002D

232810401B

211700027I202600017F152310061J242410003L062310021N

071020005C

071020010C

111700023F142610020G

182810002F081110034I231500010B

021500040O

113300001P

172900001B

072410001D

131810074P

182310037H

161700007H

231700025L131700020K

181700024F241700029E

231810073K

212700002L211600025B

071500039A

161700030P

111810031N

212310066C

241700008L221020007M

131020006C

231020005I192210023P

101810054G

192310039G

012310033D

012310033O

15 *** * *

06_kidney 07_brain 09_spleen 10_heart 12_lung 13_liver 15_cerebellum 16_placenta 17_testis 18_pancreas 20_small_intestine 22_stomach 23_tongue 25_embryo13_liver 26_embryo10 27_embryo11 30_embryo12_head 31_embryo13_head 33_embryo17_head 39_embryo13 40_embryo15_head 41_embryo16_head 43_thymus_preg1day 44_embryo14_liver 45_mammary_gland_lactate10day 46_skin_neonate0day 47_skin_neonate10day 50_ovary_uterus_preg11days 51_intestine_neonate10day 58_thymus 62_embryo11_head 63_medulla_oblongata 64_olfactory_brain 65_cerebellum_neonate10day 67_extra_testis 75_eyeball 77_cortex 78_vesicular 83_uterus84_embryo16_lung 90_colon 91_cecum 98_bone sv40txx_lung_neonate0dayxx_muscle n0_whead n6_whead n10_whead

phosphofructokinase, liver, B-type [FA

NTO

M]

glucose-6-phosphatase, transport protein 1 [FAN

TOM

]alcohol dehydrogenase 1, com

plexfructose bisphosphatase 1 [FA

NTO

M]

pyruvate decarboxylasepyruvate dehydrogenase E

1alpha subunitalcohol dehydrogenase 5 [FA

NTO

M]

ACY

LPH

OS

PH

ATAS

E, M

US

CLE

TYP

E IS

OZY

ME

(EC

3.6.1.7) (ACY

LPH

OS

PH

ATE P

HO

SP

HO

HY

DR

OLA

SE

). [FAN

TOM

]triosephosphate isom

erase [FAN

TOM

]glyceraldehyde-3-phosphate dehydrogenase [FA

NTO

M]

triosephosphate isomerase [FA

NTO

M]

ES

Ts, Highly sim

ilar to PH

OS

PH

OG

LYC

ER

ATE M

UTA

SE

, BR

AIN

FOR

M [H

omo sapiens]

similar to phosphoglycerate m

utase, muscle form

(EC

5.4.2.1) (EC

5.4.2.4) (EC

3.1.3.13) [FAN

TOM

]enolase 1, alpha non-neuron [FA

NTO

M]

glyceraldehyde-3-phosphate dehydrogenaseglyceraldehyde-3-phosphate dehydrogenase [FA

NTO

M]

glyceraldehyde-3-phosphate dehydrogenaseglyceraldehyde-3-phosphate dehydrogenase [FA

NTO

M]

aldolase 1, A isoform

triosephosphate isomerase [FA

NTO

M]

homolog to hum

an Pyruvate dehydrogenase E

1 component beta subunit, m

itochondrial precursor(EC

1.2.4.1) [FAN

TOM

]sim

ilar to FRU

CTO

SE

-BIS

PH

OS

PH

ATE A

LDO

LAS

E A

(EC

4.1.2.13) (MU

SC

LE-TY

PE

ALD

OLA

SE

). [FAN

TOM

]aldolase 1, A

isoform [FA

NTO

M]

enolase 3, beta muscle [FA

NTO

M]

aldolase 1, A isoform

[FAN

TOM

]phosphoglycerate m

utase muscle-specific subunit

enolase 3, beta muscle [FA

NTO

M]

glyceraldehyde-3-phosphate dehydrogenase [FAN

TOM

]glyceraldehyde-3-phosphate dehydrogenasehom

olog to phosphoglucomutase [FA

NTO

M]

phosphoglycerate kinase 1phosphofructokinase, m

uscle [FAN

TOM

]enolase 2, gam

ma neuronal [FA

NTO

M]

lactate dehydrogenase 2, B chain [FA

NTO

M]

lactate dehydrogenase 2, B chain [FA

NTO

M]

related to SIM

ILAR

TO S

HO

RT-C

HA

IN A

LCO

HO

L DE

HY

DR

OG

EN

AS

E [FA

NTO

M]

pyruvate kinase 3 [FAN

TOM

]pyruvate dehydrogenase E

1alpha subunitenolase 3, beta m

uscle [FAN

TOM

]glyceraldehyde-3-phosphate dehydrogenase, sperm

atogenicpyruvate dehydrogenase E

1alpha-like [FAN

TOM

]E

STs, M

oderately similar to L-LAC

TATE D

EH

YD

RO

GE

NA

SE

M C

HA

IN [S

us scrofa]phosphoglycerate kinase 2hom

olog to acylphosphatase [FAN

TOM

]pyruvate dehydrogenase 2hom

olog to glucose-6-phosphatase, catalytic (glycogen storage disease type I, von Gierke disease) [FA

NTO

M]

6-PH

OS

PH

OFR

UC

TOK

INA

SE

(EC

2.7.1.11) (PH

OS

PH

OFR

UC

TOK

INA

SE

) (PH

OS

PH

OH

EX

OK

INA

SE

). [FAN

TOM

]lactate dehydrogenase 2, B

chainlactate dehydrogenase 3, C

chain, sperm specific

similar to phosphoglycerate m

utase, muscle form

(EC

5.4.2.1) (EC

5.4.2.4) (EC

3.1.3.13) [FAN

TOM

]glyceraldehyde-3-phosphate dehydrogenase [FA

NTO

M]

similar to phosphoglycerate m

utase, muscle form

(EC

5.4.2.1) (EC

5.4.2.4) (EC

3.1.3.13) [FAN

TOM

]pyruvate dehydrogenase kinase 4 [FA

NTO

M]

aldolase 1, A isoform

[FAN

TOM

]lactate dehydrogenase 2, B

chain [FAN

TOM

]pyruvate dehydrogenase E

1alpha subunitsim

ilar to alcohol ehydrogenase classIV (E

C 1.1.1.1) [FA

NTO

M]

fructose bisphosphatase 1alcohol dehydrogenase 3 com

plex [FAN

TOM

]alcohol dehydrogenase 3 com

plex [FAN

TOM

]

0.2504

10.5

2

Current O

pinion in Structural B

iology

Page 5: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

functional description to genes has been proposed by theGene Ontology (GO) Consortium [19••]. We will brieflyreview the GO and the EC assignment strategy for mousecDNA clones below.

Gene ontology as a computerized descriptionof gene functionMany ORFs in yeast were well studied and then annotatedin databases such as SGD (Saccharomyces Genome Database)[20] after completion of the genome sequence in 1996 [21].The brief description of a gene (often called the ‘functionalannotation’) can be key to finding new biological insightsfrom the clusters generated by various algorithms. Thesefunctional annotations primarily are obtained from bio-logical publications. As a result, the annotations are‘human friendly’ but ‘computer unfriendly’ in some cases.For example, the text of these annotations can containmisspelled words and various synonyms. Therefore, thetext should have a controlled vocabulary and the classi-fication of this vocabulary should be sufficiently general tobe applicable to various organisms. Toward this goal, theGO Consortium is committed to maintaining a controlledvocabulary for describing gene function; this terminologyhas become the de facto standard for the functional annotationof genes [19••]. During the first meeting addressing thefunctional annotation of mouse cDNAs (FANTOM), GO wasused actively to annotate the mouse transcriptome [17,18••].

The proper assignment of GO terms to functionally knowngenes represented on microarray slides will help to furtherinfer the roles of functionally unknown genes when theyare in the same coexpression clusters. A convenient way ofmaking this association list is to use as a reference eitherthe well-curated gene association tables that are maintainedby the GO consortium or a GO-supporting database formodel organisms (http://www.geneontology.org/). Figure 2shows how we make such gene associations for RIKENcDNA clones in arrays. As shown here, FANTOM-DB,which is the database of the RIKEN mouse encyclopedia,and MGD (Mouse Genome Database) were used to assignGO terms [22,23]. Direct assignments of GO terms are alsomade in light of information on protein sequences inSWISS-PROT [24], and on protein motifs and domains inInterPro [25], both of which are GO-supporting databases.

Once GO terms are assigned to all the ORFs in a genome,we can enjoy functional analysis using the GO hierarchy.Another hurdle is the exploration of this hierarchy. By usingGO as a common language for gene function, comparingmultiple species via GO terms becomes feasible. But howcan we explore the hierarchical structure of the ontology?

There are no common ways to analyze this, which couldbe a problem when we use GO in the near future. Asshown in Figure 2, EC numbers have also been assignedto study all the metabolic pathways [5,26]. Using thisinformation, we can analyze the expression profiles ofmetabolic pathways.

Comparative analysis of metabolictranscriptomesUsing a similar concept used to compile data by Eisen et al.[10], Miki et al. [27••] used the RIKEN 19K mousemicroarray to complete comprehensive mammaliantranscriptome data analyses of principal adult tissues andembryonic developmental stages. Although many diffi-culties remain regarding associating transcripts with genesmapped to pathways, the reconstruction of the mousemetabolic pathway could be achieved using the geneexpression data represented in [27••]. Figure 3 shows thehierarchical clustering of mouse transcripts mapped to theglycolysis pathway.

Compared with the expression analysis of yeast ORFs(Figure 1), genes mapped to the mouse pathway were notexpressed in a concerted way. However, some features canbe deduced from these gene expression patterns. Althoughthe experiments shown on the x-axis at the bottom ofFigures 1 and 3, representing the experimental conditionsused to produce the two large data sets, are not the same,the core enzymes that were clustered together by geneexpression similarity showed high correspondencebetween mouse and yeast. These results show that theuse of different organs or developmental stages in mouseprovided similar experimental conditions to those usedin yeast, for which different culture conditions or timecourse were used. In Figure 3, the five genes coloredblue-green were all derived from the testis library, showinga testis-specific metabolic pathway that is different fromother organs. Hence, the clustering of genes can providesome insights into the tissue-specific regulation of genes.Although we focused only on the glycolysis pathway,comparative transcriptomes with GO terms may alsogive new insights into the analyses of the function of paralogues and/or orthologues, as have been widely studied by sequence analyses. As the public repository ofgene expression data grows [28•], comparative analyseswill be done more easily and precisely. Furthermore, currently unknown cascades could be explored byusing a variety of transcriptome data [29•] and by the addition of interactome data for protein–protein inter-actions [30•,31•] in order to expand our knowledge of generegulatory networks.

Functional transcriptomes Bono and Okazaki 359

Figure 3 legend

The hierarchical clustering of genes assigned to the glycolysis pathwayusing microarray data [27•• ]. Tissue-specific isozymes were clustereddifferently (genes specifically up-regulated in muscle and testis are

colored yellow and blue-green, respectively). The relative geneexpression ratio is depicted according to the color scale shown at thetop (gray indicates low threshold).

Page 6: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

ConclusionsBy making use of GO terms as the common language forassigning gene function, comparative analyses becomemore feasible in many ways. By integrating a variety ofdata, such as transcriptome and interactome, the functionalassignment of a gene reflects not only the comparativeanalysis of nucleotide and/or peptide sequences, but alsogene and protein expression data. It might be more difficult,however, to interpret these results from the biologicalpoint of view. Although the optimal hybridization conditionsdiffer markedly between yeast and mouse, preliminarycomparative analysis of yeast and mouse transcriptomedata for the well-understood glycolysis pathway showedthat the gene expression clusters of core enzymes areconserved. There are, of course, some hurdles to overcometo achieve more sophisticated analyses of the entire genenetwork. These include to increase the sensitivity andreproducibility of the experiments; the use of a commonreference in cDNA microarray experiments; to establish amethod for normalizing the data and so on. These pointsare being actively discussed by the society in this field[28•]. A more precise description of the gene networkcould be achieved by obtaining multipoint time coursedata. By taking more points for various stimuli, genes withcoordinated function could be clustered more tightly.These data can also give some ideas concerning thelimiting step of the metabolic pathways.

AcknowledgementsWe thank Mitsuteru S Nakao for initial discussions of microarray dataanalysis and critical reading of the manuscript, and the RIKEN GenomeExploration Research Project members for preparing the cDNA clones andgenerating the expression profile data. This study was supported by aresearch grant to the RIKEN Genome Exploration Research Project fromthe Ministry of Education, Culture, Sports, Science and Technology of theJapanese Government and by ACT-JST (Research and Development forApplying Advanced Computational Science and Technology) of the JapanScience and Technology Corporation (JST).

References and recommended readingPapers of particular interest, published within the annual period of review,have been highlighted as:

• of special interest••of outstanding interest

1. McPherson JD, Marra M, Hillier L, Waterston RH, Chinwalla A,Wallis J, Sekhon M, Wylie K, Mardis ER, Wilson RK et al.: A physicalmap of the human genome. Nature 2001, 409:934-941.

2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG,Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence ofthe human genome. Science 2001, 291:1304-1351.

3. Himmelreich R, Plagens H, Hilbert H, Reiner B, Herrmann R:Comparative analysis of the genomes of the bacteriaMycoplasma pneumoniae and Mycoplasma genitalium. NucleicAcids Res 1997, 25:701-712.

4. Bono H, Goto S, Fujibuchi W, Ogata H, Kanehisa M: Systematicprediction of orthologous units of genes in the complete genomes.Genome Inform Ser Workshop Genome Inform 1998, 9:32-40.

5. Tatusov RL, Mushegian AR, Bork P, Brown NP, Hayes WS,Borodovsky M, Rudd KE, Koonin EV: Metabolism and evolution of Haemophilus influenzae deduced from awhole-genome comparison with Escherichia coli. Curr Biol 1996,6:279-291.

6. Watanabe H, Mori H, Itoh T, Gojobori T: Genome plasticity as a paradigm of eubacteria evolution. J Mol Evol 1997,44:S57-S64.

7. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoringof gene expression patterns with a complementary DNAmicroarray. Science 1995, 270:467-470.

8. DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and geneticcontrol of gene expression on a genomic scale. Science 1997,278:680-686.

9. Bork P, Koonin EV: Predicting functions from protein sequences—where are the bottlenecks? Nat Genet 1998, 18:313-318.

10. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis anddisplay of genome-wide expression patterns. Proc Natl Acad SciUSA 1998, 95:14863-14868.

11. Editorial: Connecting the dots. Report for The Microarray Meeting:1999; Phoenix, Arizona. Nat Genet 1999, 23:249-252.

12. Quackenbush J: Computational analysis of microarray data. NatRev Genet 2001, 2:418-427.

13. Bono H, Nakao M, Kanehisa M: Cluster analysis of genome-wideexpression profiles to predict gene functions with KEGG[abstract]. Nat Genet Suppl 1999, 23:33-34.

14. Nakao M, Bono H, Kawashima S, Kamiya T, Sato K, Goto S,Kanehisa M: Genome-scale gene expression analysis andpathway reconstruction in KEGG. Genome Inform Ser WorkshopGenome Inform 1999, 10:94-103.

15. Kanehisa M, Goto S, Kawashima S, Nakaya A: The KEGG databases • at GenomeNet. Nucleic Acids Res 2002, 30:42-46.A unique resource for metabolic pathway analysis. Although this databasefocuses on microbial pathways, it can easily be extended to higher organisms.

16. Heinisch JJ, Muller S, Schluter E, Jacoby J, Rodicio R: Investigationof two yeast genes encoding putative isoenzymes ofphosphoglycerate mutase. Yeast 1998, 14:203-213.

17. Quackenbush J: Viva la revolution! A report from the FANTOMmeeting. Nat Genet 2000, 26:255-256.

18. The RIKEN Genome Exploration Research Group Phase II Team, the •• FANTOM Consortium: Functional annotation of a full-length mouse

cDNA collection. Nature 2001, 409:685-690.This paper reports the sequences of 21 076 RIKEN mouse cDNA clones and their functional annotations, which were actively discussed in the FANTOM meeting. The FANTOM meeting is organized to assign functionalannotation to the mouse cDNAs and to establish a standard for functional annotation strategy.

19. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, •• Davis AP, Dolinski K, Dwight SS, Eppig JT et al.: Gene ontology: tool

for the unification of biology. The Gene Ontology Consortium. NatGenet 2000, 25:25-29.

This paper describes computational ways to establish a controlled vocabularyfor describing the functions of genes.

20. Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR,Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G et al.:Saccharomyces Genome Database (SGD) provides secondarygene annotation using the Gene Ontology (GO). Nucleic AcidsRes 2002, 30:69-72.

21. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H,Galibert F, Hoheisel JD, Jacq C, Johnston M et al.: Life with 6000genes. Science 1996, 274:546, 563-547.

22. Bono H, Kasukawa T, Furuno M, Hayashizaki Y, Okazaki Y: FANTOMDB: database of functional annotation of RIKEN mouse cDNAclones. Nucleic Acids Res 2002, 30:116-118.

23. Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT: The Mouse Genome Database (MGD): the model organismdatabase for the laboratory mouse. Nucleic Acids Res 2002,30:113-115.

24. Bairoch A, Apweiler R: The SWISS-PROT protein sequencedatabase and its supplement TrEMBL in 2000. Nucleic Acids Res2000, 28:45-48.

25. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M,Bucher P, Cerutti L, Corpet F, Croning MD et al.: The InterProdatabase, an integrated documentation resource for proteinfamilies, domains and functional sites. Nucleic Acids Res 2001,29:37-40.

26. Bono H, Ogata H, Goto S, Kanehisa M: Reconstruction of aminoacid biosynthesis pathways from the complete genomesequence. Genome Res 1998, 8:203-210.

360 Sequences and topology

Page 7: Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts

27. Miki R, Kadota K, Bono H, Mizuno Y, Tomaru Y, Carninci P, Itoh M, •• Shibata K, Kawai J, Konno H et al.: Delineating developmental and

metabolic pathways in vivo by expression profiling using theRIKEN set of 18,816 full-length enriched mouse cDNA arrays.Proc Natl Acad Sci USA 2001, 98:2199-2204.

This paper reports the RIKEN 19K mouse microarray, which completescomprehensive mammalian transcriptome data analyses of principal adulttissues and embryonic developmental stages.

28. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, • Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC et al.: Minimum

information about a microarray experiment (MIAME)-towardstandards for microarray data. Nat Genet 2001, 29:365-371.

This paper outlines public guidelines to help develop microarray databasesand data management software.

29. Bono H, Kasukawa T, Hayashizaki Y, Okazaki Y: READ: RIKEN • Expression Array Database. Nucleic Acids Res 2002, 30:211-213.This paper reports the development of the RIKEN microarray database.In addition to simple searches against the RIKEN mouse set described in

this review (whole embryonic body of E17.5 was used as the referencefor this microarray experiment), the gene expression neighbor (impliescoexpressed genes) in the RIKEN mouse set is available on the Internet(http://read.gsc.riken.go.jp/).

30. Ge H, Liu Z, Church GM, Vidal M: Correlation between • transcriptome and interactome mapping data from

Saccharomyces cerevisiae. Nat Genet 2001, 29:482-486.These authors studied the correlation between transcriptome and interactomedata. They showed that the function of Sno and Snz proteins can beassigned more precisely by the combination of these two types of data.

31. Jansen R, Greenbaum D, Gerstein M: Relating whole-genome • expression data with protein-protein interactions. Genome Res

2002, 12:37-46.This paper reports an investigation into the relationship betweenprotein–protein interactions and mRNA expression level. They showed thatthe subunits of the protein complexes show high correlation in theexpression profiles, whereas the protein interaction data from the yeast twohybrid show only weak correlation.

Functional transcriptomes Bono and Okazaki 361