A Summary of Genomic Databases: Overview and...

18
A Summary of Genomic Databases: Overview and Discussion Erika De Francesco 2 , Giuliana Di Santo 1 , Luigi Palopoli 1 , Simona E. Rombo 1 1 Exeura, Via Pedro Alvares Cabrai, Edificio Manhattan, 87036 C.da Lecco, Rende (CS), Italy [email protected] 2 Universit` a della Calabria, Via Pietro Bucci, 87036 Rende, Italy [email protected], [email protected], [email protected] Summary. In the last few years both the amount of electronically stored biological data and the number of biological data repositories grew up significantly (today, more than eight hundred can be counted thereof). In spite of the enormous amount of available resources, a user may be disoriented when he/she searches for specific data. Thus, the accurate analysis of biological data and repositories turn out to be useful to obtain a systematic view of biological database structures, tools and contents and, eventually, to facilitate the access and recovery of such data. In this chapter, we propose an analysis of genomic databases, which are databases of fundamental importance for the research in bioinformatics. In particular, we provide a small catalog of 74 selected genomic databases, analyzed and classified by considering both their biological contents and their technical features (e.g, how they may be queried, the database schemas they are based on, the different data formats, etc.). We think that such a work may be an useful guide and reference for everyone needing to access and to retrieve information from genomic databases. 1 The Biological Databases Scenario In such an evolving field like the biological one, having continuously new data to store and to analyze seems natural. Thus, the amount of data stored in biological databases is growing very quickly [12], and the number of relevant biological data sources can be nowadays estimated in about 968 units [34]. The availability of so many data repositories is indisputably an important resource but, on the other hand, opens new quests, as to effectively access and retrieve available data. Indeed, the distributed nature of biological knowledge, and the heterogeneity of biological data and sources, can often cause trouble to the user trying specific demands. And, in fact, not only such data sources are so numerous, but they use different kinds of representations, access methods and various features and formats are offered. A.S. Sidhu et al. (Eds.): Biomedical Data and Alications, SCI 224, pp. 37–54. springerlink.com c Springer-Verlag Berlin Heidelberg 2009

Transcript of A Summary of Genomic Databases: Overview and...

Page 1: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overviewand Discussion

Erika De Francesco2, Giuliana Di Santo1, Luigi Palopoli1, Simona E.Rombo1

1 Exeura, Via Pedro Alvares Cabrai, Edificio Manhattan,87036 C.da Lecco, Rende (CS), [email protected]

2 Universita della Calabria, Via Pietro Bucci, 87036 Rende, [email protected], [email protected],[email protected]

Summary. In the last few years both the amount of electronically stored biologicaldata and the number of biological data repositories grew up significantly (today,more than eight hundred can be counted thereof). In spite of the enormous amount ofavailable resources, a user may be disoriented when he/she searches for specific data.Thus, the accurate analysis of biological data and repositories turn out to be usefulto obtain a systematic view of biological database structures, tools and contentsand, eventually, to facilitate the access and recovery of such data. In this chapter,we propose an analysis of genomic databases, which are databases of fundamentalimportance for the research in bioinformatics. In particular, we provide a smallcatalog of 74 selected genomic databases, analyzed and classified by consideringboth their biological contents and their technical features (e.g, how they may bequeried, the database schemas they are based on, the different data formats, etc.).We think that such a work may be an useful guide and reference for everyone needingto access and to retrieve information from genomic databases.

1 The Biological Databases Scenario

In such an evolving field like the biological one, having continuously new datato store and to analyze seems natural. Thus, the amount of data stored inbiological databases is growing very quickly [12], and the number of relevantbiological data sources can be nowadays estimated in about 968 units [34]. Theavailability of so many data repositories is indisputably an important resourcebut, on the other hand, opens new quests, as to effectively access and retrieveavailable data. Indeed, the distributed nature of biological knowledge, and theheterogeneity of biological data and sources, can often cause trouble to theuser trying specific demands. And, in fact, not only such data sources are sonumerous, but they use different kinds of representations, access methods andvarious features and formats are offered.

A.S. Sidhu et al. (Eds.): Biomedical Data and Alications, SCI 224, pp. 37–54.

springerlink.com c© Springer-Verlag Berlin Heidelberg 2009

Page 2: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

38 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

In this scenario, a documented analysis of on-line available material interms not only of biological contents, but also considering the diverse types ofexploited data formats and representations, the different ways in which suchdatabases may be queried (e.g., by graphical interaction or by text forms) orthe various database schemas they are based on, can be useful to ease users’activities spent towards fulfilling their information needs. With this aim inmind, we started from the about 968 biological databases listed in [34], wherea classification reflecting some specific biological views of data is provided,and we focused on a specific biological database category, that of genomicdatabases, analyzing them w.r.t. database technologies and tools (which hasnot been considered in previous works, such as [31, 34]).

As a result of this work, in this chapter we will discuss the 74 genomicdatabases we have analyzed, specifying the main features they have in commonand the differences among them in terms of both biological and databasetechnologies oriented points of views. In particular, we used five different andorthogonal criteria of analysis, that are:

• typologies of recoverable data (e.g., genomic segments or clone/contig re-gions);

• database schema types (e.g., Genolist schema or Chado schema);• query types (e.g., simple queries or batch queries);• query methods (e.g., graphical interaction based or query language based

methods);• export formats (e.g., flat files or XML formats).

Section 2 is devoted to the illustration of such an analysis, and a subsec-tion for each particular criterium under which the selected genomic databaseshave been analyzed in detail is included. To facilitate the full understanding ofthe discussion, the mentioned classification criteria would be also graphicallyillustrated by some graphs. In this way, given one of the analyzed databases,it would be possible to easily know in a synthetic way all the informationcaptured by the described criteria. An on-line version of such an analysis isavailable at http://siloe.deis.unical.it/ biodbSurvey/, where a num-bered list of all the databases we analyzed is provided (as also reported here inthe Appendix) and the classifications corresponding to the five criteria illus-trated above are graphically reported. In particular, for each criterium, eachsub-entry is associated to a list of databases; clicking on one of the databases,a table will be shown where all the main features of that database are sum-marized (see Section 2.6 for more details).

The two following subsections summarize some basic biological notionsand illustrate the main biological database classification commonly adoptedin the literature.

1.1 Some (very) basic biology

The genome encodes all the information of the Deoxyribonucleic Acid (DNA),the macromolecule containing the genetic heritage of each living being. DNAis a sequence of symbols on an alphabet of four characters, that are, A, T ,C and G, representing the four nucleotides Adenine, Thymine, Cytosine and

Page 3: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 39

Guanine. In genomic sequences, three kinds of subsequences can be distin-guished: i) genic subsequences, coding for protein expression; ii) regulatorysubsequences, placed upstream or downstream the gene of which they influ-ence the expression; iii) subsequences apparently not related to any function.Each gene is associated to a functional result that, usually, is a protein. Proteinsynthesis can be summarized in two main steps: transcription and translation.During transcription, the DNA genic sequence is copied into a Messenger Ri-bonucleic Acid (RNAm), delivering needed information to the synthesis ap-paratus. During translation, the RNAm is translated into a protein, that is, asequence of symbols on an alphabet of 20 characters, each denoting an aminoacid and each corresponding to a nucleotide triplet. Therefore, even changingone single nucleotide in a genic subsequence may cause a change in the corre-sponding protein sequence. Proteins as, more in general, macromolecules, arecharacterized by different levels of information, related to the elementary unitsconstituting them and to their spatial disposition. Such levels are known asstructures. In particular, the biological functions of macromolecules also de-pend on their three-dimensional shapes, usually named tertiary structures.Moreover, behind the information encoded in the genomic sequence, additiveknowledge about biological functions of genes and molecules can be attainedby experimental techniques of computational biology. Such techniques are de-voted, for example, to build detailed pathway maps, that are, directed graphsrepresenting metabolic reactions of cellular signals [14].

1.2 Biological database classification

As pointed out in [31], all the biological databases may be classified, w.r.t.their biological contents, in the following categories (see also Figure 1):

• Macromolecular Databases: contain information related to the three mainmacromolecules classes, that are, DNA, RNA and proteins.– DNA Databases: describe DNA sequences.– RNA Databases: describe RNA sequences.– Protein Databases: store information on proteins under three different

points of view:· Protein Sequence Databases: describe amino acid sequences.· Protein Structure Databases: store information on protein struc-

tures, that are, protein shapes, charges and chemical features, char-acterizing their biological functions.

· Protein Motif and Domain Databases: store sequences and struc-tures of particular protein portions (motifs and domains) to whichspecific functional meanings can be associated, grouped by biolog-ical functions, cellular localizations, etc.

• Genomic Databases: contain data related to the genomic sequencing ofdifferent organisms, and gene annotations;– Human Genome Databases: include information on the human gene

sequencing.– Model Organism Databases (MOD): store data coming from the se-

quencing projects of model organisms (such as, e.g., MATDB); theyare also intended to support the Human Genome Project (HGP).

Page 4: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

40 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

– Other Organism Databases: store information derived from sequencingprojects not related to HGP.

– Organelle Databases: store genomic data of cellular organelles, suchas mitochondria, having their own genome, distinct from the nucleargenome.

– Virus Databases: store virus genomes.• Experiment Databases: contain results of lab experiments; these can be

grouped in three main classes:– Microarray Databases: store data deriving from microarray experi-

ments, useful to assess the variability of the gene expression.– 2D PAGE Databases: contain data on proteins identified by two-

dimensional polyacrylamide gel electrophoresis (2-D PAGE), an exper-imental technique used to physically separate and distinguish proteins.

– Other Experiment Databases: store results of specific experimentaltechniques, such as Quantitative PCR Primer Database [23].

• Immunological Databases: store information associated to immunologic re-sponses.

• Gene Databases: contain sequence data of DNA codifying subsequences,that are, the genes.

• Transcript Databases: store the set of transcribed genomic sequences ofseveral organisms (that are, those genomic sequences characterizing or-gans’ and tissues’ cellular properties).

• Pathway Databases: contain information on metabolic reaction networksof cellular signals.

• Polymorphism and mutation Databases: report data on polymorphic vari-ations and mutations.

• Meta-Database: store descriptions and links about (selected groups of)biological databases;

• Literature Databases: these are archives storing papers published on pre-mier biological journals.

• Other Databases: store data related to specific application contexts, suchas the DrugBank [10].

It is worth pointing out that the above classes in some cases overlap. As anexample, EcoCyc [40] is to be looked at both as a genomic and as a pathwaydatabase. Note that this is intrinsic to the structure of the biological context.In fact, biological data sources contain data which can be classified underdifferent perspectives, due to the various origins and goals of the associatedrepositories.

2 Genomic databases analysis

This section focuses on genomic databases, that represent one among the mostimportant biological database categories. Differently from gene databases,containing only coding DNA sequences, genomic databases contain also noncoding intergenic sequences.

Page 5: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 41

Biological database

categories

Macromolecular Databases

Pathway databases

Protein databases

Polymorphism and mutation databases

Experiment databases

Literature databases

Genomic Databases

Meta Database (database of database)

Protein structure databases

Protein motif and domine databases

Microarray databases

2D PAGE databases

Other organism databases

Model organism databases

Other experiments databases

Transcript databases

Organelle databases

Gene databases

DNA databases

RNA databases

Other databases

Virus genome databases

Human genome databases

Immunological databases

Protein sequence databases

Biological database

categories

Macromolecular Databases

Pathway databases

Protein databases

Polymorphism and mutation databases

Experiment databases

Literature databases

Genomic Databases

Meta Database (database of database)

Protein structure databases

Protein motif and domine databases

Microarray databases

2D PAGE databases

Other organism databases

Model organism databases

Other experiments databases

Transcript databases

Organelle databases

Gene databases

DNA databases

RNA databases

Other databases

Virus genome databases

Human genome databases

Immunological databases

Protein sequence databases

Fig. 1. Biological database classification.

In the following, we discuss a selected subset of genomic repositories (74units) and illustrate in detail their differences and common features accordingto five criteria of analysis, that are, the typologies of recoverable data, thedatabase schema types, the query types, the query methods and, finally, theexport formats.

For each classification criterium, the corresponding classification and thelist of associated databases will be also graphically illustrated. In particu-lar, in the graphs that follow, classes are represented by rectangles, whereascircles indicate sets of databases. Databases are indicated by numbers, ac-cording to the lists found in the Appendix and at http://siloe.deis.unical

.it/biodbSurvey/.

Page 6: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

42 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

2.1 Recoverable Data

Fig. 2. Recoverable data typologies.

Genomic databases contain a large set of data types. Some archives reportonly the sequence, the function and the organism corresponding to a givenportion of genome, other ones contain also detailed information useful forbiological or clinical analysis. As shown in Figure 2, data that are recoverablefrom genomic databases can been distinguished in six classes:

• Genomic segments: include all the nucleotide subsequences which aremeaningful from a biological point of view, such as genes, clone/clontigsequences, polymorphisms, control regions, motifs and structural featuresof chromosomes. In detail:

Page 7: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 43

– Genes: are DNA subsequences originating functional product such asproteins and regulatory elements. All genomic databases contain infor-mation about genes.

– Clone/contig regions: are sequences of in vitro copies of DNA regions,used to clone the DNA sequence of a given organism. Examples ofdatabases storing information about clone/contig regions are ArkDB[1], SGD [32] and ZFIN [47].

– Polymorphisms: are frequent changes in nucleotide sequences, usuallycorresponding to a phenotypic change (e.g., the variety of eye colors inthe population). As an example, BeetleBase [3] contains informationabout polymorphisms.

– Control regions: are DNA sequences regulating the gene expression. Afew databases store information about control regions. Among themthere are Ensembl [13], FlyBase [42], SGD [32], and WormBase [28].

– Motifs (or patterns): are specific DNA segments having important func-tions and repeatedly occurring in the genome of a given organism. Ex-amples of databases containing information about motifs are AureoList[2], Leproma [20] and VEGA [38].

– Structural features: can be further distinguished in telomeres, cen-tromeres and repeats.· Telomeres: are the terminal regions of chromosomes. Genolevures

[33], PlasmoDB [30] and SGD [32] store information about telom-eres.

· Centromeres: represent the conjunction between short arms andlong arms of the chromosome. BGI-RISe [51], PlasmoDB [30], RAD[24] and SGD [32] are databases containing information about cen-tromeres.

· Repeats: are repetitions of short nucleotide segments, repeatedlyoccurring in some of the regions of the chromosome. Among theothers, CADRE [7], PlasmoDB [30] and ToxoDB [37] store infor-mation about repeats.

• Maps: result from projects that produced the sequencing and mappingof the DNA of diverse organisms, such as the Human Genome Project[50]. In particular, genetic maps give information on the order in whichgenes occur in the genome, providing only an estimate of genes distances,whereas physical maps give more precise information on the physical dis-tances among genes. GOBASE [45] and BeetleBase [3] are examples ofdatabases storing maps.

• Variations and mutations: A nucleotide change is a mutation if it occurswith low frequency in the population (as opposed to polymorphisms). Mu-tations may cause alterations in the protein tertiary structures, inducingpathologic variations of the associated biological functions. These infor-mation are stored in databases such as, for example, FlyBase [42], andPlasmoDB [30].

• Pathways: describe interactions of sets of genes, or of proteins, or ofmetabolic reactions involved in the same biological function. Pathwaysare stored in Bovilist [5] and Leproma [20], for example.

Page 8: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

44 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

• Expression data: are experimental data about the different levels of expres-sion of genes. The levels of such an expression are related to the quantityof genic product of genes. Information about expression data are storedin, e. g., CandidaDB [6], PlasmoDB [30], SGD [32] and ToxoDB [37].

• Bibliographic references: are repositories of (sometimes, links to) relevantbiological literature. Most genomic databases contain bibliographic refer-ences.

2.2 Database Schemas

Most genomic databases are relational, even if relevant examples exist basedon the object-oriented or the object-relational model (see e.g., [28, 49]). Wedistinguished four different types of database schema designed “specifically”to manage biological data, from other unspecific database schemas, which aremostly generic relational schemas, whose structure is anyways independentfrom the biological nature of data.

Thus, w.r.t. database schemas, we grouped genomic databases in fiveclasses, as also illustrated in Figure 3; more in detail, we referred to:

Fig. 3. Database schemas.

• The Genomics Unified Schema (GUS) [16]: this is a relational schemasuitable for a large set of biological information, including genomic data,genic expression data, protein data. Databases based on this schema areCryptoDB [35], PlasmoDB [30] and TcruziDB [27].

• The Genolist Schema [43]: this is a relational database schema developedto manage bacteria genomes. In the Genolist Schema, genomic data arepartitioned in subsequences, each associated to specific features. With it,partial and efficient data loading is possible (query and update operations),since the update of sequences and associated features only require accessingand analyzing the interested chromosomic fragments. Databases based onthe genolist schema are, for example, AureoList [2], LegioList [19] andPyloriGene [22].

• The Chado Schema [8]: this is a relational schema having an extensiblemodular structure, and is used in some model organism databases such asBeetleBase [3] and FlyBase [42].

Page 9: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 45

• The Pathway Tools Schema [39]: this is an object schema, used in Path-way/Genome databases. It is based on an ontology defining a large set ofclasses (there are about 1350 of them), attributes and relations to model bi-ological data, such as metabolic pathways, enzymatic functions, genes, pro-moters and genic regulation mechanisms. Databases based on this schemaare BioCyc [4], EcoCyc [11] and HumanCyc [18].

• Unspecific Schema refers to databases which do not correspond to anyof the specific groups illustrated above; in particular, most of them arebased on a relational schema, without any special adaptation to managebiological data.

2.3 Query Types

Fig. 4. Query types.

In Figure 4, query types supported by the analyzed databases are illus-trated. By simple querying it is possible to recover data satisfying some stan-dard search parameters such as, e.g., gene names, functional categories andothers. Batch queries consist of bunches of simple queries that are simulta-neously processed. The answers to such queries result as a combination ofthe answers obtained for the constituent simple queries. Analysis queries aremore complex and, somehow, more typical of the biological domain. They con-sist in retrieving data based on similarities (similarity queries) and patterns(pattern search queries). The former ones take in input a DNA or a protein(sub)sequence, and return those sequences found in the database that are themost similar to the input sequence. The latter ones take in input a pattern pand a DNA sequence s and return those subsequences of s, which turn out tobe most strongly related to the input pattern p. As an example, consider thepattern TATA, that is, a pattern denoting a regulative sequence upstream thegenic sequence of every organism: filling the pattern search query form withit on a database such as CandidaDB [6], and allowing at most one mismatch,311 genic subsequences extracted from the Candida (a fungus) DNA and con-taining the TATA box in evidence are returned. Both sequence similarity andpattern search queries exploit heuristic search algorithms such as BLAST [29].

Page 10: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

46 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

2.4 Query Methods

A further analysis criterium we considered is related to methods used toquery available databases, as illustrated in Figure 5. We distinguished fourmain classes of query methods: text based methods, graphical interaction basedmethods, sequence based methods and query language based methods.

Fig. 5. Query methods.

The most common methods are the text based ones, further grouped intwo categories: free text and forms. In both cases, the query can be formulatedspecifying some keywords. With the free text methods, the user can specifysets of words, also combining them by logical operators. Such keywords are tobe searched for in specific and not a-priori determined fields and tables of thedatabase. With forms, searching starts by specifying the values to look for thatare associated to attributes of the database. Multi-attributes searches can beobtained by combining the sub-queries by logical operators. As an example,ArkDB [1] supports both free text and forms queries.

Graphical interaction based methods are also quite common in genomicdatabase access systems. In fact, a large set of genomic information can be vi-sualized by physical and genetic maps, whereby queries can be formulated byinteracting with graphical objects representing the annotated genomic data.Graphical support tools have been indeed developed such as, e.g., the GenericGenome Browser (GBrowser) [48], which allows the user to navigate interac-tively through a genome sequence, to select specific regions and to recoverthe correspondent annotations and information. Databases such as CADRE[7] and HCVDB [17], for example, allow the user to query their repository bygraphical interaction.

Sequence based queries rely on the computation of similarity among ge-nomic sequences and patterns. In this case, the input is a nucleotides or amino-acid sequence and data mining algorithms, such as in [29], are exploited toprocess the query, so that alignments and similar information are retrievedand returned.

Query language based methods exploit DBMS-native query languages(e.g., SQL-based ones), allowing a “low level” interaction with the data- bases.

Page 11: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 47

At the moment, few databases (WormBase [28] and GrainGenes [15]) supportthem.

2.5 Result Formats

In genomic databases, several formats are adopted to represent query results.Web interfaces usually provide answers encoded in HTML, but other formatsare often available as well. In the following discussion, we shall refer to theformats adopted for representing query results and not to those used for ftp-downloadable data.

Fig. 6. Result formats.

Flat files are semi-structured text files where each information class is re-ported on one or more than one consecutive lines, identified by a code usedto characterize the annotated attributes. The most common flat files formatsare the GenBank Flat File (GBFF) [41] and the European Molecular BiologyLaboratory Data Library Format [44]. These formats represent basically thesame information contents, even if with some syntactic differences. Databasessupporting the GenBank Flat File are, for example, CADRE [7] and Ensembl[13], whereas databases supporting the EMBL format are CADRE [7], En-sembl [13], and ToxoDB [37].

The comma/tab and the attribute-value separated files are similar to flatfiles, but featuring some mildly stronger form of structure. The former ones,supported for example by BioCyc [4], TAIR [46] and HumanCyc [18], repre-sent data in a tabular format, separating attributes and records by special

Page 12: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

48 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

characters (e.g., commas and tabulations). The latter ones (supported for ex-ample by BioCyc [4], EcoCyc [11] and HumanCyc [18]) organize data as pairs<attribute, value>.

In other cases, special formats have been explicitly conceived for biologi-cal data. An example is the FASTA format [44], commonly used to representsequence data (it is exploited in all genomic databases, except for ColiBase[9], PhotoList [21] and StreptoPneumoList [25]). Each entry of a FASTA doc-ument consists in three components: a comment line, that is optional andreports brief information about the sequence and the GenBank entry code;the sequence, represented as a string on the alphabet {A, C, G, T} of thenucleotide symbols, and a character denoting the end of the sequence.

More recently, in order to facilitate the spreading of information in hetero-geneous contexts, the XML format has been sometimes supported. The mainXML-based standards in the biological context are the System Biology MarkupLanguage (SBML) [36] and the Biological Pathway Exchange (BioPAX) [26].SMBL is used to export and exchange query results such as pathways of cel-lular signals, gene regulation and so on. The main data types here are derivedfrom XML-Schema. Similarly, BioPAX has been designed to integrate differentpathways resources via a common data format. BioPAX specifically consistsof two parts: the BioPAX ontology, that provides an abstract representationof concepts related to biological pathways, and the BioPAX format, definingthe syntax of data representation according to definitions and relations repre-sented in the ontology. BioCyc [4], EcoCyc [11] and HumanCyc [18] supportboth SBML and BioPAX formats.

Figure 6 summarizes the described data formats.

2.6 The on-line synthetic description

We remark that the main goal of our investigation has been that of screeninga significant subset of the available on-line biological resources, in order toindividuate some common directions of analysis, at least for a sub-category ofthem (i.e., the genomic databases). Thus, as one of the main results of suchan analysis, we provide a synthetic description of each of the 74 analyzeddatabases according to the considered criteria. All such synthetic descriptionsare freely available at http://siloe.deis.unical.it/biodbSurvey/.

In Figure 7, an example of the synthetic description we adopted is illus-trated. In particular, a form is exploited where both the acronym and thefull name of the database under analysis are showed. We consider the Cryp-toDB [35] (Cryptosporidium Genome Resources) database, that is, the genomedatabase of the AIDS-related apicomplexan-parasite, as reported in the briefdescription below the full name in the form. Also the web-link to the con-sidered database is provided, along with its database features. In particular,CryptoDB is a relational DBMS based on the Genomic Unified Schema (illus-trated above, in Section 2.2). All features corresponding to the other analysiscriteria are reported. In detail, we highlight that CryptoDB contains informa-tion on genes, clone/contig sequences, polymorphism and motifs, and that itstores physical maps, pathways and bibliographic references as well. It sup-ports simple queries and also analysis queries based on similarity search, and

Page 13: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 49

Fig. 7. An example of the database synthetic descriptions available athttp://siloe.deis.unical.it/biodbSurvey/.

allows graphical interaction and sequence based query methods. Additionalinformation are contained in the form about the search criteria supportedby each database. In particular, CryptoDB can be queried by using bothkeywords and map positions as search parameters (map positions require thespecification of the positions of chromosomic regions). The result formats sup-ported by CryptoDB are HTML (as all the other genomic databases), FASTAand tab-separated files. It allows the user to download data stored in its repos-itory, but not to submit new data, as specified in the last two field (“submit”and “download”) of the form.

3 Concluding Remarks

In this chapter we presented the main results of an analysis we have conductedon 74 genomic databases. Such an analysis has been done mainly focusing ona database technology point of view, by choosing five specific criteria underwhich all the considered databases have been investigated. Thus, the con-

Page 14: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

50 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

sidered databases have been classified according to such criteria and theirsynthetic descriptions are given, with the aim of providing a preliminary in-strument to access and manage more easily the enormous amount of datacoming from the biological repositories.

References

1. Arkdb. http://www.thearkdb.org/arkdb//index.jsp.2. Aureolist. http://genolist.pasteur.fr/AureoList.3. Beetlebase website. http://www.bioinformatics.ksu.edu/BeetleBase.4. Biocyc website. http://biocyc.org.5. Bovilist world-wide web server. http://genolist.pasteur.fr/BoviList.6. Candidadb: Database dedicated to the analysis of the genome of the human

fungal pathogen, candida albicans. http://genolist.pasteur.fr/CandidaDB.7. Central aspergilus data repository. www.cadre.man.ac.uk.8. Chado: The gmod database schema. http://www.gmod.org/chado.9. Colibase. http://colibase.bham.ac.uk.

10. Drugbank database. http://redpoll.pharmacy.ualber- ta.ca/drugbank.11. Ecocyc website. http://ecocyc.org.12. Embl nucl. seq. db statistics. http://www3.ebi.ac.uk/ Services/DBStats.13. Ensembl. www.ensembl.org.14. From sequence to function. an introduction to the kegg project.

http://www.genome.jp/kegg/docs/intro.html.15. Graingenes: a database for triticeae and avena.

http://wheat.pw.usda.gov/GG2/index.shtml.16. Gus: The gen. unified schema docum. http://www.gusdb.org/ documenta-

tion.php.17. Hcv polyprotein. http://hepatitis.ibcp.fr/.18. Humancyc website. http://humancyc.org.19. Legiolist. http://genolist.pasteur.fr/LegioList.20. Leproma. http://genolist.pasteur.fr/Leproma.21. Photolist. http://genolist.pasteur.fr/PhotoList.22. Pylorigene. http://genolist.pasteur.fr/PyloriGene.23. Quant. pcr primer db (qppd). http://web.ncifcrf.gov/rtp/gel/ primerdb.24. Rice annotation database. http://golgi.gs.dna.affrc.go.jp/SY-1102/rad.25. Streptopneumolist. http://genolist.pasteur.fr/StreptoPneumoList.26. System biology markup language (sbml). http://www.smbl.org.27. Tcruzidb: An integrated trypanosoma cruzi genome resource.

http://tcruzidb.org/.28. Wormbase: Db on biology and genome of c.elegans. http://www.wormbase.org/.29. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local

alignment search tool. J. of Mol. Biology, 215(3):403–410, 1990.30. A. Bahl and et. al. Plasmodb: the plasmodium genome resource. a database

integrating experimental and computational data. Nucl. Ac. Res., 31(1):212–215, 2003.

31. A.D. Baxevanis. The molecular biology database collection: an update compi-lation of biological database resource. Nucl. Ac. Res., 29(1):1–10, 2001.

32. J.M. Cherry, C. Ball, S. Weng, G. Juvik, R. Schmidt, C. Adler, B. Dunn,S. Dwight, L. Riles, R.K. Mortimer, and D. Botstein.

Page 15: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 51

33. F. Iragne E. Beyne M. Nikolski D. Sherman, P. Durrens and J.-L. Souciet.Genolevures complete genomes provide data and tools for comparative genomicsof hemiascomycetous yeasts. Nucl. Ac. Res., 34(Database issue):D432–D435,2006.

34. M. Y. Galperin. The molecular biology database collection: 2007 update. Nucl.Ac. Res., 35(Database issue):D3–D4, 2007.

35. M. Heiges, H. Wang, E. Robinson, C. Aurrecoechea, X. Gao, N. Kaluskar,P. Rhodes, S. Wang, C. Z. He, Y. Su, J. Miller, E. Kraemer, and J. C. Kissinger.Cryptodb: a cryptosporidium bioinformatics resource update. Nucl. Ac. Res.,34(Database issue):D419–D422, 2006.

36. M. Hucka, A. Finney, and H.M. Sauro. The system biology markup lan-guage (sbml): a medium for representation and exchange of biochemical networkmodel. Bioinformatics, 19(4):524–531, 2003.

37. L. Li I. T. Paulsen J. C. Kissinger, B. Gajria and D. S. Roos. Toxodb: accessingthe toxoplasma gondii genome. Nucl. Ac. Res., 31(1):234–236, 2003.

38. J. G. R. Gilbert K. Jekosch S. Keenan P. Meidl S. M. Searle J. Stalker R. StoreyS. Trevanion L. Wilming J. L. Ashurst, C.-K. Chen and T. Hubbard. Thevertebrate genome annotation (vega) database. Nucl. Ac. Res., 33(DatabaseIssue):D459–D465, 2005.

39. P. D. Karp. An ontology for biological function based on molecular interactions.Bioinformatics, 16:269–285, 2000.

40. I.M. Keseler, J. Collado-Vides, S. Gama-Castro, J. Ingraham, S. Paley, I.T.Paulsen, M. Peralta-Gil, and P.D. Karp. Ecocyc: a comprehensive databaseresource for escherichia coli. Nucl. Ac. Res., 33(Database issue):D334–D337,2005.

41. Z. Lacroix and T. Critchlow. Bioinformatics: Managing Scientific Data. 2005.42. V.B. Strelets P. Zhang W.M Gelbart M.A. Crosby, J.L. Goodman and the Fly-

Base Consortium. Flybase: genomes by the dozen. Nucl. Ac. Res., 35:D486–D491, 2007.

43. I. Moszer, L. M. Jones, S. Moreira, C. Fabry, and A. Danchin. Subtilist: The ref.database for the bacillus subtilist genome. Nucl. Ac. Res., 30(1):62–65, 2002.

44. D. W. Mount. Bioinformatics: Sequence and Genome Analysis. Cold SpringHarbor Laboratory Press, second edition, 2004.

45. Zhang Yue Yang LiuSong Wang Eric Marie Veronique Lang B. Franz O’Brien,Emmet A. and Gertraud Burger. Gobase - a database of organelle and bacterialgenome information. Nucl. Ac. Res., 34:D697–D699, 2006.

46. S.Y. Rhee, W. Beavis, T.Z. Berardini, G. Chen, D. Dixon, A. Doyle, M. Garcia-Hernandez, E. Huala, G. Lander, M. Montoya, N. Miller, L.A. Mueller,S. Mundodi, L. Reiser, J. Tacklind, D.C. Weems, Y. Wu, I. Xu, D. Yoo, J. Yoon,and P. Zhang. The arabidopsis information resource (tair): a model organismdatabase providing a centralized, curated gateway to arabidopsis biology, re-search materials and communit. Nucl. Ac. Res., 31(1):224–228, 2003.

47. J. Sprague, L. Bayraktaroglu, D. Clements, T. Conlin, D. Fashena, K. Frazer,M. Haendel, D. Howe, P. Mani, S. Ramachandran, E. Segerdell K. Schaper,P. Song, S. Taylor B. Sprunger, C. Van Slyke, and M. Westerfield. The zebrafishinformation network: the zebrafish model organism database. Nucl. Acids Res.,34:D581–D585, 2006.

48. L. D. Stein, C. Mungall, S. Shu, M. Caudy, M. Mangone, A. Day, E. Nickerson,J.E. Stajich, T.W. Harris, A. Arva, and S. Lewis. The generic genome browser:A building block for a model organism system database. Gen. Res., 12(10):1599–1610, 2002.

49. S. Twigger, J. Lu, M. Shimoyama, and et al. D. Chen. Rat genome database(rgd): mapping disease onto the genome. Nucl. Ac. Res., 30(1):125–128, 2002.

Page 16: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

52 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

50. J.C. Venter and et al. The sequence of the human genome. Science,291(5507):1304–1351, 2001.

51. W. Zhao, J. Wang, X. He, X. Huang, Y. Jiao, M. Dai, S. Wei, J. Fu, Y. Chen,X. Ren, Y. Zhang, P. Ni, J. Zhang, S. Li, J. Wang, G. K. Wong, H. Zhao,J. Yu, H. Yang, and J. Wang. Bgi-ris: an integrated information resource andcomparative analysis workbench for rice genomics. Nucl. Acids Res., 32:D377–D382, 2004.

Page 17: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

A Summary of Genomic Databases: Overview and Discussion 53

Appendix

1: ArkDB, http://www.thearkdb.org

2: AureoList, http://genolist.pasteur.fr/AureoList

3: BeetleBase, http://www.bioinformatics.ksu.edu/BeetleBase

4: BGI-RISe, http://rise.genomics.org.cn

5: BioCyc, http://biocyc.org

6: Bovilist, http://genolist.pasteur.fr/BoviList

7: Brassica ASTRA, http://hornbill.cspp.latrobe.edu.au/cgi-binpub/index.pl

8: CADRE, http://www.cadre.man.ac.uk

9: CampyDB, http://campy.bham.ac.uk

10: CandidaDB, http://genolist.pasteur.fr/CandidaDB

11: CGD, http://www.candidagenome.org

12: ClostriDB, http://clostri.bham.ac.uk

13: coliBase, http://colibase.bham.ac.uk

14: Colibri, http://genolist.pasteur.fr

15: CryptoDB, http://cryptodb.org

16: CyanoBase, http://www.kazusa.or.jp/cyano

17: CyanoList, http://genolist.pasteur.fr/CyanoList

18: DictyBase, http://dictybase.org

19: EcoCyc, http://ecocyc.org

20: EMGlib, http://pbil.univ-lyon1.fr/emglib/emglib.htm

21: Ensembl, http://www.ensembl.org

22: FlyBase, http://flybase.bio.indiana.edu

23: Full-Malaria, http://fullmal.ims.u-tokyo.ac.jp

24: GabiPD, http://gabi.rzpd.de

25: GDB, http://www.gdb.org

26: Genolevures, http://cbi.labri.fr/Genolevures

27: GIB, http://gib.genes.nig.ac.jp

28: GOBASE, http://megasun.bch.umontreal.ca/gobase

29: GrainGenes, http://wheat.pw.usda.gov/GG2/index.shtml

30: Gramene, http://www.gramene.org

31: HCVDB, http://hepatitis.ibcp.fr

32: HOWDY, http://www-alis.tokyo.jst.go.jp/HOWDY

33: HumanCyc, http://humancyc.org

34: INE, http://rgp.dna.affrc.go.jp/giot/INE.html

35: LegioList, http://genolist.pasteur.fr/LegioList

36: Leproma, http://genolist.pasteur.fr/Leproma

37: LeptoList, http://bioinfo.hku.hk/LeptoList

38: LIS, http://www.comparative-legumes.org

39: ListiList, http://genolist.pasteur.fr/ListiList

40: MaizeGDB, http://www.maizegdb.org

41: MATDB, http://mips.gsf.de/proj/thal/db

42: MBGD, http://mbgd.genome.ad.jp

43: MGI, http://www.informatics.jax.org

44: MitoDat, http://www-lecb.ncifcrf.gov/mitoDat/#searching

45: MjCyc, http://maine.ebi.ac.uk:1555/MJNRDB2/server.htm

46: Microscope, http://www.genoscope.cns.fr/agc/mage/wwwpkgdb/MageHome/

Page 18: A Summary of Genomic Databases: Overview and Discussionmath.unipa.it/rombo/files/publications/chapter09c_draft.pdf · both their biological contents and their technical features (e.g,

54 Erika De Francesco, Giuliana Di Santo, Luigi Palopoli, Simona E. Rombo

index.php?webpage=microscope

47: MolliGen, http://cbi.labri.fr/outils/molligen

48: MtDB, http://www.medicago.org/MtDB

49: MypuList, http://genolist.pasteur.fr/MypuList

50: NCBI Genomes Databases, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

51: NCBI Viral Genomes, http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html

52: OGRe, http://drake.physics.mcmaster.ca/ogre

53: Oryzabase, http://www.shigen.nig.ac.jp/rice/oryzabase

54: PEC, http://shigen.lab.nig.ac.jp/ecoli/pec

55: PhotoList, http://genolist.pasteur.fr/PhotoList

56: PlantGDB, http://www.plantgdb.org

57: PlasmoDB, http://plasmodb.org

58: PyloriGene, http://genolist.pasteur.fr/PyloriGene

59: RAD, http://golgi.gs.dna.affrc.go.jp/SY-1102/rad

60: RGD, http://rgd.mcw.edu

61: SagaList, http://genolist.pasteur.fr/SagaList

62: ScoCyc, http://scocyc.jic.bbsrc.ac.uk:1555

63: SGD, http://www.yeastgenome.org

64: StreptoPneumoList, http://genolist.pasteur.fr/StreptoPneumoList

65: SubtiList, http://genolist.pasteur.fr/SubtiList

66: TAIR, http://www.arabidopsis.org

67: TcruziDB, http://tcruzidb.org

68: TIGR-CMR, http://www.tigr.org/CMR

69: ToxoDB, http://toxodb.org

70: TubercuList, http://genolist.pasteur.fr/TubercuList

71: UCSC, http://genome.ucsc.edu

72: VEGA, http://vega.sanger.ac.uk

73: WormBase, http://www.wormbase.org

74: ZFIN, http://zfin.org