Pathways and genomes databases in bioinformatics

34
Pathways and Genomes Databases in Bioinformatics Presented by Sarwat Bashir (Bioinformatics 8 th semester ) Shaheed Binazeer Bhutto Women University of Peshawar Shaheed Benazir Bhutto University Peshawar

description

Integration of different research paper of Pathways and genomes in bioinformatics..

Transcript of Pathways and genomes databases in bioinformatics

  • 1. Presented bySarwat Bashir(Bioinformatics 8th semester )Shaheed Binazeer Bhutto Women University of PeshawarShaheed Benazir Bhutto University Peshawar

2. SECONDARY DATABASES IN BIOINFORMATICSThose data that are derived from the analysis ortreatment of primary data such as secondarystructures, hydrophobicity plots, and domain arestored in secondary databases.http://www.imb-jena.de/~rake/Bioinformatics_WEB/databases_classification.html 3. THE BIOINFORMATICS SECONDARYDATABASES Secondary databases are further divided into fourcategories according to the information they contain : Sequence-related Information Genome-related Information Structure-related Information Pathway Informationhttp://www.imb-jena.de/~rake/Bioinformatics_WEB/databases_classification.html 4. Metabolic Pathway and ProteinFunction Databases A pathway database (DB) is a DB that describesbiochemical pathways, reactions, and enzymes. Forthe modeling and simulation of a biopathway.http://www.imb-jena.de/~rake/Bioinformatics_WEB/databases_classification.html 5. GENOME DATABASES These databases collect organism genome sequences,annotate (add description ) and analyze them, andprovide public access. Add some of experimental literature to improvecomputed annotations. These databases may hold many species genomes, or asingle model organism genome.http://www.imb-jena.de/~rake/Bioinformatics_WEB/databases_classification.html 6. PAGED: a pathway and gene-set enrichmentdatabase to enable molecular phenotypediscoveries Abstract: Background: Pathway and gene-set enrichment analysis has evolved intothe study of high-throughput functional genomics form past decade. Researchers have begun to combine pathway and gene-set enrichmentanalysis as well as network module-based approaches to identify crucialrelationships between different molecular mechanisms. Methods: To meet the new challenge of molecular phenotype discovery, inthis work, they developed an integrated following methods : Online database, the Pathway And Gene Enrichment Database (PAGED),to enable comprehensive searches for disease-specific pathways, genesignatures, microRNA targets, and network modules by integrating gene-set-based prior knowledge as molecular patterns from multiple levels: thegenome, transcriptome, posttranscriptome, and proteome.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 7. Cont. Results: The online database we developed, PAGEDhttp://bio.informatics.iupui.edu/PAGED is by far themost comprehensive public compilation of gene sets. In its current release, PAGED contains a total of 25,242gene sets, 61,413 genes, 20 organisms, and 1,275,560records from five major categories. Beyond its size, the advantage of PAGED lies in theexplorations of relationships between gene sets asgene-set association networks (GSANs).Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2` 8. Introduction to PAGED Biological pathways have provided natural sources of molecularmechanisms to develop diagnosis, treatment, and prevention strategiesfor complex diseases. Gene-set enrichment methods analyzed the activity of thousands ofgenes effectively instead of individual gene analysis . The analysis reveal accusations between the genotypes andphenotypes, which are simply called molecular profiling or molecularphenotypes. The other biological pathway databases are heterogeneous and lack ofannotations. Unlike candidate pathway analysis, genome-wide pathway analysisdoes not require prior biological knowledge. PAGED can reveal the interaction a cross the different databases. Gene signature data from the transcriptome level offers acomplementary source of information to complete pathway knowledge.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 9. The division of pathway analysis Pathway analysis are divided into three generation ofapproaches: First generation: over representation analysis (ORA) approach Second generation: functional class sorting (FCS) approach. Third generation: pathway topology (PT) approach. Multi-level, multi-scale, knowledge-guided enrichment analysiscan enable molecular phenotype discovery for specific humandiseases. The acquisition of prior knowledge and systems modeling posesa challenge for developing tools that go beyond third-generationpathway analysis for disease-specific molecular profiling. To meet the new challenges of molecular phenotype discovery,the Pathway And Gene Enrichment Database (PAGED) aredeveloped.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 10. The benefits of integrateddatabase (PAGED) This new database can provide the following benefits to biological researchers. First, this database consists of disease-gene association data, curated andintegrated from Online Mendelian Inheritance in Man (OMIM) database andthe Genetic Association Database (GAD) therefore, it has the potential to assisthuman disease studies. Second, as contains all current compiled gene signatures in MolecularSignature Database (MSigDB) and Gene Signatures Database (GeneSigDB. Third, it further integrates with microRNA-targets from miRecords database,signaling pathways, protein interaction networks, and transcriptionfactor/gene regulatory networks, partially based on data integrated from theHuman Pathway Database (HPD) and the Human Annotated and PredictedProtein Interaction (HAPPI) database.. It provide integrated the following version of the database OMIM (Feb. 2012),GAD (Aug. 2011), GeneSigDB (v. 4.0, Sept. 2011), MSigDB (v. 3.0. Sept. 2010),HPD (2009), HAPPI (v. 1.4)and miRecords (Nov. 2010), which are the latestversions available.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2` 11. The advantages of this Research The advantage of this work is the relationship betweenpathways, gene signatures, microRNA targets, and/ornetwork modules. These gene-set-based relationships can be visualizedas a gene-set association network (GSAN), whichprovides a roadmap for molecular phenotypediscovery for specific human diseases. It demonstrate how to query PAGED to discovercrucial pathways, gene signatures, and gene networkmodules specific to disease genome .Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 12. Methods Data sources: The overview of the data integration process in Figure 1. Gene-set data were collected, extracted, and integrated from five majorcategories. The pathway data sources were from HPD , which has integrated 999human biological pathway data from five curated sources: KEGG, PID,BioCarta, Reactome, and Protein Lounge. The genome-level disease gene relationships were from OMIM andGAD. The transcriptome- level gene signatures were from MSigDB andGeneSigDB. The post-transcriptome-level microRNA data were from miRecords. The proteome level data was from an integrated protein interactiondatabase. HAPPI, which has integrated HPRD, BIND, MINT, STRING, andOPHID databases.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 13. Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 14. Gene-set data integration: Treat gene sets as all groups of genes, including disease associatedgenes, pathway genes, gene signatures, microRNA-targeted genes, andPPI sub-network modules. The raw files are curated from those data sources have various formatsincluding plaintext, XML, and table. It have to written Perl/Java parsers to convert them into a common tabdelimited textual format to ensure syntactic level data compatibility. To integrate across different databases, they mapped the gene/proteinIDs in all databases to official gene symbols. The gene-set gene data isstored in our backend ORACLE11g relational database. All recodes of gene set members are represented by the official genesymbols. All PAGED gene sets were assigned unique PAGED-specific identifiersHuang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 15. Online software designing The PAGED platform follows a multi-tiered designarchitecture. The backend was implemented as PL/SQL packageson an Oracle 11g database server. The PAGEDapplication middleware was implemented on theOracle Application Express (APEX) server, whichbridged between the Apache webserver and the Oracledatabase server.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 16. Gene-set similarity measurement The similarity score Si, j of two different gene sets is defined by the followingformula: Here, Pi and Pj denote two different gene sets, while |Pi| and |Pj| are thenumber of genes in each of these two gene sets. Their intersection PiPj denotes a common set of genes, while their unionPiPj is calculated as |Pi| + |Pj| - |PiPj|. Here, is a weight coefficient among [0, 1], which is used to count varyingdegree of contributions from calculations based both on the overlap (left itemSL) and the cover (right item SR). SL is well-known as the Jaccard coefficient which is often used to evaluate thesimilarity between two sets . When a larger gene set covers a smaller one, it is expected that their similarityscore to be high enough to identify them.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 17. Microarray data For gene expression data analysis they show how todiscover crucial pathways, gene signatures, and genenetwork modules specific to disease functionalgenomics. To downloaded a microarray dataset from GeneExpression Omnibus, GEOhttp://www.ncbi.nlm.nih.gov/geo/. This microarray dataset compared the transcriptomedata of expected information collected adenomas withthose of the normal from the same individuals.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2` 18. Differential gene-set expressions Use ABS_FC to denote the absolute value of foldchange for each gene. Then define differential gene setexpressions. NORM_ABS_FC: The p*-norm of ABS_FC of all theavailable differential gene expressions in a gene set.Huang et al. PAGED: a pathway and gene-set enrichment database to enablemolecular phenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 19. Gene-set association network(GSAN) construction To visualize the relationships between gene sets, we definea gene-set association network (GSAN) as a network ofassociations between different gene sets, in which thenetwork element representation is as follows: Node: Gene set Edge: Association between two gene sets Node size: Gene-set scale (Counting genes in each gene set) Node color: Differential gene-set expression(NORM_ABS_FC) Node line color: Gene-set data source Edge width: Similarity scoreHuang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 20. Results Database content statistics: Table 1 lists the detailed statistics for each data sourceand the overlap between each pair. For example,Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2` 21. Gene-set scale distributions Gene-set scale distributions for PAGED molecule data. A gene-set scale refers to the number of molecules(i.e., genes) involved in a given gene set. The distributions are plotted under log scale for boththe x-axis and y-axis. The linear trend line in red represents linearregression of PAGED distribution.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2` 22. Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 23. Cont . An overview for the core functionality of the online PAGEDwebsite. (A) The PAGED home page providing search by eitherdisease name or gene list; (B) a webpage containing the list of gene sets retrieved as aresult of a disease query; (C) a webpage containing the list of gene sets retrieved as aresult of a gene list query; (D) an advanced search page in which the user can eithersearch disease name or upload a gene-list to search; (E) a browse page listing the gene sets, their data sourceand number of genes.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 24. Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 25. Discussion In the near future, the improved gene-set similarity algorithmswill be introduced by using a global PPI network to calculatetheir distance. This would provide a more robust measurement for webinterface development, and the plan is to add a disease browsingfunction based on disease ontology and a network visualizationfunction to show the gene-set association dynamically. The final goal is to perform multi-scale network modeling formolecular phenotype discoveries by integrating differentialexpressions with pathway and network topologies. The current release of PAGED provides a solid foundation for usto develop third-generation pathway analysis tools.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecular phenotypeDiscoveriesBMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2` 26. Conclusions The development of PAGED, an online database that providesthe most comprehensive public compilation of gene sets. In the current release, PAGED contains a total of 25,242 genesets, 61,413 genes, 20 organisms, and 1,275,560 records from fivemajor categories: The pathway data from HPD, genome-level disease data fromOMIM and GAD, transcriptome-level gene signatures from MSigDB andGeneSigDB, the post-transcriptomemicroRNA data frommiRecords, and proteome-level data from HAPPI. The number of overlapping genes between each data source,gene-set scale distribution, and case study in colorectal cancer. The current PAGED software can help users address a wide rangeof gene-set-related questions in human disease biology studies.Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecularphenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2http://www.biomedcentral.com/1471-2105/13/S15/S2` 27. MGD: the Mouse GenomeDatabase ABSTRACT The Mouse Genome Database (MGD) (http://www.informatics.jax.org) is onecomponent of a community database resource for the laboratory mouse, a keymodel organism for interpreting the human genome and for under standinghuman biology. MGD strives to provide an extensively integrated information resource withexperimental details annotated from both literature and on-line genomic datasources. MGD presents the consensus representation of genotype (sequence) tophenotype information including highly detailed information about genes andgene products. Primary foci of integration are through representations of relationshipsbetween genes, sequences and phenotypes. MGD collaborates with other bioinformatics groups to curate a definitive set ofinformation about the laboratory mouse. Recent developments include a general implementation of database structuresfor controlled vocabularies and the integration of a phenotype classificationsystem.Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31,No. 1 193195DOI: 10.1093/nar/gkg047 28. INTRODUCTION The Mouse Genome Database (MGD) provides anintegrated information on mouse genes, genetic markersand genomic features as well as information on molecularsegments ( probes, primers, cDNA clones, BACs and YACs)mutant phenotypes, comparative mapping data, graphicaldisplays of linkage, cytogenetic and physical maps,experimental mapping data, as well as strain distributionpatterns for recombinant inbred strains (RIs) and crosshaplotypes. MGD is updated daily . It providing several new datamanipulation and display tools. MGD is one component of the Mouse Genome Informatics(MGI) database resource (http://www.informatics.jax.org)located at The Jackson Laboratory (http://www.jax.org).Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31, No.1 193195DOI: 10.1093/nar/gkg047 29. IMPROVEMENTS DURING 2002 Implementation of phenotype classifications A broad, high-level set of phenotype terms have been developed and employedto classify phenotype data in MGD. This defined vocabulary of 105 terms can be used to search, group, compareand analyze phenotypes. These phenotype classification terms appear on the Alleles and PhenotypesQuery Form (Fig. 1), and on the Genes and Marker Query Form. The complete list of terms and their accession IDs is also available by FTP. On each form, there is a link to the phenotype classification terms, completewith definitions and examples. Users of the MGI database can select one or more terms from the list to searchfor records associated with a particular phenotype, in combination with manyother parameters on the forms. In addition, text based searches for more specific phenotypic terms remainavailable.Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31,No. 1 193195DOI: 10.1093/nar/gkg047 30. Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003,Vol. 31, No. 1 193195DOI: 10.1093/nar/gkg047 31. Improvements to the MGI:GObrowser The MGI GO Browser(http://www.informatics.jax.org/searches/GO_form.shtml) allows database users to access genes in MGI usingfunctional annotation terms from the GO. This Browser was developed in conjunction with theGXD. (Gene Expression Database ) The GO Browser can be accessed from gene detail orquery pages as well as directly from the MGI menus.Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003,Vol. 31, No. 1 193195DOI: 10.1093/nar/gkg047 32. Availability of MGI:GO files invarious formats MGI gene-to-GO annotations are updated daily. Various files for the MGI gene/markers with the GO associationsare publicly available. These files are updated each time MGI submits a new geneassociation file to the GO web site (http://www.geneontology.org) and can be accessed on the MGI FTPserver (ftp://www.informatics.jax.org/pub/informatics/reports/gene association.mgi). A file of all the GO terms used by MGI in the annotation of genesand gene products is also available. MGI also provides a file tothe GO database of MGI Gene : SWISS-PROT associations. This information is incorporated into the GO database and thusenables users to recover mouse sequence data as a result of asemantic search against the GO database(http://www.godatabase.org/cgi-bin/go.cgiJudith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31,No. 1 193195DOI: 10.1093/nar/gkg047 33. IMPLEMENTATION MGD is implemented in the Sybase relational databasesystem, version 12.5. A large set of CGI scripts and Java Servlets mediate theusers interaction with the database. For computational users, direct SQL access can berequested through User Support. User-requested database reports and a number ofwidely used data files (generated daily) are availableon the FTP site (ftp://ftp.informatics.jax.org).Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31, No. 1193195DOI: 10.1093/nar/gkg047 34. CITING MGD The following citation format is suggested whenreferring to datasets specific to the MGD componentof MGI : Mouse Genome Database (MGD), Mouse GenomeInformatics, The Jackson Laboratory, Bar Harbor,Maine (URL: http://www.informatics.jax.org).Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31,No. 1 193195DOI: 10.1093/nar/gkg047