mBRCs and MICROBIAL DATABASES INTERCONNECTION DATA … VasilenkoVKM.pdf · mBRCs and MICROBIAL...
Transcript of mBRCs and MICROBIAL DATABASES INTERCONNECTION DATA … VasilenkoVKM.pdf · mBRCs and MICROBIAL...
mBRCs and MICROBIAL DATABASES INTERCONNECTION DATA
Alexander Vasilenko, Svetlana Ozerskaya, Oleg Stupar VKM, IBPM RAS
The general goals to reach
As we know, microbial culture collections and the projects like MIRRI or GBRCN tried to meet the needs of biomedicine, agriculture, biotechnology, and science. Data integration looks like one of the key tools for this. The field of integration is Life Science data.
We know three big data sources in Life Science: (1) databases, (2) publications, (3) datasets. As far as we know the main structured data holdings in this list are the databases, and this report looks mostly at the integration opportunities for them. Data integration of mBRC microbial data with Life Science databases and first of all the integration with databases used in biomedicine, pharma, agriculture, and bioremediation.
Potential help for the Life Science world - mBRC contributions :
1. Assurance of repeatability of experimental data
2. Resolving nomenclatural issues related to microorganisms
3. Strain-specific characters
1
General CC - life science connection data
(1) In culture collections* 708 culture collections in WDCM/CCinfo
150 online catalogues (67 in EU)
(2) Life science
2076 databases collected online, 807 with microbial data
807 = 117(Mo) +483(Sp) + 207(St)
‐ Mo - microorganisms discovered (bacteria, fungi, yeasts, archaea, protists, microalgae) , viruses, but no species names,
‐ Sp - names are presented, but no strains, ‐ St - strains discovered.
Value 207(St) looks like 26% of 807. In fact this means minimal interconnection of left side and right side: (1) in the list Mo, Sp, St we indicated higher level discovered, (2) more than 50% of the strains were not in the culture collections, (3) we never discovered address of the strains, (4) in Life Science databases each strain is separate unlike Straininfo histries (next picture)
* Plus WDCM, Straininfo, CABRI and regional mBRC networks
2
General integration schema
4
CC1
CC2
CCn
.
.
.
MICRO-IS
Infrastructure_1
Infrastructure_2
Infrastructure_m
DB_1
DB_2
DB_3
DB_4
DB_5
DB_6
DB_k
.
.
.
Potentially this data integration could mean the tasks:
1. To make mBRC data visible and accessible from partner Life Science databases,
2. To make partner database records visible and accessible from mBRC aggregated catalogue,
in the formats:
a. To give this data integration for human access,
b. To give this data integration for computer programs.
5
Life Science databases inspected
Total number of life science database names or references discovered in this study is more than 12 800. The total number of database references inspected manually is more than 5500 (plus group of 7625 databases in BioCyc system, each of them present metabolic pathways and their operons for one bacterial strain). The total number of life science databases collected visible online is 2076, the number of databases with microbial data collected is 807 (plus 7625 bacterial databases in BioCyc).
Main sources inspected:
• MB (1802 entries (http://metadatabase.org/wiki/Help:Browsing), 26.12.2015),
• Biosharing (724 databases, (https://www.biosharing.org/), 26.12.2015),
• BioMedBrigeds (814 Databases, 27.12.2015), (http://wwwdev.ebi.ac.uk/fgpt/toolsui/),
• Pathguide (363 database names, 2013), (http://www.pathguide.org/)
• ELIXIR list (579 entries, (https://bio.tools/?q=database), 28.1.2016)
• ExPASy (85 + 665 databases, (http://www.expasy.org/old_links), 12.2.2016)
6
6
Databases parameters collected (an example)
• Unique identifier: BIODBCORE-000515
• Database acronym: Pfam
• Database name: Sanger Pfam Mirror
• Database URL: http://pfam.sanger.ac.uk/
• Access level: Open
• Practical domain: health, winemaking, baking, brewing
• Microbial level: st
• Year of the last correction: 2015
• Developer/Owner: UK, EMBL-EBI
• Comment
• Orientation: bacteriophages, viruses, bacteria
• Properties: protein
• Search by: OMIM ID, PubMed ID, ...
• Ontologies list: GO
• Partner databases: CATH, CDD, Europe PMC, HGNC, InterPro, iPfam, MEROPS, NCBI Gene, NCBI Taxonomy, OMIM, PDBe, PDBj, PDBsum, PMC, PRINTS, PROSITE, PubMed, RCSB PDB, RefSeq, SCOP, SMART, SUPERFAMILY, UCSC, UniProtKB
• Program interface: WEB UI, RESTful interface
7
Databases by properties 215 145 88 78 36 32 22 20 20 14 14 13 11 9 8 8 8 6 6 6 6 6
genome protein chemistry one taxon pathway RNA biodiversity enzyme taxonomy peptide pharmacology publications drug cell image ribosome web-portal antibody antimicrobic metabolite molecules toxicology
5 5 5 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 2
carbohydrates pathogenic structure biological activities of small molecules lipid metabolom promoter interactome plasmid structure biomolecule terminology toxin veterinary antibiotic resistance bacteriophages barcodes biodegradation biomolecules collection Cyanobacteria immunogenetics
2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
immunology map mtDNA patent pathogenic mo phylogenetica phylogeny stem cell transport virulence factors acetilation allergenic allergenic molecules antibiotics ascomycetes bacteriocins Biocatalysis/Biodegradation Carbon chemical compounds crop protection
8
One Taxon section (78 total) 1 1 1 18 1 2 2 1 1 1 1 1 2 1 1 2 2 57 1 5 2 1 1
class: Mollicutes family: xylariaceae group: trichomycetes Genus: Ashbya Aspergillus Bacillus Canadensis Candida Corynebacterium Legionella Listeria Mycobacterium Prochlorococcus Pseudomonas Saccharomyces Streptococcus Species: Arabidopsis thaliana Bacillus cereus Bacillus subtilis Buchnera aphidicol Chlamydia trachomatis
1 15 1 1 2 1 1 1 1 1 1 1 1 4 1 8 1 1 1 1 1 1
Dictyostelium discoideum Escherichia coli Fusarium graminearum Helicobacter pylori Magnaporthe grisea Mycobacterium leprae Mycobacterium marinum Mycobacterium smegmatis Mycobacterium tuberculosis Mycobacterium ulcerans Mycoplasma genitalium Mycoplasma pulmonis Myxococcus xanthus Neurospora crassa Pyrococcus abyssi Saccharomyces cerevisiae Schizosaccharomyces pombe Sporisorium reilianum Staphylococcus aureus subsp.aureus Toxoplasma gondii Ustilago hordei Ustilago maydis
9
127 databases for specific organisms types (they also keep microbial data):
3 3 13 3 23 12 10 1 1 44 14
animal archea bacteria drosophila fungi human plant protists vertebrates viruses yeast
10
Biggest database producer: BESC (BioEnergy Science Center)
BioCyc pathway/genome database: 7667 databases totally (http://www.biocyc.org/biocyc-pgdb-list.shtml)
Group 1 are 7 databases: EcoCyc, MetaCyc, HumanCyc, AraCyc, YeastCyc, LeishCyc, TrypanoCyc
Group 2 are 41 databases generated by program with curation done each is one strain:
Group 3 are 7625 databases each database is one bacterial strain with no curation yet 11
Agrobacterium fabrum C58 Anopheles gambiae Aurantimonas manganoxydans SI85-9A1 Bacillus anthracis Ames Bacillus subtilis 168 Bacteroides thetaiotaomicron VPI-5482 Candidatus Cardinium hertigii Candidatus Evansia muelleri Candidatus Portiera aleyrodidarum BT-QVLC Caulobacter crescentus CB15 Caulobacter crescentus NA1000 Chlamydomonas reinhardtii Clostridium saccharoperbutylacetonicum ATCC 27021 Cryptosporidium hominis TU502 Cryptosporidium parvum Iowa Drosophila melanogaster Escherichia coli B str REL606 Escherichia coli CFT073 Escherichia coli K-12 substr W3110 Escherichia coli O157:H7 str EDL933 Eubacterium rectale ATCC 33656
Helicobacter pylori 26695 Listeria monocytogenes 10403S Methylosinus trichosporium OB3b Mus musculus Mycobacterium tuberculosis CDC1551 Mycobacterium tuberculosis H37Rv Penicillium chrysogenum Wisconsin 54-1255 Peptoclostridium difficile 630 Plasmodium berghei ANKA Plasmodium chabaudi Plasmodium falciparum 3D7 Plasmodium vivax Sal-1 Plasmodium yoelii 17XNL Schistosoma mansoni Shigella flexneri 2a str 2457T Streptomyces coelicolor A3(2) Synechococcus elongatus PCC 7942 Thalassiosira pseudonana CCMP1335 Toxoplasma gondii ME49 Vibrio cholerae O1 biovar El Tor str N16961
Second DBs producer: EMBL-EBI (98dbs) ArrayExpress
ASD
ASTD
ATD
BioModels
BioSamples
Cellular Phenotype Db
ChEBI
ChEMBL
CluSTr
CSA
DGVa
DNAtraffic
DrugPort
e!Ensembl
e!Ensembl S. cerevisiae
e!EnsemblBacteria
e!EnsemblCat
e!EnsemblChicken
e!EnsemblChimpanzee
e!EnsemblCow
e!EnsemblDog
e!EnsemblFugu
e!EnsemblFungi
e!EnsemblGenomes
e!EnsemblGorilla
e!EnsemblHorse
e!EnsemblMetazoa
e!EnsemblMouse
e!EnsemblPig
e!EnsemblPlants
e!EnsemblProtists
e!EnsemblRabbit
e!EnsemblZebrafish
EGA
EMBL
EMBL-EBI
EMDB
ENA
Ensembl
Enzyme Portal
Enzyme Structures
EVA
Expression Atlas
FunTree
GeneDB
GWAS Catalog
HGNC
HipSci
IGSR
IMEx
IMGT/HLA
IntAct
IntEnz
InterPro
IPD
IPD-ESTDAB
IPD-HPA
IPD-KIR
IPD-MHC
logRECOORD
MACiE
MEROPS
MetaboLights
Metal MACiE
MicroCosm
MIRIAM collection
MTBLS
NRNL1
NRNL2
NRPL1
NRPL2
OLDERADO
PANDIT
PDBe
PDBe EM Resources
PDBeChem
PDBsum
Pfam
Pfam
PhenoDigm
PICR
PomBase
PRIDE
PROCOGNATE
Reactome
RECOORD
Rfam
RNAcentral
SAS
SRS@EMBL-EBI
SureChEMBL
TreeFam
UniChem
UniProt-GOA
UniSave
VASCO
VectorBase
12
NCBI (71dbs) Assembly
BioProject
BioSample
BioSystems
Bookshelf
CCDS
CDD
ClinGen
ClinVar
Clone DB
COGs
dbEST
dbGaP
dbGSS
dbMHC
dbProbe
dbSNP
dbSTS
dbVar
Dengue virus database
ECRbase
Epigenomics
Genbank
Gene
Genetic Codes
Genome
GEO
GEO DataSets
GEO Profiles
GTR
Histone
HIV-1
Homologene
IBIS
Influenza Virus Resource
MapViewer
MedGen
MEDLINE
MeSH
MMDB
NCBI
NCBI taxonomy
NCBI Trace Archives
NLM Catalog
Nucleotide
OMIM
Organelle genomes
Plant Genome Central
PMC
PopSet
PRK
Probe
Protein
Protein Clusters
PubChem
PubChem BioAssay
PubChem Compound
PubChem Substance
PubMed
PubMed Health
RefSeq
RefSeqGene
Retroviruses
SKY/M-FISH and CGH
SRA
Structure
TPA
UniGene
UniVec
Viral genomes
Virus Variation
13
Biggest database partners lists: UniProtKB
See http://www.uniprot.org/database/
Allergome ArachnoServer Bgee BindingDB BioCyc . BioGrid BioMuta BRENDA CAZy CCDS CGD . ChEMBL . ChiTaRS CleanEx CollecTF . COMPLUYEAST-2DPAGE . ConoServer CTD dbSNP DDBJ . DEPOD dictyBase DIP DisProt DMDM DNASU DrugBank EchoBASE EcoGene . eggNOG ENA
Ensembl . EnsemblBacteria . EnsemblFungi . EnsemblMetazoa EnsemblPlants EnsemblProtists . ENZYME ESTHER euHCVdb EuPathDB ET ExpressionAtlas FlyBase GenAtlas GenBank . Gene3D GeneCards . GeneDB GeneFarm NCBI Gene GeneReviews Ensembl Genomes . Genevisible GeneWiki GenoList GenomeRNAi GPCRDB Gramene GuidetoPHARMACOLOGY H-InvDB HAMAP .
HGNC HOGENOM HOVERGEN HPA HUGE IMGT InParanoid IntAct InterPro iPTMnet. KEGG LegioList Leproma MaizeGDB MalaCards MaxQB MEROPS MGI Micado OMIM (1) MINT MobiDB ModBase . MoonProt mycoCLAP NextBio neXtProt OMA Orphanet OrthoDB PANTHER
PATRIC. PDBe PDBj PDBsum PeptideAtlas PeroxiBase Pfam PharmGKB PhosphoSite PhylomeDB PIR PIRSF PomBase PRIDE PRINTS ProDom ProMEX PROSITE PMP. Proteomes . ProtoNet PseudoCAP RCSB-PDB Reactome REBASE RefSeq REPRODUCTION-2DPAGE RGD Rouge SABIO-RK
SBKB SGD . SignaLink SMART . SMR SOURCE . STRING . SUPFAM . SWISS-2DPAGE SwissLipids TAIR TCDB . TIGRFAMs TreeFam TubercuList . UCD-2DPAGE UCSC. UniCarbKB UniGene UniPathway. VectorBase WBParaSite World-2DPAGE . WormBase Xenbase ZFIN
14
To find appropriate integration solution we constructed the table of partner references between Life Science databases. 2054 lines, each line mean specific Life Science database. 805 colons, each colon mean specific microbial database. Cell in line I colon J is value 1 if microbial database J has database I in the list of its partners. Otherwise 0.
With this table we control two integration parameters of specific database {A}:
1. Connection Factor - the list of database partners of {A} according to materials of {A} - the data sources, common fields, common data curation, etc. - the colon of the table. Connection Number - the number of elements=1 in this list - how many database partners it has.
2. Attraction Factor of database {A} - the list of the databases that have reference to {A} in their Connection Factor – the line of the table. Attraction Number - the number of elements=1 in this list - how popular is {A} in database community.
Attraction Number (AN) and Connection Number (CN) both indicate an integration level of specific database. In our research the best balanced values presents UniProtKB database:
CN=149, AN=350.
In CC and mBRC catalogues both values are mostly 0.
Just from the scratch we selected two lists of integration contract lists:
1. NCBI + UniProt integration schema + GeneCards integration Schema (“NCBI list”)
2. EMBL-EBI + UniProt integration schema + GeneCards integration Schema (“EMBL-EBI list”)
According to the table if we are success in three contracts MICRO-IS presents:
1. CN=163, initial value of AN=163, potential value 670, if NCBI list, or
2. CN=146, initial value of AN=146, potential value 654, if EMBL-EBI list.
15
Databases with more than 10 partners (CN) 149 UniProtKB
148 UniProt-GOA
84 GeneCards
61 Hits
59 SWISS-2DPAGE
54 NCBI
50 PiroplasmsDB
49 EMBL
49 ENA
46 SGD
43 e!EnsemblGenomes
42 MetaCyc
40 EcoGene
40 EMBL-EBI
40 Gene
37 PubChem
37 ViralZone
36 EcoCyc
36 GPCRs
36 NCBI taxonomy database
34 ESTHER
34 MalaCards
34 OMIM
34 OMIM (1)
33 PRODORIC
32 HOGENOM
31 ChEBI
30 Genome
30 HPIDB
30 JPGV
30 STITCH
29 EuPathDB
29 FungiDB
29 MACiE
29 T3DB
28 EcoProDB
28 InterMitoBase
28 PIR
28 Reactome
27 BioSystems
27 CAZy
27 RNAcentral
27 STRING
26 GPM
26 Guide to Pharmacology
26 IRD
26 PATRIC
26 ThaleMine
26 YeastMine
25 LPSN
25 TRRD
24 CCDB
24 dictyBase
24 DrugBank
24 GeneProf
24 Genolevures
24 TDR Targets
23 GTOP
23 Pfam
23 Pfam
22 Genetics Home Reference
22 MapViewer
22 MouseMine
22 PDBsum
22 PED
22 PomBase
21 dbSNP
21 Europe PMC
21 FlyMine
21 KEGG
21 Microbes Online
21 NextBio
21 OrthoDB
20 ConsensusPathDB
20 INstruct
20 MINT
20 Retroviruses
19 MetaboLights
19 TCDB
19 TriTrypDB
18 Ebolavirus
18 HMDB
18 InnateDB
18 MitoMiner
18 MODOMICS
18 WholeCellKB
17 ComPPI
17 EcoliWiki
17 InterPro
17 SDAP
17 ViPR
17 WikiPathways
17 YeastCyc
16 IEDB
16 PANTHER
16 Pseudomonas Genome Database
16 RCSB PDB
16 RNA Virus
16 toxoMine
15 CDD
15 DNAtraffic
15 EVA
15 HFV / Ebola Database
15 i2d
15 PhosphoGRID
15 Source
15 UniRef
15 Victors
14 APD
14 Biozon
14 DAMPD
14 GeneDB
14 HSDB
14 KEGG ORTHOLOGY
14 NPIDB
14 PANDORA
14 PLEXdb
14 PROMISCUOUS
14 Rhea
14 SRA
14 TargetDB
14 TransportDB
14 UniPathway
13 APID
13 ASAP
13 BioModels
13 BRENDA
13 e!Ensembl Saccharomyces cerevisiae
13 e!EnsemblBacteria
13 e!EnsemblFungi
13 e!EnsemblProtists
13 ECMDB
13 EPD
13 FooDB
13 GenoList
13 IMEx
13 KEGG BRITE
13 KEGG GENES
13 KEGG GENOME
13 MTBLS
13 neXtProt
13 NMPDR
13 ORENZA
13 PeptideAtlas
13 PhenomicDB
12 DBETH
12 Drug2Gene
12 Genbank
12 GoMapMan
12 IMG
12 IMGT/3Dstructure-DB
12 iRefWeb
12 MEROPS
12 ModBase
12 MOPED
12 PMP
12 P-POD
12 Proteome 2D-PAGE Database
12 PubChem BioAssay
12 PubChem Compound
12 PubChem Substance
12 Rfam
12 sRNAMap
12 TubercuList
12 YMDB
11 CCSB Interactome
11 CTD
11 CyanoBase
11 ExplorEnz
11 iHOP
11 IMP
11 INTEGRALL
11 KEGG LIGAND
11 KiMoSys
11 KinBase
11 MatrixDB
11 MPIDB
11 MycoBank
11 Nucleotide
11 PICR
11 PRGdb
11 PROSITE
11 REBASE
11 SMART
18
Databases with big attraction number (AN) 350 UniProtKB
(+Swiss-Prot +TrEEBL)
335 PubMed
181 RCSB PDB
166 Genbank
154 NCBI taxonomy
140 Gene
133 RefSeq
131 KEGG
108 EC
104 InterPro
89 Ensembl
87 Pfam
74 SGD
68 ENA
59 OMIM
56 Nucleotide
51 IntAct
49 PROSITE
46 Reactome
45 BioGRID
45 PIR
43 HGNC
42 FlyBase
41 COGs
41 SCOP
40 UniGene
39 CAS
39 MEDLINE
39 MGI
38 PubChem
38 SMART
37 DIP
37 GEO
36 ChEBI
35 DDBJ
35 MINT
34 WormBase
33 Genome
33 STRING
31 NCBI
31 TAIR
30 DrugBank
29 BRENDA
29 HPRD
29 TIGRFAMS
28 ENZYME
27 Pfam
27 SUPERFAMILY
25 BioCyc
25 GeneCards
24 EcoCyc
24 MeSH
24 PDBe
24 PRINTS
23 CDD
23 PANTHER
23 ProDom
22 BioProject
22 CATH
22 HomoloGene
22 KEGG PATHWAY
22 MetaCyc
22 PMC
21 ChEMBL
21 dbSNP
21 wwPDB
20 PDBsum
20 PIRSF
20 PRIDE
19
DBs partners NCBI List solution: addgene, Allergome, Assembly, BioCyc, BioGRID, BioProject, BioSample, BioSystems, Bookshelf, BRENDA, CAZy, CDD, CGD, ChEMBL, ClinicalTrials.gov, COGs, CollecTF, Compulyeast, CRISPRdb, CTD, dbEST, dbGSS, dbProbe, dbSNP, DDBJ, Dengue virus database, dictyBase, DNASU, DrugBank, e!EnsemblBacteria, e!EnsemblFungi, e!EnsemblGenomes, e!EnsemblProtists, EchoBASE, EcoGene, eggNOG, ENA, Ensembl, ESTHER, euGenes, euHCVdb, EuPathDB, Expression Atlas, Genbank, Gene, GeneCards, GeneDB, Genetic Codes, Genetics Home Reference, GenoList, Genome, GEO, GEO DataSets, GEO Profiles, Gramene, Guide to Pharmacology, HAMAP, Histone, HIV-1, HMDB, HOGENOM, Homologene, HOVERGEN, i2d, Influenza Virus Resource, InParanoid, IntAct, InterPro, iPTMnet, KEGG, LegioList, Leproma, LifeMap Discovery, MalaCards, MapViewer, MaxQB, MedGen, MEDLINE, MedlinePlus , MEROPS, MeSH, Micado, MINT, miRBase, miRTarBase, MMDB, MobiDB, ModBase, MoonProt, MOPED, mycoCLAP, NCBI, NCBI taxonomy, NCBI Trace Archives, NextBio, neXtProt, NONCODE, Nucleotide, OMA, OMIM, OMIM (1), Organelle genomes, OrthoDB, PANTHER, PATRIC, PaxDB, PDBe, PDBj, PDBsum, PeptideAtlas, PeroxiBase, Pfam, PharmGKB, PhylomeDB, PIR, PMC, PMP, PomBase, PopSet, PRIDE, PRK, Probe, PROSITE, Protein, Protein Clusters, Proteomes, ProteopediA, PseudoCAP, PubChem, PubChem BioAssay, PubChem Compound, PubChem Substance, PubMed, PubMed Health, RCSB PDB, Reactome, REBASE, RefSeq, Retroviruses, Rfam, SABIO-RK, SGD, SIMAP, SMART, Source, SRA, STRING, Structure, SUPERFAMILY, SWISS-2DPAGE, SWISS-MODEL, TCDB, TIGRFAMS, TubercuList, UCD 2D-PAGE, UCSC, UMLS, UniGene, UniPathway, UniProtKB, Viral, genomes, Virus Variation, World-2DPAGE Repository
In EMBL-EBI list solution: addgene, Allergome, ArrayExpress, ASTD, BioCyc, BioGRID, BioModels, BioSamples, BioSystems, Bookshelf, BRENDA, CAZy, CGD, ChEBI, ChEMBL, ClinicalTrials.gov, CollecTF, Compulyeast, CRISPRdb, CTD, dbSNP, DDBJ, dictyBase, DNASU, DNAtraffic, DrugBank, DrugPort, e!Ensembl, e!Ensembl Saccharomyces cerevisiae, e!EnsemblBacteria, e!EnsemblFungi, e!EnsemblGenomes, e!EnsemblProtists, EchoBASE, EcoGene, eggNOG, EMBL, EMBL-EBI, EMDB, ENA, Ensembl, Enzyme Structures, ESTHER, euGenes, euHCVdb, EuPathDB, EVA, Expression Atlas, Genbank, Gene, GeneCards, GeneDB, Genetics Home Reference, GenoList, Gramene, Guide to Pharmacology, HAMAP, HMDB, HOGENOM, Homologene, HOVERGEN, i2d, IMEx, InParanoid, IntAct, InterPro, iPTMnet, KEGG, LegioList, Leproma, LifeMap Discovery, MACiE, MalaCards, MaxQB, MedlinePlus, MEROPS, MeSH, MetaboLights, Micado, MINT, miRBase, miRTarBase, MobiDB, ModBase, MoonProt, MOPED, MTBLS, mycoCLAP, NCBI, NextBio, neXtProt, NONCODE, OMA, OMIM, OMIM (1), OrthoDB, PANTHER, PATRIC, PaxDB, PDBe, PDBe EM Resources, PDBj, PDBsum, PeptideAtlas, PeroxiBase, Pfam, Pfam, PharmGKB, PhylomeDB, PICR, PIR, PMP, PomBase, PRIDE, PROSITE, Proteomes, ProteopediA, PseudoCAP, PubChem, PubMed, RCSB PDB, Reactome, REBASE, RefSeq, Rfam, RNAcentral, SABIO-RK, SGD, SIMAP, SMART, Source, STRING, SUPERFAMILY, SWISS-2DPAGE, SWISS-MODEL, TCDB, TIGRFAMS, TubercuList, UCD 2D-PAGE, UCSC, UMLS, UniGene, UniPathway, UniProt-GOA, UniProtKB World-2DPAGE Repository
17
Task 1a: Name processing *
* Page content from: http://www.mycobank.org/BioloMICS.aspx?Table=Mycobank&Rec=18759&Fields=All 20
Task 1a: Strains algorithm
WDCM 133 Centraalbureau voor Schimmelcultures Filamentous fungi and Yeast Collection, Netherlands
WDCM 18, Food, Science, Australia, Ryde
WDCM 758, IBT, Culture Collection of Fungi, Denmark
WDCM 214, CABI, Genetic Resource Collection, UK
21
Task 2b tools
Based on: ELIXIR network infrastructure, BioMedBriges reports schema in semantic WEB, LOD cloud
MICRO-IS 22
Ontologies in Task 2b Number of microbial databases with ontologies – 243, total number of ontologies – 63
Ontologies used in more than one database:
Ontologies used in just one database:
ARO, CAVEman, CELDA, Cereal plant growth stage (GRO), Dictyostelium anatomy ontology, DOID, ENVO, EC, EO, FAO, FlyBase Controlled Vocabulary, FYPO, GBIF Taxonomic checklists, GR_tax, MA, MapMan, MEO, MeGO, MetaCyc Pathway Ontology, Metathesaurus, MGED, MO, MSH, NCI Thesaurus, NCIM, ncRNA vocab, ORDO, Pathway Tools reaction ontology, PDO, Phenotype Ontology, PhiGO, Plant anatomy, Plant Development Ontology, PSI-MOD, PSO, QuickGO, The Gene categories from Monica Riley, TO, YPO
GO 207 Pathway Tools pathway ontology 6 BTO 2
KO 17 PatPathway Tools Evidence Ontology 6 CAB Thesaurus 2
SO 9 Pathway Tools compound ontology 6 CL 2
ECO 9 PRO 5 DO 2
PW 7 ChEBI 5 EPO 2
FMA 7 PO 3 JPO 2
EFO 6 MultiFun 3 KIPO 2
CCO 6 HPO 3 KPSI MI Ontology 2
23
MICRO-IS catalogue data fields
If database integration is the only goal of MICRO-IS the
minimal catalogue data standard could have six fields:
- Name,
- Accession number,
- Original URL of strain passport,
- History of deposit,
- Type of microorganism,
- List of Life Science databases that present data on this strain with contacts to these data
24
PICR database partners
Look at: http://www.ebi.ac.uk/Tools/picr/userguide.do
PICR
EMBL() Ensembl() Ensemble Genomes() EPO, FlyBase, H Inv, IPI JPO, KIPO, PDB, PIR, PRF, Refseq() SEGUID, SGD, TAIR, SwissProt() TrEMBL() TROME() UniMES, UniParc, USPTO, VEGA(), WormBase
EMBL databases
Species-specific Refseq releases
SwissProt variant databases
TrEMBL variant databases
Species-specific Trome releases
Species-specific Vega releases
Species-specific Ensembl releases
List of taxon specific databases
Swiss-Prot varsplic and TrEMBL varsplic in output options
16
ChEBI data communication Data sources Generated cross-references
IntEnz
NIST Chemistry WebBook
KEGG COMPOUND
PDBeChem
ChEMBL .
Expression Atlas
GMD
ChemIDplus
ChEBI
IUBMB
NURSA
IUPAC
JCBN
CBN
PDB
BBD. RESID
COMe
NMRShiftDB
Enzyme Portal
BRENDA
IntEnz
Rhea
ArrayExpress .
SABIO-RK
PubChem
Reactome
BioModels
IntAct
UniProtKB .
UniProt .
EMBL
EuroFir
LIPID MAPS WebElements
UniProt . MolBase
KEGG GLYCAN
KEGG DRUG
Patent DrugBank
EBI Industry Programme 17
UniParc database partners
Look at: http://www.uniprot.org/help/uniparc
It keeps databases records: It contains cross-references with databases:
EMBL-Bank/DDBJ/GenBank nucleotide sequence databases. Ensembl. EnsemblGenomes European Patent Office (EPO) FlyBase H-Invitational Database (H-InvDB) International Protein Index (IPI) Japan Patent Office (JPO) Korean Intellectual Property Office (KIPO) Pathosystems Resource Integration Center (PATRIC) PIR-PSD Protein Data Bank (PDB) Protein Research Foundation (PRF) RefSeq Saccharomyces Genome database (SGD) . TAIR Arabidopsis thaliana Information Resource The Seed (SEED) TROME USA Patent Office (USPTO) UniProtKB/Swiss-Prot, protein isoforms, UniProtKB/TrEMBL Vertebrate Genome Annotation database (VEGA) WormBase WormBase ParaSite (WBParaSite)
PIR PIRARC REMTREMBL UniMES TREMBLNEW TrEMBL_varsplic
19
Pathguide: (all) database interactions
http://pathguide.org/interactions.php 20
An example for the possible service
URL: http://biodb.jp/ https://github.com/micommunity/psicquic http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS 26
The sources not inspected yet • http://cgsc.biology.yale.edu/BioLinks.php - Other Biology Servers
• http://collectf.umbc.edu/browse/links/ - Other resources
• http://phossnp.biocuckoo.org/links.php - Computational resources of protein phosphorylation:
• http://cpla.biocuckoo.org/links.php - Acetylation Databases:
• http://dbptm.mbc.nctu.edu.tw/ - after the line "Databases" long list
• http://minisatellites.u-psud.fr/ - Genomes and PolyMorphismS
• http://www.hawaii.edu/abrp/biorlinks.html - Links to some Bioremediation Sites
• http://toxnet.nlm.nih.gov/
• http://ecoliwiki.net/colipedia/index.php/Category:Databases - online databases containing information related to E. coli K-12 and other genomics information
• https://rarediseases.info.nih.gov/research/9/tools-for-researchers#category29 - lists of medical resources
• http://www.genecards.org/Guide/GeneCard - list of Antibodies databases after words "This section also provides links to Antibodies from"
• http://www.hgvs.org/locus-specific-mutation-databases/ page 2+ - List of Locus-specific Databases
• http://gydb.uv.es/index.php/Main_Page - links of interest
• http://genomics.senescence.info/links.html - Links on Ageing and Computational Biology
• https://sis.nlm.nih.gov/enviro/databasedescriptions.html - toxinity
• http://world-2dpage.expasy.org/repository/ - World-2DPAGE Repository databases
• http://www.virology.net/ - All the virilogy on the WWW
• http://viralzone.expasy.org/all_by_species/677.html - ViralZone Links (137 contacts)
• https://scicrunch.org/scicrunch - SciCrunch - over 13000 research resources (datasets SW databases etc.) mostly biomedical
7
General comparison
(1) In culture collections* 708 culture collections in WDCM/CCinfo
150 online catalogues (67 in EU)
(2) Outside
2056 databases collected online, 807 with microbial data
Biomedicine 171
Agriculture 23
Pharma, biochemistry 35
Any type 243
Global EU
Biomedicine 53 21
Agriculture 67 26
Pharma, biochemistry 32 12
Any type 109 43
807 = 117(Mo) +483(Sp) + 207(St)
‐ Mo - microorganisms discovered (bacteria, fungi, yeasts, archaea, protists, microalgae) , viruses, but no species names,
‐ Sp - names are presented, but no strains, ‐ St - strains discovered.
Value 207(St) looks like 26% of 807. In fact this means minimal interconnection of left side and right side: (1) in the list Mo, Sp, St we indicated higher level discovered, (2) more than 50% of the strains were not in the culture collections, (3) we never discovered address of the strains, (4) in databases each strain is separate unlike Straininfo histries (next picture)
* Plus WDCM, Straininfo, CABRI and regional mBRC networks
3
Best integration values
Life Science:
MICRO-IS: 1. CN=163, initial value of AN=163, potential value 670 (NCBI list),
2.CN=146, initial value of AN=146, potential value 654 (EMBL-EBI list).
CN AN
149 UniProtKB 350 UniProtKB
148 UniProt-GOA 335 PubMed
84 GeneCards 181 RCSB PDB
Key groups and application fields
• Biggest groups: BioCyc (7667dbs), EMBL-EBI (98dbs), NCBI (71dbs), SIB(65dbs), KEGG, UniProt
• Databases with biggest list of partners: COL (159dbs), PIR(158DBS), UniProt (150dbs)
• The biggest attractors: UniProtKB (350), PubMed (335), RCSB PDB (181)
• Application fields: Life science, Biomedicine, Pharma, Agriculture, BioRemediation
Databases in practical areas (42/243): agriculture, baking, biodegradation, biotechnology, brewing, enzyme production, food industry, Pesticide residues, Heavy metals, health,
patent, pharma, remediation, veterinary drug, winemaking
abYsis, AgBiotechNet, AGRICOLA, Allergome, ALTBIB, Anti-HIV Compounds, APD, ApiEST-DB, ARDB, Aspergillus Genomes, BacMap, BBD, BEI, Bio Synthesis, BioGRID, BioModels, Bionemo, BioRadBase, BMRB, BRENDA, BuG@Sbase, BuruList, CADRE, CARD, CATH, CCDB, CCSB Interactome, ChEBI, ChEMBL, ClinicalTrials.gov, CLU-IN, COGEME, Colibri, ConceptWiki, Cost Estimates of Foodborne Illnesses, CPC, CTD, CTDB, DAA, DailyMed, DAnCER, DART, DBETH, dbSNP, Diseases Database, DNAtraffic, Dr.VIS, Drug2Gene, DrugBank, DrugPort, e!Ensembl, EAWAG-BBD, Ebolavirus, EcoGene, EcoliWiki, ELM, EMBL-EBI, ENA, Ensembl, EpiFlu™, EPIMHC, Espacenet, ESTHER, EuPathDB, EVA, FCP, FluKB, ForestScience Current Database, FunCoup, FungiDB, GARD, GB, GeneCards, GeneMANIA, Genetics Home Reference, GENE-TOX, GenoList, Global Atlas of Infectious Diseases, GlycomeDB, GOBASE, GoMapMan, GPCRs, Gramene, Guide to Pharmacology, HAGR, HAPPI, Hawaii Bioremediation Database, HC DPD, HCV Immunology Database, HCV sequence database, HCVIVdb, HealthMash, HFV / Ebola Database, HIV Drug Resistance Database, HIV MID, HIV mutation browser, HIV Sequence Database, HIV Structural Database, HIV Structural Database and Chem-BLAST, HIV/AIDS Clinical Trials, HIV/SIV Vaccine, HIVBrainSeqDB, HMDB, HorizonScan database, HSDB, i2d, ICD-10, ICD-9-CM, IEDB, IMID, ImmPort, IMP, Influenza Virus Resource, INTEGRALL, iPfam, IRD, iRefWeb, KEGG, KEGG BRITE, KEGG DISEASE, KEGG MEDICUS, LAMP, LEGER, LegioList, LifeMap Discovery, ListiList, LMPD, LMSD, MalaCards, MedHunt, MedicMine, MEDLINE, MedlinePlus, MeSH, MGG, microbedb, MLSTDB, MolliGen, MouseMine, MPID, MSRDB, MTBLS, MUHDB, MUMDB, MvirDB, MycoBrowser leprae, MycoBrowser marinum, MycoBrowser tuberculosis, MypuList, NAPP, NCIm, NCIt, NCPI, NDF-RT, NextBio, neXtProt, NFSD, NRSub, OMIM, OMIM (1), OMMBID, ORegAnno, OrthoDisease, PAGED, PathoPlant, PathPred, PATRIC, PC, PDRhealth, PDTD, PED, PepBank, PeroxisomeDB, PharmGKB, PhenoM, PhosphoGRID, PhytAMP, PIMRider, PINA, PLEXdb, PLOS One, PPIRA, PRGdb, PRIMOS, PROFESS, PROMISCUOUS, PseudoCAP, PSP, PubChem, PubMed, RCSB PDB, Reactome, Reference Strain Catalogue, RhizoBase, RIKEN, SagaList, SALAD, Scansite, ScerTF, SCMD, SCRIPDB, SDAP, SGD, SMD, SPIDer, SPPS, SubtiList, Subviral RNA Database, SuperSite, SuperTarget, SwissVar, T3DB, TDR Targets, Telomerase database, TiPs, TOXLINE, toxoMine, TriTrypDB, TubercuList, UCD 2D-PAGE, UMLS, UniProtKB, VetMed Resource, VFDB, Victors, ViPR, Virhostome Interactome Database, WikiPathways, Wiki-Pi, Wong's Virology, YDPM, Yeast Interactome Database, Yeast Resource Center, Yeast snoRNA, YeastCyc, YeastGE, YeastMine, YEASTNET, YEASTRACT, YMDB
32
Bioremediation **
- Hawaii Bioremediation DBs, - EAWAG-BBD ., - BioRadBase, - BIOREM, - ECSI, - CLU-IN, - OxDBase, - RhizoBase ., - TechProfiles.org, Also: PathSearch ., PathComp., PathPred, KEGG, BioCyc ., ...
Dehalogenation Especially 4 categories of databases are extremely helpful in dehalogenation *: 1. Databases of sequence and structure (NCBI, EMBL, DDBJ, MBDG, CMR, ExPASy, PDB, CSD, SCOP, FSSP) 2. Databases of enzymes and metabolic Pathways (BRENDA, ExplorEnz, UM-BBD, MetaCyc, WIT, KEGG, Pathway
Commons) 3. Databases of molecules (PubChem NCBI, ChemDB, ZINC, Pollution Database, ECOTOX) 4. Databases of organisms (Taxonomy NCBI, BSD, CBS, PAMDB, JCM)
* R. Satpathy, V.B. Konkimalla, J. Ratha. Application of bioinformatic tools in microbial dahalogenation research (a review). 2015 ** In Silico Approach for the Bioremediation of Toxic Pollutants. F. Khan, M. Sajid and S. S. Cameotra. Petroleum & Environmental Biotechnology
33
BRIO species in Life Science databases
.
301 299 298 298 298 298 294 270 248 216 175 143 136 128 126 113 99 96 95 91 83 79 69 53 52 52 52
EMBL-EBI ENA Genbank NCBI taxonomy Nucleotide RefSeq PubMed UniProtKB PLOS One ForestScience Current Database CPC ALTBIB Animalscience Espacenet BRENDA VetMed Resource MetaCyc BioCyc KEGG KEGG BRITE Gene KEGG GENOME KEGG GENES KEGG DISEASE KEGG LIGAND KEGG MEDICUS KEGG MODULE
52 52 52 52 50 47 43 41 41 38 34 31 30 28 25 25 25 25 18 17 12 11 9 9 9 7 4 4
KEGG Organisms KEGG ORTHOLOGY KEGG PATHWAY PathComp RCSB PDB CLU-IN InterPro Addgene AGRICOLA GoMapMan ACLAME 5S RNA Database NAPP IntAct ABCdb BBD EAWAG-BBD SCOP Bionemo PROSITE Ensembl Allergome ABAC AFTOL BioRadBase RhizoBase ABCISSE Reference Strain Catalogue
3 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PhytAMP A.pernix AFLP PLEXdb PRGdb 3D RIBOSOMAL MODIFICATION MAPS abYsis AfCS PLASMID AgBiotechNet AGD AgrID AH-DB ARAMEMNON COGEME MGG MPID MSRDB MUHDB MUMDB NCPI PathoPlant PathPred PC PPIRA Reactome SGD SubtiList
36
Example for Task 2a access
.
Acronim EMBL-EBI ENA Genbank Nucleotide RefSeq NCBI taxonomy PubMed UniProtKB PLOS One ForestScience Current Database CPC ALTBIB Animalscience Espacenet BRENDA VetMed Resource MetaCyc BioCyc KEGG Gene PathComp RCSB PDB CLU-IN InterPro addgene AGRICOLA GoMapMan ACLAME 5S RNA Database NAPP IntAct ABCdb BBD
Species name Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Acaulospora nivalis Achromobacter marplantensis Acidovorax delafieldii Achromobacter marplantensis Acidovorax sp. Acidovorax sp. Achromobacter marplantensis Acidovorax delafieldii Acidovorax delafieldii Acidovorax sp. Acidovorax delafieldii Acidovorax delafieldii Acidovorax delafieldii Acidovorax sp. Acinetobacter venetianus Pseudomonas putida Acidovorax sp. Acidovorax sp. Acidovorax delafieldii Acidovorax sp. Alcaligenes sp. Acidovorax sp. Acidovorax sp. Bjerkandera adusta Bacillus cereus Arthrobacter sp. Acidovorax sp. Alcaligenes sp.
URL of access to this species https://www.ebi.ac.uk/ebisearch/search.ebi?query=Acaulospor... http://www.ebi.ac.uk/ena/data/search?query=Acaulospora+nivalis http://www.ncbi.nlm.nih.gov/nuccore/?term=Acaulospora+nivalis http://www.ncbi.nlm.nih.gov/nuccore/?term=Acaulospora+nivalis https://www.ncbi.nlm.nih.gov/nuccore/?term=Acaulospora+nivalis http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi http://www.ncbi.nlm.nih.gov/pubmed/?term=Achromobacter+marp... http://www.uniprot.org/uniprot/?query=Acidovorax+delafieldi... http://journals.plos.org/plosone/search?q=%22Achromobacter+... http://www.cabi.org/forestscience/search/?q=%22Acidovorax+s... http://www.cabi.org/cpc/search/?q=%22Acidovorax+sp.%22 http://www.ncbi.nlm.nih.gov/pubmed?cmd=Search&term=%22Acido... http://www.cabi.org/animalscience/search/?q=%22Acidovorax+d... https://worldwide.espacenet.com/searchResults?submitted=tru... http://www.brenda-enzymes.org/search_result.php?quicksearch... http://www.cabi.org/vetmedresource/search/?q=%22Acidovorax+... http://www.biocyc.org/organism-summary?object=ADEL573060 http://www.biocyc.org/organism-summary?object=ADEL573060 http://www.genome.jp/dbget-bin/www_bfind_sub?mode=bfind&max... http://www.ncbi.nlm.nih.gov/gene/?term=Acinetobacter+venetianus http://www.genome.jp/tools-bin/pathcomp?org_name=ppg&org_na... http://www.rcsb.org/pdb/search/advSearch.do?search=new https://clu-in.org/search/default.cfm?search_term=Acidovora... http://www.ebi.ac.uk/interpro/search?q=Acidovorax+delafieldii https://www.addgene.org/search/google_results?q=Acidovorax+sp. http://agricola.nal.usda.gov/cgi-bin/Pwebrecon.cgi?Search_A... http://www.gomapman.org/search/gmm/Acidovorax%20sp.?entity=... http://aclame.ulb.ac.be/perl/Aclame/search.cgi?keys=Acidovo... http://biobases.ibch.poznan.pl/5SData/ http://napp.u-psud.fr/Niveau2.php?specie=76&Name=Bacillus_c... http://www.ebi.ac.uk/intact/interactions?conversationContext=1 https://www-abcdb.biotoul.fr/ http://eawag-bbd.ethz.ch/servlets/search 38