1
Building a network of interoperable and independently produced linked and open
biomedical data
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)Stanford University
@micheldumontier::ACS:23-08-16An invited talk in support of the 2016 Herman Skolnik Awardees
@micheldumontier::ACS:23-08-162
My research aims to develop computational methods for biomedical knowledge discovery
We develop tools and methods to represent, store, publish, integrate, query, and reuse biomedical data, software, and ontologies
@micheldumontier::ACS:23-08-163
reuse needs to be considered firmly in the context of discovery and
reproducibility
@micheldumontier::ACS:23-08-164
Most published research findings are false- John Ioannidis, Stanford University
@micheldumontier::ACS:23-08-165
Reproducible discovery
1. Data Science Tools and Methods– Infrastructure: To identify, annotate, link, integrate,
search for and query data and services– Tools: To identify and uncover support for known or
novel associations2. Community Standards to contribute to and interrogate a massive, decentralized network of interconnected data and software
@micheldumontier::ACS:23-08-166
FAIR: Findable, Accessible, Interoperable, Re-usable
@micheldumontier::ACS:23-08-167
FAIR: Findable, Accessible, Interoperable, Re-usable
Findable– Globally unique identifiers for datasets and the data they contain– Rich set of descriptors to search and filter with– Indexed and searchable
Accessible– Identifiers can be used to retrieve representations using standard protocols
(e.g. HTTP)– Metadata is always available.
Interoperable– Data represented with formal knowledge representations– Include links to other datasets/vocabularies
Reusable– Licensing, Provenance, Community standards
@micheldumontier::ACS:23-08-168
The Semantic Web is the new global web of knowledge
standards for publishing, sharing and querying facts, expert knowledge and services
scalable approach for the discoveryof independently formulated
and distributed knowledge
@micheldumontier::ACS:23-08-169
Linked Data offers a solid foundation for FAIR data
• Entities (people, proteins, pathways, etc) are identified using globally unique identifiers (URIs)
• Entity descriptions are represented with a standardized language (RDF)
• Data can be retrieved using a universal protocol (HTTP)
• Entities (concepts, data, resources) can be linked together to increase interoperability
@micheldumontier::ACS:23-08-16
Linked Data for the Life Sciences
10
Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.
chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications
• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies
• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,
NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers
11
Bio2RDF normalizes identifiers, formats, links, and access
@micheldumontier::ACS:23-08-16
@micheldumontier::ACS:23-08-1612
@micheldumontier::ACS:23-08-1613
Bio2RDF shows how datasets are connected together
14
Queries can be federated across private and public SPARQL databases
Get all protein catabolic processes (and more specific GO terms) in biomodels
SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}
@micheldumontier::ACS:23-08-16
@micheldumontier::ACS:23-08-1615
Graph-like representation amenableto finding mismatches and discovering new links
W Hu, H Qiu, M Dumontier. Link Analysis of Life Science Linked Data. International Semantic Web Conference (2) 2015: 446-462.
@micheldumontier::ACS:23-08-1616
EbolaKB Using Linked Data and Software
Kamdar, Dumontier. An Ebola virus-centered knowledge base. Database. 2015 Jun 8;2015. doi: 10.1093/database/bav049.
@micheldumontier::ACS:23-08-1617
Network analysis and discovery
McCusker, McGuiness, Dumontier. In prep.
@micheldumontier::ACS:23-08-1618
Can we implement an open version of PREDICT using Linked Data?
AUC 0.91 across all therapeutic indications
A. Chemical structure Similarity
B. Side Effect Similarity
C. Target Sequence Similarity
D. Target Functional Similarity
E. Network Distance
A. Phenotype Based
B. Text Extracted Concepts
Disease-disease similarityDrug-drug similarity
@micheldumontier::ACS:23-08-1619
HyQue: Hypothesis Validation
• A platform for knowledge discovery that uses data retrieval coupled with automated reasoning to validate scientific hypotheses
• Leverages semantic technologies to provide access to linked data, ontologies, and semantic web services
• Uses positive and negative findings, captures provenance
• Weighs evidence according to context • Used to find aging genes in worm,
assess cardiotoxicity of tyrosine kinase inhibitorsHyQue: evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3.
Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.
@micheldumontier::ACS:23-08-1620
What evidence might we gather?• clinical: Are there cardiotoxic effects associated with the drug?
– Literature (studies) [curated db]– Product labels (studies) [r3:sider]– Clinical trials (studies) [r3:clinicaltrials]– Adverse event reports [r2:pharmgkb/onesides] – Electronic health records (observations)
• pre-clinical associations:– genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase]– in vitro assays (IC50) [r3:chembl]– drug targets [r2:drugbank; r2:ctd; r3:stitch]– drug-gene expression [r3:gxa]– pathways [r2:kegg; r3:reactome]– Drug-pathway, disease-pathway enrichments [aberrant pathways]– Chemical properties [r2:pubchem; r2.drugbank]– Toxicology [r1.toxkb/cebs]
@micheldumontier::ACS:23-08-1621
HyQue
@micheldumontier::ACS:23-08-1622
Beyond Bio2RDF
@micheldumontier::ACS:23-08-1623
Network of Linked Data (~2007)
@micheldumontier::ACS:23-08-1624
Expansion across domains
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
25
A rapidly growing network of Linked Data
@micheldumontier::ACS:23-08-16Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
@micheldumontier::ACS:23-08-1626
@micheldumontier::ACS:23-08-1627
@micheldumontier::ACS:23-08-1628
@micheldumontier::ACS:23-08-1629
but the lack of coordination makes Linked Open Data is chaotic and unwieldy
@micheldumontier::ACS:23-08-1630
There is no shortage of vocabularies, ontologies and community-based
standards
@micheldumontier::ACS:23-08-1631
68 168
@micheldumontier::ACS:23-08-1632
metadatacenter.org
NIH COMMONS
Making it Easier, Possibly Even Pleasant, to Author Interoperable Experimental Metadata
@micheldumontier::ACS:23-08-1633
PubChem engaged the community to reuse and extend existing vocabularies
34 @micheldumontier::ACS:23-08-16
Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property
@micheldumontier::ACS:23-08-1635
Chemical Information Ontology (CHEMINF)
• Collaborative ontology• Distinguishes algorithmic, or
procedural information from declarative, or factual information, and renders of particular importance the annotation of provenance to calculated data.
@micheldumontier::ACS:23-08-1636
Where are we going?
• Large scale publishing on the web across biomedical datatypes is possible on the web
• Hubs, such as NCBI and EBI now integrate data, but there is need for global coordination on all datatypes
• Standard Vocabularies must to be open, freely accessible, and demonstrably reused
• Use of worldwide data integration formats (RDF) and improved linking of data
• Easier to deploy toolkits for providing standards-compliant linked data
37
Linked Data Platform
Docker
• Data conversion scripts
• Query Editor
• Faceted Browser
• Relation Exploration
• API
• Data and data store
Model Organism Linked Data
MO-LD.org
@micheldumontier::ACS:23-08-1638
In Summary
• We use semantic technologies such as ontologies and linked data to make sense of and facilitate access to biomedical data (FAIR)
• The intimate development and use of standards by PubChem and others brings us closer to an interoperability ideal
• Much more work is needed to support (computational) discovery in a reproducible manner.
39
AcknowledgementsDumontier Lab• Amrapali Zaveri• Mary Panahiazar• Shima Dastgheib• Sandeep Ayyar• Remzi Celebi• David Odgers• Wei Hu• Ruben Verborgh
• Leo Chepelev• Alison Callahan• Jose Miguel Toledo Cruz• Tanya Hiebert• Beatriz Lujan+ many more
Collaborators• Mark Musen• Nigam Shah• Robert Hoehndorf• Janna Hastings• Christoph Steinbeck • Egon Willighagen• Nico Adams• Colin Batchelor• David Wild• Evan Bolton • Gang Fu+ many more
@micheldumontier::ACS:23-08-16
@micheldumontier::ACS:23-08-1640
Website: http://dumontierlab.com Presentations: http://slideshare.com/micheldumontier
Top Related