Bio2RDF and Beyond!

57
Bio2RDF and Beyond! Large Scale, Distributed Biological Knowledge Discovery 1 EBI : 14-01-10 Michel Dumontier, Ph.D. Associate Professor of Bioinformatics Carleton University Department of Biology School of Computer Science Institute of Biochemistry Ottawa Institute of Systems Biology Ottawa-Carleton Institute of Biomedical Engineering

description

The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope.

Transcript of Bio2RDF and Beyond!

Page 1: Bio2RDF and Beyond!

EBI : 14-01-101

Bio2RDF and Beyond! Large Scale, Distributed Biological Knowledge Discovery

Michel Dumontier, Ph.D.Associate Professor of Bioinformatics

Carleton University

Department of BiologySchool of Computer Science

Institute of BiochemistryOttawa Institute of Systems Biology

Ottawa-Carleton Institute of Biomedical Engineering

Page 2: Bio2RDF and Beyond!

EBI : 14-01-102 Carole Goble (ISWC 2005)

Web-based Knowledge Discovery a very painful process

Page 3: Bio2RDF and Beyond!

EBI : 14-01-103

Syntactic Web…It takes a lot of digging to get answers

Page 4: Bio2RDF and Beyond!

EBI : 14-01-104

Portals provide structured informationand give better results

Page 5: Bio2RDF and Beyond!

EBI : 14-01-105

Surface web:167 terabytes

Deep web:91,000 terabytes

545-to-one

We need to expose the deep web

Page 6: Bio2RDF and Beyond!

EBI : 14-01-106

Data silos – not made for sharing

Page 7: Bio2RDF and Beyond!

EBI : 14-01-107

How do we integrate these resources?

Page 8: Bio2RDF and Beyond!

EBI : 14-01-108

We want to simultaneously

query the 1000+ biological databases

Page 9: Bio2RDF and Beyond!

EBI : 14-01-109

The Semantic Web is a web of knowledge.

It is about standards for publishing, sharing and querying knowledge drawn from diverse sources

It enables the answering of sophisticated questions

Page 10: Bio2RDF and Beyond!

EBI : 14-01-1010

A growing web of linked data

Page 11: Bio2RDF and Beyond!

EBI : 14-01-1011

Life Science Data Contributors

• HCLS (LODD)• Neurocommons• Bio2RDF

Page 12: Bio2RDF and Beyond!

EBI : 14-01-1012

Resource Description Framework (RDF)

Uniform Resource Identifier (URI) can be used as entity names

Bio2RDF specifies the naming convention

http://bio2rdf.org/uniprot:P05067

is a name for Amyloid precursor protein

http://bio2rdf.org/omim:104300

is a name for Alzheimer disease

uniprot:P05067

omim:104300

Allows one to talk about anything

Page 13: Bio2RDF and Beyond!

EBI : 14-01-1013

Resource Description Framework (RDF)

uniprot:Protein

is a

A RDF statement consists of:– Subject: resource identified by a URI– Predicate: resource identified by a URI– Object: resource or literal

uniprot:P05067

Allows one to express statements

Page 14: Bio2RDF and Beyond!

EBI : 14-01-1014

Multi-Source Data Integration

uniprot:P05067 go:Membrane

uniprot:Proteinis a

located in

uniprot:P05067

uniprot:P05067 uniprot:P05067interacts with

UniProt

Gene Ontology

uniprot:P05067

has name

located in

interacts with

Unified view

+

+

iRefIndex

depends on consistent naming

go:Membrane

uniprot:Protein

uniprot:P05067

Page 15: Bio2RDF and Beyond!

EBI : 14-01-1015

Building statements creates knowledge

uniprot:P05067

Protein

is a

omim:104300

Disease

is a

is involved in

Amyloid precursor

protein

label

AlzheimerDisease

label

Page 16: Bio2RDF and Beyond!

EBI : 14-01-1016

RDF/XML<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:u="http://bio2rdf.org/uniprot:"

<rdf:Description rdf:about=“&u;Q16665"> <rdf:type rdf:resource=“&u;Protein"/> </rdf:Description></rdf:RDF>

PREFIX u: <http://bio2rdf.org/uniprot:>

<u:Q16665> a <u:Protein> .

RDF/N3

RDF has multiple representations

Page 17: Bio2RDF and Beyond!

EBI : 14-01-1017

Bio2RDF is a framework to create and provision linked data networks

Francois Belleau, Laval UniversityMarc-Alexandre Nolin, Laval University

Peter Ansell, Queensland University of TechnologyMichel Dumontier, Carleton University

Page 18: Bio2RDF and Beyond!

EBI : 14-01-1018

Bio2RDF’s RDFized data fits together

Page 19: Bio2RDF and Beyond!

EBI : 14-01-1019

Bio2RDF now serving over 5 / 15 billion triples of linked biological data

Page 20: Bio2RDF and Beyond!

EBI : 14-01-1020

Bio2RDF linked data

Page 21: Bio2RDF and Beyond!

Bioinformatics Discovery Registry• SharedName initiative to provide stable URI patterns for data records.• We added the relationship between entities and records

Directory Service• ~1700 datasets & dozens of resolvers.

Discovery Service• Registry links entities to data records, their formats (RDF/XML, HTML, etc)

and provider (Bio2RDF, Uniprot)

Redirection Service• Automatic redirection to data provider document

Page 22: Bio2RDF and Beyond!

EBI : 14-01-1022

something you can lookup or search for with rich descriptions

Page 23: Bio2RDF and Beyond!

EBI : 14-01-1023

Bio2RDF: Raw Data!

Page 24: Bio2RDF and Beyond!

EBI : 14-01-1024

SPARQL is the new cool kid on the query block

SQL SPARQL

Page 25: Bio2RDF and Beyond!

EBI : 14-01-1025

Bio2RDF’s describe service uses SPARQL

CONSTRUCT {?s ?p ?o .

}WHERE {?s ?p ?o .FILTER(?s = <http://bio2rdf.org/ns:id>).

}

Sent to http://ns.bio2rdf.org/sparql?query=...

http://bio2rdf.org/ns:id

Page 26: Bio2RDF and Beyond!

EBI : 14-01-1026

Bio2RDF’s search service uses SPARQLhttp://bio2rdf.org/search/hexokinase

kegguniprot

gene

bio2rdf.org

Page 27: Bio2RDF and Beyond!

EBI : 14-01-1027

Bio2RDFScalable, Decentralized Data ProvisionGlobally Mirrored and Point Provision

Customizable Query Resolution

Page 28: Bio2RDF and Beyond!

EBI : 14-01-1028

Customizable Configuration (in N3)Single Query, Single Provider

Page 29: Bio2RDF and Beyond!

EBI : 14-01-1029

Query Resolution

Page 30: Bio2RDF and Beyond!

EBI : 14-01-1030

Page 31: Bio2RDF and Beyond!

EBI : 14-01-1031

700,000 queries in November 2009

Page 32: Bio2RDF and Beyond!

EBI : 14-01-1032

Yai for data!

But how do we discover more than what was in the data?

Page 33: Bio2RDF and Beyond!

EBI : 14-01-1033

Ontology as Strategy

Page 34: Bio2RDF and Beyond!

EBI : 14-01-1034

uniprot:P05067

Uniprot:Protein

is a

chebi:PolyatomicEntity

is a

is a

Reasoning and Inference through Semantics

fact

ontology

Knowledge base

Page 35: Bio2RDF and Beyond!

EBI : 14-01-1035

The Web Ontology Language (OWL) Has Explicit Semantics

Can therefore be used to capture knowledge in a machine understandable way

Page 36: Bio2RDF and Beyond!

Over 170 bio-ontologies

EBI : 14-01-1036

Page 37: Bio2RDF and Beyond!

From linked data to linked knowledge through syntactic and semantic normalization.

Page 38: Bio2RDF and Beyond!

Multiple Ways To Represent Knowledge

Three ways to model the relationship between a protein and the volume it occupies.

Page 39: Bio2RDF and Beyond!

EBI : 14-01-1039

Web-based Knowledge DiscoverySome of our queries need services

Page 40: Bio2RDF and Beyond!

EBI : 14-01-1040

The Holy Grail:

Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.

Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% similar in the active site to kinases known to be involved in cell-cycle regulation in any other species.

Page 41: Bio2RDF and Beyond!

EBI : 14-01-1041

Semantic Automated Discovery and Integration

http://sadiframework.org

Mark Wilkinson, UBCMichel Dumontier, Carleton UniversityChristopher Baker, UNB

Page 42: Bio2RDF and Beyond!

SADI – described oriented service matching based on

registered predicates

Page 43: Bio2RDF and Beyond!

EBI : 14-01-1043

Page 44: Bio2RDF and Beyond!

EBI : 14-01-1044

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>

SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Page 45: Bio2RDF and Beyond!

EBI : 14-01-1045

SADI

• Describe the input and output using OWL-DL classes• Subject of input and output must be the same• Web services correspond to predicates• Biocatalogue to register SADI-compliant services• Simplified migration path for existing web services (java, perl)

Page 46: Bio2RDF and Beyond!

EBI : 14-01-1046

Build aknowledge basefrom a series of questions

Page 47: Bio2RDF and Beyond!

EBI : 14-01-1047

You want to join the knowledge web

Page 48: Bio2RDF and Beyond!

EBI : 14-01-1048

Share your data

Page 49: Bio2RDF and Beyond!

EBI : 14-01-1049

Build semantic web services

Page 50: Bio2RDF and Beyond!

EBI : 14-01-1050Get to where you want to be … faster!

Page 51: Bio2RDF and Beyond!

EBI : 14-01-1051

Next Steps

Service and Data Buildout Formal Partnerships

Applications

Page 52: Bio2RDF and Beyond!

EBI : 14-01-1052

[email protected]

Page 53: Bio2RDF and Beyond!

EBI : 14-01-1053

Page 54: Bio2RDF and Beyond!

We’re interested in Personalized Medicine

The ability to offer • The Right Drug• To The Right Patient• For The Right Disease• At The Right Time• With The Right Dosage

Genetic and metabolic data will allow drugs to be tailored to patient subgroups

54 EBI : 14-01-10

Page 55: Bio2RDF and Beyond!

EBI : 14-01-1055

PHARMGKB is an emerging resource for pharmacogenomics

+ Role of genes, gene variants , drugs + pharmacokinetics + pharmacodynamics + clinical outcomes. + Links to publications

- Natural language descriptions- Variant details in publications

Page 56: Bio2RDF and Beyond!

EBI : 14-01-1056

contains statements from 11/40 relevant publications involving 45 genes / gene variants, 57 drugs annotated with 19 classes of antidepressants, 45 drug treatments, 47 drug-gene interactions, 29 clinical outcomes, 10 drug-induced side-effects, and 8 gene-disease interactions.

PHARMACOGENOMICS OF DEPRESSION KNOWLEDGE BASE

Page 57: Bio2RDF and Beyond!

EBI : 14-01-1057

Nortriptyline induced side effects for ABCB1 gene variants

‘side effect’ that ‘is realized by’ some (‘drug treatment’ that ‘involves’ some ‘nortriptyline’ and

‘involves’ some (‘variant of’ some ‘ABCB1’))

QUERYING THE PDKBProtégé 4, FaCT++, DL Query Tab

postural hypotension is a side effect of nortriptyline treatment of depression for individuals presenting the 3435C>T genotype