Download - Linked Data for Biopharma

Transcript
Page 1: Linked Data for Biopharma

Tom Plasterer, PhD.integrated informatics Semantic Framework Lead (i2SF)

The Path to Linked Data in BioPharma

Integrated R&D Informatics and Knowledge Management

Page 2: Linked Data for Biopharma

R&D | RDI

Blockbuster ‘Patent Cliff’ Gives Way to Personalized ApproachDrivers & Solutions

Blockbuster Patent Cliff

Growth of Generics

Mergers & Acquisitions

Personalized Medicine•Pharmacogenetics•Biomarkers

American Action Forum; Primer: The Pharmaceutical Industry (Han Zhong l Updated June 2012)

IMAP Pharma & Biotech Industry Global Report 2011

Evaluate Pharma World Preview 2018From: http://www.liv.ac.uk/pharmacogenetics/

Page 3: Linked Data for Biopharma

R&D | RDI

•Nurture ‘best in class’ programs

•Kill early•Repositi

oning

Build from within

•Partner or Buy?

•Integrate cultures & technology

•Is the disruption worth it?

Mergers & Acquisitions

•How much can be shared—and still be useful?

•Who is driving?

Pre-Competitive Consortiums

•Aggressive Regional Partnerships (Pfizer's Centers for Therapeutic Innovation)

•Co-locate near Academic Centers of Excellence (Novartis)

•Cherry pick (GSK, AZ, others)

Finding ‘KOLs’

Where do the new opportunities arise?Inside & Outside

Page 4: Linked Data for Biopharma

R&D | RDI

Distributed Data in a Monolithic EnvironmentManaging Silos

• Regulated Systems vs. DiscoveryPartitioned By Content

• US, EU, ASIAPACPartitioned By Geography & Organization

• RDB, Excel, Text, RSS, RDF?Data Formats

• Steps in the right direction?Warehouses & Service Oriented Architecture

• eRooms, Sharepoint,Yammer, ‘Lync’ vs. Twitter, Google Docs, SkypeCollaborative Environment

• Vendor specific or open?• Mixed BagStandards?

• UI? Services?• Metadata?Where are the ‘smarts’

Page 5: Linked Data for Biopharma

R&D | RDI

Requirements of The Informatics Landscape

Must span the entire drug development lifecycleo and back (post-market surveillance to discovery)

Must support large and very heterogeneous datao single nucleotide polymorphisms to countries

Will change as new science emerges & new regulations come into playo Medline just under 1M articles/year

Must be able to work with multiple, international regulatory bodieso Emerging markets

Partners, customers and collaborators will changeo and will have divergent technical aptitudes

Must be able to interoperated with precompetitive consortiao Can they perform common tasks for the community

Must be able to work with legacy datao Lots of unmined gems here!

Maximal Agility

Page 6: Linked Data for Biopharma

R&D | RDI

What’s Needed?

Linked Data!

http://thedatahub.org/group/lodcloudLOD Cloud 2011

Page 7: Linked Data for Biopharma

R&D | RDI

The 5 Stars of Open Linked Data

W3C/TBL Guidance

7 http://www.w3.org/DesignIssues/LinkedData.html

★ Make your stuff available on the web (any format)

★★ make it available as structured data (e.g. Excel instead of image scan of a table)

★★★ Use a non-proprietary format (e.g. CSV instead of Excel)

★★★★ Use URLs to identify things, so that people can point at your stuff

★★★★★ Link your data to other people’s data to provide context

Page 8: Linked Data for Biopharma

R&D | RDI

The 5 Stars of Open ClosedLinked Data

8 http://www.w3.org/DesignIssues/LinkedData.html

★ Make your stuff available on the web intranet (any format)

★★ make it available as structured data (e.g. Excel instead of image scan of a table)

★★★ Use a non-proprietary format (e.g. CSV instead of Excel)

★★★★ Use URLs to identify things, so that people can point at your stuff

★★★★★ Link your data to other people’s data to provide context

W3C/TBL Guidance

Page 9: Linked Data for Biopharma

Catalogues, Mapping, Queries

RD

F

Towards a Linked Data Architecture

9

Active & Partial PURLs

Central IdentityManagement

Structured

Triplestores

http://research.vocab.astrazeneca.com/id/DOID/2841 http://humandiseaseontology.astrazeneca.net/DOID/2841

SemanticVisualization

Semi-StructuredUnstructured

Content

+Tagging

VocabularyServer

Search

Page 10: Linked Data for Biopharma

R&D | RDI

Choosing Linked VocabulariesCurrent LOD Cloud Adoption

10

Vocabulary prefix Vocabulary link

Number of usages in data

sets

dc http://purl.org/dc/elements/1.1/ 92 (31.19 %)

foaf http://xmlns.com/foaf/0.1/ 81 (27.46 %)

skos http://www.w3.org/2004/02/skos/core# 58 (19.66 %)

geo http://www.w3.org/2003/01/geo/wgs84_pos# 25 (8.47 %)

xhtml http://www.w3.org/1999/xhtml/vocab# 19 (6.44 %)

akt http://www.aktors.org/ontology/portal# 17 (5.76 %)

bibo http://purl.org/ontology/bibo/ 14 (4.75 %)

mo http://purl.org/ontology/mo/ 13 (4.41 %)

vcard http://www.w3.org/2006/vcard/ns# 10 (3.39 %)

sioc http://rdfs.org/sioc/ns# 10 (3.39 %)

cc http://creativecommons.org/ns# 8 (2.71 %)

geonames http://www.geonames.org/ontology# 6 (2.03 %)

http://www4.wiwiss.fu-berlin.de/lodcloud/state/#terms

VocabularyServer

Page 11: Linked Data for Biopharma

R&D | RDI

The 5 Stars of Open Linked Vocabularies

Bernard Vatant (Mondeca) Guidance

11 http://blog.hubjects.com/2012/02/is-your-linked-data-vocabulary-5-star_9588.html

★ Publish your vocabulary on the Web at a stable URI

★★ Provide human-readable documentation and basic metadata (e.g. creator, publisher, date of creation, last modification, version number)

★★★ Provide labels and descriptions, if possible in several languages, to make your vocabulary usable in multiple linguistic scopes

★★★★ Make your vocabulary available via its namespace URI, both as a formal file and human-readable documentation, using content negotiation

★★★★★ Link to other vocabularies by re-using elements rather than re-inventing

Page 12: Linked Data for Biopharma

R&D | RDI

Domain Specific Vocabularies

Linked Open Vocabularies, NCBO

12

http://labs.mondeca.com/dataset/lov/index.html

http://bioportal.bioontology.org/

Page 13: Linked Data for Biopharma

Capture Business Questions and

Sources

Domain Expert Concept Map

Build Formal Ontology•Reuse Vocabularies!

Challenge with Linked Data

Model Business Questions (SPARQL)

Interact with RDF answer in a

Faceted Browser

Building Linked Data Applications

Page 14: Linked Data for Biopharma

Improving Internal Interoperability

Scientists, Clinicians, Informaticists can now freely interoperate as:

The PURL server provides a central identity management authority for resources that are of value (need to persist) across the enterprise. The Persistent URLs are used to connect resources found in multiple locations

The vocabulary server provides a way of harmonizing concepts across different domains

o Where possible, public vocabularies are usedo Where not, they’re extendedo We don’t want to develop and maintain vocabularies

Page 15: Linked Data for Biopharma

R&D | RDI

Structured

Vendor Content

Consortium ContentRESTful

APIs

Catalogues, Mapping, Queries

RD

F

Structured

Triplestores

Semi-StructuredUnstructured

Content

+Tagging

Inside/Outside Disappears

15

External Internal

Active & Partial PURLs

Central IdentityManagement

SemanticVisualization

VocabularyServer

Page 16: Linked Data for Biopharma

R&D | RDI

Unstructured Content

Giving Structure to Unstructured ContentoEntity RecognitionoUse of common vocabularies

o Schemaso Domain-Specific Content? Open BEL? TMO?

oCompatibility of text indices with triplestores & middleware tools

Encouraging Publishers to Structure ContentoHow can this be ‘monetized’ so they don’t lose their ROI?oWhat about interoperability & persistence?oCan this be mandated via funding agenciesoRDFa to start?

Publishers or ‘Re-publishers’o Thomson-Reuterso IngenuityoOpen up vocabularies

(or most of the data out there…)

Page 17: Linked Data for Biopharma

R&D | RDI

Pre-Competitive Consortia

Open PHACTS (Innovative Medicines Initiative)

Pistoia Alliance

W3C Health Care & Life Sciences Interest Group

National Center for Biomedical Ontologies (NCBO)

Open BEL (Biological Expression Language)

Page 18: Linked Data for Biopharma

R&D | RDI

Flexible and adaptable l Dynamic schema-less approach;

rapidly incorporate new datasets l Queries are adaptive, based on

scientific profiles (e.g. chemist or biologist)

l Use-case driven & tested by users in industry and academia

Great APIs for building apps l JSON REST-style APIs l Also supports XML, Turtle, etc l Chemistry services l Exemplars show how to take

advantage of the platform l Clear licensing details for all data in

the system

Key Points Large scale data integration l Focused on pharmacology l We integrate so you don’t have to l Dealing with multiple identifiers for

the same concept l Always up-to-date l State of the art and industrial

strength

Focus On Data Quality l Provenance is critical – know where

every data point comes from l Google-style indexing; Data

providers keep their own data l Chemistry Standardization –

enhancing chemistry connectivity

l Working with data providers to expose and enhance their data 18

Open PHACTS (Open Pharmacological Space)• EU/EFPIA Innovative Medicines Initiative (IMI) project

From: Open PHACTS Architecture - Building the extensible platform (EuroQSAR 2012 in Vienna, 30.08.2012)

Page 19: Linked Data for Biopharma

R&D | RDI

W3C HCLS

Activities:o Continue to develop high level (e.g. TMO) and architectural (e.g. SWAN) vocabularies.o Implement proof-of-concept demonstrations and industry-ready code.o Document guidelines to accelerate the adoption of the technology.o Disseminate information about the group's work at government, industry, academic events

and by participating in community initiatives.Use Cases/Domainso Drug Discoveryo Electronic Lab Notebookso Comparator Arm Datao Patient Data Ownershipo Biotech Acquisitiono Supply Chain Automationo Web Integrationo Bio-surveillanceo Co-development

http://www.w3.org/blog/hcls/

The mission of the Semantic Web Health Care and Life Sciences Interest Group (HCLS IG) is to develop, advocate for, and

support the use of Semantic Web technologies across health care, life sciences, clinical research and translational medicine

Page 20: Linked Data for Biopharma

R&D | RDI

Pleas & Future Directions

PrognosticationsRDF Content Farms

Vendors: Someone will figure out how to monetize this

Consortia: Who ‘Owns’ this?Government in Health Care & Life

Sciences; can we learn from the EPA? open.gov?

Shrinking PharmaSmaller (or virtual) footprint

oBack to first principles—what do we do best?

More modeling & SimulationRise of the informaticist…

Community HelpResist Silos

Where is your data? Where is it likely to be in 5, 10 years?

A single triplestore with all ETL-streams leading to an RDF ‘data warehouse’ is another silo

oBuilding on top of ‘standards+’ may lead to silos

Need to follow & influence emergence of standards if you have a ‘horse in the race’

Support (business focused) ConsortiumsWe’re doing the same job many, many

times

Page 21: Linked Data for Biopharma

Thank YouListeners & Molecular Med TRI-CON 2013 Organizers