Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational...

65
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune, India February 3 rd 2014

Transcript of Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational...

Big Data Supporting Drug Discovery

Cautionary Tales from the World of Chemistry for Translational Informatics

Valery Tkachenko

RSC-CSIR/OSDD meeting

Pune, India

February 3rd 2014

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

Chemical space - 1060

Navigation in chemical space

Navigation in chemical space

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

Structure-based Drug Design

Structure-based Drug Design

Ligand-based Drug Design

Ligand-based Drug Design

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

Machine learning

Applied machine learning

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

• ~30 million chemicals and growing

• Data sourced from >500 different sources

• Crowdsourced curation and annotation

• Ongoing deposition of data from our journals and our collaborators

• A structure centric hub for web-searching

ChemSpider

ChemSpider

Properties - experimental

Properties - ACDLabs

Properties – EPI Suite

Properties - ChemAxon

Literature references

Patents references

Books

Classification

Chemical vendors and datasources

Multimedia

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

ChemSpider Reactions

ChemSpider Reactions

ChemSpider Reactions

ChemSpider Reactions

ChemSpider Spectra

ChemSpider Spectra

ChemSpider Databases

ChemSpider Compounds

ChemSpider Reactions

ChemSpider Spectra

ChemSpider Crystals

ChemSpider Materials

ChemSpider Assays

ChemSpider Algorithms

Research data inflow

Deposition Gateway

Staging databases

Compounds

Reactions

Spectra

Materials

Articles / CSSP

Compounds Module

Spectra Module

Reactions Module

Materials Module

TextminingModule

!͙Module

Web UI for unified depositions

DropBox, Google Drive, SkyDrive, etc

LabTrove and other templated data

Documents

API, FTP, etc

Raw data Validated dataStaging

databases

All databases are sliced by data sources/data

collections and have simple

security model where each data

slice/source is private, public or

embargoed

Research data outflow

Compounds Reactions Spectra Materials Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

MaterialsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

MaterialsWidgets

DocumentsWidgets

Data tier

Data access tier

User interface

components tier

Analytical Laboratory application

User interface tier

(examples) Electronic Laboratory Notebook

Paid 3rd party integrations (various platforms – SharePoint, Google, etc)

Chemical Inventory application

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

RSC Archive – since 1841

DERA - Digitally Enabling RSC Archive

Semantic mark-up of articles

It is so difficult to navigate…

What’s the structure?What’s the structure?

Are they in our file?

Are they in our file?

What’s similar?What’s

similar?

What’s the target?

What’s the target?Pharmacology

data?Pharmacology

data?

Known Pathways?

Known Pathways?

Working On Now?

Working On Now?Connections

to disease?Connections to disease?

Expressed in right cell type?Expressed in

right cell type?

Competitors?Competitors?

IP?IP?

Data quality issue and CVSP

– Robochemistry

– Proliferation of errors in public and private databases

– Automated quality control system

DrugBank dataset (6516 records)

J. Brechner, IUPACGraphical Representation of stereochem. configurationsSection: ST-1.1.10

DB06287

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

Research data management

University 1

Data Hub

Workstations

University 2

Data Hub

Workstations

Company 3

Data Hub

Workstations

Data Repositoryindexed storage

Data Repository provideddata storage

Chemically intelligent services

Indexes

Data

External clients Publishers

Scientists Funding bodies

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

Crowdsourcing

AltMetrics

RSC/Rewards and Recognition

Congratulations! Your 1st CSSP article has been published. Philosopher Lao Tzu said “A journey of a thousand miles begins with a single step”. In the same way we hope that this will be the first of many submissions that you make to CSSP.

The First Step badge is awarded when a user submits (& has published) their 1st CSSP article.

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementVisualization and navigationBuilding Global Chemistry Network

Visualization

Visualization and navigation

Big DataChemical SpaceDrug Discovery pipelineMachine learningTraining setsRSC/ChemSpider platformsRSC/ArchiveResearch data managementData quality, crowdsourcing and AltMetricsBuilding Global Chemistry Network

We are a part of a larger world

ChemSpider APIs

National Chemistry Database

http://www.openphacts.org

Open PHACTS is an Innovative Medicines Initiative (IMI) project, aiming to reduce the barriers to

drug discovery in industry, academia and for small

businesses.

Semantic web is one of the corner stones

OSDD

Thank you

Email: [email protected]

Slides: http://www.slideshare.net/valerytkachenko16