How deep learning can help to design better and safer ...€¦ · Numerous commercial and open...

Post on 14-Aug-2020

2 views 0 download

Transcript of How deep learning can help to design better and safer ...€¦ · Numerous commercial and open...

Olexandr Isayev, Ph.D.University of North Carolina at Chapel Hill

@olexandr http://olexandrisayev.com

How deep learning can help to design better and safer medicine?

KinomeNet: multi-task deep convolutional network

How deep learning can help to design better and safer medicine?

KinomeNet: multi-task deep convolutional network

About me

Ph.D. in Chemistry (computational)

Minor in CS

Worked in Federal research lab on HPC & GPU computing to solve chemical problems

Now I am research faculty at the University of North Carolina, Chapel Hill

http://olexandrisayev.com

And I am also Director of Drug Discovery at Atlas Regeneration. We use AI & multi-omics for developing regenerative medicine and stem cell differentiation technologies.

http://atlasregeneration.com/

A public-private partnership that supports the discovery of new medicines through open access researchwww.thesgc.org

How drugs are discovered?

The Long and Winding Road to Drug Discovery

Data Science approachesuseful across the pipeline,

butvery different techniques

aim for success,but if not:

fail early, fail cheap

Medicines Are Transforming the Treatment of Many Diseases

Robotic biological tests (HTS)

Robotic synthesis

Drowning in Databut starving for Knowledge

The rapid growth of materials research has led to accumulation of vast amounts of data:  For example, 160,000 entries in the Inorganic Crystal Structure Database (ICSD) 

Numerous commercial and open experimental databases NIST, MatWeb, MatBase etc.

Vast computational databases such as AFLOWLIB, Materials Project, and Harvard Clean Energy.

Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191‐200

Decline in Pharmaceutical R&D efficiency

The cost of developing a new drug (~$2‐3B) roughly doubles every nine years.

Why Drugs are failed?

Selectivity of Kinase inhibitorsAll kinases bind ATP and therefore contain a conserved binding site

Most compounds inhibit more than one kinase

Why Don’t we Do Better?A Couple of Observations

• Tykerb – Breast cancer

• Gleevac – Leukemia, GI cancers

• Nexavar – Kidney and liver cancer

• Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive

Collins and Workman 2006 Nature Chemical Biology 2 689‐700

>40% of biologically active compounds bind to more than one target

~106 – 107

molecules

~102 – 103

molecules

VIRTUAL SCREENING

Empirical Rules/FiltersSimilarity Search

Consensus QSA

PotentialHits

ML or QSAR ModelsStructure-based Models

Virtual Screeningto identify potential hits

Candidate molecules

Our vision for next-gen cheminformatics platforms

• Scale up Machine Learning Methods with the Data• Use all viraity of available data (-omics, sensors, etc)• Take advantage of latest algorithmic developments –

Deep Learning

Collected all human kinase data from open sources

• ChEMBL• PKIS• PubChem• Private datasets• Literature, patents, etc.

300,000+ Molecules

489 Targets 

>800,000 Experimental data points

Biggest target data: >25000 molecules Smallest target data: 1 

Human Kinase Inhibitor Data Collection

Human Kinase IC50 Data Distribution 

“Popular” targets

“Rare” targets

Convolutional Neural Network (ConvNet)

Convolution Function (Filter)

Comes from Image and Signal Processing

The easiest way to understand a convolution is by thinking of it as a sliding window function applied to a matrix.

Groundbreaking results of DL are mostly based on networks with convolutional filters

• Image recognition• Object detection• Medical image processing 

Different Levels of Abstraction 

• Hierarchical Learning 

• Natural progression from low level to high level structure as seen in natural complexity 

• Easier to monitor what is being learnt and to guide the machine to better subspaces 

• A good lower level representation can be used for many distinct tasks 

KinomeNet: Convolutional Neural Network for QSAR

ConvNet

2D matrix of DescriptorsMultitask Learning

(253 targets)

ABL1

ACVR1

ZAK

ZAP70

N compounds Active @1uM AUC TN FP TP FN Sensitivity Specificity

MAP4K4 160 10 0.88 149 1 1 9 0.1 0.93

BMX 155 151 0.78 0 4 151 0 1.0 0.0

Some Statistics & Performance Numbers

Random Forest Models

DL Model

MAP4K4 160 10 0.91 150 0 6 4 0.6 0.94

BMX 155 151 0.93 4 0 149 6 0.99 1.0

RF (Random Forest)Average AUC: 0.90

KinomeNetAverage AUC: 0.96

KinomeNet: “Deorphanizing” rare targets

ConvNet

Multitask Learning(253 targets)

ABL1

ACVR1

ZAK

ZAP70

2D matrix of Descriptors

KinomeNet: “Deorphanizing” rare targets

ConvNet

“Rare” targets(67 targets)

ACVR1

TYMS

…“Frequent”(253 targets)

Multitask Learning(320 targets)

2D matrix of Descriptors

Why it Works: Transfer  Learning

• Feature‐representation‐transfer

• To learn a “good” feature representation for the target domain. 

• The knowledge used to transfer across domains is encoded into the learned feature representation.

• With the new feature representation, the performance of the target task is expected to improve. 

Recovery of Kinase Similarity by the Network  

Atlas Regeneration

Young dynamic startup company (formed in 2015) in North Carolina

We use AI to develop regenerative medicine

Design molecules to induce iPSC stem cell differentiation

Tissue and muscle regeneration, fibrosis

BIG CHEMICAL DATA

FAST ARTIFICIAL INTELLIGENCE TOP HITS

250M+ SCREENING MOLECULESo Integrated public data

(PubChem, ChEMBL, etc)

o Private datasets

o Literature and patents

o In vitro (HTS)

o In vivo (mouse, rats)

o Multi-omics

o Signaling Pathways

o Gene Expression

AI Drug Discovery Platform

200M+ of potential candidates

SelectivityOff target bindingToxicityMetabolic stabilityBioavailabilitySolubilityetc.

7

• Good selectivity• Three novel scaffolds• Predicted potency 7 – 25 nM• Good synthetic accessibility• Good ADME/Tox properties

Large scale prediction of bioactivity with Deep Learning

TGF beta inhibitor (Fibrosis)

FAST ARTIFICIAL INTELLIGENCE

• Data availability is the biggest barrier• Novel architecture for multitask‐QSAR• Improvement over well converged RF models• Convenience: 1 vs 320 models• Training of 1 network is faster that 320 RF models• Scalability of DL to “Big Data”• DL benefits from transfer learning• More tasks and more data – higher the benefit• Transferability: KinomeNet ‐> GPCRNet

Conclusions