Robots, Small Molecules & R

48
Robots, Small Molecules & R Ingredients for Exploring and Predic<ng Biological Effects Rajarshi Guha September 13, 2014 hEp:// blog.rguha.net /

Transcript of Robots, Small Molecules & R

Page 1: Robots, Small Molecules & R

Robots,  Small  Molecules  &  R  Ingredients  for  Exploring  and  Predic<ng  

Biological  Effects    

Rajarshi  Guha  September  13,  2014  

hEp://blog.rguha.net/  

Page 2: Robots, Small Molecules & R

Target  Iden<fica<on   Lead  Discovery  

Lead  Op<miza<on  

Clinical  Development  

• Sensi<vity  • Scaling  

Assay  Op<miza<on  

• Fluorescence  • High  Content  

Primary  Screening   • Select  subset  

to  follow  up  • Diversity  

Cherry  Picking  

• Counter  screen  

• Explore  SAR  

Confirma<on  

HTS  

Hun<ng  for  Leads  

Page 3: Robots, Small Molecules & R

High  Throughput  Screening  

•  Test  thousands  to  hundreds  of  thousands  of  compounds  in  one  or  more  assays  

•  Employs  a  robo<c  plaXorm  •  Rapidly  iden<fy  novel  modulators  of  biological  systems  –  Infec<ous  agents  – Cellular  basis  of  diseases    

 

Page 4: Robots, Small Molecules & R

Robots  for  Screening  

Page 5: Robots, Small Molecules & R

Robots  for  Screening  

Page 6: Robots, Small Molecules & R

HTS  Workflow  

•  Rapidly  screen  large  compound  collec<ons  

•  Efficiently  iden<fy  real  ac<ves    – Test  them  in  slower,  accurate,  expensive  screens  

•  Use  the  data  to  learn  what  types  of  compounds  tend  to  be  ac<ve  

•  Use  the  model  to  suggest  more  compounds  to  screen  

300K

1000

300

Nu

mb

er o

f M

ole

cu

les

Cherry

Picks

HTS

Page 7: Robots, Small Molecules & R

Data  Science  Problems  

•  Predic<ve  models  for  highlight  imbalanced  datasets  

•  Global  versus  local  models?  •  Feature  selec<on  –  data  driven?  Domain  driven?  •  Clustering  &  enrichment  •  Similarity  –  defini<on,  computa<on,  performance  •  Integra<on  –  chemical  structures,  numerical  data,  text  (papers,  patents),  images  

Page 8: Robots, Small Molecules & R

The  Roles  of  R  

Also  see  ChemPhys  CRAN  Task  View  

Data AccessROracleRMyQSL

RPostgreSQLrpubchemchemblr

Chemistry

rcdkChemmineRfingerprint

HTS QC

displayHTSspdep

Imaging

EBImagerflowcyt

riparaster

Visualizationgrid

ggplotShinyggvisigraph

Data Analysisdrc

igraphrandomForest

svm...

Page 9: Robots, Small Molecules & R

HTS  Data  Types  –  Single  Point  

0

25

50

75

100

9.50 9.75 10.00 10.25 10.50Concentration

Response

Page 10: Robots, Small Molecules & R

HTS  Data  Types  –  Dose  Response  

30

60

90

120

0.01 1.00log10 Concentration

Response

y = S0 +Sinf − S0

1+10(logAC50−x )H

Page 11: Robots, Small Molecules & R

HTS  Data  Types  –  Mul<ple  Readouts  

(and  have  this  at  mul<ple  doses!)  

Page 12: Robots, Small Molecules & R

HTS  Data  Types  -­‐  Combina<ons  

+  

Page 13: Robots, Small Molecules & R

Independent  Variable(s)  

Activity = f ( )

Page 14: Robots, Small Molecules & R

Features,  Features,  Features  

•  How  do  we  “quan<fy”  a  chemical  structure?  

Page 15: Robots, Small Molecules & R

Features,  Features,  Features  

Charges  Dipole  moments   Surface  proper<es  Topological  invariants  

1 0 1 1 0 0 0 1 0

Page 16: Robots, Small Molecules & R

Working  with  Molecules  in  R  

•  A  number  of  OSS  libraries  are  available            

•  ChemmineR  and  rcdk  are  the  main  packages  that  allow  you  to  manipulate  molecules  in  R  

•  Uses  rJava  to  interface  with  JOELib  and  CDK  respec<vely  

Page 17: Robots, Small Molecules & R

rcdk  

•  Idioma<c  R  interface  to  the  CDK  library  –  I/O  support  for  chemical  file  formats  – Manipula<on  of  atoms,  bonds,  molecules  – Generate  molecular  descriptors,  fingerprints  

library(rcdk) mol <- parse.smiles(‘CCCC’)[[1]] mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)

Page 18: Robots, Small Molecules & R

rcdk  

•  rcdk  works  with  references  to  Java  objects  – Can’t  save  them  in  a  workspace  (trivially)  

> mol [1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3, Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289, #O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2, Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC:2)))}" >

Page 19: Robots, Small Molecules & R

Calcula<ng  Molecular  Features  

•  Evaluate  a  matrix  of  numerical  features        

•  End  up  with  a  rectangular  data.frame  

mols <- load.molecules("mipe100.smi") dnames <- get.desc.names('topological') descs <- eval.desc(mols, dnames)

> str(descs) 'data.frame': 99 obs. of 195 variables: $ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ... $ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ... $ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ... $ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...

Page 20: Robots, Small Molecules & R

Calcula<ng  Fingerprints  

•  Binary  string  representa<on  of  molecular  structure  – Objec<vely  defined,  fast  to  calculate  – Good  for  searching,  clustering,  predic<on        

•  The  fingerprint  package  is  used  to  represent  them  as  S4  objects    

library(fingerprint) fps <- lapply(mols, get.fingerprint)

Page 21: Robots, Small Molecules & R

Calcula<ng  Fingerprints  

•  Methods  to  compute  similari<es,  generate  summaries  &  manipulate  fingerprints  

> fps[[1]] Fingerprint object name = length = 1024 folded = FALSE source = CDK bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159 162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327 335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535 541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645 647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930 936 954 985 988 1005 1008 1016 >

Page 22: Robots, Small Molecules & R

Use  Case  -­‐  SAR  

•  Cluster  molecules  by  structure  and  examine  whether  clusters  are  enriched  in  ac<vity  

library(chemblr); library(rcdk) d <- get.activity(chembl.id='CHEMBL857155', type='assay') cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound, type='chemblid') cmpds <- do.call(rbind, lapply(cmpds, function(x) data.frame(x$chemblId, x$smiles, stringsAsFactors=FALSE))) mols <- parse.smiles(cmpds$x.smiles) fps <- lapply(mols, get.fingerprint) sm <- fp.sim.matrix(fps) rownames(sm) <- cmpds$x.chemblId dm <- as.dist(1-sm) clus <- hclust(dm)

Page 23: Robots, Small Molecules & R

Use  Case  -­‐  SAR  CHEMBL331502

CHEMBL328164

CHEMBL52551

CHEMBL331120

CHEMBL120497

CHEMBL331759

CHEMBL120547

CHEMBL324064

CHEMBL318208

CHEMBL328627

CHEMBL99803

CHEMBL317562

CHEMBL332678

CHEMBL100312

CHEMBL119963

CHEMBL334031

CHEMBL323657

CHEMBL118406

CHEMBL118162

CHEMBL120137

CHEMBL331722

CHEMBL120078

CHEMBL121953

CHEMBL331783

CHEMBL333066

CHEMBL116832

CHEMBL316512

CHEMBL318471

CHEMBL98153

CHEMBL95827

CHEMBL119932

CHEMBL99037

CHEMBL120355

CHEMBL430574

CHEMBL120941

CHEMBL299756

CHEMBL317964

CHEMBL98501

CHEMBL317150

CHEMBL120030

CHEMBL99779

CHEMBL98554

CHEMBL318911

CHEMBL97844

CHEMBL316485

CHEMBL296586

CHEMBL100309

CHEMBL98360

CHEMBL316940

CHEMBL120664

CHEMBL419054

CHEMBL119989

CHEMBL121958

CHEMBL121957

CHEMBL329505

CHEMBL121543

CHEMBL121492

CHEMBL333894

CHEMBL333006

CHEMBL50894

CHEMBL116545

CHEMBL331190

CHEMBL325403

CHEMBL99423

CHEMBL330398

CHEMBL95477

CHEMBL545053

CHEMBL329063

CHEMBL331000

CHEMBL319373

CHEMBL431634

CHEMBL325654

CHEMBL332359

CHEMBL334084

CHEMBL328194

Page 24: Robots, Small Molecules & R

0.00

0.25

0.50

0.75

1.00

0 250 500 750Bit Position

Nor

mal

ized

Fre

quen

cy

Use  Case  -­‐  Bit  Spectrum  

•  Vector  summary  of  the  fingerprints  for  a  dataset  •  Defined  as  the  frac<on  of  <mes  a  bit  posi<on  is  set  to  1,  for  each  bit  posi<on  

0 0 1

0 1 0

1 1 1

1 0 1

0.5 0.5 0.75

...

...

...

...

...

~  10K  molecules  

Page 25: Robots, Small Molecules & R

-1.0

-0.5

0.0

0.5

1.0

0 50 100 150Bit Position

Δ N

orm

aliz

ed F

requ

ency

Use  Case  -­‐  Bit  Spectrum  

•  Comparison  of  two  datasets  is  now  O(n)  •  Simply  take  the  difference  of  the  two  bit  spectra    

   e.g.:  Compare  ~  800  solubles  with  >  30k  insolubles  ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))

Page 26: Robots, Small Molecules & R

PREDICTIVE  MODELS  -­‐  CAVEATS  

Page 27: Robots, Small Molecules & R

Building  Models  is  the  Easy  Part  

•  Given  a  descriptor  data.frame  or  fingerprint  list  we’re  ready  to  build  models  – caret,  caretEnsemble  

•  Ques<on  is  whether  the  model(s)  can  generalize  

•  Applicability  is  a  key  considera<on  when  predic<ng  bioac<vity  – Has  economic  &  safety  ramifica<ons  in  regulatory  enviroments  

Page 28: Robots, Small Molecules & R

Domain  Applicability  

•  How  dissimilar  to  the  training  set  do  you  have  to  be  before  the  predic<on  is  meaningless?  – Distance  to  training  set?  Inside/outside  convex  hull  – Comparison  of  bit  spectra  

Training  Set   Test  Set  

Page 29: Robots, Small Molecules & R

Global  vs  Local  Models  

•  Bioassay  data  is  not  really  big  data  •  Can  big  data  be  too  big?  •  AID  1996    – 57K  measurements  of    aqueous  solubility  

•  Do  we  build  one  model?  •  Or  mul<ple  local  models?  

PCA  of  166  Binary  Features  

Page 30: Robots, Small Molecules & R

RESPONSE  SURFACES  

Page 31: Robots, Small Molecules & R

Screening  Drug  Combina<ons  

•  Increased  efficacy  •  Delay  resistance  •  AEenuate  toxicity  

•  Inform  signaling  pathway  connec<vity  

•  Iden<fy  synthe<c  lethality  •  Polypharmacology  

Transla'onal  Interest   Basic  Interest  

Page 32: Robots, Small Molecules & R

How  to  Test  Combina<ons  

•  Many  procedures  described  in  the  literature  – Fixed  dose  ra<o  (aka  ray)  – Ray  contour  – Checkerboard  – Gene<c  algorithm    

C5,D5 C5

C4,D4 C4

C3,D3 C3

C2,D2 C2

C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1

D5 D4 D3 D2 D1 0

Page 33: Robots, Small Molecules & R

How  to  Test  Combina<ons  

•  Many  procedures  described  in  the  literature  – Fixed  dose  ra<o  (aka  ray)  – Ray  contour  – Checkerboard  – Gene<c  algorithm    

Vargatef DCC-2036 PD-166285 GDC-0941

PI-103 GDC-0980 Bardoxolone methyl AT-7519AT7519

SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024

ISOX Belinostat PF-477736 AZD-7762

Page 34: Robots, Small Molecules & R

•  Vargatef  exhibited  anomalous  matrix  response  compared  to  other  VEGFR  inhibitors            

Why  Similarity?  

Vargatef  

Linifanib Axitinib Sorafenib Vatalanib

Motesanib Tivozanib Brivanib Telatinib

Cabozantinib Cediranib BMS-794833 Lenvatinib

OSI-632 Foretinib Regorafenib

Page 35: Robots, Small Molecules & R

When  are  Combina<ons  Similar?  

•  Differences  and  their  aggregates  such  as  RMSD  can  lead  to  degeneracy  

•  Instead  we’re  interested  in  the  shape  of  the  surface  

•  How  to  characterize  shape?  – Parametrized  fits  – Distribu<on  of  responses  

0.000

0.005

0.010

0 25 50 75 100

0.00

0.02

0.04

0.06

0 25 50 75 100

0.00

0.05

0.10

0.15

0 50 100

D, p value

Page 36: Robots, Small Molecules & R

0.0

2.5

5.0

7.5

10.0

0.00 0.25 0.50 0.75D

density

Similarity  via  the  Syrjala  Test  

•  Syrjala  test  used  to  compare  popula<on  distribu<ons  over  a  spa<al  grid  –  Invariant  to  grid  orienta<on  – Provides  an  empirical  p-­‐value  

•  Less  degenerate  than  just  considering  1D  distribu<ons  

Syrjala,  S.E.,  “A  Sta<s<cal  Test  for  a  Difference  between  the  Spa<al  Distribu<ons  of  Two  Popula<ons”,  Ecology,  1996,  77(1),  75-­‐80  

Page 37: Robots, Small Molecules & R

Clustering  Response  Surfaces  0.0

0.2

0.4

0.6

0.8

C1  (24)  

C2(47)  

C3(35)  

C4(24)  

Page 38: Robots, Small Molecules & R

Working  in  “Combina<on  Space”  

•  Each  cell  line  is  represented  as  a  vector  of  response  matrices  

•  “Distance”  between  two    cell  lines  is  a  func<on  of  the  distance  between  component  response  matrices      

•  F  can  be  min,  max,  mean,  …    

L1   L2  

=  d1  

=  d2  

=  d3  

=  d4  

=  d5  

D L1,L2( ) = F({d1,d2,…,dn})

,  

,  

,  ,  ,  

Page 39: Robots, Small Molecules & R

Many  Choices  to  Make  0

12

34

KMS-34

INA-6

L363

OPM-1

XG-2

FR4

AMO-1

XG-6

MOLP-8

ANBL-6

KMS-20

XG-7

OCI-MY1

XG-1

8226

EJM

U266

KMS-11LB

SKMM-1

MM-MM1

sum

0.0

0.1

0.2

0.3

0.4

0.5

0.6

L363

OPM-1

XG-2

KMS-20

XG-1

XG-7

ANBL-6

OCI-MY1

U266

XG-6

INA-6

MOLP-8

AMO-1

KMS-34

KMS-11LB

SKMM-1

MM-MM1

EJM FR4

8226

max

0.00

0.05

0.10

0.15

0.20

0.25

INA-6

MM-MM1

8226

XG-1

U266

ANBL-6

SKMM-1

EJM

OPM-1

XG-2

OCI-MY1

KMS-20

L363

KMS-11LB

AMO-1

XG-6

FR4

KMS-34

MOLP-8

XG-7

min

0.0

0.2

0.4

0.6

0.8

1.0

1.2

L363

OPM-1

XG-2

KMS-34

INA-6

KMS-11LB

SKMM-1

EJM

U266

MM-MM1

FR4

AMO-1

XG-6

8226

MOLP-8

ANBL-6

OCI-MY1

XG-1

KMS-20

XG-7

euc

Page 40: Robots, Small Molecules & R

NETWORKS  

Page 41: Robots, Small Molecules & R

Networks  &  Integra<on  

•  Network  models  of  molecules,  and  targets  are  common  – Allows  for  the  incorpora<on  of  lots  of  associated  informa<on  

– Diseases,  pathways,  OTE’s,    •  When  linked  with  clinical  data    &  outcomes,  we  can  generate  massive  networks  – Adverse  events  (FDA  AERS)  – Analysis  by  Cloudera  considered  >  10E6  drug-­‐drug-­‐reac<on  triples  

Yildirim,  M.A.  et  al  

Page 42: Robots, Small Molecules & R

Networks  &  integra<on  •  SAR  data  can  be  viewed  in  a  network  form  – SALI,  SARI  based  networks  – Usually  requires  pairwise    calcula<ons  of  the  metric  

•  Current  studies  have  focused  on  small  datasets  (<  1000  molecules)  

•  Hadoop  +  Giraph  could  let  us  apply  this  to  HTS-­‐scale  datasets  

hEp://sali.rguha.net/  Peltason,  L  et  al  

Page 43: Robots, Small Molecules & R

Networks  &  integra<on  

•  When  we  apply  a  network  view  we  can  consider  many  interes<ng  applica<ons  &  make  use  of  cloud  scale  infrastructure  – Network  based  similarity  – Community  detec<on  (aka  clustering)  – PageRank  style  ranking  (of  targets,  compounds,  …)  – Generate  network  metrics,  which  can  be  used  as  input  to  predic<ve  models  (for  interac<ons,  effects,  …)  

Bauer-­‐Mehren  et  al  

Page 44: Robots, Small Molecules & R

Combina<ons  as  Networks  Combina<on  screens  lend  themselves  naturally  to  network  representa<ons                    

 

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

∆ Bliss+

−4.3

−3.8

−3.3

−2.9

−2.4

−1.9

−1.4

−1.0

−0.5

0.0

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

●●

∆ Bliss+

−3.4−3.1

−2.7

−2.3

−1.9

−1.5−1.2

−0.8

−0.4

0.0

immune system process

apoptotic process

transcription from RNApolymerase II promoter

protein phosphorylation

cell communication

immune response

Page 45: Robots, Small Molecules & R

Combina<ons  as  Networks  

•  Things  get  more    interes<ng  when  we  have  n          m  screens  

•  Can  be  simplified  using  a  variety  of    methods  – Neighborhoods  – Minimum  Spanning  Tree  

●●

●●

●●

●●

●●

● ●

●●

×

Page 46: Robots, Small Molecules & R

Comparing  Neighborhoods  Combina<ons  that  have  DBSumNeg  <  1st  quar<le  value  for  that  strain  

3D7 DD2 HB3

Page 47: Robots, Small Molecules & R

Iden<fying  the  Most  Synergis<c  Pairs  

● ●

●●

●●

●●

●●

●●

● ●

●●

Page 48: Robots, Small Molecules & R

Summary  

•  The  HTS  workflow  presents  mul<ple  data  science  problems  involving  (unique)  data  types  

•  R  can  play  a  role  at  several  stages,  but  model  building  is  straighXorward  

•  Representa<on  is  key  and  guides  the  types  and  nature  of  analyses