Robots, Small Molecules & R

Post on 02-Jul-2015

391 views 0 download

Transcript of Robots, Small Molecules & R

Robots,  Small  Molecules  &  R  Ingredients  for  Exploring  and  Predic<ng  

Biological  Effects    

Rajarshi  Guha  September  13,  2014  

hEp://blog.rguha.net/  

Target  Iden<fica<on   Lead  Discovery  

Lead  Op<miza<on  

Clinical  Development  

• Sensi<vity  • Scaling  

Assay  Op<miza<on  

• Fluorescence  • High  Content  

Primary  Screening   • Select  subset  

to  follow  up  • Diversity  

Cherry  Picking  

• Counter  screen  

• Explore  SAR  

Confirma<on  

HTS  

Hun<ng  for  Leads  

High  Throughput  Screening  

•  Test  thousands  to  hundreds  of  thousands  of  compounds  in  one  or  more  assays  

•  Employs  a  robo<c  plaXorm  •  Rapidly  iden<fy  novel  modulators  of  biological  systems  –  Infec<ous  agents  – Cellular  basis  of  diseases    

 

Robots  for  Screening  

Robots  for  Screening  

HTS  Workflow  

•  Rapidly  screen  large  compound  collec<ons  

•  Efficiently  iden<fy  real  ac<ves    – Test  them  in  slower,  accurate,  expensive  screens  

•  Use  the  data  to  learn  what  types  of  compounds  tend  to  be  ac<ve  

•  Use  the  model  to  suggest  more  compounds  to  screen  

300K

1000

300

Nu

mb

er o

f M

ole

cu

les

Cherry

Picks

HTS

Data  Science  Problems  

•  Predic<ve  models  for  highlight  imbalanced  datasets  

•  Global  versus  local  models?  •  Feature  selec<on  –  data  driven?  Domain  driven?  •  Clustering  &  enrichment  •  Similarity  –  defini<on,  computa<on,  performance  •  Integra<on  –  chemical  structures,  numerical  data,  text  (papers,  patents),  images  

The  Roles  of  R  

Also  see  ChemPhys  CRAN  Task  View  

Data AccessROracleRMyQSL

RPostgreSQLrpubchemchemblr

Chemistry

rcdkChemmineRfingerprint

HTS QC

displayHTSspdep

Imaging

EBImagerflowcyt

riparaster

Visualizationgrid

ggplotShinyggvisigraph

Data Analysisdrc

igraphrandomForest

svm...

HTS  Data  Types  –  Single  Point  

0

25

50

75

100

9.50 9.75 10.00 10.25 10.50Concentration

Response

HTS  Data  Types  –  Dose  Response  

30

60

90

120

0.01 1.00log10 Concentration

Response

y = S0 +Sinf − S0

1+10(logAC50−x )H

HTS  Data  Types  –  Mul<ple  Readouts  

(and  have  this  at  mul<ple  doses!)  

HTS  Data  Types  -­‐  Combina<ons  

+  

Independent  Variable(s)  

Activity = f ( )

Features,  Features,  Features  

•  How  do  we  “quan<fy”  a  chemical  structure?  

Features,  Features,  Features  

Charges  Dipole  moments   Surface  proper<es  Topological  invariants  

1 0 1 1 0 0 0 1 0

Working  with  Molecules  in  R  

•  A  number  of  OSS  libraries  are  available            

•  ChemmineR  and  rcdk  are  the  main  packages  that  allow  you  to  manipulate  molecules  in  R  

•  Uses  rJava  to  interface  with  JOELib  and  CDK  respec<vely  

rcdk  

•  Idioma<c  R  interface  to  the  CDK  library  –  I/O  support  for  chemical  file  formats  – Manipula<on  of  atoms,  bonds,  molecules  – Generate  molecular  descriptors,  fingerprints  

library(rcdk) mol <- parse.smiles(‘CCCC’)[[1]] mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)

rcdk  

•  rcdk  works  with  references  to  Java  objects  – Can’t  save  them  in  a  workspace  (trivially)  

> mol [1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3, Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289, #O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2, Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC:2)))}" >

Calcula<ng  Molecular  Features  

•  Evaluate  a  matrix  of  numerical  features        

•  End  up  with  a  rectangular  data.frame  

mols <- load.molecules("mipe100.smi") dnames <- get.desc.names('topological') descs <- eval.desc(mols, dnames)

> str(descs) 'data.frame': 99 obs. of 195 variables: $ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ... $ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ... $ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ... $ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...

Calcula<ng  Fingerprints  

•  Binary  string  representa<on  of  molecular  structure  – Objec<vely  defined,  fast  to  calculate  – Good  for  searching,  clustering,  predic<on        

•  The  fingerprint  package  is  used  to  represent  them  as  S4  objects    

library(fingerprint) fps <- lapply(mols, get.fingerprint)

Calcula<ng  Fingerprints  

•  Methods  to  compute  similari<es,  generate  summaries  &  manipulate  fingerprints  

> fps[[1]] Fingerprint object name = length = 1024 folded = FALSE source = CDK bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159 162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327 335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535 541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645 647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930 936 954 985 988 1005 1008 1016 >

Use  Case  -­‐  SAR  

•  Cluster  molecules  by  structure  and  examine  whether  clusters  are  enriched  in  ac<vity  

library(chemblr); library(rcdk) d <- get.activity(chembl.id='CHEMBL857155', type='assay') cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound, type='chemblid') cmpds <- do.call(rbind, lapply(cmpds, function(x) data.frame(x$chemblId, x$smiles, stringsAsFactors=FALSE))) mols <- parse.smiles(cmpds$x.smiles) fps <- lapply(mols, get.fingerprint) sm <- fp.sim.matrix(fps) rownames(sm) <- cmpds$x.chemblId dm <- as.dist(1-sm) clus <- hclust(dm)

Use  Case  -­‐  SAR  CHEMBL331502

CHEMBL328164

CHEMBL52551

CHEMBL331120

CHEMBL120497

CHEMBL331759

CHEMBL120547

CHEMBL324064

CHEMBL318208

CHEMBL328627

CHEMBL99803

CHEMBL317562

CHEMBL332678

CHEMBL100312

CHEMBL119963

CHEMBL334031

CHEMBL323657

CHEMBL118406

CHEMBL118162

CHEMBL120137

CHEMBL331722

CHEMBL120078

CHEMBL121953

CHEMBL331783

CHEMBL333066

CHEMBL116832

CHEMBL316512

CHEMBL318471

CHEMBL98153

CHEMBL95827

CHEMBL119932

CHEMBL99037

CHEMBL120355

CHEMBL430574

CHEMBL120941

CHEMBL299756

CHEMBL317964

CHEMBL98501

CHEMBL317150

CHEMBL120030

CHEMBL99779

CHEMBL98554

CHEMBL318911

CHEMBL97844

CHEMBL316485

CHEMBL296586

CHEMBL100309

CHEMBL98360

CHEMBL316940

CHEMBL120664

CHEMBL419054

CHEMBL119989

CHEMBL121958

CHEMBL121957

CHEMBL329505

CHEMBL121543

CHEMBL121492

CHEMBL333894

CHEMBL333006

CHEMBL50894

CHEMBL116545

CHEMBL331190

CHEMBL325403

CHEMBL99423

CHEMBL330398

CHEMBL95477

CHEMBL545053

CHEMBL329063

CHEMBL331000

CHEMBL319373

CHEMBL431634

CHEMBL325654

CHEMBL332359

CHEMBL334084

CHEMBL328194

0.00

0.25

0.50

0.75

1.00

0 250 500 750Bit Position

Nor

mal

ized

Fre

quen

cy

Use  Case  -­‐  Bit  Spectrum  

•  Vector  summary  of  the  fingerprints  for  a  dataset  •  Defined  as  the  frac<on  of  <mes  a  bit  posi<on  is  set  to  1,  for  each  bit  posi<on  

0 0 1

0 1 0

1 1 1

1 0 1

0.5 0.5 0.75

...

...

...

...

...

~  10K  molecules  

-1.0

-0.5

0.0

0.5

1.0

0 50 100 150Bit Position

Δ N

orm

aliz

ed F

requ

ency

Use  Case  -­‐  Bit  Spectrum  

•  Comparison  of  two  datasets  is  now  O(n)  •  Simply  take  the  difference  of  the  two  bit  spectra    

   e.g.:  Compare  ~  800  solubles  with  >  30k  insolubles  ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))

PREDICTIVE  MODELS  -­‐  CAVEATS  

Building  Models  is  the  Easy  Part  

•  Given  a  descriptor  data.frame  or  fingerprint  list  we’re  ready  to  build  models  – caret,  caretEnsemble  

•  Ques<on  is  whether  the  model(s)  can  generalize  

•  Applicability  is  a  key  considera<on  when  predic<ng  bioac<vity  – Has  economic  &  safety  ramifica<ons  in  regulatory  enviroments  

Domain  Applicability  

•  How  dissimilar  to  the  training  set  do  you  have  to  be  before  the  predic<on  is  meaningless?  – Distance  to  training  set?  Inside/outside  convex  hull  – Comparison  of  bit  spectra  

Training  Set   Test  Set  

Global  vs  Local  Models  

•  Bioassay  data  is  not  really  big  data  •  Can  big  data  be  too  big?  •  AID  1996    – 57K  measurements  of    aqueous  solubility  

•  Do  we  build  one  model?  •  Or  mul<ple  local  models?  

PCA  of  166  Binary  Features  

RESPONSE  SURFACES  

Screening  Drug  Combina<ons  

•  Increased  efficacy  •  Delay  resistance  •  AEenuate  toxicity  

•  Inform  signaling  pathway  connec<vity  

•  Iden<fy  synthe<c  lethality  •  Polypharmacology  

Transla'onal  Interest   Basic  Interest  

How  to  Test  Combina<ons  

•  Many  procedures  described  in  the  literature  – Fixed  dose  ra<o  (aka  ray)  – Ray  contour  – Checkerboard  – Gene<c  algorithm    

C5,D5 C5

C4,D4 C4

C3,D3 C3

C2,D2 C2

C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1

D5 D4 D3 D2 D1 0

How  to  Test  Combina<ons  

•  Many  procedures  described  in  the  literature  – Fixed  dose  ra<o  (aka  ray)  – Ray  contour  – Checkerboard  – Gene<c  algorithm    

Vargatef DCC-2036 PD-166285 GDC-0941

PI-103 GDC-0980 Bardoxolone methyl AT-7519AT7519

SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024

ISOX Belinostat PF-477736 AZD-7762

•  Vargatef  exhibited  anomalous  matrix  response  compared  to  other  VEGFR  inhibitors            

Why  Similarity?  

Vargatef  

Linifanib Axitinib Sorafenib Vatalanib

Motesanib Tivozanib Brivanib Telatinib

Cabozantinib Cediranib BMS-794833 Lenvatinib

OSI-632 Foretinib Regorafenib

When  are  Combina<ons  Similar?  

•  Differences  and  their  aggregates  such  as  RMSD  can  lead  to  degeneracy  

•  Instead  we’re  interested  in  the  shape  of  the  surface  

•  How  to  characterize  shape?  – Parametrized  fits  – Distribu<on  of  responses  

0.000

0.005

0.010

0 25 50 75 100

0.00

0.02

0.04

0.06

0 25 50 75 100

0.00

0.05

0.10

0.15

0 50 100

D, p value

0.0

2.5

5.0

7.5

10.0

0.00 0.25 0.50 0.75D

density

Similarity  via  the  Syrjala  Test  

•  Syrjala  test  used  to  compare  popula<on  distribu<ons  over  a  spa<al  grid  –  Invariant  to  grid  orienta<on  – Provides  an  empirical  p-­‐value  

•  Less  degenerate  than  just  considering  1D  distribu<ons  

Syrjala,  S.E.,  “A  Sta<s<cal  Test  for  a  Difference  between  the  Spa<al  Distribu<ons  of  Two  Popula<ons”,  Ecology,  1996,  77(1),  75-­‐80  

Clustering  Response  Surfaces  0.0

0.2

0.4

0.6

0.8

C1  (24)  

C2(47)  

C3(35)  

C4(24)  

Working  in  “Combina<on  Space”  

•  Each  cell  line  is  represented  as  a  vector  of  response  matrices  

•  “Distance”  between  two    cell  lines  is  a  func<on  of  the  distance  between  component  response  matrices      

•  F  can  be  min,  max,  mean,  …    

L1   L2  

=  d1  

=  d2  

=  d3  

=  d4  

=  d5  

D L1,L2( ) = F({d1,d2,…,dn})

,  

,  

,  ,  ,  

Many  Choices  to  Make  0

12

34

KMS-34

INA-6

L363

OPM-1

XG-2

FR4

AMO-1

XG-6

MOLP-8

ANBL-6

KMS-20

XG-7

OCI-MY1

XG-1

8226

EJM

U266

KMS-11LB

SKMM-1

MM-MM1

sum

0.0

0.1

0.2

0.3

0.4

0.5

0.6

L363

OPM-1

XG-2

KMS-20

XG-1

XG-7

ANBL-6

OCI-MY1

U266

XG-6

INA-6

MOLP-8

AMO-1

KMS-34

KMS-11LB

SKMM-1

MM-MM1

EJM FR4

8226

max

0.00

0.05

0.10

0.15

0.20

0.25

INA-6

MM-MM1

8226

XG-1

U266

ANBL-6

SKMM-1

EJM

OPM-1

XG-2

OCI-MY1

KMS-20

L363

KMS-11LB

AMO-1

XG-6

FR4

KMS-34

MOLP-8

XG-7

min

0.0

0.2

0.4

0.6

0.8

1.0

1.2

L363

OPM-1

XG-2

KMS-34

INA-6

KMS-11LB

SKMM-1

EJM

U266

MM-MM1

FR4

AMO-1

XG-6

8226

MOLP-8

ANBL-6

OCI-MY1

XG-1

KMS-20

XG-7

euc

NETWORKS  

Networks  &  Integra<on  

•  Network  models  of  molecules,  and  targets  are  common  – Allows  for  the  incorpora<on  of  lots  of  associated  informa<on  

– Diseases,  pathways,  OTE’s,    •  When  linked  with  clinical  data    &  outcomes,  we  can  generate  massive  networks  – Adverse  events  (FDA  AERS)  – Analysis  by  Cloudera  considered  >  10E6  drug-­‐drug-­‐reac<on  triples  

Yildirim,  M.A.  et  al  

Networks  &  integra<on  •  SAR  data  can  be  viewed  in  a  network  form  – SALI,  SARI  based  networks  – Usually  requires  pairwise    calcula<ons  of  the  metric  

•  Current  studies  have  focused  on  small  datasets  (<  1000  molecules)  

•  Hadoop  +  Giraph  could  let  us  apply  this  to  HTS-­‐scale  datasets  

hEp://sali.rguha.net/  Peltason,  L  et  al  

Networks  &  integra<on  

•  When  we  apply  a  network  view  we  can  consider  many  interes<ng  applica<ons  &  make  use  of  cloud  scale  infrastructure  – Network  based  similarity  – Community  detec<on  (aka  clustering)  – PageRank  style  ranking  (of  targets,  compounds,  …)  – Generate  network  metrics,  which  can  be  used  as  input  to  predic<ve  models  (for  interac<ons,  effects,  …)  

Bauer-­‐Mehren  et  al  

Combina<ons  as  Networks  Combina<on  screens  lend  themselves  naturally  to  network  representa<ons                    

 

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

∆ Bliss+

−4.3

−3.8

−3.3

−2.9

−2.4

−1.9

−1.4

−1.0

−0.5

0.0

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●●

●●

● ●

●●

∆ Bliss+

−3.4−3.1

−2.7

−2.3

−1.9

−1.5−1.2

−0.8

−0.4

0.0

immune system process

apoptotic process

transcription from RNApolymerase II promoter

protein phosphorylation

cell communication

immune response

Combina<ons  as  Networks  

•  Things  get  more    interes<ng  when  we  have  n          m  screens  

•  Can  be  simplified  using  a  variety  of    methods  – Neighborhoods  – Minimum  Spanning  Tree  

●●

●●

●●

●●

●●

● ●

●●

×

Comparing  Neighborhoods  Combina<ons  that  have  DBSumNeg  <  1st  quar<le  value  for  that  strain  

3D7 DD2 HB3

Iden<fying  the  Most  Synergis<c  Pairs  

● ●

●●

●●

●●

●●

●●

● ●

●●

Summary  

•  The  HTS  workflow  presents  mul<ple  data  science  problems  involving  (unique)  data  types  

•  R  can  play  a  role  at  several  stages,  but  model  building  is  straighXorward  

•  Representa<on  is  key  and  guides  the  types  and  nature  of  analyses