Open Source Bioinformatics for Data Scientists
Amanda Schierz
Recent Projects! Druggability prediction
! 3D structure ! Protein Sequence ! Predict a protein’s druggability based on it’s position in the
protein-protein interaction network ! Drug Resistance
! Therapeutic opportunities ! Identification of new gene targets for cancer ! Are they Druggable?
! Candidate Compounds ! Compounds more likely to be a hit for a bioassay
Drug Discovery ProcessEarly-stage: Discovery Optimisation ADMET Clinical
Trials Paperwork
• Target Evaluation • Compound
Screening
• Computational Chemistry • Structure-
based Drug Design
• Absorption Distribution Metabolism Excretion Toxicity
• Patient Stratification • Protocol
• Drug Approval
Biology 101! There is a many to many relationship between Gene and Protein
! A Protein is a large molecule; a Drug is a small molecule
! Gene Expression data ! The amount of a gene produced. Epigenetics. ! highly / lowly / over / under – fold change ! Warning: Platforms and preprocessing
! Gene Copy Number ! Loss / Gain a gene ! On one strand or 2?
! There are only approx. 400 genetic targets of approved pharmaceuticals ! Only from a handful of Protein Families ! Desperate need for diversity
! TCGGTCAGGCTAGCCGTTACAGGG
Target Identification! Prediction of disease-associated genes
! patient level ! gene / protein level ! network
! Prediction of mechanisms of disease ! Epigenetic targets – meta-targets
! Prediction of protein function – from sequence / structure / network ! multi-class; multi-label
! Prediction of 3D structure
! Prediction of protein binding ! New immune targets
Druggability Prediction! Drugs – FDA Approved ~350 Very strict – know
therapeutic benefit
! Drugbank – loose – binds but no therapeutic benefit
! Tractable or Druggable ! Rule of 5 compliant
! Precedence-based - Druggable families / Homology - Ligand-based scoring - Uniprot, bioassays – EBI and Pubchem bioassay - Statistical analysis
Druggability Prediction! Sequence Analysis
- Amino Acid motifs and composition - Physicochemical descriptors
- infinite amount – very wide data set - Supervised classification
! FASTA - can download all human sequences from Uniprot >seq0 FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTD
! R ProtR ; R Bioconductor
! species,mhc,peptide_length,A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V,scl1.lag1,scl2.lag1,scl1.lag2,scl2.lag2,scl1.2.lag1,scl2.1.lag1,scl1.2.lag2,scl2.1.lag2,AA,RA,NA,DA,CA,EA,QA,GA,HA,IA,LA,KA,MA,FA,PA,SA,TA,WA,YA,VA,AR,RR,NR,DR,CR ..... ,Schneider.Xr.K,Schneider.Xr.M,Schneider.Xr.F, Grantham.Xr.A,Grantham.Xr.R,
Druggability Prediction! 3D structure
- Pockets, surface area - Ligand interaction fingerprints - Supervised classification
3D Structure! PDB, ProtDCal, PockDrug
Druggability Prediction! Interaction Network
! Many use cases ! Data from EBI and Y2H
! List of binary interactions ! Becareful 1: Data is inherently biased ! Becareful 2: Complex interactions
! R iGraph; Gephi for visualisation ! Topological properties ! Community analysis ! Subgraph analysis ! Statistical analysis, network analysis and supervised
classification
Drug Resistance
Drug Resistance
Compound Bioactivity! Brute force mass screening
! 1000s compounds screened in batches
! Primary Assays; Secondary / confirmatory assays
! Can be binary classification or regression ! The IC50 is a measure of how effective a drug is. ! Active / inactive : IC50 threshold
! Goal is also to identify diverse compound structures ! Scaffold Hopping
! Same kind of method as Protein Sequence conversion ! Pharmacophore fingerprints
! https://www.chemaxon.com/free-software/
Compound ADMET! Many use cases
! ADMET of hits ! Absorption ! Distribution ! Metabolism ! Excretion ! Toxicity
! Mutagenecity
! Protein binding
General Resources! EBI European Bioinformatics Institute / Pubchem
! API ! Integrates several downloadable Data Sources (expression, Copy
Number, Bioassays, network, disease-specific) ! Baseline data (Normal not diseased)
! Protein Data Bank – 3D Structures
! DrugBank
! Cancer – The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC)
! Coding Tools – R Bioconductor , BioPerl, BioPython
! https://docs.chemaxon.com/display/docs/Documentation
General Resources! canSAR database
! Integration of biological, pharmacological, chemical, structural biology and protein network data
Beware 101! Non-standard Gene names
! Some experiments Genes, some are Proteins
! We need new Drug Targets, different from established ones. ! Keep in mind when analysing results
! Cancer is difficult ! Drug resistance ! Data is not up with the science ! Tumour Heterogeneity
! Wide data = random patterns
! Different expression / sequencing platforms
Therapeutic Opportunities! Approximately only 350 - 400 protein targets
! DNA damage response (DDR) is essential for maintaining the genomic integrity of the cell ! Currently targeted by chemotherapy and radiation. Goal is for
small molecule targeting
! TCGA Patient Analysis: Expression, Copy Number Variation and Mutation data. ! 15 cancer disease types
! Telegraph March 2015 ! New drugs to tackle cancer cell weak spots could end
'scattergun' chemotherapy
Laurence H. Pearl, Amanda C. Schierz, Simon E. Ward, Bissan Al-Lazikani, Frances M. G. Pearl. Therapeutic opportunities within the DNA Damage Response. Nature Cancer Reviews
Therapeutic Opportunities! Statistical analysis of DDR deregulation in patients compared
to a random set of genes
! Druggability prediction of deregulated DDR genes
! Synthetic Lethality analysis of Yeast DDR orthologues ! Two genes are synthetic lethal if mutation of either alone is fine
but mutation of both leads to cell death. Targeting a gene that is synthetic lethal to a cancer-relevant mutation theoretically will kill only cancer cells.
Therapeutic Opportunities
DDR Pathway Signatures
Top Related