Getting the Big Picture by Joining up the SAR dots

28
Getting the Big Picture by Joining up the SAR dots The 9th Annual Pharmaceutical IT Congress 2011 Sorel Muresan AstraZeneca R&D Mölndal DECS Computational Sciences Large-scale integration of structure and bioactivity data

Transcript of Getting the Big Picture by Joining up the SAR dots

Page 1: Getting the Big Picture by Joining up the SAR dots

Getting the Big Picture by Joining up the SAR dots

The 9th Annual Pharmaceutical IT Congress 2011

Sorel MuresanAstraZeneca R&D MölndalDECS Computational Sciences

Large-scale integration of structure and bioactivity data

Page 2: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

WO patents with the classification code C07D

Query performed using the European Patent Office search interface

Page 3: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Driver – explosion in SAR data

• Chemical information landscape changing fast

• Databases, journal articles, patents, internal docs

20082006

Southan, C.; Varkonyi, P.; Muresan, S., J. Cheminfo. 2009, 1:10

Page 4: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

The Challenge – Information deluge

• Volume

• Complexity

• Unstructured content

Page 5: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Since 2006 >1M chemistry publications per year

Number of articles (diamonds) and patents (open boxes) abstracted annually by Chemical Abstracts Bachrach J.Cheminformatics 2009 1:2

Page 6: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Number of structures per year from J Med Chem

W. Patrick Walters; Jeremy Green; Jonathan R. Weiss; Mark A. Murcko; J. Med. Chem. Article ASAP DOI: 10.1021/jm200504p Copyright © 2011 American Chemical Society

Page 7: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

SAR key entities and relationships

Unstructured Data from Documents

Structured Entries in Relational Databases

Expert Extractionor

Text Mining

Southan, C.; Boppana, K.; Jagarlapudi, S.; Muresan, S .J. Cheminfo. 2011, 3:14

Page 8: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Manually extracted SAR data (commercial)

• GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a comprehensive database that captures explicit relationships between the three entities of publications, compounds and sequences.

• It includes 2.6 million compounds linked to 3,500 sequences with 12.5M SAR points extracted from 43,000 patents and 67,000 articles from 125 journals

Page 9: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

SAR data (public)

• PubChem• the NCBI public informatics backbone for the NIH Molecular Libraries

Initiative focused on small molecules as systems biology probes and potential therapeutic agents. The statistics are 30.5 million compounds with 85.6 million links. Of the compounds, 1654K have been tested in 504K assays.

• ChEMBL• includes drugs, small molecules from the medicinal chemistry or

biochemical literature and their targets. It contains 1,060,258 distinct compounds extracted by expert manual curation from 42,516 publications with 5,479,146 activities, including SAR and ADMET values. This data is mapped to 8,603 targets.

Page 10: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Extracting chemical entities from text

Collaboration with IBM Research Almaden to apply text analytics technology to analyze intellectual property and scientific literature

- 10 million full text patents

- 11 million structures

- 12% out of 46M parent structures in Chemistry Connect

Page 11: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Chemical Named Entity Recognition (NER)

7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE

Name-to-Structuresoftware

CN1c2ccc(cc2C(=NCC1=O)c3ccccc3)Cl

Page 12: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Extracting chemical entities from text

The biggest cause of missing compounds when extracting chemical entities from text is the presence of typographical errors: human errors, OCR failures, hyphenation and multiple line issues, etc.

• Automated spelling correction with CaffeineFix from NextMove Software

• CaffeineFix significantly improves extraction rates (22% increase from D=0 to D=1)

• name2structure software are complementary (40% of the structures come from single n2s contributions)

Page 13: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Structure standardisation

Muresan, S.; Sitzmann, M.; Southan, C., Biocomputing and Drug Discovery, 2011

“The big merge” requires:• A common set of chemistry and biology rules applied carefully & consistently across databases

Page 14: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Chemistry Connect

Page 15: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Technical Overview - ETL

Data Sources

Oracle DB

Text Files

Web Service

Python Scripts

(chemistry)

Pipeline Pilot(biological results)

Extraction Transformation Loading

Oracle PL/SQL (ext tables)

StructureNormalizationProperty calc

Page 16: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Technical Overview - Application

Oracle 11gDirect 7

WebLogic ServerREST (and SOAP) services

HTML

Java

.Net

PipelinePilotKnimeExcel

Page 17: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Source content in Chemistry ConnectSource Structures % unique Cpd/Str Syn/StrChemSpider 18922316 50 1.07 1.8Reaxys 15535377 59 1.12 2.0IBM patents 11038533 51 1.00 n/aPubChemBE 4675643 n/a 1.03 n/aACD 4452644 73 1.01 1.3eMolecules 4213813 19 1.01 n/aTRPharma 3268613 n/a 1.03 n/aGOSTAR 3128567 27 1.00 3.3ChEMBL 940905 n/a 1.05 1.6TRIntegrity 307685 27 1.00 1.3AZReagents 78265 3.4 1.73 3.4TRPartnering 17901 10 1.00 1.0ChEBI 13191 n/a 1.31 5.2HMDB 7789 53 1.00 13.4DrugBank 6359 n/a 1.04 5.0TTD 2663 4.9 1.27 n/aBioprint 2481 n/a 1.00 n/a

Muresan, S. et al, Drug Discovery Today 2011, in print

Page 18: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Finding a common language

[3H]Acetaminophen10066-90-7103-90-21047-607-001169-894-1216110-10-4222 AF222-AF3-(glutathion-S-yl)acetaminophen37519-14-53-hydroxyacetaminophen4-(Acetylamino)phenol4-13-00-010914-ACETAMIDOPHENOL4-Acetaminophenol4-ACETYLAMINOPHENOL4'-Hydroxyacetanilide4-HYDROXYACETANILIDE4-HYDROXYANILID KYSELINY OCTOVE4-hydroxyphenolacetamide644/4046644/750264889-81-2659/950177097-85-9840-416-00872-667-00878-022-04878-022-09878-022-14878-022-19882-720-04882-720-07882-720-10

882-720-13882-720-16882-720-20A F ANACINA PERA.F. ANACINAAPaa-sulfateAA-sulphateAbenolAbensanilABROLABROLETAC112578AC112579AcamolAccu-TapAcenolAcenol (pharmaceutical)AcephenAcertolAcetaAceta ElixirAceta TabletsAcetacoAcetagesicAcetalginACETAMIDE, N-(4-HYDROXYPHENYL)-ACETAMIDE, N-(P-HYDROXYPHENYL)-AcetamidophenolAcetaminofenAcetaminophenAcetaminophen (4-hydroxyacetanilide)Acetaminophen glucuronide(55%)acetaminophen sulfate

AcetaminophenAcetaminophen (4-hydroxyacetanilide)Acetaminophen glucuronide(55%)acetaminophen sulfateAcetaminophen sulfate(30%)acetaminophen sulphateAcetaminophen UnisertsacetaminopheneAcetamolACETANILIDE, 4'-HYDROXY-AcetavanceAcetofenACETOMINOPHENActaminActamin ExtraActamin SuperActifed PlusActimolActimol Chewable TabletsActimol Children's SuspensionActimol Infants' SuspensionActimol Junior Strength CapletsActronAfebrinAfebrylAferadolAG10223AG12029AG124687AG12800AG12948AmadilAminofenAminofen MaxAnacinAnacin-3Anacin-3 Extra StrengthAnadin dla dzieciAnaflonAnalterAnapapAndoxAnelixAnexsiaAnexsia 10/660Anexsia 5/325Anexsia 7.5/325Anexsia 7.5/650AnhibaAnoquanAnti-AlgosAntidolApacetApacet Capsules

Acetaminophen:>1000 synonyms..

Page 19: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Word of the Day : Crowdsourcing

Page 20: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Exact match source comparisons

predominantly patent-derived compounds

sources that include known drugs

Page 21: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Chemistry Connect - Synonyms Searches

Page 22: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Chemistry Connect - Structure Searches

Page 23: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Chemistry Connect - Patent Searches

Page 24: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Chemistry Connect - Test & Result Searches

Page 25: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Different Questions, Common Language

• What compounds have been described in document D?

• What compounds bind target X with an affinity greater than A?

• What targets does compound C bind with an affinity greater than A?

• What compounds have AZ patented on target X?

• What is the structure for this developmentcompound?

• How can I quickly get the SAR data from this patent?

Concepts

Species

CompoundStudy

BMO (AE)

Target

Disease

Study

Institute

Drug

Compound

MoA

Pathway

Target

Pathway

People

Disease

Bioprocess

Compound TargetTest

Disease MoA

CompoundBMO (AE)

Question

Page 26: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Take-home messages

• Chemistry Connect is enabling AZ to intensify its exploitation of synergies between internal and external SAR estate and to shortenthe time between hypothesis generation during DMTA cycles

• Our Chemical Dictionary of 120 million chemical terms has become a crucial cross-mapping resource between chemistry and the scientific literature

• We cannot wave a magic wand over data qality, provenance issues, drug name space, and the inherent challenges of chemistryrepresentation but Chemistry Connect gives us a unique overview and amelioration options for each source

Page 27: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

A Democracy of Ideas (Acknowledgements)

• Plamen Petrov

• Chris Southan

• Paul Xie

• Peter Varkonyi

• Thierry Kogej

• Christian Tyrchan

• Magnus Kjellberg

• Håkan Nilsson

• Mats Ericsson

• Jonas Ekengren

• Marcus Gelderman

• Ithipol Suriyawongkul

• Niklas Blomberg

• Kay Brickmann

• Ola Engkvist

• Yidong Yang

• Hongming Chen

• and many others…

Page 28: Getting the Big Picture by Joining up the SAR dots

DECS | CompSci

Thank you!