Getting the Big Picture by Joining up the SAR dots
-
Upload
sorel-muresan -
Category
Technology
-
view
856 -
download
2
Transcript of Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dots
The 9th Annual Pharmaceutical IT Congress 2011
Sorel MuresanAstraZeneca R&D MölndalDECS Computational Sciences
Large-scale integration of structure and bioactivity data
DECS | CompSci
WO patents with the classification code C07D
Query performed using the European Patent Office search interface
DECS | CompSci
Driver – explosion in SAR data
• Chemical information landscape changing fast
• Databases, journal articles, patents, internal docs
20082006
Southan, C.; Varkonyi, P.; Muresan, S., J. Cheminfo. 2009, 1:10
DECS | CompSci
The Challenge – Information deluge
• Volume
• Complexity
• Unstructured content
DECS | CompSci
Since 2006 >1M chemistry publications per year
Number of articles (diamonds) and patents (open boxes) abstracted annually by Chemical Abstracts Bachrach J.Cheminformatics 2009 1:2
DECS | CompSci
Number of structures per year from J Med Chem
W. Patrick Walters; Jeremy Green; Jonathan R. Weiss; Mark A. Murcko; J. Med. Chem. Article ASAP DOI: 10.1021/jm200504p Copyright © 2011 American Chemical Society
DECS | CompSci
SAR key entities and relationships
Unstructured Data from Documents
Structured Entries in Relational Databases
Expert Extractionor
Text Mining
Southan, C.; Boppana, K.; Jagarlapudi, S.; Muresan, S .J. Cheminfo. 2011, 3:14
DECS | CompSci
Manually extracted SAR data (commercial)
• GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a comprehensive database that captures explicit relationships between the three entities of publications, compounds and sequences.
• It includes 2.6 million compounds linked to 3,500 sequences with 12.5M SAR points extracted from 43,000 patents and 67,000 articles from 125 journals
DECS | CompSci
SAR data (public)
• PubChem• the NCBI public informatics backbone for the NIH Molecular Libraries
Initiative focused on small molecules as systems biology probes and potential therapeutic agents. The statistics are 30.5 million compounds with 85.6 million links. Of the compounds, 1654K have been tested in 504K assays.
• ChEMBL• includes drugs, small molecules from the medicinal chemistry or
biochemical literature and their targets. It contains 1,060,258 distinct compounds extracted by expert manual curation from 42,516 publications with 5,479,146 activities, including SAR and ADMET values. This data is mapped to 8,603 targets.
DECS | CompSci
Extracting chemical entities from text
Collaboration with IBM Research Almaden to apply text analytics technology to analyze intellectual property and scientific literature
- 10 million full text patents
- 11 million structures
- 12% out of 46M parent structures in Chemistry Connect
DECS | CompSci
Chemical Named Entity Recognition (NER)
7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
Name-to-Structuresoftware
CN1c2ccc(cc2C(=NCC1=O)c3ccccc3)Cl
DECS | CompSci
Extracting chemical entities from text
The biggest cause of missing compounds when extracting chemical entities from text is the presence of typographical errors: human errors, OCR failures, hyphenation and multiple line issues, etc.
• Automated spelling correction with CaffeineFix from NextMove Software
• CaffeineFix significantly improves extraction rates (22% increase from D=0 to D=1)
• name2structure software are complementary (40% of the structures come from single n2s contributions)
DECS | CompSci
Structure standardisation
Muresan, S.; Sitzmann, M.; Southan, C., Biocomputing and Drug Discovery, 2011
“The big merge” requires:• A common set of chemistry and biology rules applied carefully & consistently across databases
DECS | CompSci
Chemistry Connect
DECS | CompSci
Technical Overview - ETL
Data Sources
Oracle DB
Text Files
Web Service
Python Scripts
(chemistry)
Pipeline Pilot(biological results)
Extraction Transformation Loading
Oracle PL/SQL (ext tables)
StructureNormalizationProperty calc
DECS | CompSci
Technical Overview - Application
Oracle 11gDirect 7
WebLogic ServerREST (and SOAP) services
HTML
Java
.Net
PipelinePilotKnimeExcel
DECS | CompSci
Source content in Chemistry ConnectSource Structures % unique Cpd/Str Syn/StrChemSpider 18922316 50 1.07 1.8Reaxys 15535377 59 1.12 2.0IBM patents 11038533 51 1.00 n/aPubChemBE 4675643 n/a 1.03 n/aACD 4452644 73 1.01 1.3eMolecules 4213813 19 1.01 n/aTRPharma 3268613 n/a 1.03 n/aGOSTAR 3128567 27 1.00 3.3ChEMBL 940905 n/a 1.05 1.6TRIntegrity 307685 27 1.00 1.3AZReagents 78265 3.4 1.73 3.4TRPartnering 17901 10 1.00 1.0ChEBI 13191 n/a 1.31 5.2HMDB 7789 53 1.00 13.4DrugBank 6359 n/a 1.04 5.0TTD 2663 4.9 1.27 n/aBioprint 2481 n/a 1.00 n/a
Muresan, S. et al, Drug Discovery Today 2011, in print
DECS | CompSci
Finding a common language
[3H]Acetaminophen10066-90-7103-90-21047-607-001169-894-1216110-10-4222 AF222-AF3-(glutathion-S-yl)acetaminophen37519-14-53-hydroxyacetaminophen4-(Acetylamino)phenol4-13-00-010914-ACETAMIDOPHENOL4-Acetaminophenol4-ACETYLAMINOPHENOL4'-Hydroxyacetanilide4-HYDROXYACETANILIDE4-HYDROXYANILID KYSELINY OCTOVE4-hydroxyphenolacetamide644/4046644/750264889-81-2659/950177097-85-9840-416-00872-667-00878-022-04878-022-09878-022-14878-022-19882-720-04882-720-07882-720-10
882-720-13882-720-16882-720-20A F ANACINA PERA.F. ANACINAAPaa-sulfateAA-sulphateAbenolAbensanilABROLABROLETAC112578AC112579AcamolAccu-TapAcenolAcenol (pharmaceutical)AcephenAcertolAcetaAceta ElixirAceta TabletsAcetacoAcetagesicAcetalginACETAMIDE, N-(4-HYDROXYPHENYL)-ACETAMIDE, N-(P-HYDROXYPHENYL)-AcetamidophenolAcetaminofenAcetaminophenAcetaminophen (4-hydroxyacetanilide)Acetaminophen glucuronide(55%)acetaminophen sulfate
AcetaminophenAcetaminophen (4-hydroxyacetanilide)Acetaminophen glucuronide(55%)acetaminophen sulfateAcetaminophen sulfate(30%)acetaminophen sulphateAcetaminophen UnisertsacetaminopheneAcetamolACETANILIDE, 4'-HYDROXY-AcetavanceAcetofenACETOMINOPHENActaminActamin ExtraActamin SuperActifed PlusActimolActimol Chewable TabletsActimol Children's SuspensionActimol Infants' SuspensionActimol Junior Strength CapletsActronAfebrinAfebrylAferadolAG10223AG12029AG124687AG12800AG12948AmadilAminofenAminofen MaxAnacinAnacin-3Anacin-3 Extra StrengthAnadin dla dzieciAnaflonAnalterAnapapAndoxAnelixAnexsiaAnexsia 10/660Anexsia 5/325Anexsia 7.5/325Anexsia 7.5/650AnhibaAnoquanAnti-AlgosAntidolApacetApacet Capsules
Acetaminophen:>1000 synonyms..
DECS | CompSci
Word of the Day : Crowdsourcing
DECS | CompSci
Exact match source comparisons
predominantly patent-derived compounds
sources that include known drugs
DECS | CompSci
Chemistry Connect - Synonyms Searches
DECS | CompSci
Chemistry Connect - Structure Searches
DECS | CompSci
Chemistry Connect - Patent Searches
DECS | CompSci
Chemistry Connect - Test & Result Searches
DECS | CompSci
Different Questions, Common Language
• What compounds have been described in document D?
• What compounds bind target X with an affinity greater than A?
• What targets does compound C bind with an affinity greater than A?
• What compounds have AZ patented on target X?
• What is the structure for this developmentcompound?
• How can I quickly get the SAR data from this patent?
Concepts
Species
CompoundStudy
BMO (AE)
Target
Disease
Study
Institute
Drug
Compound
MoA
Pathway
Target
Pathway
People
Disease
Bioprocess
Compound TargetTest
Disease MoA
CompoundBMO (AE)
Question
DECS | CompSci
Take-home messages
• Chemistry Connect is enabling AZ to intensify its exploitation of synergies between internal and external SAR estate and to shortenthe time between hypothesis generation during DMTA cycles
• Our Chemical Dictionary of 120 million chemical terms has become a crucial cross-mapping resource between chemistry and the scientific literature
• We cannot wave a magic wand over data qality, provenance issues, drug name space, and the inherent challenges of chemistryrepresentation but Chemistry Connect gives us a unique overview and amelioration options for each source
DECS | CompSci
A Democracy of Ideas (Acknowledgements)
• Plamen Petrov
• Chris Southan
• Paul Xie
• Peter Varkonyi
• Thierry Kogej
• Christian Tyrchan
• Magnus Kjellberg
• Håkan Nilsson
• Mats Ericsson
• Jonas Ekengren
• Marcus Gelderman
• Ithipol Suriyawongkul
• Niklas Blomberg
• Kay Brickmann
• Ola Engkvist
• Yidong Yang
• Hongming Chen
• and many others…
DECS | CompSci
Thank you!