Exploring Chemical and Biological Knowledge Spaces with PubChem

25
EXPLORING CHEMICAL AND BIOLOGICAL KNOWLEDGE SPACES WITH PUBCHEM Dr. Paul A. Thiessen, NCBI 2013/03/21 draft

description

My presentation for the Drug Repurposing workshop at the upcoming Bio-IT World Expo. http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=124256 Presentation abstract: PubChem has a wealth of chemical structure and biological activity information. In conjunction with NCBI’s other resources such as PubMed and GenBank, PubChem is a vast source of information relevant to repurposing not only of established drugs but any compounds with in vivo pharmacology and/or clinical results. The challenge is how to take advantage of this knowledge. The ability to explore not only chemical similarity but relationships between diseases and disease targets has crucial value in repurposing. While focused investigations are already possible within the existing Entrez system, navigation across these linked information spaces can be difficult to do on a large scale with current tools. We are actively developing new infrastructure to support such analyses, and pursuing new methods of exploring inter- and intra-database relationships between chemicals, targets, diseases, and patents. Progress and some future direction in these areas will be presented.

Transcript of Exploring Chemical and Biological Knowledge Spaces with PubChem

Page 1: Exploring Chemical and Biological Knowledge Spaces with PubChem

EXPLORING CHEMICAL AND BIOLOGICAL

KNOWLEDGE SPACES WITH PUBCHEM

Dr. Paul A. Thiessen, NCBI

2013/03/21 draft

Page 2: Exploring Chemical and Biological Knowledge Spaces with PubChem

What is a “Knowledge Space”?

May be a database But may be a concept not encapsulated

in a database

Literature(PubMed) Chemicals

(PubChem)

Targets(sequences)

Genes Diseases

Patents

Drugs

Assays(PubChem)

Page 3: Exploring Chemical and Biological Knowledge Spaces with PubChem

Connecting the Spaces

Database cross-links

Literature(PubMed)

Chemicals(PubChem)

Targets(sequences)

Assays(PubChem)

Active

Inactive

MeSH

Depositor

Page 4: Exploring Chemical and Biological Knowledge Spaces with PubChem

Moving Within a Space Neighbors… some examples

Chemicals(PubChem)

Assays(PubChem)

Sameconnectivity

Sameparent

Similarby 2Dor 3D

Similartarget(BLAST)

Similar setsof screenedchemicals

Page 5: Exploring Chemical and Biological Knowledge Spaces with PubChem

Drug Repurposing as a Spatial Transformation

DrugsSearch

TargetsDiseases(hypothesized)

One possible route…

Diseases(known)

Similarity

Page 6: Exploring Chemical and Biological Knowledge Spaces with PubChem

What is in PubChem

117M Substances (SIDs)Information from depositors, including links to

PubMed, sequences, structures, patents, etc. 47M Compounds (CIDs)

Derived from Substances (including links)Computed properties

650k Assays (AIDs)~200M test results on SIDsLinks to target sequences

Page 7: Exploring Chemical and Biological Knowledge Spaces with PubChem

Some PubChem Statistics All CIDs 46,814,409 Unique parents by connectivity 36,806,372 Rule of 5 34,343,056 Rule of 5 but MW 250-800 31,483,865 Active in any BioAssay 824,028 Tested in any BioAssay 1,872,313 Experimental 3D (mainly PDB) 41,406 Computed 3D (multiple confs + neighbors) 42,252,570 Pharmacological Actions 11,531 Biosystems 9,703 Chemical vendors 28,852,943 NIH Molecular Libraries 402,076 Patent sources 14,512,499 Patent links 5,978,538

… as of 2013/03/20

Page 8: Exploring Chemical and Biological Knowledge Spaces with PubChem

What is in NCBI Entrez

Many other databases…PubMedProtein/Nucleotide sequencesGenesBiosystems (metabolic pathways)PDB structures (with VAST neighbors)

Text and numeric search fields Cross-links

Between databasesWithin databases (neighbors)

Page 9: Exploring Chemical and Biological Knowledge Spaces with PubChem

How Entrez Works

Search results = list of identifiers Boolean operations on lists (query

refinement) Links from one database to another

PubChemSearch

CIDList

PubChemSearch

CIDList

Link

to PubMed

PMIDList

Page 10: Exploring Chemical and Biological Knowledge Spaces with PubChem

Limitations of Entrez

Only text or numeric searchSearch fields hard to discoverSearch fields and defaults vary by databaseChemical structure search, and other

specialized algorithms, must be done outside Entrez

The kicker: links are incompleteOnly 500-10,000 ids!Limit also varies by database

Page 11: Exploring Chemical and Biological Knowledge Spaces with PubChem

Working Around the Limitations

Scripting E-Utils, PUG SOAP/REST, etc.Break queries into smaller chunks

Specialized servicesPubChem’s ID ExchangeClassification trees (with associated IDs)

Page 12: Exploring Chemical and Biological Knowledge Spaces with PubChem

What is not in Entrez

… as a database per se, but which may be imported and linked to PubChem

Drugs(sort of but not really)

Targets(again sort of)

Diseases Patents

Page 13: Exploring Chemical and Biological Knowledge Spaces with PubChem

Some Public Sources of Information Relevant to Drugs and Repurposing United States (FDA, NLM, NCBI, …)

ClinicalTrials.gov NDF(-RT) RxNorm HSDB MeSH DailyMed PubMed, PubMed Health USPTO

Europe ChEBI / ChEMBL EPO / WIPO

Canada DrugBank

Japan KEGG

… not an exhaustive list

… some are linked to PubChem

… some are works in progress

Page 14: Exploring Chemical and Biological Knowledge Spaces with PubChem

MeSH and ChEBI

Chemical structure classification

Biological role

Pharmacological action

Page 15: Exploring Chemical and Biological Knowledge Spaces with PubChem

KEGG and DrugBank

Drug classification

Targets

Page 16: Exploring Chemical and Biological Knowledge Spaces with PubChem

Patents PubChem depositors

Per SID:○ Patent IDs○ PubMed IDs

Classifications ECLA IPC USPC CPC

Page 17: Exploring Chemical and Biological Knowledge Spaces with PubChem

Aside: Patent Summaries

Page 18: Exploring Chemical and Biological Knowledge Spaces with PubChem

NDF-RT

Molecular interactions Drug ingredients Diseases (with drugs) Physiological effects

Has links to MeSH… which leads to CIDs

Page 19: Exploring Chemical and Biological Knowledge Spaces with PubChem

NDF-RT linked to SID, CID

Page 20: Exploring Chemical and Biological Knowledge Spaces with PubChem

Classifications as Navigation Tools Where are the CIDs in the tree?

• Example: chemicals affecting serotonin transporters according to KEGG

Page 21: Exploring Chemical and Biological Knowledge Spaces with PubChem

Classifications for Query Refinement Where are MY CIDs in the tree?

• Example: what diseases are linked by NDF to KEGG’s serotonin transport drugs?

Page 22: Exploring Chemical and Biological Knowledge Spaces with PubChem

Big Classifications… Some Engineering Required

WIPO IPC

• 72,000 tree nodes

• 6,000,000 CIDs

• 124,000,000 node-CID links

Filtering on the fly:

• 22,000 CIDs from PDB

… interactive!

Page 23: Exploring Chemical and Biological Knowledge Spaces with PubChem

More Space to Explore

Literature(PubMed)

Chemicals(PubChem)

Targets(sequences)

Genes

Diseases

Patents

Drugs

Assays(PubChem)

… and beyond

Page 24: Exploring Chemical and Biological Knowledge Spaces with PubChem

Conclusions PubChem is…

A very generalized systemBased on open dataPart of the larger Entrez collection

We strive to…Make analysis across multiple knowledge spaces

accessible and powerfulEnable hypothesis generation for drug

repurposing (as one scenario among many)

Feedback is always [email protected]

Page 25: Exploring Chemical and Biological Knowledge Spaces with PubChem

Acknowledgements

Evan Bolton Steve Bryant Asta Gindulyte (classification front end)

Chris Southan

… Thank You!