ChemSpider as an integration hub for interlinked chemistry data

55
ChemSpider as an Integration Hub for Interlinked Chemistry Data Antony Williams SETAC November 18 th 2013

description

The internet has provided access to unprecedented quantities of data. In the domain of chemistry specifically over the past decade the web has become populated with tens of millions of chemical structures and related properties, both experimental and predicted, together with tens of thousands of spectra and syntheses. The data have, to a large extent, remained disparate and disconnected. In recent years with the wave of Web 2.0 participation any chemist can contribute to both the sharing and validation of chemistry-related data whether it be via Wikipedia, the online encyclopedia, or one of the multiple public compound databases. Toxicologists commonly wish to source data, either for reference purposes, to support the development of models or, when experimental data are not available, predicted data will suffice. This presentation will offer a perspective of the type and quality of chemistry data available today, our experiences of building the ChemSpider public compound database to link together chemistry on the internet and our efforts to both encourage and enable even greater integration and connectivity for chemistry data for the community.

Transcript of ChemSpider as an integration hub for interlinked chemistry data

Page 1: ChemSpider as an integration hub for interlinked chemistry data

ChemSpider as an Integration Hub for Interlinked

Chemistry Data

Antony WilliamsSETAC

November 18th 2013

Page 2: ChemSpider as an integration hub for interlinked chemistry data

How Much Data Online?

• How much data regarding environmental toxicology and chemistry is online?

• How can it all be mapped together?

Page 3: ChemSpider as an integration hub for interlinked chemistry data

A Grand Challenge….

• Let’s map together all historical chemistry data and build systems to integrate new data

• Let’s integrate chemistry, toxicology and biology data and add in disease data too

• Lets model the data and see if we can extract new relationships – quantitative and qualitative

• Let’s make it all available on the web

Page 4: ChemSpider as an integration hub for interlinked chemistry data
Page 5: ChemSpider as an integration hub for interlinked chemistry data

What about this….

• We’re going to map the world

• We’re going to take photos of as many places as we can and link them together

• We’ll let people annotate and curate the map

• Then let’s make it available free on the web

• We’ll make it available for decision making

• Put it on Mobile Devices, Give it Away

Page 6: ChemSpider as an integration hub for interlinked chemistry data

The World of Online Chemistry

• Property databases• Compound aggregators• Screening assay results• Scientific publications • Encyclopedic articles (Wikipedia)• Metabolic pathway databases• ADME/Tox data – eTOX for example• Blogs/Wikis and Open Notebook Science

Page 7: ChemSpider as an integration hub for interlinked chemistry data

How to Map Data Together

• Download the structure representations and map together at the structure level

• Integrate and mesh chemical names, chemical properties, analytical data

• Carry URL links and retain external links to original data sets (assume no link decay)

• It sounds easy….

Page 8: ChemSpider as an integration hub for interlinked chemistry data

ChemSpider

• Build a HUB connecting as many data sources as possible

• NOT to harvest all data from each data source

• Today we have >29 million unique chemicals from >500 data sources

• Focus on improving data quality

• Allow users to enhance, curate and annotate

Page 9: ChemSpider as an integration hub for interlinked chemistry data

RSC’s ChemSpider

Page 10: ChemSpider as an integration hub for interlinked chemistry data

Identifiers are very useful! But what when they are “closed”

Page 11: ChemSpider as an integration hub for interlinked chemistry data

CAS Numbers Validation?

Page 12: ChemSpider as an integration hub for interlinked chemistry data

Various Registration Numbers

Page 13: ChemSpider as an integration hub for interlinked chemistry data

Mappings and Inconsistencies

PubChemDrugbankChemSpider

Imatinib

Mesylate

Page 14: ChemSpider as an integration hub for interlinked chemistry data

The InChI Identifier

Page 15: ChemSpider as an integration hub for interlinked chemistry data

InChIStrings Hash to InChIKeys

Page 16: ChemSpider as an integration hub for interlinked chemistry data

Vancomycin – Search the Internet

Page 17: ChemSpider as an integration hub for interlinked chemistry data

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 18: ChemSpider as an integration hub for interlinked chemistry data

Full Skeleton Search: 529 Hits

Page 19: ChemSpider as an integration hub for interlinked chemistry data

Full Molecule Search: 294 Hits

Page 20: ChemSpider as an integration hub for interlinked chemistry data

Historical Data for reference

• As evidence that InChI is proliferating and data is improving:

• Three years ago there were only 104 hits on the complete InChI online

• Only 4 were correct

Page 21: ChemSpider as an integration hub for interlinked chemistry data

What you might not know about Chemistry Databases on the Internet

Page 22: ChemSpider as an integration hub for interlinked chemistry data

NCGC Pharma Collection

Page 23: ChemSpider as an integration hub for interlinked chemistry data

NCGC Pharma Collection

Page 24: ChemSpider as an integration hub for interlinked chemistry data

NCGC Pharma Collection

Page 25: ChemSpider as an integration hub for interlinked chemistry data

PHYSPROP Database

• The freely downloadable database under the EPI Suite prediction software

• Very Basic filters suggest data quality issues

Page 26: ChemSpider as an integration hub for interlinked chemistry data

The Stereochemistry challenge.12500 chemicals with “missed” stereo

Page 27: ChemSpider as an integration hub for interlinked chemistry data

NIST Webbook

Page 28: ChemSpider as an integration hub for interlinked chemistry data

PubChem

Page 29: ChemSpider as an integration hub for interlinked chemistry data

Patents

Page 30: ChemSpider as an integration hub for interlinked chemistry data

Patents

Page 31: ChemSpider as an integration hub for interlinked chemistry data

But Chemspider is curated right?

Page 32: ChemSpider as an integration hub for interlinked chemistry data

Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine

Page 33: ChemSpider as an integration hub for interlinked chemistry data

Crowdsourced Curation

• Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 34: ChemSpider as an integration hub for interlinked chemistry data

Search “Vitamin H”

Page 35: ChemSpider as an integration hub for interlinked chemistry data

“Curate” Identifiers

Page 36: ChemSpider as an integration hub for interlinked chemistry data

“Curate” Identifiers

Page 37: ChemSpider as an integration hub for interlinked chemistry data

“Curate” Identifiers

Page 38: ChemSpider as an integration hub for interlinked chemistry data

Chemical name dictionaries for:

• Text-mining (publications, patents)• Used to index PubMed and link Google Patents

• Linking to other databases – think Biology!• When structures are not available names link

• Searching the web• Names link to structures link to InChIs

Page 39: ChemSpider as an integration hub for interlinked chemistry data

I want to know about “Vincristine”

Page 40: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: Identifiers to link

Page 41: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: Vendors and SourcesLinked by Structure

Page 42: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: PatentsLinked by Name

Page 43: ChemSpider as an integration hub for interlinked chemistry data

Vincristine: ArticlesLinked by Name

Page 44: ChemSpider as an integration hub for interlinked chemistry data

What needs to happen?

• Standards• Standardization of structures

• More sharing of data – downloadable data collections for mapping, meshing and integration

• InChI adoption

• Collaboration• Stop reinventing the wheel• Share data, share efforts and speed the

process

Page 45: ChemSpider as an integration hub for interlinked chemistry data

Adopting Modified FDA Rules

Page 46: ChemSpider as an integration hub for interlinked chemistry data

Nitro groups

Page 47: ChemSpider as an integration hub for interlinked chemistry data

Salt and Ionic Bonds

Page 48: ChemSpider as an integration hub for interlinked chemistry data

Ammonium salts

Page 49: ChemSpider as an integration hub for interlinked chemistry data

What if we could capture it all?Digitally Enhancing the RSC Archive

Page 50: ChemSpider as an integration hub for interlinked chemistry data

Start with data in publications

Page 51: ChemSpider as an integration hub for interlinked chemistry data

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer, thermometer and reflux condenser.

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour.

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 52: ChemSpider as an integration hub for interlinked chemistry data

ChemSpider Reactions

Page 53: ChemSpider as an integration hub for interlinked chemistry data

Turn “Figures” Into Data

FIGURE

EXTRACTED DATA

Page 54: ChemSpider as an integration hub for interlinked chemistry data

Conclusions

• There are some amazing online resources for environmental toxicology and chemistry already!

• ChemSpider has an important role in quality data and linking resources

• Crowdsourced deposition, validation and curation works

• Standards are an important part of data linking

• MORE collaboration and data sharing can benefit us all

Page 55: ChemSpider as an integration hub for interlinked chemistry data

Thank you

Email: [email protected] Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams