ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Post on 04-Dec-2014

4.235 views 3 download

description

ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are now hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of almost 25 million chemical substances, grows daily, and is integrated with over 400 sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for a linked web for chemistry and to provide access to a set online tools and services to support access to these data.

Transcript of ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Antony WilliamsBAGIM, Boston, August 2010

Our dog has fleas

It’s not an Advantage…

What is the structure of “Advantage”?

Audience Participation Time….

Where would you look? What would you trust? Where would you look ONLINE?

What is the Structure of Vitamin K?

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

What is the Structure of Vitamin K1?

Wikipedia

What is the Structure of Vitamin K1?

CAS’s Common Chemistry

PubChem

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Bioassay Data are Associated…

Structures on DailyMed

Lack of Stereochemistry

Does Stereochemistry Matter?

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Incorrect Structures

Wow!

ChEBI – Manual Curation

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

InChIs

InChIs are proliferating across databases InChIs are increasingly used by publishers Single code base – no multi-flavored SMILES

InChIs are “incomplete” but very useful…

Vancomycin – Search the Internet

Full Skeleton Search: 104 Hits

Full Molecule Search: 4 Hits

Is this the structure of Vitamin K1?

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Where Would You look? What Do You Trust?

Question Everything online: www.dhmo.org

It’s all on Wikipedia…

What’s Methane?

What’s Methane?

What ELSE is Methane???

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News (from ACS)

Feedback from Steve Ritter

“As for where we source our structures, our primary source is the researcher and peer-reviewed papers, because many compounds are novel.

..we always double check them against one or more primary sources, typically Merck Index and SciFinder.

Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”

Feedback from Steve Ritter

“As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone.”

“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”

A vision…

Authoritative web-based source of standard, well-drawn structures With associated data – spectra, property data,

ADME/Tox data, Bioassay data Linked to encyclopedic articles, publications,

patents, MSDS/safety sheets Links to chemical vendors Links to property predictions

A Pragmatic Vision

“Build a Structure Centric Community”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

media.obsessable.com

As few interfaces as possible

What do humans want?

www.chemspider.com

We’re Out to Answer Questions

Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Search for a Chemical…by name

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Available Information….

Search for a chemical…by structureSubstructure search coming…

Annotating, Cleaning and Growing...

Almost 25 million chemicals from 400 diverse data sources

“Diverse” data sources… High Quality through questionable to wrong Rich content of Wikipedia links, YouTube videos

and photographs to “Stub Records” containing “just a structure”

All records can be further enhanced…25 million compounds need annotation by the masses

Search “Vitamin H”

Search “Vitamin H”

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Remove names with/without stereo compared

to the structure Correct registry numbers and other numeric

identifiers (Beilstein, EINECS etc) Add multilingual names Add alternative names

Crowdsourced “Annotations”

Registered Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos

Spectra Linked

Spectra Linked

Link off a structure in ChemSpider

Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Semantic Markup: Project Prospect

Success Depends on Dictionaries

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

“Chemicalizing” Pages

“Chemicalizing” Pages

ChemSpider SyntheticPages

ChemSpider SyntheticPages

ChemSpider Everywhere:What do computers want?

Web services

Web Services

ChemSpider Everywhere

Linked from Wikipedia and many Public Databases

Linked from Open Notebook Science sites

Linked from Blogs using Structure/Spectra EMBED

Integrated into structure drawing packages

Integrated to software offerings from Thermo, Waters, Agilent, Bruker

Structure Database Lookup

Structure Database Lookup

Reaction Database Look-up

Reaction Database Look-up

There will always be gaps...

What ChemSpider does not deal with, yet...

Materials Minerals Polymers Biological macromolecules

ChemSpider Tomorrow

6 months: >1.2M compounds/month 6 months: >800,000 new uniques 6 months: >60 new data sources added

Continue the curation effort and keep cleaning Finish depositions – millions left to deposit Integrate RSC content – a massive archive! Integrate RSC publishing workflows and databases Enable the semantic web for chemistry – RDF was

layered on last week

The Future of Linked Chemistry on the Internet? I can buy my wife a “methane ring” for Xmas There are more than 10 compounds called

Vitamin K1 on PubChem… Most databases online cannot be annotated The public funds the generation of data that is

then mis-associated, cannot be used for modeling, for reference, for…

Low quality databases become authorities The community accepts the status quo

The PREFERABLE Future of Linked Chemistry on the Internet? Public compound databases federate to build a

truly linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make

publications discoverable Public-Private databases can be linked Open Data proliferate RDF is everywhere

Business models WILL change

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams