ChemSpider – Is This The Future of Linked Chemistry on the Internet?

87
hemSpider – Is This The Future of Linked Chemistry on the Internet? Antony Williams BAGIM, Boston, August 2010

description

ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are now hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of almost 25 million chemical substances, grows daily, and is integrated with over 400 sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for a linked web for chemistry and to provide access to a set online tools and services to support access to these data.

Transcript of ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Page 1: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Antony WilliamsBAGIM, Boston, August 2010

Page 2: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Our dog has fleas

Page 3: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

It’s not an Advantage…

Page 4: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What is the structure of “Advantage”?

Audience Participation Time….

Where would you look? What would you trust? Where would you look ONLINE?

Page 5: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What is the Structure of Vitamin K?

Page 6: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

Page 7: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What is the Structure of Vitamin K1?

Page 8: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Wikipedia

Page 9: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What is the Structure of Vitamin K1?

Page 10: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

CAS’s Common Chemistry

Page 11: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

PubChem

Page 12: ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Page 13: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Page 14: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Bioassay Data are Associated…

Page 15: ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Page 16: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Structures on DailyMed

Page 17: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Lack of Stereochemistry

Page 18: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Does Stereochemistry Matter?

Page 19: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Page 20: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Incorrect Structures

Page 21: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Wow!

Page 22: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChEBI – Manual Curation

Page 23: ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Page 24: ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Page 25: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

The InChI Identifier

Page 26: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Multiple Layers

Page 27: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

InChIStrings Hash to InChIKeys

Page 28: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

Page 29: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

Page 30: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

InChIs

InChIs are proliferating across databases InChIs are increasingly used by publishers Single code base – no multi-flavored SMILES

InChIs are “incomplete” but very useful…

Page 31: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Vancomycin – Search the Internet

Page 32: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Full Skeleton Search: 104 Hits

Page 33: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Full Molecule Search: 4 Hits

Page 34: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Is this the structure of Vitamin K1?

Page 35: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 36: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Page 37: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Where Would You look? What Do You Trust?

Page 38: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Question Everything online: www.dhmo.org

Page 39: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

It’s all on Wikipedia…

Page 40: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What’s Methane?

Page 41: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What’s Methane?

Page 42: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

What ELSE is Methane???

Page 43: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

The EXPERTS must get it right?!

Page 44: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 45: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Feedback from Steve Ritter

“As for where we source our structures, our primary source is the researcher and peer-reviewed papers, because many compounds are novel.

..we always double check them against one or more primary sources, typically Merck Index and SciFinder.

Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”

Page 46: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Feedback from Steve Ritter

“As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone.”

“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”

Page 47: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

A vision…

Authoritative web-based source of standard, well-drawn structures With associated data – spectra, property data,

ADME/Tox data, Bioassay data Linked to encyclopedic articles, publications,

patents, MSDS/safety sheets Links to chemical vendors Links to property predictions

Page 48: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

A Pragmatic Vision

“Build a Structure Centric Community”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Page 49: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

media.obsessable.com

As few interfaces as possible

What do humans want?

Page 50: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

www.chemspider.com

Page 51: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

We’re Out to Answer Questions

Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Page 52: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Search for a Chemical…by name

Page 53: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 54: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Available Information….

Page 55: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Search for a chemical…by structureSubstructure search coming…

Page 56: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Annotating, Cleaning and Growing...

Almost 25 million chemicals from 400 diverse data sources

“Diverse” data sources… High Quality through questionable to wrong Rich content of Wikipedia links, YouTube videos

and photographs to “Stub Records” containing “just a structure”

All records can be further enhanced…25 million compounds need annotation by the masses

Page 57: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Search “Vitamin H”

Page 58: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Search “Vitamin H”

Page 59: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“Curate” Identifiers

Page 60: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“Curate” Identifiers

Page 61: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“Curate” Identifiers

Page 62: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“Curate” Identifiers

General curation activities Remove incorrect names Correct spellings Remove names with/without stereo compared

to the structure Correct registry numbers and other numeric

identifiers (Beilstein, EINECS etc) Add multilingual names Add alternative names

Page 63: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Crowdsourced “Annotations”

Registered Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos

Page 64: ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Page 65: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Spectra Linked

Page 66: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Spectra Linked

Page 67: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Link off a structure in ChemSpider

Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 68: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Semantic Markup: Project Prospect

Page 69: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Success Depends on Dictionaries

Page 70: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 71: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“Chemicalizing” Pages

Page 72: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

“Chemicalizing” Pages

Page 73: ChemSpider – Is This The Future of Linked Chemistry on the Internet?
Page 74: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider SyntheticPages

Page 75: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider SyntheticPages

Page 76: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider Everywhere:What do computers want?

Web services

Page 77: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Web Services

Page 78: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider Everywhere

Linked from Wikipedia and many Public Databases

Linked from Open Notebook Science sites

Linked from Blogs using Structure/Spectra EMBED

Integrated into structure drawing packages

Integrated to software offerings from Thermo, Waters, Agilent, Bruker

Page 79: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Structure Database Lookup

Page 80: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Structure Database Lookup

Page 81: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Reaction Database Look-up

Page 82: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Reaction Database Look-up

Page 83: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

There will always be gaps...

What ChemSpider does not deal with, yet...

Materials Minerals Polymers Biological macromolecules

Page 84: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

ChemSpider Tomorrow

6 months: >1.2M compounds/month 6 months: >800,000 new uniques 6 months: >60 new data sources added

Continue the curation effort and keep cleaning Finish depositions – millions left to deposit Integrate RSC content – a massive archive! Integrate RSC publishing workflows and databases Enable the semantic web for chemistry – RDF was

layered on last week

Page 85: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

The Future of Linked Chemistry on the Internet? I can buy my wife a “methane ring” for Xmas There are more than 10 compounds called

Vitamin K1 on PubChem… Most databases online cannot be annotated The public funds the generation of data that is

then mis-associated, cannot be used for modeling, for reference, for…

Low quality databases become authorities The community accepts the status quo

Page 86: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

The PREFERABLE Future of Linked Chemistry on the Internet? Public compound databases federate to build a

truly linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make

publications discoverable Public-Private databases can be linked Open Data proliferate RDF is everywhere

Business models WILL change

Page 87: ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams