Chemicals, Chemical Identifiers and Navigating Through Databases

59
hemicals, Chemical Identifiers and Navigating Through Databases Antony Williams UNC Chapel Hill, October 2010

description

This is a presentation given to a group of students at the UNC Eshelman School of Pharmacy. As chemists many of us want to resource information that is high quality, accurate and addresses our query. With the increasing proliferation of online chemistry resources it is very common for us to turn to these resources to source data. However, are resources such as Wikipedia, PubChem and the plethora of databases delivering information for metabolism, medicinal chemistry and synthetic chemistry trustworthy? Which of these resources, if any, should be treated as authorities? What is the most integrated approach to resource chemistry related data online? What approaches can be taken to validate the data that is available and how can individual scientists participate in helping to improve the content and quality of chemistry related data on the web. Antony Williams is ChemSpiderman. He started the ChemSpider database (www.chemspider.com) as a hobby to deliver a free platform for the community to source chemistry related data. Within three years the system was acquired by the Royal Society of Chemistry and now serves up close to 25 million chemical structures linked to over 400 data sources across the internet and offers individual scientists the opportunity to host and share their data with the community and to participate in data curation and annotation. Tony will share his experiences of building this chemistry database with a focus on data validation and curation and sourcing high quality data. During the presentation he will discuss ways to check chemical structure representations before submission to public systems for searching and provide an overview of chemical identifiers such as SMILES strings and the International Chemical Identifier (InChI) allows for the interlinking of resources. Attendees can expect to leave the session with a deeper understanding of utilizing the internet to resource chemistry related data.

Transcript of Chemicals, Chemical Identifiers and Navigating Through Databases

Page 1: Chemicals, Chemical Identifiers and Navigating Through Databases

Chemicals, Chemical Identifiers and Navigating Through Databases

Antony WilliamsUNC Chapel Hill, October 2010

Page 2: Chemicals, Chemical Identifiers and Navigating Through Databases

Chemistry on the Internet

Where do you source chemistry information? What can you trust online? How can you recognize potential issues? Cross-referencing and curating data

Page 3: Chemicals, Chemical Identifiers and Navigating Through Databases

What is the Structure of Vitamin K?

Page 4: Chemicals, Chemical Identifiers and Navigating Through Databases

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

Page 5: Chemicals, Chemical Identifiers and Navigating Through Databases

What is the Structure of Vitamin K1?

Page 6: Chemicals, Chemical Identifiers and Navigating Through Databases

Wikipedia

Page 7: Chemicals, Chemical Identifiers and Navigating Through Databases

What is the Structure of Vitamin K1?

Page 8: Chemicals, Chemical Identifiers and Navigating Through Databases

CAS’s Common Chemistry

Page 9: Chemicals, Chemical Identifiers and Navigating Through Databases

PubChem

Page 10: Chemicals, Chemical Identifiers and Navigating Through Databases
Page 11: Chemicals, Chemical Identifiers and Navigating Through Databases

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Page 12: Chemicals, Chemical Identifiers and Navigating Through Databases

Bioassay Data are Associated…

Page 13: Chemicals, Chemical Identifiers and Navigating Through Databases
Page 14: Chemicals, Chemical Identifiers and Navigating Through Databases

Lack of Stereochemistry

Page 15: Chemicals, Chemical Identifiers and Navigating Through Databases

ChEBI – Manual Curation

Page 16: Chemicals, Chemical Identifiers and Navigating Through Databases
Page 17: Chemicals, Chemical Identifiers and Navigating Through Databases
Page 18: Chemicals, Chemical Identifiers and Navigating Through Databases
Page 19: Chemicals, Chemical Identifiers and Navigating Through Databases

Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)

Page 20: Chemicals, Chemical Identifiers and Navigating Through Databases

Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END

Page 21: Chemicals, Chemical Identifiers and Navigating Through Databases

Molfiles Molfiles are the primary exchange format between

structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates

Page 22: Chemicals, Chemical Identifiers and Navigating Through Databases

SMILES (http://en.wikipedia.org/wiki/SMILES)

SMILES is a common format Can support polymers,

organometallics, etc. Does NOT carry X,Y or Z

coordinates for layout so requires layout algorithms – can be problematic!

Generally different between drawing packages

Page 23: Chemicals, Chemical Identifiers and Navigating Through Databases

Stereo

Page 24: Chemicals, Chemical Identifiers and Navigating Through Databases

Tautomers

Page 25: Chemicals, Chemical Identifiers and Navigating Through Databases

SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\

C)=C\CC2=C(C)C(=O)c1ccccc1C2=O

OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/

CCC[C@H](C)CCC[C@H](C)CCCC(C)C

ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\

C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C

Page 26: Chemicals, Chemical Identifiers and Navigating Through Databases

The InChI Identifier

Page 27: Chemicals, Chemical Identifiers and Navigating Through Databases

InChI

SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES

InChI Strings can be reversed to structures – same problem as with SMILES – no layout

Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet

Page 28: Chemicals, Chemical Identifiers and Navigating Through Databases

Multiple Layers

Page 29: Chemicals, Chemical Identifiers and Navigating Through Databases

Tautomers – “Mobile H Perception”

Page 30: Chemicals, Chemical Identifiers and Navigating Through Databases

Double Bond Orientation

Page 31: Chemicals, Chemical Identifiers and Navigating Through Databases

Stereo

Page 32: Chemicals, Chemical Identifiers and Navigating Through Databases

Checking for Stereochemistry

Page 33: Chemicals, Chemical Identifiers and Navigating Through Databases

Checking for StereochemistryUse your drawing package!

Page 34: Chemicals, Chemical Identifiers and Navigating Through Databases

Checking for Stereochemistry

Page 35: Chemicals, Chemical Identifiers and Navigating Through Databases

Checking for Stereochemistry

Page 36: Chemicals, Chemical Identifiers and Navigating Through Databases

Checking for Stereochemistry

Page 37: Chemicals, Chemical Identifiers and Navigating Through Databases

InChIStrings Hash to InChIKeys

Page 38: Chemicals, Chemical Identifiers and Navigating Through Databases
Page 39: Chemicals, Chemical Identifiers and Navigating Through Databases

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

Page 40: Chemicals, Chemical Identifiers and Navigating Through Databases

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

Page 41: Chemicals, Chemical Identifiers and Navigating Through Databases

Databases and Standardization

Page 42: Chemicals, Chemical Identifiers and Navigating Through Databases

Databases and Standardization

Page 43: Chemicals, Chemical Identifiers and Navigating Through Databases

InChI

No support for polymers, organometallics

Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic

“Slight” chance of collisions of InChIKeys

VERY USEFUL FOR INTEGRATING THE WEB

Page 44: Chemicals, Chemical Identifiers and Navigating Through Databases

Vancomycin

Page 45: Chemicals, Chemical Identifiers and Navigating Through Databases

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 46: Chemicals, Chemical Identifiers and Navigating Through Databases

Full Skeleton Search: 104 Hits

Page 47: Chemicals, Chemical Identifiers and Navigating Through Databases

Full Molecule Search: 4 Hits

Page 48: Chemicals, Chemical Identifiers and Navigating Through Databases

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 49: Chemicals, Chemical Identifiers and Navigating Through Databases

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Page 50: Chemicals, Chemical Identifiers and Navigating Through Databases

www.chemspider.com

Page 51: Chemicals, Chemical Identifiers and Navigating Through Databases

Search for a Chemical…by name

Page 52: Chemicals, Chemical Identifiers and Navigating Through Databases

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Page 53: Chemicals, Chemical Identifiers and Navigating Through Databases

How do we build it?

25 million chemicals from 400 data sources We deal in Molfiles or SDF files – including

coordinates We do rudimentary filtering – valence checking,

charge imbalance – prior to deposition We have our own “business logic” to standardize We use InChI to “aggregate tautomers” to one

record We link out to external sites where possible using

their IDs

Page 54: Chemicals, Chemical Identifiers and Navigating Through Databases

Inherited Errors

We have inherited errors from every database… all public compound databases, including ours, have errors

“Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE

Page 55: Chemicals, Chemical Identifiers and Navigating Through Databases

Compounds and Identifiers

Page 56: Chemicals, Chemical Identifiers and Navigating Through Databases

Be careful searching by Name!

Determining the correct structure by name searching is difficult online! Good, not perfect Wikipedia ChEBI/ChEMBL ChemIDPlus ChemSpider

Be VERY careful with MOST databases

Page 57: Chemicals, Chemical Identifiers and Navigating Through Databases

Validating structures

Check for “full stereo” and use stereo descriptors especially for checking!

Check for quality of associated data sources Check against reference literature when available

– but it can be wrong Question EVERYTHING!

Page 58: Chemicals, Chemical Identifiers and Navigating Through Databases

Online Curation

Online databases generally do NOT allow curation or annotation

If you find errors they stay there! ChemSpider is unique…immediate curation

ChemSpider live demo following this lecture Searching Deposition and Curation ChemSpider SyntheticPages

Page 59: Chemicals, Chemical Identifiers and Navigating Through Databases

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams