Chemicals, Chemical Identifiers and Navigating Through Databases
-
Upload
antony-williams-chemconnector -
Category
Technology
-
view
1.249 -
download
2
description
Transcript of Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
Antony WilliamsUNC Chapel Hill, October 2010
Chemistry on the Internet
Where do you source chemistry information? What can you trust online? How can you recognize potential issues? Cross-referencing and curating data
What is the Structure of Vitamin K?
MeSH
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
What is the Structure of Vitamin K1?
Wikipedia
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
Variants of systematic names on PubChem
2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Bioassay Data are Associated…
Lack of Stereochemistry
ChEBI – Manual Curation
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
Molfiles Molfiles are the primary exchange format between
structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates
SMILES (http://en.wikipedia.org/wiki/SMILES)
SMILES is a common format Can support polymers,
organometallics, etc. Does NOT carry X,Y or Z
coordinates for layout so requires layout algorithms – can be problematic!
Generally different between drawing packages
Stereo
Tautomers
SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\
C)=C\CC2=C(C)C(=O)c1ccccc1C2=O
OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/
CCC[C@H](C)CCC[C@H](C)CCCC(C)C
ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\
C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C
The InChI Identifier
InChI
SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES
InChI Strings can be reversed to structures – same problem as with SMILES – no layout
Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet
Multiple Layers
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for StereochemistryUse your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIStrings Hash to InChIKeys
PubChem InChIKeys
MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N
PubChem InChIKeys
MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N
Databases and Standardization
Databases and Standardization
InChI
No support for polymers, organometallics
Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic
“Slight” chance of collisions of InChIKeys
VERY USEFUL FOR INTEGRATING THE WEB
Vancomycin
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
www.chemspider.com
Search for a Chemical…by name
Available Information…
Linked to vendors, safety data, toxicity, metabolism
How do we build it?
25 million chemicals from 400 data sources We deal in Molfiles or SDF files – including
coordinates We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition We have our own “business logic” to standardize We use InChI to “aggregate tautomers” to one
record We link out to external sites where possible using
their IDs
Inherited Errors
We have inherited errors from every database… all public compound databases, including ours, have errors
“Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
Compounds and Identifiers
Be careful searching by Name!
Determining the correct structure by name searching is difficult online! Good, not perfect Wikipedia ChEBI/ChEMBL ChemIDPlus ChemSpider
Be VERY careful with MOST databases
Validating structures
Check for “full stereo” and use stereo descriptors especially for checking!
Check for quality of associated data sources Check against reference literature when available
– but it can be wrong Question EVERYTHING!
Online Curation
Online databases generally do NOT allow curation or annotation
If you find errors they stay there! ChemSpider is unique…immediate curation
ChemSpider live demo following this lecture Searching Deposition and Curation ChemSpider SyntheticPages
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams