The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf ·...

47
Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National Institutes of Health Frederick, Maryland 21702 The NCI/CADD Group's InChI Usage and Analysis of Tautomerism for InChI V2 Marc C. Nicklaus Computer-Aided Drug Design (CADD) Group

Transcript of The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf ·...

Page 1: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Chemical Biology LaboratoryCenter for Cancer ResearchNational Cancer Institute

National Institutes of HealthFrederick, Maryland 21702

The NCI/CADD Group's InChI Usage and Analysis of Tautomerism for InChI V2

Marc C. Nicklaus

Computer-Aided Drug Design (CADD) Group

Page 2: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

2

Chemical Identifier Resolver (CIR)

http://cactus.nci.nih.gov/chemical/structure

• “Resolves” structure identifiers or representations, i.e. converts one

structure identifier/representation into another

• Usable by humans; but optimized for communication between computers

(returns MIME/text pages wherever possible)

• Launched in June 2009

• Planned: major update of services and underlying database, the CADD

Group’s “Chemical Structure DataBase” (CSDB)

Page 3: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Chemical Identifier Resolver (CIR) Flowchart

identifier representation

http request

http response

detection ofthe identifier

type

identifier is afull structure

representation(e.g. SMILES, InChI)

calculation of therequested structure

representation

identifier is ahashed structurerepresentation

(e.g. InChIKey), orchemical name etc.

database lookup

structure

e.g. InChI, GIF image

e.g. CAS number,chemical name

Page 4: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Chemical Identifier Resolver (CIR)

http://cactus.nci.nih.gov/chemical/structure/CDBRNDSHEYLDJV-FVGYRXGTSA-M/smiles

[C@H](C2=CC1=CC=C(OC)C=C1C=C2)(C([O-])=O)C.[Na+] MIME type: text/plain

Examples:

http://cactus.nci.nih.gov/chemical/structure/

XMWRBQBLMFGWIX-UHFFFAOYSA-N/image

?height=300&width=300&bgcolor=black&bondcolor=white

Buckyball

Naproxen sodium

Page 5: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Chemical Structure Database (CSDB) in CIR

• ChemNavigator/Sigma iResearch Librarycompilation of commercially available screeningcompounds from ~300 international chemistrysuppliers

• PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIV database, NIST Webbook, NLM ChemIDplus, ChemSpider, …

• Commercial Sources / othersAsinex, Comgenex, eMolecules, …

ChemNav.

iResearch

Lib.

~56%

PubChem

~38%

others~6%

140 chemical structure databases

120 million structure records

84.6 million unique structures by FICuS

110 million unique Standard InChIKeys for lookup(includes ~25 million derived scaffold and ring structures)

Currently available in CIR:

Page 6: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Tautomerism in large small-molecule databases

NCI/CADD Structure Identifiers

Fragments Isotopes Charges

sensitive sensitive sensitive

D

D

D

D

D

D

O OCOOH

NH2

F I C

Based on CACTVS hashcodes; 16-digit hex numbers (64 bit unsigned)

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

u

O-

O

NH3+

OH

O

NH2

≠≠ ≠ ≠

Tautomers Stereochemistry

sensitive sensitive

O OH

O OH

COOH

HNH2

COOH

NH2

H=

= ≠

S

Na+

O

O-

O

OH

T

u u u u

NCI/CADD Structure Identifiers

Sitzmann et al. SAR QSAR Environ. Res. 2008, 19, 1–9

FICuS identifier: comes closest to how a chemist perceives a compound

Page 7: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Tautomerism in large small-molecule databases

HNDVDQJCIGZPNO-UHFFFAOYSA-N

HNDVDQJCIGZPNO-CDYZYAPPSA-N

HNDVDQJCIGZPNO-RXMQYKEDSA-N

HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+

O-

O

O

HNN NH2

ONa

HNN NH

OH

O

NHN 15NH2

OH

O

HNDVDQJCIGZPNO-UHFFFAOYSA-N

charged form

tautomer

isotope

stereoisomers

salt Std. InChIKey

“errors”

HNDVDQJCIGZPNO-UHFFFAOYSA-N

UHPNKBYGGMJTIM-UHFFFAOYSA-M

UHPNKBYGGMJTIM-UHFFFAOYSA-M

9850FD9F9E2B4E25-FICuS

9850FD9F9E2B4E25-FICuS

E5F83F10C5DB080A-FICuS

E5F83F10C5DB080A-FICuS

E92E4BA2869F3611-FICuS

8A7AD1EB498CC76A-FICuS

A3DAE0788050DDE4-FICuS

B2FDA68AEDA06DB9-FICuS

9850FD9F9E2B4E25-FICuS-01-78

FICuS

Histidine

Page 8: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

calculation of Standard InChIKey and NCI/CADD identifiers(from ~350 million original raw records including multiple database versions)

structurenormalization

parent

structureNCI/CADDIdentifier

hashcodecalculationoriginal

structure

record

FICTS

FICuS

uuuuu

E_HASHISY

union set:

Standard InChIKey 1.04

Set 1 Set 2 Set 3Standard

Chemical Structure Database (current version)

InChIKeys

Standard: 167,722,852Set 1: 167,722,850Set 2: 167,722,852Set 3: 167,698,426any unique: 167,723,824

NCI/CADD Identifiers(unique counts)

FICTS: 125,009,738FICuS: 121,429,689uuuuu: 108,993,792

(before normalization: ~127M)

Total unique records: 168,015,387(includes ~40M derived

scaffold and ring structures)

Page 9: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Tautomers

Tautomers are isomers that can transform into

each other through chemical equilibrium reactions

enol form keto form

cyclic form acyclic form

- Prototropic tautomerism:intramolecular movement of a hydrogen atom

- Ring-chain tautomerism:

movement of the proton accompanied by opening/closing of a ring

Strongly environment-

dependent

(pH, solvent, T, time, ... )

9

Page 10: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

The existence of multiple tautomeric forms of the same molecule can create problems!

Ligand

dockingClustering

diversityRegistration in

databases

Property

calculation

Hydrogen

bonding

interactions are

different for

different

tautomers

(Tanimoto) similarity

between tautomers

can be very low

May lead to duplicate

registration, missed

molecules in

searches, or incorrect

identification of two

structures as the

same compound

Variations across

tautomers by several

orders of magnitude.

Eg. pKa, logP

May impact the success of drug discovery

Importance of Getting Tautomers Right

X-ray

crystallographic

What is the

tautomeric form

present in ligand-

protein

complexes?

10

Most important for InChI

Page 11: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Tautomerism in large small-molecule databases

Average tautomeric overlap per DB ~0.3%

Tautomeric overlap across all DBs: ~10%

Structures capable of tautomerism: ~68%

NCI/CADD Chemical Structure Database

Tautomer Analysis

number

database

releases

0

10

20

30

40

50

60

70

80

90

100

0.0 0.5 1.0 1.5 2.0

frequency

percentage of actual duplicates FICTS - FICuS parent structure

in each database release

tautomeric overlap within each

individual database release Asinex

ChemBridge

ComGenex

ChemNavigator

Columbia University

Molecular Screening

Center

EPA DSSTox

Specs

Ambinter

BIND

BindingDB

ChemNavigator

KEGG

NCI Open Database

NIST WebBook

NLM ChemIDplus

NMRShiftDB

Thomson Pharma

Wombat

NCI/DTP

PASS Training Set

SGC-Ox

ChemDB

ZINCChEBIChemSpider

Sitzmann M, Ihlenfeldt WD, Nicklaus MC. J Comput Aided Mol Des. 2010 Jun;24(6-7):521-51.

Page 12: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

12

Tautomerism in Chemoinformatics

• In reality, tautomerism is a quantum-mechanical effect (orbital changes)

• In principle calculable at the QM level

• But incorporating the conditions (solvent, pH, ...) is not easy

• ...and these calculations can take weeks for one molecule

• In chemoinformatics: rule-based (you have maybe 100 ms per structure!)

• These rules will be “correct” only in a statistical sense

• Whether QM or rule-based: one has to agree on what set and ranges of conditions to use to define “tautomerism” in a practical application, e.g. for identifiers used for compound registration in a database or repository

Page 13: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

13

Tautomerism and Identifiers

Different chemical identifiers are sensitive to tautomerism to different degrees:

•InChI/InChIKey [IUPAC International Chemical Identifier]

– one identifier with a layer structure

– in principle designed to be tautomer-invariant

– but not all types of common tautomerism used by default (e.g. not invariant by default to keto/enol tautomerism in v. 1.04)

•NCI/CADD identifiers1

– several identifiers with different sensitivities to chemical features; most important one:

• FICuS – tautomer-invariant

1 Sitzmann et al. SAR QSAR Environ. Res. 2008, 19, 1–9

Page 14: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

14

Example of Known Issues in InChI

Dmitrii Tchekhovskoi, IUPAC InChI Committee Meeting, March 2012

1,4-oxime/nitroso tautomerism not currently handled by InChI. Adding it will break current InChI.

N+

O-

OH N

OH

ON

O

OH

InChI=1S/C5H5NO2/c7-5-3-1-2-4-6(5)8/h1-4,7H

InChI=1S/C5H5NO2/c7-5-3-1-2-4-6(5)8/h1-4,8H

Page 15: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

15

Tautomerism Transform Rules

• CACTVS: rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS

• Types of tautomerism covered:

rule 12: furanones

rule 11: 1.11 (aromatic) heteroatom H shift

rule 9: 1.7 (aromatic) heteroatom H shift

rule 8: 1.5 (aromatic) heteroatom H shift (2)

rule 7: 1.5 (aromatic) heteroatom H shift (1)

rule 6: 1.3 heteroatom H shift

rule 5: 1.3 aromatic heteroatom H shift

rule 4: special imine

rule 3: simple (aliphatic) imine

rule 2: 1.5 (thio)keto/(thio)enol

rule 1: 1.3 (thio)keto/(thio)enol*

rule 21: phosphonic acids

rule 20: isocyanides

rule 19: formamidinesulfinic acids

rule 18: cyanic/iso-cyanic acids

rule 17: oxim/nitroso via phenol

rule 16: oxim/nitroso

rule 15: pentavalent nitro/aci-nitro

rule 14: ionic nitro/aci-nitro

rule 13: keten/ynol exchange

CACTVS by: Wolf-Dietrich Ihlenfeldt, Xemistry GmbH

rule 10: 1.9 (aromatic) heteroatom H shift

* rule 1 has been merged with rule 6

Page 16: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

16

Tautomer Overlap in Commercial Catalogs

Aldrich Market Select (AMS) database

of commercially available samples:

5,755,574 molecules (2012-09 version)

31,156 conflicts 62,872 molecules

n-tuples Conflicts

2 30,619

3 514

4 21

5 1

Examples (prices per 1 g):

$300$188

Same original supplier!

$350$313

Page 17: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Quadruple Tautomeric Case

17

Page 18: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

But wait, there is more: Ring-Chain Tautomerism

18

“The need for computer programs that predict ring-chain tautomerization, a capability absent from the current tautomer generation programs”

Y.C. Martin, J Comput Aided Mol Des. 2009: 23:693-704 Let’s not forget tautomers

Prototropic tautomerism handledby most chemoinformatics tools:

XH: Nucleophilic center, YZ: Electrophilic center

exocyclic

endocyclic

OUR GOAL

...but not ring-chain tautomerism:

etc.

Page 19: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Example for Ring-Chain Tautomerism: Warfarin

19

Anticoagulant drug used in the prevention of thrombosis

Introduced in 1948 as a pesticide against rats and miceApproved in 1954 for use as a medicationMost widely prescribed oral anticoagulant drug in the U.S.

Inhibits vitamin K epoxide reductase (recycles oxidized vitamin K1 to its reduced form)

Page 20: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Valente E.J. et al., J. Med. Chem. 1977, 20, 1489-1493Karlsson, B.C.G. et al., J. Phys. Chem. B 2007, 111, 10520-10528Porter, R.P., J. Comput. Aided Mol. Des. 2010, 24, 553–573Nicholls, I.A. et al., J. Mol. Recognit. 2010, 23, 604–608

Submitted

to PubChem

HO

O

O

HO

O

O

O

HO

O

O

O

O

O

O

OH

O

HO

O

O

O

HO

O

OH

O

HO

O

OH

O

HO

O

O

O

HO

O

O

HO

Tautomerism of Warfarin – What to Expect

Mentioned

in literature

Confirmed

experimentally

O

O

OHO

O

O

O

HO

20

Can exist in principle in as many as 40 topologically distinct tautomeric forms!

Page 21: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

HO

O

O

HO

O

O

O

HO

O

O

O

O

O

O

OH

O

HO

O

O

O

HO

O

OH

O

HO

O

OH

O

HO

O

O

O

HO

O

O

HO

Tautomerism of Warfarin – FICuS Identifier

O

O

OHO

O

O

O

HO

2121

prototropic tautomerism

http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/ficus

prototropic tautomerism

Not covered

by current

rule set!

D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS 09BB2FAADA1508A7-FICuS

D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS

D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS D76B88C0354759F1-FICuS

8F5519DD1E62B6B2-FICuS

ring-chaintautomerism

Page 22: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

HO

O

O

HO

O

O

O

HO

O

O

O

O

O

O

OH

O

HO

O

O

O

HO

O

OH

O

HO

O

OH

O

HO

O

O

O

HO

O

O

HO

Tautomerism of Warfarin – InChIKey Identifier

O

O

OHO

O

O

O

HO

2222

prototropic tautomerism

QTXVAVXCBMYBJW-UHFFFAOYSA-N VWSXIGYSLWNCBN-VAWYXSNFSA-N GRAAPKVUSREWIL-UHFFFAOYSA-N

FQEPJUOLUDFINX-UHFFFAOYSA-N UCKRWKACBKRIKB-VAWYXSNFSA-N NNLYDNMZCAHUOV-UHFFFAOYSA-N

PJVWKTKQMONHTI-UHFFFAOYSA-N FVSFCRPKSVCTBA-VAWYXSNFSA-N BBOSKMPTDUUMKL-UHFFFAOYSA-N

LSCYDZJASSKSMJ-UHFFFAOYSA-N

PIBBOXWKSPNJFI-UHFFFAOYSA-N

All InChIKey

identifiers are

different!!

Revamp

handling of

tautomerism

in InChI V.2

ring-chaintautomerism

prototropic tautomerism

http://cactus.nci.nih.gov/chemical/structure/tautomers:warfarin/stdinchikey

Page 23: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Ring sizes being formed: 3 – 7

Breaking bonds: exocyclic (exo) or endocyclic (endo)

Geometry of carbon atoms: sp3 (tet), sp2 (trig) or sp (dig)

Baldwin's Rules

as starting point

Definition of Ring-Chain Tautomerism Rules

Disfavoured / favoured ring closures

23

J.E. Baldwin, J Chem Soc Chem Comun. 1976: 734-736

11 ring-chain rules → each rule is encoded as a SMIRKS string

OurRules

Baldwin’sRules

Guasch, L. et al.. JCIM (under revision)

Page 24: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

6-exo-Trig

1

25

6

8 7 4

3

2

6

8 7 3

51

4

111

111

5 5

66

7

7

4

4

3

3

2

2

5-exo-Dig

111

111333

4

47

78

8

6

6

5

6

5

2

2

7-endo-Trig

Examples: Ring-chain tautomerism rules

24

Page 25: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Tautomers per Structure in the AMS Database

Count % Count %

no tautomers

(single molecule)1,393,612 24.21 5,297,864 92.05

one tautomer 1,235,979 21.47 101,890 1.77

2 tautomers 833,492 14.48 214,488 3.73

3 tautomers 483,057 8.39 16,490 0.29

4 tautomers 223,114 3.88 40,606 0.71

5 – 10 tautomers 889,118 15.45 32,661 0.57

11- 50 tautomers 584,842 10.16 37,267 0.65

51- 100 tautomers 72,832 1.27 3,905 0.07

101 – 200 tautomers 35,901 0.62 7,078 0.12

201 – 500 tautomers 3,486 0.06 3,017 0.05

501 – 1000 tautomers 141 0.00 308 0.01

Prototropic Tautomerism

21 Rules

Ring-chain Tautomerism

11 Rules

Ring-chain tautomersim is in the minority relative to prototropic tautomerism ̶ but it is not

an “exotic” occurrence either.

25

Page 26: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

26

Tautomerism Analysis by “Experimental Chemoinformatics”

Procedure:• Select 100-200 tautomer tuples from AMS by

– coverage of types of tautomer transforms

– chemical diversity

– solubility

– availability from same original supplier

– likelihood to be distinguishable by NMR

– price

• Purchase samples

• Analyze by NMR spectroscopy

– measure as function of temperature, solvent, pH, shelf time...

Goal:• Investigate prevalence of tautomeric overlap in a real commercial catalog

• Test which of the tautomer transform rules may be too “aggressive”

Page 27: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

27

NMR Experiments

338 molecules from AMS

127 prototropic tautomeric pairs5 prototropic tautomeric triples34 ring-chain tautomeric pairs

Bruker AVANCETM 500 – Autosampler (24)

Solvent: DMSO-d6Room temperature

1H and13C NMR Spectra

Page 28: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

28

Keto/enol Tautomerism (conflict 26)

1H NMR Spectrum

13C NMR Spectrum

Same 1H and 13C NMR spectrum between samples. Assignment of

chemical shifts indicates enol form is present in both samples

26_1

26_2

InChI=1S/C14H12N4OS/c1-9-10(5-4-8-15)13(19)18(17-9)14-16-11-6-2-3-7-12(11)20-14/h2-3,6-7,17H,4-5H2,1H3InChIKey=HGTVTJWWZHYAQS-UHFFFAOYSA-N

InChI=1S/C14H12N4OS/c1-9-10(5-4-8-15)13(19)18(17-9)14-16-11-6-2-3-7-12(11)20-14/h2-3,6-7,19H,4-5H2,1H3 InChIKey=KTGFJMXHOLHDQA-UHFFFAOYSA-N

Page 29: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

29

Heteroaromatic Tautomerism (conflict 36)

1H NMR Spectrum

13C NMR Spectrum

Same 1H and 13C NMR spectrum

between samples. Double number of

peaks, both samples have the same

mixture of tautomers.

InChI=1S/C9H13ClN4/c1-9(2,3)14-7-6(5-12-14)4-11-8(10)13-7/h5,12H,4H2,1-3H3InChIKey=PCZNHIHWFBLCGG-UHFFFAOYSA-N

InChI=1S/C9H13ClN4/c1-9(2,3)14-7-6(5-12-14)4-111-8(10)13-7/h5H,4H2,1-3H3,(H,1,13)InChIKey=XRAYTMYJALHNQU-UHFFFAOYSA-N

Page 30: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

InChI=1S/C14H14N2O2S/c15-13(17)12-10-5-1-2-6-11(10)19-14(12)16-8-9-4-3-7-18-9/h3-4,7-8H,1-2,5-6H2,(H2,15,17)InChIKey=VKEHQTAMAUSRIS-UHFFFAOYSA-N

30

Ring-Chain Tautomerism (conflict 652)

1H NMR Spectrum

13C NMR Spectrum

InChI=1S/C14H14N2O2S/c17-13-11-8-4-1-2-6-10(8)19-14(11)16-12(15-13)9-5-3-7-18-9/h3,5,7,12,16H,1-2,4,6H2,(H,15,17)/t12-/m0/s1 InChIKey=XMZKSUZSNNZEDZ-LBPRGKRZSA-N

Same 1H and 13C NMR spectrum between samples. Assignment of

chemical shifts indicates open form is present in both samples

Page 31: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

31

Preliminary Results

• More than 200 spectra have been analyzed so far.

• VERY PRELIMINARY conclusions: Around 80% of prototropic tautomeric cases and 50% of ring-chain tautomeric cases show the same 1H and 13C NMR spectra

• Usually the same visual appearance of the samples of a pair (texture, color etc.) corresponds to identical NMR results.

• We have assigned the chemical shifts of some spectra to determine which tautomer is present in the samples.

• Some tautomeric conflicts, e.g. involving triazole, imidazole or pyrazole moieties, are practically indistinguishable by standard NMR experiments:

• We are starting to review the tautomeric rules based on the NMR results.

12_1 12_2

InChI=1S/C6H4BrN3/c7-4-1-2-5-6(3-4)9-10-8-5/h1-3H,(H,8,9,10)InChIKey=BQCIJWPKDPZNHD-UHFFFAOYSA-N

InChI=1S/C6H4BrN3/c7-4-1-2-5-6(3-4)9-10-8-5/h1-3H,(H,8,9,10)InChIKey=BQCIJWPKDPZNHD-UHFFFAOYSA-N

Page 32: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

32

Tautomerism – Ongoing and Planned Activities

• Database of tautomerism data (structures; ratios, interconversion rates, relative energies...) from literature, both experimental and computational

• IUPAC Working Group: “Redesign of Handling of Tautomerism for InChI V2”

• Agree on conditions for “tautomerism” in InChI

• Addition of ring-chain tautomerism rule set

• Recommendation for a tautomerism rule set for InChI V2

• Definition of a canonical tautomer

Page 33: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

NCI/CADD Team

Alexey ZakharovLaura GuaschMegan Peach Marc NicklausMarkus Sitzmann

Xemistry GmbH

Wolf-Dietrich Ihlenfeldt

Acknowledgements

ChemNavigator

Scott Hutton

InChI Team

PubChem

All other database providers

Page 34: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National
Page 35: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

• based on hashcodes calculated by the chemoinformatics

toolkit CACTVS

• CACTVS hashcodes:

represent a chemical structure uniquely as

16-digit hexadecimal number (64-bit unsigned)

high sensitivity to structural features of a compound

change if connectivity changes

NCI/CADD Structure Identifiers

Unique Representation of Chemical Structures

HNN NH2

OH

O

9850FD9F9E2B4E25

Page 36: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

HNDVDQJCIGZPNO-UHFFFAOYSA-N

HNDVDQJCIGZPNO-CDYZYAPPSA-N

HNDVDQJCIGZPNO-RXMQYKEDSA-N

HNDVDQJCIGZPNO-YFKPBYRVSA-NHNDVDQJCIGZPNO-UHFFFAOYSA-N

HNN NH2

OH

O

NNH NH2

OH

O

HNN

OH

O

NH2

HNN

OH

O

NH2

HNN NH2

O-

ONa+

HNN NH3

+

O-

O

O

HNN NH2

ONa

HNN NH

OH

O

NHN 15NH2

OH

O

HNDVDQJCIGZPNO-UHFFFAOYSA-N

charged form

tautomer

isotope

stereoisomers

salt Std. InChIKey

“errors”

HNDVDQJCIGZPNO-UHFFFAOYSA-N

UHPNKBYGGMJTIM-UHFFFAOYSA-M

UHPNKBYGGMJTIM-UHFFFAOYSA-M

9850FD9F9E2B4E25-FICuS

9850FD9F9E2B4E25-FICuS

E5F83F10C5DB080A-FICuS

E5F83F10C5DB080A-FICuS

E92E4BA2869F3611-FICuS

8A7AD1EB498CC76A-FICuS

A3DAE0788050DDE4-FICuS

B2FDA68AEDA06DB9-FICuS

9850FD9F9E2B4E25-FICuS

FICuS

Page 37: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Chemical Identifier Resolver (CIR)

http://cactus.nci.nih.gov/chemical/structure/PGZUMBJQJWIWGJ-ONAKXNSWSA-N/cas

204255-11-8 MIME type: text/plain

examples:

http://cactus.nci.nih.gov/chemical/structure/

XMWRBQBLMFGWIX-UHFFFAOYSA-N/image

?height=300&width=300&bgcolor=black&bondcolor=white

Buckyball

Page 38: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles

CCO

http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles

CCO

CC[OH2+]

http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles

C(C(O)([2H])[2H])[2H]

CC(O)([2H])[2H]

C(CO)([2H])([2H])[2H]

CC[17OH]

C(CO)[2H]

[14CH3]CO

CCO

• resolve Standard InChIKey into full structure representation: Ethanol

Partial InChIKey Lookup (in preparation)

Page 39: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

InChI/InChIKey (Version 1.04) calculated with four InChI flag sets:

Set 1

Set 2

Set 3

Standard Standard InChIKey

DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T

DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T

DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T

Add H

Add H

Add H

Add H

CACTVS

:

:

:

:

Standard Set, Set 1 & Set 2: addition of hydrogen atoms by CACTVS

Set 3: addition of hydrogen atoms by the InChI library

Chemical Structure Database (current version)

Page 40: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

structure

normalizationparent

structure

MDL SDF

SMILES

database

NCI/CADD

Identifier

hashcode

calculation

NCI/CADD Structure Identifiers

Unique Representation of Chemical Structures

E_HASHISY

• we calculate a set of parent structures with different

sensitivity to chemical features

• fine grained representation of chemical structures

FICTS FICuS uuuuu

original

structure

record

MDL Molfile

MDL SDF

SMILES

ChemDraw cdx

PDB

Page 41: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

NCI/CADD Chemical Structure Database

Tautomer Analysis

0

5

10

15

20

25

30

0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.518.5 20.5 22.5 24.5

occurrence of “tautomerism-critical” moleculeswithin each individual database release (%)

average: ~9.5% of FICuS parent structures

numberdatabasereleases

frequency

Sitzmann M, Ihlenfeldt WD, Nicklaus MC. J Comput Aided Mol Des. 2010 Jun;24(6-7):521-51.

Page 42: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

Ring-chain tautomerism in a real database

Most commonly applicable ring-chain tautomerism rules in the AMS Database

SMIRKS rule Count %

3-exo-Trig 65,435 0.31

4-exo-Trig 10,560 0.05

5-exo-Trig 7,506,722 35.09

6-exo-Trig 5,289,114 24.72

7-exo-Trig 4,185,292 19.565-exo-Dig 179,567 0.84

6-exo-Dig 472,074 2.21

7-exo-Dig 3,293,445 15.4

5-endo-Trig 169,007 0.79

6-endo-Trig 156,371 0.73

7-endo-Trig 65,239 0.3

Most common

42

Page 43: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

43

Imine/amine tautomerism (conflict 5)

1H NMR Spectrum

13C NMR Spectrum

InChI=1S/C14H11FN6O/c1-22-7-11-12(8-2-4-9(15)5-3-8)14-19-18-10(6-16)13(17)21(14)20-11/h2-5H,7,17H2,1H3InChIKey=ZUXILZLWKBIETJ-UHFFFAOYSA-N

5_1

5_2

InChI=1S/C14H11FN6O/c1-22-7-11-12(8-2-4-9(15)5-3-8)14-19-18-10(6-16)13(17)21(14)20-11/h2-5,17,19H,7H2,1H3InChIKey=ZDHCKDOWMHOUPE-UHFFFAOYSA-N

Same 1H and 13C NMR spectrum between samples. Assignation of chemical shifts indicates

imine form is present in both samples

Page 44: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

InChI/InChIKey Resolver

Page 45: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

InChI/InChIKey Resolver

“loose coupling”of InChI resolversprovided by differentorganizations

central list of resolvers

each resolvermust provide aspecific protocol.

Page 46: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National

InChI/InChIKey Resolver

Evan Bolton (NCBI, NLM, NIH)

Valery Tkachenko (RSC/ChemSpider)

Marc Nicklaus (CADD Group, NCI, NIH)

Steven Bachrach (Trinity University)

Antony Williams (RSC/ChemSpider)

Markus Sitzmann (CADD Group, NCI, NIH)

Page 47: The NCI/CADD Group's InChI Usage and Analysis of ...bulletin.acscinf.org/PDFs/247nm56.pdf · Chemical Biology Laboratory Center for Cancer Research National Cancer Institute National