InChI/InChIKey vs. NCI/CADD Structure Identifiers: A...

51
InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison Markus Sitzmann Computer-Aided Drug Design Group (NCI/CADD), Laboratory of Medicinal Chemistry, NCI-Frederick, NIH, DHHS

Transcript of InChI/InChIKey vs. NCI/CADD Structure Identifiers: A...

Page 1: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison Markus Sitzmann

Computer-Aided Drug Design Group (NCI/CADD), Laboratory of Medicinal Chemistry, NCI-Frederick, NIH, DHHS

Page 2: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

The Adaption and Use of the IUPAC InChI/InChIKey

NCI/CADD Identifiers InChI/InChIKey

Chemical Structure Lookup Service

FICTS FICuS uuuuu Std. InChI/InChIKey

74 million structure records – 46 million unique structures

Page 3: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

•  based on hashcodes calculated by the chemoinformatics toolkit CACTVS

•  CACTVS hashcodes:   represent a chemical structure uniquely as

16-digit hexadecimal number (64-bit unsigned)

  have a high sensitivity to structural features of a compound

  change if connectivity changes

NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

H N N N H 2

O H

O

9850FD9F9E2B4E25

Page 4: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

charged form

A3DAE0788050DDE4 3ECEF579D7DF025A

tautomers

isotope “errors”

E92E4BA2869F3611 8A7AD1EB498CC76A stereoisomers 6C16DE2351F9FF50

H N N N H 2

O H

O

N N H N H 2

O H O

H N N

O H O

N H 2 H N

N O H

O

N H 2

salt

H N N N H 2

O - O

N a + H N

N N H 3 + O -

O

8F7A1DE5A733F0E0

O

H N N N H 2

O N a

60525E1AF41497B6

H N N N H

O H O

B2FDA68AEDA06DB9

N H N 1 5 N H 2

O H O

9850FD9F9E2B4E25

Page 5: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

input structure

MDL Molfile MDL SDF SMILES ChemDraw cdx PDB

structure normalization

parent structure

MDL SDF SMILES database

NCI/CADD Identifier

hashcode calculation

NCI/CADD Structure Identifiers Unique Representation of Chemical Structures

E_HASHISY

Page 6: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

•  adjustable levels of sensitivity:

NCI/CADD Structure Identifiers

Fragments

sensitive

keep only largest organic fragment

Isotopes

ignore isotope labels

sensitive

D D D

D D D

Charges

uncharge

sensitive

find canonical tautomer

O O

Stereochemistry

sensitive

C O O H N H 2

discard stereo information

O - O

N H 3 +

O H O

N H 2

un-sensitive un-sensitive un-sensitive un-sensitive

sensitive

O O H

O O H

Tautomers

C O O H H N H 2 C O O H

N H 2 H Na+

O O -

O O H

Structure Normalization

un-sensitive

Page 7: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

NCI/CADD Structure Identifiers

Fragments Isotopes Charges

sensitive sensitive sensitive

D D D

D D D

O O C O O H N H 2

un-sensitive un-sensitive un-sensitive un-sensitive

O - O

N H 3 +

O H O

N H 2

Tautomers Stereochemistry

sensitive sensitive

O O H

O O H C O O H H N H 2 C O O H

N H 2 H Na+

O O -

O O H

Structure Normalization

Page 8: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

NCI/CADD Structure Identifiers

Fragments Isotopes Charges

sensitive sensitive sensitive

D D D

D D D

O O C O O H N H 2

F I C

FICTS identifier: representation of the exact drawing

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

T

O - O

N H 3 +

O H O

N H 2

≠ ≠ ≠

Tautomers Stereochemistry

sensitive sensitive

O O H

O O H C O O H H N H 2 C O O H

N H 2 H

S

Na+

O O -

O O H

=

=

Structure Normalization

Page 9: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

NCI/CADD Structure Identifiers

Fragments Isotopes Charges

sensitive sensitive sensitive

D D D

D D D

O O C O O H N H 2

F I C

FICuS identifier: comes closest to how a chemist perceives a compound

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

u

O - O

N H 3 +

O H O

N H 2

≠ ≠ ≠ ≠

Tautomers Stereochemistry

sensitive sensitive

O O H

O O H C O O H H N H 2 C O O H

N H 2 H =

= ≠

S

Na+

O O -

O O H

Structure Normalization

Page 10: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

NCI/CADD Structure Identifier

Fragments Isotopes Charges Tautomers Stereochemistry

Na+

sensitive sensitive sensitive sensitive sensitive

O O -

D D D

D D D

O - O

N H 3 +

O O H

O O H C O O H H N H 2 C O O H

N H 2 H

O O H

O O C O O H N H 2 O H

O

N H 2

=

= = = = = =

=

uuuuu identifier: closely related forms of the same compound

u u u u u

un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive

Structure Normalization

Page 11: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

NCI/CADD Structure Identifier

correct structure: add hydrogen atoms correct functional groups correct metal atom bonds

input structure

normalize or discard stereo information

define canonical tautomer

discard isotope labels

d

Structure Normalization

get largest fragment & uncharge: delete complex center get largest organic fragment delete radical center uncharge structure

uuuuu

uuuuS

uuuTu

uuuTS

FICuu

FICuS

FICTS

FICTu

n

n

n

n

d

d

d

define canonical resonance form/ protonation state

parent structures

Page 12: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

NCI/CADD Structure Identifier

9850FD9F9E2B4E25-FICTS-01-57 9850FD9F9E2B4E25-FICuS-01-78 9850FD9F9E2B4E25-uuuuu-01-27

<CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>

H N N N H 2

O H

O

Page 13: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

A3DAE0788050DDE4-FICTS E5F83F10C5DB080A-FICTS

B2FDA68AEDA06DB9-FICTS

9850FD9F9E2B4E25-FICTS

E5F83F10C5DB080A-FICTS

E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS

H N N N H 2

O H

O

N N H N H 2

O H O

H N N

O H O

N H 2 H N

N O H

O

N H 2

H N N N H 2

O - O

N a + H N

N N H 3 + O -

O

O

H N N N H 2

O N a

H N N N H

O H O

N H N 1 5 N H 2

O H O

9850FD9F9E2B4E25-FICTS

charged form

tautomers

isotope

salt

stereoisomers

FICTS

“errors”

Page 14: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

A3DAE0788050DDE4-FICuS E5F83F10C5DB080A-FICuS

B2FDA68AEDA06DB9-FICuS

9850FD9F9E2B4E25-FICuS

E5F83F10C5DB080A-FICuS

E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25-FICuS

H N N N H 2

O H

O

N N H N H 2

O H O

H N N

O H O

N H 2 H N

N O H

O

N H 2

H N N N H 2

O - O

N a + H N

N N H 3 + O -

O

O

H N N N H 2

O N a

H N N N H

O H O

N H N 1 5 N H 2

O H O

9850FD9F9E2B4E25-FICuS

charged form

tautomers

isotope

salt

stereoisomers

FICuS

“errors”

Page 15: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-FICuS

9850FD9F9E2B4E25-uuuuu

9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu

H N N N H 2

O H

O

N N H N H 2

O H O

H N N

O H O

N H 2 H N

N O H

O

N H 2

H N N N H 2

O - O

N a + H N

N N H 3 + O -

O

O

H N N N H 2

O N a

H N N N H

O H O

N H N 1 5 N H 2

O H O

9850FD9F9E2B4E25-uuuuu

charged form

tautomers

isotope

stereoisomers

salt

uuuuu

“errors”

Page 16: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

HNDVDQJCIGZPNO-UHFFFAOYSA-N

HNDVDQJCIGZPNO-CDYZYAPPSA-N

HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N

H N N N H 2

O H

O

N N H N H 2

O H O

H N N

O H O

N H 2 H N

N O H

O

N H 2

H N N N H 2

O - O

N a + H N

N N H 3 + O -

O

O

H N N N H 2

O N a

H N N N H

O H O

N H N 1 5 N H 2

O H O

HNDVDQJCIGZPNO-UHFFFAOYSA-N

charged form

tautomers

isotope

stereoisomers

salt

Std. InChIKey

“errors”

HNDVDQJCIGZPNO-UHFFFAOYSA-N

UHPNKBYGGMJTIM-UHFFFAOYSA-M

UHPNKBYGGMJTIM-UHFFFAOYSA-M

Page 17: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Structure Normalization

Tautomers

canonical tautomer

?

O

O OH

O

O OH

O

O O

Page 18: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

•  CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism)

•  rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS

•  types of tautomerism covered:

Tautomers Structure Normalization

  1.3, 1.5 keto/enol imine/enamine imine/amine lactam/lactim 1.3, 1.5, 1.7, 1.11 hydrogen atom shift on (aromatic) heteroatoms keten/ynol nitro/aci-nitro nitroso/oxime

  special cases: cyanic/iso-cyanic acid, phosphonic acid, formamidinesulfonic acid, isocyanide, furanones and more …

Page 19: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Tautomers Structure Normalization

  transform: 1.3 keto-enol

[O,S,Se,Te;X1:1]=[Cx1:2][CX4R{0-2}:3][#1:4]>> [#1:4][O,S,Se,Te;X2:1][Cx1,cx1:2]=[C,cx1,cx0:3]

  transform: 1.3 heteroatom H shift

[N,n,S,s,O,o,Se,Te:1]=[NX2,nX2,C,c,P,p:2] [N,n,S,O,Se,Te:3][#1:4]>>[#1:4][N,n,S,O,Se,Te:1] [NX2,nX2,C,c,P,p:2]=[N,n,S,s,O,o,Se,Te:3]

  transform: 1.5 heteroatom H shift

[nX2,NX2,S,O,Se,Te:1]=[C,c,nX2,NX2:6][C,c:5]=[C,c,nX2:2] [N,n,S,s,O,o,Se,Te:3][#1:4]>>[#1:4][N,n,S,O,Se,Te:1] [C,c,nX2,NX2:6]=[C,c:5][C,c,nX2:2]=[NX2,S,O,Se,Te:3]

•  21 SMIRKS transforms, examples:

Page 20: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

N

N H

N H

N

O

H 2 N

N

N H

N

H N

O

H 2 N

N

N H

N

N

O H

H 2 N

H N

N N H

N

O

H 2 N

N

N N H

N

O H

H 2 N

HN

N N

H N

O

H 2 N

N

N N

H N

O H

H 2 N

H N

N N

N

O H

H 2 N

H N

N H

N H

N

O

H N

N

N H

N H

N

O H

H N

H N

N H

N

H N

O

H N

N

N H

N

H N

O H

H N

H N

N H

N

N

O H

H N

HN

N N H

N

O H

H N

HN

N N

H N

O H

H N

Tautomers Structure Normalization

A6199E68A788F2F5-FICTS 959B273B619C709F-FICTS

61248C4A7D045A47-FICTS

675R4FCC50F45026-FICTS

0B345B47F6625113-FICTS

181CA9BCE3EF47F4-FICTS

1AD375920BE60DAD-FICTS

67196F0B20B1D934-FICTS

BCCDA7D0CDACF120-FICTS CE8F480C11DBFC4F-FICTS

D46A1E6500B06AB6-FICTS

D979CF9770AC0BA5-FICTS

56FFE8B5619FB01-FICTS F802E527EC5C61BF-FICTS EF060DA9D97091DE-FICTS

BCCDA7D0CDACF120-FICuS

guanine

UYTPUPDQBNUYGX-UHFFFAOYSA-N

Page 21: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Tautomerism & Stereochemistry

O Z

O E

methyl propenyl ketone

Structure Normalization

Page 22: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

O Z

O E

O H

tautomer

tautomer

methyl propenyl ketone

Structure Normalization

Tautomerism & Stereochemistry

Page 23: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

O Z

O E

O H

O

76D03F08ACDF6C0C-FICuS

FICUS disregards stereo-chemistry on double bonds if the double bond is not located during tautomer generation.

tautomer

tautomer

methyl propenyl ketone

Tautomerism & Stereochemistry Structure Normalization

Page 24: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

O Z

O E

O H

O

76D03F08ACDF6C0C-FICuS

FICUS disregards stereo-chemistry on double bonds if the double bond is not located during tautomer generation.

tautomer

InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3/b4-3+ LABTWGUMFABVFG-ONEGZZNKSA-N

InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4,6H,1H2,2H3/b5-4- LYGWZVOQSCPYDG-PLNGDYQASA-N

InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3/b4-3- LABTWGUMFABVFG-ARJAWSKDSA-N

tautomer

methyl propenyl ketone

InChI/InChIKey - NCI/CADD Identifier comparison

Tautomerism & Stereochemistry InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3 LABTWGUMFABVFG-UHFFFAOYSA-N

Page 25: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

O Z

O E

O H

821D8C17ACE5040E-FICTS

6EB4AA2BAA11965F-FICTS

1677645190718885-FICTS

tautomer

tautomer

O

76D03F08ACDF6C0C-FICTS

methyl propenyl ketone

FICTS “sees” four different structures

InChI/InChIKey - NCI/CADD Identifier comparison

Tautomerism & Stereochemistry

Page 26: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Charges in Resonance Systems Structure Normalization

F3A27F03AE77A722

F3A27F03AE77A722

62FADCB01F197FC9

canonical resonance structure?

uncharge

uncharge

problem!

2E011EE4519F7920

N N H

N N H

H

N N H N

N H H

different protonation states

Page 27: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Structure Normalization

•  generation of all formal resonance structures for a given (charged) organic compound

•  rule set of 14 transforms encoded as (CACTVS-extended) SMIRKS

shifting of charges: 5 rules

recombination of charges: 5 rules

separation of charges: 4 rules

O N O

O N O

O N O

O N O

O N O

O N O

Charges in Resonance Systems

Page 28: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Structure Normalization

(no plausible unpolarized resonance structure can be drawn)

münchnones:

N O

O

N O

O

N O

O

N O

O

N O

O

N O

O

N O

O

N O

O

1.2 shift

1.2 recombination

1.2 recombination

separation (pentavalent N atom) 1.3 shift

1.3 shift

1.3 recombination 1.3 shift 1.3 shift 1.3 shift 1.3 shift

Charges in Resonance Systems

IUYUGWCTOLFFCL-UHFFFAOYSA-N F68AC07DE0D3379F-FICuS

Page 29: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

•  PubChem database (including Open NCI database, EPA DSSTox databases, NIAID HIV databases, NIST Webbook, NLM ChemIDplus, ChemSpider …)

•  ChemNavigator iResearch Library

(compilation of commercially available screening compounds from ~250 international chemistry suppliers)

•  Commercial Sources / Others (Asinex, Comgenex, …)

»Chemical Structure Lookup Service« Database

74 million structure records (~46 million unique structures)

InChI/InChIKey - NCI/CADD Identifier comparison

ChemNav. iResearch Lib. ~43%

PubChem ~47%

Others

~10%

Page 30: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

•  structure records registered in CSLS: 74.2 million

successful calculation of: Standard InChI/InChIKey: 73.8 million records NCI/CADD Structure Identifiers: 73.7 million records

•  unique structure counts (compound sets)

Standard InChI/InChIKey: FICTS Identifier FICuS Identifier Standard InChIKey (first block) uuuuu Identifier

48,027,940 48,023,835 46,715,521 43,055,589 41,671,010

Standard InChI/InChIKeys were calculated by stdinchi-1 (Linux i-386 executable) from the original SD file records

Unique Structure Counts InChI/InChIKey - NCI/CADD Identifier comparison

Page 31: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

original structure record set (74.2 million)

FICuS compound set (46.7 million unique)

Standard InChI/InChIKey set calculated by stdinchi-1

(73.8 million, 48.0 million unique)

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

Page 32: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

original structure record set (74.2 million)

FICuS compound set (46.7 million unique)

Standard InChI/InChIKey set calculated by stdinchi-1

(73.8 million, 48.0 million unique)

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

1 conflicts?

Page 33: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

original structure record set (74.2 million)

FICuS compound set (46.7 million unique)

Standard InChI/InChIKey set calculated by stdinchi-1

(73.8 million, 48.0 million unique)

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

1 conflicts?

Page 34: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

original structure record set (74.2 million)

FICuS compound set (46.7 million unique)

Standard InChI/InChIKey set calculated by stdinchi-1

(73.8 million, 48.0 million unique)

Detailed Comparison

Standard InChI/InChIKey calculated by CACTVS

from FICuS compound structure

InChI/InChIKey - NCI/CADD Identifier comparison

same InChI/InChIKey? 2

Page 35: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

1 no conflicts between Std. InChI/InChIKey and FICuS

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

FICuS linked to a single InChI/InChIKey

both linked to a single structure record

both linked to multiple structure records

62.3

34.4

27.9

all structure records

(46.9%)

(38.0%)

73.7

(84.5%)

structure records (million records)

Page 36: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

1 conflicts between Std. InChI/InChIKey and FICuS

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

structure records (million records)

all structure records

FICuS is linked to multiple InChI/InChIKeys or vice versa

one FICuS is linked to multiple InChI/InChIKeys

one InChI/InChIKey is linked to multiple FICuS

10.9

6.8

4.1

(9.2%)

(5.5%)

(14.7%)

73.7

Page 37: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

1 conflicts between Std. InChI/InChIKey and FICuS

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

structure records (million records)

all structure records

FICuS is linked to multiple InChI/InChIKeys or vice versa

one FICuS is linked to multiple InChI/InChIKeys

one InChI/InChIKey is linked to multiple FICuS

73.7

number of InChIKey first block 2.3

number of InChIKey first block 1.0

(3.1%)

(1.3%)

10.9

6.8

4.1

(9.2%)

(5.5%)

(14.7%)

Page 38: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Detailed Comparison

2

FICuS

FICTS

uuuuu

46.7

48.0

41.6

6.4 (13.7%)

3.8 (7.9%)

11.9 (28.6%)

compounds (unique structures) (million records)

all compounds

73.7 9.3

4.6

(29.7%) 21.9

(6.2%)

(12.7%)

structure records (million records)

all records

InChI/InChIKey - NCI/CADD Identifier comparison

same InChI/InChIKey?

InChI changes InChI changes

Page 39: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Detailed Comparison

FICuS

FICTS

uuuuu

46.7

48.0

41.6

6.4 (13.7%)

3.8 (7.9%)

11.9 (28.6%)

compounds (unique structures) (million records)

all compounds

structure records (million records)

all records

InChI/InChIKey - NCI/CADD Identifier comparison

3.2 6.3 (7.6%) (8.4%) vs. InChIKey first block

InChI changes InChI changes

2 same InChI/InChIKey?

73.7 9.3

4.6

(29.7%) 21.9

(6.2%)

(12.7%)

Page 40: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

(formal) tautomer count > 1 (formal) tautomer count > 3 (formal) tautomer count > 10 full stereo contains metal atoms metal complexes salt has resonance charges inorganic

compound classification

14.5% 18.5% 28.9% 16.9% 34.5% 52.1% 18.6% 52.1% 33.9%

56.4% 25.4% 5.5%

25.7% 0.8% 0.2% 1.0% 0.2% 0.1%

Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison

occurrence in FICuS set

occurrence in FICuS subset

(InChI changes)

Page 41: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

FICuS: 12 different structure records linked to this structure

Std. InChI/InChIKey (stdinchi-1): calculates 3 different strings/keys for these 12 structure records (all have the same connectivity layer/first block)

all of these 3 StdInChI/InChIKey differ from the StdInChI/InChIKey calculated after FICuS normalization (including connectivity layer/ first block)

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

ChemBlock A3422/0145215

Page 42: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

ChemBlock A3422/0145215

Page 43: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

tautomer:

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

ChemBlock A3422/0145215

N

O N

N H

O O

Page 44: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

tautomer:

H N

O N O

O N H

tautomeric interconversion?

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

ChemBlock A3422/0145215

N

O N

N H

O O

Page 45: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

tautomer:

H N

O N O

O N H

tautomeric interconversion?

tautomeric interconversion?

S R

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

N

O N

N H

O O

N

O N

N H O

O

ChemBlock A3422/0145215

N

O N

N H

O O

Page 46: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

tautomer:

H N

O N O

O N H

tautomeric interconversion?

tautomeric interconversion?

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

ChemBlock A3422/0145215

N

O N

N H

O O

S R

N

O N

N H

O O

N

O N

N H O

O

Page 47: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

tautomer:

H N

O N O

O N H

tautomeric interconversion?

tautomeric interconversion?

S R

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

N

O N

N H

O O

N

O N

N H O

O

ChemBlock A3422/0145215

N

O N

N H

O O

How many structures?

ZINC04685909

ChemBlock A3422/0145215 ChemNavigator 47748165 NIST MS-Lib 1967005690

ChemNavigator 34903393

ChemNavigator 65635274

Page 48: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

H N

O N

N H

O O

N

O N O

O N H

Z E

tautomer:

H N

O N O

O N H

tautomeric interconversion?

tautomeric interconversion?

S R

InChI/InChIKey - NCI/CADD Identifier comparison

H N

O N

N H

O

O

N

O N

N H

O O

N

O N

N H O

O

ChemBlock A3422/0145215

N

O N

N H

O O

How many structures?

InChIKey A

InChIKey B

InChIKey C

same connectivity layer/block

FICuS

parent structure

Page 49: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

The Adaption and Use of the IUPAC InChI/InChIKey

NCI/CADD Identifiers InChI/InChIKey

Chemical Structure Lookup Service

FICTS FICuS uuuuu Std. InChI/InChIKey

74 million structure records – 46 million unique structures

http://cactus.nci.nih.gov/lookup

Page 50: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Web Service

Chemical Structure REST Service (beta)

http://cactus.nci.nih.gov/chemical/structure/{identifier}/{method}

http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/names http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/ficus http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/stdinchi http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/image

http://cactus.nci.nih.gov/chemical/structure/ethanol/stdinchikey http://cactus.nci.nih.gov/chemical/structure/64-17-5/stdinchikey

URL scheme:

returns plain text/gif image if the structure identifier is not resolvable: http 404 status code

Page 51: InChI/InChIKey vs. NCI/CADD Structure Identifiers: A ...acscinf.org/docs/meetings/237nm/presentations/237nm17.pdf · Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Comparison Standard InChI/InChIKeys - NCI/CADD Structure Identifiers

Acknowledgments

ChemNavigator Scott Hutton

Tad Hurst

CADD Group, LMC, NCI Marc Nicklaus

Igor V. Filippov

CACTVS, Xemistry GmbH

Wolf-Dietrich Ihlenfeldt

Thanks to all database providers

Thanks to the InChI Team

http://cactus.nci.nih.gov

Our web site: