EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with...

28
Implementing iso 11238 standard compliance with chemaxon tools Roger Sayle Nextmove software, cambridge, uk ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21 st May 2014

Transcript of EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with...

Page 1: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Implementing iso 11238 standard compliance with chemaxon tools

Roger Sayle

Nextmove software, cambridge, uk

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 2: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

What is iso 11238?

• ISO standard 11238 entitled “Health Informatics – Identification of medicinal products – Data elements and structures for the unique identification and exchange of regulated information on substances”.

• Defines a framework for uniquely identifying and exchanging compounds of pharmaceutical interest.

• The framework serves a similar role to CAS registry numbers, PubChem CID or InChI-Key, assigning unique identifiers to substances.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 3: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Meet the (IDMP) family

• 11238 is one of a suite of 5 related standards, all for “unique identification and exchange of …”

– 11238 “… regulated information on substances”.

– 11239 “… dose forms, units, administration, etc.”.

– 11240 “… units of measurement”.

– 11615 “… regulated medicinal product information”.

– 11616 “… regulated pharmaceutical product information”.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 4: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Why this is 11238 important?

• EU regulation 520/2012 on “pharmacovigilance” requires countries, regulatory authorities and pharma to adopt the 5 IDMP standards (articles 25 and 26) by 1st July 2016 (article 40).

• Executive summary: It’s the law!

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 5: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

How it works

Code Assignment (Authority)

Code Look-up (Authority)

Name/Identifer

Connection Table

Properties (Significant Text)

Unique Code

Unique Code

Name/Identifer

Connection Table

Properties (Significant Text)

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 6: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Likely implementation

Code Assignment (Authority)

Code Look-up (Authority)

Name/Identifer

Connection Table

Properties (Significant Text)

Unique Code

Unique Code

Name/Identifer

Connection Table

Properties (Significant Text)

FDA UNII

FDA SRS Search FDA UNII

XML

INN/USAN/CID

FDA/NCATS GInAS

MOL2000/SMILES/InChI Protein/NA Sequence

ISO11238 Groups 1-4

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 7: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Current status

• The standard has been ratified and it use has been written into EU law (EU Reg. 520/2012).

• Framework requires use of non-semantic, random, fixed length unique identifiers, that include an internal integrity check.

• The standard also details constraints on uniqueness.

• Exact implementation details yet to be determined (to appear in a future “Implementation Guide”).

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 8: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

What will the future look like?

• ISO11238 compliant identifiers will be very similar to the FDA’s UNII (UNique Ingredient Identifier).

• The fixed width non-semantic identifier requirement rules out the use of plain SMILES, InChI, V2000 Mol file and similar encodings.

• The random requirement rules out plain CAS registry numbers, PubChem CIDs and ChEMBL IDs (which use sequential or monotonic number assignment).

• Alternatively, InChI keys or similar hashes (with [CRC] checks) of connection tables+text may be possible.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 9: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

What’s available now

• ISO charge for access to official standards documents (which is why 5 IDMP standards is more profitable than one), about 158 CHF ($177 USD) from ISO for 11238 [between $120 and $340 online].

• However, as with many ISO standards, late drafts of ISO 11238 are freely available on the internet.

• Caution: Many of the technical examples (all XML) were removed from the final standard and are due to appear in the upcoming “Implementation Guide(s)”.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 10: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Example requirement

• §3.4 “Naming of substances” states “at least one substance name or company code shall be associated with each substance”.

• For the envisioned work flows this typically assumes INN or USAN name has already been assigned.

• One way to guarantee the existence of a suitable substance name for investigational compounds is to use IUPAC naming software (such as ChemAxon’s) during submission to the unique coding authority.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

• Plug: ChemAxon s2n coverage is state-of-the-art.

Page 11: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

The devil is in the details

• One of the interesting cheminformatics challenges with working with the published ISO standard and the examples from the draft annex is the typography.

• The document has been typeset by editors with expertise outside the field of cheminformatics who have inadvertently changed whitespace without appreciating the impact this has on chemistry tools.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 12: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Final ISO11238 standard Annex A

• §A.2.3 SMILES uses the example “C1 = CC = CC = C1” where the spurious spaces create problems for SMILES readers.

• §A.2.4 InChI both strips the “InChI=” prefix and again suffers from spaces “1/C6H6 /c1-2-4-6-5-3-1/h1-6H”.

– Interestingly this is an old InChI not a standard InChI.

• §A.2.2 Molfile fails to mention that V2000 mol files use fixed width columns and blank lines, as a result the example given in text *next slide+ can’t easily be read.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 13: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Annex A: example.mol

ACD/Labs0812062058

6 6 0 0 0 0 0 0 0 0 1 V2000

1.9050 −0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1.9050 −2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

0.7531 −0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

0.7531 −2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

−0.3987 −0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

−0.3987 −2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

2 1 1 0 0 0 0

3 1 2 0 0 0 0

4 2 2 0 0 0 0

5 3 1 0 0 0 0

6 4 1 0 0 0 0

6 5 2 0 0 0 0

M END

$$$$

Missing Blank Lines

Incorrectly aligned columns

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 14: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Benefit of the doubt?

• These unintentional typographical errors in the normative text may perhaps be the result of poor fonts, with the exception of “InChI=”.

• Alas the content of the original Annex B from the draft indicate these issues were more widespread and may arise from ignorance of cheminformatics file formats.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 15: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

§B.2.2 InChI in XML Example

<STRUCTURAL_REPRESENTATION_TYPE>INCHI</STRUCTURAL_REPRESENTATION_TYPE>

<STRUCTURAL_REPRESENTATION>1S/C2H5NO2.AL.CLH.2H2O.ZR/C3-1-

2(4)5;;;;;/H1,3H2,(H,4,5);;1H;2*1H2;/Q;+3;;;;+4/P-

2</STRUCTURAL_REPRESENTATION>

Missing InChI=

Standard and Non-Standard InChI?

Converted to upper case

Indentation

Spurious Spaces

Line Breaks

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 16: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

§B.2.4 V2000 Mol File in XML Example

<STRUCTURAL_REPRESENTATION_TYPE>MOL</STRUCTURAL_REPRESENTATION_TYPE>

<STRUCTURAL_REPRESENTATION>30 29 0 0 0 0 0 0 0 0999 V2000 9.9563 -7.3055 0.0000 Y

1 1 0 0 0 0 0 0 0 0 0 0 15.0355 -4.8847 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0 13.3609 -

8.0134 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8867 -9.9869 0.0000 O 0 5 0 0 0 0 0 0 0 0 0

0 6.4178 -6.8678 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 5.8872 -4.8955 0.0000 O 0 5 0 0 0 0

0 0 0 0 0 0 6.7218 -5.7285 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 13.0541 -9.1519 0.0000 C

0 0 0 0 0 0 0 0 0 0 0 0 13.3408 -6.8634 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 13.8599 -

4.8881 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 13.0301 -5.7260 0.0000 C 0 0 0 0 0 0 0 0 0 0 0

0 5.9099 -9.9441 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0 6.4492 -7.9743 0.0000 O 0 0 0 0 0 0

0 0 0 0 0 0 6.7482 -9.1149 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8605 -5.4221 0.0000 C 0

0 0 0 0 0 0 0 0 0 0 0 11.8897 -5.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.9147 -9.4555

0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.8855 -9.4263 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

7.6897 -8.0305 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6897 -6.8513 0.0000 C 0 0 0 0 0 0 0

0 0 0 0 0 8.7018 -6.2618 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 9.2908 -5.2506 0.0000 C 0 0

0 0 0 0 0 0 0 0 0 0 10.4700 -5.2524 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 11.0577 -6.2664

0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 12.0761 -6.8427 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

12.0891 -8.0218 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.7257 -8.5952 0.0000 N 0 0 0 0 0 0

0 0 0 0 0 0 11.0839 -8.6223 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 10.4848 -9.6275 0.0000

C 0 0 0 0 0 0 0 0 0 0 0 0 9.3057 -9.6139 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 10 2 1 0 0 0 0

8 3 2 0 0 0 0 25 24 1 0 0 0 0 8 4 1 0 0 0 0 27 18 1 0 0 0 0 7 5 2 0 0 0 0 26 28 1 0 0 0 0

7 6 1 0 0 0 0 19 27 1 0 0 0 0 15 7 1 0 0 0 0 20 21 1 0 0 0 0 17 8 1 0 0 0 0 30 27 1 0 0 0

0 11 9 2 0 0 0 0 30 29 1 0 0 0 0 11 10 1 0 0 0 0 20 19 1 0 0 0 0 16 11 1 0 0 0 0 22 21 1

0 0 0 0 14 12 1 0 0 0 0 23 24 1 0 0 0 0 14 13 2 0 0 0 0 18 14 1 0 0 0 0 26 25 1 0 0 0 0

21 15 1 0 0 0 0 29 28 1 0 0 0 0 24 16 1 0 0 0 0 23 22 1 0 0 0 0 28 17 1 0 0 0 0 M CHG 4

1 3 4 -1 6 -1 12 -1 M ISO 1 1 90 M END </STRUCTURAL_REPRESENTATION>

Where to begin?

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 17: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

All is not lost!

• Back at the 2011 ChemAxon UGM here in Budapest, Sorel Muressan from AstraZeneca Sweden gave a presentation on how spelling correction improves the recall of ChemAxon’s name-to-structure tools.

• The exact same CaffeineFix technology can be applied to perform aggressive “spelling correction” on SMILES strings, InChI and V2000 mol files.

• As with IUPAC-like systematic names, these can each be specified by a formal grammar.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 18: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

How the algorithm works

• The regular expression describing a V2000 mol files is compiled into a “finite state machine” with 1333 states.

• The only allowed “corrections” are the deletion of new lines and the insertion of spaces or new lines, but only where permitted in the grammar/FSM.

• Depth-first recursion is used to identify a minimal set of edits to correct the input.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 19: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

§B.2.4 example after correction

30 29 0 0 0 0 0 0 0 0999 V2000

9.9563 -7.3055 0.0000 Y 1 1 0 0 0 0 0 0 0 0 0 0

15.0355 -4.8847 0.0000 * 0 0 0 0 0 0 0 0 0 0 0 0

13.3609 -8.0134 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

13.8867 -9.9869 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0

6.4178 -6.8678 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

5.8872 -4.8955 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0

6.7218 -5.7285 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

13.0541 -9.1519 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

13.3408 -6.8634 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

13.8599 -4.8881 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0

...

21 15 1 0 0 0 0

29 28 1 0 0 0 0

24 16 1 0 0 0 0

23 22 1 0 0 0 0

28 17 1 0 0 0 0

M CHG 4 1 3 4 -1 6 -1 12 -1

M ISO 1 1 90

M END

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

3 line Header Block before Count Line

Page 20: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Chemaxon toolkit implementation

public static Molecule molFileToChemaxonMol(String molFileStr) throws MolFormatException {

try {

return MolImporter.importMol(molFileStr);

}

catch (MolFormatException e) {

molFileStr = FixMolFile.fixMolFile(molFileStr);

if (molFileStr == null){

throw e;

}

return MolImporter.importMol(molFileStr);

}

}

// Java source code available at http://www.chemaxon.com/forum/ftopic1265.html

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 21: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Geek of the week

• A particularly tricky corner case concerns Accerlys’ Pipeline Pilot-style V2000 mol files which abbreviate the columns in the atom block (to save space).

• In these files there’s potential ambiguity where the first bond line is mistaken as a continuation of the last (abbreviated) atom line.

• Our solution relies on the atom stereo care field being zero in non-query mol files vs. the non-zero values that appear in the first three fields of bonds.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 22: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Lest we forget

• A similar “spelling correction” variant that allows uppercase characters to be mapped to lowercase, and the prefix “InChI=” to magically appear at the start of a string can also be used to fix ISO InChIs.

• Alas uppercasing an InChI (or any molecular formula) is potentially lossy, e.g. “CsN” vs. “CSn”.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 23: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Before and after InChI example

1S/C17H21CLN4O/C1-22-12-3-2-4-13(22)8-11(7-12)21-17(23)14-5-10(18)6-15-16(14)20-9-19-15/H5-6,9,11-13H,2-4,7-8H2,1H3,(H,19,20)(H,21,23)

InChI=1S/C17H21ClN4O/c1-22-12-3-2-4-13(22)8-11(7-12)21-17(23)14-5-10(18)6-15-16(14)20-9-19-15/h5-6,9,11-13H,2-4,7-8H2,1H3,(H,19,20)(H,21,23)

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 24: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

How common are the ambiguities?

• 1.35 million standard InChIs from ChEMBL

• Uppercase the InChIs, fix them and check whether the original InChI can be regenerated

• 99.5% roundtrip (6596 discrepancies)

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 25: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Inchi case-insensitive ambiguities

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 26: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

conclusions

• The Java source code for recovering V2000 mol files and InChIs from the types of corruption seen in the ISO 12238 draft has now been contributed to the ChemAxon forum, allowing Marvin and JChem to read the examples given in that document.

• Whether this functionality will be required to fully support the final (pending) “Implementation Guide” requirements remains to be seen (and voted upon).

• Attention to detail is important in standards writing.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 27: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

Final words

• ISO 11238 IDs may become as popular as Chemical Abstracts’ registry numbers.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014

Page 28: EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools

acknowledgements

• Daniel Lowe, NextMove Software, Cambridge, UK.

• Richard Bolton, GSK, Stevenage, UK.

• Evan Bolton, NCBI PubChem, Bethesda, MD, USA.

• Dac-Trung Nguyen, NIH NCATS, Rockville, MD, USA.

• Tyler Peryea, NIH NCATS, Rockville, MD, USA.

• Noel Southall, NIH NCATS, Rockville, MD, USA.

• Yulia Borodina, FDA, Silver Spring, MD, USA.

• Lawrence Callahan, FDA, Silver Spring, MD, USA.

• Andrew Marr, Marr Consultancy, Knebworth, UK.

ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014