CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified)...

25
Generating canonical identifiers for (glycoproteins and other chemically modified) biopolymers Roger Sayle , john may & Noel O’Boyle Nextmove software, cambridge, uk 250 th ACS National Meeting, Boston, MA. Sunday 16 th August 2015

Transcript of CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified)...

Page 1: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Generating canonical identifiers for (glycoproteins and other

chemically modified) biopolymers

Roger Sayle , john may & Noel O’Boyle

Nextmove software, cambridge, uk

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 2: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

motivation

• Non-standard peptides, post-translationally modified proteins and drug-antibody conjugates are becoming increasingly relevant to the life sciences.

• Registration of biologics, beyond the FASTA sequence, is considered desirable but technically challenging.

• In this talk, I discuss complementary approaches to biologics registration; one based upon expressive all-atom representations, another on tracking deltas to a reference database of protein sequences.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 3: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Real world small scale example

• Many research reagents contain “hybrid molecules”

– Innovagen SP-5125 lauroyl-apelin-13 • dodecanoyl-QRPRLSHKGPMPF

– Innovagen SP-5126 myristoyl-apelin-13 • tetadecanoyl-QRPRLSHKGPMPF

– Innovagen SP-5124 palmitoyl-apelin-13 • hexadecanoyl-QRPRLSHKGPMPF

– Innovagen SP-5127 steroyl-apelin-13 • octadecanoyl-QRPRLSHKGPMPF

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 4: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

The cutting edge of biosimilarity

• The high prevalence of potentially life-threatening hypersensitivity reactions to the antibody cetuximab (Erbitux) in some US states has been traced to its glycosylation [containing a Gal(a1-3)Gal epitope].

Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for galactose-alpha-1,3-galactose”, New England Journal of Medicine, Vol. 358, No. 11, pp. 1109-1117, 13th March 2008.

• Similarly, Human Erythropoietin (EPO) alpha, beta, delta and omega share the same primary sequence, but differ in their glycosylation patterns.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 5: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Monomer dictionaries don’t scale

• Systems based upon monomer dictionaries (such as HELM and PDB) are notoriously difficult to maintain.

• The limited number of monomers in proteinogenic peptides and natural nucleic acid sequences leads to a false sense of security; that monomers are finite.

• In practice, the number of monomers, post-translational and chemical modifications is infinite.

• Even more difficult than standardizing monomer definitions via a central repository, like PDB, is allowing local custom definitions.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 6: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

48 hexopyranoses

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 7: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

264 deoxy-hexopyranoses

Page 8: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

9540 substituted hexopyranoses (4 most common

substituents)

Page 9: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

The current situation

• Pistoia HELM can’t [yet] handle/canonicalize glycans and oligosaccharides.

– It can’t uniquely canonicalize Fmoc-Ala-OH (between Pistoia and ChEMBL monomer sets).

• IUPAC InChI can’t [yet] officially handle more than 1024 atoms.

• Folks working on glycoproteins are screwed…

(or use expensive commercial software)

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 10: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Constructive suggestion…

• Ideally, a chemical identifier should be independent of the input representation or file format.

• Equivalence between small molecules, peptide and proteins are best determined by a single identifier, preferably the existing standard InChI.

• This is possible as increases in computer power and storage mean that cheminformatics toolkits can handle huge biopolymers on modern hardware.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 11: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Three recent experiments

1. Is it possible to generate standard InChI for extremely large molecules (polymers)?

2. How well do all-atom canonicalization algorithms scale and can they be improved?

3. Are there alternative canonical identifiers that can be useful in bioinformatics and precision medicine?

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 12: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Previous inchi key record [(2014)

• Sequence Identifier: UTP10_KLULA

• Sequence Length: 1774 amino acids

• Molecule size: 28509 atoms

• InChI Length: 119699 characters

• InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N

• InChI Canonicalization Time: 73.2s

• Canonical SMILES Length: 35408 chars

• OEChem SMILES Canonicalization Time: 0.4s

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 13: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Special classes of molecules

• Alkanes – InChI=1S/C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3

– InChI=1S/C8H18/c1-3-5-7-8-6-4-2/h3-8H2,1-2H3

– InChI=1S/C10H22/c1-3-5-7-9-10-8-6-4-2/h3-10H2,1-2H3

– 1 million carbons, InChI is 6,889,942 bytes (~6.9Mbytes)

– 1 billion carbons, InChI is 9,888,888,954 bytes (~9.9Gbytes)

• Polyalanine – InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1

– 1 thousand L-alanines, InChI is 42,965 bytes (~43Kbytes)

– 1 million L-alanines, InChI is 66,888,995 bytes (~66.9Mbytes)

• Theoretically one could write an efficient fasta2inchi

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 14: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Algorithm scaling to 100AA

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 15: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Algorithm scaling to 1000AA

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 16: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Algorithm scaling to 5000AA

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 17: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Algorithm scaling to maximum

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 18: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Spread of algorithm run-times

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 19: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

peptide names (chembl)

• The following names are machine generated

• [15-L-arginine]nociceptin CHEMBL526333

• [2-4-chloro-L-phenylalanine]neuropeptide S [human] CHEMBL441576

• [1-L-threonine]cyclosporin A CHEMBL2370014

• [6-L-tryptophan]sermorelin free acid CHEMBL440438

• angiotensin II (3-8) CHEMBL261120

• nociceptin amide CHEMBL389521

• acetyl-alpha-MSH (4-10) amide CHEMBL410411

• [2-L-cysteine,13-L-cysteine]neurotensin disulfide CHEMBL3278512

• myristoyl-[1-L-lysine,4-L-tryptophan]tetrapandin 2 amide CHEMBL3288219

• [2-(4RS)-thiazolidine-4-carboxylic acid,4-L-proline]endomorphin-2 CHEMBL126611

• [22-L-serine]kalata B1 CHEMBL1801140

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 20: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Scaling-up protein variant naming

• The algorithm described for naming peptides can also be applied to naming arbitary protein variants.

• Consider the a database of the following 11 peptides: – CFFQNCPRG phenylpressin

– CFVRNCPTG annetocin

– CFWTSCPIG octopressin

– CYFQNCPRG argipressin

– CYFQNCPKG lypressin

– CYFRNCPIG cephalotocin

– CYIQNCPLG oxytocin

– CYIQNCPPG prol-oxytocin

– CYIQNCPRG vasotocin

– CYIQSCPIG seritocin

– CYISNCPIG isotocin

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 21: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Dag representation of sequences

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

These 11 peptides may be efficiently represented and search as a “directed acyclic graph” [38 vs. 99 states]

Page 22: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

entirety of uniprot/swissprot

• Using this representation, all 540546 protein sequences in uniprot_sprot, which contains over 192M amino acids, requires 142M states (1.4Gb).

• This data structure allows close analogues to be identified much faster than using NCBI blastp.

• For example, all 540546 sequences can be queried against this database (i.e. all-against-all) in ~9m30s on a single core on a laptop.

• The sequence from PDB 1CRN (crambin 46AA) is canonically named as [L25I]P01542 in 0.002s.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 23: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

Application to precision medicine

• A more realistic example is that sequence of the gene “spastic paraplegia4” with six mutations from OMIM:604277 can be canonically named as [I344K,S362C,N386S,D441G,C448Y,R499C]Q9UBP0

• Run-time for this query is 0.2s.

• By comparison, blastp 2.2.29+ takes about 6s.

– With default arguments, NCBI blastp run time is 7s.

– Only 6s with –num_descriptions 1 –num_alignments 1.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 24: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

conclusions

• “InChI for large molecules” can be achieved, and remain compatible with small molecule InChI identifiers, through the evolution of ever better canonicalization algorithms.

• Journal reviewers who claim that the run-time of canonicalization algorithms is a non-issue, and not an area ripe for improvement are… very mistaken.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015

Page 25: CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemically Modified) Biopolymers

acknowledgements

• Greg Landrum, Novatis, Basel, Switzerland.

• Nadine Schneider, Novartis, Basel, Switzerland.

• Evan Bolton, NCBI PubChem project, Bethesda, MD.

• Joann Prescott-Roy, Novartis, Cambridge, MA, USA.

• Daniel Lowe, NextMove Software, Cambridge, UK.

250th ACS National Meeting, Boston, MA. Sunday 16th August 2015