ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

23
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 16 1 / 23 Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge? Josef Eiblmaier, Hans Kraut, Sascha Hausberg, Peter Loew ICIC 2013 Vienna, October 13 16

description

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge? Josef Eiblmaier (InfoChem, Germany) In the past decade various systems for the automatic identification and extraction of chemistry-related information from unstructured sources have emerged. They have opened up new possibilities for organizing, querying, and analyzing chemical content to support the research and development process. Patent authorities and scientific publishers make available, on a large scale, not only full text and images, but also ChemDraw CDX files for many sources. The chemical information contained in these CDX files is primarily intended for layout purposes for publications but it is often erroneously considered to be readily available as input for structure and reaction database building processes. Unfortunately, automatic work-up of chemical structures and reactions from these CDX files entails serious obstacles and problems and consequently the information produced is often incorrect or incomplete and thus not properly available to information professionals via structure and reaction searching. This talk will present different approaches to extracting reactions and structures correctly from CDX files and will describe the main difficulties and drawbacks encountered.

Transcript of ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

Page 1: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

1 / 23

Extraction of structural information from

ChemDraw CDX files: easy, or an

underestimated, difficult challenge?

Josef Eiblmaier, Hans Kraut, Sascha Hausberg, Peter Loew

ICIC 2013 Vienna, October 13 – 16

Page 2: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

2 / 23

» ChemDraw files:

Relevance and the Challenge

» Approach

» Projects

» InfoChem ChemProspector

» Wiley Smart Article

» Thieme Science of Synthesis Update / Pharmaceutical Substances

» Conclusion / Outlook

Outline

© cora / PIXELIO, www.pixelio.de

Page 3: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

3 / 23

Patents, Journal Articles and MRW‘s: a Buried Treasure?

Reactions (CDX files)

Chemical structures

(images)

Markush

structures (text,

images, CDX)

Chemical structures

(CDX files)

Chemical

names/fragments (text)

Page 4: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

4 / 23

Manuscript submission

Publishing

Database production e.g. SciFinder, Reaxys, SPRESI

eEROS, ...

Manuscript Article Database …

Manual Indexing

Page 5: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

5 / 23

CDX Scheme vs. Database Record

ChemDraw file Database

Purpose: presentation / publishing

no search

Purpose: search / retrieval

Unstructured Structured

Structures: no strict rules Structures: strict rules

General rules: none Database rules: strict

Reactant Product Reagent Solvent Catalyst

SOCl2

LiOH H2O, THF Pd(OAc)2

Cl-Co2Et,

Et3N

Acetone,

H2O

Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)

Page 6: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

6 / 23

CDX Scheme Processing,

what does that mean? Chemical structures (SD files)

ICSchemeProcessor

Reactions (RD files)

Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)

Reagent Solvent Catalyst

SOCl2

LiOH H2O, THF Pd(OAc)2

Cl-Co2Et,

Et3N

Acetone,

H2O

Conditions (RD files)

Page 7: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

7 / 23

But: CDX files, often an optical illusion!

Authors are very inventive for a ‚perfect‘ layout!

Appearences are deceiving!

» Usage of graphical symbols

• Polymer supports

• Heteroatoms

C Grid:

Page 8: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

8 / 23

Optical illusions 2

» Unresolvable labels

• Labels not defined

• Element symbols used as R-group labels

• Ambiguous fragment labels (e.g. molecular formula)

Page 9: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

9 / 23

» Variable points of attachment

Optical illusions 3

Page 10: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

10 / 23

» Reaction arrows / forked arrows / brackets

Optical illusions 4

Page 11: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

11 / 23

Approach

© Gerd Altmann / PIXELIO, www.pixelio.de

Page 12: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

12 / 23

» The algorithmic approach:

• Application of a set of rules in the software (generic, project unspecific). Software

should recognize all cases that might occur!

• project (title-) specific rules (drawing conventions must not change), otherwise

further development necessary

• manual post correction required (cost/time intensive)

• problem is infinite, unprecedented issues can not be handled

» The templating approach:

• software is developed to recognize a defined set of problems (PS)

• all content must be manually pre-templated (cost intensive) according to the

capabilities of the software

» The hybrid approach:

• depending on the source the focus can be laid on either approach

Approach

Page 13: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

13 / 23

Templating

» Templating: Guidelines for authors and typesetters

• Syntax definitions for tables, R-groups etc.

• Syntax rules for captions

• Reaction arrangement, forked arrows

• Rules for reaction conditions

(reactants, catalysts, solvents, yields, temperature)

Page 14: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

14 / 23

Examples:

» Algorithmic detection of features

» Resolution of repeating groups

» Enumeration of R-groups

» Resolution of aliases/labels

• source specific alias databases

• continuously extended

» Table Enumeration

• compound enumeration

• reaction factual data:

Caption/Yield

» Variable points of

attachment

» Forked arrows

Page 15: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

15 / 23

Projects

Page 16: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

16 / 23

Sucessful Application of CDX Processing:

Chemistry Enrichment Workflow*, (Wiley Smart Article)

*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin

Page 17: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

17 / 23

Templating*

Author‘s CDX File CDX Template Enumerated structures

ICSchemeProcessor Templating

CDX-Templating

Guidelines (Structures)

*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin

Page 18: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

18 / 23

R4

O

R5

OH

+

H2N

HN H2O

H2

39

H2O

N

NH

R4

R5

N

O

R4

R5 NH3

40

NH

NHR5

HO

HO

R4

N

NR5

O

R4

• •

N

NHO

R5

R4

H2

R4

O

R5

OH

+

H2N

HN H2O

H2

39

H2O

N

NH

R4

R5

N

O

R4

R5 NH3

40

NH

NHR5

HO

HO

R4

N

NR5

O

R4

• •

N

NHO

R5

R4

H2

Correct /

extend process

ICSchemeProcessor

CDX-

Templating

Guidelines

(Reactions)

Scheme

Error Report R4

O

R5

OH

+

H2N

HN H2O

H2

39

H2O

N

NH

R4

R5

N

O

R4

R5 NH3

40

NH

NHR5

HO

HO

R4

N

NR5

O

R4

• •

N

NHO

R5

R4

H2

Manual data

entry

Scheme

correction not

possible

Workflow Science of Synthesis Update

Page 19: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

19 / 23

Sample Pharmaceutical Substances Update

Source: Thieme Pharmaceutical Substances, Abiraterone

Page 20: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

20 / 23

Conclusion

» As much as possible algorithmic processing desirable

• generic: can be applied to other contents as well

• cheaper (humans cost!)

» 100% conversion (without human interaction) never possible

» Solutions are project / source specific

» Relevance of automatic extraction will continuously increase

» Authors / Publishers play an essential role in a successful conversion

Page 21: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

21 / 23

Acknowledgements

» Wiley

Michael Forster

Reinhard Neudert

» Thieme

Guido Herrmann

Rolf Hoppe

Klaus Köberlein

» InfoChem

Hans Kraut, Sascha Hausberg, Thomas Menke, Manuela Rauh

Fanny Irlinger, Huyen Ngyen, Dagmar Kunzmann

Page 22: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

22 / 23

© Thomas Link / Flickr

Thank you!

Page 23: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

23 / 23

Questions?