IntroducingCLiDE Pro:
A chemical OCR tool
Aniko T. Valko, Keymodule Ltd.
Chemical structure Diagrams
Chemical structure diagrams are a form of representation of chemical compounds.
Information contained in a structure diagram can be divided into three areas:
• Atom information
• Bond information
• Structural information
chemical elements,functional groups,generic elements,
vertex label,charge,atomic weight,hybridization,etc.
O
SOMe
O
R
N
R
N
O
XDH
H
HMe
3
14 1520
16
21
O O
OR H
CR2
CR2
e
f
ff
fe
e
e
bond orders,bond styles,bond labels
O
OAl
H
OEt Li
N
S SC6F13
21
atom information,bond information,
overall charge,structure label
29 31 0 0 0 0 0 0 0 0999 V2000 -1.9417 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3542 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.9417 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.1167 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1792 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 0.9649 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.5917 2.3939 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.0042 1.6794 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.1208 1.6794 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 1.0961 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -0.0927 2.4763 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7042 2.2628 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.5292 1.0961 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9417 0.3816 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
What is chemical OCR for?
SHN
O O
Cl
O
N
Publication process
chemical structure diagrams are
converted to images
All chemical information is lost!
Manual reproduction
slow and prone to errors
chemical OCR
automatic extraction of chemical information from
chemical structure depictions
20-90 seconds per page
CLiDE Pro
A chemical OCR software tool
The latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project [1-3].
[1] P. Ibison, M. Jacguot, F. Kam, A. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Literature Data Extraction: The CLiDE Project. J. Chem. Inf. Comput. Sci. 1993, 33(3), 338-344.[2] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson. Chemical Structure Recognition and Generic Text in the CLiDE Project. In Proceedings on Online Information 92. 1992, London, England.[3] A. Simon and A.P. Johnson. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. J. Chem. Inf. Comput. Sci. 1997, 37(1), 109-116.
Features
Converts chemical images into connection tables
Interprets generic structures
Supports document-oriented processing as opposed to page-oriented processing The whole document is loaded and processed at once rather than individual pages.
Loads PDF documents, as well as TIFF and BMP image files
Handles various difficult drawing features
Exports chemical information into MDL MOL files
Operates in interactive or batch mode
Tools for structure and text editing
Three main problems involvedin chemical OCR
1) Identification of chemical images within a document.
2) Compilation of chemical graphs of individual molecules from chemical images.
3) Interpretation of complex objects such as generic structures using the retrieved chemical graphs.
CLiDE Pro’s solutions to Problem 1
Problem 1: Identification of chemical images within a document
Document image segmentation
Digitized image of a document page of a
patent
Segmented document highlighting recognized
text blocks and graphic blocks
Identification of connected components
Bottom-up layout analysisby building the tree structure
of the page
CLiDE Pro’s solutions to Problem 2
Problem 2: Extraction of connection tables from chemical images
A chemical image
1 Chemical image
Classification of connected components into basic groups:characters
linesdashes
graphics
2 Classification of connected components
[4] R.O. Duda and P.E. Hart. Use of the Hough Transform to Detect Lines and Curves in Pictures. Graphics Image Process. 1972, 1.
3 Construction ofdashed bonds
[5] J. Sklansky and V. Gonzalez. Fast Polygonal Approximation of Digitized Curves. Pattern Recognit. 1980, 12, 327-331.
4 Vectorization
12 3 Construction of dashed bondsbased on the Hough transform method [4]
Vectorization based ona polygon approximation method [5]
4Construction of atom labels:OCR
Grouping characters into atom labelsRecognition of superatoms
5
5 Construction ofatom labels
Construction of connection table:Connecting lines to atoms
Joining lines to form implicit Carbon atoms
6
6 Construction ofconnection table
3D molecular structureafter
exporting the constructed CT into SDF file in 2D andconverting the structure from 2D to 3D
CLiDE Pro’s solutions to Problem 3
Problem 3: Interpretation of generic structures
X
N
O
CO2R
Y1
Y2
N S
41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et
Generic text interpretation (GTI)1
R-groups,substitution values,labels
Currently, GTI is limited to the presence of ‘=‘ sign separating the R-groups and the substituents.
However, combined assignment to R-groups are handled successfully.
X
N
O
CO2R
Y1
Y2
N S
41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et
X
N
O
CO2R
Y1
Y2
N S
41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et
2 Association the generic text block to the structure by matching R-groups present in both the text and the structure
X
N
O
CO2R
Y1
Y2
N S
41a: X=N, Y1=H, Y2=Cl, R=Et41b: X=CF, Y1=Y2=F, R=Et
Alignment of Atom Labels
Horizontal atom labels
Vertical atom labels
Two types of alignment of atom labels with more than one character:
Examples
Alignment of Atom labels
Constructed molecule Input image
The interpreted structure in CLiDE Pro’s GUI:
Ambiguity in interpretation
Horizontal lines representing dashes of a dashed wedged bond
A horizontal line representing
a negative charge
Contextual analysis
Ambiguity in interpretation
Constructed molecule Input image
The interpreted structure in CLiDE Pro’s GUI:
Ambiguity in interpretation
Contextual analysis
Vertical lines representingIodine atoms
A vertical line part of
a double bond
Ambiguity in interpretation
Constructed molecule Input image
The interpreted structure in CLiDE Pro’s GUI:
Ambiguity in interpretation
Contextual analysis
Circles represent:
Oxygen atoms
aromatic rings
Input image
Ambiguity in interpretation
Constructed molecule
Crossing bonds in bridged molecule
Input imageConstructed molecule
No extra Carbon atom is generated at the point where bonds cross each other
Functional groups are expanded in the exported structure
A generic structure
Constructed molecule Input image
R = H
R = Me
Bad image quality
Constructed molecule Input image
Isolated black spots (noise from scanning)
Black spots touching one CC
Black spots merging two or more CCs
Bad image quality
Input imageConstructed molecule
The quality of interpretation depends on the ability of dealing with difficult situations such as - ambiguous drawing features - distortions resulting from bad image quality
Conclusions and Outlook
CLiDE Pro, a chemical OCR tool
3 main problems in chemical OCR and CLiDE Pro’s solutions
Goal to extend CLiDE Pro on further chemical drawing features such as
- Reaction schemes (partly implemented)
- Improved generic text interpretation (dealing with tables of R-groups)
- Positional variation in Markush structures
- Other difficult situations (e.g. missing bonds between ring atoms)
- Frequency variation in Markush structures
Palytoxin – A complex structure
Constructed molecule
Input image
Further Information Acknowledgments
CLiDE Pro is licensed with Keymodule Ltd. and SimBioSys Inc.
http://www.keymodule.co.ukhttp://www.simbiosys.ca
Live demo at Booth #817 People who previously worked on CLiDE
Top Related