Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April...
Transcript of Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April...
![Page 1: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/1.jpg)
Information Extraction from Chemical ImagesDiscovery Knowledge & Informatics
April 24th, 2006
Dr. Marc Zimmermann
![Page 2: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/2.jpg)
Available Chemical Information
page 2Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Textbooks
Reports
Patents
Databases
Scientific journals and publications
Websites
![Page 3: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/3.jpg)
Representations of Chemical Compounds
page 3Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Name (trivial, trade, brand, INN, USAN)
Registration numbers (CAS, NCI, Beilstein)
Formal description (sum formula, SMILES)
Chemical nomenclature (IUPAC, CAS, InChI)
Depictions
![Page 4: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/4.jpg)
Example: Aspirin
page 4Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Name: Acetylsalicylic acid, Aspirin, Bayer, Colfarit, Dolean PH 8, Duramax, Ecotrin, …CAS: 50-78-2, SID: 35870, Formula: C9H8O4IUPAC Name: 2-acetoxybenzoic acidSMILES: CC(=O)OC1=CC=CC=C1C(=O)OInChI: 1.12Beta/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3,2-5H,(H,11,12)Depiction:
![Page 5: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/5.jpg)
Information Extraction Methods
page 5Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Names Dictionary based
Registration numbers Databases
Formal descriptions Rule based
Depictions chemical OCR
![Page 6: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/6.jpg)
Representing a Chemical Compound
page 6Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
How much information do you want to include?
Atoms present
Connections between atoms
bond types
Isotopes
Charges
Stereochemical configuration
OH
CH2
C
14
HN+H3
O-
O
![Page 7: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/7.jpg)
Modeling of Chemicals as Graphs
page 7Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Why use graph theory?
Established mathematical field
Graphs can be easily represented in computers
Existing algorithms for comparison, searching, etc.
Unlike humans, computers aren’t very good at pattern
recognition
Similaror
Same?
![Page 8: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/8.jpg)
Computer Representation
page 8Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
A typical example: MDL MOL file (SDF)
For more information on MDL formats, see http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
![Page 9: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/9.jpg)
Disadvantages of Using Graphs
page 9Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Many graph algorithms are inherently slowAnalogy between chemical structures and graphs is not perfectRealities of chemical structures cause problems
aromaticitystereochemistrytautomerisminorganic compoundsmacromolecules and polymersincompletely-defined substances
![Page 10: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/10.jpg)
Good News
page 10Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
There is only a limited number of chemical drawing tools
(and these are using templates):
ChemDraw (CambridgeSoft)
ChemSketch (ACD)
ISISdraw (MDL)
JAVA applets (ChemAxon)
...
Reduced complexity
![Page 11: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/11.jpg)
chemOCR: Reconstruction of Chemical Compounds
page 11Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
1 2
3
Document Depiction
Reconstruction SDF file4 - IS IS - 0 9 2 3 0 3 1 5 0 7 2 D
2 7 2 9 0 0 0 0 0 0 0 0 9 9 9 V 2 0 0 0 -0 .9 3 4 8 -0 .4 0 0 0 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .9 3 5 9 -1 .2 2 7 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .2 2 1 1 -1 .6 4 0 2 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 9 5 3 -1 .2 2 6 9 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 9 2 5 -0 .3 9 6 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .2 2 2 9 0 .0 1 2 8 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .0 7 5 0 -1 .8 0 8 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .0 7 0 8 -2 .6 3 3 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .7 8 7 5 -1 .3 9 1 7 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 2 .5 0 4 2 -1 .8 0 3 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 3 .2 1 6 2 -1 .3 8 7 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 3 .2 1 2 0 -0 .5 6 1 1 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 2 .4 8 9 9 -0 .1 5 2 6 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .7 8 0 8 -0 .5 7 0 9 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 4 .0 0 4 2 -0 .3 4 1 7 0 .0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0 4 .0 0 8 3 -1 .5 9 5 9 0 .0 0 0 0 N 0 0 3 0 0 0 0 0 0 0 0 0 4 .4 1 2 5 -1 .0 5 4 2 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 5 .2 3 7 5 -1 .0 5 4 2 0 .0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 7 9 2 -3 .2 1 6 7 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 4 .2 1 6 7 -2 .3 9 1 7 0 .0 0 0 0 S 0 0 3 0 0 0 0 0 0 0 0 0 5 .0 1 2 5 -2 .1 7 5 0 0 .0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 3 .4 1 6 7 -2 .6 0 4 2 0 .0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 4 .4 2 9 2 -3 .1 8 7 5 0 .0 0 0 0 C 0 0 3 0 0 0 0 0 0 0 0 0 5 .2 2 5 0 -3 .4 0 0 0 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 4 .0 1 2 5 -3 .9 0 0 0 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .3 4 5 8 -3 .2 1 2 5 0 .0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 0 .8 8 7 5 -3 .9 2 9 2 0 .0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0
![Page 12: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/12.jpg)
CSR (Compound Structure Reconstruction)
page 12Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
raster images
common fragments module
molecule database
chemical cartridge
connected components
manualcuration tool
machinelearning tool
chemical rules module
pagesegmen-tation
imagepreprocessing
vectorizer OCR
componentclassifier
s-atom database
approx.graphmatcher
molecular graph converter
super-atoms
machinelearning tool
![Page 13: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/13.jpg)
Preprocessing Steps
page 13Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Page segmentation
Image extraction
Image conversion (image
restauration, adaptive
binarization ...)
![Page 14: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/14.jpg)
Connected Component Analysis
page 14Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Building an image tree
Using adaptive nested TreeMaps
![Page 15: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/15.jpg)
Component Classification
page 15Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Single bonds
Double bonds
Thick chirals
Dotted chirals
Text
1
2
Raster image
Extract features
3 Classify as...
4 Manual curation
![Page 16: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/16.jpg)
Atomtype Reconstruction
page 16Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Train new characters Expand superatoms1 3
Need of a chemical intelligent
OCR
Define new superatoms2
![Page 17: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/17.jpg)
Vectorization
page 17Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Fixing vectorization errors using relative neighborhood graphs
Need of a chemical intelligent vectorizer
Disconnections
Dubious links
Antiparallel double bonds
Fixing bond lengths
![Page 18: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/18.jpg)
Graph Matching
page 18Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Using a line graph representation
Searching for subgraph isomorphism
Database with common fragments
Decomposition network for fragments
Recognizing new fragments
Graph matching a solution for
mapping bridged ring systems
![Page 19: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/19.jpg)
Manual Curation of Errors
page 19Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Reconstruction score
Editingbonds
![Page 20: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/20.jpg)
Post Processing
page 20Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Workflow plugin technology
2D beautify
File format conversion
2D to 3D conversion
Name generation
Property calculation / prediction
…
![Page 21: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/21.jpg)
A Real Challenge
page 21Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Data set with ~7.600 depictions of natural products
to get new scaffolds and super atoms
to incorporate the CSR workflow into a grid service
to add a database interface
But we need more real training sets…
(i.e. pictures and the solved structure)
current status: ~3.400 fully reconstructed!
![Page 22: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/22.jpg)
Future Works
page 22Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Incompletely-defined substances:
unknown stereochemistry
unknown attachment position
unknown repetition
OH
n
NH2
Cl
![Page 23: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/23.jpg)
Markush (“Generic”) Structures and Reaction Schemes
page 23Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
shorthand for describing sets of structures with common features
structures with R-groups
very important in chemical patents
can be used to describe combinatorial libraries
can be used as queries in database searches
OH
R1R2
Br*
I*
Cl*
R1=
CH2
*CH3 CH2
* CH2CH3 CH2
* CH2CH2
CH3R2=
![Page 24: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/24.jpg)
The Mission: Combination of CSR and Text Mining
page 24Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
-CH3
-CH2-CH3
-CH2-CNHS
-COOH
-CH3
-CH2-CH3
-CH2-CNHS
-COOH
Image Analysis / Structure Reconstruction
Text Analysis / Entity Recognition
Reconstruction ofPublished Chem-, Pharm-and PatentSpace
Cytochrome inhibitionPPAR activationStability in serumSide effectBlood-brain-barrier
PPAR activation
Cytochrome inhibition
Side effect
Stability in serum
![Page 25: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/25.jpg)
The Team (in the order of appearance)
page 25Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
Marc Zimmermann
Tanja Fey
Le Thuy Bui Thi
Christoph Friedrich
Yuan Wang
Maria-Elena Algorri
Miguel Alvarez
Wei Wang
![Page 26: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery](https://reader034.fdocuments.net/reader034/viewer/2022050421/5f90a7ce367e145c22004d13/html5/thumbnails/26.jpg)
CSR Software Demo available
page 26Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann
CSR can extract chemical depictions from various image sources and convert them into SD-files, which can be further used in nearly all chemical software; it allows for the modification of reconstructed molecules by a structure editor; it maintains the superatom and bond (single, double, triple, or chiral) information; and it accepts user curation and scoring schema to improve its performance.