The RSC chemical validation and standardization platform, a potential path to quality-conscious...
-
Upload
karen-karapetyan -
Category
Science
-
view
157 -
download
2
Transcript of The RSC chemical validation and standardization platform, a potential path to quality-conscious...
![Page 1: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/1.jpg)
Chemistry Validation and Standardization Platform
Modularization and “Hadoop”ization
Kenneth Karapetyan, Colin Batchelor,
Valery Tkachenko, Antony WilliamsACS New Orleans April 2013
![Page 2: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/2.jpg)
Overview
• Motivation• What we support• Modularization• Parallelization• Examples
![Page 3: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/3.jpg)
Motivation: validation
Open and free chemical validation system for:
•Structure validation– Warn on query atoms, pseudo atoms, polymers, etc.
– Nonsensical stereo
•SDF field mapping for validating depositor-provided names, InChI, SMILES
![Page 4: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/4.jpg)
Motivation: standardization
Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on)
Allows users to put together their own workflow using modules provided:•Apply default CVSP or user-defined SMIRKS rules•Layout•Neutralize•Get canonical tautomer using ChemAxon’s algorithms•Get biggest organic fragment
![Page 5: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/5.jpg)
What we support
• SD files and mol files• ChemDraw files (in-house code)• Tab-delimited text files of names, InChIs,
SMILES
• Zipped files• GZipped files
![Page 6: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/6.jpg)
CVSP: modularization
![Page 7: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/7.jpg)
Reusable workflows
![Page 8: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/8.jpg)
SMIRKS-based rules
![Page 9: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/9.jpg)
![Page 10: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/10.jpg)
![Page 11: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/11.jpg)
![Page 12: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/12.jpg)
“Hadoop”ization
Apache Hadoop is a framework for the distributed processing of large data sets across clusters of computers.
CVSP is written in C#. To run it on Linux machines we use Mono (cross-platform .NET runtime environment)
Farm:•28 CPU cores•42G memory•2T disk space
Processor intensive tasks•Tautomerization
![Page 13: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/13.jpg)
Input fileDeposit ID in
database
Upload to farm for processing on Hadoop
Hadoop processing
Download resultsUpload results to database for user
preview
Convert to SD format
![Page 14: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/14.jpg)
Hadoop queues
Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions
•“Small” submission queue for submissions under 500 records•Large submissions queue•Internal queue
– For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization
All records have to be processed on Hadoop to user to see the results (no partial preview)
![Page 15: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/15.jpg)
Examples
DrugBank •~6500 records, approximately 2 records per second
PubMed•~100 000 records, about 9 h
![Page 16: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/16.jpg)
Rate-limiting step?
Canonical tautomerization
This molecule took
45 min to
canonicalize.
![Page 17: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/17.jpg)
DrugBank dataset (6516 records)
Errors
•2 records with query(any) bond•2 records with R groups•3 polymers•18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic•Unusual valence: ~20
Warnings
•INCHI not matching structure (100+)•SMILES not matching structure (100+)
![Page 18: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/18.jpg)
DrugBank ID: DB00755InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-14+
DrugBank ID: DB00614
![Page 19: The RSC chemical validation and standardization platform, a potential path to quality-conscious databases](https://reader034.fdocuments.net/reader034/viewer/2022052602/559f75991a28abf4718b4811/html5/thumbnails/19.jpg)
Stereo issues
DB08128 DB06287
J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277