Chemical Informatics and Cyber- infrastructure Building Blocks Chemical Informatics Resources: ...

15
Chemical Informatics and Cyber-infrastructure Building Blocks Chemical Informatics Resources: Deluge of experimental data > 100,000 compounds screened by 10 publicly funded high throughput screening centers using various assay techniques (molecular to cellular) Molecular Libraries Screening Center Network Chemical databases maintained by various groups NIH PubChem, NIH DTP Chemical informatics and computational chemistry Data clustering, data mining, descriptor calculations, toxicity prediction, docking, molecular modeling, and quantum chemistry Visualization tools Web resources: journal articles, etc. A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, open, distributed computing environment.

Transcript of Chemical Informatics and Cyber- infrastructure Building Blocks Chemical Informatics Resources: ...

Chemical Informatics and Cyber-infrastructure Building Blocks Chemical Informatics Resources:

Deluge of experimental data > 100,000 compounds screened by 10 publicly funded high throughput

screening centers using various assay techniques (molecular to cellular) Molecular Libraries Screening Center Network

Chemical databases maintained by various groups NIH PubChem, NIH DTP

Chemical informatics and computational chemistry Data clustering, data mining, descriptor calculations, toxicity prediction,

docking, molecular modeling, and quantum chemistry Visualization tools Web resources: journal articles, etc.

A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, open, distributed computing environment.

Our Solution Stack Domain specific Web Services

VOTables, CDK services Grid services, Cyber-

infrastructure for computationally intensive applications. Clustering, quantum chemistry

Workflow and service management We work with Taverna Many solutions: Kepler, BPEL

engines, etc. Portlets and other user

interfaces Rich desktop apps Ubiquitous clients

Portals and Other User Interfaces

Workflow and ServiceManagement

Web and Grid Services

Each level is subject for research and development, as is their integration.

Wrapping Science Applications as Services Science Grid services typically must wrap legacy

applications written in C or Fortran. You must handle such problems as

Specifying several input and output files These may need to be staged in

Launching executables and monitoring their progress. Specifying environment variables

Often these have also shell scripts to do some miscellaneous tasks.

How do you convert this to WSDL? Or (equivalently) how do you automatically generate the

XML job description for WS-GRAM?

Flow Chart of SMILES to Cluster Partitioned of BCI Web ServiceSMILEString

Makebits

Dictionary(Default)

Fingerprint(*.scn)

DivKmeansCluster

Hierarchy(*.dkm)

Optclus RNNclusOne

ColumnProcess

MergeProcess

ExtractedCluster

Hierarchy(*.clu)

NewSMILEString

GeneratingFingerprints

ClusteringFingerprints

Generatingthe best levels

SMILES to DKM

Extracting individualcluster partitions

best

level

BCI Clustering Service Methods

Service Method Description Input Output

makebitsGenerate Generate fingerprints from a SMILES structure

SMIstring Fingerprint string

divkmGenerate Cluster fingerprints with Divkmeans

SCNstring Clustered Hierarchy

smile2dkm Makebits + divkm SMIstring Clustered Hierarchy

optclusGenerate Generate the best levels in a hierarchy

DKMstring Best partition cluster level

rnnclusGenerate Extract individual cluster partitions

DKMstring Indiv. cluster partitions

smile2ClusterPartitioned

Generate a new SMILES structure w/ extra col.

SMIstring New SMILES structure

Submitting Applications with Condor We are working to use Condor-G as a simple bridge

to the NSF’s TeraGrid for job submission. Condor has a Web Service interface (called

BirdBath) that we are using to construct Java portlets.

We are investigating how to construct Condor classads using GPIR. Required for Condor matchmaking But no facility for this built in to the TeraGrid.

CondorMaster

Condor

Condor

Condor

Condor

Condor Only Condor-G and Globus

(Portal)Client

Condor-G

LSFPBS

TeraGridGlobus

TeraGridGlobus

(Portal)Client

VOTables: Handling Tabular Data Developed by the Virtual Observatory community for encoding

astronomy data. The VOTable format is an XML representation of the tabular

data (data coming from BCI, NIH DTP databases, and so on). VOTables-compatible tools have been built

We just inherit them. SAVOT and JAVOT JAVA Parser APIs for VOTable allow us

to easily build VOTable-based applications Web Services Spread sheet Plotting applications.

VOPlot and TopCat are two

mrtd1.txt – smiles representation of chemical compounds along with its properties

Votable.xml : xml representation of mrtd1.txt file

VOPlot Application from generated votable.xml file : Graph plotted on Mass (X–axis) and PSA (Y-axis)

More Services: WWMM ServicesServices Descriptions Input Output

InChIGoogle Search an InChI structure through Google

inchiBasic

type

Search result in HTML format

InChIServer Generate InChI version

format

An InChI structure

OpenBabelServer

Transform a chemical format to another using Open Babel

format

inputData

outputData

options

Converted chemical structure string

CMLRSSServer

Generate CMLRSS feed from CML data

mol, title description link, source

Converted CMLRSS feed of CML data

CDK-Based Services

Common Substructure

Calculates the common substructure between two molecules.

CDKsim Takes two SMILES and evaluates the Tanimoto coefficient (ratio of intersection to union of their fingerprints).

CDKdesc Calculates a variety of molecular and atomic descriptors for QSAR modeling

CDKws Fingerprint generation

CDKsdg Creates a jpeg of the compound’s 2D structure

CDKStruct3D Generates 3D coordinates of a molecule from its SMILE

ToxTree Service The Threshold of Toxicological

Concern (TTC) establishes a level of exposure for all chemicals below which there would be no appreciable risk to human health.

ToxTree implements the Cramer Decision Tree approach to estimate TTC.

We have converted this into a service. Uses SMILES as input. Note the GUI must be

separated from the library to be a service

http://ecb.jrc.it/QSAR/home.php?CONTENU=/QSAR/qsar_tools/qsar_tools_toxtree.php

OSCAR3 Service Oscar3 is a tool for shallow, chemistry-specific

natural language parsing of chemical documents (i.e. journal articles).

It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also

formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in

experimental sections. Other entities: Things like N(5)-C(3) and so on.

Results are exported as an XML file. There is a larger effort, SciBorg, in this area

http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html It also has potentially very interesting Workflows

http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3