Exploring Large Chemical Data Sets

21
Exploring Large Chemical Data Sets Interactive Analysis and Visualization Kyle Lutz and Marcus D. Hanwell August 21, 2012 Skolnik Symposium

description

Exploring Large Chemical Data Sets: Interactive Analysis and Visualization

Transcript of Exploring Large Chemical Data Sets

Page 1: Exploring Large Chemical Data Sets

Exploring Large Chemical Data Sets

Interactive Analysis and Visualization

Kyle Lutz and Marcus D. Hanwell

August 21, 2012Skolnik Symposium

Page 2: Exploring Large Chemical Data Sets

● An open-source, cross-platform cheminformatics tool

● A general-purpose tool for chemical data exploration and analysis

● Interactive, editable and queryable database of chemical data on the desktop

● Part of the Open Chemistry application suite (Avogadro and MoleQueue)

● Leverages several open-source projects: Qt, VTK, Chemkit, Open Babel, MongoDB

Overview

Page 3: Exploring Large Chemical Data Sets

Architecture● Native, cross-platform C++ application built with Qt● Stores chemical data in a NoSQL MongoDB database● Uses VTK for 2D and 3D data set visualization

Page 4: Exploring Large Chemical Data Sets

Main Window

Page 5: Exploring Large Chemical Data Sets

Molecule Details

Page 6: Exploring Large Chemical Data Sets

Queries

Supports different queries:● Name● Formula● InChI● InChIKey● Structure and

Substructure

Page 7: Exploring Large Chemical Data Sets

Similarity Searching

Page 8: Exploring Large Chemical Data Sets

Charts and Plots

Histogram of logPScatter Plotof Polar Surface Area (TPSA)

against Volume (VABC)

Page 9: Exploring Large Chemical Data Sets

Multidimensional Analysis

● Provide tools for viewing and analyzing large amounts of data with multiple dimensions○ Scatter Plot Matrix○ Parallel Coordinates○ K-Means Clustering

● Interactive charts supporting selection● Easy to add new chemical descriptors

Page 10: Exploring Large Chemical Data Sets

Scatter Plot Matrix

Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume

Page 11: Exploring Large Chemical Data Sets

Parallel Coordinates

Polar Surface Area vs. logP vs. Mass vs. Rotatable Bonds vs Volume

Page 12: Exploring Large Chemical Data Sets

K-Means Clustering

● ~30 numeric molecular descriptors● 1D, 2D, and 3D visualization● Selection and extraction of molecules from clusters

Page 13: Exploring Large Chemical Data Sets

Similarity Visualization

● Similarity Clustering● Calculated from fingerprint similarity or structural

similarity

Page 14: Exploring Large Chemical Data Sets

Similarity Visualization

30%

45%

60%

Page 15: Exploring Large Chemical Data Sets

ChemicalJSON

Specification avaialble at: http://wiki.openchemistry.org/Chemical_JSON

Example: ethane.cjson

● JSON (JavaScript Object Notation) is a "lightweight data-interchange format"

● Store molecular structure, geometry, identifiers and descriptors all as a single JSON object

● Benefits:○ More compact than XML/CML○ Native language of MongoDB and

JSON-RPC○ Easily converted to a binary

representation (BSON)

Page 16: Exploring Large Chemical Data Sets

ChemicalJSON in MongoDB

● Nearly identical to what is stored in a file○ A few extra fields stored

■ 2D diagram (as PNG)■ Heavy atom count (for substructure searching)■ Binary fingerprints (for similarity searching)■ InChIKey for indexing and as a unique key■ Mongo's OID ("_id") field

● Trivial to write out to a .cjson file: db.molecules.find({"name" : "ethanol"}, {"diagram" : 0, "heavyAtomCount" : 0, "fp2_fingerprint" : 0, "_id" : 0})

Page 17: Exploring Large Chemical Data Sets

Open Chemistry with ParaViewWeb● Uses ParaView's client-server architecture● Interactive 3D rendering● Runs in any modern web browser

URL: http://paraviewweb.kitware.com/OpenChemistry/

Page 18: Exploring Large Chemical Data Sets

Open Chemistry with ParaViewWeb

ChemData

Page 19: Exploring Large Chemical Data Sets

● Uses JSON-RPC to communicate with other applications (most notably Avogadro)

● Visualize data directly from the database● Uses ChemicalJSON to represent molecular

structures and transfer molecular information

RPC / Avogadro Integration

Page 20: Exploring Large Chemical Data Sets

Future Directions

● Direct integration with 3rd party databases (PubChem, PDB, ...)

● Broader support for storing and analyzing computational job results○ Linked with molecular structures○ Direct from CML or converted/parsed

● Plugins to facilitate extension○ Descriptors○ Visualization○ Chemical file input/output

● Scaling studies, working with multiple data servers and terabytes of data

Page 21: Exploring Large Chemical Data Sets

Comments/Questions?Home Page

http://wiki.openchemistry.org/ChemData

Source Codehttps://github.com/OpenChemistry/chemdata

ParaViewWeb Demohttp://paraviewweb.kitware.com/OpenChemistry