MyMolDB: A micromolecular database solution with open source and free components

7
DOI: 10.1002/jcc.21874 MyMolDB: A Micromolecular Database Solution with Open Source and Free Components Bing Xia, [a] Zheng-Fu Tai, [a] Yu-Cheng Gu,* [b] Bang-Jing Li, [a] Li-Sheng Ding, [a] and Yan Zhou* [a] Background: To manage chemical structures in small laboratories is one of the important daily tasks. Few solutions are available on the internet, and most of them are closed source applications. The open-source applications typically have limited capability and basic cheminformatics functionalities. In this article, we describe an open-source solution to manage chemicals in research groups based on open source and free components. It has a user-friendly interface with the functions of chemical handling and intensive searching. Results: MyMolDB is a micromolecular database solution that supports exact, substructure, similarity, and combined searching. This solution is mainly implemented using scripting language Python with a web-based interface for compound management and searching. Almost all the searches are in essence done with pure SQL on the database by using the high performance of the database engine. Thus, impressive searching speed has been archived in large data sets for no external Central Processing Unit (CPU) consuming languages were involved in the key procedure of the searching. Availability: MyMolDB is an open-source software and can be modified and/or redistributed under GNU General Public License version 3 published by the Free Software Foundation (Free Software Foundation Inc. The GNU General Public License, Version 3, 2007. Available at: http://www.gnu.org/ licenses/gpl.html). The software itself can be found at http:// code.google.com/p/mymoldb/. V C 2011 Wiley Periodicals, Inc. J Comput Chem 32: 2942–2947, 2011 Keywords: Chemical management micromolecular database solution open source python structure searching Introduction Over the centuries, millions of chemicals have been synthesized in chemistry laboratories or isolated from natural sources. To handle the information of these chemicals efficiently and extract the required knowledge from it for certain purposes is a huge challenge. Traditional way such as storing the information of molecules in Microsoft Excel files failed to handle it well at a level of millions. With the rapid development of computer hard- ware and evolution of database technologies, computer-aided strategy has made it possible and is becoming vital in the suc- cess of chemical research in drug design. [1–4] Cheminfor- matics, [5,[6] using computer and informational techniques to study a range of problems focuses mainly on small organic mol- ecules rather than macromolecules such as polymers, genes, proteins, and polysaccharides. For the purpose of storage and information retrieval of these small molecules, a number of chemical compound database solutions have been developed, but most of them are commercial softwares, e.g., Chemaxon’s JChem package, [7] which runs several Java programs as servlets on a server and the data are stored in relational SQL databases, and the solution from CambridgeSoft serves as an extension to Oracle database. [8] There are also many open-source counter- parts available with basic functionalities. Haider [9] released a set of programs written in Pascal language and Perl script, and these programs parse and format the relevant data of mole- cules from commonly used MDL SD files and the formatted results stored in MySQL [10] tables. Haider’s solution performs a substructure search on a standard PC (AMD Athlon 1.6 GHz CPU, 1.5 GB of memory) typically within a few seconds. Open- source libraries for chem- and bioinformatics such as Chemistry Development Kit (CDK) [11] and Open Babel [12,[13] provide meth- ods for common tasks in cheminformatics are fully capable of chemical compound database constructing and other chemin- formatics-related task processing. Rijnbeek and Steinbeck [14] developed OrChem, an extension for the Oracle 11G database that supports fast substructure and similarity searching with cheminformatics functionalities provided by CDK. OrChem pro- vides similarity searching with response times in the order of seconds for databases with millions of compounds, depending on a given similarity cutoff. For substructure search, it can make use of multiple processor cores on today’s powerful database servers to provide fast response times in equally large data sets. Also, database extensions based on Open Babel also exist. Jer- ome Pansanel developed Mychem, [15] a cheminformatics exten- sion based on Open Babel for MySQL, which provides a set of functions that permits to handle chemical data within the [a] B. Xia, Z.-F. Tai, B.-J. Li, L.-S. Ding, Y. Zhou Key Laboratory of Mountain Ecological Restoration and Bioresource Utilization, Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu 610041, People’s Republic of China E-mail: [email protected] [b] Y.-C. Gu Syngenta, Jealott’s Hill International Research Centre, Bracknell, Berkshire RG42 6EY, United Kingdom E-mail: [email protected] Grant sponsor: The National Natural Sciences Foundation of China; Grant numbers: 30973634, 21072185, 81073014; Grant sponsor: Syngenta Ltd. (the PhD studentship) Journal of Computational Chemistry V C 2011 Wiley Periodicals, Inc. 2942 SOFTWARE NEWS AND UPDATES

Transcript of MyMolDB: A micromolecular database solution with open source and free components

Page 1: MyMolDB: A micromolecular database solution with open source and free components

DOI: 10.1002/jcc.21874

MyMolDB: A Micromolecular Database Solutionwith Open Source and Free Components

Bing Xia,[a] Zheng-Fu Tai,[a] Yu-Cheng Gu,*[b] Bang-Jing Li,[a]

Li-Sheng Ding,[a] and Yan Zhou*[a]

Background: To manage chemical structures in small

laboratories is one of the important daily tasks. Few solutions

are available on the internet, and most of them are closed

source applications. The open-source applications typically have

limited capability and basic cheminformatics functionalities. In

this article, we describe an open-source solution to manage

chemicals in research groups based on open source and free

components. It has a user-friendly interface with the functions

of chemical handling and intensive searching.

Results: MyMolDB is a micromolecular database solution that

supports exact, substructure, similarity, and combined

searching. This solution is mainly implemented using

scripting language Python with a web-based interface for

compound management and searching. Almost all the

searches are in essence done with pure SQL on the database

by using the high performance of the database engine. Thus,

impressive searching speed has been archived in large data

sets for no external Central Processing Unit (CPU) consuming

languages were involved in the key procedure of the

searching.

Availability: MyMolDB is an open-source software and can be

modified and/or redistributed under GNU General Public

License version 3 published by the Free Software Foundation

(Free Software Foundation Inc. The GNU General Public

License, Version 3, 2007. Available at: http://www.gnu.org/

licenses/gpl.html). The software itself can be found at http://

code.google.com/p/mymoldb/. VC 2011 Wiley Periodicals, Inc. J

Comput Chem 32: 2942–2947, 2011

Keywords: Chemical management � micromolecular database solution � open source � python � structure searching

Introduction

Over the centuries, millions of chemicals have been synthesized

in chemistry laboratories or isolated from natural sources. To

handle the information of these chemicals efficiently and extract

the required knowledge from it for certain purposes is a huge

challenge. Traditional way such as storing the information of

molecules in Microsoft Excel files failed to handle it well at a

level of millions. With the rapid development of computer hard-

ware and evolution of database technologies, computer-aided

strategy has made it possible and is becoming vital in the suc-

cess of chemical research in drug design.[1–4] Cheminfor-

matics,[5,[6] using computer and informational techniques to

study a range of problems focuses mainly on small organic mol-

ecules rather than macromolecules such as polymers, genes,

proteins, and polysaccharides. For the purpose of storage and

information retrieval of these small molecules, a number of

chemical compound database solutions have been developed,

but most of them are commercial softwares, e.g., Chemaxon’s

JChem package,[7] which runs several Java programs as servlets

on a server and the data are stored in relational SQL databases,

and the solution from CambridgeSoft serves as an extension to

Oracle database.[8] There are also many open-source counter-

parts available with basic functionalities. Haider[9] released a set

of programs written in Pascal language and Perl script, and

these programs parse and format the relevant data of mole-

cules from commonly used MDL SD files and the formatted

results stored in MySQL[10] tables. Haider’s solution performs a

substructure search on a standard PC (AMD Athlon 1.6 GHz

CPU, 1.5 GB of memory) typically within a few seconds. Open-

source libraries for chem- and bioinformatics such as Chemistry

Development Kit (CDK)[11] and Open Babel[12,[13] provide meth-

ods for common tasks in cheminformatics are fully capable of

chemical compound database constructing and other chemin-

formatics-related task processing. Rijnbeek and Steinbeck[14]

developed OrChem, an extension for the Oracle 11G database

that supports fast substructure and similarity searching with

cheminformatics functionalities provided by CDK. OrChem pro-

vides similarity searching with response times in the order of

seconds for databases with millions of compounds, depending

on a given similarity cutoff. For substructure search, it can make

use of multiple processor cores on today’s powerful database

servers to provide fast response times in equally large data sets.

Also, database extensions based on Open Babel also exist. Jer-

ome Pansanel developed Mychem,[15] a cheminformatics exten-

sion based on Open Babel for MySQL, which provides a set of

functions that permits to handle chemical data within the

[a] B. Xia, Z.-F. Tai, B.-J. Li, L.-S. Ding, Y. Zhou

Key Laboratory of Mountain Ecological Restoration and Bioresource

Utilization, Chengdu Institute of Biology, Chinese Academy of Sciences,

Chengdu 610041, People’s Republic of China

E-mail: [email protected]

[b] Y.-C. Gu

Syngenta, Jealott’s Hill International Research Centre, Bracknell, Berkshire

RG42 6EY, United Kingdom

E-mail: [email protected]

Grant sponsor: The National Natural Sciences Foundation of China;

Grant numbers: 30973634, 21072185, 81073014; Grant sponsor:

Syngenta Ltd. (the PhD studentship)

Journal of Computational ChemistryVC 2011 Wiley Periodicals, Inc.2942

SOFTWARE NEWS AND UPDATES

Page 2: MyMolDB: A micromolecular database solution with open source and free components

MySQL database. All these solutions have their particular merits.

For the commercial ones, they have their advantages when it

comes to handling very large numbers of data sets, whereas for

the open-source ones, they are free in most cases, and their

source code are open, so you can add your own functions into

them at your will if you are familiar with related programming

language.

In most cases for small laboratories, it is sufficient to have a

database solution that is able to handle molecule collections at

a level of few thousands to tens of thousands with exact, sub-

structure and similarity search capabilities. Although several sol-

utions such as OrChem and Mychem as mentioned earlier are

capable and available, they are only extensions for database

engines, which means they have only core functionalities, and

the users have to write required back ends and interfaces by

themselves. Also, they are not made to be hacked easily due to

the program language used (Java for OrChem and C and Cþþfor Mychem). Therefore, we introduce MyMolDB, a micromolec-

ular database solution based on open source and free compo-

nents including Python,[16] Open Babel, Web.py[17] with user-

friendly web-based interface for managers and end users. In

this solution, Apache,[18] Lighttpd,[19] and Ngnix[20] can be used

as web server, MySQL is used as storage engine of chemical

data, and Python serves as the scripting and glue language for

both chemical information handling and web interface pro-

gramming. The leading open-source web server (Apache,

Lighttpd, and Ngnix) and relational database management sys-

tem (MySQL) guarantee the performance and stability of the

system. The use of simple and powerful scripting language

Python as the main programming language makes it easy to be

extended and modified to suit individual requirement and pref-

erence. Figure 1 gives an overview of MyMolDB’s architecture

and the brief work flow. Detailed introduction about the main

searching options (exact, substructure, similarity, and combined)

can be found in the corresponding section of implementation.

Implementation

MyMolDB was developed on top of many open source or

free softwares. Python as the main programming language

was chosen because it has clear syntax and massive stand-

ard and third part libraries, which makes it easy to learn

and use for nonprofessional users. Scientific users typically

lack of specialized programming knowledge can now de-

velop applications for their research rapidly and efficiently

by making use of Python with third part scientific-related

libraries such as Pybel[21] (chem- and bioinformatics related,

generally provided by Linux distributions with package

name of python openbabel), Cinfony[22] (cheminfomatics

related), Biopython[23–25] (bioinfomatics related), and scien-

tific computing libraries such as Scipy.[26] In the case of

MyMolDB, the cheminformatics-related functionalities such

as fingerprints calculation, SMILES[27] canonicalization,

SMARTS[28] parsing, and molecule 3D coordination are pro-

vided by Pybel. The chemical data handling part is imple-

mented using the open-source relational database engine

MySQL, and MySQL-python[29] is used to interact with the

MySQL server in Python language. The web interfaces are

built on top of simple but powerful Python-based web

framework Web.py. We use OLN JSDraw[30] as a tool for

molecule editing and structure display on the web pages,

whereas an Indigo[31] bases molecular structure layout and

picture generator is optional for structure display. Integra-

tion of other on line molecular structure input or display

tools such as JME Java applet,[32] FlaME,[33] and Ketcher[34]

with MyMolDB is to be developed. In our own circum-

stance, the web interfaces and tools are delivered with the

light weight and fast Lighttpd software (with fastcgi mod-

ule enabled). Many tools such as sdfparser.py, which parses

molecules from the MDL SD files and generates SQL sen-

tences according to a file with definitions of rules of MDL

SD file to SQL sentences,

are written in Python.

Preview

All the services supplied by

MyMolDB system are accessible

across Internet by the web

interfaces. Interfaces include

normal mode (under which

substructure, exact, and similar-

ity searching options are avai-

lable), advanced mode where

combined searching is sup-

ported, molecule management,

and user management. Figure 2

shows the preview of MyMolDB

in normal mode under Google

Chrome, and more and detailed

screen shots are available at

http://code.google.com/p/mymol

db/ wiki/Screenshots.Figure 1. The architecture and brief work flow of MyMolDB.

A Micromolecular Database Solution with Open Source and Free Components

Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 2943

Page 3: MyMolDB: A micromolecular database solution with open source and free components

Database schema

We use MySQL to store all our data, as a relational database

engine, the relational structure of the data in MySQL allows han-

dling of complex arrangements. The Open Babel canonical

SMILES strings were calculated, and the basic database schema

is relationally organized on, and MD5 hashed SMILES strings

were calculated and stored in table mol_data for rapid index-

ing and enforcement of uniqueness. The Open Babel binary fin-

gerprints and fingerprint bits of the molecules were precalcu-

lated and divided into 32 segments, these segments were

converted to decimal num-

bers and stored in a

mol_fp table, on which, all

the substructure and similar-

ity-related querying are

based. Tanimoto coefficient

was used to measure the

similarity between molecules.

Also, the statistics of mol-

ecules were precalculated

and stored in the table

named mol_stat as the

molecular statistic finger-

prints, which help limit the

search range when querying

for molecules with particular

molecular statistics. Table 1

presents the main database

schema for substructure and

similarity searching.

Main work flow of

MyMolDB

The MyMolDB’s program

entrance, main.py will be

launched by Lighttpd’s

fastcgi module. In initializa-

tion, a map between eter-

nal Uniform Resource Loca-

tion (URL) and Python

classes with different func-

tions was established, the

Python classes were instan-

tiated by Web.py and

would be called when an

searching request started

by its related URL. These

instances of the instanti-

ated classes first parse the

request to get the queried

structure(s) and then call

the related methods to

generate SQL statements

according to the searching

type of the query, the gen-

erated SQL statements

were finally executed by

MySQL through the MySQL-python API to get results. The pro-

cedure described earlier is briefly demonstrated in Figure 3.

Exact searching

Exact searching is relatively easy in our system. All the Open

Babel canonical SMILES for molecules to be handled were pre-

calculated and hashed with MD5 algorithm. The hashed strings

of the Open Babel canonical SMILES for molecules were then

stored in column md5_openbabel_can_smiles of a SQL

Figure 2. Web interface for normal searching.

B. Xia et al.

Journal of Computational Chemistry2944 http://wileyonlinelibrary.com/jcc

Page 4: MyMolDB: A micromolecular database solution with open source and free components

table named mol_data, and an SQL index on md5_open

babel_can_smiles was created for fast exact searching. The

MD5 algorithm hashed string of the Open Babel canonical SMILES

of the target molecule was calculated at the right time while

searching and used by MyMolDB to generate SQL statements.

The SQL statement was generated by an instance of the

sql class of libmymoldb, the following code shows an exact

querying for benzene:

from settings import*

import libmymoldb

sql_obj ¼ libmymoldb.sql(env_db)

sql_string ¼ sql_obj.gen_search_sql(0c1ccccc10, 1)

Where env_db from settings is a Python dict contains

the value of needed information of the database such as field

name of fp_bits, openbabel_can_smiles and primary

field. String c1ccccc1 is the Open Babel canonical SMILES

for benzene, whereas 1 in sql_obj.gen_search_

sql(‘c1ccccc1’, 1) is the code for exact searching. The

SQL statement generated will resemble:

SELECT … FROM table WHERE mol_data.md5_openbabel_

can_smiles ¼ ‘975f4ba64117f4dc65044299552ee698’;

where 975f4ba64117f4dc65044299552ee698 is the MD5

hashed string of c1ccccc1. The SQL statement will then be exe-

cuted by an instantiated libmymoldb.database class to get

results.

A hashed string was used to introduce the possibility of col-

lisions where two different strings hash down to the same

value, so we further do an exact text match on the unhashed

canonical SMILES when the query returns nonempty result,

this method eliminates the fake positive hits.

Substructure searching

Substructure searching can be graph theoretically described as

checking for the existence of a subgraph isomorphism with the

query graph in a series of topological graphs.[35] The finger-

prints were used to screen out the molecule candidates quickly,

which are likely to contain the molecule to query as a substruc-

ture. Further filter is needed to inspect the candidates, for

merely fingerprints method returns false-positive results some-

times due to the degeneracy of bits in fingerprints. Our way for

further candidates screening is based on SMARTS matching.

Therefore, the substructure searching process is separated into

two steps—fingerprints screening and SMARTS matching. Fin-

gerprints method is fast and consumes less CPU resource and is

competent for dealing with large data sets, whereas SMARTS

matching algorithm is accurate but relatively slow.

The fingerprints method is a bitwise and comparison between

the fingerprints of structures to query and that of candidates.

The bitwise and operation is supported by MySQL, we can sim-

ply use a SQL clause to screen the candidates. For the finger-

prints of molecule, candidates were precalculated and stored in

the database, we just need to calculate the fingerprints of the

query molecule simply using the API of Pybel, whereas the mo-

lecular statistics are also calculated and added to the SQL state-

ment to limit the searching range. All the work mentioned ear-

lier can be simply done by an instance of libmymoldb.sql

class with likely code as shown in exact searching. The only dif-

ference is that the searching is switched to code 2 (code for sub-

structure searching). The SQL statement of substructure search

for ‘‘c1ccccc1’’ (benzene) generated would resemble:

SELECT …

FROM mol_stat INNER JOIN mol_fp ON (mol_stat.-

mol_id ¼ mol_fp.mol_id)

WHERE ((n_rings >¼ 1 AND n_atoms >¼ 12 AND n_C

>¼ 6 AND n_r6 >¼ 1 AND n_bonds >¼ 12 AND n_X >¼ 0

AND fp_bits >¼ 6 AND n_Car >¼ 6) AND (fp03 &

131072 ¼ 131072 AND fp11 & 134217728 ¼ 134217728

AND fp21 & 32768 ¼ 32768 AND fp23 & 2112 ¼ 2112 AND

fp29 & 512 ¼ 512));

The SQL statement is then executed in MySQL and gets pre-

liminary results. Large amount of bitwise and comparison

operation is totally done in MySQL without any other compu-

tationally expensive languages and thus improves the effi-

ciency and speed of searching.

After the first step of rough screening, SMARTS matching

method is adopted to find the target molecules accurately. We

defined a Python class in libmymoldb named mymol with

the method of sub_match to do the SMARTS matching work.

So the code for SMARTS matching using Python should resem-

ble the following on the assumption that the rough results

were stored in a Python list named rough_results with

Table 1. Database tables of fingerprints for substructure and similarity

searching.

TABLE mol_fpmol_id INT(11)

fp_bits (bit count of the fingerprints) SMALLINT(5)

fp01 BIGINT(20)

fp32 BIGINT(20)

TABLE mol_stat

mol_id INT(11)

n_atoms (number of atoms) SMALLINT(5)

n_bonds (number of bonds) SMALLINT(5)

n_rings (number of rings) SMALLINT(5)

n_r3 (number of 3-membered rings) SMALLINT(5)

… …

n_r8 (number of 8-membered rings) SMALLINT(5)

n_rx (number of rings with members more than 8) SMALLINT(5)

n_C (number of carbon atoms) SMALLINT(5)

n_C2 (number of sp2 carbon atoms) SMALLINT(5)

n_C3 (number of sp3 carbon atoms) SMALLINT(5)

n_Car (number of aromatic carbon atoms) SMALLINT(5)

… …

Figure 3. Main work flow of searching of MyMolDB.

A Micromolecular Database Solution with Open Source and Free Components

Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 2945

Page 5: MyMolDB: A micromolecular database solution with open source and free components

elements of Python dicts in which the infomation of molecule

candidates from the first step of fingerprints screening is

stored with the SMILES key of openbabel_can_smiles,

The structure of query molecule in the format of SMILES

assigned to variable query_smiles:

import libmymoldb

mol_obj ¼ libmymoldb.mymol(‘smi’, query_smiles)

results ¼ [ smiles for smiles in rough_results

if mol_obj.sub_match(‘smi’, smiles[‘openbabel_

can_smiles’]) ]

where string smi is the format code of SMILES in Pybel.

This step of result filtering is fast enough with a few Python

code above alike due to the high performance of the API of

Cþþ implemented Pybel and the acceptable amount of rough

results from the first fingerprints screening step.

Similarity searching

Although there are many measures of structural similarity cal-

culation, such as 2D fingerprints method,[36] layered atom

environment fingerprints,[37] and structural dictionary-based,[38]

traditional binary-encoded fingerprints-based similarity algo-

rithms are still the most widely used methods due to their

simplicity and robustness. In our system, Tanimoto coefficient

is used to evaluate the similarity between molecules. The

equation of Tanimoto coefficient between molecules a and b

is shown in eq. (1):

Tanimoto coefficient ¼ C=ðAþ B� CÞ (1)

in which, A is the number of bit 1 in fingerprints of molecule

a, B is that of molecule b, and C means the number of bit 1

present both in fingerprints of molecules a and b.

Our solution for Tanimoto coefficient calculation mainly uses

the standard functions and arithmetical operations supported

by MySQL. All the computational consumptive operations are

performed in pure SQL in the table of mol_fp, thus better

speed and efficiency were obtained for no external languages

are used. The SQL statement for similarity searching is also

generated by an instance of libmymoldb.sql class using

likely code as shown in exact searching with searching code

changed to 3 (code for similarity searching). For example, the

SQL statement for finding out molecules with Tanimoto coeffi-

cient between the molecules and ‘‘c1ccccc1’’ (benzene) exceeds

0.9 should resemble as the following:

SELECT …

FROM mol_fp

WHERE ((BIT_COUNT(fp03&131072) þ BIT_COUNT(fp11&

134217728) þ BIT_COUNT(fp21&32768) þ BIT_-

COUNT(fp23& 2112) þ BIT_COUNT(fp29&512)) / (6 þfp_bits - (BIT_COUNT (fp03&131072) þ BIT_COUNT

(fp11&134217728) þ BIT_ COUNT(fp21&32768) þBIT_COUNT(fp23&2112) þ BIT_ COUNT(fp29&512))) > 0.9);

in which, 6 is the count of bit 1 in the fingerprints of the struc-

ture (benzene here), whereas 0.9 is the minimal similarity

value between benzene and the candidate molecules.

Combined searching

MyMolDB supports the so-called combined searching, which

provides a possibility to search in a flexible and powerful way.

Complex searching can be easily done here without writing

long and sophisticated SQL statements or editing the back

end code of the system. In this way, a simple search language

was designed and used to process the query. A statement in

the language is in essence with a logical query formula in

which the key words supported by the system are abbrevia-

tions of the names of MySQL columns (abbreviations are con-

figurable in the settings file). Operators include AND, OR, NOT

and ~ (means match, follows a string in which wildcard % is

supported), SUB (means substructure) MAX (the max number

of results) and so on. A molecule buffer with several cells

named m1 to mn (1 < n < 10) is set on the web page; mole-

cules can be drawn in those cells and referred with their

names in the query formula. The MyMolDB system interprets

the formula to simple searching types (exact searching and

substructure searching) and then generates SQL statement

using the related methods. SQL statement is then executed. If

we have two molecules in the cells of molecule buffer named

m1 and m2, the query for molecules whose molecular weight

less than 400 and contain m1 but not m2 as substructure, with

no more than 100 results returned should be described in the

formula below:

MW < 400 AND (SUB ¼ m1 AND SUB !¼ m2) MAX ¼ 100

also valid in the following form:

MW < 400 AND (SUB ¼ m1 AND (NOT SUB ¼ m2)) MAX ¼ 100

in which MW is the abbreviation of the database column for

molecular weight.

Performance

The performance tests of MyMolDB were done on a PubChem

data sample of 1.0 million molecules to measure the perform-

ance of our system on a data set large enough for daily use in

small laboratories. All the tests were done on a single CPU

quad core Q9550 2.83GHz, 4GB RAM Arch Linux server with

the kernel version of 2.6.34.1, Python version of 2.6.5, Web.py

version of 0.34, Pybel version of 2.2.3, MySQL version of 5.1.47,

and Lighttpd version of 1.4.26.

Substructure searching

As described earlier, the substructure searching process is sep-

arated into two steps of a fingerprints-based rough screening

followed by a SMARTS-based accurate match. So the perform-

ance of substructure searching should include the performan-

ces of the both steps. Performance of substructure searching

for ‘‘c1ccccc1’’ with different results limitations was done, and

the results are illustrated in Figure 4.

Similarity searching

The performance of similarity searching is more representative

in compare with that of the other types of searching, because

B. Xia et al.

Journal of Computational Chemistry2946 http://wileyonlinelibrary.com/jcc

Page 6: MyMolDB: A micromolecular database solution with open source and free components

the Tanimoto method of similarity searching checks almost all

the compounds in the data set in one time. In our test, 100

randomly chosen compounds were used to query for all their

similar ones and repeated for different minimum Tanimoto

coefficient values. Table 2 illustrates the performance for

searching the PubChem data set of minimum Tanimoto

similarities with respect to query throughput time and average

result sizes.

Conclusion

The micromolecular database solution MyMolDB with open

source and free components for small laboratories was devel-

oped. This solution supports substructure searching, exact

searching, similarity searching, and combined searching with

acceptable performance even on a large data set of 1.0 million

chemicals. The web-based searching pages provided user-

friendly human-computer interface for compound searching

and user management. The Python-based essence grants the

extensibility of this system. We provided the libraries within the

package which can be used for further extend on cheminfor-

matics-related functionalities. Furthermore, a plug-in system is

planning to be developed, which could help building your own

cheminformatics-related data processing applications more eas-

ily. We hosted our project on Google Code. The URL of this pro-

ject is http://code.google.com/p/mymoldb. Developers who are

interested in further extending MyMolDB are welcome to par-

ticipate through Google Code. Bugs and suggestions are also

encouraged to submit in the home page of this project.

License

MyMolDB is released under GNU General Public License ver-

sion 3 as published by the Free Software Foundation.

Acknowledgments

The authors thank the PubChem Project for assembling the

great collection of molecular structures and for opening up for

noncommercial use with which we are available to test the

performance of our system. They also thank the Open Babel

community for their great work of Open Babel chemical tool-

box and its Python binding, which provides the cheminfor-

matics-related functionalities for our solution. They thank

Scilligence for providing their powerful javascript chemical

structure editor/viewer OLN JSDraw for noncommercial use for

free, with which we can edit and display structures easily on

the web. They thank Mr. John Delaney for the scientific inputs

and proof reading of the manuscript.

[1] A. V. Veselovsky, A. S. Ivanov, Curr Drug Targ Infect Dis 2003, 3, 33.

[2] F. Ooms, Curr Med Chem 2000, 7, 141.

[3] O. F. Guner, Curr Top Med Chem 2002, 2, 1321.

[4] G. Schneider, U. Fechner, Nat Rev Drug Discov 2005, 4, 649.

[5] Cheminformatics. Wikipedia, 2011. Available at: http: //en.wikipedia.

org/wiki/Cheminformatics. Accessed on May 2, 2011.

[6] F. K. Brown, Annu Rep Med Chem 1998, 33, 375.

[7] ChemAxon Ltd., Introduction to the JChem suite, 2011. Available at:

http://www.chemaxon.com/jchem/intro/index.html. Accessed on May 2,

2011.

[8] CambridgeSoft Corporation, CambridgeSoft Enterprise Solutions—Ora-

cle Cartridge, 2007. Available at: http://www.cambridgesoft.com/solu

tions/details/?fid¼186. Accessed on May 2, 2011.

[9] N. Haider, Molecules (Basel, Switzerland) 2010, 15, 5079.

[10] Oracle Corporation and/or its affiliates, MySQL: The World’s Most Pop-

ular Open Source Database, Version 5.5.10, 2011. Available at: http://

www.mysql.com/. Accessed on May 2, 2011.

[11] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen,

J Chem Inf Comput Sci 2003, 43, 493.

[12] R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C.

Steinbeck, J. Wegner, E. L. Willighagen, J Chem Inf Model 2006, 46,

991.

[13] The Open Babel Package, Version 2.2.3, 2011. Available at: http://open

babel.sourceforge.net/. Accessed on May 2, 2011.

[14] M. Rijnbeek, C. Steinbeck, J Cheminf 2009, 1, 1.

[15] J. Pansanel, Mychem—A Chemical Extension for MySQL, Version 0.8.0,

2010. Available at: http://mychem.sourceforge.net/. Accessed on May 2,

2011.

[16] Python Software Foundation. Python Programming Language—Official

Website, Version 2.7.1, 2011. Available at: http://www.python.org/.

Accessed on May 2, 2011.

[17] A. Swartz Welcome to web.py! (web.py), Version 0.34, 2010. Available

at: http://webpy.org/. Accessed on May 2, 2011.

[18] The Apache Software Foundation, Apache HTTP Server, Version 2.2.17,

2010. Available at: http://projects.apache.org/projects/http_server.html.

Accessed on May 2, 2011.

[19] Lighttpd fly light, Version 1.4.28, 2010. Available at: http://www.light

tpd.net/. Accessed on May 2, 2011.

[20] Nginx news, Version 0.8.54, 2010. Available at: http://nginx.org.

Accessed on May 2, 2011.

[21] N. M. O. Boyle , C. Morley, G. R. Hutchison, Chem Cent J 2008, 2, 1.

[22] N. M. O. Boyle, G. R. Hutchison, Chem Cent J 2008, 2, 24.

Figure 4. Performance of substructure search for ‘‘c1ccccc1’’ with different

results limitations on 1.0 million compounds.

Table 2. Similarity search throughput time for 100 randomly selected

compounds with different minimal Tanimoto similarity in a sample

database with 1.0 million structures.

Minimal Tanimoto

similarity

Average

compounds found

Average throughput

time (seconds)

0.5 846 7.4

0.65 135 5.3

0.8 34 3.5

0.95 5 2.4

A Micromolecular Database Solution with Open Source and Free Components

Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 2947

Page 7: MyMolDB: A micromolecular database solution with open source and free components

[23] B. Chapman, J. Chang, ACM SIGBIO Newsletter 2000, 20, 15.

[24] P. J. Cock, T. Antao, J. T. Chang, B. Chapman, C. J. Cox, A. Dalke, I.

Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski, M. J. L. de Hoon, Bioin-

formatics (Oxford, England) 2009, 25, 1422.

[25] M. J. L. D. Hoon, B. Chapman, I. Friedberg, North 2003, 299,

298.

[26] E. Jones, T. Oliphant, P. Peterson, SciPy: Open Source Scientific Tools

for Python, 2010. Available at: http://www.scipy.org/. Accessed on May 2,

2011.

[27] D. Weininger, J Chem Inf Model 1988, 28, 31.

[28] Daylight Chemical Information Systems Inc., SMARTS—A Language for

Describing Molecular Patterns, 2008. Available at: http://www.daylight.

com/dayhtml/doc/theory/theory.smarts.html. Accessed on May 2,

2011.

[29] A. Dustman, MySQL for Python, Version 1.2.3, Available at: http://sour

ceforge.net/projects/mysql-python/. Accessed on May 2, 2011.

[30] Scilligence, OLNTM JSDRAWTM—A Javascript Chemical Structure Editor/

Viewer, Version 1.2.2, 2011. Available at: http://www.scilligence.com/

web/jsdrawapis.aspx. Accessed on May 2, 2011.

[31] GGA Software Services LLC, Indigo—GGA Software Services, Version

1.0 Beta3. Available at: http://ggasoftware.com/opensource/indigo.

Accessed on May 2, 2011.

[32] P. Ertl, JME Molecular Editor, 2011. Available at: http://www.molinspira

tion.com/jme/index.html. Accessed on May 2, 2011.

[33] P. Dallakian, N. Haider, J Cheminf 2011, 3, 6.

[34] GGA Software Services LLC, Ketcher—GGA Software Services, Version

1.0 Beta3, 2011. Available at: http://ggasoftware.com/opensource/

ketcher. Accessed on May 2, 2011.

[35] J. M. Barnard, J Chem Inf Model 1993, 33, 532.

[36] P. Willett, Drug Discovery Today 2006, 11, 1046.

[37] A. Bender, H. Y. Mussa, R. C. Glen, S. Reiling, J Chem Inf Comput Sci

2004, 44, 1708.

[38] J. M. Barnard, G. M. Downs, J Chem Inf Model 1997, 37, 141.

Received: 5 Jan. 2011Revised: 2 May 2011Accepted: 28 May 2011Published online on 5 July 2011

B. Xia et al.

Journal of Computational Chemistry2948 http://wileyonlinelibrary.com/jcc