Stephen C. Hayne 1 Database System Components The Database and the DBMS.
MyMolDB: A micromolecular database solution with open source and free components
Transcript of MyMolDB: A micromolecular database solution with open source and free components
DOI: 10.1002/jcc.21874
MyMolDB: A Micromolecular Database Solutionwith Open Source and Free Components
Bing Xia,[a] Zheng-Fu Tai,[a] Yu-Cheng Gu,*[b] Bang-Jing Li,[a]
Li-Sheng Ding,[a] and Yan Zhou*[a]
Background: To manage chemical structures in small
laboratories is one of the important daily tasks. Few solutions
are available on the internet, and most of them are closed
source applications. The open-source applications typically have
limited capability and basic cheminformatics functionalities. In
this article, we describe an open-source solution to manage
chemicals in research groups based on open source and free
components. It has a user-friendly interface with the functions
of chemical handling and intensive searching.
Results: MyMolDB is a micromolecular database solution that
supports exact, substructure, similarity, and combined
searching. This solution is mainly implemented using
scripting language Python with a web-based interface for
compound management and searching. Almost all the
searches are in essence done with pure SQL on the database
by using the high performance of the database engine. Thus,
impressive searching speed has been archived in large data
sets for no external Central Processing Unit (CPU) consuming
languages were involved in the key procedure of the
searching.
Availability: MyMolDB is an open-source software and can be
modified and/or redistributed under GNU General Public
License version 3 published by the Free Software Foundation
(Free Software Foundation Inc. The GNU General Public
License, Version 3, 2007. Available at: http://www.gnu.org/
licenses/gpl.html). The software itself can be found at http://
code.google.com/p/mymoldb/. VC 2011 Wiley Periodicals, Inc. J
Comput Chem 32: 2942–2947, 2011
Keywords: Chemical management � micromolecular database solution � open source � python � structure searching
Introduction
Over the centuries, millions of chemicals have been synthesized
in chemistry laboratories or isolated from natural sources. To
handle the information of these chemicals efficiently and extract
the required knowledge from it for certain purposes is a huge
challenge. Traditional way such as storing the information of
molecules in Microsoft Excel files failed to handle it well at a
level of millions. With the rapid development of computer hard-
ware and evolution of database technologies, computer-aided
strategy has made it possible and is becoming vital in the suc-
cess of chemical research in drug design.[1–4] Cheminfor-
matics,[5,[6] using computer and informational techniques to
study a range of problems focuses mainly on small organic mol-
ecules rather than macromolecules such as polymers, genes,
proteins, and polysaccharides. For the purpose of storage and
information retrieval of these small molecules, a number of
chemical compound database solutions have been developed,
but most of them are commercial softwares, e.g., Chemaxon’s
JChem package,[7] which runs several Java programs as servlets
on a server and the data are stored in relational SQL databases,
and the solution from CambridgeSoft serves as an extension to
Oracle database.[8] There are also many open-source counter-
parts available with basic functionalities. Haider[9] released a set
of programs written in Pascal language and Perl script, and
these programs parse and format the relevant data of mole-
cules from commonly used MDL SD files and the formatted
results stored in MySQL[10] tables. Haider’s solution performs a
substructure search on a standard PC (AMD Athlon 1.6 GHz
CPU, 1.5 GB of memory) typically within a few seconds. Open-
source libraries for chem- and bioinformatics such as Chemistry
Development Kit (CDK)[11] and Open Babel[12,[13] provide meth-
ods for common tasks in cheminformatics are fully capable of
chemical compound database constructing and other chemin-
formatics-related task processing. Rijnbeek and Steinbeck[14]
developed OrChem, an extension for the Oracle 11G database
that supports fast substructure and similarity searching with
cheminformatics functionalities provided by CDK. OrChem pro-
vides similarity searching with response times in the order of
seconds for databases with millions of compounds, depending
on a given similarity cutoff. For substructure search, it can make
use of multiple processor cores on today’s powerful database
servers to provide fast response times in equally large data sets.
Also, database extensions based on Open Babel also exist. Jer-
ome Pansanel developed Mychem,[15] a cheminformatics exten-
sion based on Open Babel for MySQL, which provides a set of
functions that permits to handle chemical data within the
[a] B. Xia, Z.-F. Tai, B.-J. Li, L.-S. Ding, Y. Zhou
Key Laboratory of Mountain Ecological Restoration and Bioresource
Utilization, Chengdu Institute of Biology, Chinese Academy of Sciences,
Chengdu 610041, People’s Republic of China
E-mail: [email protected]
[b] Y.-C. Gu
Syngenta, Jealott’s Hill International Research Centre, Bracknell, Berkshire
RG42 6EY, United Kingdom
E-mail: [email protected]
Grant sponsor: The National Natural Sciences Foundation of China;
Grant numbers: 30973634, 21072185, 81073014; Grant sponsor:
Syngenta Ltd. (the PhD studentship)
Journal of Computational ChemistryVC 2011 Wiley Periodicals, Inc.2942
SOFTWARE NEWS AND UPDATES
MySQL database. All these solutions have their particular merits.
For the commercial ones, they have their advantages when it
comes to handling very large numbers of data sets, whereas for
the open-source ones, they are free in most cases, and their
source code are open, so you can add your own functions into
them at your will if you are familiar with related programming
language.
In most cases for small laboratories, it is sufficient to have a
database solution that is able to handle molecule collections at
a level of few thousands to tens of thousands with exact, sub-
structure and similarity search capabilities. Although several sol-
utions such as OrChem and Mychem as mentioned earlier are
capable and available, they are only extensions for database
engines, which means they have only core functionalities, and
the users have to write required back ends and interfaces by
themselves. Also, they are not made to be hacked easily due to
the program language used (Java for OrChem and C and Cþþfor Mychem). Therefore, we introduce MyMolDB, a micromolec-
ular database solution based on open source and free compo-
nents including Python,[16] Open Babel, Web.py[17] with user-
friendly web-based interface for managers and end users. In
this solution, Apache,[18] Lighttpd,[19] and Ngnix[20] can be used
as web server, MySQL is used as storage engine of chemical
data, and Python serves as the scripting and glue language for
both chemical information handling and web interface pro-
gramming. The leading open-source web server (Apache,
Lighttpd, and Ngnix) and relational database management sys-
tem (MySQL) guarantee the performance and stability of the
system. The use of simple and powerful scripting language
Python as the main programming language makes it easy to be
extended and modified to suit individual requirement and pref-
erence. Figure 1 gives an overview of MyMolDB’s architecture
and the brief work flow. Detailed introduction about the main
searching options (exact, substructure, similarity, and combined)
can be found in the corresponding section of implementation.
Implementation
MyMolDB was developed on top of many open source or
free softwares. Python as the main programming language
was chosen because it has clear syntax and massive stand-
ard and third part libraries, which makes it easy to learn
and use for nonprofessional users. Scientific users typically
lack of specialized programming knowledge can now de-
velop applications for their research rapidly and efficiently
by making use of Python with third part scientific-related
libraries such as Pybel[21] (chem- and bioinformatics related,
generally provided by Linux distributions with package
name of python openbabel), Cinfony[22] (cheminfomatics
related), Biopython[23–25] (bioinfomatics related), and scien-
tific computing libraries such as Scipy.[26] In the case of
MyMolDB, the cheminformatics-related functionalities such
as fingerprints calculation, SMILES[27] canonicalization,
SMARTS[28] parsing, and molecule 3D coordination are pro-
vided by Pybel. The chemical data handling part is imple-
mented using the open-source relational database engine
MySQL, and MySQL-python[29] is used to interact with the
MySQL server in Python language. The web interfaces are
built on top of simple but powerful Python-based web
framework Web.py. We use OLN JSDraw[30] as a tool for
molecule editing and structure display on the web pages,
whereas an Indigo[31] bases molecular structure layout and
picture generator is optional for structure display. Integra-
tion of other on line molecular structure input or display
tools such as JME Java applet,[32] FlaME,[33] and Ketcher[34]
with MyMolDB is to be developed. In our own circum-
stance, the web interfaces and tools are delivered with the
light weight and fast Lighttpd software (with fastcgi mod-
ule enabled). Many tools such as sdfparser.py, which parses
molecules from the MDL SD files and generates SQL sen-
tences according to a file with definitions of rules of MDL
SD file to SQL sentences,
are written in Python.
Preview
All the services supplied by
MyMolDB system are accessible
across Internet by the web
interfaces. Interfaces include
normal mode (under which
substructure, exact, and similar-
ity searching options are avai-
lable), advanced mode where
combined searching is sup-
ported, molecule management,
and user management. Figure 2
shows the preview of MyMolDB
in normal mode under Google
Chrome, and more and detailed
screen shots are available at
http://code.google.com/p/mymol
db/ wiki/Screenshots.Figure 1. The architecture and brief work flow of MyMolDB.
A Micromolecular Database Solution with Open Source and Free Components
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 2943
Database schema
We use MySQL to store all our data, as a relational database
engine, the relational structure of the data in MySQL allows han-
dling of complex arrangements. The Open Babel canonical
SMILES strings were calculated, and the basic database schema
is relationally organized on, and MD5 hashed SMILES strings
were calculated and stored in table mol_data for rapid index-
ing and enforcement of uniqueness. The Open Babel binary fin-
gerprints and fingerprint bits of the molecules were precalcu-
lated and divided into 32 segments, these segments were
converted to decimal num-
bers and stored in a
mol_fp table, on which, all
the substructure and similar-
ity-related querying are
based. Tanimoto coefficient
was used to measure the
similarity between molecules.
Also, the statistics of mol-
ecules were precalculated
and stored in the table
named mol_stat as the
molecular statistic finger-
prints, which help limit the
search range when querying
for molecules with particular
molecular statistics. Table 1
presents the main database
schema for substructure and
similarity searching.
Main work flow of
MyMolDB
The MyMolDB’s program
entrance, main.py will be
launched by Lighttpd’s
fastcgi module. In initializa-
tion, a map between eter-
nal Uniform Resource Loca-
tion (URL) and Python
classes with different func-
tions was established, the
Python classes were instan-
tiated by Web.py and
would be called when an
searching request started
by its related URL. These
instances of the instanti-
ated classes first parse the
request to get the queried
structure(s) and then call
the related methods to
generate SQL statements
according to the searching
type of the query, the gen-
erated SQL statements
were finally executed by
MySQL through the MySQL-python API to get results. The pro-
cedure described earlier is briefly demonstrated in Figure 3.
Exact searching
Exact searching is relatively easy in our system. All the Open
Babel canonical SMILES for molecules to be handled were pre-
calculated and hashed with MD5 algorithm. The hashed strings
of the Open Babel canonical SMILES for molecules were then
stored in column md5_openbabel_can_smiles of a SQL
Figure 2. Web interface for normal searching.
B. Xia et al.
Journal of Computational Chemistry2944 http://wileyonlinelibrary.com/jcc
table named mol_data, and an SQL index on md5_open
babel_can_smiles was created for fast exact searching. The
MD5 algorithm hashed string of the Open Babel canonical SMILES
of the target molecule was calculated at the right time while
searching and used by MyMolDB to generate SQL statements.
The SQL statement was generated by an instance of the
sql class of libmymoldb, the following code shows an exact
querying for benzene:
from settings import*
import libmymoldb
sql_obj ¼ libmymoldb.sql(env_db)
sql_string ¼ sql_obj.gen_search_sql(0c1ccccc10, 1)
Where env_db from settings is a Python dict contains
the value of needed information of the database such as field
name of fp_bits, openbabel_can_smiles and primary
field. String c1ccccc1 is the Open Babel canonical SMILES
for benzene, whereas 1 in sql_obj.gen_search_
sql(‘c1ccccc1’, 1) is the code for exact searching. The
SQL statement generated will resemble:
SELECT … FROM table WHERE mol_data.md5_openbabel_
can_smiles ¼ ‘975f4ba64117f4dc65044299552ee698’;
where 975f4ba64117f4dc65044299552ee698 is the MD5
hashed string of c1ccccc1. The SQL statement will then be exe-
cuted by an instantiated libmymoldb.database class to get
results.
A hashed string was used to introduce the possibility of col-
lisions where two different strings hash down to the same
value, so we further do an exact text match on the unhashed
canonical SMILES when the query returns nonempty result,
this method eliminates the fake positive hits.
Substructure searching
Substructure searching can be graph theoretically described as
checking for the existence of a subgraph isomorphism with the
query graph in a series of topological graphs.[35] The finger-
prints were used to screen out the molecule candidates quickly,
which are likely to contain the molecule to query as a substruc-
ture. Further filter is needed to inspect the candidates, for
merely fingerprints method returns false-positive results some-
times due to the degeneracy of bits in fingerprints. Our way for
further candidates screening is based on SMARTS matching.
Therefore, the substructure searching process is separated into
two steps—fingerprints screening and SMARTS matching. Fin-
gerprints method is fast and consumes less CPU resource and is
competent for dealing with large data sets, whereas SMARTS
matching algorithm is accurate but relatively slow.
The fingerprints method is a bitwise and comparison between
the fingerprints of structures to query and that of candidates.
The bitwise and operation is supported by MySQL, we can sim-
ply use a SQL clause to screen the candidates. For the finger-
prints of molecule, candidates were precalculated and stored in
the database, we just need to calculate the fingerprints of the
query molecule simply using the API of Pybel, whereas the mo-
lecular statistics are also calculated and added to the SQL state-
ment to limit the searching range. All the work mentioned ear-
lier can be simply done by an instance of libmymoldb.sql
class with likely code as shown in exact searching. The only dif-
ference is that the searching is switched to code 2 (code for sub-
structure searching). The SQL statement of substructure search
for ‘‘c1ccccc1’’ (benzene) generated would resemble:
SELECT …
FROM mol_stat INNER JOIN mol_fp ON (mol_stat.-
mol_id ¼ mol_fp.mol_id)
WHERE ((n_rings >¼ 1 AND n_atoms >¼ 12 AND n_C
>¼ 6 AND n_r6 >¼ 1 AND n_bonds >¼ 12 AND n_X >¼ 0
AND fp_bits >¼ 6 AND n_Car >¼ 6) AND (fp03 &
131072 ¼ 131072 AND fp11 & 134217728 ¼ 134217728
AND fp21 & 32768 ¼ 32768 AND fp23 & 2112 ¼ 2112 AND
fp29 & 512 ¼ 512));
The SQL statement is then executed in MySQL and gets pre-
liminary results. Large amount of bitwise and comparison
operation is totally done in MySQL without any other compu-
tationally expensive languages and thus improves the effi-
ciency and speed of searching.
After the first step of rough screening, SMARTS matching
method is adopted to find the target molecules accurately. We
defined a Python class in libmymoldb named mymol with
the method of sub_match to do the SMARTS matching work.
So the code for SMARTS matching using Python should resem-
ble the following on the assumption that the rough results
were stored in a Python list named rough_results with
Table 1. Database tables of fingerprints for substructure and similarity
searching.
TABLE mol_fpmol_id INT(11)
fp_bits (bit count of the fingerprints) SMALLINT(5)
fp01 BIGINT(20)
…
fp32 BIGINT(20)
TABLE mol_stat
mol_id INT(11)
n_atoms (number of atoms) SMALLINT(5)
n_bonds (number of bonds) SMALLINT(5)
n_rings (number of rings) SMALLINT(5)
n_r3 (number of 3-membered rings) SMALLINT(5)
… …
n_r8 (number of 8-membered rings) SMALLINT(5)
n_rx (number of rings with members more than 8) SMALLINT(5)
n_C (number of carbon atoms) SMALLINT(5)
n_C2 (number of sp2 carbon atoms) SMALLINT(5)
n_C3 (number of sp3 carbon atoms) SMALLINT(5)
n_Car (number of aromatic carbon atoms) SMALLINT(5)
… …
Figure 3. Main work flow of searching of MyMolDB.
A Micromolecular Database Solution with Open Source and Free Components
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 2945
elements of Python dicts in which the infomation of molecule
candidates from the first step of fingerprints screening is
stored with the SMILES key of openbabel_can_smiles,
The structure of query molecule in the format of SMILES
assigned to variable query_smiles:
import libmymoldb
mol_obj ¼ libmymoldb.mymol(‘smi’, query_smiles)
results ¼ [ smiles for smiles in rough_results
if mol_obj.sub_match(‘smi’, smiles[‘openbabel_
can_smiles’]) ]
where string smi is the format code of SMILES in Pybel.
This step of result filtering is fast enough with a few Python
code above alike due to the high performance of the API of
Cþþ implemented Pybel and the acceptable amount of rough
results from the first fingerprints screening step.
Similarity searching
Although there are many measures of structural similarity cal-
culation, such as 2D fingerprints method,[36] layered atom
environment fingerprints,[37] and structural dictionary-based,[38]
traditional binary-encoded fingerprints-based similarity algo-
rithms are still the most widely used methods due to their
simplicity and robustness. In our system, Tanimoto coefficient
is used to evaluate the similarity between molecules. The
equation of Tanimoto coefficient between molecules a and b
is shown in eq. (1):
Tanimoto coefficient ¼ C=ðAþ B� CÞ (1)
in which, A is the number of bit 1 in fingerprints of molecule
a, B is that of molecule b, and C means the number of bit 1
present both in fingerprints of molecules a and b.
Our solution for Tanimoto coefficient calculation mainly uses
the standard functions and arithmetical operations supported
by MySQL. All the computational consumptive operations are
performed in pure SQL in the table of mol_fp, thus better
speed and efficiency were obtained for no external languages
are used. The SQL statement for similarity searching is also
generated by an instance of libmymoldb.sql class using
likely code as shown in exact searching with searching code
changed to 3 (code for similarity searching). For example, the
SQL statement for finding out molecules with Tanimoto coeffi-
cient between the molecules and ‘‘c1ccccc1’’ (benzene) exceeds
0.9 should resemble as the following:
SELECT …
FROM mol_fp
WHERE ((BIT_COUNT(fp03&131072) þ BIT_COUNT(fp11&
134217728) þ BIT_COUNT(fp21&32768) þ BIT_-
COUNT(fp23& 2112) þ BIT_COUNT(fp29&512)) / (6 þfp_bits - (BIT_COUNT (fp03&131072) þ BIT_COUNT
(fp11&134217728) þ BIT_ COUNT(fp21&32768) þBIT_COUNT(fp23&2112) þ BIT_ COUNT(fp29&512))) > 0.9);
in which, 6 is the count of bit 1 in the fingerprints of the struc-
ture (benzene here), whereas 0.9 is the minimal similarity
value between benzene and the candidate molecules.
Combined searching
MyMolDB supports the so-called combined searching, which
provides a possibility to search in a flexible and powerful way.
Complex searching can be easily done here without writing
long and sophisticated SQL statements or editing the back
end code of the system. In this way, a simple search language
was designed and used to process the query. A statement in
the language is in essence with a logical query formula in
which the key words supported by the system are abbrevia-
tions of the names of MySQL columns (abbreviations are con-
figurable in the settings file). Operators include AND, OR, NOT
and ~ (means match, follows a string in which wildcard % is
supported), SUB (means substructure) MAX (the max number
of results) and so on. A molecule buffer with several cells
named m1 to mn (1 < n < 10) is set on the web page; mole-
cules can be drawn in those cells and referred with their
names in the query formula. The MyMolDB system interprets
the formula to simple searching types (exact searching and
substructure searching) and then generates SQL statement
using the related methods. SQL statement is then executed. If
we have two molecules in the cells of molecule buffer named
m1 and m2, the query for molecules whose molecular weight
less than 400 and contain m1 but not m2 as substructure, with
no more than 100 results returned should be described in the
formula below:
MW < 400 AND (SUB ¼ m1 AND SUB !¼ m2) MAX ¼ 100
also valid in the following form:
MW < 400 AND (SUB ¼ m1 AND (NOT SUB ¼ m2)) MAX ¼ 100
in which MW is the abbreviation of the database column for
molecular weight.
Performance
The performance tests of MyMolDB were done on a PubChem
data sample of 1.0 million molecules to measure the perform-
ance of our system on a data set large enough for daily use in
small laboratories. All the tests were done on a single CPU
quad core Q9550 2.83GHz, 4GB RAM Arch Linux server with
the kernel version of 2.6.34.1, Python version of 2.6.5, Web.py
version of 0.34, Pybel version of 2.2.3, MySQL version of 5.1.47,
and Lighttpd version of 1.4.26.
Substructure searching
As described earlier, the substructure searching process is sep-
arated into two steps of a fingerprints-based rough screening
followed by a SMARTS-based accurate match. So the perform-
ance of substructure searching should include the performan-
ces of the both steps. Performance of substructure searching
for ‘‘c1ccccc1’’ with different results limitations was done, and
the results are illustrated in Figure 4.
Similarity searching
The performance of similarity searching is more representative
in compare with that of the other types of searching, because
B. Xia et al.
Journal of Computational Chemistry2946 http://wileyonlinelibrary.com/jcc
the Tanimoto method of similarity searching checks almost all
the compounds in the data set in one time. In our test, 100
randomly chosen compounds were used to query for all their
similar ones and repeated for different minimum Tanimoto
coefficient values. Table 2 illustrates the performance for
searching the PubChem data set of minimum Tanimoto
similarities with respect to query throughput time and average
result sizes.
Conclusion
The micromolecular database solution MyMolDB with open
source and free components for small laboratories was devel-
oped. This solution supports substructure searching, exact
searching, similarity searching, and combined searching with
acceptable performance even on a large data set of 1.0 million
chemicals. The web-based searching pages provided user-
friendly human-computer interface for compound searching
and user management. The Python-based essence grants the
extensibility of this system. We provided the libraries within the
package which can be used for further extend on cheminfor-
matics-related functionalities. Furthermore, a plug-in system is
planning to be developed, which could help building your own
cheminformatics-related data processing applications more eas-
ily. We hosted our project on Google Code. The URL of this pro-
ject is http://code.google.com/p/mymoldb. Developers who are
interested in further extending MyMolDB are welcome to par-
ticipate through Google Code. Bugs and suggestions are also
encouraged to submit in the home page of this project.
License
MyMolDB is released under GNU General Public License ver-
sion 3 as published by the Free Software Foundation.
Acknowledgments
The authors thank the PubChem Project for assembling the
great collection of molecular structures and for opening up for
noncommercial use with which we are available to test the
performance of our system. They also thank the Open Babel
community for their great work of Open Babel chemical tool-
box and its Python binding, which provides the cheminfor-
matics-related functionalities for our solution. They thank
Scilligence for providing their powerful javascript chemical
structure editor/viewer OLN JSDraw for noncommercial use for
free, with which we can edit and display structures easily on
the web. They thank Mr. John Delaney for the scientific inputs
and proof reading of the manuscript.
[1] A. V. Veselovsky, A. S. Ivanov, Curr Drug Targ Infect Dis 2003, 3, 33.
[2] F. Ooms, Curr Med Chem 2000, 7, 141.
[3] O. F. Guner, Curr Top Med Chem 2002, 2, 1321.
[4] G. Schneider, U. Fechner, Nat Rev Drug Discov 2005, 4, 649.
[5] Cheminformatics. Wikipedia, 2011. Available at: http: //en.wikipedia.
org/wiki/Cheminformatics. Accessed on May 2, 2011.
[6] F. K. Brown, Annu Rep Med Chem 1998, 33, 375.
[7] ChemAxon Ltd., Introduction to the JChem suite, 2011. Available at:
http://www.chemaxon.com/jchem/intro/index.html. Accessed on May 2,
2011.
[8] CambridgeSoft Corporation, CambridgeSoft Enterprise Solutions—Ora-
cle Cartridge, 2007. Available at: http://www.cambridgesoft.com/solu
tions/details/?fid¼186. Accessed on May 2, 2011.
[9] N. Haider, Molecules (Basel, Switzerland) 2010, 15, 5079.
[10] Oracle Corporation and/or its affiliates, MySQL: The World’s Most Pop-
ular Open Source Database, Version 5.5.10, 2011. Available at: http://
www.mysql.com/. Accessed on May 2, 2011.
[11] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen,
J Chem Inf Comput Sci 2003, 43, 493.
[12] R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C.
Steinbeck, J. Wegner, E. L. Willighagen, J Chem Inf Model 2006, 46,
991.
[13] The Open Babel Package, Version 2.2.3, 2011. Available at: http://open
babel.sourceforge.net/. Accessed on May 2, 2011.
[14] M. Rijnbeek, C. Steinbeck, J Cheminf 2009, 1, 1.
[15] J. Pansanel, Mychem—A Chemical Extension for MySQL, Version 0.8.0,
2010. Available at: http://mychem.sourceforge.net/. Accessed on May 2,
2011.
[16] Python Software Foundation. Python Programming Language—Official
Website, Version 2.7.1, 2011. Available at: http://www.python.org/.
Accessed on May 2, 2011.
[17] A. Swartz Welcome to web.py! (web.py), Version 0.34, 2010. Available
at: http://webpy.org/. Accessed on May 2, 2011.
[18] The Apache Software Foundation, Apache HTTP Server, Version 2.2.17,
2010. Available at: http://projects.apache.org/projects/http_server.html.
Accessed on May 2, 2011.
[19] Lighttpd fly light, Version 1.4.28, 2010. Available at: http://www.light
tpd.net/. Accessed on May 2, 2011.
[20] Nginx news, Version 0.8.54, 2010. Available at: http://nginx.org.
Accessed on May 2, 2011.
[21] N. M. O. Boyle , C. Morley, G. R. Hutchison, Chem Cent J 2008, 2, 1.
[22] N. M. O. Boyle, G. R. Hutchison, Chem Cent J 2008, 2, 24.
Figure 4. Performance of substructure search for ‘‘c1ccccc1’’ with different
results limitations on 1.0 million compounds.
Table 2. Similarity search throughput time for 100 randomly selected
compounds with different minimal Tanimoto similarity in a sample
database with 1.0 million structures.
Minimal Tanimoto
similarity
Average
compounds found
Average throughput
time (seconds)
0.5 846 7.4
0.65 135 5.3
0.8 34 3.5
0.95 5 2.4
A Micromolecular Database Solution with Open Source and Free Components
Journal of Computational Chemistry http://wileyonlinelibrary.com/jcc 2947
[23] B. Chapman, J. Chang, ACM SIGBIO Newsletter 2000, 20, 15.
[24] P. J. Cock, T. Antao, J. T. Chang, B. Chapman, C. J. Cox, A. Dalke, I.
Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski, M. J. L. de Hoon, Bioin-
formatics (Oxford, England) 2009, 25, 1422.
[25] M. J. L. D. Hoon, B. Chapman, I. Friedberg, North 2003, 299,
298.
[26] E. Jones, T. Oliphant, P. Peterson, SciPy: Open Source Scientific Tools
for Python, 2010. Available at: http://www.scipy.org/. Accessed on May 2,
2011.
[27] D. Weininger, J Chem Inf Model 1988, 28, 31.
[28] Daylight Chemical Information Systems Inc., SMARTS—A Language for
Describing Molecular Patterns, 2008. Available at: http://www.daylight.
com/dayhtml/doc/theory/theory.smarts.html. Accessed on May 2,
2011.
[29] A. Dustman, MySQL for Python, Version 1.2.3, Available at: http://sour
ceforge.net/projects/mysql-python/. Accessed on May 2, 2011.
[30] Scilligence, OLNTM JSDRAWTM—A Javascript Chemical Structure Editor/
Viewer, Version 1.2.2, 2011. Available at: http://www.scilligence.com/
web/jsdrawapis.aspx. Accessed on May 2, 2011.
[31] GGA Software Services LLC, Indigo—GGA Software Services, Version
1.0 Beta3. Available at: http://ggasoftware.com/opensource/indigo.
Accessed on May 2, 2011.
[32] P. Ertl, JME Molecular Editor, 2011. Available at: http://www.molinspira
tion.com/jme/index.html. Accessed on May 2, 2011.
[33] P. Dallakian, N. Haider, J Cheminf 2011, 3, 6.
[34] GGA Software Services LLC, Ketcher—GGA Software Services, Version
1.0 Beta3, 2011. Available at: http://ggasoftware.com/opensource/
ketcher. Accessed on May 2, 2011.
[35] J. M. Barnard, J Chem Inf Model 1993, 33, 532.
[36] P. Willett, Drug Discovery Today 2006, 11, 1046.
[37] A. Bender, H. Y. Mussa, R. C. Glen, S. Reiling, J Chem Inf Comput Sci
2004, 44, 1708.
[38] J. M. Barnard, G. M. Downs, J Chem Inf Model 1997, 37, 141.
Received: 5 Jan. 2011Revised: 2 May 2011Accepted: 28 May 2011Published online on 5 July 2011
B. Xia et al.
Journal of Computational Chemistry2948 http://wileyonlinelibrary.com/jcc