A lazy learning-based QSAR classification study for...

26
This article was downloaded by: [Gyeongsang National Uni] On: 21 May 2015, At: 22:33 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Click for updates SAR and QSAR in Environmental Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/gsar20 A lazy learning-based QSAR classification study for screening potential histone deacetylase 8 (HDAC8) inhibitors G.P. Cao a , M. Arooj b , S. Thangapandian a , C. Park a , V. Arulalapperumal a , Y. Kim c , Y.J. Kwon d , H.H. Kim e , J.K. Suh f & K.W. Lee a a Department of Biochemistry, Division of Applied Life Science (BK21 Plus Program), Systems and Synthetic Agrobiotech Centre (SSAC), Plant Molecular Biology and Biotechnology Research Centre (PMBBRC), Research Institute of Natural Science (RINS), Gyeongsang National University, Jinju, Republic of Korea b CHIRI Biosciences Research Precinct, School of Biomedical Sciences, Curtin University, Perth, Australia c Department of Science Education, Kyungnam University, Masan, Republic of Korea d Department of Chemical Engineering, Kangwon National University, Chunchon, Republic of Korea e Division of Quality of Life, Korea Research Institute of Standards and Science, Daejeon, Republic of Korea f Bio Computing Major, Korean German Institute of Technology, Seoul, Republic of Korea Published online: 19 May 2015. To cite this article: G.P. Cao, M. Arooj, S. Thangapandian, C. Park, V. Arulalapperumal, Y. Kim, Y.J. Kwon, H.H. Kim, J.K. Suh & K.W. Lee (2015) A lazy learning-based QSAR classification study for screening potential histone deacetylase 8 (HDAC8) inhibitors, SAR and QSAR in Environmental Research, 26:5, 397-420, DOI: 10.1080/1062936X.2015.1040453 To link to this article: http://dx.doi.org/10.1080/1062936X.2015.1040453 PLEASE SCROLL DOWN FOR ARTICLE

Transcript of A lazy learning-based QSAR classification study for...

Page 1: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

This article was downloaded by: [Gyeongsang National Uni]On: 21 May 2015, At: 22:33Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Click for updates

SAR and QSAR in EnvironmentalResearchPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/gsar20

A lazy learning-based QSARclassification study for screeningpotential histone deacetylase 8(HDAC8) inhibitorsG.P. Caoa, M. Aroojb, S. Thangapandiana, C. Parka, V.Arulalapperumala, Y. Kimc, Y.J. Kwond, H.H. Kime, J.K. Suhf & K.W.Leea

a Department of Biochemistry, Division of Applied Life Science(BK21 Plus Program), Systems and Synthetic Agrobiotech Centre(SSAC), Plant Molecular Biology and Biotechnology ResearchCentre (PMBBRC), Research Institute of Natural Science (RINS),Gyeongsang National University, Jinju, Republic of Koreab CHIRI Biosciences Research Precinct, School of BiomedicalSciences, Curtin University, Perth, Australiac Department of Science Education, Kyungnam University, Masan,Republic of Koread Department of Chemical Engineering, Kangwon NationalUniversity, Chunchon, Republic of Koreae Division of Quality of Life, Korea Research Institute of Standardsand Science, Daejeon, Republic of Koreaf Bio Computing Major, Korean German Institute of Technology,Seoul, Republic of KoreaPublished online: 19 May 2015.

To cite this article: G.P. Cao, M. Arooj, S. Thangapandian, C. Park, V. Arulalapperumal, Y. Kim,Y.J. Kwon, H.H. Kim, J.K. Suh & K.W. Lee (2015) A lazy learning-based QSAR classification studyfor screening potential histone deacetylase 8 (HDAC8) inhibitors, SAR and QSAR in EnvironmentalResearch, 26:5, 397-420, DOI: 10.1080/1062936X.2015.1040453

To link to this article: http://dx.doi.org/10.1080/1062936X.2015.1040453

PLEASE SCROLL DOWN FOR ARTICLE

Page 2: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 3: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

A lazy learning-based QSAR classification study for screening potentialhistone deacetylase 8 (HDAC8) inhibitors

G.P. Caoa, M. Aroojb, S. Thangapandiana, C. Parka, V. Arulalapperumala, Y. Kimc,Y.J. Kwond, H.H. Kime, J.K. Suhf and K.W. Leea*

aDepartment of Biochemistry, Division of Applied Life Science (BK21 Plus Program), Systems andSynthetic Agrobiotech Centre (SSAC), Plant Molecular Biology and Biotechnology Research Centre(PMBBRC), Research Institute of Natural Science (RINS), Gyeongsang National University, Jinju,Republic of Korea; bCHIRI Biosciences Research Precinct, School of Biomedical Sciences, CurtinUniversity, Perth, Australia; cDepartment of Science Education, Kyungnam University, Masan, Republicof Korea; dDepartment of Chemical Engineering, Kangwon National University, Chunchon, Republic ofKorea; eDivision of Quality of Life, Korea Research Institute of Standards and Science, Daejeon,Republic of Korea; fBio Computing Major, Korean German Institute of Technology, Seoul, Republic ofKorea

(Received 24 October 2014; in final form 9 April 2015)

Histone deacetylases 8 (HDAC8) is an enzyme repressing the transcription of various genesincluding tumour suppressor gene and has already become a target of human cancer treat-ment. In an effort to facilitate the discovery of HDAC8 inhibitors, two quantitative struc-ture–activity relationship (QSAR) classification models were developed using K nearestneighbours (KNN) and neighbourhood classifier (NEC). Molecular descriptors were calcu-lated for the data set and database compounds using ADRIANA.Code of MolecularNetworks. Principal components analysis (PCA) was used to select the descriptors. Thedeveloped models were validated by leave-one-out cross validation (LOO CV). The perfor-mances of the developed models were evaluated with an external test set. Highly predictivemodels were used for database virtual screening. Furthermore, hit compounds were subse-quently subject to molecular docking. Five hits were obtained based on consensus scoringfunction and binding affinity as potential HDAC8 inhibitors. Finally, HDAC8 structures incomplex with five hits were also subjected to 5 ns molecular dynamics (MD) simulationsto evaluate the complex structure stability. To the best of our knowledge, the NEC classi-fication model used in this study is the first application of NEC to virtual screening fordrug discovery.

Keywords: histone deacetylases 8; K nearest neighbours; neighbourhood classifier;molecular docking; MD simulation; virtual screening

1. Introduction

Histone deacetylases (HDACs) play critical roles in the regulation of cellular metabolism andconstitute promising drug targets for treatment of a broad range of human diseases [1] suchas cardiomyopathy, osteodystrophy, neurodegenerative disorder, metabolic disorders, cardio-vascular disease, aging and cancer. The 18 HDAC enzymes can be identified and classifiedinto four different classes based on sequence homology, function, DNA similarity and

*Corresponding author. Email: [email protected]

© 2015 Taylor & Francis

SAR and QSAR in Environmental Research, 2015Vol. 26, No. 5, 397–420, http://dx.doi.org/10.1080/1062936X.2015.1040453

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 4: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

phylogenetic analysis [2–5]. Class I, Class II and Class IV HDACs are zinc (Zn2+) dependentdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide adeninedinucleotide (NAD+) for their deacetylation activity [6,7].

The HDAC8 enzyme belongs to the class I group of enzymes which are found primarilyin the nucleus [8]. Except for HDAC8, functional HDACs are found as multimeric complexesof higher molecular weight; in addition, most of the purified HDAC enzymes are functionallyinactive [9,10]. Along with this advantage, expression of HDAC8 notably correlates with thedisease stage of neuroblastoma, a highly malignant childhood cancer derived from the sympa-thetic nervous system [11,12]. Moreover, a RNA interference study showed that HDAC8 isinvolved in the regulation of proliferation, clonogenic growth and neuronal differentiation ofneuroblastoma cells. Inv1, an abnormal fusion protein formed during acute myeloid leukaemiabinding HDAC8, is also associated with aberrant, constitutive genetic repression [13]. Thisevidence proves HDAC8 as a potential target for cancer treatment. Development of newHDAC8 inhibitors remains ongoing. Only a few HDAC8 inhibitors have been approved bythe US Food and Drug Administration (FDA) and are serving as anti-cancer therapeutics[14]. Thus, the discovery of new and potential HDAC8 inhibitors attracts much attention.

Drug design is a complicated, costly and time-consuming work. In silico virtual screeningprovides an economical and rapid way to identify novel leads in chemical databases. At pre-sent various virtual screening methods have been well-established such as quantitative struc-ture–activity relationship (QSAR) model based and molecular docking based virtualscreening. Obviously, different virtual screening methods have their advantages and disadvan-tages. Each of these virtual screening methods might not perform optimally when used alonein terms of the speed and effectiveness of virtual screening; a combination of these methodsis an alternative approach to improve performance of virtual screening.

In order to discover potential leads based on the known inhibitors, it is necessary to studythe QSAR of these inhibitors. In the past decade, machine learning has become an attractiveapproach to develop QSAR models for virtual screening. Among them, eager learning suchas artificial neural network (ANN) and support vector machine (SVM) is widely discussedand applied [15–28]. But another efficient and effective learning technique based on localinformation, lazy learning, has not been carefully reviewed and considered [29,30]. Lazylearning can provide faster computing speed and its target function is approximated locallycompared with eager learning.

In this study, two lazy learning methods including K nearest neighbours (KNN) andneighbourhood classifier (NEC) were utilized to develop QSAR classification models for vir-tual screening of HDAC8 inhibitors. For an application to the classification task, compoundsare typically represented by n-dimensional molecular descriptors. Thus, a feature selectionmethod is necessary during modelling, which can bring many potential benefits: facilitatingdata visualization and data understanding; reducing the measurement and storage require-ments; reducing training and utilization times; and defying the curse of dimensionality toimprove prediction performance [29]. Because of many benefits, principal component analysis(PCA) was utilized for selecting relevant descriptors.

This study focused on developing QSAR classification models using KNN and NECbased on known HDAC8 inhibitors, which can correctly discriminate the existing HDAC8inhibitors and non-inhibitors. Then these models, in combination with molecular docking, willbe queries for searching the Maybridge database to identify novel HDAC8 inhibitors.HDAC8–inhibitor complex stability was checked using molecular dynamics (MD) simulation.In addition, the predictive abilities of KNN and NEC were compared through severalstatistical parameters.

398 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 5: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

2. Materials and methods

2.1 Collection of data set

A data set of 75 compounds with HDAC8 inhibitory activity (IC50) values predicted undersame biological assay conditions was collected from various literature resources, includingpatents [13,31–37]. This data set included 38 inhibitors and 37 non-inhibitor compoundsbased on their activity values: inhibitors 0.008 μM < IC50 ≤ 0.3 μM; non-inhibitors IC50 >0.3 μM. The data set was finally divided into training and test sets containing 30 and 45compounds, respectively. The 30 training set compounds, including 15 inhibitors and 15non-inhibitors, were used in the development of QSAR classification models (Figure 1). Thetest set containing 23 inhibitors and 22 non-inhibitors was utilized to validate the developedmodels. All the compounds in the data set were sketched in two-dimensional (2D) structureswith Accelrys Draw v4.1 (Accelrys Inc., San Diego, CA, USA) and subsequently convertedinto 3D structures with Accelrys Discovery Studio (DS) v3.5 (Accelrys, San Diego, CA,USA).

Figure 1. The 30 chemically diverse compounds used as training set in classification model generation.

SAR and QSAR in Environmental Research 399

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 6: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

2.2 Molecular descriptor calculation and data preprocessing

Molecular descriptors are the final results of a logical and mathematical procedure whichtransforms chemical information encoded within a symbolic representation of a molecule intoa useful number or the result of some standardized experiment [38]. Although thousands ofdescriptors can be calculated, to date this is only useful from a medicinal chemistry perspec-tive when they are reduced to a small set of molecular descriptors that can effectively beapplied in designing novel and potent compounds. Thus, the main goal of a QSAR study isnot to calculate thousands of descriptors but to identify a few molecular descriptors.

In this study, molecular descriptors for both data set and database compounds were calcu-lated using ADRIANA.Code program (Molecular Networks Inc.) [39] (Table 1). Furthermore,each descriptor of the data set and database compounds was scaled to the range [-1, +1] usingthe following:

yi ¼ xij j � xminj jxmaxj j � xminj j ; i ¼ 1; 2; . . .; n (1)

yi ¼ yi; xi � 0ð Þyi ¼ yi; xi\0ð Þ

�; i ¼ 1; 2; . . .n (2)

where xi is a descriptor vector, n is the number of compounds in data set and yi is a scaledresult.

2.3 K-nearest neighbour (KNN) algorithm

KNN is a classic lazy learning method based on local distance information used for classifica-tion. Cover and Hart firstly proposed the 1-NN rule in 1967 (nearest neighbour) and it workedpowerfully [40]. Afterwards, the KNN rule was drafted as a simple generalization of 1-NN torecognize a new pattern [41]. KNN classifies a new object into the class membership with amajority vote of its K-nearest neighbours. KNN first calculates the distances between the train-ing set and the new object, and then selects the K-nearest training data based on the calculateddistances. Finally, the new object is assigned to the class most common among its K-nearesttraining data. In KNN, K is a user-defined constant and must be a positive integer. Distancemetrics can be used to calculate distance between training data and the new object.

2.4 Neighbourhood classifier (NEC) algorithm

NEC is also a lazy classifier based on local distance information. It is similar to KNN, but it wasproposed by few researchers [42,43]. In general, the process of NEC can be summed up as atwo-step procedure: First, a valid area (neighbourhood) is selected within training data with thecorrect classification. Then, NEC assigns the class to a new object according to majority vote oftraining data in the neighbourhood. The neighbourhood is given in distance metric functions anda threshold w, which can determine the shape and size of the neighbourhood, respectively.

2.5 Principal components analysis (PCA) feature transformation

PCA is a powerful dimensionality reduction and noise reduction method, which uses orthogo-nal transformation to convert data into linearly uncorrelated principal components [44,45].PCA can reduce the number of dimensions of data, without much loss of information. This

400 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 7: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

Table 1. Molecular descriptors used in this study.

Descriptor name Description AbbreviationType ofdescriptors

Molecular weight Molecular weight in [u] or [Da] derived fromthe gross formula

Weight Globalmoleculardescriptors

Number of hydrogenbonding acceptors

Number of hydrogen bonding acceptors derivedfrom the sum of nitrogen and oxygen atoms inthe molecule

HAcc Globalmoleculardescriptors

Number of hydrogenbonding donors

Number of hydrogen bonding donors derivedfrom the sum of N–H and O–H groups in themolecule

HDon Globalmoleculardescriptors

Octanol–waterpartition coefficient(log P)

Octanol–water partition coefficient in [log units]of the molecule following the XlogP approach

XlogP Globalmoleculardescriptors

Topological polarsurface area

Topological polar surface area in [Å2] of themolecule derived from polar two-dimensional(2D) fragments

TPSA Globalmoleculardescriptors

Mean molecularpolarizability

Mean molecular polarizability in [Å3] of themolecule

Polariz Globalmoleculardescriptors

Molecular dipolemoment

Dipole moment in [Debye] of the molecule Dipole Globalmoleculardescriptors

Aqueous solubility(logS)

Solubility of the molecule in water in [logunits]

LogS Globalmoleculardescriptors

Number of rotatablebonds

Number of open-chain, single rotatable bonds NRotBond Globalmoleculardescriptors

Number of atoms Number of all atoms in the molecule (includinghydrogen atoms)

NAtoms Globalmoleculardescriptors

Molecular complexity Molecular complexity according to theapproach by J. Hendrickson

Complexity Globalmoleculardescriptors

Ring complexity Ring complexity according to the approach byJ. Gasteiger and C. Jochum

RComplexity Globalmoleculardescriptors

Molecular diameter Maximum distance between two atoms in themolecule in [Å]

Diameter Size andshapedescriptors

Principal moment ofinertia of 1stprincipal axis

Principal component of the inertia tensor in xdirection in [Da·Å²]

InertiaX Size andshapedescriptors

Principal moment ofinertia of 2ndprincipal axis

Principal component of the inertia tensor in ydirection in [Da·Å²]

InertiaY Size andshapedescriptors

Principal moment ofinertia of 3rdprincipal axis

Principal component of the inertia tensor in zdirection in [Da·Å²]

InertiaZ Size andshapedescriptors

(Continued)

SAR and QSAR in Environmental Research 401

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 8: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

transformation is defined in such a way that the first principal component has the largestpossible variance (as much of the variability in the data as possible) and each succeedingcomponent in turn has the highest variance possible under the constraint that it is orthogonalto the preceding components.

PCA was performed using statistics toolbox in MATLAB v6.5 student version (Math-Works, Natick, MA, USA).

2.6 Leave-one-out cross validation (LOO CV)

CV is used as a model validation technique for assessing how the results of a statistical analy-sis will generalize to an independent data set [46]. The LOO CV procedure was implementedto estimate the predictive capability of the developed QSAR models. In the LOO CV process,a single compound from the data set was used as the test data and the remaining compoundsas the training data. This was repeated such that each sample in the data set is used once asthe test data. The results were averaged as output of LOO CV.

2.7 Grid search (GS) method

During the development of QSAR classification modelling, a difficult issue is how to set appro-priate parameters of classifiers. It is not clear beforehand which parameters are best. Thus, aparameter search must be performed. In this study, GS was utilized for parameter optimization.GS is a straightforward but powerful method, which involves exhaustive searching through asubset of the parameter space of a learning algorithm to solve the problem of model selectionand parameter optimization. The GS method has to be guided by some performance metric,typically measured by CVon the training set; LOO CV was used in this study.

2.8 Evaluation of prediction performance

The predictive abilities of KNN and NEC were evaluated using the following statistical mea-sures. In following equations, TP is the number of true positives, TN the number of truenegatives, FP the number of false positives and FN the number of false negatives. In thisstudy, HDAC8 inhibitors were considered the ‘positive set’ and the non-inhibitors wereconsidered the ‘negative set’.

Table 1. (Continued).

Descriptor name Description AbbreviationType ofdescriptors

Molecular span Radius of the smallest sphere centred at thecentre of mass which completely encloses allatoms in the molecule in [Å]

Span Size andshapedescriptors

Molecular radius ofgyration

Radius of gyration in [Å] Rgyr Size andshapedescriptors

Molecular eccentricity Molecular eccentricity Eccentric Size andshapedescriptors

Molecular asphericity Molecular asphericity Aspheric Size andshapedescriptors

402 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 9: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

Accuracy (ACC) is the degree of closeness of measurements of a quantity to that quan-tity’s actual (true) value. The Matthews correlation coefficient (MCC) is a measure of qualityof binary classification; it returns a value between –1 and +1. A coefficient of +1 stands for aperfect prediction, 0 means a random prediction, and –1 indicates total disagreement betweenprediction and observation. True positive rate (TPR, known as Recall) is a measure of rele-vant retrieved instances. False positive rate (FPR) refers to the probability of falsely rejectingthe null hypothesis for a particular test.

ACC can be calculated by following the formula

ACC ¼ TPþ TN

TPþ FPþ TNþ FN� 100%: (3)

MCC can be calculated directly from the confusion matrix using the formula

MCC ¼ ðTP� TNÞ � ðFP� FNÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðTPþ FPÞðTPþ FNÞðTNþ FPÞðTNþ FNÞp : (4)

TPR is defined as

TPR ¼ TP

TPþ FN� 100%: (5)

FPR is defined as

FPR ¼ FP

TNþ FP� 100%: (6)

2.9 Chemical database preparation and virtual screening

Maybridge, a commercial chemical database containing 59,652 compounds, was employedfor the virtual screening procedure [47]. However, this database is found to have a number ofnon-druglike compounds. As it is worthless screening all the compounds of these databasesand then eliminating them in the later phase for their non-druglike properties, the compoundsnot satisfying druglike properties were excluded from the database prior to double lazylearning-based virtual screening.

In order to accomplish this task, compounds in this database were subjected to variousscrupulous druglike filters such as Lipinski’s rule of five and ADMET (absorption, distribu-tion, metabolism, excretion and toxicity) properties. ADMET was applied to check whetherthe compounds are in a position to cross the blood–brain barrier (BBB) and have good solu-bility, human intestinal absorption (HIA) and low toxicity. Here, we mainly focused on oralbioavailability, low or no hepatotoxicity, and the capacity to penetrate the BBB, which is akey decision filter for central nervous system drug discovery.

The compounds that satisfied the abovementioned properties were chosen for moleculardocking studies. Lipinski’s rule of five states that ClogP is ≤5, molecular weight is ≤500,number of hydrogen bond acceptors is ≤10 and number of hydrogen bond donor is ≤5.Compounds violating more than one of these rules may have problems with bioavailability.Therefore these parameters were calculated by Prepare Ligands and ADMET Descriptorsprotocols as available in DS v3.5 software to eliminate compounds that did not pass theabove criteria.

SAR and QSAR in Environmental Research 403

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 10: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

After preparation of the druglike database, the generated classification models weresubjected to screening of this database. The retrieved hit compounds were further subjected toa molecular docking process.

2.10 Structure-based molecular docking

Molecular docking is a potent method in the drug discovery process. It predicts the preferredorientation of one molecule to a second when they bind with each other to constitute a stablecomplex. Virtual screening followed by docking has become one of the reputable methods fordrug discovery and enhancing the efficiency in lead optimization.

The hit compounds resulting from QSAR-based virtual screening along with two of themost active inhibitors (IC50 values of Inh1 and Inh2 are 8 nM and 10 nM, respectively) inthe data set were docked to HDAC8 using LibDock module in DS v3.5. LibDock is a high-throughput algorithm for docking ligands into an active receptor site. Ligand conformationsare aligned to polar and apolar receptor interaction sites (hotspots) and the best scoring posesare reported. Ligand conformations can be pre-calculated or generated on the fly usingCATALYST [48,49]. Protein was coordinated from the crystal structure of HDAC8 (PDB ID:2V5X) [9], which was selected from the Protein Data Bank (PDB, www.rcsb.org). All thewater molecules present in the protein structure were removed and hydrogen atoms wereadded. The active sites were defined around the ligand present in the crystal structure of2V5X. During the docking process, the top 100 conformations were generated for each ligandbased on docking score value. Docking tolerance was set to 0.25. The interacting ability of acompound depends on the LibDockScore; the greater the LibDockScore, the better the bind-ing affinity. Protein–inhibitor interactions were examined by DS v3.5. For further validation,the binding free energies were calculated for the final hit compounds and the two most activeknown inhibitors using the AutoDock Vina tool available in PyRx v0.8 [50]. The hitcompounds with best binding free energies were selected.

2.11 Molecular dynamics (MD) simulation

Selected complexes from docking study were subjected to 5 nanosecond (ns) MD simulationusing the GROMACS 4.5.3 package with AMBER03 force field running on a high-perfor-mance Linux cluster computer [51]. Topology files for the inhibitors were generated usingACPYPE (AnteChamber Python Parser interface) [52]. The structure was solvated in adodecahedron box of length 1.5 nm and the TIP3P water model was generated to perform thesimulations in an aqueous environment [53,54]. Ten Na+ counter ions were added by replac-ing water molecules to neutralize the simulated system. The systems were subjected to asteepest descent energy minimization process up to a tolerance of 1000 kJ/mol/nm to avoidhigh energy interactions and steric clashes. The energy minimized system was treated for 100picoseconds (ps) in an equilibration run. A constant temperature and pressure of 300 K and 1bar were achieved with the V-rescale thermostat and Parrinello-Rahman barostat [55,56]. Theparticle mesh Ewald (PME) method was implemented to accurately determine the long-rangeelectrostatic interactions [57]. Bonds between heavy metals and corresponding hydrogenatoms were limited to their equilibrium bond lengths using the LINCS21 algorithm [58]. Thetime step for the simulations was set to 2 femtoseconds (fs). During the production phase, thecoordinate data were written to the file every 10 ps. All the analyses of the MD simulationswere carried out by GROMACS and DS v3.5.

404 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 11: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

3. Results and discussion

3.1 Strategy for screening potent HDAC8 inhibitors

Virtual screening is useful for drug design as it is a cost-effective and cost-saving process.Virtual screening methods can be divided into two categories: structure-based and ligand-based methods. To date, the three-dimensional (3D) structure of HDAC8 as a target receptorand its binding sites are available. Molecular docking is a highly effective structure-basedtechnique for screening HDAC8 inhibitors. A set of active inhibitors is available and it ispossible to compute their shared information. Bearing this in mind, an innovative hybridstrategy integrating structure-based and ligand-based approaches to identify novel HDAC8inhibitors is presented in this study.

This strategy (Figure 2) begins with preparation of a druglike database which is furtherbound to the HDAC8 protein, thus predicting the binding conformations and molecularinteractions. In next step, a KNN-based model and NEC-based model were applied sequen-tially to druglike compounds, so that only candidates that were classified correctly by bothKNN and NEC were passed for further molecular docking. On the basis of the binding modeanalysis, hit compounds with good binding characteristics and showing strong interactionswith crucial amino acids at the active site of HDAC8 were selected as final hits. A 5 ns MDsimulation was used to check the HDAC8-inhibitor complex stability.

Figure 2. Strategy of HDAC8 inhibitors discovery used in this study.

SAR and QSAR in Environmental Research 405

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 12: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

3.2 PCA performance and selection

PCA feature transformation was carried out on the data set which was used to develop KNNand NEC models. PCA is a statistical method, which uses orthogonal transformation to con-vert a large number of correlated variables into a small number of uncorrelated variables,called principal components, while retaining maximal amount of variation, and thus making iteasier to operate the data to achieve data dimension reduction. During the PCA procedure,the calculated molecular descriptors were converted into 20 principal components (PCs)(Figure 3). The transformation is defined in such a way that the first principal component hasthe largest possible variance, accounting for as much of the variability in the original data setas possible, and each succeeding component in turn has the highest variance possible underthe constraint that it is orthogonal to the preceding components. The only noticeable break inthe amount of variance accounted for by each component is between the first and secondcomponents after the extraction of PCA. The first component by itself explained 53.82% oftotal variation and the second PC explained 13.63% of total variation. The value for each PCis small. It means that any single PC is not good enough to develop a model with goodperformance. Thus, a combination of PCs is needed. It not known beforehand whichcombination of PCs is best for developing KNN and NEC models, and there is no uniformstandard about selecting PCs.

As a general rule, the total explained variance for the selected PCs and variance for eachselected PC were chose as measures in this study. The selected PCs together should explainmore than 90% of the total variance for general cases and retain much of the variability inthe original data set as possible. The first six principal components together can explain 92%of the total variation, which satisfied the requirement. It shows that the minimum number ofthe selected PCs is six. However, the 11th PC explained 0.6% of the total variation, which istoo small to construct a good classification model. Thus, the PCs which explained less than1% of the total variance were ignored in this study. It means that the maximum number ofthe selected PCs is 10. The selected number of PCs for classification modelling is assigned ina range from 6 to 10. The fundamental purpose of the selection is to make a trade-off deci-sion between performance and computing time. In order to retain the maximal amount ofvariation, the first 10 ten components, which can explain 97.66% of the total variance, werefinally selected for developing classification models.

Figure 3. Histogram plot of the percentage variability explained by each principal component.

406 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 13: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

3.3 Data set diversity and distribution

A three-dimensional (3D) visualization of the data set using the first three PCs was studied(Figure 4). All compounds were well distributed in the PC space and there was no clearboundary to distinguish HDAC8 inhibitors and non-inhibitors. Although there were somecompounds isolated from the majority of the data set, there was no evidence to indicate thatthese compounds were outliers. To insure data integrity and data diversity, these compoundswere still used for developing models.

3.4 Construction and validation of KNN classification model

The 30 training set compounds were used in development of a KNN classification QSARmodel. Besides the input (as mentioned above, the first 10 PCs were selected as the inputvariable), the K value and distance metric can also affect the predictive ability of the KNNmodel. The optimized parameters can improve the performance of the classification model. Inorder to find an optimized combination of these two parameters, the GS method was used onK and the distance metric. During parameter searching, K is an odd number ranging from 1to 29, and step was set to 2. Euclidean distance and Manhattan distance were considered asdistance metrics. However, a process of LOO CV was utilized to improve generalization qual-ity of KNN classification model. Finally, the model with the best average accuracy of LOOCV (73.33%) was picked for further validation. The K value and distance metric were 5 andManhattan distance, respectively.

The generated KNN classification model was first used to predict the training set (Table 2).The ACC, MCC, TPR and FPR for the training set were 80.00%, 0.62, 93.33% and 33.33%,respectively. Next, an external test set containing 23 inhibitors and 22 non-inhibitors was

Figure 4. Data set distribution using the first three principal components.

SAR and QSAR in Environmental Research 407

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 14: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

prepared using the same protocol as the training set and used to validate whether the devel-oped KNN model was able to classify the HDAC8 inhibitors and non-inhibitors other thanthe training set compounds. The ACC and MCC for the test set given by KNN model were77.78% and 0.60, respectively (Table 2). MCC is usually used to assess quality of binaryclassification. Generally, an MCC value >0.4 for the test set is considered to be predictive inmachine learning methods.

The developed KNN model predicted the outcome well for the training and test sets.These results clearly demonstrate that the established KNN model achieved highly robustand good utility, implying the KNN model can be used as a filter for retrieving HDAC8inhibitors.

3.5 Construction and validation of NEC classification model

During the development of NEC modelling, the same strategy as with KNN was adopted.The NEC model was developed based on the same training set. The first 10 PCs werechosen for NEC modelling. Besides the number of PCs, distance metrics and w (a thresh-old) can also affect the predictive ability of the NEC model. GS was also utilized tosearch for an optimal combination of distance metric and w based on LOO CV. TheEuclidean distance and Manhattan distance were thought of as distance metrics. The wvalue is a decimal number ranging from 0.01 to 1, and the step was set to 0.01 for GS.Various pairs of distance metrics and w values were tried and the one with the best aver-age accuracy of LOO CV was selected. The model showed 76.67% of LOO CV accuracy.The w value and distance metric were 0.13 and Manhattan distance, respectively. Bothtraining and test sets were used to determine whether the developed NEC model was ableto be used as a screening tool for database virtual screening (Table 3). The values ofACC, MCC, TPR and FPR for training set were 93.33%, 0.87, 100%, and 13.33%,respectively. Moreover, the NEC model also predicted the test set well, which showed anACC of 82.22%, an MCC of 0.66, a TPR of 73.91% and an FPR of 9.09%. Thus, theNEC model was selected for further chemical database virtual screening.

By comparing NEC to KNN, it was shown that NEC slightly outperformed the KNNclassifier in this study. Although NEC appears to be well suited for solving binary decisionproblems in classifying HDCA8 inhibitors and non-inhibitors, the KNN model also showedsatisfactory prediction. The developed KNN model can be a useful complement of NEC pre-diction. This study does not justify the conclusion that NEC outperforms KNN. An associatedapplication approach of these two methods was used to identify potential HDAC8 inhibitors.

Table 2. Statistical parameters of the developed KNN model.

Training set (30 compounds) Test set (45 compounds)

K value 5 5Distance metric Manhattan distance metric Manhattan distance metricTP 14 14TN 10 21FP 5 1FN 1 9ACC 80.00% 77.78%TPR 93.33% 60.87%FPR 33.33% 4.55%MCC 0.62 0.60

408 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 15: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

A meaningful combination of the different classifiers with individual strengths will bedeterminative.

3.6 Database virtual screening

QSAR-based virtual screening deals with the quick search of large libraries of small mole-cules to discover drug targets. QSAR classification models developed by NEC and KNN wereused as a two-step query for retrieving novel candidate molecules from the Maybridge data-base. However, the Maybridge database was found to contain a number of non-druglike com-pounds. Various scrupulous druglike filters such as Lipinski’s rule of five and ADMET(absorption, distribution, metabolism, excretion and toxicity) properties were applied to con-structing a druglike database in this step. A total of 4741 compounds were regarded as beingdruglike compounds. Among them, a total of 1307 compounds satisfied both the NEC andKNN models. These 1307 hit compounds were further considered for molecular docking stud-ies.

3.7 Molecular docking studies of HDAC8

Molecular docking is a method which samples conformations of small compounds in proteinbinding sites, which can be invoked as a post-docking filter to select the compounds interact-ing with the active site amino acids and to predict the binding orientations of the hit com-pounds. In this study, a HDAC8-inhibitor complex form PDB (PDB: 2V5X) was selected asthe target protein for molecular docking.

The 1307 hit compounds retrieved from the Maybridge database which satisfied druglikeproperties, along with two most active inhibitors in the data set (IC50 values of Inh1 and Inh2are 8 nM and 10 nM, respectively.) were docked into the active sites of HDAC8 usingLibDock module as available in DS v3.5. The LibDock generated 100 feasible binding con-formations for each compound and ranked them according to their LibDockScore. The boundconformation with the most favourable energies was considered as the best binding orienta-tion. The LibDockScore for Inh 1 and Inh 2 was 112.516 and 111.668, respectively. Theinteracting ability of a compound depends on the LibDockScore and the greater theLibDockScore, the better the binding affinity.

In view of these findings, hit compounds which gave a LibDockScore >111.668 wereselected for further analysis on binding modes and detailed molecular interactions with the

Table 3. Statistical parameters of the developed NEC model.

Training set (30 compounds) Test set (45 compounds)

w value 0.13 0.13Distance metric Manhattan distance metric Manhattan distance metricTP 15 17TN 13 20FP 2 2FN 0 6ACC 93.33% 82.22%TPR 100% 73.91%FPR 13.33% 9.09%MCC 0.87 0.66

SAR and QSAR in Environmental Research 409

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 16: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

important amino acid residues. Inh 1 and Inh 2 showed hydrogen bonds with K33, D101 andK33, respectively (Figure 5). Thus, K33 and D101 were considered to be crucial for inhibi-tory activity of HDAC8. Based on docking modes and molecular interactions observed fromInh 1 and Inh 2, five hit compounds from the Maybridge-based druglike database wereselected as possible hits (Figure 6). All final hits possesses a LibDockScore >112 and showed

Figure 5. Two-dimensional chemical structures and protein–inhibitor interaction diagrams of HDACand (A) Inh1 and (B) Inh 2.

410 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 17: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

hydrogen bonds with crucial amino acids K33 and D101, as well as a few reliable hydropho-bic interactions.

In order to explore the binding free energy of each docked compound, the binding modesof the five possible hits along with two most active inhibitors were analysed using AutoDockVina as a supplement to the LibDock docking. In this case, AutoDock Vina docking resultsshowed that all five possible hits have better or similar binding free energy values comparedwith the binding energies of the two most active inhibitors from the data set (Table 4).Finally, these five hits were listed as final hits on the basis of strong binding interaction atactive site of target protein, good LibDock docking score and favourable AutoDock Vinabinding affinity. The binding modes, molecular interactions and binding free energies betweenthe final hits and HDAC8 are discussed below.

3.8 Binding mode of AW00409

The binding conformation of AW00409 within the active site of HDAC8 had a LibDockScoreof 112.14 and an AutoDock Vina binding free energy of –6 kcal/mol. This binding mode

Figure 6. Two-dimensional chemical structures of the five final hits.

Table 4. Binding energies for the five hits and two most active known inhibitors using AutoDock Vinaprograms.

Name Binding energy (kcal/mol)

Inhibitor 1 -7Inhibitor 2 -5.7AW00409 -6.0HTS03338 -6.4JFD03558 -6.9KM00731 -6.5KM02921 -6.7

SAR and QSAR in Environmental Research 411

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 18: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

showed strong interactions with the important residues (Figure 7). The AW00409 showeddemonstrated hydrogen bonds with K33 (2.383 Å and 1.882 Å) and D101 (1.785 Å), as wellas with some reliable hydrophobic interaction with I94, G97, G99, C102, F152, D272, P273and Y306. On the whole, this compound has gained significant hydrophobic interactions.

3.9 Binding mode of HTS03338

HTS03338 had a LibDockScore of 121.10 and an AutoDock Vina binding free energy of –6.4 kcal/mol. This hit compound also showed considerable interaction with the importantresidues (Figure 8). The compound has formed three hydrogen bonds with K33 (2.085 Å),Y100 (2.231 Å) and D101 (2.255 Å), as well as a strong hydrophobic interaction with F152,P273 and M274.

Figure 7. (A) Binding mode and (B) molecular interactions diagram of AW00409 and HDAC8. (A)AW00409 is represented by violet ball and sticks. Black dotted lines denote intermolecular hydrogenbonds. (B) The locations of key amino acid residues are represented in round boxes, where pink andgreen colours denote hydrogen bond acceptor/donor and nonpolar contacts, respectively.

412 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 19: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

3.10 Binding mode of JFD03558

JFD03558 was predicted to have a LibDockScore of 123.35 and an AutoDock Vina bindingfree energy of –6.9 kcal/mol. In this binding mode, the compound showed strong hydrogenbonding and hydrophobic interactions with three active residues of K33 (2.279 Å and 2.289Å), Y100 (2.441 Å) andD101 (2.088 Å), and with a few hydrophobic residues such as F152,D272, P273 and M274 (Figure 9).

3.11 Binding mode of KM00731

KM00731 possessed the highest LibDockScore of 126.11 and an AutoDock Vina binding freeenergy of –6.5 kcal/mol. The compound was engaged in two short-ranged hydrogen bondinteractions with K33 (1.650 Å) and D101 (1.561 Å). Meanwhile it also formed numerousstrong hydrophobic interactions with E148, A149, S150 G151, F152 and F208 (Figure 10).

Figure 8. (A) Binding mode and (B) molecular interactions diagram of HTS03338 and HDAC8. (A)HTS03338 is represented by marine ball and sticks. Black dotted lines denote intermolecular hydrogenbonds. (B) The locations of key amino acid residues are represented in round boxes, where pink andgreen colours denote hydrogen bond acceptor/donor and nonpolar contacts, respectively.

SAR and QSAR in Environmental Research 413

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 20: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

3.12 Binding mode of KM02921

The LibDockScore and AutoDock Vina binding free energy for KM02921 were 122.10 and –6.7 kcal/mol, respectively. The compound formed hydrogen bonds with K33 (1.775 Å) andD101 (1.841 Å) amino acids and was also involved in hydrophobic interactions with F152and Y306 (Figure 11).

3.13 Molecular dynamics (MD) simulation studies

After molecular docking studies, the best binding conformations of the two most activeHDAC8 inhibitors along with five final hits were subjected to 5 ns MD simulations. The MDsimulation results were analysed using tools bundled with the GROMACS distribution

Figure 9. (A) Binding mode and (B) molecular interactions diagram of JFD03558 and HDAC8. (A)JFD03558 is represented by firebrick red ball and stick. Black dotted lines denote intermolecular hydro-gen bonds. (B) The locations of key amino acid residues are represented in round boxes, where pinkand green colours denote hydrogen bond acceptors/donor and nonpolar contacts, respectively.

414 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 21: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

package. The Cα root mean square deviation (RMSD) was used to confirm the complex sys-tem stability. RMSD plots showed that the HDAC8–inhibitor complexes were stable duringthe second half of simulation time (2500–5000 ps) (Figure 12A). The averaged Cα RMSDvalues were 0.13 nm for Inh1, 0.12 nm for Inh2, 0.14 nm for AW00409, 0.15 nm for

Figure 10. (A) Binding mode and (B) molecular interactions diagram of KM00731 and HDAC8. (A)KM00731 is represented by brown ball and sticks. Black dotted lines denote intermolecular hydrogenbonds. (B) The locations of key amino acid residues are represented in round boxes, where pink andgreen colours denote hydrogen bond acceptor/donor and nonpolar contacts, respectively.

SAR and QSAR in Environmental Research 415

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 22: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

HTS03338, 0.12 nm for JFD03558, 0.13 nm for KM00731 and 0.13 nm for KM02921. Theradius of gyration (RG) is an indicator for structural change in a protein, when HDAC8 inhi-bitors were bound to the protein. The averaged Cα RG values were 1.94 nm for Inh1, 1.93nm for Inh2, 1.94 nm for AW00409, 1.95 nm for HTS03338, 1.94 nm for JFD03558, 1.94nm for KM00731 and 1.94 nm for KM02921 (Figure 12B). These results suggested that five

Figure 11. (A) Binding mode and (B) molecular interactions diagram of KM02921 and HDAC8. (A)KM02921 is represented by green ball and sticks. Black dotted lines denote intermolecular hydrogenbonds. (B) The locations of key amino acid residues are represented in round boxes, where pink andgreen colours denote hydrogen bond acceptor/donor and nonpolar contacts, respectively.

416 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 23: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

final hits in complex with HDAC8 were as stable as Inh1 and Inh2. These five hits werefinally selected as novel HDAC8 inhibitors for in vivo studies.

4. Conclusion

It was demonstrated that a combination of NEC and KNN has the capacity to produce higheroverall prediction accuracy for HDAC8 inhibitors. The focus of this study was not onlyto develop the two kinds of classification models to discriminate HDAC8 inhibitors and non-inhibitors, but also to discover potential inhibitors of HDAC8 as antitumour agents byintegrating QSAR-based virtual screening with molecular docking virtual screening. As thefirst step, a druglike database based on the Maybridge database was prepared. All hit com-pounds with druglike properties retrieved by both NEC and KNN models were selected forfurther docking studies. From the overall analyses, five hit compounds having strong interac-tions with HDAC8 enzyme were regarded as final hits for future in vivo studies. On thewhole, the developed classification models can be used as simple filters in the HDAC8inhibitors discovery process and the strategy used in current study to develop computationalmodels could be applied to other targets.

AcknowledgementsThis research was supported by: the Basic Science Research Programme under Grant2012R1A1A4A01013657; the Management of Climate Change Programme under Grant 2010-0029084through the National Research Foundation of Korea (NRF) funded by the Ministry of Education of

Figure 12. Comparison of: (A) Cα root mean square deviation (RMSD); and (B) radius of gyration(RG) of 5 ns MD simulation for seven systems.

SAR and QSAR in Environmental Research 417

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 24: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

Republic of Korea; and the Next-Generation BioGreen 21 Programme under Grant PJ009486 from theRural Development Administration (RDA) of Republic of Korea.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

[1] D. Taylor, M. Maxwell, R. Luthi-Carter, and A. Kazantsev, Biological and potential therapeuticroles of sirtuin deacetylases, Cell. Mol. Life Sci. 65 (2008), pp. 4000–4018.

[2] X.Y. Yang, and E. Seto, The Rpd3/Hda1 family of lysine deacetylases: From bacteria and yeast tomice and men, Nat. Rev. Mol. Cell Biol. 9 (2008), pp. 206–218.

[3] H. Lehrmann, L.L. Pritchard, and A. Harel-Bellan, Histone acetyltransferases and deacetylases inthe control of cell proliferation and differentiation, Adv. Cancer Res. 86 (2002), pp. 41–65.

[4] P.A. Marks and R. Breslow, Dimethyl sulfoxide to vorinostat: Development of this histone deacety-lase inhibitor as an anticancer drug, Nat. Biotechnol. 25 (2007), pp. 84–90.

[5] S. Emiliani, W. Fischle, C. Van Lint, Y. Al-Abed, and E. Verdin, Characterization of a humanRPD3 ortholog, HDAC3, Proc. Natl. Acad. Sci. U S A 95 (1998), pp. 2795–2800.

[6] S. Imai, C.M. Armstrong, M. Kaeberlein, and L. Guarente, Transcriptional silencing and longevityprotein Sir2 is an NAD-dependent histone deacetylase, Nature 403 (2000), pp. 795–800.

[7] J. Landry, J. Slama, and R. Sternglanz, Role of NAD+ in the deacetylase activity of the SIR2-likeProteins, Biochem. Bioph. Res. Co. 278 (2000), pp. 685–690.

[8] A. Valenzuela-Fernández, J.R. Cabrero, J.M. Serrador, and F. Sánchez-Madrid, HDAC6: A key reg-ulator of cytoskeleton, cell migration and cell-cell interactions error, Trends Cell Biol. 18 (2008),pp. 291–297.

[9] A. Vannini, C. Volpari, P. Gallinari, P. Jones, M. Mattu, A. Carfí, R. De Francesco, C. Steinkühler,and S. Di Marco, Substrate binding to histone deacetylases as shown by the crystal structure of theHDAC8–substrate complex, EMBO Reports 8 (2007), pp. 879–884.

[10] J.E. Bolden, M.J. Peart, and R.W. Johnstone, Anticancer activities of histone deacetylase inhibitors,Nat. Rev. Drug Discov. 5 (2006), pp. 769–784.

[11] G.M. Brodeur, Neuroblastoma: Biological insights into a clinical enigma, Nat. Rev. Cancer 3(2003), pp. 203–216.

[12] I. Oehme, H.E. Deubzer, D. Wegener, D. Pickert, J.P. Linke, B. Hero, A. Kopp-Schneider, F.Westermann, S.M. Ulrich, A. von Deimling, M. Fischer, and O. Witt, Histone deacetylase 8 in neu-roblastoma tumorigenesis, Clin. Cancer Res. 15 (2009), pp. 91–99.

[13] K.L. Durst, B. Lutterbach, T. Kummalue, A.D. Friedman, and S.W. Hiebert, The inv(16) fusion pro-tein associates with corepressors via a smooth muscle myosin heavy-chain domain, Mol. Cell. Biol.23 (2003), pp. 607–619.

[14] S. Thangapandian, S. John, Y. Lee, S. Kim, and K.W. Lee, Dynamic structure-based pharma-cophore model development: A new and effective addition in the histone deacetylase 8 (HDAC8)inhibitor discovery, Int. J. Mol. Sci. 12 (2011), pp. 9440–9462.

[15] E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider, Comparison of support vector and artifi-cial neural network system for drug/nondrug classification, J. Chem. Inf. Comput. Sci. 43 (2003),pp. 1882–1889.

[16] J. Zhang, B. Han, X. Wei, C. Tan, Y. Chen, and Y. Jiang, A two-step target binding and selectivitysupport vector machines approach for virtual screening of dopamine receptor subtype-selectiveligands, PLoS ONE 7 (2012), p. e39076.

[17] H.L. Wan, Z.R. Wang, L.L. Li, C. Cheng, P. Ji, J.J. Liu, H. Zhang, J. Zou, and S.Y. Yang, Discov-ery of novel Bruton’s tyrosine kinase inhibitors using a hybrid protocol of virtual screeningapproaches based on SVM model, pharmacophore and molecular docking, Chem. Biol. Drug Des.80 (2012), pp. 366–373.

418 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 25: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

[18] X.H. Ma, R. Wang, C.Y. Tan, Y.Y. Jiang, and T. Lu, Virtual screening of selective multitargetkinase inhibitors by combinatorial support vector machines, Mol. Pharm. 7 (2010), pp. 1545–1560.

[19] Z. Shi, X.H. Ma, C. Qin, J. Jia, and Y.Y. Jiang, Combinatorial support vector machines approachfor virtual screening of selective multi-target serotonin reuptake inhibitors from large compoundlibraries, J. Mol. Graph. Model. 32 (2012), pp. 49–66.

[20] G.P. Cao, S. Thangapandian, S. John, and K.W. Lee, Classification of HDAC8 inhibitors andnon-inhibitors using support vector machines, IBC 4 (2012), pp. 1–7.

[21] P. Vasanthanathan, O. Taboureau, C. Oostenbrink, N.P. Vermeulen, L. Olsen, and F.S. Jørgensen,Classification of cytochrome P450 1A2 inhibitors and noninhibitors by machine learningtechniques, Drug Metab. Dispos. 37 (2009), pp. 658–664.

[22] L.Y. Han, C.J. Zheng, B. Xie, J. Jia, X.H. Ma, F. Zhu, H.H. Lin, X. Chen, and Y.Z. Chen, Supportvector machines approach for predicting druggable proteins: Recent progress in its explorationand investigation of its usefulness, Drug Discov. Today 12 (2007), pp. 7–8.

[23] C.W. Yap and Y.Z. Chen, Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors andsubstrates by using support vector machines, J. Chem. Inf. Model. 45 (2004), pp. 982–992.

[24] M. Wang, K. Wang, A. Yan, and C. Yu, classification of HCV NS5B polymerase inhibitors usingsupport vector machine, Int. J. Mol. Sci. 13 (2012), pp. 4033–4047.

[25] P. Mahé, N. Ueda, T. Akutsu, J.L. Perret, and J.P. Vert, Graph kernels for molecular structure-activity relationship analysis with support vector machines, J. Chem. Inf. Model. 45 (2005),pp. 939–951.

[26] C.Y. Liew, X.H. Ma, X. Liu, and C.W. Yap, SVM model for virtual screening of LCK inhibitors, J.Chem. Inf. Model. 49 (2009), pp. 877–885.

[27] H. Tang, X.S. Wang, X.P. Huang, B.L. Roth, K.V. Butler, A.P. Kozikowski, M. Jung, and A. Tropsha,Novel inhibitors of human histone deacetylase (HDAC) identified by QSAR modeling of known inhibi-tors, virtual screening, and experimental validation, J. Chem. Inf. Model. 49 (2009), pp. 461–476.

[28] B. Niu, W.C. Lu, S.S. Yang, Y.D. Cai, and G.Z. Li, Support vector machine for SAR/QSAR ofphenethyl-amines, Acta. Pharmacol. Sin. 28 (2007), pp. 1075–1086.

[29] S. Zhang, A. Golbraikh, S. Oloff, H. Kohn, and A. Tropsha, A novel automated lazy learningQSAR (ALL-QSAR) approach: Method development, applications, and virtual screening of chemicaldatabases using validated ALL-QSAR models, J. Chem. Inf. Model. 46 (2006), pp. 1984–1995.

[30] J.H. Hsieh, X.S. Wang, D. Teotico, A. Golbraikh, and A. Tropsha, Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of theQSAR classifier to virtual screening, J. Comput. Aided Mol. Des. 22 (2008), pp. 593–609.

[31] W. Gu, I. Nusinzon, R.D. Jr. Smith, C.M. Horvath, and R.B. Silverman, Carbonyl-sulfurcontaininganalogs of suberoylanilide hydroxam-ic acid: Potent inhibition of histone deacetylases, Bioorg.Med. Chem. 14 (2006), pp. 3320–3329.

[32] T.Y. Wu, C. Hassig, Y. Wu, S. Ding, and P.G. Schultz, Design, synthesis, and activity of HDACinhibitors with a N-formyl hydroxyl-amine head group, Bioorg. Med. Chem. Lett. 14 (2004),pp. 449–453.

[33] M.B. Jeffrey, L. Zuomei, D. Daniel, and B. Claire, Methods for specifically inhibiting histone-7and 8, US Patent 2004/0072770 A1.

[34] C. Dizhong, D. Weiping, S. Kand, Y.S. Hong, T.S. Eric, Y. Niefang, and Z. Yong, Benzimidazolederivatives: Preparation and pharma-ceutical applications, US Patent 2007/0043043 A1.

[35] S. Walter, W. Haishan, and Y. Zheng, Biaryl linked hydroxamates: Preparation and pharmaceuticalapplications, US Patent 2007/0167499 A1.

[36] L. Ze-Yi, W. Haishan, and Z. Yan, Aclyurea connected and sul-fonamide connected hydroxamates,US Patent, 2008/0070954 A1.48.

[37] J.B. Joseph and B. Sriram, Uses of selective inhibitors of HDAC8 for treatment of T-cell prolifera-tive disorders, US Patent 2008/0112889 A1.

[38] R. Todeschini and V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim,2000, p. 9.

SAR and QSAR in Environmental Research 419

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015

Page 26: A lazy learning-based QSAR classification study for ...bio.gnu.ac.kr/publication/pdf/2015_07(127).pdfdeacetylases and Class III HDACs (sirtuins) are mainly dependent on nicotinamide

[39] ADRIANA Code. Molecular Networks Inc.; software available at http://www.molecular-networks.com.

[40] T.M. Cover and P.E. Hart, Nearest neighbor pattern classification, IEEE T. Inform. Theory 13(1967), pp. 21–27.

[41] K.S. Kim, H.H. Choi, C.S. Moon, and C.W. Mun, Comparison of k-nearest neighbor, quadraticdiscriminant and linear discriminant analysis in classification of electromyogram signals based onthe wrist-motion directions, Curr. Appl. Phys 11 (2011), pp. 740–745.

[42] T.Y. Lin, Neighborhood systems and relational database, in Proceedings of 1988 ACM 16th AnnualConference on Computer Science, Atlanta, GA, 1988, p. 725.

[43] Q.H. Hu, D.R. Yu, and Z.X. Xie, Neighborhood classifiers, Expert Syst. Appl. 34 (2008),pp. 886–876.

[44] K. Pearson, On lines and planes of closest fit to systems of points in space, Philos. Mag. 2 (1901),pp. 559–572.

[45] S.G. Ma and Y. Dai, Principal component analysis based methods in bioinformatics studies, Brief.Bioinform. 12 (2011), pp. 714–722.

[46] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection,in Proceedings of the 14th International Joint Conference on Artificial Intelligence, Volume 2,1995, pp. 1137–1143.

[47] Maybridge. Maybridge Chemical Co., Cornwall, U.K; database available at http://www.maybridge.com.

[48] D.J. Diller and R. Li, Kinases, Homology models, and high throughput docking, J. Med. Chem. 46(2003), pp. 4638–4647.

[49] S.N. Rao, M.S. Head, A. Kulkarni, and J.M. Lalonde, Validation studies of the site-directeddocking program LibDock, J. Chem. Inf. Model. 47 (2007), pp. 2159–2171.

[50] O. Trott and A.J. Olson, AutoDock Vina: Improving the speed and accuracy of docking with a newscoring function, efficient optimization and multithreading, J. Comput. Chem. 31 (2010),pp. 455–461.

[51] B. Hess, C. Kutzner, and D. van der Spoel, GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation, J. Chem. Theory Comput. 4 (2008), pp. 453–447.

[52] A.W. Sousa da Silva and W.F. Vranken, ACPYPE – AnteChamber PYthon Parser interface, BMCResearch Notes 5 (2012), pp. 1–8.

[53] H.J.C. Berendsen, J.P.M. Postma, W.F. Van Gunsteren, and J. Hermans, Interaction models forwater in relation to protein hydration, Intermol. Forces 11 (1981), pp. 331–342.

[54] W.L. Jorgensen, J. Chandrasekhar, J.D. Madura, R.W. Impey, and M.L. Klein, Comparison ofsimple potential functions for simulating liquid water, J. Chem. Phys. 79 (1983), p. 926.

[55] G. Bussi, D. Donadio, and M. Parrinello, Canonical sampling through velocity rescaling, J. Chem.Phys. 126 (2007), p. 014101.

[56] M. Parrinello and A. Rahman, Polymorphic transitions in single crystals: A new moleculardynamics method, J. Appl. Phys. 52 (1981), p. 7182.

[57] U. Essmann, L. Perera, M.L. Berkowitz, T. Darden, H. Lee, and G.P. Lee, A smooth particle meshEwald method, J. Chem. Phys. 103 (1995), pp. 8577–8593.

[58] B. Hess, H. Bekker, H.J.C. Berendsen, and J.G.E.M. Fraaije, LINCS: A linear constraint solver formolecular simulations, J. Comput. Chem. 18 (1997), pp. 1463–1472.

420 G.P. Cao et al.

Dow

nloa

ded

by [

Gye

ongs

ang

Nat

iona

l Uni

] at

22:

33 2

1 M

ay 2

015