An accurate free energy estimator: based on MM/PBSA ...

14
Nanoscale PAPER Cite this: Nanoscale, 2020, 12, 10737 Received 17th December 2019, Accepted 17th April 2020 DOI: 10.1039/c9nr10638c rsc.li/nanoscale An accurate free energy estimator: based on MM/PBSA combined with interaction entropy for proteinligand binding anityKaifang Huang, a Song Luo, a Yalong Cong, b Susu Zhong, a John Z. H. Zhang b,c,d and Lili Duan * a The molecular mechanics/PoissonBoltzmann surface area (MM/PBSA) method is constantly used to cal- culate the binding free energy of proteinligand complexes, and has been shown to eectively balance computational cost against accuracy. The relative binding anities obtained by the MM/PBSA approach are acceptable, while it usually overestimates the absolute binding free energy. This paper proposes four free energy estimators based on the MM/PBSA for enthalpy change combined with interaction entropy (IE) for entropy change using dierent weights for individual energy terms. The ΔG PBSA_IE method is determined to be an optimal estimator based on its performance in terms of the correlation between experimental and theoretical values and error estimations. This approach is optimized using high-quality experimental values from a training set containing 84 proteinligand systems, and the coecients for the sum of electrostatic energy and polar solvation free energy, van der Waals (vdW) energy, non-polar sol- vation energy and entropy change are obtained by multivariate linear tting to the corresponding experi- mental values. A comparison between the traditional MM/PBSA method and this method shows that the correlation coecient is improved from 0.46 to 0.72 and the slope of the regression line increases from 0.10 to 1.00. More importantly, the mean absolute error (MAE) is signicantly reduced from 22.52 to 1.59 kcal mol -1 . Furthermore, the numerical stability of this method is validated on a test set with a similar correlation coecient, slope and MAE to those of the training set. Based on the above advantages, the ΔG PBSA_IE method can be a powerful tool for a reliable and accurate estimation of binding free energy and plays a signicant role in a detailed energetic investigation of proteinligand interaction. Introduction Molecular recognition plays a great important role in several biological processes, such as self-replication, information pro- cessing and gene expression. For example, proteinprotein, proteinligand, antigenantibody, proteinDNA, sugarlectin and RNAribosome interactions all belong to molecular reco- gnition. Particularly, the proteinligand interaction occurs in almost all biological processes, which is crucial for drug dis- covery. However, the major challenge is to find a ligand that can alter the activity of the target protein by ligand binding with high anity. In general, the strength of proteinligand binding is judged through the free energy change in the binding process. Thus, an accurate and highly ecient calcu- lation of binding free energy is critical in early-stage drug discovery. Along with experimental progress, theoretical and compu- tational studies have become increasingly popular for investi- gating the microscopic mechanisms of proteins structure and function at the atomic level, because of the increase in the power of computer technology and the development in related algorithms. This is beneficial for drug design research which can accelerate the drug discovery process. In the virtual screen- ing of drugs, binding free energy is often used as an important criterion for ranking the small molecules. The free energy per- Electronic supplementary information (ESI) available: Information about pro- teins in training and test sets, various energy terms of enthalpy change calcu- lated by the MM/PBSA method, the entropy change calculated by the IE method, the experimental value of binding anity, binding free energy predicted by ΔG H_IE , ΔG MM_Sol_IE , ΔG PBSA_IE , ΔG all , ΔG PBSA and ΔG PBSA+IE , various energy terms between 2 ns and 10 ns MD simulation, RMSD of the typical systems and 2d-RMSD and RMSF of 20 systems. See DOI: 10.1039/c9nr10638c a School of Physics and Electronics, Shandong Normal University, Jinan, 250014, China. E-mail: [email protected] b Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai, 200062, China c NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai, 200062, China d Department of Chemistry, New York University, NY, NY 10003, USA This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 1073710750 | 10737 Published on 17 April 2020. Downloaded on 11/17/2021 4:51:31 AM. View Article Online View Journal | View Issue

Transcript of An accurate free energy estimator: based on MM/PBSA ...

Page 1: An accurate free energy estimator: based on MM/PBSA ...

Nanoscale

PAPER

Cite this: Nanoscale, 2020, 12, 10737

Received 17th December 2019,Accepted 17th April 2020

DOI: 10.1039/c9nr10638c

rsc.li/nanoscale

An accurate free energy estimator: based onMM/PBSA combined with interaction entropyfor protein–ligand binding affinity†

Kaifang Huang, a Song Luo,a Yalong Cong,b Susu Zhong,a John Z. H. Zhang b,c,d

and Lili Duan *a

The molecular mechanics/Poisson–Boltzmann surface area (MM/PBSA) method is constantly used to cal-

culate the binding free energy of protein–ligand complexes, and has been shown to effectively balance

computational cost against accuracy. The relative binding affinities obtained by the MM/PBSA approach

are acceptable, while it usually overestimates the absolute binding free energy. This paper proposes four

free energy estimators based on the MM/PBSA for enthalpy change combined with interaction entropy

(IE) for entropy change using different weights for individual energy terms. The ΔGPBSA_IE method is

determined to be an optimal estimator based on its performance in terms of the correlation between

experimental and theoretical values and error estimations. This approach is optimized using high-quality

experimental values from a training set containing 84 protein–ligand systems, and the coefficients for the

sum of electrostatic energy and polar solvation free energy, van der Waals (vdW) energy, non-polar sol-

vation energy and entropy change are obtained by multivariate linear fitting to the corresponding experi-

mental values. A comparison between the traditional MM/PBSA method and this method shows that the

correlation coefficient is improved from 0.46 to 0.72 and the slope of the regression line increases from

0.10 to 1.00. More importantly, the mean absolute error (MAE) is significantly reduced from 22.52 to

1.59 kcal mol−1. Furthermore, the numerical stability of this method is validated on a test set with a similar

correlation coefficient, slope and MAE to those of the training set. Based on the above advantages, the

ΔGPBSA_IE method can be a powerful tool for a reliable and accurate estimation of binding free energy and

plays a significant role in a detailed energetic investigation of protein–ligand interaction.

Introduction

Molecular recognition plays a great important role in severalbiological processes, such as self-replication, information pro-cessing and gene expression. For example, protein–protein,

protein–ligand, antigen–antibody, protein–DNA, sugar–lectinand RNA–ribosome interactions all belong to molecular reco-gnition. Particularly, the protein–ligand interaction occurs inalmost all biological processes, which is crucial for drug dis-covery. However, the major challenge is to find a ligand thatcan alter the activity of the target protein by ligand bindingwith high affinity. In general, the strength of protein–ligandbinding is judged through the free energy change in thebinding process. Thus, an accurate and highly efficient calcu-lation of binding free energy is critical in early-stage drugdiscovery.

Along with experimental progress, theoretical and compu-tational studies have become increasingly popular for investi-gating the microscopic mechanisms of protein’s structure andfunction at the atomic level, because of the increase in thepower of computer technology and the development in relatedalgorithms. This is beneficial for drug design research whichcan accelerate the drug discovery process. In the virtual screen-ing of drugs, binding free energy is often used as an importantcriterion for ranking the small molecules. The free energy per-

†Electronic supplementary information (ESI) available: Information about pro-teins in training and test sets, various energy terms of enthalpy change calcu-lated by the MM/PBSA method, the entropy change calculated by the IE method,the experimental value of binding affinity, binding free energy predicted byΔGH_IE, ΔGMM_Sol_IE, ΔGPBSA_IE, ΔGall, ΔGPBSA and ΔGPBSA+IE, various energyterms between 2 ns and 10 ns MD simulation, RMSD of the typical systems and2d-RMSD and RMSF of 20 systems. See DOI: 10.1039/c9nr10638c

aSchool of Physics and Electronics, Shandong Normal University, Jinan, 250014,

China. E-mail: [email protected] Engineering Research Center of Molecular Therapeutics and New Drug

Development, School of Chemistry and Molecular Engineering, East China Normal

University, Shanghai, 200062, ChinacNYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai,

200062, ChinadDepartment of Chemistry, New York University, NY, NY 10003, USA

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10737

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

.

View Article OnlineView Journal | View Issue

Page 2: An accurate free energy estimator: based on MM/PBSA ...

turbation (FEP)1–7 and thermodynamic integration (TI)8–15 aretheoretically rigorous. Although their computational efficien-cies are significantly improved using graphics processing units(GPUs), they may still not be appropriate for evaluating thebinding free energy of a large number of protein–ligandsystems, such as in drug screening.16 Another approach is thescoring function, which is widely used to estimate quicklyseveral protein–ligand binding affinities, due to its highefficiency. However, simple empirical scoring functions aregenerally unreliable.17–21 As a result, the molecular mechanics/Poisson–Boltzmann surface area (MM/PBSA)22–30 method hasbecome a very popular method in most practicalapplications.31–34

In addition, one should note that the entropy change in theMM/PBSA method is usually calculated using the normalmode (Nmode) method, which is time-consuming andapproximate in theory.35–38 This method divides the entropiccontribution into three parts, i.e., changes in translational,rotational and vibrational freedoms, which probably leads tosystematic errors in vibrational entropy calculation becausethe inharmonic contribution is ignored.31,39–41 Therefore,many studies often neglect the entropic contribution to thebinding free energy when applying the MM/PBSA method,which leads to unreliable results. To address this problem, wepropose the interaction entropy (IE) method, which is theoreti-cally rigorous, computationally efficient and numericallyreliable.42 The entropy change can be directly obtained fromthe fluctuation of the interaction energy around its averagevalue in the molecular dynamics (MD) simulation using the IEmethod, which does not require any additional computationalcost. Recently, the IE method combined with the MM/PBSAmethod achieves a highly efficient binding free energy calcu-lation for many biological systems.43–50 Moreover, the corre-lation coefficient between the theoretical values and the experi-mental data increases when the entropy change is consideredby the IE method. However, the MM/PBSA method dependsheavily on the parameters used, such as atomic charge, dielec-tric constants, atomic radii, force field, empirical models ofnonpolar hydration, etc.16 In the traditional MM/PBSA, someparameters are regularized to make the relative binding freeenergies reasonable, which usually leads to the overestimationof the calculated binding free energy.51–54

The challenges associated with the MM/PBSA method haveattracted considerable research interest. Hou’s groupsuggested that the application of variable dielectric constantsin MM/PBSA and molecular mechanics/generalized-Bornsurface area (MM/GBSA) methods can significantly improvethe accuracy of the results.55 Zhang’s group obtained excellentresults by using residue-type specific dielectric constants inthe MM/GBSA framework.53 Ryde’s group combined semi-empirical quantum mechanical (SQM) calculations with theMM/GBSA method, which showed that the SQM conductor-likescreening model surface-area (SQM-COSMO/SA) method is aviable alternative to the MM/PB(GB)SA approach for calculat-ing binding free energy.56 In addition, their later work pro-posed the QM-MM/PBSA method, which calculates the internal

interaction energy of the reaction site by the QM level and theother parts by the MM level. Their result shows that the abilityof QM-MM/PBSA for predicting binding free energy is signifi-cantly worse than that of the QM/MM FEP approach.57

Different attempts by these researchers have led to the dis-covery of a new method for calculating binding free energy,which should have low-cost calculation, be insensitive todifferent protein families or ligands, and do not need frequentadjustment of the estimator’s internal parameters. Hence, itcan be used in a wide range of fields. In order to meet theabove requirements, we have made some attempts in thispaper. In this work, the binding free energy is obtained via alinear combination of the energy terms from the MM/PBSAand IE calculations. The IE method is used to calculate theentropy change, while the MM/PBSA method is used to calcu-late the various energy terms of enthalpy change. The coeffi-cients of the linear combination are optimized using a trainingset consisting of 84 protein–ligand complexes with experi-mentally known binding affinities. The weights of each energyterm are obtained via multivariate linear fitting using thetraining set. The final binding free energy is obtained byadding the optimized energy terms. We perform four multi-variate linear fitting schemes based on the interaction mecha-nism of the various energy terms. Our results for the trainingset show that the third fitting method is the optimal choice.Then, a test set comprising 44 protein–ligand complexes isused for the fitting methods. The results show that the calcu-lated binding free energies under the third method are inexcellent agreement with the experimental data for both thetraining set and test set. We also compare the performance ofthe current method to those of other methods, such as MM/PBSA and MM/PBSA combined with the IE method.

MethodsMD simulation

The protein–ligand complexes of the training and test sets arederived from the “protein–ligand complexes: the general setminus refined set” in the PDBbind database developed byWang et al.58 The native structures of these two datasetsobtained from the protein databank (PDB)59 are used as thestarting structures for the MD simulations. The 84 protein–ligand complexes whose first two characters of the PDBID arebetween 1A and 1U will be considered as the training set,while the other 44 systems whose PDBID is in the range of 1Vand 2H expressed with the first two characters will be con-sidered as the test set. The training set and test set contain 60and 34 different proteins, respectively. There are only 10 pro-teins used in the test set overlapping in the training set, sototal 84 different proteins are used in our study. Informationabout protein–ligand complexes of the datasets is provided inTable S1.† We exclude those systems with metal ions presentin the reactive site or ligands with more than two charges.

All ligands of complexes are optimized at the level of HF/6-31G**, and then the single point energy is calculated at the

Paper Nanoscale

10738 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 3: An accurate free energy estimator: based on MM/PBSA ...

B3LYP/cc-PVTZ level to obtain electrostatic potentials (ESP).The restrained ESP (RESP) method is used to fit the charges ofatoms of ligand.60,61 All missing hydrogen atoms of protein areplaced in the proper position using the Leap module ofAMBER16 package.62 The AMBER12SB force field and thegeneral AMBER force field (GAFF) are applied to produce thoseparameters of the protein and ligand for MD simulation,respectively. Each system is solvated in an octahedral box ofTIP3P water molecules where the distance between the edgesof the box and the closest solute atom is at least 10 Å. Inaddition, the counter ions are added to neutralize the wholesystem. Prior to the MD simulation, the energy of the system isminimized in two steps by the steepest descent method fol-lowed by conjugate gradient minimization until convergence isreached. After minimization, the entire systems are heated from0 to 300 K for 300 ps with a weak restrain of 10 kcal (mol Å2)−1

on all solute atoms. Langevin dynamics63 is applied to regulatethe temperature with a collision frequency of 1.0 ps−1. TheSHAKE algorithm64 is used to constrain all bonds involvinghydrogen atoms. The time step for all MD simulations is set to2 fs, and a cutoff of 10 Å is applied to the non-bonded inter-actions. Finally, MD simulations of these systems are performedfor 2 ns. The time of writing to the trajectory file is set to 4 ps inthe first 1 ns, and 10 fs for the last 1 ns to obtain enough con-formational sampling needed for PBSA and IE calculations.

MM/PBSA method22–28

The binding free energy (ΔGbind) can be expressed by eqn (1):

ΔGbind ¼ ΔH � TΔS ð1Þwhere the enthalpy change (ΔH) is computed as the sum ofchanges of the gas–phase energy (ΔEMM) and solvation freeenergy (ΔGsol) averaged over a conformational ensemble gener-ated by MD simulations (eqn (2)):

ΔH ¼ ΔEMM þ ΔGsol ð2ÞΔEMM can be denoted as the following formula:

ΔEMM ¼ ΔEele þ ΔEvdW þ ΔEint ð3Þwhere ΔEele, ΔEvdW and ΔEint represent the electrostatic, vdWenergies, and internal energies that is bond, angle and di-hedral energies, respectively. In this work, the conformationalstructures of the protein–ligand complex, apoprotein andligand are obtained from a single MD trajectory (only complextrajectory) that regards the protein–ligand structure as a rigidbody. This means that the ΔEint between the complex and theisolated components (i.e., apoprotein and ligand) can offseteach other because this energy term is calculated from thesame MD simulated trajectory. So, only ΔEele and ΔEvdW ofeqn (3) are studied in the following work.

ΔGsol is the sum of the polar solvation free energy (ΔGpb)and non-polar solvation free energy (ΔGnp).

ΔGsol ¼ ΔGpb þ ΔGnp ð4ÞΔGpb is determined by solving the linearized Poisson–

Boltzmann equation65 using the PBSA program24 in the

AMBER suite. The nonpolar solvation term is calculated fromthe solvent-accessible surface area (SASA)66 using the formula:

ΔGnp ¼ γ � SASA þ β ð5Þwhere γ and β are empirical constants for 0.00542 kcal (molÅ2)−1 and 0.92 kcal (mol Å2)−1, respectively. 100 snapshots areextracted from the final trajectory for MM/PBSA calculation.

IE method42

The entropic contribution can be calculated by the IE methodusing the following term:

�TΔSIE ¼ KT lnhe βΔEplint i ð6Þ

where ΔEintpl is defined as the fluctuation of protein–ligandinteraction energy (Eintpl ) around the average interaction energy⟨ΔEintpl ⟩, which can be computed by eqn (7):

ΔEintpl ¼ Eint

pl � hEintpl i ð7Þ

The effectiveness of this method is that ⟨Eintpl ⟩ and ⟨eβΔEplint

⟩can be simply calculated using the following formula:

Eintpl

D E¼ 1

T

ðT0Eintpl ðtÞdt ¼

1N

XNi¼1

Eintpl ðtiÞ ð8Þ

and

eβΔEintpl

D E¼ 1

N

XNi¼1

eβΔEintpl ðtiÞ ð9Þ

in which β is 1/KT.It can be seen that the interaction energy can be directly

obtained from MD simulation, so that the interaction entropycan be simply calculated without any additional expense.

Binding energy estimator using multivariate linear fittingbased on MM/PBSA combined with IE

The binding free energy estimators are given by the followingfour fitting methods to obtain the optimal combination.

In the first method, we fit the enthalpy change as a wholeand the entropy change with independent coefficients. Thecorresponding binding free energy is given as follows:

ΔGH IE ¼ a1ðΔEele þ ΔEvdW þ ΔGpb þ ΔGnpÞ þ a2ð�TΔSIEÞ þ a3ð10Þ

In the second method, we fit the gas-phase interactionenergy (ΔEMM) as a whole, the solvation energy ΔGsol and theentropy change with independent coefficients. The corres-ponding binding free energy is given as:

ΔGMM Sol IE ¼ b1ðEele þ EvdWÞ þ b2ΔGsol þ b3ð�TΔSIEÞ þ b4ð11Þ

In the third method, we fit the electrostatic and polar sol-vation energies with one common coefficient because theseenergy terms can be regarded as total polar contributions,while the other energy terms are fitted with independent

Nanoscale Paper

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10739

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 4: An accurate free energy estimator: based on MM/PBSA ...

coefficients. The binding free energy defined as ΔGPBSA_IE isgiven by the following equation:

ΔGPBSA IE ¼ c1ðΔEele þ ΔGpbÞ þ c2ΔGvdWþc3ΔGnp þ c4ð�TΔSIEÞ þ c5

ð12Þ

In the final method, we fit each energy term with indepen-dent coefficients. The binding free energy, defined as δgall, isgiven by the following equation:

ΔGall ¼ d1ΔEele þ d2ΔEvdW þ d3ΔGpb

þ d4ΔGnp þ d5ð�TΔSIEÞ þ d6ð13Þ

We first calculate the individual energy terms for theprotein–ligand complexes obtained from the MD simulationsin the given training set (containing 84 complexes as describedabove) using MM/PBSA and IE methods, and then substitutethem into eqn (10), eqn (11), eqn (12), and eqn (13). The coeffi-cients a1–a3, b1–b4, c1–c5 and d1–d6 are obtained via least-squares fitting to the corresponding experimental bindingaffinities for all complexes in the training set. The details ofthe results obtained from the MD simulations are given inTables S2 and S3† for the training set and Tables S4 and S5†for the test set.

Error estimates

In order to compare the performance of methods mentionedin this paper, we use three types of error evaluation methodsas follows.

The absolute error (ΔΔG) is given by:

ΔΔG ¼ ΔGth � ΔGexp ð14Þwhere ΔGth is the theoretical binding free energy calculated

using the estimator mentioned in this work, and ΔGexp is theexperimental value.

The Mean Absolute Error (MAE) is defined by:

MAE ¼ hjΔGth � ΔGexpji ð15ÞThe Root Mean Square Error (RMSE) is obtained from:

RMSE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffihðΔGth � ΔGexpÞ2i

qð16Þ

The degree of freedom (DF) refers to as

DF ¼ n� k ð17Þwhere n is the random system, and k is the number ofparameters.

Results and discussionStudying for MD simulation length

In order to investigate the effect of MD simulation time on thebinding free energy prediction, we extend our simulation for10 protein–ligand complexes in the training set to 10 ns. Theirexperimental values of binding free energy span from −4.64 to−12.29 kcal mol−1, which cover a wide range. The results of 10

ns simulations are generally consistent with those of 2 nssimulation, as shown in Table S6.† There is no significantdifference for each energy term between 2 ns and 10 ns. Inaddition, the MAE for all energy terms in the same system(denoted: MAEa) and for all systems in the same energy term(denoted: MAEb) between 2 ns and 10 ns are also calculatedand listed in this table. The maximum of MAEa is 1.77 kcalmol−1 and that of MAEb is 1.38 kcal mol−1, and the mean ofMAEs is 0.93 kcal mol−1. This result indicates that 2 ns MDsimulation is a suitable time and can basically give a stableresult. In this current work, we try to propose a low-cost,simple and reliable approach for protein–ligand binding freeenergy calculation, and it is maybe not appropriate to run longMD simulations for large-scale drug screening. Therefore, the2 ns MD simulation is used in our study.

Fitting formula

After performing the numerical fitting procedure using thetraining set, the final formulas of the energy estimatorsΔGH_IE, ΔGMM_Sol_IE, ΔGPBSA_IE and ΔGall are given by:

ΔGH IE ¼ 0:15178ðΔEeleþΔEvdWþΔGpbþΔGnpÞþ 0:16624ð�TΔSIEÞ � 6:09665

ð18Þ

ΔGMM Sol IE ¼ 0:15487ðΔEele þ ΔEvdWÞ þ 0:16623ΔGsol

þ 0:12830ð�TΔSIEÞ � 6:12818ð19Þ

ΔGPBSA IE ¼ 0:05402ðΔEeleþΔGpbÞ þ 0:14852ΔEvdWþ 0:05584ΔGnp þ 0:11351ð�TΔSIEÞ � 4:77148

ð20Þ

ΔGall ¼ 0:05867ΔEele þ 0:13519ΔEvdW þ 0:06796ΔGpb

þ 0:25050ΔGnp þ 0:08875ð � TΔSIEÞ � 4:63248

ð21Þ

Training set

Analysis of the correlation and slope. We examine the corre-lation between the calculated binding free energies obtainedfrom the above four formulas and experimental values for 84protein–ligand complexes in the training set. For comparison,the binding free energies calculated directly by using the tra-ditional MM/PBSA (only enthalpy change, denoted as ΔGPBSA)and MM/PBSA combined IE methods (enthalpy and entropychanges, denoted as ΔGPBSA+IE) are also evaluated, and areshown in Fig. 1(a) and (b). Fig. 1 shows that the predictedbinding free energy values from the four formulas are moreconcentrated around the correlation line, while those obtainedfrom the traditional ΔGPBSA and ΔGPBSA+IE methods are morediffuse. Note that slopes generated from the four estimatorsare all close to 1.00, as compared to the values of 0.10 and 0.15obtained from the traditional methods (ΔGPBSA, ΔGPBSA+IE). Inaddition, the values obtained from ΔGPBSA+IE are higher thanthose obtained from ΔGPBSA with respect to both the slope andcorrelation coefficient, which indicates that the IE method canoptimize the result of MM/PBSA approach. This is consistentwith the previous studies.51–54

Paper Nanoscale

10740 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 5: An accurate free energy estimator: based on MM/PBSA ...

The correlation between the theoretical and experimentalvalues has always been important evidence to test the perform-ance of the estimator. In this work, Pearson’s correlationcoefficient and Spearman’s correlation coefficient are bothcomputed, as shown in Fig. 2(a). Two correlation coefficientsshow the nearly same trend from this figure, indicating thatthe sample is not accidental and the calculations of correlationare credible. Meanwhile, considering that the two correlation

coefficients are very close in value, Pearson’s correlation isused in following discussion.

Fig. 2(a) shows that the binding free energy calculated onlyfrom the enthalpy change has a poor linear correlation withthe experimental value, with a correlation coefficient of 0.46.When the entropic contribution is considered using the IEmethod, the linear correlation is improved with a correlationcoefficient of 0.61. On the other hand, the binding free ener-

Fig. 1 Scatter plots of experimental and calculated binding free energies in the training set. (a) ΔGPBSA, (b) ΔGPBSA+IE, (c) ΔGH_IE, (d) ΔGMM_Sol_IE, (e)ΔGPBSA_IE and (f ) ΔGall. K is the slope of the regression line, and the DF is the degree of freedom of multiple regression.

Nanoscale Paper

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10741

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 6: An accurate free energy estimator: based on MM/PBSA ...

gies predicted by ΔGPBSA_IE and ΔGall are in better agreementwith the experimental values than the other methods, with cor-relation coefficients of 0.72 and 0.73, respectively. For theother two fitting methods (ΔGH_IE and ΔGMM_Sol_IE), the result-ing correlations are nearly consistent with those obtainedusing the ΔGPBSA+IE method; however, their computationalerrors are significantly reduced, as discussed in the followinganalysis.

Considering the performance of the correlation coefficientand slope of the regression lines, the ΔGPBSA_IE and ΔGall

methods are considered to be the best approaches among thesix estimators, i.e., two traditional methods and four fittingmethods.

Analysis of the degree of freedom. We find that the fourfitting methods have different ability to improve the accuracyof estimating binding free energy. Hence, we systemicallyevaluate these methods and determine the optimal fittingmethod that leads to a highly efficient and reliable calculationof binding free energy. From Fig. 1 and 2(a), the correlationcoefficients for the four estimators improve from 0.61 to 0.73when the number of fitting parameters is increased, i.e., 3, 4,5, and 6 for ΔGH_IE, ΔGMM_Sol_IE, ΔGPBSA_IE and ΔGall, respect-ively. Note, however, that increasing the number of fittingterms means that the calculated result will depend more onthe corresponding coefficients of these terms. That is, the cal-culated binding free energy is mainly corrected by the coeffi-cients rather than the contribution of each energy term.Hence, we measure the degree of dependence of the calculatedresult on the fitting coefficients with the degree of freedom(DF). Fig. 1(c)–(f ) show that the DF decreases with the increas-ing number of fitting terms, which indicates that the DF isinversely proportional to the number of fitting parameters.Considering the correlation and slope, ΔGPBSA_IE and ΔGall are

suitable choices, which achieve almost the same correlationcoefficients. However, the DF in ΔGall is less than that inΔGPBSA_IE. Therefore, ΔGPBSA_IE is considered the optimal esti-mator for calculating protein–ligand binding free energy.

Analysis of the error. To further evaluate these methods, twotypes of error estimations, i.e., the mean absolute error (MAE)and root mean square error (RMSE), calculated for the sixmethods, are shown in Fig. 2(b). Both MAE and RMSE give asimilar result, so we choose MAE as the main error analysis inthe following work. The MAE is reduced to less than half ofΔGPBSA when the entropic contribution is included using theΔGPBSA+IE method, i.e., from 22.52 kcal mol−1 to 10.50 kcalmol−1, respectively. Although the ΔGPBSA+IE method achievesbetter accuracy compared to the ΔGPBSA method, the error isstill very large. Subsequently, we examine the performances ofthe four estimators and find that the MAE and RMSE of fourfitting methods are significantly less than 10% of those of theΔGPBSA method. In addition, the lowest value for MAE is1.56 kcal mol−1, which simultaneously occurs using the ΔGall

method. Meanwhile, the second lowest value is 1.59 kcalmol−1, obtained using the ΔGPBSA_IE method. The differencebetween ΔGPBSA_IE and ΔGall for MAE is only 0.03 kcal mol−1,which can be considered negligible. This suggests that thefinal energy estimator can be chosen between the ΔGPBSA_IE

and ΔGall methods.The difference in the binding free energy (ΔΔG) obtained

using ΔGPBSA, ΔGPBSA+IE, ΔGPBSA_IE, and ΔGall and the experi-mental values for all 84 systems are shown in Fig. 3. It can beseen that for the ΔGPBSA method, the minimum absolute error(ΔΔG) is −55.98 kcal mol−1 and the maximum is 5.75 kcalmol−1, which means that the energy fluctuates around61.73 kcal mol−1. The ΔGPBSA+IE method improves theseresults with a minimum error of −37.52 kcal mol−1 and amaximum of 10.81 kcal mol−1; however, the range of values isstill unacceptable. On the other hand, the ΔGPBSA_IE and ΔGall

methods obtain a minimum ΔΔG of −3.89 and −3.84 kcal

Fig. 2 The correlation coefficient, MAE and RMSE of the experimentaland calculated binding free energies in the training set. (a) Correlationcoefficient and (b) MAE and RMSE.

Fig. 3 Absolute error of the experimental and calculated binding freeenergies for each system in the training set. The No. of systems areshown in Table S2.†

Paper Nanoscale

10742 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 7: An accurate free energy estimator: based on MM/PBSA ...

mol−1 and a maximum ΔΔG of 3.63 and 3.76 kcal mol−1,respectively. This shows that the two methods obtain signifi-cantly reduced computational errors, with the range of absol-ute errors being almost one-eighth that of the ΔGPBSA method,indicating that these two fitting methods are reliable andreasonable. As shown in Fig. 3, the ΔΔG values for mostsystems are consistently close to zero for the ΔGPBSA_IE or ΔGall

methods, and the two lines have a similar trend of fluctu-ations, which suggests that the calculated values are similar tothe experimental values. This means that the ΔGPBSA_IE andΔGall methods can significantly improve the reliability andaccuracy of binding free energy calculation. However, consider-ing the correlation coefficient, slope, DF and errors, the suit-able and optimal estimator for calculating binding free energyis determined to be ΔGPBSA_IE.

Analysis of the typical systems. To evaluate the stability ofMD simulations, the root-mean-square deviation (RMSD) ofthe backbone atoms for the three representative systems rela-tive to their native structures as a function of MD simulationtime is shown in Fig. S1.† These three representative systemshave the feature that the absolute errors (ΔΔG) between thecalculation by the ΔGPBSA+IE method and experiments aremaximum (PDBID: 1BIO), close to zero (PDBID: 1C85) andminimum (PDBID: 1SQP). As shown in Fig. S1,† the values ofRMSD fluctuate below 2.5 Å, which means that these systemsare stable during the MD simulations. For 1BIO and 1SQPsystems, the binding free energies calculated from ΔGPBSA+IE

are 3.25 and −48.72 kcal mol−1, respectively, which differ fromthe experimental values by as much as 10.81 and −37.52 kcalmol−1. When the ΔGPBSA_IE method is used, the accuracy sig-nificantly improves, with the absolute error being very close tozero. For the 1C85 system, the absolute error of ΔGPBSA+IE isclosest to zero among all systems, which can still be accuratelymatched with the experimental values when the ΔGPBSA_IE

method is applied. This indicates that the ΔGPBSA_IE methodcan not only improve the calculated accuracy for protein–ligand binding free energy, but also maintain the consistencybetween the theoretical and experimental values for thesesystems that perform well in the other methods.

Test set

To verify the universality of the fitting method, we use a testset comprising 44 randomly selected systems for computingvarious energy terms using MM/PBSA combined with the IEmethod. Then, we substitute them into the fitted functions toobtain the final theoretical binding free energy. The calculatedbinding free energies are compared with the correspondingexperimental values to evaluate the rationality of thesemethods. All calculation data are shown in Tables S4 and S5.†

Analysis of the correlation and slope. Fig. 4 and 5(a) showthe relation between the predicted binding free energies ofΔGPBSA, ΔGPBSA+IE, ΔGH_IE, ΔGMM_Sol_IE, ΔGPBSA_IE, and ΔGall

and the experimental values for the 44 systems in the test set.As the training set, the distribution law of these energy pointsin the image appears again in the test set: the predicted valuesof the four fitting methods are more concentrated around the

correlation line, while the values calculated by traditionalmethods are more scattered throughout the image. Both theslope of the regression lines and the correlation coefficientsare much higher in the ΔGPBSA_IE method than those in thetraditional methods for ΔGPBSA and ΔGPBSA+IE methods, whichis similar to what we found for the training set. In particular,the slope of the regression line and the correlation coefficientcalculated by ΔGPBSA_IE are almost the same in the training setand the test set, which are 1.00/0.98, and 0.72/0.72, respect-ively. Obviously, the ΔGPBSA_IE method not only improves theaccuracy of the predicted binding free energy, but is also stablein unfamiliar systems. Meanwhile, the ΔGPBSA+IE method isnumerically more stable than the ΔGPBSA method, because thefluctuations in the correlation from the former are signifi-cantly smaller than those from the latter. The correlationcoefficients are 0.61 and 0.62 in the training set and the testset, respectively, for the ΔGPBSA+IE method. However, they are0.46 and 0.55, respectively, for the ΔGPBSA method, whichfurther verifies that the IE method can improve the numericalstability of the calculated result.

Analysis of the error. Fig. 5(b) shows the errors between thetheoretical and the experimental binding free energies for thetest set. The results illustrate that in terms of both MAE andRMSE, the ΔGPBSA_IE method is significantly better than theΔGPBSA and ΔGPBSA+IE methods. In addition, the errorsobtained by the ΔGPBSA_IE method are the second lowest fromall methods, but very close to the lowest values calculated bythe ΔGall method. The difference in the accuracy betweenΔGPBSA_IE and ΔGall is significantly narrowed from the trainingset to the test set. Comparing Fig. 2(b) with Fig. 5(b), the differ-ence in MAE between ΔGPBSA_IE and ΔGall is about 0.02 kcalmol−1 in the training set and test set. Meanwhile, the differ-ence in RMSE between the two methods is very small in thetwo sets. However, the DF of the ΔGall method is smaller thanthat of the ΔGPBSA_IE method, which means that the depen-dence on weight in ΔGall is stronger than that in ΔGPBSA_IE.Considering the balance between efficiency and accuracy, theΔGPBSA_IE method is more appropriate than the ΔGall methodfor measuring the binding free energy of unfamiliar systems.This is why we select the ΔGPBSA_IE method instead of theΔGall method for predicting the binding affinities of protein–ligand complexes.

Fig. 6 shows ΔΔG between the theoretical binding freeenergy and the experimental binding affinity for each system.For the ΔGPBSA method, the maximum value of ΔΔG is−3.23 kcal mol−1 and the minimum value is −50.46 kcalmol−1; hence, the range of fluctuations is 47.23 kcal mol−1.For the ΔGPBSA+IE method, the maximum value of ΔΔG is8.94 kcal mol−1 and the minimum value is −29.58 kcal mol−1;hence, the range of fluctuations is 38.52 kcal mol−1 in the testset. The large fluctuations in the error suggest that the compu-tational accuracies of the ΔGPBSA and ΔGPBSA+IE methods arepoor. The curve of ΔΔG from the ΔGPBSA_IE method in Fig. 6shows that all points lie close to the line where ΔΔG is equalto zero. The range of ΔΔG is 6.67 kcal mol−1, with a maximumvalue of 2.65 kcal mol−1 and a minimum value of −4.02 kcal

Nanoscale Paper

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10743

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 8: An accurate free energy estimator: based on MM/PBSA ...

mol−1. For comparison, in the ΔGall method, the range of ΔΔGis 6.90 kcal mol−1, with a maximum value of 2.74 kcal mol−1

and a minimum value of −4.16 kcal mol−1, which indicatesthat the accuracy of the predicted binding free energy isalmost the same as that of the ΔGPBSA_IE method. Althoughthe difference is not significant, the maximum and minimum

values calculated by ΔGall and ΔGPBSA_IE methods occur at thesame systems, with PDBID being 1X7B and 1W7H, respectively.The maximum value of ΔΔG in the ΔGall method is largerthan that in the ΔGPBSA_IE method (2.74 vs. 2.65), while theminimum value in the ΔGall method is smaller than that inthe ΔGPBSA_IE method (−4.16 vs. −4.02). This suggests that the

Fig. 4 Scatter plots of the experimental and calculated binding free energies in the test set. (a) ΔGPBSA, (b) ΔGPBSA+IE, (c) ΔGH_IE, (d) ΔGMM_Sol_IE, (e)ΔGPBSA_IE and (f ) ΔGall. K is the slope of the regression line.

Paper Nanoscale

10744 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 9: An accurate free energy estimator: based on MM/PBSA ...

calculated uncertainty of the ΔGall method is greater than thatof the ΔGPBSA_IE method for new protein–ligand systems,because the DF of the ΔGall method is smaller than that of theΔGPBSA_IE method.

Analysis of the typical systems. Three representative systemsin the test set are used to examine the efficiency of theΔGPBSA_IE method, which share the maximum value (PDBID:2D2V), close to zero (PDBID: 2B1Q) and the minimum value(PDBID: 1XH6) of ΔΔG between the theoretical value calcu-lated by the ΔGPBSA+IE method and experimental value,respectively. We firstly examine the stability of MD simu-lations. Fig. S2† shows the evolution of RMSD and native struc-tures for the three systems. The RMSDs fluctuate below 1.6 Å,

which means that all structures are highly stable within thetime frame. The difference between the theoretical and experi-mental binding free energies for 2D2V and 1XH6 obtained bythe ΔGPBSA+IE method is 8.94 kcal mol−1 and −29.58 kcalmol−1, respectively. However, the accuracies for these systemscan be significantly improved when the ΔGPBSA_IE method isused, achieving ΔΔG values very close to zero. Another repre-sentative system, 2B1Q, whose ΔΔG measured by ΔGPBSA+IE isclosest to zero among all systems, can still be matched to theexperimental data when the ΔGPBSA_IE method is applied. Thisis similarly observed in the training set, which indicates thatthe ΔGPBSA_IE method can reliably improve the accuracy of thepredicted binding free energy.

The property of the free energy estimator

Analysis of the rationality for IE calculation. It should bementioned that the calculation of entropic contribution usingthe IE method is based on the approximation: the structuresof the protein and ligand remain unchanged before and afterbinding,42 which is in agreement with the idea of MM/PBSAusing the single trajectory (only complex) approach. In orderto examine the conformational changes of the complex duringbinding, we pick 20 complexes with the maximum RMSD inMD simulation from training and test sets (10 systems of eachset). Then their apoproteins are isolated from the initial com-plexes and 2 ns MD simulations are again performed underthe same conditions as the complexes. We first calculate thetwo-dimensional RMSD (2d-RMSD) of Cα atoms in the bindingpocket for protein–ligand complexes and apoproteins basedon the last 1 ns of trajectories. The cross-comparison of 2d-RMSD between the apo system and complex system is shownin Fig. S3 and S4.† The 2d-RMSD values are color-coded fromblack to yellow. As shown in this figure, the 2d-RMSD valuesbetween the apoprotein trajectories and complex trajectoriesare all less than 4 Å, almost below 2 Å, which suggest that theapoprotein and complex are similar to each other and displayno large conformational differences. To probe the effect ofligand binding on structural fluctuation of each residue, wethen compare the root-mean-square fluctuation (RMSF) in thebinding pocket between the complex simulation and apopro-tein simulation. The RMSF and its MAE for each system areshown in Fig. S5 and S6.† These RMSFs in the apo andcomplex simulation are similar and their MAE values are verylow, which indicates that the conformational changes of theprotein are very slight before and after ligand binding.Considering that we select those systems with the largestRMSD, the conformation of other systems should exhibit nosignificant change in the process of binding.

Analysis of the precision of method. For a newly developedmethod, the analysis of precision is an essential step toimprove credibility. In this study, the standard error of mean(SEM) of each energy term is firstly calculated, and then theprecision of fitted binding free energy value of each system isobtained by the error propagation formula. These detaileddata are shown in Tables S2–S5.† In order to visualize andcompare the precision of the six methods, the violin plot pro-

Fig. 5 The correlation coefficient, MAE and RMSE of the experimentaland calculated binding free energies in the test set. (a) Correlationcoefficient and (b) MAE and RMSE.

Fig. 6 Absolute error of the experimental and calculated binding freeenergies for each system in the test set. The No. of systems are shown inTable S4.†

Nanoscale Paper

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10745

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 10: An accurate free energy estimator: based on MM/PBSA ...

viding the distributions of precisions in predicting the bindingfree energy for various estimators is shown in Fig. 7. In thefigure, the white circle represents the median of precision. Theviolet dashed horizontal line is the reference line where themedian is located, whereas the black dashed horizontal linesare a quarter and three quarter of the bootstrap distribution.Since the precisions of the ΔGPBSA and ΔGPBSA+IE methodshave an order of magnitude different from that of four othermethods, they are partially enlarged by image.

Fig. 7 shows that the precision of ΔGPBSA and ΔGPBSA+IE

ranges from 0 to 2.25 kcal mol−1, and the collective distri-bution is between 0.43 and 0.80 kcal mol−1, while that of ourfour fitted method ranges from 0 to 0.40 kcal mol−1, and thecollective distribution is between 0.03 and 0.13 kcal mol−1.This suggests that the fitting method has significantlyimproved the accuracy of the calculated binding free energy.Whether in the training set or the test set, for the ΔGPBSA_IE

method, the median accuracy is the smallest, the 25% and75% distribution line is the lowest, and the interval from the25% to 75% distribution horizontal lines is also the narrowest,which indicates that the ΔGPBSA_IE approach shows a perfectperformance in improving precision. It is well known that withsmaller SEM, it will have higher precision resolution.

This is especially important in high-throughput virtualscreening of drugs, because when targeting the same targetprotein, small molecules are likely to have similar functionalgroups, which often results in similar binding free energy. Amore accurate calculation of binding free energy is necessaryfor screening drug candidates. On the other hand, the area ofthe violin plots representing the precision of the ΔGPBSA_IE

approach is the smallest of all methods, which means that theprecision of this method is most concentrated of all estima-tors. Therefore, the method can also minimize the fluctuationscaused by different protein–ligand systems when calculatingthe binding free energy, which means that the method hasbetter applicability. The above analysis evaluates several fitting

methods from the perspective of precision and illustrates thatthe ΔGPBSA_IE method is the optimal solution.

Analysis of outliers. As shown in Fig. 1(e) and (f ), there areabout 6 outliers, which are specially marked at the S1 and S3region in the training set shown in Fig. 8(a) and (b). Throughthose analyses of outliers, we find that the vdW interactions ofthese systems are very special. For those systems in the S1region (1HIV, 1BIL, 1SQP, 1NPV and 1LF3), their vdW inter-actions are significantly higher than those of other systems.For the system in the S3 region (1CZE), the vdW interaction isthe lowest in all systems. According to eqn (20) and (21), thefitted parameter of the vdW term is the largest among allfitting parameters, which means that the final calculatedbinding free energy strongly depends on the vdW term. In fact,these outliers are not bad points, but play an important role inthe overall fit. That is to say, the ideal multivariate linearfitting not only requires sufficient fit samples, but also a widefit range, so that the fitted equation has better applicability.

To examine this judgement, we remove these systems in theS1 and S3 region and refit the equations. As shown inFig. 8(e)–(h), the correlation coefficient has improved in thetraining set (e and f), but it has decreased slightly in the testset (g and h). In particular, the slope drops from about 1 (asshown in Fig. 8(c) and (d)) to about 0.74. The fitted equationwithout the outliers has good results in the training set, butworse results in the test set. This result indicates that theadditional outliers improve the predicted accuracy on the testset, which also validates our previous conclusion. These out-liers are useful for our final fitting results, because they notonly increase the diversity of the training set data, but alsoimprove the applicability of the fitted equation.

Meaning of fitted parameters. In terms of nature, except forthe last constant term, other fitted parameters are positivevalues, which are in line with the idea of PBSA, i.e.,

ΔGbind ¼ ΔEele þ ΔEvdW þ ΔGpb þ ΔGnp � TΔS ð22Þ

Fig. 7 The violin plot of precision distributions of various estimators in training and test sets.

Paper Nanoscale

10746 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 11: An accurate free energy estimator: based on MM/PBSA ...

In terms of absolute values, those fitted parameters are verysmall (eqn (18)–(21)), and the largest value is only 0.25050,which seems to explain why the standard MM/PBSA often over-estimates binding free energy.

In terms of relative values, the fitted parameter of the vdWterm is the largest, which means that this term plays a vitalrole in binding free energy calculation. In the study of inter-action mechanisms for biological macromolecules, vdW inter-action has been one of the main contributions to promoteprotein–ligand binding, which is also a breakthrough point inthe development and design of drugs.49,51,67,68

The second largest fitted parameter is the entropy term,which plays an important role in characterizing the absolutebinding free energy. In fact, when the entropic contribution isignored, the MM/PBSA method often performs well in qualitat-ive analysis, but have major shortages in quantitative bindingfree energy calculation. As we know, balancing the efficiencyand accuracy has always been a major challenge in the calcu-lation of entropy. The interaction entropy method is not onlyrecognized in previous works but also shows excellent per-formance in this work.

The third largest fitted parameter is ΔGnp, which meansthat it is indispensable even though the value of non-polar sol-vation energy is small. In fact, the non-polar solvation energyterm has also been one of the hot topics over the past manyyears, and many methods have been proposed to improve theaccuracy of calculating non-polar solvation energy, such ascavity and dispersion (CD),69 the polarizable continuum model(PCM)70 and so on. Compared to other energy terms, the non-polar solvation energy term has a smaller component.

Interestingly, by comparing eqn (20) (ΔGPBSA_IE method)and eqn (21) (ΔGall method), the coefficients of electrostaticterms and those of the polar solvation energy are very close ineqn (21), which is inconsistent with eqn (20) where we put the

electrostatic and polar solvation terms as a whole to fit.Considering the influence of the number of fitted terms onthe applicability, the ΔGPBSA_IE method is obviously moreadvantageous. This also explains why ΔGPBSA_IE instead ofΔGall is used as our final method. In addition, the fittingcoefficient of the electrostatic term and the polar term is thesmallest, which means that these two terms are the mostseverely modified. In recent years, many researchers have triedto give proteins higher dielectric constants to correct overesti-mated electrostatic terms and polar solvation energy, whichhas achieved very good results in related research.31,53,55,71–74

Conclusion

This paper proposes four estimators (i.e., ΔGH_IE, ΔGMM_Sol_IE,ΔGPBSA_IE and ΔGall) based on the MM/PBSA approach com-bined with the IE method for highly efficient and accurate cal-culation of protein–ligand binding free energy. Various energyterms are calculated in the MM/PBSA and IE methods, andthen fitted to a training set with 84 protein–ligand complexesto obtain the optimized coefficients for the four fitting func-tions. By comparing the four estimators, we find that theΔGPBSA_IE method is superior to the other three methods interms of accuracy and numerical stability. The universality ofΔGPBSA_IE is further verified on a test set with 44 protein–ligand systems. Our results can be summarized as follows:

First, compared to the traditional MM/PBSA method, theinclusion of entropic contribution using the IE method, notonly improves the correlation (correlation coefficients increasefrom 0.46 to 0.61 in the training set and from 0.55 to 0.62 inthe test set) but also reduces the absolute error between thecalculated and experimental values (MAEs decrease from 22.52to 10.50 kcal mol−1 in the training set and from 20.94 to

Fig. 8 Scatter plots of the experimental and calculated binding free energies with and without outliers. Comparison between the present andabsent outliers with ΔGPBSA_IE and ΔGall methods in the training set and test set. K is the slope of the regression line, and R is Pearson’s correlationcoefficient. (a) and (b) is the training set and (c) and (d) is the test set, which all retain outliers. (e)–(h) have both outliers removed, but (e) and (f) isthe training set and (g) and (f ) is the test set.

Nanoscale Paper

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10747

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 12: An accurate free energy estimator: based on MM/PBSA ...

12.37 kcal mol−1 in the test set). This indicates that the IEmethod can improve the computational accuracy.

Second, although the ΔGPBSA+IE method produces a moreaccurate prediction of the binding free energy as compared tothe MM/PBSA method, the obtained computational errors arestill far from our expectation. Therefore, by giving differentweights for the energy terms of ΔGPBSA+IE using multivariatelinear fitting, we obtain four types of binding free energy esti-mators, i.e, ΔGH_IE (eqn (18)), ΔGMM_Sol_IE (eqn (19)), ΔGPBSA_IE

(eqn (20)) and ΔGall, (eqn (21)). Using the four fitting methods,the accuracy of the calculated binding free energy is again sig-nificantly improved. The slope of the regression lines increasesto about 1.00, while the correlation coefficients are higherthan 0.61. More importantly, the predicted values are in goodagreement with the experimental values, where MAE is lessthan 1.78 kcal mol−1. Meanwhile, by comparing the obtainedcorrelation and errors, we find that the ΔGPBSA_IE and ΔGall

methods are the optimal methods, which share a similarability in predicting the binding free energy with the corre-lation coefficients of about 0.72 and the fluctuations of absol-ute error around 0.

Third, compared to the two traditional unfitted methods(ΔGPBSA and ΔGPBSA+IE), our four estimators exhibit better uni-versality and stability in binding free energy prediction. Theevaluation indices are also changed for the traditional ΔGPBSA

and ΔGPBSA+IE methods, show worse applicability on the testset compared to the training set, while the four fittingmethods do not.

Fourth, for the training and test sets, the performances ofthe ΔGPBSA_IE and the ΔGall methods are always optimal interms of correlation and MAE. However, the ΔGall method issmaller than the ΔGPBSA_IE method in terms of DF. Thesmaller DF means that the final theoretical value stronglydepends on the weight. Therefore, the ΔGPBSA_IE method ischosen as the final estimator. Besides, the performance of theΔGPBSA_IE method is also better than that of the ΔGall methodin the test set. The ΔGPBSA_IE method can be better applied tothe test set, and it has smaller differences between the trainingset and the test set compared to the ΔGall method.

Fifth, the ΔGPBSA_IE method has the minimum SEM in thecalculation of binding free energy, which means that it has thehighest precision resolution. Meanwhile, the ΔGPBSA_IE

method not only improves the computational accuracy ofsystems that fail to predict the binding free energy in othermethods, but also maintains the consistency of the experi-mental and theoretical values of systems that perform well inother methods.

Finally, by analyzing the fitted parameters, the vdW termplays a vital role in the prediction of protein–ligand bindingfree energy. Besides, the non-polar solvation energy term isindispensable even though its value is small. Meanwhile, theaccurate calculation of entropy will significantly improve theaccuracy of binding free energy. And the overestimation ofelectrostatic energy is probably one of the most importantfactors in the MM/PBSA method of overestimating binding freeenergy.

Overall, the introduction of the IE method significantlyimproves the accuracy of binding free energy calculations,which makes up for the difficulty in calculating the entropychange. In this paper, combined with the advantages of theMM/PBSA method and IE theory, we correct some energyterms with reasonable parameters, and obtain an optimalsolution, which is the ΔGPBSA_IE method. This method is avaluable, highly accurate, widely applicable estimator for theprediction of binding free energy. And it is helpful for drugscreening, understanding the protein–ligand interaction, andexposing related biological processes.

Conflicts of interest

There are no conflicts to declare.

Acknowledgements

This work was supported by the National Natural ScienceFoundation of China (Grant No. 11774207, 11574184, and91753103) and the NYU Global Seed Grant. We also thank theECNU Public Platform for Innovation 001 for providing super-computer time.

References

1 P. A. Bash, M. J. Field and M. Karplus, J. Am. Chem. Soc.,1987, 109, 8092–8094.

2 S. N. Rao, U. C. Singh, P. A. Bash and P. A. Kollman,Nature, 1987, 328, 551–554.

3 S. Hirono and P. A. Kollman, J. Mol. Biol., 1990, 212, 197–209.

4 B. Rao and U. C. Singh, J. Am. Chem. Soc., 1990, 112, 3803–3811.

5 P. Kollman, Chem. Rev., 1993, 93, 2395–2417.6 Y. Kita, T. Arakawa, T. Y. Lin and S. N. Timasheff,

Biochemistry, 1994, 33, 15178–15189.7 W. L. Jorgensen and L. L. Thomas, J. Chem. Theory

Comput., 2008, 4, 869–876.8 D. L. Beveridge and F. DiCapua, Annu. Rev. Biophys.

Biophys. Chem., 1989, 18, 431–492.9 C. X. Wang, Y. Y. Shi, F. Zhou and L. Wang, Proteins: Struct.,

Funct., Bioinf., 1993, 15, 5–9.10 M. Zacharias, T. P. Straatsma and J. A. Mccammon,

J. Chem. Phys., 1994, 100, 9025–9031.11 T. P. Straatsma and H. J. C. Berendsen, J. Chem. Phys.,

1988, 89, 5876–5886.12 S. Kamath, E. Coutinho and P. Desai, J. Biomol. Struct. Dyn.,

1999, 16, 1239–1244.13 K. W. Wu, P. C. Chen, J. Wang and Y. C. Sun, J. Comput.

Aided Mol. Des., 2012, 26, 1159–1169.14 M. Lawrenz, R. Baron, Y. Wang and J. A. McCammon, in

Computational Drug Discovery and Design, ed. R. Baron,Springer, New York, NY, 2012, vol. 819, pp. 469–486.

Paper Nanoscale

10748 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 13: An accurate free energy estimator: based on MM/PBSA ...

15 V. Gapsys and B. L. de Groot, J. Chem. Theory Comput.,2017, 13, 6275–6289.

16 E. C. Wang, H. Y. Sun, J. M. Wang, Z. Wang, H. Liu,J. Z. H. Zhang and T. J. Hou, Chem. Rev., 2019, 119, 9478–9508.

17 Y. Z. Li, J. Shen, X. Q. Sun, W. H. Li, G. X. Liu and Y. Tang,J. Chem. Inf. Model., 2010, 50, 1134–1146.

18 D. Plewczynski, M. Łaźniewski, R. Augustyniak andK. Ginalski, J. Comput. Chem., 2011, 32, 742–755.

19 X. B. Hou, J. T. Du, J. Zhang, L. P. Du, H. Fang and M. Y. Li,J. Chem. Inf. Model., 2013, 53, 188–200.

20 E. Yuriev, J. Holien and P. A. Ramsland, J. Mol. Recognit.,2015, 28, 581–604.

21 Z. Wang, H. Y. Sun, X. J. Yao, D. Li, L. Xu, Y. Y. Li, S. Tianand T. J. Hou, Phys. Chem. Chem. Phys., 2016, 18, 12964–12975.

22 A. Nicholls and B. Honig, J. Comput. Chem., 1991, 12, 435–445.

23 P. A. Kollman, I. Massova, C. Reyes, B. Kuhn, S. H. Huo,L. Chong, M. Lee, T. Lee, Y. Duan, W. Wang, O. Donini,P. Cieplak, J. Srinivasan, D. A. Case and T. E. Cheatham,Acc. Chem. Res., 2000, 33, 889–897.

24 I. Massova and P. A. Kollman, Perspect. Drug Discovery Des.,2000, 18, 113–135.

25 W. Rocchia, E. Alexov and B. Honig, J. Phys. Chem. B, 2001,105, 6507–6514.

26 R. Luo, L. David and M. K. Gilson, J. Comput. Chem., 2002,23, 1244–1253.

27 B. Kuhn, P. Gerber, T. Schulz-Gasch and M. Stahl, J. Mol.Biol., 2005, 48, 4040–4048.

28 Z. Chen, N. A. Baker and G. W. Wei, J. Comput. Chem.,2010, 229, 8231–8258.

29 J. Z. Chen, J. N. Wang, W. L. Zhu and G. H. Li, J. Comput.Aided Mol. Des., 2013, 27, 965–974.

30 T. Fu, X. Wu, Z. L. Xiu, J. G. Wang, L. Yin and G. H. Li,J. Theor. Comput. Chem., 2013, 12, 1341003.

31 T. J. Hou, J. M. Wang, Y. Y. Li and W. Wang, J. Comput.Chem., 2011, 32, 866–877.

32 F. Chen, H. Liu, H. Y. Sun, P. C. Pan, Y. Y. Li, D. Li andT. J. Hou, Phys. Chem. Chem. Phys., 2016, 18, 22129–22139.

33 I. Maffucci and A. Contini, J. Chem. Inf. Model., 2016, 56,1692–1704.

34 T. Feng, F. Chen, Y. Kang, H. Y. Sun, H. Liu, D. Li, F. Zhuand T. J. Hou, J. Cheminf., 2017, 9, 66.

35 N. Gō and H. A. Scheraga, J. Chem. Phys., 1969, 51, 4751–4767.

36 D. T. Nguyen and D. A. Case, J. Phys. Chem., 1985, 89,4020–4026.

37 B. R. Brooks, D. Janežič and M. Karplus, J. Comput. Chem.,1995, 16, 1522–1542.

38 J. Srinivasan, T. E. Cheatham, P. Cieplak, P. A. Kollmanand D. A. Case, J. Am. Chem. Soc., 1998, 120, 9401–9409.

39 M. Karplus and J. N. Kushick, Macromolecules, 1981, 14,325–332.

40 J. M. Wang, P. Morin, W. Wang and P. A. Kollman, J. Am.Chem. Soc., 2001, 123, 5221–5230.

41 S. H. Chong and S. Ham, J. Phys. Chem. B, 2015, 119,12623–12631.

42 L. L. Duan, X. Liu and J. Z. H. Zhang, J. Am. Chem. Soc.,2016, 138, 5722–5728.

43 L. Q. Qiu, Y. N. Yan, Z. X. Sun, J. N. Song andJ. Z. H. Zhang, Wiley Interdiscip. Rev.: Comput. Mol. Sci.,2017, 8, e1342.

44 Z. X. Sun, Y. N. Yan, M. Y. Yang and J. Z. H. Zhang, J. Chem.Phys., 2017, 146, 124124.

45 M. Aldeghi, M. J. Bodkin, S. Knapp and P. C. Biggin,J. Chem. Inf. Model., 2017, 57, 2203–2221.

46 Y. N. Yan, M. Y. Yang, C. G. Ji and J. Z. H. Zhang, J. Chem.Inf. Model., 2017, 57, 1112–1122.

47 Y. C. Li, Y. L. Cong, G. Q. Feng, S. S. Zhong, J. Z. H. Zhang,H. Y. Sun and L. L. Duan, Struct. Dyn., 2018, 5,064101.

48 J. Z. Chen, X. Y. Wang, J. Z. H. Zhang and T. Zhu, ACSOmega, 2018, 3, 18052–18064.

49 Z. R. Xiao, Y. L. Cong, K. F. Huang, S. S. Zhong,J. Z. H. Zhang and L. L. Duan, Phys. Chem. Chem. Phys.,2019, 21, 20951–20964.

50 L. P. He, J. X. Bao, Y. P. Yang, S. Z. Dong, L. J. Zhang,Y. F. Qi and J. Z. H. Zhang, J. Chem. Inf. Model., 2019, 59,3871–3878.

51 L. L. Duan, G. Q. Feng, X. W. Wang, L. Z. Wang andQ. G. Zhang, Phys. Chem. Chem. Phys., 2017, 19, 10140–10152.

52 Y. L. Cong, Y. C. Li, K. Jin, S. S. Zhong, J. Z. H. Zhang, H. Liand L. L. Duan, Front. Chem., 2018, 6, 18.

53 X. Liu, L. Peng and J. Z. H. Zhang, J. Chem. Inf. Model.,2019, 59, 272–281.

54 M. X. Li, Y. L. Cong, Y. C. Li, S. S. Zhong, R. Wang, H. Liand L. L. Duan, Front. Chem., 2019, 7, 33.

55 H. Y. Sun, Y. Y. Li, S. Tian, L. Xu and T. J. Hou, Phys. Chem.Chem. Phys., 2014, 16, 16719–16729.

56 P. Mikulskis, S. Genheden, K. Wichmann and U. Ryde,J. Comput. Chem., 2012, 33, 1179–1189.

57 M. Kaukonen, P. R. Söderhjelm, J. Heimdal and U. Ryde,J. Phys. Chem. B, 2008, 112, 12537–12548.

58 Z. H. Liu, M. Y. Su, L. Han, J. Liu, Q. F. Yang, Y. Li andR. X. Wang, Acc. Chem. Res., 2017, 50, 302–309.

59 H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland,T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne,Nucleic Acids Res., 2000, 28, 235–242.

60 C. I. Bayly, P. Cieplak, W. Cornell and P. A. Kollman,J. Phys. Chem., 1993, 97, 10269–10280.

61 W. D. Cornell, P. Cieplak, C. I. Bayly and P. A. Kollman,J. Am. Chem. Soc., 2002, 115, 9620–9631.

62 D. A. Case, D. S. Cerutti, T. E. Cheatham III, T. A. Darden,R. E. Duke, T. J. Giese, H. Gohlke, A. W. Goetz, D. Greene,N. Homeyer, S. Izadi, A. Kovalenko, T. S. Lee, S. LeGrand,P. F. Li, C. Lin, J. Liu, T. Luchko, R. Luo, D. Mermelstein,K. M. Merz, G. Monard, H. Nguyen, I. Omelyan,A. Onufriev, F. Pan, R. X. Qi, D. R. Roe, A. Roitberg,C. Sagui, C. L. Simmerling, W. M. Botello-Smith, J. Swails,R. C. Walker, J. Wang, R. M. Wolf, X. W. Wu, L. Xiao,

Nanoscale Paper

This journal is © The Royal Society of Chemistry 2020 Nanoscale, 2020, 12, 10737–10750 | 10749

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online

Page 14: An accurate free energy estimator: based on MM/PBSA ...

D. M. York and P. A. Kollman, AMBER 2017, University ofCalifornia, San Francisco, 2017.

63 R. W. Pastor, B. R. Brooks and A. Szabo, Mol. Phys., 1988,65, 1409–1419.

64 J. P. Ryckaert, G. Ciccotti and H. J. C. Berendsen, J. Comput.Phys., 1977, 23, 327–341.

65 K. A. Sharp and B. Honig, J. Phys. Chem., 1990, 94, 7684–7692.

66 J. Weiser, P. S. Shenkin and W. C. Still, J. Comput. Chem.,1999, 20, 217–230.

67 J. Z. Chen, X. Y. Wang, L. X. Pang, J. Z. H. Zhang andT. Zhu, Nucleic Acids Res., 2019, 47, 6618–6631.

68 D. D. Huang, W. Wen, X. Liu, Y. Li and J. Z. H. Zhang, RSCAdv., 2019, 9, 14944–14956.

69 C. H. Tan, Y. H. Tan and R. Luo, J. Phys. Chem. B, 2007,111, 12263–12274.

70 V. Barone, M. Cossi and J. Tomasi, J. Chem. Phys., 1997,107, 3210–3221.

71 H. Y. Sun, Y. Y. Li, M. Y. Shen, S. Tian, L. Xu, P. C. Pan, G. Yanand T. J. Hou, Phys. Chem. Chem. Phys., 2014, 16, 22035–22045.

72 T. Y. Yang, J. C. Wu, C. L. Yan, Y. F. Wang, R. Luo,M. B. Gonzales, K. N. Dalby and P. Y. Ren, Proteins: Struct.,Funct., Bioinf., 2011, 79, 1940–1951.

73 P. Soderhjelm, J. Kongsted and U. Ryde, J. Chem. TheoryComput., 2010, 6, 1726–1737.

74 T. Venken, D. Krnavek, J. Münch, F. Kirchhoff, P. Henklein,M. De Maeyer and A. Voet, Proteins: Struct., Funct., Bioinf.,2011, 79, 3221–3235.

Paper Nanoscale

10750 | Nanoscale, 2020, 12, 10737–10750 This journal is © The Royal Society of Chemistry 2020

Publ

ishe

d on

17

Apr

il 20

20. D

ownl

oade

d on

11/

17/2

021

4:51

:31

AM

. View Article Online