SAMPL6 Challenge Results from pKa Predictions Based on a ...

21
doi.org/10.26434/chemrxiv.6406505.v1 SAMPL6 Challenge Results from pKa Predictions Based on a General Gaussian Process Model Caitlin C. Bannan, David L. Mobley, Geoff Skillman Submitted date: 01/06/2018 Posted date: 04/06/2018 Licence: CC BY 4.0 Citation information: Bannan, Caitlin C.; Mobley, David L.; Skillman, Geoff (2018): SAMPL6 Challenge Results from pKa Predictions Based on a General Gaussian Process Model. ChemRxiv. Preprint. A variety of fields would benefit from accurate pK a predictions, especially drug design due to the affect a change in ionization state can have on a molecules physiochemical properties.Participants in the recent SAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pK a s of 24 drug like small molecules.We recently built a general model for predicting pK a s using a Gaussian process regression trained using physical and chemical features of each ionizable group.Our pipeline takes a molecular graph and uses the OpenEye Toolkits to calculate features describing the removal of a proton.These features are fed into a Scikit-learn Gaussian process to predict microscopic pK a s which are then used to analytically determine macroscopic pK a s.Our Gaussian process is trained on a set of 2,700 macroscopic pK a s from monoprotic and select diprotic molecules.Here, we share our results for microscopic and macroscopic predictions in the SAMPL6 challenge.Overall, we ranked in the middle of the pack compared to other participants, but our fairly good agreement with experiment is still promising considering the challenge molecules are chemically diverse and often polyprotic while our training set is predominately monoprotic.Of particular importance to us when building this model was to include an uncertainty estimate based on the chemistry of the molecule that would reflect the likely accuracy of our prediction. Our model reports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability, along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.The challenge highlighted a variety of means to improve our model, including adding more polyprotic molecules to our training set and more carefully considering what functional groups we do or do not identify as ionizable. File list (3) download file view on ChemRxiv sampl6_preprint.pdf (745.74 KiB) download file view on ChemRxiv sampl6_SI.pdf (618.58 KiB) download file view on ChemRxiv sampl6_SI.tar (11.82 MiB)

Transcript of SAMPL6 Challenge Results from pKa Predictions Based on a ...

Page 1: SAMPL6 Challenge Results from pKa Predictions Based on a ...

doi.org/10.26434/chemrxiv.6406505.v1

SAMPL6 Challenge Results from pKa Predictions Based on a GeneralGaussian Process ModelCaitlin C. Bannan, David L. Mobley, Geoff Skillman

Submitted date: 01/06/2018 • Posted date: 04/06/2018Licence: CC BY 4.0Citation information: Bannan, Caitlin C.; Mobley, David L.; Skillman, Geoff (2018): SAMPL6 Challenge Resultsfrom pKa Predictions Based on a General Gaussian Process Model. ChemRxiv. Preprint.

A variety of fields would benefit from accurate pKa predictions, especially drug design due to the affect achange in ionization state can have on a molecules physiochemical properties.Participants in the recentSAMPL6 blind challenge were asked to submit predictions for microscopic and macroscopic pKas of 24 druglike small molecules.We recently built a general model for predicting pKas using a Gaussian processregression trained using physical and chemical features of each ionizable group.Our pipeline takes amolecular graph and uses the OpenEye Toolkits to calculate features describing the removal of aproton.These features are fed into a Scikit-learn Gaussian process to predict microscopic pKas which are thenused to analytically determine macroscopic pKas.Our Gaussian process is trained on a set of 2,700macroscopic pKas from monoprotic and select diprotic molecules.Here, we share our results for microscopicand macroscopic predictions in the SAMPL6 challenge.Overall, we ranked in the middle of the packcompared to other participants, but our fairly good agreement with experiment is still promising consideringthe challenge molecules are chemically diverse and often polyprotic while our training set is predominatelymonoprotic.Of particular importance to us when building this model was to include an uncertainty estimatebased on the chemistry of the molecule that would reflect the likely accuracy of our prediction. Our modelreports large uncertainties for the molecules that appear to have chemistry outside our domain of applicability,along with good agreement in quantile-quantile plots, indicating it can predict its own accuracy.The challengehighlighted a variety of means to improve our model, including adding more polyprotic molecules to ourtraining set and more carefully considering what functional groups we do or do not identify as ionizable.

File list (3)

download fileview on ChemRxivsampl6_preprint.pdf (745.74 KiB)

download fileview on ChemRxivsampl6_SI.pdf (618.58 KiB)

download fileview on ChemRxivsampl6_SI.tar (11.82 MiB)

Page 2: SAMPL6 Challenge Results from pKa Predictions Based on a ...

SAMPL6 challenge results from pKa predictions based on ageneral Gaussian process model

Caitlin C. Bannan · David L. Mobley · A. Geoffrey Skillman

Abstract A variety of fields would benefit from ac-

curate pKa predictions, especially drug design due to

the affect a change in ionization state can have on a

molecules physiochemical properties. Participants in the

recent SAMPL6 blind challenge were asked to submit

predictions for microscopic and macroscopic pKas of 24

drug like small molecules. We recently built a general

model for predicting pKas using a Gaussian process re-

gression trained using physical and chemical features

of each ionizable group. Our pipeline takes a molecular

graph and uses the OpenEye Toolkits to calculate fea-

tures describing the removal of a proton. These features

are fed into a Scikit-learn Gaussian process to predict

microscopic pKas which are then used to analytically

determine macroscopic pKas. Our Gaussian process is

trained on a set of 2,700 macroscopic pKas from mono-

protic and select diprotic molecules. Here, we share our

results for microscopic and macroscopic predictions in

the SAMPL6 challenge. Overall, we ranked in the mid-

dle of the pack compared to other participants, but our

fairly good agreement with experiment is still promising

considering the challenge molecules are chemically di-

verse and often polyprotic while our training set is pre-

dominately monoprotic. Of particular importance to us

C. C. Bannan(1) Department of Chemistry, University of California, Irvine,(2) 2017 Summer Intern, OpenEye Scientific Software, Inc.,Santa Fe, NM.

D. L. MobleyDepartments of Pharmaceutical Sciences and Chemistry, Uni-versity of California, Irvine147 Bison Modular, Irvine, CA 92697Tel.: +949-824-6383Fax: +949-824-2949E-mail: [email protected]

A. G. SkillmanOpenEye Scientific Software, Inc.9 Bisbee Court, Suite D, Santa Fe, NM 87508Tel.: +505-473-7385E-mail: [email protected]

when building this model was to include an uncertainty

estimate based on the chemistry of the molecule that

would reflect the likely accuracy of our prediction. Our

model reports large uncertainties for the molecules that

appear to have chemistry outside our domain of applica-

bility, along with good agreement in quantile-quantile

plots, indicating it can predict its own accuracy. The

challenge highlighted a variety of means to improve our

model, including adding more polyprotic molecules to

our training set and more carefully considering what

functional groups we do or do not identify as ionizable.

Keywords pKa · SAMPL6 · blind challenge ·Gaussian process

1 Introduction

Accurate predictions of pKa values are of interest in

a variety of fields including pharmaceutical research,as absorption, distribution, metabolism, and toxicity

can be profoundly affected by changes in ionization

state [1,2]. Other key physiochemical properties, such

as lipophilicity, solubility, and permeability are also pKa-

dependent [3–6]. Knowing the likely ionization state of

a molecule is also important as preparation for other

modeling studies. For example, predictions of distri-

bution coefficients in SAMPL5 demonstrated how dra-

matically free energy calculations can be affected by

a choice in ionization state of a molecule [7,8]. Calcu-

lations of other biomolecular properties, such as pro-

tein ligand binding affinities, are similarly affected by

choices in ionization state [9].

Because of the importance of pKa prediction, and

the difficulty of predicting pKa values, the SAMPL

challenge organizers included a pKa prediction compo-

nent in SAMPL6. Experimental macroscopic pKas were

collected for 24 drug like molecules using an established

spectrophotometric technique limited to a pH range be-

tween two and twelve. As a part of follow up analysis, a

Page 3: SAMPL6 Challenge Results from pKa Predictions Based on a ...

2 Caitlin C. Bannan et al.

few NMR experiments were performed to determine the

microscopic pKas of a select few molecules [10]. Micro-

scopic pKas refer to an equilibrium resulting from re-

moving a specific hydrogen from a molecule and macro-

scopic pKas describe the process of removing any hy-

drogen or an overall change in charge state [11,12]. All

experimental data was kept secret from the public to

allow participants in the challenge to make blind micro-

scopic and macroscopic predictions for the 24 molecules.

Specifically, there were three formats allowed for predic-

tion submission:

– type I: microscopic pKas,

– type II: fractional microstate populations as a func-

tion of pH, and

– type III: macroscopic pKas

where microstates refer to a single tautomer of a spe-

cific charge state of a molecule. For each type of submis-

sion participants were encouraged, but not required, to

submit all predictions their model generated for every

molecule. SAMPL6 organizers then evaluated predic-

tions based on experimental results for all macroscopic

and a select set of microscopic pKas [13]. Details for the

challenge including experimental results, all submitted

predictions, and an overview analysis are available on-

line (github.com/MobleyLab/SAMPL6).

There are many different methods and tools for pKa

prediction, and a variety can be seen in this special is-

sue on SAMPL6 results. These techniques vary dramat-

ically in scope, computational cost, and accuracy. His-

torically, a common approach for predicting pKa was

through linear free energy relationships using empir-

ically determined constants to relate an acid or base

to a parent molecule in a known database [14,15]. A

related technique, quantitative structure-property rela-

tionships (QSPR), remain popular. These incorporate

a variety of molecular and atomistic descriptors [16–

18]. Some of these techniques have been updated to use

more advanced machine learning models such as artifi-

cial neural networks [19,20]. A variety of quantum me-

chanical descriptors, including partial atomic charges,

have also been shown to be promising in QSPR models

– due to computational cost, these methods are imprac-

tical for a general model and have only been applied to

specific types of ionizable groups [6,21–23]. Quantum

mechanical calculations from first principles can also

be used to calculate pKa using a thermodynamic cycle

of deprotonation in the gas phase and the hydration

free energy of both the protonated and deprotonated

molecule [24]. QM calculations are often still limited in

accuracy due to the difficulty in calculating hydration

free energies of ionized molecules in implicit solvents.

The most successful quantum mechanical predictions

from first principles also apply an empirical linear cor-

rection factor [25,26].

Here, we introduce a new machine learning model

for predicting microscopic and macroscopic pKas. Our

goal was to create a universal model which provides

predictions that come with accurate estimates of their

uncertainty. Most general methods for pKa prediction

build separate models for each type of ionizable group.

Even preditions based on DFT calculations for model

input can require specialized models for different ioniz-

able groups [27]

In contrast, we set out to build a single general

model which could predict a microscopic pKa for any

identified ionizable group. We believe that if our fea-

tures, or input into the machine learning model, are

based on the underlying physical and chemical proper-

ties responsible for the variation in deprotonation en-

ergy, only one model would be necessary. Artificial neu-

ral networks have been successful for predicting pKa,

but require substantial training data. We were inter-

ested in a machine learning model that could be built

from less training data, but did not require an assump-

tion about the shape of the function being fit. Gaus-

sian process regression meets these requirements pro-

viding a model based on distributions in feature space.

It also automatically incorporates an assessment of un-

certainty based on how similar input data is to the

training data [28]. Here, we present this new model and

our results for the type I and type III components of

the SAMPL6 blind challenge.

2 Computational Methods

We built a pipeline to predict the microscopic and macro-

scopic pKas of a molecule starting from any molecular

representation, such as a SMILES string. Our model

directly predicts microscopic pKas and then calculates

macroscopic pKas. First, we identify all ionizable groups

in a molecule and iterate through them to identify all

transitions between microstates. In the next step, we

convert each microscopic transition into a list of quan-

titative features. These features are used as input into

our Gaussian process regression model which predicts

a pKa for each micro-transition. The output from these

steps is a list of all microscopic pKas for each molecule.

Lastly, macroscopic pKas are analytically calculated

from a thermodynamic cycle involving all microscopic

transitions. Below, each of these steps is described in de-

tail including an overview of how we trained, validated,

and tested our model before the SAMPL6 challenge.

Page 4: SAMPL6 Challenge Results from pKa Predictions Based on a ...

SAMPL6 challenge results from pKa predictions based on a general Gaussian process model 3

2.1 A heuristic approach is used to identify aqueous

ionizable groups

The first step in processing any molecule, either for

training or prediction, is to identify all microscopic tran-

sitions. Molecules, in the form of SMILES strings or

any common molecular file format, are processed us-

ing OpenEye’s OEChem toolkit [29] and one reasonable

tautomer of the neutral form of the molecule is chosen.

A substructure search is used to identify groups that

commonly ionize in water [15] as either acidic:

– any protonated oxygen atom

– any protonated aliphatic sulfur atom

– cyclopentadiene

– carbon or nitrogen between two strongly electron

withdrawing groups

– arylsulfonamide nitrogens

– pyrrole-like aromatic nitrogens

– any atom with a formal positive charge and a hy-

drogen

or basic:

– aliphatic nitrogen atoms, not a part of amide or sul-

fonamide groups

– pyridine-like aromatic nitrogens

– trivalent aliphatic phosphorous

– any atom with a formal negative charge.

Next, we protonate all basic groups and then iterate

through all ionizable groups recursively removing a pro-

ton from each in order to identify all micro-transitions.

For each transition we store the protonated and depro-

tonated form of the molecule. OpenEye’s Omega toolkit

is then used to generate a low-energy conformation foreach form of the molecule [30]. Next, a list of features

is calculated to describe the micro-transition between

these two forms of the molecule. This feature list will

then be used as input for into our Gaussian process

model.

2.2 Features were chosen to describe physical

characteristics

We chose features based on the chemical and physical

properties that affect pKa. The key properties chemists

are trained to think about in relation to ionization are

the ionizable atom, resonance, inductive effects, steric

effects, and solvation. These properties affect the abil-

ity of an ionizable group to support a protonated or

deprotonated state along with the associated change

in formal charge. We also considered the quantum me-

chanical approach for calculating pKa using a thermo-

dynamic cycle involving the gas phase acidity and the

Fig. 1 We use the partial charge on the deprotonated atom(yellow) and the average partial charge on atoms one bond(purple) and two bonds (orange) away from that atom inboth the protonated and deprotonated form of the molecule,making a total of six features involving partial charges.

solvation free energy of each form of the molecule. Thus,

we calculate features to describe the micro-transition

using the protonated and deprotonated forms of the

molecule in gas and aqueous phase. Using OpenEye

Toolkits we calculate a total of ten features for each

transition, some for each form of the molecule and some

taking differences in properties between the two forms.

– Difference in enthalpy

– Mayer Partial Bond order on the bond between hy-

drogen and the ionizable group

– AM1-BCC partial charges on multiple atoms, re-

sulting in six charge-related features

– Difference of solvation free energy

– Solvent accessible surface area of the deprotonated

atom

To begin, we perform a semi-empirical AM1 calcula-

tion for each microstate and then extract several prop-

erties. The first feature mirrors gas phase acidity by tak-

ing the difference in enthalpy between the protonated

and deprotonated form of the molecule. For the pro-

tonated form of the molecule, the Mayer partial bond

order is also calculated for the bond between the ionized

atom and the hydrogen to be removed [31–33]. AM1-

BCC partial charges are calculated for atoms one and

two bonds away from the ionized atom [33–35]. Previous

work established partial charges as a useful feature to

predict pKa on molecules. These studies used molecule

sets with all the same ionizable groups considering the

charge on the deprotonated and surrounding atoms [22,

23,36]. In order to apply our model to all identified ion-

izable groups, we needed a more general approach. We

decided to consider the partial charge of (1) the depro-

tonated atom and the average partial charge on atoms

(2) one bond and (3) two bonds away from this atom,

for both the protonated and deprotonated forms of the

molecule (Figure 1). This leads to six features based

on the partial atomic charges. Since the AM1 calcula-

tions are performed in gas phase, the last two features

Page 5: SAMPL6 Challenge Results from pKa Predictions Based on a ...

4 Caitlin C. Bannan et al.

attempt to capture the affect of solvation on the equi-

librium. The difference in solvation free energy of the

two forms of the molecule is estimated by a Poisson

Boltzmann surface area calculation as implemented in

OpenEye’s Szybki toolkit [37,38]. Lastly, the solvent

accessible surface area around the deprotonated atom

is determined with OpenEye’s Spicoli Toolkit [39–42].

2.3 Gaussian process regression provides a simple

machine learning model

We built our Gaussian process regression model using

the Python package Scikit-learn [43]. A Gaussian pro-

cess is a nonparametric model which uses a Bayesian

approach to sample a posterior distribution of func-

tions [28]. There are two priors set for a Gaussian pro-

cess, a mean function and a kernel (or covariance) func-

tion. As with most Gaussian process models, we set our

prior mean function to zero. When initially training and

validating the model, we considered a variety of the ker-

nel functions included in Scikit-learn. To choose a ker-

nel and optimize any required parameters, we used a

three-fold cross validation method considering the root

mean squared error (RMSE), mean error, and correla-

tion coefficient of the training and validating sets (Sec-

tion 2.5). The best performing kernel for our purposes

was a Matern kernel – a generalized function between

the squared and absolute exponential kernels [28]. This

kernel requires a preset parameter ν which was opti-

mized to 2.5 for our model. The general form of Matern

kernel is complex including a Bessel function, and with

ν = 2.5, our final kernel is the function:

k = c

(1 +

√5d

l+

5d2

3l2

)exp

(−√

5d

l

)(1)

where c and l are trained constants and d is the

distance between two feature vectors.

2.4 Macroscopic pKas are calculated from microscopic

transitions

Our Gaussian process model is trained to predict mi-

croscopic pKas which can be used to analytically calcu-

late macroscopic pKas. Most experimentally measured

pKas are macroscopic, providing an equilibrium con-

stant for an overall change in total charge. These macro-

scopic transitions are comprised of multiple microscopic

transitions, each of which consists of the removal of one

specific hydrogen atom. If pKas, or equilibrium con-

stants, for all microscopic transitions are known, then

the macroscopic pKa can be analytically calculated us-

ing a thermodynamic cycle [15,6]. For example, for a

Fig. 2 We identify two ionizable groups in SAMPL6 com-pound SM22; this thermodynamic cycle shows an exampleof how microscopic transitions with equilibrium constants a,b, c, and d are related to macroscopic equilibrium constants(Ka1

and Ka2).

molecule with two ionizable groups, the macroscopic

Ka’s are:

Ka1= a+ b (2)

Ka2=

1

c+

1

d(3)

where a and b are equilibrium constants for the first

deprotonation and c and d are for the second depro-

tonation with the thermodynamic cycle in Figure 2.

Similar, though more complex, cycles can be drawn for

polyprotic molecules, allowing us to calculate macro-

scopic pKas for any provided molecule.

2.5 Training, validation, and internal test sets include

monoprotic and select diprotic molecules

Our training set was derived from an extensive exper-

imental pKa database Tony Slater curated from four

original sources:

– Dissociation Constants of Organic Bases in Aque-

ous Solution, by D.D. Perin (3,775 molecules, 8,766

pKas) [15];

– Dissociation Constants of Organic Acids in Aqueous

Solution, by G. Kortum, W. Vogel and K. Andrus-

sow (1,063 molecules, 2,893 pKas) [44];

– Dissociation Constants of Organic Bases in Aqueous

Solution, Supplement 1972, by D.D. Perin (4,275

molecules, 7,844 pKas) [45];

– Ionisation Constants of Organic Acids in Aqueous

Solution, by E.P. Serjeant and Boyd Dempsey (4,584

molecules, 10,912 pKas) [46].

This is the same database used for the OpenEye appli-

cation pKa Prospector. To begin, we filtered database

Page 6: SAMPL6 Challenge Results from pKa Predictions Based on a ...

SAMPL6 challenge results from pKa predictions based on a general Gaussian process model 5

entries for experimental measurements which were aque-

ous (including removing measurements in D2O), taken

between 20 and 25 degrees Celsius, and not tagged as

very uncertain. This resulted in a set of 9,890 molecules

with 26,519 experimental measurements. The large num-

ber of experimental results compared to number of molecules

is not solely due to molecules with multiple pKas; rather,

it is primarily due to replicate measurements for certain

molecules. In such cases we performed a weighted aver-

age, propagating the estimated uncertainties. As we are

most interested in biologically relevant ionization, we

also removed molecules where the experimental pKas

were outside a range of 0 to 14.

We currently use SciKit-Learn’s out-of-the-box ver-

sion of Gaussian Process which assumes one expected

value for each feature vector, limiting the types of molecules

we can use for training. Specifically, polyprotic molecules

are not suitable for training input in this approach.

Specifically, for polyprotic molecules, there is a fea-

ture vector for each microscopic transition, leading to

more predicted values than experimental pKa values.

For example, there are four microscopic transitions for a

diprotic molecule but only two macroscopic pKa values.

Thus, we focused our training set on instances where a

microscopic transition can be directly mapped to an ex-

perimental macroscopic pKa. The first set of molecules

was perhaps the most obvious — those with only one

ionizable group where the microscopic and macroscopic

transition are identical. We checked that molecules we

identified as having a single ionizable group also only

had one experimental measurement. This resulted in

2,672 molecules. To expand the diversity of the training

set we added a selection of diprotic molecules. For this

set we also included molecules where we identified two

ionizable groups and two experimental values were re-

ported. Additionally, we required the difference in these

two experimental pKas be greater than three log units

to assure dominance of a single microstate in estimation

of the macroscopic pKa. There were a total of 286 dipro-

tic molecules in the database that met this requirement.

For these molecules, we assumed each macroscopic pKa

was dominated by only one microscopic transition.

Before training, we removed 10% of these molecules

to later serve as an internal test set, resulting in setting

aside 243 monoprotic and 29 diprotic molecules, for a

total of 301 data points. The training data then con-

sisted of 2,186 monoprotic and 257 diprotic molecules.

We then split the training data into thirds in order to

use a three-fold cross validation method to evaluate the

choice of a Gaussian process model and choose a ker-

nel [47]. To evaluate model performance, we considered

RMSE, mean error, and correlation coefficients for each

training and validation set pair. We judged model per-

formance on training and cross-validation datasets in

the context of learning curves for the purposes of model

and feature selection. All training data was recombined

for our final Gaussian process model used to evaluate

our internal test set and make predictions for SAMPL6.

3 SAMPL6 challenge results

We predict microscopic pKa values using a Gaussian

process model trained on 2,443 mono- and diprotic molecules

(2,700 data points). Physical and chemical features are

calculated for the protonated and deprotonated form

of the molecule using OpenEye toolkits (Section 2.2).

Macroscopic pKas are then analytically calculated from

a combination of microscopic transitions (Section 2.4).

We used our model to predict microscopic (type I) and

macroscopic (type III) pKas for 24 drug like molecules

in the SAMPL6 blind challenge [13]. While Mobley is a

co-organizer of the challenge, none of the authors had

any access to the experimental data nor any knowl-

edge of details of the measurements until experimental

values and details were publicly released to all partici-

pants. The SAMPL6 organizers also asked for optional

microstate populations as a function of pH (type II).

However, we elected not to participate in that portion

of the challenge.

3.0.1 Predictions were matched with experiment to

reduce error

In an ideal world we would have a one-to-one match

when comparing predicted and experimental results,

where each calculated pKa has a corresponding exper-

imental value. When SAMPL6 was announced, it in-

cluded specification for how the experimental pKa val-

ues would be measured. This included the limitation

to perform experiments in a pH range of 2 − 12. Fol-

lowing the organizer suggestions, we included predic-

tions for all macroscopic pKas our model predicted in-

cluding those outside the specified experimental range.

Thus, there are many molecules with fewer experimen-

tally determined pKas than we predicted. The organiz-

ers considered two matching algorithms and analyzed

all challenge submissions with both methods. The first

was a closest matching algorithm where each prediction

is matched to an experimental value based on the ab-

solute difference between them. If two predictions are

paired to the same experimental value then the match

with the larger absolute difference is thrown out lead-

ing to one less pair used in the analysis. To prevent the

loss of data due to multiple pairings, the organizers re-

did the analysis using a Hungarian matching algorithm

instead [48]. In the Hungarian algorithm, the absolute

Page 7: SAMPL6 Challenge Results from pKa Predictions Based on a ...

6 Caitlin C. Bannan et al.

difference is calculated for each pair of prediction to

experiment. Then the combination of pairs which re-

duces the absolute error for that whole molecule is re-

tained. One potential problem in this approach is that it

does not account for the natural ordering of pKa values,

meaning it is possible that the larger of two predictions

could be paired with the smaller of two experimental

pKas. For example, if a molecule had two experimental

pKas 2.15 and 9.58 and a prediction reported values of

0.50 and 1.84, then the final pairs would be (9.58, 0.5)

and (2.15, 1.84) as that would result in the smallest

absolute error overall. In general, we believe the Hun-

garian approach is superior as it allows for all possible

data to be included, though an ideal algorithm would

restrict the order while matching. Fortunately, this re-

ordering did not occur when our predictions were paired

with experiment so we used the Hungarian matching

to evaluate our performance. All analysis by organizers

can be found online and in an overview article in this

issue (github.com/MobleyLab/SAMPL6) [13].

3.1 Microscopic pKa reported for type I predictions

We reported microscopic pKas for all ionizable groups

we identified in the SAMPL6 molecules. The first step

for any prediction we perform is to identify ionizable

groups and then iterate through those groups to find

all microscopic transitions. For SAMPL6, all resonance

structures of a given microstate were considered to be a

single state and assigned a single identification number

for the set. We matched each molecular microstate to

the proper identification number using a script adapted

from the SAMPL6 organizers that identifies identical

resonance structures. A full table of microscopic pKas

and the script used to find their identification numbers

is provided in the supplementary information.

The SAMPL6 organizers initially provided an ap-

proximate evaluation of predicted microscopic transi-

tions using macroscopic pKas. Experimental measure-

ments were only made for macroscopic pKas [10]. In

an attempt to provide feedback on type I predictions,

organizers compared experimental data for molecules

with only one experimental pKa or two pKas with a

difference greater than 3 relative to microscopic predic-

tions. For each molecule, the experimental values were

matched to microscopic predictions that resulted in the

lowest error. However, most of these molecules have

multiple ionizable groups which may contribute to each

macroscopic pKa. We chose not to focus on this anal-

ysis as we thought it would not be informative about

how well our model predicts microscopic pKas.

After the macroscopic pKa values and all predic-

tions were made public, molecule SM07 was analyzed in

Fig. 3 NMR experiments for SM07 identified only one mi-croscopic transition for the protonation of the top nitrogen(green). We identified the other two nitrogens (blue) as ion-izable and reported microscopic pKas for all three sites.

Table 1 Specific, microscopic pKa predictions can be com-pared to experiment based on NMR experiments on SM07where the macroscopic pKa was dominated by a single mi-croscopic transition (Figure 3).

SAMPL ID Experiment Prediction

SM02 5.03± 0.01 5.32± 1.28SM12 5.28± 0.01 5.89± 1.28SM09 5.37± 0.01 6.09± 1.28SM13 5.77± 0.01 6.95± 1.37SM04 6.02± 0.01 6.73± 1.37SM07 6.08± 0.01 7.05± 1.39

an NMR experiment to determine microscopic pKas [10].

SM07 is a 4-amino quinazoline derivative with three

nitrogens our algorithm identified as ionizable (Figure

3). The NMR results indicated the macroscopic pKa

was dominated by a single microscopic transition ob-

served for this molecule. SAMPL6 included five other

molecules that are also 4-amino quinazoline derivatives.

If we assume all of these molecules have a dominant mi-

crotransition on the same nitrogen, then we can com-

pare a total of six predicted microscopic pKas with ex-

periment (Table 1). Our predictions for these micro-

scopic transitions have reasonable correlation with the

experimental values with an R2 of 0.9± 0.2 (Figure 4).

While there appears to be a slight bias with a mean er-

ror of 0.7±0.1, all predictions were within uncertainty of

experiment. Predicted uncertainties are also fairly large

(all greater than 1.2) which is not unexpected as our

training data only included molecules with one or two

ionizable groups so these 4-amino quinazoline deriva-

tives would not be well represented.

Page 8: SAMPL6 Challenge Results from pKa Predictions Based on a ...

SAMPL6 challenge results from pKa predictions based on a general Gaussian process model 7

3 4 5 6 7 8 9Experimental microscopic pKa

3

4

5

6

7

8

9

Pre

dict

ed m

icro

scop

ic p

K a

RMSE = 0.8 ± 0.1R2 = 0.9 ± 0.2ME = 0.7 ± 0.1

Fig. 4 Predicted microscopic pKas are compared to experi-ment for six 4-amino quinazoline derivatives based on NMRexperiments on molecule SM07. The shaded region region in-dicates agreement within 1pKa unit.

In an attempt to evaluate a wider range of micro-

scopic pKas, we compared our predictions with some

of the top results from the macroscopic analysis. Sim-

ulationsPlus’ pKa Predictor [20] and ACD Lab’s pKa

GALAS [49] both performed better than our approach

in the macroscopic pKa challenge compared to experi-

ment and provided type I predictions [13]. The SAMPL6

challenge instructions encouraged all participants to sub-

mit whatever microscopic pKas their method identi-fied. Each method we are considering for comparison

here reported a different number of microscopic pKas.

We found 254, SimulationPlus reported 313, while ACD

only predicted 65. For transitions where we both made

a prediction, we compared our results with these two

commercial products (Figure 5 a and b), then we also

compared SimulationPlus with ACD predictions with

each other (Figure 5c). In all cases there is no corre-

lation and without experimental data for more micro-

scopic transitions there is no way of evaluating how well

our methods performed relative to these other more es-

tablished methods.

3.2 Macroscopic pKa reported for type III predictions

We reported macroscopic pKa values (type III) for all

molecules in SAMPL6 (Table 2). These were calculated

analytically based on the microscopic pKas determined

for each molecule (Section 2.4). Using the Hungarian

Table 2 A list of experimental pKas for all molecules inSAMPL6 by molecule ID and our predicted macroscopic pKa

that matches with each, based on the Hungarian matchingalgorithm.

SAMPL ID Prediction Experiment

SM01 9.27± 0.17 9.53± 0.01

SM02 5.19± 2.47 5.03± 0.01

SM03 4.49± 3.47 7.02± 0.01

SM04 6.73± 1.39 6.02± 0.01

SM05 7.62± 1.04 4.59± 0.01

SM06 1.77± 2.43 3.03± 0.04

3.94± 0.54 11.74± 0.01

SM07 5.17± 2.47 6.08± 0.01

SM08 4.61± 0.23 4.22± 0.01

SM09 5.14± 2.47 5.37± 0.01

SM10 6.44± 0.98 9.02± 0.01

SM11 5.07± 3.59 3.89± 0.01

SM12 5.17± 2.47 5.28± 0.01

SM13 4.97± 2.49 5.77± 0.01

SM14 0.12± 3.42 2.58± 0.01

6.49± 0.58 5.30± 0.01

SM15 5.42± 0.45 4.70± 0.01

8.71± 0.20 8.94± 0.01

SM16 5.91± 0.34 5.37± 0.01

SM17 3.47± 4.20 3.16± 0.01

SM18 −0.26± 2.70 2.15± 0.02

5.00± 4.39 9.58± 0.03

10.98± 1.59 11.02± 0.04

SM19 6.04± 0.88 9.56± 0.02

SM20 7.31± 1.84 5.70± 0.03

SM21 4.07± 0.02 4.10± 0.01

SM22 2.73± 0.34 2.40± 0.02

6.60± 1.08 7.43± 0.01

SM23 5.48± 2.83 5.45± 0.01

SM24 1.71± 3.14 2.60± 0.01

algorithm described in section 3.0.1, SAMPL6 orga-

nizers compared experimental results with all 34 pre-

diction submissions using RMSE, mean error (ME),

and R2 correlation coefficient. Overall, we saw reason-

able agreement between our predictions and experiment

(Figure 6). The SAMPL6 molecules included a vari-

ety of polyprotic functional groups that are completely

outside the scope of our mono- and diprotic training

set. Despite this, over half (18 predictions) fall within

one pKa unit of experiment. By RMSE and ME we fall

within the middle 15 predictions which cannot be eas-

Page 9: SAMPL6 Challenge Results from pKa Predictions Based on a ...

8 Caitlin C. Bannan et al.

5 3 1 1 3 5 7 9 11 13 15Our microscopic pKa

53113579

111315

Sim

ulat

ionP

lus

mic

rosc

opic

pK a a

5 3 1 1 3 5 7 9 11 13 15Our microscopic pKa

53113579

111315

AC

D G

ALA

S m

icro

scop

ic p

K a

b

5 3 1 1 3 5 7 9 11 13 15SimulationPlus microscopic pKa

53113579

111315

AC

D G

ALA

S m

icro

scop

ic p

K a

c

Fig. 5 These plots compare microscopic pKa predictions for all combinations of our model, SimulationPlus’ pKa predictorand ACD Lab’s pKa GALAS. The shaded region indicates an agreement within 1pKa unit.

4 2 0 2 4 6 8 10 12Experimental macroscopic pKa

4

2

0

2

4

6

8

10

12

Pre

dict

ed m

acro

scop

ic p

K a

RMSE = 2.2 ± 0.5R2 = 0.4 ± 0.1ME = 0.7 ± 0.4

SM06

Fig. 6 This plot shows our macroscopic pKa predictionscompared to experiment. The shaded region represents agree-ment within 1pKa unit. The most significant outlier (SM06)is due to an acidic amide we did not identify as ionizable

ily ranked due to wide confidence intervals, determined

using bootstrapping, for most participants. By correla-

tion coefficient (R2) our method ranks very low, in the

bottom five submissions, but this appears to be due to

one rather extreme outlier we discuss in detail below. If

we were to remove this one outlier, our ranking would

improve significantly with a change in R2 from 0.4±0.1

to 0.62 ± 0.09 and a shift in RMSE from 2.2 ± 0.5 to

1.7± 0.3.

The molecule is SM06 can definitely be considered

an outlier, not just due to the large discrepancy between

our prediction and experiment, but also due to an ion-

izable group we did not properly identify (Figure 7). In

this case our predicted value of 3.94 ± 0.5 is matched

with the experimental value 11.74±0.01, as pointed out

Figure 6. This molecule contains three ionizable groups:

the pyridine nitrogen base, the quinoline nitrogen base

and the amide nitrogen either as a base at low pH or

as an acid at high pH. We did not train our model to

treat amides as either acids or bases (Section 2.1). Our

model predicted the transition from +2 to +1 to occur

at a pH of 1.77 ± 2.43 and to be dominated by depro-

tonation of the pyridine nitrogen. It predicts +1 to +0

transition at 3.94±0.54 dominated by deprotonation of

the quinoline nitrogen. In order to improve our model,

we consider how similar functional groups are repre-

sented in our training set and look to the literature to

attempt to determine which microstate dominates at

the 11.74± 0.01 transition. It seems probable that the

deprotonation from +3 charge to +2 charge would oc-

cur well below pH2.0, outside the experimental range,

and is most likely dominated by the deprotonation of a

charged and doubly protonated amide. The next tran-

sitions are less immediately obvious, so we look to our

training set which contains meta-substituted bromo-

pyridines and carboxamide pyridines, both with pKas

in the low 3s. It also includes several monoprotic quino-

line derivatives with pKas from 4.8− 5.5. One explana-

tion for this large error could be that the carbonyl of

the amide group could form an internal hydrogen bond

stabilizing the protonated form of the quinoline and

increasing its pKa. While internal hydrogen-bonding

may affect some of the features we already include, our

model does not directly consider it. Adding a more ex-

plicit descriptor to capture such affects may be some-

thing we should explore as we improve our model. A

more likely explanation for the error is that it is due to

the amide nitrogen our model misses. Our logic for not

Page 10: SAMPL6 Challenge Results from pKa Predictions Based on a ...

SAMPL6 challenge results from pKa predictions based on a general Gaussian process model 9

Fig. 7 SM06 provided feedback on our ability to accurateidentify ionizable groups as our method only finds two (blue),notably missing the acid amide (red)

Fig. 8 Our prediction for SM20 was still rather inaccurate,despite it being monoprotic (blue). This is likely due to a lackof representation of imide groups in similar environments inour training set.

including amides as ionizable sites in our model was

because they often have a basic pKa value less than

2.0 and an acidic pKa value greater than 14.0. How-

ever, in this highly conjugated system, that amide ni-

trogen could be an important contributor to the pKa of

11.74± 0.01. An analogous system to consider is N-(2-

pyrimidyl)benzamide, with its second ionization mea-

sured at 11.2 [50], demonstrating that acidic amide ni-

trogens can have pKa values in the appropriate range.

Improving our model will likely involve conducting a

more thorough investigation of which groups should be

considered ionizable.

We knew going into the SAMPL6 challenge that

complex polyprotic molecules would fall outside the

domain of applicability for our model, however, other

functional groups appear to also be poorly represented.

For example, molecule SM20 has only one ionizable

group, the acidic imide group (Figure 8). While our

training set includes some similar functional groups,

there was not a wide diversity. Specifically, none in the

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Expected fraction in range

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Frac

tion

of p

redi

ctio

ns in

rang

e

predicted uncertaintySlope=0.88test set RMSE (0.75)Slope=0.73

Fig. 9 This QQ plot provides an assessment for predicteduncertainties compared to a normal distribution. Our pre-dicted uncertainties (blue circles) out perform a fixed error of0.75 taken from our test set RMSE (red squares), as evidentby the proximity of the point to the x = y black line and aslope approaching one.

training set had a sulfur one bond away. This is also evi-

dent in our prediction 7.31±1.84 where the large uncer-

tainty reflects the lack of similarity between this ioniz-

able group and our training set. Expanding our training

data to include polyprotic molecules was already in our

plans, but considering more complex mono- or diprotic

molecules with overlapping microconstants could also

improve our model.

An important goal in building our model was to

be able to predict uncertainties which actually provide

some guidance as to expected accuracy and limitations

in the training data. Not all commercial products make

this a priority; for example, SimulationPlus did not

provide uncertainty estimates for any of their predic-

tions. One obvious feature in our data is that the pre-

dicted uncertainties are all very large, greater than 1

pKa unit for 19 of the matched predictions. Most of

the molecules with large uncertainties are polyprotic

and include functional groups outside the domain of

our training set. Therefore it seems these large uncer-

tainties are a good sign as they seem to correlate with

actual error.

Previous SAMPL challenges have included quantile-

quantile plots (QQ plots) which provide a more quanti-

tative assessment of a participants reported model un-

certainties [7,51]. QQ plots are based on the concept

that actual errors should be drawn from a normal distri-

Page 11: SAMPL6 Challenge Results from pKa Predictions Based on a ...

10 Caitlin C. Bannan et al.

bution, and well-predicted uncertainties should be able

to predict the frequency of deviations of a given size.

Thus in QQ plots, y-axis has the fraction of predicted

minus experimental values that fall within a given num-

ber of uncertainties and the x-axis shows the fraction of

a normal distribution within that many standard devi-

ations. The closer the predicted uncertainties compare

to a normal distribution the closer they will come to an

x = y line. Thus, the slope of a regression is also often

used as a part of the evaluation. We compare two possi-

bilities for model uncertainty in our QQ plot. The first

uncertainty approach we consider uses the predicted

uncertainties from our Gaussian process model (blue

circles in Figure 9). Another common way to report

uncertainty is to assume it is the same for all predic-

tions based on past performance of the approach. For

the second set of data, we assumed the uncertainty for

each predicted pKa was equal to the RMSE for our in-

ternal test set, 0.75 (red squares in Figure 9). A method

producing accurate predicted uncertainties should lead

to a diagonal line on the resulting plot, with slope of

1; in this case, we find that the uncertainty model us-

ing predicted uncertainties from the Gaussian process

(slope=0.87) outperforms the model with a fixed un-

certainty (slope=0.73). This is promising evidence that

our model is capable of predicting how its reliability

varies with the chemistry being considered, rather than

just its overall typical performance.

4 Conclusions

Our Gaussian process model showed promising results

in the SAMPL6 challenge, but was limited by the scope

of our training set. The chemical space represented in

our training set was limited to mono- and diprotic molecules

(Section 2.5). Despite this limitation, we still saw fairly

good agreement between our predicted macroscopic pKas

and the experimentally measured values and performed

competitively compared to other participants. We rank

in the top ten by RMSE (1.7 ± 0.3) after removing a

single obvious outlier (Figure 6). This outlier, with an

acidic amide group, highlighted a potential hole in our

limited definition of ionizable groups. Improving our

model will require adding groups which are often ion-

ized outside the aqueous pKa range, but which can be

perturbed to ionize within that range. Our performance

in this blind challenge is evidence that a single model

trained on physically and chemically relevant features

can be competitive with established methods which rely

on specialized models for individual functional groups.

The other important step in improving our model

will be to augment our training set with additional

polyprotic molecules. Currently, the likelihood function

in Scikit-learn requires one feature vector for each ex-

perimental result. Using this function, we would require

a large dataset of experimental microscopic pKas in or-

der to include polyprotic molecules. Generally speaking,

it is easier to acquire experimental macroscopic pKa

data. Thus, a preferred approach would be to define

a new likelihood function which would take advantage

of the analytical relationship between microscopic and

macroscopic pKas and evaluate a set of microscopic pre-

dictions with one macroscopic value. We are confident

that with this expansion we will have a general model

which could predict pKa for molecules with any combi-

nation of ionizable groups.

Evaluating microscopic pKa predictions was limited

by the availability of experimental results. For the six

molecules with NMR supported microscopic pKas, our

predicted values agreed with experiment within uncer-

tainty. This was a rather limited set of the possible

microscopic transitions so we also attempted to com-

pare our performance to competitive commercial prod-

ucts. However, there was no correlation between any

combination of type I predictions from SimulationPlus,

ACD Labs, or our own model (Figure 5), indicating

that much research remains to be done to predict the

true microscopic pKa values for many important transi-

tions. A valuable addition to future SAMPL challenges

including pKa predictions would be to expand experi-

mental measurements to include more microscopic re-

sults when available.

We believe predictions are only valuable when they

include an accurate assessment of uncertainty, other-

wise downstream users have no guidance as to the reli-

ability of such predictions and thus no confidence as to

when they can usefully be used and when they should

be ignored. These uncertainties are even more valuable

if they are determined based on the input molecule, cap-

turing when reliability varies with chemistry. Unfortu-

nately, ten out of 34 type III submissions in SAMPL6

provided no uncertainties with their predictions. Per-

haps requiring such predictions for every submission

would improve future challenges and drive progress in

this respect. From the beginning, we considered pro-

viding an uncertainty evaluation for each prediction an

important component of our model. Thus, our ability

to determine accurate uncertainty predictions based on

input chemistry shows our model’s potential to be a suc-

cessful predictive method. Previous SAMPL challenges

have highlighted the importance and difficulty in ac-

curately assessing model uncertainty for hydration free

energies [51] and distribution coefficients [7]. The large

error bars for ionizable sites we consider outside our do-

main of applicability provide evidence our uncertainty

estimates are working as desired. QQ plots also support

Page 12: SAMPL6 Challenge Results from pKa Predictions Based on a ...

SAMPL6 challenge results from pKa predictions based on a general Gaussian process model 11

the conclusion that our model is capable of predicting

its own uncertainty (Figure 9).

SAMPL6 was an opportunity to test our Gaussian

process model on an external test set and our first

completely blind set of predictions. Our new Gaussian

process model performed semi-competitively, especially

considering its limited training set compared to more

established methods which participated. We look for-

ward to incorporating important lessons from this chal-

lenge, particularly, expanding our definition of an ion-

izable group and improving our likelihood function to

include polyprotic molecules in our next training set.

Overall, SAMPL challenges provide an important ser-

vice to the community allowing participants to test

their predictive models in a blind manner.

Acknowledgements DLM and CCB appreciate the finan-cial support from the National Science Foundation (CHE1352608) and the National Institutes of Health (1R01GM108889-01). CCB was supported financially by OpenEye ScientificSoftware to build this model during Summer 2017 and is nowsupported by a fellowship from The Molecular Sciences Soft-ware Institute under NSF grant ACI-1547580. We are thank-ful for valuable conversations with OpenEye employees, theSAMPL6 organizers, and all challenge participants, and espe-cially to Merck for its contributions to the experimental workin this challenge. AGS would like to thank Paul Hawkins,Christopher Bayly and Robert Tolbert as well as AnthonyNicholls and Matthew Geballe for many insightful discussionsof pKa and machine learning.

4.1 Supplementary Materials

Included with this article you will find supplementary

materials in the form of a PDF with human readable

figures and tables and a compressed file with machine

readable data and analysis scripts. In the PDF we pro-

vide equations for computing macroscopic equilibrium

constants Ka1, Ka2

, and Ka3for a triprotic molecule

along with the corresponding thermodynamic cycle sim-

ilar to the one in Figure 2. Also included there is a full

list of all microscopic and macroscopic pKas our model

predicts for all 24 SAMPL6 molecules. In the electronic

materials we include the prediction files we submitted

for type I and type III, along with all the analysis scripts

we used to generate data and figures provided here. For

analysis of all SAMPL6 submissions [13] and details

on the experimental data [10] see the GitHub repos-

itory provided by challenge organizers (github.com/

MobleyLab/SAMPL6).

References

1. H. Wan, J. Ulander, High-throughput pKa screening andprediction amenable for ADME profiling, Expert Opin.

Drug Metab. Toxicol. 2(1), 139 (2006). DOI 10.1517/17425255.2.1.139

2. M.P. Gleeson, Generation of a Set of Simple, Inter-pretable ADMET Rules of Thumb, J. Med. Chem. 51(4),817 (2008). DOI 10.1021/jm701122q

3. D.T. Manallack, R.J. Prankerd, E. Yuriev, T.I. Oprea,D.K. Chalmers, The Significance of Acid/Base Propertiesin Drug Discovery, Chem. Soc. Rev. 42(2), 485 (2013).DOI 10.1039/c2cs35348b

4. J. Manchester, G. Walkup, O. Rivin, Z. You, Evaluationof pKa Estimation Methods on 211 Druglike Compounds,J. Chem. Inf. Model. 50(4), 565 (2010). DOI 10.1021/ci100019p

5. L. Settimo, K. Bellman, R.M.A. Knegtel, Comparison ofthe Accuracy of Experimental and Predicted pKa Valuesof Basic and Acidic Compounds, Pharm. Res. 31(4), 1082(2014). DOI 10.1007/s11095-013-1232-z

6. R. Fraczkiewicz, in Reference Module in Chemistry,Molecular Sciences and Chemical Engineering (Elsevier,2013). DOI 10.1016/B978-0-12-409547-2.02610-X

7. C.C. Bannan, K.H. Burley, M. Chiu, M.R. Shirts,M.K. Gilson, D.L. Mobley, Blind prediction of cyclo-hexane–water distribution coefficients from the SAMPL5challenge, J.Comput.-Aided Mol. Des. 30(11), 1 (2016).DOI 10.1007/s10822-016-9954-8

8. F.C. Pickard, G. Konig, F. Tofoleanu, J. Lee, A.C. Sim-monett, Y. Shao, J.W. Ponder, B.R. Brooks, Blind pre-diction of distribution in the SAMPL5 challenge withQM based protomer and pKa corrections, J. Comput.-Aided Mol. Des. 30(11), 1 (2016). DOI 10.1007/s10822-016-9955-7

9. B. Aguilar, R. Anandakrishnan, J.Z. Ruscio, A.V.Onufriev, Statistics and Physical Origins of pK and Ion-ization State Changes upon Protein-Ligand Binding, Bio-phys. J. 98(5), 872 (2010). DOI 10.1016/j.bpj.2009.11.016

10. M. Isik, D.L. Mobley, J.D. Chodera, SAMPL6 Experi-mental Place Holders, J Comput-Aided Mol Des (2018)

11. I.G. Darvey, The assignment of pKa values to functionalgroups in amino acids, Biochem. Educ. 23(2), 80 (1995).DOI 10.1016/0307-4412(94)00150-N

12. G.M. Bodner, Assigning the pKa’s of polyprotic acids, J.Chem. Educ. 63(3), 246 (1986). DOI 10.1021/ed063p246

13. M. Isik, D.L. Mobley, J.D. Chodera, SAMPL6 OverviewPlace Holders, J. Comput.-Aided Mol. Des. (2018)

14. O. Exner, in Advances in Linear Free Energy Relation-ships (Springer, Boston, MA, 1972), pp. 1–69. DOI10.1007/978-1-4615-8660-9 1

15. D. Perrin, B. Dempsey, E. Serjeant, pKa Prediction forOrganic Acids and Bases (Chapman and Hall, New York,NY, 1981)

16. S. Geidl, R. Svobodova Varekova, V. Bendova,L. Petrusek, C.M. Ionescu, Z. Jurka, R. Abagyan,J. Koca, How Does the Methodology of 3D StructurePreparation Influence the Quality of pKa Prediction?,J. Chem. Inf. Model. 55(6), 1088 (2015). DOI 10.1021/ci500758w

17. G. Cruciani, F. Milletti, L. Storchi, G. Sforna, L. Goracci,In silico pKa Prediction and ADME Profiling, Chem.Biodiversity. 6(11), 1812 (2009). DOI 10.1002/cbdv.200900153

18. A.R. Katritzky, M. Kuanar, S. Slavov, C.D. Hall,M. Karelson, I. Kahn, D.A. Dobchev, Quantitative Cor-relation of Physical and Chemical Properties with Chemi-cal Structure: Utility for Prediction, Chem. Rev. 110(10),5714 (2010). DOI 10.1021/cr900238d

Page 13: SAMPL6 Challenge Results from pKa Predictions Based on a ...

12 Caitlin C. Bannan et al.

19. K.L. Peterson, in Reviews in Computational Chemistry(Wiley, 2000), pp. 53–140

20. R. Fraczkiewicz, M. Lobell, A.H. Goller, U. Krenz,R. Schoenneis, R.D. Clark, A. Hillisch, Best of BothWorlds: Combining Pharma Data and State of the ArtModeling Technology To Improve in Silico pKa Predic-tion, J. Chem. Inf. Model. 55(2), 389 (2015). DOI10.1021/ci500585w

21. M.J. Citra, Estimating the pKa of phenols, carboxylicacids and alcohols from semi-empirical quantum chemicalmethods, Chemosphere 38(1), 191 (1999). DOI 10.1016/S0045-6535(98)00172-6

22. R.S. Varekova, S. Geidl, C.M. Ionescu, O. Skrehota,T. Bouchal, D. Sehnal, R. Abagyan, J. Koca, Predict-ing pKa values from EEM atomic charges, J. Cheminf.5, 18 (2013). DOI 10.1186/1758-2946-5-18

23. S.L. Dixon, P.C. Jurs, Estimation of pKa for organic oxy-acids using calculated atomic charges, J. Comput. Chem.14(12), 1460 (1993). DOI 10.1002/jcc.540141208

24. Y.E. Zevatskii, D.V. Samoilov, Modern methods for esti-mation of ionization constants of organic compounds insolution, Russ. J. Org. Chem. 47(10), 1445 (2011). DOI10.1134/S1070428011100010

25. P. Pracht, C.A. Bauer, S. Grimme, Automated and ef-ficient quantum chemical determination and energeticranking of molecular protonation sites, J. Comput.Chem. 38(30), 2618 (2017). DOI 10.1002/jcc.24922

26. A.D. Bochevarov, E. Harder, T.F. Hughes, J.R. Green-wood, D.A. Braden, D.M. Philipp, D. Rinaldo, M.D.Halls, J. Zhang, R.A. Friesner, Jaguar: A high-performance quantum chemistry software program withstrengths in life and materials sciences, Int. J. QuantumChem. 113(18), 2110 (2013). DOI 10.1002/qua.24481

27. A.D. Bochevarov, M.A. Watson, J.R. Greenwood, D.M.Philipp, Multiconformation, density functional theory-based pka prediction in application to large, flexible or-ganic molecules with diverse functional groups, J. Chem.Theory Comput. 12(12), 6001 (2016). DOI 10.1021/acs.jctc.6b00805

28. C.E. Rasmussen, C.K.I. Williams, Gaussian Processesfor Machine Learning. Adaptive computation and ma-chine learning (MIT Press, Cambridge, Mass, 2006)

29. OpeneEye Scientific Software, Inc. OEChem Toolkit(2018). URL http://www.eyesopen.com

30. P.C.D. Hawkins, A.G. Skillman, G.L. Warren, B.A.Ellingson, M.T. Stahl, Conformer Generation withOMEGA: Algorithm and Validation Using High QualityStructures from the Protein Databank and CambridgeStructural Database, J. Chem. Inf. Model 50(4), 572(2010). DOI 10.1021/ci100031x

31. K.B. Wiberg, Application of the pople-santry-segalCNDO method to the cyclopropylcarbinyl and cyclobutylcation and to bicyclobutane, Tetrahedron 24(3), 1083(1968). DOI 10.1016/0040-4020(68)88057-3

32. I. Mayer, Bond order and valence indices: A personalaccount, J. Comput. Chem. 28(1), 204 (2007). DOI10.1002/jcc.20494

33. OpeneEye Scientific Software, Inc. OEQuacPac Toolkit(2018). URL http://www.eyesopen.com

34. A. Jakalian, B.L. Bush, D.B. Jack, C.I. Bayly, Fast, ef-ficient generation of high-quality atomic charges. AM1-BCC model: I. Method, J. Comput. Chem. 21(2),132 (2000). DOI 10.1002/(SICI)1096-987X(20000130)21:2〈132::AID-JCC5〉3.0.CO;2-P

35. A. Jakalian, D.B. Jack, C.I. Bayly, Fast, efficient gener-ation of high-quality atomic charges. AM1-BCC model:

II. Parameterization and validation, J. Comput. Chem.23(16), 1623 (2002). DOI 10.1002/jcc.10128

36. S. Jelfs, P. Ertl, P. Selzer, Estimation of pKa for druglikecompounds using semiempirical and information-baseddescriptors, J. Chem. Inf. Model. 47(2), 450 (2007). DOI10.1021/ci600285n

37. A. Nicholls, S. Wlodek, J.A. Grant, SAMPL2 and contin-uum modeling, J. Comput.-Aided Mol. Des. 24(4), 293(2010). DOI 10.1007/s10822-010-9334-8

38. J.A. Grant, B.T. Pickup, A. Nicholls, A smooth permit-tivity function for Poisson-Boltzmann solvation methods,J. Comput. Chem. 22(6), 608 (2001). DOI 10.1002/jcc.1032

39. A. Nicholls. Spicoli: A Surface Toolkit, dude (2004)40. B. Lee, F.M. Richards, The interpretation of protein

structures: Estimation of static accessibility, J. Mol. Biol.55(3), 379 (1971). DOI 10.1016/0022-2836(71)90324-X

41. M.L. Connolly, Analytical molecular surface calcula-tion, J. Appl. Cryst. 16(5), 548 (1983). DOI 10.1107/S0021889883010985

42. K.A. Sharp, A. Nicholls, R.F. Fine, B. Honig, Reconcil-ing the magnitude of the microscopic and macroscopichydrophobic effects, Science 252(5002), 106 (1991). DOI10.1126/science.2011744

43. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, J. Mach. Learn. Res.12, 2825 (2011)

44. G. Kortum, W. Vogel, K. Andrussow, Disssociationconstants of organic acids in aqueous solution, PureAppl. Chem. 1(2-3), 187 (1960). DOI 10.1351/pac196001020187

45. D.D. Perrin, Dissociation Constants of Organic Basesin Aqueous Solution: Supplement 1972 (Butterworths,1972)

46. P. Serjeant, B. Dempsey, Ionisation Constants of Or-ganic Acids in Aqueous Solution, vol. 23 (Pergamon,1979)

47. T. Hastie, R. Tibshirani, J.H. Friedman, The elementsof statistical learning: data mining, inference, and pre-diction, 2nd edn. Springer series in statistics (Springer,New York, NY, 2009)

48. H.W. Kuhn, The Hungarian method for the assignmentproblem, Nav. Res. Logist. 52(1), 7 (2004). DOI 10.1002/nav.20053

49. Advanced Chemistry Development, Inc. pKa Classic(2015). URL www.acdlabs.com

50. R.F. Evans, 460. hydropyrimidines. part iii. reduction ofamino-pyrimidines, J. Chem. Soc. 0, 2450 (1964). DOI10.1039/JR9640002450

51. D.L. Mobley, K.L. Wymer, N.M. Lim, J.P. Guthrie, Blindprediction of solvation free energies from the SAMPL4challenge, J. Comput.-Aided Mol. Des. 28(3), 135 (2014).DOI 10.1007/s10822-014-9718-2

Page 15: SAMPL6 Challenge Results from pKa Predictions Based on a ...

Supporting Information:SAMPL6 challenge results from pKa predictionsbased on a general Gaussian process model

Caitlin C. Bannan · David L. Mobley · A. Geoffrey Skillman

SI 1 Triprotic molecules require a more

complicated thermodynamic cycle

As the number of ionizable groups increases, the

corresponding thermodynamic cycle becomes in-

creasingly complex. For a molecule we three ioniz-

able groups the macroscopic equilibrium constants

(Ka’s) are given by:

Ka1 = a + b + c (3)

Ka2=

egh + egi + efi

gh + eg + eidgh = efi (4)

Ka3=

1

j+

1

k+

1

l(5)

where a to l for equilibrium constants with a,

b, and c for the first deprotonation followed by d

to i for the second and j, k, and l for the third.

The corresponding thermodynamic cycle is shown

in Figure SI 1.

C. C. Bannan(1) Department of Chemistry, University of California,Irvine,(2) 2017 Summer Intern, OpenEye Scientific Software,Inc., Santa Fe, NM.

D. L. MobleyDepartments of Pharmaceutical Sciences and Chem-istry, University of California, Irvine147 Bison Modular, Irvine, CA 92697 Tel.: +949-824-6383Fax: +949-824-2949E-mail: [email protected]

A. G. SkillmanOpenEye Scientific Software, Inc.9 Bisbee Court, Suite D, Santa Fe, NM 87508 Tel.: +505-473-7385E-mail: [email protected]

SI 2 Microscopic pKa’s were predicted for

all molecules

In the main paper we analyzed a subset of micro-

scopic pKa’s based on experimental data for SM07

which was extrapolated to five molecules which are

also 4-amino quinazoline derivatives. We also com-

pared all of microscopic pKa predictions with Sim-

ulationPlus and ACD lab’s predictions. Below we

report all mcroscopic pKa’s our model predicts for

every molecule in SAMPL6 (Tables SI 1, SI 2, SI

3, and SI 4)

SI 3 Not all macroscopic predictions were

paired with experiment

For every molecule in SAMPL6 we predicted a

number of macroscopic pKa’s based on the number

of ionizable groups we identified. That is for each

diprotic molecule we reported two pKa’s and each

triprotic three pKa’s and so forth. In the main text

we reported only the macroscopic pKa’s actually

used for analysis, that is the ones paired to ex-

perimental values using the Hungarian algorithm.

Here we provide a complete table of all macro-

scopic pKa’s predicted by our method for each

molecule in SAMPL6 (Table SI 5).

Page 16: SAMPL6 Challenge Results from pKa Predictions Based on a ...

2 Caitlin C. Bannan et al.

Fig. SI 1: We identify three ionizable groups in SM15 from SAMPL6. This thermodynamic cycle shows an exampleof how microscopic transitions with equilibrium constants a-l are related to the macroscopic equilibrium constants(Ka1 , Ka2 , and Ka3) as in equations 3, 4, and 5

Page 17: SAMPL6 Challenge Results from pKa Predictions Based on a ...

Title Suppressed Due to Excessive Length 3

Table SI 1: All microscopic pKa’s we predict forSAMPL6 molecules SM01 to SM10

Acid microstate ID Base microstate ID Prediction

SM01 micro005 SM01 micro004 9.27 ± 0.17SM02 micro014 SM02 micro008 5.70 ± 4.25SM02 micro014 SM02 micro012 5.69 ± 4.25SM02 micro014 SM02 micro013 5.60 ± 4.25SM02 micro008 SM02 micro004 3.33 ± 4.12SM02 micro008 SM02 micro006 1.61 ± 3.92SM02 micro012 SM02 micro004 3.92 ± 4.12SM02 micro012 SM02 micro005 3.06 ± 3.95SM02 micro013 SM02 micro005 3.30 ± 4.14SM02 micro013 SM02 micro006 2.78 ± 4.06SM02 micro004 SM02 micro002 5.32 ± 1.28SM02 micro005 SM02 micro002 4.00 ± 1.54SM02 micro006 SM02 micro002 −1.13 ± 1.95SM03 micro024 SM03 micro001 2.38 ± 4.06SM03 micro024 SM03 micro016 3.47 ± 4.10SM03 micro001 SM03 micro012 4.16 ± 2.69SM03 micro016 SM03 micro012 4.22 ± 2.33SM04 micro016 SM04 micro013 3.94 ± 4.13SM04 micro016 SM04 micro014 5.15 ± 4.21SM04 micro016 SM04 micro015 3.76 ± 4.05SM04 micro013 SM04 micro002 3.68 ± 4.11SM04 micro013 SM04 micro008 1.50 ± 3.90SM04 micro014 SM04 micro002 3.91 ± 4.11SM04 micro014 SM04 micro006 2.97 ± 3.91SM04 micro015 SM04 micro006 3.53 ± 4.11SM04 micro015 SM04 micro008 2.81 ± 4.06SM04 micro002 SM04 micro003 6.73 ± 1.37SM04 micro006 SM04 micro003 4.94 ± 1.75SM04 micro008 SM04 micro003 0.26 ± 1.97SM05 micro010 SM05 micro001 7.62 ± 1.04SM06 micro002 SM06 micro005 2.38 ± 2.83SM06 micro002 SM06 micro007 1.89 ± 3.09SM06 micro005 SM06 micro001 2.19 ± 0.31SM06 micro007 SM06 micro001 3.93 ± 0.53SM07 micro016 SM07 micro013 5.68 ± 4.25SM07 micro016 SM07 micro014 5.70 ± 4.25SM07 micro016 SM07 micro015 5.57 ± 4.25SM07 micro013 SM07 micro006 3.65 ± 4.11SM07 micro013 SM07 micro011 1.49 ± 3.89SM07 micro014 SM07 micro006 3.85 ± 4.10SM07 micro014 SM07 micro007 2.92 ± 3.90SM07 micro015 SM07 micro007 3.49 ± 4.11SM07 micro015 SM07 micro011 2.76 ± 4.06SM07 micro006 SM07 micro004 7.05 ± 1.39SM07 micro007 SM07 micro004 5.12 ± 1.75SM07 micro011 SM07 micro004 0.46 ± 1.98SM08 micro011 SM08 micro003 10.00 ± 1.09SM08 micro011 SM08 micro006 4.61 ± 0.23SM08 micro003 SM08 micro004 6.92 ± 3.71SM08 micro006 SM08 micro004 10.82 ± 3.73SM09 micro015 SM09 micro008 5.66 ± 4.25SM09 micro015 SM09 micro013 5.66 ± 4.25SM09 micro015 SM09 micro014 5.55 ± 4.25SM09 micro008 SM09 micro001 3.17 ± 4.09SM09 micro008 SM09 micro006 1.34 ± 3.87SM09 micro013 SM09 micro001 3.55 ± 4.06SM09 micro013 SM09 micro005 2.74 ± 3.87SM09 micro014 SM09 micro005 2.99 ± 4.11SM09 micro014 SM09 micro006 2.41 ± 4.01SM09 micro001 SM09 micro003 6.09 ± 1.28SM09 micro005 SM09 micro003 4.71 ± 1.52SM09 micro006 SM09 micro003 −0.19 ± 2.08SM10 micro028 SM10 micro036 6.44 ± 0.98

Table SI 2: All microscopic pKa’s we predict forSAMPL6 molecules SM11 to SM16

Acid microstate ID Base microstate ID Prediction

SM11 micro027 SM11 micro023 5.85 ± 4.25SM11 micro027 SM11 micro024 5.85 ± 4.25SM11 micro027 SM11 micro025 5.85 ± 4.25SM11 micro027 SM11 micro026 5.84 ± 4.25SM11 micro023 SM11 micro019 5.60 ± 4.25SM11 micro023 SM11 micro022 5.66 ± 4.25SM11 micro023 SM11 micro031 5.76 ± 4.25SM11 micro024 SM11 micro017 5.60 ± 4.25SM11 micro024 SM11 micro021 5.77 ± 4.25SM11 micro024 SM11 micro022 5.57 ± 4.25SM11 micro025 SM11 micro018 5.52 ± 4.25SM11 micro025 SM11 micro021 5.70 ± 4.25SM11 micro025 SM11 micro031 5.66 ± 4.25SM11 micro026 SM11 micro017 5.71 ± 4.25SM11 micro026 SM11 micro018 5.72 ± 4.25SM11 micro026 SM11 micro019 5.65 ± 4.25SM11 micro017 SM11 micro011 4.89 ± 4.19SM11 micro017 SM11 micro013 2.49 ± 4.01SM11 micro018 SM11 micro008 3.80 ± 4.09SM11 micro018 SM11 micro011 4.34 ± 4.17SM11 micro019 SM11 micro008 4.54 ± 4.17SM11 micro019 SM11 micro013 3.43 ± 4.12SM11 micro021 SM11 micro009 3.02 ± 3.96SM11 micro021 SM11 micro011 2.91 ± 3.97SM11 micro022 SM11 micro009 4.79 ± 4.16SM11 micro022 SM11 micro013 3.36 ± 3.91SM11 micro031 SM11 micro008 2.93 ± 3.93SM11 micro031 SM11 micro009 3.77 ± 4.02SM11 micro008 SM11 micro005 4.74 ± 1.99SM11 micro009 SM11 micro005 1.87 ± 1.04SM11 micro011 SM11 micro005 3.78 ± 1.53SM11 micro013 SM11 micro005 4.75 ± 3.29SM12 micro015 SM12 micro010 5.69 ± 4.25SM12 micro015 SM12 micro013 5.68 ± 4.25SM12 micro015 SM12 micro014 5.59 ± 4.25SM12 micro010 SM12 micro006 3.21 ± 4.10SM12 micro010 SM12 micro008 1.50 ± 3.90SM12 micro013 SM12 micro001 2.93 ± 3.91SM12 micro013 SM12 micro006 3.80 ± 4.10SM12 micro014 SM12 micro001 3.11 ± 4.12SM12 micro014 SM12 micro008 2.60 ± 4.03SM12 micro001 SM12 micro012 4.38 ± 1.50SM12 micro006 SM12 micro012 5.89 ± 1.28SM12 micro008 SM12 micro012 −0.71 ± 1.96SM13 micro015 SM13 micro003 5.56 ± 4.25SM13 micro015 SM13 micro013 5.43 ± 4.25SM13 micro015 SM13 micro014 5.37 ± 4.25SM13 micro003 SM13 micro001 2.52 ± 4.00SM13 micro003 SM13 micro008 0.75 ± 3.74SM13 micro013 SM13 micro001 3.41 ± 4.03SM13 micro013 SM13 micro012 2.21 ± 3.78SM13 micro014 SM13 micro008 1.56 ± 3.86SM13 micro014 SM13 micro012 1.97 ± 4.00SM13 micro001 SM13 micro005 6.95 ± 1.37SM13 micro008 SM13 micro005 0.57 ± 1.98SM13 micro012 SM13 micro005 4.93 ± 1.59SM14 micro003 SM14 micro002 2.73 ± 3.66SM14 micro003 SM14 micro004 0.13 ± 3.43SM14 micro002 SM14 micro001 6.48 ± 0.56SM14 micro004 SM14 micro001 4.83 ± 0.22SM15 micro003 SM15 micro001 7.26 ± 1.57SM15 micro003 SM15 micro002 5.43 ± 0.45SM15 micro001 SM15 micro004 7.61 ± 1.69SM15 micro002 SM15 micro004 8.68 ± 0.19SM16 micro004 SM16 micro002 5.91 ± 0.34

Page 18: SAMPL6 Challenge Results from pKa Predictions Based on a ...

4 Caitlin C. Bannan et al.

Table SI 3: All microscopic pKa’s we predict forSAMPL6 molecules SM17 to SM22

Acid microstate ID Base microstate ID Prediction

SM17 micro008 SM17 micro006 5.46 ± 4.25SM17 micro008 SM17 micro007 5.09 ± 4.24SM17 micro008 SM17 micro009 4.94 ± 4.22SM17 micro006 SM17 micro004 2.06 ± 3.29SM17 micro006 SM17 micro011 1.82 ± 3.61SM17 micro007 SM17 micro005 2.61 ± 3.66SM17 micro007 SM17 micro011 4.06 ± 3.88SM17 micro009 SM17 micro004 2.86 ± 4.11SM17 micro009 SM17 micro005 2.57 ± 4.07SM17 micro004 SM17 micro010 3.65 ± 1.93SM17 micro005 SM17 micro010 3.07 ± 1.88SM17 micro011 SM17 micro010 3.96 ± 0.36SM18 micro026 SM18 micro004 1.04 ± 3.79SM18 micro026 SM18 micro034 0.69 ± 2.70SM18 micro026 SM18 micro056 −0.18 ± 3.20SM18 micro004 SM18 micro002 3.11 ± 1.05SM18 micro004 SM18 micro069 5.84 ± 1.54SM18 micro034 SM18 micro002 2.39 ± 3.68SM18 micro034 SM18 micro016 2.96 ± 1.02SM18 micro056 SM18 micro016 2.20 ± 0.95SM18 micro056 SM18 micro069 2.28 ± 3.49SM18 micro002 SM18 micro071 10.98 ± 1.59SM18 micro016 SM18 micro071 7.87 ± 1.66SM18 micro069 SM18 micro071 7.77 ± 2.70SM19 micro006 SM19 micro007 6.04 ± 0.88SM20 micro003 SM20 micro004 7.31 ± 1.84SM21 micro036 SM21 micro032 5.53 ± 4.22SM21 micro036 SM21 micro033 5.27 ± 4.23SM21 micro036 SM21 micro034 5.43 ± 4.22SM21 micro036 SM21 micro035 5.36 ± 4.23SM21 micro032 SM21 micro026 5.63 ± 4.25SM21 micro032 SM21 micro027 5.70 ± 4.25SM21 micro032 SM21 micro028 5.64 ± 4.25SM21 micro033 SM21 micro028 5.59 ± 4.25SM21 micro033 SM21 micro029 5.57 ± 4.25SM21 micro033 SM21 micro031 5.64 ± 4.25SM21 micro034 SM21 micro026 5.65 ± 4.25SM21 micro034 SM21 micro029 5.66 ± 4.25SM21 micro034 SM21 micro030 5.71 ± 4.25SM21 micro035 SM21 micro027 5.54 ± 4.25SM21 micro035 SM21 micro030 5.53 ± 4.25SM21 micro035 SM21 micro031 5.55 ± 4.25SM21 micro026 SM21 micro001 4.29 ± 4.14SM21 micro026 SM21 micro024 3.50 ± 4.06SM21 micro027 SM21 micro001 2.51 ± 4.02SM21 micro027 SM21 micro022 2.72 ± 3.96SM21 micro028 SM21 micro022 3.59 ± 4.06SM21 micro028 SM21 micro024 2.11 ± 4.02SM21 micro029 SM21 micro002 3.45 ± 4.05SM21 micro029 SM21 micro024 2.67 ± 4.03SM21 micro030 SM21 micro001 2.76 ± 4.04SM21 micro030 SM21 micro002 2.10 ± 3.96SM21 micro031 SM21 micro002 1.62 ± 3.86SM21 micro031 SM21 micro022 2.23 ± 3.92SM21 micro001 SM21 micro020 3.73 ± 1.81SM21 micro002 SM21 micro020 1.73 ± 2.60SM21 micro022 SM21 micro020 −0.47 ± 2.00SM21 micro024 SM21 micro020 4.32 ± 2.49SM22 micro002 SM22 micro001 2.78 ± 0.27SM22 micro002 SM22 micro003 3.69 ± 2.08SM22 micro001 SM22 micro004 6.59 ± 1.04SM22 micro003 SM22 micro004 5.09 ± 2.03

Table SI 4: All microscopic pKa’s we predict forSAMPL6 molecules SM23 and SM24

Acid microstate ID Base microstate ID Prediction

SM23 micro048 SM23 micro044 5.11 ± 4.22SM23 micro048 SM23 micro045 5.28 ± 4.23SM23 micro048 SM23 micro046 5.32 ± 4.22SM23 micro048 SM23 micro047 5.15 ± 4.22SM23 micro044 SM23 micro038 5.52 ± 4.25SM23 micro044 SM23 micro039 5.64 ± 4.25SM23 micro044 SM23 micro041 5.53 ± 4.25SM23 micro045 SM23 micro039 5.44 ± 4.25SM23 micro045 SM23 micro040 5.44 ± 4.25SM23 micro045 SM23 micro043 5.44 ± 4.25SM23 micro046 SM23 micro038 5.57 ± 4.25SM23 micro046 SM23 micro040 5.69 ± 4.25SM23 micro046 SM23 micro042 5.59 ± 4.25SM23 micro047 SM23 micro041 5.47 ± 4.25SM23 micro047 SM23 micro042 5.45 ± 4.25SM23 micro047 SM23 micro043 5.56 ± 4.25SM23 micro038 SM23 micro007 4.41 ± 4.13SM23 micro038 SM23 micro025 3.44 ± 4.01SM23 micro039 SM23 micro007 2.42 ± 3.98SM23 micro039 SM23 micro021 2.78 ± 3.86SM23 micro040 SM23 micro007 2.47 ± 3.99SM23 micro040 SM23 micro024 2.20 ± 3.90SM23 micro041 SM23 micro021 3.70 ± 4.02SM23 micro041 SM23 micro025 1.80 ± 3.95SM23 micro042 SM23 micro024 3.72 ± 4.04SM23 micro042 SM23 micro025 2.35 ± 3.97SM23 micro043 SM23 micro021 2.09 ± 3.86SM23 micro043 SM23 micro024 1.66 ± 3.79SM23 micro007 SM23 micro012 5.30 ± 2.01SM23 micro021 SM23 micro012 0.77 ± 2.25SM23 micro024 SM23 micro012 2.32 ± 2.76SM23 micro025 SM23 micro012 5.00 ± 2.71SM24 micro035 SM24 micro020 4.96 ± 4.24SM24 micro035 SM24 micro032 5.54 ± 4.24SM24 micro035 SM24 micro033 5.22 ± 4.24SM24 micro035 SM24 micro034 5.18 ± 4.24SM24 micro020 SM24 micro010 3.53 ± 4.04SM24 micro020 SM24 micro011 3.14 ± 3.88SM24 micro020 SM24 micro031 1.74 ± 3.89SM24 micro032 SM24 micro010 1.93 ± 3.90SM24 micro032 SM24 micro017 0.73 ± 3.69SM24 micro032 SM24 micro029 2.18 ± 3.94SM24 micro033 SM24 micro011 1.28 ± 3.88SM24 micro033 SM24 micro017 1.93 ± 3.86SM24 micro033 SM24 micro030 1.62 ± 3.87SM24 micro034 SM24 micro029 5.79 ± 4.25SM24 micro034 SM24 micro030 1.63 ± 3.56SM24 micro034 SM24 micro031 0.69 ± 3.72SM24 micro010 SM24 micro006 4.13 ± 1.51SM24 micro010 SM24 micro023 6.82 ± 2.79SM24 micro011 SM24 micro006 4.54 ± 1.84SM24 micro011 SM24 micro013 3.09 ± 2.45SM24 micro017 SM24 micro006 0.55 ± 2.05SM24 micro017 SM24 micro009 6.62 ± 2.86SM24 micro029 SM24 micro009 8.57 ± 2.97SM24 micro029 SM24 micro023 5.74 ± 4.25SM24 micro030 SM24 micro009 5.68 ± 4.25SM24 micro030 SM24 micro013 0.93 ± 1.67SM24 micro031 SM24 micro013 0.67 ± 1.98SM24 micro031 SM24 micro023 5.71 ± 4.25SM24 micro006 SM24 micro027 11.28 ± 1.31SM24 micro009 SM24 micro027 4.11 ± 2.63SM24 micro013 SM24 micro027 5.83 ± 4.25SM24 micro023 SM24 micro027 7.40 ± 2.52

Page 19: SAMPL6 Challenge Results from pKa Predictions Based on a ...

Title Suppressed Due to Excessive Length 5

Table SI 5: All macroscopic pKa’s we predict forSAMPL6 molecules

SAMPL6 ID Prediction

SM01 9.27 ± 0.17

SM025.19 ± 2.472.84 ± 2.895.34 ± 1.34

SM032.35 ± 3.764.49 ± 3.47

SM043.53 ± 2.872.83 ± 3.196.73 ± 1.39

SM05 7.62 ± 1.04

SM061.77 ± 2.433.94 ± 0.54

SM075.17 ± 2.472.78 ± 3.177.06 ± 1.40

SM084.61 ± 0.2310.82 ± 3.73

SM095.14 ± 2.472.49 ± 2.926.11 ± 1.34

SM10 6.44 ± 0.98

SM11

5.25 ± 2.125.41 ± 2.425.22 ± 0.025.07 ± 3.59

SM125.17 ± 2.472.68 ± 2.845.90 ± 1.32

SM134.97 ± 2.491.84 ± 2.636.96 ± 1.38

SM140.12 ± 3.426.49 ± 0.58

SM155.42 ± 0.458.71 ± 0.20

SM16 5.91 ± 0.34

SM174.64 ± 2.663.47 ± 4.204.17 ± 0.58

SM18−0.26 ± 2.705.00 ± 4.3910.98 ± 1.59

SM19 6.04 ± 0.88SM20 7.31 ± 1.84

SM21

4.78 ± 2.165.41 ± 2.414.07 ± 0.024.42 ± 2.96

SM222.73 ± 0.346.60 ± 1.08

SM23

4.60 ± 2.155.30 ± 2.453.45 ± 0.025.48 ± 2.83

SM24

4.58 ± 2.301.71 ± 3.143.61 ± 0.0211.28 ± 1.31