Consensus QSAR modeling for the toxicity of organic ... · SR 2 0.70 0.68 0.70 0.62 Moderate 0.77...

1
Consensus QSAR modeling for the toxicity of organic chemicals against Pseudokirchneriella subcapitata using 2D descriptors Kabiruddin Khan 1 , Kunal Roy 1* 1 Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, 188 Raja S C Mullick Road, 700032, Kolkata, India Ecotoxicity of organic chemicals (OCs) Protocol followed OCs-Most extensively used chemical class Potent PBT potential candidates Consist of Pharmaceuticals, UV filters, hormones, biocides and endocrine disrupting chemicals etc Environment risk assessment of many OCs became mandatory in the US (since 1995) and EU (since 1993) NOEC and ECx are most widely used in ecotoxicity measurement What is NOEC? No observed-effective concentration Highest concentration at which there is no significant change in control population against the measured endpoint Represents lower threshold for toxicity Denotes x% effective concentration where x% is change in the populations of studied species (x can be 5-100%) What is ECx? ECx replacing NOEC? OECD and SETAC recommend replacement of NOEC with ECx (x = 5- 20) In USEPA, effluent testing relies on NOEC and EC 25 In EU-REACH regulation, NOEC and ECx are fundamental information for indicating toxicity Consensus on need of replacing NOEC with ECx in workshop conducted by OECD in 1996 Objectives 1. To propose robust QSAR models to calculate EC 10 and EC 50 concentrations in algae 2. To demonstrate advantage of consensus modeling in error reduction 3. To identify features responsible for algal immobility 4. To fill the data gap of untested/unknown compounds using developed models 5. To analyze the predictive efficiency of the developed models Applicability domain assessment pEC 50 References Reduction in MAE (mean absolute error) and predictivity enhancement Data gap filling Conclusion M. no Equations Div LV % % Prediction quality a % % Prediction quality a 1 pEC 50 Algae (mols/L) = 3.67917 + 0.02888 × CrippenMR + 0.39725 × MLOGP + 0.60323 × B06[C − N] + 0.78557 × B05[C − Cl] + 1.16506 × B02[N − S] − 1.46683 × nSO2OH ED 3 0.72 0.70 0.69 0.62 Moderate 0.70 0.67 0.60 Good 2 pEC 50 Algae (mols/L) = 3.90575 + 0.02754 × CrippenMR + 0.2973 × Log Kow + 0.67324 × B06[C − N] + 0.83114 × B05[C − Cl] + 1.12005 × B02[N − S] − 1.05917 × nSO2OH ED 3 0.72 0.70 0.70 0.63 Moderate 0.71 0.63 0.56 Moderate 3 pEC 50 Algae (mols/L) = 3.90595 + 0.03373 × CrippenMR + 0.28608 × Log Kow + 0.96872 × B05[C − Cl] + 1.07463 × B02[N − S] − 0.20679 × H − 051 − 1.23943 × nSO2OH ED 3 0.72 0.70 0.69 0.62 Moderate 0.71 0.64 0.57 Good 4 pEC 50 Algae (mols/L) = 3.943 ] + 0.02707 × CrippenMR + 0.30156 × Log Kow + 0.67215 × B06[C − N] + 0.81936 × B05[C − Cl] + 0.61348 × F02[N − S] − 1.15045 × nSO2OH ED 3 0.71 0.70 0.70 0.63 Moderate 0.70 0.64 0.57 Moderate 5 pEC 50 Algae (mols/L) = 3.92235 + 0.03364 × CrippenMR + 0.28524 × Log Kow + 0.96588 × B05[C − Cl] + 0.65962 × F02[N − S] − 0.20618 × H − 051 − 1.23579 × nSO2OH ED 3 0.71 0.70 0.70 0.63 Moderate 0.70 0.65 0.59 Moderate CM0 Average of predictions from all input Individual models 0.72 0.62 0.55 Good CM1 Average of predictions from 'qualified' Individual models 0.72 0.62 0.55 Good CM2 Weighted average predictions from 'qualified' Individual models 0.72 0.62 0.54 Good CM3 Best selection of predictions (compound-wise) from 'qualified' Individual models[Winner Model] 0.71 0.61 0.53 Good M. No Equations Div LV % % Prediction quality a % % Prediction quality a 1 pEC 10 Algae (mol/L) = 1.93569 + 0.02315 × CrippenMR + 0.00496 × MW + 0.3396 × MLOGP + 0.64439 × Log Kow − 1.71498 × nSO2OH + 1.48957 × B02[N − S] − 2.24679 × X3A − 0.17172 × H − 051 SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Good 2 pEC 10 Algae (mol/L) = 1.93569 + 0.02315 × CrippenMR + 0.00496 × MW + 0.3396 × MLOGP + 0.64439 × B10[C − N] − 1.71498 × nSO2OH + 1.48957 × F01[C − C] − 2.24679 × X3A − 0.17172 × H − 051 SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Moderate 3 pEC 10 Algae (mol/L) = 1.93569 + 0.02315 × CrippenMR + 0.00496 × MW + 0.3396 × MLOGP + 0.64439 × B10[C − N] − 1.71498 × nSO2OH + 1.48957 × B02[N − S] − 2.24679 × X3A − 0.17172 × H − 051 SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Good 4 pEC 10 Algae (mol/L) = 1.93569 + 0.02315 × CrippenMR + 0.00496 × MW + 0.3396 × MLOGP + 0.64439 × Log Kow − 1.71498 × nSO2OH + 1.48957 × F01[C − C] − 2.24679 × X3A − 0.17172 × H − 051 SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Good 5 pEC 10 Algae (mol/L) = 1.71202 + 0.02658 × CrippenMR + 0.00556 × MW + 0.349452 × MLOGP − 1.8637 × nSO2OH + 1.26739 × F01[C − C] − 1.98596 × X3A − 0.17195 × H − 051 − 0.01353 × F04[C − S] SR 2 0.70 0.68 0.71 0.63 Moderate 0.76 0.62 0.57 Good CM0 Average of predictions from all input Individual models [Winner Model] 0.77 0.61 0.55 Good CM1 Average of predictions from 'qualified' Individual models 0.76 0.62 0.56 Good CM2 Weighted average predictions from 'qualified' Individual models 0.76 0.62 0.56 Good CM3 Best selection of predictions (compound-wise) from 'qualified' Individual models 0.76 0.63 0.57 Moderate Acknowledgement KK thanks Indian Council of Medical Research, New Delhi for financial support in the form of a senior research fellowship. Externally validated consensus QSAR models (pEC 10 , pEC 50 ) proposed to predict acute toxicity of organic compounds in green algae Pseudokirchneriella subcapitata. Models were validated rigorously following strict OECD guidelines for QSAR validation The models were developed using genetic algorithm partial least square technique (GA-PLS) followed by consensus approach The QSAR models were characterized by large applicability domain, verified by different internationally accepted validation metrics The developed QSAR model greatly focuses on logP dependence of toxicity of organic chemicals The variables like MLOGP, MR and Log Kow (observed lipophilicity) exert highest positive contribution in controlling aquatic toxicity More polar groups such as oxygen (nSO2OH moiety) and alpha hydrogen (H- 051 moiety) showed an inverse correlation with algal toxicity The results emphasize on the use of consensus modeling and its application in reducing prediction error The obtained QSAR models can act as helpful tool for identification and prioritization for chemicals of highest concern, production of safer alternatives in order to accomplish REACH regulations on hazardous chemicals Performed modeling approach will be useful for the design of safer replacement solutions of organic compounds against green algae, and the developed models can be applied to detect a priori, using chemical structures, the potential contaminants toxic mainly against Pseudokirchneriella subcapitata Consensus models (CM3)) were applied to predict 64 additional synthetic organic compounds having no absolute observed responses reported in the source paper of Kusk et al Initially, individual five PLS models (pEC 50 ) were used to derive the acute predicted values of 64 compounds, the models could predict all 64 compounds with confidence suggested by “Prediction reliability indicatortool In the second step, the obtained consensus model 3 was employed in the estimation of acute predicted toxicities of 64 compounds, and the predictions were compared with the range values provided in the manuscript The model could predict accurately 53 (82%) out of 64 compounds with deviation of plus or minus 2 log unit (for the lower range value) and 51 (80%) out of 64 compounds with deviation of plus or minus 2 log unit (for the higher range value) In the external set, 37 molecules (53%) were predicted with deviation of less than 1 log unit Seven molecules were outside the domain of predictions as given by the PRI tool The precise prediction of additional data points proves the significance of the developed QSAR models, consensus approach and practical application of the PRI tool in prediction of unknown/untested chemicals The use of multiple model derived predictions are more reliable when compared to single QSAR models as the former can cover a large number of features The developed multiple PLS models against each endpoint (pEC10, pEC50) was further subjected to consensus modeling using intelligent consensus predictor tool (available at http://teqip.jdvu.ac.in/QSAR_Tools/) in order to reduce test set errors (in terms of MAE (MAEtest100%, MAEtest95%)) and to enhance prediction performance of the models Four different types of consensus modeling has been carried out in the present study (Roy et al., 2018), namely CM0 (simple average of predictions), CM1 (average of predictions from the 'qualified' individual models), CM2 (weighted average predictions (WAPs) from 'qualified' (having at least 3 compounds similar in training set for respective test set compound) individual models) and CM3 (best selection of predictions (compound-wise) from 'qualified' Individual models) In both the cases, consensus derived models were better in quality when compared with the performance of individual models Applicability domain assessment pEC 10 The AD assessment of the developed models was achieved using the DModX approach implemented in the SIMCA-P tool as proposed by Wold et al. (2001) In case of pEC 50 models, 22 outliers were present in the training set and 5 molecules were outside the domain of applicability in the test set Interestingly, 20 outliers were common in the models for both endpoints, and these are 15 (an allergen),54 (cyclohexane), 57 (insecticide),90 (an intermediate), 115 (a hazardous chemical), 136 (personnel care product (PCP) ), 143 (surfactant), 163 (flavoring agent), 172 (contaminant), 266 (reaction intermediate), 267 (a contaminant), 280 (flavoring agent), 285 (flavoring agent), 294 (a hazardous chemical), 331 (PCP ingredient), 350 (herbicide), 352 (a hazardous chemical), 369 (contaminant), 370 (pharmaceutical) and 372 (a contaminant) The models developed using the pEC10 data gave 22 outliers in the training set, and 5 chemicals in the test set were outside the domain of applicability The outliers were retained in the final models as they were predicted with moderate precision by the respective models (absolute predicted residual being <2 log units) 1. K.O. Kusk, A.M. Christensen, and N. Nyholm, Algal growth inhibition test results of 425 organic chemical substances. Chemosphere. 204 (2018), pp. 405-412. 2. A. Beasley, S.E. Belanger, J.L. Brill, and R.R. Otter, Evaluation and comparison of the relationship between NOEC and EC10 or EC20 values in chronic Daphnia toxicity testing. Environ Toxicol Chem. 34 (2015), pp. 2378-2384. 3. K. Roy, and I. Mitra, On various metrics used for validation of predictive QSAR models with applications in virtual screening and focused library design. Combinatorial chemistry & high throughput screening 14 (2011), pp. 450-474. 4. C.W. Yap, PaDELdescriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 32 (2011), pp. 1466-1474. 5. H. Golmohammadi, Z. Dashtbozorgi, and W.E. Acree Jr, Quantitative structureactivity relationship prediction of blood-to-brain partitioning behavior using support vector machine. European Journal of Pharmaceutical Sciences 47 (2012), pp. 421-429. 6. R.W. Kennard, and L.A. Stone, Computer aided design of experiments. Technometrics 11 (1969), pp. 137- 148. 7. P.M. Khan, and K. Roy, Current approaches for choosing feature selection and learning algorithms in quantitative structure-activity relationships (QSAR). Expert Opin Drug Discov. 13 (2018), pp. 1075-1089. 8. K. Roy, P. Ambure, S. Kar, and P.K. Ojha, Is it possible to improve the quality of predictions from an “intelligent” use of multiple QSAR/QSPR/QSTR models? J. Chemom 32 (2018), pp.e2992. 9. K. Roy, P. Ambure, and S. Kar, How Precise Are Our Quantitative StructureActivity Relationship Derived Predictions for New Query Chemicals? ACS Omega 3 (2018), pp. 11392- 11406. Data curation for model development Results and Analysis (process, metrics and output) Metrics (pEC 50 ) Metrics (pEC 10 ) Process Output Summary of features present in organic chemicals responsible for enhancing and reducing algal toxicity Statistical qualities of all five models developed against algae species using pEC 50 endpoint. a MAE criteria Statistical qualities of all five models developed against algae species using pEC 10 endpoint. a MAE criteria Like pEC50QSAR models, molar refractivity, lipophilicity, presence of sulfur/nitrogen (due to presence of CrippenMR, MLOGP, B02[N-S], B10[C-N], F01[C-C], F04[C-S]moieties) enhanced pEC10 values, while presence of more polar atoms like oxygen (nSO2OH moiety) reduced pEC10 values. Molecular weight was a significant descriptor appearing in all the five QSAR models with positive contribution to pEC10 values. Among the other descriptors, B10[CN], F01[C-C] and F04[C-S] contributed positively while X3A contributed negatively towards pEC10. CrippenMR (molar refractivity) gives a measure of molecular size, volume and polarizability, the former two attributes enhance lipophilicity of the molecules while the latter has an electronic component; in other words, an increase in MR might not always enhances size or lipophilicity of molecules. Molecules with higher molar refractivity were generally toxic in nature irrespective of their logP values. The highly toxic compounds with higher molar refractivity include 253 (a pharmaceutical), 72 (a cationic surfactant), 234 (a bulky molecule) and 317 (Amycin, a pharmaceutical). Compounds 72 (a cationic surfactant), 317(a pharmaceutical), and 386 (an herbicide) were included in the list of toxic molecules with high MR (128, 111, 94) values and very low logP (1.18, -0.87, 1.67) values. The modeled dataset exhibited a wide range of log Kow/MLOGP values ranging from -6.7 to 7.2, covering a huge chemical space. Analyze Analyze Mechanistic interpretation pEC 50 Mechanistic interpretation pEC 10 Structural data collection Remove salts Remove mixtures Identify organic chemicals Optimize and normalize structures Aromatize, add explicit hydrogens and save

Transcript of Consensus QSAR modeling for the toxicity of organic ... · SR 2 0.70 0.68 0.70 0.62 Moderate 0.77...

Page 1: Consensus QSAR modeling for the toxicity of organic ... · SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Moderate 3 pEC10 ... • In case of pEC 50 models, 22 outliers were present

Consensus QSAR modeling for the toxicity of organic chemicals against Pseudokirchneriella subcapitata using 2D descriptors Kabiruddin Khan1, Kunal Roy1*

1Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, 188 Raja S C Mullick Road, 700032, Kolkata, India

Ecotoxicity of organic chemicals (OCs)

Protocol followed

• OCs-Most extensively used chemical

class

• Potent PBT potential candidates

• Consist of Pharmaceuticals, UV filters,

hormones, biocides and endocrine

disrupting chemicals etc

• Environment risk assessment of many

OCs became mandatory in the US (since

1995) and EU (since 1993)

• NOEC and ECx are most widely used in

ecotoxicity measurement

What is NOEC?

• No observed-effective concentration

• Highest concentration at which there is

no significant change in control

population against the measured

endpoint

• Represents lower threshold for toxicity

• Denotes x% effective concentration

where x% is change in the populations

of studied species (x can be 5-100%)

What is ECx?

ECx replacing NOEC?

• OECD and SETAC recommend

replacement of NOEC with ECx (x = 5-

20)

• In USEPA, effluent testing relies on

NOEC and EC25

• In EU-REACH regulation, NOEC and

ECx are fundamental information for

indicating toxicity

• Consensus on need of replacing NOEC

with ECx in workshop conducted by

OECD in 1996

Objectives

1. To propose robust QSAR models to

calculate EC10 and EC50 concentrations

in algae

2. To demonstrate advantage of consensus

modeling in error reduction

3. To identify features responsible for algal

immobility

4. To fill the data gap of untested/unknown

compounds using developed models

5. To analyze the predictive efficiency of the

developed models

Applicability domain assessment pEC50

References

Reduction in MAE (mean absolute error) and predictivity enhancement

Data gap filling

Conclusion

M. no Equations Div LV 𝑹𝟐 𝑸𝟐

𝑴𝑨𝑬𝒕𝒓𝒂𝒊𝒏𝟏𝟎𝟎%

𝑴𝑨𝑬𝒕𝒓𝒂𝒊𝒏𝟗𝟓%

Prediction

qualitya 𝑸𝑭𝟏𝟐

𝑴𝑨𝑬𝒕𝒆𝒔𝒕𝟏𝟎𝟎%

𝑴𝑨𝑬𝒕𝒆𝒔𝒕𝟗𝟓% Prediction

qualitya

1 pEC50Algae (mols/L) = 3.67917 +

0.02888 × CrippenMR + 0.39725 ×

MLOGP + 0.60323 × B06[C − N] +

0.78557 × B05[C − Cl] + 1.16506 ×

B02[N − S] − 1.46683 × nSO2OH

ED 3 0.72 0.70 0.69 0.62 Moderate 0.70 0.67 0.60 Good

2 pEC50Algae (mols/L) = 3.90575 +

0.02754 × CrippenMR + 0.2973 ×

Log Kow + 0.67324 × B06[C − N] +

0.83114 × B05[C − Cl] + 1.12005 ×

B02[N − S] − 1.05917 × nSO2OH

ED 3 0.72 0.70 0.70 0.63 Moderate 0.71 0.63 0.56 Moderate

3 pEC50Algae (mols/L) = 3.90595 +

0.03373 × CrippenMR + 0.28608 ×

Log Kow + 0.96872 × B05[C − Cl] +

1.07463 × B02[N − S] − 0.20679 × H −

051 − 1.23943 × nSO2OH

ED 3 0.72 0.70 0.69 0.62 Moderate 0.71 0.64 0.57 Good

4 pEC50Algae (mols/L) =

3.943 ] + 0.02707 × CrippenMR +

0.30156 × Log Kow + 0.67215 ×

B06[C − N] + 0.81936 × B05[C − Cl] +

0.61348 × F02[N − S] − 1.15045 ×

nSO2OH

ED 3 0.71 0.70 0.70 0.63 Moderate 0.70 0.64 0.57 Moderate

5 pEC50Algae (mols/L) = 3.92235 +

0.03364 × CrippenMR + 0.28524 ×

Log Kow + 0.96588 × B05[C − Cl] +

0.65962 × F02[N − S] − 0.20618 × H −

051 − 1.23579 × nSO2OH

ED 3 0.71 0.70 0.70 0.63 Moderate 0.70 0.65 0.59 Moderate

CM0 Average of predictions from all

input Individual models 0.72 0.62 0.55 Good

CM1 Average of predictions from

'qualified' Individual models 0.72 0.62 0.55 Good

CM2 Weighted average predictions from 'qualified' Individual

models 0.72 0.62 0.54 Good

CM3 Best selection of predictions (compound-wise) from 'qualified' Individual models[Winner Model] 0.71 0.61 0.53 Good

M. No Equations Div LV 𝑹𝟐 𝑸𝟐 𝑴𝑨𝑬𝒕𝒓𝒂𝒊𝒏𝟏𝟎𝟎% 𝑴𝑨𝑬𝒕𝒓𝒂𝒊𝒏𝟗𝟓%

Prediction

qualitya 𝑸𝑭𝟏𝟐 𝑴𝑨𝑬𝒕𝒆𝒔𝒕𝟏𝟎𝟎% 𝑴𝑨𝑬𝒕𝒆𝒔𝒕𝟗𝟓%

Prediction

qualitya

1 pEC10Algae (mol/L) = 1.93569 +

0.02315 × CrippenMR + 0.00496 ×

MW + 0.3396 × MLOGP + 0.64439 ×

Log Kow − 1.71498 × nSO2OH +

1.48957 × B02[N − S] − 2.24679 ×

X3A − 0.17172 × H − 051

SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Good

2 pEC10Algae (mol/L) = 1.93569 +

0.02315 × CrippenMR + 0.00496 ×

MW + 0.3396 × MLOGP + 0.64439 ×

B10[C − N] − 1.71498 × nSO2OH +

1.48957 × F01[C − C] − 2.24679 ×

X3A − 0.17172 × H − 051

SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Moderate

3 pEC10Algae (mol/L) = 1.93569 +

0.02315 × CrippenMR + 0.00496 ×

MW + 0.3396 × MLOGP + 0.64439 ×

B10[C − N] − 1.71498 × nSO2OH +

1.48957 × B02[N − S] − 2.24679 ×

X3A − 0.17172 × H − 051

SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Good

4 pEC10Algae (mol/L) = 1.93569 +

0.02315 × CrippenMR + 0.00496 ×

MW + 0.3396 × MLOGP + 0.64439 ×

Log Kow − 1.71498 × nSO2OH +

1.48957 × F01[C − C] − 2.24679 ×

X3A − 0.17172 × H − 051

SR 2 0.70 0.68 0.70 0.62 Moderate 0.77 0.61 0.56 Good

5 pEC10Algae (mol/L) = 1.71202 +

0.02658 × CrippenMR + 0.00556 ×

MW + 0.349452 × MLOGP − 1.8637 ×

nSO2OH + 1.26739 × F01[C − C] −

1.98596 × X3A − 0.17195 × H − 051 −

0.01353 × F04[C − S]

SR 2 0.70 0.68 0.71 0.63 Moderate 0.76 0.62 0.57 Good

CM0 Average of predictions from all input Individual models [Winner Model]

0.77 0.61 0.55 Good

CM1 Average of predictions from 'qualified' Individual

models

0.76 0.62 0.56 Good

CM2 Weighted average predictions from 'qualified' Individual

models

0.76 0.62 0.56 Good

CM3 Best selection of predictions (compound-wise) from 'qualified' Individual models

0.76 0.63 0.57 Moderate

Acknowledgement

• KK thanks Indian Council of Medical Research, New Delhi for financial

support in the form of a senior research fellowship.

• Externally validated consensus QSAR models (pEC10, pEC50) proposed to

predict acute toxicity of organic compounds in green algae Pseudokirchneriella

subcapitata.

• Models were validated rigorously following strict OECD guidelines for QSAR

validation

• The models were developed using genetic algorithm partial least square

technique (GA-PLS) followed by consensus approach

• The QSAR models were characterized by large applicability domain, verified by

different internationally accepted validation metrics

• The developed QSAR model greatly focuses on logP dependence of toxicity of

organic chemicals

• The variables like MLOGP, MR and Log Kow (observed lipophilicity) exert

highest positive contribution in controlling aquatic toxicity

• More polar groups such as oxygen (nSO2OH moiety) and alpha hydrogen (H-

051 moiety) showed an inverse correlation with algal toxicity

• The results emphasize on the use of consensus modeling and its application in

reducing prediction error

• The obtained QSAR models can act as helpful tool for identification and

prioritization for chemicals of highest concern, production of safer alternatives in

order to accomplish REACH regulations on hazardous chemicals • Performed modeling approach will be useful for the design of safer replacement solutions of organic

compounds against green algae, and the developed models can be applied to detect a priori, using

chemical structures, the potential contaminants toxic mainly against Pseudokirchneriella subcapitata

• Consensus models (CM3)) were applied to predict 64 additional synthetic

organic compounds having no absolute observed responses reported in the source

paper of Kusk et al

• Initially, individual five PLS models (pEC50) were used to derive the acute

predicted values of 64 compounds, the models could predict all 64 compounds

with confidence suggested by “Prediction reliability indicator” tool

• In the second step, the obtained consensus model 3 was employed in the

estimation of acute predicted toxicities of 64 compounds, and the predictions

were compared with the range values provided in the manuscript

• The model could predict accurately 53 (82%) out of 64 compounds with

deviation of plus or minus 2 log unit (for the lower range value) and 51 (80%)

out of 64 compounds with deviation of plus or minus 2 log unit (for the higher

range value)

• In the external set, 37 molecules (53%) were predicted with deviation of less

than 1 log unit

• Seven molecules were outside the domain of predictions as given by the PRI tool

• The precise prediction of additional data points proves the significance of the

developed QSAR models, consensus approach and practical application of the

PRI tool in prediction of unknown/untested chemicals

• The use of multiple model derived predictions are more reliable when

compared to single QSAR models as the former can cover a large

number of features

• The developed multiple PLS models against each endpoint (pEC10,

pEC50) was further subjected to consensus modeling using intelligent

consensus predictor tool (available at

http://teqip.jdvu.ac.in/QSAR_Tools/) in order to reduce test set errors (in

terms of MAE (MAEtest100%, MAEtest95%)) and to enhance

prediction performance of the models

• Four different types of consensus modeling has been carried out in the

present study (Roy et al., 2018), namely CM0 (simple average of

predictions), CM1 (average of predictions from the 'qualified' individual

models), CM2 (weighted average predictions (WAPs) from 'qualified'

(having at least 3 compounds similar in training set for respective test set

compound) individual models) and CM3 (best selection of predictions

(compound-wise) from 'qualified' Individual models)

• In both the cases, consensus derived models were better in quality when

compared with the performance of individual models

Applicability domain assessment pEC10

• The AD assessment of the developed models was achieved using the DModX approach

implemented in the SIMCA-P tool as proposed by Wold et al. (2001)

• In case of pEC50 models, 22 outliers were present in the training set and 5 molecules were

outside the domain of applicability in the test set

• Interestingly, 20 outliers were common in the models for both endpoints, and these are 15 (an

allergen),54 (cyclohexane), 57 (insecticide),90 (an intermediate), 115 (a hazardous chemical),

136 (personnel care product (PCP) ), 143 (surfactant), 163 (flavoring agent), 172

(contaminant), 266 (reaction intermediate), 267 (a contaminant), 280 (flavoring agent), 285

(flavoring agent), 294 (a hazardous chemical), 331 (PCP ingredient), 350 (herbicide), 352 (a

hazardous chemical), 369 (contaminant), 370 (pharmaceutical) and 372 (a contaminant)

• The models developed using the pEC10 data gave 22 outliers in the

training set, and 5 chemicals in the test set were outside the domain of

applicability

• The outliers were retained in the final models as they were predicted

with moderate precision by the respective models (absolute predicted

residual being <2 log units)

1. K.O. Kusk, A.M. Christensen, and N. Nyholm, Algal growth inhibition test results of 425 organic chemical

substances. Chemosphere. 204 (2018), pp. 405-412.

2. A. Beasley, S.E. Belanger, J.L. Brill, and R.R. Otter, Evaluation and comparison of the relationship between

NOEC and EC10 or EC20 values in chronic Daphnia toxicity testing. Environ Toxicol Chem. 34 (2015), pp.

2378-2384.

3. K. Roy, and I. Mitra, On various metrics used for validation of predictive QSAR models with applications in

virtual screening and focused library design. Combinatorial chemistry & high throughput screening 14

(2011), pp. 450-474.

4. C.W. Yap, PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints. J

Comput Chem. 32 (2011), pp. 1466-1474.

5. H. Golmohammadi, Z. Dashtbozorgi, and W.E. Acree Jr, Quantitative structure–activity relationship

prediction of blood-to-brain partitioning behavior using support vector machine. European Journal of

Pharmaceutical Sciences 47 (2012), pp. 421-429.

6. R.W. Kennard, and L.A. Stone, Computer aided design of experiments. Technometrics 11 (1969), pp. 137-

148.

7. P.M. Khan, and K. Roy, Current approaches for choosing feature selection and learning algorithms in

quantitative structure-activity relationships (QSAR). Expert Opin Drug Discov. 13 (2018), pp. 1075-1089.

8. K. Roy, P. Ambure, S. Kar, and P.K. Ojha, Is it possible to improve the quality of predictions from an

“intelligent” use of multiple QSAR/QSPR/QSTR models? J. Chemom 32 (2018), pp.e2992.

9. K. Roy, P. Ambure, and S. Kar, How Precise Are Our Quantitative Structure–Activity Relationship Derived

Predictions for New Query Chemicals? ACS Omega 3 (2018), pp. 11392- 11406.

Data curation for model development Results and Analysis (process, metrics and output)

Metrics (pEC50) Metrics (pEC10)

Process

Output

Summary of features present in organic chemicals

responsible for enhancing and reducing algal toxicity

Statistical qualities of all five models developed against algae species using pEC50endpoint. aMAE criteria Statistical qualities of all five models developed against algae species using pEC10 endpoint. aMAE criteria

Like pEC50QSAR models, molar refractivity, lipophilicity, presence of sulfur/nitrogen (due to presence

of CrippenMR, MLOGP, B02[N-S], B10[C-N], F01[C-C], F04[C-S]moieties) enhanced pEC10 values,

while presence of more polar atoms like oxygen (nSO2OH moiety) reduced pEC10 values. Molecular

weight was a significant descriptor appearing in all the five QSAR models with positive contribution to

pEC10 values. Among the other descriptors, B10[CN], F01[C-C] and F04[C-S] contributed positively

while X3A contributed negatively towards pEC10.

CrippenMR (molar refractivity) gives a measure of molecular size, volume and polarizability, the former two

attributes enhance lipophilicity of the molecules while the latter has an electronic component; in other words, an

increase in MR might not always enhances size or lipophilicity of molecules. Molecules with higher molar refractivity

were generally toxic in nature irrespective of their logP values. The highly toxic compounds with higher molar

refractivity include 253 (a pharmaceutical), 72 (a cationic surfactant), 234 (a bulky molecule) and 317 (Amycin, a

pharmaceutical). Compounds 72 (a cationic surfactant), 317(a pharmaceutical), and 386 (an herbicide) were

included in the list of toxic molecules with high MR (128, 111, 94) values and very low logP (1.18, -0.87, 1.67) values.

The modeled dataset exhibited a wide range of log Kow/MLOGP values ranging from -6.7 to 7.2, covering a huge

chemical space.

Analyze Analyze

Mechanistic interpretation pEC50 Mechanistic interpretation pEC10

Structural data collection

Remove salts

Remove mixtures

Identify organic chemicals

Optimize and normalize structures

Aromatize, add explicit hydrogens and save