DEVELOPMENT OF STRUCTURE-ACTIVITY … · DEVELOPMENT OF STRUCTURE-ACTIVITY RELATIONSHIPS FOR...
Transcript of DEVELOPMENT OF STRUCTURE-ACTIVITY … · DEVELOPMENT OF STRUCTURE-ACTIVITY RELATIONSHIPS FOR...
DEVELOPMENT OF STRUCTURE-ACTIVITY
RELATIONSHIPS FOR PHARMACOTOXICOLOGICAL
ENDPOINTS RELEVANT TO EUROPEAN UNION
LEGISLATION
A thesis submitted in partial fulfilment of the requirements of Liverpool John Moores University for the PhD degree
IGLIKA LESSIGIARSKA School of Pharmacy and Chemistry
Liverpool John Moores University
Byrom Street
Liverpool L3 3AF
England
and
Institute for Health & Consumer Protection
Joint Research Centre
European Commission
I-21020 Ispra (VA)
Italy
April 2006
ABSTRACT
In this project quantitative structure-activity relationships (QSARs) were developed for
several toxicological endpoints, including chemical cytotoxicity and acute toxicity, and
biokinetic parameters related to penetration of chemical compounds through the blood-
brain barrier (BBB). QSARs are computer-based mathematical models, which give
information about the intrinsic properties of compounds (such as potential biological
effects) on the basis of their chemical structure alone. For the regulatory assessment of
chemicals and chemical products, the proposed new EU legislation called REACH
(Registration, Evaluation and Authorisation of Chemicals) foresees that there will be an
increased use of QSARs as an alternative approach to (animal) testing.
In this project, QSARs were developed for compound penetration through the BBB in
vivo and through several membrane models in vitro, taking into account penetration by
both passive diffusion and active transport. The classification of compounds as low or
high BBB penetrators was explored and a simple classification QSAR model based on
compound lipophilicity and H-bonding ability was obtained. The BBB transport of
compounds known to interact with the P-glycoprotein (one of the BBB efflux transport
systems) was modelled by 3D-QSAR analysis, using hydrophobic and electrostatic
molecular fields.
Toxicities to a broad range of biological systems were also investigated, including
unicellular organisms like bacteria and algae, isolated human and rodent cells, and in vivo
toxicity to Daphnia, fish, rodents and humans. Similarities between the toxic effects for
some of these systems were identified. Baseline toxicity effects (relationships between
compound toxicity and lipophilicity) attributed to non-polar narcotics were investigated
and separate QSARs were obtained for compounds acting by different mechanisms of
toxic action or belonging to different chemical types. Classification QSAR models were
obtained applying the EU classification scheme for chemical toxicity. The project also
includes an investigation of the feasibility of predicting in vivo human toxicity by
combining in vitro data and molecular descriptors.
The QSAR models developed contributed to a mechanistic understanding of the
investigated biological effects. Some of the models could be applied in integrated testing
strategies for the assessment of regulatory endpoints based on alternative (non-animal)
methods.
CONTENTS
Acknowledgements
Abbreviations
Chapter 1. Introduction and overview 1
1.1. Introduction to the project 1
1.2. Overview of the thesis 2
Part I. Background to the research 5
Chapter 2. Introduction to (Q)SAR analysis 5
Chapter 3. Descriptors of chemical structure 10
3.1. Descriptors of lipophilicity (hydrophobicity) and hydrophilicity 10
3.2. Constitutional descriptors 13
3.3. Geometrical descriptors 13
3.4. Topological and electrotopological indices 14
3.5. Electronic descriptors 23
Chapter 4. 3D-Molecular conformations 28
4.1. Molecular mechanical calculations 28
4.2. Quantum mechanical calculations 29
4.3. Obtaining 3D-molecular conformation with minimum energy 35
Chapter 5. Mathematical procedures and statistical methods 37
5.1. Linear regression analysis 37
5.2. Principal Components Analysis 44
5.3. Partial Least Squares regression 46
5.4. Linear discriminant function analysis 48
5.5. Classification trees 54
5.6. Cluster Analysis 58
5.7. Selecting variables for statistical analyses 59
5.8. Cross-validation statistical procedures 60
Chapter 6. Approaches for 3D-(Q)SAR analysis 62
6.1. GASP analysis 62
6.2 CoMFA and CoMSIA analyses 63
Chapter 7. Blood-brain barrier 66
7.1. Introduction to the blood-brain barrier 66
7.2. Literature review of QSARs for blood-brain barrier penetration 67
Chapter 8. Acute chemical toxicity/cytotoxicity 71
8.1. Mechanisms of toxic action 71
8.2. Literature review of QSARs for acute toxicity/cytotoxicity 73
Part II. Description of the research 78
Chapter 9. Programs for statistical algorithms 78
9.1. Objectives 78
9.2. Reduction of data multicollinearity 78
9.3. Program for selection of regression equations with best statistical fit 80
9.4. Contribution to existing knowledge 81
Chapter 10. Investigation of blood-brain barrier penetration 83
10.1. Objectives 83
10.2. Methods 83
10.3. Results 87
10.4. Discussion 88
10.5. Conclusions 93
Chapter 11. Investigation of BBB penetration in relation to P-glycoprotein
interactions
106
11.1. Objectives 106
11.2. Methods 106
11.3. Results 109
11.4. Discussion 110
11.5. Conclusions 113
Chapter 12. Investigation of bacterial toxicity 123
12.1. Objectives 123
12.2. Methods 123
12.3. Results 125
12.4. Discussion 126
12.5. Conclusions 131
Chapter 13. Investigation of acute aquatic toxicity 145
13.1. Objectives 145
13.2. Methods 145
13.3. Results 149
13.4. Discussion 153
13.5. Conclusions 167
Chapter 14. Investigation of acute toxicity/cytotocicity 187
14.1. Objectives 187
14.2. Methods 187
14.3. Results 191
14.4. Discussion 193
14.5. Conclusions 201
Chapter 15. Summary and general discussion 212
15.1. Objectives 212
15.2. Summary of the research results 212
15.3. Some reflections on the quality of QSARs 216
15.4. Perspectives on the use of the QSARs in integrated testing
strategies
218
References 223
List of author’s publications related to the research in the project 244
Appendix A. C codes of the programs for statistical algorithms 245
Appendix B. SMILES codes for the investigated compounds 275
ACKNOWLEDGEMENTS
I should like to thank to Liverpool John Moores University for giving me the possibility
to work in the interesting field of modelling biokinetic and pharmacotoxicological
endpoints, and providing me with facilities and training needed for the research.
A very special thanks to my supervisors Dr. Andrew Worth (European Chemicals Bureau,
Joint Research Centre of the European Commission, Ispra, Italy), Dr. Mark Cronin
(Liverpool John Moores University), and Prof. John Dearden (Liverpool John Moores
University), who assisted me in every step of my work, and helped me to gain knowledge
and training as a researcher. Needless to say, without them my PhD project would not
have its present form. I also thank my advisors Dr. Ilza Pajeva (Central Laboratory of
Biomedical Engineering, Bulgarian Academy of Sciences, Bulgaria) and Dr. Tatiana
Netzeva (Liverpool John Moores University, and European Chemicals Bureau, Joint
Research Centre of the European Commission, Ispra, Italy) who supported me with
valuable advice and suggestions.
I acknowledge the European Commission’s DG Joint Research Centre (JRC), for giving
me a training grant (JRC contract 19120-2002-01 P1B20 ISP IT). The two units of the
JRC - European Centre for the Validation of Alternative Methods (ECVAM) and
European Chemicals Bureau (ECB), where the work was mainly performed, provided me
with necessary facilities and research environment to complete the project. Working in an
international environment of specialists in a broad range of fields was very helpful for my
development as a researcher. A special thanks to Pilar Prieto (ECVAM, Joint Research
Centre, Italy) and Joaquin Baraibar-Fentanes (ECB, Joint Research Centre, Italy) for
providing me with data sets needed for the research.
ABBREVIATIONS
2D 2-Dimensional
3D 3-Dimensional
AM1 Austin Model 1 semi-empirical quantum mechanical method
BB Ratio between compound concentration in brain and that in plasma at the steady state
BBB Blood-brain barrier
BBEC Membrane system constructed by bovine and rat subcultured primary brain capillary endothelial cells, co-cultured with primary rat astrocytes
CA Cluster analysis
Caco-2 Membrane system constructed by human epithelial colon carcinoma cell line
CART Classification and regression tree splitting
CoMFA Comparative molecular field analysis
CoMSIA Comparative molecular shape indices analysis
CT Classification tree
DA Discriminant analysis
DBU Discriminant-based univariate splitting
EC European Commission
ECB European Chemicals Bureau
ECVAM European Centre for the Validation of Alternative Methods
EMS Explained mean square
ESS Explained sum of squares
EU European Union
F Fisher statistic
GASP Genetic algorithm similarity program
GTO Gaussian-type orbitals
H-bond Hydrogen-bond
HAP Approximate acute human blood/serum peak LC50 values
HLD Acute oral human lethal doses
HOMO Highest occupied molecular orbital
IC50 Concentration that inhibits 50% of enzyme reaction or cell growth
JRC Joint Research Centre
LC50 Concentration that causes 50% lethality in a group of test animals
LD50 Dose that causes 50% lethality in a group of test animals
logD Logarithm of the oil-water distribution coefficient
logDoct Logarithm of the octanol-water distribution coefficient
logP Logarithm of the oil-water partition coefficient
logPoct Logarithm of the octanol-water partition coefficient
LUMO Lowest unoccupied molecular orbital
MAE Mean absolute error
MEIC Multicentre Evaluation of In Vitro Cytotoxicity
MEMO MEIC monographs on time-related human lethal blood concentrations
MDCK Membrane system constructed by dog kidney epithelial cell line
MDR Multidrug resistance
MO-LCAO Molecular orbitals – a linear combination of atomic orbitals
NCD New Chemicals Data Base
NDDO Neglect of diatomic differential overlap
NDO Neglect of differential overlap
P-gp P-glycoprotein
Papp Apparent permeability coefficent
PCA Principal components analysis
Pe Endothelial permeability coefficient
PLS Partial least squares
PM3 Parametric Model 3 semi-empirical quantum mechanical method
R2 Squared regression coefficient of determination
R2adj Adjusted squared regression coefficient of determination
RHF Restricted Hartree-Fock method
RMS Residual mean square
RSS Residual sum of squares
Q2 Cross-validated squared coefficient of determination
QSAAR Quantitative structure-activity-activity relationship
QSAR Quantitative structure-activity relationship
s Standard error of estimate
SCF Self-consistent field
SEP Cross-validated standard error of prediction
STO Slater-type orbitals
SV-ARBEC Membrane system constructed by rat SV40 immortalised brain microvascular endothelial cells, co-cultured with SV40 immortalised rat astrocytes
TIs Topological indices
TLSER Theoretical linear solvation energy relationship
TSS Total sum of squares
UHF Unrestricted Hartree-Fock method
1
CHAPTER 1
INTRODUCTION AND OVERVIEW
1.1. Introduction to the project
The general aim of the project was to develop quantitative structure-activity relationships
(QSARs) for one or more of the toxicological endpoints currently mandated for testing
under the Dangerous Substances Directive (EC, 1967) of the European Union (EU).
These endpoints include acute and chronic toxicity, genotoxicity, carcinogenicity, skin
and eye irritation and corrosivity, skin and respiratory sensitisation, biokinetics
(absorption, distribution, metabolism, elimination), and they characterise the toxicological
hazard of a chemical. QSARs are computer-based mathematical models, which relate the
biological activity of compounds to their chemical structure. In comparison with other
methods for assessing toxicological endpoints, such as animal-based and in vitro methods,
QSAR models are not only easy to apply, but are also efficient in terms of time and
financial cost. In addition, the QSAR models sometimes contribute to the mechanistic
understanding of the pharmacotoxicological effects being modelled.
QSARs, together with in vitro methods, provide alternative methods to animal testing of
toxicity. Alternative methods are designed to replace, reduce and refine of animal use, as
required by the Directive on the Protection of Laboratory Animals (EC, 1986). The saving
potential of the alternatives has recently been evaluated by Pedersen et al. (2003), and
Van der Jagt et al. (2004). Under the proposed new EU legislation for chemicals and
chemical products, called REACH (Registration, Evaluation and Authorisation of
Chemicals) (EC, 2001b; EC, 2003), it is foreseen that there will be an increased use of
QSARs as alternative methods for hazard and risk assessment of chemicals. In addition to
the evaluation of the toxic effect of a substance (chemical hazard), the process of
chemical risk assessment involves appraisal of the exposure of biological systems to
chemicals, i.e. evaluation of the level of chemical release in the environment and/or
chemical biouptake.
Due to some limitations of the applicability of individual alternative methods (for
example, related to particular biological systems, or chemical class, or associated with
particular levels of confidence), to reduce, refine, or replace animal use without a loss of
relevant information and safety, integrated testing strategies are developed. These are
2
designed as combinations of individual methods applied in a stepwise and/or parallel
fashion, in order to ensure high confidence in the results.
The project was funded by the Joint Research Centre (JRC) of the European Commission
(EC). The work was performed mainly in the Institute for Health and Consumer
Protection, which is part of the JRC, in its units European Centre for the Validation of
Alternative Methods (ECVAM) and European Chemicals Bureau (ECB). The JRC
provides scientific and technical support for the development, implementation and
monitoring of EU policies. The work of ECVAM is focused on coordination of the
validation of alternative methods at the EU level, exchange of information on alternative
methods, and promotion of their development, scientific and regulatory acceptance. The
ECB manages data and assessment procedures on dangerous chemicals at the EU level.
The specific aims of the research project included investigation of two of the
abovementioned toxicological and pharmaceutical (biokinetic) endpoints by QSAR
modelling. The project was focused on modelling of blood-brain barrier penetration, and
acute toxicity and cytotoxicity. In the project a review of QSARs published in the
literature on these endpoints is made. Data sets were compiled from the literature, and
from in-house data of ECVAM and ECB. The latter include data from an ECVAM study
on in vitro models of the blood-brain barrier (BBB), and data for aquatic toxicity from the
New Chemicals Database, managed by the ECB. QSAR models were developed and
discussed in relation to their use as alternative methods and in integrated testing
strategies.
1.2. Overview of the thesis
The thesis consists of two parts. Part I contains seven chapters in which the background to
the research is presented. Part II represents the research performed in the project in seven
chapters. At the end of the report are given appendices including programs in C code,
written for the purposes of the project, and SMILES codes of the investigated chemical
compounds.
In Part I, Chapter 2, a short introduction to the theory of QSAR analysis is presented,
including the main steps of the analysis and approaches to assess model quality. In
Chapters 3-6 some elements of QSAR analysis are discussed in detail, including the
descriptors used to represent chemical structures, methods to obtain relevant 3D
3
molecular conformations, mathematical and statistical techniques, and 3D-QSAR
approaches. In Chapters 7 and 8 an introduction to the underlying biology of the two
pharmacotoxicological endpoints is given, including the structure and function of the
blood-brain barrier, and mechanisms of acute toxic action of chemicals. Additionally,
reviews of published QSARs in these areas are presented.
In Part II, Chapter 9, two computational algorithms with applications in QSAR analysis
are suggested. The first one reduces variable multicollinearity in a data set containing a
large number of variables. The second algorithm represents an application of the best
subsets regression to obtain regression models with best statistical fit.
In Chapters 10 and 11, the QSAR investigation of compound penetration through the
BBB is presented. The work reported in Chapter 10 is based on two data sets including
compounds transported by both passive diffusion and active transport. The first data set
contains data for both in vivo BBB permeability, and permeability through several in vitro
membrane models of the BBB. Models that describe the in vivo BBB penetration by
combining in vitro membrane penetration and structural descriptors (QSAARs) were
developed. The second data set contained in vivo BBB penetration data for a large number
of compounds (approximately 150). QSARs were developed, and additionally, the
compounds were classified into low and high penetrators through the BBB and a
classification QSAR model was obtained.
In Chapter 11 an investigation of structural features influencing transport across the BBB
of compounds known to inhibit one of the BBB efflux transport systems, P-glycoprotein,
is reported. The investigated compounds were imipramine and phenothiazine derivatives.
Their mechanism of transport across the BBB involves passive diffusion and P-
glycoprotein interactions. Common 3D structural characteristics of these compounds
related to their mechanism of transport across the BBB were identified.
The next three chapters of Part II (Chapters 12-14) present the work related to analysis of
acute toxicity and cytotoxicity, which are considered to be related effects. Toxicities to a
broad range of biological systems were investigated, including unicellular organisms like
bacteria and algae, isolated human and rodent cells, and in vivo toxicity to Daphnia, fish,
rodents and human. Similarities between toxic effects for these systems were investigated.
Separate QSARs were obtained for compounds acting by different mechanisms of toxic
4
action or belonging to different chemical types. QSARs were also developed by
combining different compounds and mechanisms of action.
Chapter 12 represents an investigation of toxicity to the bacterium Sinorhizobium meliloti.
A large data set of 140 compounds was used to develop mechanistically-based and
chemical class-based QSARs.
In Chapter 13 an investigation of environmental toxic effects is reported. Five aquatic
species (two algal species, Daphnia, and two fish species) were included in the study. The
data were taken from the New Chemicals Database of the European Union. Interspecies
correlations were developed to reveal similarities between toxicity to different species.
Baseline toxicity effects attributed to non-polar narcotics and QSARs for the whole data
sets were investigated.
Classification QSAR models were obtained applying the EU classification scheme into
very toxic, toxic, harmful, and non-toxic compounds to the aquatic species. Additionally,
two-group classification into dangerous and non-toxic compounds was investigated, with
the group of dangerous compounds being formed by uniting the groups of very toxic,
toxic, and harmful compounds from the EU classification scheme. The biological data
used were assessed in terms of their quality.
In Chapter 14 a study on toxicity to a broad range of biological species, including
bacterial strains, rodent and human cell lines, rodents and human is presented. The data
were taken from the MEIC (Multicentre Evaluation of In Vitro Cytotoxicity) programme
(Bondesson et al., 1989, Ekwall et al., 1998a). The toxicity endpoints were compared.
The study explored an approach to predict in vivo human toxicity by combination of in
vitro data and molecular descriptors (QSAAR analysis). QSAR models for the
investigated endpoints were also developed.
The final Chapter 15 of the thesis summarises the results of the project. Proposals for use
of the developed QSARs in integrated testing strategies are made. In addition a
commentary on the quality and applicability of the QSARs is provided.
5
CHAPTER 2
INTRODUCTION TO (Q)SAR ANALYSIS
The objective of (quantitative) structure-activity relationship ((Q)SAR) analysis is to
derive empirical models that relate the biological activity of compounds to their chemical
structure. SAR analysis is based on the assumption that the chemical structure of a
compound implicitly determines its behaviour in biological systems and the biological
response to it. There can be both qualitative SARs and quantitative SARs (QSARs),
depending on the means used to describe the chemical structure and on the nature of the
derived relationship. In QSAR analysis, quantitative descriptors are used to describe the
chemical structure and the analysis results in a mathematical model describing the
relationship between the chemical structure and biological activity. In Figure 2.1 the main
steps in QSAR analysis are shown.
To develop (Q)SARs, a series of compounds, called a training set, is used. The
compounds in the training set ideally possess the same or similar mechanism of biological
action to ensure that the same factors influence the activity of all compounds under
investigation. For all compounds in the series, biological activities are evaluated and
compound structural descriptors are calculated. Statistical tools are then used to derive
QSARs.
The selection of a suitable training set is an important step in the QSAR analysis since the
representativeness and the size of the training set and the quality of the biological data
will affect all of the following steps. The biological activity of compounds is usually
assessed by using equilibrium constants of binding to biological macromolecules, enzyme
or cell-proliferation inhibition constants, rate constants of metabolism or distribution in
biological systems, and also, related to them, compound concentrations and doses causing
a certain biological effect (e.g., compound concentration that inhibits 50% of enzyme
reaction or cell growth (IC50), compound dose that causes 50% of a certain biological
effect (ED50), compound dose that causes death of 50% of laboratory animals (LD50)).
The mathematical models derived in the QSAR analysis are based on the linear or non-
linear relationship of the Gibbs free energy (∆G) of the investigated process on the
compound structural descriptors. ∆G is related to the parameter describing biological
6
activity K (equilibrium constant of binding or inhibition, rate constant of metabolism or
distribution) by the logarithmic dependence defined by Arrhenius (known as the van’t
Hoff isotherm): ∆G = -2.303 RT logK. Consequently, QSAR models are often based on
the logarithm of the measured biological activity.
To derive QSAR models, an appropriate representation of the chemical structure is
necessary. For this purpose, descriptors of the structure are commonly used. These
descriptors are generally understood as being any term, index or parameter conveying
structural information. Commonly used descriptors in the QSAR analysis are presented
in Table 2.1. Some of them are obtained directly from the chemical structure, e.g.
constitutional, geometrical, and topological and electrotopological descriptors (Table 2.1).
Others represent physicochemical properties determined by the chemical structure
(lipophilicity and hydrophilicity descriptors, electronic descriptors, energies of
interaction, Table 2.1).
Some of the listed descriptors can be obtained experimentally (for example, logP,
aqueous solubility). However, most of the descriptors are calculated using specially
developed software packages (for example, ACD/LogP (Advanced Chemistry
Development Inc.) and KOWWIN (Syracuse Research Corporation) for calculating
aqueous solubility and logP of partition between octanol and water phases; Tsar (Oxford
Molecular) or Dragon (TALETE srl) for topological descriptors). In some 3D-QSAR
approaches energies of interaction with a probe atom are used, for which again
specialised software packages have been developed (e.g. Sybyl, Tripos Associates).
Statistical methods are used to relate biological activity to chemical structure. Statistical
methods include, for example, regression analysis, principal components analysis,
discriminant analysis, partial least squares, classification trees and neural network
techniques. Statistical software packages are used to perform these analyses. Some of
these statistical techniques are used to derive continuous models which relate the
structural descriptors with values of the biological parameters (linear regression, partial
least squares regression). Other techniques (discriminant analysis, classification trees) are
used to develop models which classify chemical compounds into groups according their
activity (for example, groups of toxic and non-toxic compounds).
The development of (Q)SAR models has several purposes and applications:
7
- (Q)SARs can lead to better understanding of the mechanisms of interaction
between compounds and biological systems. They may reveal important structural
features for the biological effect.
- QSAR models provide useful information about a dose range for a biological
effect of a compound, thus helping the experimental design (selection of doses and
tests) in drug research and toxicity testing.
- QSAR models are used to predict the activity of new (hypothetical) chemical
compounds, even before their synthesis. Thus, QSARs can save time and
experimental resources for synthesising and biological testing of large numbers of
compounds. QSARs offer possibilities for reduction or replacement of animal use
in research and toxicity testing.
To perform properly in these applications, QSARs need to reflect the actual relationships
between structure and activity. Thus, evaluation of their quality (reliability) is necessary.
The quality of a model can be assessed using the following means:
- statistical parameters of the model (correlation coefficient, standard error of
estimation, Fisher-test values, Wilks’ λ , see Chapter 5);
- some statistical validation procedures are also developed for this purpose (cross-
validation, see Chapter 5, Section 5.8 “Cross-validation statistical procedures”);
- validation on an external test set. In this case the model is used for prediction of
activity of chemical compounds that were not included in the model training set.
Similar to QSAR analysis is quantitative structure-activity-activity relationship (QSAAR)
analysis. It is performed in the same way and it has the same characteristics as the QSAR
analysis, with the exception that it uses some biological endpoint(s) in combination with
descriptors of the chemical structure to derive models for the investigated biological
activity. Thus, the QSAAR models relate the investigated biological activity to a
combination of descriptors of chemical structure and other biological endpoints. These
endpoints are usually less complex than the investigated activity and represent parts of the
investigated biological effect. For example, in vivo toxicity may be described by
combining in vitro cytotoxicity and descriptors of the chemical structure. The descriptors
of the chemical structure included in the QSAAR models reveal factors that influence the
investigated biological effect (in vivo toxicity) apart of the other biological endpoint(s) (in
vitro cytotoxicity).
8
Figure 2.1. Main steps in QSAR analysis
Training set
Statistical analyses
Chemical structure representation
Evaluation of the quality of the model(s)
Biological data
Modification of the training set and further statistical analyses
Preliminary models(s)
Refined model(s)
9
Table 2.1. Descriptors of the chemical structure used in the QSAR analysis
Descriptor type Example
lipophilicity
(hydrophobicity)
and hydrophilicity
aqueous solubility, oil-water partition coefficient (logP), oil-water
distribution coefficient (logD);
constitutional molecular weight, total number of atoms, number of individual types
of atoms, number of rings, total number of bonds, number of
individual types of bonds;
geometrical molecule volume, molecule surface area, solvent accessible molecular
surface area, shadow indices;
topological and
electrotopological
Wiener index, Balaban index, Randić indices, Kier and Hall
connectivity indices, Kappa shape indices; Electrotopological-state
indices;
electronic dipole moments, polarisability, hydrogen bonding parameters,
Hammett constant, HOMO and LUMO energies, orbital electron
densities, superdelocalisabilities;
energies of
interaction
steric, electrostatic, or hydrophobic energies of interaction with a
given atom (molecule) at certain points in space.
10
CHAPTER 3
DESCRIPTORS OF CHEMICAL STRUCTURE
In this chapter the descriptors used for chemical structure representation are presented.
Only descriptors used in the current project are described. The main grouping of
descriptors is presented in Table 2.1. As mentioned in the introduction, some descriptors
are obtained directly from the chemical structure (constitutional, geometrical, and
topological and electrotopological descriptors), whilst others represent physicochemical
properties (lipophilicity and hydrophilicity descriptors, electronic descriptors). Another
characteristic of the descriptors is that the values of some of them depend only on the 2D
chemical formula (constitutional, topological and electrotopological descriptors), whereas
the values of others are influenced additionally by the 3D molecular conformations
(geometrical descriptors, electronic descriptors, energies of interactions).
The descriptor presentation starts with lipophilicity and hydrophilicity descriptors, due to
the fact that the oil-water partition coefficient (logP) is the most widely used descriptor in
the QSAR analyses included in the project. Afterwards the descriptors are presented in
order of increasing complexity of their definition and calculation.
3.1. Descriptors of lipophilicity (hydrophobicity) and hydrophilicity
3.1.1.Oil-water partition coefficient
Partition coefficient (P) is defined as the ratio of the solute concentration in the oil phase
to the non-ionised solute concentration in the water phase, at equilibrium:
P = Coil / Cwater (3.1)
where Coil is the equilibrium concentration of the solute in the oil phase
Cwater is the equilibrium concentration of the solute in water.
P describes the distribution of a compound between two phases – oil and water. It is
generally used in its logarithmic form (logP). A logP of zero indicates that the solute is
equally soluble in the two phases, a negative logP means that the solute is more soluble in
water, and a positive value indicates a greater solubility in the oil phase.
11
Examples of oil phases are octanol, cyclohexane, chloroform. LogP is probably the most
commonly used descriptor of lipophilicity, and is usually interpreted in biological terms
as a measure of the ability of the solute to cross lipid membranes. Molecules with low
logP values cannot easily enter the lipid phase of the membranes, while molecules with
high logP values are trapped in the membrane. Therefore, only molecules with
intermediate logP values (e.g. between about 0 and 4) can readily cross membranes (by
passive diffusion).
LogP can be determined by a number of experimental methods. However, the
measurements can be time-consuming and, due to differences in the experimental
protocol, a broad range of reported values exists for some chemicals (Dearden, 2002).
Various computational methods and software are available for calculation of logP for the
octanol-water system (logPoct). To estimate logPoct by most of the methods the molecule
of interest is divided into fragments (or atoms), and individual contributions of the
fragments (or atoms) into logPoct values are summed up. Additionally some correction
factors for the interactions between the individual fragments (or atoms) in the molecule
are added. Commonly used software for calculating logPoct, which differ in terms of the
molecular fragments and correction factors used, are, for example ACD/LogP (Advanced
Chemistry Development, Inc.), KOWWIN (Syracuse Research Corporation), and ClogP
(Daylight Chemical Information Systems, Inc.).
Also, statistical models are used in some software for calculating logPoct. In the software
CSlogP (ChemSilico LLC, available on the internet: www.logp.com) logPoct is calculated
by using a neural network model based on topological descriptors of the chemical
structure.
3.1.2. Oil-water distribution coefficient
LogP is calculated for the non-ionised forms of the chemicals. When a compound has
ionisable groups, to account for the distribution between oil and aqueous phases of the
ionic forms (microspecies) of the compound, the oil-water distribution coefficient (D) is
used. D is equal to the ratio of the total concentration (the sum of the concentrations of all
microspecies including ionised and non-ionised forms) in the oil phase to the total
concentration in the aqueous phase:
12
logD = log (Σ Cioil / Σ Ciwater) (3.2)
where Σ Cioil is the sum of the concentrations of all microspecies in oil
Σ Ciwater is the sum of the concentrations of all microspecies in water.
The value of logD depends on the pH of the aqueous phase. For chemical compounds
containing a single ionisable group, logD can be calculated from the acid-base
dissociation constant of the compound (pKa) and the pH of the aqueous phase by applying
the equation:
logD = logP – log(1 + 10pH-pKa) for acids (3.3)
logD = logP – log(1 + 10pKa-pH) for bases (3.4)
Commercially available software for calculating logD in octanol-water system (logDoct) is
ACD/LogD (Advanced Chemistry Development, Inc.), which uses equilibrium constants
for distribution of microspecies between the octanol and the aqueous phases to calculate
logDoct. Two different algorithms are available: in the first one, the penetration of the ion
pairs in the octanol phase are taken into account; the second algorithm does not account
for the penetration of ion pairs in the octanol phase. Other software for calculating logDoct
is PrologD from the package PALLAS (CompuDrug Inc.), and CSlogD (ChemSilico
LLC) based on neural network analysis.
3.1.3. Aqueous solubility
Aqueous solubility is defined as the maximum concentration of a compound that will
dissolve in pure water at a certain temperature, at equilibrium. It is a measure of
compound hydrophilicity. The amount of a compound available for absorption,
distribution, elimination, and/or the compound concentration at the site of biological
action depends on its aqueous solubility.
For the purposes of QSAR modelling, the solubility is usually expressed in units mol/l
and logarithmically transformed. It can be determined experimentally; however,
nowadays commonly available software for calculating aqueous solubility exists (e.g.
ACD/LogP from Advanced Chemistry Development Inc., WSKOWWIN from Syracuse
Research Corporation). The calculations are based on relationships between the logarithm
of the aqueous solubility and the octanol-water partition coefficient (logPoct), melting
13
point (for solid compounds at the specific temperature), and molecular weight. Also,
statistical models are used in some software for calculating aqueous solubility; for
example the software CSlogWS (ChemSilico LLC, available on the internet:
www.logp.com) uses a neural network model based on topological descriptors.
3.1.4. Hydrophilic factor
Another hydrophilicity descriptor (hydrophilic factor, Hy) was defined by Todeschini et
al. (1997). It is calculated by the following equation:
Hy = { (1 + NHy) * log 2 (1+NHy) + nC * [ (1/nA) * log 2 (1/nA) ] + (NHy / nA2)1/2 } / log 2
(1 + nA) (3.5)
where NHy is the number of hydrophilic groups (for example OH, SH, NH)
nC is the number of carbon atoms
nA is the number of atoms in the molecule (excluding hydrogen atoms).
3.2. Constitutional descriptors
Constitutional descriptors are broadly used in QSAR analysis. This descriptor group
includes molecular weight, number of different types of atoms present in a molecule,
number of rings, number of different types of bonds, number of different functional
groups. They encode the size of molecules (molecular weight, number of atoms, number
of rings), and chemical properties (type and number of functional groups).
3.3. Geometrical descriptors
These descriptors reflect features of the molecular geometry. Examples for such
descriptors include distances between particular points of the molecular surface (the two
farthest points, the two closest points), and distances between given chemical groups. The
most widely used geometrical descriptors are molecular surface area and molecular
volume, which are discussed in the following paragraphs.
3.3.3. Molecular surface area and molecular volume
14
Molecular surface area is the area of the outer surface of the volume from which solvent
molecules are excluded due to the presence of the solute molecule in a solution (solvent-
excluding surface). It is based on the van der Waals molecular surface (defined by the van
der Waals radii of the atoms (represented as spheres) in the molecule), however, van der
Waals molecular surface contains small gaps and crevices, which are inaccessible to other
atoms and molecules (for example solvent molecules). The molecular surface area is
defined by excluding these gaps and crevices. Thus, the molecular surface consists of the
van der Waals surfaces of the atoms where they can enter in a contact with the solvent
molecules, and additionally, of the surfaces of the solvent molecules, placed in contacts
with the van der Waals surfaces of two or more atoms of the investigated molecule
(Chapman and Connolly, 2001). Typically water is used as a solvent to perform
calculations of the molecular surface area. For practical reasons the shape of the water
molecule is considered as a sphere with a radius of 1.4 to 1.7 Å (Angstrom, 1 Å = 1 E-10
m), which is the average distance from the centre of the oxygen atom to the van der
Waals surface of the water molecule (Chapman and Connolly, 2001). Molecular volume
is the volume enclosed within this surface.
Additionally, a solvent-accessible molecular surface area is defined by the centre of a
probe sphere (solvent molecule, typically water), when it is rolled over the molecular
surface.
The calculated values of the molecular surface area and the molecular volume depend on
the molecular conformation (see Chapter 4). Molecular surface area and volume are
usually expressed in Å 2 and Å 3 respectively. They are measures of the molecular size.
3.4. Topological and electrotopological indices
Topological indices (TIs) are derived by representing the molecules as molecular graphs.
In a molecular graph the atoms are represented as dots (vertices), which are connected to
each other by lines (edges), representing the chemical bonds (Netzeva, 2003). From the
molecular graphs, paths of a certain length can be calculated. A path of a certain length
(m) represents two atoms that are connected with m bonds in the shortest pathway
between them. For example, a path of length one represents two atoms, connected with a
bond, and a path of length two indicates two bonds between the atoms. Another term is a
walk of a certain length. A walk of length one for a given atom is equal to the number of
atoms, to which it is connected. A walk of length m for a given atom is equal to the sum
15
of walks of length m-1 of all its neighbouring atoms (Morgan’s summation procedure)
(Rücker and Rücker, 1993). To calculate TIs usually a hydrogen-suppressed molecular
graph is used, i.e. molecular graph in which the hydrogen atoms are excluded.
Numerous topological indices have been created and used in QSAR studies. Their
calculation is easy, based on the molecular 2D structure only, thus not requiring
conformational analysis or 3D optimisation of the structure (see Chapter 4). The main
drawback of TIs is their complex and difficult interpretation.
3.4.1.Wiener index, Balaban index
These two indices are calculated on the basis of the distance matrix of a molecular graph.
The ij element of the distance matrix is equal to the number of bonds between atoms i and
j in the shortest pathway between them. The diagonal elements of the distance matrix are
equal to zero.
The Wiener index (W) was proposed by Wiener in 1947 (Wiener, 1947). It is calculated
as a half sum of the elements of the distance matrix:
W = ½ ΣΣ Dij (3.6)
where Dij is the ij element of the distance matrix of the molecule. The summation is
made over the all atoms i and j in the molecule.
The Wiener index has greater value for molecules with linear structure (e.g. n-alkanes)
and smaller value for compact molecules with many branches and cycles (Sablić, 1990).
The Balaban index (J) (Balaban, 1982) is defined as:
J = M / (µ + 1) Σ (Di Dj)-0.5 (3.7)
where M is the number of bonds in the molecule
Di is the sum of the elements of i row of the distance matrix, corresponding to the
i-th atom; Di = Σ Dij
Dij is the ij member of the distance matrix of the molecule
16
µ is the cyclomatic number of the molecular graph, equal to the minimum number
of edges that must be removed before a polycyclic graph becomes acyclic:
µ = n – (N-1) (3.8)
where n is the number of graph edges (bonds)
N is the number of graph vertices (atoms).
The Balaban index increases with the size of the molecule, degree of branching, and
degree of unsaturation. Its value decreases sharply with the number of rings in a molecule
(Sablić, 1990).
3.4.2. Path/walk Randić shape indices
Path/walk Randić shape indices of order m (PWm) are calculated by summing the ratios
of the atomic path counts over the atomic walk counts of order m for all atoms, and
dividing the sum by the number of non-hydrogen atoms (Randić, 2001). Path/walk count
ratios are independent of molecular size, and these descriptors can be considered as
encoding the shape of molecules. Path/walk Randić index of first order is equal to one for
all molecules as the counts of the paths and walks of length one are equal.
3.4.3. Molecular connectivity indices
Molecular connectivity indices (χ, often called “chi” indices) were introduced by Randić
(1975), and further developed by Kier and Hall (Kier and Hall, 1986; Hall and Kier,
1991). The indices are based on dividing the hydrogen-suppressed molecular graph into
subgraphs (fragments) of an equal length (number of atoms). They are denoted as mχtv,
where the left-side superscript m is the order of the index, equal to the length of the
fragments to which the molecule is divided, i.e. the number of bonds in the fragments; the
right-side superscript v denotes that the index is of the valence type; if it is missing, the
index is of the simple type; and the subscript t denotes the type of structural fragments to
which the molecule has been divided. In the valence connectivity indices the presence of
heteroatoms and multiple bonding in the molecule are taken into account.
Four types of structural fragments are used in definition of the connectivity indices: path
fragments (subscript p); cluster fragments (subscript c); path clusters (subscript pc); and
17
chains (rings) (subscript ch). If the subscript t is missing, the index is assumed to be path-
type. The connectivity indices of order smaller or equal to two can be only of type path,
therefore the subscript p in their indication is often omitted.
To calculate the connectivity indices, for each non-hydrogen atom i in the molecule,
simple δi and valence δνi terms are calculated. These terms are defined as:
δi = σ i – h i, equal to the number of skeletal neighbours of the specified atom (3.9)
where σ i is the number of valence electrons in the sigma orbitals of the atom i
h i is the number of hydrogen atoms bonded to the atom i.
δvi = Zv
i – hi for atoms in the first row of the periodic table (3.10)
δvi = (Zv
i – hi) / (Zi – Zvi – 1) for other atoms (3.11)
where Zvi is the total number of valence electrons of i
Zi is the atomic number of i, i.e. the number of all electrons of the atom i.
Connectivity terms mCt or mCtv are defined for each fragment of type t and order m as
equal to:
mCt = Π(δi)
–1/2 for simple indices (3.12)
mCt
v = Π(δvi)
–1/2 for valence indices (3.13)
where the multiplication is done over all atoms in the fragment of type t and order m.
The connectivity indices are defined as:
mχt = Σ mCt for simple indices (3.14)
mχt
v = Σ mCtv for valence indices (3.15)
where the summation is done over all fragment of type t and order m in the molecule.
18
For example, fragments of 1 atom form paths of length 0. As mentioned above, the
connectivity indices of order 0 are only of type path and therefore they are denoted as 0χ
(the simple index) and 0χv (the valence index) (the subscript p is omitted). They are equal
to:
0χ = Σ mCt = Σ (δi)
–1/2 (3.16)
0χv = Σ mCt
v = Σ (δvi)
–1/2 (3.17)
Fragments of two connected atoms form paths of length 1. 1χ (the simple index) and 1χv
(the valence index) are equal to:
1χ = Σ mCt = Σ (δi* δj)
–1/2 (3.18)
1χv = Σ mCt
v = Σ (δvi* δv
j) –1/2 (3.18)
where i and j refer to the two connected atoms in the fragments, and the summation is
over all possible two-atom fragments.
The connectivity indices of order zero, one and two are well correlated with the number
of atoms in the molecule, i.e with the molecular size (Netzeva et al., 2003). The values of
the indices of order one and two depend also on the molecular branching.
Molecular connectivity indices of order three or more can be of path, cluster, or chain
types. In the path-type graphs each atom is connected to a maximum two other non-
hydrogen atoms. In cluster-type graphs all atoms are connected to a common central
atom. For example a cluster graph of 3-rd order has the structure of tert-butane, a cluster
graph of 4-th order is tert-pentane-like. In the chain-type graphs the atoms participate in a
ring system. A 3-rd order chain graph has a cyclopropane-like structure, a 6-th order
chain graph is constructed from 6-membered aromatic or aliphatic rings. Path/cluster
indices can be of order four or more, and are calculated by dividing the molecule into
fragments, which combine paths and clusters.
19
Molecular connectivity indices are generally accepted to encode molecular size,
branching, cyclicity, unsaturation, and the presence and type of heteroatoms.
3.4.4. Difference molecular connectivity indices
Due to the large contribution of molecular size to the values of the chi indices, a high
degree of intercorrelation between the path chi indices may occur. This is observed
especially when the investigated compound set contains molecules with different sizes,
and/or insufficient structural diversity exists. In order to exclude the contribution of the
molecular size in the constitution of the chi indices, the difference connectivity indices
were constructed. Thus, other properties encoded by the chi indices, like branching,
cyclicity, unsaturation, and/or presence of heteroatoms, are emphasised.
To calculate the difference connectivity indices (dmχn) and the difference valence
connectivity indices (dmχnv) a (valence) chi path index is calculated for a hypothetical
reference molecule with an unbranched (straight chain) molecular graph and the same
number of atoms as the molecule being described (denoted mχn or mχnv respectively). This
index is subtracted from the corresponding chi path index (mχt or mχtv) for the given
molecule:
dmχn = mχt – mχn (3.19)
dmχnv = mχt
v – mχnv (3.20)
Difference connectivity indices are not defined for cluster, path/cluster and chain chi
indices because size factors do not influence them at the same large extent as the path chi
indices (QsarIS version 1.1 software, help manual).
3.4.5. Average molecular connectivity indices
The average molecular connectivity indices are denoted by 0χA (simple average indices)
and 0χAv (valence average indices) and are obtained by dividing the corresponding simple
or valence connectivity path index by the number of paths (fragments) involved in its
calculation. They are also aimed at excluding the influence of the molecular size on the
value of the index.
20
3.4.6. Kier benzene-likeliness index
The Kier benzene-likeliness index (BLI, Kier and Hall, 1986) is calculated by dividing
the first-order valence connectivity index 1χv by the number of non-hydrogen bonds in the
molecule and then normalising on the benzene molecule. It was proposed to measure the
molecule aromaticity.
3.4.7. Kappa shape indices
Kappa shape indices are designed to encode the overall molecular shape (Kier, 1985).
They are calculated using the counts of paths of length one (one-bond), two (two-bond)
and three (three-bond) in the hydrogen-suppressed graph of the molecule, and
correspondingly, the kappa shape indices are defined as of first, second and third order.
To calculate the kappa shape indices for a given molecule, two reference structures are
used, having the same number of atoms as the molecule of interest. The two reference
structures are constructed in such a way that they have the minimum possible (mpmin) and
the maximum possible (mpmax) number of paths of length m (m = 1, 2, 3) for the given
number of atoms in the molecule. The first reference structure for all kappa indices is the
non-branched isomer (linear graph), whose shape can be described as a cylinder or
ellipsoid. The second reference structure, which may be either real or hypothetical, is a
complete graph for m = 1 (first order kappa shape index); a star graph for m = 2 (second
order kappa shape index); and a twin star graph for m = 3 (third order kappa shape index)
(MolconnZ version 4.0 software, online help manual). A complete graph is a graph with
all atoms connected to each other. In the star graph all atoms are connected to a central
atom.
If mpi is the number of paths of length m in the molecule of interest, then it is assumed
that mpmin ≤ mpi ≤ mpmax, where mpmin is the number of paths of length m in the linear
graph, and mpmax is the number of paths of length m in the second reference structure.
The general formula for calculating kappa shape indices (mκ) is the following:
mκ = 2 mpmax
mpmin / (mpi)
2 (3.21)
21
where m is the order of the index, m = 1, 2, 3.
The mpmin and mpmax values can be calculated directly from the number of non-hydrogen
atoms (nA) in the molecule. Their substitution for paths of different length has resulted in
the following equations for calculation of mκ:
1κ = nA (nA – 1)2 / (1pi)
2 (3.22)
2κ = (nA – 1) (nA – 2)2 / (2pi)
2 (3.23)
3κ = (nA – 3) (nA – 2)2 / (3pi)
2, when nA is even (3.24)
3κ = (nA – 1) (nA – 3)2 / (3pi)
2, when nA is odd (3.25)
where nA is the number of atoms in the molecule (excluding hydrogen atoms).
The kappa indices were derived assuming that all atoms in the molecule are equivalent.
The influence on the molecular shape of atoms other than carbons in sp3 hybrid state is
accounted by the kappa alpha shape indices (mκα). They can be obtained by modifying
each nA and mpi in the above equations by adding of an α value:
α = r(x) / r(C(sp3)) – 1 (3.26)
where r(x) is the covalent radius of atom x
r(C(sp3)) is the covalent radius of carbon atom in the sp3 hybrid state.
Subsequently, nA is replaced by a + α, and mpi is replaced by mpi + α.
3.4.8. Shape flexibility index
The flexibility of a molecule depends on the presence of cycles and/or branching. By
combining 1κα and 2κα indices with the number of atoms (nA) (for normalisation), a
further index Φ has been defined (Kier, 1989), which is considered to measure molecular
flexibility.
22
Φ = (1κα* 2κα) / nA (3.27)
The flexibility index decreases with increased branching or cyclicity. The presence of
unsaturation or heteroatoms results in a decrease in the flexibility index when these atoms
have a covalent radius that is less than that of a carbon in sp3 hybrid state (QsarIS version
1.1. software, help manual).
3.4.9. Electrotopological state (E-state) indices
The E-state indices (Kier and Hall, 1990; Hall et al., 1991; Kier and Hall, 1999) are
defined again in terms of the δi and δvi values of an atom i similarly to the connectivity
indices (Equations 3.9, 3.10 and 3.11, Section 3.4.3 “Molecular connectivity indices”).
The E-state index for an atom i (Si, also called atom-level E-state index) is composed of
an intrinsic state term (Ii), plus a sum of perturbations (∆Iij) from all other atoms in the
molecule:
Si = Ii + Σj ∆Iij (3.28)
where the summation is over the remaining atoms in the molecule.
The intrinsic state (Ii) and the perturbation (∆Iij) terms are calculated as follows:
Ii = ( (2 / Ni)2 * δi
ν + 1)/ δi (3.29)
where Ni is the principal quantum number of valence electrons (the row, at which the
chemical element is placed in the periodic table)
∆Iij = (Ii – Ij)/rij2 (3.30)
where rij is the topological distance between atoms i and j, equal to the number of atoms
in the shortest path between them.
Atom-level E-state indices can be calculated for each atom (such as >C<, >N–, =O, –Cl)
in a molecule, as well as for each hydride group (such as –CH3, >NH, –OH). For
23
simplicity, both atoms and hydride group are termed “atoms”. Their values encode
information about the topological environment of the atom and the electronic interactions
due to all other atoms in the molecule (MolconnZ version 4.0 software, online help
manual).
In addition to atom-level E-state values, which are computed for each atom in the
molecule, atom-type E-state indices can be calculated. The atom type E-state indices are
defined as the sum of the individual atom-level E-state values for a particular atom type.
Hydrogen atoms can also be included in calculation of E-state values for deriving
hydrogen atom-type E-state indices (Rose and Hall, 2003).
Atom-type E-state indices encode information about the electron accessibility for atoms
of the same type, the presence/absence of an atom type and the count of the atoms of a
given atom type. Thus, the E-state values account for the ability of molecule to enter into
non-covalent intermolecular interactions (Rose and Hall, 2003).
3.4.10. E-state topological parameter
The E-state topological parameter (TIE) is calculated as follows (Voelkel, 1994):
TIE = (nB / nR) * Σ (Si*Sj) –1/2 (3.31)
where nB is the number of non-hydrogen bonds
nR the number of rings in the molecule
Si and Sj the E-state indices for the two connected atoms
the summation is over all bonds in the molecule.
3.5. Electronic descriptors
This group includes a large number of descriptors whose values are determined by the
spatial distribution of the electrons in a molecule. They represent global molecular
properties such as dipole moment, polarisability, molecular orbital energies (HOMO,
LUMO), and as well as local properties like partial atomic charges, atomic
superdelocalisabilities.
24
3.5.1. Dipole moment
An electric dipole consists of a pair of charges of equal magnitude and opposite signs (+q
and – q), separate by a distance (r). The dipole moment of an electric dipole is a vector
directed from the negative to the positive charge with magnitude equal to q*r. The
magnitude of the dipole moment is measured in Coulomb meters (Cm) or in debyes (D),
where 1D = 3.338*10-30 Cm.
If the positive and negative charges in a molecule do not overlap, the molecule possesses
a permanent dipole moment (µ) (polar molecule). Molecular dipole moment is usually
calculated using the following formula:
µ = Σqi*ri (3.32)
where ri is the radius-vector of an atom i from the origin of the coordinate system (centre
of charge or centre of mass)
qi is the partial charge of atom i
the summation is over all atoms in the molecule.
Usually the magnitude of the dipole moment is used as a descriptor in the QSAR analysis.
The magnitude of one or more of the vector’s components along the x, y and z Cartesian
axes can also be used. However, this requires proper alignment of the investigated
molecules in the coordinate system.
Molecules having permanent dipole moments tend to align with each other resulting in
van der Waals non-covalent attractive intermolecular interactions (orientation effect). The
presence of molecules with permanent dipole moments temporarily distorts the electron
charge in other close placed polar or non-polar molecules, inducing further polarisation
and van der Waals intermolecular forces (induction effect). However, it is difficult to
determine which biological and physicochemical factors are encoded by the magnitude of
the dipole moment, when it is included in QSAR models. The dipole moment could
influence electrostatic interactions with biological macromolecules. Electric dipoles are
known to align themselves parallel to externally applied electric fields, but in the opposite
direction, resulting in decreasing of the magnitude of the field. In this way the magnitude
of the dipole moment could be a measure of the ability of a molecule to diminish the
electric field across lipid membranes, which will influence membrane transport.
25
The dipole moment of a molecule can be measured experimentally. Quantum mechanical
methods allow calculation of the molecular dipole moment. The calculated value depends
on the molecular conformation (see Chapter 4).
3.5.2. Polarisability
Polarisability is the relative susceptibility of the electron cloud of an atom or a molecule
to be distorted from its normal shape by presence of an external electric field (close
placed ion or dipole). Due to this distortion an induced electric dipole moment appears.
Polarisability (α) is a tensor relating the induced dipole moment (µind) to the applied
electric field strength (E):
µind = αE (3.33)
The non-diagonal elements of the tensor represent the polarisability of the electrons along
one of the axes of the coordinate system due to a component of the applied electric field
along another of the coordinate axes. As this effect is negligible compared to the
polarisability in the direction of the applied electric field (or it does not exist), the non-
diagonal elements of the polarisability tensor are zero or very small compared to the
diagonal elements. Therefore, in practice the polarisability is presented as ‘mean
polarisability’, i.e., the average polarisability over the three axes of the molecule, and is
equal to one third (1/3) of the sum of the tensor diagonal elements (the trace of the
tensor).
Polarisability can be measured experimentally, or calculated with the methods of quantum
mechanics. Values are reported in Å3.
Molecular polarisability causes the inductive van der Waals forces (interactions between
permanent and induced molecular dipoles). It is also important for the presence of the
dispersion van der Waals (or London) forces which represent interactions between
temporary dipoles and induced dipoles. Temporary dipoles are formed in non-polar
molecules due to the electron motion within the molecule, and fluctuate rapidly with the
motion of the electrons. These temporary dipoles induce dipoles in other closely placed
molecules resulting in van der Waals intermolecular forces by orienting the temporary
and induced dipoles with each other.
26
3.5.3. Energies of the frontier molecular orbitals HOMO and LUMO
The energies of the frontier orbitals HOMO (highest occupied molecular orbital) and
LUMO (lowest unoccupied molecular orbital) are commonly used descriptors in QSAR
analysis. They reflect the reactivity of a molecule. A higher HOMO energy suggests
higher affinity of a molecule to react as a nucleophile, a lower LUMO energy suggests
stronger electrophilic nature of a molecule. Additionally, electrophilic and nucleophilic
attack will most likely occur at atoms where the coefficients of the corresponding atomic
orbitals in HOMO and LUMO, respectively, are large. The coefficients of the atomic
orbitals in HOMO and LUMO and the orbital energies are calculated with the methods of
the quantum mechanics (for details see Chapter 4, Section 4.2 “Quantum mechanical
calculations”).
3.5.4. Partial atomic charges
Partial atomic charges appear due to the different ability of atoms to withdraw electron
density (differences in the atomic electronegativity). The partial atomic charges determine
the electrostatic potential around a molecule, thus they influence intermolecular
interactions with electrostatic nature, including hydrogen-bonding (see Section 3.5.5
“Hydrogen-bond donor and acceptor ability”). Partial charges can be calculated with the
methods of molecular or quantum mechanics (for details see Chapter 4).
3.5.5. Hydrogen-bond donor and acceptor ability
The descriptors characterising hydrogen-bonding (H-bonding) are included in the group
of electronic descriptors, because this molecular property is determined by the
distribution of electrons, i.e. presence of lone electron pairs, and the energies of HOMO
and LUMO and high positive or negative charges on certain types of atoms (see below).
The simplest approach to evaluate the hydrogen-bonding (H-bonding) ability is by using
counts of the H-bond donor and acceptor atoms in a molecule. H-bond donors are
hydrogen atoms attached to an electronegative atom (O or N), H-bond acceptors are
atoms, which have a lone electron pair (O, N).
27
Several other approaches to assess the overall H-bonding strength of a molecule have
been developed. Kamlet and Taft (Kamlet and Taft, 1976; Taft and Kamlet, 1976)
proposed scales of solvent H-bond donor acidity (α) and H-bond acceptor basicity (β).
These scales were derived from effects of solvents on the ultraviolet and visible spectra of
solutes, which change due to H-bonding. Other measures of H-bond acidity (αH2) and H-
bond basicity (βH2), proposed by Abraham and co-workers (Abraham et al., 1989,
Abraham, 1993), are based on values of equilibrium constants for complex formation by
H-bonding in tetrachloromethane.
To avoid experimental determination of H-bond ability, Wilson and Famini (1991)
proposed descriptors based on quantum mechanics calculations. H-bond acceptor strength
is represented by two terms. The first is the covalent contribution to Lewis basicity (εb),
equal to the difference in energy between LUMO (lowest unoccupied molecular orbital)
of water and HOMO (highest occupied molecular orbital) of the solute. The second term
is the electrostatic basicity (q-) equal to the largest negative atomic charge in the solute
molecule (the most negatively charged atom will have the greatest interaction with a
proton from a neighbouring molecule). Analogously, H-bond donor strength is
represented by the covalent acidity (εa) – the energy difference between HOMO of water
and LUMO of solute, and the electrostatic acidity (qH+) - the largest positive charge of a
hydrogen atom in the solute molecule (the most positively charged proton of the molecule
will be most attracted to a negatively charged atom of a neighbour).
28
CHAPTER 4
3D-MOLECULAR CONFORMATIONS
A number of QSAR techniques and descriptors need 3D-representation of the molecular
structure. Examples of descriptors, whose values depend on the 3D-molecular
conformations are molecular surface area and volume, dipole moment, energies of the
frontier molecular orbitals (HOMO and LUMO), partial atomic charges. 3D-QSAR
techniques like CoMFA and CoMSIA are very sensitive to the choice of the 3D-
molecular conformations (see Chapter 6). The 3D-conformation chosen for the analysis
should resemble most closely the active 3D-conformation of a chemical compound, i.e.
the conformation, in which it elicits its biological effect. Information on the active
conformation can be obtained from crystallographic analysis of complexes between the
investigated chemical and its biological receptor (enzyme or other macromolecule)
However, often such information is insufficient or missing. Therefore it is generally
accepted to use the 3D-conformation with minimum energy in the QSAR analysis.
Generally, two methods can be used to calculate the energy-minimised molecular
conformations and corresponding molecular descriptors, namely the theory of molecular
mechanics and the theory of quantum mechanics. Usually due to insufficient
computational resources the both approaches calculate the energy-minimised
conformation in vacuum, ignoring solvent effects. Calculations performed by molecular
mechanics are fast, but are less accurate and cannot estimate the energies of the electron
orbitals. To perform more accurate calculations and to calculate electronic energies, the
theory of the quantum mechanics is used. However, quantum mechanical calculations
demand greater computational resources and can be very time-consuming. To reduce the
computational time and resources needed, usually a molecular mechanical minimisation is
performed to obtain a reasonable conformation, which is further minimised with quantum
mechanical calculations.
In this chapter molecular mechanical and quantum mechanical approaches for calculation
of (energy-minimised) molecular conformations are presented.
4.1. Molecular mechanical calculations
29
Molecular mechanics is a method to calculate 3D-structure and energy of molecules
based on the motion of atomic nuclei only. Electrons are not considered explicitly, as it is
assumed that they will find their optimum distribution once the positions of the nuclei are
known. This assumption is based on the Born-Oppenheimer approximation (see Section
4.2 “Quantum mechanical calculations”), which states that the nuclear motions can be
studied separately from the electronic motions. Because the nuclei are much heavier and
move more slowly than the electrons, the electrons are assumed to move fast enough to
adjust to the nuclear motion. Thus, in the theory of molecular mechanics the molecules
are considered as assemblies of spherical particles with given radii (representing the
nuclei) held together by simple harmonic forces (representing the bonds).
The energy and the optimum (energy-minimised) geometry of a molecule are calculated
by applying a force field to it. A force field is a collection of atom types and atom
parameters, parameters for bond lengths at equilibrium, bond angles, torsion angles, etc.
The total energy of a molecule is divided into several terms called force potentials, which
are described by potential energy equations, included in the definition of the force field.
Examples of force potentials are the energies associated with bond stretching, bond
bending (bond angle deformation), torsion strain (rotation of adjacent groups to a
common bond), interactions between non-bound atoms (van der Waals attraction, steric
repulsion, and electrostatic attraction/repulsion), energies due to dipole-dipole
interactions between the bond dipole moments in a molecule, and/or intra-molecular
hydrogen bonding energies. Force potentials are calculated independently from each
other, and are summed to give the total energy of the molecule.
Different force fields have been developed. They differ by their atom and bond
parameterisation and by the energy potentials included. Commonly used force fields for
calculating conformations of small organic molecules are MM2 and MM3 developed by
Allinger and coworkers (Allinger, 1977; Allinger, et al., 1989). A force field suitable for
modelling proteins and nucleic acids is AMBER (Assisted Model Building with Energy
Refinement) (Weiner and Kollman, 1981, Weiner et al., 1984). Another force field
developed for calculations of small molecules (with molecular weight between 50 and
1000) is the Cosmic force field, which uses simple force fields parameters and equations
with reasonable accuracy of the results (Abraham and Haworth, 1988; Morley et al.,
1991).
4.2. Quantum mechanical calculations
30
In the theory of quantum mechanics a wave function is associated with every quantum
particle. The wave function describes the spatial position and the motion of the particle
and its energy. The wave function for a given particle is obtained by solving the
Schrödinger equation. The time-independent and non-relativistic (neglect of relativistic
effects on the particle mass) form of the Schrödinger equation is the following:
Ĥ Ψ = E Ψ (4.1)
where Ĥ is the Hamiltonian operator, which is applied to the wavefunction Ψ to give the
same function multiplied by a constant. The constant E is the energy of the state of the
particle described by the given wavefunction. Ψ and E represent respectively the
eigenvectors and the corresponding eigenvalues of the operator Ĥ. The square of the
wavefunction value at a given point in space represents the probability for the particle to
be found at that point.
For quantum systems consisting of more than one particle (molecules), the Schrödinger
equation is solved to find wavefunctions describing the different energy states of the
whole system (all particles), and the corresponding energies of the system.
So far the Schrödinger equation has been exactly solved for one-electron molecules only.
For more complex systems some approximations need to be made to find its solution. The
Born-Oppenheimer approximation is based on the fact that the nuclear motions are slower
than the electronic motions. Therefore the nuclei and the electrons can be studied
independently. Independent Schrödinger equations are solved to obtain separate
wavefunctions describing the motion of the nuclei and the motion of the electrons. In this
approximation, the nuclear motion is neglected, and a solution of the Schrödinger
equation for electrons is sought at a given fixed nuclear configuration.
The electronic Hamiltonian has three terms (discussed in detail in Schüürmann, 2004).
The first term corresponds to the sum of the kinetic energies of the electrons, the second
term is related to the attractive electrostatic forces between the nuclei and the electrons,
and the third term corresponds to the sum of the repulsive electrostatic forces between
each two electrons. The square of the wavefunction obtained for the electrons is the
physically observable electron density.
31
However, an electronic wavefunction depending on the position of all electrons is very
complicated. Several further approximations need to be made to solve the Schrödinger
equation for electrons. In the molecular orbital approximation the motion of the electrons
in a molecule is considered independent from each other. Thus, the third term in the
electronic Hamiltonian, corresponding to the repulsive electron-electron interactions, is
neglected. Since electron-electron interactions are neglected in this approach, the
probability of finding an electron at a given position (i.e. the square of an one-electron
wavefunction) does not depend on the positions of the other electrons. Thus, the
probability of finding all electrons at a given configuration in the space is represented
using the products of the squares of the one-electron wavefunctions. Therefore, the
molecular orbital approximation assumes that the electronic wavefunction (describing the
state of all electrons) can be obtained as a combination of products of one-electron
wavefunctions.
The molecular one-electron wavefunctions (called molecular spin orbitals) can be
expressed as products of a spatial wavefunction, describing the spatial position of the
electron, and a spin function (accounting for the electron spin). The energy associated
with the molecular orbitals is the energy required to remove an electron from that orbital.
The spatial parts of the molecular orbitals (ψi) are usually expressed as a linear
combination of a set of one-electron wavefunctions, centred on the nuclei, called basis
set:
ψi = Σ cijфj (4.2)
The one-electron wavefunctions of the basis set (фj) have similar mathematical form to
the wavefunctions describing the state of an electron in non-connected atoms (the so
called atomic orbitals). Therefore this approximation is called a “molecular orbitals – a
linear combination of atomic orbitals” (MO-LCAO) method. The coefficients cij
determine the contribution of the atomic orbital фj to the molecular orbital ψi.
The most common atomic wavefunctions (фj) used as basis sets are the Slater-type
orbitals (STO), and the Gaussian-type orbitals (GTO). The STOs account for the actual
shape of the atomic orbitals. GTOs are mathematically simpler than STOs, but less
accurate. The one-electron orbitals can also be built by combining sets of gaussian
functions to approximate each STO (contracted gaussian functions). A minimal basis set
32
is one in which only occupied orbitals of each isolated atom are used to compose the
molecular orbitals.
The objective of the MO-LCAO method is to find the coefficients cij which result in
molecular orbitals that best approximate the actual molecular orbitals. The variational
principle states that this is equivalent to finding values for the coefficients cij, which result
in molecular orbitals with minimum energies for a given choice of Hamiltonian and basis
set. To obtain these coefficient values, the energy is expressed as a function of cij, and the
derivatives of the energy with respect to each coefficient are equalised to zero. In the
calculations of cij values of some integrals are needed, which can be obtained completely
or partially from experimental data (semi-empirical quantum mechanical methods) or
from complete calculation of these integrals (ab initio methods).
As described above, the effects of electron-electron interactions are neglected in the
Hamiltonian used to derive the molecular orbitals. These effects, called "electron
correlation", include electrostatic electron-electron repulsions, and Pauli interactions,
representing the attractive interaction between two electrons of opposite spin and
repulsive interactions between two electrons of the same spin. However, to obtain a
proper description of the state of molecular systems these effects need to be taken into
account.
In the self-consistent-field (SCF) approximation the electrostatic electron-electron
repulsions are accounted for by assuming that each electron interacts with an average
electrostatic field of the nuclei plus all other electrons. This gives a mathematically
simpler approach than describing the electron interactions in pairs. It is used in the
Hartee-Fock approach, where a Hamiltonian operator (called Fock operator) is
constructed, which considers each electron in the average electrostatic field formed from
all other electrons in the molecule, and also takes into account the Pauli interactions
(discussed in detail in Schüürmann, 2004).
To find molecular orbitals which take into account the electron correlation, an iterative
procedure is applied. An initial set of molecular orbitals is constructed and the
Schrödinger equation is used to calculate the electron correlation term of the Fock
operator. The Fock operator is then used to calculate a new set of molecular orbitals, and
again, a new Fock operator is computed. This process is repeated until the calculated
molecular orbitals become constant, or a pre-defined energy criterion is met. Each SCF
33
iteration involves calculation of a new set of coefficients cij, and the process is repeated
until the cijs approach constant values within a convergence criterion.
Two types of Hartree-Fock calculations exist. The restricted Hartree-Fock (RHF)
calculations are performed on closed shell systems, where all orbitals are empty or have
paired electrons. The unrestricted Hartree-Fock method (UHF) is applied for open shell
systems with one or more unpaired electrons on some molecular orbitals.
Molecular orbital calculations are used to calculate molecular orbital energies, energies of
the electrons and nuclei; energies of electron-electron, electron-nuclei, and nuclei-nuclei
interactions; the total molecular energy; heat of formation (calculated from the difference
between the total energy and the energy of the isolated atoms); partial atomic charges,
electrostatic potential; dipole moment; spectroscopic properties.
The molecular electrostatic potential surrounding the molecule caused by its charge
distribution can be used to calculate partial atomic charges by fitting of partial atomic
charges to reproduce it. Partial atomic charges can also be calculated from the molecular
orbital coefficients using methods such as Mulliken population analysis, which provides
partitioning of either the total electron density, or an orbital density (Mulliken, 1955).
4.2.1. Semi-empirical methods
The semi-empirical quantum mechanical methods allow for relatively fast calculations on
larger molecules. In these methods only the outer or valence electron orbitals are
calculated. The inner electrons are considered of less importance for the chemical
properties, and usually are parameterised empirically. In the semi-empirical
approximations some integrals arising from the Schrödinger equation are neglected, and
parameters are used to compensate for the errors introduced by removing these integrals.
The parameters are obtained experimentally from measured or calculated ionisation
potentials, electron affinities, and spectroscopic quantities, or are computed from ab initio
calculations on model systems.
The first semi-empirical methods were developed by Pople and coworkers (Pople and
Segal, 1965, Pople et al., 1965, Pople and Beveridge, 1970). They are based on neglect of
differential overlap integrals (NDO). The methods developed differ by the amount to
which these integrals are neglected. These methods include CNDO (Complete Neglect of
34
Differential Overlap), INDO (Intermediate Neglect of Differential Overlap), and MINDO
(Modified Intermediate Neglect of Differential Overlap).
The semi-empirical methods used nowadays are based on the neglect of diatomic
differential overlap (NDDO). In these methods two-electron integrals are considered only
if they are centred on the same atom. The first method was developed by Dewar and Thiel
(1977) and called MNDO (Modified Neglect of Diatomic Overlap). Austin Model 1
(AM1) (Dewar et al., 1985), and Parametric Model 3 (PM3) (Stewart, 1989) are the most
commonly used NDDO methods. AM1 and PM3 are quantum mechanically identical but
differ in their parameterisations. Generally, AM1 is most reliable for molecules
containing elements C, H, N, and O, while PM3 is suitable for molecules containing P
and S or for the main group elements that are not parameterised in AM1 (Vamp
Reference Guide, 1997).
Currently, the semi-empirical methods are used for molecules in the range of 20 to 400
atoms. Specific software implementations of the semi-empirical methods include
AMPAC (Semichem, Inc.), MOPAC (Fujitsu), ZINDO (Zerner, 1993), Vamp (Accelrys
Inc.).
4.2.2. Ab initio methods
In the ab inito approach all integrals in the calculation of molecular orbitals are computed.
It gives the most accurate results, but requires greater computational resources compared
to the semi-empirical methods. The obtained accuracy, however, depends on the choice of
the basis set. Examples of basis sets include Slater-type Orbital (STO) bases. In these
bases STOs are approximated by N Gaussian-type orbitals (GTOs), and the bases are
denoted as STO-NG (for example STO-3G, where each STO is replaced by 3 GTOs).
STO-NG are minimal basis sets, with only one function per atomic orbital for each atom.
Better results are given by the so-called split-valence basis sets, in which different basis
functions are used for the core orbitals and for the valence orbitals. Examples are 3-21G
basis and 6-31G basis. The notation for 3-21G basis means that 3 GTOs are used to
approximate the core orbitals, 2 GTOs are used to approximate the inner valence orbitals
(2s and 2p orbitals), and 1 GTO is used to approximate the outer valence orbitals (3s and
3p orbital). Further improvement of the results is achieved by adding d-orbitals for the
heavy atoms to the basis set. For typical organic compounds they are not used in bond
35
formation, but rather are used to allow a shift of the centre of an orbital from the position
of the nucleus (polarisation of the orbital). For example, a p-orbital on carbon can be
polarised by adding to it a d-orbital. The presence of polarisation functions is indicated by
adding an asterisk to the notation of the basis set. Thus, 3-21G* implies the previously
described split valence basis with polarisation added. A second asterisk is used to denote
the addition of a set of p-orbitals to each hydrogen atom to provide for their polarisation
(for example 6-31G** basis).
The choice of a basis set is based on a compromise between the desired accuracy in the
results, and the corresponding computational demands. Ab initio methods can calculate
any type of atom, including metals. Currently, their application is limited to molecules
containing tens of atoms due to the great computational demand. Software
implementations include GAUSSIAN (Gaussian, Inc.), and GAMESS (Computing for
Science (CFS) Ltd.).
4.3. Obtaining 3D-molecular conformation with minimum energy
By using the methods of molecular or quantum mechanics the energy of a molecule and
some descriptors of the structure are calculated. For the needs of QSAR analysis these
descriptors should be calculated on the basis of the conformation with minimum energy.
To find an energy-minimised molecular conformation, an iterative procedure is usually
applied. A given starting conformation is constructed and its energy is calculated.
Subsequently, the starting conformation is altered to form a conformation with smaller
energy. Again, the energy of the new conformation is calculated and these steps are
repeated until a conformation with minimum energy is found. Different algorithms exist
for altering the conformation in the process of searching for smaller-energy
conformations. In general, these can be divided into derivative and non-derivative
algorithms. In the derivative algorithms the derivatives of the energy with respect to the
atomic coordinates are calculated. Some derivative algorithms search for the minimum
value of the energy gradient (the first derivative of the energy). Such are, for example, the
steepest descent method, the arbitrary step approach, and the conjugate gradients
minimisation. They are fast and are appropriate for finding the most rapid decrease in
energy far from the energy minimum. Other derivative algorithms consider also the
second derivative of the energy to find the conformation with a minimum energy
(Newton-Raphson method). These methods are slower, and are more appropriate for
structures close to the energy minimum. The non-derivative algorithms do not use the
36
energy derivatives. In the Simplex method (a slow computational method) each atom is
moved along the Cartesian axes in both directions and the energy for each position is
calculated; the new position is selected with the lowest energy. In Monte Carlo
procedures the molecular conformation is altered by a small random amount at each step,
and if the energy decreases the new structure is accepted, otherwise it is rejected.
The minimum-energy conformation obtained by these algorithms usually depends on the
starting conformation and may represent a local minimum of the energy. There might be
several local energy minima for a molecule. The lowest of the local energy minima is
called the global minimum. To find the conformation corresponding to the global energy
minimum methods for conformational search are used. In these methods several starting
conformations are generated and optimised (i.e. for the given starting conformation the
conformation with energy in a local minimum is found). The optimised conformation
with the lowest energy is considered to be the conformation with a global energy
minimum. A variety of methods exist to generate different starting conformations.
Generally, they are two types: stochastic methods, in which the new conformations are
generated randomly; and systematic methods, in which the conformation space is
searched in a systematic way, for example, by increasing the torsion angles with a given
value.
37
CHAPTER 5
MATHEMATICAL PROCEDURES AND STATISTICAL METHODS
In this chapter an introduction to the mathematical procedures and statistical methods
used to derive QSAR models in the project is presented. Statistical parameters used to
assess model quality are described. Assumptions of the statistical approaches are given.
Additionally, statistical techniques for selecting variables resulting in best statistical fit,
and for evaluating model quality, are described.
5.1. Linear regression analysis
Linear regression analysis is the most widely used statistical approach to derive QSARs.
5.1.1. General description of the method
In regression analysis a functional dependence of a variable (called the dependent
variable) on a set of other variables (called independent or predictor variables) is sought.
In linear regression analysis the dependence has the following linear form:
Y’ = b1X1 + b2X2 + … + bpXp + b (5.1)
where b1, b2.. bp are regression coefficients
b is the intercept
X1, X2, … Xp are independent variables
Y’ represents expected values of the dependent variable by the regression model.
The regression coefficients b1, b2.. bp and the intercept are calculated by applying the
method of least squares to give the smallest possible sum of squared differences between
the true Y values of the dependent variable and the Y’ values calculated by the regression
model.
The equation represents a hyperplane in the p-dimensional space, where p is the number
of predictor variables in the equation. It reduces to a line in the case of one predictor
variable and a plane in the case of two predictor variables.
38
The derived regression equation can be used for predicting values of the dependent
variable from the values of the independent variable(s).
When the dependent and independent variables are standardised to have means of zero
and standard deviations of one, the equation can be written in a standardised form, in
which the intercept b is equal to zero. In matrix form the equation can be written as:
Ys = XsBs + E (5.2)
where Ys is the matrix of the standardised dependent variable; it has dimensionality of n
x 1 (n rows and one column)
Xs is the matrix containing the standardised independent variables included in the
model; of dimensionality n x p
Bs is the matrix of the standardised regression (beta) coefficients, having
dimensionality of p x 1
E represents a term corresponding to the unexplained part of Y by the model of
dimensionality n x 1
n is the number of observations
p is the number of independent variables included in the regression model.
The coefficients for the independent variables in this case are called beta coefficients, and
can be used to assess the relative contribution of the independent variables to the
prediction of the dependent variable. Independent variables with higher absolute values of
their beta coefficients explain a greater part of the variance in the dependent variable.
To assess the accuracy with which the dependence found describes the variance of the
dependent variable (i.e. the quality of the statistical fit), the Pearson correlation
coefficient (r) (for regression with single predictor variable) or squared coefficient of
determination, are used (R2):
r = ±(ESS/TSS)0.5 (5.3)
R2 = ESS/TSS = 1- (RSS/TSS) (5.4)
39
where: TSS is the total sum of squares: TSS = Σ (Yi – Ymean)2, it has n-1 degrees of
freedom
ESS is the explained sum of squares: ESS = Σ (Yi’ – Ymean)2, it has p degrees of
freedom
RSS is the residual sum of squares: RSS = Σ (Yi – Yi’)2, it has n-p-1 degrees of
freedom
Yi are the observed values of the dependent variable
Yi’ are the predicted values of the dependent variable by the regression model
Ymean is the mean value of the dependent variable
n is the number of observations
p is the number of independent variables included in the regression model.
The Pearson correlation coefficient (r) has a positive value if the values of the dependent
variable increase with increasing of the values of the independent variable, otherwise the
coefficient is negative.
R2 value bigger than 0.5 means that the explained variance by the model (ESS) is bigger
than the unexplained variance (RSS). The closer the value of R2 to 1, the better the
regression equation describes the observed data. The value of R2 depends on the size of
the sample and the number of predictor variables in the equation. It keeps the same value
or increases when a new predictor variable is added to the regression equation, even if the
added variable does not contribute to reducing the unexplained variance in the dependent
variable. Therefore another statistical parameter can be used, called adjusted R2 value
(R2adj). It is obtained by a similar expression as the one for R2, however the RSS and TSS
are divided by their corresponding degrees of freedom:
R2adj = 1 – [RSS /(n-p-1) ] / [TSS / (n-1)] = 1 – { (1-R2) * [ (n-1) / (n-p-1) ] } (5.5)
The value of R2adj decreases if an added variable to the equation does not reduce the
unexplained variance.
The reliability of prediction of the dependent variable can be assessed by the value of the
standard error of estimate s:
s = (RMS)0.5 = [ Σ (Yi – Yi’)2 / (n-p-1) ]0.5 (5.6)
40
where RMS is the residual mean square: RMS = RSS / (n-p-1).
The standard error of estimate (s) is a measure of the dispersion of the observed values of
the depended variables about the regression line (surface). Larger values of s mean worse
statistical fit of the model and less reliability of the prediction.
The statistical significance of a regression equation can be assessed by means of the
Fisher (F) statistic:
F = EMS / RMS = R2 (n – p – 1) / (p (1 – R2)) (5.7)
where: EMS is the explained mean square: EMS = ESS / p.
A regression equation is considered to be statistically significant if the observed F value is
greater than a tabulated value for the chosen level of significance (typically, the 95%
level) and the corresponding degrees of freedom of F. The degrees of freedom of F are
equal to p and n-p-1. Significance of the equation at the 95% level means that there is
only a 5% probability that the dependence found is obtained due to a chance numerical
relationship between the variables, i.e. that there is no real relationship between the
dependent and the independent variables in the population.
The statistical significance of an independent variable can be assessed by using the t
statistic, which is equal to the variable’s regression coefficient (b) divided by the standard
error of the coefficient (sb):
t = b/sb (5.8)
where b is the regression coefficient for the variable
sb is the standard error of the coefficient, equal to:
sb = s / [∑ (Xi - Xmean)2 ] 0.5 = [RMS / ∑ (Xi - Xmean)
2 ] 0.5 (5.9)
where Xi are the observed values of the independent variable
Xmean is the mean value of the independent variable.
41
Higher t values of a predictor variable correspond to a greater statistical significance. The
statistical significance of a variable using its t value is determined again from tables for a
given level of significance (similar to the use of F value). The degrees of freedom of t are
equal to n-p-1 (corresponding to the degrees of freedom of RMS). If a variable is
statistically significant at the 95% level then there is only a 5% probability that the
regression coefficient of the variable is not significantly different from zero.
Adding independent variables into the regression model generally results in an increased
R2 value. However, adding high number of predictor variables can result in an overfitted
model, i.e. in describing the noise in the data rather than true underlying relationships. To
avoid this, the ratio of observations to predictor variables in a model should be kept as
high as possible. According to Topliss and Costello (1972), this ratio should be at least
five.
5.1.2. Assumptions of linear regression analysis
In regression analysis it is assumed that the found dependence reflects a real causality in
the population. However, even when relationships between variables are found, the
underlying causal mechanism is not ascertained unequivocally. If relevant causal
variables are omitted from the model, the common variance which they share with
included variables may be a reason for wrongly attributed causality. Therefore, alternative
causal explanations may be also considered. Additionally, it should be certain that the
dependent variable is not a cause of one or more of the independent variables.
Another assumption (evident from the name of the analysis) is that the relationship
between the dependent and the independent variable(s) is linear. This can be examined by
plotting dependent against independent variable(s). If a non-linear relationship is
observed, a transformation of the variables can be performed to achieve a linear
relationship, or non-linear regression analysis should be applied.
For correct application of the tests for statistical significance the residuals (predicted
minus observed values of the dependent variable) should follow the normal distribution.
This assumption can be assessed by using a histogram of residuals, which shows their
distribution. However, the F-test is robust with regards to small violations of the
normality assumption, meaning that in such cases it can be used for estimating statistical
significance (Statistica version 5.5. software, help manual).
42
Another assumption is that of homoscedasticity of residuals. This means that the residuals
are dispersed randomly throughout the range of the dependent variable, i.e. the variance
of residuals is constant for all values of the independent variables (Berry, 1993). This
assumption can be examined by using plots of residuals against the predicted values of
the dependent variable. Outliers, representing cases with high residuals, are a form of
violation of homoscedasticity. When the homoscedasticity assumption is violated, the
values of tests for statistical significance (F, t) are questionable. If heteroscedasticity is
detected, the best strategy is to build a model based on the absolute values of the
residuals. If an adequate residual model is obtained, the absolute values of the residuals
for each observation are evaluated on the basis of that model, and a new regression
dependence is built using the method of the weighted least squares, in which the
reciprocal of the absolute values of the evaluated residual are used as weighting factors
(Radojnowa et al, 2002). Otherwise it is considered that statistically significant
heteroscedasticity is inexplicable using the independent variable and therefore should be
neglected from a practical point of view.
The next assumption is that of independence of each observation from each other
observation (absence of autocorrelation). The values of a variable should not depend on
the order in which the observations were made. This is often a problem with time series,
where variable values tend to depend on the value in a previous moment. Violation of this
assumption leads to biased estimates of standard errors of the regression coefficients and
the statistical significance. The autocorrelation is tested by the Durbin-Watson coefficient
(d). The value of d ranges from 0 to 4, with a value between 1.5 and 2.5 indicating
independence of the observations. For a graphical test of the observation independence, a
plot of residuals against the sequence of cases can be used. It must show no pattern,
indicating independence of errors.
Multicollinearity is a high level of intercorrelation among the independent variables,
which results in difficult determination of the individual effects of the variables.
Multicollinearity does not influence the calculated regression coefficients, but increases
their standard errors, thus making difficult to assess the relative importance of the
independent variables by using beta coefficients. To assess the multicollinearity the
Pearson correlation coefficients between each pair of independent variables can be
examined. As a rule, independent variables which correlate with correlation coefficient
higher than 0.70-0.80 should not be included in the same regression model (Kubiny,
43
1993). Another method of assessing multicollinearity is to use tolerance. Tolerance is
defined as 1 – R2, where R2 is the coefficient of determination of the regression when the
given independent variable is regressed on all other independent variables. If the tolerance
value is less than some cut-off value, usually 0.20, the independent variable should be
excluded from the analysis due to multicollinearity.
5.1.3. Observations with large influence on the regression model, outliers
Some observations can have large influence (leverage) on the obtained regression results.
As a result a biased regression model is obtained. A number of techniques are developed
to identify such observations.
The leverage statistic (h, also called hat-value) is related to the distances between the
values of the independent variables for the given observation and the means of the values
of the independent variables calculated on all observations. The leverage statistic varies
from 0 (the observation has no influence on the model) to 1 (the observation completely
determines the model). As a rule, observations should have leverage lower than 0.2.
Observations with leverage greater than 0.5 should be excluded from the model. Another
estimate of observation influence on a model is the Cook's distance (D). It is related to the
difference between the computed regression coefficients and the regression coefficients of
a model in which the respective observation is excluded. The distances for all
observations should be of a similar magnitude, otherwise observations with large distance
will exert undue leverage.
Outliers are observations that are not well described by the regression model (have large
residuals). The presence of outliers can seriously bias the regression results by greatly
influencing the values of the regression coefficients. However, defining an observation as
an outlier is subjective, taking into account specific experimental considerations.
However, generally observations with high residuals (usually higher than ±2 times
standard error) are considered as outliers.
Studentised residuals (residuals divided by their standard errors) can also be used to
identify outliers. According to a method developed by Tenekedjiev and Radojnowa
(2001) each observation is excluded once from the data set and its studentised residual is
calculated from a regression model of the remaining observations. After testing the whole
sample, all the observations whose studentised residuals do not lie within the calculated
44
confidence interval corresponding to a chosen significance level are rejected. This
procedure (representing a single loop) is repeated until either a predetermined number of
loops over all the observations are executed, or some loop does not reject any outliers
(which is often the second one) (Tenekedjiev and Radojnowa, 2001).
Outliers can be present for different reasons: errors in the measured variable values,
extreme values for one or more of the variables, different causality for the value of the
dependent variable for the given observation. In some cases, transformations of the data
can be applied, or additional independent variables can be included into the model to
correctly describe the outliers. Also, separate models excluding outliers can be developed.
5.2. Principal Components Analysis
Principal Components Analysis (PCA) is a mathematical procedure that is used to
identify patterns in data, and to transform a number of (possibly) correlated variables into
uncorrelated variables. Smaller number of variables can be considered after the
transformation, thus the PCA can be used to reduce the dimensionality of the data without
significant loss of information (data variability).
PCA can be performed by using the variance/covariance matrix or the correlation matrix
of the variables. To perform PCA, the eigenvectors and eigenvalues of the matrix are
computed, thus, the following matrix equation is solved:
CxEi = λiEi (5.10)
where Cx is the variance/covariance or correlation matrix
Ei are the eigenvectors of the matrix
λi are the corresponding eigenvalues.
The number of possible eigenvectors and eigenvalues is equal to the number of variables.
The number of components of each eigenvector is also equal to the number of variables.
The obtained eigenvectors are orthogonal. By ordering the eigenvectors in the order of
descending eigenvalues, an ordered orthogonal basis can be created. The first eigenvector
(with the highest eigenvalue) has the direction of largest variance of the data. In this way,
the directions in which the data set has the most significant variance are found.
45
The eigenvectors form the principal components of the data set. Some of the components
having the smallest eigenvalues can be ignored in the further analysis, as they describe a
very small part of the variance in the data. According to the Kaiser criterion (Kaiser,
1960), a component should be considered if its eigenvalue is greater than one, which
implies that this component explains a greater part of the variance than the variance
contribution of a single variable. Thus, the dimensionality of the transformed data set can
be reduced.
To obtain the values of the principal components corresponding to each observation in the
data set (i.e. the values of the transformed variables, representing the so called component
scores), a feature vector needs to be formed. It represents a matrix of vectors, in which the
eigenvectors considered for further analysis (having high eigenvalues) form the columns
of the matrix. The eigenvectors are ordered in the matrix by decreasing of their
eigenvalues (the eigenvector with the highest eigenvalue forms the first column, the
eigenvector with the second highest eigenvalue forms the second column, etc.). The
feature vector is of a dimensionality q x p (q rows and p columns), where q is the original
number of variables (the dimensionality of the variance/covariance matrix), and p is the
number of principal components considered for further analysis. Additionally, the means
of each of the original variables is subtracted from the variable values, to obtain a data set
in which each variable has a mean of zero.
To obtain the values of the transformed variables (the scores of the principal
components), the following matrix equation is solved:
T = (X – µ)F (5.11)
where T is the data matrix containing the scores of the principal components (in which
the scores of each component form the columns of the matrix); it has a dimensionality of
n x p
(X – µ) is the matrix, containing the original variables, from which the mean of
each variable is subtracted (in which the observations are in the rows, and the mean-
subtracted variables form the columns); its dimensionality is n x q
F is the feature vector (with dimensionality is q x p)
q is the number of variables in the original data set
n is the number of observations in the data set
p is the number of considered principal components.
46
Thus, the transformed variables (the scores of the principal components) are linear
combinations of the original variables. The components of the eigenvectors represent the
loadings (weights) of the original variables in the corresponding principal components.
The obtained scores of the principal components (which are uncorrelated) can be used as
independent variables in regression analysis. The approach is called PCA regression.
5.3. Partial Least Squares regression
Partial Least Squares (PLS) regression combines features from PCA and multiple linear
regression. In the PLS, as in the PCA, orthogonal principal components are extracted as a
linear combination of the original variables, but in contrast to the PCA regression, the
PLS regression components are extracted from product matrices involving both the
variance/covariance matrices of the independent and the dependent variables. Thus, the
components found are relevant for the dependent variable(s). The PLS regression can be
used to analyse simultaneously more than one dependent variable. It is useful when a set
of dependent variables is assessed by a large set of independent variables, also in cases
when there are fewer observations than predictor variables.
As in the multiple linear regression, in the PLS a linear model is obtained. It has the
following matrix form:
Y = TB + E (5.12)
where Y is the matrix of the dependent variables, having dimensionality of n x m
T is the matrix containing the scores of the extracted principal components
(having the cases in the rows and the principal component scores in the columns), with
dimensions n x p
B is the matrix of the PLS regression coefficients, of dimensionality p x m
E is a term, corresponding to the unexplained variance in Y, with the same
dimensions as Y
n is the number of observations
m is the number of dependent variables
p is the number of the considered principal components (with high eigenvalues).
47
In this equation the variables in Y are standardised by subtracting their means and
dividing by the standard deviations.
The principal components in the PLS regression are extracted by considering the
variance/covariance structure between the predictor and dependent variables. The
loadings of the original variables in the principal components are computed to maximise
the covariance between the dependent variables and the principal components scores.
Consequently, an ordinary least squares procedure is applied with the extracted principal
component scores and the dependent variables to produce the PLS regression model.
Since T = XF, where F is the matrix of the loadings of the original variables in the
principal component scores (feature vector, see Section 5.2 “Principal Components
Analysis”), and X is the matrix of the original mean-subtracted independent variables
(with dimensionality n x q, where n is the number of observations and q is the number of
original independent variables), the PLS regression model can be presented in terms of
the original independent variables:
Y = XQ + E (5.13)
where Q = FB is the matrix of regression coefficients.
Additionally, a matrix P is defined (equal to the transpose of the matrix F (FT)), with
dimensions p x q) representing the loading of the principal component scores in the
original mean-subtracted variables:
X = TP + K (5.14)
where K is the unexplained part of X due to the reduction of the variable dimensionality
(the excluding of some of the principal components with low eigenvalues).
Different algorithms for calculating the PLS principal components exist. They are
iterative algorithms, in which one principal component is extracted from the X data
matrix at each step. The principal component is extracted to have scores with maximum
covariance with a certain linear combination of the dependent variables. Afterwards, a
regression is performed on the Y and X variables by using the scores of the extracted
principal component. The predicted Y and X values from the regression with the first
48
principal component are subtracted from the original Y and X variables. The obtained
residuals are called deflated Y and X matrices. In the next step of the PLS algorithm the
original Y and X data are replaced with the deflated Y and X data, and a new principal
component is extracted. The procedure is performed until the desired number of principal
components is obtained, or X becomes a null-matrix. The standard algorithm for
computing PLS regression components is called nonlinear iterative partial least squares
(NIPALS). Another algorithm is the SIMPLS algorithm, which is faster than the NIPALS
algorithm, but gives slightly different results when more than one dependent variable is
included.
The same statistical parameters as used in multiple linear regression analysis can be
applied to assess the statistical significance of PLS regression models.
5.4. Linear discriminant function analysis
5.4.1. General description of the method
Discriminant function analysis is used to determine which variables discriminate between
two or more naturally occurring groups in a population, and to classify observations into
the different groups with accuracy better than chance.
In discriminant analysis it is determined whether groups differ with respect to the mean
value of a variable (Statistica version 5.5 software, help manual). If the means for a
variable are significantly different in different groups, then this variable discriminates
between the groups. The significance test of whether a variable discriminates between
groups is the Fisher F test. F is computed as the ratio of the between-groups variance in
the data and the pooled (average) within-group variance. If the between-group variance is
significantly larger there are significant differences between the group means. In the case
of multivariate discrimination, the matrix of total variances and covariances, and the
matrix of pooled within-group variances and covariances, are calculated. The two
matrices are compared by using multivariate F tests (Wilks’ lambda) to determine
whether there are significant differences in the means of all variables between groups.
In the Fisher linear discriminant analysis so called linear discriminant functions are
calculated. In the two-group case, a linear equation is fitted to obtain a discriminant
function:
49
Group = b1X1 + b2X2 + … bpXp + b (5.15)
where b1, b2, … bp are coefficients of the independent variables Xi
b is an intercept.
This discriminant function is analogous to a multiple regression equation, but values of
discriminant coefficients are sought that will maximise the distance between the means of
the calculated values of the function in the two groups.
In the case of more than two groups, more discriminant functions are calculated. These
functions are called also canonical functions and the analysis is termed canonical
discriminant analysis. The number of functions is the lesser of (g-1) where g is the
number of groups, and p, the number of independent variables. For example, when there
are three groups, there will be a function for discriminating between group 1 and groups 2
and 3 combined, and another function for discriminating between group 2 and group 3.
Each discriminant function is orthogonal to the others (uncorrelated).
Analogously to the regression analysis, standardised discriminant functions can be
calculated by standardising the variable values to have means of zero and standard
deviations of one. The standardised discriminant coefficients are used to compare the
relative importance of the independent variables.
For each discriminant function an eigenvalue is calculated. It reflects the percent of
between-group variance explained by the corresponding function. Thus, the eigenvalues
assess the relative importance of the discriminant functions for discriminating between
groups. If there is more than one discriminant function, the first function has the highest
eigenvalue (it is the most important), and the eigenvalues of the following functions
decrease.
Discriminant scores for observation are calculated from the equation(s) for the
discriminant function(s) by using the values of the independent variables for the
respective observations. The Z scores are the discriminant scores calculated from the
standardised discriminant function(s).
50
The Pearson correlation coefficients between the independent variables and the
discriminant scores are called structure coefficients or discriminant loadings. They can be
used to assess how closely a variable is related to a discriminant function. A table
representing structure coefficients of each variable with each discriminant function is
called a canonical structure matrix.
To test the statistical significance of the discrimination the value of Wilks’ lambda (λ) is
used. It tests whether there are differences between the group means of a combination of
variables. Wilks’ λ takes values from 0 to 1, with 0 meaning group means differ (the
variables differentiate between the groups), and 1 meaning all group means are the same.
Wilks’ λ is sometimes called the U statistic. The Wilks’ λ for the overall discrimination is
computed as the ratio of the determinant (det) of the within-groups variance/covariance
matrix over the determinant of the total variance/covariance matrix:
Wilks’ λ = det(W)/det(T) (5.16)
The Wilks’ λ statistic can be mathematically transformed to a statistic, which has
approximately a Fisher (F) distribution. Thus, an F value for the corresponding Wilks’ λ
and degrees of freedom can be calculated, and the corresponding level of statistical
significance can be assessed. Details of the mathematical procedure are given in Rao
(1951).
Additionally, the significance of each variable added to the model can be assessed by
calculating a partial Wilks’ λ, which is equal to:
partial λ = λ (after adding the variable)/lambda(before adding the variable) (5.17)
The corresponding F value of the partial λ is calculated as:
F = [(n-q-p)/(q-1)]*[(1-partial λ)/partial λ] (5.18)
where: n is the number of observations
q is the number of groups
p is the number of variables.
51
Again, a level of statistical significance of the variable corresponding to the F value can
be determined.
As described above, one of the purposes of the discriminant analysis is to classify
observations into the defined groups. Several methods can be used for this purpose.
Some approaches use the non-standardised discriminant coefficients and the derived
discriminant scores. If the discriminant score for an observation, derived from a given
discriminant function, is less than or equal to a cut-off value (specified for that function),
the observation is classified to one group, otherwise it is classified to the other group,
which that function discriminates.
In another approach, classification functions are calculated for each group, which include
the variables determined as discriminating between the groups. Each function allows
computation of classification scores for each case for each group, by applying the
formula:
Si = wi1*X1 + wi2 *X3 + ... + wip*Xp + ci (5.19)
where the subscript i denotes the respective group
the subscripts 1, 2, .. p denote the p variables
ci is a constant for the i'th group
wij is the weight for the j'th variable in the computation of the classification score
for the i'th group
Xj is the observed value for the respective case for the j'th variable
Si is the resultant classification score for the respective case.
An observation is classified as belonging to the group for which it has the highest
classification score.
The so-called Mahalanobis distances are also used to classify observations. A
Mahalanobis distance is a measure of the distance between two points in a
multidimensional space defined by two or more variables. If these variables are
uncorrelated, the Mahalanobis distances will be identical to the Euclidean distances
between points (the axes of the space can be regarded as being orthogonal). However,
calculating Mahalanobis distances takes into account that the variables defining the space
52
can be correlated (the space axes are considered as non-orthogonal). In those cases the
obtained Mahalanobis distances account for the correlations.
To classify an observation, the Mahalanobis distances between that observation and the
centroids of each group are calculated, and the observation is classified as belonging to
the group to which it has smallest distance. A group centroid represents the point in the
multivariable space with coordinates equal to the means of all variables calculated for the
observations in the group.
Additionally, a probability for a given observation to belong to a certain group can also be
calculated (the so-called posterior probability). The posterior classification probability for
an observation is related to the Mahalanobis distance of that observation to the group
centroid. A multivariate normal distribution of the observations around each group
centroid is assumed, and the posterior probabilities are assessed depending on how many
times the standard deviation is the Mahalanobis distance to the centroid of the observation
(i.e. observations which have Mahalanobis distance more than 1.96 times the standard
deviation from the centroid have less than 0.05 chance of belonging to the given group).
To classify observations, the so-called a priori classification probabilities are also taken
into account. They are based on an a priori knowledge for the chances of observations to
belong to a certain group, without considering the particular values of the independent
variables for the observations. For example, the a priori probabilities can be proportional
to the size of the groups, since the probability that an observation belongs to a group with
a bigger number of observations, is higher. However, the a priori probabilities should be
proportional to the group sizes only when it is know that there is a causality for such
differences in the group sizes in the population, and it is not a simple result of the
sampling procedure.
A classification table (matrix) is used to assess the percentage of correct classification. On
the rows of the table are placed the numbers of the cases observed in each group and on
the columns are the numbers of the cases predicted in each group. The numbers of the
cases predicted in the same group as observed are placed on the diagonal of the table. The
percentage of these cases represents the percentage of correct classifications.
The percentage of correct classifications must be compared to the percent of cases that
would have been correctly classified by chance alone. When there are two groups with
53
equal sizes, the expected correct classification by chance alone is 50%. For n groups with
equal sizes, the expected percent is 100/n. For n groups of different sizes (g1, g2,..., gn),
the expected percent is equal to:
expected percent correct classification by chance alone = 100*[(g1/N)*g1 + (g2/N)*g2 +
... + (gn/N)*gn] /N (5.20)
where N is the number of all observations.
5.4.2. Assumptions of linear discriminant function analysis
The discriminant function analysis makes the following assumptions:
a) Assumes linearity.
b) Independence of each observation from each other observation.
c) The sizes of the different groups must not be greatly different.
d) Each independent variable has a standard deviation different from zero in each
group (calculation requirement).
e) Errors (residuals) are randomly distributed.
f) Homogeneity of variances (homoscedasticity):
Similar values of the variances and means must be observed in the groups for the same
independent variable. Lack of homogeneity of variances will question the results of the
tests for significance. Lack of homogeneity of variances may indicate presence of outliers
in one or more groups. Lack of homogeneity of variances and presence of outliers can be
evaluated through scatterplots of variables. Also, the Levene test can be used to examine
homogeneity of variance in the case of one variable. The null hypothesis of the Levene
test is that the covariance matrices do not differ between groups. Thus, the test must be
statistically insignificant at 95% confidence level to reject the hypothesis that the matrices
are significantly different.
g) homogeneity of covariances/correlations:
Within each group the covariance/correlation between any two predictor variables must
be similar to the corresponding covariance/correlation in the other groups, i.e. the groups
must have similar covariance/correlation matrices. The Box M test is used to assess the
homogeneity of covariance matrices. Its null hypothesis is that the covariance matrices do
54
not differ between groups. Thus, the test must be statistically insignificant at confidence
95% level to reject the hypothesis that the matrices are significantly different. However,
the M test is sensitive to violations from normal distribution of the variables. Another test,
which is robust to violations of the normal distribution, is the Sen-Puri nonparametric test.
It is also necessary that the test is insignificant at the 95 % level, in order not to reject its
null hypothesis that the matrices are homogeneous.
h) Low multicollinearity of the independent variables is assumed. Again, high
multicollineality does not change the variable coefficients in the discriminant
functions, but hampers the correct estimation of the importance of the predictor
variables by using the standardised discriminant coefficients.
i) It is assumed that predictor variables follow multivariate normal distributions, i.e.
each predictor variable has a normal distribution about fixed values of all other
independent variables:
Generally, discriminant analysis will be robust against violation of this assumption if the
smallest group has more than 20 cases and the number of independent variables is fewer
than six.
Violations of the assumptions of the discriminant analysis will influence the reliability of
the significance tests in the discriminant analysis. However, the final goal of the
discriminant analysis is to classify correctly observations into groups. Thus, violations of
the assumptions of the tests for statistical significance might influence the derivation of
discriminant and classification functions. However, the performance of the discriminant
analysis should be assessed mainly on the accuracy of the classification (the percentage of
correct classifications).
5.5. Classification trees
Classification trees are used to predict membership of objects (observations) into a priori
defined groups (classes) on the basis of independent (predictor) variables. Thus, the
objective of this technique is similar to the objective of discriminant analysis. However,
algorithms used for classification differ.
Classification trees consist of sets of decision rules (splits), nested within each other,
having the following form: “if a value of a variable is smaller/bigger than a cut-off value,
55
then assign the object to a group or consider another cut-off or consider another variable –
if it is bigger/smaller than a cut-off value, then …; else, assign the object to another group
or consider another cut-off or consider another variable …”. Thus, the classification tree
consists of nodes, at which the decision rules are applied to create different branches
containing objects and possibly new nodes of the tree. The tree starts with an initial node
(called the root node), which contains all the objects. After each node split, each object is
assigned to a certain branch of the tree. The classification trees have a hierarchical
structure, in which a hierarchy of decision rules is applied in a recursive way (i.e. a given
predictor variable can be used in more than one decision rule). The hierarchical structure
distinguishes the classification tree approach from the discriminant analysis, where the
decision to assign an object to a group is taken on the basis of a single decision rule (in
the two-group case), taking into account the values of all variables in the model
simultaneously.
Classification trees can be computed using ordered or categoric independent variables, or
combinations of them. Ordered variables allow for ordering of their values. In categoric
variables categories are used to define variable values, which do not allow for ordering of
the values (for example a variable defining colours of objects is a categoric variable).
Univariate splits (considering one variable at a node), as well as linear-combination splits
can be applied.
The classification accuracy of classification trees can be assessed in terms of
misclassification costs. Predictions with lower misclassification cost are considered more
accurate. The misclassification costs reflect the number of misclassified objects, but also
take into account other factors. For example, in some cases, objects from certain
categories need to be correctly assigned with higher accuracy than objects from other
categories (for example, in the case when it is more dangerous to classify actually toxic
compounds as non-toxic, than actually non-toxic compounds as toxic). In these cases, the
misclassification cost of incorrect classification of an object actually belonging to the
more important categories is higher than the misclassification cost of incorrect
classification of an object actually belonging to the less important categories.
The correct classification and the classification costs also depend on the a priori
classification probabilities (see Section 5.4 “Linear discriminant function analysis”).
Additionally, the ratio of the a priori probabilities can be used to assess the importance of
misclassifications for each class. When the a priori probabilities are taken to be equal to
56
the class sizes, minimising misclassification costs corresponds to minimising the overall
proportion of misclassified objects, because the prediction should be better in larger
classes to produce a lower misclassification rate (with all other factors determining the
misclassification costs equal for the different classes).
The splits in the classification trees can be chosen by different approaches. For example,
in the discriminant-based univariate (DBU) splitting, for each node formed so far, p-
levels are computed for significance tests of the relationship between class membership of
the node and the values (levels) of each predictor variable. For categoric predictors, the p-
levels are computed by using Chi-square tests of independence of the frequency
distribution of the classes on the levels of the categoric predictor. For ordered predictors,
the p-levels are computed by using analysis of variance (ANOVA). In some cases tests
that are robust to violations of the normal distribution, such as the Levene test for equality
of variances, are also used. The predictor variable for which the smallest p-level is
calculated is chosen to split the corresponding node. To determine the cut-off value of the
split the two-means clustering algorithm of Hartigan and Wong (1978) is used for ordered
predictors. The algorithm creates two “superclasses” and chooses the cut-off value to be
closest to the mean of one of the “superclasses”. To find the split for categoric predictors,
these are coded as dummy variables representing the levels of the categoric predictor. The
dummy variables are afterwards transformed into ordered predictors. The procedures for
ordered predictors are then applied. The discriminant-based linear combination splitting is
based on similar principles.
Another splitting algorithm is classification and regression trees (CART) search for
univariate splits. The method examines all possible splits for each predictor variable at
each node in order to find the split producing the largest improvement in goodness of fit.
The improvement of the goodness of fit due to the split is assessed by the degree of
reduction of the heterogeneity (presence of objects of different classes in the same node)
of the daughter nodes compared to the parent node. Deviance can be used as measure of
homogeneity (lack of heterogeneity) of a node. It is calculated on the basis of the relative
frequencies of the classes at a node, and the number of node objects. Another criterion of
homogeneity is the Gini measure of node impurity. It is calculated again using the relative
frequencies of the classes at a node from the equation:
Ginii = 1 – Σ (pik)2 (5.21)
57
where Ginii is the value of the Gini index of node i
pik is the relative frequency of the k-th class at the i-th node
the summation is over all classes present at the node i.
The Gini index is equal to zero if all objects at a node belong to the same class, and
increases with increasing the number of classes present at the node.
If no limit is placed on the number of splits to be performed, the splitting will be done
until the terminal nodes of the classification tree contain only one class of objects (pure
classification). However, such a tree may contain splits due to chance numerical
relationships between variables, and may perform badly when new objects are to be
classified. Therefore, some rules for stopping the splitting are usually applied. For
example, splitting is stopped when all terminal nodes are pure or have fewer objects than
a specified number.
Even applying a stopping rule, the resulting tree might have complicated structure. Thus,
a tree possessing good classification accuracy, and on the other hand having a fairly
simple structure, should be sought. For this purpose, procedures for pruning trees are
applied, in which branches of less importance for the classification accuracy are removed
from the tree. In the minimal cost-complexity pruning the costs are computed at each step
of the growing (splitting) of the tree up to the maximum tree size. The costs generally
decrease with increasing the number of splits in the tree (reflecting better classification).
However, the costs’ decrease is greater at the initial splits than at the splits close to the
maximum possible size of the tree. Thus, pruning the terminal nodes of a tree with
maximum size will result in a smaller increase of the costs, while pruning nodes that are
close to the root node will result in a greater increase of the cost. Therefore, an optimal
size of the tree can be reached when the terminal nodes are pruned up to a turn-off value
of the extent of increase of the costs due to the pruning.
Additionally, cross-validation can be performed at each step of the pruning, and the cross-
validated costs can be used to identify the tree with the optimal size. Generally, the cross-
validated costs decrease with increasing the splits up to a given minimum. Consequent
tree splitting results in increasing the cross-validated costs (unlike the non-cross validated
costs, which always decrease with increasing the complexity of the tree). Thus, the less
complex tree, for which the cross-validated cost is close to its minimum value, can be
chosen. Breiman et al. (1984) suggested that to obtain a tree with optimal size, the least
58
complex tree whose cross-validated cost do not exceed the minimum cross-validated cost
plus one times the standard error of the minimum cross-validated cost, should be selected.
A graphical representation of the hierarchical structure of classification trees can be used
to simplify the interpretation of the results.
5.6. Cluster Analysis
The term Cluster analysis (CA) represents a variety of techniques used to identify
common properties of objects and assign them to groups (clusters). The investigated
objects can represent a set of observations (in this case variables can be used to identify
the similarities between the observations), or a set of variables. Unlike other statistical
procedures like discriminant analysis and classification trees, cluster analysis is used
when there is not a priori knowledge for possible grouping of the objects.
Most CA techniques are hierarchical, i.e, clusters with increasing number of objects are
nested within each other. The formed structure of clusters is called a hierarchical tree.
Some CA methods are non-hierarchical, for example k-means clustering. Additionally,
clustering techniques can be divisive or agglomerative. In the divisive techniques initially
all objects are placed in a single cluster, which is then divided gradually into smaller
clusters. In the agglomerative techniques initially each object forms a separate cluster,
and the clusters are united gradually until one cluster is formed. To perform this, the
distances (similarities) between the objects are assessed, and the two closest objects (the
two most similar clusters) are united in a single cluster. The distances between the new
clusters are calculated again, and the two most similar clusters are again united. The
procedure is repeated until all objects are included in the same cluster.
The distances between the objects can be assessed in several ways. For example, when a
set of observations is investigated, Euclidean distances can be used (calculated on the
basis of variable values). To avoid the influence of possible differences between the
variable scales on the results, the Euclidean distances can be calculated on standardised
data. Another measure of the distances is the city-block (Manhattan) distance, which is
calculated as the average difference between the variables for the different objects
(observations). In most cases, this distance measure yields results similar to the Euclidean
distance. Chebychev distance is calculated as the maximum value of the differences
59
between any pair of variables for two objects. Pearson correlation coefficients can be used
a distance measure when variables are to be clustered.
Additionally, linkage rules need to be specified in the agglomerative CA. When a cluster
contains a single object, its distance to another cluster, containing also a single object, is
defined by the distance between the two objects. However, if the clusters contain several
objects, to identify the distance between them a certain linkage rule is applied. In the
single-linkage rule, the distance between two clusters is determined as the closest distance
between any pair of objects of the different clusters. In the complete-linkage rule, the
distance between two clusters is determined as the greatest distance between any pair of
objects of the different clusters. Another linkage method is unweighted pair-group
average rule, in which the distance between two clusters is calculated as the average
distance between all pairs of objects of the two different clusters. It uses a more
centralised measure of the distance than the single or complete linkage rules. Weighted
pair-group average rule uses again the average distance between all pairs of objects,
taking into account the sizes of the clusters (i.e., the numbers of objects in them) as
weights. Pair-group centroid rule (unweighted, or weighted for the size of the clusters)
defines the distance between the clusters as being equal to the distance between the
cluster centroids. A centroid of a cluster is calculated as an average point in the
multidimensional space defined by the variables (when observations are subject to
clustering). Another linkage rule is Ward's method, in which the sum of squares of any
two hypothetical clusters that can be formed at each step is minimised.
The results of the hierarchical CA are usually shown in a form of a dendogram plot, in
which the structure of the clusters in the hierarchical tree is presented as distinct branches.
The horisontal axis of the plot represents the cluster distance.
Other CA techniques are described in Tenekedjiev (1994). For example, in the two-way
joining method observations and variables are clustered simultaneously. The k-means
clustering aims to produce exactly k clusters with the lowest possible similarity between
them. The analysis is performed by minimising variability within clusters and maximising
variability between clusters.
5.7. Selecting variables for statistical analyses
60
If a large set of independent variables is available, several techniques can be used for
selecting variables, which give models with best statistical fit. An example is the stepwise
procedure (applied mainly in regression and discriminant analysis), which can proceed by
forward stepping or backward stepping. In forward stepping the initial model contains
only one predictor variable found to give the best statistical fit. In the following step a
variable, which improves the statistical parameters of the model more than the other
variables, is added. This is repeated until a fixed number of variables have been included,
or until a desired statistical fit has been achieved. In backward stepping, the initial model
contains all predictor variables, and at each step a variable, which results in minimum
reduction of the model fit is excluded. Again, the procedure is continued until a fixed
number of variables have been included, or until a desired fit has been achieved.
Another approach, called “best subsets approach” can be applied, in which statistical
models are developed by using all possible combinations of variables and the
combinations that give the best statistical fits are selected. However, for large sets of
independent variables it requires great computational resources and time.
5.8. Cross-validation statistical procedures
As described in Chapter 2, a cross-validation statistical procedure can be used to evaluate
the predictive power of QSAR models. It includes the following steps:
- one (or a set) of the chemical compounds is excluded from the training set and the
model under evaluation is developed again using the remaining compounds;
- the investigated property (biological activity, class belonging) of the excluded
compound(s) is predicted using the model based on the remaining compounds;
- this procedure is repeated by excluding different compounds, until all compounds
are excluded once and have one prediction of the investigated property;
- the accuracy of the predicted values from the previous steps is estimated by
calculating certain parameters, typical for the statistical technique used to develop
the model.
For example, if the predictivity of a regression model is evaluated, the cross-validated
squared coefficient of determination (Q2) and cross-validated standard error of estimate
(sometimes called standard error of prediction, SEP) can be calculated in a similar way as
in the standard regression procedure, but using the predicted values from the cross-
61
validation procedure. In the classification tress, cross-validated misclassification costs can
be calculated.
There are several cross-validation algorithms, which differ by the number of compounds
excluded from the training set at each step. The most commonly used is the leave-one-out
cross-validation procedure, in which one compound is excluded at each step. In the leave-
N-out cross-validation procedure, a set of N compounds is excluded at each step. In the
leave-group-out cross-validation, the compound set is divided into a given number of
groups, and one group is excluded at each step.
62
CHAPTER 6
APPROACHES FOR 3D-(Q)SAR ANALYSIS
In 3D-(Q)SAR approaches 3D-interactions between chemical compounds and their
biological receptors (enzymes or other macromolecules) are simulated. Therefore, the
choice of the molecular conformations is very important for the quality of the results. As
described in Chapter 4, the conformations chosen should reflect the active conformations
of the compounds. However, information about these conformations is often missing, and
the energetically minimal conformations are therefore considered in the analysis. The
energetically minimal conformations are found with the methods for conformational
search (see Chapter 4, Section 4.3 “Obtaining 3D-molecular conformation with minimum
energy”). Additionally, 3D-QSAR techniques are applied correctly only if the
investigated compounds interact with the same biological macromolecule(s), i.e. if they
have very similar molecular mechanisms of biological action.
In this chapter the 3D-(Q)SAR approaches used in the project are described.
6.1. GASP analysis
GASP (genetic algorithm similarity program) is a 3D-SAR technique used to obtain
information for possible chemical atoms or groups involved in the interaction with the
biological receptor.
Usually, a set of compounds presumed to interact (bind) to the same macromolecule at the
same extent, is used. By identifying similarities between these compounds in certain
structural features and their relative spatial positions, GASP identifies possible chemical
features responsible for the common biological binding of the compounds, and possible
spatial positions of these chemical features (sometimes called “pharmacophore pattern”).
The chemical features considered are rings, which might enter into hydrophobic
interactions with macromolecules, and potential H-bonding sites (atoms that can form H-
bonds). Additionally, a conformation of a compound with high binding affinity can be
defined as a rigid template, and the remaining molecules can be aligned to it.
GASP identifies the potential interaction sites in each molecule and randomly constructs a
population of chromosomes, where each chromosome represents a possible alignment of
63
the molecules of the data set to each other. The chromosomes encode torsion settings for
the rotatable bonds and an intermolecular mapping of elements. A fitness score for each
chromosome is calculated as the weighted sum of three terms: the number and similarity
of overlaid chemical features, the common volume of the aligned molecules, and the van
der Waals energy of each molecule. The last-named takes into account the probability that
the active conformation of the molecule is the one used in the alignment.
In the next step parent chromosomes are selected with high fitness scores, and child
chromosomes are produced by applying a mutation or crossover operation on the parent
chromosomes. The child chromosomes with improved fitness scores are used to replace
the least fit members of the parent population. These steps are repeated until the fitness of
the population cannot be improved, or when the possible number of genetic operations is
completed. After obtaining the chromosomes with the best fitness score, these
chromosomes are used to define the structural features having highest similarity in their
chemical properties and spatial positions between the investigated molecules. These
structural features are suggested to be responsible for the similar biological binding of the
investigated molecules.
6.2 CoMFA and CoMSIA analyses
The CoMFA (comparative molecular field analysis) 3D-QSAR approach was proposed
by Cramer and co-workers (1988). In this approach the representation of molecular
structures is done by using values of a certain type of interaction energy at points in space
around the molecules. The interaction energies are calculated by using a probe atom of a
specified charge and steric properties (usually a carbon atom in sp3 hybrid state). Thus,
interactions with biological macromolecules are simulated. The energies of steric and
electrostatic interactions are used.
The main steps in CoMFA analysis are the following (Figure 6.1):
- the investigated compounds are aligned in space according to chemical
characteristics, considered important for the binding to macromolecules and
biological effect. A priori knowledge of such characteristics is necessary;
- a 3D-grid is constructed around the aligned molecules. The energies of interaction
with a probe atom are calculated at each grid point (energy fields);
64
- the PLS statistical procedure is used to obtain CoMFA QSAR models. It extracts
principal components from the energy fields;
- the number of PLS components to be included in the final QSAR model is chosen
by applying a cross-validation technique (usually leave-one-out) to evaluate the
predictive power of models with different components. The best model is selected
to have low SEP (standard error of prediction) value and high cross-validated
coefficient of determination (Q2).
Due to the large number of grid points participating in the extracted PLS principal
components, the CoMFA results are usually interpreted graphically as 3D-contour maps.
In the maps these points of the 3D-grid are visualised whose PLS loadings show high
association between differences in calculated interaction energies and biological
activities. Usually these points are displayed in two colours, denoting the regions with the
highest positive and lowest negative QSAR coefficients. Thus, the contour maps show
favourable and unfavourable regions around the molecules, where stronger steric and/or
electrostatic interactions will increase or decrease the biological activity, respectively.
CoMSIA (comparative molecular shape indices analysis) is similar to CoMFA, but uses a
Gaussian function rather than Coulombic and Lennard-Jones potentials to assess steric,
electrostatic, hydrophobic, and hydrogen bond donor/acceptor fields (Klebe, 1998).
In CoMFA and CoMSIA approaches energies of interaction at the grid points around the
molecules are calculated. However, for a number of grid points the calculated energies
will have very low variance (for example grid points around parts of the molecules where
the variation in the chemical structure within the series is very small). Therefore a
criterion, called column filtering (σmin, usually in units of kcal/mol), for excluding such
grid points from the applied PLS analysis is used. Points that have standard deviations
(square root of the variance) smaller than the value of the column filtering are excluded
from the PLS analysis.
65
Figure 6.1. Obtaining 3D-QSAR models in CoMFA (taken from Kubiny, 1993)
66
CHAPTER 7
BLOOD-BRAIN BARRIER
In this chapter an introduction to the blood-brain barrier (BBB) and its functionality is
presented. The modes of assessment of compound BBB penetration are described.
Additionally, a summary of the QSAR models for BBB penetration, developed and
published before the present project, is presented.
7.1. Introduction to the blood-brain barrier
The blood-brain barrier (BBB) plays a critical role in supplying the brain with nutrients
and in its protection from circulating toxins. The BBB is formed from the brain capillary
endothelial cells, which are tightly linked with junctional complexes (tight junctions) that
eliminate gaps or spaces between cells. Under normal physiological conditions the
membranes of the capillary endothelial cells (basement (lateral) and apical membranes),
together with the tight junctions, form a continuous cellular barrier that prevents the
passive influx of a variety of substances with the exception of the smallest, lipid-soluble
molecules (Lee et al., 2001). The anatomy of the BBB is presented in Figure 7.1.
The passage of molecules across the BBB can occur either between adjacent cells (the
paracellular pathway) or through the endothelial cells (the transcellular path). The
paracellular passage includes passive diffusion of ions and solutes. Under normal
physiological conditions the tight junctions between the endothelial cells prevent most
substances from undergoing paracellular diffusion, with the exception of the smallest,
lipid-soluble molecules.
The transcellular path includes transport across cell membranes by one or more of several
membrane permeation mechanisms: passive diffusion, carrier-mediated (facilitative), and
active transport. Membrane transporters are involved in the influx/efflux of various
essential substrates such as electrolytes, nucleosides, amino acids and glucose. For
example, the sodium-ion-independent System L transports across the BBB large neutral
amino acids such as phenylalanine, tyrosine, and leucine. Small and neutral amino acids,
such as proline, alanine, glycine, methionine, and glutamine are transported by the
sodium-ion-dependent System A (Tamiai and Tsuji, 2000). Brain tissue is additionally
protected by efflux transport systems present in the brain capillary endothelial cells.
67
Several transport protein families have been recognised, such as the product of the
multidrug resistance gene, MDR1 (p-glycoprotein), the multidrug resistance-associated
protein family (MRPs), and the organic anion transport proteins (OATPs) (Sun et al.,
2002). The multidrug resistance (MDR) phenomenon includes development of resistance
to a wide variety of xenobiotic compounds, involving active efflux from the cells.
In general two ways to describe the BBB permeability can be used. The first includes use
of permeability coefficients (P) for compounds with dimensions of length/time. P values
are sometimes multiplied by the capillary surface area per 1 g brain to obtain the
permeability-surface area products (PS), in units of length3 / (time*weight). Both the Ps
and PSs are quantitative measures of the rate of transport, and so can be subjected to
QSAR analysis.
The second way is to use the blood-brain distribution at the steady state (BB), which is
defined as the ratio of the steady state concentrations of a compound in the brain and in
the blood:
BB = Cbrain/Cblood (7.1)
where Cbrain is the steady-state concentration of the compound in the brain
Cblood is the is the steady-state concentration of the compound in the blood.
7.2. Literature review of QSARs for blood-brain barrier penetration
Various QSAR models of BBB penetration have been developed and published. A variety
of approaches for structural representation and statistical tools have been applied.
Reviews of QSARs have been presented by Clark (2003) and Abraham and Platts (2000).
An early QSAR model for BBB penetration was developed by Young et al. (1988). The
BBB permeability (as BB values) of histamine H2 receptor antagonists was correlated
with the difference between logPoct and logPcyh values (∆logP), where logPoct is the
partition coefficient in the octanol-water solvent system, and logPcyh is the partition
coefficient in the cyclohexane-water system:
logBB = -0.485 (± 0.160) ∆logP + 0.889 (± 0.500) (7.2)
68
n = 20, R = 0.831, s = 0.439, F = 40.23
∆logP is related to the overall hydrogen bonding capacity of a molecule. The negative
coefficient of ∆logP suggests that the brain penetration increases by decreasing the
overall H-bonding ability of a compound. Young et al. (1988) did not find significant
correlation between logBB and logPoct.
Some authors (Abraham et al., 1994; Gratton et al. 1997) used the structural parameters
included in the general solvation equation (introduced by Abraham et al., 1991; Abraham
and Whiting, 1992). For example, Abraham et al. (1994) derived the following QSAR:
logBB = 0.198 R2 – 0.687 πH2 – 0.715 αH
2 – 0.698 βH2 + 0.995 Vx – 0.038 (7.3)
n = 57, R = 0.952, s = 0.197, F = 99.2
where R2 is an excess molar refraction
πH2 is the solute dipolarity/polarisability parameter (Abraham et al., 1991)
αH2 is the solute H-bond acidity
βH2 is the solute H-bond basicity
Vx is the solute characteristic volume of McGowan (Abraham and McGowan,
1987).
According to the model increase in solute size leads to an increase in logBB, whereas
increase in solute dipolarity/polarisability and hydrogen-bond acidity and basicity lead to
a decrease in logBB.
The Abraham solvation equation (which includes the variables described in the model of
Abraham et al., 1994, Equation 7.3) was applied as well to describe logPoct data. In the
model obtained, the coefficient for αH2 was not significant. Therefore Abraham et al.
(1999) concluded that compounds that are H-bond acids (H-bond donors) will always be
outliers when logBB is plotted against logPoct.
Solvation free energies have also been correlated with blood-brain partitioning
(Lombardo et al., 1996; Keseru and Molnar, 2001). Other authors demonstrated the
capacity of the molecular polar surface area to obtain good predictive models for blood-
brain penetration (Kansy and Van de Waterbeemd, 1992; Clark, 1999; Kelder et al., 1999;
69
Feher et al., 2000). The polar surface area is defined as the sum of surface contributions
of polar atoms (usually oxygen, nitrogen and attached hydrogen) to the molecular surface,
and is related to the capacity of a compound to form H-bonds.
Some authors used topological and E-state indices to develop models for BBB penetration
(Luco 1999; Norinder and Osterberg, 2001; Rose et al. 2002). Descriptors derived from
3D-molecular fields using the VolSurf method, have also been used (Crivori et al., 2000).
Again, the analyses confirmed that molecular properties such as hydrophilic regions and
H-bonding are unfavourable for BBB permeation.
In summary, numerous literature reports state that higher hydrogen bonding potential is
unfavourable for BBB penetration (Young et al., 1988; Kansy and Van de Waterbeemd,
1992; Abraham et al., 1994; Gratton et al., 1997; Clark, 1999; Kelder et al., 1999; Crivori
et al., 2000; Feher et al., 2000; Norinder and Osterberg, 2001; Rose et al. 2002). LogPoct
values, as an estimate of hydrophobicity, were found in some studies to correlate with
BBB penetration values (Gratton et al., 1997; Clark, 1999; Feher et al., 2000), while other
authors demonstrated that logPoct is not a suitable descriptor for obtaining BBB
permeability models (Young et al., 1988; Abraham et al., 1994). According to Abraham
et al. (1999), logBB values of compounds that are H-bond donors cannot be described
with logPoct values alone. Other descriptors found as important for BBB penetration are
compound polarity and polarisability (Abraham et al., 1994; Gratton et al., 1997;
Norinder and Osterberg, 2001).
70
Figure 7.1. Cross section through a brain microcapillary, representing the anatomy of
BBB (redrawn from Gloor et al., 2001)
71
CHAPTER 8
ACUTE CHEMICAL TOXICITY/CYTOTOXICITY
In this chapter an introduction to the mechanisms of toxic action is given. Also, a
summary of the QSAR models for toxicity, developed and published before the present
project, is presented.
8.1. Mechanisms of toxic action
A mechanism of toxic action has been adopted as being the critical biological effect of the
toxicant at molecular or cellular level. A number of toxicity mechanisms have been
defined in the literature. The following classification of mechanisms of toxic action has
adopted in this project: non-polar narcosis, polar narcosis, weak acid respiratory
uncoupling, formation of free radicals, electrophilic reactions, toxic action by specific
(receptor-mediated) mechanisms.
A number of methods are available to assess mechanisms of toxic action, including in
vitro tests (Urani et al., 1998); joint toxicity tests, in which the additivity (or otherwise) of
the toxicity of a chemical to the toxicity of a reference compound with a known
mechanism of toxicity, is tested (Broderius et al., 1995); fish acute toxicity syndromes
(FATS) (Bradbury et al., 1989); and considerations on the basis of structural criteria.
Also, a number of authors define structural criteria to assign compounds to a given
mechanism of toxic action (Hermens 1990; Lipnick, 1991; Verhaar et al., 1992; Russom
et al., 1997; Schultz et al., 1997; Hansch et al., 2000; Cronin et al., 2002a). However,
identification of the mechanism of toxic action for a compound from its chemical
structure is often a difficult task due to the complex nature of toxic activity (Schultz et al.,
2003a). Additionally, the different mechanisms are not mutually exclusive, i.e. a chemical
compound may act by different mechanisms.
The narcotic mechanism of toxic action results from non-specific non-covalent reversible
interactions with cell membranes (Schultz et al., 2003a). Narcotic mechanisms of action
can include non-polar and polar narcosis (Schultz et al., 1990). In addition, some authors
also refer to other mechanisms of narcosis (e.g. amine, ester) (Russom et al., 1997). The
toxic effect of non-polar narcotics is determined by the compound lipid solubility;
compound-specific molecular features are of little or no importance (Schultz et al., 1990).
72
Non-polar narcotics are neutral non-reactive compounds such as aliphatic alcohols,
ketones and ethers. Quantitative relationships between toxicity and hydrophobicity
(presented as octanol-water partition coefficients, logPoct values, or octanol-water
distribution coefficients, logDoct values) for non-polar narcotics have been reported by
many authors (Könemann, 1981; Veith et al., 1983; Hansch et al., 1989; Schultz et al.,
1990). These relationships are considered to represent a “baseline effect”, whereby no
wholly soluble and non-volatile chemical can elicit toxicity less than predicted by such
relationships (Schultz et al., 2003a).
Polar narcotics are aromatic, less inert and often possess a hydrogen donor moiety, e.g.
phenols and anilines (Veith and Broderius, 1990).
In general weak acid respiratory uncouplers elicit their effect by abolishing the coupling
of substrate oxidation to adenosine triphosphate synthesis in mitochondria. The molecular
characteristics of weak acid respiratory uncouplers include a weak acid feature (e.g. an
amino or hydroxyl group), a bulky, hydrophobic aromatic moiety, and multiple
electronegative groups (e.g. nitro and/or halogen substituents) (Terada, 1990). A typical
example is 2,4-dinitrophenol (Schultz et al., 1990).
Some chemicals are suggested to produce their toxic effect by forming free radicals.
Phenols, which possess electron-releasing groups could, for instance, be converted to
toxic phenoxyl radicals (Hansch et al., 2000).
Compounds that can undergo direct electrophilic interactions may cause covalent changes
in biological macromolecules (Hermens, 1990, Lipnick, 1991, Russom, 1997). Some
compounds may undergo metabolic reactions, resulting in more toxic forms (for example,
precursors to electrophilic compounds, i.e. proelectrophiles). Cronin et al. (2002a) stated
that 2- and 4-substituted amino and nitrophenols are capable of being oxidised to
quinones, which have toxicities greater than those of polar narcotics.
Specific mechanisms of toxic action are due to interactions with specific target
macromolecules (receptors or enzymes). These interactions depend on specific chemical
features of the compounds. Binding to receptors is non-covalent, due to ionic,
hydrophobic and/or H-bonding interactions, and requires specific 3D molecular
conformations.
73
In Table 8.1 a summary of structural criteria which can be used to assign mechanisms of
toxic action to compounds is given.
8.2. Literature review of QSARs for acute toxicity/cytotoxicity
A large number QSAR studies of acute toxicity have been developed and published in the
literature. A review of recently published QSARs (since 2000) for acute toxicity is
presented by Lessigiarska et al. (2005b).
Toxicities to a wide range of biological species have been investigated by QSAR analysis,
for example bacteria, protozoa, algae, invertebrates, plants, fish, mammalian cells and
mammals, including humans. Various chemical groups have been investigated and
various QSAR approaches have been applied.
A dependence of acute (narcotic) toxicity on a partition coefficent, especially the octanol-
water partition coefficient, has been shown by many authors (Dearden et al., 2000;
Freidig and Hermens, 2000; Kapur et al., 2000; Parkerton and Konkel, 2000; Bundy et al.,
2001; Gramatica et al., 2001; Ren and Frymer, 2002; Sverdrup et al., 2002; Worgan et al.,
2003).
For example, in the work of Schultz et al. (2002), a large data set of 500 aliphatic
chemicals was investigated for toxicity to the protozoan Tetrahymena pyriformis (toxicity
expressed as inhibitory concentrations causing 50 % inhibition of growth of the protozoan
in a two-day assay, IGC50 values). The following QSAR, representing a baseline toxicity
relationship, was derived for a group of non-polar narcotics (saturated alcohols, ketones,
nitriles, esters, and sulphur-containing compounds):
log1/IGC50 = 0.723 logPoct – 1.79 (8.1)
n = 215, R2adj = 0.926, s = 0.274, Q2 = 0.925
In addition to the octanol-water partition coefficient, partition coefficients in other phases
have also been used to develop QSARs for toxicity. For example Schultz and Seward
(2000) concluded that dimyristoyl phosphatidylcholine-water partition coefficients gave
better statistical fit than octanol-water partition coefficients in QSAR analysis of
inhibition of T. pyriformis population growth for 23 non-polar narcotics, polar narcotics,
74
and esters. Roberts and Costello (2003) developed QSARs for toxicity of 18 non-polar
and polar narcotics to the fish Poecilia reticulata (guppy) using logPoct (octanol-water)
and logPMW (membrane-water) partition coefficients. The LogPoct-based equations for
non-polar and polar narcotics had different coefficients. LogPMW-based QSARs for non-
polar and polar narcotics had similar coefficients. Therefore Roberts and Costello (2003)
derived a general equation for polar and non-polar narcotics, based on logPMW:
pLC50 = 0.84 logPMW + 1.38 (8.2)
n = 18, R2 = 0.963, s = 0.21, F = 419
Freidig and Hermens (2000) developed QSARs for toxicity to fish (14-day LC50 for
guppy (P. reticulata) and 4-day LC50 for fathead minnow (Pimephales promelas)). The
compounds were divided into narcotics and reactive compounds, according to the values
of the ratio of excess toxicity, Te:
Te = observed toxicity / predicted toxicity by a baseline equation (8.3)
Freidig and Hermens (2000) stated that if the Te value of a compound is smaller than five,
it should be considered as a narcotic, and if Te is bigger than five it should be considered
as a reactive chemical.
Freidig and Hermens (2000) developed separate one-parameter QSARs for the groups of
narcotics and reactive compounds, using a descriptor that characterises the particular
toxicity mechanism (logPoct for the narcotics, and a certain electronic descriptor for the
reactive compounds). Additionally, models combining the two groups of compounds and
the two types of descriptors were developed. Freidig and Hermens (2000) concluded that
using separate QSAR models for compounds acting by different mechanisms, each of the
models including a descriptor that characterises the particular toxicity mechanism, gives
better results than using a single model that combines all compounds and descriptors.
Some authors have used the so-called “response-surface approach” based on
hydrophobicity and electrophilicity of compounds. In this approach the QSARs include a
descriptor encoding bio-uptake and distribution (usually octanol–water partition or
distribution coefficients (logPoct or logDoct)) and a descriptor of electrophilic reactivity
(usually LUMO or maximum acceptor superdelocalisability (Amax)). This approach has
been applied to different species, including the bacterium Vibrio fischeri (Cronin et al.,
75
2000); the protozoan Tetrahymena pyriformis (Cronin and Schultz, 2001; Cronin et al.,
2002a; Schultz et al., 2002); the yeast Saccharomyces cerevisiae (Wang et al., 2002b); the
mould Aspergillus nidulans (Cronin et al., 2002b); the algae Scenedesmus obliquus (Lu et
al., 2001) and Chlorella vulgaris (Cronin et al., 2002b); the plant Cucumis sativus (Wang
et al., 2002a; Wang et al., 2002c); mice (Cronin et al., 2002b); toxic and metabolic effects
on perfused rat liver (Cronin et al., 2002b). The advantage of the response-surface
approach is that it is simple and has a sound mechanistic interpretation.
While some authors (e.g. Cronin and Schultz, 2001; Cronin et al., 2002a; Cronin et al.,
2002b) have used LUMO as a descriptor of electrophilic reactivity resulting in covalent
changes in biological systems, Dimitrov et al. (2000) and Dimitrov et al. (2003) have
suggested that LUMO can be also used to describe non-covalent electrophilic interaction
of narcotic chemicals with the site of action.
Some authors extended the response-surface approach by adding additional indicator
variables and other parameters to improve the statistical fit of the models (Schmitt et al.,
2000; Wang et al., 2001; Huang et al., 2003; Cronin et al., 2004; Netzeva et al., 2004).
Examples of such variables include charges of defined atoms in the compound (Schmitt et
al., 2000), the heat of formation (Wang et al., 2001), the molecular volume (Huang et al.,
2003), molecular connectivity indices (Wang et al., 2001; Netzeva et al., 2004). These
help to model outliers for which toxicity is under- or overestimated by the two-variable
response surface approach. However, according to Schultz et al. (1998) QSAR modelling
for electrophilic compounds is difficult because of data and descriptor limitation
compared to QSAR modelling of compounds acting by other toxic mechanisms.
Another approach has been the use of the Brown variation of the Hammett constant (σ+)
to model acute toxicity of aromatic compounds forming free radicals (Hansch et al., 2000;
Selassie et al., 2002; Moridani et al., 2003b; Verma et al., 2003).
QSARs for toxicity based on topological and/or E-state indices have been derived by
many authors (Ivanciuc, 2000; Burden, 2001; Gramatica et al., 2001; Cronin et al., 2002a;
Grodnitzky and Coats, 2002; Huuskonen, 2003; Rose and Hall, 2003). Polarisability and
molar refractivity have been used by Geiss and Frazier (2001); Hansch and Kurup (2003);
Trohalaki et al. (2000). Another approach is based on the TLSER (theoretical linear
solvation energy relationship) model descriptors, which represent cavity,
dipolarity/polarisability, and H-bonding terms (Boyd et al., 2001; Liu et al., 2001; Liu et
76
al., 2003a). As a descriptor in some QSARs the so-called ‘hardness’, equal to
½ (HOMO – LUMO), was also used. It encodes atomic radius, nuclear charge and
polarisability (Faucon et al., 2001).
3D-QSAR approaches (CoMFA and CoMSIA) have also been applied to investigate
toxicity (Xu et al., 2002; Liu et al., 2003b). Interaction fields and regions around
molecules that are important for compound toxicity were identified. There is, however,
implicit difficulty in using these approaches, because they require a series of compounds
sharing a common and specific mechanism of action (see Chapter 6).
Also, several other QSAR approaches have been explored in investigating acute toxicity.
For example, Balaž and Lukacova (2002) used subcellular pharmacokinetic theory to
derive QSARs. Toropov and Toropova (2002), Toropov and Schultz (2003), and Toropov
and Benfenati (2004) used optimisation of correlation weights of local graph invariants
(OCWLGI) approach (Toropov and Toropova, 2002). Weighted holistic invariant
molecular (WHIM) descriptors encoding three-dimensional information for molecular
shape and electronic structure were used by Gramatica et al. (2001), and Di Marzio et al.
(2001). Grodnitzky and Coats (2002) used geometry, topology and atomic weights
assembly descriptors (GETAWAY), which encode information about the 3D structures,
size and shape of the molecules. Several studies using autocorrelation vectors encoding
lipophilicity, molar refractivity, and H-bonding ability were performed (Devillers et al.,
2002a; Devillers et al., 2002b). Tao et al. (2002a), and Tao et al. (2002b) applied a
fragment constant method to develop toxicity QSARs.
Additionally, QSARs for metal toxicity have been also developed using ion-specific
physicochemical parameters (Enache et al., 2000; Enache et al., 2003; Ownby and
Newman, 2003).
QSARs for toxicity of mixtures of chemical compounds have been also developed. It
represents an important and growing research area. Environmental pollutants are usually
released as chemical mixtures rather than single chemicals. Some authors used mixture
partition coefficients and H-bonding potentials of mixtures composed from non-polar and
polar narcotics to develop QSARs (Yu et al., 2001; Lin et al., 2003b). Other
investigations include specific characteristics of the chemical compounds in the mixtures,
for example charges at a certain atom (Lin et al., 2003a). Tichý et al. (2002) investigated
dependence of toxicity on different molar ratios of mixture compounds.
77
Table 8.1. Summary of structural criteria used in this project for classifying compounds
according to mechanism of toxic action (devised from Hermens 1990; Lipnick, 1991;
Verhaar et al., 1992; Russom et al., 1997; Schultz et al., 1997; Hansch et al., 2000; Cronin
et al., 2002a)
Mechanism of Action Structural Determinants
1 Non-polar narcosis Saturated alkanes with e.g. halogen and/or alkoxy
substituents (aliphatic alcohols, ketones, ethers,
amines); halogen and alkyl substituted benzenes
2 Polar narcosis Phenols with pKa greater than or equal to 6.0; phenols
and anilines with three or fewer halogen atoms, and/or
alkyl substituents
3 Weak acid respiratory
uncouplers
Phenols and anilines with four or more halogen
substituents, or more than one nitro group, or single
nitro group and more than one halogen group
4 Formation of free radicals Phenol or aniline substituted with an electron-releasing
group (alkoxy, hydroxyl, more than one alkyl group).
5 Electrophile/proelectrophiles Activated unsaturated compounds; benzene rings
without aniline or phenol substructures, that have two
nitro groups on one ring; phenols with single nitro
group but not more than one halogen group; aromatic
compounds with two or more hydroxy groups in the
ortho or para position and at least one unsubstituted
aromatic carbon atom; quinines; aldehydes; compounds
with halogens at α-position of an aromatic bond;
ketenes; epoxides
6 Specific mechanisms Chemicals interacting with specific biological
macromolecules. For example, acetylcholinesterase
inhibitors with an organophosphate group.
78
CHAPTER 9
PROGRAMS FOR STATISTICAL ALGORITHMS
9.1. Objectives
In this chapter two statistical algorithms designed by the author for the purposes of the
project are presented, which were used as auxiliary tools in the QSAR analysis. The first
one is related to reduction of multicollinearity of variables in a data set. It was used to
reduce data redundancy before performing statistical analyses. The second algorithm
represents implementation of the best-subsets approach for selection of regression
equations with best statistical fit.
9.2. Reduction of data multicollinearity
Nowadays, commercially-available QSAR software packages can calculate large numbers
of descriptors. The availability of a large number of descriptors in a single study poses the
problem of variable multicollinearity and data redundancy. One way to avoid data
redundancy is to exclude descriptors that are highly intercorrelated with each other before
performing statistical analysis. Reduced multicollinearity and redundancy in the data will
facilitate selection of relevant variables and models for the investigated endpoint. An
algorithm for reducing the intercorrelation in the variable data matrix, developed by the
author, is presented. The main steps in the proposed algorithm are presented in Figure 9.1.
A program in C code has been developed by the author, based on the proposed algorithm.
The code of the program is presented in Appendix A.1. As input an ASCII file containing
the variable data matrix is required. The user is requested to specify a cut-off value for the
Pearson coefficient of correlation R (Rc). The variables that intercorrelate with R larger
than Rc will be considered as redundant. The user may also input a list of variables that
are considered important for the modelling (list_to_remain), for example, because of their
physicochemical or mechanistic significance. The main steps of the program algorithm
are:
1. Calculation of the intercorrelation matrix for the variables.
2. For each variable, counting the number of intercorrelated variables with an
intercorrelation coefficient R larger than Rc.
79
3. The variable (not in the list_to_remain) that has the largest number of
intercorrelated variables is proposed to be deleted from the set of variables.
4. On positive reply from the user, the data matrix is created again without this
variable. Steps 1 to 3 are repeated. On negative reply, the user is asked whether to
quit the program, and if not, that variable is added to the list_to_remain and step 3
is repeated.
5. Steps 1 to 4 are repeated until there are no highly intercorrelated variables, or all
variables are in the list_to_remain, or the user decides to quit the program at step
4.
The output of the program is an ASCII file that contains the intercorrelation matrix of the
undeleted variables, the number and a list of intercorrelated variables for each of the
undeleted variables, and a list of the deleted variables.
A key feature of this algorithm is that the user can specify a list of variables that have
important physicochemical and/or mechanistic significance, and which are therefore not
deleted.
An alternative approach for reducing data multicollinearity is implemented in the
DRAGON software version 5.0 (TALETE srl.), where one of the two descriptors with a
correlation coefficient bigger than a user-specified value between 0.9 and 1.0 is excluded.
For each pair of correlated descriptors, the descriptor having the highest correlation
coefficient with some of the other descriptors is automatically excluded. The algorithm
developed in this project is based on a different principle, namely excluding descriptors
having highest number of intercorrelations with other descriptors. According to the author
of this thesis, an advantage of the algorithm presented in this thesis is the possibility to
define a list of variables which are not to be deleted from the data set, thus allowing the
user to control the process of data reduction.
For example, the algorithm was applied to the set of descriptors calculated for compound
set containing blood-brain barrier penetration data, taken from Platts et al (2002) (QSAR
analysis of this data set is reported in Chapter 10 of this thesis). The data table contained
approximately 300 descriptors calculated with both the TSAR (Accelrys Inc.) and QsarIS
(then SciVision-Academic Press, San Diego, CA; currently MDL®QSAR by Elsevier
MDL, San Leandro, CA) software. Applying the proposed algorithm, with a cut-off value
of R = 0.95, resulted in a data table containing 166 descriptors. During the execution of
80
the program, molecular volume and molecular surface area were preserved from deletion
by adding them to the list_to_remain, because of their possible mechanistic significance.
9.3. Program for selection of regression equations with best statistical fit
A program in C code was developed by the author, which implements the algorithm of
the best-subsets approach for selecting regression equations with best statistical fit. The
code of the program is presented in Appendix A.2. The program computes k-variable
regression equations by combining all possible combinations of k variables from a given
set of variables. The user can specify a R value, and variables, which intercorrelate with
correlation coefficient higher than this R value will not be included in the same regression
equation. Afterwards the regression equations are sorted by decreasing value of the
squares of their correlation coefficients (R2).
The program needs as input an ASCII file containing the variable data matrix, including
the dependent and the independent variables. The input file may contain several
dependent variables, which are investigated one by one. The output of the program is an
ASCII file that contains the regression equations for each of the dependent variables and
the square of the regression coefficients, R2.
The program was developed for the needs of the work in this thesis, because it allowed
the best-subsets approach to be applied in the regression analysis of data sets containing a
large number of independent variables (practically in the program no limitation is set for
the number of independent variables; the only restriction is due to the increased
computational time with increasing the number of variables). In contrast, in the
MINITAB version 14 software (Minitab Inc., State College, PA, USA) the
implementation of the best-subsets regression analysis has a restriction of a maximum
number of 31 independent variables to be used for constructing k-variable combinations
from them to search for the best statistical fit.
For example, the program was used to derive QSARs for toxicity to the bacterium
Sinorhizobium meliloti, (the study is reported in Chapter 12 of this thesis). The variable
data sets contained approximately 100 variables after applying the data reduction
algorithm (different for each subset of compounds with different mechanisms of toxicity,
which were studied separately, see Chapter 12). The program was applied up to k = 5,
because for k ≥ 6, the number of the possible combinations of k variables from the data
81
set of approximately 100 variables was more than 1 100 000 000, which would require
more than 24 h computational time.
9.4. Contribution to existing knowledge
The presented algorithm for reducing data multicollinearity was designed by the author to
serve the needs of the work reported in this thesis and the development of QSARs in
general. As described above, the algorithm differs from a similar algorithm implemented
in the DRAGON software (TALETE srl.) in the way the multicollinearity is assessed,
namely by counting the number of intercorrelated variables for each variable of a data set.
Additionally, the algorithm allows the user to preserve variables considered important for
the further QSAR analysis. The algorithm was used in the analysis published by
Lessigiarska et al. (2004a).
The second algorithm, presented in this chapter, contributed to an extended
implementation of the best-subsets regression approach to large data sets, together with
the possibility to derive the variable combinations resulting in the best statistical fit, for
which the variables included are not intercorrelated above a given value of the
intercorrelation coefficient R. The algorithm was used in the analyses reported by
Lessigiarska et al. (2004a), Lessigiarska et al. (2004b), Netzeva et al. (2005) and
Lessigiarska et al. (2006).
82
Figure 9.1. Algorithm for reduction of data multicollinearity. Points where user input is
necessary are represented with ellipses
Calculation of the correlation matrix
Calculation of the number of intercorrelated variables (N_cor) with R bigger than Rc for each variable
yes noExit
yes no
yes no
Exit
Deleting the variable and creating a new variable data table
The variable is added to the list_to_remain
Variable data table What is the cut-off
value for the intercorrelation coefficient (Rc)?
Input
List of variables that are forced into model (list_to_remain)
Should the variable outside the list_to_remain that has the biggest value of N_cor be deleted from the variable data set?
There are variable(s) with N_cor > 0 and not included in the list_to_remain
Do you wish to exit the programme
83
CHAPTER 10
INVESTIGATION OF BLOOD-BRAIN BARRIER PENETRATION
10.1. Objectives
In this chapter an investigation of compound penetration through the blood-brain barrier
(BBB) is presented. In vivo BBB penetration, as well as in vitro penetration through
several membrane models of the BBB, was modelled. Regression and classification
QSAR models were sought. Additionally, the in vivo BBB penetration was described by
combined models including descriptors of the chemical structure and in vitro penetration
endpoints (QSAAR analysis), in order to investigate relationships between the in vitro
and in vivo endpoints. Models for passively diffusing compounds, as well as compounds
transported actively, were developed.
10.2. Methods
10.2.1. Biological data
Two data sets were used to investigate BBB penetration.
The first data set contains 21 compounds and was taken from a collaborative study funded
by the European Commission’s Joint Research Centre (ECVAM), called below “ECVAM
data set” (Garberg, 2001). It includes data for in vivo penetration through the BBB, and in
vitro penetration through several membrane models of the BBB. In Table 10.1, the
compounds investigated and their in vivo BBB and in vitro penetration abilities are
presented. The SMILES codes for the compounds are given in Appendix B.1.a. The
chemicals belong to different chemical groups and reflect different transport mechanisms.
According to the source study (Garberg, 2001) four of them are actively transported
through the BBB by different carrier systems (Table 10.1), eight are subject to active
efflux from the brain, and the remaining nine compounds cross the BBB by passive
diffusion.
The in vivo BBB penetration of a compound was described by the compound permeability
coefficient (P) and the (percentage) ratio between the compound concentration in brain
and that in plasma at the steady state (BB, see Chapter 7, Equation 7.1), i.e. BB =
84
(Cbrain/Cblood)*100 %, where Cbrain and Cblood are the steady-state concentration of the
compound in the brain and in the blood respectively.
The in vitro penetration of a compound was expressed as the permeability coefficients, P,
for membranes constructed with the following cells:
- bovine and rat subcultured primary brain capillary endothelial cells, co-cultured
with primary rat astrocytes (BBEC);
- rat SV40 immortalised brain microvascular endothelial cells, co-cultured with
SV40 immortalised rat astrocytes (SV-ARBEC);
- human epithelial colon carcinoma cell line (Caco-2);
- dog kidney epithelial cell line (MDCK).
A schematic representation of the experimental set-up of an in vitro blood–brain barrier
co-culture model is presented in Figure 10.1.
Two types of permeability coefficients were reported in the source study (Garberg, 2001).
Apparent permeability coefficients (Papp) were obtained for the permeability of the cell
membrane together with the filter insert (see Figure 10.1). Endothelial permeability
coefficients Pe were obtained indirectly by applying the formula:
1/(Pe*S) = 1/(Papp*S) – 1/(Pf*S) (10.1)
where Pf is the permeability coefficient for the insert alone
S is the surface area of the insert (see Figure 10.1).
Thus Pe values are assumed to represent the permeability of the cell membrane itself.
The second data set was taken from Platts et al. (2001). It contains in vivo BBB
penetration data for 157 compounds collected from different literature sources, including
directly measured and indirectly determined values. The compounds and the logBB
values are presented in Table 10.2. The SMILES codes for the compounds are given in
Appendix B.1.b. The BBB penetration is presented again as the BB ratio: BB =
Cbrain/Cblood. In Platts et al. (2001) the following QSAR for the data set was reported,
based on the solvation equation of Abraham (Platts et al., 2001, adopted a simpler
notation for the structural descriptors, see Chapter 7):
85
logBB = 0.463 E - 0.864 S - 0.564 A - 0.731 B + 0.933 V - 0.567 I1 + 0.021 (10.2)
n = 148, R2 = 0.745, s = 0.343, F = 69, Q2 = 0.711
where: E is an excess molar refraction
S is the dipolarity/polarisability parameter
A is the H-bond acidity
B is the H-bond basicity
V is the characteristic McGowan volume
I1 is an indicator variable, set to 1 for a compound containing a carboxylic acid
fragment and 0 otherwise.
Nine compounds had been omitted as statistical outliers in the equation above in order to
improve the statistical fit.
According to the models reported by Platts et al. (2001), the factors that affect logBB are
molecular size and dispersion effects, which increase the penetration into brain, and
polarity/polarisability and H-bond acidity and basicity, which decrease the brain uptake.
10.2.2. Structural descriptors
The following structural descriptors for the compounds were calculated: logPoct, logDoct,
aqueous solubility, molecular weight, molar refractivity, molar volume, parachor, index
of refraction, surface tension, density, polarisability, approximately 10 quantum-chemical
and approximately 250 topological descriptors. LogPoct values were calculated by using
the ACD/LogP version 4.02 (Advanced Chemistry Development Inc., Toronto, ON,
Canada), KOWWIN version 1.65 (Syracuse Research Corporation, Syracuse, NY, USA)
programs, TSAR for Windows version 3.3 (Accelrys Inc.) and QsarIS version 1.1 (then
SciVision-Academic Press, San Diego, CA; currently MDL®QSAR by Elsevier MDL,
San Leandro, CA) software. LogDoct at pH of 7.4 was calculated with ACD/LogD version
4.02 (Advanced Chemistry Development Inc., Toronto, ON, Canada) by using the ion-
pair partitioning method. Aqueous solubilities were calculated with ACD/AqSol version
4.02 (Advanced Chemistry Development Inc., Toronto, ON, Canada) using a pH setting
of 7.0. Molecular weight, molar refractivity, molar volume, parachor, index of refraction,
surface tension, density and polarisability were calculated with the ACDLabs version 4.02
software (Advanced Chemistry Development Inc., Toronto, ON, Canada). The TSAR
86
version 3.3 software was used to calculate quantum-chemical and topological descriptors.
For quantum-chemical calculations, the VAMP package (Accelrys Inc.) was used,
applying the AM1 Hamiltonian. The QsarIS version 1.1 software was used to calculate
topological descriptors. The abbreviations used for the molecular descriptors are
presented in Table 10.3.
10.2.3. Statistical analysis
Before performing the statistical analysis, the descriptors that had values of zero for more
than 95% of the compounds were excluded from the descriptor data matrix. To reduce
descriptor multicollinearity, before performing statistical analysis the algorithm
developed by the author of this thesis was applied to the descriptor data matrices
(described in Chapter 9).
To model the in vivo endpoints, in vitro data were used in addition to the calculated
molecular descriptors (QSAAR analysis).
QSAAR and QSAR models were obtained by linear regression, using Statistica version
5.5 (StatSoft Inc., Tulsa, OK, USA). Two approaches were used for selecting variables to
derive models with good statistical diagnostics, and allowing for physicochemical and/or
mechanistic interpretation. Firstly, forward stepwise regression was applied, selecting up
to five variables to include. If some of the variables from the derived model were
considered to lack physicochemical and/or mechanistic relevance, these variables were
excluded from the variable data set and the forward stepwise regression was performed
again. Secondly, the program in C code implementing the best-subsets regression,
developed by the author was used (described in Chapter 9). Only those variables that
intercorrelated with a coefficient of intercorrelation R less than 0.7 were included in the
same model.
Discriminant analysis was also performed with Statistica version 5.5. Variable selection
in the discriminant analysis was performed by a forward stepwise procedure, applying the
same considerations regarding physicochemical and/or mechanistic relevance of the
selected variables as in the regression analysis (see above). The assumptions of the
discriminant analysis for homogeneity of variable variance and variance/covariance
matrices within the groups were tested with the univariate Levene test and with the
multivariate Sen-Puri test (see Chapter 5, Section 5.4.2 “Assumptions of the linear
87
discriminant function analysis”), included in the ANOVA/MANOVA module of
Statistica version 5.5 software.
10.3. Results
10.3.1. Analysis of the ECVAM data set
The permeability data used in the study are shown in Table 10.1. The abbreviations used
for the structural descriptors are presented in Table 10.3, and the calculated values of the
descriptors included in the selected QSARs are presented in Table 10.4. Models based on
the whole set of compounds, as well as mechanism-specific models for passive diffusion,
were developed. The QSARs obtained for the whole set of compounds are presented in
Table 10.5.a; the QSARs for the passively diffused compounds are presented in Table
10.5.b.
10.3.2. Analysis of the data set taken from Platts et al. (2001)
Since argon, krypton, neon and xenon were not accepted by the TSAR software, BBB
penetration data for 153 (instead of 157) compounds were used for model development,
and are summarised in Table 10.2. The values of the descriptors included in the selected
QSARs are presented in Table 10.2. The QSARs obtained are presented in Table 10.6.
For the abbreviations of the molecular descriptors, see Table 10.3.
The goodness-of-fit of the QSAR (Equation 2, Table 10.6) improved when the following
compounds were excluded: mezoridazine, 4, Y-G 19, Y-G 20, Org12692 and thioridazine
(with the exception of mezoridazine, these compounds were also excluded from the final
model in the source paper (Platts et al., 2001)):
logBB = -0.211 (± 0.017) ΣH-bond + 0.204 (± 0.037) Nrings6 + 0.117 (± 0.032) logPoct +
0.165 (± 0.091) (10.3)
n = 147, R2 = 0.695, s = 0.388, F = 108.5
A semi-quantitative classification of compounds based on their permeation of the BBB
was performed using discriminant analysis. The reason for performing this was that the
biological data have been collected from different literature sources, which decreases the
quality of the models obtained, due to possible variability in the logBB values.
88
The compounds were classified into two groups with a cut-off value for logBB of 0. The
cut-off value was chosen in such a way that compounds having a log BB value < 0 would
have higher concentration in blood than in brain under steady-state conditions (low
penetrators). Conversely, compounds with a logBB value > 0 would have higher
concentrations in brain (high penetrators). Discriminant analysis revealed ΣH-bond
(calculated using TSAR software) and logPoct (calculated using QsarIS) as best
discriminating descriptors between the two groups. The Levene univariate test and the
Sen-Puri multivariate tests were statistically insignificant at 95% level, thus suggesting
homogeneity in the variable variances and variance/covariance matrices within the two
groups. The following discriminant function was obtained:
Group = -0.426 ΣH-bond + 0.475 logPoct + 0.209 (10.4)
Wilks’ λ = 0.564, F = 57.9.
The following classification functions were obtained:
S1 = 1.284 ΣH-bond + 1.035 logPoct – 4.658 (10.5)
S2 = 0.536 ΣH-bond + 1.869 logPoct – 3.897 (10.6)
where S1 and S2 are the classification scores of the cases for the first and the second
group respectively.
Compounds with higher value of S1 than S2 will be classified as low penetrators, and vice
versa, compounds with higher value of S2 than S1 will be classified as high penetrators.
The a priori classification probabilities were set to be equal for the two groups. The
classification matrix of the two-group classification based on the obtained classification
functions is presented in Table 10.7. From the table it can be seen that 82.4% of the
compounds from the both groups were correctly classified.
Other cut-off values for classification of compounds according to BBB penetration were
attempted, but this did not improve the ability to classify the chemicals.
10.4. Discussion
89
In this study both in vivo BBB permeability and permeability through in vitro membrane
models for BBB were investigated. The in vivo BBB penetration was correlated with both
in vitro permeability coefficients and structural descriptors. QSAR models were
developed for the in vivo and in vitro compound penetration. Additionally, compounds
were classified into high and low BBB penetrators, and a classification model was
developed.
10.4.1. Interpretation of the models obtained for the ECVAM data set
QSAR models were developed for the whole data set as well as for the passively diffused
compounds only (Table 10.5). From the table it can be seen that the QSAR models for
passive diffusion (Table 10.5.b) have better statistical fit than the models based on all
compounds (Table 10.5.a). Models for in vivo penetration of compounds transported by
different mechanisms do not include logPoct. As can be expected, this descriptor reflects
better the in vivo penetration of compounds transported by passive diffusion than by other
mechanisms. LogPoct or logDoct as descriptors of lipophilicity appeared to be suitable
descriptors for the penetration through model membranes in vitro, with the exception of
the BBEC membrane.
For the whole data set of compounds, the in vitro membrane penetration model that
appears to correlate best with the in vivo penetration was the BBEC model (Equations 1
and 3, Table 10.5.a), which is consistent with results from the literature (Lundquis et al.,
2002). When BBEC logPs were correlated with the in vivo logPe or logBB without
including structural descriptors, R2 values of 0.324 and 0.338, respectively, were
obtained; thus, adding structural descriptors (Balaban index) increased the model R2
values by approximately 0.23 (Equations 1 and 3, Table 10.5.a). QSAR models based on
the whole set of compounds (Equations 2 and 4, Table 10.5.a) indicate that the number of
H-donor atoms is important for the in vivo BBB penetration, which is in accordance with
many literature reports (see Chapter 7).
When only passively diffused compounds were considered, the Caco-2 cell membrane in
the apical to basolateral direction appeared to be best suited for describing in vivo
penetration (Equations 12 and 14, Table 10.5.b). When the Caco-2 Ps in apical to
basolateral direction were correlated with the in vivo logPe or logBB (without including
structural descriptors) the R2 values obtained were 0.577 and 0.587, respectively. Thus,
adding structural descriptors (the magnitude of the dipole moment) improved the R2 of
90
the correlations by approximately 0.31. A QSAR model for in vivo penetration included
logPoct and the dipole moment (Equations 13 and 15, Table 10.5.b). The dipole moment
might express the extent to which a molecule influences the electric field across lipid
membranes, which could affect the membrane transport. A weakness of the models for
passive diffusion is the low ratio of data points to independent variables (the equations
have 2 independent variables and only 9 data points), which contradicts the
recommendations of Topliss and Castello (1972). According to these authors, this ratio
should be at least five.
The improved statistical fit of the in vivo models, when descriptors of the chemical
structure were added to an in vitro endpoint, suggests that efforts to describe the in vivo
BBB penetration using membrane models in vitro can be combined with QSAR analysis.
The developed models will include descriptors which account for the differences between
the in vivo and in vitro penetration processes. In the present study the Balaban index
(encoding the molecular topology) and the magnitude of the dipole moment were used.
The Balaban index increases with increase in the compound size and decreases with
increase in the number of rings in a molecule. Thus, according to the models (Equations 1
and 3, Table 10.5.a) large molecules with smaller number of rings penetrate more easily
through the BBB. The dipole moment could affect the membrane transport by influencing
electrostatic interactions with the membranes.
From Table 10.5 it is seen that the QSAR models for in vivo BBB penetration have only
slightly worse statistical fit than the QSAAR models, which are more complex and
require in vitro data. Thus, it appears that use of descriptors of chemical structure alone
gives comparable results to the more problematical approach to model the in vivo BBB
penetration by using membrane systems in vitro.
The models obtained for BBEC permeability indicated that the number of H-donor and H-
acceptor atoms and the largest positive charge on a hydrogen atom (Hmaxp) are important
for penetration (Equations 5 and 16, Table 10.5). According to some authors (Wilson and
Famini, 1991) Hmaxp is also related to H-bond donating potential of a compound (see
Chapter 5, Section 3.5.5 “Hydrogen-bond donor and acceptor ability”).
The permeability of the SV-ARBEC in the apical to basolateral direction was correlated
with the sum of the E-state indices of all the individual atoms in the molecule (ΣE-state)
(Equation 6, Table10.5.a). According to Rose et al. (2002), the E-state indices account for
91
the ability of a molecule to enter into non-covalent intermolecular interactions, and thus
they encode factors that could influence binding to the membrane transporters. The
permeability of SV-ARBEC in the basolateral to apical direction was correlated with the
logarithm of the octanol-water distribution coefficient (logDoct) and the number of H-
donor atoms (Equation 7, Table10.5.a). LogDoct accounts for the degree of compound
ionisation in the process of distribution between octanol and water. For the subset of
passively diffused compounds, no QSARs with good statistical parameters were obtained.
Models for predicting the permeability of the Caco-2 membrane, based on all compounds,
also indicate the importance of hydrogen bonding and logDoct for the transport processes
(Equations 8 and 9, Table10.5.a). In the case of the passively diffused compounds, the Ps
in the apical to basolateral, and the basolateral to apical, directions are highly
intercorrelated (intercorrelation coefficient R = 0.93, compared with R = 0.62 when all
compounds were included). Thus, the same descriptors determine the transport in the two
directions, namely the number of H-donors and logPoct (Equations 17, 18, 19, 20,
Table10.5.b).
The best models for penetration across the MDCK membrane included logPoct, the shape
flexibility index, and the largest positive charge on a hydrogen atom (Equations 10, 11,
21, and 22, Table 10.5).
10.4.2. Interpretation of the models obtained for the data set taken from Platts et al.
(2001)
The derived QSARs for this data set are presented in Table 10.6 and Equation 10.3.
LogPoct and the numbers of H-donor and H-acceptor atoms again appear to be the most
suitable predictors for penetration across the BBB in vivo. According to the models,
forming H-bonds is unfavourable for passage across the BBB, while increasing logPoct
(hydrophobicity) tends to increase logBB. This is in accordance with results from the
literature (see Chapter 7). The number of 6-membered rings (Nrings6) may be related to the
volume or shape of the molecules, or may indicate differences in the aromatic character of
the compounds.
Additionally, a classification model for compounds as low and high penetrators of the
BBB was developed, by using a logBB cut-off value of 0.0. The variables that appeared
to discriminate between the groups were again related to the H-bonding capacity (ΣH-bond)
92
and the lipophilicity (logPoct). The model resulted in 82.4% correct classification of the
compounds from the both groups, while the expected classification by chance was 50.6%
(calculated by using Equations 5.20, Chapter 5, Section 5.4 “Linear discriminant function
analysis”). Thus, applying the discriminant model gave an improvement of the
classification accuracy by approximately 30%. From the obtained classification functions
(Equations 10.5 and 10.6) it can be seen that a compound will have a higher value of S2
than S1 (the compound will be classified as high penetrator), if:
0.748 ΣH-bond < 0.834 logPoct - 0.761, equivalent to:
ΣH-bond < 1.115 logPoct - 1.017 (10.7)
As the coefficient of logPoct and the free term of this expression are very close to one, it
can be concluded that a compound should have a ΣH-bond value smaller than its logPoct
value minus one, to be classified as a high penetrator. Additionally, compounds having
negative logPoct values will be always classified as low penetrators (in accordance with
results found in the literature, see Chapter 7).
10.4.3. Contribution to existing knowledge
In the present study a new data set (the ECVAM data set) was investigated, and QSARs
were obtained for in vivo BBB penetration and for membrane transport through four in
vitro BBB model systems. The importance of compound lipophilicity and H-bonding
ability for the membrane transport were confirmed. However, the results suggested that
other factors than compound lipophilicity (encoded by logPoct or logDoct) mainly
influence the in vivo penetration of compounds transported by different mechanisms
(passive and active transport). In the present study these factors were related to size of
molecules and presence of rings (encoded by the Balaban index) and the number of H-
bond donors. The values of logPoct reflected better the penetration of compounds
transported by passive diffusion only.
QSAAR analysis was also applied to the data set. As far as the author is aware, this is the
first attempt to model in vivo BBB penetration by combining in vitro membrane
permeability data with descriptors of chemical structure (QSAAR analysis). The QSAAR
analysis suggested that adding structural descriptors results in better models for in vivo
BBB transport than did those based on in vitro membrane systems alone. However, using
93
descriptors of chemical structure alone (QSAR analysis) resulted in models with
comparable statistical fit to that of the QSAAR models.
A second data set was investigated, for which a QSAR model was previously developed
by Platts et al. (2001) (Equation 10.2). A QSAR model was developed (Equation 10.3)
with slightly worse statistical fit than the model developed in the source paper, but using
smaller number of descriptors of chemical structure. A new classification model of the
compounds into high and low penetrators was derived. On the basis of this classification
model, a simple relation between logPoct value and the sum of H-bond donors and
acceptors for a compound, governing its classification as a high or low penetrator, was
obtained. This relation could be used for fast screening of high numbers of compounds.
10.5. Conclusions
QSAARs for the in vivo BBB penetration combining permeability data for in vitro
membrane systems and descriptors of chemical structure gave improved results than
modelling of in vivo BBB transport by using in vitro membrane systems alone. The
BBEC membrane system appeared to be the most suitable in vitro system for modelling
BBB penetration of compounds transported by different mechanisms (passive and active
transport), while for the passively diffused compounds alone the Caco-2 cell system
reflected best the in vivo BBB penetration. Factors that accounted for differences between
the in vivo and in vitro penetration processes were encoded by the Balaban index and the
magnitude of the dipole moment.
Lipophilicity and H-bonding ability were shown to be important molecular properties for
membrane transport (consistent with literature data). However, logPoct did not appear to
be a suitable descriptor for the in vivo penetration of compounds transported by different
mechanisms; it appeared as determining the in vivo passive transport only.
According to a classification model for compounds as low and high penetrators of the
BBB, the sum of H-bond donor and acceptor atoms for a compound should be smaller
than its logPoct value minus one, in order to classify the compound as a high penetrator.
This relation is simple and can be used for fast compound screening.
94
Figure 10.1. A schematic representation of the in vitro blood–brain barrier co-culture
model – brain capillary endothelial cells, co-cultured with astrocytes (redrawn from
Gaillard and De Boer, 2000)
Abbreviations: BCEC – brain capillary endothelial cells; AC – astrocytes
95
Table 10.1. ECVAM data set of 21 compounds and their in vivo BBB and in vitro permeability coefficients
No Name Transport mechanism in vivo in vivo BBEC SV-ARBEC* SV-ARBEC** Caco-2* d Caco-2** d MDCK* d MDCK** d
logPea logBBb logPea logPea logPea logPappa logPappa logPappa logPappa
1 L-alanine active influx (system A) 0.982 1.826 1.250 1.000 2.180 0.562 0.598 0.301 0.204
2 antipyrine passive 1.041 1.898 2.495 1.360 1.358 1.630 1.775 1.639 1.809 3 AZT efflux (OAT) -0.301 0.477 0.813 1.250 1.407 1.033 1.328 0.371 0.342 4 caffeine passive 0.908 1.903 2.997 1.072 1.936 1.635 1.798 1.600 1.778 5 cimetidine efflux -0.268 0.531 0.633 0.813 1.053 -0.097 0.686 -0.097 0.114 6 cyclosporin A efflux (P-glycoprotein) 0.477 1.301 0.845 0.176 0.602 -0.187 1.086 0.041 0.968 7 diazepam passive 1.638 2.502 2.520 1.538 0.919 1.613 1.742 1.574 1.701
8 digoxin efflux (mdr-1) -0.699 0.000 0.544 0.398 0.919 -0.022 1.348 -0.125 1.211
9 L-dopa active influx (aa-carrier) 0.591 1.491 1.332 1.217 0.978 -0.022 0.279 -0.155 -0.398
10 glycerol passive 0.875 1.785 0.863 1.367 1.238 1.167 0.398 -0.222 0.114
11 lactic acid active influx (MCT1+2) 1.270 2.155 1.021 1.190 1.591 -0.022 0.061 0.484 0.398
12 L-leucine active influx (system L) 1.164 2.000 1.182 1.393 1.391 1.186 0.699 0.267 0.230
13 morphine efflux (P-glycoprotein) -0.097 0.699 1.358 1.336 1.267 0.916 1.092 0.342 0.505
14 nicotine passive 0.792 1.708 2.991 1.538 1.615 1.455 1.608 1.472 1.654
15 phenytoin passive 1.173 2.021 2.089 1.305 1.243 1.602 1.689 1.534 1.602 16 sucrose passive -0.495 0.301 0.602 0.934 1.004 0.146 -0.347 -0.699 -0.699
17 urea passive 0.079 0.954 1.568 1.589 1.484 0.771 0.863 0.079 0.097
18 verapamil efflux (P-glycoprotein) 0.799 1.613 1.462 0.987 -c 1.196 1.609 1.146 1.520
19 vinblastine efflux (P-glycoprotein) 0.204 1.000 1.297 0.301 1.210 0.312 1.545 0.114 1.299
20 vincristine efflux (MRP1) 0.301 1.255 0.398 0.826 0.813 -1.071 0.808 -0.699 0.332 21 warfarin passive -0.125 0.699 1.458 0.519 1.648 1.504 1.683 1.306 1.453
a all permeability coefficients (P) are given as the log (P x 10e-6). Ps are in units of cm/s b BB = concentration in the brain / concentration in the blood * 100 (%) c the SV-ARBEC (basolateral to apical direction) P value for verapamil was given as negative, and therefore discarded from the analyses d mean value from two experiments
* permeability coefficients obtained in apical to basolateral direction of the membrane
** permeability coefficients obtained in basolateral to apical direction of the membrane
96
Table 10.2. Biological data and values of the molecular descriptors included in the
QSARs for BBB penetration, developed from the data set of Platts et al. (2001)
No Name logBB e logPoct
f ΣH-bond g Nrings6
g
1 1 (cimetidine) -1.42 0.16 6 0 2 2 c -0.04 0.84 3 0 3 4 b, c -1.30 4.08 6 3 4 5 c -1.06 1.98 7 2 5 6 (clonidine) 0.11 1.71 3 1 6 7 (mepyramine) 0.49 2.58 3 2 7 8 (imipramine) 1.06 4.71 1 2 8 9 (ranitidine) b -1.23 0.20 6 0 9 10 (tiotidine) -0.82 0.31 8 0
10 13 c -0.67 1.88 5 1 11 14 c -0.66 1.20 5 1 12 15 c -0.12 2.46 5 2 13 16 c -0.18 1.78 4 1 14 17 c -1.15 1.11 5 1 15 18 c -1.57 1.19 6 1 16 19 c -1.54 1.46 7 1 17 20 b, c -1.12 0.72 6 0 18 21 b, c -0.73 2.38 6 1 19 22 c -0.27 2.64 6 1 20 23 c -0.28 2.72 6 2 21 24 c -0.46 3.22 4 2 22 25 c -0.24 4.16 4 3 23 26 c -0.02 2.71 4 2 24 27 c 0.69 3.67 4 3 25 28 c 0.44 3.63 4 2 26 29 c 0.14 5.17 4 3 27 30 c 0.22 4.41 5 3 28 31 (carbamazepine) d 0.00 2.51 2 2 29 32 (epoxide of carbamazepine) d -0.34 1.68 3 2 30 33 d -0.30 2.72 5 1 31 34 d -1.34 2.47 7 1 32 35 d -1.82 2.25 9 1 33 36 (amitriptilyne) d 0.89 4.88 1 2 34 acetaminophen -0.31 0.67 4 1 35 acetylsalicylic acid -0.50 1.06 5 1 36 alprazolam 0.04 2.71 3 2 37 aminopyrine 0.00 1.19 1 1 38 amobarbital 0.04 1.84 5 1 39 antipyrine -0.10 1.26 1 1 40 argon a 0.03 4.70 - - 41 atenolol -1.42 0.75 7 1 42 benzene 0.37 1.95 0 1 43 bretazenil -0.09 2.99 4 1 44 bromperidol 1.38 4.08 4 3 45 butanone -0.08 0.57 1 0 46 caffeine -0.05 0.12 3 1 47 chlopromazine 1.06 5.20 1 3 48 clobazam 0.35 1.75 2 2 49 codeine 0.55 1.62 5 4
97
Table 10.2. (continued)
No Name logBB e logPoct
f ΣH-bond g Nrings6 g
50 CS2 0.60 1.96 2 0 51 cyclohexane 0.92 3.40 0 1 52 cyclopropane 0.00 1.57 0 0 53 desipramine 1.20 4.54 2 2 54 desmethydesipramine 1.06 3.85 2 2 55 desmethylclobazam 0.36 2.39 3 2 56 desmethyldiazepam 0.50 2.89 3 2 57 desmonomethylpromazine 0.59 4.50 2 3 58 diazepam 0.52 2.91 2 2 59 dichloromethane -0.11 1.33 0 0 60 didanosine -1.30 0.14 8 1 61 diethyl ether 0.00 1.07 1 0 62 2,2-dimethylbutane 1.04 3.12 0 0 63 divinyl ether 0.11 1.41 1 0 64 enflurane 0.24 2.38 1 0 65 ethanol -0.16 -0.03 2 0 66 ethylbenzene 0.20 3.14 0 1 67 flumanezil -0.29 1.17 4 1 68 flunitrazepam 0.06 2.06 4 2 69 fluphenazine b 1.51 4.13 4 4 70 fluroxene 0.13 1.44 1 0 71 haloperidol 1.34 4.08 4 3 72 halothane 0.35 2.33 0 0 73 heptane 0.81 4.40 0 0 74 hexane 0.80 3.87 0 0 75 hexobarbital 0.10 1.53 4 2 76 1-hydroxymidazolam -0.07 2.26 4 2 77 4-hydroxymidazolam -0.30 2.32 4 2 78 9-hydroxyrisperidone -0.67 3.46 7 4 79 hydroxyzine 0.39 2.37 5 3 80 ibuprofen -0.18 3.76 3 1 81 indinavir -0.74 3.14 11 4 82 indomethacin -1.26 3.32 5 2 83 isoflurane 0.42 2.31 1 0 84 krypton a -0.16 4.74 - - 85 M2L-663581 -1.82 0.74 9 1 86 mesoridazine -0.36 4.39 2 4 87 methane 0.04 4.58 0 0 88 methohexital -0.06 2.63 5 1 89 methoxyflurane 0.25 2.11 1 0 90 methylcyclopentane 0.93 3.30 0 0 91 3-methylhexane 0.90 4.14 0 0 92 2-methylpentane 0.97 3.51 0 0 93 3-methylpentane 1.01 3.65 0 0 94 2-methylpropan-1-ol -0.17 0.76 2 0 95 mianserin 0.99 3.17 1 3 96 midazolam 0.36 2.92 2 2 97 MIL-663581 -1.34 1.46 7 1 98 mirtazapine 0.53 2.49 2 3 99 morphine -0.16 1.03 6 4
98
Table 10.2. (continued)
No Name logBB e logPoct
f ΣH-bond g Nrings6 g
100 neon a 0.20 4.69 - - 101 nevirapine 0.00 1.31 4 2 102 nitrogen 0.03 0.41 2 0 103 nitrous oxide 0.03 -0.26 2 0 104 nor-1-chlorpromazine 1.37 5.01 2 3 105 nor-2-chlorpromazine 0.97 4.48 2 3 106 northioridazine 0.75 5.53 2 4 107 Org12692 b 1.64 2.64 2 2 108 Org13011 0.16 2.03 3 2 109 Org30526 0.39 4.05 3 2 110 Org32104 0.52 2.97 5 3 111 Org34167 0.00 3.49 4 2 112 Org4428 0.82 3.33 4 3 113 Org5222 1.03 4.41 2 2 114 oxazepam 0.61 2.22 5 2 115 paraxanthine 0.06 -0.53 4 1 116 pentane 0.76 3.33 0 0 117 pentobarbital 0.12 1.93 5 1 118 phenylbutazone -0.52 2.98 2 2 119 phenytoin -0.04 2.23 4 2 120 promazine 1.23 4.83 1 3 121 propan-1-ol -0.16 0.34 2 0 122 propan-2-ol -0.15 0.28 2 0 123 propanone -0.15 0.03 1 0 124 propranolol 0.64 3.07 5 2 125 quinidine -0.46 2.85 5 4 126 risperidone -0.02 4.13 5 4 127 RO19-4603 -0.25 1.76 4 0 128 salicylic acid -1.10 1.60 5 1 129 salicyluric acid -0.44 0.87 7 1 130 SF6 0.36 3.18 0 0 131 SKF 101468 0.25 2.37 3 1 132 SKF 89124 -0.43 1.20 5 1 133 sulforidazine 0.18 4.31 3 4 134 teflurane 0.27 2.14 0 0 135 theobromine -0.28 -0.63 4 1 136 theophylline -0.29 0.19 4 1 137 thiopental -0.14 2.61 5 1 138 thioridazine b 0.24 5.88 1 4 139 tibolone 0.40 2.52 3 3 140 toluene 0.37 2.66 0 1 141 triazolam 0.74 3.08 3 2 142 1,1,1-trichloroethane 0.40 1.85 0 0 143 trichloroethene 0.34 2.29 0 0 144 trichloromethane 0.29 1.76 0 0 145 1,1,1-trifluoro-2-chloroethane 0.08 1.33 0 0 146 trifluoperazine 1.44 4.50 2 4 147 valproic acid -0.22 2.39 3 0 148 xenon a 0.03 4.78 - - 149 2-xylene 0.37 3.22 0 1
99
Table 10.2. (continued)
No Name logBB e logPoct
f ΣH-bond g Nrings6
g
150 3-xylene 0.29 3.17 0 1 151 4-xylene 0.31 3.11 0 1 152 Y-G 14 -0.30 0.85 3 1 153 Y-G 15 -0.06 1.42 2 1 154 Y-G 16 -0.42 0.29 3 0 155 Y-G 19 b -1.30 2.02 3 1 156 Y-G 20 b -1.40 0.88 3 1 157 zidovudine -0.72 -0.32 7 1
a compounds not accepted by the TSAR software and omitted from the final model b compounds omitted in the final model of the source paper c in the source paper (Platts et al., 2001), these compounds are indicated to be taken from Young et al.
(1988). The number represents the numeration from Young et al. (1988) d in the source paper (Platts et al., 2001), these compounds are indicated to be taken from Norinder et al.
(1998). The number represents the numeration from Norinder et al. (1998) e BB = concentration in the brain / concentration in the blood f calculated with the QsarIS software g calculated with the TSAR software
100
Table 10.3. Physicochemical descriptors used in the selected QSARs for BBB penetration
Symbol Name
Balaban Balaban index
Flex shape flexibility index
Hmaxp largest positive charge on a hydrogen atom.
logPoct logarithm of the octanol-water partition coefficient
logDoct logarithm of the octanol-water distribution coefficient at pH of 7.4
µ magnitude of the molecular dipole moment
NH-accept number of H-bond acceptor atoms
NH-donors number of H-bond donor atoms
ΣE-state sum of the E-state indices of all atoms in the molecule
ΣH-bond sum of the H-bond donor and acceptor atoms
101
Table 10.4. Calculated values of the structural descriptors included in the selected QSARs derived from the ECVAM data set
No Name ACD/LogP KOWWIN ACD/LogD TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS
logPoct logPoct logDoct Balaban NH-donors ΣE-state µ a logPoct NH-accept Flex Hmaxp
1 L-alanine -2.77 -4.15 -3.18 2.9935 1 22.00 10.74 -2.033 3 1.764 0.229
2 antipyrine 0.27 0.59 0.27 1.7743 0 32.00 4.55 1.257 3 2.191 0.065
3 AZT -0.88 0.23 -0.58 1.7767 2 56.00 4.14 -0.322 9 4.000 0.211
4 caffeine -0.07 0.16 -0.081 1.7803 0 37.67 3.63 -0.504 6 1.845 0.098
5 cimetidine 0.36 0.57 0.21 1.9223 3 38.13 4.38 0.619 7 6.319 0.224
6 cyclosporin A 11.28 1.00 11.28 2.5440 5 211.83 - b - c - c - c - c
7 diazepam 2.96 2.70 2.96 1.3625 0 42.81 3.72 2.905 4 3.384 0.077
8 digoxin 2.06 0.50 2.23 0.7660 6 128.58 6.84 0.774 14 11.215 0.210
9 L-dopa -2.06 -3.40 -2.73 2.1375 3 44.50 8.81 -1.767 5 3.317 0.229
10 glycerol -2.41 -1.65 -2.41 2.7542 3 22.33 1.70 -1.659 3 3.023 0.210
11 lactic acid -0.70 -0.65 -6.09 2.9935 2 24.00 2.02 -0.825 3 1.764 0.229
12 L-leucine -1.72 -2.75 -1.77 3.3766 1 26.83 10.76 -0.873 3 3.421 0.229
13 morphine 1.06 0.72 0.053 1.2687 2 45.25 3.57 1.026 4 2.212 0.210
14 nicotine 0.72 1.00 -1.02 1.6601 0 22.50 3.29 1.594 2 2.086 0.085
15 phenytoin 2.52 2.16 2.50 1.6394 2 46.92 2.99 2.366 4 2.739 0.216
16 sucrose -3.65 -4.27 -3.65 1.9642 8 74.92 3.02 -2.970 11 5.920 0.210
17 urea -1.58 -1.56 -1.58 2.3238 2 16.67 4.00 -2.133 3 0.876 0.200
18 verapamil 5.03 4.80 3.38 1.7387 0 70.58 3.45 5.671 6 10.491 0.082
19 vinblastine 4.85 4.32 4.12 0.9321 3 131.92 2.02 0.072 13 9.821 0.212
20 vincristine 3.47 3.11 2.77 0.9433 3 138.92 5.43 0.902 14 10.051 0.212
21 warfarin 3.47 2.23 0.57 1.5681 1 58.00 7.98 3.803 4 4.189 0.197 a calculated with the VAMP package, using AM1 Hamiltonian b not calculated by the VAMP package c cyclosporin A was not accepted by the QsarIS software
102
Table 10.5. QSARs for BBB and membrane penetration, obtained from the ECVAM data
set
Table 10.5. a) all compounds included†
No Permeability coefficients (Ps)
Regression equation R2 s F
1 in vivo logPe 0.517 (± 0.131) BBEC logPs + 0.446 (± 0.145) Balaban – 1.10 (± 0.36)
0.558 0.457 11.4
2 in vivo logPe 0.333 (± 0.148) Balaban - 0.187 (± 0.050) NH-donors – 0.273 (± 0.331)
0.539 0.467 10.5
3 in vivo logBB
0.559 (± 0.138) BBEC logPs + 0.471 (± 0.152) Balaban – 0.355 (± 0.380)
0.569 0.479 11.9
4 in vivo logBB
0.349 (± 0.156) Balaban - 0.201 (± 0.053) NH-donors + 1.12 (± 0.35)
0.543 0.493 10.7
5a BBEC logPs - 8.83 (± 1.73) Hmax
p - 0.0737 (± 0.0264) NH-accept + 3.49 (± 0.33)
0.720 0.441 21.9
6 SV-ARBEC logPs* -0.00674 (± 0.00110) ΣE-state + 1.47 (± 0.09) 0.663 0.248 37.4
7b SV-ARBEC logPs**
-0.0506 (± 0.0169) logDoct - 0.0973 (± 0.0294) NH-donors + 1.53 (± 0.09)
0.581 0.264 11.8
8a Caco-2 logPs*
0.143 (± 0.049) logDoct – 0.165 (± 0.033) NH-accept + 1.80 (± 0.23)
0.604 0.510 13.0
9 Caco-2 logPs**
0.0749 (± 0.0187) logDoct - 0.377 (± 0.095) Balaban - 0.197 (± 0.030) NH-donors + 2.20 (± 0.21)
0.837 0.275 29.2
10a MDCK logPs*
0.303 (± 0.046) logPoct c - 0.157 (± 0.030) Flex + 1.10 (±
0.16) 0.759 0.408 26.8
11a MDCK logPs**
0.157 (± 0.040) logPoct d – 7.20 (± 1.69) Hmax
p + 1.99 (± 0.33)
0.748 0.421 25.3
† number of compounds in the training set n = 21 a cyclosporin A was not accepted by the QsarIS software, so the number of compounds in the training set
was 20 b the SV-ARBEC (basolateral to apical direction) P value for verapamil was determined as negative, and it
was excluded from the model, so the number of compounds in the training set was 20 c calculated with the QsarIS software d calculated with the ACD/LogP software
* permeability coefficients obtained in apical to basolateral direction of the membrane (see Figure 3.1)
** permeability coefficients obtained in basolateral to apical direction of the membrane (see Figure 3.1)
103
Table 10.5. QSARs for BBB and membrane penetration, obtained from the ECVAM data
set
b) only passively diffused compounds included†
No Permeability coefficients (Ps)
Regression equation R2 s F
12 in vivo logPe 1.21 (± 0.19) Caco-2 logPs* - 0.226 (± 0.058) µ – 0.0198 (± 0.3042)
0.881 0.274 22.3
13 in vivo logPe 0.289 (± 0.051) logPoct
a - 0.300 (± 0.067) µ + 1.77 (± 0.27)
0.862 0.295 18.8
14 in vivo logBB
1.274 (± 0.179) Caco-2 logPs* - 0.242 (± 0.053) µ + 0.835 (± 0.280)
0.907 0.255 29.3
15 in vivo logBB
0.297 (± 0.055) logPoct a - 0.315 (± 0.072) µ + 2.71(±
0.29) 0.853 0.317 17.4
16 BBEC logPs -0.282 (± 0.072) NH-donors + 2.45 (± 0.22) 0.684 0.529 15.2
17 Caco-2 logPs*
-0.179 (± 0.032) NH-donors + 1.60 (± 0.10) 0.813 0.237 30.4
18 Caco-2 logPs*
0.201(± 0.038) logPoct a + 1.25 (± 0.08) 0.801 0.245 28.1
19 Caco-2 logPs**
-0.275 (± 0.043) NH-donors + 1.73 (± 0.13) 0.852 0.317 40.2
20 Caco-2 logPs**
0.309 (± 0.052) logPoct a + 1.20 (± 0.31) 0.835 0.334 35.5
21 MDCK logPs *
0.365 (± 0.068) logPoct a + 0.865 (± 0.146) 0.805 0.438 28.9
22 MDCK logPs**
0.374 (± 0.070) logPoct a + 1.000 (± 0.150) 0.804 0.449 28.8
† number of compounds in the training set n = 9 a calculated with the KOWWIN program
* permeability coefficients obtained in apical to basolateral direction of the membrane
** permeability coefficients obtained in basolateral to apical direction of the membrane
104
Table 10.6. QSARs for in vivo BBB penetration, obtained from the data set of Platts et al.
(2001) †
No BBB penetration
Regression equation R2 s F
1 LogBB -0.165 (± 0.017) ΣH-bond + 0.212 (± 0.028) logPoct + 0.0507 (± 0.1053)
0.565 0.479 97.3
2 LogBB -0.208 (± 0.020) ΣH-bond + 0.176 (± 0.043) Nrings6 + 0.109 (± 0.037)logPoct + 0.179 (± 0.105)
0.609 0.455 77.5
† number of compounds in the training set n = 153
105
Table 10.7. Two-group classification matrix obtained by discriminant analysis of the date
set of Platts et al. (2001)
Number of predicted compounds
Low penetrators High penetrators Percent of correct classification
Low penetrators 56 12 82.4 Number of observed compounds High penetrators 15 70 82.4
Total 71 82 82.4
106
CHAPTER 11
INVESTIGATION OF BBB PENETRATION IN RELATION TO
P-GLYCOPROTEIN INTERACTIONS
11.1. Objectives
In Chapter 10 an investigation of blood-brain barrier (BBB) permeability for compounds
transported by different mechanisms is presented. In Chapter 11 a study of specific
interactions with one of the transport systems present in the BBB, namely the efflux
protein P-glycoprotein (P-gp), is described. P-gp plays very important role in cell
protection from xenobiotic compounds like toxicants and drugs. It is associated with
development of resistance to a wide variety of drugs (so called multidrug resistance
(MDR) phenomenon). 3D-(Q)SAR approaches were applied (GASP, CoMFA, CoMSIA),
which are appropriate for investigating compounds following similar binding modes to a
biological macromolecule.
11.2. Methods
11.2.1. Biological data
Compounds from the data set, taken from Platts et al. (2001) and investigated in Chapter
10, were used to study also interactions with the P-gp. As different mechanisms can be
involved in the transport across the BBB, a subset of 16 structurally related compounds
was selected from the whole data set. The selected compounds are phenothiazine and
imipramine derivatives, for which similar mechanisms of interaction with the BBB were
assumed. Catamphiphilic compounds like phenothiazine and imipramine derivatives have
been investigated for membrane activity (Pajeva et al., 1996) and ability to overcome
multidrug resistance (MDR) in tumour cells (Ford et al., 1989; Ramu and Ramu, 1992).
These studies show that phenothiazines and imipramines interact with membrane
phospholipids, and with the MDR transport P-gp, responsible for the active drug efflux
from the cells. Thus, it can be suggested that the BBB penetration of this group of
compounds is regulated by similar transport mechanisms, namely passive diffusion
through the membrane and active outward transport by P-gp.
107
The chemical structures and the in vivo BBB penetration data for the 16 selected
compounds taken from Platts et al. (2001) are presented in Table 11.1. As described in
Chapter 10 the BBB penetration of a compound was expressed by the (percentage) BB
ratio, with BB = (Cbrain/Cblood)*100 %, where Cbrain and Cblood are the steady-state
concentration of the compound in the brain and in the blood respectively.
11.2.2. 3D-(Q)SAR methodology
The Sybyl version 6.8 molecular modelling software (Tripos Inc., St. Louis, MO, USA)
was used to perform the 3D-SAR and QSAR analyses. Molecular mechanical calculations
were performed with the Tripos force field (Powell method, no electrostatics, and 0.05
kcal/mol convergence criterion). The semi-empirical method AM1 (MOPAC version 6.0)
was applied for quantum chemistry calculations, using the key-word XYZ (Cartesian
coordinates to be used in the calculations). Distance constrains were defined by using
Build/Define/Constrains option of Sybyl. GASP (SYBYL/GASP interface, Tripos Inc.,
St. Louis, MO, USA), CoMFA and CoMSIA were used, as implemented in Sybyl.
The 3D-structures of imipramine and phenothiazine derivatives were built from the
structures of imipramine and trifluoperazine, respectively. Their structures were taken
from conformational optimisation 3D-QSAR studies done by Pajeva and Wiese (1998). In
the study of Pajeva and Wiese (1998) a structural search was done in the Cambridge
Structural Database (CSD) (Cambridge Crystallographic Data Centre, Cambridge, UK).
The CSD reference code CIMPRA of chlorimipramine was used to generate the structure
of imipramine, and the CSD reference code PERAZ of prochlorperazine was used to
generate the structure of trifluoperazine. In order to obtain conformations with minimal
energy Pajeva and Wiese (1998) performed further a systematic conformational search of
the structures keeping the tricyclic fused ring system as an aggregate, and the final
optimisation was done with AM1.
After building the structures of the imipramine and phenothiazine derivatives, they were
firstly optimised with molecular mechanics and then with AM1.
In GASP analysis compounds were studied in pairs and the 10 best alignments were
analysed. The 3D-structure of trifluoperazine was used as a template, and the 3D-
distribution of the hydrophobic and H-bonding features of each of the remaining
molecules were compared to it. As the GASP algorithm uses random numbers for
108
initialisation, all pairs were run more than once to estimate the reproducibility of the
obtained alignments. The default settings in GASP were applied, namely: population size
100; selection pressure 1.1; maximum number of operations 60000; operation increment
6500; fitness increment 0.01; point cross weight 95.0; allele mutation weight 95.0; full
mutation weight 0.0; full crossweight 0.0; internal van der Waals energy coefficient 0.05;
HB weight coefficient 750; van der Waals contact cut-off 0.8.
The choice of the chemical compound used as a template for superimposing the
remaining compounds under investigation is a crucial step in the GASP, CoMFA and
CoMSIA. If preliminary knowledge about the 3D-structure of the receptor is missing, the
template is usually selected from the most active compounds in the data set. In this study
trifluoperazine was chosen as a template because it had one of the highest brain
penetration (logBB value of 1.44, Table 11.1.b). Among the remaining drugs only
fluphenazine had a slightly higher logBB value (1.51, Table 11.1.b), but it has a more
flexible side chain. Additionally, trifluoperazine was shown to be among the most
membrane-active phenothiazines in experiments with artificial membranes (Pajeva et al.,
1996). Also, Ford et al. (1989) found that this compound possessed the highest anti-MDR
potency among all phenothiazines investigated in their study including fluphenazine. This
could indicate a good ability of this compound to bind to P-gp.
CoMSIA and CoMFA analyses were applied to the investigated compounds. The
molecules were superimposed in the 3D-space on the structure of trifluoperazine, using
the similarity points derived by GASP analysis for each molecule (see Table 11.2 of
Section 11.3 “Results”) as a fitting point. For molecules having three similarity points
generated (Table 11.2), the compound conformation which gave the biggest GASP fit
score was taken. Constraints were applied to the distances between the similarity points.
This conformation was firstly minimised with Tripos force field molecular mechanics,
and then with AM1. For the structures which appeared to have only two similarity points
when aligned on trifluoperazine (carbamazepine, epoxide of carbamazepine, thioridazine,
mesoridazine, sulphoridazine (Table 11.2), an additional third fitting point was selected
for each compound individually. For carbamazepine and its epoxide, the third point was
the nitrogen atom of the fused ring system (see their structures in Table 11.1.a), which
was aligned to N1 of trifluoperazine (Figure 11.1). As a third fitting point for the spatial
alignment of mesoridazine, sulphoridazine and thioridazine, the nitrogen atom from the
piperidine ring (see compound structures in Table 11.1.b) was aligned to N2 of piperazine
ring of trifluperazine (Figure 11.1). For these three structures the compound conformation
109
from the GASP alignment to trifluoperazine that gave the biggest GASP fit score was
taken, and constraints were defined relating to the distance between the two similarity
points. The conformations were then minimised, firstly with molecular mechanics and
then with AM1.
The CoMFA interaction energies were calculated for a sp3 carbon probe atom of +1
charge, using the default grid spacing of 2 Å. CoMSIA similarity indices were also
calculated for a probe atom with + 1 parameters, using the same grid box and grid spacing
as in CoMFA. In CoMFA, a column filtering σmin of 2 kcal/mol, energetic field cut-off
values of 30 kcal/mol, no electrostatic interactions at bad steric contacts and a distance-
dependent dielectric constant, were set. In CoMSIA a column filtering σmin of 0.2 and 2.0
kcal/mol, and an attenuation factor of 0.3 were used. The Partial Least Squares (PLS)
leave-one-out cross-validation procedure was applied in both CoMFA and CoMSIA. The
robustness of the best CoMSIA model was additionally evaluated by applying leave-
group-out (5 groups) cross-validation. The standard error of prediction (SEP) and cross-
validated coefficient of determination (Q2) were used to assess the quality of the models.
11.3. Results
The importance of compound lipophilicity and H-bonding ability for BBB transport has
been reported by many authors and demonstrated in Chapter 10 of this thesis. However,
for the investigated set of 16 compounds a statistically significant model based on logPoct
and/or the numbers of H-bond donor and acceptor atoms could not be obtained (the
values of these structural descriptors are presented in Table 10.2, Chapter 10). This might
be due to the suggested P-gp interactions of the compounds.
GASP analysis, CoMFA, and CoMSIA were applied to identify common 3D-structural
characteristics of the investigated compounds related to their mechanism of transport
across the BBB, thought to include passive diffusion and interactions with the P-gp.
GASP analysis was used to identify a pattern of spatial similarity of hydrophobic and H-
bonding features between the compounds. The results from the GASP analysis are
summarised in Table 11.2. The obtained similarity points include the centroids of the two
phenyl rings of the imipramines and phenothiazines (labelled with C1 and C2 in the
structure of trifluoperazine, Figure 11.1) and the nitrogen atom of the side chain (N3 of
110
trifluoperazine, Figure 11.1). The characteristics of the similarity pattern found for
imipramine from its 10 best alignments to trifluoperazine are given in Table 11.3
(distances and angles between the similarity points).
To further explore the compound similarity pattern, CoMFA and CoMSIA studies were
performed. The CoMSIA and CoMFA models with best statistical fit are presented in
Table 11.4. The best model was obtained with the CoMSIA hydrophobic field (Model 1,
Table 11.4.a, optimal number of components = 3, Q2 = 0.793, SEP = 0.317, R2 = 0.935, s
= 0.177, F = 58.0). The contributions contour map (Coefficients*Standard Deviations) for
the model with the CoMSIA hydrophobic indices (Model 1, Table 11.4.a) is presented in
Figure 11.2. The robustness of this model was evaluated by leave-group-out cross-
validation statistical procedure. The data set of 16 compounds was randomly divided into
5 groups, and analysis was performed excluding each time a given group of compounds
and generating a model with the remaining compounds. Table 11.5 represents the results
obtained by repeating the leave-group-out procedure 5 times.
11.4. Discussion
In this study 3D-QSAR approaches were applied to investigate common 3D-structural
characteristics between phenothiazine and imipramine derivatives, related to their similar
mechanism of transport through the BBB. The transport mechanism was assumed to be
passive diffusion through the BBB and active outward transport by P-gp.
11.4.1. Interpretation of the models
The investigated compounds were part of a bigger data set taken from Platts et al., (2001),
for which QSARs were developed in this project and reported in Chapter 10. LogPoct and
the sum of H-bond donor and acceptor atoms (ΣH-bond) appeared in the QSARs (Table
10.6, Chapter 10). For the set of 16 phenothiazine and imipramine derivatives a
statistically significant model based on logPoct and/or numbers of H-bond donors and
acceptors could not be obtained. Fluphenazine and thioridazine were excluded from the
final QSAR model derived in the source paper (Platts et al., 2001), which included
geometrical, electronic and H-bonding structural descriptors (see Chapter 10).
Additionally, as described in Chapter 10, excluding two of these compounds
(mesoridazine and thioridazine) improved the model based on logPoct and ΣH-bond
(Equation 10.3). Thus, it appears that the investigated compounds are not well described
111
by such QSARs. This complies with the results of Dearden et al. (2003), who performed
2D-QSAR study on P-gp interactions and could not derive significant correlations
between parameters encoding P-gp interactions and descriptors of hydrophobicity and H-
bonding. 3D-QSAR approaches were used to investigate the suggested interactions with
the P-gp of the studies compounds.
A similarity pattern of 3D-hydrophobic and H-bond features of the imipramine and
phenothiazine derivatives was obtained by GASP analysis (Table 11.2). From Table 11.2
it can be seen that for 10 compounds three points were found to match the trifluperazine
3D-distribution of hydrophobic and H-bonding features (amitriptyline, chlorpromazine,
desipramine, desmethyldesipramine, desmonomethylpromazine, fluphenazine,
imipramine, nor-1-chlorpromazine, nor-2-chlorpromazine, promazine), namely the
centroids of the two phenyl rings of the imipramines and phenothiazines (labelled with C1
and C2 in the structure of trifluoperazine, Figure 11.1) and the nitrogen atom of the side
chain (N3 of trifluoperazine, Figure 11.1). Table 11.3 represents the characterstics of the
similarity pattern found for imipramine from its 10 best alignments to trifluoperazine. It
can be seen that the pattern is stable, with only small variations for the different
alignments. Stable similarity patterns were produced also for the phenothiazine
derivatives, which possessed structures more closely related to that of trifluoperazine than
imipramine.
For five compounds (carbamazepine, epoxide of carbamazepine, mesoridazine,
sulphoridazine, thioridazine, Table 11.2) the similarity pattern consisted of only two
points – the centroids of the two phenyl rings (Figure 11.1). An explanation of this result
for carbamazepine and its epoxide is the absence of a side chain and subsequently of a H-
bond acceptor atom to be superimposed on that of trifluoperazine.
The imipramine and phenothiazine derivatives with higher logBB values (for some of
them the logBB values are comparable to that of trifluoperazine) possess correspondingly
three similarity points. This suggests a highly similar spatial profile of the hydrophobic
and H-bond properties for these compounds. The results from the GASP analysis
qualitatively agree with the BBB penetration data for the investigated compounds: the
more similar space profiles with more hydrophobic and H-bond points involved, the
higher the drug logBB values. The different similarity patterns of mesoridazine,
sulphoridazine and thioridazine, that involve one point less is consistent with their lower
level of BBB penetration (logBB values of 0.24, -0.36, and 0.18 respectively) compared
112
to the remaining compounds (Table 11.1.b). Considering the fact that the BB ratio
encodes for the drug influx and efflux simultaneously, it would be misleading to relate the
obtained similarity patterns to interactions with P-gp only. Naturally, drugs with low rate
of passive diffusion due to their low hydrophobicity are expected to have also low BB
ratios. In addition to the lower number of pharmacophore points due to a missing aliphatic
side chain with tertiary nitrogen, drugs like carbamazepine and its epoxide also possess
low logPoct values (Table 10.2, Chapter 10), which can explain their low brain
penetration.
CoMFA and CoMSIA studies were performed to investigate further the compound 3D-
features important for BBB transport. A model with very good statistical parameters
(Model 1, Table 11.4.a) was obtained with the CoMSIA hydrophobic indices, when no
correlation was found with logPoct (see above). This suggests the importance of the 3D-
distribution of the hydrophobic features of imipramines and phenothiazines for the BBB
penetration, rather than the whole-molecule lipophilicity characterised by logPoct. The 3D-
distribution of the hydrophobic properties could be important for the P-gp interactions.
From Table 11.4.a it can be seen that the model with CoMSIA hydrophobic indices is not
influenced by the value of column filtering σmin (changing σmin from 2.0 kcal/mol into 0.2
kcal/mol did not change the statistical parameters of the model). This is an indication of
the high signal-to-noise ratio for the hydrophobic indices calculated. Additionally, the
model appeared to be robust when assessed with the leave-group-out (5 groups) cross-
validation procedure (Table 11.5). As it can be seen from Table 11.5 the model
performance is stable, resulting in similar cross-validation statistics from the different
random divisions of the compounds into groups.
In Figure 11.2, the contributions contour map (Coefficients*Standard Deviations) for the
model with the CoMSIA hydrophobic indices (Model 1, Table 11.4.a) is presented. Three
regions around the molecules were identified, in which the structural variation contributed
mostly to the logBB variance: more hydrophobic substituents in the regions, labelled in
red (letter “R” in the figure) (substituents in the fused ring system, and around the side
chain) and more hydrophilic substituents in the region of side-chain nitrogens (labeled in
blue, letter “B” in the figure) favour BBB penetration. This distribution reflects the
amphiphilic character of the studied compounds, and indirectly supports the obtained 3D-
molecular pattern.
113
Relatively good models were obtained also with CoMSIA steric and electrostatic indices
(Models 3 and 4, Table 11.4.a). The model with the donor-acceptor indices (Model 5,
Table 11.4.a) had a lower Q2. The model with acceptor indices had poor statistics (Q2 =
0.196, SEP = 0.578) and is not presented in Table 11.4.a.
CoMFA analysis also resulted in models having a good predictivity (Models 6, 7, and 8,
Table 11.4.b.), with electrostatic field giving better results than steric field. The models
with electrostatic field and both fields (Models 7 and 8, Table 11.4.b) had comparable
statistical parameters.
11.4.2. Contribution to existing knowledge
Numerous investigations of interactions of catamphiphilic compounds like phenothiazine
and imipramine derivatives with P-gp have been published (a review of such studies is
given in Wiese and Pajeva, 2001). A similarity pattern of hydrophobic and H-bonding
structural features for a range of chemical compounds, including phenothiazines and
imipramines, was obtained by Pajeva and Wiese (2001). The main new element
introduced by the present study is related to the fact that a similarity pattern of
hydrophobic and H-bonding structural characteristics for the phenothiazines and
imipramines was observed in relation to their transport across the BBB, i.e. the
investigation is extended to a new endpoint.
The present study has emphasised the importance of the 3D-distribution of the
hydrophobicity of phenothiazines and imipramines, which is suggested to govern their
BBB penetration, rather than the whole-molecule lipophilicity characterised by logPoct.
Similar results have been previously published for compound interactions with P-gp.
However, the present study related this observation to penetration of chemicals through
the BBB.
An article based on this work has been published by Lessigiarska et al. (2005a).
11.5. Conclusions
Identified were common 3D-structural characteristics of phenothiazine and imipramine
derivatives related to their mechanism of transport across the BBB involving passive
diffusion and P-gp interactions. It was shown that the compounds with highest BBB
114
penetration possess a similar specific profile of two clearly defined hydrophobic centres
and one hydrophilic (H-bonding) centre arranged in a particular spatial configuration.
A model with very good statistical parameters was obtained with the CoMSIA
hydrophobic indices when no correlation was found with the whole-molecule descriptor
logPoct. This suggests that hydrophobicity, represented as a space-distributed molecular
property but not as a logPoct value, is more suitable for the modelling of the BBB
penetration of phenothiazine and imipramine derivatives.
115
Figure 11.1. Structure of trifluoperazine. C1 and C2 label the centroids of the phenyl rings.
Nitrogen atoms are also labelled
F
FF
S
N1
N2
N3
CH3
c1c2
116
Figure 11.2. Contour map (representing Coefficient*Standard Deviation values) for the
model with CoMSIA hydrophobic field (Model 1, Table 11.4.a); “R” labels the “red”
regions (highest positive Coefficient*Standard Deviation values), “B” labels the “blue”
region (lowest negative Coefficient*Standard Deviation values) around the molecules
117
Table 11.1. Chemical structure and BBB penetration data (presented as logBB values) of
the imipramine and phenothiazine derivatives investigated for their interactions with P-gp
(data taken from Platts et al., 2001). The numeration is done according to Table 10.2,
Chapter 10
a) imipramine derivatives
No Name Structure logBB
33 amitriptyline
N
CH3
CH3
0.89
28 carbamazepine
N
O
NH2
0.00
53 desipramine
N NH
CH3
1.20
54 desmethyldesipramine
N
NH2
1.06
29 epoxide of carbamazepine O
NO
NH2
-0.34
7 imipramine
CH3
N
CH3N
1.06
118
Table 11.1. Chemical structure and BBB penetration data (presented as logBB values) of
the imipramine and phenothiazine derivatives investigated for their interactions with P-gp
(data taken from Platts et al., 2001). The numeration is done according to Table 10.2,
Chapter 10
b) phenothiazine derivatives
S
NR1
R3R2
No Name R1 R2 R3 logBB
47 chlorpromazine -Cl -H -(CH2)3-N(CH3)2 1.06
57 desmonomethylpromazine -H -H -(CH2)3-NH-CH3 0.59
69 fluphenazine -CF3 -H (H2C)3 N N (CH2)2 OH
1.51
86 mesoridazine -S(=O)CH3 -H
(H2C)2
N
CH3
-0.36
104 nor-1-chlorpromazine -H -Cl -(CH2)3-NH2 1.37
105 nor-2-chlorpromazine -Cl -H -(CH2)3-NH2 0.97
120 promazine -H -H -(CH2)3-N(CH3)2 1.23
133 sulforidazine -S(=O)2CH3 -H
(H2C)2
N
CH3
0.18
138 thioridazine -S-CH3 -H
(H2C)2
N
CH3
0.24
146 trifluoperazine -CF3 -H (H2C)3 N N CH3
1.44
119
Table 11.2. Similarity points, obtained by GASP analysis. Y encodes the presence of the
corresponding similarity point, N encodes its absence
Similarity points Compound
C1 C2 N3
amitriptyline Y Y Y
carbamazepine Y Y N
chlorpromazine Y Y Y
desipramine Y Y Y
desmethyldesipramine Y Y Y
esmonomethylpromazine Y Y Y
epoxide of carbamazepine Y Y N
fluphenazine Y Y Y
imipramine Y Y Y
mesoridazine Y Y N
nor-1-chlorpromazine Y Y Y
nor-2-chlorpromazine Y Y Y
promazine Y Y Y
sulforidazine Y Y N
thioridazine Y Y N
trifluoperazine Y Y Y
120
Table 11.3. Distances and angles between the points of the similarity pattern found from
the 10 best alignments of imipramine to trifluoperazine. C1, C2, and N3 are the
corresponding similarity points (Figure 11.1)
Distances (Å) Angles (degrees) Values
C1-C2 C1-N3 C2-N3 C2-C1-N3 C1-C2-N3 C1-N3-C2
lowest 4.864 10.361 8.423 53.16 97.59 27.52
highest 4.946 10.386 8.517 54.57 99.32 27.85
average 4.884 10.380 8.454 53.60 98.40 27.61
121
Table 11.4. CoMSIA and CoMFA models for 16 the imipramine and phenothiazine
derivatives. The best models are shown in bold
(a) CoMSIA models
No Field σmina NC b Q2 b SEP b R2 b s b F b
1 hydrophobic 2.0 3 0.793 0.317 0.935 0.177 58.0
2 hydrophobic 0.2 3 0.793 0.317 0.936 0.176 58.5
3 steric 2.0 5 0.689 0.425 0.917 0.213 48.3
4 electrostatic 2.0 3 0.641 0.417 0.917 0.200 44.4
5 donor-acceptor 2.0 1 0.444 0.481 0.674 0.368 29.0
(b) CoMFA models (σmina = 2.0 kcal/mol)
No Field NC b Q2 b SEP b R2 b s b F b
6 steric 3 0.697 0.384 0.918 0.199 44.9
7 electrostatic 3 0.767 0.336 0.940 0.170 63.1
8 both 3 0.773 0.332 0.955 0.148 84.6
a σmin – column filtering (in units kcal/mol) b NC – optimal number of components in the PLS analysis; Q2 – cross-validated coefficient of
determination; SEP – cross-validated standard error of prediction; R2 – regression coefficient of
determination; s – standard error of estimate; F – Fisher statistic
122
Table 11.5. Results from the leave-group-out (5 groups) cross-validation procedure
performed with the CoMSIA hydrophobic indices*.
NC a Q2 a SEP a
2 0.798 0.301
3 0.821 0.294
2 0.813 0.290
3 0.790 0.319
3 0.820 0.295
* number of compounds n = 16; column filtering σmin = 2.0 kcal/mol a NC – optimal number of components in the PLS analysis; Q2 – cross-validated coefficient of
determination; SEP – cross-validated standard error of prediction
123
CHAPTER 12
INVESTIGATION OF BACTERIAL TOXICITY
12.1. Objectives
An objective of the project was to investigate cytotoxic and acute toxic effects of
chemicals by the means of QSAR analysis. Toxicity to bacterial strains, aquatic
organisms, rodents and humans was investigated. This chapter presents a QSAR analysis
of a large set of data for chemical toxicity to the bacterial strain Sinorhizobium meliloti.
The data set contained chemicals acting by different mechanisms of toxicity, which
permitted investigation of mechanisms of toxic action. Baseline toxicity effects, attributed
to non-polar narcotics, were sought. Mechanism-based QSARs were developed.
Additionally, attempts were made to develop models for compounds belonging to
different chemical classes.
12.2 Methods
12.2.1. Toxicity data
The data set was taken from Botsford (2002). The toxicity test applied uses the bacterium
Sinorhizobium meliloti as an indicator of cell viability, which in its viable state can
readily reduce the thiazole tetrazolium dye MTT to a dark blue compound. The reduction
of the dye is inhibited by toxic chemicals. The toxicity values are expressed as compound
concentrations that cause 50% inhibition of the bacterial growth (IC50).
The original data set contains toxicity measurements for 237 substances. In the present
QSAR study, data for 133 of the 237 substances were used. Compounds identified by
Botsford (2002) as non-toxic, as well as herbal medicines and related substances, and
inorganics, were excluded from this study. Unspecified positional isomers were also
excluded. Glyphosate, neomycin and streptomycin were excluded, because of their very
low calculated logPoct values (logPoct < -4). Thus, assuming the calculated logPoct values
are accurate, the compounds are unlikely to cross the cell membrane by passive diffusion.
The investigated data set is heterogeneous in terms of chemistry and mechanisms of toxic
action.
124
The compounds included in the study, their mean toxicities (average of several
experiments) in terms of IC50 values, and their assigned mechanisms of toxic action are
presented in Table 12.1. The SMILES codes of the compounds are given in Appendix
B.2. The mechanism of action is encoded by: 1 – non-polar narcotics; 2 – polar narcotics;
3 – weak acid respiratory uncouplers; 4 – formation of free radicals; 5 – electrophilic
interactions; 6 – toxic action by a specific mechanism. The compounds were assigned to a
given mechanism of toxicity according to structural criteria found in the literature
(Hermens, 1990; Lipnick, 1991; Verhaar et al., 1992; Russom et al., 1997; Schultz et al.,
1997; Hansch et al., 2000; Cronin et al., 2002a) and summarised in Table 8.1, Chapter 8.
The number of narcotic compounds identified is 72, of which 46 are non-polar and 26 are
polar narcotics. Five compounds are considered to act by weak acid respiratory
uncoupling; 9 compounds by forming free radicals; and 13 are likely to be
electrophiles/proelectrophiles. Forty-seven compounds act by a specific mechanism.
Some compounds are likely to possess more than one mechanism of toxicity, as noted in
Table 12.1.
12.2.2. Structural Descriptors
The following structural descriptors for the compounds were calculated: logPoct, boiling
point, melting point, vapour pressure, aqueous solubility, the Brown variation of the
Hammett constant σ+, and approximately 250 different quantum-chemical and topological
descriptors. LogPoct was calculated with the KOWWIN version 1.66 software (Syracuse
Research Corporation, Syracuse, NY, USA). Boiling points, melting points and vapour
pressures were calculated with the MPBPWIN version 1.40 software (Syracuse Research
Corporation, Syracuse, NY, USA). Aqueous solubilities were calculated with
WSKOWWIN version 1.40 (Syracuse Research Corporation, Syracuse, NY, USA).
KOWWIN version 1.66, MPBPWIN version 1.40 and WSKOWWIN version 1.40
software were downloaded from the web-site of the Environmental Protection Agency of
the United States (http://www.epa.gov/oppt/exposure/docs/episuitedl.htm). The values of
the Brown variation of the Hammett constant σ+ were calculated with the ACDLabs
version 4.02 software (Advanced Chemistry Development Inc., Toronto, Canada). TSAR
for Windows version 3.3 (Accelrys Inc.) was used to calculate quantum-chemical and
topological descriptors. For quantum-chemical calculations, the VAMP package
(Accelrys Inc.) was used with the AM1 Hamiltonian. QsarIS version 1.1 software (then
125
SciVision-Academic Press, San Diego, CA; currently MDL®QSAR by Elsevier MDL,
San Leandro, CA) was used to calculate topological descriptors.
12.2.3. Statistical Analysis
Before performing the statistical analysis, the descriptors that had values of zero for more
than 95% of the compounds in a data set were excluded from the descriptor data matrix.
To reduce the number of intercorrelated molecular descriptors, the algorithm developed
by the author was applied to the descriptor data matrices before performing statistical
analysis (see Chapter 9).
QSARs were obtained by multiple linear regression, using MINITAB version 14 software
(Minitab Inc., State College, PA, USA). Two approaches were used for selecting
variables to derive models with good statistical diagnostics, and allowing for
physicochemical and/or mechanistic interpretation. Firstly, forward stepwise regression
was applied, selecting up to five variables to include, and excluding variables that lack
physicochemical and/or mechanistic relevance. Secondly, the program in C code
developed by the author, implementing the best-subsets approach (see Chapter 9), was
used. Only those variables that intercorrelated with a coefficient of intercorrelation R less
than 0.7 were included in the same model. The leave-one-out cross-validation statistical
procedure was applied to the QSAR models to generate the leave-one-out cross-validation
coefficient (Q2) of determination.
Partial least squares (PLS) analysis was performed by using MINITAB version 14
software.
12.3. Results
The abbreviations used for the molecular descriptors are presented in Table 12.2. The
values of the descriptors, included in the selected QSARs, are presented in Table 12.3.
A plot of toxicity against logPoct for all compounds investigated shows a baseline effect
(Figure 12.1.a). Of the 133 compounds, 46 meet the structural criteria for compounds that
may act by the non-polar narcosis mechanism of toxic action. Within this sub-group, 18
compounds (acetone, chloroquine, DCPA, 1,3-dichlorobenzene, dimethyl sulphoxide,
ethanol, ethylene glycol, hexachlorobenzene, methanol, methylphenidate, naproamide,
126
naproxen, octane, orphenadrine, 1,10-phenanthroline, propan-2-ol, tert-butyl methyl
ether, tetrachloromethane) could be identified as falling on the baseline, determined by
lipophilicity. The relationship giving the baseline toxic effect is defined by the following
equation, which was based on the sub-group of 18 compounds:
log(1/IC50) = 0.667 (± 0.041) logPoct – 2.79 (± 0.12) (12.1)
n = 18, R2 = 0.942, s = 0.369, F = 261
To determine factors other than logPoct which could enhance the toxicity of the remaining
compounds above that determined by the baseline alone, additional models were
developed by including other molecular descriptors in combination with logPoct. The
QSAR equations for the groups of compounds acting by different toxic mechanisms are
presented in Table 12.4.
To investigate the possibility of developing chemical class-specific models, QSARs were
developed for the aromatic and aliphatic compounds separately. QSARs for the aromatic
compounds could be derived, but the statistical quality of these models was poor, and the
descriptors selected were not meaningful. The QSAR developed for the aliphatic
compounds is presented in Table 12.4, Equation 7.
PLS analysis was applied to the whole data set to investigate further the multivariable
data space. Firstly, the whole set of 104 independent variables was included in the PLS
analysis. The model obtained had 4 significant components (selected on the basis of best
cross-validated coefficient of determination, Q2), and R2 and Q2 values of 0.598 and
0.347, respectively. Secondly, following the approach of Cronin et al. (2002a), the four
variables found to be most significant in the regression analysis (logPoct, LUMO, NOH,
and NNH2, Equation 6, Table 12.4) were used to derive a PLS model. The model selected
had 3 components, R2 = 0.541 and Q2 = 0.464.
12.4. Discussion
In this study a data set of 133 compounds was used to develop QSARs for toxicity to the
bacterium S. meliloti. In general, several approaches could be used for developing QSARs
from a large data set for a given endpoint. The compounds in the data set could be divided
into groups according their mechanism of action (mechanism-based approach) or their
chemical structure, with QSARs being developed for each group. A global QSAR for all
127
compounds in the data set could also be developed. These approaches were tried in the
present study.
12.4.1. Baseline toxicity effect
A baseline toxic effect, determined by lipophilicity, was identified for these compounds
(Equation 12.1). The position of the baseline on plots of toxicity against logPoct is
presented in Figure 12.1.a for all investigated compounds regardless of the mechanism of
toxicity, and in Figure 12.1.b for the non-polar narcotics. From Figure 12.1.b it can be
seen that the toxicity of the remaining 28 non-polar narcotics, not included in Equation
12.1, is greater than that determined by the baseline. This disagrees with some literature
reports (Könemann, 1981; Hansch et al., 1989; Schultz et al., 1990; Verhaar et al., 1992)
stating that the toxicity of the non-polar narcotics is dependent mainly on their logPoct
values. This contradiction could be due to variability in the Botsford (2002) IC50 values.
In the Botsford study, the IC50 values were obtained by investigating several samples for
the same chemical compound, and the number of samples for the different compounds
varied from 1 to 24. The coefficient of variation obtained (the standard deviation divided
by the mean) for the IC50 values varies widely from 0.25 up to 70. The high values of the
coefficient of variation suggest some uncertainty in the IC50 values determined by the
method of Botsford (2002). Additionally, all the chemicals were tested as received from
the suppliers (e.g. in the form of pesticide and pharmaceutical products), without attempts
to purify them from inactive compounds added to make the chemicals marketable.
The baseline presented by Equation 12.1 is developed in an arbitrary way by the author
and its position in Figure 12.1.b results in the placing of non-polar narcotic compounds
both above and below the baseline, in accordance with the suggested data variability.
12.4.2. QSAR models
When the non-polar narcotics were investigated alone, logPoct was found to explain only
67% of the variance in the toxicity (Equation 1, Table 12.4), possibly due to the suggested
data variability. The QSAR developed for the toxicity of the polar narcotics indicates that
the specific polarisability (SpcP), the largest negative charge over the atoms in a
compound (Qmaxn), and the number of hydroxyl groups (NOH) are important determinants
of toxicity (Equation 2, Table 12.4). Interestingly, it did not include logPoct (this result is
unlikely to be due to lack of logPoct variance, because the values of this descriptor varied
128
between –0.91 (sulphosalicylic acid) and 6.92 (hexachlorophene)). From the QSAR, it
can be seen that increasing the ability of the molecule to be polarised under an external
electric field (possibly in cell membranes) is associated with an increase in toxicity. The
value of Qmaxn could account for participation of the compound in electrostatic
interactions at the site of action, or for H-bonding effects, and its increased absolute value
is associated with increasing toxicity (Qmaxn has negative values). From Equation 2, Table
12.4 it can be seen that increasing the number of hydroxyl groups in the molecule is
associated with increasing toxicity. This property can be also related to H-bonding
activity.
The group of weak acid respiratory uncouplers is represented only by 5 compounds.
Unfortunately this is statistically insufficient to investigate all the factors that determine
the interaction with the mitochondrial membrane. The observed relationship with logPoct
(Equation 3, Table 12.4) could be due to membrane disruption or could reflect the ability
of compounds to pass across the cell membrane in order to reach the mitochondria,
accounting for the effects of the limited number of compounds in the training set.
Relationships between toxicity of respiratory uncouplers and logPoct have been previously
reported in the literature by Russom et al. (1997), and Schultz and Cronin (1997).
Equation 3 has higher intercept (not significantly different from zero) than Equation 12.1
defining the baseline toxicity, thus, as it can be expected the toxicity of the weak acid
respiratory uncouplers is greater than that of the non-polar narcotics.
The toxicity of the compounds forming free radicals was correlated with the molecular
dipole moment (µ) and the total molecular energy (Etot) (Equation 4, Table 12.4). These
parameters could be associated with the formation and/or stability of free radicals. It
should be noted, however, that with the exception of catechol, these compounds could
also act by the polar narcosis mechanism. The low value of 0.391 for Q2 suggests that the
model is very unstable with regard to changes of compounds included. Furthermore, since
the equation has 2 independent variables and only 9 data points, its quality can be
questioned on the basis of the low ratio of data points to independent variables (fewer
than 5 data points for 1 variable).
Hansch and Gao (1997) reviewed the literature on free radical reactions. They found that
in 27 QSAR equations describing endpoints in which formation of free radicals involves
the abstraction hydrogen radical (H.) from a phenolic hydroxyl group, 25 equations were
correlated with negative values of the Brown variation of the Hammett substituent
129
constant σ+, thus indicating that electron-releasing substituents, i.e. increasing electron
density in the benzene ring, favour the investigated endpoints. The Brown variation of the
Hammett constant σ+ takes into account the contribution of substituents through
conjugation to electron-deficient reactivity sites attached to a benzene ring. π-Electron-
releasing and π-electron-withdrawing groups are associated with negative and positive
values of σ+, respectively (Morao and Hillier, 2001). In the present study, σ+ values were
calculated for the eight phenols identified as forming free radicals (ACDLabs software
could not calculate the σ+ value for 2-dianisidine). No statistically significant correlations
between the toxicity and the σ+ values were found for these compounds.
No meaningful QSARs could be derived for the electrophilic compounds. It is possible
that the electrophilic interactions involved were too diverse to be accommodated in a
simple QSAR for all electrophilic compounds. Also, other structural descriptors than
those used in the present study could determine the toxicity of these compounds. As
concluded by Schultz et al. (1998), data and descriptor limitations exist for QSAR
modelling for electrophilic compounds.
The model obtained for compounds acting by specific mechanisms described only 57% of
the variation in compound toxicity (Equation 5, Table 12.4). Greater aqueous solubilities
(logSaq), higher absolute values of the largest negative charge over an atom in the
molecule (Qmaxn), lower values of the difference first order connectivity index (d1χ), and a
smaller number of amino groups in the molecule (NNH2) appeared to decrease toxicity.
d1χ encodes information regarding the degree of skeletal branching, in terms of
decreasing values with increased branching, and tends to be insensitive to molecular size.
The positive coefficient of d1χ shows that increased branching is associated with lower
toxicity. In the case of compounds acting by specific mechanisms, it is possible that this
descriptor reflects effects of shielding of reactive centres in branched compounds,
resulting in lower toxicity. The model for specifically-acting compounds (Equation 5,
Table 12.4) is of low quality, having a low Q2 value and at least one outlier with
considerable leverage (isoproterenol). This may be due to the fact that the compounds
brought together in this model may elicit their toxic effects by interacting with different
biological macromolecules, and thus different factors may determine the toxic response to
these compounds. In such a case, it would be more appropriate to develop separate QSAR
models for separate specific mechanisms of action, if sufficient data were available.
130
A single, global model for S. meliloti toxicity, covering all mechanisms of action, was
developed (Equation 6, Table 12.4). The model includes logPoct, the energy of the lowest
unoccupied molecular orbital (LUMO), the number of hydroxyl groups (NOH), and the
number of amino groups (NNH2). Thus, descriptors encoding compound biouptake,
distribution and/or membrane interactions (logPoct) and electrophilic reactivity (LUMO)
appeared to influence toxicity. The number of hydroxyl and amino groups might encode
particular compound reactivity or H-bonding effects. However, the statistical fit of the
model was not high, possibly due to the diversity of the included compounds and their
mechanisms of toxic action.
In an attempt to obtain additional information from the variable space, PLS regression
was applied to the whole data set. However, the PLS models obtained had comparable
statistical fits to the QSAR based on regression analysis (Equation 6, Table 12.4), but they
lack the mechanistic interpretability of the regression model (Cronin et al., 2002a).
The QSAR developed for the aliphatic compounds had reasonable statistical parameters
(Equation 7, Table 12.4). It included compounds acting by non-polar narcosis,
electrophilic reactivity, and specific mechanisms Thus, the diversity of mechanisms
covered by this model was less than in the case of the model for all compounds.
Correspondingly, the model had better statistical fit. It included descriptors encoding
compound biouptake, distribution and/or membrane interactions (logPoct) and
electrophilic reactivity (LUMO). Additional descriptors included were MSA, which might
reflect the higher size of the compounds acting by specific mechanisms included in the
model, and NNH2, which probably encodes particular types of interactions/reactivity. In
Figure 12.2 a plot of predicted values against observed toxicity values is presented for this
model. It can be seen that the predicted toxicity values of some compounds (acetone,
ethanolamine, clindamycin) have high residuals (greater than 1.3 log units). The
coefficients of variation for acetone and ethanolamine reported by Botsford (2002) were
24 and 2.65 respectively (the number of samples were 18 and 2 respectively), which
resulted in standard deviations of the determined toxicity values of 4.71 and 0.85 log units
respectively. The standard deviation for acetone is higher than the predicted residual from
the model. The toxicity assessment of small aliphatic molecules is often difficult to
perform accurately due to their volatility. The toxicity of clindamycin was determined
with one sample only (no coefficient of variation was reported by Botsford, 2002).
131
12.4.3. Contribution to existing knowledge
A new data set was investigated in which the toxicity of compounds was obtained in the
same laboratory, according to the same experimental protocol, suggesting good quality of
the obtained results. However, high experimental variability was reported by Botsford
(2002) for some compounds, which could be a reason for the moderate statistical fit of the
QSARs derived in this study.
The study confirmed the presence of baseline toxicity effect attributed to the non-polar
narcotics. In contrast to literature reports (see Chapter 8) logPoct was not among the
descriptors that best described the toxicity of the polar narcotics. The descriptors that
were related to the toxicity of these compounds accounted for polarisability, electrostatic
properties and H-bonding interactions. However, these properties play a role in molecular
partitioning between two phases, thus, they may also reflect the logPoct values.
A QSAR model for all investigated compounds was derived, however, it was based on
diverse compounds acting by different mechanisms, and thus its statistical fit was not
high. A model with good statistical fit was obtained for the aliphatic compounds,
including descriptors related to compound lipophilicity, reactivity, size and particular
types of interactions/reactivity attributed to amino group. This result suggests that QSAR
models including compounds acting by different mechanisms might be successful if
descriptors reflecting the different mechanisms are included.
An article based on this work has been published by Lessigiarska et al. (2004a).
12.5. Conclusions
A baseline effect (related to non-polar narcosis) for the toxicity to the bacterium
Sinorhizobium meliloti was shown. However, some non-polar narcotics had toxicities
different from those defined by the baseline. This might be due to high experimental data
variability for some compounds. The toxicity of polar narcotics might be determined by
their polarisability, H-bonding ability, and electrostatic properties. The study
demonstrated that modelling of toxicity of compounds forming free radicals and
electrophilic compounds might be limited by data and descriptor availability. The toxicity
of compounds acting by more specific mechanisms of toxic action was also difficult to
predict with the structural descriptors included in the study.
132
Toxicity of the aliphatic compounds could be modelled successfully despite the fact that
they act by different mechanisms of toxic action (non-polar narcosis, electrophilic
reactivity, and specific mechanisms). The descriptors included in the QSAR reflect the
mechanisms of action of the compounds, namely compound biouptake and/or membrane
interactions (logPoct), electrophilic reactivity (LUMO), possibly the greater size of the
compounds acting by specific mechanisms (MSA) and particular types of
interactions/reactivity attributed to amino group (NNH2). However, increasing the diversity
of compounds and their mechanisms of action by adding also aromatic compounds to the
training set resulted in a model with low statistical fit.
133
Figure 12.1. Plot of toxicity to Sinorhizobium meliloti against logPoct, showing a baseline
toxicity effect (solid line, Equation 12.1). The compounds, represented by triangles were
used to define the baseline
a) all investigated compounds regardless of the mechanism of action
-2 -1 0 1 2 3 4 5 6 7 8
logP oct
-4
-3
-2
-1
0
1
2
3
4
5lo
g1
/IC
50
b) non-polar narcotics
-2 -1 0 1 2 3 4 5 6 7
logP oct
-4
-3
-2
-1
0
1
2
log
1/IC
50
134
Figure 12.2. Plot of observed against predicted toxicity to Sinorhizobium meliloti for the
QSAR model for aliphatic compounds (Equation 7, Table 12.4). Chemicals with large
residuals are named
log(1/IC50)
Predicted Values
Ob
se
rve
d V
alu
es
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
clindamycin
acetone
gentamycin
ethanolamine
135
Table 12.1. Investigated compounds, their acute toxicity to S. meliloti, and mechanisms of
toxic action
No Name log(1/IC50) b Mechanism of
action c
1 acetaminophen -0.217 2 2 acetic acid -0.854 1 3 acetone -3.326 1 4 acetylsalicylic acid -0.394 2 5 alachlor 0.386 1 6 4-aminobenzaldehyde -0.111 5 7 amitriptyline 1.824 6 8 amoxicillin a -1.147 6 9 atenolol 0.636 6 10 atropine 0.180 6 11 bensulide 0.750 6 12 benzene 0.0862 1 13 benzyl chloride 0.319 5 14 bipyridyl 0.00922 1 15 bromoxynil 1.027 3 16 butylamine -0.568 1 17 caffeine -1.280 6 18 catechol 0.337 4, 5 19 celebrex 1.319 6 20 chloramphenicol -0.636 5, 6 21 chlorobenzene 0.674 1 22 3-chlorophenol 1.602 2 23 4-chlorophenol 1.036 2 24 chloroquine -0.142 1, 6 25 clindamycin -0.929 6 26 clomazone 0.991 6 27 codeine -0.204 6 28 2-cresol -0.246 2 29 3-cresol 0.0410 2 30 4-cresol -0.553 2 31 cyanazine -0.00860 6 32 cyclohexylamine -0.373 1 33 DCPA (chlorthal-dimethyl) -0.109 1 34 dexedrine 0.507 2 35 2-dianisidine -0.236 2, 4 36 diazepam (valium) -0.504 6 37 diazinon 0.936 6 38 dibromoethane -0.0969 1 39 1,2-dichlorobenzene 0.588 1 40 1,3-dichlorobenzene -0.449 1 41 1,4-dichlorobenzene 0.451 1 42 1,2-dichloroethane -0.179 1 43 dichloromethane -0.616 1 44 2,4-dichlorophenol 0.544 2 45 3,4-dichlorophenol 1.538 2 46 3,5-dichlorophenol 2.398 2 47 2,4-dichlorophenoxyacetic acid -0.0334 2 48 diethylamine -0.423 1
136
Table 12.1. (continued)
No Name log(1/IC50) b Mechanism of
action c
49 digoxin 0.635 6 50 dimethyl sulphoxide -3.050 1 51 4-(dimethylamino)benzaldehyde -0.0755 5 52 2,3-dimethylphenol 0.00833 2, 4 53 2,4-dimethylphenol 0.234 2, 4 54 1,2-dinitrobenzene -0.343 5 55 1,4-dinitrobenzene -0.270 5 56 2,6-dinitro-4-cresol 0.772 3 57 2,4-dinitrophenol 0.742 3 58 diuron 0.429 1 59 emetine 0.595 6 60 eptam (EPTC) 0.565 1 61 ethanol -3.216 1 62 ethanolamine -0.613 5 63 ethyl acetate -1.097 1 64 ethylbenzene 0.175 1 65 ethylene glycol -3.535 1 66 FCCP [carbony cyanide 4-(trifluoromethoxy)-phenylhydrazone] 3.000 6 67 fluvoxamine 0.167 6 68 furosemide 1.244 6 69 gentamycin 2.854 6 70 hexachlorobenzene 0.793 1 71 hexachlorophene 3.721 2, 3 72 ibuprofen 0.235 2 73 imazapyr 0.126 6 74 indole -0.631 1 75 isoproterenol -3.248 6 76 isoxaben 0.0297 6 77 lindane 0.851 1 78 malathion 0.951 6 79 methanol -3.328 1 80 3-methyl-4-nitrophenol 0.446 5 81 methylphenidate (ritalin) -0.369 1 82 morphine 0.0825 6 83 1-naphthol 0.409 1 84 naproamide -0.0273 1 85 naproxen -0.109 1 86 nicosulfuron 0.169 6 87 nicotine -0.785 6 88 2-nitrophenol 0.569 5 89 norflurazon 0.223 6 90 novobiocin -0.512 6 91 octane -0.149 1 92 orphenadrine -0.244 1 93 orudis 0.349 1 94 oxadiazon 0.0391 6 95 paraquat 0.590 6 96 pentachlorophenol 3.268 3 97 pentanol -1.265 1 98 1,10-phenanthroline -1.363 1
137
Table 12.1. (continued)
No Name log(1/IC50) b Mechanism of
action c
99 phenol -1.111 2 100 primsulfuron 0.278 6 101 propan-2-ol -2.977 1 102 propranolol 0.395 1, 6 103 pseudocumene 0.509 1 104 quinidine 0.375 6 105 quinine 0.394 6 106 resorcinol 0.670 2, 4 107 salicylic acid 0.0410 2 108 scopolamine 0.393 6 109 sethoxydim 2.097 6 110 sulphosalicylic acid -0.705 2 111 tert-butyl methyl ether -2.173 1 112 tetrachloroethene 0.260 5 113 tetrachloromethane -1.025 1 114 tetracycline -0.145 6 115 theophylline -1.316 6 116 thiazopyr 0.921 6 117 thifensulfuron -0.514 6 118 thioridazine 1.987 6 119 toluene -0.373 1 120 4-toluidine -1.143 2 121 1,2,3-trichlorobenzene 1.119 1 122 1,1,1-trichloroethane 0.182 1 123 trichloroethene 0.298 5 124 trichloromethane -0.723 1 125 trifluralin 1.009 6 126 2,4,6-trihydroxytoluene 0.437 2, 4 127 2,3,6-trimethylphenol -0.0561 2, 4 128 2,4,6-trimethylphenol 0.0329 2, 4 129 3,4,5-trimethylphenol 0.428 2, 4 130 2,4,6-trinitrotoluene 0.668 5 131 verapamil 0.690 6 132 warfarin -0.250 6 133 zanaflex -0.0785 6
a mean value from the values for amoxicillin (USA), amoxicillin (Cyprus), amoxicillin (Jordan), and
amoxicillin (Mexico) b toxicity data are presented in terms of IC50 values, in units of mmol/l
c the mechanism of action is encoded by: 1 – non-polar narcotics; 2 – polar narcotics; 3 – weak acid
respiratory uncouplers; 4 – formation of free radicals; 5 – electrophile/proelectrophile compounds; 6 – toxic
action by specific mechanisms
138
Table 12.2. Physicochemical descriptors used in the selected QSARs for
toxicity/cytotoxicity
Symbol Name
α molecular polarisability 0χA
average molecular connectivity index of zero order 2χ connectivity index of second order 2χv valence connectivity index of second order 3χp
v valence path connectivity index of third order
BLI Kier benzene-likeliness index
d1χ difference connectivity index of first order
Etot total energy of the molecule
HOMO energy of the highest occupied molecular orbital
Hy hydrophilic factor
LmH difference between the energies of LUMO and HOMO
logPoct calculated logarithm of the octanol-water partition coefficient
logPoctM measured logarithm of the octanol-water partition coefficient
logSaq calculated logarithm of the aqueous solubility (units mol/l)
logSaqM measured logarithm of the aqueous solubility (units mol/l)
LUMO energy of the lowest unoccupied molecular orbital
µ magnitude of the molecular dipole moment
MM molecular mass
MSA molecular surface area
NC#N number of cyano groups
NPh number of phenyl rings
NH-donors number of H-bond donor atoms
NNH2 number of amino groups
Nrings6 number of 6-membered rings
NO number of O atoms
NOH number of hydroxyl groups
PW2 path/walk Randić shape index of second order
Qmaxn largest negative charge over the atoms in a compound
ΣE-state sum of the E-state indices of all atoms in the molecule
ΣH-bond sum of the H-bond donor and acceptor atoms
SpcP specific polarisability of a molecule equal to polarisability/volume
TIE E-state topological parameter
139
Table 12.3. Calculated values for the molecular descriptors included in the selected QSARs for acute toxicity to S. meliloti
No Name KOWWIN WSKOWWIN TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS QsarIS
logPoct logSaq Etot a LUMO NNH2 NOH d1χ µ Qmax
n SpcP MSA
1 acetaminophen 0.27 -0.70 -25.72 0.253 0 1 -0.233 5.48 -0.423 0.0241 193.6 2 acetic acid 0.09 0.90 -28.49 0.974 0 1 -0.182 2.35 -0.327 0.0279 83.0 3 acetone -0.24 0.58 -11.17 0.844 0 0 -0.182 2.54 -0.364 0.0359 96.8 4 acetylsalicylic acid 1.13 -1.53 -25.07 -0.532 0 1 -0.305 7.19 -0.399 0.0197 204.1 5 alachlor 3.37 -4.17 5.31 0.039 0 0 -0.226 4.79 -0.422 0.0419 296.4 6 4-aminobenzaldehyde 0.79 -0.76 -8.73 -0.234 1 0 -0.089 6.86 -0.448 0.0223 160.5 7 amitriptyline 4.95 -5.53 18.14 0.079 0 0 -0.160 0.32 -0.363 0.0333 325.1 8 amoxicillin 0.97 -2.03 32.51 -0.023 1 2 -0.783 6.58 -0.421 0.0291 325.2 9 atenolol -0.03 -2.59 -20.87 0.000 1 1 -0.445 6.80 -0.423 0.0332 330.6 10 atropine 1.91 -1.87 14.98 0.150 0 1 -0.228 1.75 -0.393 0.0396 285.6 11 bensulide 4.12 -5.69 -23.00 -2.414 0 0 -0.769 6.40 -0.320 0.0340 399.7 12 benzene 1.99 -1.59 5.39 0.554 0 0 0.086 0.00 -0.062 0.0261 117.9 13 benzyl chloride 2.79 -2.09 5.30 0.075 0 0 0.018 1.27 -0.120 0.0436 150.4 14 bipyridyl 1.38 -1.62 9.90 -0.537 0 0 0.052 0.02 -0.305 0.0195 199.8 15 bromoxynil 3.39 -3.80 10.79 -0.888 0 1 -0.267 1.19 -0.398 0.0456 193.3 16 butylamine 0.83 0.44 3.62 3.529 1 0 0.000 0.53 -0.330 0.0468 132.9 17 caffeine 0.16 -1.87 -2.78 -0.349 0 0 -0.378 4.79 -0.404 0.0240 202.9 18 catechol 1.03 -0.18 3.94 0.297 0 2 -0.110 5.51 -0.393 0.0229 132.5 19 celebrex 3.47 -4.95 19.78 -1.198 1 0 -0.861 7.36 -0.321 0.0242 332.8 20 chloramphenicol 0.92 -2.92 20.44 -1.203 0 2 -0.552 7.33 -0.421 0.0423 285.9 21 chlorobenzene 2.64 -2.45 3.78 0.155 0 0 -0.020 1.61 -0.084 0.0405 138.2 22 3-chlorophenol 2.16 -1.70 -4.29 0.019 0 1 -0.127 2.91 -0.394 0.0377 146.1 23 4-chlorophenol 2.16 -1.60 -3.80 0.095 0 1 -0.127 2.33 -0.394 0.0379 149.0 24 chloroquine 4.50 -4.48 -12.30 -0.454 0 0 -0.280 7.37 -0.358 0.0420 359.1 25 clindamycin 2.90 -4.65 15.22 1.372 0 3 -0.709 6.11 -0.422 0.0462 421.4 26 clomazone 2.86 -3.08 - b - b 0 0 -0.409 0.63 -0.391 0.0470 212.0 27 codeine 1.28 -1.39 38.46 0.219 0 1 -0.250 4.81 -0.386 0.0459 211.1
140
Table 12.3. (continued)
No Name KOWWIN WSKOWWIN TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS QsarIS
logPoct logSaq Etot a LUMO NNH2 NOH d1χ µ Qmax
n SpcP MSA
28 2-cresol 2.06 -1.08 0.12 0.408 0 1 -0.110 2.13 -0.394 0.0277 145.1 29 3-cresol 2.06 -1.09 -3.97 0.395 0 1 -0.127 3.69 -0.394 0.0281 144.0 30 4-cresol 2.06 -1.07 -4.26 0.430 0 1 -0.127 4.13 -0.394 0.0277 147.7 31 cyanazine 2.51 -3.12 -64.41 -0.210 0 0 -0.450 3.48 -0.258 0.0388 240.9 32 cyclohexylamine 1.63 -0.19 -2.26 3.461 1 0 -0.020 0.82 -0.327 0.0447 148.7 33 DCPA (chlorthal-dimethyl) 4.24 -5.28 4.37 -1.314 0 0 -0.553 5.92 -0.387 0.0521 275.8 34 dexedrine 1.76 -0.68 -3.25 0.659 1 0 -0.127 0.37 -0.327 0.0353 185.8 35 2-dianisidine 2.08 -2.53 5.53 0.020 2 0 -0.263 1.00 -0.385 0.0262 291.2 36 diazepam (valium) 2.70 -3.69 31.93 -0.633 0 0 -0.250 4.98 -0.422 0.0330 270.0 37 diazinon 4.73 -5.45 -25.25 -1.200 0 0 -0.516 3.09 -0.264 0.0440 298.5 38 dibromoethane 2.01 -2.25 -1.36 0.001 0 0 0.000 0.00 -0.091 0.0819 130.3 39 1,2-dichlorobenzene 3.28 -3.20 6.11 -0.142 0 0 -0.110 3.48 -0.130 0.0532 150.2 40 1,3-dichlorobenzene 3.28 -3.29 -0.34 -0.158 0 0 -0.127 1.96 -0.131 0.0520 153.3 41 1,4-dichlorobenzene 3.28 -3.21 0.66 -0.216 0 0 -0.127 1.30 -0.131 0.0509 158.5 42 1,2-dichloroethane 1.83 -1.19 -1.38 0.685 0 0 0.000 0.00 -0.124 0.0820 112.3 43 dichloromethane 1.34 -0.89 0.00 0.595 0 0 0.000 1.00 -0.108 0.0925 87.1 44 2,4-dichlorophenol 2.80 -2.42 -0.34 -0.245 0 1 -0.216 4.36 -0.389 0.0497 158.3 45 3,4-dichlorophenol 2.80 -2.65 -3.12 -0.236 0 1 -0.216 2.27 -0.390 0.0489 165.3 46 3,5-dichlorophenol 2.80 -2.90 -9.97 -0.285 0 1 -0.233 1.80 -0.390 0.0499 157.8 47 2,4-dichlorophenoxyacetic acid 2.62 -2.82 -11.77 -0.520 0 1 -0.322 5.21 -0.328 0.0411 221.8 48 diethylamine 0.81 0.78 -1.11 3.227 0 0 0.000 0.30 -0.316 0.0466 132.8 49 digoxin 0.50 -5.97 45.79 -0.158 0 6 -1.403 5.68 -0.391 0.0418 714.9 50 dimethyl sulphoxide -1.22 1.11 -8.41 0.808 0 0 -0.182 3.25 -0.309 0.0323 105.2 51 4-(dimethylamino)benzaldehyde 1.89 -1.84 8.87 -0.177 0 0 -0.178 6.73 -0.362 0.0280 192.8 52 2,3-dimethylphenol 2.61 -1.63 -0.11 0.374 0 1 -0.199 1.59 -0.360 0.0310 159.5 53 2,4-dimethylphenol 2.61 -1.48 -3.05 0.396 0 1 -0.216 1.79 -0.360 0.0308 159.5 54 1,2-dinitrobenzene 1.63 -2.26 9.15 -1.841 0 0 -0.288 0.82 -0.061 0.0122 163.2 55 1,4-dinitrobenzene 1.63 -2.07 24.63 -2.208 0 0 -0.305 0.00 -0.053 0.0114 179.3
141
Table 12.3. (continued)
No Name KOWWIN WSKOWWIN TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS QsarIS
logPoct logSaq Etot a LUMO NNH2 NOH d1χ µ Qmax
n SpcP MSA
56 2,6-dinitro-4-cresol 2.27 -2.59 7.10 -1.894 0 1 -0.484 1.92 -0.393 0.0151 197.9 57 2,4-dinitrophenol 1.73 -1.97 35.58 -1.887 0 1 -0.394 3.32 -0.394 0.0108 184.8 58 diuron 2.67 -3.19 -11.53 -0.146 0 0 -0.411 5.12 -0.450 0.0420 263.7 59 emetine 5.20 -6.13 27.62 0.399 0 0 -0.322 4.06 -0.378 0.0349 520.6 60 eptam (EPTC) 3.02 -3.32 4.72 0.123 0 0 -0.157 3.53 -0.409 0.0406 243.2 61 ethanol -0.14 1.24 1.69 3.565 0 1 0.000 1.50 -0.395 0.0434 82.7 62 ethanolamine -1.61 1.21 10.03 3.169 1 1 0.000 1.28 -0.394 0.0409 100.5 63 ethyl acetate 0.86 -0.47 -7.23 1.153 0 0 -0.144 0.91 -0.330 0.0352 127.8 64 ethylbenzene 3.03 -2.67 3.32 0.528 0 0 0.018 0.16 -0.062 0.0321 156.7 65 ethylene glycol -1.20 1.21 9.13 3.023 0 2 0.000 1.58 -0.393 0.0374 95.4
66 FCCP [carbony cyanide 4-(trifluoromethoxy)-phenylhydrazone]
3.15 -3.58 -11.52 -1.125 0 0 -0.157 1.01 -0.178 0.0233 235.7
67 fluvoxamine 3.09 -4.16 20.25 -0.483 1 0 -0.429 1.70 -0.383 0.0373 325.4 68 furosemide 2.32 -3.35 -48.67 -0.869 1 1 -0.628 5.61 -0.397 0.0275 298.0 69 gentamycin -1.48 -0.41 30.12 1.979 3 3 -0.920 3.69 -0.387 0.0496 405.1 70 hexachlorobenzene 5.86 -6.17 1.59 -1.041 0 0 -0.450 0.48 -0.100 0.0825 205.1 71 hexachlorophene 6.92 -8.03 3.69 -0.861 0 2 -0.679 0.92 -0.386 0.0638 303.2 72 ibuprofen 3.79 -3.70 -15.15 0.199 0 1 -0.411 2.84 -0.326 0.0395 235.9 73 imazapyr 1.57 -2.60 -21.33 -1.184 0 1 -0.555 9.60 -0.393 0.0267 261.2 74 indole 2.05 -1.88 16.54 0.300 0 0 0.052 0.48 -0.196 0.0254 139.8 75 isoproterenol 0.21 0.66 0.89 0.164 0 3 -0.411 7.94 -0.403 0.0331 255.6 76 isoxaben 3.98 -4.99 -5.17 -0.314 0 0 -0.430 8.84 -0.434 0.0314 361.5 77 lindane 4.26 -4.86 4.98 0.147 0 0 -0.450 0.03 -0.119 0.0890 233.6 78 malathion 2.29 -3.62 -9.94 -2.658 0 0 -0.496 4.15 -0.329 0.0376 306.3 79 methanol -0.63 1.49 7.98 3.778 0 1 0.000 1.70 -0.398 0.0457 55.4 80 3-methyl-4-nitrophenol 2.46 -1.86 6.54 -1.006 0 1 -0.305 1.42 -0.360 0.0213 160.0 81 methylphenidate (ritalin) 2.78 -2.27 7.93 0.186 0 0 -0.089 2.14 -0.326 0.0343 265.8 82 morphine 0.72 -1.03 29.29 0.188 0 2 -0.288 5.32 -0.386 0.0440 201.9
142
Table 12.3. (continued)
No Name KOWWIN WSKOWWIN TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS QsarIS
logPoct logSaq Etot a LUMO NNH2 NOH d1χ µ Qmax
n SpcP MSA
83 1-naphthol 2.69 -2.11 6.82 -0.247 0 1 -0.037 0.68 -0.359 0.0217 178.8 84 naproamide 3.33 -4.05 14.81 -0.403 0 0 -0.246 3.67 -0.417 0.0316 318.2 85 naproxen 3.10 -3.20 -10.72 -0.439 0 1 -0.301 3.67 -0.385 0.0269 253.2 86 nicosulfuron -1.15 -3.51 -94.35 -1.119 0 0 -0.741 10.98 -0.410 0.0224 386.8 87 nicotine 1.00 0.79 7.06 0.242 0 0 -0.037 2.77 -0.351 0.0329 209.3 88 2-nitrophenol 1.91 -1.75 9.56 -1.014 0 1 -0.199 1.67 -0.359 0.0166 152.8 89 norflurazon 2.19 -3.28 31.23 -0.801 0 0 -0.573 3.77 -0.370 0.0302 273.4 90 novobiocin 2.45 -5.18 - b - b 1 3 -1.276 6.30 -0.419 0.0261 646.9 91 octane 4.27 -5.00 1.74 3.638 0 0 0.000 0.00 -0.065 0.0490 200.6 92 orphenadrine 3.65 -3.38 12.26 0.379 0 0 -0.233 0.86 -0.366 0.0386 284.3 93 orudis 3.00 -3.33 5.52 -0.583 0 1 -0.322 4.41 -0.435 0.0234 279.6 94 oxadiazon 3.43 -4.51 8.35 -0.366 0 1 -0.823 6.62 -0.403 0.0364 329.7 95 paraquat -0.56 -0.10 38.98 -8.887 0 0 -0.160 0.00 -0.068 0.0334 199.1 96 pentachlorophenol 4.74 -4.94 8.31 -0.978 0 1 -0.450 1.31 -0.380 0.0750 193.4 97 pentanol 1.33 -0.63 3.91 3.391 0 1 0.000 0.54 -0.395 0.0454 148.1 98 1,10-phenanthroline 2.29 -3.77 19.99 -0.718 0 0 0.035 3.64 -0.309 0.0179 210.1 99 phenol 1.51 -0.56 -0.85 0.398 0 1 -0.020 1.13 -0.360 0.0239 128.8 100 primsulfuron 2.41 -4.67 -125.03 -1.158 0 0 -0.902 33.10 -0.348 0.0210 354.8 101 propan-2-ol 0.28 0.83 -8.10 3.576 0 1 -0.182 2.02 -0.392 0.0438 101.8 102 propranolol 2.60 -3.06 9.09 -0.219 0 1 -0.250 3.60 -0.388 0.0305 335.0 103 pseudocumene 3.63 -3.18 -2.12 0.502 0 0 -0.216 0.09 -0.059 0.0343 170.8 104 quinidine 3.29 -3.50 23.42 -0.597 0 1 -0.207 2.89 -0.388 0.0368 296.3 105 quinine 3.29 -3.50 23.08 -0.571 0 1 -0.207 2.89 -0.388 0.0368 296.3 106 resorcinol 1.03 -0.11 -7.84 0.321 0 2 -0.127 2.39 -0.394 0.0223 137.5 107 salicylic acid 2.24 -1.56 -19.17 -0.591 0 2 -0.199 5.33 -0.391 0.0196 153.4 108 scopolamine 0.39 -1.24 135.39 0.026 0 1 -0.228 1.20 -0.393 0.0411 241.3 109 sethoxydim 3.99 -5.33 -30.79 -0.052 0 1 -0.387 2.21 -0.393 0.0375 381.1 110 sulphosalicylic acid -0.91 0.57 -44.86 -1.247 0 3 -0.594 6.53 -0.349 0.0145 203.1
143
Table 12.3. (continued)
No Name KOWWIN WSKOWWIN TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS QsarIS
logPoct logSaq Etot a LUMO NNH2 NOH d1χ µ Qmax
n SpcP MSA
111 tert-butyl methyl ether 1.43 -0.65 -11.29 2.988 0 0 -0.354 1.03 -0.377 0.0512 125.1 112 tetrachloroethene 2.97 -3.32 0.99 -0.437 0 0 -0.271 0.01 -0.072 0.0968 129.2 113 tetrachloromethane 2.44 -2.74 0.00 -1.117 0 0 -0.414 0.00 -0.067 0.1199 105.5 114 tetracycline -1.33 -2.06 -6.41 -0.816 1 5 -1.142 12.72 -0.436 0.0291 372.6 115 theophylline -0.39 -1.79 8.43 -0.089 0 0 -0.288 3.40 -0.399 0.0217 156.2 116 thiazopyr 5.76 -5.41 33.54 -1.532 0 0 -0.841 6.68 -0.403 0.0314 311.8 117 thifensulfuron 1.32 -3.05 -146.62 -1.606 0 1 -0.690 3.40 -0.397 0.0189 288.3 118 thioridazine 6.45 -7.04 16.47 -0.138 0 0 -0.156 1.77 -0.358 0.0334 354.6 119 toluene 2.54 -2.21 2.31 0.520 0 0 -0.020 0.16 -0.062 0.0294 138.2 120 4-toluidine 1.62 -1.17 -5.67 0.616 1 0 -0.127 0.62 -0.291 0.0300 150.4 121 1,2,3-trichlorobenzene 3.93 -3.98 6.02 -0.364 0 0 -0.199 4.55 -0.120 0.0630 161.5 122 1,1,1-trichloroethane 2.68 -2.30 -5.07 -0.265 0 0 -0.414 1.58 -0.084 0.1043 107.3 123 trichloroethene 2.47 -2.23 -0.64 -0.061 0 0 -0.144 1.25 -0.100 0.0852 120.1 124 trichloromethane 1.52 -1.76 0.00 -0.303 0 0 -0.182 0.59 -0.087 0.1036 100.7 125 trifluralin 5.31 -6.21 20.78 -1.989 0 0 -0.786 4.64 -0.309 0.0365 258.0 126 2,4,6-trihydroxytoluene 1.10 -0.58 -9.06 0.253 0 3 -0.305 1.61 -0.379 0.0241 168.6 127 2,3,6-trimethylphenol 3.15 -1.90 1.01 0.383 0 1 -0.288 1.51 -0.359 0.0347 169.8 128 2,4,6-trimethylphenol 3.15 -1.95 -1.91 0.433 0 1 -0.305 1.74 -0.359 0.0341 173.0 129 3,4,5-trimethylphenol 3.15 -2.31 -5.77 0.430 0 1 -0.305 1.96 -0.360 0.0338 174.1 130 2,4,6-trinitrotoluene 1.99 -2.61 23.55 -2.433 0 0 -0.573 0.17 -0.046 0.0121 200.2 131 verapamil 4.80 -5.01 42.43 0.181 0 0 -0.572 7.72 -0.378 0.0405 459.2 132 warfarin 2.23 -3.67 -10.09 -1.034 0 1 -0.339 6.48 -0.409 0.0277 283.3 133 zanaflex 1.36 -2.22 -5.41 -1.376 0 0 -0.071 5.51 -0.326 0.0281 232.3 a calculated with the Cosmic force field for molecular mechanic calculations b not calculated by the TSAR software
144
Table 12.4. Selected QSARs for acute toxicity to S. meliloti †
No
Mechanism of toxicity / chemical type
Regression equation n R2 s F Q2
1 Non-polar narcotic compounds
0.648 (± 0.069) logPoct – 1.98 (± 0.19) 46 0.670 0.717 89.3 0.630
2 Polar narcotic compounds
80.0 (± 9.1) SpcP – 9.71 (± 3.45) Qmaxn
+ 0.399 (± 0.143) NOH – 6.37 (± 1.29) 26 0.804 0.498 30.0 0.723
3 Weak acid respiratory uncouplers
0.657 (± 0.141) logPoct – 0.599 (± 0.597) 5 0.879 0.587 21.8 0.692
4 Compounds forming free radicals
0.131 (± 0.024) µ – 0.0514 (± 0.0064) Etot – 0.171 (± 0.063)
9 0.929 0.089 39.2 0.391
5
Compounds acting by specific mechanisms
-0.386 (± 0.067) log Saq + 6.46 (± 1.70)
Qmaxn + 1.17 (± 0.43) d1χ + 1.16 (± 0.24)
NNH2 + 1.70 (± 0.64)
47 0.570 0.743 13.9 0.329
6 All 133 compounds
0.462 (± 0.048) logPoct – 0.278 (± 0.049) LUMO + 0.244 (± 0.078) NOH + 0.909 (± 0.190) NNH2 – 1.28 (± 0.16)
131 a 0.541 0.818 37.1 0.464
7 Aliphatic compounds
0.542 (± 0.095) logPoct – 0.305 (± 0.087) LUMO + 0.00245 (± 0.00105) MSA + 1.73 (± 0.25) NNH2 – 1.80 (± 0.30)
30 0.827 0.715 29.8 0.750
† toxicity expressed as log(1/IC50) (in units mmol/l) a clomazone and novobiocin were excluded, since their molecular orbital properties could not be calculated
by the TSAR software
145
CHAPTER 13
INVESTIGATION OF ACUTE AQUATIC TOXICITY
13.1. Objectives
The objective of the study presented in this chapter was to investigate environmental toxic
effects including toxicity to five aquatic species (two algal species, Daphnia, and two fish
species). Interspecies correlations were developed to identify similarities in toxicity
between the species. Baseline toxicity effects for the non-polar narcotics were sought.
Chemical toxicity was investigated by applying different statistical approaches, including
multiple linear regression, linear discriminant analysis, and classification trees in order to
develop regression and classification models for toxicity.
13.2 Methods
13.2.1. Toxicity data
The data set was extracted from the New Chemicals Data Base (NCD), which is
maintained by the European Chemicals Bureau within the European Commission’s Joint
Research Centre (ECB, http://ecb.jrc.it). The data are submitted by industry as a part of
the notification process for each new chemical substance that is manufactured in or
imported into the European Union. A harmonised European notification system for new
substances was introduced as part of the 6th Amendment to Directive 67/548/EEC (EC,
1967) on the classification, packaging and labelling of dangerous substances (Directive
79/831/EEC; EC, 1979). The testing for physicochemical, toxicological and
ecotoxicological endpoints, included in the technical dossiers for substances, should be
made according to the official testing methods contained in Annex V of Directive
67/548/EEC (EC, 1967). Some aspects of dossier information are confidential,
particularly chemical structures and spectra. Due to this confidentiality, the structures of
the investigated compounds and their toxicity values are not presented in this thesis.
The extracted data set from the NCD consisted of algal, daphnid and fish toxicities of
approximately 2900 neutral chemicals, salts, metal complexes and chemical mixtures.
However, most of the substances did not have reported toxicity values for all the species.
146
Algal toxicities were expressed as the chemical concentrations (in units mg/l) that result
in a 50% reduction in the growth (EbC50) or the growth rate (ErC50) of an algal culture
relative to a control over a 72-h exposure period (in units mg/l). The testing
recommendations (Annex V of Directive 67/548/EEC) indicate that Selenastrum
capricornutum (now Pseudokirchneriella subcapitata) and Scenedesmus subspicatus are
the preferred test species. In general, a given chemical is tested on a single algal species.
Toxicity to Daphnia was expressed as the chemical concentration that results in 50%
immobilisation of the Daphnia in a test batch, over a 48-h exposure period, and in a static
system (EC50, in mg/l). The preferred test species is Daphnia magna.
Fish toxicity was expressed as the chemical concentration at which 50% lethality is
observed in a test batch of fish within a 96-h exposure period (LC50, in mg/l). Three
testing procedures can be used: a static test, in which test solutions remain unchanged
during the test; a semi-static test, without flow of test solution, but with regular batchwise
renewal of the test solution after a prolonged period (e.g. 24 hours); and a flow-through
test, in which the water is renewed constantly in the test chambers, the chemical under
test being transported with the water used to renew the test medium. A variety of fish
species can be used (Annex V of Directive 67/548/EEC), however, Brachydanio rerio
(zebra fish) and Oncorhynchus mykiss (rainbow trout) are the preferred species. Again,
for a given chemical, testing is generally performed on a single species.
The toxicities of some of the chemicals extracted from the NCD were not reported as
exact values, but as approximate numbers or as greater than or smaller than a cut-off
value. For the purposes of classification and labelling, the test recommendations (EC,
1967) allow for a limit test to be performed at 100 mg/l, to demonstrate that the 50%
effect concentration is greater than this concentration. Furthermore, if two consecutive
test concentrations with a ratio of 2.2 gave only 0 and 100% response, these two values
were considered sufficient to indicate the range within which the investigated toxicity
value falls. The correlations between the toxicity endpoints, the baseline relationships and
the regression QSARs were obtained by using only chemicals for which exact toxicity
values were reported.
To develop baseline relationships and QSARs, the initial data set of approximately 2900
chemicals was reduced by removing salts, metal complexes and mixtures, leaving single
and neutral chemicals. All notes relating to the toxicity testing of each chemical were
147
examined. Only chemicals with a reported purity greater than 95% were considered. This
value was chosen because a balance was sought between the use of pure compounds
(ensuring that the toxic effect is due to the main compound, and not to some impurities),
and a greater number of data points. Chemicals that could not be maintained within 80%
of the starting concentration during the test, due to chemical volatility, instability
(hydrolysis) in the test medium, or poor water solubility, were excluded from the
investigation. Any chemical with a toxicity value higher than its measured water
solubility was also excluded, since the half-maximal effect concentration of such a
chemical would not be reached under the experimental conditions. The final numbers of
investigated chemicals were different for the different endpoints and are given in Section
13.3 “Results”.
To develop classification models, the boundaries for toxicity classification were set
according to Annex VI of Directive 67/548/EEC (EC, 2001a). In the following table the
classification boundaries according to Annex VI are given:
Category Very toxic (V) Toxic (T) Harmful (H) Non-toxic (N)
Endpoint value (E) E ≤ 1 mg/l 1 mg/l < E ≤ 10 mg/l 10 mg/l < E ≤ 100 mg/l E > 100 mg/l
In the table “Endpoint value (E)” denotes LC50 of fish (96h-assay), EC50 of Daphnia
(48h-assay), or IC50 of algae (72h-assay). To assess algal toxicity, the inhibition of the
algal growth rate (ErC50 values) is preferred to the inhibition of the growth of the algal
biomass (EbC50).
In the present work the different species were investigated separately. The group of very
toxic compounds was denoted with “V”; the group of toxic compounds was denoted with
“T”; “H” was used to denote the group of harmful compounds. Compounds having
toxicity values higher than 100 mg/l for a given species were classified as non-toxic (the
group was denoted with “N”).
Additionally, another type of classification was tried, in which the groups of very toxic
(V), toxic (T), and harmful (H) compounds were united, to form a single group of
dangerous compounds. Thus, the chemicals were divided into two groups, representing
dangerous compounds for the investigated species (the group was denoted with “D”), and
non-toxic compounds (this group was denoted with “N”).
148
Additionally, experimental logPoct data were taken from the NCD. The recommended
methods (Annex V of Directive 67/548/EEC) for the experimental determination of
logPoct are the shake-flask method and high performance liquid chromatography (HPLC).
The measurements were obtained at different temperatures.
13.2.2. Structural Descriptors
The following structural descriptors for the compounds were calculated: the logarithm of
the octanol-water partition coefficient (logPoct), aqueous solubility, and approximately
250 different quantum-chemical, charge, and topological descriptors. LogPoct and aqueous
solubilities were calculated with the KOWWIN version 1.66 and WSKOWWIN version
1.40 programs, respectively (Syracuse Research Corporation, Syracuse, NY, USA;
downloaded for free from the website of the Environmental Protection Agency of the
United States: http://www.epa.gov/oppt/exposure/docs/episuitedl.htm). TSAR for
Windows version 3.3 (Accelrys Inc.) was used to calculate quantum-chemical and
topological descriptors. For quantum-chemical calculations, the VAMP package
(Accelrys Inc.) was used, applying the AM1 Hamiltonian. Topological and charge
descriptors were calculated by using DRAGON version 5.0 (TALETE srl). For the
calculations in DRAGON that require 3D chemical structures (the charge descriptors), the
structures optimised by VAMP optimisation were transferred from TSAR to Dragon as
Mol2 files.
13.2.3.Statistical Analysis
Before performing statistical analyses, the descriptors that had values of zero for more
than 95% of the compounds in a data set were excluded from the descriptor data matrix
since these carry no useful information. To reduce the number of intercorrelated
molecular descriptors, the algorithm developed by the author of this thesis was applied
(see Chapter 9).
QSARs were obtained by multiple linear regression, using the MINITAB version 13
software (Minitab Inc., State College, PA, USA). Again, forward stepwise regression was
applied to select variables in the QSARs, excluding descriptors considered to lack
physicochemical and/or mechanistic relevance. Also, the program in C code (best-subsets
regression), developed by the author was used (see Chapter 9). Only those variables that
intercorrelated with a coefficient of intercorrelation R less than 0.7 were included in the
149
same model. The leave-one-out cross-validation statistical procedure was applied to the
QSAR models.
Discriminant analysis was performed with the Statistica version 5.5 software (StatSoft
Inc., Tulsa, OK, USA). Variable selection in the discriminant analysis was performed by
a forward stepwise procedure, applying the same considerations regarding
physicochemical and/or mechanistic relevance of the selected variables as in the
regression analysis (see above). The assumptions of discriminant analysis for
homogeneity of variable variance and variance/covariance matrices within the groups
were tested with the univariate Levene test, and with the multivariate Sen-Puri test,
included in the ANOVA/MANOVA module of Statistica version 5.5 software.
Classification trees were developed by using the Statistica version 5.5 software. Both
discriminant-based univariate (DBU) and classification and regression tree (CART)
splitting algorithms were explored. In the CART algorithm, the Gini index (Equation
5.21, Chapter 5, Section 5.5 “Classification trees”) was used as a measure of node
homogeneity. A minimum node size of 5 observations was used as a stopping rule, and 3-
fold cross-validation pruning (dividing the training set into three groups) on
misclassification error was applied. The least complex tree, for which the cross-validated
cost did not exceed the minimum cross-validated cost plus the standard error of the
minimum cross-validated cost, was selected as a final classification tree.
13.3. Results
13.3.1. Correlations between the toxicity endpoints
To perform correlations between the toxicity endpoints the toxicity concentrations were
kept in units of mg/l, as they are reported in the database. It was impossible to transform
the data into units of mol/l, because mixtures and salts with ill-defined compositions were
also included in the correlations. The correlations obtained are reported in Table 13.1.
Equation 1 (Table 13.1) represents the correlation between algal EbC50 and ErC50 values,
without taking into account the algal species tested. To obtain interspecies correlations
between the algal toxicity and the toxicity to Daphnia or fish, the EbC50 values were used
to represent the algal toxicity, as they were reported for more chemicals than were the
ErC50 values. Similar interspecies correlations were obtained when ErC50 values were
150
used to represent the algal toxicity, which is consistent with the high EbC50/ErC50
correlation, expressed by Equation 1 (Table 13.1).
13.3.2. Analysis of baseline effects
Before investigating the baseline effects and developing QSARs, each chemical was
assigned to a mechanism of toxic action according to the structural criteria mentioned
described in Chapter 8, Table 8.1. To investigate the baseline effects and develop QSARs,
all toxicity values were transformed into units of µmol/l.
For the development of baseline relationships, the calculated values of logPoct were used
in preference to the reported measured values, because for a number of chemicals,
experimental logPoct values were not reported, so reliance on experimental logPoct values
alone would have reduced considerably the number of data points for analysis.
Furthermore, it is considered good QSAR practice to be consistent in the choice of
descriptor data (i.e. all data should preferably be generated by the same experimental
protocol or estimation method) (Cronin, 2002).
The derived relationships between toxicity and logPoct for the non-polar narcotics present
in the data sets for each species are presented in Table 13.2, and in Figure 13.1.
For the alga S. subspicatus, only 50 chemicals with reported exact EbC50 values remained
after excluding the salts, metal complexes, mixtures, chemicals with purity less than or
equal to 95%, unstable or volatile chemicals, chemicals with algal EbC50 values higher
than their measured water solubilities, and chemicals with reported EbC50 values as
approximate numbers or as greater than or smaller than a cut-off value (see Section 13.2.1
“Toxicity data”). Only 8 of the remaining 50 chemicals met the structural criteria for
chemicals that may act by the non-polar narcosis mechanism of toxic action. For the alga
S. subspicatus ErC50 endpoint, only 33 chemicals with exact ErC50 values were selected
for investigation (after excluding the “unsuitable chemicals”), of which 6 were classified
as non-polar narcotics (Table 13.2).
For the alga S. capricornutum 64 chemicals with exact EbC50 values were selected, from
which 16 were classified as non-polar narcotics. For the S. capricornutum ErC50 values 51
chemicals remained after exclusion of the “unsuitable chemicals”, from which 13
chemicals were classified as non-polar narcotics. However, baseline effects for the
151
toxicity to the alga S. capricornutum were not observed. Chemicals were equally
distributed above and below the regression line defined by the toxicity/logPoct relationship
for the subset of non-polar narcotics.
For D. magna exclusion of the “unsuitable chemicals” resulted in a set of 176 chemicals
with reported exact toxicity values, of which 56 chemicals were classified as non-polar
narcotics. A total of 122 chemicals were selected with reported toxicities to the fish O.
mykiss, (rainbow trout) of which 34 chemicals were classified as non-polar narcotics. For
B. rerio (zebra fish) 55 chemicals were selected after applying the above-described
exclusion criteria, of which 19 chemicals were classified as non-polar narcotics.
13.3.3. QSAR models
The QSAR models developed by multiple linear regression are presented in Table 13.3.
The abbreviations for the included structural descriptors are explained in Table 12.2,
Chapter 12. No QSAR with a reasonable statistical fit could be obtained for toxicity to the
alga S. capricornutum, to D. magna, or to the fish B. rerio, even if additional descriptors
were used.
13.3.4. Development of classification models for toxicity
The derived baseline relationships and QSAR models are not of a high statistical quality.
Since the toxicity data were obtained by different laboratories, experimental variability
was inevitable. Therefore, in order to avoid the use of the exact toxicity values reported,
chemicals were classified into groups according to their toxicity, and classification
models were developed. The toxicity boundaries for the classification were taken from
Annex VI of Directive 67/548/EEC (EC, 2001a) (see Section 13.2.1 “Toxicity data”). The
algal growth rate inhibition (ErC50) was used to classify compounds into groups of algal
toxicity as this endpoint is the preferred measure of acute algal toxicity for the regulatory
assessment of chemicals in the European Union.
To develop QSAR models the classification boundaries were transformed from units mg/l
(as they are given in Annex VI of Directive 67/548/EEC; EC, 2001a) to units µmol/l, by
using the average values of the molecular mass of the chemicals investigated for each
endpoint. In Table 13.4 the number of chemicals for each species, the average molecular
masses (MMav) and the upper boundaries for group V (very toxic compounds) in units
152
µmol/l for each species, are presented. The upper boundaries in units µmol/l for groups T
(toxic compounds) and H (harmful compounds) can be derived by multiplying the
boundary for group V by 10 and 100, respectively. As described in Section 13.2.1
(“Toxicity data”), the boundary for group N (non-toxic compounds) coincides with the
upper boundary of group H. Thus, the classification boundaries differ by one log unit.
Chemicals with reported toxicity values as approximate numbers were included in the
development of classification models (see Section 13.2.1 “Toxicity data”). Also,
chemicals with reported toxicity values greater than the boundary for group N were
included in the group of the non-toxic compounds, and chemicals with reported toxicity
values smaller than the upper boundary for group V were included in the group of the
very toxic compounds. Therefore, the number of the chemicals included in the
classification models is greater than the number of chemicals included in the baseline and
regression QSAR models for each species (Table 13.4). Table 13.5 gives the numbers of
chemicals in each group with toxicity boundaries set both in units mg/l and µmol/l for
each species.
Results from the discriminant analysis
In Table 13.6 the classification models for the four-group classification (V, T, H, and N
groups) obtained by using discriminant analysis (DA) are presented. The a priori
classification probabilities were set as proportional to the group sizes, because it is less
probable for a chemical to have extreme toxicity (from group V) than to be toxic or
harmful.
The discriminant models for the second type of classification (dangerous (D) and non-
toxic (N) groups) are presented in Table 13.7. Classifications were tried with a priori
probabilities set as both proportional to the group sizes and equal in the two groups.
The discriminant model obtained when compounds were grouped into four groups of
toxicity to alga S. subspicatus included logPoct and the number of cyano groups (NC#N)
(Model 1, Table 13.6). However, heterogeneity of the variance of NC#N in the four groups
at 95% significance level was suggested by the Levene test. The same holds true for the
variable Nrings6 used in the model for classification into four groups of toxicity to the alga
S. capricornitum (Model 2, Table 13.6) (the Levene test for this variable was statistically
significant at 95% level). Also, the assumption for homogeneity of the
153
variance/covariance matrices for this model (Model 2, Table 13.6) was violated (the Sen-
Puri test was significant at 95% level). In the models for toxicity to D. magna (Model 3,
Table 13.6; and Model 3, Table 13.7), the variance of logPoct was also heterogeneous
within the groups at 95% level (suggested by the Levene test).
The tests for heterogeneity of variable variances (Levene test) and for heterogeneity of
variance/covariance matrices (Sen-Puri test) for the remaining variables and models
included in Tables 13.6 and 13.7 were not statistically significant at 95% level.
Results from the classification trees approach
Additionally, the classification tree (CT) approach was used to obtain classification
models. Both DBU and CART splitting algorithms were explored. Similarly to the DA,
when classification into four groups was investigated, the a priori probabilities were set
proportional to the group sizes. When models for the classification into dangerous and
non-toxic compounds were aimed, the a priori probabilities both equal or proportional to
the group sizes were explored. In Table 13.8 and Table 13.9, the results of CT analysis for
the two types of grouping – four groups (very toxic (V), toxic (T), harmful (H), and non-
toxic (N) compounds) and two groups (dangerous (D) and non-toxic (N) compounds),
respectively, are presented, including the splitting method and the percentages of
correctly classified compounds. The structures of the CTs are presented in Figures 13.2-
13.7. These are discussed in more detail in Section 13.4 “Discussion”.
13.3.5. Correlation between measured and calculated logPoct values
The following correlation between measured and calculated logPoct values was obtained
(illustrated in Figure 13.8):
Calculated logPoct = 0.709 (± 0.030) Measured logPoct + 0.387 (± 0.099) (13.1)
n = 352, R2 = 0.619, F = 568, s = 1.23
13.4. Discussion
Data sets for toxicity to five aquatic species were investigated by different QSAR
approaches aimed to develop regression and classification models. Interspecies
correlations were also developed.
154
13.4.1. Correlations between toxicological endpoints
The correlations between the toxicity endpoints are presented in Table 13.1. From
Equation 1, Table 13.1, it can be seen that a very good correlation was obtained between
the two algal toxicity endpoints (R2 = 0.92), suggesting that testing only one of these
endpoints could be sufficient for evaluating algal toxicity. The growth rate endpoint
(ErC50) is the preferred measure of acute algal toxicity for the regulatory assessment of
chemicals in the European Union.
Statistically significant correlations were obtained between toxicity to Daphnia and algae,
but with poor statistical fits (Equations 2 and 3, Table 13.1). Toxicity to the fish O. mykiss
correlated better with toxicity to the alga S. subspicatus (R2 = 0.57, Equation 4, Table
13.1), and very badly with toxicity to the alga S. capricornutum (R2 = 0.25, Equation 5,
Table 13.1). For the fish B. rerio the opposite observation can be made, i.e. toxicity to the
fish B. rerio was well correlated with S. capricornutum toxicity (R2 = 0.67, Equation 7,
Table 13.1), having a slope of 1, and poorly with S. subspicatus toxicity (R2 = 0.28,
Equation 6, Table 13.1). A relatively good correlation was observed between toxicities to
the fish O. mykiss and D. magrna (R2 = 0.67, Equation 8, Table 13.1), while the
correlation between toxicities to the fish B. rerio and D. magna was worse (R2 = 0.42,
Equation 9, Table 13.1).
13.4.2. Baseline effects
The baseline toxicity effects are presented in Table 13.2 and Figure 13.1. Although
general trends for presence of linear toxicity/logPoct relationships can be seen in Figure
13.1, with most of the chemicals being placed above the corresponding baselines, the
statistical fit of the baseline equations is not high (Table 13.2). The worst statistical
parameters were observed for the relationships observed for D. magna and the fish B.
rerio.
As mentioned in Section 13.3 (“Results”), baseline effects for the toxicity to the alga S.
capricornutum were not observed, because the chemicals were equally distributed above
and below the regression line defined by the toxicity/logPoct relationship for the subset of
non-polar narcotics. From the baseline plots in Figure 13.1 it can be seen that similar
tendencies exist for the remaining species – although most of the chemicals are placed
155
above the baselines in Plots “b” of Figure 13.1 (indicating that these chemicals act by
mechanisms of toxic action other than non-polar narcosis), some chemicals are placed
below and close to the baselines without being classified as non-polar narcotics according
to the structural criteria. Most of these chemicals were classified as polar narcotics or
specifically-acting chemicals. These chemicals have complex molecular structures
containing different functional groups, which might prevent them from expressing the
potentially higher toxic effects related to some features of their structures. Also, due to
the presence of multiple and different functional groups, it was difficult to assign these
chemicals to a single mechanism of toxic action. Similar reasons for the absence of
observed baseline effect for the toxicity to the alga S. capricornitum could be suggested.
For the algal ErC50 endpoint (S. subspicatus), the baseline relationship is represented by
Equation 2 (Table 13.2), and Plots 2a and 2b (Figure 13.1). Although a linear
toxicity/logPoct relationship can be observed, Plot 2a (Figure 13.1) shows that the 6 non-
polar narcotics do not provide a well-defined baseline.
13.4.3. QSAR models
The derived QSARs have moderate statistical fit (Table 13.3). QSARs were developed for
the alga S. subspicatus, and the models were similar for the both EbC50 and ErC50
endpoints (Equations 1 and 2, Table 13.3). A QSAR model was obtained also for toxicity
to the fish O. mykiss, albeit with a poor statistical fit (Equation 3, Table 13.3).
The QSARs are based on chemicals that are likely to act by different toxic mechanisms,
including narcotics, reactive chemicals and specifically-acting chemicals. LogPoct was
found to be a significant descriptor in all models. It accounts for the role of chemical
hydrophobicity in producing a toxic effect. The number of phenyl rings (NPh) in
Equations 1 and 2 (Table 13.3) might reflect differences in the toxic action towards S.
subspicatus between aromatic and aliphatic chemicals. An attempt was made to develop
class-specific QSARs for toxicity to S. subspicatus, for the aliphatic and aromatic
chemicals separately, but better models could not be obtained. The descriptor LUMO,
which appeared in the model for fish toxicity (Equation 3, Table 13.3.), might account for
the electrophilic reactivity of the chemicals.
The statistical fit (R2) of the QSARs ranged from 0.42 to 0.71. A possible reason for the
limited goodness-of-fit of the models could be due to toxicokinetic factors that influence
156
toxicity in vivo towards fish, but which are not accounted for in the models. Other
possible reasons relate to limitations in the quality of the data used, as described below.
13.4.4. Toxicity classification
Two types of toxicity classification of compounds were attempted. The first one was set
according to classification criteria established in EU legislation (Annex VI of Directive
67/548/EEC; EC, 2001a), which provides for classification into four groups of very toxic,
toxic, harmful, and non-toxic compounds. The second classification was formed by
uniting the groups of very toxic, toxic and harmful compounds into a single group of
dangerous compounds.
In Annex VI of Directive 67/548/EEC (EC, 2001a) classification boundaries for toxicity
in units of mg/l are used. For the purposes of the QSAR analysis these boundaries were
transferred into units of µmol/l. In Table 13.5 the numbers of compounds in each group
with boundaries in units both mg/l and µmol/l are given. From Table 13.5.a it can be seen
that for the alga S. subspicatus, five compounds (6.1%) differ in their classification
according to the classification schemes used. In the classification according to the toxicity
to S. capricornutum, 16 compounds (14.1%) were classified differently by the two
schemes (Table 13.5.b). For D. magna this number is 56 compounds (13.5%, Table
13.5.c). In the case of fish O. mykiss, 32 compounds (14.0%) were classified differently
(Table 13.5.d), and for the fish B. rerio, 19 compounds (17.3%) were classified
differently by the two classification schemes (Table 13.5.e). Thus, approximately 15% of
chemical classifications are likely to be different when using the two different types of
units to define the classification boundaries.
Interpretation of the results from the discriminant analysis
The percentages of correct classifications expected by chance and obtained by applying
the DA for the four-group classification (V, T, H, and N groups) are presented in Table
13.6. From the table it can be seen that the correct classification expected by chance
varies between 25.9% and 32.1%. The larger values mean that the size of one of the
groups represents a larger percentage of the total number of compounds (for a
classification into four groups of equal sizes the correct classification expected by chance
is 25%). Thus, the sizes of the toxicity groups for the alga S. subspicatus and the fish B.
rerio (31.1% and 32.1% correct classifications expected by chance, respectively, Table
157
13.6) are less balanced than the sizes of the toxicity groups for the remaining species.
This is also reflected by the group sizes presented in Table 13.5.
As can be seen from Table 13.6, the discriminant models generally classified correctly
approximately 55% of the compounds, i.e. applying the discriminant models resulted in
an approximately 25% improvement of the classification over that expected by chance.
The worst classification was obtained for toxicity to D. magna (50.0% correct
classification, Model 3, Table 13.6), for which the data set contained the largest number
of compounds. The best overall classification was obtained for the fish B. rerio (60.6%
correct classification, Model 5, Table 13.6); however, the second group (toxic
compounds) was very poorly predicted by this model (only 5.6% correct classification).
All models included logPoct as a discriminating variable. Additionally, the number of
cyano groups, electronic (HOMO, LUMO), topological (PW2), and volume/shape
(Nrings6) factors appeared to determine the toxicity classification.
The discriminant models for classification of chemicals into dangerous and non-toxic
groups are presented in Table 13.7. The percentages of correct classifications expected by
chance are also given in the table. From these percentages it can be seen that the largest
differences between the sizes of the two toxicity groups are present for D. magna and the
fish O. mykiss, which is also indicated by the group sizes presented in Table 13.5.
As can be seen from Table 13.7, when the a priori probabilities were set as proportional
to the group sizes, the overall classifications were similar or better than those obtained
with equal a priori probabilities. However, the group of non-toxic compounds was
classified much less accurately with proportional a priori probabilities. Nevertheless, it
can be considered as more useful (from a regulatory perspective) to have better
classification for the group of dangerous compounds, than for the non-toxic compounds,
thus applying proportional a priori probabilities may result in more valuable classification
schemes. From Table 13.7 it can be seen that the improvement in the classification over
the classification expected by chance varies from approximately 10% (for the model for
D. magna and equal a priori probabilities) to approximately 25% (for the model for the
fish B. rerio and a priori probabilities proportional to the group sizes).
Similar descriptors appeared in the discriminant models for the two-group classification
(Table 13.7) as in those for the four-group classification (Table 13.6). Again, all models
included logPoct. In the classification model for toxicity to the alga S. subspicatus, NC#N
158
was not a statistically significant variable, whereas the number of H-bond donors (NH-
donors) was used as discriminating variable (Model 1, Table 13.7). The model for the alga
S. capricornitum included only logPoct, and Nrings6 was not a statistically significant
discriminating variable in this case (Model 2, Table 13.7.). Similarly, LUMO was not a
statistically significant discriminating variable for the two-group classification of toxicity
to the fish O. mykiss (Model 4, Table 13.7.). For D. magna and the fish B. rerio, the same
discriminating variables were used as in the models for compound classification into four
toxicity groups, namely logPoct, HOMO, and PW2 (Models 3 and 5, Tables 13.6 and
13.7).
Interpretation of the results from the classification trees approach
The results of CT analysis for the two types of grouping – four groups (very toxic, toxic,
harmful, and non-toxic compounds) and two groups (dangerous and non-toxic
compounds), are presented in Table 13.8 and Table 13.9 respectively. Unlike discriminant
analysis, where the selection of the a priori probabilities influences the final classification
only, and not the characteristics of the model itself (variables included and statistical
parameters), in the CT approach the a priori probabilities determine also the selected
discriminating variables and their cut-off values. Therefore, different CTs were obtained
in some cases (for example for the fish B. rerio, CTs 6 and 7, Table 13.9) when different
a priori probabilities (proportional to the group sizes, or equal for the groups) were
applied.
Classification trees for toxicity to the alga S. subspicatus
For the alga S. subspicatus, models for classifying chemicals into four groups (very toxic,
toxic, harmful and non-toxic chemicals) could not be obtained by using CT approach
(Table 13.8). A CT was obtained only for the two-group classification of toxicity to S.
subspicatus, using equal a priori probabilities and DBU splitting (CT 1, Table 13.9). The
same descriptors appeared in the CT as in the model obtained by DA (Model 1, Table
13.7), namely logPoct and NH-donors. The CT approach resulted in a slightly higher
percentage of correctly classified compounds than obtained by DA; however the group of
non-toxic compounds was poorly classified with only 57.7% correct classifications. The
CT is presented in Figure 13.2. It can be seen that compounds with logPoct ≤ 2.09 and at
least one H-bond donor are classified as non-toxic to S. subspicatus, while compounds
159
with logPoct ≤ 2.09 and not containing H-bond donor atoms, or compounds having logPoct
> 2.09 are classified as dangerous to S. subspicatus. Thus, from the CT, it appears that
compounds with higher liphophilicity are more toxic, which is in accordance with
literature reports (see Chapter 8). Also, from the CT, it can be seen that H-bonding ability
is unfavourable for the toxicity.
Classification trees for toxicity to the alga S. capricornitum
The CT for the four-group classification of compounds according to toxicity to the alga S.
capricornitum was obtained by applying the CART splitting method (a priori
probabilities set proportional to the group sizes) (CT 1, Table 13.8) In the CT approach,
the second-order connectivity index (2χ) gave better results than the number of 6-
membered rings (Nrings6), which appeared in the DA (Model 2, Table 13.6). From Table
13.6, Model 2, and Table 13.8, CT 1, it can be seen that the overall percentages of
correctly classified compounds by the DA and the CT approach are similar; however,
with the CT, a better classification of the group of very toxic compounds is obtained. The
CT is presented in Figure 13.3. It can be seen that compounds having logPoct ≤ 2.31 are
classified as non-toxic. Compounds having logPoct > 2.31 and 2χ > 8.955 are classified as
very toxic; compounds with 2χ ≤ 8.955 and 2.31 < logPoct ≤ 3.01 are harmful to S.
capricornitum, and compounds with 2χ ≤ 8.955 and logPoct > 3.01 are toxic. Thus, higher
values of logPoct and 2χ are favourable for toxicity. The value of 2χ increases with
increasing molecular size, i.e. larger molecules are suggested to be more toxic.
For the second type of classification (dangerous and non-toxic compounds), the use of
CART and a priori probabilities both equal and proportional to the group sizes produced
CTs with one split along logPoct (CT 2, Table 13.9). When the a priori probabilities were
set proportional to the group sizes, logPoct had a cut-off value of 2.05 (compounds having
logPoct ≤ 2.05 were classified as non-toxic, while compounds with logPoct > 2.05 were
classified as dangerous). When equal a priori probabilities were considered, logPoct had a
different cut-off value of 2.31 (compounds having logPoct ≤ 2.31 were classified as non-
toxic, compounds with logPoct > 2.31 were classified as dangerous). The percentages of
correct classifications are presented in Table 13.9, CT 2. It can be seen that both settings
for the a priori probabilities resulted in a better classification of the non-toxic compounds
than of the group of dangerous compounds.
160
Classification trees for toxicity to D. magna
For the four-group classification of toxicity to D. magna, logPoct, HOMO, and MM
appeared as discriminating variables when applying DBU splitting (a priori probabilities
proportional to the group sizes). The percentage of correct classification is presented in
Table 13.8, CT 2. A similar overall accuracy of the classification was obtained as with the
DA (Model 3, Table 13.6). The tree is presented in Figure 13.4.a. It can be seen that
compounds with logPoct ≤ 3.82 are classified as non-toxic, harmful or toxic to D. magna,
while compounds with logPoct > 3.82 are either toxic or very toxic to this species.
Compounds with logPoct ≤ 1.91 and having HOMO ≤ -10.07 are classified as non-toxic,
compounds with logPoct ≤ 1.91 and HOMO > -10.07, or compounds with 1.91 < logPoct ≤
3.82 and HOMO ≤ 9.14, are classified as harmful; compounds with 1.91 < logPoct ≤ 3.82
and HOMO > 9.14, or with logPoct > 3.82 and MM ≤ 424.3 are toxic; compounds with
logPoct > 3.82 and MM > 424.3 are very toxic. Thus, compounds with higher values of
logPoct, HOMO, and MM tend to be more toxic. Compounds with higher values of
HOMO will enter chemical reactions more easily as nucleophiles, thus the CT suggests
involvement of nucleophilic reactions in the toxic action. According to the CT, some very
toxic compounds have high molecular mass, suggesting that they are complex organic
molecules with numerous centres of reactivity and/or centres responsible for toxicity by
specific mechanisms. Additionally, the increased molecular mass might reflect increased
lipophilicity, which results in higher toxicity.
Another classification tree, which included logPoct and LmH, was obtained by using the
CART splitting method (a priori probabilities proportional to group sizes) (CT 3, Table
13.8). The tree is presented in Figure 13.4.b. It can be seen that according to this CT the
cut-off value for logPoct that separates the group of non-toxic and harmful compounds
from the group of toxic and very toxic compounds is 3.12. Additionally, the group of
toxic and very toxic compounds is separated by another cut-off value for logPoct, equal to
4.91. This supports the suggestion that the increased toxicity of the very toxic compounds
is related to increased lipophilicity, rather than to the higher molecular mass of these
compounds, which appeared as a descriptor in the CT obtained by the DBU splitting (CT
2, Table 13.8; Figure 13.4.a) (see above). The non-toxic and harmful compounds were
separated by LmH, with a cut-off value of 8.87. LmH might reflect the activation energy
of the compounds. From Figure 13.4.b it can be seen that smaller values of LmH,
corresponding to smaller activation energy, tend to be associated with increased toxicity.
161
In Table 13.8, CTs 2 and 3, the percentages of correct classifications obtained by the two
CTs for toxicity to D. magna are presented. It can be seen that the overall classification
accuracy of the two models is very similar (approximately 51% correct classification), but
the two models discriminate better between different groups – the model obtained by
DBU splitting performs better for the groups of toxic and harmful compounds, while the
model derived by CART splitting predicts better the groups of very toxic and non-toxic
compounds.
When models for the two-group classification (dangerous and non-toxic compounds)
were developed, better results were obtained by using CART splitting. The CTs obtained
with a priori probabilities set both equal and proportional to the group sizes are presented
in Figure 13.5. It can be seen that similar CTs were obtained by using the two types of a
priori probabilities. They included logPoct and LmH as discriminating variables.
However, the cut-off value for logPoct is smaller when proportional a priori probabilities
were applied (a cut-off value of 1.63 for proportional a priori probabilities, and a cut-off
value of 3.12 for equal a priori probabilities). Again, higher values of logPoct, and smaller
values of LmH favour toxicity. The percentages of correct classification are presented in
Table 13.9, CT 3. A better classification was obtained with a priori probabilities
proportional to the group sizes than with the corresponding model obtained by the DA
(Model 3, Table 13.7). As can be seen from Table 13.9, CT 3, setting the a priori
probabilities proportional to the group sizes favoured a more accurate prediction of the
group membership of dangerous chemicals. Understandably, the logPoct cut-off value
favouring classification of more compounds as dangerous (1.63) is smaller than the cut-
off value resulting in assigning more compounds as non-toxic (3.12), since the dangerous
compounds are located in the direction of higher values along the logPoct scale.
Classification trees for toxicity to the fish O. mykiss (rainbow trout)
The CT for the four-group classification of compounds according to their toxicity to the
fish O. mykiss (obtained by applying DBU splitting and a priori probabilities proportional
to the group sizes) included logPoct and ΣE-state (CT 4, Table 13.8). A similar accuracy of
classification was observed to that obtained by DA (Model 4, Table 13.6). The CT is
presented in Figure 13.6. It can be seen that again (as in the CT for toxicity to D. magna
obtained by CART splitting, Figure 13.4.b), the non-toxic and harmful compounds are
separated from the toxic and very toxic compounds by a logPoct cut-off value of 3.39.
162
Additionally, the non-toxic and harmful compounds are separated by logPoct cut-off value
of 0.70. The sum of the E-state indices (ΣE-state) separates the toxic from the very toxic
compounds by a cut-off value of 69.2 (bigger values of ΣE-state favour toxicity). The E-
state indices account for the ability of a molecule to enter into non-covalent
intermolecular interactions (Rose et al., 2002), and thus they encode factors that could
influence interactions with biological macromolecules.
When the classification into dangerous and non-toxic compounds to the fish O. mykiss
was investigated, similar results were obtained with both DBU and CART splitting
methods (CTs 4 and 5, Table 13.9). The CTs included one split by logPoct. When the a
priori probabilities were set proportional to group sizes, the cut-off value of logPoct was
0.81 (compounds having logPoct ≤ 0.81 were classified as non-toxic, while compounds
with logPoct > 0.81 were classified as dangerous) for the DBU splitting, and 1.37 for the
CART splitting. Applying equal a priori probabilities resulted in similar cut-off values of
logPoct for the two splitting methods (2.20 for DBU, and 2.17 for CART). A similar
accuracy of classification was obtained with the CT approach (CTs 4 and 5, Table 13.9)
and with DA (Model 3, Table 13.7). From Table 13.9, CTs 4 and 5 it can be seen that
setting the a priori probabilities proportional to the group sizes resulted in better overall
classification, with the group of dangerous compounds being classified more accurately
than the group of non-toxic compounds. As described above, when proportional a priori
probabilities were used, the logPoct cut-off values obtained were different for the DBU
and the CART splitting methods. As can be seen from the percentages of correct
classifications obtained with the two splitting methods (CTs 4 and 5, Table 13.9) the
logPoct cut-off value obtained by DBU splitting (0.81) allows for better classification of
the dangerous compounds, while if the logPoct cut-off value obtained by CART splitting
(1.37) is used, more compounds will be correctly classified as non-toxic. Again, the
logPoct cut-off value favouring classification of more compounds as dangerous (0.81) is
smaller than the cut-off value resulting in assigning more compounds as non-toxic (1.37).
Classification trees for toxicity to the fish B. rerio (zebra fish)
CTs discriminating between the four groups of toxicity to the fish B. rerio could not be
obtained with any of the splitting methods (Table 13.8). CTs were obtained only for the
two-group classification (dangerous and non-toxic compounds) (CTs 6 and 7, Table
13.9). A CT obtained by applying CART splitting and a priori probabilities proportional
to the group sizes (CT 6, Table 13.9) contained the same variables that appeared in the
163
DA model for this endpoint (Model 5, Table 13.7), namely logPoct and PW2. Similar
percentages of correct classification were obtained with both CT analysis and DA, with
the CT classifying slightly better the group of dangerous compounds at the expense of
correctly classifying the non-toxic compounds. The CT is presented in Figure 13.7. It can
be seen that compounds with logPoct ≤ 2.41 and PW2 > 0.576 are classified as non-toxic,
while compounds with logPoct ≤ 2.41 and PW2 ≤ 0.576, or having logPoct > 2.41 are
classified as dangerous. Generally, the values of PW2 are higher for more branched and
spherical molecules. Thus, less branched molecules with more extended shapes appear to
be more toxic.
When CART splitting with equal a priori probabilities was applied, the CT for the two-
group classification of toxicity to B. rerio included only logPoct (CT 6, Table 13.9). The
descriptor had the same cut-off value (2.41) as the CT obtained by applying CART and
proportional a priori probabilities (thus, compounds with logPoct ≤ 2.41 were classified as
non-toxic; compounds with logPoct > 2.41 were classified as dangerous). The percentages
of correct classification are presented in Table 13.9, CT 7. From Table 13.9, CTs 6 and 7,
it can be seen that, again, setting the a priori probabilities proportional to the group sizes
results in a more accurate prediction of the dangerous compounds.
In summary, a slightly higher classification accuracy was obtained with the CT approach
than with DA. However, CTs for some of the endpoints could not be obtained
(classification of compounds into four groups of toxicity to the alga S. subspicatus and the
fish B. rerio, Table 13.8). Again, logPoct appeared as the first discriminating variable in all
CTs. A logPoct cut-off value of approximately 2 was observed for discrimination between
dangerous and non-toxic compounds. Other descriptors in the CTs, related to electronic
properties (HOMO, LmH, ΣE-state), H-bond donor ability, molecular topology (ΣE-state, 2χ,
PW2) and size (MM), suggested that decreased number of H-bond donors, decreased
branching, and increased compound reactivity and size are favourable for toxicity.
13.4.5. Correlation between measured and calculated logPoct values
Although the correlation between the measured and calculated logPoct values was
statistically significant (Equation 13.1), the correlation coefficient is not high, and also,
the slope of the line is different from 1, and the intercept differs from 0. However, from
Figure 13.8 (representing a plot of calculated against measured logPoct values), it can be
164
seen that the compounds are evenly distributed around the regression line. The low
correlation obtained might be due both to differences in the experimental protocols and to
computational errors. The experimental data were obtained from different laboratories, by
using two different experimental methods, i.e. the shake-flask method and the high
performance liquid chromatography (HPLC) method. Additionally, the measurements
were obtained at different temperatures.
13.4.6. Assessment of the data used
The data used in this study are properties of New Chemicals, which are developed by
industry for specific industrial or commercial applications. Some chemicals are complex
molecules, which, despite possessing the structural characteristics for assignment to a
particular toxicity mechanism, might not be able to reach and/or interact with the
biological targets and express their toxic effects. Alternatively, due to their complexity,
these chemicals might act by several mechanisms.
One reason for the limited goodness-of-fit of the baseline relationships and the QSARs
could be due to the fact that the data were collected from different laboratories under
different experimental conditions. While the laboratories used official testing methods
(Annex V of Directive 67/548/EEC), there are still several sources of variability between
laboratories, due to possible differences in experimental protocols. These are discussed in
the following paragraphs.
In general, the recommendations regarding the origin and number of test animals, the
number of chemical concentrations used to derive the dose-response curves, and
equipment used are not very strict, using statements such as “preferably” or “at least”.
According to the recommendations, testing should be performed without adjustment of
the pH of the test medium. However, when there is evidence of a marked change in the
pH, it is advised that the test should be repeated with pH adjustment and the results
reported. In some cases, the toxic effect might be due to the extreme pH conditions, rather
than to the toxicity of the chemical itself. The pH correction is not always explicitly
mentioned in the results presented in the database.
A particular problem arises with unstable, poorly soluble and volatile chemicals.
Maintenance of their concentrations is difficult and this could result in great differences
165
between the laboratories. For the purposes of this study, wherever such a problem was
recorded in the data base, the chemical was excluded from the analyses. However, such
problems are not always recorded.
In the Daphnia test, the assessment of the endpoint (immobility) is subjective, which
could lead to differences in results between different laboratories. In the fish toxicity test,
different types of water supply (e.g. drinking water, reconstituted water) may be used, and
it is difficult to ensure that the water has no contaminants that could interfere with the
toxicological effects of the test substance. Furthermore, the option to conduct fish testing
with three types of procedures (static, semi-static, flow-through) is likely to introduce
additional interlaboratory variability in the data.
In the Daphnia and fish tests, data are evaluated by plotting the percentage mortality
against concentration on a logarithmic scale. This may be done manually or by using a
computer program, which could result in different results between different laboratories.
In the algal test, there are possible differences in the choice of test equipment (e.g., the
volume of the test flasks), illumination (e.g. different type of lamps), and method for
measuring cell density. The recommendations state that cell density should be measured
by using a direct counting method of cells. However, other methods (photometry,
turbidimetry) may be also used. An additional difficulty when interpreting the results of
the algae test is that significant amounts of the test substance may be incorporated into the
algal biomass during the period of the test. The recommendations do not specify how the
resulting uncertainty in exposure should be taken into account in reporting the test results.
13.4.7. Contribution to existing knowledge
The study reported in this chapter is based on a large data set for fish, Daphnia and algal
toxicity extracted from the New Chemicals Database of the European Union. In the study,
correlations between the different endpoints were derived. A strong correlation was
observed between the inhibition of algal growth and algal growth rate, which suggests
that testing only one of these endpoints should be sufficient for the regulatory assessment
of algal toxicity. Two fish and two algal species were investigated, and, interestingly,
toxicity to the fish O. mykiss correlated well with toxicity to one of the algal species (S.
subspicatus), but correlated poorly with the other algal species (S. capricornutum), while
for the other fish species (B. rerio) this relation was the opposite, i.e. toxicity to B. rerio
166
was well correlated with toxicity to the alga S. capricornutum, and poorly correlated with
toxicity to the alga S. subspicatus.
An attempt was made to define baseline toxicity effects. Although general trends for
linear toxicity/logPoct relationships were observed for the non-polar narcotics, with most
of the remaining compounds being placed above the baselines, the derived relationships
did not have high statistical fit. Also, QSARs with moderate statistical fit were only
obtained for the alga S. subspicatus and the fish O. mykiss. They confirmed the
importance of lipophilicity, encoded by logPoct, for toxicity.
Classification QSAR models were obtained by applying the EU classification scheme
(very toxic, toxic, harmful, and non-toxic compounds to the aquatic species). In general,
the models correctly classified approximately 55% of the compounds, i.e. the
classification models resulted in an approximately 25% improvement of the classification
compared with that expected by chance. Additionally, a two-group classification into
dangerous and non-toxic compounds was investigated, with the group of dangerous
compounds being formed by uniting the groups of very toxic, toxic, and harmful
compounds from the EU classification scheme. The percentage of correct classifications
by the classification models was between 70% and 83%, i.e. the improvement over the
percentage of correct classifications expected by chance was approximately 20%. The
classification models included molecular descriptors allowing for mechanistic
interpretation. The models are easily applicable for predicting the group membership of
new compounds. All classification models included logPoct as a discriminating variable. A
logPoct cut-off value of approximately 2 was observed for discrimination between
dangerous and non-toxic compounds. Additionally, electronic features (HOMO, LUMO,
energy difference between LUMO and HOMO), H-bond donor ability, topological
properties and factors encoding molecular volume, shape, and size appeared to influence
the assignment of compounds to a given toxicity group.
The complexity of the molecular structures of the chemicals included in the NCD might
be a reason for the moderate quality of the derived baseline relationships and QSAR
models. Another reason might be related to the experimental variability of the data, as
they were collected from different laboratories under diverse experimental conditions.
Thus, it appears that the data in the New Chemicals Database are not sufficiently
consistent to develop high quality QSARs.
167
An article based on this work has been published by Lessigiarska et al. (2004b).
13.5. Conclusions
There is a strong correlation between the inhibition of algal growth and algal growth rate.
Important molecular properties included in QSARs for toxicity to the alga S. subspicatus
and the fish O. mykiss were lipophilicity (logPoct), chemical reactivity (LUMO), and the
number of phenyl rings. The number of phenyl rings might suggest differences in the
toxic effect between aromatic and aliphatic chemicals.
Classification models for aquatic toxicity obtained by applying discriminant analysis and
classification trees approach revealed lipophilicity (logPoct) as very important for toxicity
(consistent with literature data). A logPoct cut-off value of approximately 2 was observed
for discrimination between dangerous and non-toxic compounds. Additionally, electronic
properties (HOMO, LUMO, sum of E-state indices), H-bond donor ability, topological
and size/shape factors, and the number of cyano groups appeared to influence the toxicity
classification.
On the basis of this study, it appears that the data in the New Chemicals Database are not
sufficiently consistent to develop high quality QSARs for toxicity to algae, daphnids and
fish.
168
Figure 13.1. Plots of toxicity against logPoct for aquatic toxicity endpoints, showing
baseline toxicity effects (represented by the solid lines) †
No Endpoint Plots “a”
non-polar narcotics
Plots “b”
all compounds
1 Alga S.
subspicatus
log(1/EbC50)
logP
log(1
/EbC
50)
5.04.54.03.53.02.52.0
-1.00
-1.25
-1.50
-1.75
-2.00
-2.25
-2.50
-2.75
logP
log
(1/E
bC
50)
6543210-1-2
2
1
0
-1
-2
-3
-4
2 Alga S.
subspicatus log(1/ErC50)
logP
log
(1/E
rC5
0)
43210-1
-2.0
-2.5
-3.0
-3.5
-4.0
logP
log(1
/ErC
50)
543210-1-2
2
1
0
-1
-2
-3
-4
-5
3 D. magna
log(1/EC50)
logP
log
(1/E
C5
0)
1086420
1
0
-1
-2
-3
-4
logP
log(1
/EC
50)
10.07.55.02.50.0-2.5-5.0
1
0
-1
-2
-3
-4
4 Fish O.
mykiss log(1/LC50)
logP
log(1
/LC
50)
6543210
-0.5
-1.0
-1.5
-2.0
-2.5
-3.0
-3.5
logP
log
(1/L
C5
0)
7.55.02.50.0-2.5-5.0
4
3
2
1
0
-1
-2
-3
-4
5 Fish B. rerio log(1/LC50)
logP
log
(1/L
C5
0)
543210
-1.0
-1.5
-2.0
-2.5
-3.0
logP
log
(1/L
C50
)
1086420
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-3.0
-3.5
† the numeration of the plots is according the Equations of Table 13.1. Plots “a” show the position of the
baselines (solid lines) with respect to the non-polar narcotic compounds. Plots “b” include all compounds,
regardless of mechanism of action, and the position of the baseline is again presented
oct
oct
oct
oct
oct
oct
oct
oct
oct
oct
169
Figure 13.2. Classification tree for toxicity to the alga S. subspicatus when two compound
groups were investigated (dangerous (D) and non-toxic (N) compounds) (using DBU
splitting and equal a priori probabilities)
1
2 3
4 5
logPoct <= 2.09
NH-donors = 0
34 48
12 22
D
D N
DN
170
Figure 13.3. Classification tree for toxicity to the alga S. capricornitum when four
compound groups were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic
(N) compounds) (using CART splitting and a priori probabilities proportional to the
group sizes)
1
2 3
4 5
6 7
logPoct <= 2.31
2XP <= 8.955
logPoct <= 3.01
54 57
36 21
19 17
N
V
H T
VTHN
171
Figure 13.4. Classification tree for toxicity to D. magna when four compound groups
were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic (N) compounds)
a) using DBU splitting and a priori probabilities proportional to the group sizes
1
2 3
4 5 6 7
8 9 10 11
logPoct <= 3.82
logPoct <= 1.91 MM <= 424.3
HOMO <= -10.07 HOMO <= -9.14
293 121
147 146 95 26
45 102 105 41
T V
N H H T
VTHN
172
Figure 13.4. Classification tree for toxicity to D. magna when four compound groups
were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic (N) compounds)
b) using CART splitting and a priori probabilities proportional to the group sizes
1
2 3
4 5 6 7
logPoct <= 3.12
LmH <= 8.87 logPoct <= 4.91
244 170
87 157 107 63
H N T V
VTHN
173
Figure 13.5. Classification tree for toxicity to D. magna when two compound groups were
investigated (dangerous (D) and non-toxic (N) compounds)
a) using CART splitting and a priori probabilities proportional to the group sizes
1
2 3
4 5
logPoct <= 1.63
LmH <= 8.89
125 289
38 87
D
D N
DN
b) using CART splitting and a priori probabilities equal in the two groups
1
2 3
4 5
logPoct <= 3.12
LmH <= 8.87
244 170
87 157
D
D N
DN
174
Figure 13.6. Classification tree for toxicity to the fish O. mykiss (rainbow trout) when four
compound groups were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic
(N) compounds) (using DBU splitting and a priori probabilities proportional to the group
sizes)
1
2 3
4 5 6 7
logPoct <= 3.39
logPoct <= 0.70 SE-state <= 69.2
151 77
35 116 59 18
N H T V
VTHN
175
Figure 13.7. Classification tree for toxicity to the fish B. rerio when two compound
groups were investigated (dangerous (D) and non-toxic (N) compounds), using CART
splitting and a priori probabilities proportional to the group sizes
1
2 3
4 5
logPoct <= 2.41
PW2 <= 0.576
43 67
20 23
D
D N
DN
176
Figure 13.8. Correlation between measured and calculated logPoct values for 352
compounds taken from the New Chemicals Data Base
177
Table 13.1. Correlations between the aquatic toxicity endpoints †
No Regression equation n R2 s F
1 a log(1/EbC50) = 0.940 (± 0.013) log(1/ErC50) + 0.343 (± 0.020) 443 0.919 0.289 4777
2 log(1/EC50)DM = 0.629 (± 0.063) log(1/EbC50)SS - 0.429 (± 0.075)
152 0.402 0.763 101
3 log(1/EC50) DM = 0.638 (± 0.059) log(1/EbC50) SC - 0.515 (±0.067) 199 0.375 0.840 118
4 log(1/LC50)OM = 0.795 (± 0.079) log(1/EbC50)SS - 0.250 (± 0.097)
77 0.574 0.682 101
5 log(1/LC50)OM = 0.432 (± 0.073) log(1/EbC50) SC - 0.673 (± 0.082)
109 0.246 0.769 34.7
6 log(1/LC50)BR = 0.524 (± 0.134) log(1/EbC50)SS - 0.800 (± 0.161)
41 0.282 0.802 15.3
7 log(1/LC50)BR = 1.00 (± 0.160) log(1/EbC50)SC- 0.500 (± 0.142) 21 0.674 0.549 39.3
8 log(1/LC50)OM = 0.773 (± 0.029) log(1/EC50)DM - 0.266 (± 0.038) 360 0.665 0.631 709
9 log(1/LC50) BR = 0.634 (± 0.062) log(1/EC50)DM - 0.637 ± (0.089)
146 0.424 0.713 105
† the species were encoded as subscripts to the endpoints as follows: BR – fish B. rerio; DM – D. magna;
OM – fish O. mykiss; SC – alga S. capricornutum; SS – alga S. subspicatus a correlation between the two algal endpoints regardless of the species
178
Table 13.2. Baseline toxicity relationships for five aquatic toxicity endpoints
No Endpoint Regression equation n R2 s F
1 Alga S. subspicatus log(1/EbC50)
0.421 (± 0.122) logPoct – 3.38 (± 0.371) 8 0.665 0.345 11.9
2 Alga S. subspicatus log(1/ErC50)
0.319 (± 0.109) logPoct – 3.34 (± 0.282) 6 0.680 0.412 8.48
3 D. magna log(1/EC50) 0.376 (± 0.047) logPoct - 2.95 (± 0.157) 56 0.539 0.639 63.0
4 Fish O. mykiss log(1/LC50)
0.469 (± 0.062) logPoct – 3.28 (± 0.197) 34 0.640 0.499 57.0
5 Fish B. rerio log(1/LC50) 0.349 (± 0.091) logPoct – 3.07 (± 0.292) 19 0.465 0.454 14.8
179
Table 13.3.QSARs for algal and fish toxicity
No Endpoint Regression equation n R2 s F Q2
1 Alga S.
subspicatus log(1/EbC50)
0.345 (± 0.077) logPoct + 1.00 (± 0.179) NPh – 3.11 (± 0.205)
50 0.601 0.769 35.4 0.543
2 Alga S.
subspicatus log(1/ErC50)
0.464 (± 0.091) logPoct + 0.971 (± 0.180) NPh – 3.57 (± 0.236)
33 0.709 0.691 36.3 0.655
3 Fish O. mykiss log(1/LC50)
0.369 (± 0.045) logPoct - 0.303 (± 0.074) LUMO - 2.53 (± 0.145)
122 0.417 0.884 42.6 0.371
180
Table 13.4. Number of investigated compounds (n), average molecular mass (MMav) and
upper classification boundaries for group V (very toxic compounds), converted to units
µmol/l for each species investigated*
Species
n MMav Upper boundary for
group V (mg/l) Upper boundary for
group V (µmol/l)
Alga S. subspicatus 82 236.1 1.000 4.235
Alga S. capricornutum 111 245.7 1.000 4.070
D. magna 414 254.8 1.000 3.925
Fish O. mykiss (Rainbow trout) 228 254.2 1.000 3.934
Fish B. rerio (Zebra fish) 110 246.5 1.000 4.057
* the upper boundaries in units mg/l for the groups T (toxic compounds) and H (harmful compounds) are
derived by multiplying the upper boundary for group V (1 mg/l) by 10 and 100, respectively
181
Table 13.5. Classification of compounds with classification boundaries set both in units
mg/l and µmol/l for each species*
a.) alga S. subspicatus
Groups V (mg/l) T (mg/l) H (mg/l) N (mg/l) Total
V (µmol/l) 8 0 0 0 8
T (µmol/l) 0 13 1 0 14
H (µmol/l) 0 3 30 1 34
N (µmol/l) 0 0 0 26 26
Total 8 16 31 27 82
b) alga S. capricornutum
Groups V (mg/l) T (mg/l) H (mg/l) N (mg/l) Total
V (µmol/l) 13 2 0 0 15
T (µmol/l) 0 22 3 0 25
H (µmol/l) 0 5 26 3 34
N (µmol/l) 0 0 3 34 37
Total 13 29 32 37 111
c) D. magna
Groups V (mg/l) T (mg/l) H (mg/l) N (mg/l) Total
V (µmol/l) 57 3 0 0 60
T (µmol/l) 11 100 3 0 114
H (µmol/l) 0 7 130 13 150
N (µmol/l) 0 0 19 71 90
Total 68 110 152 84 414
182
Table 13.5. Classification of compounds with classification boundaries set both in units
mg/l and µmol/l for each species*
d) fish O. mykiss (Rainbow trout)
Groups V (mg/l) T (mg/l) H (mg/l) N (mg/l) Total
V (µmol/l) 36 4 0 0 40
T (µmol/l) 2 56 0 0 58
H (µmol/l) 0 5 54 12 71
N (µmol/l) 0 0 9 50 59
Total 38 65 63 62 228
e) fish B. rerio (Zebra fish)
Groups V (mg/l) T (mg/l) H (mg/l) N (mg/l) Total
V (µmol/l) 8 1 0 0 9
T (µmol/l) 1 17 0 0 18
H (µmol/l) 0 6 32 8 46
N (µmol/l) 0 0 3 34 37
Total 9 24 35 42 110
* the groups were encoded as follows: V – very toxic compounds; T – toxic compounds; H – harmful
compounds; N – non-toxic compounds
183
Table 13.6. Discriminant models for the four-group classification of aquatic toxicity (very
toxic (V), toxic (T), harmful (H), and non-toxic (N) compounds)
% correctly classified compounds
No Species
% correct
by chance
a
Model variables
Wilks λ
F group V
group T
Group H
group N
total
1 Alga S.
subspicatus 31.1
logPoct b,
NC#N 0.583 7.95 37.5 42.9 73.5 38.5 53.7
2 Alga S.
capricornitum 27.4
logPoct b ,
Nrings6 0.783 4.60 26.7 40.0 41.2 83.8 53.2
3 D. magna 27.5 logPoct
c, HOMO
0.609 38.4 41.7 36.0 70.7 38.9 50.0
4 Fish O.
mykiss 25.9
logPoct c,
LUMO 0.592 22.3 37.5 43.1 67.6 55.9 53.1
5 Fish B. rerio 32.1 logPoct
c, PW2 b
0.567 11.5 33.3 5.6 76.1 73.0 60.6
a calculated by using Equation 5.20, Chapter 5, Section 5.4 “Linear discriminant function analysis” b calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used
here is calculated according to the Ghose-Crippen-Viswanadhan model (Ghose and Crippen, 1986;
Viswanadhan et al., 1989) c calculated with the KOWWIN program
184
Table 13.7. Discriminant models for the two-group classification of aquatic toxicity
(dangerous (D) and non-toxic (N) compounds)
% correctly classified compounds
a priori probabilities proportional to group
sizes
a priori probabilities equal No Species
% correct
by chance
a
Model variables
Wilks λ
F
group D
group N
total group
D group
N total
1 Alga S.
subspicatus 56.7
logPoct b,
NH-donors c
0.829 8.16 87.5 46.2 74.4 73.2 65.4 70.7
2 Alga S.
capricornitum 55.6 logPoct
d 0.867 16.7 94.6 29.8 73.0 70.3 81.1 73.9
3 D. magna 66.0 logPoct
b, HOMO
0.801 51.0 93.5 25.6 78.7 75.6 73.3 75.1
4 Fish O.
mykiss 61.6 logPoct
b 0.732 82.7 95.3 44.1 82.0 74.6 81.4 76.3
5 Fish B. rerio 55.4 logPoct
b, PW2 d
0.678 25.3 90.4 64.9 81.8 78.1 73.0 76.4
a calculated by using Equation 5.20, Chapter 5, Section 5.4 “Linear discriminant function analysis” b calculated with the KOWWIN program c calculated with the Tsar software d calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used
here is calculated according to the Ghose-Crippen-Viswanadhan model (Ghose and Crippen, 1986;
Viswanadhan et al., 1989)
185
Table 13.8. Summary of the classification tree analysis for the four-group classification of
aquatic toxicity (very toxic (V), toxic (T), harmful (H), and non-toxic (N) compounds)
% correctly classified compounds
No Species Splitting method
a Model variables
group V group T group H group N total
Alga S.
subspicatus No model obtained - - - - -
1 Alga S.
capricornitum CART logPoct
b , 2χ c 60.0 44.0 32.4 83.8 55.9
2 D. magna DBU logPoct
d, HOMO, MM
28.3 55.3 69.3 32.2 51.4
3 D. magna CART logPoct d, LmH 63.3 45.6 30.7 83.3 51.0
4 Fish O. mykiss DBU logPoct d, ΣE-state
c 30.0 53.4 76.1 44.1 53.9
Fish B. rerio No model obtained - - - - -
a CART – classification and regression trees; DBU – discriminant-based univariate splitting b calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used
here is calculated according to the Ghose-Crippen-Viswanadhan model (Ghose and Crippen, 1986;
Viswanadhan et al., 1989) c calculated with the Tsar software d calculated with the KOWWIN program
186
Table 13.9. Summary of the classification tree analysis for the two-group classification of
aquatic toxicity (dangerous (D) and non-toxic (N) compounds)
% correctly classified compounds
a priori probabilities proportional to group
sizes
a priori probabilities equal No Species
Splitting method
Model variables
group D
group N
total group
D group
N total
1 Alga S.
subspicatus DBU logPoct
b, NH-donors c - - - 87.5 57.7 78.0
2 Alga S.
capricornitum CART logPoct
d 71.6 81.1 74.8 68.9 83.8 73.9
3 D. magna CART logPoct b, LmH 89.5 58.9 82.9 74.7 83.3 76.6
4 Fish O. mykiss DBU logPoct b 94.7 47.5 82.5 70.4 84.7 74.1
5 Fish O. mykiss CART logPoct b 89.9 61.0 82.5 72.8 84.7 75.9
6 Fish B. rerio CART logPoct b, PW2 d 94.5 51.4 80.0 - - -
7 Fish B. rerio CART logPoct b - - - 78.1 73.0 76.4
b calculated with the KOWWIN program c calculated with the Tsar software d calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used
here is calculated according to the Ghose-Crippen-Viswanadhan model (Ghose and Crippen, 1986;
Viswanadhan et al., 1989).
187
CHAPTER 14
INVESTIGATION OF ACUTE TOXICITY/CYTOTOXICITY
14.1. Objectives
The objective of the study described in this chapter was to investigate in vivo and in vitro
acute toxicity and cytotoxicity to a broad range of biological species. Toxicities to
bacterial strains, rodent and human cell lines, rodents and humans were included.
Similarities between the toxicity endpoints were investigated by using different
approaches such as principal components analysis and cluster analysis. Models for in vivo
toxicity based on a combination of other in vivo or in vitro endpoints and descriptors of
chemical structure (QSAAR models), and QSAR models were sought.
14.2 Methods
14.2.1. Toxicity data
The data set was taken from the MEIC (Multicentre Evaluation of In Vitro Cytotoxicity)
programme (Bondesson et al., 1989; Ekwall et al., 1998a). The MEIC programme aimed
to evaluate the relevance and reliability of in vitro tests for predicting acute in vivo
toxicity. As a part of the MEIC programme, a number of international laboratories
produced data for the toxicity of 50 reference chemicals to more than 60 in vitro systems
(bacterial, animal and human cells) in order to use them for predicting human acute
toxicity. Additionally, in vivo data for acute rat, mouse, and human toxicity of the same
chemicals were collected as a part of the MEMO programme (MEIC monographs on
time-related human lethal blood concentrations). Rat and mouse toxicities (rat LD50 and
mouse LD50 values representing lethal doses at which 50% lethality of the test animals is
observed) were collected from the NIOSH/Registry of Toxic Effects of Chemical
Substances (RTECS); human toxicity data were collected from publications of clinical
and forensic studies of individuals exposed to toxicants. A comparison of human toxicity
to the rat and mouse LD50 values, and in vitro-in vivo correlations, were obtained in the
MEIC programme (Ekwall et al., 1998b).
The MEIC data set is heterogeneous in terms of chemical structure and toxicity
mechanisms, including inorganic and simple organic chemicals, alkaloids and drugs.
188
In the present study, the data for rat LD50, mouse LD50 and human toxicity (negative log-
transformed, units of mmol/kg) for the 50 reference chemicals of the MEIC programme
were taken from Ekwall et al. (1998a). The human toxicity is represented as acute oral
lethal doses (HLD, negative log-transformed, mmol/kg), and approximate acute
blood/serum peak LC50 values (HAP, negative log-transformed, mmol/l). Ekwall et al.
(1998a) derived the HAP values by constructing an approximate blood/serum LC50 curve
against time after exposure, and interpolation at the time of the peak blood/serum
concentrations. For this purpose, data on human highest survival blood/serum
concentrations were plotted against time, and after consideration of known kinetic data,
LC100 time curves were constructed connecting the highest survival concentration data
points. Similarly, curves of LC0 blood/serum concentrations against time were
constructed, using data for the lowest lethal blood/serum concentrations. The approximate
LC50 curves were drawn as the geometrical mean of the LC100 and LC0 curves, and HAP
values were obtained at the time of peak of the blood/serum concentrations. However, for
some chemicals Ekwall et al. (1998a) could not construct approximate LC50 curves due to
insufficient data for the concentrations required for severe poisoning or lethality. For
these chemicals HAP values were taken from the LC0 or LC100 curves.
Data for toxicity to the Chang human liver cell line (IC50 values, negative log-transformed
and converted to mmol/l, 24-h assay) were taken from the MEIC data base
(http://www.cctoxconsulting.a.se/). Data for toxicity to rat hepatocytes (IC50 values,
negative log-transformed, mmol/l, 24-h assay) were obtained from Shrivastava et al.
(1993). Data for the BiotoxTM assay on the bacterium Vibrio fischeri (IC50 values,
negative log-transformed, mmol/l, 5-min assay) were taken from Kahru and Borchardt
(1994). Additionally, toxicity to the bacterium Sinorhizobium meliloti for the MEIC
chemicals was included in the investigation; the data were taken from Botsford (2002)
(IC50 values, negative log-transformed, mmol/l, 20-min assay).
To derive QSAARs and QSARs, inorganic chemicals, salts, xylene (which is a mixture of
isomers) and paraquat (which has a charge of +2) were excluded from the analyses,
resulting in a data set of 26 chemicals (their SMILES codes are given in Appendix B.3).
The investigated chemicals and their toxicities are presented in Table 14.1.
189
14.2.2. Structural Descriptors
The following structural descriptors for the compounds were calculated: the logarithm of
the octanol-water partition coefficient (logPoct), boiling point, melting point, vapour
pressure, water solubility, and approximately 250 different quantum-chemical and
topological descriptors. LogPoct was calculated with the KOWWIN version 1.66 software
(Syracuse Research Corporation, Syracuse, NY, USA). Boiling points, melting points and
vapour pressures were calculated with the MPBPWIN version 1.40 software (Syracuse
Research Corporation, Syracuse, NY, USA). Aqueous solubilities were calculated with
WSKOWWIN version 1.40 (Syracuse Research Corporation, Syracuse, NY, USA).
KOWWIN version 1.66, MPBPWIN version 1.40 and WSKOWWIN version 1.40
software were downloaded from the web-site of the Environmental Protection Agency of
the United States (http://www.epa.gov/oppt/exposure/docs/episuitedl.htm). TSAR for
Windows version 3.3 (Accelrys Inc.) was used to calculate quantum-chemical and
topological descriptors. For quantum-chemical calculations, the VAMP package
(Accelrys Inc.) was used with the AM1 Hamiltonian. The Dragon version 5.0 software
(TALETE srl.) was used to calculate topological and charge descriptors. Some
calculations in Dragon required 3D chemical structures (the charge descriptors). For these
descriptors, the structures optimised previously in TSAR were used (imported into
Dragon as Mol2 files). The options to exclude constant and near-constant variables, as
well as one variable from a pair with intercorrelation coefficient greater than 0.95, were
applied when saving the calculated descriptors in Dragon.
Additionally, experimental values for logPoct (logPoctM) and logSaq (logSaqM) were
obtained from KOWWIN version 1.66 and WSKOWWIN version 1.40.
14.2.3. Statistical Analysis
Before performing the analysis, variables were retained for further analysis if at least
three of the values were different from zero. To reduce the number of intercorrelated
molecular descriptors, the algorithm developed by the author of this thesis was applied to
the descriptor data matrices (see Chapter 9).
Principal components analysis (PCA) and cluster analysis were performed using the
MINITAB version 14 software (Minitab Inc., State College, PA, USA). In the PCA, the
correlation matrix was used in the calculation of the principal components. Cluster
190
analysis was performed with the average linkage option and with distance measured by
using the Pearson correlation coefficients R between the variables.
QSAAR and QSAR models were obtained by multiple linear regression. The program in
C code (best-subsets regression), developed by the author (see Chapter 9), was used to
select variables in the models. Only those variables that intercorrelated with a coefficient
of intercorrelation R less than 0.7 were included in the same model.
The models were validated by cross-validation and validation on external test sets. Leave-
one-out cross-validation was used to generate leave-one-out cross-validation coefficients
of determination (Q2). Also, leave-five-out cross-validation was performed ten times for
each model (randomly selecting compounds to be excluded at each step of the cross-
validation procedure) in order to investigate model robustness. Thus, ten values for the
leave-five-out cross-validation coefficient of determination (Q2(5)) were generated for
each model.
A simulation of validation by using external test sets was also performed. Five
compounds (referred to as “a test set”) were excluded from the data set and the models
were developed on the basis of the remaining (approximately 21) compounds. The
toxicities of the five compounds from the test set were predicted and the correctness of
the prediction was evaluated by calculating the mean absolute error (MAE) of the
predicted values of the five excluded compounds. As the compounds investigated
represented a heterogeneous data set, no particular group of compounds could be selected
as a test set. Therefore, for each model the five test compounds were randomly chosen
and the described procedure was repeated ten times. Thus, ten MAE values were
produced for each model.
The regression equations with the best regression parameters (R2 and s), cross-validation
parameters (Q2 and Q2(5)), and MAE values were selected. Additionally, models
containing variables allowing for a plausible biological interpretation were chosen from
the equations with best statistical fit. The statistical parameters of the selected models
were finally derived by using MINITAB version 14.
To identify model outliers the method of Tenekedjiev and Radojnova (2001) using
studentised residuals was applied (see Chapter 5, Section 5.1.3 “Observations with large
influence on the regression model, outliers”). An observation was classified as outlier if
191
its studentised residual calculated from a regression model based on the remaining
observations did not lie within the calculated confidence interval corresponding to the 5%
significance level. This procedure was repeated two times (two loops), with the second
time excluding the observations classified as outliers in the first run.
14.3. Results
The investigated data set contained eight toxicity endpoints, which enabled PCA and
cluster analysis to be applied to the endpoints alone to investigate similarities in the toxic
effects. For this purpose, chemicals that had missing toxicity values for one or more of the
endpoints were removed, thus, 30 chemicals remained to perform PCA and cluster
analysis.
The first principal component obtained by the PCA had an eigenvalue of 6.34, and
explained 79.3% of the variance in the toxicity data, while the second component had a
much lower eigenvalue of 0.712, and explained only 8.9% of the data variability. Plots of
the loadings and scores of the first against the second component are given in Figure 14.1.
The grouping of the endpoints, obtained by the cluster analysis, is shown in the
dendrogram of Figure 14.2. It is discussed in section 14.4. “Discussion”.
As explained in Section 14.2.1 (“Toxicity data”) 26 chemicals were investigated by
QSAAR and QSAR analysis (after excluding the inorganic chemicals, salts, xylene and
paraquat). The calculated values of the descriptors of chemical structure for the 26
chemicals, used in the selected QSAAR and QSAR models, are presented in Table 14.2.
The abbreviations of the descriptors are explained in Table 12.2. The derived models are
presented in Table 14.3. Only descriptors that were statistically significant at 95%
confidence level were included in the equations. In order to avoid reporting of multiple
numbers only the minimal, the maximal, and the average values of the leave-five-out
cross-validated Q2 coefficients (Q2(5)) and the MAEs of each model are given in the
table.
In Figure 14.3, plots of observed against predicted values of HLD obtained from the
model with mouse LD50 values (Equation 1, Table 14.3.a, Figure 14.3.a), and observed
against predicted values of HAP obtained from the model with toxicity to human liver
cells (Equation 6, Table 14.3.a, Figure 14.3.b) are presented. From the figure it can be
192
seen that digoxin can have a large influence on the models for the human toxicity
endpoints, having a large residual in Equation 1 (Figure 14.3.a), or giving the good
correlation of Equation 6 (Figure 14.3.b). Therefore attempts were made to develop
models both including and excluding digoxin from the training set.
Similarly, a one-descriptor equation obtained with the molecular mass (MM) for the
toxicity to Chang human liver cells (Equation 25, Table 14.3.b) showed that digoxin
might have large influence on the correlation (Figure 14.4), and so correlations without it
were also attempted (Table 14.3.b).
From Table 14.3 it can be seen that the leave-one-out cross-validation Q2 values differed
from the corresponding regression R2 values with less than or approximately 0.1. Three
exceptions were observed – the QSAAR models for the mouse in vivo toxicity (Equations
13 and 14, Table 14.3.a) and Equation 27 (Table 14.3.b) for toxicity to Chang human
liver cells, where these differences were approximately 0.14.
From Table 14.3 it can be seen that, generally, the leave-five-out Q2(5) values were
smaller than the corresponding regression R2 values with less than or approximately 0.2.
The same exceptions were observed here – for the mouse in vivo QSAAR models
(Equations 13 and 14, Table 14.3.a) and Chang human liver cell models (Equations 27
and 28, Table 14.3.b) these differences were up to 0.43 (Equation 13, Table 14.3.a).
The leave-five-out Q2(5) values varied with less than 0.16 for a given model, with the
same exceptions (Equations 13 and 14, Table 14.3.a, and Equations 27 and 28, Table
14.3.b).
From Table 14.3 it can be seen that the MAE averages over the ten MAEs produced for
each model ranged from 0.26 to 0.79. The maximal MAEs ranged between 0.50 and 1.46.
The worst results with MAE averages higher than 0.60 and/or maximal MAEs higher than
0.90 were obtained for the following equations: the QSAAR Equations 1 and 6 (Table
14.3.a), where only a toxicity endpoint (without structural descriptors) was used as a
predictor variable; the QSAARs for the mouse in vivo toxicity (Equations 13 and 14,
Table 14.3.a); the QSARs for toxicity to Chang human liver cells (Equations 25, 26, 27,
and 28, Table 14.3.b); and the one-parameter Equation 30 (Table 14.3.b) for toxicity to rat
hepatocytes. These exceptions are discussed in Sections “Rodent in vivo toxicity” and
“Toxicity to Chang human liver cells” below.
193
14.4. Discussion
Eight endpoints including in vitro cytotoxicity and in vivo acute toxicity to bacterial
strains, rodent and human cell lines, rodents and humans were investigated. Similarities
between the toxicity endpoints were investigated and QSAAR and QSAR models were
developed.
14.4.1. Investigating of similarities between the toxicity endpoints
Principal Component Analysis
The PCA resulted in one component that explained a large part of the variance in the
toxicity data (79.3%). The second component was much less important, explaining only
8.9% of the variance. This result indicates that the toxicities are determined mainly by a
single factor, with other factors being less influential. The nature of this factor is possibly
related to cell toxicity, which directly determines the values of half of the investigated
endpoints (toxicity to Chang human liver cells, rat hepatocytes, S. meliloti, and Biotox
assay), and is a basis for the remaining in vivo toxicity endpoints.
From the loading plot of the first against the second component (Figure 14.1.a) it can be
seen that all toxicity endpoints have negative loadings in the first component (values
between –0.38 and –0.33). The loadings of the toxicity endpoints in the second
component (which explained only 8.9% of the toxicity variance) separated the toxicity
tests into in vitro and in vivo tests, with in vivo toxicity endpoints having positive
loadings, and the in vitro endpoints having negative loadings. The only exception was the
in vitro toxicity to human liver cells, which also had a positive loading, albeit smaller than
the loadings of the in vivo endpoints. Thus, the second factor accounts for differences
between the in vivo and in vitro endpoints. The QSAAR and QSAR models derived below
might give an insight in the nature of the two factors.
From the score plot (Figure 14.1.b), no obvious grouping of chemicals can be discerned.
The chemicals having highest positive scores in the first component were simple organic
molecules and the salts of univalent metals, while the more complex organic molecules
and the salts of bivalent metals had smaller scores in the first component.
194
Cluster Analysis
From the dendrogram, obtained by the cluster analysis (Figure 14.2), it can be seen that
the human blood/serum LC50 concentration (HAP) is closely related to human liver cell
toxicity, whereas the oral dose causing lethality to human (HLD values) is related closely
to the oral rat and mouse LD50. The remaining in vitro endpoints form a separate cluster.
These results suggest that the route of toxicant administration (and subsequent
toxicokinetics) determines similarities in the toxic effects between the different species.
Thus, the oral lethal doses applied to human, rat and mouse were related, whereas the
toxic concentrations in human blood/serum were related to IC50 values of human liver
cells which are exposed in vivo to the chemical blood concentrations.
14.4.2. QSAAR and QSAR models
Validation procedures
As described in Section “Results”, most of the models had differences between the
regression R2 values and the corresponding leave-one-out cross-validation Q2 values
smaller than or approximately 0.1, and differences between the regression R2 values and
the corresponding leave-five-out Q2(5) values smaller than or approximately 0.2. These
results showed that the models performed satisfactorily during leave-one-out and leave-
five-out cross-validation. The exceptions observed were the QSAAR models for the
mouse in vivo toxicity (Equations 13 and 14, Table 14.3.a) and the models for Chang
human liver cells (Equations 27 and 28, Table 14.3.b). Also, the ranges of the leave-five-
out Q2(5) values obtained for a given model were smaller than 0.16 (with the same
exceptions) suggesting stable model performance with respect to the observations
included.
With the exception of the QSAAR models for the mouse in vivo toxicity (Equations 13
and 14, Table 14.3.a), the models for Chang human liver cells (Equations 25 - 28, Table
14.3.b) and some one-parameter models (Equations 1 and 6, Table 14.3.a; and Equation
30, Table 14.3.b), the remaining models had MAE averages smaller than 0.6 and maximal
MAEs smaller than 0.9, meaning that the mean errors of prediction by the models are
likely to be smaller than 0.9 log units. Perhaps this value reflects the heterogeneity of the
data sets and the complexity of the in vivo endpoints.
195
Human endpoints
For the 26 chemicals investigated HLD correlated best with the mouse and rat in vivo
toxicity. The correlation with mouse in vivo toxicity was better (Equation 1, Table
14.3.a). Previously, Ekwall et al. (1998b) have also obtained a good correlation between
HLD values and the mouse in vivo toxicity for the whole MEIC data set. However, as
mentioned above, the model performed badly during external validation (with maximal
MAE of 0.969). The statistical parameters improved when the number of H-bond donors
was added (NH-donors, Equation 2, Table 14.3.a). When digoxin was excluded, the
correlation of HLD with mouse in vivo toxicity improved (Equation 3, Table 14.3.a). NH-
donors as a second descriptor was significant only at 85 % confidence level (the model
including mouse in vivo toxicity and NH-donors had n = 24, R2 = 0.764, s = 0.499, F = 34,
Q2 = 0.671). A better model included the Kier benzene-likeness index (BLI, Equation 4,
Table 14.3.a), which is a measure of molecular aromaticity (Kier and Hall, 1986). These
results suggest that the number of H-bond donors (NH-donors) influences compound
toxicity, however, for the more restricted data set in terms of toxicity and NH-donors ranges
(excluding digoxin) it becomes less determining, whilst compound aromaticity becomes
more significant.
A QSAR model for HLD was obtained from the data set excluding digoxin with the zero-
order average connectivity index only (0χA, Equation 15, Table 14.3.b), which had a
worse statistical fit than the QSAAR models. Since 0χA decreases with increasing degree
of skeletal branching and unsaturation, its negative coefficient in the equation suggests
that toxicity increases with increases in these factors. The model could not be improved
by adding other descriptors. Thus, QSAR models for HLD with comparable or better
statistical fit than the QSAAR models could not be obtained.
Ekwall et al. (1998b) correlated the HAP values of the MEIC chemicals with the results
of 61 in vitro toxicity assays, and obtained the highest R2 with toxicity to Chang liver
cells (R2 = 0.73). In the present study, HAP was also predicted best from toxicity to
Chang liver cells (Equations 5 and 6, Table 14.3.a). For the data set excluding digoxin,
the correlation improved when the energy of the lowest unoccupied molecular orbital
(LUMO) was added (Equations 7, Table 14.3.a). The best three-descriptor model included
LUMO and the number of oxygen atoms (NO) together with the toxicity to Chang liver
cells (Equation 8, Table 14.3.a).
196
The QSAR models obtained for HAP had slightly better statistical fits than those of the
QSAAR models. Combinations of topological descriptors (second-order valence
connectivity index, 2χv, or third-order valence path connectivity index, 3χpv), and
electronic descriptor (LUMO) or the number of oxygen atoms (NO) gave models with the
best statistical fit (Equations 16 – 20, Table 14.3.b). The topological descriptors (2χv, 3χpv)
encode the size and shape of the molecules. LUMO is usually considered to be a
descriptor of chemical reactivity. Similar descriptors appeared in the models for HAP
including and excluding digoxin.
Adding structural descriptors resulted in QSAAR models for in vivo human toxicity with
better fit than did those based on toxicity endpoints alone; however, the improvement of
R2 was only between 0.06 and 0.17.
Rodent in vivo toxicity
Rat LD50
Rat LD50 values correlated best with the toxicity to rat hepatocytes (Equation 9, Table
14.3.a). Adding the number of 6-membered rings (Nrings6) or the molecular polarisability
(α) improved the correlation (Equations 10 and 11, Table 14.3.a), albeit with an increase
of R2 of only about 0.09. Nrings6 and α were highly intercorrelated with R = 0.98 (n = 25).
Nrings6 may be related to the size and/or shape of the molecule. The molecular
polarisability encodes the susceptibility of a molecule to acquire a dipole moment under
an external electric field, and might influence the transport or distribution of the chemical
into cell membranes. It is also related to the molecular size.
When QSARs were investigated, a model containing the difference between the energies
of LUMO and HOMO (LmH) and the number of 6-membered rings (Nrings6) was derived
(Equation 21, Table 14.3.b). LmH is considered to reflect the activation energy of a
molecule. A three-descriptor model (Equation 22, Table 14.3.b) included the
hydrophilicity factor (Hy), the electrotopological state descriptor (TIE, calculated on the
basis of the electrotopological state indices of Kier and Hall, 1990), and the number of 6-
membered rings (Nrings6). The electrotopological state indices of Kier and Hall are
considered to account for the ability of molecules to enter into non-covalent
197
intermolecular interactions (Kier and Hall, 1990). The three-variable QSARs had better
statistical parameters than the QSAARs.
Mouse LD50
The same descriptors as in the models for rat LD50 appeared also to describe best mouse
LD50, although the statistical parameters of the models were worse. Mouse and rat LD50
intercorrelate with R = 0.840. Mouse LD50 was correlated with the toxicity to rat
hepatocytes in combination with Nrings6 or polarisability (α) (Equation 13 and 14, Table
14.3.a). However, as mentioned above, these models performed poorly during cross-
validation and external validation, but better QSAAR models could not be derived.
However, the QSAR models for mouse LD50 were better than the QSAARs and included
again LmH and Nrings6 (Equation 23, Table 14.3.b), and a combination of Hy, TIE, and
Nrings6 (Equation 24, Table 14.3.b).
As described above, the results from the PCA suggested a factor determining differences
between the in vivo and in vitro toxicity effects, which is probably related to toxicokinetic
factors. An insight about the nature of this factor might be obtained from the structural
descriptors relating in vivo to in vitro endpoints in the QSAAR models. Generally, these
descriptors were related to electronic/reactivity properties, presence of oxygen atoms, and
size/shape properties.
Toxicity to human and rodent cell lines in vitro
Toxicity to Chang human liver cells
The best QSARs for toxicity to Chang liver cells included the same structural descriptors
when digoxin was included and excluded from the data set (Equations 26 and 27, Table
14.3.b), namely MM and the magnitude of molecular dipole moment (µ). The positive
coefficients of MM in the equations suggest that larger molecules are more toxic. µ
influences the intermolecular interactions and membrane transport of a compound. The
exclusion of digoxin from the data set resulted also in a model with logPoctM and LUMO
(Equation 28, Table 14.3.b), which is in accordance with the so-called response-surface
approach for QSAR modeling of cytotoxicity (see Chapter 8, Section 8.2 “Literature
review of QSARs for acute toxicity/cytotoxicity”).
198
However, as mentioned above, the presented models for toxicity to human liver cells
performed poorly during cross-validation and external validation. These results suggest
that the toxicity of some of the compounds is poorly described by the models. The
identification of outliers by using studentised residuals at the 5% significance level
detected lindane, 1,1,1-trichloromethane, tetrachloromethane, and hexachlorophene.
Lindane and tetrachloromethane were more toxic than predicted by the three equations;
1,1,1-trichloroethane and hexachlorophene were less toxic than predicted.
Hexachlorophene has a very high logPoct value (calculated logPoct of 6.92, measured
logPoctM of 7.54, Table 14.2.a), which might be a reason for its lower toxicity than
predicted. Lindane, tetrachloromethane, and 1,1,1-trichloroethane are saturated
halogenoalkanes, which are known to cause their toxic effect by the non-polar narcotic
mechanism (see Chapter 8, Section 8.1 “Mechanisms of toxic action”). According to the
baseline effect the non-polar narcotics have the lowest possible toxicity when compared
with compounds having similar logPoct but acting by other mechanisms (see Chapter 8).
However, in the present study lindane and tetrachloromethane have higher toxicities than
expected from their logPoct values, even when their LUMO energies (considered to
encode compound reactivity) are accounted for. This may result from specific properties
of the target biological system (Chang human liver cells), which interacts in a particular
way with these compounds. These observations need more detailed experimental
investigation.
Toxicity to rat hepatocytes
A model for the toxicity to rat hepatocytes with good statistical parameters was obtained
with logSaq only. Calculated logSaq values gave better results than did measured values
(Equations 29 and 30, Table 14.3.b). The two-parameter models included a combination
of logSaq and LmH (Equation 33, Table 14.3.b) and a combination of logPoctM and LmH
(measured logPoct values gave better results, Equation 34, Table 14.3.b). In addition, the
topological descriptor 0χA was also used in combination with logSaq or logPoctM to obtain
two-descriptor models (Equations 31 and 32, Table 14.3.b). Statistically significant
models with these structural descriptors can also be obtained for rat and mouse LD50
values (data not shown), which is consistent with the high intercorrelation between rat
and mouse in vivo and rat hepatocyte in vitro toxicities.
199
Bacterial toxicity
S. meliloti
Two-descriptor models were obtained by combining logPoctM (measured logPoct values
gave slightly better results than did calculated values) with LUMO or LmH (LmH
resulted in a model with a slightly better statistical fit than did LUMO, Equations 35 and
36, Table 14.3.b). An extended QSAR investigation of the toxicity to the bacterium S.
meliloti is presented in Chapter 12. As described in Chapter 12, a model including logPoct,
LUMO, and the numbers of NH2 and OH groups was obtained for a data set of 131
chemicals acting by different toxicity mechanisms. The model had R2 = 0.541, s = 0.818,
and F = 37.1. When only the MEIC chemicals were considered in the study presented in
this chapter, a model with better statistical fit was obtained by using logPoct and LUMO
only (R2 = 0.862, Equation 36, Table 14.3.b); however, the training set was smaller (n =
22). Adding the number of OH groups to this equation did not improve the fit with
statistical significance. The number of NH2 groups was not attempted in combination with
logPoct and LUMO, because only two of the investigated compounds had NH2 groups.
Biotox TM
(V. fischeri)
The best two-descriptor models included logPoct in combination with 0χA or LmH. A
model with logPoct and LUMO was also obtained, but again, its statistical parameters
were poorer (Equations 37, 38, and 39, Table 14.3.b.).
As described above, the results from PCA suggested that the toxicity to the different
endpoints is determined mainly by a single factor, which might be related to cell toxicity.
The QSARs for cell toxicity related this factor to compound hydrophobicity and
reactivity, which is in accordance with literature reports (see Chapter 8, Section 8.2
“Literature review of QSARs for acute toxicity/cytotoxicity”). Descriptors encoding these
properties also appeared in the QSARs for the in vivo endpoints, together with descriptors
of molecular size/shape.
14.4.3. Contribution to existing knowledge
200
The work presented in this chapter is based on a large data base developed in the MEIC
and MEMO programmes. As far as the author is aware, this is the first investigation by
QSAR analysis of toxicity endpoints included in the MEIC programme. The eight
endpoints selected in this study included a broad range of biological species, including
humans, which allowed comparison between them. The aim of the MEIC programme was
to assess in vitro toxicity endpoints that allow for prediction of human toxicity in vivo.
The present study explored an extended approach to predict in vivo human toxicity by
using combinations of other in vitro and/or in vivo toxicity endpoints and descriptors of
chemical structure (QSAAR analysis). Human in vivo toxicity data were correlated with
rodent toxicity in vivo or human in vitro liver cell toxicity in combination with structural
descriptors accounting for the H-bond donor ability, molecular aromaticity, and electronic
properties. However, adding structural descriptors resulted in improvement of R2 of the
models with only between 0.06 and 0.17 compared with those based on toxicity endpoints
alone. Nevertheless, the QSAAR models with structural descriptors performed better
during cross-validation and external validation. Additionally, QSAR models for HAP
were obtained which had slightly better statistical parameters than the QSAAR model,
without the need to include an in vitro endpoint. Such a model for HLD could not be
obtained.
Also, in vivo toxicity to rat and mouse was investigated and better QSARs (including
electronic parameters and descriptors of size and hydrophobicity) than QSAARs were
derived.
The QSAAR and QSAR models for in vivo toxicity developed in this study have
reasonable statistical parameters, however a clear and detailed explanation of the
mechanistic basis of these models is difficult to provide. The toxic effect in vivo is
determined by a complex combination of factors, related to toxicokinetics and cell
toxicity which are not always understood. However, when mechanistically-based and/or
easily interpretable models based on in vitro endpoints and/or structural descriptors
cannot be developed, models based on statistical correlations can still be useful even if
they are not used to directly replace testing. For example, they could be used as priority-
setting methods or to provide supplementary information in hazard and risk assessment.
QSARs for the in vitro biological systems were obtained, which, in general, included a
descriptor of compound hydrophobicity (logPoct, logSaq), and reactivity (LmH, LUMO).
Thus, the present investigation confirmed the applicability of the so-called response-
201
surface approach (see Chapter 8, Section 8.2 “Literature review of QSARs for acute
toxicity/cytotoxicity”) for QSAR analysis of toxicity to in vitro systems. These QSARs
are simple and have a sound mechanistic interpretation as they account for both the
transport and distribution of the compounds in cell membranes (as a part of the narcotic
mechanisms of toxicity), as well as the chemical reactivity.
An article based on this work has been published by Lessigiarska et al. (2006).
14.5. Conclusions
The results from the cluster and QSAAR analyses were consistent with the findings of
Ekwall et al. (1998b) and suggested that the route of toxicant administration (and
subsequent toxicokinetics) determines similarities in the toxic effects between the
different species. Thus, the in vivo human oral lethal doses were best related to the in vivo
rat and mouse oral lethal doses, whereas the toxic concentrations in human blood/serum
were related to IC50 values of human liver cells which are exposed in vivo to the chemical
blood concentrations.
Toxicity to mammalian cells was shown to be a better predictor of in vivo human and
rodent toxicity than the toxicity to the two bacteria investigated (S. meliloti and V.
fischeri). QSAARs for human toxicity endpoints were developed by combining mouse
LC50 values and descriptors encoding H-bond donor ability and aromaticity, or in vitro
toxicity to human liver cells and descriptors related to electronic properties (LUMO).
Factors that appear to be important for human toxicity from the QSAR models are related
to molecular size and shape, and electronic properties (LUMO).
In vivo toxicity to rat and mouse was related to in vitro IC50 values of rat hepatocytes.
However, in vivo toxicity to rat and mouse were better modelled by QSAR analysis than
by QSAAR analysis, by using descriptors of hydrophilicity, electropological state, and the
number of 6-membered rings.
Descriptors for hydrophobicity (logPoct, logSaq) and compound reactivity (LmH, LUMO)
appeared to determine in vitro toxicity to liver and bacterial cells, which is in accordance
with many QSAR studies of toxicity to in vitro systems (see Chapter 8, Section 8.2
“Literature review of QSARs for acute toxicity/cytotoxicity”).
202
Figure 14.1. Plots of (a) the loadings of the endpoints and (b) the scores of the chemicals
of the first against the second component obtained by the PCA for the MEIC data set
(a)
First Component
Second Component
0.0-0.1-0.2-0.3-0.4
0.50
0.25
0.00
-0.25
-0.50
Rat Hepatocytes
Biotox
S. meliloti
Human Liver
Mouse LD50
Rat LD50
HLD
HAP
Loading plot
(b) the chemicals are numbered according to their MEIC number (see Table 14.1)
First Component
Second Component
5.02.50.0-2.5-5.0
3
2
1
0
-1
-2
50
49
48
45
4340
39
3634 3332
29
2827
2321
19
18
1713
12
11
10 98
7
6
532
Score plot
203
Figure 14.2. Dendrogram plot obtained for the endpoints of the MEIC data set by cluster
analysis with average linkage and correlation coefficients distance
Distance
Rat hepatocytes
Biotox
S. meliloti
Mouse LD50
Rat LD50
HLD
Human liver
HAP
0.28
0.19
0.09
0.00
Average linkage and correlation coefficient distance
204
Figure 14.3. (a) Plot of observed against predicted values of HLD obtained with the
model based on mouse LD50 (Equation 1, Table 14.3.a)
LOG 1/HLD
Predicted Values
Observ
ed V
alu
es
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
4.5
-2.5 -1.5 -0.5 0.5 1.5 2.5
digoxin
(b) Plot of observed against predicted values of HAP obtained with the model based on
the toxicity to human liver cells (Equation 6, Table 14.3.a)
LOG 1/HAP
Predicted Values
Observ
ed V
alu
es
-3
-2
-1
0
1
2
3
4
5
-2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5
digoxin
205
Figure 14.4. Plot of observed against predicted IC50 values for the toxicity to Chang
human liver cells obtained with the model based on MM (Equation 25, Table 14.3.b)
Log 1/Human L iver
-3 -2 -1 0 1 2 3 4 5 6 7
Predicted Values
-2
0
2
4
6
Ob
se
rve
d V
alu
es
digoxin
206
Table 14.1. The 50 MEIC chemicals and their toxicities †
No Name HAP HLD Rat
LD50 Mouse LD50
Human liver
Rat hepatocytes
S. meliloti Biotox
1 acetaminophen a -0.37 -0.25 -1.20 -0.35 na d -1.03 -0.22 -1.19
2 acetylsalicylic acid -0.89 -0.33 -0.05 -0.14 0.36 -0.43 -0.39 -0.80
3 iron (II) sulphate b 0.11 -0.55 -0.32 -0.65 0.69 -0.21 0.49 -0.93
4 diazepam a 1.15 c na d -0.09 0.77 1.44 0.64 -0.50 0.12
5 amitriptyline HCl b 2.21 0.87 -0.06 0.30 1.41 1.15 1.82 1.52
6 digoxin 4.04 3.77 1.44 1.64 5.52 0.57 0.64 -0.16
7 ethylene glycol -1.40 -1.40 -1.88 -1.95 -1.65 -2.55 -3.54 -3.43
8 methanol -2.07 -1.69 -2.24 -2.36 -2.66 -2.96 -3.33 -2.90
9 ethanol -2.26 -2.01 -2.19 -1.87 -2.50 -2.65 -3.22 -2.63
10 propan-2-ol -1.92 -1.63 -1.92 -1.78 -1.65 -2.48 -2.98 -2.27
11 1,1,1-trichloroethane -0.24 -1.73 -1.86 -1.65 -2.05 -1.88 0.18 -0.79
12 phenol 0.07 -0.22 -0.53 -0.46 -0.69 0.10 -1.11 -0.15
13 sodium chloride b -2.30 -1.59 -1.71 -1.84 -1.92 -2.01 -2.46 -2.47
14 sodium fluoride a,b -0.01 c -0.32 -0.09 -0.13 -1.54 0.00 na d -2.37
15 malathion a 2.24 c -0.35 0.06 0.24 na d na d 0.95 0.05
16 2,4-dichlorophenoxyacetic acid a
-0.71 -0.24 -0.23 -0.20 -0.22 0.10 na d 0.24
17 xylene b -0.02 c -0.93 -1.61 -1.30 -0.03 -1.24 -0.09 -0.75
18 nicotine 1.08 c 2.36 0.51 1.69 -0.06 -0.55 -0.79 -0.43
19 potassium cyanide b 0.20 1.34 1.11 0.88 -0.32 0.11 0.65 -1.42
20 lithium sulphate a,b -1.15 c -0.88 na d -1.03 -1.32 -1.46 na d -2.37
21 theophylline 0.00 0.06 -0.13 -0.12 -0.86 -0.34 -1.32 -1.08
22 dextropropoxyphene HCl a,b 1.63 c 1.57 0.65 0.17 na d na d na d na d
23 propranolol HCl b 1.92 0.62 -0.20 -0.03 1.46 1.24 0.40 0.55
24 phenobarbital a 0.00 0.32 0.16 0.23 -0.40 -0.37 na d 0.40
25 paraquat a,b 1.17 0.72 0.27 0.19 0.01 -0.07 0.59 na d
26 arsenic trioxide a,b 1.66 1.68 1.13 0.80 2.22 2.52 1.64 na d
27 copper (II) sulphate b 0.60 -0.10 -0.27 -0.36 0.42 1.32 2.15 0.26
28 mercury (II) chloride b 0.70 1.10 2.43 1.66 2.35 2.52 4.22 2.89
29 thioridazine HCl b 1.96 0.83 -0.39 0.02 2.49 1.72 1.99 0.96
30 thallium sulphate a,b 1.44 1.56 1.50 1.33 0.60 0.89 na d na d
31 warfarin a 0.19 c 0.49 2.28 2.01 -0.26 0.86 -0.25 na d
32 lindane 2.35 c 0.08 0.58 0.82 2.83 0.84 0.85 0.68
33 trichloromethane -0.61 -0.92 -0.88 0.52 -0.45 -0.79 -0.72 -0.87
34 tetrachloromethane 1.42 -0.93 -1.18 -1.73 1.81 -0.60 -1.03 -0.78
35 isoniazid a -0.09 -0.10 -0.96 0.01 -0.54 -0.55 na d -1.30
36 dichloromethane -0.61 c -1.21 -1.28 -1.01 -1.74 -2.04 -0.62 -1.41
37 barium nitrate a,b -0.35 c 0.85 -0.13 -0.01 -0.21 0.33 na d na d
38 hexachlorophene a 0.55 0.28 0.86 0.78 1.36 2.70 3.72 na d
39 pentachlorophenol 0.53 0.97 0.99 0.87 1.93 1.30 3.27 1.70
207
Table 14.1. (continued)
No Name HAP HLD Rat
LD50 Mouse LD50
Human liver
Rat hepatocytes
S. meliloti Biotox
40 verapamil HCl b 1.54 0.60 0.66 0.48 0.66 0.95 0.69 0.32
41 chloroquine phosphate a,b 1.53 0.79 na d 0.01 0.39 1.36 -0.14 -1.99
42 orphenadrine HCl a,b 1.38 0.81 0.08 0.49 -0.20 0.94 -0.24 na d
43 quinidine sulphate b 1.10 0.72 0.21 0.17 1.03 0.89 0.38 0.85
44 phenytoin a 0.10 -0.08 -0.81 0.23 0.39 0.75 na d na d
45 chloramphenicol 0.25 c 0.05 -0.89 -0.67 -0.11 0.40 -0.64 0.20
46 sodium oxalate a,b -0.10 c -0.39 na d -1.58 na d 0.24 na d na d
47 amphetamine sulphate a,b 0.94 1.43 0.83 1.19 na d -0.09 na d na d
48 caffeine 0.04 0.14 0.00 0.18 -0.97 -0.20 -1.28 -1.04
49 atropine sulphate b 1.85 c 2.61 0.06 0.17 0.80 0.17 0.18 -0.47
50 potassium chloride b -0.98 c -0.73 -1.54 -1.30 -1.59 -1.96 -2.46 -3.30
† the toxicity values are presented in negative logarithmic form and units of mmol/kg (HLD, rat, and mouse
LD50) or mmol/l (the remaining endpoints) a chemicals excluded from the PCA and cluster analysis due to missing values for some of the endpoint(s) b inorganic chemicals, salts, xylene (mixture of isomers), and paraquat (charge of +2) were excluded from
the QSAAR and QSAR analyses c human peak concentrations obtained from approximate LC0 or LC100 curves, instead of an approximate
LC50 curve (see text) d na – not available
208
Table 14.2. Values for the molecular descriptors included in the selected QSAARs and QSARs for acute toxicity
a) measured descriptors taken from the KOWWIN and WSKOWWIN, and descriptors calculated with KOWWIN, WSKOWWIN, and DRAGON
No Name KOWWIN KOWWIN WSKOWWIN WSKOWWIN DRAGON DRAGON DRAGON DRAGON DRAGON
logPoct logPoctM logSaq logSaqM 0χA BLI Hy NH-donors TIE
1 acetaminophen 0.27 0.46 -0.70 -1.03 0.752 0.886 -0.107 2 45.9 2 acetylsalicylic acid 1.13 1.19 -1.53 -1.59 0.757 0.835 -0.673 0 92.9 4 diazepam 2.70 2.82 -3.69 -3.76 0.706 0.915 -0.787 2 80.6 6 digoxin 0.50 1.26 -5.97 -4.08 0.713 1.013 2.653 8 283 7 ethylene glycol -1.20 -1.36 1.21 1.21 0.854 1.132 1.837 2 24.0 8 methanol -0.63 -0.77 1.49 1.49 1.000 1.342 1.402 1 0.378 9 ethanol -0.14 -0.31 1.24 1.34 0.902 1.535 0.712 1 4.54
10 propan-2-ol 0.28 0.05 0.83 1.22 0.894 1.413 0.371 1 0.00 11 1,1,1-trichloroethane 2.68 2.49 -2.30 -2.01 0.900 1.651 -0.359 0 0.00 12 phenol 1.51 1.46 -0.56 -0.06 0.730 0.915 -0.067 1 19.7 15 malathion 2.29 2.36 -3.62 -3.36 0.784 1.627 -0.517 0 263 16 2,4-dichlorophenoxyacetic acid 2.62 2.81 -2.82 -2.51 0.757 0.957 -0.598 1 83.8 18 nicotine 1.00 1.17 0.79 0.79 0.699 1.034 -0.807 0 37.0 21 theophylline -0.39 -0.04 -1.79 -1.39 0.737 0.797 -0.523 0 39.2 24 phenobarbital 1.33 1.47 -2.15 -2.32 0.733 0.889 0.477 2 48.0 31 warfarin 2.23 2.60 -3.67 -4.26 0.713 0.884 -0.815 0 71.3 32 lindane 4.26 4.14 -4.86 -4.56 0.789 1.482 -0.484 0 82.4 33 trichloromethane 1.52 1.97 -1.76 -1.18 0.894 1.964 -0.215 0 0.00 34 tetrachloromethane 2.44 2.83 -2.74 -2.29 0.900 1.701 -0.180 0 0.00 35 isoniazid -0.81 -0.70 -0.92 0.01 0.740 0.826 -0.576 0 23.3 36 dichloromethane 1.34 1.25 -0.89 -0.82 0.902 2.405 -0.264 0 4.16 38 hexachlorophene 6.92 7.54 -8.03 -3.46 0.757 1.051 0.478 2 385 39 pentachlorophenol 4.74 5.12 -4.94 -4.28 0.789 1.140 0.089 1 474 44 phenytoin 2.16 2.47 -3.15 -3.90 0.700 0.854 -0.296 1 49.0 45 chloramphenicol 0.92 1.14 -2.92 -2.11 0.764 0.953 0.565 3 116 48 caffeine 0.16 -0.07 -1.87 -0.95 0.747 0.822 -0.557 0 43.8
209
Table 14.2. Values for the molecular descriptors included in the selected QSAARs and QSARs for acute toxicity b) descriptors calculated with TSAR
No Name TSAR TSAR TSAR TSAR TSAR TSAR TSAR TSAR TSAR
2χv 3χpv α а LmH LUMO µ a MM NO Nrings6
1 acetaminophen 2.228 1.188 19.31 8.754 0.284 4.618 151.2 2 1 2 acetylsalicylic acid 2.395 1.371 19.68 9.251 -0.532 1.428 180.2 4 1 4 diazepam 5.117 3.623 37.02 8.588 -0.632 3.712 284.8 1 2 6 digoxin 18.816 15.952 90.38 10.034 -0.158 6.837 781.1 14 6 7 ethylene glycol 0.447 0.100 5.44 13.850 3.024 0.001 62.1 2 0 8 methanol 0.000 0.000 3.06 14.914 3.778 1.631 32.0 1 0 9 ethanol 0.316 0.000 5.31 14.439 3.564 1.493 46.1 1 0
10 propan-2-ol 1.094 0.000 7.61 14.701 3.576 1.826 60.1 1 0 11 1,1,1-trichloroethane 3.936 0.000 7.06 11.726 -0.266 1.697 133.4 0 0 12 phenol 1.336 0.756 13.68 9.512 0.398 1.302 94.1 1 1 15 malathion 10.648 8.542 31.15 7.412 -2.658 4.821 330.4 6 0 16 2,4-dichlorophenoxyacetic acid 3.169 1.819 19.22 9.040 -0.312 3.149 221.0 3 1 18 nicotine 3.425 2.590 24.29 9.487 0.213 2.714 162.3 0 1 21 theophylline 2.806 2.036 19.97 8.982 -0.089 6.121 180.2 2 1 24 phenobarbital 3.862 3.026 27.35 9.817 -0.191 1.609 232.3 3 2 31 warfarin 5.517 3.862 41.03 8.284 -1.037 7.999 308.4 4 3 32 lindane 5.936 6.284 16.97 11.211 -0.151 2.081 290.8 0 1 33 trichloromethane 2.474 0.000 4.81 11.467 -0.303 0.860 119.4 0 0 34 tetrachloromethane 4.286 0.000 5.57 11.262 -1.116 0.002 153.8 0 0 35 isoniazid 1.709 1.074 16.23 9.715 -0.596 1.581 137.2 1 1 36 dichloromethane 1.010 0.000 4.08 11.984 0.594 1.279 84.9 0 0 38 hexachlorophene 6.718 5.269 31.92 8.421 -0.772 2.274 406.9 2 2 39 pentachlorophenol 3.962 3.677 17.01 8.596 -0.977 0.878 266.3 1 1 44 phenytoin 4.390 3.282 33.20 9.544 -0.133 2.996 252.3 2 2 45 chloramphenicol 5.106 2.978 27.92 9.278 -1.183 6.764 323.2 5 1 48 caffeine 3.233 2.317 22.50 8.615 -0.348 3.627 194.2 2 1
a calculated with the VAMP package, using AM1 Hamiltonian
210
Table 14.3. QSAAR and QSAR models based on the MEIC data set
(a) QSAAR models
No Endpoint Regression equation n R2 s F Q2 Q2(5) Min
Q2(5) Max
Q2(5) Average
MAE Min
MAE Max
MAE Average
1 HLD a 0.869 log1/MouseLD50 c 25 0.683 0.728 49.6 0.589 0.536 0.666 0.605 0.239 0.969 0.540
2 HLD a 0.796 log1/MouseLD50 + 0.312 NH-donors – 0.346 25 0.853 0.507 63.7 0.756 0.700 0.808 0.760 0.164 0.668 0.394
3 HLD b 0.725 log1/MouseLD50 – 0.144 24 0.739 0.513 62.4 0.663 0.598 0.720 0.669 0.163 0.645 0.412
4 HLD b 0.650 log1/MouseLD50 – 0.590 BLI + 0.548 24 0.797 0.463 41.2 0.702 0.633 0.759 0.693 0.137 0.569 0.365
5 HAP a 0.676 log1/HumanLiver c 24 0.814 0.610 96.5 0.784 0.760 0.802 0.783 0.295 0.755 0.527
6 HAP b 0.647 log1/HumanLiver c 23 0.705 0.621 50.2 0.645 0.602 0.676 0.635 0.422 0.891 0.638
7 HAP b 0.463 log1/HumanLiver – 0.251 LUMO c 23 0.773 0.558 34.0 0.684 0.614 0.759 0.688 0.363 0.735 0.529
8 HAP b 0.412 log1/HumanLiver – 0.323 LUMO – 0.191 NO + 0.376 23 0.828 0.499 30.5 0.730 0.711 0.796 0.741 0.280 0.661 0.455
9 Rat LD50 0.685 log1/RatHepat – 0.153 25 0.661 0.696 44.8 0.599 0.576 0.642 0.609 0.272 0.848 0.514
10 Rat LD50 0.493 log1/RatHepat + 0.347 Nrings6 – 0.628 25 0.758 0.601 34.4 0.658 0.544 0.687 0.624 0.328 0.693 0.529
11 Rat LD50 0.522 log1/RatHepat + 0.0215 α – 0.675 25 0.736 0.628 30.6 0.670 0.552 0.713 0.655 0.289 0.735 0.490
12 Mouse LD50 0.680 log1/RatHepat c 25 0.603 0.782 34.9 0.528 0.496 0.587 0.528 0.194 0.822 0.558
13 Mouse LD50 0.487 log1/RatHepat + 0.348 Nrings6 – 0.354 25 0.693 0.703 24.9 0.560 0.267 0.642 0.516 0.258 1.201 0.601
14 Mouse LD50 0.502 log1/RatHepat + 0.0236 α – 0.447 25 0.686 0.711 24.0 0.557 0.426 0.626 0.539 0.299 1.027 0.592 a digoxin included b digoxin excluded c intercept of the equation was not significantly different from zero at 95% confidence level
211
Table 14.3. QSAAR and QSAR models based on the MEIC data set (b) QSAR models
No Endpoint Regression equation n R2 s F Q2 Q2(5) Min
Q2(5) Max
Q2(5) Average
MAE Min
MAE Max
MAE Average
15 HLD b - 9.57 0χA + 7.30 24 0.655 0.589 41.8 0.575 0.542 0.599 0.576 0.266 0.589 0.435 16 HAP a 0.269 3χp
v – 0.352 LUMO – 0.558 26 0.823 0.614 53.6 0.772 0.649 0.803 0.754 0.274 0.883 0.478 17 HAP a 0.258 2χv – 0.283 LUMO – 0.878 26 0.821 0.618 52.6 0.794 0.745 0.800 0.779 0.242 0.814 0.474 18 HAP b 0.497 2χv – 0.271 NO – 1.26 25 0.778 0.576 38.6 0.715 0.693 0.751 0.718 0.316 0.818 0.439 19 HAP b 0.224 3χp
v – 0.388 LUMO – 0.464 25 0.744 0.620 31.9 0.641 0.583 0.665 0.630 0.273 0.716 0.423 20 HAP b 0.319 3χp
v – 0.420 LUMO – 0. 296 NO c 25 0.870 0.451 47.0 0.794 0.732 0.809 0.778 0.101 0.543 0.339 21 Rat LD50 - 0.281 LmH + 0.427 Nrings6 + 2.01 26 0.716 0.639 29.0 0.633 0.538 0.672 0.617 0.296 0.596 0.482 22 Rat LD50 0.00382 TIE – 0.645 Hy + 0.607 Nrings6 – 1.41 26 0.815 0.527 32.4 0.735 0.667 0.746 0.705 0.188 0.677 0.412 23 Mouse LD50 -0.307 LmH + 0.409 Nrings6 + 2.57 26 0.705 0.675 27.5 0.631 0.581 0.666 0.628 0.248 0.789 0.470 24 Mouse LD50 0.00281 TIE – 0.790 Hy + 0.677 Nrings6 – 1.12 26 0.797 0.573 28.8 0.751 0.718 0.765 0.749 0.215 0.613 0.451 25 Human liver a 0.0101 MM – 2.15 24 0.726 0.989 58.3 0.690 0.672 0.705 0.689 0.364 1.458 0.790 26 Human liver a 0.0128 MM – 0.322 µ –1.87 24 0.817 0.828 46.7 0.757 0.633 0.789 0.750 0.348 1.120 0.703 27 Human liver b 0.0141 MM – 0.335 µ – 2.05 23 0.699 0.834 23.2 0.552 0.421 0.621 0.541 0.326 1.421 0.643 28 Human liver b 0.364 logPoctM – 0.349 LUMO – 0.813 23 0.634 0.920 17.3 0.522 0.408 0.592 0.496 0.314 1.178 0.595 29 Rat hepatocytes - 0.514 logSaq – 1.51 25 0.763 0.691 73.9 0.721 0.694 0.750 0.716 0.284 0.770 0.532 30 Rat hepatocytes - 0.579 logSaqM – 1.39 25 0.662 0.824 45.0 0.605 0.568 0.638 0.599 0.206 1.019 0.555 31 Rat hepatocytes - 0.405 logSaq – 6.90 0χA + 4.19 25 0.907 0.441 108 0.870 0.832 0.874 0.854 0.205 0.683 0.385 32 Rat hepatocytes 0.443 logPoctM – 9.49 0χA + 6.36 25 0.906 0.444 106 0.878 0.852 0.882 0.865 0.271 0.622 0.412 33 Rat hepatocytes - 0.347 logSaq – 0.312 LmH + 2.10 25 0.901 0.455 101 0.871 0.830 0.882 0.864 0.163 0.495 0.319 34 Rat hepatocytes 0.341 logPoctM – 0.401 LmH + 3.18 25 0.871 0.521 74.0 0.841 0.804 0.862 0.831 0.267 0.611 0.409 35 S. meliloti 0.651 logPoctM – 0.276 LmH + 1.24 22 0.886 0.652 73.6 0.860 0.815 0.873 0.850 0.347 0.737 0.539 36 S. meliloti 0.626 logPoctM – 0.348 LUMO – 1.54 22 0.862 0.715 59.5 0.829 0.686 0.841 0.814 0.339 0.836 0.579 37 Biotox 0.576 logPoctM – 6.67 0χA + 3.80 23 0.914 0.371 107 0.885 0.854 0.901 0.876 0.129 0.523 0.263 38 Biotox 0.480 logPoctM – 0.269 LmH + 1.42 23 0.898 0.405 87.8 0.874 0.857 0.886 0.870 0.170 0.519 0.361 39 Biotox 0.430 logPoctM – 0.328 LUMO –1.27 23 0.834 0.518 50.1 0.793 0.669 0.813 0.774 0.201 0.668 0.405
a digoxin included b digoxin excluded c intercept of the equation was not significantly different from zero at 95% confidence level
212
CHAPTER 15
SUMMARY AND GENERAL DISCUSSION
15.1. Objective
The aim of this chapter is to summarise the results of the research project, emphasising
the author’s contributions to existing knowledge. Some observations about the quality and
applicability of the QSARs based on the performed research are made. Also, possible
applications of the developed QSARs in strategies for toxicity testing are described.
15.2. Summary of the research results
The project included QSAR investigations in two main pharmacotoxicological areas,
namely blood-brain barrier (BBB) penetration, and acute toxicity and cytotoxicity.
QSARs were developed by using physicochemical properties, counts of functional
groups, descriptors based on molecular topology, electrostatic and electronic features. A
number of statistical approaches were involved, including regression analysis, partial least
squares (PLS) regression, discriminant analysis, principal components analysis (PCA),
cluster analysis, and classification trees. 3D QSAR approaches (CoMSIA, CoMFA) were
also explored to investigate specific interactions with biological molecules. Additionally,
the author developed some algorithms for the implementation of statistical approaches to
support the QSAR analysis. These are related to reducing variable multicollinearity in
large data sets, and an application of the best subsets regression.
The background to the research is presented in Chapters 2-8, whereas the results of the
research are reported in Chapters 9-14.
In Chapter 9 two computational algorithms with applications in QSAR analysis are
suggested. The first one is used to reduce multicolliearity in large data sets of structural
descriptors. It was designed to optimise subsequent QSAR analysis. The algorithm allows
the user to preserve variables considered important for describing biological activity and
development of QSARs. The algorithm was used in the publication of Lessigiarska et al.
(2004a).
213
The second algorithm reported in Chapter 9 represents an extended implementation of the
best-subsets regression to large data sets. An option is provided to include in the same
model only these variables that are intercorrelated below a given value of the
intercorrelation coefficient R. The algorithm was used in the publications of Lessigiarska
et al. (2004a), Lessigiarska et al. (2004b), Netzeva et al. (2005), and Lessigiarska et al.
(2006).
In Chapters 10 and 11 the QSAR modelling of chemical penetration through the BBB is
reported. The work in Chapter 10 is based on two data sets containing compounds
transported by different mechanisms, including passive diffusion and active transport.
The first data set is taken from a study funded by ECVAM, and contains data for both in
vivo BBB permeability, and permeability through several in vitro membrane models of
the BBB. The permeability through the membrane models was compared to the in vivo
BBB permeability to assess which membrane model was most similar to the BBB in vivo.
A novel aspect introduced by the current study was the development of models that
describe the in vivo BBB penetration by combining in vitro membrane penetration and
structural descriptors (QSAAR models), thus integrating in vitro modelling with QSAR
analysis. Another result of the study was that it confirmed the importance of compound
lipophilicity and H-bonding ability for the membrane transport, which had already been
reported in the literature. However, the models for in vivo BBB penetration of compounds
transported by different mechanisms did not include logPoct. This descriptor was found to
reflect better the penetration of compounds transported by passive diffusion.
The second data set investigated in Chapter 10 was obtained from the literature (Platts et
al., 2001). Regression and classification QSAR models were developed using structural
descriptors of lipophilicity, H-bonding and size/shape (logPoct, the number of H-bond
donors/acceptors, the number of 6-membered rings). A simple rule for compound
classification as low or high BBB penetrators was suggested, stating that if for a
compound the sum of H-bond donor and acceptor atoms is smaller than its logPoct value
minus one, it is likely to have higher concentration in the brain than in the blood at steady
state (high penetrator), and vice versa.
The work presented in Chapter 11 adds to that presented in Chapter 10 by focussing on
structural features influencing transport across the BBB of compounds known to inhibit
one of the efflux systems of the BBB, namely P-glycoprotein (P-gp). The compounds
included were imipramine and phenothiazine derivatives. The BBB penetration of these
214
compounds was investigated in order to identify common 3D structural characteristics
related to the mechanism of transport across the BBB, involving passive diffusion and P-
gp interactions. It was shown that the compounds with highest BBB penetration possess a
similar specific profile of two clearly defined hydrophobic centres and one hydrophilic
(H-bonding) centre arranged in a particular spatial configuration. The present study
emphasised the importance of the 3D-distribution of the hydrophobicity of phenothiazines
and imipramines for the BBB penetration, rather than the whole-molecule lipophilicity
characterised by logPoct. This result was observed in relation to the BBB penetration of
these compounds for the first time in the present study.
The next three chapters (Chapters 12-14) report analysis of acute toxic and cytotoxic
effects. Toxicity to a broad range of biological systems was investigated, including
bacteria, algae, isolated human and rodent cells, and in vivo toxicity to Daphnia, fish,
rodents, and humans. A variety of statistical techniques were applied to investigate
similarities between toxic effects for these systems, including correlations between the
investigated endpoints, PCA, and cluster analysis. Baseline toxicity effects were sought.
Separate QSARs were obtained for compounds acting by different mechanisms of toxic
action or belonging to different chemical types. Additionally, QSARs were developed by
combining different chemical classes and mechanisms of action in the same model.
Chapter 12 reports an investigation of toxicity to the bacterium Sinorhizobium meliloti. A
large data set of 133 compounds, taken from the literature (Botsford, 2002), was used to
develop mechanistically-based and chemical class-based QSARs. A baseline toxicity
effect attributed to the non-polar narcotics was observed. However, some non-polar
narcotics had toxicity different from that defined by the baseline. Explanations for this
observation were sought by evaluation of the quality of the toxicity data. Interestingly,
logPoct was not among the descriptors that were best correlated with toxicity of the polar
narcotics. The descriptors that were related to the toxicity of these compounds accounted
for polarisability, electrostatic properties and H-bonding interactions. A QSAR with
reasonable statistical parameters was developed for the aliphatic compounds in the data
set, but no such QSAR could be obtained for the group of aromatic compounds. The
model for the aliphatic compounds included logPoct, encoding compound biouptake
and/or membrane interactions, and LUMO, related to electrophilic reactivity. The other
descriptors included in the model encoded geometrical factors (MSA), and indicators of
specific interactions/reactivity or H-bonding (NNH2). A global QSAR model for all
215
investigated compounds was derived, which included well-understood molecular
descriptors (logPoct, LUMO, NOH, NNH2), but had low statistical fit.
In Chapter 13 an investigation of environmental toxic effects is reported, involving five
aquatic species (two algal species, Daphnia, and two fish species). The data were taken
from the New Chemicals Database (NCD) of the European Union. These data are
submitted by industry as a part of the notification process for each new chemical
substance that is manufactured in (or imported into) the European Union. Interspecies
correlations were developed to reveal similarities in toxicity between the species. A
strong correlation was observed between the inhibition of algal growth and algal growth
rate. Again, baseline toxicity effects for the non-polar narcotics were investigated.
Although general trends for linear toxicity/logPoct relationships were observed for the
non-polar narcotics, with most of the remaining compounds being placed above the
baselines, the derived relationships did not have high statistical fits. Also, QSARs with
moderate statistical fit were obtained only for the alga S. subspicatus and the fish O.
mykiss. They were based on mechanistically transparent structural descriptors, including
logPoct, LUMO, and the number of phenyl rings.
Classification QSAR models were also developed for the data taken from the NCD, by
applying the EU classification scheme into very toxic, toxic, harmful, and non-toxic
compounds. In general, the models classified correctly approximately 55% of the
compounds. Additionally, a two-group classification (defined by the author) into
dangerous and non-toxic compounds was investigated, with the group of dangerous
compounds being formed by uniting the groups of very toxic, toxic, and harmful
compounds. The obtained percentage correct classifications when applying the
classification models were between 70% and 83%. All classification models included
logPoct as a discriminating variable. A logPoct cut-off value of approximately 2 was
observed for discrimination between dangerous and non-toxic compounds. Additionally,
electronic features (HOMO, LUMO, energy difference between LUMO and HOMO), H-
bond donor ability, topological properties and factors encoding molecular volume, shape,
and size appeared to influence the compounds belonging to a given toxicity group. The
models were evaluated in relation to biological data.
In Chapter 14 a QSAR analysis of toxicity to a broad range of biological species,
including bacterial strains, rodent and human cell lines, rodents and humans, is presented.
The data were taken from the MEIC and MEMO programmes (Bondesson et al., 1989;
216
Ekwalll et al., 1998a). The toxicity endpoints were compared with each other by using
PCA and cluster analysis. Results from PCA analysis indicated that the toxicities are
determined mainly by a single factor, possibly related to cell toxicity, with other factors
being less influential. The analyses showed that the route of toxicant administration (and
subsequent toxicokinetics) might determine similarities in the toxic effects between the
different species. Thus, the oral lethal doses applied to human were related to the oral
lethal doses to rat and mouse, whereas the toxic concentrations in human blood/serum
were related to IC50 values of human liver cells which are exposed in vivo to the chemical
blood concentrations.
The study explored an approach to predict in vivo human toxicity by combining other in
vivo or in vitro data and descriptors of chemical structure (QSAAR analysis). The data for
human in vivo toxicity correlated with rodent toxicity in vivo and human in vitro liver cell
toxicity. The addition of descriptors of chemical structure improved moderately the
statistical fit of the models, compared to the equations based on toxicity endpoints alone.
QSAR models for the investigated endpoints were developed based on descriptors of
molecular hydrophobicity, size and shape, and electronic properties. The QSARs for in
vitro toxicity included descriptors for hydrophobicity (logPoct, logSaq) and compound
reactivity (LmH, LUMO), which confirms the importance of these properties for toxicity
to in vitro systems as previously reported in the literature.
15.3. Some reflections on the quality of QSARs
Some requirements exist for the development of QSARs to assure good quality of the
models and their correct application.
A main consideration is related to the quality of the biological data used for modelling.
Criteria for the selection of high quality biological data for QSAR development are
provided by Cronin and Schultz (2003) and Schultz and Cronin (2003). Criteria include
the principle that high quality data are derived from the same endpoint and protocol, and
ideally should be measured in the same laboratory. The investigated substances should
have high purity. QSARs developed on the basis of heterogeneous data, obtained in
different laboratories, are often of insufficient quality for QSAR modelling, as shown in
Chapter 13 concerning the toxicity data set taken from the NCD. Additionally, as
discussed in the report of QSARs for toxicity to S. meliloti (Chapter 12), even if the data
217
are obtained in the same laboratory, high experimental variability can be observed,
possibly due to some characteristics of the experimental protocol. However, in practice it
is often difficult to obtain high quality biological data, therefore researchers are faced
with the need to use available data with lower quality. According to the author of this
thesis QSARs based on such data can be useful to assess mechanistic factors related to
biological effects, and even to predict biological activity, and the obtained values should
be considered as tentative taking into account the error in biological data.
The ability to develop QSARs is based on the principle that quantitative differences in
biological effect of a series of compounds can be related to quantitative differences in
certain features of their chemical structure. To apply successfully this theoretical
consideration, the same type of chemical features should determine the biological effect
for all compounds in the series used to develop a given QSAR. In more general terms this
is understood as a prerequisite to develop QSARs based on series of compounds with the
same mechanism of action. However, in certain cases QSARs including compounds
acting by different mechanisms might be also successful. An example is the QSAR
developed for the aliphatic compounds of the data set including toxicity to S. meliloti
(Chaper 12). In such cases it is presumed that the descriptors included in the QSAR
account for each mechanism of action.
Additionally, sometimes QSARs are based on compounds from the same chemical class,
suggesting similar structural features determining the biological action. Including diverse
compounds in a QSAR is might be unsuccessful. Thus, complexity and diversity of the
molecular structures of the chemicals included in the NCD might be a reason for the
moderate statistical quality of the derived baseline relationships and QSAR models
described in Chapter 13.
The development of reliable QSARs should be based on descriptors that are considered to
be relevant for the biological effect. The choice of such descriptors is a difficult task, and
could be based on a priori knowledge about the biological effect and the mechanism of
action. However, among the purposes of QSARs analysis is to test hypotheses and to
provide such knowledge, so descriptors that result in high statistical fit to the investigated
endpoint might be considered relevant for the biological effect. When applying this
approach the researcher should be careful not to mistake chance numerical correlations
for a real causality, and should be aware that apparent causal descriptors in reality may
not determine the biological effect, but are simply correlated with the real causality. As
218
argued by Cronin (2002), it is better to include in the QSAR model descriptors that have
physicochemical and/or biological relevance even if they result in a model with worse
statistics compared with other variables. In this project, a balance was sought, by
selecting models with good statistical fit, but containing variables that would allow for
some biological interpretation.
To apply correctly QSARs for the prediction of new compounds, the applicability
domains of the models should be respected. The applicability domain of a model includes
the range of chemicals with particular structural features, to which the model can be
applied. Generally, the applicability domain can be defined on the basis of chemical class,
on the basis of the ranges of structural descriptors for the compounds included in the
model, or on the basis of the range of mechanisms of action covered by the model.
Usually development of QSARs related to a particular chemical class or mechanism of
action increases the statistical quality of the model, but decreases the range of its
applicability. Thus, there is a trade-off between the model quality and its applicability
domain. As a particular example, 3D QSARs have relatively more restricted applicability
domains, as they are usually developed for compounds of a particular chemical class
interacting with the same macromolecules.
15.4. Perspectives on the use of the QSARs in integrated testing strategies
The QSARs developed in this project can contribute to development of integrated
strategies based on non-animal methods for hazard and risk assessment. Two main
elements can be defined in relation to such a contribution, namely the application of the
QSARs for the assessment of brain uptake, i.e. assessment of the availability of
compound to act on the central nervous system, and application of the QSARs to assess
toxic effects.
The models developed in this project for brain uptake (Chapter 10) suggest that efforts to
describe the in vivo BBB penetration using membrane models in vitro in combination
with QSAR analysis might be successful. However, the QSARs developed without using
an in vitro endpoint had similar statistics. The classification model for low and high
penetrators (Chapter 10) can be used in integrated strategies to assess whether the
compound will be distributed mainly in the brain or not. The model represents a simple
relationship between logPoct value and the sum of H-bond donors and acceptor atoms for a
compound, which is easy to apply. It states that if the sum of H-bond donors and
219
acceptors for a compound is smaller than its logPoct value minus one, the compound is
most likely to have higher distribution in the brain than in the blood.
The 3D QSAR models developed in the project (Chapter 11) can be used to identify and
assess possible specific BBB interactions related to P-gp. However, a limitation of such
models is that they are applicable to the specific chemical classes, on which the models
are based.
In parallel with the evaluation of compound biouptake, an appraisal of compound toxicity
is also useful in integrated strategies. Development of alternative methods for toxicity
testing is a quite complex problem, due mainly to the facts that toxic effects can be
different and specific for different biological systems, and different compounds can affect
the same system by different mechanisms of toxicity.
The application of alternative methods is based on an extrapolation of results obtained for
less complex systems (in vitro toxicity) to the more complex in vivo systems. This implies
an inevitable uncertainty in the prediction. In this connection, an interesting result was
obtained for the interspecies correlations of the two fish and two algal species
investigated in this project (Chapter 13). Toxicity to the fish O. mykiss correlated better
with toxicity to the alga S. subspicatus (n = 77, R2 = 0.574), and poorly with toxicity to
the alga S. capricornutum (n = 109, R2 = 0.246); while for the fish B. rerio the opposite
observation was made, i.e. toxicity to the fish B. rerio was well correlated with S.
capricornutum toxicity (n = 21, R2 = 0.674), having a slope of 1, and poorly with S.
subspicatus toxicity (n = 41, R2 = 0.282). This suggests a higher level of similarity in the
toxic effect for a given fish species to a particular algal species, and lack of similarity to
other algal species. An explanation of this interesting result is currently lacking.
QSARs for toxicity are usually related to a particular biological system. Therefore,
models to predict toxicity to different biological systems of interest for hazard evaluation
and risk assessment are needed. Such systems include indicators for environmental
toxicity (bacteria, aquatic species), rodent and human cell lines, and rodents and humans
in vivo.
The development of QSARs for toxicity to simple biological systems such as bacteria is
relatively straightforward, as the factors that influence the toxic effect are less complex
and easier to define. Thus, as outlined by the response-surface approach (for example
220
Cronin et al., 2002a) descriptors encoding compound membrane interactions and
transport (logPoct) and reactivity (LUMO, LmH) are suitable in QSARs for toxicity to
such systems. This was confirmed by the work reported in this thesis. Such QSARs are
suitable for the assessment of toxicity on the lower biological level of cells. However, the
toxic effect in vivo is determined by a complex combination of factors, related to
biokinetics and cell toxicity. QSARs for toxicity in vivo to particular species can be
developed, with the presumption that they will contain descriptors accounting for the
range of factors influencing the toxicity. However, due to the complexity of these factors
and difficulty in modelling biokinetics, such models usually do not have high statistical
quality and/or are difficult to interpret from a mechanistic point of view (as in the case
with the models for in vivo toxicity described in Chapters 13 and 14).
Another aspect that should be considered when developing and applying QSARs for
toxicity is related to the mechanism of toxic action of compounds. A number of
experimental methods are available to assess mechanisms of toxic action. In addition,
considerations on the basis of structural criteria are broadly used, due to the easy
application of this approach. Apart from most simple compounds (containing, for
example, unbranched structures and simple functional groups, like alcohols), the
assessment of mechanism has a number of limitations. These are related to the complexity
of chemical structures, which may elicit a particular toxic effect by multiple mechanisms,
and the diversity of biological systems. Thus, a compound might act in a specific way
with biological macromolecules present in some biological systems, and cause a different
toxic effect by other non-specific mechanism in others biological systems. However, in
practice this is very difficult to determine.
It is generally accepted that QSARs should ideally be associated with a particular
mechanism of toxic action. Usually, QSARs that include diverse compounds acting by
different mechanisms are not of high statistical quality. An example of this is the QSAR
model developed for toxicity to the bacterium S. meliloti including all investigated
compounds (Chapter 12).
Despite the above-mentioned limitations of alternative methods and QSARs, valuable
information can be obtained from the models. QSAR models, considered of sufficient
quality, can be integrated in testing strategies together with other alternative methods. In
this project, a QSAR obtained for toxicity of aliphatic compounds to S. meliloti can be
used to assess possible cell toxicity. It includes compounds acting by non-polar narcosis,
221
electrophilic reactivity, and specific mechanisms. The model has reasonable statistical
parameters. However, its applicability domain is restricted to aliphatic compounds acting
by non-polar narcosis, electrophilic reactivity, and specific mechanisms.
Other QSAR models with good statistics, which can be used to assess cell toxicity, are the
models developed for toxicity to V. fischeri and S. meliloti in Chapter 14 of the thesis. In
other words, bacterial toxicity can be used as an indicator of cell toxicity in general.
In order to contribute to testing strategies for assessment of environmental hazard,
QSARs for aquatic species were sought in the project. However, reasonable models could
be obtained for only two of the five investigated species – the alga S. subspicatus and the
fish O. mykiss. The QSARs developed for the simpler organism, the alga S. subspicatus,
had reasonable statistical fits (R2 between 0.601 and 0.709) and could therefore
incorporated in testing strategies for environmental hazard. The QSAR for toxicity to the
fish O. mykiss was of low quality (R2 = 0.417), and, although it included well-understood
structural descriptors (logPoct and LUMO), should be used with caution, if at all. As
mentioned above, the complexity of the factors determining in vivo toxicity might be one
reason for the poor statistical fit of this model.
Classification QSAR models were obtained for the toxicity to the aquatic species by
applying the EU classification of very toxic, toxic, harmful, and non-toxic compounds.
However, the models classified correctly approximately only 55% of the compounds.
Thus, another type of classification was tried, namely dangerous and non-toxic. The
percentage correct classification when applying these classification models was between
70% and 83%. The classification models included simple molecular descriptors and are
easily applicable for prediction of toxicity group of new compounds. By setting the a
priori classification probabilities to be proportional to the group sizes, or to be equal in
the two groups, different accuracy of the within-group classification can be obtained. A
priori probabilities proportional to the group sizes favour more accurate classification of
the dangerous compounds, at the expense of accuracy of classification of the non-toxic
compounds. This might serve the needs of hazard assessment, where the over-
classification of a non-toxic compound as dangerous might be considered preferable to
classification of dangerous compound as non-toxic.
Other results from the project, which could contribute to development of testing strategies
for human health effect, are described in Chapter 14. Models for predicting human
222
toxicity in vivo based on combinations of in vitro data and molecular descriptors (QSAAR
analysis) were developed. Human peak blood/serum LC50 concentrations (HAP) were
described by a combination of human liver cell toxicity and descriptors related to
electronic properties. Additionally, QSAR models for HAP were obtained, which had
slightly better statistical parameters to the QSAAR model, without the need to include an
in vitro endpoint. A reasonable model to assess the in vivo oral human lethal doses
(HLD), combining an in vitro endpoint and descriptors of chemical structure, could not be
obtained. However HLDs were correlated with combinations of in vivo mouse LD50
values and descriptors of H-bond donor ability and molecular aromaticity. Although
useful for the assessment of chemical hazard to humans, this model cannot be regarded as
an alternative to animal testing.
Other models which could be used as alternative methods to animal testing in integrated
strategies are the models developed for rat and mouse in vivo toxicity described in
Chapter 14. QSAAR models for rat in vivo toxicity were obtained, however the QSAR
models for rat and mouse in vivo toxicity had better statistical fits than the QSAARs.
QSAR models with good statistics were obtained for in vitro toxicity to rat hepatocytes.
This biological systems is closely related to the species of interest, and is therefore more
appropriate for inclusion in testing strategies for the assessment of chemical risk.
As a general conclusion, QSARs can provide useful information about the mechanism of
action of compounds and a potential toxic effect. They can be used as tools for
preliminary analysis or supplementary information in testing strategies in combination
with other methods. Their predictive quality and domain of applicability should be taken
into account.
223
References
Abraham, M.H. (1993). Scales of solute hydrogen-bonding – their construction and
application to physicochemical and biochemical processes. Chemical Society Reviews, 22,
73-83.
Abraham, R.J., and Haworth, I.S. (1988). A modification to the COSMIC
parameterisation using ab initio constrained potential functions. Journal of Computer-
Aided Molecular Design, 1988, 2, 125-135.
Abraham, M.H., and McGowan, J.C. (1987). The use of characteristic volumes to
measure cavity terms in reversed phase liquid chromatography. Chromatographia, 23,
243-246.
Abraham, M.H., and Platts, J.A. (2000). Physicochemical factors that influence brain
uptake. In: Begley, D.J., Bradbury, M.W., Kreuter, J. (Eds.), The Blood-Brain Barrier
and Drug Delivery to the CNS. Marcel Dekker, Inc, New York, Basel.
Abraham, M.H., and Whiting, G.S. (1992). Hydrogen bonding: XXI. Solvation
parameters for alkylaromatic hydrocarbons from gas-liquid chromatographic data.
Journal of Chromatography A, 594, 229-241.
Abraham, M.H., Chadha, H.S., Mitchell R.C. (1994). Hydrogen bonding. 33. Factors that
influence the distribution of solutes between blood and brain. Journal of Pharmaceutical
Sciences, 83, 1257-1268.
Abraham M.H., Chadha, H.S., Martins, F., Mitchell, R.C., Bradbury, M.W., Gratton, J.A.
(1999). Hydrogen bonding part 46: A review of the correlation and prediction of transport
properties by an LFER method: physicochemical properties, brain penetration and skin
permeability. Pesticide Science, 55, 78-88.
Abraham M.H., Grellier, P.L., Prior, D.V., Duce, P.P., Moms, J.J., Taylor, P.J. (1989).
Hydrogen bonding. Part 7. A scale of solute hydrogen-bond acidity based on logK values
for complexation in tetrachloromethane. Journal of Chemical Society, Perkin
Transactions, 2, 699-711.
224
Abraham, M.H., Whiting G.S., Doherty, R.M., Shuely W.J. (1991). Hydrogen bonding:
XVI. A new solute salvation parameter, π2H, from gas chromatographic data. Journal of
Chromatography A, 587, 213-228.
Allinger, N.L. (1977). Conformational analysis 130. MM2. A hydrocarbon force field
utilizing V1 and V2 torsional terms. Journal of the American Chemical Society, 99, 8127-
8134.
Allinger, N.L., Yuh, Y.H., Lii, J-H. (1989). Molecular mechanics. The MM3 force field
for hydrocarbons. 1. Journal of the American Chemical Society, 111, 8551-8565.
Balaban, A.T. (1982). Highly discriminating distance-based topological index. Chemical
and Physical Letters, 89, 399-404.
Balaž, S., and Lukacova, V. (2002). Subcellular pharmacokinetics and its potential for
library focusing. Journal of Molecular Graphics and Modelling, 20, 479-490.
Berry, W.D. (1993). Understanding Regression Assumptions. Series: Quantitative
Applications in the Social Sciences, 92. Thousand Oaks, CA, Sage Publications.
Boyd, E.M., Killham K., Meharg, A.A. (2001). Toxicity of mono-, di- and tri-
chlorophenols to lux marked terrestrial bacteria, Burkholderia species Rasc c2 and
Pseudomonas fluorescens. Chemosphere, 43, 157-166.
Bondesson, I., Ekwall, B., Hellberg, S., Romert, L., Stenberg, K., Walum, E. (1989)
MEIC: a new international multicenter project to evaluate the relevance to human toxicity
of in vitro cytotoxicity tests. Cell Biology and Toxicology, 5, 331-348.
Bradbury, S.P., Henry, T.R., Niemi, G.J., Carlson, R.W., Snarski, V.M. (1989). Use of
respiratory-cardiovascular responses of rainbow trout (Salmo gardineri) in identifying
acute toxicity syndromes in fish: Part 3: Polar narcotics. Environmental Toxicology and
Chemistry, 8, 247-261.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and
regression trees. Monterey, CA, Wadsworth & Brooks/Cole Advanced Books &
Software.
225
Broderius, S.J., Kahl, M.D., Hoglund, M.D. (1995). Use of joint toxic response to define
the primary mode of toxic action for diverse industrial organic chemicals. Environmental
Toxicology and Chemistry, 9, 1591-1605.
Bundy, J.G., Morriss, A.W.J., Durham, D.G., Campbell, C.D., Paton, G.I. (2001).
Development of QSARs to investigate the bacterial toxicity and biotransformation
potential of aromatic heterocylic compounds. Chemosphere, 42, 885-892.
Burden, F.R. (2001). Quantitative structure-activity relationship studies using Gaussian
processes. Journal of Chemical Information and Computer Sciences, 41, 830-835.
Chapman, M.S., and Connolly, M.L. (2001). Molecular surfaces: calculations, uses and
representations. In: Rossmann, M.G., and Arnold, E. (Eds.), International Tables for
Crystallography, Vol. F: Crystallography of biological macromolecules, Chapter 22:
Molecular geometry and features, Chapter 22.1: Protein surfaces and volumes:
measurement and use. International Union of Crystallography, Chester, UK.
Available on internet (30 December 2004):
http://www.sb.fsu.edu/~chapman/Home/Publications/Chapman_&_Connolly_Internation
al_Tables_1999/Molecular_Surface_Calculations.html
Clark, D.E. (2003). In silico prediction of blood-brain barrier permeation. Drug Discovery
Today, 8, 927-33.
Clark, D.E. (1999). Rapid calculation of polar molecular surface area and its application
to the prediction of transport phenomena. 2. Prediction of blood-brain barrier penetration.
Journal of Pharmaceutical Sciences, 88, 815-821.
Cramer, R. D. III, Patterson, D. E. Bunce, J. D. (1988). Comparative molecular field
analysis (CoMFA). Effect of shape on binding of steroids to carrier proteins. Journal of
the American Chemical Society, 110, 5959-5967.
Crivori, P., Cruciani G., Carrupt P.A., Testa B. (2000). Predicting blood-brain barrier
permeation from three-dimensional molecular structure, Journal of Medicinal Chemistry,
43, 2204-2216.
226
Cronin, M. (2002). ECVAM Workshop on The Use of Computer Models as Alternatives
to Animal Experiments in Chemical Risk Assessment, October 3-4, Praha, Czech
Republic.
Cronin, M.T.D., and Schultz, T.W. (2001). Development of quantitative structure-
activity relationships for the toxicity of aromatic compounds to T. pyriformis:
comparative assessment of the methodologies. Chemical Research in Toxicology, 14,
1284-1295.
Cronin, M.T.D., and Schultz T.W. (2003). Pitfalls in QSAR. Journal of Molecular
Structure. (Theochem), 622, 39-51.
Cronin, M.T.D., Aptula, A.O., Duffy, J.C., Netzeva, T.I., Rowe, P.H., Valkova, I.V.,
Schultz, T.W. (2002a). Comparative assessment of methods to develop QSARs for the
prediction of the toxicity of phenols to Tetrahymena pyriformis. Chemosphere, 49, 1201–
1221.
Cronin, M.T.D., Bowers, G.S., Sinks, G. D., Schultz, T.W. (2000). Structure-toxicity
relationships for aliphatic compounds encompassing a variety of mechanisms of toxic
action to V. fischeri. SAR and QSAR in Environmental Research, 11, 301-312.
Cronin, M.T.D., Dearden, J.C., Duffy, J.C., Edwards, R., Manga, N., Worth, A.P.,
Worgan, A.D.P. (2002b). The importance of hydrophobicity and electrophilicity
descriptors in mechanistically-based QSARs for toxicological endpoints. SAR and QSAR
in Environmental Research, 13, 167-176.
Cronin, M.T.D., Netzeva, T.I, Dearden, J.C., Edwards, R., Worgan, A.D.P. (2004).
Assessment and Modeling of the Toxicity of Organic Chemicals to Chlorella vulgaris:
Development of a Novel Database. Chemical Research in Toxicology, 17, 545-554.
Dearden, J. (2002). ECVAM Workshop on The Use of Computer Models as Alternatives
to Animal Experiments in Chemical Risk Assessment, October 3-4, Praha, Czech
Republic.
227
Dearden, J.C., Al-Noobi, A., Scott, A.C., Thomson, S.A. (2003). QSAR studies on P-
glycoprotein-regulated multidrug resistance and on its reversal by phenothiazines. SAR
and QSAR in Environmental Research, 14, 447-454.
Dearden, J.C., Bradburne, S.J.A., Cronin, M.T.D., Solanki, P. (1988). The physical
significance of molecular connectivity. In: Turner, J.E., England, M.W., Schultz, T.W.,
Kwaak, N.J. (Eds.), Proceedings of the Third International Workshop on Quantitative
Structure-Activity Relationships in Environmental Chemistry. US DOE, Oak Ridge, TN,
43-50.
Devillers, J., Chezeau, A., Thybaud, E., Rahmani, R. (2002a). QSAR modelling of the
adult and developmental toxicity of glycols, glycol ethers and xylenes to Hydra attenuata.
SAR and QSAR in Environmental Research, 13, 555-566.
Devillers, J., Chezeau, A., Thybaud, E. (2002b). PLS-QSAR of the adult and
developmental toxicity of chemicals to Hydra attenuata. SAR and QSAR in
Environmental Research, 13, 705-712.
Dewar, M.J.S., and Thiel, W. (1977). Ground states of molecules. 38. The MNDO
method. Approximation and parameters. Journal of the American Chemical Society, 99,
4899-4907.
Dewar, M.J.S., Zoebisch, E.G., Healy, E.F., Stewart, J.J.P. (1985). The development and
use of quantum-mechanical models. 76. AM1 – a new general-purpose quantum-
mechanical molecular model. Journal of the American Chemical Society, 107, 3902-3909.
Dimitrov, S.D., Mekenyan, O.G., Sinks, G.D., Schultz, T.W. (2003). Global modeling of
narcotic chemicals: ciliate and fish toxicity. Journal of Molecular Strucrure (Theochem),
622, 63-70.
Dimitrov, S.D., Mekenyan, O.G., Schultz, T.W. (2000). Interspecies modeling of narcotic
toxicity to aquatic animals. Bulletin of Environmental Contamination and Toxicology, 65,
399-406.
228
Di Marzio, W., Galassi, S., Todeschini, R., Consolaro, F. (2001). Traditional versus
WHIM molecular descriptors in QSAR approaches applied to fish toxicity studies.
Chemosphere, 44, 401-406.
EC (1967). Council Directive 67/548/EEC of 27 June 1967 on the approximation of laws,
regulations and administrative provisions relating to the classification, packaging and
labelling of dangerous substances, Official Journal of the European Community P 196,
16/08/1967.
EC (1979). Council Directive 79/831/EEC of 18 September 1979 amending for the sixth
time Directive 67/548/EEC on the approximation of the laws, regulations and
administrative provisions relating to the classification, packaging and labelling of
dangerous substances, Official Journal of the European Community L 259, 15/10/1979.
EC (1986). Council Directive 86/609/EEC of 24 November 1986 on the approximation of
laws, regulations and administrative provisions of the Member States regarding the
protection of animals used for experimental and other scientific purposes. Official Journal
of the European Community L358, 18.12.1986.
EC (2001a). Annex VI to Council Directive 67/548/EEC of 21 August 2001 on general
classification and labelling requirements for dangerous substances and preparations,
Official Journal of the European Community L 225/263, 21/08/2001.
EC (2001b). White Paper on a Strategy for a Future Chemicals Policy. Brussels:
Commission of the European Communities. Available on the internet:
http://europa.eu.int/comm/environment/chemicals/whitepaper.htm
EC (2003). Commission’s proposal COM 2003 0644 (03) of 29 October 2003 for a
Regulation of the European Parliament and of the Council concerning the Registration,
Evaluation, Authorisation and Restriction of Chemicals (REACH). Available on the
internet:
http://europa.eu.int/eur-lex/en/com/pdf/2003/com2003_0644en.html
Enache, M., Dearden, J.C., Walker, J.D. (2003). QSAR analysis of metal ion toxicity
data in sunflower callus cultures (Helianthus annuus “Sunspot”). QSAR and
Combinatorial Science, 22, 234 – 240.
229
Enache, M., Palit, P., Dearden, J.C., Lepp, N.W. (2000). Correlation of physico-chemical
parameters with toxicity of metal ions to plants. Pest Management Science, 56, 821-824.
Ekwall, B., Clemedson, C., Crafoord, B., Ekwall, B., Hallander, S., Walum, E.,
Bondesson, I. (1998a). MEIC evaluation of acute systemic toxicity. Part V. Rodent and
human toxicity data for the 50 reference chemicals. Alternatives to Laboratory Animals,
26, Suppl. 2, 571-616.
Ekwall, B., et al. (1998b). MEIC evaluation of acute systemic toxicity. Part VI. The
prediction of human toxicity by rodent LD50 values and results from 61 in vitro methods.
Alternatives to Laboratory Animals, 26, Suppl. 2, 617-658.
Faucon, J.C., Bureau, R., Faisant, J., Briens, F., Rault, S. (2001). Prediction of the
Daphnia acute toxicity from heterogeneous data. Chemosphere, 44, 407-422.
Feher, M., Sourial, E., Schmidt, J.M. (2000). A simple model for the prediction of blood-
brain partitioning., International Journal of Pharmacology, 201, 239-247.
Ford, J.M., Prozialeck, W.C., Hait, W.N. (1989). Structural features determining activity
of phenothiazines and related drugs for inhibition of cell growth and reversal of multidrug
resistance. Molecular Pharmacology, 35, 105-115.
Freidig, A.P., and Hermens, J.L.M. (2000). Narcosis and chemical reactivity QSARs for
acute fish toxicity. Quantitative Structure-Activity Relationships, 19, 547-553.
Gaillard, P.J., and De Boer, A.G. (2000). Relationship between permeability status of the
blood–brain barrier and in vitro permeability coefficient of a drug. European Journal of
Pharmaceutical Sciences, 12, 95–102.
Garberg, P. (2001). Prevalidation of in vitro models for the blood-brain barrier. ECVAM
internal report.
Geiss, K.T., and Frazier, J.M. (2001). QSAR modeling of oxidative stress in vitro
following hepatocyte exposures to halogenated methanes. Toxicology in Vitro, 15, 557-
563.
230
Gloor, S.M., Wachtel, M., Bolliger M.F., Ishihara H., Landmann, R., Frei, K. (2001)
Molecular and cellular permeability control at the blood–brain barrier. Brain Research
Reviews, 36, 258–264.
Gramatica, P., Vighi, M., Consolaro, F., Todeschini, R., Finizio, A., Faust. M. (2001).
QSAR approach for the selection of congeneric compounds with a similar toxicological
mode of action. Chemoshere, 42, 873-883
Gratton, J.A., Abraham, M.H., Bradbury, M.W., Chadha, H.S. (1997). Molecular factors
influencing drug transfer across the blood-brain barrier. Journal of Pharmacy and
Pharmacology, 49, 1211-1216.
Grodnitzky, J.A., and Coats, J.R. (2002). QSAR evaluation of monoterpenoids’
insecticidal activity. Journal of Agricultural and Food Chemistry, 50, 4576-4580.
Ghose, A.K., and Crippen, G.M. (1986). Atomic physicochemical parameters for three-
dimensional structure-directed quantitative structure-activity relationships I. Partition
coefficients as a measure of hydrophobicity. Journal of Computational Chemistry, 7, 565-
577.
Hall L.H., and Kier, L.B. (1991). The molecular connectivity chi indices and kappa shape
indices in structure-property modelling. In: Boyd, D.B., Lipkowitz, K. (Eds.), Reviews of
Computational Chemistry, Vol. 2.
Hall L.H., Mohney, B.K., Kier, L.B. (1991). The electrotopological state: an atom index
for QSAR. Quantitative Structure-Activity Relationships, 10, 43.
Hansch, C., and Gao, H. (1997). Comparative QSAR: radical reactions of benzene
derivatives in chemistry and biology. Chemical Reviews, 97, 2995-3059.
Hansch, C., and Kurup, A. (2003). QSAR of chemical polarisability and nerve toxicity. 2.
Journal of Chemical Information and Computer Sciences, 43, 1647-1651.
231
Hansch, C., Kim, D., Leo, A.J., Novellino, E., Silipo, C., Vittoria, A. (1989). Toward a
quantitative comparative toxicology of organic compounds. Critical Reviews in
Toxicology, 19, 185–226.
Hansch, C., McKarns, S.C., Smith, C.J., Doolittle, D.J. (2000). Comparative QSAR
evidence for a free-radical mechanism of phenol-induced toxicity. Chemico-Biological
Interactions, 127, 61–72.
Hartigan, J.A., and Wong, M.A. (1978). Algorithm 136. A k-means clustering algorithm.
Applied Statistics, 28, 100-108.
Hermens, J.L.M. (1990). Electrophiles and acute toxicity to fish. Environmental Health
Perspectives, 87, 219-225.
Huang, H., Wang, X., Ou, W., Zhao, J., Shao, Y., Wang, L. (2003). Acute toxicity of
benzene derivatives to the tadpoles (Rana japonica) and QSAR analyses. Chemosphere,
53, 963-970.
Huuskonen J. (2003). QSAR modeling with the electrotopological state indices:
predicting the toxicity of organic chemicals. Chemosphere, 20, 949-953.
Ivanciuc, O. (2000). QSAR comparative study of Wiener descriptors for weighted
molecular graphs. Journal of Chemical Information and Computer Sciences, 40, 1412-
1422.
Kaiser, H.F. (1960). The application of electronic computers to factor analysis.
Educational and Psychological Measurement, 20, 141-151.
Kamlet, M.J., and Taft, R.W. (1976). The solvatochromic comparison method. I. The
.beta.-scale of solvent hydrogen-bond acceptor (HBA) basicities. Journal of the American
Chemical Society, 98, 377-383.
Kansy, M., and van de Waterbeemd, H. (1992). Hydrogen-bonding capacity and brain
penetration. Chimia, 46, 299-303.
232
Kahru, A., and Borchardt, B. (1994). Toxicity of 39 MEIC chemicals to bioluminescent
photobacteria (the BiotoxTM test): correlation with other test systems. Alternatives to
Laboratory Animals, 22, 147-160.
Kapur, S., Shusterman, A., Verma, R.P., Hansch, C., Selassie, C.D. (2000). Toxicology of
benzyl alcohols: a QSAR analysis. Chemosphere, 41, 1643-1649.
Kelder, J., Grootenhuis, P.D., Bayada, D.M., Delbressine, L.P., Ploemen, J.P. (1999).
Polar molecular surface as a dominating determinant for oral absorption and brain
penetration of drugs. Pharmaceutical Research, 16, 1514-1519.
Keseru, G.M., and Molnar, L. (2001). High-throughput prediction of blood-brain
partitioning: a thermodynamic approach. Journal of Chemical Information and Computer
Sciences, 41, 120-128.
Kier, L.B. (1985). A shape index from molecular graphs. Quantitative Structure-Activity
Relationships, 4, 109-116.
Kier, L.B., and Hall, L.H. (1986). Molecular Connectivity in Structure-Activity Analysis.
Research Studies Press - Wiley, Chichester, UK.
Kier, L.B., and Hall, L.H. (1990). An electrotopological state index for atoms in
molecules. Pharmaceutical Research, 7, 801.
Kier, L.B., and Hall, L.H. (1999). Molecular Structure Description: The
Electrotopological State. Academic Press.
Klebe, G. (1998). Comparative molecular similarity indices: CoMSIA. In: Kubiny, H.,
Folkers, G., Martin, Y.C. (Eds.), 3D QSAR in Drug Design. Kluwer Academic Publishers,
UK, 3-87.
Könemann, H. (1981). Quantitative structure-activity relationships in fish toxicity studies.
Part 1: Relationship for 50 industrial pollutants. Toxicology, 19, 209-221.
Kubiny, H. (1993). QSAR in drug design. Theory, Methods and Application. Escom,
Leiden.
233
Lee, G., Dallas, S., Hong, M., Bendayan, R. (2001). Drug transporters in the central
nervous system: brain barriers and brain parenchyma considerations. Pharmacological
Reviews, 53, 569-596.
Lessigiarska, I., Cronin, M.T.D., Worth, A.P., Dearden, J.C., Netzeva, T.I. (2004a).
QSARs for toxicity to the bacterium Sinorhizobium meliloti. SAR and QSAR in
Environmental Research, 15, 169-190.
Lessigiarska, I., Pajeva, I., Cronin, M.T.D., Worth, A.P. (2005a). 3D QSAR investigation
of the blood-brain barrier penetration of chemical compounds. SAR and QSAR in
Environmental Research, 16, 79-91.
Lessigiarska, I., Worth, A.P., Netzeva, T.I. (2005b). Comparative review of QSARs for
acute toxicity. EUR report No. 21559 EN. EC Joint Research Centre, Ispra, Italy.
Lessigiarska, I., Worth, A.P., Netzeva, T.I., Dearden, J.C., Cronin, M.T.D (2006).
Quantitative structure-activity-activity and quantitative structure-activity investigations of
human and rodent toxicity. Chemosphere. In press.
Lessigiarska, I., Worth, A.P., Sokull-Klüttgen, B., Jeram, S., Dearden, J.C., Netzeva, T.I.,
Cronin, M.T.D. (2004b). QSAR investigation of a large data set for fish, algae and
Daphnia toxicity. SAR and QSAR in Environmental Research, 15, 413-431.
Lin, Z., Yin, K., Shi, P., Wang, L., Yu, H. (2003a). Development of QSARs for
predicting the joint effects between cyanogenic toxicants and aldehydes. Chemical
Research in Toxicology, 16, 1365-1371.
Lin, Z., Zhong, P., Yin, K.,Wang, L.,Yu, H. (2003b). Quantification of joint effect for
hydrogen bond and development of QSARs for predicting mixture toxicity.
Chemosphere, 52, 1199-1208.
Lundquist, S., Renftel, M., Brillault, J., Fenart, L., Cecchelli, R., Dehouck, M-P. (2002).
Prediction of drug transport through blood-brain barrier in vivo: a comparison between
two in vitro cell models. Pharmaceutical Research, 19, 976-981.
234
Lipnick, R.L. (1991). Outliers: their origin and use in the classification of molecular
mechanisms of toxicity. Science of the Total Environment, 109/110, 131-153.
Liu, X., Wang, B., Huang, Z., Han, S., Wang, L (2003a). Acute toxicity and quantitative
structure-activity relationships of alpha-branched phenylsulfonyl acetates to D. magna.
Chemosphere, 50, 403-408.
Liu, X., Wu, C., Han, S., Wang, L., Zhang, Z. (2001). The acute toxicity of alpha-
branched phenylsulfonyl acetates in V. fischeri test. Ecotoxicology and Environmental
Safety, 49, 240-244.
Liu, X., Yang, Z., Wang, L. (2003b). Three-dimensional quantitative structure–activity
relationship study for phenylsulfonyl carboxylates using CoMFA and CoMSIA.
Chemosphere, 53, 945-952.
Lombardo, F., Blake J.F., Curatolo W.J. (1996). Computation of brain-blood partitioning
of organic solutes via free energy calculations. Journal of Medicinal Chemistry, 39, 4750-
4755.
Lu, G.H., Yuan, X., Zhao, Y.H. (2001). QSAR study on the toxicity of substituted
benzenes to the algae (Scenedesmus obliquus). Chemosphere, 44, 437-440.
Luco, J.M. (1999). Prediction of the brain-blood distribution of a large set of drugs from
structurally derived descriptors using partial least-squares (PLS) modelling, Journal of
Chemical Information and Computer Sciences, 39, 396-404.
MEIC data base: http://www.cctoxconsulting.a.se/ (accessed November 2004).
MolconnZ version 4.0, online help manual, Hall Associates Consulting). Available on
internet (20 December 2004): http://www.eslc.vabiotech.com/molconn/manuals/400/
Moridani, M.Y., Siraki A., O’Brien P.J. (2003b). Quantitative structure toxicity
relationships for phenols in isolated rat hepatocytes. Chemico-Biological Interactions,
145, 213-223.
235
Mulliken, R.S. (1955). Electronic population analysis on LCAO-MO molecular wave
functions i. Journal of Chemical Physics, 23, 1833-1840,
Morao, I., and Hillier, I.H. (2001). Magnetic analysis (NICS) of monoarylic cations.
Linear relationship between aromaticity and Hammett constants (σp+). Tetrahedron
Letters, 42, 4429–4431.
Morley, S.D., Abraham R.J., Haworth I.S., Jackson D.E., Saunders M.R., Vinter J.G.
(1991). COSMIC(90): An improved molecular mechanics treatment of hydrocarbons and
conjugated systems. Journal of Computed-Aided Molecular Design, 5, 475-504.
Netzeva, T.I. (2003). Whole molecule and atom based, topological descriptors. In:
Cronin, M.T.D., and Livingstone, D. (Eds.), Predicting Chemical Toxicity and Fate.
Taylor and Francis, London, 61-83.
Netzeva, T.I., Aptula, A.O., Benfenati, E., Cronin, M.T.D., Gini, G., Lessigiarska, I.,
Maran, U., Vracko, M., Schüürmann, G. (2005). Description of the electronic structure of
organic chemicals using semiempirical and ab initio methods for development of
toxicological QSARs. Journal of Chemical Information and Modeling, 45, 106-114.
Netzeva, T.I., Dearden, J.C., Edwards, R., Worgan, A.D.P., Cronin, M.T.D. (2004).
QSAR analysis of the toxicity of aromatic compounds to Chlorella vulgaris in a novel
short-term assay. Journal of Chemical Information and Computer Sciences, 44, 258-265.
Norinder, U., and Osterberg, T. (2001). Theoretical calculation and prediction of drug
transport processes using simple parameters and partial least squares projections to latent
structures (PLS) statistics. The use of electrotopological state indices. Journal of
Pharmaceutical Sciences, 90, 1076-1085.
Ownby, D.R., and Newman, M.C. (2003). Advances in quantitative ion character-activity
relationships (QICARs): using metal-ligand binding characteristics to predict metal
toxicity. QSAR and Combinatorial Science, 22, 241-246.
Pajeva, I.K., and Wiese, M. (1998). Molecular modelling of phenothiazines and related
drugs as multidrug resistance modifiers: a comparative molecular field analysis study.
Journal of Medicinal Chemistry, 41, 1815-1826.
236
Pajeva, I.K., Wiese, M., Cordes, H.P., Seydel, J.K. (1996). Membrane interactions of
some catamphiphilic drugs and relation to their multidrug resistance reversing ability.
Jouranl Cancer Research and Clinical Oncology, 122, 27-40.
Parkerton, T.F., and Konkel, W.J. (2000). Application of quantitative structure-activity
relationships for assessing the aquatic toxicity of phthalate esters. Ecotoxicology and
Environmental Safety, 45, 61-78.
Pedersen, F., de Bruijn, J., Munn, S., van Leeuwen, K. (2003). Assessment of additional
testing needs under REACH. Effects of QSARs, risk based testing and voluntary industry
initiatives. EUR Report No. 20863 EN. EC Joint Research Centre, Ispra, Italy.
Platts, J.A., Abraham, M.H., Zhao, Y.H., Hersey, A., Ijaz, L., Butina, D. (2001).
Correlation and prediction of a large blood–brain distribution data set — an LFER study.
European Journal of Medicinal Chemistry, 36, 719–730.
Pople, J.A., and Beveridge, D.L. (1970). Approximate Molecular Orbital Theory.
McGraw-Hill, New York.
Pople, J.A., and Segal, G.A. (1965). Approximate self-consistent molecular orbital theory
II. Calculations with complete neglect of differential overlap. Journal of Chemical
Physics, 43, S136-S149.
Pople J.A., Santry, D.P., Segal, G.A. (1965). Approximate self-consistent molecular
orbital theory I. Invariant procedures. Journal of Chemical Physics, 43, S129-S135.
QsarIS version 1.1, help manual, then SciVision-Academic Press, San Diego, CA;
currently MDL®QSAR by Elsevier MDL, San Leandro, CA.
Radojnova, D., Tenekedjiev, K., Yordanov, Y. (2002). Stature estimation from long bone
length in Bulgarians, HOMO, 52, 221-232.
Ramu, A., and Ramu, N. (1992). Reversal of multidrug resistance by phenothiazines and
structurally related compounds. Cancer Chemotherapy and Pharmacology, 30, 165-173.
237
Randić, M. (2001). Novel shape descriptors for molecular graphs. Journal of Chemical
Information and Computer Sciences, 41, 607-613.
Rao, C.R. (1951). An asymptotic expansion of the distribution of Wilks’ criterion.
Bulletin of the International Statistical Institute, 33, 177-181.
Ren, S., and Frymier, P.D. (2002). Estimating the toxicities of organic chemicals to
bioluminescent bacteria and activated sludge. Water Research, 36, 4406-4414.
Roberts, D.W., and Costello, J.F. (2003). Mechanisms of action for general and polar
narcosis: A difference in dimension. QSAR and Combinatorial Science, 22, 226 – 233.
Rose, K., and Hall, L.H. (2003). E-state modelling of fish toxicity independent of 3D
structure information. SAR and QSAR in Environmental Research, 14, 113-129.
Rose, K., Hall, L.H., Kier, L.B. (2002). Modeling blood-brain barrier partitioning using
the electrotopological state. Journal of Chemical Information and Computer Sciences, 42,
651-666.
Rücker, G, and Rücker, C. (1993). Counts of all walks as atomic and molecular
descriptors. Journal of Chemical Information and Computer Sciences, 33, 683-695.
Russom, C.L., Bradbury, S.P., Broderius, S.J., Hammermeister, D.E., Drummond, R.A.
(1997). Predicting modes of action from chemical structure: Acute toxicity in the fathead
minnow (Pimephales promelas). Environmentel Toxicology and Chemistry, 16, 948-967.
Sablić, A. (1990). Topological indices and environmental chemistry. In: Karcher, W., and
Devillers J. (Eds.), Practical applications of quantitative structure-activity relationships
(QSAR) in environmental chemistry and toxicology. Kluwer Academic Publishers, 61-82.
Schmitt, H., Altenburger, R., Jastorff, B., Schüürmann, G. (2000). Quantitative structure-
activity analysis of the algae toxicity of nitroaromatic compounds. Chemical Research in
Toxicology, 13, 441-450.
238
Schultz, T.W., and Cronin, M.T.D. (1997). Quantitative structure-activity relationships
for weak acid respiratory uncouplers to Vibirio fisheri. Environmental Toxicology and
Chemistry, 16, 357–360.
Schultz, T.W., and Cronin, M.T.D. (2003). Essential and desirable characteristics of
ecotoxicity quantitative structure-activity relationships. Environmental Toxicology and
Chemistry 22, 599-607.
Schultz, T.W., and Seward, J.R. (2000) Dimyristoyl phosphatidylcholine/water
partitioning-dependent modeling of narcotic toxicity to T. pyriformis. Quantitative
Structure-Activity Relationships, 19, 239-344.
Schultz, T.W., Cronin, M.T.D., Netzeva, T.I., Aptula, A.O. (2002). Structure-toxicity
relationships for aliphatic chemicals evaluated with T. pyriformis. Chemical Research in
Toxicology, 15, 1602-1609.
Schultz, T.W., Cronin, M.T.D., Walker, J.D., Aptula, A.O. (2003a). Quantitative
structure–activity relationships (QSARs) in toxicology: a historical perspective. Journal
of Molecular Structure (Theochem), 622, 1–22.
Schultz, T.W., Lin, D.T., Wilke, T.S., Arnold, L.M. (1990). Quantitative structure-
activity relationships for the Tetrahymena pyriformis population growth endpoint: a
mechanism of action approach. In: Karcher, W., and Devillers, J. (Eds), Practical
Applications of Quantitative Structure-Activity Relationships (QSAR) in Environmental
Chemistry and Toxicology. Kluwer Academic Publishers, 61-82.
Schultz, T.W., Sinks, G.D., Bearden, A.P. (1998). QSAR in aquatic toxicology: a
mechanism of action approach comparing toxic potency to Pimephales promelas, T.
pyriformis, and V. fischeri. In: Devillers, J. (Ed.), Comparative QSAR. Taylor & Francis,
New York, 51-109.
Schultz, T.W., Sinks, G.D., Cronin, M.T.D. (1997). Identification of mechanisms of toxic
action of phenols to Tetrahymena pyriformis from molecular descriptors. In: Chen, F., and
Schüürmann, G. (Eds.), Quantitative Structure-Activity Relationships in Environmental
Sciences - VII. SETAC Press, Pensacola, USA, 329-342.
239
Schüürmann, G. (2004). Quantum chemical descriptors in structure-activity relationships
– calculation, interpretation and comparison of methods. In: Cronin, M.T.D., and
Livingstone, D.J. (Eds.), Predicting Chemical Toxicity and Fate. Taylor and Francis,
London, 85-149.
Selassie, C.D., Garg, R., Kapur, S., Kurup, A., Verma, R.P., Mekapati, S.B., Hansch, C.
(2002). Comparative QSAR and the radical toxicity of various functional groups.
Chemical Reviews, 102, 2585-2605.
Shrivastava, R., Delominie, C., Chevalier, A., John, G., Ekwall, B., Walum, E.,
Massingham, R. (1993). Comparison of in vitro acute lethal potency and in vitro
cytotoxicity of 48 chemicals. Cell Biology and Toxicology, 8, 157-167.
Stewart, J.J.P. (1989). Optimization of parameters for semiempirical methods. I. Method.
Journal of Computational Chemistry, 10, 209-220.
Statistica version 5.5, help manual, StatSoft Inc., Tulsa, OK, USA.
Sun, H., Dai, H., Shaik, N., Elmquist, W. F. (2003). Drug efflux transporters in the CNS.
Advanced Drug Delivery Reviews, 55, 83–105.
Sverdrup, L.E., Nielsen, T., Krogh, P.H. (2002). Soil ecotoxicity of polycyclic aromatic
hydrocarbons in relation to soil sorption, lipophilicity, and water solubility.
Environmental Science and Technology, 36, 2429-2435.
Taft, R.W., and Kamlet, M.J. (1976). The solvatochromic comparison method. 2. The
.alpha.-scale of solvent hydrogen-bond donor (HBD) acidities. Journal of the American
Chemical Society, 98, 2886-2894.
Tamai, I., and Tsuji, A. (2000). Transporter-mediated permeation of drugs across the
blood-brain barrier. Journal of Pharmaceutical Sciences, 89, 1371-1388.
Tao S., Xi, X., Xu F., Dawson R. (2002a). A QSAR model for predicting toxicity (LC50)
to rainbow trout. Water Research, 36, 2926-2930.
240
Tao, S., Xi, X., Xu, F., Li, B., Cao, J., Dawson, R. (2002b). A fragment constant QSAR
model for evaluating the EC50 values of organic chemicals to D. magna. Environmental
Pollution, 116, 57-64.
Tenekedjiev, K., and Radojnova, D., (2001). Numerical procedures for stature estimation
according to length of limb bones in Bulgarian and Hungarian populations. Acta
Morphologica et Anthropologica, 6, 90-97 (2001).
Tenekedjiev, K. (1994). Technical diagnosis of complex energetic objects using statistical
pattern recognition. PhD Thesis Summary, Technical University-Varna, Bulgaria.
Terada, H. (1990). Uncouplers of oxidative phosphorylation. Environmental Health
Perspectives, 87, 213-218.
Tichý, M., Borek-Dohalský, V., Matousová, D., Rucki, M., Feltl, L., Roth, Z. (2002).
Prediction of acute toxicity of chemicals in mixtures: worms Tubifex tubifex and
gas/liquid distribution. SAR and QSAR in Environmental Research, 13, 261-269.
Todeschini, R., Vighi, M., Finizio, A., and Gramatica, P. (1997) 3D-modelling and
prediction by WHIM descriptors. Part 8. Toxicity and physico-chemical properties of
environmental priority chemicals by 2D-TI and 3D-WHIM descriptors. SAR and QSAR in
Environmental Research, 7, 173-193.
Topliss, J.G., and Costello, R.J. (1972). Chance correlations in structure-activity studies
using multiple regression analysis. Journal of Medicinal Chemistry, 15, 1066-1068.
Veith, G.D., Broderius, S.J. (1990). Rules for distinguishing toxicants that cause type I
and type II narcosis syndromes. Environmental Health Perspectives, 87, 207-211.
Toropov, A.A., and Benfenati, E. (2004). QSAR modelling of aldehyde toxicity by means
of optimization of correlation weights of nearest neighbouring codes. Journal of
Molecular Structure (Theochem), 676, 165-169.
Toropov, A.A., and Schultz, T.W. (2003). Prediction of aquatic toxicity: use of
optimization of correlation weights of local graph invariants. Journal of Chemical
Information and Computer Sciences, 43, 560-567.
241
Toropov A.A., and Toropova, A.P. (2002). QSAR modeling of toxicity on optimization of
correlation weights of Morgan extended connectivity. Journal of Molecular Structure
(Theochem), 578, 129-134.
Trohalaki, S., Gifford, E., Pachter, R. (2000). Improved QSARs for predictive toxicology
of halogenated hydrocarbons. Computers and Chemistry, 24, 421-427.
Urani, C., Doldi, M., Crippa, S., Camatini, M. (1998). Human-derived cell lines to study
xenobiotic metabolism. Chemosphere, 37, 2785-2795.
Vamp Reference Guide (1997), Accelrys, Inc.
Van der Jagt, K., Munn, S., Tørsløv, J., de Brujin, J. (2004). Alternative approaches can
reduce the use of test animals under REACH. Addendum to the report: Assessment of
additional testing needs under REACH. Effects of (Q)SARs, risk based testing and
voluntary industry initiatives. EUR Report No. 21405 EN. EC Joint Research Centre,
Ispra, Italy.
Veith, G.D., Call, D.J., Brooke, L.T. (1983). Structure-toxicity relationships for the
fathead minnow, Pimephales promelas: narcotic industrial chemicals. Canadian Journal
of Fisheries and Aquatic Sciences, 40, 743–748.
Verhaar, H.J.M., Van Leeuwen C.J., Hermens, J.L.M. (1992). Classifying environmental
pollutants. 1. Structure-activity relationships for prediction of aquatic toxicity.
Chemosphere, 25, 471-491.
Verma, R.P., Kapur, S., Barberena, O., Shusterman, A., Hansch, C., Selassie, C.D.
(2003). Synthesis, cytotoxicity, and QSAR analysis of X-thiophenols in rapidly dividing
cells. Chemical Research in Toxicology, 16, 276-284.
Viswanadhan, V.N., Ghose, A.K., Revankar, G.R., Robins R.K. (1989). Atomic
physicochemical parameters for three-dimensional structure-directed quantitative
structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive
Interactions and their application for an automated superposition of certain naturally
occurring nucleoside antibiotics. Journal of Computational Chemistry, 29, 163-172.
242
Voelkel, A. (1994). Structural descriptors in organic chemistry - new topological
parameter based on electrotopological state of graph vertices. Computers and Chemistry,
18, 1-4.
Wang, X., Dong, Y., Wang, L., Han, S. (2001). Acute toxicity of substituted phenols to
Rana japonica tadpoles and mechanism-based quantitative structure-activity relationship
(QSAR) study. Chemosphere, 44, 447-455.
Wang, X., Sun, C., Wang, Y., Wang, L. (2002a). Quantitative structure-activity
relationships for the inhibition toxicity to root elongation of Cucumis sativus of selected
phenols and interspecies correlation with T. pyriformis. Chemosphere, 46, 153-161
Wang, X., Yin, C., Wang, L., (2002b). Structure-activity relationships and response-
surface analysis of nitroaromatics toxicity to the yeast (Saccharomyces cerevisiae).
Chemosphere, 46, 1045-1051.
Wang, X., Yu, J., Wang, Y., Wang, L. (2002c). Mechanism-based quantitative structure-
activity relationships for the inhibition of substituted phenols on germination rate of
Cucumis sativus. Chemosphere, 46, 241-250.
Weiner, P.K., and Kollman, P.A. (1981). AMBER: assisted model building with energy
refinement. A general program for modeling molecules and their interactions. Journal of
Computational Chemistry, 2, 287-303.
Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S.,
Jr., Weiner, P.K. (1984). A new force field for molecular mechanical simulation of
nucleic acids and proteins. Journal of the American Chemical Society, 106, 765-784.
Wiener, H. (1947). Structural determination of paraffin boiling points. Journal of the
American Chemical Society, 69, 17-20.
Wiese, M., and Pajeva, I.K. (2001). Structure-activity relationships of multidrug
resistance reversers. Current Medicinal Chemistry, 8, 685-713.
243
Wilson, L.Y., and Famini, G.R. (1991). Using theoretical descriptors in quantitative
structure-activity relationships: Some toxicological indices. Journal of Medicinal
Chemistry, 34, 1668-1674.
Worgan, A.D.P., Dearden, J.C., Edwards, R., Netzeva, T.I., Cronin, M.T.D. (2003).
Evaluation of a novel short-term algal toxicity assay by the development of QSARs and
inter-species relationships for narcotic chemicals. QSAR and Combinatorial Science, 22,
204-209.
Xu, M., Zhang, A., Han, S., Wang, L. (2002). Studies of 3D-quantitative structure–
activity relationships on a set of nitroaromatic compounds: CoMFA, advanced CoMFA
and CoMSIA. Chemosphere, 48, 707–715.
Yu, H.X., Lin, Z.F., Feng, J.F., Xu, T.L., Wang, L.S. (2001). Development of quantitative
structure activity relationships in toxicity prediction of complex mixtures. Acta
Pharmacologica Sinica, 22, 45-49.
Zerner, M.C. (1993). ZINDO, a comprehensive semiempirical quantum chemistry
package. Quantum Theory Project. Gainesville, Florida, USA.
Young, R.C., Mitchell, R.C., Brown, T.H., Ganellin, C.R., Griffiths, R., Jones, M., Rana,
K.K., Saunders, D., Smith, I.R., Sore, N.E., Wilks, T.J. (1988). Development of a new
physicochemical model to the design of centrally acting H2 receptor histamine
antagonists. Journal of Medicinal Chemistry, 31, 656-671.
244
List of author’s publications related to the research in the project
Lessigiarska, I., and Worth, A. (2003). The use of computer models as alternatives to
animal experiments in chemical risk assessment. Alternatives to Laboratory Animals, 31,
67-73.
Lessigiarska, I., Cronin, M.T.D., Worth, A.P., Dearden, J.C., Netzeva, T.I. (2004a).
QSARs for toxicity to the bacterium Sinorhizobium meliloti. SAR and QSAR in
Environmental Research, 15, 169-190.
Lessigiarska, I., Pajeva, I., Cronin, M.T.D., Worth, A.P. (2005a). 3D QSAR investigation
of the blood-brain barrier penetration of chemical compounds. SAR and QSAR in
Environmental Research, 16, 79-91.
Lessigiarska, I., Worth, A.P., Netzeva, T.I., Dearden, J.C., Cronin, M.T.D (2006).
Quantitative structure-activity-activity and quantitative structure-activity investigations of
human and rodent toxicity. Chemosphere. In press.
Lessigiarska, I., Worth, A.P., Sokull-Klüttgen, B., Jeram, S., Dearden, J.C., Netzeva, T.I.,
Cronin, M.T.D. (2004b). QSAR investigation of a large data set for fish, algae and
Daphnia toxicity. SAR and QSAR in Environmental Research, 15, 413-431.
Netzeva, T.I., Aptula, A.O., Benfenati, E., Cronin, M.T.D., Gini, G., Lessigiarska, I.,
Maran, U., Vracko, M., Schüürmann, G. (2005). Description of the electronic structure of
organic chemicals using semiempirical and ab initio methods for development of
toxicological QSARs. Journal of Chemical Information and Modeling, 45, 106-114.
245
APPENDIX A
C CODES OF THE PROGRAMS FOR STATISTICAL ALGORITHMS
A.1. Program for reducing of data multicollinearity
#include <stdio.h> #include <stdlib.h> #include <math.h> #include "my_lib.h" typedef struct variabl { char *var_name; double *var_val; int var_nico; } variable; void main(void) { variable *var; FILE *f_from, *f_ic; char sf_from[100], sf_ic[100], ynvar, **var_rem, exs, ynexcl, **var_ex_list, qn; int numb, case_num, n_dep, i, j, i1, n_var_rem, max_vchar, n_step, n_indep, no_max, *var_rem_ind; double **arr_incor, **arr_sum1, r_lim, var_r_ico; int exist(char *); int getint(void); double getfloat(void); char getch(void); void err_calloc(void); int getstr_maxl_notw(char *, int); int getstr_maxl_file(FILE *, char *, int); int scmp_notcase (char *, char *); char* strcopy_my(char *str1, char *str2); int f_r_cont(char *, int *); void intercor(variable *, int, int, int, double **, double **); max_vchar = 30; printf("\nEnter the name of the file with the data:\n"); if ( getstr_maxl_notw(sf_from, 100) != 1 ) { printf("\nError - the file name cannot be read!\n"); exit(-1);
246
} if ( (f_from = fopen(sf_from, "r")) == NULL ) { printf("\nError - can't open file %s for reading!\n", sf_from); exit(-1); } fclose(f_from); if ( (numb = f_r_cont(sf_from, &case_num)) == -1) { printf("\nError with file %s!\n", sf_from); exit(-1); } if (case_num < 3) { printf("\nError - the number of cases in the file %s is less than 3! Program terminates!\n", sf_from); exit(-1); } printf("\nThe file %s contains %d columns and %d cases\n",sf_from, numb, case_num); if ( ( var = (variable *) calloc( (size_t) numb, sizeof(variable) ) ) == NULL ) err_calloc(); for (i = 0; i < numb; ++i) { if ( ( (var + i) -> var_name = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); if ( ( (var + i) -> var_val = (double *) calloc( (size_t) case_num, sizeof(double) ) ) == NULL ) err_calloc(); } f_from = fopen(sf_from, "r"); for (j = 0; j < numb; ++j) { if ( getstr_maxl_file(f_from, (var + j) -> var_name, max_vchar) != 1) { printf("\nError in reading from the file %s!\n", sf_from); exit(-1); } } for (i = 0; i < case_num; ++i) { for (j = 0; j < numb; ++j) { if ( fscanf(f_from, "%lf", ((var + j) -> var_val) + i) != 1) { printf("\nError in reading from the file %s!\n", sf_from); exit(-1);
247
} } } printf("\n"); for (j = 0; j < numb; ++j) { printf("%10s ", (var + j) -> var_name); } printf("\n"); for (i = 0; i < case_num; ++i) { for (j = 0; j < numb; ++j) { printf("%10g ", *(((var + j) -> var_val) + i)); } printf("\n"); } label1: printf("\nEnter the name of the file to save the intercorrelation matrix:\n"); if ( getstr_maxl_notw(sf_ic, 100) != 1 ) { printf("\nError - the file name cannot be read!\n"); goto label1; } if ( exist(sf_ic) == 1 ) { if ( (f_ic = fopen(sf_ic, "w")) == NULL ) { printf("\nError - can't open file %s for writing!\n", sf_ic); goto label1; } } else goto label1; fclose(f_ic); label3: printf("\nEnter the number of dependent variables in the file: "); n_dep = getint(); if ( (n_dep > numb - 2) || (n_dep < 0) ) { printf("\nError - the number of dependent variables must be between 0 and %d!\n", numb-2); goto label3; } label31: printf("\nEnter the absolute value of the intercorrelation coefficent r, above which the variables to be considered as highly intercorrelated: "); r_lim = getfloat(); if ( (r_lim > 1) || (r_lim < 0) )
248
{ printf("\nError - absolute value of r must be between 0 and 1!\n"); goto label31; } label49: printf("\nAre there any independent varibles that you want for sure to remain in the variable data set (Press 'Y' - yes, 'N' - no)?: "); ynvar = getch(); if ( (ynvar != 'Y') && (ynvar != 'y') && (ynvar != 'N') && (ynvar != 'n') ) { printf("\nError!\n"); goto label49; } n_var_rem = 0; if ( ( var_rem = (char **) calloc( (size_t) (numb - n_dep + 1), sizeof(char *) ) ) == NULL ) err_calloc(); if ( ( var_rem_ind = (int *) calloc( (size_t) (numb-n_dep + 1), sizeof(int) ) ) == NULL ) err_calloc(); if ( (ynvar == 'Y') || (ynvar == 'y') ) { label29: printf("\nEnter the number of independent variables that have to remain in the variable data set: "); n_var_rem = getint(); if ( (n_var_rem > numb - n_dep) || (n_var_rem < 0) ) { printf("\nError - the number of variables to remain must be between 0 and %d!\n", numb-n_dep); goto label29; } for (i = 0; i < n_var_rem; ++i) { if ( ( *(var_rem + i) = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); } if (n_var_rem > 0) printf("\nEnter the list with the names of the independent variables that have to remain in any case in the variable data set.\n"); for (i = 0; i < n_var_rem; ++i) { j = 0; label003: if ( getstr_maxl_notw( *(var_rem + i), max_vchar) != 1) { if (j == 0) {
249
printf("\nError in reading the variable name. Try again.\n"); ++j; goto label003; } else { printf("\nThis name cannot be read. Go to the next variable.\n"); } } /*200*/ exs = 1; for (i1 = n_dep; (i1 < numb) && (exs == 1); ++i1) { if ( scmp_notcase ( *(var_rem + i), (var + i1) -> var_name ) == 1 ) { *(var_rem_ind + i) = i1; exs = 2; } } if (exs == 1) { printf("\nThere is no variable %s in the independent variable data set! Enter the name again.\n", *(var_rem + i)); goto label003; } } if (n_var_rem > 0) printf("\nThe following variables will remain in any case in the independent variable data set:\n"); for (i = 0; i < n_var_rem; ++i) { printf("var No %d var name %s\n", *(var_rem_ind + i) + 1, *(var_rem + i)); } } if ( ( arr_incor = (double **) calloc( (size_t) (numb-n_dep + 2), sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < (numb-n_dep+1); ++i) if ( ( *(arr_incor + i) = (double *) calloc( (size_t) (numb-n_dep-i+1), sizeof(double) ) ) == NULL ) err_calloc(); if ( ( arr_sum1 = (double **) calloc( (size_t) 2, sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < 2; ++i) if ( ( *(arr_sum1 + i) = (double *) calloc( (size_t) 1, sizeof(double) ) ) == NULL ) err_calloc(); if ( ( var_ex_list = (char **) calloc( (size_t) (numb - n_dep + 1), sizeof(char *) ) ) == NULL ) err_calloc(); n_step = 0;
250
labelsteps: n_indep = numb - n_dep - n_step; intercor(var, n_indep, n_dep, case_num, arr_sum1, arr_incor); f_ic = fopen(sf_ic, "w"); fprintf(f_ic, " "); for (i = 0; i < n_indep; ++i) { fprintf(f_ic, "%12s ", (var + n_dep + i) -> var_name); } fprintf(f_ic, "\n"); for (i = 0; i < n_indep; ++i) { fprintf(f_ic, "%12s ", (var + n_dep + i) -> var_name); for (j = 0; j < i; ++j) { fprintf(f_ic, "%12.4f ", *(*(arr_incor + j) + i-j-1) ); } fprintf(f_ic, " 1.0000 "); for (j = i+1; j < n_indep; ++j) { fprintf(f_ic, "%12.4f ", *(*(arr_incor + i) + j-i-1) ); } fprintf(f_ic, "\n"); } for (i = 0; i < n_indep; ++i) ((var + n_dep + i) -> var_nico) = 0; fprintf(f_ic,"\n"); printf("\n"); for (i = 0; i < n_indep; ++i) { fprintf(f_ic,"var: %s list of intercorr. vars: ", (var + n_dep + i) -> var_name); printf("var: %s list of intercorr. vars: ", (var + n_dep + i) -> var_name); for (j = 0; j < i; ++j) { if( *(*(arr_incor + j) + i-j-1) < 0 ) var_r_ico = 0 - *(*(arr_incor + j) + i-j-1); else var_r_ico = *(*(arr_incor + j) + i-j-1); if( var_r_ico > r_lim ) { fprintf(f_ic, "%s ", (var + n_dep + j) -> var_name); printf("%s ", (var + n_dep + j) -> var_name); ++((var + n_dep + i) -> var_nico); } } for (j = i+1; j < n_indep; ++j)
251
{ if( *(*(arr_incor + i) + j-i-1) < 0 ) var_r_ico = 0 - *(*(arr_incor + i) + j-i-1); else var_r_ico = *(*(arr_incor + i) + j-i-1); if( var_r_ico > r_lim ) { fprintf(f_ic, "%s ", (var + n_dep + j) -> var_name); printf("%s ", (var + n_dep + j) -> var_name); ++((var + n_dep + i) -> var_nico); } } fprintf(f_ic, "number of intercorr. vars: %d\n", (var + n_dep + i) -> var_nico ); printf("number of intercorr. vars: %d\n", (var + n_dep + i) -> var_nico ); } /*300*/ fclose(f_ic); printf("\nThe correlation matrix is saved in file %s\n", sf_ic); no_max = n_dep; labelh65: exs = 2; for (i1 = 0; (i1 < n_var_rem) && (exs == 2); ++i1) { if ( *(var_rem_ind + i1) == no_max ) { no_max++; exs = 1; } } if ( exs == 1) goto labelh65; if (no_max == n_indep + n_dep) { printf("\nAll of the independent variable are in the list of variables that have to remain!\n"); goto labelh98; } for (i = no_max; i < (n_indep + n_dep); ++i) { if( ((var + i) -> var_nico) > ((var + no_max) -> var_nico) ) { exs = 2; for (i1 = 0; (i1 < n_var_rem) && (exs == 2); ++i1) { if ( scmp_notcase ( *(var_rem + i1), (var + i) -> var_name ) == 1 ) exs = 1; } if (exs == 2) no_max = i; } } if( ((var + no_max) -> var_nico ) > 0 )
252
{ for (i = 0; i < n_var_rem; ++i) { if( ((var + *(var_rem_ind + i) ) -> var_nico) > ((var + no_max) -> var_nico) ) { printf("\nVariable %s has bigger number of intercorrelated variables than variable %s, but it is in the list of variables to remain.\n", (var + *(var_rem_ind + i) ) -> var_name, (var + no_max) -> var_name); } } printf("\nVariable %s has the biggest number of intercorrelated variables (%d).\n", (var + no_max) -> var_name, (var + no_max) -> var_nico); labelh89: printf("\nProgram is goind to exclude variable %s from the independent variable data set (Press 'Y' - yes, 'N' - no): ", (var + no_max) -> var_name); ynexcl = getch(); if ( (ynexcl != 'Y') && (ynexcl != 'y') && (ynexcl != 'N') && (ynexcl != 'n') ) { printf("\nError!\n"); goto labelh89; } if ( (ynexcl == 'Y') || (ynexcl == 'y') ) { if ( ( *(var_ex_list + n_step) = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); if( strcopy_my( *(var_ex_list + n_step), (var + no_max) -> var_name) == NULL) { printf("\nError - the program is not performing properly!\n"); exit(-1); } for (i = no_max; i < (n_indep + n_dep - 1); ++i) { (var + i) -> var_name = (var + i+1) -> var_name; (var + i) -> var_val = (var + i+1) -> var_val; } for (i = 0; i < n_var_rem; ++i) { if ( *(var_rem_ind + i) == no_max ) { printf("\nError - the preogramme is not performing correctly!\n"); exit(-1); } if ( *(var_rem_ind + i) > no_max ) (*(var_rem_ind + i))--; } n_step++; goto labelsteps; /*360*/ } else {
253
labelh32: printf("\nQuit the program (press 'q') or continuing with excluding next intercorrelated variable (press 'n'): "); qn = getch(); if ( (qn != 'Q') && (qn != 'q') && (qn != 'N') && (qn != 'n') ) { printf("\nError!\n"); goto labelh32; } if ( (qn == 'Q') || (qn == 'q') ) goto labelh98; else { if ( ( *(var_rem + n_var_rem) = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); if( strcopy_my( *(var_rem + n_var_rem), (var + no_max) -> var_name) == NULL) { printf("\nError - the program is not performing properly!\n"); exit(-1); } *(var_rem_ind + n_var_rem) = no_max; n_var_rem++; goto labelsteps; } } } else { printf("\nThe variable data set of variables not included into the vars remain list does not contain variables that are intercorrelated above %g.\n", r_lim); } labelh98: if (n_step > 0) { f_ic = fopen(sf_ic, "a"); fprintf(f_ic, "\n%d vaeiables were excluded.\nThe excluded variable list:\n", n_step); printf("\n%d vaeiables were excluded.\nThe excluded variable list:\n", n_step); for (i = 0; i < n_step; ++i) { /*400*/ fprintf(f_ic, "%s ", *(var_ex_list + i) ); printf("%s ", *(var_ex_list + i) ); } fprintf(f_ic,"\n"); printf("\n"); fclose(f_ic); }
254
for (i = 0; i < (numb-n_dep+1); ++i) free( *(arr_incor + i) ); free(arr_incor); free(arr_sum1); if ( (ynvar == 'Y') || (ynvar == 'y') ) { for (i = 0; i < n_var_rem; ++i) free( *(var_rem + i) ); free(var_rem); free(var_rem_ind); } } int f_r_cont(char *sf, int *c_num) { /* This function checks the file with name sf. It returns the number of */ /* strings in the first line of the file and sets *c_num value to be the */ /* number of lines of the file minus 1. */ FILE *f; int c, n = 0, eo, i; double re; if ( (f = fopen(sf, "r")) == NULL ) { printf("\nCan't open file %s for reading!\n", sf); return -1; } while ( (c = getc(f)) == '\n' ); while ( (c != '\n') && (c != EOF) ) { while ( (c == ' ') || (c == '\t') ) c = getc(f); if (c != '\n') { n++; while ( (c != ' ') && (c != '\t') && (c != '\n') && (c != EOF) ) c = getc(f); } } *c_num = 0; if (c != EOF) { eo = 1; while ( (eo = fscanf(f,"%lf", &re)) != EOF ) { if (eo != 1) {
255
fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n); return -1; } for (i = 0; i < n-1; ++i) { eo = fscanf(f,"%lf", &re); if (eo != 1) { fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n); return -1; } } (*c_num)++; } } fclose(f); return n; } void intercor(variable *v, int n, int n_dep, int n_cases, double **arr_sum, double **arr_ic) { unsigned int j, m, n_ico; double r_inter; void regr_calc_2(variable *, unsigned int *, int, int, int, int, double **, double *, unsigned int *); for (m = 0; m < n-1; ++m) { for (j = m+1; j < n; ++j) { n_ico = 0; regr_calc_2(v, &j, 1, m + n_dep, n_dep, n_cases, arr_sum, &r_inter, &n_ico); *(*(arr_ic + m) + j-m-1) = r_inter; } } } void regr_calc_2(variable *v, unsigned int *c_index, int k, int n_d, int n_dep, int n_cases, double **arr_sum, double *r, unsigned int *n_co) { unsigned int m, l, q, iv;
256
double **arr_det, det_a, det_k, d_mean, *d_calc, sum_om, sum_oc, sum_cm, r_sq, tol, sign; double sum_sum(double *, double *, int); double sum_1(double *, int); double c_det(double **, int); d_mean = sum_1( (v + n_d) -> var_val, n_cases); if (n_cases != 0) d_mean /= n_cases; else { printf("\nError - there is no cases!\n"); exit(-1); } for (m = 0; m < k; ++m) { for (l = 0; l < k; ++l) { *(*(arr_sum + m) + l) = sum_sum( (v + n_dep + *(c_index + m)) -> var_val, (v + n_dep + *(c_index + l)) -> var_val, n_cases); } *(*(arr_sum + k) + m) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + m) + k) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + k + 1) + m) = sum_sum( (v + n_dep + *(c_index + m )) -> var_val, (v + n_d) -> var_val, n_cases); } *(*(arr_sum + k) + k) = n_cases; *(*(arr_sum + k + 1) + k) = sum_1( (v + n_d) -> var_val, n_cases); if ( ( arr_det = (double **) calloc( (size_t) k + 1, sizeof(double *) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) if ( ( *(arr_det + m) = (double *) calloc( (size_t) k + 1, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); } } det_a = c_det(arr_det, k + 1); iv = 1; if ( (det_a > -1e-6) && (det_a < 1e-6) ) iv = 0; if (iv)
257
{ if ( ( d_calc = (double *) calloc( (size_t) n_cases, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < n_cases; ++m) *(d_calc + m) = 0.0; } for (q = 0; q < k+1; ++q) { if (iv) { for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { if (m != q) *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); else *(*(arr_det + m) + l) = *(*(arr_sum + k+1) + l); } } det_k = c_det(arr_det, k + 1); if (q == 0) sign = det_k/det_a; if (q < k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a) * ( *(((v + n_dep + *(c_index + q)) -> var_val) + m) ); } if (q == k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a); } } } if (iv) { sum_om = 0.0; sum_oc = 0.0; sum_cm = 0.0; for (m = 0; m < n_cases; ++m) { sum_om += pow( (*(((v + n_d) -> var_val) + m) - d_mean), 2.0); sum_oc += pow( (*(((v + n_d) -> var_val) + m) - *(d_calc + m)), 2.0); sum_cm += pow( (*(d_calc + m)- d_mean), 2.0); } tol = 1 - sum_oc/sum_om - sum_cm/sum_om; if ( (tol > 1e-5) || (tol < -1e-5) ) { r_sq = 0.0; } else r_sq = sum_cm/sum_om;
258
if (sign < -1e-12) *(r + *n_co) = -pow(r_sq, 0.5); else *(r + *n_co) = pow(r_sq, 0.5); } else *(r + *n_co) = 0.0; if (iv) free(d_calc); for (m = 0; m < k+1; ++m) free(*(arr_det + m)); free(arr_det); *n_co += 1; } double sum_sum (double *r1, double *r2, int n) { int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += (*(r1+i)) * (*(r2+i)); } return sum; } double sum_1 (double *r1, int n) { int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += *(r1+i); } return sum; }
259
A.2. Program implementing the algorithm of the best-subsets approach for selecting
regression equations with best statistical fit
#include <stdio.h> #include <stdlib.h> #include <math.h> #include "my_lib.h" typedef struct variabl { char *var_name; double *var_val; } variable; void main(void) { variable *var; FILE *f_from, *f_to; char sf_from[100], sf_to[100], **res, yn, c1; int numb, case_num, n_dep, n_max, n_min, i, j, n, n_d, npr_mod; unsigned int *var_index, n_c; long unsigned int comb_n, n_allc; double **arr_sum, *r_sq, **arr_incor, **arr_sum1, rpr_lim, rico_lim; int exist(char *); int getint(void); double getfloat(void); char getch(void); void err_calloc(void); long unsigned int fact_n_k(int, int); int f_r_cont(char *, int *); void intercor_2(variable *, int, int, int, double **, double **); void comb_n_k_w(variable *, unsigned int*, int, int, unsigned int, int, int, int, double **, double *, char **, unsigned int *, double **, double, int, long unsigned int *); printf("\nEnter the name of the file with the data:\n"); scanf("%s",sf_from); if ( (f_from = fopen(sf_from, "r")) == NULL ) { printf("\nError - can't open file %s for reading!\n", sf_from); exit(-1); } fclose(f_from); if ( (numb = f_r_cont(sf_from, &case_num)) == -1) { printf("\nError with file %s!\n", sf_from); exit(-1);
260
} if (case_num < 3) { printf("\nError - the number of cases in the file %s is less than 3! Program terminates!\n", sf_from); exit(-1); } printf("\nThe file %s contains %d columns and %d cases\n",sf_from, numb, case_num); if ( ( var = (variable *) calloc( (size_t) numb, sizeof(variable) ) ) == NULL ) err_calloc(); for (i = 0; i < numb; ++i) { if ( ( (var + i) -> var_name = (char *) calloc( (size_t) 30, sizeof(char) ) ) == NULL ) err_calloc(); if ( ( (var + i) -> var_val = (double *) calloc( (size_t) case_num, sizeof(double) ) ) == NULL ) err_calloc(); } f_from = fopen(sf_from, "r"); for (j = 0; j < numb; ++j) { if ( fscanf(f_from, "%s", (var + j) -> var_name) != 1) { printf("\nError in reading of the file %s! -1\n", sf_from); exit(-1); } } for (i = 0; i < case_num; ++i) { for (j = 0; j < numb; ++j) { if ( fscanf(f_from, "%lf", ((var + j) -> var_val) + i) != 1) { printf("\nError in reading of the file %s!\n", sf_from); exit(-1); } } } printf("\n"); for (j = 0; j < numb; ++j) { printf("%10s ", (var + j) -> var_name); } printf("\n"); for (i = 0; i < case_num; ++i) {
261
for (j = 0; j < numb; ++j) { /* 100 */ printf("%10g ", *(((var + j) -> var_val) + i)); } printf("\n"); } label1: printf("\nEnter the name of the file to save in:\n"); scanf("%s",sf_to); if ( exist(sf_to) == 1 ) { if ( (f_to = fopen(sf_to, "w")) == NULL ) { printf("\nError - can't open file %s for writing!\n", sf_to); goto label1; } } else goto label1; label31: printf("\nEnter the minimum value of the model r squared below which the models to not be printed in the file and on the screen: "); rpr_lim = getfloat(); if ( (rpr_lim > 1) || (rpr_lim < 0) ) { printf("\nError - model r squared must be between 0 and 1!\n"); goto label31; } label37: printf("\nEnter the maximum number of the models to be printed in the file and on the screen for a dependent variable: "); npr_mod = getint(); if ( npr_mod < 0 ) { printf("\nError - nuber of the models to be printed has not to be smaller than 0!\n"); goto label37; } label3: printf("\nEnter the number of dependent variables in the file: "); n_dep = getint(); if ( (n_dep > numb - 1) || (n_dep < 1) ) { printf("\nError - the number of dependent variables must be between 1 and %d!\n", numb-1); goto label3; } label24: printf("\nEnter the minimum number of independent variables to be included in the model: ");
262
n_min = getint(); if ( (n_min > numb - n_dep) || (n_min < 1) ) { printf("\nError - the number of independent variables must be between 1 and %d!\n", numb-n_dep); goto label24; } if (n_min > case_num - 2) { printf("\nError - the number of independent variables has not to be bigger than the number of cases minus two (%d)!\n", case_num-2); goto label24; } label2: printf("\nEnter the maximum number of independent variables to be included in the model: "); n_max = getint(); if ( (n_max > numb - n_dep) || (n_max < n_min) ) { printf("\nError - the number of independent variables must be between %d and %d!\n", n_min, numb-n_dep); goto label2; } if (n_max > case_num - 2) { printf("\nError - the number of independent variables has not to be bigger than the number of cases minus two (%d)!\n", case_num-2); goto label2; } label34: printf("\nEnter the value of the intercorrelation r squared above which the independent variables will be considered as intercorrelated and will not be incuded in the same model: "); rico_lim = getfloat(); if ( (rico_lim > 1) || (rico_lim < 0) ) { printf("\nError - the intercorrelation r squared must be between 0 and 1!\n"); goto label34; } if ( ( arr_incor = (double **) calloc( (size_t) (numb-n_dep + 2), sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < (numb-n_dep+1); ++i) if ( ( *(arr_incor + i) = (double *) calloc( (size_t) (numb-n_dep-i+1), sizeof(double) ) ) == NULL ) err_calloc(); if ( ( arr_sum1 = (double **) calloc( (size_t) 2, sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < 2; ++i)
263
if ( ( *(arr_sum1 + i) = (double *) calloc( (size_t) 1, sizeof(double) ) ) == NULL ) err_calloc(); intercor_2(var, numb - n_dep, n_dep, case_num, arr_sum1, arr_incor); free(arr_sum1); for (n_d = 0; n_d < n_dep; ++n_d) { for (n = n_min; n <= n_max; ++n) { comb_n = fact_n_k(numb - n_dep, n); if (comb_n > (long unsigned int) (pow(2,63) - 1)) { printf("\nThe number of combinations of %d independent variables is %u. Too much. Calculations terminate.\n", n, comb_n); break; } label4: if (comb_n > 10000) { printf("\nThe number of combinations of %d independent variables is %u. Do you want to continue (Press 'Y' - yes, 'N' - no)?: ", n, comb_n); yn = getch(); if ( (yn != 'Y') && (yn != 'y') && (yn != 'N') && (yn != 'n') ) { printf("\nError!\n"); goto label4; } if ( (yn == 'N') || (yn == 'n') ) break; } printf("\nPlease wait...\n"); if( comb_n < npr_mod) n_c = comb_n; else n_c = npr_mod; if ( ( r_sq = (double *) calloc( (size_t) n_c, sizeof(double) ) ) == NULL ) err_calloc(); /* 200 */ if ( ( res = (char **) calloc( (size_t) n_c, sizeof(char *) ) ) == NULL ) err_calloc(); for (i = 0; i < n_c; ++i) { if ( ( *(res + i) = (char *) calloc( (size_t) n*30 + 200, sizeof(char) ) ) == NULL ) err_calloc(); *(*(res + i)) = '\0'; }
264
if ( ( arr_sum = (double **) calloc( (size_t) n + 2, sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < n + 2; ++i) if ( ( *(arr_sum + i) = (double *) calloc( (size_t) n + 1, sizeof(double) ) ) == NULL ) err_calloc(); if ( ( var_index = (unsigned int *) calloc( (size_t) n, sizeof(unsigned int) ) ) == NULL ) err_calloc(); n_c = 0; n_allc = 0; comb_n_k_w(var, var_index, numb - n_dep, n, 0, n_d, n_dep, case_num, arr_sum, r_sq, res, &n_c, arr_incor, rico_lim, npr_mod, &n_allc); fprintf(f_to, "\n r_sq dep. var number of indep. vars is %d\n", n); for (i = 0; (i < n_c) && ( *(r_sq + i) >= rpr_lim ) && (i < npr_mod) ; ++i) { fprintf(f_to, "%5d %.8f ",i + 1, *(r_sq + i)); j = 0; while ( (c1 = *(*(res + i) + j)) != '\0' ) { fputc(c1,f_to); ++j; } fprintf(f_to,"\n"); } printf("\n r_sq dep. var number of indep. vars is %d\n", n); for (i = 0; (i < n_c) && ( *(r_sq + i) >= rpr_lim ) && (i < npr_mod) ; ++i) { printf("%5d %.8f ",i + 1, *(r_sq + i)); j = 0; while ( (c1 = *(*(res + i) + j)) != '\0' ) { putchar(c1); ++j; } printf("\n"); } free(var_index); for (i = 0; i < n+2; ++i) free(*(arr_sum + i)); free(arr_sum); if( comb_n < npr_mod) n_c = comb_n; else n_c = npr_mod; for (i = 0; i < n_c; ++i) free(*(res + i)); free(res);
265
free(r_sq); } } fclose(f_to); free(arr_incor); } int f_r_cont(char *sf, int *c_num) { /* This function checks the file with name sf. It returns the number of */ /* strings in the first line of the file and sets *c_num value to be the */ /* number of lines of the file minus 1. */ FILE *f; int c, n = 0, eo, i; double re; if ( (f = fopen(sf, "r")) == NULL ) { printf("\nCan't open file %s for reading!\n", sf); return -1; } while ( (c = getc(f)) == '\n' ); while ( (c != '\n') && (c != EOF) ) { while ( (c == ' ') || (c == '\t') ) c = getc(f); if (c != '\n') { n++; while ( (c != ' ') && (c != '\t') && (c != '\n') && (c != EOF) ) c = getc(f); } } *c_num = 0; if (c != EOF) { eo = 1; /* 300 */ while ( (eo = fscanf(f,"%lf", &re)) != EOF ) { if (eo != 1) { fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n);
266
return -1; } for (i = 0; i < n-1; ++i) { eo = fscanf(f,"%lf", &re); if (eo != 1) { fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n); return -1; } } (*c_num)++; } } fclose(f); return n; } void intercor_2(variable *v, int n, int n_dep, int n_cases, double **arr_sum, double **arr_ic) { unsigned int j, m; double r_inter_sq; void regr_calc_3(variable *, unsigned int *, int, int, int, int, double **, double *); for (m = 0; m < n-1; ++m) { for (j = m+1; j < n; ++j) { regr_calc_3(v, &j, 1, m + n_dep, n_dep, n_cases, arr_sum, &r_inter_sq); *(*(arr_ic + m) + j-m-1) = r_inter_sq; } } } void regr_calc_3(variable *v, unsigned int *c_index, int k, int n_d, int n_dep, int n_cases, double **arr_sum, double *r_sq) { unsigned int m, l, q, iv; double **arr_det, det_a, det_k, d_mean, *d_calc, sum_om, sum_oc, sum_cm, tol; double sum_sum(double *, double *, int); double sum_1(double *, int);
267
double c_det(double **, int); d_mean = sum_1( (v + n_d) -> var_val, n_cases); if (n_cases != 0) d_mean /= n_cases; else { printf("\nError - there is no cases!\n"); exit(-1); } for (m = 0; m < k; ++m) { for (l = 0; l < k; ++l) { *(*(arr_sum + m) + l) = sum_sum( (v + n_dep + *(c_index + m)) -> var_val, (v + n_dep + *(c_index + l)) -> var_val, n_cases); } *(*(arr_sum + k) + m) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + m) + k) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + k + 1) + m) = sum_sum( (v + n_dep + *(c_index + m )) -> var_val, (v + n_d) -> var_val, n_cases); } *(*(arr_sum + k) + k) = n_cases; *(*(arr_sum + k + 1) + k) = sum_1( (v + n_d) -> var_val, n_cases); if ( ( arr_det = (double **) calloc( (size_t) k + 1, sizeof(double *) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) if ( ( *(arr_det + m) = (double *) calloc( (size_t) k + 1, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); } } det_a = c_det(arr_det, k + 1); /* 400 */ iv = 1; if ( (det_a > -1e-6) && (det_a < 1e-6) ) iv = 0; if (iv) { if ( ( d_calc = (double *) calloc( (size_t) n_cases, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < n_cases; ++m) *(d_calc + m) = 0.0; }
268
for (q = 0; q < k+1; ++q) { if (iv) { for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { if (m != q) *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); else *(*(arr_det + m) + l) = *(*(arr_sum + k+1) + l); } } det_k = c_det(arr_det, k + 1); if (q < k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a) * ( *(((v + n_dep + *(c_index + q)) -> var_val) + m) ); } if (q == k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a); } } } if (iv) { sum_om = 0.0; sum_oc = 0.0; sum_cm = 0.0; for (m = 0; m < n_cases; ++m) { sum_om += pow( (*(((v + n_d) -> var_val) + m) - d_mean), 2.0); sum_oc += pow( (*(((v + n_d) -> var_val) + m) - *(d_calc + m)), 2.0); sum_cm += pow( (*(d_calc + m)- d_mean), 2.0); } tol = 1 - sum_oc/sum_om - sum_cm/sum_om; if ( (tol > 1e-5) || (tol < -1e-5) ) { *r_sq = 0.0; } else *r_sq = sum_cm/sum_om; } else *r_sq = 0.0; if (iv) free(d_calc); for (m = 0; m < k+1; ++m) free(*(arr_det + m));
269
free(arr_det); } void comb_n_k_w(variable *v, unsigned int *c_index, int n, int k, unsigned int i, int n_d, int n_dep, int n_cases, double **arr_sum, double *r, char **results, unsigned int *n_co, double **arr_ic, double ri_l, int npr_m, long unsigned int *n_allco) { unsigned int j, p, m, l, flag; void regr_calc_4(variable *, unsigned int *, int, int, int, int, double **, double *, char **, unsigned int *, int); if (i == 0) p = 0; else p = *(c_index + i - 1) + 1; for (j = p; j < n; ++j) { *(c_index + i) = j; if (i < k-1) comb_n_k_w(v, c_index, n, k, i+1, n_d, n_dep, n_cases, arr_sum, r, results, n_co, arr_ic, ri_l, npr_m, n_allco); if (i == k-1) { flag = 1; if (k > 1) { for (m = 0; (m < k-1) && (flag); ++m) { for (l = m+1; (l < k) && (flag); ++l) { if ( *( *( arr_ic + *(c_index + m) ) + *(c_index + l) - *(c_index + m) - 1 ) > ri_l ) flag = 0; } } } if (flag) regr_calc_4(v, c_index, k, n_d, n_dep, n_cases, arr_sum, r, results, n_co, npr_m); if ( ( (*n_allco+1)%10000 ) == 0 ) printf("\nReached the %u-th combination\n", *n_allco+1 ); *n_allco += 1; } } } /* 500 */
270
void regr_calc_4(variable *v, unsigned int *c_index, int k, int n_d, int n_dep, int n_cases, double **arr_sum, double *r, char **results, unsigned int *n_co, int npr_m) { unsigned int m, l, q, iv; int i, j; double **arr_det, det_a, det_k, d_mean, *d_calc, sum_om, sum_oc, sum_cm, r_sq, tol; char *results_1; double sum_sum(double *, double *, int); double sum_1(double *, int); double c_det(double **, int); char * strcopy_my(char *, char *); d_mean = sum_1( (v + n_d) -> var_val, n_cases); if (n_cases != 0) d_mean /= n_cases; else { printf("\nError - there is no cases!\n"); exit(-1); } if ( ( results_1 = (char *) calloc( (size_t) k*30 + 200, sizeof(char) ) ) == NULL ) err_calloc(); for (m = 0; m < k; ++m) { for (l = 0; l < k; ++l) { *(*(arr_sum + m) + l) = sum_sum( (v + n_dep + *(c_index + m)) -> var_val, (v + n_dep + *(c_index + l)) -> var_val, n_cases); } *(*(arr_sum + k) + m) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + m) + k) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + k + 1) + m) = sum_sum( (v + n_dep + *(c_index + m )) -> var_val, (v + n_d) -> var_val, n_cases); } *(*(arr_sum + k) + k) = n_cases; *(*(arr_sum + k + 1) + k) = sum_1( (v + n_d) -> var_val, n_cases); if ( ( arr_det = (double **) calloc( (size_t) k + 1, sizeof(double *) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) if ( ( *(arr_det + m) = (double *) calloc( (size_t) k + 1, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) {
271
for (l = 0; l < k+1; ++l) { *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); } } /* 550 */ det_a = c_det(arr_det, k + 1); iv = 1; if ( (det_a > -1e-6) && (det_a < 1e-6) ) iv = 0; if (iv) { if ( ( d_calc = (double *) calloc( (size_t) n_cases, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < n_cases; ++m) *(d_calc + m) = 0.0; } sprintf(results_1,"%10s ", (v + n_d) -> var_name); for (q = 0; q < k+1; ++q) { if (iv) { for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { if (m != q) *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); else *(*(arr_det + m) + l) = *(*(arr_sum + k+1) + l); } } det_k = c_det(arr_det, k + 1); if (q < k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"%10s %15g ", (v + n_dep + *(c_index + q)) -> var_name, det_k/det_a); } if (q == k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"Intercept %15g", det_k/det_a); } if (q < k) {
272
for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a) * ( *(((v + n_dep + *(c_index + q)) -> var_val) + m) ); } if (q == k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a); } /* 600 */ } else { if (q < k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"%10s ", (v + n_dep + *(c_index + q)) -> var_name); } if (q == k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"Intercept can't be calculated - data have no variance"); } } } if (iv) { sum_om = 0.0; sum_oc = 0.0; sum_cm = 0.0; for (m = 0; m < n_cases; ++m) { sum_om += pow( (*(((v + n_d) -> var_val) + m) - d_mean), 2.0); sum_oc += pow( (*(((v + n_d) -> var_val) + m) - *(d_calc + m)), 2.0); sum_cm += pow( (*(d_calc + m)- d_mean), 2.0); } tol = 1 - sum_oc/sum_om - sum_cm/sum_om; if ( (tol > 1e-5) || (tol < -1e-5) ) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m," minimum tolerance"); r_sq = 0.0; } else r_sq = sum_cm/sum_om; } else r_sq = 0.0;
273
if (iv) free(d_calc); for (m = 0; m < k+1; ++m) free(*(arr_det + m)); free(arr_det); /* 650 */ if (*n_co < npr_m) { strcopy_my( *(results + *n_co), results_1 ); *(r + *n_co) = r_sq; } if (*n_co > npr_m-1) i = npr_m - 1; else i = *n_co; while ( (i >= 0) && ( *(r + i) <= r_sq) ) --i; if ( (i+1 < *n_co+1) && (i+1 < npr_m) ) { if (*n_co > npr_m-1) j = npr_m - 1; else j = *n_co; while (j > i+1) { strcopy_my( *(results + j), *(results + j-1) ); *(r + j) = *(r + j-1); --j; } strcopy_my( *(results + i+1), results_1); *(r + i+1) = r_sq; } free(results_1); *n_co += 1; } double sum_sum (double *r1, double *r2, int n) { int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += (*(r1+i)) * (*(r2+i)); } return sum; } double sum_1 (double *r1, int n)
274
{ int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += *(r1+i); } return sum; }
275
APPENDIX B
SMILES CODES FOR THE INVESTIGATED COMPOUNDS
B.1. Compounds investigated for blood-brain barrier penetration (Chapter 10)
B.1.a. Compounds from the ECVAM data set.
No Name SMILES
1 L-alanine [NH3+][C@@H](C)C(=O)[O-]
2 antipyrine c1ccccc1N2C(=O)C=C(C)N2C
3 AZT C1=C(C)C(=O)NC(=O)N1C2OC(CO)C(N=[N+]=[N-])C2
4 caffeine CN1C(=O)N(C)c2ncn(C)c2C1(=O)
5 cimetidine CNC(=NCCSCc1nc[nH]c1C)NC#N
6 cyclosporin A CN1C(=O)C(C)NC(=O)C(C)NC(=O)C(CC(C)C)N(C)C(=O)C(C(C)C)NC(=O)C(CC(C)C)N(C)C(=O)CN(C)C(=O)C(CC)NC(=O)C(C(O)C(C)CC=CC)N(C)C(=O)C(C(C)C)N(C)C(=O)C(CC(C)C)N(C)C(=O)C1CC(C)C
7 diazepam O=C1N(C)c2ccc(Cl)cc2C(=NC1)c3ccccc3
8 digoxin CC1OC(CC(O)C1O)OC2C(C)OC(CC2O)OC3C(C)OC(CC3O)OC4CCC5(C)C(CCC6C5CC(O)C7(C)C(CCC67O)C8=CC(=O)OC8)C4
9 L-dopa [NH3+][C@@H](C(c1cc(O)c(O)cc1))C(=O)[O-]
10 glycerol OCC(O)CO
11 lactic acid OC(C)C(=O)O
12 L-leucine [NH3+][C@@H](C(C(C)C))C(=O)[O-]
13 morphine C1=CC2C(N(C)C5)Cc3ccc(O)c4c3C2(C5)C(O4)C1O
14 nicotine c1ccncc1C2N(C)CCC2
15 phenytoin O=C1NC(=O)C(N1)(c2ccccc2)c3ccccc3
16 sucrose OC[C@@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O1)O[C@@]2(CO)[C@@H](O)[C@H](O)[C@@H](CO)O2
17 urea O=C(N)N
18 verapamil COc1c(OC)cc(cc1)CCN(C)CCCC(C(C)C)(C#N)c2cc(OC)c(OC)cc2
19 vinblastine CC[C@@]1(O)CN2CCc3c4ccccc4[nH]c3[C@](C(OC)=O)(C[C@H](C1)C2)c5c(OC)cc6N(C)[C@H]7[C@@](O)(C(OC)=O)[C@H](OC(C)=O)[C@]([C@H]89)(CC)C=CCN8CC[C@]79c6c5
20 vincristine CC[C@@]1(O)CN2CCc3c4ccccc4[nH]c3[C@](C(OC)=O)(C[C@H](C1)C2)c5c(OC)cc6N(C=O)[C@H]7[C@@](O)(C(OC)=O)[C@H](OC(C)=O)[C@]([C@H]89)(CC)C=CCN8CC[C@]79c6c5
21 warfarin c1ccccc1C(CC(C)=O)C2=C(O)c3ccccc3OC2(=O)
276
B.1.b. Compounds from the data set of Platts et al. (2001).
No Name SMILES
1 1 (cimetidine) CNC(=NC#N)NCCSCc1nc[NH]c1C
2 2 NC(N)=Cc1sc(C)nc1
3 4 CN(C)Cc1ccc(o1)CSCCNC2=NC(=O)C(=CN2)Cc3cc4ccccc4cc3
4 5 CN(C)Cc1ccc(o1)CSCCNC2=NC(=O)C(=CN2)Cc3ccc(C)nc3
5 6 (clonidine) Clc1cccc(Cl)c1N=C2NCCN2
6 7 (mepyramine) COc1ccc(cc1)CN(c2ccccn2)CCN(C)C
7 8 (imipramine) CN(C)CCCN1c2ccccc2CCc3ccccc13
8 9 (ranitidine) O=N(=O)C=C(NC)NCCSCc1ccc(o1)CN(C)C
9 10 (tiotidine) N#CN=C(NC)NCCSCc1csc(n1)N=C(N)N
10 13 O=N(=O)c1cc[NH]c1NCCSCc2ncccc2Br
11 14 O=N(=O)c1cc[NH]c1NCCSCc2ncccc2
12 15 O=N(=O)c1c(Cc3ccccc3)c[NH]c1NCCSCc2ncccc2
13 16 NC(N)=Nc1scc(n1)c2ccccc2
14 17 NC(N)=Nc1scc(n1)c2cc(N)ccc2
15 18 NC(N)=Nc1scc(n1)c2cc(NC(=O)C)ccc2
16 19 NC(N)=Nc1scc(n1)c2cc(CC(NC)=NC#N)ccc2
17 20 O=N(=O)c1cc[NH]c1NCCSCc2ccc(o2)CN(C)C
18 21 O=N(=O)c1c(Cc3ccccc3)c[NH]c1NCCSCc2ccc(o2)CN(C)C
19 22 O=N(=O)c1cc[NH]c1Nc2cccc(c2)c3ccc(o3)CN(C)C
20 23 O=N(=O)c1cc[NH]c1Nc2cccc(c2)c3nccc(c3)CN(C)C
21 24 C1CCCCN1Cc2cccc(c2)OCCCNC(=O)C
22 25 C1CCCCN1Cc2cccc(c2)OCCCNC(=O)c3ccccc3
23 26 C1CCCCN1Cc2cccc(c2)OCCCO
24 27 C1CCCCN1Cc2cccc(c2)OCCCNc3ccccn3
25 28 C1CCCCN1Cc2cccc(c2)OCCCNc3sccn3
26 29 C1CCCCN1Cc2cccc(c2)OCCCNc3sc4ccccc4n3
27 30 C1CCCCN1Cc2cccc(c2)OCCCNc3oc4ccccc4n3
28 31 (carbamazepine) c12ccccc2C=Cc3ccccc3N1C(=O)N
29 32 (epoxide of carbamazepine)
c12ccccc2C4OC4c3ccccc3N1C(=O)N
30 33 n1cN2c3cccc(Cl)c3C(=O)N(C)Cc2c1c4noc(n4)C(C)C
31 34 n1cN2c3cccc(Cl)c3C(=O)N(C)Cc2c1c4noc(n4)C(C)(C)O
32 35 n1cN2c3cccc(Cl)c3C(=O)N(C)Cc2c1c4noc(n4)C(C)(O)CO
33 36 (amitriptyline) c12ccccc2CCc3ccccc3C1=CCCN(C)C
34 acetaminophen Oc1ccc(cc1)NC(=O)C
35 acetylsalicylic acid CC(=O)Oc1ccccc1C(=O)O
36 alprazolam Clc1ccc2N3c(C)nnc3CN=C(c4ccccc4)c2c1
37 aminopyrine c1ccccc1N2C(=O)C(N(C)C)=C(C)N2C
38 amobarbital CC(C)CCC1(CC)C(=O)NC(=O)NC1=O
39 antipyrine c1ccccc1N2C(=O)C=C(C)N2C
40 argon Ar
41 atenolol CC(C)NCC(O)COc1ccc(cc1)CC(=O)N
42 benzene c1ccccc1
43 bretazenil CC(C)(C)OC(=O)C1=C2C3CCCN3C(=O)c4cc(Br)ccc4N2C=N1
44 bromperidol Fc1ccc(cc1)C(=O)CCCN2CCC(CC2)(O)c3ccc(Br)cc3
277
B.1.b. (continued)
No Name SMILES
45 butanone CC(=O)CC
46 caffeine c12ncN(C)c1C(=O)N(C)C(=O)N2C
47 chlopromazine c12cc(Cl)ccc1Sc3ccccc3N2CCCN(C)C
48 clobazam Clc1ccc2N(C)C(=O)CC(=O)N(c3ccccc3)c2c1
49 codeine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(OC)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O
50 CS2 S=C=S
51 cyclohexane C1CCCCC1
52 cyclopropane C1CC1
53 desipramine c12ccccc1CCc3ccccc3N2CCCNC
54 desmethydesipramine c12ccccc1CCc3ccccc3N2CCCN
55 desmethylclobazam Clc1ccc2NC(=O)CC(=O)N(c3ccccc3)c2c1
56 desmethyldiazepam Clc1ccc2NC(=O)CN=C(c3ccccc3)c2c1
57 desmonomethylpromazine c12ccccc1Sc3ccccc3N2CCCNC
58 diazepam Clc1ccc2N(C)C(=O)CN=C(c3ccccc3)c2c1
59 dichloromethane C(Cl)Cl
60 didanosine c12ncnc(O)c1N=CN2[C@H]3CC[C@@H](CO)O3
61 diethyl ether CCOCC
62 2,2-dimethylbutane CC(C)(C)CC
63 divinyl ether C=COC=C
64 enflurane ClC(F)C(F)(F)OC(F)F
65 ethanol CCO
66 ethylbenzene c1c(CC)cccc1
67 flumanezil CCOC(=O)C1=C2CN(C)C(=O)c3cc(F)ccc3N2C=N1
68 flunitrazepam [O-][N+](=O)c1ccc2N(C)C(=O)CN=C(c3c(F)cccc3)c2c1
69 fluphenazine c12ccccc1Sc3ccc(C(F)(F)F)cc3N2CCCN4CCN(CC4)CCO
70 fluroxene FC(F)(F)COC=C
71 haloperidol Fc1ccc(cc1)C(=O)CCCN2CCC(CC2)(O)c3ccc(Cl)cc3
72 halothane FC(F)(F)C(Cl)Br
73 heptane CCCCCCC
74 hexane CCCCCC
75 hexobarbital C2CCCC=C2C1(C)C(=O)N(C)C(=O)NC1=O
76 1-hydroxymidazolam Clc1ccc2N3c(CO)ncc3CN=C(c4c(F)cccc4)c2c1
77 4-hydroxymidazolam Clc1ccc2N3c(C)ncc3C(O)N=C(c4c(F)cccc4)c2c1
78 9-hydroxyrisperidone O=C1N2CCCC(O)C2=NC(C)=C1CCN3CCC(CC3)C4=NOc5cc(F)ccc45
79 hydroxyzine c1ccccc1C(c2ccccc2)N3CCN(CC3)CCOCCO
80 ibuprofen CC(C)Cc1ccc(cc1)C(C)C(=O)O
81 indinavir c1ncccc1CN2CC(C(=O)NC(C)(C)C)N(CC2)CC(O)CC(Cc3ccccc3)C(=O)NC4c5ccccc5CC4O
82 indomethacin COc1ccc2N(C(=O)c3ccc(Cl)cc3)c(C)c(CC(=O)O)c2c1
83 isoflurane FC(F)(F)C(Cl)OC(F)F
84 krypton Kr
85 M2L-663581 N1=CN2c3cccc(Cl)c3C(=O)N(C)CC2=C1C4=NOC(C(C)(O)CO)=N4
278
B.1.b. (continued) No Name SMILES
86 mesoridazine c12ccccc1Sc3ccc(S(=O)C)cc3N2CCC4CCCCN4C
87 methane C
88 methohexital CCC=CC(C)C1(CC=C)C(=O)NC(=O)NC1=O
89 methoxyflurane COC(F)(F)C(Cl)Cl
90 methylcyclopentane C1C(C)CCC1
91 3-methylhexane CCC(C)CCC
92 2-methylpentane CC(C)CCC
93 3-methylpentane CCC(C)CC
94 2-methylpropan-1-ol CC(C)CO
95 mianserin c12ccccc1Cc3ccccc3C4N2CCN(C)C4
96 midazolam Clc1ccc2N3c(C)ncc3CN=C(c4c(F)cccc4)c2c1
97 MIL-663581 N1=CN2c3cccc(Cl)c3C(=O)N(C)CC2=C1C4=NOC(C(C)(C)O)=N4
98 mirtazapine c12ncccc1Cc3ccccc3C4N2CCN(C)C4
99 morphine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(O)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O
100 neon Ne
101 nevirapine c12nccc(C)c1NC(=O)c3cccnc3N2C4CC4
102 nitrogen N#N
103 nitrous oxide [O-][N+]#N
104 nor-1-chlorpromazine c12ccccc1Sc3cccc(Cl)c3N2CCCN
105 nor-2-chlorpromazine c12ccccc1Sc3ccc(Cl)cc3N2CCCN
106 northioridazine c12ccccc1Sc3ccc(SC)cc3N2CCC4CCCCN4
107 Org12692 FC(F)(F)c1c(Cl)cc(cc1)N2CCNCC2
108 Org13011 FC(F)(F)c1ccnc(c1)N2CCN(CC2)CCCCN3CCCC3(=O)
109 Org30526 c12cc(Cl)ccc1Oc3ccccc3C4C2CNC4
110 Org32104 c12ccccc1Oc3c(C)cccc3C4C2(O)CCNC4
111 Org34167 c12ccccc2ON=C1c3ccccc3C(N)CC=C
112 Org4428 c12ccccc1Oc3c(C)cccc3C4C2(O)CCN(C)C4
113 Org5222 c12cc(Cl)ccc1Oc3ccccc3C4C2CN(C)C4
114 oxazepam Clc1ccc2NC(=O)C(O)N=C(c3ccccc3)c2c1
115 paraxanthine CN1C=NC2=C1C(=O)N(C)C(=O)N2
116 pentane CCCCC
117 pentobarbital CCCC(C)C1(CC)C(=O)NC(=O)NC1=O
118 phenylbutazone c1ccccc1N2C(=O)C(CCCC)C(=O)N2c3ccccc3
119 phenytoin c1ccccc1C2(c3ccccc3)NC(=O)NC2=O
120 promazine c12ccccc1Sc3ccccc3N2CCCN(C)C
121 propan-1-ol CCCO
122 propan-2-ol CC(O)C
123 propanone CC(=O)C
124 propranolol CC(C)NCC(O)COc1cccc2ccccc12
125 quinidine COc1ccc2nccc(c2c1)[C@H](O)[C@H]3N4CCC(C3)[C@H](C4)C=C
126 risperidone O=C1N2CCCCC2=NC(C)=C1CCN3CCC(CC3)C4=NOc5cc(F)ccc45
127 RO19-4603 CC(C)(C)OC(=O)C1=C2CN(C)C(=O)c3sccc3N2C=N1
128 salicylic acid Oc1ccccc1C(=O)O
279
B.1.b. (continued)
No Name SMILES
129 salicyluric acid Oc1ccccc1C(=O)NCC(=O)O
130 SF6 FS(F)(F)(F)(F)F
131 SKF 101468 CCN(CC)CCc1cccc2NC(=O)Cc12
132 SKF 89124 CCN(CC)CCc1ccc(O)c2NC(=O)Cc12
133 sulforidazine c12ccccc1Sc3ccc(S(=O)(=O)C)cc3N2CCC4CCCCN4C
134 teflurane FC(F)(F)C(F)Br
135 theobromine CN1C=NC2=C1C(=O)NC(=O)N2(C)
136 theophylline c12ncNc1C(=O)N(C)C(=O)N2C
137 thiopental CCCC(C)C1(CC)C(=O)NC(=S)NC1=O
138 thioridazine c12cc(SC)ccc1Sc3ccccc3N2CCC4N(C)CCCC4
139 tibolone C#C[C@@]1(O)CC[C@H]2[C@@H]3[C@H](C)CC4=C(CCC(=O)C4)[C@H]3CC[C@]12C
140 toluene c1c(C)cccc1
141 triazolam Clc1ccc2N3c(C)nnc3CN=C(c4c(Cl)cccc4)c2c1
142 1,1,1-trichloroethane C(Cl)(Cl)(Cl)C
143 trichloroethene C(Cl)(Cl)=CCl
144 trichloromethane C(Cl)(Cl)Cl
145 1,1,1-trifluoro-2-chloroethane
FC(F)(F)CCl
146 trifluoperazine c12cc(C(F)(F)F)ccc1Sc3ccccc3N2CCCN4CCN(C)CC4
147 valproic acid CCCC(C(=O)O)CCC
148 xenon Xe
149 2-xylene c1c(C)c(C)ccc1
150 3-xylene c1c(C)cc(C)cc1
151 4-xylene c1c(C)ccc(C)c1
152 Y-G 14 CNCCc1ccccn1
153 Y-G 15 CN(C)CCc1ccccn1
154 Y-G 16 NCCc1sccn1
155 Y-G 19 NCCc1scc(c2ccccc2)n1
156 Y-G 20 NCCc1cn2ccccc2n1
157 zidovudine C1=C(C)C(=O)NC(=O)N1[C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2
280
B.2. Compounds from the data set of Botsford (2002) (Chapter 12)
No Name SMILES
1 acetaminophen CC(=O)Nc1ccc(O)cc1
2 acetic acid CC(O)=O
3 acetone CC(=O)C
4 acetylsalicylic acid CC(=O)Oc1ccccc1C(O)=O
5 alachlor ClCC(=O)N(COC)c1c(CC)cccc1CC
6 4-aminobenzaldehyde Nc1ccc(C=O)cc1
7 amitriptyline c13ccccc1CCc2ccccc2C3=CCCN(C)C
8 amoxicillin c1cc(O)ccc1C(N)C(=O)NC2C(=O)N3C(C(=O)O)C(C)(C)SC23
9 atenolol CC(C)NCC(O)COc1ccc(cc1)CC(=O)N
10 atropine CN1C2CCC1C[C@H](C2)OC(=O)C(CO)c3ccccc3
11 bensulide c1ccccc1S(=O)(=O)NCCSP(=S)(OC(C)C)OC(C)C
12 benzene c1ccccc1
13 benzyl chloride ClCc1ccccc1
14 bipyridyl n1ccccc1c2ncccc2
15 bromoxynil N#Cc1cc(Br)c(O)c(Br)c1
16 butylamine CCCCN
17 caffeine CN1C(=O)N(C)c2ncn(C)c2C1(=O)
18 catechol Oc1c(O)cccc1
19 celebrex c1cc(C)ccc1C2=CC(C(F)(F)F)=NN2c3ccc(cc3)S(=O)(=O)N
20 chloramphenicol ClC(Cl)C(=O)NC(CO)C(O)c1ccc(N(=O)=O)cc1
21 chlorobenzene Clc1ccccc1
22 3-chlorophenol Clc1cc(O)ccc1
23 4-chlorophenol Clc1ccc(O)cc1
24 chloroquine CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc12
25 clindamycin CCC[C@H]1CN(C)[C@@H](C1)C(=O)N[C@@H](C(Cl)C)[C@@H]1[C@H](O)[C@H](O)[C@@H](O)[C@@H](CC)O1
26 clomazone Clc1ccccc1CN2OCC(C)(C)C(=O)2
27 codeine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(OC)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O
28 2-cresol Cc1ccccc1O
29 3-cresol Cc1cccc(O)c1
30 4-cresol Cc1ccc(O)cc1
31 cyanazine N#CC(C)(C)Nc1nc(NCC)nc(Cl)n1
32 cyclohexylamine C1CCCCC1N
33 DCPA (chlorthal-dimethyl) COC(=O)c1c(Cl)c(Cl)c(C(=O)OC)c(Cl)c1Cl
34 dexedrine C[C@H](N)Cc1ccccc1
35 2-dianisidine COc1c(N)ccc(c1)c2ccc(N)c(c2)OC
36 diazepam (valium) O=C1N(C)c2ccc(Cl)cc2C(=NC1)c3ccccc3
37 diazinon n1c(C(C)C)cc(C)cc1OP(=S)(OCC)OCC
38 dibromoethane BrCCBr
39 1,2-dichlorobenzene Clc1ccccc1Cl
40 1,3-dichlorobenzene Clc1cccc(Cl)c1
41 1,4-dichlorobenzene Clc1ccc(Cl)cc1
281
B.2. (continued)
No Name SMILES
42 1,2-dichloroethane ClCCCl
43 dichloromethane ClCCl
44 2,4-dichlorophenol Clc1cc(Cl)c(O)cc1
45 3,4-dichlorophenol Clc1c(Cl)cc(O)cc1
46 3,5-dichlorophenol Clc1cc(Cl)cc(O)c1
47 2,4-dichlorophenoxyacetic acid
OC(=O)COc1c(Cl)cc(Cl)cc1
48 diethylamine CCNCC
49 digoxin CC1OC(CC(O)C1O)OC2C(C)OC(CC2O)OC3C(C)OC(CC3O)OC4CCC5(C)C(CCC6C5CC(O)C7(C)C(CCC67O)C8=CC(=O)OC8)C4
50 dimethyl sulphoxide CS(=O)C
51 4-(dimethylamino) benzaldehyde
O=Cc1ccc(cc1)N(C)C
52 2,3-dimethylphenol Oc1c(C)c(C)ccc1
53 2,4-dimethylphenol Oc1c(C)cc(C)cc1
54 1,2-dinitrobenzene O=N(=O)c1ccccc1N(=O)=O
55 1,4-dinitrobenzene O=N(=O)c1ccc(N(=O)=O)cc1
56 2,6-dinitro-4-cresol Cc1cc(N(=O)=O)c(O)c(N(=O)=O)c1
57 2,4-dinitrophenol c1cc(N(=O)=O)cc(N(=O)=O)c1O
58 diuron CN(C)C(=O)Nc1ccc(Cl)c(Cl)c1
59 emetine COc1c(OC)cc2c(c1)CCN3[C@H]2C[C@@H]([C@@H](CC)C3)C[C@H]4NCCc5c4cc(OC)c(OC)c5
60 eptam (EPTC) CCSC(=O)N(CCC)CCC
61 ethanol CCO
62 ethanolamine NCCO
63 ethyl acetate CCOC(=O)C
64 ethylbenzene c1ccccc1CC
65 ethylene glycol OCCO
66 FCCP [carbony cyanide 4-(trifluoromethoxy)-phenylhydrazone]
N#CC(C#N)=NNc1cc(Cl)ccc1
67 fluvoxamine COCCCCC(=NOCCN)c1ccc(C(F)(F)F)cc1
68 furosemide NS(=O)(=O)c1cc(C(=O)O)c(cc1Cl)NCC2=CC=CO2
69 gentamycin CNC(C)[C@@H]1CC[C@@H](N)[C@H](O1)O[C@@H]2[C@@H](N)C[C@@H](N)[C@@H]([C@@H]2O)O[C@@H]3[C@@H](O)[C@@H](NC)[C@@](C)(O)CO3
70 hexachlorobenzene Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl
71 hexachlorophene Clc1c(Cl)cc(Cl)c(O)c1Cc2c(O)c(Cl)cc(Cl)c2Cl
72 ibuprofen CC(C)Cc1ccc(C(C)C(=O)O)cc1
73 imazapyr OC(=O)c1cccnc1C2=NC(C)(C(C)C)C(=O)N2
74 indole c12ccccc1ccn2
75 isoproterenol CC(C)NCC(O)c1ccc(O)c(O)c1
76 isoxaben COc1cccc(OC)c1C(=O)NC1=CC(=NO1)C(C)(CC)CC
282
B.2. (continued)
No Name SMILES
77 lindane [C@H]1(Cl)[C@H](Cl)[C@H](Cl)[C@@H](Cl)[C@H](Cl)[C@@H]1Cl
78 malathion CCOC(=O)C(CC(=O)OCC)SP(=S)(OC)OC
79 methanol CO
80 3-methyl-4-nitrophenol Oc1cc(C)c(N(=O)=O)cc1
81 methylphenidate (ritalin) N1CCCCC1C(C(=O)OC)c1ccccc1
82 morphine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(O)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O
83 1-naphthol Oc1cccc2ccccc12
84 naproamide CCN(CC)C(=O)C(C)Oc1cccc2ccccc12
85 naproxen OC(=O)[C@@H](C)c1ccc2cc(OC)ccc2c1
86 nicosulfuron CN(C)C(=O)c1cccnc1S(=O)(=O)NC(=O)Nc2nc(OC)cc(OC)n2
87 nicotine c1ccncc1[C@H]2N(C)CCC2
88 2-nitrophenol Oc1ccccc1N(=O)=O
89 norflurazon c1c(C(F)(F)F)cccc1N2N=CC(NC)=C(Cl)C2=O
90 novobiocin CC(C)=CCc1c(O)ccc(c1)C(=O)NC2=C(O)c3c(OC(=O)2)c(C)c(cc3)O[C@H]4[C@@H](O)[C@H](OC(=O)N)[C@@H](OC)C(C)(C)O4
91 octane CCCCCCCC
92 orphenadrine CN(C)CCOC(c1ccccc1)c2ccccc2C
93 orudis OC(=O)C(C)c1cccc(c1)C(=O)c2ccccc2
94 oxadiazon CC(C)(C)C1=NN(C(=O)O1)c2c(Cl)cc(O)c(OC(C)C)c2
95 paraquat C[n+]1ccc(cc1)c2cc[n+](C)cc2
96 pentachlorophenol Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1O
97 pentanol CCCCCO
98 1,10-phenanthroline n1cccc2ccc3cccnc3c12
99 phenol Oc1ccccc1
100 primsulfuron FC(F)Oc1cc(OC(F)F)nc(n1)NC(=O)NS(=O)(=O)c2ccccc2C(=O)OC
101 propan-2-ol CC(O)C
102 propranolol CC(C)NCC(O)COc1cccc2ccccc12
103 pseudocumene Cc1c(C)cc(C)cc1
104 quinidine COc1ccc2nccc(c2c1)[C@H](O)[C@H]3N4CCC(C3)[C@H](C4)C=C
105 quinine COc1ccc2nccc(c2c1)[C@@H](O)[C@H]3N4CCC(C3)[C@H](C4)C=C
106 resorcinol Oc1cccc(O)c1
107 salicylic acid OC(=O)c1c(O)cccc1
108 scopolamine CN1C2C4C(O4)C1C[C@H](C2)OC(=O)C(CO)c3ccccc3
109 sethoxydim CCSC(C)CC1CC(O)=C(C(=O)C1)C(CCC)=NOCC
110 sulphosalicylic acid OC(=O)c1c(O)ccc(S(=O)(=O)O)c1
111 tert-butyl methyl ether COC(C)(C)C
112 tetrachloroethene ClC(Cl)=C(Cl)Cl
113 tetrachloromethane ClC(Cl)(Cl)Cl
114 tetracycline NC(=O)C1=C(O)[C@@H](N(C)C)C2CC3[C@](C)(O)c4cccc(O)c4C(=O)C3=C(O)[C@]2(O)C1=O
283
B.2. (continued)
No Name SMILES
115 theophylline CN1C(=O)N(C)C(=O)C2=C1NC=N2
116 thiazopyr CC(C)Cc1c(C(=O)OC)c(C(F)F)nc(C(F)(F)F)c1C2=NCCS2
117 thifensulfuron n1c(OC)nc(C)nc1NC(=O)NS(=O)(=O)C2=C(C(O)=O)SC=C2
118 thioridazine CN1CCCCC1CCN2c3ccccc3Sc4ccc(SC)cc24
119 toluene Cc1ccccc1
120 4-toluidine Cc1ccc(N)cc1
121 1,2,3-trichlorobenzene Clc1c(Cl)c(Cl)ccc1
122 1,1,1-trichloroethane ClC(Cl)(Cl)C
123 trichloroethene ClC(Cl)=CCl
124 trichloromethane ClC(Cl)Cl
125 trifluralin FC(F)(F)c1cc(N(=O)=O)c(N(CCC)CCC)c(N(=O)=O)c1
126 2,4,6-trihydroxytoluene Cc1c(O)cc(O)cc1O
127 2,3,6-trimethylphenol Oc1c(C)c(C)ccc1C
128 2,4,6-trimethylphenol Oc1c(C)cc(C)cc1C
129 3,4,5-trimethylphenol Oc1cc(C)c(C)c(C)c1
130 2,4,6-trinitrotoluene Cc1c(N(=O)=O)cc(N(=O)=O)cc1N(=O)=O
131 verapamil COc1c(OC)cc(cc1)CCN(C)CCCC(C(C)C)(C#N)c2cc(OC)c(OC)cc2
132 warfarin c1ccccc1C(CC(C)=O)C2=C(O)c3ccccc3OC2(=O)
133 zanaflex N1CCN=C1NC2=C(Cl)C=CC3=NSN=C23
284
B.3. Compounds from the MEIC data set included in the QSAAR and QSAR
analyses (Chapter 14)
No Name SMILES
1 acetaminophen CC(=O)Nc1ccc(O)cc1
2 acetylsalicylic acid CC(=O)Oc1ccccc1C(O)=O
4 diazepam c1ccccc1C2=NCC(=O)N(C)c3ccc(Cl)cc23
6 digoxin CC1OC(CC(O)C1O)OC2C(C)OC(CC2O)OC3C(C)OC(CC3O)OC4CCC5(C)C(CCC6C5CC(O)C7(C)C(CCC67O)C8=CC(=O)OC8)C4
7 ethylene glycol OCCO
8 methanol CO
9 ethanol CCO
10 propan-2-ol CC(O)C
11 1,1,1-trichloroethane ClC(Cl)(Cl)C
12 phenol Oc1ccccc1
15 malathion CCOC(=O)C(CC(=O)OCC)SP(=S)(OC)OC
16 2,4-dichlorophenoxyacetic acid
Clc1cc(Cl)ccc1OCC(=O)O
18 nicotine c1ccncc1C2N(C)CCC2
21 theophylline CN1C(=O)N(C)C(=O)C2=C1NC=N2
24 phenobarbital c1ccccc1C2(CC)C(=O)NC(=O)NC2=O
31 warfarin c1ccccc1C(CC(C)=O)C2=C(O)c3ccccc3OC2(=O)
32 lindane [C@H]1(Cl)[C@H](Cl)[C@H](Cl)[C@@H](Cl)[C@H](Cl)[C@@H]1Cl
33 trichlormethane ClC(Cl)Cl
34 tetrachlormethane ClC(Cl)(Cl)Cl
35 isoniazid NNC(=O)c1ccncc1
36 dichloromethane ClCCl
38 hexachlorophene Clc1c(Cl)cc(Cl)c(O)c1Cc2c(O)c(Cl)cc(Cl)c2Cl
39 pentachlorophenol Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1O
44 phenytoin c1ccccc1C3(c2ccccc2)C(=O)NC(=O)N3
45 chloramphenicol ClC(Cl)C(=O)NC(CO)C(O)c1ccc(N(=O)=O)cc1
48 caffeine CN1C(=O)N(C)c2ncn(C)c2C1(=O)