DEVELOPMENT OF STRUCTURE-ACTIVITY … · DEVELOPMENT OF STRUCTURE-ACTIVITY RELATIONSHIPS FOR...

DEVELOPMENT OF STRUCTURE-ACTIVITY

RELATIONSHIPS FOR PHARMACOTOXICOLOGICAL

ENDPOINTS RELEVANT TO EUROPEAN UNION

LEGISLATION

A thesis submitted in partial fulfilment of the requirements of Liverpool John Moores University for the PhD degree

IGLIKA LESSIGIARSKA School of Pharmacy and Chemistry

Liverpool John Moores University

Byrom Street

Liverpool L3 3AF

England

and

Institute for Health & Consumer Protection

Joint Research Centre

European Commission

I-21020 Ispra (VA)

Italy

April 2006

ABSTRACT

In this project quantitative structure-activity relationships (QSARs) were developed for

several toxicological endpoints, including chemical cytotoxicity and acute toxicity, and

biokinetic parameters related to penetration of chemical compounds through the blood-

brain barrier (BBB). QSARs are computer-based mathematical models, which give

information about the intrinsic properties of compounds (such as potential biological

effects) on the basis of their chemical structure alone. For the regulatory assessment of

chemicals and chemical products, the proposed new EU legislation called REACH

(Registration, Evaluation and Authorisation of Chemicals) foresees that there will be an

increased use of QSARs as an alternative approach to (animal) testing.

In this project, QSARs were developed for compound penetration through the BBB in

vivo and through several membrane models in vitro, taking into account penetration by

both passive diffusion and active transport. The classification of compounds as low or

high BBB penetrators was explored and a simple classification QSAR model based on

compound lipophilicity and H-bonding ability was obtained. The BBB transport of

compounds known to interact with the P-glycoprotein (one of the BBB efflux transport

systems) was modelled by 3D-QSAR analysis, using hydrophobic and electrostatic

molecular fields.

Toxicities to a broad range of biological systems were also investigated, including

unicellular organisms like bacteria and algae, isolated human and rodent cells, and in vivo

toxicity to Daphnia, fish, rodents and humans. Similarities between the toxic effects for

some of these systems were identified. Baseline toxicity effects (relationships between

compound toxicity and lipophilicity) attributed to non-polar narcotics were investigated

and separate QSARs were obtained for compounds acting by different mechanisms of

toxic action or belonging to different chemical types. Classification QSAR models were

obtained applying the EU classification scheme for chemical toxicity. The project also

includes an investigation of the feasibility of predicting in vivo human toxicity by

combining in vitro data and molecular descriptors.

The QSAR models developed contributed to a mechanistic understanding of the

investigated biological effects. Some of the models could be applied in integrated testing

strategies for the assessment of regulatory endpoints based on alternative (non-animal)

methods.

CONTENTS

Acknowledgements

Abbreviations

Chapter 1. Introduction and overview 1

1.1. Introduction to the project 1

1.2. Overview of the thesis 2

Part I. Background to the research 5

Chapter 2. Introduction to (Q)SAR analysis 5

Chapter 3. Descriptors of chemical structure 10

3.1. Descriptors of lipophilicity (hydrophobicity) and hydrophilicity 10

3.2. Constitutional descriptors 13

3.3. Geometrical descriptors 13

3.4. Topological and electrotopological indices 14

3.5. Electronic descriptors 23

Chapter 4. 3D-Molecular conformations 28

4.1. Molecular mechanical calculations 28

4.2. Quantum mechanical calculations 29

4.3. Obtaining 3D-molecular conformation with minimum energy 35

Chapter 5. Mathematical procedures and statistical methods 37

5.1. Linear regression analysis 37

5.2. Principal Components Analysis 44

5.3. Partial Least Squares regression 46

5.4. Linear discriminant function analysis 48

5.5. Classification trees 54

5.6. Cluster Analysis 58

5.7. Selecting variables for statistical analyses 59

5.8. Cross-validation statistical procedures 60

Chapter 6. Approaches for 3D-(Q)SAR analysis 62

6.1. GASP analysis 62

6.2 CoMFA and CoMSIA analyses 63

Chapter 7. Blood-brain barrier 66

7.1. Introduction to the blood-brain barrier 66

7.2. Literature review of QSARs for blood-brain barrier penetration 67

Chapter 8. Acute chemical toxicity/cytotoxicity 71

8.1. Mechanisms of toxic action 71

8.2. Literature review of QSARs for acute toxicity/cytotoxicity 73

Part II. Description of the research 78

Chapter 9. Programs for statistical algorithms 78

9.1. Objectives 78

9.2. Reduction of data multicollinearity 78

9.3. Program for selection of regression equations with best statistical fit 80

9.4. Contribution to existing knowledge 81

Chapter 10. Investigation of blood-brain barrier penetration 83

10.1. Objectives 83

10.2. Methods 83

10.3. Results 87

10.4. Discussion 88

10.5. Conclusions 93

Chapter 11. Investigation of BBB penetration in relation to P-glycoprotein

interactions

106

11.1. Objectives 106

11.2. Methods 106

11.3. Results 109

11.4. Discussion 110


Chapter 12. Investigation of bacterial toxicity 123


12.2. Methods 123

12.3. Results 125



Chapter 13. Investigation of acute aquatic toxicity 145


13.2. Methods 145

13.3. Results 149



Chapter 14. Investigation of acute toxicity/cytotocicity 187


14.2. Methods 187

14.3. Results 191



Chapter 15. Summary and general discussion 212


15.2. Summary of the research results 212

15.3. Some reflections on the quality of QSARs 216

15.4. Perspectives on the use of the QSARs in integrated testing

strategies

218

References 223

List of author’s publications related to the research in the project 244

Appendix A. C codes of the programs for statistical algorithms 245

Appendix B. SMILES codes for the investigated compounds 275

ACKNOWLEDGEMENTS

I should like to thank to Liverpool John Moores University for giving me the possibility

to work in the interesting field of modelling biokinetic and pharmacotoxicological

endpoints, and providing me with facilities and training needed for the research.

A very special thanks to my supervisors Dr. Andrew Worth (European Chemicals Bureau,

Joint Research Centre of the European Commission, Ispra, Italy), Dr. Mark Cronin

(Liverpool John Moores University), and Prof. John Dearden (Liverpool John Moores

University), who assisted me in every step of my work, and helped me to gain knowledge

and training as a researcher. Needless to say, without them my PhD project would not

have its present form. I also thank my advisors Dr. Ilza Pajeva (Central Laboratory of

Biomedical Engineering, Bulgarian Academy of Sciences, Bulgaria) and Dr. Tatiana

Netzeva (Liverpool John Moores University, and European Chemicals Bureau, Joint

Research Centre of the European Commission, Ispra, Italy) who supported me with

valuable advice and suggestions.

I acknowledge the European Commission’s DG Joint Research Centre (JRC), for giving

me a training grant (JRC contract 19120-2002-01 P1B20 ISP IT). The two units of the

JRC - European Centre for the Validation of Alternative Methods (ECVAM) and

European Chemicals Bureau (ECB), where the work was mainly performed, provided me

with necessary facilities and research environment to complete the project. Working in an

international environment of specialists in a broad range of fields was very helpful for my

development as a researcher. A special thanks to Pilar Prieto (ECVAM, Joint Research

Centre, Italy) and Joaquin Baraibar-Fentanes (ECB, Joint Research Centre, Italy) for

providing me with data sets needed for the research.

ABBREVIATIONS

2D 2-Dimensional

3D 3-Dimensional

AM1 Austin Model 1 semi-empirical quantum mechanical method

BB Ratio between compound concentration in brain and that in plasma at the steady state

BBB Blood-brain barrier

BBEC Membrane system constructed by bovine and rat subcultured primary brain capillary endothelial cells, co-cultured with primary rat astrocytes

CA Cluster analysis

Caco-2 Membrane system constructed by human epithelial colon carcinoma cell line

CART Classification and regression tree splitting

CoMFA Comparative molecular field analysis

CoMSIA Comparative molecular shape indices analysis

CT Classification tree

DA Discriminant analysis

DBU Discriminant-based univariate splitting

EC European Commission

ECB European Chemicals Bureau

ECVAM European Centre for the Validation of Alternative Methods

EMS Explained mean square

ESS Explained sum of squares

EU European Union

F Fisher statistic

GASP Genetic algorithm similarity program

GTO Gaussian-type orbitals

H-bond Hydrogen-bond

HAP Approximate acute human blood/serum peak LC50 values

HLD Acute oral human lethal doses

HOMO Highest occupied molecular orbital

IC50 Concentration that inhibits 50% of enzyme reaction or cell growth

JRC Joint Research Centre

LC50 Concentration that causes 50% lethality in a group of test animals

LD50 Dose that causes 50% lethality in a group of test animals

logD Logarithm of the oil-water distribution coefficient

logDoct Logarithm of the octanol-water distribution coefficient

logP Logarithm of the oil-water partition coefficient

logPoct Logarithm of the octanol-water partition coefficient

LUMO Lowest unoccupied molecular orbital

MAE Mean absolute error

MEIC Multicentre Evaluation of In Vitro Cytotoxicity

MEMO MEIC monographs on time-related human lethal blood concentrations

MDCK Membrane system constructed by dog kidney epithelial cell line

MDR Multidrug resistance

MO-LCAO Molecular orbitals – a linear combination of atomic orbitals

NCD New Chemicals Data Base

NDDO Neglect of diatomic differential overlap

NDO Neglect of differential overlap

P-gp P-glycoprotein

Papp Apparent permeability coefficent

PCA Principal components analysis

Pe Endothelial permeability coefficient

PLS Partial least squares

PM3 Parametric Model 3 semi-empirical quantum mechanical method

R2 Squared regression coefficient of determination

R2adj Adjusted squared regression coefficient of determination

RHF Restricted Hartree-Fock method

RMS Residual mean square

RSS Residual sum of squares

Q2 Cross-validated squared coefficient of determination

QSAAR Quantitative structure-activity-activity relationship

QSAR Quantitative structure-activity relationship

s Standard error of estimate

SCF Self-consistent field

SEP Cross-validated standard error of prediction

STO Slater-type orbitals

SV-ARBEC Membrane system constructed by rat SV40 immortalised brain microvascular endothelial cells, co-cultured with SV40 immortalised rat astrocytes

TIs Topological indices

TLSER Theoretical linear solvation energy relationship

TSS Total sum of squares

UHF Unrestricted Hartree-Fock method

1

CHAPTER 1

INTRODUCTION AND OVERVIEW

1.1. Introduction to the project

The general aim of the project was to develop quantitative structure-activity relationships

(QSARs) for one or more of the toxicological endpoints currently mandated for testing

under the Dangerous Substances Directive (EC, 1967) of the European Union (EU).

These endpoints include acute and chronic toxicity, genotoxicity, carcinogenicity, skin

and eye irritation and corrosivity, skin and respiratory sensitisation, biokinetics

(absorption, distribution, metabolism, elimination), and they characterise the toxicological

hazard of a chemical. QSARs are computer-based mathematical models, which relate the

biological activity of compounds to their chemical structure. In comparison with other

methods for assessing toxicological endpoints, such as animal-based and in vitro methods,

QSAR models are not only easy to apply, but are also efficient in terms of time and

financial cost. In addition, the QSAR models sometimes contribute to the mechanistic

understanding of the pharmacotoxicological effects being modelled.

QSARs, together with in vitro methods, provide alternative methods to animal testing of

toxicity. Alternative methods are designed to replace, reduce and refine of animal use, as

required by the Directive on the Protection of Laboratory Animals (EC, 1986). The saving

potential of the alternatives has recently been evaluated by Pedersen et al. (2003), and

Van der Jagt et al. (2004). Under the proposed new EU legislation for chemicals and

chemical products, called REACH (Registration, Evaluation and Authorisation of

Chemicals) (EC, 2001b; EC, 2003), it is foreseen that there will be an increased use of

QSARs as alternative methods for hazard and risk assessment of chemicals. In addition to

the evaluation of the toxic effect of a substance (chemical hazard), the process of

chemical risk assessment involves appraisal of the exposure of biological systems to

chemicals, i.e. evaluation of the level of chemical release in the environment and/or

chemical biouptake.

Due to some limitations of the applicability of individual alternative methods (for

example, related to particular biological systems, or chemical class, or associated with

particular levels of confidence), to reduce, refine, or replace animal use without a loss of

relevant information and safety, integrated testing strategies are developed. These are

2

designed as combinations of individual methods applied in a stepwise and/or parallel

fashion, in order to ensure high confidence in the results.

The project was funded by the Joint Research Centre (JRC) of the European Commission

(EC). The work was performed mainly in the Institute for Health and Consumer

Protection, which is part of the JRC, in its units European Centre for the Validation of

Alternative Methods (ECVAM) and European Chemicals Bureau (ECB). The JRC

provides scientific and technical support for the development, implementation and

monitoring of EU policies. The work of ECVAM is focused on coordination of the

validation of alternative methods at the EU level, exchange of information on alternative

methods, and promotion of their development, scientific and regulatory acceptance. The

ECB manages data and assessment procedures on dangerous chemicals at the EU level.

The specific aims of the research project included investigation of two of the

abovementioned toxicological and pharmaceutical (biokinetic) endpoints by QSAR

modelling. The project was focused on modelling of blood-brain barrier penetration, and

acute toxicity and cytotoxicity. In the project a review of QSARs published in the

literature on these endpoints is made. Data sets were compiled from the literature, and

from in-house data of ECVAM and ECB. The latter include data from an ECVAM study

on in vitro models of the blood-brain barrier (BBB), and data for aquatic toxicity from the

New Chemicals Database, managed by the ECB. QSAR models were developed and

discussed in relation to their use as alternative methods and in integrated testing

strategies.

1.2. Overview of the thesis

The thesis consists of two parts. Part I contains seven chapters in which the background to

the research is presented. Part II represents the research performed in the project in seven

chapters. At the end of the report are given appendices including programs in C code,

written for the purposes of the project, and SMILES codes of the investigated chemical

compounds.

In Part I, Chapter 2, a short introduction to the theory of QSAR analysis is presented,

including the main steps of the analysis and approaches to assess model quality. In

Chapters 3-6 some elements of QSAR analysis are discussed in detail, including the

descriptors used to represent chemical structures, methods to obtain relevant 3D

3

molecular conformations, mathematical and statistical techniques, and 3D-QSAR

approaches. In Chapters 7 and 8 an introduction to the underlying biology of the two

pharmacotoxicological endpoints is given, including the structure and function of the

blood-brain barrier, and mechanisms of acute toxic action of chemicals. Additionally,

reviews of published QSARs in these areas are presented.

In Part II, Chapter 9, two computational algorithms with applications in QSAR analysis

are suggested. The first one reduces variable multicollinearity in a data set containing a

large number of variables. The second algorithm represents an application of the best

subsets regression to obtain regression models with best statistical fit.

In Chapters 10 and 11, the QSAR investigation of compound penetration through the

BBB is presented. The work reported in Chapter 10 is based on two data sets including

compounds transported by both passive diffusion and active transport. The first data set

contains data for both in vivo BBB permeability, and permeability through several in vitro

membrane models of the BBB. Models that describe the in vivo BBB penetration by

combining in vitro membrane penetration and structural descriptors (QSAARs) were

developed. The second data set contained in vivo BBB penetration data for a large number

of compounds (approximately 150). QSARs were developed, and additionally, the

compounds were classified into low and high penetrators through the BBB and a

classification QSAR model was obtained.

In Chapter 11 an investigation of structural features influencing transport across the BBB

of compounds known to inhibit one of the BBB efflux transport systems, P-glycoprotein,

is reported. The investigated compounds were imipramine and phenothiazine derivatives.

Their mechanism of transport across the BBB involves passive diffusion and P-

glycoprotein interactions. Common 3D structural characteristics of these compounds

related to their mechanism of transport across the BBB were identified.

The next three chapters of Part II (Chapters 12-14) present the work related to analysis of

acute toxicity and cytotoxicity, which are considered to be related effects. Toxicities to a

broad range of biological systems were investigated, including unicellular organisms like

bacteria and algae, isolated human and rodent cells, and in vivo toxicity to Daphnia, fish,

rodents and human. Similarities between toxic effects for these systems were investigated.

Separate QSARs were obtained for compounds acting by different mechanisms of toxic

4

action or belonging to different chemical types. QSARs were also developed by

combining different compounds and mechanisms of action.

Chapter 12 represents an investigation of toxicity to the bacterium Sinorhizobium meliloti.

A large data set of 140 compounds was used to develop mechanistically-based and

chemical class-based QSARs.

In Chapter 13 an investigation of environmental toxic effects is reported. Five aquatic

species (two algal species, Daphnia, and two fish species) were included in the study. The

data were taken from the New Chemicals Database of the European Union. Interspecies

correlations were developed to reveal similarities between toxicity to different species.

Baseline toxicity effects attributed to non-polar narcotics and QSARs for the whole data

sets were investigated.

Classification QSAR models were obtained applying the EU classification scheme into

very toxic, toxic, harmful, and non-toxic compounds to the aquatic species. Additionally,

two-group classification into dangerous and non-toxic compounds was investigated, with

the group of dangerous compounds being formed by uniting the groups of very toxic,

toxic, and harmful compounds from the EU classification scheme. The biological data

used were assessed in terms of their quality.

In Chapter 14 a study on toxicity to a broad range of biological species, including

bacterial strains, rodent and human cell lines, rodents and human is presented. The data

were taken from the MEIC (Multicentre Evaluation of In Vitro Cytotoxicity) programme

(Bondesson et al., 1989, Ekwall et al., 1998a). The toxicity endpoints were compared.

The study explored an approach to predict in vivo human toxicity by combination of in

vitro data and molecular descriptors (QSAAR analysis). QSAR models for the

investigated endpoints were also developed.

The final Chapter 15 of the thesis summarises the results of the project. Proposals for use

of the developed QSARs in integrated testing strategies are made. In addition a

commentary on the quality and applicability of the QSARs is provided.

5

CHAPTER 2

INTRODUCTION TO (Q)SAR ANALYSIS

The objective of (quantitative) structure-activity relationship ((Q)SAR) analysis is to

derive empirical models that relate the biological activity of compounds to their chemical

structure. SAR analysis is based on the assumption that the chemical structure of a

compound implicitly determines its behaviour in biological systems and the biological

response to it. There can be both qualitative SARs and quantitative SARs (QSARs),

depending on the means used to describe the chemical structure and on the nature of the

derived relationship. In QSAR analysis, quantitative descriptors are used to describe the

chemical structure and the analysis results in a mathematical model describing the

relationship between the chemical structure and biological activity. In Figure 2.1 the main

steps in QSAR analysis are shown.

To develop (Q)SARs, a series of compounds, called a training set, is used. The

compounds in the training set ideally possess the same or similar mechanism of biological

action to ensure that the same factors influence the activity of all compounds under

investigation. For all compounds in the series, biological activities are evaluated and

compound structural descriptors are calculated. Statistical tools are then used to derive

QSARs.

The selection of a suitable training set is an important step in the QSAR analysis since the

representativeness and the size of the training set and the quality of the biological data

will affect all of the following steps. The biological activity of compounds is usually

assessed by using equilibrium constants of binding to biological macromolecules, enzyme

or cell-proliferation inhibition constants, rate constants of metabolism or distribution in

biological systems, and also, related to them, compound concentrations and doses causing

a certain biological effect (e.g., compound concentration that inhibits 50% of enzyme

reaction or cell growth (IC50), compound dose that causes 50% of a certain biological

effect (ED50), compound dose that causes death of 50% of laboratory animals (LD50)).

The mathematical models derived in the QSAR analysis are based on the linear or non-

linear relationship of the Gibbs free energy (∆G) of the investigated process on the

compound structural descriptors. ∆G is related to the parameter describing biological

6

activity K (equilibrium constant of binding or inhibition, rate constant of metabolism or

distribution) by the logarithmic dependence defined by Arrhenius (known as the van’t

Hoff isotherm): ∆G = -2.303 RT logK. Consequently, QSAR models are often based on

the logarithm of the measured biological activity.

To derive QSAR models, an appropriate representation of the chemical structure is

necessary. For this purpose, descriptors of the structure are commonly used. These

descriptors are generally understood as being any term, index or parameter conveying

structural information. Commonly used descriptors in the QSAR analysis are presented

in Table 2.1. Some of them are obtained directly from the chemical structure, e.g.

constitutional, geometrical, and topological and electrotopological descriptors (Table 2.1).

Others represent physicochemical properties determined by the chemical structure

(lipophilicity and hydrophilicity descriptors, electronic descriptors, energies of

interaction, Table 2.1).

Some of the listed descriptors can be obtained experimentally (for example, logP,

aqueous solubility). However, most of the descriptors are calculated using specially

developed software packages (for example, ACD/LogP (Advanced Chemistry

Development Inc.) and KOWWIN (Syracuse Research Corporation) for calculating

aqueous solubility and logP of partition between octanol and water phases; Tsar (Oxford

Molecular) or Dragon (TALETE srl) for topological descriptors). In some 3D-QSAR

approaches energies of interaction with a probe atom are used, for which again

specialised software packages have been developed (e.g. Sybyl, Tripos Associates).

Statistical methods are used to relate biological activity to chemical structure. Statistical

methods include, for example, regression analysis, principal components analysis,

discriminant analysis, partial least squares, classification trees and neural network

techniques. Statistical software packages are used to perform these analyses. Some of

these statistical techniques are used to derive continuous models which relate the

structural descriptors with values of the biological parameters (linear regression, partial

least squares regression). Other techniques (discriminant analysis, classification trees) are

used to develop models which classify chemical compounds into groups according their

activity (for example, groups of toxic and non-toxic compounds).

The development of (Q)SAR models has several purposes and applications:

7

- (Q)SARs can lead to better understanding of the mechanisms of interaction

between compounds and biological systems. They may reveal important structural

features for the biological effect.

- QSAR models provide useful information about a dose range for a biological

effect of a compound, thus helping the experimental design (selection of doses and

tests) in drug research and toxicity testing.

- QSAR models are used to predict the activity of new (hypothetical) chemical

compounds, even before their synthesis. Thus, QSARs can save time and

experimental resources for synthesising and biological testing of large numbers of

compounds. QSARs offer possibilities for reduction or replacement of animal use

in research and toxicity testing.

To perform properly in these applications, QSARs need to reflect the actual relationships

between structure and activity. Thus, evaluation of their quality (reliability) is necessary.

The quality of a model can be assessed using the following means:

- statistical parameters of the model (correlation coefficient, standard error of

estimation, Fisher-test values, Wilks’ λ , see Chapter 5);

- some statistical validation procedures are also developed for this purpose (cross-

validation, see Chapter 5, Section 5.8 “Cross-validation statistical procedures”);

- validation on an external test set. In this case the model is used for prediction of

activity of chemical compounds that were not included in the model training set.

Similar to QSAR analysis is quantitative structure-activity-activity relationship (QSAAR)

analysis. It is performed in the same way and it has the same characteristics as the QSAR

analysis, with the exception that it uses some biological endpoint(s) in combination with

descriptors of the chemical structure to derive models for the investigated biological

activity. Thus, the QSAAR models relate the investigated biological activity to a

combination of descriptors of chemical structure and other biological endpoints. These

endpoints are usually less complex than the investigated activity and represent parts of the

investigated biological effect. For example, in vivo toxicity may be described by

combining in vitro cytotoxicity and descriptors of the chemical structure. The descriptors

of the chemical structure included in the QSAAR models reveal factors that influence the

investigated biological effect (in vivo toxicity) apart of the other biological endpoint(s) (in

vitro cytotoxicity).

8

Figure 2.1. Main steps in QSAR analysis

Training set

Statistical analyses

Chemical structure representation

Evaluation of the quality of the model(s)

Biological data

Modification of the training set and further statistical analyses

Preliminary models(s)

Refined model(s)

9

Table 2.1. Descriptors of the chemical structure used in the QSAR analysis

Descriptor type Example

lipophilicity

(hydrophobicity)

and hydrophilicity

aqueous solubility, oil-water partition coefficient (logP), oil-water

distribution coefficient (logD);

constitutional molecular weight, total number of atoms, number of individual types

of atoms, number of rings, total number of bonds, number of

individual types of bonds;

geometrical molecule volume, molecule surface area, solvent accessible molecular

surface area, shadow indices;

topological and

electrotopological

Wiener index, Balaban index, Randić indices, Kier and Hall

connectivity indices, Kappa shape indices; Electrotopological-state

indices;

electronic dipole moments, polarisability, hydrogen bonding parameters,

Hammett constant, HOMO and LUMO energies, orbital electron

densities, superdelocalisabilities;

energies of

interaction

steric, electrostatic, or hydrophobic energies of interaction with a

given atom (molecule) at certain points in space.

10

CHAPTER 3

DESCRIPTORS OF CHEMICAL STRUCTURE

In this chapter the descriptors used for chemical structure representation are presented.

Only descriptors used in the current project are described. The main grouping of

descriptors is presented in Table 2.1. As mentioned in the introduction, some descriptors

are obtained directly from the chemical structure (constitutional, geometrical, and

topological and electrotopological descriptors), whilst others represent physicochemical

properties (lipophilicity and hydrophilicity descriptors, electronic descriptors). Another

characteristic of the descriptors is that the values of some of them depend only on the 2D

chemical formula (constitutional, topological and electrotopological descriptors), whereas

the values of others are influenced additionally by the 3D molecular conformations

(geometrical descriptors, electronic descriptors, energies of interactions).

The descriptor presentation starts with lipophilicity and hydrophilicity descriptors, due to

the fact that the oil-water partition coefficient (logP) is the most widely used descriptor in

the QSAR analyses included in the project. Afterwards the descriptors are presented in

order of increasing complexity of their definition and calculation.

3.1. Descriptors of lipophilicity (hydrophobicity) and hydrophilicity

3.1.1.Oil-water partition coefficient

Partition coefficient (P) is defined as the ratio of the solute concentration in the oil phase

to the non-ionised solute concentration in the water phase, at equilibrium:

P = Coil / Cwater (3.1)

where Coil is the equilibrium concentration of the solute in the oil phase

Cwater is the equilibrium concentration of the solute in water.

P describes the distribution of a compound between two phases – oil and water. It is

generally used in its logarithmic form (logP). A logP of zero indicates that the solute is

equally soluble in the two phases, a negative logP means that the solute is more soluble in

water, and a positive value indicates a greater solubility in the oil phase.

11

Examples of oil phases are octanol, cyclohexane, chloroform. LogP is probably the most

commonly used descriptor of lipophilicity, and is usually interpreted in biological terms

as a measure of the ability of the solute to cross lipid membranes. Molecules with low

logP values cannot easily enter the lipid phase of the membranes, while molecules with

high logP values are trapped in the membrane. Therefore, only molecules with

intermediate logP values (e.g. between about 0 and 4) can readily cross membranes (by

passive diffusion).

LogP can be determined by a number of experimental methods. However, the

measurements can be time-consuming and, due to differences in the experimental

protocol, a broad range of reported values exists for some chemicals (Dearden, 2002).

Various computational methods and software are available for calculation of logP for the

octanol-water system (logPoct). To estimate logPoct by most of the methods the molecule

of interest is divided into fragments (or atoms), and individual contributions of the

fragments (or atoms) into logPoct values are summed up. Additionally some correction

factors for the interactions between the individual fragments (or atoms) in the molecule

are added. Commonly used software for calculating logPoct, which differ in terms of the

molecular fragments and correction factors used, are, for example ACD/LogP (Advanced

Chemistry Development, Inc.), KOWWIN (Syracuse Research Corporation), and ClogP

(Daylight Chemical Information Systems, Inc.).

Also, statistical models are used in some software for calculating logPoct. In the software

CSlogP (ChemSilico LLC, available on the internet: www.logp.com) logPoct is calculated

by using a neural network model based on topological descriptors of the chemical

structure.

3.1.2. Oil-water distribution coefficient

LogP is calculated for the non-ionised forms of the chemicals. When a compound has

ionisable groups, to account for the distribution between oil and aqueous phases of the

ionic forms (microspecies) of the compound, the oil-water distribution coefficient (D) is

used. D is equal to the ratio of the total concentration (the sum of the concentrations of all

microspecies including ionised and non-ionised forms) in the oil phase to the total

concentration in the aqueous phase:

12

logD = log (Σ Cioil / Σ Ciwater) (3.2)

where Σ Cioil is the sum of the concentrations of all microspecies in oil

Σ Ciwater is the sum of the concentrations of all microspecies in water.

The value of logD depends on the pH of the aqueous phase. For chemical compounds

containing a single ionisable group, logD can be calculated from the acid-base

dissociation constant of the compound (pKa) and the pH of the aqueous phase by applying

the equation:

logD = logP – log(1 + 10pH-pKa) for acids (3.3)

logD = logP – log(1 + 10pKa-pH) for bases (3.4)

Commercially available software for calculating logD in octanol-water system (logDoct) is

ACD/LogD (Advanced Chemistry Development, Inc.), which uses equilibrium constants

for distribution of microspecies between the octanol and the aqueous phases to calculate

logDoct. Two different algorithms are available: in the first one, the penetration of the ion

pairs in the octanol phase are taken into account; the second algorithm does not account

for the penetration of ion pairs in the octanol phase. Other software for calculating logDoct

is PrologD from the package PALLAS (CompuDrug Inc.), and CSlogD (ChemSilico

LLC) based on neural network analysis.

3.1.3. Aqueous solubility

Aqueous solubility is defined as the maximum concentration of a compound that will

dissolve in pure water at a certain temperature, at equilibrium. It is a measure of

compound hydrophilicity. The amount of a compound available for absorption,

distribution, elimination, and/or the compound concentration at the site of biological

action depends on its aqueous solubility.

For the purposes of QSAR modelling, the solubility is usually expressed in units mol/l

and logarithmically transformed. It can be determined experimentally; however,

nowadays commonly available software for calculating aqueous solubility exists (e.g.

ACD/LogP from Advanced Chemistry Development Inc., WSKOWWIN from Syracuse

Research Corporation). The calculations are based on relationships between the logarithm

of the aqueous solubility and the octanol-water partition coefficient (logPoct), melting

13

point (for solid compounds at the specific temperature), and molecular weight. Also,

statistical models are used in some software for calculating aqueous solubility; for

example the software CSlogWS (ChemSilico LLC, available on the internet:

www.logp.com) uses a neural network model based on topological descriptors.

3.1.4. Hydrophilic factor

Another hydrophilicity descriptor (hydrophilic factor, Hy) was defined by Todeschini et

al. (1997). It is calculated by the following equation:

Hy = { (1 + NHy) * log 2 (1+NHy) + nC * [ (1/nA) * log 2 (1/nA) ] + (NHy / nA2)1/2 } / log 2

(1 + nA) (3.5)

where NHy is the number of hydrophilic groups (for example OH, SH, NH)

nC is the number of carbon atoms

nA is the number of atoms in the molecule (excluding hydrogen atoms).

3.2. Constitutional descriptors

Constitutional descriptors are broadly used in QSAR analysis. This descriptor group

includes molecular weight, number of different types of atoms present in a molecule,

number of rings, number of different types of bonds, number of different functional

groups. They encode the size of molecules (molecular weight, number of atoms, number

of rings), and chemical properties (type and number of functional groups).

3.3. Geometrical descriptors

These descriptors reflect features of the molecular geometry. Examples for such

descriptors include distances between particular points of the molecular surface (the two

farthest points, the two closest points), and distances between given chemical groups. The

most widely used geometrical descriptors are molecular surface area and molecular

volume, which are discussed in the following paragraphs.

3.3.3. Molecular surface area and molecular volume

14

Molecular surface area is the area of the outer surface of the volume from which solvent

molecules are excluded due to the presence of the solute molecule in a solution (solvent-

excluding surface). It is based on the van der Waals molecular surface (defined by the van

der Waals radii of the atoms (represented as spheres) in the molecule), however, van der

Waals molecular surface contains small gaps and crevices, which are inaccessible to other

atoms and molecules (for example solvent molecules). The molecular surface area is

defined by excluding these gaps and crevices. Thus, the molecular surface consists of the

van der Waals surfaces of the atoms where they can enter in a contact with the solvent

molecules, and additionally, of the surfaces of the solvent molecules, placed in contacts

with the van der Waals surfaces of two or more atoms of the investigated molecule

(Chapman and Connolly, 2001). Typically water is used as a solvent to perform

calculations of the molecular surface area. For practical reasons the shape of the water

molecule is considered as a sphere with a radius of 1.4 to 1.7 Å (Angstrom, 1 Å = 1 E-10

m), which is the average distance from the centre of the oxygen atom to the van der

Waals surface of the water molecule (Chapman and Connolly, 2001). Molecular volume

is the volume enclosed within this surface.

Additionally, a solvent-accessible molecular surface area is defined by the centre of a

probe sphere (solvent molecule, typically water), when it is rolled over the molecular

surface.

The calculated values of the molecular surface area and the molecular volume depend on

the molecular conformation (see Chapter 4). Molecular surface area and volume are

usually expressed in Å 2 and Å 3 respectively. They are measures of the molecular size.

3.4. Topological and electrotopological indices

Topological indices (TIs) are derived by representing the molecules as molecular graphs.

In a molecular graph the atoms are represented as dots (vertices), which are connected to

each other by lines (edges), representing the chemical bonds (Netzeva, 2003). From the

molecular graphs, paths of a certain length can be calculated. A path of a certain length

(m) represents two atoms that are connected with m bonds in the shortest pathway

between them. For example, a path of length one represents two atoms, connected with a

bond, and a path of length two indicates two bonds between the atoms. Another term is a

walk of a certain length. A walk of length one for a given atom is equal to the number of

atoms, to which it is connected. A walk of length m for a given atom is equal to the sum

15

of walks of length m-1 of all its neighbouring atoms (Morgan’s summation procedure)

(Rücker and Rücker, 1993). To calculate TIs usually a hydrogen-suppressed molecular

graph is used, i.e. molecular graph in which the hydrogen atoms are excluded.

Numerous topological indices have been created and used in QSAR studies. Their

calculation is easy, based on the molecular 2D structure only, thus not requiring

conformational analysis or 3D optimisation of the structure (see Chapter 4). The main

drawback of TIs is their complex and difficult interpretation.

3.4.1.Wiener index, Balaban index

These two indices are calculated on the basis of the distance matrix of a molecular graph.

The ij element of the distance matrix is equal to the number of bonds between atoms i and

j in the shortest pathway between them. The diagonal elements of the distance matrix are

equal to zero.

The Wiener index (W) was proposed by Wiener in 1947 (Wiener, 1947). It is calculated

as a half sum of the elements of the distance matrix:

W = ½ ΣΣ Dij (3.6)

where Dij is the ij element of the distance matrix of the molecule. The summation is

made over the all atoms i and j in the molecule.

The Wiener index has greater value for molecules with linear structure (e.g. n-alkanes)

and smaller value for compact molecules with many branches and cycles (Sablić, 1990).

The Balaban index (J) (Balaban, 1982) is defined as:

J = M / (µ + 1) Σ (Di Dj)-0.5 (3.7)

where M is the number of bonds in the molecule

Di is the sum of the elements of i row of the distance matrix, corresponding to the

i-th atom; Di = Σ Dij

Dij is the ij member of the distance matrix of the molecule

16

µ is the cyclomatic number of the molecular graph, equal to the minimum number

of edges that must be removed before a polycyclic graph becomes acyclic:

µ = n – (N-1) (3.8)

where n is the number of graph edges (bonds)

N is the number of graph vertices (atoms).

The Balaban index increases with the size of the molecule, degree of branching, and

degree of unsaturation. Its value decreases sharply with the number of rings in a molecule

(Sablić, 1990).

3.4.2. Path/walk Randić shape indices

Path/walk Randić shape indices of order m (PWm) are calculated by summing the ratios

of the atomic path counts over the atomic walk counts of order m for all atoms, and

dividing the sum by the number of non-hydrogen atoms (Randić, 2001). Path/walk count

ratios are independent of molecular size, and these descriptors can be considered as

encoding the shape of molecules. Path/walk Randić index of first order is equal to one for

all molecules as the counts of the paths and walks of length one are equal.

3.4.3. Molecular connectivity indices

Molecular connectivity indices (χ, often called “chi” indices) were introduced by Randić

(1975), and further developed by Kier and Hall (Kier and Hall, 1986; Hall and Kier,

1991). The indices are based on dividing the hydrogen-suppressed molecular graph into

subgraphs (fragments) of an equal length (number of atoms). They are denoted as mχtv,

where the left-side superscript m is the order of the index, equal to the length of the

fragments to which the molecule is divided, i.e. the number of bonds in the fragments; the

right-side superscript v denotes that the index is of the valence type; if it is missing, the

index is of the simple type; and the subscript t denotes the type of structural fragments to

which the molecule has been divided. In the valence connectivity indices the presence of

heteroatoms and multiple bonding in the molecule are taken into account.

Four types of structural fragments are used in definition of the connectivity indices: path

fragments (subscript p); cluster fragments (subscript c); path clusters (subscript pc); and

17

chains (rings) (subscript ch). If the subscript t is missing, the index is assumed to be path-

type. The connectivity indices of order smaller or equal to two can be only of type path,

therefore the subscript p in their indication is often omitted.

To calculate the connectivity indices, for each non-hydrogen atom i in the molecule,

simple δi and valence δνi terms are calculated. These terms are defined as:

δi = σ i – h i, equal to the number of skeletal neighbours of the specified atom (3.9)

where σ i is the number of valence electrons in the sigma orbitals of the atom i

h i is the number of hydrogen atoms bonded to the atom i.

δvi = Zv

i – hi for atoms in the first row of the periodic table (3.10)

δvi = (Zv

i – hi) / (Zi – Zvi – 1) for other atoms (3.11)

where Zvi is the total number of valence electrons of i

Zi is the atomic number of i, i.e. the number of all electrons of the atom i.

Connectivity terms mCt or mCtv are defined for each fragment of type t and order m as

equal to:

mCt = Π(δi)

–1/2 for simple indices (3.12)

mCt

v = Π(δvi)

–1/2 for valence indices (3.13)

where the multiplication is done over all atoms in the fragment of type t and order m.

The connectivity indices are defined as:

mχt = Σ mCt for simple indices (3.14)

mχt

v = Σ mCtv for valence indices (3.15)

where the summation is done over all fragment of type t and order m in the molecule.

18

For example, fragments of 1 atom form paths of length 0. As mentioned above, the

connectivity indices of order 0 are only of type path and therefore they are denoted as 0χ

(the simple index) and 0χv (the valence index) (the subscript p is omitted). They are equal

to:

0χ = Σ mCt = Σ (δi)

–1/2 (3.16)

0χv = Σ mCt

v = Σ (δvi)

–1/2 (3.17)

Fragments of two connected atoms form paths of length 1. 1χ (the simple index) and 1χv

(the valence index) are equal to:

1χ = Σ mCt = Σ (δi* δj)

–1/2 (3.18)

1χv = Σ mCt

v = Σ (δvi* δv

j) –1/2 (3.18)

where i and j refer to the two connected atoms in the fragments, and the summation is

over all possible two-atom fragments.

The connectivity indices of order zero, one and two are well correlated with the number

of atoms in the molecule, i.e with the molecular size (Netzeva et al., 2003). The values of

the indices of order one and two depend also on the molecular branching.

Molecular connectivity indices of order three or more can be of path, cluster, or chain

types. In the path-type graphs each atom is connected to a maximum two other non-

hydrogen atoms. In cluster-type graphs all atoms are connected to a common central

atom. For example a cluster graph of 3-rd order has the structure of tert-butane, a cluster

graph of 4-th order is tert-pentane-like. In the chain-type graphs the atoms participate in a

ring system. A 3-rd order chain graph has a cyclopropane-like structure, a 6-th order

chain graph is constructed from 6-membered aromatic or aliphatic rings. Path/cluster

indices can be of order four or more, and are calculated by dividing the molecule into

fragments, which combine paths and clusters.

19

Molecular connectivity indices are generally accepted to encode molecular size,

branching, cyclicity, unsaturation, and the presence and type of heteroatoms.

3.4.4. Difference molecular connectivity indices

Due to the large contribution of molecular size to the values of the chi indices, a high

degree of intercorrelation between the path chi indices may occur. This is observed

especially when the investigated compound set contains molecules with different sizes,

and/or insufficient structural diversity exists. In order to exclude the contribution of the

molecular size in the constitution of the chi indices, the difference connectivity indices

were constructed. Thus, other properties encoded by the chi indices, like branching,

cyclicity, unsaturation, and/or presence of heteroatoms, are emphasised.

To calculate the difference connectivity indices (dmχn) and the difference valence

connectivity indices (dmχnv) a (valence) chi path index is calculated for a hypothetical

reference molecule with an unbranched (straight chain) molecular graph and the same

number of atoms as the molecule being described (denoted mχn or mχnv respectively). This

index is subtracted from the corresponding chi path index (mχt or mχtv) for the given

molecule:

dmχn = mχt – mχn (3.19)

dmχnv = mχt

v – mχnv (3.20)

Difference connectivity indices are not defined for cluster, path/cluster and chain chi

indices because size factors do not influence them at the same large extent as the path chi

indices (QsarIS version 1.1 software, help manual).

3.4.5. Average molecular connectivity indices

The average molecular connectivity indices are denoted by 0χA (simple average indices)

and 0χAv (valence average indices) and are obtained by dividing the corresponding simple

or valence connectivity path index by the number of paths (fragments) involved in its

calculation. They are also aimed at excluding the influence of the molecular size on the

value of the index.

20

3.4.6. Kier benzene-likeliness index

The Kier benzene-likeliness index (BLI, Kier and Hall, 1986) is calculated by dividing

the first-order valence connectivity index 1χv by the number of non-hydrogen bonds in the

molecule and then normalising on the benzene molecule. It was proposed to measure the

molecule aromaticity.

3.4.7. Kappa shape indices

Kappa shape indices are designed to encode the overall molecular shape (Kier, 1985).

They are calculated using the counts of paths of length one (one-bond), two (two-bond)

and three (three-bond) in the hydrogen-suppressed graph of the molecule, and

correspondingly, the kappa shape indices are defined as of first, second and third order.

To calculate the kappa shape indices for a given molecule, two reference structures are

used, having the same number of atoms as the molecule of interest. The two reference

structures are constructed in such a way that they have the minimum possible (mpmin) and

the maximum possible (mpmax) number of paths of length m (m = 1, 2, 3) for the given

number of atoms in the molecule. The first reference structure for all kappa indices is the

non-branched isomer (linear graph), whose shape can be described as a cylinder or

ellipsoid. The second reference structure, which may be either real or hypothetical, is a

complete graph for m = 1 (first order kappa shape index); a star graph for m = 2 (second

order kappa shape index); and a twin star graph for m = 3 (third order kappa shape index)

(MolconnZ version 4.0 software, online help manual). A complete graph is a graph with

all atoms connected to each other. In the star graph all atoms are connected to a central

atom.

If mpi is the number of paths of length m in the molecule of interest, then it is assumed

that mpmin ≤ mpi ≤ mpmax, where mpmin is the number of paths of length m in the linear

graph, and mpmax is the number of paths of length m in the second reference structure.

The general formula for calculating kappa shape indices (mκ) is the following:

mκ = 2 mpmax

mpmin / (mpi)

2 (3.21)

21

where m is the order of the index, m = 1, 2, 3.

The mpmin and mpmax values can be calculated directly from the number of non-hydrogen

atoms (nA) in the molecule. Their substitution for paths of different length has resulted in

the following equations for calculation of mκ:

1κ = nA (nA – 1)2 / (1pi)

2 (3.22)

2κ = (nA – 1) (nA – 2)2 / (2pi)

2 (3.23)

3κ = (nA – 3) (nA – 2)2 / (3pi)

2, when nA is even (3.24)

3κ = (nA – 1) (nA – 3)2 / (3pi)

2, when nA is odd (3.25)

where nA is the number of atoms in the molecule (excluding hydrogen atoms).

The kappa indices were derived assuming that all atoms in the molecule are equivalent.

The influence on the molecular shape of atoms other than carbons in sp3 hybrid state is

accounted by the kappa alpha shape indices (mκα). They can be obtained by modifying

each nA and mpi in the above equations by adding of an α value:

α = r(x) / r(C(sp3)) – 1 (3.26)

where r(x) is the covalent radius of atom x

r(C(sp3)) is the covalent radius of carbon atom in the sp3 hybrid state.

Subsequently, nA is replaced by a + α, and mpi is replaced by mpi + α.

3.4.8. Shape flexibility index

The flexibility of a molecule depends on the presence of cycles and/or branching. By

combining 1κα and 2κα indices with the number of atoms (nA) (for normalisation), a

further index Φ has been defined (Kier, 1989), which is considered to measure molecular

flexibility.

22

Φ = (1κα* 2κα) / nA (3.27)

The flexibility index decreases with increased branching or cyclicity. The presence of

unsaturation or heteroatoms results in a decrease in the flexibility index when these atoms

have a covalent radius that is less than that of a carbon in sp3 hybrid state (QsarIS version

1.1. software, help manual).

3.4.9. Electrotopological state (E-state) indices

The E-state indices (Kier and Hall, 1990; Hall et al., 1991; Kier and Hall, 1999) are

defined again in terms of the δi and δvi values of an atom i similarly to the connectivity

indices (Equations 3.9, 3.10 and 3.11, Section 3.4.3 “Molecular connectivity indices”).

The E-state index for an atom i (Si, also called atom-level E-state index) is composed of

an intrinsic state term (Ii), plus a sum of perturbations (∆Iij) from all other atoms in the

molecule:

Si = Ii + Σj ∆Iij (3.28)

where the summation is over the remaining atoms in the molecule.

The intrinsic state (Ii) and the perturbation (∆Iij) terms are calculated as follows:

Ii = ( (2 / Ni)2 * δi

ν + 1)/ δi (3.29)

where Ni is the principal quantum number of valence electrons (the row, at which the

chemical element is placed in the periodic table)

∆Iij = (Ii – Ij)/rij2 (3.30)

where rij is the topological distance between atoms i and j, equal to the number of atoms

in the shortest path between them.

Atom-level E-state indices can be calculated for each atom (such as >C<, >N–, =O, –Cl)

in a molecule, as well as for each hydride group (such as –CH3, >NH, –OH). For

23

simplicity, both atoms and hydride group are termed “atoms”. Their values encode

information about the topological environment of the atom and the electronic interactions

due to all other atoms in the molecule (MolconnZ version 4.0 software, online help

manual).

In addition to atom-level E-state values, which are computed for each atom in the

molecule, atom-type E-state indices can be calculated. The atom type E-state indices are

defined as the sum of the individual atom-level E-state values for a particular atom type.

Hydrogen atoms can also be included in calculation of E-state values for deriving

hydrogen atom-type E-state indices (Rose and Hall, 2003).

Atom-type E-state indices encode information about the electron accessibility for atoms

of the same type, the presence/absence of an atom type and the count of the atoms of a

given atom type. Thus, the E-state values account for the ability of molecule to enter into

non-covalent intermolecular interactions (Rose and Hall, 2003).

3.4.10. E-state topological parameter

The E-state topological parameter (TIE) is calculated as follows (Voelkel, 1994):

TIE = (nB / nR) * Σ (Si*Sj) –1/2 (3.31)

where nB is the number of non-hydrogen bonds

nR the number of rings in the molecule

Si and Sj the E-state indices for the two connected atoms

the summation is over all bonds in the molecule.

3.5. Electronic descriptors

This group includes a large number of descriptors whose values are determined by the

spatial distribution of the electrons in a molecule. They represent global molecular

properties such as dipole moment, polarisability, molecular orbital energies (HOMO,

LUMO), and as well as local properties like partial atomic charges, atomic

superdelocalisabilities.

24

3.5.1. Dipole moment

An electric dipole consists of a pair of charges of equal magnitude and opposite signs (+q

and – q), separate by a distance (r). The dipole moment of an electric dipole is a vector

directed from the negative to the positive charge with magnitude equal to q*r. The

magnitude of the dipole moment is measured in Coulomb meters (Cm) or in debyes (D),

where 1D = 3.338*10-30 Cm.

If the positive and negative charges in a molecule do not overlap, the molecule possesses

a permanent dipole moment (µ) (polar molecule). Molecular dipole moment is usually

calculated using the following formula:

µ = Σqi*ri (3.32)

where ri is the radius-vector of an atom i from the origin of the coordinate system (centre

of charge or centre of mass)

qi is the partial charge of atom i

the summation is over all atoms in the molecule.

Usually the magnitude of the dipole moment is used as a descriptor in the QSAR analysis.

The magnitude of one or more of the vector’s components along the x, y and z Cartesian

axes can also be used. However, this requires proper alignment of the investigated

molecules in the coordinate system.

Molecules having permanent dipole moments tend to align with each other resulting in

van der Waals non-covalent attractive intermolecular interactions (orientation effect). The

presence of molecules with permanent dipole moments temporarily distorts the electron

charge in other close placed polar or non-polar molecules, inducing further polarisation

and van der Waals intermolecular forces (induction effect). However, it is difficult to

determine which biological and physicochemical factors are encoded by the magnitude of

the dipole moment, when it is included in QSAR models. The dipole moment could

influence electrostatic interactions with biological macromolecules. Electric dipoles are

known to align themselves parallel to externally applied electric fields, but in the opposite

direction, resulting in decreasing of the magnitude of the field. In this way the magnitude

of the dipole moment could be a measure of the ability of a molecule to diminish the

electric field across lipid membranes, which will influence membrane transport.

25

The dipole moment of a molecule can be measured experimentally. Quantum mechanical

methods allow calculation of the molecular dipole moment. The calculated value depends

on the molecular conformation (see Chapter 4).

3.5.2. Polarisability

Polarisability is the relative susceptibility of the electron cloud of an atom or a molecule

to be distorted from its normal shape by presence of an external electric field (close

placed ion or dipole). Due to this distortion an induced electric dipole moment appears.

Polarisability (α) is a tensor relating the induced dipole moment (µind) to the applied

electric field strength (E):

µind = αE (3.33)

The non-diagonal elements of the tensor represent the polarisability of the electrons along

one of the axes of the coordinate system due to a component of the applied electric field

along another of the coordinate axes. As this effect is negligible compared to the

polarisability in the direction of the applied electric field (or it does not exist), the non-

diagonal elements of the polarisability tensor are zero or very small compared to the

diagonal elements. Therefore, in practice the polarisability is presented as ‘mean

polarisability’, i.e., the average polarisability over the three axes of the molecule, and is

equal to one third (1/3) of the sum of the tensor diagonal elements (the trace of the

tensor).

Polarisability can be measured experimentally, or calculated with the methods of quantum

mechanics. Values are reported in Å3.

Molecular polarisability causes the inductive van der Waals forces (interactions between

permanent and induced molecular dipoles). It is also important for the presence of the

dispersion van der Waals (or London) forces which represent interactions between

temporary dipoles and induced dipoles. Temporary dipoles are formed in non-polar

molecules due to the electron motion within the molecule, and fluctuate rapidly with the

motion of the electrons. These temporary dipoles induce dipoles in other closely placed

molecules resulting in van der Waals intermolecular forces by orienting the temporary

and induced dipoles with each other.

26

3.5.3. Energies of the frontier molecular orbitals HOMO and LUMO

The energies of the frontier orbitals HOMO (highest occupied molecular orbital) and

LUMO (lowest unoccupied molecular orbital) are commonly used descriptors in QSAR

analysis. They reflect the reactivity of a molecule. A higher HOMO energy suggests

higher affinity of a molecule to react as a nucleophile, a lower LUMO energy suggests

stronger electrophilic nature of a molecule. Additionally, electrophilic and nucleophilic

attack will most likely occur at atoms where the coefficients of the corresponding atomic

orbitals in HOMO and LUMO, respectively, are large. The coefficients of the atomic

orbitals in HOMO and LUMO and the orbital energies are calculated with the methods of

the quantum mechanics (for details see Chapter 4, Section 4.2 “Quantum mechanical

calculations”).

3.5.4. Partial atomic charges

Partial atomic charges appear due to the different ability of atoms to withdraw electron

density (differences in the atomic electronegativity). The partial atomic charges determine

the electrostatic potential around a molecule, thus they influence intermolecular

interactions with electrostatic nature, including hydrogen-bonding (see Section 3.5.5

“Hydrogen-bond donor and acceptor ability”). Partial charges can be calculated with the

methods of molecular or quantum mechanics (for details see Chapter 4).

3.5.5. Hydrogen-bond donor and acceptor ability

The descriptors characterising hydrogen-bonding (H-bonding) are included in the group

of electronic descriptors, because this molecular property is determined by the

distribution of electrons, i.e. presence of lone electron pairs, and the energies of HOMO

and LUMO and high positive or negative charges on certain types of atoms (see below).

The simplest approach to evaluate the hydrogen-bonding (H-bonding) ability is by using

counts of the H-bond donor and acceptor atoms in a molecule. H-bond donors are

hydrogen atoms attached to an electronegative atom (O or N), H-bond acceptors are

atoms, which have a lone electron pair (O, N).

27

Several other approaches to assess the overall H-bonding strength of a molecule have

been developed. Kamlet and Taft (Kamlet and Taft, 1976; Taft and Kamlet, 1976)

proposed scales of solvent H-bond donor acidity (α) and H-bond acceptor basicity (β).

These scales were derived from effects of solvents on the ultraviolet and visible spectra of

solutes, which change due to H-bonding. Other measures of H-bond acidity (αH2) and H-

bond basicity (βH2), proposed by Abraham and co-workers (Abraham et al., 1989,

Abraham, 1993), are based on values of equilibrium constants for complex formation by

H-bonding in tetrachloromethane.

To avoid experimental determination of H-bond ability, Wilson and Famini (1991)

proposed descriptors based on quantum mechanics calculations. H-bond acceptor strength

is represented by two terms. The first is the covalent contribution to Lewis basicity (εb),

equal to the difference in energy between LUMO (lowest unoccupied molecular orbital)

of water and HOMO (highest occupied molecular orbital) of the solute. The second term

is the electrostatic basicity (q-) equal to the largest negative atomic charge in the solute

molecule (the most negatively charged atom will have the greatest interaction with a

proton from a neighbouring molecule). Analogously, H-bond donor strength is

represented by the covalent acidity (εa) – the energy difference between HOMO of water

and LUMO of solute, and the electrostatic acidity (qH+) - the largest positive charge of a

hydrogen atom in the solute molecule (the most positively charged proton of the molecule

will be most attracted to a negatively charged atom of a neighbour).

28

CHAPTER 4

3D-MOLECULAR CONFORMATIONS

A number of QSAR techniques and descriptors need 3D-representation of the molecular

structure. Examples of descriptors, whose values depend on the 3D-molecular

conformations are molecular surface area and volume, dipole moment, energies of the

frontier molecular orbitals (HOMO and LUMO), partial atomic charges. 3D-QSAR

techniques like CoMFA and CoMSIA are very sensitive to the choice of the 3D-

molecular conformations (see Chapter 6). The 3D-conformation chosen for the analysis

should resemble most closely the active 3D-conformation of a chemical compound, i.e.

the conformation, in which it elicits its biological effect. Information on the active

conformation can be obtained from crystallographic analysis of complexes between the

investigated chemical and its biological receptor (enzyme or other macromolecule)

However, often such information is insufficient or missing. Therefore it is generally

accepted to use the 3D-conformation with minimum energy in the QSAR analysis.

Generally, two methods can be used to calculate the energy-minimised molecular

conformations and corresponding molecular descriptors, namely the theory of molecular

mechanics and the theory of quantum mechanics. Usually due to insufficient

computational resources the both approaches calculate the energy-minimised

conformation in vacuum, ignoring solvent effects. Calculations performed by molecular

mechanics are fast, but are less accurate and cannot estimate the energies of the electron

orbitals. To perform more accurate calculations and to calculate electronic energies, the

theory of the quantum mechanics is used. However, quantum mechanical calculations

demand greater computational resources and can be very time-consuming. To reduce the

computational time and resources needed, usually a molecular mechanical minimisation is

performed to obtain a reasonable conformation, which is further minimised with quantum

mechanical calculations.

In this chapter molecular mechanical and quantum mechanical approaches for calculation

of (energy-minimised) molecular conformations are presented.

4.1. Molecular mechanical calculations

29

Molecular mechanics is a method to calculate 3D-structure and energy of molecules

based on the motion of atomic nuclei only. Electrons are not considered explicitly, as it is

assumed that they will find their optimum distribution once the positions of the nuclei are

known. This assumption is based on the Born-Oppenheimer approximation (see Section

4.2 “Quantum mechanical calculations”), which states that the nuclear motions can be

studied separately from the electronic motions. Because the nuclei are much heavier and

move more slowly than the electrons, the electrons are assumed to move fast enough to

adjust to the nuclear motion. Thus, in the theory of molecular mechanics the molecules

are considered as assemblies of spherical particles with given radii (representing the

nuclei) held together by simple harmonic forces (representing the bonds).

The energy and the optimum (energy-minimised) geometry of a molecule are calculated

by applying a force field to it. A force field is a collection of atom types and atom

parameters, parameters for bond lengths at equilibrium, bond angles, torsion angles, etc.

The total energy of a molecule is divided into several terms called force potentials, which

are described by potential energy equations, included in the definition of the force field.

Examples of force potentials are the energies associated with bond stretching, bond

bending (bond angle deformation), torsion strain (rotation of adjacent groups to a

common bond), interactions between non-bound atoms (van der Waals attraction, steric

repulsion, and electrostatic attraction/repulsion), energies due to dipole-dipole

interactions between the bond dipole moments in a molecule, and/or intra-molecular

hydrogen bonding energies. Force potentials are calculated independently from each

other, and are summed to give the total energy of the molecule.

Different force fields have been developed. They differ by their atom and bond

parameterisation and by the energy potentials included. Commonly used force fields for

calculating conformations of small organic molecules are MM2 and MM3 developed by

Allinger and coworkers (Allinger, 1977; Allinger, et al., 1989). A force field suitable for

modelling proteins and nucleic acids is AMBER (Assisted Model Building with Energy

Refinement) (Weiner and Kollman, 1981, Weiner et al., 1984). Another force field

developed for calculations of small molecules (with molecular weight between 50 and

1000) is the Cosmic force field, which uses simple force fields parameters and equations

with reasonable accuracy of the results (Abraham and Haworth, 1988; Morley et al.,

1991).

4.2. Quantum mechanical calculations

30

In the theory of quantum mechanics a wave function is associated with every quantum

particle. The wave function describes the spatial position and the motion of the particle

and its energy. The wave function for a given particle is obtained by solving the

Schrödinger equation. The time-independent and non-relativistic (neglect of relativistic

effects on the particle mass) form of the Schrödinger equation is the following:

Ĥ Ψ = E Ψ (4.1)

where Ĥ is the Hamiltonian operator, which is applied to the wavefunction Ψ to give the

same function multiplied by a constant. The constant E is the energy of the state of the

particle described by the given wavefunction. Ψ and E represent respectively the

eigenvectors and the corresponding eigenvalues of the operator Ĥ. The square of the

wavefunction value at a given point in space represents the probability for the particle to

be found at that point.

For quantum systems consisting of more than one particle (molecules), the Schrödinger

equation is solved to find wavefunctions describing the different energy states of the

whole system (all particles), and the corresponding energies of the system.

So far the Schrödinger equation has been exactly solved for one-electron molecules only.

For more complex systems some approximations need to be made to find its solution. The

Born-Oppenheimer approximation is based on the fact that the nuclear motions are slower

than the electronic motions. Therefore the nuclei and the electrons can be studied

independently. Independent Schrödinger equations are solved to obtain separate

wavefunctions describing the motion of the nuclei and the motion of the electrons. In this

approximation, the nuclear motion is neglected, and a solution of the Schrödinger

equation for electrons is sought at a given fixed nuclear configuration.

The electronic Hamiltonian has three terms (discussed in detail in Schüürmann, 2004).

The first term corresponds to the sum of the kinetic energies of the electrons, the second

term is related to the attractive electrostatic forces between the nuclei and the electrons,

and the third term corresponds to the sum of the repulsive electrostatic forces between

each two electrons. The square of the wavefunction obtained for the electrons is the

physically observable electron density.

31

However, an electronic wavefunction depending on the position of all electrons is very

complicated. Several further approximations need to be made to solve the Schrödinger

equation for electrons. In the molecular orbital approximation the motion of the electrons

in a molecule is considered independent from each other. Thus, the third term in the

electronic Hamiltonian, corresponding to the repulsive electron-electron interactions, is

neglected. Since electron-electron interactions are neglected in this approach, the

probability of finding an electron at a given position (i.e. the square of an one-electron

wavefunction) does not depend on the positions of the other electrons. Thus, the

probability of finding all electrons at a given configuration in the space is represented

using the products of the squares of the one-electron wavefunctions. Therefore, the

molecular orbital approximation assumes that the electronic wavefunction (describing the

state of all electrons) can be obtained as a combination of products of one-electron

wavefunctions.

The molecular one-electron wavefunctions (called molecular spin orbitals) can be

expressed as products of a spatial wavefunction, describing the spatial position of the

electron, and a spin function (accounting for the electron spin). The energy associated

with the molecular orbitals is the energy required to remove an electron from that orbital.

The spatial parts of the molecular orbitals (ψi) are usually expressed as a linear

combination of a set of one-electron wavefunctions, centred on the nuclei, called basis

set:

ψi = Σ cijфj (4.2)

The one-electron wavefunctions of the basis set (фj) have similar mathematical form to

the wavefunctions describing the state of an electron in non-connected atoms (the so

called atomic orbitals). Therefore this approximation is called a “molecular orbitals – a

linear combination of atomic orbitals” (MO-LCAO) method. The coefficients cij

determine the contribution of the atomic orbital фj to the molecular orbital ψi.

The most common atomic wavefunctions (фj) used as basis sets are the Slater-type

orbitals (STO), and the Gaussian-type orbitals (GTO). The STOs account for the actual

shape of the atomic orbitals. GTOs are mathematically simpler than STOs, but less

accurate. The one-electron orbitals can also be built by combining sets of gaussian

functions to approximate each STO (contracted gaussian functions). A minimal basis set

32

is one in which only occupied orbitals of each isolated atom are used to compose the

molecular orbitals.

The objective of the MO-LCAO method is to find the coefficients cij which result in

molecular orbitals that best approximate the actual molecular orbitals. The variational

principle states that this is equivalent to finding values for the coefficients cij, which result

in molecular orbitals with minimum energies for a given choice of Hamiltonian and basis

set. To obtain these coefficient values, the energy is expressed as a function of cij, and the

derivatives of the energy with respect to each coefficient are equalised to zero. In the

calculations of cij values of some integrals are needed, which can be obtained completely

or partially from experimental data (semi-empirical quantum mechanical methods) or

from complete calculation of these integrals (ab initio methods).

As described above, the effects of electron-electron interactions are neglected in the

Hamiltonian used to derive the molecular orbitals. These effects, called "electron

correlation", include electrostatic electron-electron repulsions, and Pauli interactions,

representing the attractive interaction between two electrons of opposite spin and

repulsive interactions between two electrons of the same spin. However, to obtain a

proper description of the state of molecular systems these effects need to be taken into

account.

In the self-consistent-field (SCF) approximation the electrostatic electron-electron

repulsions are accounted for by assuming that each electron interacts with an average

electrostatic field of the nuclei plus all other electrons. This gives a mathematically

simpler approach than describing the electron interactions in pairs. It is used in the

Hartee-Fock approach, where a Hamiltonian operator (called Fock operator) is

constructed, which considers each electron in the average electrostatic field formed from

all other electrons in the molecule, and also takes into account the Pauli interactions

(discussed in detail in Schüürmann, 2004).

To find molecular orbitals which take into account the electron correlation, an iterative

procedure is applied. An initial set of molecular orbitals is constructed and the

Schrödinger equation is used to calculate the electron correlation term of the Fock

operator. The Fock operator is then used to calculate a new set of molecular orbitals, and

again, a new Fock operator is computed. This process is repeated until the calculated

molecular orbitals become constant, or a pre-defined energy criterion is met. Each SCF

33

iteration involves calculation of a new set of coefficients cij, and the process is repeated

until the cijs approach constant values within a convergence criterion.

Two types of Hartree-Fock calculations exist. The restricted Hartree-Fock (RHF)

calculations are performed on closed shell systems, where all orbitals are empty or have

paired electrons. The unrestricted Hartree-Fock method (UHF) is applied for open shell

systems with one or more unpaired electrons on some molecular orbitals.

Molecular orbital calculations are used to calculate molecular orbital energies, energies of

the electrons and nuclei; energies of electron-electron, electron-nuclei, and nuclei-nuclei

interactions; the total molecular energy; heat of formation (calculated from the difference

between the total energy and the energy of the isolated atoms); partial atomic charges,

electrostatic potential; dipole moment; spectroscopic properties.

The molecular electrostatic potential surrounding the molecule caused by its charge

distribution can be used to calculate partial atomic charges by fitting of partial atomic

charges to reproduce it. Partial atomic charges can also be calculated from the molecular

orbital coefficients using methods such as Mulliken population analysis, which provides

partitioning of either the total electron density, or an orbital density (Mulliken, 1955).

4.2.1. Semi-empirical methods

The semi-empirical quantum mechanical methods allow for relatively fast calculations on

larger molecules. In these methods only the outer or valence electron orbitals are

calculated. The inner electrons are considered of less importance for the chemical

properties, and usually are parameterised empirically. In the semi-empirical

approximations some integrals arising from the Schrödinger equation are neglected, and

parameters are used to compensate for the errors introduced by removing these integrals.

The parameters are obtained experimentally from measured or calculated ionisation

potentials, electron affinities, and spectroscopic quantities, or are computed from ab initio

calculations on model systems.

The first semi-empirical methods were developed by Pople and coworkers (Pople and

Segal, 1965, Pople et al., 1965, Pople and Beveridge, 1970). They are based on neglect of

differential overlap integrals (NDO). The methods developed differ by the amount to

which these integrals are neglected. These methods include CNDO (Complete Neglect of

34

Differential Overlap), INDO (Intermediate Neglect of Differential Overlap), and MINDO

(Modified Intermediate Neglect of Differential Overlap).

The semi-empirical methods used nowadays are based on the neglect of diatomic

differential overlap (NDDO). In these methods two-electron integrals are considered only

if they are centred on the same atom. The first method was developed by Dewar and Thiel

(1977) and called MNDO (Modified Neglect of Diatomic Overlap). Austin Model 1

(AM1) (Dewar et al., 1985), and Parametric Model 3 (PM3) (Stewart, 1989) are the most

commonly used NDDO methods. AM1 and PM3 are quantum mechanically identical but

differ in their parameterisations. Generally, AM1 is most reliable for molecules

containing elements C, H, N, and O, while PM3 is suitable for molecules containing P

and S or for the main group elements that are not parameterised in AM1 (Vamp

Reference Guide, 1997).

Currently, the semi-empirical methods are used for molecules in the range of 20 to 400

atoms. Specific software implementations of the semi-empirical methods include

AMPAC (Semichem, Inc.), MOPAC (Fujitsu), ZINDO (Zerner, 1993), Vamp (Accelrys

Inc.).

4.2.2. Ab initio methods

In the ab inito approach all integrals in the calculation of molecular orbitals are computed.

It gives the most accurate results, but requires greater computational resources compared

to the semi-empirical methods. The obtained accuracy, however, depends on the choice of

the basis set. Examples of basis sets include Slater-type Orbital (STO) bases. In these

bases STOs are approximated by N Gaussian-type orbitals (GTOs), and the bases are

denoted as STO-NG (for example STO-3G, where each STO is replaced by 3 GTOs).

STO-NG are minimal basis sets, with only one function per atomic orbital for each atom.

Better results are given by the so-called split-valence basis sets, in which different basis

functions are used for the core orbitals and for the valence orbitals. Examples are 3-21G

basis and 6-31G basis. The notation for 3-21G basis means that 3 GTOs are used to

approximate the core orbitals, 2 GTOs are used to approximate the inner valence orbitals

(2s and 2p orbitals), and 1 GTO is used to approximate the outer valence orbitals (3s and

3p orbital). Further improvement of the results is achieved by adding d-orbitals for the

heavy atoms to the basis set. For typical organic compounds they are not used in bond

35

formation, but rather are used to allow a shift of the centre of an orbital from the position

of the nucleus (polarisation of the orbital). For example, a p-orbital on carbon can be

polarised by adding to it a d-orbital. The presence of polarisation functions is indicated by

adding an asterisk to the notation of the basis set. Thus, 3-21G* implies the previously

described split valence basis with polarisation added. A second asterisk is used to denote

the addition of a set of p-orbitals to each hydrogen atom to provide for their polarisation

(for example 6-31G** basis).

The choice of a basis set is based on a compromise between the desired accuracy in the

results, and the corresponding computational demands. Ab initio methods can calculate

any type of atom, including metals. Currently, their application is limited to molecules

containing tens of atoms due to the great computational demand. Software

implementations include GAUSSIAN (Gaussian, Inc.), and GAMESS (Computing for

Science (CFS) Ltd.).

4.3. Obtaining 3D-molecular conformation with minimum energy

By using the methods of molecular or quantum mechanics the energy of a molecule and

some descriptors of the structure are calculated. For the needs of QSAR analysis these

descriptors should be calculated on the basis of the conformation with minimum energy.

To find an energy-minimised molecular conformation, an iterative procedure is usually

applied. A given starting conformation is constructed and its energy is calculated.

Subsequently, the starting conformation is altered to form a conformation with smaller

energy. Again, the energy of the new conformation is calculated and these steps are

repeated until a conformation with minimum energy is found. Different algorithms exist

for altering the conformation in the process of searching for smaller-energy

conformations. In general, these can be divided into derivative and non-derivative

algorithms. In the derivative algorithms the derivatives of the energy with respect to the

atomic coordinates are calculated. Some derivative algorithms search for the minimum

value of the energy gradient (the first derivative of the energy). Such are, for example, the

steepest descent method, the arbitrary step approach, and the conjugate gradients

minimisation. They are fast and are appropriate for finding the most rapid decrease in

energy far from the energy minimum. Other derivative algorithms consider also the

second derivative of the energy to find the conformation with a minimum energy

(Newton-Raphson method). These methods are slower, and are more appropriate for

structures close to the energy minimum. The non-derivative algorithms do not use the

36

energy derivatives. In the Simplex method (a slow computational method) each atom is

moved along the Cartesian axes in both directions and the energy for each position is

calculated; the new position is selected with the lowest energy. In Monte Carlo

procedures the molecular conformation is altered by a small random amount at each step,

and if the energy decreases the new structure is accepted, otherwise it is rejected.

The minimum-energy conformation obtained by these algorithms usually depends on the

starting conformation and may represent a local minimum of the energy. There might be

several local energy minima for a molecule. The lowest of the local energy minima is

called the global minimum. To find the conformation corresponding to the global energy

minimum methods for conformational search are used. In these methods several starting

conformations are generated and optimised (i.e. for the given starting conformation the

conformation with energy in a local minimum is found). The optimised conformation

with the lowest energy is considered to be the conformation with a global energy

minimum. A variety of methods exist to generate different starting conformations.

Generally, they are two types: stochastic methods, in which the new conformations are

generated randomly; and systematic methods, in which the conformation space is

searched in a systematic way, for example, by increasing the torsion angles with a given

value.

37

CHAPTER 5

MATHEMATICAL PROCEDURES AND STATISTICAL METHODS

In this chapter an introduction to the mathematical procedures and statistical methods

used to derive QSAR models in the project is presented. Statistical parameters used to

assess model quality are described. Assumptions of the statistical approaches are given.

Additionally, statistical techniques for selecting variables resulting in best statistical fit,

and for evaluating model quality, are described.

5.1. Linear regression analysis

Linear regression analysis is the most widely used statistical approach to derive QSARs.

5.1.1. General description of the method

In regression analysis a functional dependence of a variable (called the dependent

variable) on a set of other variables (called independent or predictor variables) is sought.

In linear regression analysis the dependence has the following linear form:

Y’ = b1X1 + b2X2 + … + bpXp + b (5.1)

where b1, b2.. bp are regression coefficients

b is the intercept

X1, X2, … Xp are independent variables

Y’ represents expected values of the dependent variable by the regression model.

The regression coefficients b1, b2.. bp and the intercept are calculated by applying the

method of least squares to give the smallest possible sum of squared differences between

the true Y values of the dependent variable and the Y’ values calculated by the regression

model.

The equation represents a hyperplane in the p-dimensional space, where p is the number

of predictor variables in the equation. It reduces to a line in the case of one predictor

variable and a plane in the case of two predictor variables.

38

The derived regression equation can be used for predicting values of the dependent

variable from the values of the independent variable(s).

When the dependent and independent variables are standardised to have means of zero

and standard deviations of one, the equation can be written in a standardised form, in

which the intercept b is equal to zero. In matrix form the equation can be written as:

Ys = XsBs + E (5.2)

where Ys is the matrix of the standardised dependent variable; it has dimensionality of n

x 1 (n rows and one column)

Xs is the matrix containing the standardised independent variables included in the

model; of dimensionality n x p

Bs is the matrix of the standardised regression (beta) coefficients, having

dimensionality of p x 1

E represents a term corresponding to the unexplained part of Y by the model of

dimensionality n x 1

n is the number of observations

p is the number of independent variables included in the regression model.

The coefficients for the independent variables in this case are called beta coefficients, and

can be used to assess the relative contribution of the independent variables to the

prediction of the dependent variable. Independent variables with higher absolute values of

their beta coefficients explain a greater part of the variance in the dependent variable.

To assess the accuracy with which the dependence found describes the variance of the

dependent variable (i.e. the quality of the statistical fit), the Pearson correlation

coefficient (r) (for regression with single predictor variable) or squared coefficient of

determination, are used (R2):

r = ±(ESS/TSS)0.5 (5.3)

R2 = ESS/TSS = 1- (RSS/TSS) (5.4)

39

where: TSS is the total sum of squares: TSS = Σ (Yi – Ymean)2, it has n-1 degrees of

freedom

ESS is the explained sum of squares: ESS = Σ (Yi’ – Ymean)2, it has p degrees of

freedom

RSS is the residual sum of squares: RSS = Σ (Yi – Yi’)2, it has n-p-1 degrees of

freedom

Yi are the observed values of the dependent variable

Yi’ are the predicted values of the dependent variable by the regression model

Ymean is the mean value of the dependent variable


p is the number of independent variables included in the regression model.

The Pearson correlation coefficient (r) has a positive value if the values of the dependent

variable increase with increasing of the values of the independent variable, otherwise the

coefficient is negative.

R2 value bigger than 0.5 means that the explained variance by the model (ESS) is bigger

than the unexplained variance (RSS). The closer the value of R2 to 1, the better the

regression equation describes the observed data. The value of R2 depends on the size of

the sample and the number of predictor variables in the equation. It keeps the same value

or increases when a new predictor variable is added to the regression equation, even if the

added variable does not contribute to reducing the unexplained variance in the dependent

variable. Therefore another statistical parameter can be used, called adjusted R2 value

(R2adj). It is obtained by a similar expression as the one for R2, however the RSS and TSS

are divided by their corresponding degrees of freedom:

R2adj = 1 – [RSS /(n-p-1) ] / [TSS / (n-1)] = 1 – { (1-R2) * [ (n-1) / (n-p-1) ] } (5.5)

The value of R2adj decreases if an added variable to the equation does not reduce the

unexplained variance.

The reliability of prediction of the dependent variable can be assessed by the value of the

standard error of estimate s:

s = (RMS)0.5 = [ Σ (Yi – Yi’)2 / (n-p-1) ]0.5 (5.6)

40

where RMS is the residual mean square: RMS = RSS / (n-p-1).

The standard error of estimate (s) is a measure of the dispersion of the observed values of

the depended variables about the regression line (surface). Larger values of s mean worse

statistical fit of the model and less reliability of the prediction.

The statistical significance of a regression equation can be assessed by means of the

Fisher (F) statistic:

F = EMS / RMS = R2 (n – p – 1) / (p (1 – R2)) (5.7)

where: EMS is the explained mean square: EMS = ESS / p.

A regression equation is considered to be statistically significant if the observed F value is

greater than a tabulated value for the chosen level of significance (typically, the 95%

level) and the corresponding degrees of freedom of F. The degrees of freedom of F are

equal to p and n-p-1. Significance of the equation at the 95% level means that there is

only a 5% probability that the dependence found is obtained due to a chance numerical

relationship between the variables, i.e. that there is no real relationship between the

dependent and the independent variables in the population.

The statistical significance of an independent variable can be assessed by using the t

statistic, which is equal to the variable’s regression coefficient (b) divided by the standard

error of the coefficient (sb):

t = b/sb (5.8)

where b is the regression coefficient for the variable

sb is the standard error of the coefficient, equal to:

sb = s / [∑ (Xi - Xmean)2 ] 0.5 = [RMS / ∑ (Xi - Xmean)

2 ] 0.5 (5.9)

where Xi are the observed values of the independent variable

Xmean is the mean value of the independent variable.

41

Higher t values of a predictor variable correspond to a greater statistical significance. The

statistical significance of a variable using its t value is determined again from tables for a

given level of significance (similar to the use of F value). The degrees of freedom of t are

equal to n-p-1 (corresponding to the degrees of freedom of RMS). If a variable is

statistically significant at the 95% level then there is only a 5% probability that the

regression coefficient of the variable is not significantly different from zero.

Adding independent variables into the regression model generally results in an increased

R2 value. However, adding high number of predictor variables can result in an overfitted

model, i.e. in describing the noise in the data rather than true underlying relationships. To

avoid this, the ratio of observations to predictor variables in a model should be kept as

high as possible. According to Topliss and Costello (1972), this ratio should be at least

five.

5.1.2. Assumptions of linear regression analysis

In regression analysis it is assumed that the found dependence reflects a real causality in

the population. However, even when relationships between variables are found, the

underlying causal mechanism is not ascertained unequivocally. If relevant causal

variables are omitted from the model, the common variance which they share with

included variables may be a reason for wrongly attributed causality. Therefore, alternative

causal explanations may be also considered. Additionally, it should be certain that the

dependent variable is not a cause of one or more of the independent variables.

Another assumption (evident from the name of the analysis) is that the relationship

between the dependent and the independent variable(s) is linear. This can be examined by

plotting dependent against independent variable(s). If a non-linear relationship is

observed, a transformation of the variables can be performed to achieve a linear

relationship, or non-linear regression analysis should be applied.

For correct application of the tests for statistical significance the residuals (predicted

minus observed values of the dependent variable) should follow the normal distribution.

This assumption can be assessed by using a histogram of residuals, which shows their

distribution. However, the F-test is robust with regards to small violations of the

normality assumption, meaning that in such cases it can be used for estimating statistical

significance (Statistica version 5.5. software, help manual).

42

Another assumption is that of homoscedasticity of residuals. This means that the residuals

are dispersed randomly throughout the range of the dependent variable, i.e. the variance

of residuals is constant for all values of the independent variables (Berry, 1993). This

assumption can be examined by using plots of residuals against the predicted values of

the dependent variable. Outliers, representing cases with high residuals, are a form of

violation of homoscedasticity. When the homoscedasticity assumption is violated, the

values of tests for statistical significance (F, t) are questionable. If heteroscedasticity is

detected, the best strategy is to build a model based on the absolute values of the

residuals. If an adequate residual model is obtained, the absolute values of the residuals

for each observation are evaluated on the basis of that model, and a new regression

dependence is built using the method of the weighted least squares, in which the

reciprocal of the absolute values of the evaluated residual are used as weighting factors

(Radojnowa et al, 2002). Otherwise it is considered that statistically significant

heteroscedasticity is inexplicable using the independent variable and therefore should be

neglected from a practical point of view.

The next assumption is that of independence of each observation from each other

observation (absence of autocorrelation). The values of a variable should not depend on

the order in which the observations were made. This is often a problem with time series,

where variable values tend to depend on the value in a previous moment. Violation of this

assumption leads to biased estimates of standard errors of the regression coefficients and

the statistical significance. The autocorrelation is tested by the Durbin-Watson coefficient

(d). The value of d ranges from 0 to 4, with a value between 1.5 and 2.5 indicating

independence of the observations. For a graphical test of the observation independence, a

plot of residuals against the sequence of cases can be used. It must show no pattern,

indicating independence of errors.

Multicollinearity is a high level of intercorrelation among the independent variables,

which results in difficult determination of the individual effects of the variables.

Multicollinearity does not influence the calculated regression coefficients, but increases

their standard errors, thus making difficult to assess the relative importance of the

independent variables by using beta coefficients. To assess the multicollinearity the

Pearson correlation coefficients between each pair of independent variables can be

examined. As a rule, independent variables which correlate with correlation coefficient

higher than 0.70-0.80 should not be included in the same regression model (Kubiny,

43

1993). Another method of assessing multicollinearity is to use tolerance. Tolerance is

defined as 1 – R2, where R2 is the coefficient of determination of the regression when the

given independent variable is regressed on all other independent variables. If the tolerance

value is less than some cut-off value, usually 0.20, the independent variable should be

excluded from the analysis due to multicollinearity.

5.1.3. Observations with large influence on the regression model, outliers

Some observations can have large influence (leverage) on the obtained regression results.

As a result a biased regression model is obtained. A number of techniques are developed

to identify such observations.

The leverage statistic (h, also called hat-value) is related to the distances between the

values of the independent variables for the given observation and the means of the values

of the independent variables calculated on all observations. The leverage statistic varies

from 0 (the observation has no influence on the model) to 1 (the observation completely

determines the model). As a rule, observations should have leverage lower than 0.2.

Observations with leverage greater than 0.5 should be excluded from the model. Another

estimate of observation influence on a model is the Cook's distance (D). It is related to the

difference between the computed regression coefficients and the regression coefficients of

a model in which the respective observation is excluded. The distances for all

observations should be of a similar magnitude, otherwise observations with large distance

will exert undue leverage.

Outliers are observations that are not well described by the regression model (have large

residuals). The presence of outliers can seriously bias the regression results by greatly

influencing the values of the regression coefficients. However, defining an observation as

an outlier is subjective, taking into account specific experimental considerations.

However, generally observations with high residuals (usually higher than ±2 times

standard error) are considered as outliers.

Studentised residuals (residuals divided by their standard errors) can also be used to

identify outliers. According to a method developed by Tenekedjiev and Radojnowa

(2001) each observation is excluded once from the data set and its studentised residual is

calculated from a regression model of the remaining observations. After testing the whole

sample, all the observations whose studentised residuals do not lie within the calculated

44

confidence interval corresponding to a chosen significance level are rejected. This

procedure (representing a single loop) is repeated until either a predetermined number of

loops over all the observations are executed, or some loop does not reject any outliers

(which is often the second one) (Tenekedjiev and Radojnowa, 2001).

Outliers can be present for different reasons: errors in the measured variable values,

extreme values for one or more of the variables, different causality for the value of the

dependent variable for the given observation. In some cases, transformations of the data

can be applied, or additional independent variables can be included into the model to

correctly describe the outliers. Also, separate models excluding outliers can be developed.

5.2. Principal Components Analysis

Principal Components Analysis (PCA) is a mathematical procedure that is used to

identify patterns in data, and to transform a number of (possibly) correlated variables into

uncorrelated variables. Smaller number of variables can be considered after the

transformation, thus the PCA can be used to reduce the dimensionality of the data without

significant loss of information (data variability).

PCA can be performed by using the variance/covariance matrix or the correlation matrix

of the variables. To perform PCA, the eigenvectors and eigenvalues of the matrix are

computed, thus, the following matrix equation is solved:

CxEi = λiEi (5.10)

where Cx is the variance/covariance or correlation matrix

Ei are the eigenvectors of the matrix

λi are the corresponding eigenvalues.

The number of possible eigenvectors and eigenvalues is equal to the number of variables.

The number of components of each eigenvector is also equal to the number of variables.

The obtained eigenvectors are orthogonal. By ordering the eigenvectors in the order of

descending eigenvalues, an ordered orthogonal basis can be created. The first eigenvector

(with the highest eigenvalue) has the direction of largest variance of the data. In this way,

the directions in which the data set has the most significant variance are found.

45

The eigenvectors form the principal components of the data set. Some of the components

having the smallest eigenvalues can be ignored in the further analysis, as they describe a

very small part of the variance in the data. According to the Kaiser criterion (Kaiser,

1960), a component should be considered if its eigenvalue is greater than one, which

implies that this component explains a greater part of the variance than the variance

contribution of a single variable. Thus, the dimensionality of the transformed data set can

be reduced.

To obtain the values of the principal components corresponding to each observation in the

data set (i.e. the values of the transformed variables, representing the so called component

scores), a feature vector needs to be formed. It represents a matrix of vectors, in which the

eigenvectors considered for further analysis (having high eigenvalues) form the columns

of the matrix. The eigenvectors are ordered in the matrix by decreasing of their

eigenvalues (the eigenvector with the highest eigenvalue forms the first column, the

eigenvector with the second highest eigenvalue forms the second column, etc.). The

feature vector is of a dimensionality q x p (q rows and p columns), where q is the original

number of variables (the dimensionality of the variance/covariance matrix), and p is the

number of principal components considered for further analysis. Additionally, the means

of each of the original variables is subtracted from the variable values, to obtain a data set

in which each variable has a mean of zero.

To obtain the values of the transformed variables (the scores of the principal

components), the following matrix equation is solved:

T = (X – µ)F (5.11)

where T is the data matrix containing the scores of the principal components (in which

the scores of each component form the columns of the matrix); it has a dimensionality of

n x p

(X – µ) is the matrix, containing the original variables, from which the mean of

each variable is subtracted (in which the observations are in the rows, and the mean-

subtracted variables form the columns); its dimensionality is n x q

F is the feature vector (with dimensionality is q x p)

q is the number of variables in the original data set

n is the number of observations in the data set

p is the number of considered principal components.

46

Thus, the transformed variables (the scores of the principal components) are linear

combinations of the original variables. The components of the eigenvectors represent the

loadings (weights) of the original variables in the corresponding principal components.

The obtained scores of the principal components (which are uncorrelated) can be used as

independent variables in regression analysis. The approach is called PCA regression.

5.3. Partial Least Squares regression

Partial Least Squares (PLS) regression combines features from PCA and multiple linear

regression. In the PLS, as in the PCA, orthogonal principal components are extracted as a

linear combination of the original variables, but in contrast to the PCA regression, the

PLS regression components are extracted from product matrices involving both the

variance/covariance matrices of the independent and the dependent variables. Thus, the

components found are relevant for the dependent variable(s). The PLS regression can be

used to analyse simultaneously more than one dependent variable. It is useful when a set

of dependent variables is assessed by a large set of independent variables, also in cases

when there are fewer observations than predictor variables.

As in the multiple linear regression, in the PLS a linear model is obtained. It has the

following matrix form:

Y = TB + E (5.12)

where Y is the matrix of the dependent variables, having dimensionality of n x m

T is the matrix containing the scores of the extracted principal components

(having the cases in the rows and the principal component scores in the columns), with

dimensions n x p

B is the matrix of the PLS regression coefficients, of dimensionality p x m

E is a term, corresponding to the unexplained variance in Y, with the same

dimensions as Y


m is the number of dependent variables

p is the number of the considered principal components (with high eigenvalues).

47

In this equation the variables in Y are standardised by subtracting their means and

dividing by the standard deviations.

The principal components in the PLS regression are extracted by considering the

variance/covariance structure between the predictor and dependent variables. The

loadings of the original variables in the principal components are computed to maximise

the covariance between the dependent variables and the principal components scores.

Consequently, an ordinary least squares procedure is applied with the extracted principal

component scores and the dependent variables to produce the PLS regression model.

Since T = XF, where F is the matrix of the loadings of the original variables in the

principal component scores (feature vector, see Section 5.2 “Principal Components

Analysis”), and X is the matrix of the original mean-subtracted independent variables

(with dimensionality n x q, where n is the number of observations and q is the number of

original independent variables), the PLS regression model can be presented in terms of

the original independent variables:

Y = XQ + E (5.13)

where Q = FB is the matrix of regression coefficients.

Additionally, a matrix P is defined (equal to the transpose of the matrix F (FT)), with

dimensions p x q) representing the loading of the principal component scores in the

original mean-subtracted variables:

X = TP + K (5.14)

where K is the unexplained part of X due to the reduction of the variable dimensionality

(the excluding of some of the principal components with low eigenvalues).

Different algorithms for calculating the PLS principal components exist. They are

iterative algorithms, in which one principal component is extracted from the X data

matrix at each step. The principal component is extracted to have scores with maximum

covariance with a certain linear combination of the dependent variables. Afterwards, a

regression is performed on the Y and X variables by using the scores of the extracted

principal component. The predicted Y and X values from the regression with the first

48

principal component are subtracted from the original Y and X variables. The obtained

residuals are called deflated Y and X matrices. In the next step of the PLS algorithm the

original Y and X data are replaced with the deflated Y and X data, and a new principal

component is extracted. The procedure is performed until the desired number of principal

components is obtained, or X becomes a null-matrix. The standard algorithm for

computing PLS regression components is called nonlinear iterative partial least squares

(NIPALS). Another algorithm is the SIMPLS algorithm, which is faster than the NIPALS

algorithm, but gives slightly different results when more than one dependent variable is

included.

The same statistical parameters as used in multiple linear regression analysis can be

applied to assess the statistical significance of PLS regression models.

5.4. Linear discriminant function analysis

5.4.1. General description of the method

Discriminant function analysis is used to determine which variables discriminate between

two or more naturally occurring groups in a population, and to classify observations into

the different groups with accuracy better than chance.

In discriminant analysis it is determined whether groups differ with respect to the mean

value of a variable (Statistica version 5.5 software, help manual). If the means for a

variable are significantly different in different groups, then this variable discriminates

between the groups. The significance test of whether a variable discriminates between

groups is the Fisher F test. F is computed as the ratio of the between-groups variance in

the data and the pooled (average) within-group variance. If the between-group variance is

significantly larger there are significant differences between the group means. In the case

of multivariate discrimination, the matrix of total variances and covariances, and the

matrix of pooled within-group variances and covariances, are calculated. The two

matrices are compared by using multivariate F tests (Wilks’ lambda) to determine

whether there are significant differences in the means of all variables between groups.

In the Fisher linear discriminant analysis so called linear discriminant functions are

calculated. In the two-group case, a linear equation is fitted to obtain a discriminant

function:

49

Group = b1X1 + b2X2 + … bpXp + b (5.15)

where b1, b2, … bp are coefficients of the independent variables Xi

b is an intercept.

This discriminant function is analogous to a multiple regression equation, but values of

discriminant coefficients are sought that will maximise the distance between the means of

the calculated values of the function in the two groups.

In the case of more than two groups, more discriminant functions are calculated. These

functions are called also canonical functions and the analysis is termed canonical

discriminant analysis. The number of functions is the lesser of (g-1) where g is the

number of groups, and p, the number of independent variables. For example, when there

are three groups, there will be a function for discriminating between group 1 and groups 2

and 3 combined, and another function for discriminating between group 2 and group 3.

Each discriminant function is orthogonal to the others (uncorrelated).

Analogously to the regression analysis, standardised discriminant functions can be

calculated by standardising the variable values to have means of zero and standard

deviations of one. The standardised discriminant coefficients are used to compare the

relative importance of the independent variables.

For each discriminant function an eigenvalue is calculated. It reflects the percent of

between-group variance explained by the corresponding function. Thus, the eigenvalues

assess the relative importance of the discriminant functions for discriminating between

groups. If there is more than one discriminant function, the first function has the highest

eigenvalue (it is the most important), and the eigenvalues of the following functions

decrease.

Discriminant scores for observation are calculated from the equation(s) for the

discriminant function(s) by using the values of the independent variables for the

respective observations. The Z scores are the discriminant scores calculated from the

standardised discriminant function(s).

50

The Pearson correlation coefficients between the independent variables and the

discriminant scores are called structure coefficients or discriminant loadings. They can be

used to assess how closely a variable is related to a discriminant function. A table

representing structure coefficients of each variable with each discriminant function is

called a canonical structure matrix.

To test the statistical significance of the discrimination the value of Wilks’ lambda (λ) is

used. It tests whether there are differences between the group means of a combination of

variables. Wilks’ λ takes values from 0 to 1, with 0 meaning group means differ (the

variables differentiate between the groups), and 1 meaning all group means are the same.

Wilks’ λ is sometimes called the U statistic. The Wilks’ λ for the overall discrimination is

computed as the ratio of the determinant (det) of the within-groups variance/covariance

matrix over the determinant of the total variance/covariance matrix:

Wilks’ λ = det(W)/det(T) (5.16)

The Wilks’ λ statistic can be mathematically transformed to a statistic, which has

approximately a Fisher (F) distribution. Thus, an F value for the corresponding Wilks’ λ

and degrees of freedom can be calculated, and the corresponding level of statistical

significance can be assessed. Details of the mathematical procedure are given in Rao

(1951).

Additionally, the significance of each variable added to the model can be assessed by

calculating a partial Wilks’ λ, which is equal to:

partial λ = λ (after adding the variable)/lambda(before adding the variable) (5.17)

The corresponding F value of the partial λ is calculated as:

F = [(n-q-p)/(q-1)]*[(1-partial λ)/partial λ] (5.18)

where: n is the number of observations

q is the number of groups

p is the number of variables.

51

Again, a level of statistical significance of the variable corresponding to the F value can

be determined.

As described above, one of the purposes of the discriminant analysis is to classify

observations into the defined groups. Several methods can be used for this purpose.

Some approaches use the non-standardised discriminant coefficients and the derived

discriminant scores. If the discriminant score for an observation, derived from a given

discriminant function, is less than or equal to a cut-off value (specified for that function),

the observation is classified to one group, otherwise it is classified to the other group,

which that function discriminates.

In another approach, classification functions are calculated for each group, which include

the variables determined as discriminating between the groups. Each function allows

computation of classification scores for each case for each group, by applying the

formula:

Si = wi1*X1 + wi2 *X3 + ... + wip*Xp + ci (5.19)

where the subscript i denotes the respective group

the subscripts 1, 2, .. p denote the p variables

ci is a constant for the i'th group

wij is the weight for the j'th variable in the computation of the classification score

for the i'th group

Xj is the observed value for the respective case for the j'th variable

Si is the resultant classification score for the respective case.

An observation is classified as belonging to the group for which it has the highest

classification score.

The so-called Mahalanobis distances are also used to classify observations. A

Mahalanobis distance is a measure of the distance between two points in a

multidimensional space defined by two or more variables. If these variables are

uncorrelated, the Mahalanobis distances will be identical to the Euclidean distances

between points (the axes of the space can be regarded as being orthogonal). However,

calculating Mahalanobis distances takes into account that the variables defining the space

52

can be correlated (the space axes are considered as non-orthogonal). In those cases the

obtained Mahalanobis distances account for the correlations.

To classify an observation, the Mahalanobis distances between that observation and the

centroids of each group are calculated, and the observation is classified as belonging to

the group to which it has smallest distance. A group centroid represents the point in the

multivariable space with coordinates equal to the means of all variables calculated for the

observations in the group.

Additionally, a probability for a given observation to belong to a certain group can also be

calculated (the so-called posterior probability). The posterior classification probability for

an observation is related to the Mahalanobis distance of that observation to the group

centroid. A multivariate normal distribution of the observations around each group

centroid is assumed, and the posterior probabilities are assessed depending on how many

times the standard deviation is the Mahalanobis distance to the centroid of the observation

(i.e. observations which have Mahalanobis distance more than 1.96 times the standard

deviation from the centroid have less than 0.05 chance of belonging to the given group).

To classify observations, the so-called a priori classification probabilities are also taken

into account. They are based on an a priori knowledge for the chances of observations to

belong to a certain group, without considering the particular values of the independent

variables for the observations. For example, the a priori probabilities can be proportional

to the size of the groups, since the probability that an observation belongs to a group with

a bigger number of observations, is higher. However, the a priori probabilities should be

proportional to the group sizes only when it is know that there is a causality for such

differences in the group sizes in the population, and it is not a simple result of the

sampling procedure.

A classification table (matrix) is used to assess the percentage of correct classification. On

the rows of the table are placed the numbers of the cases observed in each group and on

the columns are the numbers of the cases predicted in each group. The numbers of the

cases predicted in the same group as observed are placed on the diagonal of the table. The

percentage of these cases represents the percentage of correct classifications.

The percentage of correct classifications must be compared to the percent of cases that

would have been correctly classified by chance alone. When there are two groups with

53

equal sizes, the expected correct classification by chance alone is 50%. For n groups with

equal sizes, the expected percent is 100/n. For n groups of different sizes (g1, g2,..., gn),

the expected percent is equal to:

expected percent correct classification by chance alone = 100*[(g1/N)*g1 + (g2/N)*g2 +

... + (gn/N)*gn] /N (5.20)

where N is the number of all observations.

5.4.2. Assumptions of linear discriminant function analysis

The discriminant function analysis makes the following assumptions:

a) Assumes linearity.

b) Independence of each observation from each other observation.

c) The sizes of the different groups must not be greatly different.

d) Each independent variable has a standard deviation different from zero in each

group (calculation requirement).

e) Errors (residuals) are randomly distributed.

f) Homogeneity of variances (homoscedasticity):

Similar values of the variances and means must be observed in the groups for the same

independent variable. Lack of homogeneity of variances will question the results of the

tests for significance. Lack of homogeneity of variances may indicate presence of outliers

in one or more groups. Lack of homogeneity of variances and presence of outliers can be

evaluated through scatterplots of variables. Also, the Levene test can be used to examine

homogeneity of variance in the case of one variable. The null hypothesis of the Levene

test is that the covariance matrices do not differ between groups. Thus, the test must be

statistically insignificant at 95% confidence level to reject the hypothesis that the matrices

are significantly different.

g) homogeneity of covariances/correlations:

Within each group the covariance/correlation between any two predictor variables must

be similar to the corresponding covariance/correlation in the other groups, i.e. the groups

must have similar covariance/correlation matrices. The Box M test is used to assess the

homogeneity of covariance matrices. Its null hypothesis is that the covariance matrices do

54

not differ between groups. Thus, the test must be statistically insignificant at confidence

95% level to reject the hypothesis that the matrices are significantly different. However,

the M test is sensitive to violations from normal distribution of the variables. Another test,

which is robust to violations of the normal distribution, is the Sen-Puri nonparametric test.

It is also necessary that the test is insignificant at the 95 % level, in order not to reject its

null hypothesis that the matrices are homogeneous.

h) Low multicollinearity of the independent variables is assumed. Again, high

multicollineality does not change the variable coefficients in the discriminant

functions, but hampers the correct estimation of the importance of the predictor

variables by using the standardised discriminant coefficients.

i) It is assumed that predictor variables follow multivariate normal distributions, i.e.

each predictor variable has a normal distribution about fixed values of all other

independent variables:

Generally, discriminant analysis will be robust against violation of this assumption if the

smallest group has more than 20 cases and the number of independent variables is fewer

than six.

Violations of the assumptions of the discriminant analysis will influence the reliability of

the significance tests in the discriminant analysis. However, the final goal of the

discriminant analysis is to classify correctly observations into groups. Thus, violations of

the assumptions of the tests for statistical significance might influence the derivation of

discriminant and classification functions. However, the performance of the discriminant

analysis should be assessed mainly on the accuracy of the classification (the percentage of

correct classifications).

5.5. Classification trees

Classification trees are used to predict membership of objects (observations) into a priori

defined groups (classes) on the basis of independent (predictor) variables. Thus, the

objective of this technique is similar to the objective of discriminant analysis. However,

algorithms used for classification differ.

Classification trees consist of sets of decision rules (splits), nested within each other,

having the following form: “if a value of a variable is smaller/bigger than a cut-off value,

55

then assign the object to a group or consider another cut-off or consider another variable –

if it is bigger/smaller than a cut-off value, then …; else, assign the object to another group

or consider another cut-off or consider another variable …”. Thus, the classification tree

consists of nodes, at which the decision rules are applied to create different branches

containing objects and possibly new nodes of the tree. The tree starts with an initial node

(called the root node), which contains all the objects. After each node split, each object is

assigned to a certain branch of the tree. The classification trees have a hierarchical

structure, in which a hierarchy of decision rules is applied in a recursive way (i.e. a given

predictor variable can be used in more than one decision rule). The hierarchical structure

distinguishes the classification tree approach from the discriminant analysis, where the

decision to assign an object to a group is taken on the basis of a single decision rule (in

the two-group case), taking into account the values of all variables in the model

simultaneously.

Classification trees can be computed using ordered or categoric independent variables, or

combinations of them. Ordered variables allow for ordering of their values. In categoric

variables categories are used to define variable values, which do not allow for ordering of

the values (for example a variable defining colours of objects is a categoric variable).

Univariate splits (considering one variable at a node), as well as linear-combination splits

can be applied.

The classification accuracy of classification trees can be assessed in terms of

misclassification costs. Predictions with lower misclassification cost are considered more

accurate. The misclassification costs reflect the number of misclassified objects, but also

take into account other factors. For example, in some cases, objects from certain

categories need to be correctly assigned with higher accuracy than objects from other

categories (for example, in the case when it is more dangerous to classify actually toxic

compounds as non-toxic, than actually non-toxic compounds as toxic). In these cases, the

misclassification cost of incorrect classification of an object actually belonging to the

more important categories is higher than the misclassification cost of incorrect

classification of an object actually belonging to the less important categories.

The correct classification and the classification costs also depend on the a priori

classification probabilities (see Section 5.4 “Linear discriminant function analysis”).

Additionally, the ratio of the a priori probabilities can be used to assess the importance of

misclassifications for each class. When the a priori probabilities are taken to be equal to

56

the class sizes, minimising misclassification costs corresponds to minimising the overall

proportion of misclassified objects, because the prediction should be better in larger

classes to produce a lower misclassification rate (with all other factors determining the

misclassification costs equal for the different classes).

The splits in the classification trees can be chosen by different approaches. For example,

in the discriminant-based univariate (DBU) splitting, for each node formed so far, p-

levels are computed for significance tests of the relationship between class membership of

the node and the values (levels) of each predictor variable. For categoric predictors, the p-

levels are computed by using Chi-square tests of independence of the frequency

distribution of the classes on the levels of the categoric predictor. For ordered predictors,

the p-levels are computed by using analysis of variance (ANOVA). In some cases tests

that are robust to violations of the normal distribution, such as the Levene test for equality

of variances, are also used. The predictor variable for which the smallest p-level is

calculated is chosen to split the corresponding node. To determine the cut-off value of the

split the two-means clustering algorithm of Hartigan and Wong (1978) is used for ordered

predictors. The algorithm creates two “superclasses” and chooses the cut-off value to be

closest to the mean of one of the “superclasses”. To find the split for categoric predictors,

these are coded as dummy variables representing the levels of the categoric predictor. The

dummy variables are afterwards transformed into ordered predictors. The procedures for

ordered predictors are then applied. The discriminant-based linear combination splitting is

based on similar principles.

Another splitting algorithm is classification and regression trees (CART) search for

univariate splits. The method examines all possible splits for each predictor variable at

each node in order to find the split producing the largest improvement in goodness of fit.

The improvement of the goodness of fit due to the split is assessed by the degree of

reduction of the heterogeneity (presence of objects of different classes in the same node)

of the daughter nodes compared to the parent node. Deviance can be used as measure of

homogeneity (lack of heterogeneity) of a node. It is calculated on the basis of the relative

frequencies of the classes at a node, and the number of node objects. Another criterion of

homogeneity is the Gini measure of node impurity. It is calculated again using the relative

frequencies of the classes at a node from the equation:

Ginii = 1 – Σ (pik)2 (5.21)

57

where Ginii is the value of the Gini index of node i

pik is the relative frequency of the k-th class at the i-th node

the summation is over all classes present at the node i.

The Gini index is equal to zero if all objects at a node belong to the same class, and

increases with increasing the number of classes present at the node.

If no limit is placed on the number of splits to be performed, the splitting will be done

until the terminal nodes of the classification tree contain only one class of objects (pure

classification). However, such a tree may contain splits due to chance numerical

relationships between variables, and may perform badly when new objects are to be

classified. Therefore, some rules for stopping the splitting are usually applied. For

example, splitting is stopped when all terminal nodes are pure or have fewer objects than

a specified number.

Even applying a stopping rule, the resulting tree might have complicated structure. Thus,

a tree possessing good classification accuracy, and on the other hand having a fairly

simple structure, should be sought. For this purpose, procedures for pruning trees are

applied, in which branches of less importance for the classification accuracy are removed

from the tree. In the minimal cost-complexity pruning the costs are computed at each step

of the growing (splitting) of the tree up to the maximum tree size. The costs generally

decrease with increasing the number of splits in the tree (reflecting better classification).

However, the costs’ decrease is greater at the initial splits than at the splits close to the

maximum possible size of the tree. Thus, pruning the terminal nodes of a tree with

maximum size will result in a smaller increase of the costs, while pruning nodes that are

close to the root node will result in a greater increase of the cost. Therefore, an optimal

size of the tree can be reached when the terminal nodes are pruned up to a turn-off value

of the extent of increase of the costs due to the pruning.

Additionally, cross-validation can be performed at each step of the pruning, and the cross-

validated costs can be used to identify the tree with the optimal size. Generally, the cross-

validated costs decrease with increasing the splits up to a given minimum. Consequent

tree splitting results in increasing the cross-validated costs (unlike the non-cross validated

costs, which always decrease with increasing the complexity of the tree). Thus, the less

complex tree, for which the cross-validated cost is close to its minimum value, can be

chosen. Breiman et al. (1984) suggested that to obtain a tree with optimal size, the least

58

complex tree whose cross-validated cost do not exceed the minimum cross-validated cost

plus one times the standard error of the minimum cross-validated cost, should be selected.

A graphical representation of the hierarchical structure of classification trees can be used

to simplify the interpretation of the results.

5.6. Cluster Analysis

The term Cluster analysis (CA) represents a variety of techniques used to identify

common properties of objects and assign them to groups (clusters). The investigated

objects can represent a set of observations (in this case variables can be used to identify

the similarities between the observations), or a set of variables. Unlike other statistical

procedures like discriminant analysis and classification trees, cluster analysis is used

when there is not a priori knowledge for possible grouping of the objects.

Most CA techniques are hierarchical, i.e, clusters with increasing number of objects are

nested within each other. The formed structure of clusters is called a hierarchical tree.

Some CA methods are non-hierarchical, for example k-means clustering. Additionally,

clustering techniques can be divisive or agglomerative. In the divisive techniques initially

all objects are placed in a single cluster, which is then divided gradually into smaller

clusters. In the agglomerative techniques initially each object forms a separate cluster,

and the clusters are united gradually until one cluster is formed. To perform this, the

distances (similarities) between the objects are assessed, and the two closest objects (the

two most similar clusters) are united in a single cluster. The distances between the new

clusters are calculated again, and the two most similar clusters are again united. The

procedure is repeated until all objects are included in the same cluster.

The distances between the objects can be assessed in several ways. For example, when a

set of observations is investigated, Euclidean distances can be used (calculated on the

basis of variable values). To avoid the influence of possible differences between the

variable scales on the results, the Euclidean distances can be calculated on standardised

data. Another measure of the distances is the city-block (Manhattan) distance, which is

calculated as the average difference between the variables for the different objects

(observations). In most cases, this distance measure yields results similar to the Euclidean

distance. Chebychev distance is calculated as the maximum value of the differences

59

between any pair of variables for two objects. Pearson correlation coefficients can be used

a distance measure when variables are to be clustered.

Additionally, linkage rules need to be specified in the agglomerative CA. When a cluster

contains a single object, its distance to another cluster, containing also a single object, is

defined by the distance between the two objects. However, if the clusters contain several

objects, to identify the distance between them a certain linkage rule is applied. In the

single-linkage rule, the distance between two clusters is determined as the closest distance

between any pair of objects of the different clusters. In the complete-linkage rule, the

distance between two clusters is determined as the greatest distance between any pair of

objects of the different clusters. Another linkage method is unweighted pair-group

average rule, in which the distance between two clusters is calculated as the average

distance between all pairs of objects of the two different clusters. It uses a more

centralised measure of the distance than the single or complete linkage rules. Weighted

pair-group average rule uses again the average distance between all pairs of objects,

taking into account the sizes of the clusters (i.e., the numbers of objects in them) as

weights. Pair-group centroid rule (unweighted, or weighted for the size of the clusters)

defines the distance between the clusters as being equal to the distance between the

cluster centroids. A centroid of a cluster is calculated as an average point in the

multidimensional space defined by the variables (when observations are subject to

clustering). Another linkage rule is Ward's method, in which the sum of squares of any

two hypothetical clusters that can be formed at each step is minimised.

The results of the hierarchical CA are usually shown in a form of a dendogram plot, in

which the structure of the clusters in the hierarchical tree is presented as distinct branches.

The horisontal axis of the plot represents the cluster distance.

Other CA techniques are described in Tenekedjiev (1994). For example, in the two-way

joining method observations and variables are clustered simultaneously. The k-means

clustering aims to produce exactly k clusters with the lowest possible similarity between

them. The analysis is performed by minimising variability within clusters and maximising

variability between clusters.

5.7. Selecting variables for statistical analyses

60

If a large set of independent variables is available, several techniques can be used for

selecting variables, which give models with best statistical fit. An example is the stepwise

procedure (applied mainly in regression and discriminant analysis), which can proceed by

forward stepping or backward stepping. In forward stepping the initial model contains

only one predictor variable found to give the best statistical fit. In the following step a

variable, which improves the statistical parameters of the model more than the other

variables, is added. This is repeated until a fixed number of variables have been included,

or until a desired statistical fit has been achieved. In backward stepping, the initial model

contains all predictor variables, and at each step a variable, which results in minimum

reduction of the model fit is excluded. Again, the procedure is continued until a fixed

number of variables have been included, or until a desired fit has been achieved.

Another approach, called “best subsets approach” can be applied, in which statistical

models are developed by using all possible combinations of variables and the

combinations that give the best statistical fits are selected. However, for large sets of

independent variables it requires great computational resources and time.

5.8. Cross-validation statistical procedures

As described in Chapter 2, a cross-validation statistical procedure can be used to evaluate

the predictive power of QSAR models. It includes the following steps:

- one (or a set) of the chemical compounds is excluded from the training set and the

model under evaluation is developed again using the remaining compounds;

- the investigated property (biological activity, class belonging) of the excluded

compound(s) is predicted using the model based on the remaining compounds;

- this procedure is repeated by excluding different compounds, until all compounds

are excluded once and have one prediction of the investigated property;

- the accuracy of the predicted values from the previous steps is estimated by

calculating certain parameters, typical for the statistical technique used to develop

the model.

For example, if the predictivity of a regression model is evaluated, the cross-validated

squared coefficient of determination (Q2) and cross-validated standard error of estimate

(sometimes called standard error of prediction, SEP) can be calculated in a similar way as

in the standard regression procedure, but using the predicted values from the cross-

61

validation procedure. In the classification tress, cross-validated misclassification costs can

be calculated.

There are several cross-validation algorithms, which differ by the number of compounds

excluded from the training set at each step. The most commonly used is the leave-one-out

cross-validation procedure, in which one compound is excluded at each step. In the leave-

N-out cross-validation procedure, a set of N compounds is excluded at each step. In the

leave-group-out cross-validation, the compound set is divided into a given number of

groups, and one group is excluded at each step.

62

CHAPTER 6

APPROACHES FOR 3D-(Q)SAR ANALYSIS

In 3D-(Q)SAR approaches 3D-interactions between chemical compounds and their

biological receptors (enzymes or other macromolecules) are simulated. Therefore, the

choice of the molecular conformations is very important for the quality of the results. As

described in Chapter 4, the conformations chosen should reflect the active conformations

of the compounds. However, information about these conformations is often missing, and

the energetically minimal conformations are therefore considered in the analysis. The

energetically minimal conformations are found with the methods for conformational

search (see Chapter 4, Section 4.3 “Obtaining 3D-molecular conformation with minimum

energy”). Additionally, 3D-QSAR techniques are applied correctly only if the

investigated compounds interact with the same biological macromolecule(s), i.e. if they

have very similar molecular mechanisms of biological action.

In this chapter the 3D-(Q)SAR approaches used in the project are described.

6.1. GASP analysis

GASP (genetic algorithm similarity program) is a 3D-SAR technique used to obtain

information for possible chemical atoms or groups involved in the interaction with the

biological receptor.

Usually, a set of compounds presumed to interact (bind) to the same macromolecule at the

same extent, is used. By identifying similarities between these compounds in certain

structural features and their relative spatial positions, GASP identifies possible chemical

features responsible for the common biological binding of the compounds, and possible

spatial positions of these chemical features (sometimes called “pharmacophore pattern”).

The chemical features considered are rings, which might enter into hydrophobic

interactions with macromolecules, and potential H-bonding sites (atoms that can form H-

bonds). Additionally, a conformation of a compound with high binding affinity can be

defined as a rigid template, and the remaining molecules can be aligned to it.

GASP identifies the potential interaction sites in each molecule and randomly constructs a

population of chromosomes, where each chromosome represents a possible alignment of

63

the molecules of the data set to each other. The chromosomes encode torsion settings for

the rotatable bonds and an intermolecular mapping of elements. A fitness score for each

chromosome is calculated as the weighted sum of three terms: the number and similarity

of overlaid chemical features, the common volume of the aligned molecules, and the van

der Waals energy of each molecule. The last-named takes into account the probability that

the active conformation of the molecule is the one used in the alignment.

In the next step parent chromosomes are selected with high fitness scores, and child

chromosomes are produced by applying a mutation or crossover operation on the parent

chromosomes. The child chromosomes with improved fitness scores are used to replace

the least fit members of the parent population. These steps are repeated until the fitness of

the population cannot be improved, or when the possible number of genetic operations is

completed. After obtaining the chromosomes with the best fitness score, these

chromosomes are used to define the structural features having highest similarity in their

chemical properties and spatial positions between the investigated molecules. These

structural features are suggested to be responsible for the similar biological binding of the

investigated molecules.

6.2 CoMFA and CoMSIA analyses

The CoMFA (comparative molecular field analysis) 3D-QSAR approach was proposed

by Cramer and co-workers (1988). In this approach the representation of molecular

structures is done by using values of a certain type of interaction energy at points in space

around the molecules. The interaction energies are calculated by using a probe atom of a

specified charge and steric properties (usually a carbon atom in sp3 hybrid state). Thus,

interactions with biological macromolecules are simulated. The energies of steric and

electrostatic interactions are used.

The main steps in CoMFA analysis are the following (Figure 6.1):

- the investigated compounds are aligned in space according to chemical

characteristics, considered important for the binding to macromolecules and

biological effect. A priori knowledge of such characteristics is necessary;

- a 3D-grid is constructed around the aligned molecules. The energies of interaction

with a probe atom are calculated at each grid point (energy fields);

64

- the PLS statistical procedure is used to obtain CoMFA QSAR models. It extracts

principal components from the energy fields;

- the number of PLS components to be included in the final QSAR model is chosen

by applying a cross-validation technique (usually leave-one-out) to evaluate the

predictive power of models with different components. The best model is selected

to have low SEP (standard error of prediction) value and high cross-validated

coefficient of determination (Q2).

Due to the large number of grid points participating in the extracted PLS principal

components, the CoMFA results are usually interpreted graphically as 3D-contour maps.

In the maps these points of the 3D-grid are visualised whose PLS loadings show high

association between differences in calculated interaction energies and biological

activities. Usually these points are displayed in two colours, denoting the regions with the

highest positive and lowest negative QSAR coefficients. Thus, the contour maps show

favourable and unfavourable regions around the molecules, where stronger steric and/or

electrostatic interactions will increase or decrease the biological activity, respectively.

CoMSIA (comparative molecular shape indices analysis) is similar to CoMFA, but uses a

Gaussian function rather than Coulombic and Lennard-Jones potentials to assess steric,

electrostatic, hydrophobic, and hydrogen bond donor/acceptor fields (Klebe, 1998).

In CoMFA and CoMSIA approaches energies of interaction at the grid points around the

molecules are calculated. However, for a number of grid points the calculated energies

will have very low variance (for example grid points around parts of the molecules where

the variation in the chemical structure within the series is very small). Therefore a

criterion, called column filtering (σmin, usually in units of kcal/mol), for excluding such

grid points from the applied PLS analysis is used. Points that have standard deviations

(square root of the variance) smaller than the value of the column filtering are excluded

from the PLS analysis.

65

Figure 6.1. Obtaining 3D-QSAR models in CoMFA (taken from Kubiny, 1993)

66

CHAPTER 7

BLOOD-BRAIN BARRIER

In this chapter an introduction to the blood-brain barrier (BBB) and its functionality is

presented. The modes of assessment of compound BBB penetration are described.

Additionally, a summary of the QSAR models for BBB penetration, developed and

published before the present project, is presented.

7.1. Introduction to the blood-brain barrier

The blood-brain barrier (BBB) plays a critical role in supplying the brain with nutrients

and in its protection from circulating toxins. The BBB is formed from the brain capillary

endothelial cells, which are tightly linked with junctional complexes (tight junctions) that

eliminate gaps or spaces between cells. Under normal physiological conditions the

membranes of the capillary endothelial cells (basement (lateral) and apical membranes),

together with the tight junctions, form a continuous cellular barrier that prevents the

passive influx of a variety of substances with the exception of the smallest, lipid-soluble

molecules (Lee et al., 2001). The anatomy of the BBB is presented in Figure 7.1.

The passage of molecules across the BBB can occur either between adjacent cells (the

paracellular pathway) or through the endothelial cells (the transcellular path). The

paracellular passage includes passive diffusion of ions and solutes. Under normal

physiological conditions the tight junctions between the endothelial cells prevent most

substances from undergoing paracellular diffusion, with the exception of the smallest,

lipid-soluble molecules.

The transcellular path includes transport across cell membranes by one or more of several

membrane permeation mechanisms: passive diffusion, carrier-mediated (facilitative), and

active transport. Membrane transporters are involved in the influx/efflux of various

essential substrates such as electrolytes, nucleosides, amino acids and glucose. For

example, the sodium-ion-independent System L transports across the BBB large neutral

amino acids such as phenylalanine, tyrosine, and leucine. Small and neutral amino acids,

such as proline, alanine, glycine, methionine, and glutamine are transported by the

sodium-ion-dependent System A (Tamiai and Tsuji, 2000). Brain tissue is additionally

protected by efflux transport systems present in the brain capillary endothelial cells.

67

Several transport protein families have been recognised, such as the product of the

multidrug resistance gene, MDR1 (p-glycoprotein), the multidrug resistance-associated

protein family (MRPs), and the organic anion transport proteins (OATPs) (Sun et al.,

2002). The multidrug resistance (MDR) phenomenon includes development of resistance

to a wide variety of xenobiotic compounds, involving active efflux from the cells.

In general two ways to describe the BBB permeability can be used. The first includes use

of permeability coefficients (P) for compounds with dimensions of length/time. P values

are sometimes multiplied by the capillary surface area per 1 g brain to obtain the

permeability-surface area products (PS), in units of length3 / (time*weight). Both the Ps

and PSs are quantitative measures of the rate of transport, and so can be subjected to

QSAR analysis.

The second way is to use the blood-brain distribution at the steady state (BB), which is

defined as the ratio of the steady state concentrations of a compound in the brain and in

the blood:

BB = Cbrain/Cblood (7.1)

where Cbrain is the steady-state concentration of the compound in the brain

Cblood is the is the steady-state concentration of the compound in the blood.

7.2. Literature review of QSARs for blood-brain barrier penetration

Various QSAR models of BBB penetration have been developed and published. A variety

of approaches for structural representation and statistical tools have been applied.

Reviews of QSARs have been presented by Clark (2003) and Abraham and Platts (2000).

An early QSAR model for BBB penetration was developed by Young et al. (1988). The

BBB permeability (as BB values) of histamine H2 receptor antagonists was correlated

with the difference between logPoct and logPcyh values (∆logP), where logPoct is the

partition coefficient in the octanol-water solvent system, and logPcyh is the partition

coefficient in the cyclohexane-water system:

logBB = -0.485 (± 0.160) ∆logP + 0.889 (± 0.500) (7.2)

68

n = 20, R = 0.831, s = 0.439, F = 40.23

∆logP is related to the overall hydrogen bonding capacity of a molecule. The negative

coefficient of ∆logP suggests that the brain penetration increases by decreasing the

overall H-bonding ability of a compound. Young et al. (1988) did not find significant

correlation between logBB and logPoct.

Some authors (Abraham et al., 1994; Gratton et al. 1997) used the structural parameters

included in the general solvation equation (introduced by Abraham et al., 1991; Abraham

and Whiting, 1992). For example, Abraham et al. (1994) derived the following QSAR:

logBB = 0.198 R2 – 0.687 πH2 – 0.715 αH

2 – 0.698 βH2 + 0.995 Vx – 0.038 (7.3)

n = 57, R = 0.952, s = 0.197, F = 99.2

where R2 is an excess molar refraction

πH2 is the solute dipolarity/polarisability parameter (Abraham et al., 1991)

αH2 is the solute H-bond acidity

βH2 is the solute H-bond basicity

Vx is the solute characteristic volume of McGowan (Abraham and McGowan,

1987).

According to the model increase in solute size leads to an increase in logBB, whereas

increase in solute dipolarity/polarisability and hydrogen-bond acidity and basicity lead to

a decrease in logBB.

The Abraham solvation equation (which includes the variables described in the model of

Abraham et al., 1994, Equation 7.3) was applied as well to describe logPoct data. In the

model obtained, the coefficient for αH2 was not significant. Therefore Abraham et al.

(1999) concluded that compounds that are H-bond acids (H-bond donors) will always be

outliers when logBB is plotted against logPoct.

Solvation free energies have also been correlated with blood-brain partitioning

(Lombardo et al., 1996; Keseru and Molnar, 2001). Other authors demonstrated the

capacity of the molecular polar surface area to obtain good predictive models for blood-

brain penetration (Kansy and Van de Waterbeemd, 1992; Clark, 1999; Kelder et al., 1999;

69

Feher et al., 2000). The polar surface area is defined as the sum of surface contributions

of polar atoms (usually oxygen, nitrogen and attached hydrogen) to the molecular surface,

and is related to the capacity of a compound to form H-bonds.

Some authors used topological and E-state indices to develop models for BBB penetration

(Luco 1999; Norinder and Osterberg, 2001; Rose et al. 2002). Descriptors derived from

3D-molecular fields using the VolSurf method, have also been used (Crivori et al., 2000).

Again, the analyses confirmed that molecular properties such as hydrophilic regions and

H-bonding are unfavourable for BBB permeation.

In summary, numerous literature reports state that higher hydrogen bonding potential is

unfavourable for BBB penetration (Young et al., 1988; Kansy and Van de Waterbeemd,

1992; Abraham et al., 1994; Gratton et al., 1997; Clark, 1999; Kelder et al., 1999; Crivori

et al., 2000; Feher et al., 2000; Norinder and Osterberg, 2001; Rose et al. 2002). LogPoct

values, as an estimate of hydrophobicity, were found in some studies to correlate with

BBB penetration values (Gratton et al., 1997; Clark, 1999; Feher et al., 2000), while other

authors demonstrated that logPoct is not a suitable descriptor for obtaining BBB

permeability models (Young et al., 1988; Abraham et al., 1994). According to Abraham

et al. (1999), logBB values of compounds that are H-bond donors cannot be described

with logPoct values alone. Other descriptors found as important for BBB penetration are

compound polarity and polarisability (Abraham et al., 1994; Gratton et al., 1997;

Norinder and Osterberg, 2001).

70

Figure 7.1. Cross section through a brain microcapillary, representing the anatomy of

BBB (redrawn from Gloor et al., 2001)

71

CHAPTER 8

ACUTE CHEMICAL TOXICITY/CYTOTOXICITY

In this chapter an introduction to the mechanisms of toxic action is given. Also, a

summary of the QSAR models for toxicity, developed and published before the present

project, is presented.

8.1. Mechanisms of toxic action

A mechanism of toxic action has been adopted as being the critical biological effect of the

toxicant at molecular or cellular level. A number of toxicity mechanisms have been

defined in the literature. The following classification of mechanisms of toxic action has

adopted in this project: non-polar narcosis, polar narcosis, weak acid respiratory

uncoupling, formation of free radicals, electrophilic reactions, toxic action by specific

(receptor-mediated) mechanisms.

A number of methods are available to assess mechanisms of toxic action, including in

vitro tests (Urani et al., 1998); joint toxicity tests, in which the additivity (or otherwise) of

the toxicity of a chemical to the toxicity of a reference compound with a known

mechanism of toxicity, is tested (Broderius et al., 1995); fish acute toxicity syndromes

(FATS) (Bradbury et al., 1989); and considerations on the basis of structural criteria.

Also, a number of authors define structural criteria to assign compounds to a given

mechanism of toxic action (Hermens 1990; Lipnick, 1991; Verhaar et al., 1992; Russom

et al., 1997; Schultz et al., 1997; Hansch et al., 2000; Cronin et al., 2002a). However,

identification of the mechanism of toxic action for a compound from its chemical

structure is often a difficult task due to the complex nature of toxic activity (Schultz et al.,

2003a). Additionally, the different mechanisms are not mutually exclusive, i.e. a chemical

compound may act by different mechanisms.

The narcotic mechanism of toxic action results from non-specific non-covalent reversible

interactions with cell membranes (Schultz et al., 2003a). Narcotic mechanisms of action

can include non-polar and polar narcosis (Schultz et al., 1990). In addition, some authors

also refer to other mechanisms of narcosis (e.g. amine, ester) (Russom et al., 1997). The

toxic effect of non-polar narcotics is determined by the compound lipid solubility;

compound-specific molecular features are of little or no importance (Schultz et al., 1990).

72

Non-polar narcotics are neutral non-reactive compounds such as aliphatic alcohols,

ketones and ethers. Quantitative relationships between toxicity and hydrophobicity

(presented as octanol-water partition coefficients, logPoct values, or octanol-water

distribution coefficients, logDoct values) for non-polar narcotics have been reported by

many authors (Könemann, 1981; Veith et al., 1983; Hansch et al., 1989; Schultz et al.,

1990). These relationships are considered to represent a “baseline effect”, whereby no

wholly soluble and non-volatile chemical can elicit toxicity less than predicted by such

relationships (Schultz et al., 2003a).

Polar narcotics are aromatic, less inert and often possess a hydrogen donor moiety, e.g.

phenols and anilines (Veith and Broderius, 1990).

In general weak acid respiratory uncouplers elicit their effect by abolishing the coupling

of substrate oxidation to adenosine triphosphate synthesis in mitochondria. The molecular

characteristics of weak acid respiratory uncouplers include a weak acid feature (e.g. an

amino or hydroxyl group), a bulky, hydrophobic aromatic moiety, and multiple

electronegative groups (e.g. nitro and/or halogen substituents) (Terada, 1990). A typical

example is 2,4-dinitrophenol (Schultz et al., 1990).

Some chemicals are suggested to produce their toxic effect by forming free radicals.

Phenols, which possess electron-releasing groups could, for instance, be converted to

toxic phenoxyl radicals (Hansch et al., 2000).

Compounds that can undergo direct electrophilic interactions may cause covalent changes

in biological macromolecules (Hermens, 1990, Lipnick, 1991, Russom, 1997). Some

compounds may undergo metabolic reactions, resulting in more toxic forms (for example,

precursors to electrophilic compounds, i.e. proelectrophiles). Cronin et al. (2002a) stated

that 2- and 4-substituted amino and nitrophenols are capable of being oxidised to

quinones, which have toxicities greater than those of polar narcotics.

Specific mechanisms of toxic action are due to interactions with specific target

macromolecules (receptors or enzymes). These interactions depend on specific chemical

features of the compounds. Binding to receptors is non-covalent, due to ionic,

hydrophobic and/or H-bonding interactions, and requires specific 3D molecular

conformations.

73

In Table 8.1 a summary of structural criteria which can be used to assign mechanisms of

toxic action to compounds is given.

8.2. Literature review of QSARs for acute toxicity/cytotoxicity

A large number QSAR studies of acute toxicity have been developed and published in the

literature. A review of recently published QSARs (since 2000) for acute toxicity is

presented by Lessigiarska et al. (2005b).

Toxicities to a wide range of biological species have been investigated by QSAR analysis,

for example bacteria, protozoa, algae, invertebrates, plants, fish, mammalian cells and

mammals, including humans. Various chemical groups have been investigated and

various QSAR approaches have been applied.

A dependence of acute (narcotic) toxicity on a partition coefficent, especially the octanol-

water partition coefficient, has been shown by many authors (Dearden et al., 2000;

Freidig and Hermens, 2000; Kapur et al., 2000; Parkerton and Konkel, 2000; Bundy et al.,

2001; Gramatica et al., 2001; Ren and Frymer, 2002; Sverdrup et al., 2002; Worgan et al.,

2003).

For example, in the work of Schultz et al. (2002), a large data set of 500 aliphatic

chemicals was investigated for toxicity to the protozoan Tetrahymena pyriformis (toxicity

expressed as inhibitory concentrations causing 50 % inhibition of growth of the protozoan

in a two-day assay, IGC50 values). The following QSAR, representing a baseline toxicity

relationship, was derived for a group of non-polar narcotics (saturated alcohols, ketones,

nitriles, esters, and sulphur-containing compounds):

log1/IGC50 = 0.723 logPoct – 1.79 (8.1)

n = 215, R2adj = 0.926, s = 0.274, Q2 = 0.925

In addition to the octanol-water partition coefficient, partition coefficients in other phases

have also been used to develop QSARs for toxicity. For example Schultz and Seward

(2000) concluded that dimyristoyl phosphatidylcholine-water partition coefficients gave

better statistical fit than octanol-water partition coefficients in QSAR analysis of

inhibition of T. pyriformis population growth for 23 non-polar narcotics, polar narcotics,

74

and esters. Roberts and Costello (2003) developed QSARs for toxicity of 18 non-polar

and polar narcotics to the fish Poecilia reticulata (guppy) using logPoct (octanol-water)

and logPMW (membrane-water) partition coefficients. The LogPoct-based equations for

non-polar and polar narcotics had different coefficients. LogPMW-based QSARs for non-

polar and polar narcotics had similar coefficients. Therefore Roberts and Costello (2003)

derived a general equation for polar and non-polar narcotics, based on logPMW:

pLC50 = 0.84 logPMW + 1.38 (8.2)

n = 18, R2 = 0.963, s = 0.21, F = 419

Freidig and Hermens (2000) developed QSARs for toxicity to fish (14-day LC50 for

guppy (P. reticulata) and 4-day LC50 for fathead minnow (Pimephales promelas)). The

compounds were divided into narcotics and reactive compounds, according to the values

of the ratio of excess toxicity, Te:

Te = observed toxicity / predicted toxicity by a baseline equation (8.3)

Freidig and Hermens (2000) stated that if the Te value of a compound is smaller than five,

it should be considered as a narcotic, and if Te is bigger than five it should be considered

as a reactive chemical.

Freidig and Hermens (2000) developed separate one-parameter QSARs for the groups of

narcotics and reactive compounds, using a descriptor that characterises the particular

toxicity mechanism (logPoct for the narcotics, and a certain electronic descriptor for the

reactive compounds). Additionally, models combining the two groups of compounds and

the two types of descriptors were developed. Freidig and Hermens (2000) concluded that

using separate QSAR models for compounds acting by different mechanisms, each of the

models including a descriptor that characterises the particular toxicity mechanism, gives

better results than using a single model that combines all compounds and descriptors.

Some authors have used the so-called “response-surface approach” based on

hydrophobicity and electrophilicity of compounds. In this approach the QSARs include a

descriptor encoding bio-uptake and distribution (usually octanol–water partition or

distribution coefficients (logPoct or logDoct)) and a descriptor of electrophilic reactivity

(usually LUMO or maximum acceptor superdelocalisability (Amax)). This approach has

been applied to different species, including the bacterium Vibrio fischeri (Cronin et al.,

75

2000); the protozoan Tetrahymena pyriformis (Cronin and Schultz, 2001; Cronin et al.,

2002a; Schultz et al., 2002); the yeast Saccharomyces cerevisiae (Wang et al., 2002b); the

mould Aspergillus nidulans (Cronin et al., 2002b); the algae Scenedesmus obliquus (Lu et

al., 2001) and Chlorella vulgaris (Cronin et al., 2002b); the plant Cucumis sativus (Wang

et al., 2002a; Wang et al., 2002c); mice (Cronin et al., 2002b); toxic and metabolic effects

on perfused rat liver (Cronin et al., 2002b). The advantage of the response-surface

approach is that it is simple and has a sound mechanistic interpretation.

While some authors (e.g. Cronin and Schultz, 2001; Cronin et al., 2002a; Cronin et al.,

2002b) have used LUMO as a descriptor of electrophilic reactivity resulting in covalent

changes in biological systems, Dimitrov et al. (2000) and Dimitrov et al. (2003) have

suggested that LUMO can be also used to describe non-covalent electrophilic interaction

of narcotic chemicals with the site of action.

Some authors extended the response-surface approach by adding additional indicator

variables and other parameters to improve the statistical fit of the models (Schmitt et al.,

2000; Wang et al., 2001; Huang et al., 2003; Cronin et al., 2004; Netzeva et al., 2004).

Examples of such variables include charges of defined atoms in the compound (Schmitt et

al., 2000), the heat of formation (Wang et al., 2001), the molecular volume (Huang et al.,

2003), molecular connectivity indices (Wang et al., 2001; Netzeva et al., 2004). These

help to model outliers for which toxicity is under- or overestimated by the two-variable

response surface approach. However, according to Schultz et al. (1998) QSAR modelling

for electrophilic compounds is difficult because of data and descriptor limitation

compared to QSAR modelling of compounds acting by other toxic mechanisms.

Another approach has been the use of the Brown variation of the Hammett constant (σ+)

to model acute toxicity of aromatic compounds forming free radicals (Hansch et al., 2000;

Selassie et al., 2002; Moridani et al., 2003b; Verma et al., 2003).

QSARs for toxicity based on topological and/or E-state indices have been derived by

many authors (Ivanciuc, 2000; Burden, 2001; Gramatica et al., 2001; Cronin et al., 2002a;

Grodnitzky and Coats, 2002; Huuskonen, 2003; Rose and Hall, 2003). Polarisability and

molar refractivity have been used by Geiss and Frazier (2001); Hansch and Kurup (2003);

Trohalaki et al. (2000). Another approach is based on the TLSER (theoretical linear

solvation energy relationship) model descriptors, which represent cavity,

dipolarity/polarisability, and H-bonding terms (Boyd et al., 2001; Liu et al., 2001; Liu et

76

al., 2003a). As a descriptor in some QSARs the so-called ‘hardness’, equal to

½ (HOMO – LUMO), was also used. It encodes atomic radius, nuclear charge and

polarisability (Faucon et al., 2001).

3D-QSAR approaches (CoMFA and CoMSIA) have also been applied to investigate

toxicity (Xu et al., 2002; Liu et al., 2003b). Interaction fields and regions around

molecules that are important for compound toxicity were identified. There is, however,

implicit difficulty in using these approaches, because they require a series of compounds

sharing a common and specific mechanism of action (see Chapter 6).

Also, several other QSAR approaches have been explored in investigating acute toxicity.

For example, Balaž and Lukacova (2002) used subcellular pharmacokinetic theory to

derive QSARs. Toropov and Toropova (2002), Toropov and Schultz (2003), and Toropov

and Benfenati (2004) used optimisation of correlation weights of local graph invariants

(OCWLGI) approach (Toropov and Toropova, 2002). Weighted holistic invariant

molecular (WHIM) descriptors encoding three-dimensional information for molecular

shape and electronic structure were used by Gramatica et al. (2001), and Di Marzio et al.

(2001). Grodnitzky and Coats (2002) used geometry, topology and atomic weights

assembly descriptors (GETAWAY), which encode information about the 3D structures,

size and shape of the molecules. Several studies using autocorrelation vectors encoding

lipophilicity, molar refractivity, and H-bonding ability were performed (Devillers et al.,

2002a; Devillers et al., 2002b). Tao et al. (2002a), and Tao et al. (2002b) applied a

fragment constant method to develop toxicity QSARs.

Additionally, QSARs for metal toxicity have been also developed using ion-specific

physicochemical parameters (Enache et al., 2000; Enache et al., 2003; Ownby and

Newman, 2003).

QSARs for toxicity of mixtures of chemical compounds have been also developed. It

represents an important and growing research area. Environmental pollutants are usually

released as chemical mixtures rather than single chemicals. Some authors used mixture

partition coefficients and H-bonding potentials of mixtures composed from non-polar and

polar narcotics to develop QSARs (Yu et al., 2001; Lin et al., 2003b). Other

investigations include specific characteristics of the chemical compounds in the mixtures,

for example charges at a certain atom (Lin et al., 2003a). Tichý et al. (2002) investigated

dependence of toxicity on different molar ratios of mixture compounds.

77

Table 8.1. Summary of structural criteria used in this project for classifying compounds

according to mechanism of toxic action (devised from Hermens 1990; Lipnick, 1991;

Verhaar et al., 1992; Russom et al., 1997; Schultz et al., 1997; Hansch et al., 2000; Cronin

et al., 2002a)

Mechanism of Action Structural Determinants

1 Non-polar narcosis Saturated alkanes with e.g. halogen and/or alkoxy

substituents (aliphatic alcohols, ketones, ethers,

amines); halogen and alkyl substituted benzenes

2 Polar narcosis Phenols with pKa greater than or equal to 6.0; phenols

and anilines with three or fewer halogen atoms, and/or

alkyl substituents

3 Weak acid respiratory

uncouplers

Phenols and anilines with four or more halogen

substituents, or more than one nitro group, or single

nitro group and more than one halogen group

4 Formation of free radicals Phenol or aniline substituted with an electron-releasing

group (alkoxy, hydroxyl, more than one alkyl group).

5 Electrophile/proelectrophiles Activated unsaturated compounds; benzene rings

without aniline or phenol substructures, that have two

nitro groups on one ring; phenols with single nitro

group but not more than one halogen group; aromatic

compounds with two or more hydroxy groups in the

ortho or para position and at least one unsubstituted

aromatic carbon atom; quinines; aldehydes; compounds

with halogens at α-position of an aromatic bond;

ketenes; epoxides

6 Specific mechanisms Chemicals interacting with specific biological

macromolecules. For example, acetylcholinesterase

inhibitors with an organophosphate group.

78

CHAPTER 9

PROGRAMS FOR STATISTICAL ALGORITHMS

9.1. Objectives

In this chapter two statistical algorithms designed by the author for the purposes of the

project are presented, which were used as auxiliary tools in the QSAR analysis. The first

one is related to reduction of multicollinearity of variables in a data set. It was used to

reduce data redundancy before performing statistical analyses. The second algorithm

represents implementation of the best-subsets approach for selection of regression

equations with best statistical fit.

9.2. Reduction of data multicollinearity

Nowadays, commercially-available QSAR software packages can calculate large numbers

of descriptors. The availability of a large number of descriptors in a single study poses the

problem of variable multicollinearity and data redundancy. One way to avoid data

redundancy is to exclude descriptors that are highly intercorrelated with each other before

performing statistical analysis. Reduced multicollinearity and redundancy in the data will

facilitate selection of relevant variables and models for the investigated endpoint. An

algorithm for reducing the intercorrelation in the variable data matrix, developed by the

author, is presented. The main steps in the proposed algorithm are presented in Figure 9.1.

A program in C code has been developed by the author, based on the proposed algorithm.

The code of the program is presented in Appendix A.1. As input an ASCII file containing

the variable data matrix is required. The user is requested to specify a cut-off value for the

Pearson coefficient of correlation R (Rc). The variables that intercorrelate with R larger

than Rc will be considered as redundant. The user may also input a list of variables that

are considered important for the modelling (list_to_remain), for example, because of their

physicochemical or mechanistic significance. The main steps of the program algorithm

are:

1. Calculation of the intercorrelation matrix for the variables.

2. For each variable, counting the number of intercorrelated variables with an

intercorrelation coefficient R larger than Rc.

79

3. The variable (not in the list_to_remain) that has the largest number of

intercorrelated variables is proposed to be deleted from the set of variables.

4. On positive reply from the user, the data matrix is created again without this

variable. Steps 1 to 3 are repeated. On negative reply, the user is asked whether to

quit the program, and if not, that variable is added to the list_to_remain and step 3

is repeated.

5. Steps 1 to 4 are repeated until there are no highly intercorrelated variables, or all

variables are in the list_to_remain, or the user decides to quit the program at step

4.

The output of the program is an ASCII file that contains the intercorrelation matrix of the

undeleted variables, the number and a list of intercorrelated variables for each of the

undeleted variables, and a list of the deleted variables.

A key feature of this algorithm is that the user can specify a list of variables that have

important physicochemical and/or mechanistic significance, and which are therefore not

deleted.

An alternative approach for reducing data multicollinearity is implemented in the

DRAGON software version 5.0 (TALETE srl.), where one of the two descriptors with a

correlation coefficient bigger than a user-specified value between 0.9 and 1.0 is excluded.

For each pair of correlated descriptors, the descriptor having the highest correlation

coefficient with some of the other descriptors is automatically excluded. The algorithm

developed in this project is based on a different principle, namely excluding descriptors

having highest number of intercorrelations with other descriptors. According to the author

of this thesis, an advantage of the algorithm presented in this thesis is the possibility to

define a list of variables which are not to be deleted from the data set, thus allowing the

user to control the process of data reduction.

For example, the algorithm was applied to the set of descriptors calculated for compound

set containing blood-brain barrier penetration data, taken from Platts et al (2002) (QSAR

analysis of this data set is reported in Chapter 10 of this thesis). The data table contained

approximately 300 descriptors calculated with both the TSAR (Accelrys Inc.) and QsarIS

(then SciVision-Academic Press, San Diego, CA; currently MDL®QSAR by Elsevier

MDL, San Leandro, CA) software. Applying the proposed algorithm, with a cut-off value

of R = 0.95, resulted in a data table containing 166 descriptors. During the execution of

80

the program, molecular volume and molecular surface area were preserved from deletion

by adding them to the list_to_remain, because of their possible mechanistic significance.

9.3. Program for selection of regression equations with best statistical fit

A program in C code was developed by the author, which implements the algorithm of

the best-subsets approach for selecting regression equations with best statistical fit. The

code of the program is presented in Appendix A.2. The program computes k-variable

regression equations by combining all possible combinations of k variables from a given

set of variables. The user can specify a R value, and variables, which intercorrelate with

correlation coefficient higher than this R value will not be included in the same regression

equation. Afterwards the regression equations are sorted by decreasing value of the

squares of their correlation coefficients (R2).

The program needs as input an ASCII file containing the variable data matrix, including

the dependent and the independent variables. The input file may contain several

dependent variables, which are investigated one by one. The output of the program is an

ASCII file that contains the regression equations for each of the dependent variables and

the square of the regression coefficients, R2.

The program was developed for the needs of the work in this thesis, because it allowed

the best-subsets approach to be applied in the regression analysis of data sets containing a

large number of independent variables (practically in the program no limitation is set for

the number of independent variables; the only restriction is due to the increased

computational time with increasing the number of variables). In contrast, in the

MINITAB version 14 software (Minitab Inc., State College, PA, USA) the

implementation of the best-subsets regression analysis has a restriction of a maximum

number of 31 independent variables to be used for constructing k-variable combinations

from them to search for the best statistical fit.

For example, the program was used to derive QSARs for toxicity to the bacterium

Sinorhizobium meliloti, (the study is reported in Chapter 12 of this thesis). The variable

data sets contained approximately 100 variables after applying the data reduction

algorithm (different for each subset of compounds with different mechanisms of toxicity,

which were studied separately, see Chapter 12). The program was applied up to k = 5,

because for k ≥ 6, the number of the possible combinations of k variables from the data

81

set of approximately 100 variables was more than 1 100 000 000, which would require

more than 24 h computational time.

9.4. Contribution to existing knowledge

The presented algorithm for reducing data multicollinearity was designed by the author to

serve the needs of the work reported in this thesis and the development of QSARs in

general. As described above, the algorithm differs from a similar algorithm implemented

in the DRAGON software (TALETE srl.) in the way the multicollinearity is assessed,

namely by counting the number of intercorrelated variables for each variable of a data set.

Additionally, the algorithm allows the user to preserve variables considered important for

the further QSAR analysis. The algorithm was used in the analysis published by

Lessigiarska et al. (2004a).

The second algorithm, presented in this chapter, contributed to an extended

implementation of the best-subsets regression approach to large data sets, together with

the possibility to derive the variable combinations resulting in the best statistical fit, for

which the variables included are not intercorrelated above a given value of the

intercorrelation coefficient R. The algorithm was used in the analyses reported by

Lessigiarska et al. (2004a), Lessigiarska et al. (2004b), Netzeva et al. (2005) and

Lessigiarska et al. (2006).

82

Figure 9.1. Algorithm for reduction of data multicollinearity. Points where user input is

necessary are represented with ellipses

Calculation of the correlation matrix

Calculation of the number of intercorrelated variables (N_cor) with R bigger than Rc for each variable

yes noExit

yes no

yes no

Exit

Deleting the variable and creating a new variable data table

The variable is added to the list_to_remain

Variable data table What is the cut-off

value for the intercorrelation coefficient (Rc)?

Input

List of variables that are forced into model (list_to_remain)

Should the variable outside the list_to_remain that has the biggest value of N_cor be deleted from the variable data set?

There are variable(s) with N_cor > 0 and not included in the list_to_remain

Do you wish to exit the programme

83

CHAPTER 10

INVESTIGATION OF BLOOD-BRAIN BARRIER PENETRATION

10.1. Objectives

In this chapter an investigation of compound penetration through the blood-brain barrier

(BBB) is presented. In vivo BBB penetration, as well as in vitro penetration through

several membrane models of the BBB, was modelled. Regression and classification

QSAR models were sought. Additionally, the in vivo BBB penetration was described by

combined models including descriptors of the chemical structure and in vitro penetration

endpoints (QSAAR analysis), in order to investigate relationships between the in vitro

and in vivo endpoints. Models for passively diffusing compounds, as well as compounds

transported actively, were developed.

10.2. Methods

10.2.1. Biological data

Two data sets were used to investigate BBB penetration.

The first data set contains 21 compounds and was taken from a collaborative study funded

by the European Commission’s Joint Research Centre (ECVAM), called below “ECVAM

data set” (Garberg, 2001). It includes data for in vivo penetration through the BBB, and in

vitro penetration through several membrane models of the BBB. In Table 10.1, the

compounds investigated and their in vivo BBB and in vitro penetration abilities are

presented. The SMILES codes for the compounds are given in Appendix B.1.a. The

chemicals belong to different chemical groups and reflect different transport mechanisms.

According to the source study (Garberg, 2001) four of them are actively transported

through the BBB by different carrier systems (Table 10.1), eight are subject to active

efflux from the brain, and the remaining nine compounds cross the BBB by passive

diffusion.

The in vivo BBB penetration of a compound was described by the compound permeability

coefficient (P) and the (percentage) ratio between the compound concentration in brain

and that in plasma at the steady state (BB, see Chapter 7, Equation 7.1), i.e. BB =

84

(Cbrain/Cblood)*100 %, where Cbrain and Cblood are the steady-state concentration of the

compound in the brain and in the blood respectively.

The in vitro penetration of a compound was expressed as the permeability coefficients, P,

for membranes constructed with the following cells:

- bovine and rat subcultured primary brain capillary endothelial cells, co-cultured

with primary rat astrocytes (BBEC);

- rat SV40 immortalised brain microvascular endothelial cells, co-cultured with

SV40 immortalised rat astrocytes (SV-ARBEC);

- human epithelial colon carcinoma cell line (Caco-2);

- dog kidney epithelial cell line (MDCK).

A schematic representation of the experimental set-up of an in vitro blood–brain barrier

co-culture model is presented in Figure 10.1.

Two types of permeability coefficients were reported in the source study (Garberg, 2001).

Apparent permeability coefficients (Papp) were obtained for the permeability of the cell

membrane together with the filter insert (see Figure 10.1). Endothelial permeability

coefficients Pe were obtained indirectly by applying the formula:

1/(Pe*S) = 1/(Papp*S) – 1/(Pf*S) (10.1)

where Pf is the permeability coefficient for the insert alone

S is the surface area of the insert (see Figure 10.1).

Thus Pe values are assumed to represent the permeability of the cell membrane itself.

The second data set was taken from Platts et al. (2001). It contains in vivo BBB

penetration data for 157 compounds collected from different literature sources, including

directly measured and indirectly determined values. The compounds and the logBB

values are presented in Table 10.2. The SMILES codes for the compounds are given in

Appendix B.1.b. The BBB penetration is presented again as the BB ratio: BB =

Cbrain/Cblood. In Platts et al. (2001) the following QSAR for the data set was reported,

based on the solvation equation of Abraham (Platts et al., 2001, adopted a simpler

notation for the structural descriptors, see Chapter 7):

85

logBB = 0.463 E - 0.864 S - 0.564 A - 0.731 B + 0.933 V - 0.567 I1 + 0.021 (10.2)

n = 148, R2 = 0.745, s = 0.343, F = 69, Q2 = 0.711

where: E is an excess molar refraction

S is the dipolarity/polarisability parameter

A is the H-bond acidity

B is the H-bond basicity

V is the characteristic McGowan volume

I1 is an indicator variable, set to 1 for a compound containing a carboxylic acid

fragment and 0 otherwise.

Nine compounds had been omitted as statistical outliers in the equation above in order to

improve the statistical fit.

According to the models reported by Platts et al. (2001), the factors that affect logBB are

molecular size and dispersion effects, which increase the penetration into brain, and

polarity/polarisability and H-bond acidity and basicity, which decrease the brain uptake.

10.2.2. Structural descriptors

The following structural descriptors for the compounds were calculated: logPoct, logDoct,

aqueous solubility, molecular weight, molar refractivity, molar volume, parachor, index

of refraction, surface tension, density, polarisability, approximately 10 quantum-chemical

and approximately 250 topological descriptors. LogPoct values were calculated by using

the ACD/LogP version 4.02 (Advanced Chemistry Development Inc., Toronto, ON,

Canada), KOWWIN version 1.65 (Syracuse Research Corporation, Syracuse, NY, USA)

programs, TSAR for Windows version 3.3 (Accelrys Inc.) and QsarIS version 1.1 (then

SciVision-Academic Press, San Diego, CA; currently MDL®QSAR by Elsevier MDL,

San Leandro, CA) software. LogDoct at pH of 7.4 was calculated with ACD/LogD version

4.02 (Advanced Chemistry Development Inc., Toronto, ON, Canada) by using the ion-

pair partitioning method. Aqueous solubilities were calculated with ACD/AqSol version

4.02 (Advanced Chemistry Development Inc., Toronto, ON, Canada) using a pH setting

of 7.0. Molecular weight, molar refractivity, molar volume, parachor, index of refraction,

surface tension, density and polarisability were calculated with the ACDLabs version 4.02

software (Advanced Chemistry Development Inc., Toronto, ON, Canada). The TSAR

86

version 3.3 software was used to calculate quantum-chemical and topological descriptors.

For quantum-chemical calculations, the VAMP package (Accelrys Inc.) was used,

applying the AM1 Hamiltonian. The QsarIS version 1.1 software was used to calculate

topological descriptors. The abbreviations used for the molecular descriptors are

presented in Table 10.3.

10.2.3. Statistical analysis

Before performing the statistical analysis, the descriptors that had values of zero for more

than 95% of the compounds were excluded from the descriptor data matrix. To reduce

descriptor multicollinearity, before performing statistical analysis the algorithm

developed by the author of this thesis was applied to the descriptor data matrices

(described in Chapter 9).

To model the in vivo endpoints, in vitro data were used in addition to the calculated

molecular descriptors (QSAAR analysis).

QSAAR and QSAR models were obtained by linear regression, using Statistica version

5.5 (StatSoft Inc., Tulsa, OK, USA). Two approaches were used for selecting variables to

derive models with good statistical diagnostics, and allowing for physicochemical and/or

mechanistic interpretation. Firstly, forward stepwise regression was applied, selecting up

to five variables to include. If some of the variables from the derived model were

considered to lack physicochemical and/or mechanistic relevance, these variables were

excluded from the variable data set and the forward stepwise regression was performed

again. Secondly, the program in C code implementing the best-subsets regression,

developed by the author was used (described in Chapter 9). Only those variables that

intercorrelated with a coefficient of intercorrelation R less than 0.7 were included in the

same model.

Discriminant analysis was also performed with Statistica version 5.5. Variable selection

in the discriminant analysis was performed by a forward stepwise procedure, applying the

same considerations regarding physicochemical and/or mechanistic relevance of the

selected variables as in the regression analysis (see above). The assumptions of the

discriminant analysis for homogeneity of variable variance and variance/covariance

matrices within the groups were tested with the univariate Levene test and with the

multivariate Sen-Puri test (see Chapter 5, Section 5.4.2 “Assumptions of the linear

87

discriminant function analysis”), included in the ANOVA/MANOVA module of

Statistica version 5.5 software.

10.3. Results

10.3.1. Analysis of the ECVAM data set

The permeability data used in the study are shown in Table 10.1. The abbreviations used

for the structural descriptors are presented in Table 10.3, and the calculated values of the

descriptors included in the selected QSARs are presented in Table 10.4. Models based on

the whole set of compounds, as well as mechanism-specific models for passive diffusion,

were developed. The QSARs obtained for the whole set of compounds are presented in

Table 10.5.a; the QSARs for the passively diffused compounds are presented in Table

10.5.b.

10.3.2. Analysis of the data set taken from Platts et al. (2001)

Since argon, krypton, neon and xenon were not accepted by the TSAR software, BBB

penetration data for 153 (instead of 157) compounds were used for model development,

and are summarised in Table 10.2. The values of the descriptors included in the selected

QSARs are presented in Table 10.2. The QSARs obtained are presented in Table 10.6.

For the abbreviations of the molecular descriptors, see Table 10.3.

The goodness-of-fit of the QSAR (Equation 2, Table 10.6) improved when the following

compounds were excluded: mezoridazine, 4, Y-G 19, Y-G 20, Org12692 and thioridazine

(with the exception of mezoridazine, these compounds were also excluded from the final

model in the source paper (Platts et al., 2001)):

logBB = -0.211 (± 0.017) ΣH-bond + 0.204 (± 0.037) Nrings6 + 0.117 (± 0.032) logPoct +

0.165 (± 0.091) (10.3)

n = 147, R2 = 0.695, s = 0.388, F = 108.5

A semi-quantitative classification of compounds based on their permeation of the BBB

was performed using discriminant analysis. The reason for performing this was that the

biological data have been collected from different literature sources, which decreases the

quality of the models obtained, due to possible variability in the logBB values.

88

The compounds were classified into two groups with a cut-off value for logBB of 0. The

cut-off value was chosen in such a way that compounds having a log BB value < 0 would

have higher concentration in blood than in brain under steady-state conditions (low

penetrators). Conversely, compounds with a logBB value > 0 would have higher

concentrations in brain (high penetrators). Discriminant analysis revealed ΣH-bond

(calculated using TSAR software) and logPoct (calculated using QsarIS) as best

discriminating descriptors between the two groups. The Levene univariate test and the

Sen-Puri multivariate tests were statistically insignificant at 95% level, thus suggesting

homogeneity in the variable variances and variance/covariance matrices within the two

groups. The following discriminant function was obtained:

Group = -0.426 ΣH-bond + 0.475 logPoct + 0.209 (10.4)

Wilks’ λ = 0.564, F = 57.9.

The following classification functions were obtained:

S1 = 1.284 ΣH-bond + 1.035 logPoct – 4.658 (10.5)

S2 = 0.536 ΣH-bond + 1.869 logPoct – 3.897 (10.6)

where S1 and S2 are the classification scores of the cases for the first and the second

group respectively.

Compounds with higher value of S1 than S2 will be classified as low penetrators, and vice

versa, compounds with higher value of S2 than S1 will be classified as high penetrators.

The a priori classification probabilities were set to be equal for the two groups. The

classification matrix of the two-group classification based on the obtained classification

functions is presented in Table 10.7. From the table it can be seen that 82.4% of the

compounds from the both groups were correctly classified.

Other cut-off values for classification of compounds according to BBB penetration were

attempted, but this did not improve the ability to classify the chemicals.

10.4. Discussion

89

In this study both in vivo BBB permeability and permeability through in vitro membrane

models for BBB were investigated. The in vivo BBB penetration was correlated with both

in vitro permeability coefficients and structural descriptors. QSAR models were

developed for the in vivo and in vitro compound penetration. Additionally, compounds

were classified into high and low BBB penetrators, and a classification model was

developed.

10.4.1. Interpretation of the models obtained for the ECVAM data set

QSAR models were developed for the whole data set as well as for the passively diffused

compounds only (Table 10.5). From the table it can be seen that the QSAR models for

passive diffusion (Table 10.5.b) have better statistical fit than the models based on all

compounds (Table 10.5.a). Models for in vivo penetration of compounds transported by

different mechanisms do not include logPoct. As can be expected, this descriptor reflects

better the in vivo penetration of compounds transported by passive diffusion than by other

mechanisms. LogPoct or logDoct as descriptors of lipophilicity appeared to be suitable

descriptors for the penetration through model membranes in vitro, with the exception of

the BBEC membrane.

For the whole data set of compounds, the in vitro membrane penetration model that

appears to correlate best with the in vivo penetration was the BBEC model (Equations 1

and 3, Table 10.5.a), which is consistent with results from the literature (Lundquis et al.,

2002). When BBEC logPs were correlated with the in vivo logPe or logBB without

including structural descriptors, R2 values of 0.324 and 0.338, respectively, were

obtained; thus, adding structural descriptors (Balaban index) increased the model R2

values by approximately 0.23 (Equations 1 and 3, Table 10.5.a). QSAR models based on

the whole set of compounds (Equations 2 and 4, Table 10.5.a) indicate that the number of

H-donor atoms is important for the in vivo BBB penetration, which is in accordance with

many literature reports (see Chapter 7).

When only passively diffused compounds were considered, the Caco-2 cell membrane in

the apical to basolateral direction appeared to be best suited for describing in vivo

penetration (Equations 12 and 14, Table 10.5.b). When the Caco-2 Ps in apical to

basolateral direction were correlated with the in vivo logPe or logBB (without including

structural descriptors) the R2 values obtained were 0.577 and 0.587, respectively. Thus,

adding structural descriptors (the magnitude of the dipole moment) improved the R2 of

90

the correlations by approximately 0.31. A QSAR model for in vivo penetration included

logPoct and the dipole moment (Equations 13 and 15, Table 10.5.b). The dipole moment

might express the extent to which a molecule influences the electric field across lipid

membranes, which could affect the membrane transport. A weakness of the models for

passive diffusion is the low ratio of data points to independent variables (the equations

have 2 independent variables and only 9 data points), which contradicts the

recommendations of Topliss and Castello (1972). According to these authors, this ratio

should be at least five.

The improved statistical fit of the in vivo models, when descriptors of the chemical

structure were added to an in vitro endpoint, suggests that efforts to describe the in vivo

BBB penetration using membrane models in vitro can be combined with QSAR analysis.

The developed models will include descriptors which account for the differences between

the in vivo and in vitro penetration processes. In the present study the Balaban index

(encoding the molecular topology) and the magnitude of the dipole moment were used.

The Balaban index increases with increase in the compound size and decreases with

increase in the number of rings in a molecule. Thus, according to the models (Equations 1

and 3, Table 10.5.a) large molecules with smaller number of rings penetrate more easily

through the BBB. The dipole moment could affect the membrane transport by influencing

electrostatic interactions with the membranes.

From Table 10.5 it is seen that the QSAR models for in vivo BBB penetration have only

slightly worse statistical fit than the QSAAR models, which are more complex and

require in vitro data. Thus, it appears that use of descriptors of chemical structure alone

gives comparable results to the more problematical approach to model the in vivo BBB

penetration by using membrane systems in vitro.

The models obtained for BBEC permeability indicated that the number of H-donor and H-

acceptor atoms and the largest positive charge on a hydrogen atom (Hmaxp) are important

for penetration (Equations 5 and 16, Table 10.5). According to some authors (Wilson and

Famini, 1991) Hmaxp is also related to H-bond donating potential of a compound (see

Chapter 5, Section 3.5.5 “Hydrogen-bond donor and acceptor ability”).

The permeability of the SV-ARBEC in the apical to basolateral direction was correlated

with the sum of the E-state indices of all the individual atoms in the molecule (ΣE-state)

(Equation 6, Table10.5.a). According to Rose et al. (2002), the E-state indices account for

91

the ability of a molecule to enter into non-covalent intermolecular interactions, and thus

they encode factors that could influence binding to the membrane transporters. The

permeability of SV-ARBEC in the basolateral to apical direction was correlated with the

logarithm of the octanol-water distribution coefficient (logDoct) and the number of H-

donor atoms (Equation 7, Table10.5.a). LogDoct accounts for the degree of compound

ionisation in the process of distribution between octanol and water. For the subset of

passively diffused compounds, no QSARs with good statistical parameters were obtained.

Models for predicting the permeability of the Caco-2 membrane, based on all compounds,

also indicate the importance of hydrogen bonding and logDoct for the transport processes

(Equations 8 and 9, Table10.5.a). In the case of the passively diffused compounds, the Ps

in the apical to basolateral, and the basolateral to apical, directions are highly

intercorrelated (intercorrelation coefficient R = 0.93, compared with R = 0.62 when all

compounds were included). Thus, the same descriptors determine the transport in the two

directions, namely the number of H-donors and logPoct (Equations 17, 18, 19, 20,

Table10.5.b).

The best models for penetration across the MDCK membrane included logPoct, the shape

flexibility index, and the largest positive charge on a hydrogen atom (Equations 10, 11,

21, and 22, Table 10.5).

10.4.2. Interpretation of the models obtained for the data set taken from Platts et al.

(2001)

The derived QSARs for this data set are presented in Table 10.6 and Equation 10.3.

LogPoct and the numbers of H-donor and H-acceptor atoms again appear to be the most

suitable predictors for penetration across the BBB in vivo. According to the models,

forming H-bonds is unfavourable for passage across the BBB, while increasing logPoct

(hydrophobicity) tends to increase logBB. This is in accordance with results from the

literature (see Chapter 7). The number of 6-membered rings (Nrings6) may be related to the

volume or shape of the molecules, or may indicate differences in the aromatic character of

the compounds.

Additionally, a classification model for compounds as low and high penetrators of the

BBB was developed, by using a logBB cut-off value of 0.0. The variables that appeared

to discriminate between the groups were again related to the H-bonding capacity (ΣH-bond)

92

and the lipophilicity (logPoct). The model resulted in 82.4% correct classification of the

compounds from the both groups, while the expected classification by chance was 50.6%

(calculated by using Equations 5.20, Chapter 5, Section 5.4 “Linear discriminant function

analysis”). Thus, applying the discriminant model gave an improvement of the

classification accuracy by approximately 30%. From the obtained classification functions

(Equations 10.5 and 10.6) it can be seen that a compound will have a higher value of S2

than S1 (the compound will be classified as high penetrator), if:

0.748 ΣH-bond < 0.834 logPoct - 0.761, equivalent to:

ΣH-bond < 1.115 logPoct - 1.017 (10.7)

As the coefficient of logPoct and the free term of this expression are very close to one, it

can be concluded that a compound should have a ΣH-bond value smaller than its logPoct

value minus one, to be classified as a high penetrator. Additionally, compounds having

negative logPoct values will be always classified as low penetrators (in accordance with

results found in the literature, see Chapter 7).

10.4.3. Contribution to existing knowledge

In the present study a new data set (the ECVAM data set) was investigated, and QSARs

were obtained for in vivo BBB penetration and for membrane transport through four in

vitro BBB model systems. The importance of compound lipophilicity and H-bonding

ability for the membrane transport were confirmed. However, the results suggested that

other factors than compound lipophilicity (encoded by logPoct or logDoct) mainly

influence the in vivo penetration of compounds transported by different mechanisms

(passive and active transport). In the present study these factors were related to size of

molecules and presence of rings (encoded by the Balaban index) and the number of H-

bond donors. The values of logPoct reflected better the penetration of compounds

transported by passive diffusion only.

QSAAR analysis was also applied to the data set. As far as the author is aware, this is the

first attempt to model in vivo BBB penetration by combining in vitro membrane

permeability data with descriptors of chemical structure (QSAAR analysis). The QSAAR

analysis suggested that adding structural descriptors results in better models for in vivo

BBB transport than did those based on in vitro membrane systems alone. However, using

93

descriptors of chemical structure alone (QSAR analysis) resulted in models with

comparable statistical fit to that of the QSAAR models.

A second data set was investigated, for which a QSAR model was previously developed

by Platts et al. (2001) (Equation 10.2). A QSAR model was developed (Equation 10.3)

with slightly worse statistical fit than the model developed in the source paper, but using

smaller number of descriptors of chemical structure. A new classification model of the

compounds into high and low penetrators was derived. On the basis of this classification

model, a simple relation between logPoct value and the sum of H-bond donors and

acceptors for a compound, governing its classification as a high or low penetrator, was

obtained. This relation could be used for fast screening of high numbers of compounds.

10.5. Conclusions

QSAARs for the in vivo BBB penetration combining permeability data for in vitro

membrane systems and descriptors of chemical structure gave improved results than

modelling of in vivo BBB transport by using in vitro membrane systems alone. The

BBEC membrane system appeared to be the most suitable in vitro system for modelling

BBB penetration of compounds transported by different mechanisms (passive and active

transport), while for the passively diffused compounds alone the Caco-2 cell system

reflected best the in vivo BBB penetration. Factors that accounted for differences between

the in vivo and in vitro penetration processes were encoded by the Balaban index and the

magnitude of the dipole moment.

Lipophilicity and H-bonding ability were shown to be important molecular properties for

membrane transport (consistent with literature data). However, logPoct did not appear to

be a suitable descriptor for the in vivo penetration of compounds transported by different

mechanisms; it appeared as determining the in vivo passive transport only.

According to a classification model for compounds as low and high penetrators of the

BBB, the sum of H-bond donor and acceptor atoms for a compound should be smaller

than its logPoct value minus one, in order to classify the compound as a high penetrator.

This relation is simple and can be used for fast compound screening.

94

Figure 10.1. A schematic representation of the in vitro blood–brain barrier co-culture

model – brain capillary endothelial cells, co-cultured with astrocytes (redrawn from

Gaillard and De Boer, 2000)

Abbreviations: BCEC – brain capillary endothelial cells; AC – astrocytes

95

Table 10.1. ECVAM data set of 21 compounds and their in vivo BBB and in vitro permeability coefficients

No Name Transport mechanism in vivo in vivo BBEC SV-ARBEC* SV-ARBEC** Caco-2* d Caco-2** d MDCK* d MDCK** d

logPea logBBb logPea logPea logPea logPappa logPappa logPappa logPappa

1 L-alanine active influx (system A) 0.982 1.826 1.250 1.000 2.180 0.562 0.598 0.301 0.204

2 antipyrine passive 1.041 1.898 2.495 1.360 1.358 1.630 1.775 1.639 1.809 3 AZT efflux (OAT) -0.301 0.477 0.813 1.250 1.407 1.033 1.328 0.371 0.342 4 caffeine passive 0.908 1.903 2.997 1.072 1.936 1.635 1.798 1.600 1.778 5 cimetidine efflux -0.268 0.531 0.633 0.813 1.053 -0.097 0.686 -0.097 0.114 6 cyclosporin A efflux (P-glycoprotein) 0.477 1.301 0.845 0.176 0.602 -0.187 1.086 0.041 0.968 7 diazepam passive 1.638 2.502 2.520 1.538 0.919 1.613 1.742 1.574 1.701

8 digoxin efflux (mdr-1) -0.699 0.000 0.544 0.398 0.919 -0.022 1.348 -0.125 1.211

9 L-dopa active influx (aa-carrier) 0.591 1.491 1.332 1.217 0.978 -0.022 0.279 -0.155 -0.398

10 glycerol passive 0.875 1.785 0.863 1.367 1.238 1.167 0.398 -0.222 0.114

11 lactic acid active influx (MCT1+2) 1.270 2.155 1.021 1.190 1.591 -0.022 0.061 0.484 0.398

12 L-leucine active influx (system L) 1.164 2.000 1.182 1.393 1.391 1.186 0.699 0.267 0.230

13 morphine efflux (P-glycoprotein) -0.097 0.699 1.358 1.336 1.267 0.916 1.092 0.342 0.505

14 nicotine passive 0.792 1.708 2.991 1.538 1.615 1.455 1.608 1.472 1.654

15 phenytoin passive 1.173 2.021 2.089 1.305 1.243 1.602 1.689 1.534 1.602 16 sucrose passive -0.495 0.301 0.602 0.934 1.004 0.146 -0.347 -0.699 -0.699

17 urea passive 0.079 0.954 1.568 1.589 1.484 0.771 0.863 0.079 0.097

18 verapamil efflux (P-glycoprotein) 0.799 1.613 1.462 0.987 -c 1.196 1.609 1.146 1.520

19 vinblastine efflux (P-glycoprotein) 0.204 1.000 1.297 0.301 1.210 0.312 1.545 0.114 1.299

20 vincristine efflux (MRP1) 0.301 1.255 0.398 0.826 0.813 -1.071 0.808 -0.699 0.332 21 warfarin passive -0.125 0.699 1.458 0.519 1.648 1.504 1.683 1.306 1.453

a all permeability coefficients (P) are given as the log (P x 10e-6). Ps are in units of cm/s b BB = concentration in the brain / concentration in the blood * 100 (%) c the SV-ARBEC (basolateral to apical direction) P value for verapamil was given as negative, and therefore discarded from the analyses d mean value from two experiments

* permeability coefficients obtained in apical to basolateral direction of the membrane

** permeability coefficients obtained in basolateral to apical direction of the membrane

96

Table 10.2. Biological data and values of the molecular descriptors included in the

QSARs for BBB penetration, developed from the data set of Platts et al. (2001)

No Name logBB e logPoct

f ΣH-bond g Nrings6

g

1 1 (cimetidine) -1.42 0.16 6 0 2 2 c -0.04 0.84 3 0 3 4 b, c -1.30 4.08 6 3 4 5 c -1.06 1.98 7 2 5 6 (clonidine) 0.11 1.71 3 1 6 7 (mepyramine) 0.49 2.58 3 2 7 8 (imipramine) 1.06 4.71 1 2 8 9 (ranitidine) b -1.23 0.20 6 0 9 10 (tiotidine) -0.82 0.31 8 0

10 13 c -0.67 1.88 5 1 11 14 c -0.66 1.20 5 1 12 15 c -0.12 2.46 5 2 13 16 c -0.18 1.78 4 1 14 17 c -1.15 1.11 5 1 15 18 c -1.57 1.19 6 1 16 19 c -1.54 1.46 7 1 17 20 b, c -1.12 0.72 6 0 18 21 b, c -0.73 2.38 6 1 19 22 c -0.27 2.64 6 1 20 23 c -0.28 2.72 6 2 21 24 c -0.46 3.22 4 2 22 25 c -0.24 4.16 4 3 23 26 c -0.02 2.71 4 2 24 27 c 0.69 3.67 4 3 25 28 c 0.44 3.63 4 2 26 29 c 0.14 5.17 4 3 27 30 c 0.22 4.41 5 3 28 31 (carbamazepine) d 0.00 2.51 2 2 29 32 (epoxide of carbamazepine) d -0.34 1.68 3 2 30 33 d -0.30 2.72 5 1 31 34 d -1.34 2.47 7 1 32 35 d -1.82 2.25 9 1 33 36 (amitriptilyne) d 0.89 4.88 1 2 34 acetaminophen -0.31 0.67 4 1 35 acetylsalicylic acid -0.50 1.06 5 1 36 alprazolam 0.04 2.71 3 2 37 aminopyrine 0.00 1.19 1 1 38 amobarbital 0.04 1.84 5 1 39 antipyrine -0.10 1.26 1 1 40 argon a 0.03 4.70 - - 41 atenolol -1.42 0.75 7 1 42 benzene 0.37 1.95 0 1 43 bretazenil -0.09 2.99 4 1 44 bromperidol 1.38 4.08 4 3 45 butanone -0.08 0.57 1 0 46 caffeine -0.05 0.12 3 1 47 chlopromazine 1.06 5.20 1 3 48 clobazam 0.35 1.75 2 2 49 codeine 0.55 1.62 5 4

97

Table 10.2. (continued)


f ΣH-bond g Nrings6 g

50 CS2 0.60 1.96 2 0 51 cyclohexane 0.92 3.40 0 1 52 cyclopropane 0.00 1.57 0 0 53 desipramine 1.20 4.54 2 2 54 desmethydesipramine 1.06 3.85 2 2 55 desmethylclobazam 0.36 2.39 3 2 56 desmethyldiazepam 0.50 2.89 3 2 57 desmonomethylpromazine 0.59 4.50 2 3 58 diazepam 0.52 2.91 2 2 59 dichloromethane -0.11 1.33 0 0 60 didanosine -1.30 0.14 8 1 61 diethyl ether 0.00 1.07 1 0 62 2,2-dimethylbutane 1.04 3.12 0 0 63 divinyl ether 0.11 1.41 1 0 64 enflurane 0.24 2.38 1 0 65 ethanol -0.16 -0.03 2 0 66 ethylbenzene 0.20 3.14 0 1 67 flumanezil -0.29 1.17 4 1 68 flunitrazepam 0.06 2.06 4 2 69 fluphenazine b 1.51 4.13 4 4 70 fluroxene 0.13 1.44 1 0 71 haloperidol 1.34 4.08 4 3 72 halothane 0.35 2.33 0 0 73 heptane 0.81 4.40 0 0 74 hexane 0.80 3.87 0 0 75 hexobarbital 0.10 1.53 4 2 76 1-hydroxymidazolam -0.07 2.26 4 2 77 4-hydroxymidazolam -0.30 2.32 4 2 78 9-hydroxyrisperidone -0.67 3.46 7 4 79 hydroxyzine 0.39 2.37 5 3 80 ibuprofen -0.18 3.76 3 1 81 indinavir -0.74 3.14 11 4 82 indomethacin -1.26 3.32 5 2 83 isoflurane 0.42 2.31 1 0 84 krypton a -0.16 4.74 - - 85 M2L-663581 -1.82 0.74 9 1 86 mesoridazine -0.36 4.39 2 4 87 methane 0.04 4.58 0 0 88 methohexital -0.06 2.63 5 1 89 methoxyflurane 0.25 2.11 1 0 90 methylcyclopentane 0.93 3.30 0 0 91 3-methylhexane 0.90 4.14 0 0 92 2-methylpentane 0.97 3.51 0 0 93 3-methylpentane 1.01 3.65 0 0 94 2-methylpropan-1-ol -0.17 0.76 2 0 95 mianserin 0.99 3.17 1 3 96 midazolam 0.36 2.92 2 2 97 MIL-663581 -1.34 1.46 7 1 98 mirtazapine 0.53 2.49 2 3 99 morphine -0.16 1.03 6 4

98



f ΣH-bond g Nrings6 g

100 neon a 0.20 4.69 - - 101 nevirapine 0.00 1.31 4 2 102 nitrogen 0.03 0.41 2 0 103 nitrous oxide 0.03 -0.26 2 0 104 nor-1-chlorpromazine 1.37 5.01 2 3 105 nor-2-chlorpromazine 0.97 4.48 2 3 106 northioridazine 0.75 5.53 2 4 107 Org12692 b 1.64 2.64 2 2 108 Org13011 0.16 2.03 3 2 109 Org30526 0.39 4.05 3 2 110 Org32104 0.52 2.97 5 3 111 Org34167 0.00 3.49 4 2 112 Org4428 0.82 3.33 4 3 113 Org5222 1.03 4.41 2 2 114 oxazepam 0.61 2.22 5 2 115 paraxanthine 0.06 -0.53 4 1 116 pentane 0.76 3.33 0 0 117 pentobarbital 0.12 1.93 5 1 118 phenylbutazone -0.52 2.98 2 2 119 phenytoin -0.04 2.23 4 2 120 promazine 1.23 4.83 1 3 121 propan-1-ol -0.16 0.34 2 0 122 propan-2-ol -0.15 0.28 2 0 123 propanone -0.15 0.03 1 0 124 propranolol 0.64 3.07 5 2 125 quinidine -0.46 2.85 5 4 126 risperidone -0.02 4.13 5 4 127 RO19-4603 -0.25 1.76 4 0 128 salicylic acid -1.10 1.60 5 1 129 salicyluric acid -0.44 0.87 7 1 130 SF6 0.36 3.18 0 0 131 SKF 101468 0.25 2.37 3 1 132 SKF 89124 -0.43 1.20 5 1 133 sulforidazine 0.18 4.31 3 4 134 teflurane 0.27 2.14 0 0 135 theobromine -0.28 -0.63 4 1 136 theophylline -0.29 0.19 4 1 137 thiopental -0.14 2.61 5 1 138 thioridazine b 0.24 5.88 1 4 139 tibolone 0.40 2.52 3 3 140 toluene 0.37 2.66 0 1 141 triazolam 0.74 3.08 3 2 142 1,1,1-trichloroethane 0.40 1.85 0 0 143 trichloroethene 0.34 2.29 0 0 144 trichloromethane 0.29 1.76 0 0 145 1,1,1-trifluoro-2-chloroethane 0.08 1.33 0 0 146 trifluoperazine 1.44 4.50 2 4 147 valproic acid -0.22 2.39 3 0 148 xenon a 0.03 4.78 - - 149 2-xylene 0.37 3.22 0 1

99



f ΣH-bond g Nrings6

g

150 3-xylene 0.29 3.17 0 1 151 4-xylene 0.31 3.11 0 1 152 Y-G 14 -0.30 0.85 3 1 153 Y-G 15 -0.06 1.42 2 1 154 Y-G 16 -0.42 0.29 3 0 155 Y-G 19 b -1.30 2.02 3 1 156 Y-G 20 b -1.40 0.88 3 1 157 zidovudine -0.72 -0.32 7 1

a compounds not accepted by the TSAR software and omitted from the final model b compounds omitted in the final model of the source paper c in the source paper (Platts et al., 2001), these compounds are indicated to be taken from Young et al.

(1988). The number represents the numeration from Young et al. (1988) d in the source paper (Platts et al., 2001), these compounds are indicated to be taken from Norinder et al.

(1998). The number represents the numeration from Norinder et al. (1998) e BB = concentration in the brain / concentration in the blood f calculated with the QsarIS software g calculated with the TSAR software

100

Table 10.3. Physicochemical descriptors used in the selected QSARs for BBB penetration

Symbol Name

Balaban Balaban index

Flex shape flexibility index

Hmaxp largest positive charge on a hydrogen atom.

logPoct logarithm of the octanol-water partition coefficient

logDoct logarithm of the octanol-water distribution coefficient at pH of 7.4

µ magnitude of the molecular dipole moment

NH-accept number of H-bond acceptor atoms

NH-donors number of H-bond donor atoms

ΣE-state sum of the E-state indices of all atoms in the molecule

ΣH-bond sum of the H-bond donor and acceptor atoms

101

Table 10.4. Calculated values of the structural descriptors included in the selected QSARs derived from the ECVAM data set

No Name ACD/LogP KOWWIN ACD/LogD TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS

logPoct logPoct logDoct Balaban NH-donors ΣE-state µ a logPoct NH-accept Flex Hmaxp

1 L-alanine -2.77 -4.15 -3.18 2.9935 1 22.00 10.74 -2.033 3 1.764 0.229

2 antipyrine 0.27 0.59 0.27 1.7743 0 32.00 4.55 1.257 3 2.191 0.065

3 AZT -0.88 0.23 -0.58 1.7767 2 56.00 4.14 -0.322 9 4.000 0.211

4 caffeine -0.07 0.16 -0.081 1.7803 0 37.67 3.63 -0.504 6 1.845 0.098

5 cimetidine 0.36 0.57 0.21 1.9223 3 38.13 4.38 0.619 7 6.319 0.224

6 cyclosporin A 11.28 1.00 11.28 2.5440 5 211.83 - b - c - c - c - c

7 diazepam 2.96 2.70 2.96 1.3625 0 42.81 3.72 2.905 4 3.384 0.077

8 digoxin 2.06 0.50 2.23 0.7660 6 128.58 6.84 0.774 14 11.215 0.210

9 L-dopa -2.06 -3.40 -2.73 2.1375 3 44.50 8.81 -1.767 5 3.317 0.229

10 glycerol -2.41 -1.65 -2.41 2.7542 3 22.33 1.70 -1.659 3 3.023 0.210

11 lactic acid -0.70 -0.65 -6.09 2.9935 2 24.00 2.02 -0.825 3 1.764 0.229

12 L-leucine -1.72 -2.75 -1.77 3.3766 1 26.83 10.76 -0.873 3 3.421 0.229

13 morphine 1.06 0.72 0.053 1.2687 2 45.25 3.57 1.026 4 2.212 0.210

14 nicotine 0.72 1.00 -1.02 1.6601 0 22.50 3.29 1.594 2 2.086 0.085

15 phenytoin 2.52 2.16 2.50 1.6394 2 46.92 2.99 2.366 4 2.739 0.216

16 sucrose -3.65 -4.27 -3.65 1.9642 8 74.92 3.02 -2.970 11 5.920 0.210

17 urea -1.58 -1.56 -1.58 2.3238 2 16.67 4.00 -2.133 3 0.876 0.200

18 verapamil 5.03 4.80 3.38 1.7387 0 70.58 3.45 5.671 6 10.491 0.082

19 vinblastine 4.85 4.32 4.12 0.9321 3 131.92 2.02 0.072 13 9.821 0.212

20 vincristine 3.47 3.11 2.77 0.9433 3 138.92 5.43 0.902 14 10.051 0.212

21 warfarin 3.47 2.23 0.57 1.5681 1 58.00 7.98 3.803 4 4.189 0.197 a calculated with the VAMP package, using AM1 Hamiltonian b not calculated by the VAMP package c cyclosporin A was not accepted by the QsarIS software

102

Table 10.5. QSARs for BBB and membrane penetration, obtained from the ECVAM data

set

Table 10.5. a) all compounds included†

No Permeability coefficients (Ps)

Regression equation R2 s F

1 in vivo logPe 0.517 (± 0.131) BBEC logPs + 0.446 (± 0.145) Balaban – 1.10 (± 0.36)

0.558 0.457 11.4

2 in vivo logPe 0.333 (± 0.148) Balaban - 0.187 (± 0.050) NH-donors – 0.273 (± 0.331)

0.539 0.467 10.5

3 in vivo logBB

0.559 (± 0.138) BBEC logPs + 0.471 (± 0.152) Balaban – 0.355 (± 0.380)

0.569 0.479 11.9

4 in vivo logBB

0.349 (± 0.156) Balaban - 0.201 (± 0.053) NH-donors + 1.12 (± 0.35)

0.543 0.493 10.7

5a BBEC logPs - 8.83 (± 1.73) Hmax

p - 0.0737 (± 0.0264) NH-accept + 3.49 (± 0.33)

0.720 0.441 21.9

6 SV-ARBEC logPs* -0.00674 (± 0.00110) ΣE-state + 1.47 (± 0.09) 0.663 0.248 37.4

7b SV-ARBEC logPs**

-0.0506 (± 0.0169) logDoct - 0.0973 (± 0.0294) NH-donors + 1.53 (± 0.09)

0.581 0.264 11.8

8a Caco-2 logPs*

0.143 (± 0.049) logDoct – 0.165 (± 0.033) NH-accept + 1.80 (± 0.23)

0.604 0.510 13.0

9 Caco-2 logPs**

0.0749 (± 0.0187) logDoct - 0.377 (± 0.095) Balaban - 0.197 (± 0.030) NH-donors + 2.20 (± 0.21)

0.837 0.275 29.2

10a MDCK logPs*

0.303 (± 0.046) logPoct c - 0.157 (± 0.030) Flex + 1.10 (±

0.16) 0.759 0.408 26.8

11a MDCK logPs**

0.157 (± 0.040) logPoct d – 7.20 (± 1.69) Hmax

p + 1.99 (± 0.33)

0.748 0.421 25.3

† number of compounds in the training set n = 21 a cyclosporin A was not accepted by the QsarIS software, so the number of compounds in the training set

was 20 b the SV-ARBEC (basolateral to apical direction) P value for verapamil was determined as negative, and it

was excluded from the model, so the number of compounds in the training set was 20 c calculated with the QsarIS software d calculated with the ACD/LogP software

* permeability coefficients obtained in apical to basolateral direction of the membrane (see Figure 3.1)

** permeability coefficients obtained in basolateral to apical direction of the membrane (see Figure 3.1)

103

Table 10.5. QSARs for BBB and membrane penetration, obtained from the ECVAM data

set

b) only passively diffused compounds included†

No Permeability coefficients (Ps)


12 in vivo logPe 1.21 (± 0.19) Caco-2 logPs* - 0.226 (± 0.058) µ – 0.0198 (± 0.3042)

0.881 0.274 22.3

13 in vivo logPe 0.289 (± 0.051) logPoct

a - 0.300 (± 0.067) µ + 1.77 (± 0.27)

0.862 0.295 18.8

14 in vivo logBB

1.274 (± 0.179) Caco-2 logPs* - 0.242 (± 0.053) µ + 0.835 (± 0.280)

0.907 0.255 29.3

15 in vivo logBB

0.297 (± 0.055) logPoct a - 0.315 (± 0.072) µ + 2.71(±

0.29) 0.853 0.317 17.4

16 BBEC logPs -0.282 (± 0.072) NH-donors + 2.45 (± 0.22) 0.684 0.529 15.2

17 Caco-2 logPs*

-0.179 (± 0.032) NH-donors + 1.60 (± 0.10) 0.813 0.237 30.4

18 Caco-2 logPs*

0.201(± 0.038) logPoct a + 1.25 (± 0.08) 0.801 0.245 28.1

19 Caco-2 logPs**

-0.275 (± 0.043) NH-donors + 1.73 (± 0.13) 0.852 0.317 40.2

20 Caco-2 logPs**

0.309 (± 0.052) logPoct a + 1.20 (± 0.31) 0.835 0.334 35.5

21 MDCK logPs *

0.365 (± 0.068) logPoct a + 0.865 (± 0.146) 0.805 0.438 28.9

22 MDCK logPs**

0.374 (± 0.070) logPoct a + 1.000 (± 0.150) 0.804 0.449 28.8

† number of compounds in the training set n = 9 a calculated with the KOWWIN program

* permeability coefficients obtained in apical to basolateral direction of the membrane

** permeability coefficients obtained in basolateral to apical direction of the membrane

104

Table 10.6. QSARs for in vivo BBB penetration, obtained from the data set of Platts et al.

(2001) †

No BBB penetration


1 LogBB -0.165 (± 0.017) ΣH-bond + 0.212 (± 0.028) logPoct + 0.0507 (± 0.1053)

0.565 0.479 97.3

2 LogBB -0.208 (± 0.020) ΣH-bond + 0.176 (± 0.043) Nrings6 + 0.109 (± 0.037)logPoct + 0.179 (± 0.105)

0.609 0.455 77.5

† number of compounds in the training set n = 153

105

Table 10.7. Two-group classification matrix obtained by discriminant analysis of the date

set of Platts et al. (2001)

Number of predicted compounds

Low penetrators High penetrators Percent of correct classification

Low penetrators 56 12 82.4 Number of observed compounds High penetrators 15 70 82.4

Total 71 82 82.4

106

CHAPTER 11

INVESTIGATION OF BBB PENETRATION IN RELATION TO

P-GLYCOPROTEIN INTERACTIONS

11.1. Objectives

In Chapter 10 an investigation of blood-brain barrier (BBB) permeability for compounds

transported by different mechanisms is presented. In Chapter 11 a study of specific

interactions with one of the transport systems present in the BBB, namely the efflux

protein P-glycoprotein (P-gp), is described. P-gp plays very important role in cell

protection from xenobiotic compounds like toxicants and drugs. It is associated with

development of resistance to a wide variety of drugs (so called multidrug resistance

(MDR) phenomenon). 3D-(Q)SAR approaches were applied (GASP, CoMFA, CoMSIA),

which are appropriate for investigating compounds following similar binding modes to a

biological macromolecule.

11.2. Methods

11.2.1. Biological data

Compounds from the data set, taken from Platts et al. (2001) and investigated in Chapter

10, were used to study also interactions with the P-gp. As different mechanisms can be

involved in the transport across the BBB, a subset of 16 structurally related compounds

was selected from the whole data set. The selected compounds are phenothiazine and

imipramine derivatives, for which similar mechanisms of interaction with the BBB were

assumed. Catamphiphilic compounds like phenothiazine and imipramine derivatives have

been investigated for membrane activity (Pajeva et al., 1996) and ability to overcome

multidrug resistance (MDR) in tumour cells (Ford et al., 1989; Ramu and Ramu, 1992).

These studies show that phenothiazines and imipramines interact with membrane

phospholipids, and with the MDR transport P-gp, responsible for the active drug efflux

from the cells. Thus, it can be suggested that the BBB penetration of this group of

compounds is regulated by similar transport mechanisms, namely passive diffusion

through the membrane and active outward transport by P-gp.

107

The chemical structures and the in vivo BBB penetration data for the 16 selected

compounds taken from Platts et al. (2001) are presented in Table 11.1. As described in

Chapter 10 the BBB penetration of a compound was expressed by the (percentage) BB

ratio, with BB = (Cbrain/Cblood)*100 %, where Cbrain and Cblood are the steady-state

concentration of the compound in the brain and in the blood respectively.

11.2.2. 3D-(Q)SAR methodology

The Sybyl version 6.8 molecular modelling software (Tripos Inc., St. Louis, MO, USA)

was used to perform the 3D-SAR and QSAR analyses. Molecular mechanical calculations

were performed with the Tripos force field (Powell method, no electrostatics, and 0.05

kcal/mol convergence criterion). The semi-empirical method AM1 (MOPAC version 6.0)

was applied for quantum chemistry calculations, using the key-word XYZ (Cartesian

coordinates to be used in the calculations). Distance constrains were defined by using

Build/Define/Constrains option of Sybyl. GASP (SYBYL/GASP interface, Tripos Inc.,

St. Louis, MO, USA), CoMFA and CoMSIA were used, as implemented in Sybyl.

The 3D-structures of imipramine and phenothiazine derivatives were built from the

structures of imipramine and trifluoperazine, respectively. Their structures were taken

from conformational optimisation 3D-QSAR studies done by Pajeva and Wiese (1998). In

the study of Pajeva and Wiese (1998) a structural search was done in the Cambridge

Structural Database (CSD) (Cambridge Crystallographic Data Centre, Cambridge, UK).

The CSD reference code CIMPRA of chlorimipramine was used to generate the structure

of imipramine, and the CSD reference code PERAZ of prochlorperazine was used to

generate the structure of trifluoperazine. In order to obtain conformations with minimal

energy Pajeva and Wiese (1998) performed further a systematic conformational search of

the structures keeping the tricyclic fused ring system as an aggregate, and the final

optimisation was done with AM1.

After building the structures of the imipramine and phenothiazine derivatives, they were

firstly optimised with molecular mechanics and then with AM1.

In GASP analysis compounds were studied in pairs and the 10 best alignments were

analysed. The 3D-structure of trifluoperazine was used as a template, and the 3D-

distribution of the hydrophobic and H-bonding features of each of the remaining

molecules were compared to it. As the GASP algorithm uses random numbers for

108

initialisation, all pairs were run more than once to estimate the reproducibility of the

obtained alignments. The default settings in GASP were applied, namely: population size

100; selection pressure 1.1; maximum number of operations 60000; operation increment

6500; fitness increment 0.01; point cross weight 95.0; allele mutation weight 95.0; full

mutation weight 0.0; full crossweight 0.0; internal van der Waals energy coefficient 0.05;

HB weight coefficient 750; van der Waals contact cut-off 0.8.

The choice of the chemical compound used as a template for superimposing the

remaining compounds under investigation is a crucial step in the GASP, CoMFA and

CoMSIA. If preliminary knowledge about the 3D-structure of the receptor is missing, the

template is usually selected from the most active compounds in the data set. In this study

trifluoperazine was chosen as a template because it had one of the highest brain

penetration (logBB value of 1.44, Table 11.1.b). Among the remaining drugs only

fluphenazine had a slightly higher logBB value (1.51, Table 11.1.b), but it has a more

flexible side chain. Additionally, trifluoperazine was shown to be among the most

membrane-active phenothiazines in experiments with artificial membranes (Pajeva et al.,

1996). Also, Ford et al. (1989) found that this compound possessed the highest anti-MDR

potency among all phenothiazines investigated in their study including fluphenazine. This

could indicate a good ability of this compound to bind to P-gp.

CoMSIA and CoMFA analyses were applied to the investigated compounds. The

molecules were superimposed in the 3D-space on the structure of trifluoperazine, using

the similarity points derived by GASP analysis for each molecule (see Table 11.2 of

Section 11.3 “Results”) as a fitting point. For molecules having three similarity points

generated (Table 11.2), the compound conformation which gave the biggest GASP fit

score was taken. Constraints were applied to the distances between the similarity points.

This conformation was firstly minimised with Tripos force field molecular mechanics,

and then with AM1. For the structures which appeared to have only two similarity points

when aligned on trifluoperazine (carbamazepine, epoxide of carbamazepine, thioridazine,

mesoridazine, sulphoridazine (Table 11.2), an additional third fitting point was selected

for each compound individually. For carbamazepine and its epoxide, the third point was

the nitrogen atom of the fused ring system (see their structures in Table 11.1.a), which

was aligned to N1 of trifluoperazine (Figure 11.1). As a third fitting point for the spatial

alignment of mesoridazine, sulphoridazine and thioridazine, the nitrogen atom from the

piperidine ring (see compound structures in Table 11.1.b) was aligned to N2 of piperazine

ring of trifluperazine (Figure 11.1). For these three structures the compound conformation

109

from the GASP alignment to trifluoperazine that gave the biggest GASP fit score was

taken, and constraints were defined relating to the distance between the two similarity

points. The conformations were then minimised, firstly with molecular mechanics and

then with AM1.

The CoMFA interaction energies were calculated for a sp3 carbon probe atom of +1

charge, using the default grid spacing of 2 Å. CoMSIA similarity indices were also

calculated for a probe atom with + 1 parameters, using the same grid box and grid spacing

as in CoMFA. In CoMFA, a column filtering σmin of 2 kcal/mol, energetic field cut-off

values of 30 kcal/mol, no electrostatic interactions at bad steric contacts and a distance-

dependent dielectric constant, were set. In CoMSIA a column filtering σmin of 0.2 and 2.0

kcal/mol, and an attenuation factor of 0.3 were used. The Partial Least Squares (PLS)

leave-one-out cross-validation procedure was applied in both CoMFA and CoMSIA. The

robustness of the best CoMSIA model was additionally evaluated by applying leave-

group-out (5 groups) cross-validation. The standard error of prediction (SEP) and cross-

validated coefficient of determination (Q2) were used to assess the quality of the models.

11.3. Results

The importance of compound lipophilicity and H-bonding ability for BBB transport has

been reported by many authors and demonstrated in Chapter 10 of this thesis. However,

for the investigated set of 16 compounds a statistically significant model based on logPoct

and/or the numbers of H-bond donor and acceptor atoms could not be obtained (the

values of these structural descriptors are presented in Table 10.2, Chapter 10). This might

be due to the suggested P-gp interactions of the compounds.

GASP analysis, CoMFA, and CoMSIA were applied to identify common 3D-structural

characteristics of the investigated compounds related to their mechanism of transport

across the BBB, thought to include passive diffusion and interactions with the P-gp.

GASP analysis was used to identify a pattern of spatial similarity of hydrophobic and H-

bonding features between the compounds. The results from the GASP analysis are

summarised in Table 11.2. The obtained similarity points include the centroids of the two

phenyl rings of the imipramines and phenothiazines (labelled with C1 and C2 in the

structure of trifluoperazine, Figure 11.1) and the nitrogen atom of the side chain (N3 of

110

trifluoperazine, Figure 11.1). The characteristics of the similarity pattern found for

imipramine from its 10 best alignments to trifluoperazine are given in Table 11.3

(distances and angles between the similarity points).

To further explore the compound similarity pattern, CoMFA and CoMSIA studies were

performed. The CoMSIA and CoMFA models with best statistical fit are presented in

Table 11.4. The best model was obtained with the CoMSIA hydrophobic field (Model 1,

Table 11.4.a, optimal number of components = 3, Q2 = 0.793, SEP = 0.317, R2 = 0.935, s

= 0.177, F = 58.0). The contributions contour map (Coefficients*Standard Deviations) for

the model with the CoMSIA hydrophobic indices (Model 1, Table 11.4.a) is presented in

Figure 11.2. The robustness of this model was evaluated by leave-group-out cross-

validation statistical procedure. The data set of 16 compounds was randomly divided into

5 groups, and analysis was performed excluding each time a given group of compounds

and generating a model with the remaining compounds. Table 11.5 represents the results

obtained by repeating the leave-group-out procedure 5 times.

11.4. Discussion

In this study 3D-QSAR approaches were applied to investigate common 3D-structural

characteristics between phenothiazine and imipramine derivatives, related to their similar

mechanism of transport through the BBB. The transport mechanism was assumed to be

passive diffusion through the BBB and active outward transport by P-gp.

11.4.1. Interpretation of the models

The investigated compounds were part of a bigger data set taken from Platts et al., (2001),

for which QSARs were developed in this project and reported in Chapter 10. LogPoct and

the sum of H-bond donor and acceptor atoms (ΣH-bond) appeared in the QSARs (Table

10.6, Chapter 10). For the set of 16 phenothiazine and imipramine derivatives a

statistically significant model based on logPoct and/or numbers of H-bond donors and

acceptors could not be obtained. Fluphenazine and thioridazine were excluded from the

final QSAR model derived in the source paper (Platts et al., 2001), which included

geometrical, electronic and H-bonding structural descriptors (see Chapter 10).

Additionally, as described in Chapter 10, excluding two of these compounds

(mesoridazine and thioridazine) improved the model based on logPoct and ΣH-bond

(Equation 10.3). Thus, it appears that the investigated compounds are not well described

111

by such QSARs. This complies with the results of Dearden et al. (2003), who performed

2D-QSAR study on P-gp interactions and could not derive significant correlations

between parameters encoding P-gp interactions and descriptors of hydrophobicity and H-

bonding. 3D-QSAR approaches were used to investigate the suggested interactions with

the P-gp of the studies compounds.

A similarity pattern of 3D-hydrophobic and H-bond features of the imipramine and

phenothiazine derivatives was obtained by GASP analysis (Table 11.2). From Table 11.2

it can be seen that for 10 compounds three points were found to match the trifluperazine

3D-distribution of hydrophobic and H-bonding features (amitriptyline, chlorpromazine,

desipramine, desmethyldesipramine, desmonomethylpromazine, fluphenazine,

imipramine, nor-1-chlorpromazine, nor-2-chlorpromazine, promazine), namely the

centroids of the two phenyl rings of the imipramines and phenothiazines (labelled with C1

and C2 in the structure of trifluoperazine, Figure 11.1) and the nitrogen atom of the side

chain (N3 of trifluoperazine, Figure 11.1). Table 11.3 represents the characterstics of the

similarity pattern found for imipramine from its 10 best alignments to trifluoperazine. It

can be seen that the pattern is stable, with only small variations for the different

alignments. Stable similarity patterns were produced also for the phenothiazine

derivatives, which possessed structures more closely related to that of trifluoperazine than

imipramine.

For five compounds (carbamazepine, epoxide of carbamazepine, mesoridazine,

sulphoridazine, thioridazine, Table 11.2) the similarity pattern consisted of only two

points – the centroids of the two phenyl rings (Figure 11.1). An explanation of this result

for carbamazepine and its epoxide is the absence of a side chain and subsequently of a H-

bond acceptor atom to be superimposed on that of trifluoperazine.

The imipramine and phenothiazine derivatives with higher logBB values (for some of

them the logBB values are comparable to that of trifluoperazine) possess correspondingly

three similarity points. This suggests a highly similar spatial profile of the hydrophobic

and H-bond properties for these compounds. The results from the GASP analysis

qualitatively agree with the BBB penetration data for the investigated compounds: the

more similar space profiles with more hydrophobic and H-bond points involved, the

higher the drug logBB values. The different similarity patterns of mesoridazine,

sulphoridazine and thioridazine, that involve one point less is consistent with their lower

level of BBB penetration (logBB values of 0.24, -0.36, and 0.18 respectively) compared

112

to the remaining compounds (Table 11.1.b). Considering the fact that the BB ratio

encodes for the drug influx and efflux simultaneously, it would be misleading to relate the

obtained similarity patterns to interactions with P-gp only. Naturally, drugs with low rate

of passive diffusion due to their low hydrophobicity are expected to have also low BB

ratios. In addition to the lower number of pharmacophore points due to a missing aliphatic

side chain with tertiary nitrogen, drugs like carbamazepine and its epoxide also possess

low logPoct values (Table 10.2, Chapter 10), which can explain their low brain

penetration.

CoMFA and CoMSIA studies were performed to investigate further the compound 3D-

features important for BBB transport. A model with very good statistical parameters

(Model 1, Table 11.4.a) was obtained with the CoMSIA hydrophobic indices, when no

correlation was found with logPoct (see above). This suggests the importance of the 3D-

distribution of the hydrophobic features of imipramines and phenothiazines for the BBB

penetration, rather than the whole-molecule lipophilicity characterised by logPoct. The 3D-

distribution of the hydrophobic properties could be important for the P-gp interactions.

From Table 11.4.a it can be seen that the model with CoMSIA hydrophobic indices is not

influenced by the value of column filtering σmin (changing σmin from 2.0 kcal/mol into 0.2

kcal/mol did not change the statistical parameters of the model). This is an indication of

the high signal-to-noise ratio for the hydrophobic indices calculated. Additionally, the

model appeared to be robust when assessed with the leave-group-out (5 groups) cross-

validation procedure (Table 11.5). As it can be seen from Table 11.5 the model

performance is stable, resulting in similar cross-validation statistics from the different

random divisions of the compounds into groups.

In Figure 11.2, the contributions contour map (Coefficients*Standard Deviations) for the

model with the CoMSIA hydrophobic indices (Model 1, Table 11.4.a) is presented. Three

regions around the molecules were identified, in which the structural variation contributed

mostly to the logBB variance: more hydrophobic substituents in the regions, labelled in

red (letter “R” in the figure) (substituents in the fused ring system, and around the side

chain) and more hydrophilic substituents in the region of side-chain nitrogens (labeled in

blue, letter “B” in the figure) favour BBB penetration. This distribution reflects the

amphiphilic character of the studied compounds, and indirectly supports the obtained 3D-

molecular pattern.

113

Relatively good models were obtained also with CoMSIA steric and electrostatic indices

(Models 3 and 4, Table 11.4.a). The model with the donor-acceptor indices (Model 5,

Table 11.4.a) had a lower Q2. The model with acceptor indices had poor statistics (Q2 =

0.196, SEP = 0.578) and is not presented in Table 11.4.a.

CoMFA analysis also resulted in models having a good predictivity (Models 6, 7, and 8,

Table 11.4.b.), with electrostatic field giving better results than steric field. The models

with electrostatic field and both fields (Models 7 and 8, Table 11.4.b) had comparable

statistical parameters.


Numerous investigations of interactions of catamphiphilic compounds like phenothiazine

and imipramine derivatives with P-gp have been published (a review of such studies is

given in Wiese and Pajeva, 2001). A similarity pattern of hydrophobic and H-bonding

structural features for a range of chemical compounds, including phenothiazines and

imipramines, was obtained by Pajeva and Wiese (2001). The main new element

introduced by the present study is related to the fact that a similarity pattern of

hydrophobic and H-bonding structural characteristics for the phenothiazines and

imipramines was observed in relation to their transport across the BBB, i.e. the

investigation is extended to a new endpoint.

The present study has emphasised the importance of the 3D-distribution of the

hydrophobicity of phenothiazines and imipramines, which is suggested to govern their

BBB penetration, rather than the whole-molecule lipophilicity characterised by logPoct.

Similar results have been previously published for compound interactions with P-gp.

However, the present study related this observation to penetration of chemicals through

the BBB.

An article based on this work has been published by Lessigiarska et al. (2005a).

11.5. Conclusions

Identified were common 3D-structural characteristics of phenothiazine and imipramine

derivatives related to their mechanism of transport across the BBB involving passive

diffusion and P-gp interactions. It was shown that the compounds with highest BBB

114

penetration possess a similar specific profile of two clearly defined hydrophobic centres

and one hydrophilic (H-bonding) centre arranged in a particular spatial configuration.

A model with very good statistical parameters was obtained with the CoMSIA

hydrophobic indices when no correlation was found with the whole-molecule descriptor

logPoct. This suggests that hydrophobicity, represented as a space-distributed molecular

property but not as a logPoct value, is more suitable for the modelling of the BBB

penetration of phenothiazine and imipramine derivatives.

115

Figure 11.1. Structure of trifluoperazine. C1 and C2 label the centroids of the phenyl rings.

Nitrogen atoms are also labelled

F

FF

S

N1

N2

N3

CH3

c1c2

116

Figure 11.2. Contour map (representing Coefficient*Standard Deviation values) for the

model with CoMSIA hydrophobic field (Model 1, Table 11.4.a); “R” labels the “red”

regions (highest positive Coefficient*Standard Deviation values), “B” labels the “blue”

region (lowest negative Coefficient*Standard Deviation values) around the molecules

117

Table 11.1. Chemical structure and BBB penetration data (presented as logBB values) of

the imipramine and phenothiazine derivatives investigated for their interactions with P-gp

(data taken from Platts et al., 2001). The numeration is done according to Table 10.2,

Chapter 10

a) imipramine derivatives

No Name Structure logBB

33 amitriptyline

N

CH3

CH3

0.89

28 carbamazepine

N

O

NH2

0.00

53 desipramine

N NH

CH3

1.20

54 desmethyldesipramine

N

NH2

1.06

29 epoxide of carbamazepine O

NO

NH2

-0.34

7 imipramine

CH3

N

CH3N

1.06

118

Table 11.1. Chemical structure and BBB penetration data (presented as logBB values) of

the imipramine and phenothiazine derivatives investigated for their interactions with P-gp

(data taken from Platts et al., 2001). The numeration is done according to Table 10.2,

Chapter 10

b) phenothiazine derivatives

S

NR1

R3R2

No Name R1 R2 R3 logBB

47 chlorpromazine -Cl -H -(CH2)3-N(CH3)2 1.06

57 desmonomethylpromazine -H -H -(CH2)3-NH-CH3 0.59

69 fluphenazine -CF3 -H (H2C)3 N N (CH2)2 OH

1.51

86 mesoridazine -S(=O)CH3 -H

(H2C)2

N

CH3

-0.36

104 nor-1-chlorpromazine -H -Cl -(CH2)3-NH2 1.37

105 nor-2-chlorpromazine -Cl -H -(CH2)3-NH2 0.97

120 promazine -H -H -(CH2)3-N(CH3)2 1.23

133 sulforidazine -S(=O)2CH3 -H

(H2C)2

N

CH3

0.18

138 thioridazine -S-CH3 -H

(H2C)2

N

CH3

0.24

146 trifluoperazine -CF3 -H (H2C)3 N N CH3

1.44

119

Table 11.2. Similarity points, obtained by GASP analysis. Y encodes the presence of the

corresponding similarity point, N encodes its absence

Similarity points Compound

C1 C2 N3

amitriptyline Y Y Y

carbamazepine Y Y N

chlorpromazine Y Y Y

desipramine Y Y Y

desmethyldesipramine Y Y Y

esmonomethylpromazine Y Y Y

epoxide of carbamazepine Y Y N

fluphenazine Y Y Y

imipramine Y Y Y

mesoridazine Y Y N

nor-1-chlorpromazine Y Y Y

nor-2-chlorpromazine Y Y Y

promazine Y Y Y

sulforidazine Y Y N

thioridazine Y Y N

trifluoperazine Y Y Y

120

Table 11.3. Distances and angles between the points of the similarity pattern found from

the 10 best alignments of imipramine to trifluoperazine. C1, C2, and N3 are the

corresponding similarity points (Figure 11.1)

Distances (Å) Angles (degrees) Values

C1-C2 C1-N3 C2-N3 C2-C1-N3 C1-C2-N3 C1-N3-C2

lowest 4.864 10.361 8.423 53.16 97.59 27.52

highest 4.946 10.386 8.517 54.57 99.32 27.85

average 4.884 10.380 8.454 53.60 98.40 27.61

121

Table 11.4. CoMSIA and CoMFA models for 16 the imipramine and phenothiazine

derivatives. The best models are shown in bold

(a) CoMSIA models

No Field σmina NC b Q2 b SEP b R2 b s b F b

1 hydrophobic 2.0 3 0.793 0.317 0.935 0.177 58.0

2 hydrophobic 0.2 3 0.793 0.317 0.936 0.176 58.5

3 steric 2.0 5 0.689 0.425 0.917 0.213 48.3

4 electrostatic 2.0 3 0.641 0.417 0.917 0.200 44.4

5 donor-acceptor 2.0 1 0.444 0.481 0.674 0.368 29.0

(b) CoMFA models (σmina = 2.0 kcal/mol)

No Field NC b Q2 b SEP b R2 b s b F b

6 steric 3 0.697 0.384 0.918 0.199 44.9

7 electrostatic 3 0.767 0.336 0.940 0.170 63.1

8 both 3 0.773 0.332 0.955 0.148 84.6

a σmin – column filtering (in units kcal/mol) b NC – optimal number of components in the PLS analysis; Q2 – cross-validated coefficient of

determination; SEP – cross-validated standard error of prediction; R2 – regression coefficient of

determination; s – standard error of estimate; F – Fisher statistic

122

Table 11.5. Results from the leave-group-out (5 groups) cross-validation procedure

performed with the CoMSIA hydrophobic indices*.

NC a Q2 a SEP a

2 0.798 0.301

3 0.821 0.294

2 0.813 0.290

3 0.790 0.319

3 0.820 0.295

* number of compounds n = 16; column filtering σmin = 2.0 kcal/mol a NC – optimal number of components in the PLS analysis; Q2 – cross-validated coefficient of

determination; SEP – cross-validated standard error of prediction

123

CHAPTER 12

INVESTIGATION OF BACTERIAL TOXICITY

12.1. Objectives

An objective of the project was to investigate cytotoxic and acute toxic effects of

chemicals by the means of QSAR analysis. Toxicity to bacterial strains, aquatic

organisms, rodents and humans was investigated. This chapter presents a QSAR analysis

of a large set of data for chemical toxicity to the bacterial strain Sinorhizobium meliloti.

The data set contained chemicals acting by different mechanisms of toxicity, which

permitted investigation of mechanisms of toxic action. Baseline toxicity effects, attributed

to non-polar narcotics, were sought. Mechanism-based QSARs were developed.

Additionally, attempts were made to develop models for compounds belonging to

different chemical classes.

12.2 Methods

12.2.1. Toxicity data

The data set was taken from Botsford (2002). The toxicity test applied uses the bacterium

Sinorhizobium meliloti as an indicator of cell viability, which in its viable state can

readily reduce the thiazole tetrazolium dye MTT to a dark blue compound. The reduction

of the dye is inhibited by toxic chemicals. The toxicity values are expressed as compound

concentrations that cause 50% inhibition of the bacterial growth (IC50).

The original data set contains toxicity measurements for 237 substances. In the present

QSAR study, data for 133 of the 237 substances were used. Compounds identified by

Botsford (2002) as non-toxic, as well as herbal medicines and related substances, and

inorganics, were excluded from this study. Unspecified positional isomers were also

excluded. Glyphosate, neomycin and streptomycin were excluded, because of their very

low calculated logPoct values (logPoct < -4). Thus, assuming the calculated logPoct values

are accurate, the compounds are unlikely to cross the cell membrane by passive diffusion.

The investigated data set is heterogeneous in terms of chemistry and mechanisms of toxic

action.

124

The compounds included in the study, their mean toxicities (average of several

experiments) in terms of IC50 values, and their assigned mechanisms of toxic action are

presented in Table 12.1. The SMILES codes of the compounds are given in Appendix

B.2. The mechanism of action is encoded by: 1 – non-polar narcotics; 2 – polar narcotics;

3 – weak acid respiratory uncouplers; 4 – formation of free radicals; 5 – electrophilic

interactions; 6 – toxic action by a specific mechanism. The compounds were assigned to a

given mechanism of toxicity according to structural criteria found in the literature

(Hermens, 1990; Lipnick, 1991; Verhaar et al., 1992; Russom et al., 1997; Schultz et al.,

1997; Hansch et al., 2000; Cronin et al., 2002a) and summarised in Table 8.1, Chapter 8.

The number of narcotic compounds identified is 72, of which 46 are non-polar and 26 are

polar narcotics. Five compounds are considered to act by weak acid respiratory

uncoupling; 9 compounds by forming free radicals; and 13 are likely to be

electrophiles/proelectrophiles. Forty-seven compounds act by a specific mechanism.

Some compounds are likely to possess more than one mechanism of toxicity, as noted in

Table 12.1.

12.2.2. Structural Descriptors

The following structural descriptors for the compounds were calculated: logPoct, boiling

point, melting point, vapour pressure, aqueous solubility, the Brown variation of the

Hammett constant σ+, and approximately 250 different quantum-chemical and topological

descriptors. LogPoct was calculated with the KOWWIN version 1.66 software (Syracuse

Research Corporation, Syracuse, NY, USA). Boiling points, melting points and vapour

pressures were calculated with the MPBPWIN version 1.40 software (Syracuse Research

Corporation, Syracuse, NY, USA). Aqueous solubilities were calculated with

WSKOWWIN version 1.40 (Syracuse Research Corporation, Syracuse, NY, USA).

KOWWIN version 1.66, MPBPWIN version 1.40 and WSKOWWIN version 1.40

software were downloaded from the web-site of the Environmental Protection Agency of

the United States (http://www.epa.gov/oppt/exposure/docs/episuitedl.htm). The values of

the Brown variation of the Hammett constant σ+ were calculated with the ACDLabs

version 4.02 software (Advanced Chemistry Development Inc., Toronto, Canada). TSAR

for Windows version 3.3 (Accelrys Inc.) was used to calculate quantum-chemical and

topological descriptors. For quantum-chemical calculations, the VAMP package

(Accelrys Inc.) was used with the AM1 Hamiltonian. QsarIS version 1.1 software (then

125

SciVision-Academic Press, San Diego, CA; currently MDL®QSAR by Elsevier MDL,

San Leandro, CA) was used to calculate topological descriptors.

12.2.3. Statistical Analysis

Before performing the statistical analysis, the descriptors that had values of zero for more

than 95% of the compounds in a data set were excluded from the descriptor data matrix.

To reduce the number of intercorrelated molecular descriptors, the algorithm developed

by the author was applied to the descriptor data matrices before performing statistical

analysis (see Chapter 9).

QSARs were obtained by multiple linear regression, using MINITAB version 14 software

(Minitab Inc., State College, PA, USA). Two approaches were used for selecting

variables to derive models with good statistical diagnostics, and allowing for

physicochemical and/or mechanistic interpretation. Firstly, forward stepwise regression

was applied, selecting up to five variables to include, and excluding variables that lack

physicochemical and/or mechanistic relevance. Secondly, the program in C code

developed by the author, implementing the best-subsets approach (see Chapter 9), was

used. Only those variables that intercorrelated with a coefficient of intercorrelation R less

than 0.7 were included in the same model. The leave-one-out cross-validation statistical

procedure was applied to the QSAR models to generate the leave-one-out cross-validation

coefficient (Q2) of determination.

Partial least squares (PLS) analysis was performed by using MINITAB version 14

software.

12.3. Results

The abbreviations used for the molecular descriptors are presented in Table 12.2. The

values of the descriptors, included in the selected QSARs, are presented in Table 12.3.

A plot of toxicity against logPoct for all compounds investigated shows a baseline effect

(Figure 12.1.a). Of the 133 compounds, 46 meet the structural criteria for compounds that

may act by the non-polar narcosis mechanism of toxic action. Within this sub-group, 18

compounds (acetone, chloroquine, DCPA, 1,3-dichlorobenzene, dimethyl sulphoxide,

ethanol, ethylene glycol, hexachlorobenzene, methanol, methylphenidate, naproamide,

126

naproxen, octane, orphenadrine, 1,10-phenanthroline, propan-2-ol, tert-butyl methyl

ether, tetrachloromethane) could be identified as falling on the baseline, determined by

lipophilicity. The relationship giving the baseline toxic effect is defined by the following

equation, which was based on the sub-group of 18 compounds:

log(1/IC50) = 0.667 (± 0.041) logPoct – 2.79 (± 0.12) (12.1)

n = 18, R2 = 0.942, s = 0.369, F = 261

To determine factors other than logPoct which could enhance the toxicity of the remaining

compounds above that determined by the baseline alone, additional models were

developed by including other molecular descriptors in combination with logPoct. The

QSAR equations for the groups of compounds acting by different toxic mechanisms are

presented in Table 12.4.

To investigate the possibility of developing chemical class-specific models, QSARs were

developed for the aromatic and aliphatic compounds separately. QSARs for the aromatic

compounds could be derived, but the statistical quality of these models was poor, and the

descriptors selected were not meaningful. The QSAR developed for the aliphatic

compounds is presented in Table 12.4, Equation 7.

PLS analysis was applied to the whole data set to investigate further the multivariable

data space. Firstly, the whole set of 104 independent variables was included in the PLS

analysis. The model obtained had 4 significant components (selected on the basis of best

cross-validated coefficient of determination, Q2), and R2 and Q2 values of 0.598 and

0.347, respectively. Secondly, following the approach of Cronin et al. (2002a), the four

variables found to be most significant in the regression analysis (logPoct, LUMO, NOH,

and NNH2, Equation 6, Table 12.4) were used to derive a PLS model. The model selected

had 3 components, R2 = 0.541 and Q2 = 0.464.

12.4. Discussion

In this study a data set of 133 compounds was used to develop QSARs for toxicity to the

bacterium S. meliloti. In general, several approaches could be used for developing QSARs

from a large data set for a given endpoint. The compounds in the data set could be divided

into groups according their mechanism of action (mechanism-based approach) or their

chemical structure, with QSARs being developed for each group. A global QSAR for all

127

compounds in the data set could also be developed. These approaches were tried in the

present study.

12.4.1. Baseline toxicity effect

A baseline toxic effect, determined by lipophilicity, was identified for these compounds

(Equation 12.1). The position of the baseline on plots of toxicity against logPoct is

presented in Figure 12.1.a for all investigated compounds regardless of the mechanism of

toxicity, and in Figure 12.1.b for the non-polar narcotics. From Figure 12.1.b it can be

seen that the toxicity of the remaining 28 non-polar narcotics, not included in Equation

12.1, is greater than that determined by the baseline. This disagrees with some literature

reports (Könemann, 1981; Hansch et al., 1989; Schultz et al., 1990; Verhaar et al., 1992)

stating that the toxicity of the non-polar narcotics is dependent mainly on their logPoct

values. This contradiction could be due to variability in the Botsford (2002) IC50 values.

In the Botsford study, the IC50 values were obtained by investigating several samples for

the same chemical compound, and the number of samples for the different compounds

varied from 1 to 24. The coefficient of variation obtained (the standard deviation divided

by the mean) for the IC50 values varies widely from 0.25 up to 70. The high values of the

coefficient of variation suggest some uncertainty in the IC50 values determined by the

method of Botsford (2002). Additionally, all the chemicals were tested as received from

the suppliers (e.g. in the form of pesticide and pharmaceutical products), without attempts

to purify them from inactive compounds added to make the chemicals marketable.

The baseline presented by Equation 12.1 is developed in an arbitrary way by the author

and its position in Figure 12.1.b results in the placing of non-polar narcotic compounds

both above and below the baseline, in accordance with the suggested data variability.

12.4.2. QSAR models

When the non-polar narcotics were investigated alone, logPoct was found to explain only

67% of the variance in the toxicity (Equation 1, Table 12.4), possibly due to the suggested

data variability. The QSAR developed for the toxicity of the polar narcotics indicates that

the specific polarisability (SpcP), the largest negative charge over the atoms in a

compound (Qmaxn), and the number of hydroxyl groups (NOH) are important determinants

of toxicity (Equation 2, Table 12.4). Interestingly, it did not include logPoct (this result is

unlikely to be due to lack of logPoct variance, because the values of this descriptor varied

128

between –0.91 (sulphosalicylic acid) and 6.92 (hexachlorophene)). From the QSAR, it

can be seen that increasing the ability of the molecule to be polarised under an external

electric field (possibly in cell membranes) is associated with an increase in toxicity. The

value of Qmaxn could account for participation of the compound in electrostatic

interactions at the site of action, or for H-bonding effects, and its increased absolute value

is associated with increasing toxicity (Qmaxn has negative values). From Equation 2, Table

12.4 it can be seen that increasing the number of hydroxyl groups in the molecule is

associated with increasing toxicity. This property can be also related to H-bonding

activity.

The group of weak acid respiratory uncouplers is represented only by 5 compounds.

Unfortunately this is statistically insufficient to investigate all the factors that determine

the interaction with the mitochondrial membrane. The observed relationship with logPoct

(Equation 3, Table 12.4) could be due to membrane disruption or could reflect the ability

of compounds to pass across the cell membrane in order to reach the mitochondria,

accounting for the effects of the limited number of compounds in the training set.

Relationships between toxicity of respiratory uncouplers and logPoct have been previously

reported in the literature by Russom et al. (1997), and Schultz and Cronin (1997).

Equation 3 has higher intercept (not significantly different from zero) than Equation 12.1

defining the baseline toxicity, thus, as it can be expected the toxicity of the weak acid

respiratory uncouplers is greater than that of the non-polar narcotics.

The toxicity of the compounds forming free radicals was correlated with the molecular

dipole moment (µ) and the total molecular energy (Etot) (Equation 4, Table 12.4). These

parameters could be associated with the formation and/or stability of free radicals. It

should be noted, however, that with the exception of catechol, these compounds could

also act by the polar narcosis mechanism. The low value of 0.391 for Q2 suggests that the

model is very unstable with regard to changes of compounds included. Furthermore, since

the equation has 2 independent variables and only 9 data points, its quality can be

questioned on the basis of the low ratio of data points to independent variables (fewer

than 5 data points for 1 variable).

Hansch and Gao (1997) reviewed the literature on free radical reactions. They found that

in 27 QSAR equations describing endpoints in which formation of free radicals involves

the abstraction hydrogen radical (H.) from a phenolic hydroxyl group, 25 equations were

correlated with negative values of the Brown variation of the Hammett substituent

129

constant σ+, thus indicating that electron-releasing substituents, i.e. increasing electron

density in the benzene ring, favour the investigated endpoints. The Brown variation of the

Hammett constant σ+ takes into account the contribution of substituents through

conjugation to electron-deficient reactivity sites attached to a benzene ring. π-Electron-

releasing and π-electron-withdrawing groups are associated with negative and positive

values of σ+, respectively (Morao and Hillier, 2001). In the present study, σ+ values were

calculated for the eight phenols identified as forming free radicals (ACDLabs software

could not calculate the σ+ value for 2-dianisidine). No statistically significant correlations

between the toxicity and the σ+ values were found for these compounds.

No meaningful QSARs could be derived for the electrophilic compounds. It is possible

that the electrophilic interactions involved were too diverse to be accommodated in a

simple QSAR for all electrophilic compounds. Also, other structural descriptors than

those used in the present study could determine the toxicity of these compounds. As

concluded by Schultz et al. (1998), data and descriptor limitations exist for QSAR

modelling for electrophilic compounds.

The model obtained for compounds acting by specific mechanisms described only 57% of

the variation in compound toxicity (Equation 5, Table 12.4). Greater aqueous solubilities

(logSaq), higher absolute values of the largest negative charge over an atom in the

molecule (Qmaxn), lower values of the difference first order connectivity index (d1χ), and a

smaller number of amino groups in the molecule (NNH2) appeared to decrease toxicity.

d1χ encodes information regarding the degree of skeletal branching, in terms of

decreasing values with increased branching, and tends to be insensitive to molecular size.

The positive coefficient of d1χ shows that increased branching is associated with lower

toxicity. In the case of compounds acting by specific mechanisms, it is possible that this

descriptor reflects effects of shielding of reactive centres in branched compounds,

resulting in lower toxicity. The model for specifically-acting compounds (Equation 5,

Table 12.4) is of low quality, having a low Q2 value and at least one outlier with

considerable leverage (isoproterenol). This may be due to the fact that the compounds

brought together in this model may elicit their toxic effects by interacting with different

biological macromolecules, and thus different factors may determine the toxic response to

these compounds. In such a case, it would be more appropriate to develop separate QSAR

models for separate specific mechanisms of action, if sufficient data were available.

130

A single, global model for S. meliloti toxicity, covering all mechanisms of action, was

developed (Equation 6, Table 12.4). The model includes logPoct, the energy of the lowest

unoccupied molecular orbital (LUMO), the number of hydroxyl groups (NOH), and the

number of amino groups (NNH2). Thus, descriptors encoding compound biouptake,

distribution and/or membrane interactions (logPoct) and electrophilic reactivity (LUMO)

appeared to influence toxicity. The number of hydroxyl and amino groups might encode

particular compound reactivity or H-bonding effects. However, the statistical fit of the

model was not high, possibly due to the diversity of the included compounds and their

mechanisms of toxic action.

In an attempt to obtain additional information from the variable space, PLS regression

was applied to the whole data set. However, the PLS models obtained had comparable

statistical fits to the QSAR based on regression analysis (Equation 6, Table 12.4), but they

lack the mechanistic interpretability of the regression model (Cronin et al., 2002a).

The QSAR developed for the aliphatic compounds had reasonable statistical parameters

(Equation 7, Table 12.4). It included compounds acting by non-polar narcosis,

electrophilic reactivity, and specific mechanisms Thus, the diversity of mechanisms

covered by this model was less than in the case of the model for all compounds.

Correspondingly, the model had better statistical fit. It included descriptors encoding

compound biouptake, distribution and/or membrane interactions (logPoct) and

electrophilic reactivity (LUMO). Additional descriptors included were MSA, which might

reflect the higher size of the compounds acting by specific mechanisms included in the

model, and NNH2, which probably encodes particular types of interactions/reactivity. In

Figure 12.2 a plot of predicted values against observed toxicity values is presented for this

model. It can be seen that the predicted toxicity values of some compounds (acetone,

ethanolamine, clindamycin) have high residuals (greater than 1.3 log units). The

coefficients of variation for acetone and ethanolamine reported by Botsford (2002) were

24 and 2.65 respectively (the number of samples were 18 and 2 respectively), which

resulted in standard deviations of the determined toxicity values of 4.71 and 0.85 log units

respectively. The standard deviation for acetone is higher than the predicted residual from

the model. The toxicity assessment of small aliphatic molecules is often difficult to

perform accurately due to their volatility. The toxicity of clindamycin was determined

with one sample only (no coefficient of variation was reported by Botsford, 2002).

131


A new data set was investigated in which the toxicity of compounds was obtained in the

same laboratory, according to the same experimental protocol, suggesting good quality of

the obtained results. However, high experimental variability was reported by Botsford

(2002) for some compounds, which could be a reason for the moderate statistical fit of the

QSARs derived in this study.

The study confirmed the presence of baseline toxicity effect attributed to the non-polar

narcotics. In contrast to literature reports (see Chapter 8) logPoct was not among the

descriptors that best described the toxicity of the polar narcotics. The descriptors that

were related to the toxicity of these compounds accounted for polarisability, electrostatic

properties and H-bonding interactions. However, these properties play a role in molecular

partitioning between two phases, thus, they may also reflect the logPoct values.

A QSAR model for all investigated compounds was derived, however, it was based on

diverse compounds acting by different mechanisms, and thus its statistical fit was not

high. A model with good statistical fit was obtained for the aliphatic compounds,

including descriptors related to compound lipophilicity, reactivity, size and particular

types of interactions/reactivity attributed to amino group. This result suggests that QSAR

models including compounds acting by different mechanisms might be successful if

descriptors reflecting the different mechanisms are included.

An article based on this work has been published by Lessigiarska et al. (2004a).

12.5. Conclusions

A baseline effect (related to non-polar narcosis) for the toxicity to the bacterium

Sinorhizobium meliloti was shown. However, some non-polar narcotics had toxicities

different from those defined by the baseline. This might be due to high experimental data

variability for some compounds. The toxicity of polar narcotics might be determined by

their polarisability, H-bonding ability, and electrostatic properties. The study

demonstrated that modelling of toxicity of compounds forming free radicals and

electrophilic compounds might be limited by data and descriptor availability. The toxicity

of compounds acting by more specific mechanisms of toxic action was also difficult to

predict with the structural descriptors included in the study.

132

Toxicity of the aliphatic compounds could be modelled successfully despite the fact that

they act by different mechanisms of toxic action (non-polar narcosis, electrophilic

reactivity, and specific mechanisms). The descriptors included in the QSAR reflect the

mechanisms of action of the compounds, namely compound biouptake and/or membrane

interactions (logPoct), electrophilic reactivity (LUMO), possibly the greater size of the

compounds acting by specific mechanisms (MSA) and particular types of

interactions/reactivity attributed to amino group (NNH2). However, increasing the diversity

of compounds and their mechanisms of action by adding also aromatic compounds to the

training set resulted in a model with low statistical fit.

133

Figure 12.1. Plot of toxicity to Sinorhizobium meliloti against logPoct, showing a baseline

toxicity effect (solid line, Equation 12.1). The compounds, represented by triangles were

used to define the baseline

a) all investigated compounds regardless of the mechanism of action

-2 -1 0 1 2 3 4 5 6 7 8

logP oct

-4

-3

-2

-1

0

1

2

3

4

5lo

g1

/IC

50

b) non-polar narcotics

-2 -1 0 1 2 3 4 5 6 7

logP oct

-4

-3

-2

-1

0

1

2

log

1/IC

50

134

Figure 12.2. Plot of observed against predicted toxicity to Sinorhizobium meliloti for the

QSAR model for aliphatic compounds (Equation 7, Table 12.4). Chemicals with large

residuals are named

log(1/IC50)

Predicted Values

Ob

se

rve

d V

alu

es

-4

-3

-2

-1

0

1

2

3

4

-4 -3 -2 -1 0 1 2 3 4

clindamycin

acetone

gentamycin

ethanolamine

135

Table 12.1. Investigated compounds, their acute toxicity to S. meliloti, and mechanisms of

toxic action

No Name log(1/IC50) b Mechanism of

action c

1 acetaminophen -0.217 2 2 acetic acid -0.854 1 3 acetone -3.326 1 4 acetylsalicylic acid -0.394 2 5 alachlor 0.386 1 6 4-aminobenzaldehyde -0.111 5 7 amitriptyline 1.824 6 8 amoxicillin a -1.147 6 9 atenolol 0.636 6 10 atropine 0.180 6 11 bensulide 0.750 6 12 benzene 0.0862 1 13 benzyl chloride 0.319 5 14 bipyridyl 0.00922 1 15 bromoxynil 1.027 3 16 butylamine -0.568 1 17 caffeine -1.280 6 18 catechol 0.337 4, 5 19 celebrex 1.319 6 20 chloramphenicol -0.636 5, 6 21 chlorobenzene 0.674 1 22 3-chlorophenol 1.602 2 23 4-chlorophenol 1.036 2 24 chloroquine -0.142 1, 6 25 clindamycin -0.929 6 26 clomazone 0.991 6 27 codeine -0.204 6 28 2-cresol -0.246 2 29 3-cresol 0.0410 2 30 4-cresol -0.553 2 31 cyanazine -0.00860 6 32 cyclohexylamine -0.373 1 33 DCPA (chlorthal-dimethyl) -0.109 1 34 dexedrine 0.507 2 35 2-dianisidine -0.236 2, 4 36 diazepam (valium) -0.504 6 37 diazinon 0.936 6 38 dibromoethane -0.0969 1 39 1,2-dichlorobenzene 0.588 1 40 1,3-dichlorobenzene -0.449 1 41 1,4-dichlorobenzene 0.451 1 42 1,2-dichloroethane -0.179 1 43 dichloromethane -0.616 1 44 2,4-dichlorophenol 0.544 2 45 3,4-dichlorophenol 1.538 2 46 3,5-dichlorophenol 2.398 2 47 2,4-dichlorophenoxyacetic acid -0.0334 2 48 diethylamine -0.423 1

136



action c

49 digoxin 0.635 6 50 dimethyl sulphoxide -3.050 1 51 4-(dimethylamino)benzaldehyde -0.0755 5 52 2,3-dimethylphenol 0.00833 2, 4 53 2,4-dimethylphenol 0.234 2, 4 54 1,2-dinitrobenzene -0.343 5 55 1,4-dinitrobenzene -0.270 5 56 2,6-dinitro-4-cresol 0.772 3 57 2,4-dinitrophenol 0.742 3 58 diuron 0.429 1 59 emetine 0.595 6 60 eptam (EPTC) 0.565 1 61 ethanol -3.216 1 62 ethanolamine -0.613 5 63 ethyl acetate -1.097 1 64 ethylbenzene 0.175 1 65 ethylene glycol -3.535 1 66 FCCP [carbony cyanide 4-(trifluoromethoxy)-phenylhydrazone] 3.000 6 67 fluvoxamine 0.167 6 68 furosemide 1.244 6 69 gentamycin 2.854 6 70 hexachlorobenzene 0.793 1 71 hexachlorophene 3.721 2, 3 72 ibuprofen 0.235 2 73 imazapyr 0.126 6 74 indole -0.631 1 75 isoproterenol -3.248 6 76 isoxaben 0.0297 6 77 lindane 0.851 1 78 malathion 0.951 6 79 methanol -3.328 1 80 3-methyl-4-nitrophenol 0.446 5 81 methylphenidate (ritalin) -0.369 1 82 morphine 0.0825 6 83 1-naphthol 0.409 1 84 naproamide -0.0273 1 85 naproxen -0.109 1 86 nicosulfuron 0.169 6 87 nicotine -0.785 6 88 2-nitrophenol 0.569 5 89 norflurazon 0.223 6 90 novobiocin -0.512 6 91 octane -0.149 1 92 orphenadrine -0.244 1 93 orudis 0.349 1 94 oxadiazon 0.0391 6 95 paraquat 0.590 6 96 pentachlorophenol 3.268 3 97 pentanol -1.265 1 98 1,10-phenanthroline -1.363 1

137



action c

99 phenol -1.111 2 100 primsulfuron 0.278 6 101 propan-2-ol -2.977 1 102 propranolol 0.395 1, 6 103 pseudocumene 0.509 1 104 quinidine 0.375 6 105 quinine 0.394 6 106 resorcinol 0.670 2, 4 107 salicylic acid 0.0410 2 108 scopolamine 0.393 6 109 sethoxydim 2.097 6 110 sulphosalicylic acid -0.705 2 111 tert-butyl methyl ether -2.173 1 112 tetrachloroethene 0.260 5 113 tetrachloromethane -1.025 1 114 tetracycline -0.145 6 115 theophylline -1.316 6 116 thiazopyr 0.921 6 117 thifensulfuron -0.514 6 118 thioridazine 1.987 6 119 toluene -0.373 1 120 4-toluidine -1.143 2 121 1,2,3-trichlorobenzene 1.119 1 122 1,1,1-trichloroethane 0.182 1 123 trichloroethene 0.298 5 124 trichloromethane -0.723 1 125 trifluralin 1.009 6 126 2,4,6-trihydroxytoluene 0.437 2, 4 127 2,3,6-trimethylphenol -0.0561 2, 4 128 2,4,6-trimethylphenol 0.0329 2, 4 129 3,4,5-trimethylphenol 0.428 2, 4 130 2,4,6-trinitrotoluene 0.668 5 131 verapamil 0.690 6 132 warfarin -0.250 6 133 zanaflex -0.0785 6

a mean value from the values for amoxicillin (USA), amoxicillin (Cyprus), amoxicillin (Jordan), and

amoxicillin (Mexico) b toxicity data are presented in terms of IC50 values, in units of mmol/l

c the mechanism of action is encoded by: 1 – non-polar narcotics; 2 – polar narcotics; 3 – weak acid

respiratory uncouplers; 4 – formation of free radicals; 5 – electrophile/proelectrophile compounds; 6 – toxic

action by specific mechanisms

138

Table 12.2. Physicochemical descriptors used in the selected QSARs for

toxicity/cytotoxicity

Symbol Name

α molecular polarisability 0χA

average molecular connectivity index of zero order 2χ connectivity index of second order 2χv valence connectivity index of second order 3χp

v valence path connectivity index of third order

BLI Kier benzene-likeliness index

d1χ difference connectivity index of first order

Etot total energy of the molecule

HOMO energy of the highest occupied molecular orbital

Hy hydrophilic factor

LmH difference between the energies of LUMO and HOMO

logPoct calculated logarithm of the octanol-water partition coefficient

logPoctM measured logarithm of the octanol-water partition coefficient

logSaq calculated logarithm of the aqueous solubility (units mol/l)

logSaqM measured logarithm of the aqueous solubility (units mol/l)

LUMO energy of the lowest unoccupied molecular orbital

µ magnitude of the molecular dipole moment

MM molecular mass

MSA molecular surface area

NC#N number of cyano groups

NPh number of phenyl rings

NH-donors number of H-bond donor atoms

NNH2 number of amino groups

Nrings6 number of 6-membered rings

NO number of O atoms

NOH number of hydroxyl groups

PW2 path/walk Randić shape index of second order

Qmaxn largest negative charge over the atoms in a compound

ΣE-state sum of the E-state indices of all atoms in the molecule

ΣH-bond sum of the H-bond donor and acceptor atoms

SpcP specific polarisability of a molecule equal to polarisability/volume

TIE E-state topological parameter

139

Table 12.3. Calculated values for the molecular descriptors included in the selected QSARs for acute toxicity to S. meliloti

No Name KOWWIN WSKOWWIN TSAR TSAR TSAR TSAR QsarIS QsarIS QsarIS QsarIS QsarIS

logPoct logSaq Etot a LUMO NNH2 NOH d1χ µ Qmax

n SpcP MSA

1 acetaminophen 0.27 -0.70 -25.72 0.253 0 1 -0.233 5.48 -0.423 0.0241 193.6 2 acetic acid 0.09 0.90 -28.49 0.974 0 1 -0.182 2.35 -0.327 0.0279 83.0 3 acetone -0.24 0.58 -11.17 0.844 0 0 -0.182 2.54 -0.364 0.0359 96.8 4 acetylsalicylic acid 1.13 -1.53 -25.07 -0.532 0 1 -0.305 7.19 -0.399 0.0197 204.1 5 alachlor 3.37 -4.17 5.31 0.039 0 0 -0.226 4.79 -0.422 0.0419 296.4 6 4-aminobenzaldehyde 0.79 -0.76 -8.73 -0.234 1 0 -0.089 6.86 -0.448 0.0223 160.5 7 amitriptyline 4.95 -5.53 18.14 0.079 0 0 -0.160 0.32 -0.363 0.0333 325.1 8 amoxicillin 0.97 -2.03 32.51 -0.023 1 2 -0.783 6.58 -0.421 0.0291 325.2 9 atenolol -0.03 -2.59 -20.87 0.000 1 1 -0.445 6.80 -0.423 0.0332 330.6 10 atropine 1.91 -1.87 14.98 0.150 0 1 -0.228 1.75 -0.393 0.0396 285.6 11 bensulide 4.12 -5.69 -23.00 -2.414 0 0 -0.769 6.40 -0.320 0.0340 399.7 12 benzene 1.99 -1.59 5.39 0.554 0 0 0.086 0.00 -0.062 0.0261 117.9 13 benzyl chloride 2.79 -2.09 5.30 0.075 0 0 0.018 1.27 -0.120 0.0436 150.4 14 bipyridyl 1.38 -1.62 9.90 -0.537 0 0 0.052 0.02 -0.305 0.0195 199.8 15 bromoxynil 3.39 -3.80 10.79 -0.888 0 1 -0.267 1.19 -0.398 0.0456 193.3 16 butylamine 0.83 0.44 3.62 3.529 1 0 0.000 0.53 -0.330 0.0468 132.9 17 caffeine 0.16 -1.87 -2.78 -0.349 0 0 -0.378 4.79 -0.404 0.0240 202.9 18 catechol 1.03 -0.18 3.94 0.297 0 2 -0.110 5.51 -0.393 0.0229 132.5 19 celebrex 3.47 -4.95 19.78 -1.198 1 0 -0.861 7.36 -0.321 0.0242 332.8 20 chloramphenicol 0.92 -2.92 20.44 -1.203 0 2 -0.552 7.33 -0.421 0.0423 285.9 21 chlorobenzene 2.64 -2.45 3.78 0.155 0 0 -0.020 1.61 -0.084 0.0405 138.2 22 3-chlorophenol 2.16 -1.70 -4.29 0.019 0 1 -0.127 2.91 -0.394 0.0377 146.1 23 4-chlorophenol 2.16 -1.60 -3.80 0.095 0 1 -0.127 2.33 -0.394 0.0379 149.0 24 chloroquine 4.50 -4.48 -12.30 -0.454 0 0 -0.280 7.37 -0.358 0.0420 359.1 25 clindamycin 2.90 -4.65 15.22 1.372 0 3 -0.709 6.11 -0.422 0.0462 421.4 26 clomazone 2.86 -3.08 - b - b 0 0 -0.409 0.63 -0.391 0.0470 212.0 27 codeine 1.28 -1.39 38.46 0.219 0 1 -0.250 4.81 -0.386 0.0459 211.1

140




n SpcP MSA

28 2-cresol 2.06 -1.08 0.12 0.408 0 1 -0.110 2.13 -0.394 0.0277 145.1 29 3-cresol 2.06 -1.09 -3.97 0.395 0 1 -0.127 3.69 -0.394 0.0281 144.0 30 4-cresol 2.06 -1.07 -4.26 0.430 0 1 -0.127 4.13 -0.394 0.0277 147.7 31 cyanazine 2.51 -3.12 -64.41 -0.210 0 0 -0.450 3.48 -0.258 0.0388 240.9 32 cyclohexylamine 1.63 -0.19 -2.26 3.461 1 0 -0.020 0.82 -0.327 0.0447 148.7 33 DCPA (chlorthal-dimethyl) 4.24 -5.28 4.37 -1.314 0 0 -0.553 5.92 -0.387 0.0521 275.8 34 dexedrine 1.76 -0.68 -3.25 0.659 1 0 -0.127 0.37 -0.327 0.0353 185.8 35 2-dianisidine 2.08 -2.53 5.53 0.020 2 0 -0.263 1.00 -0.385 0.0262 291.2 36 diazepam (valium) 2.70 -3.69 31.93 -0.633 0 0 -0.250 4.98 -0.422 0.0330 270.0 37 diazinon 4.73 -5.45 -25.25 -1.200 0 0 -0.516 3.09 -0.264 0.0440 298.5 38 dibromoethane 2.01 -2.25 -1.36 0.001 0 0 0.000 0.00 -0.091 0.0819 130.3 39 1,2-dichlorobenzene 3.28 -3.20 6.11 -0.142 0 0 -0.110 3.48 -0.130 0.0532 150.2 40 1,3-dichlorobenzene 3.28 -3.29 -0.34 -0.158 0 0 -0.127 1.96 -0.131 0.0520 153.3 41 1,4-dichlorobenzene 3.28 -3.21 0.66 -0.216 0 0 -0.127 1.30 -0.131 0.0509 158.5 42 1,2-dichloroethane 1.83 -1.19 -1.38 0.685 0 0 0.000 0.00 -0.124 0.0820 112.3 43 dichloromethane 1.34 -0.89 0.00 0.595 0 0 0.000 1.00 -0.108 0.0925 87.1 44 2,4-dichlorophenol 2.80 -2.42 -0.34 -0.245 0 1 -0.216 4.36 -0.389 0.0497 158.3 45 3,4-dichlorophenol 2.80 -2.65 -3.12 -0.236 0 1 -0.216 2.27 -0.390 0.0489 165.3 46 3,5-dichlorophenol 2.80 -2.90 -9.97 -0.285 0 1 -0.233 1.80 -0.390 0.0499 157.8 47 2,4-dichlorophenoxyacetic acid 2.62 -2.82 -11.77 -0.520 0 1 -0.322 5.21 -0.328 0.0411 221.8 48 diethylamine 0.81 0.78 -1.11 3.227 0 0 0.000 0.30 -0.316 0.0466 132.8 49 digoxin 0.50 -5.97 45.79 -0.158 0 6 -1.403 5.68 -0.391 0.0418 714.9 50 dimethyl sulphoxide -1.22 1.11 -8.41 0.808 0 0 -0.182 3.25 -0.309 0.0323 105.2 51 4-(dimethylamino)benzaldehyde 1.89 -1.84 8.87 -0.177 0 0 -0.178 6.73 -0.362 0.0280 192.8 52 2,3-dimethylphenol 2.61 -1.63 -0.11 0.374 0 1 -0.199 1.59 -0.360 0.0310 159.5 53 2,4-dimethylphenol 2.61 -1.48 -3.05 0.396 0 1 -0.216 1.79 -0.360 0.0308 159.5 54 1,2-dinitrobenzene 1.63 -2.26 9.15 -1.841 0 0 -0.288 0.82 -0.061 0.0122 163.2 55 1,4-dinitrobenzene 1.63 -2.07 24.63 -2.208 0 0 -0.305 0.00 -0.053 0.0114 179.3

141




n SpcP MSA

56 2,6-dinitro-4-cresol 2.27 -2.59 7.10 -1.894 0 1 -0.484 1.92 -0.393 0.0151 197.9 57 2,4-dinitrophenol 1.73 -1.97 35.58 -1.887 0 1 -0.394 3.32 -0.394 0.0108 184.8 58 diuron 2.67 -3.19 -11.53 -0.146 0 0 -0.411 5.12 -0.450 0.0420 263.7 59 emetine 5.20 -6.13 27.62 0.399 0 0 -0.322 4.06 -0.378 0.0349 520.6 60 eptam (EPTC) 3.02 -3.32 4.72 0.123 0 0 -0.157 3.53 -0.409 0.0406 243.2 61 ethanol -0.14 1.24 1.69 3.565 0 1 0.000 1.50 -0.395 0.0434 82.7 62 ethanolamine -1.61 1.21 10.03 3.169 1 1 0.000 1.28 -0.394 0.0409 100.5 63 ethyl acetate 0.86 -0.47 -7.23 1.153 0 0 -0.144 0.91 -0.330 0.0352 127.8 64 ethylbenzene 3.03 -2.67 3.32 0.528 0 0 0.018 0.16 -0.062 0.0321 156.7 65 ethylene glycol -1.20 1.21 9.13 3.023 0 2 0.000 1.58 -0.393 0.0374 95.4

66 FCCP [carbony cyanide 4-(trifluoromethoxy)-phenylhydrazone]

3.15 -3.58 -11.52 -1.125 0 0 -0.157 1.01 -0.178 0.0233 235.7

67 fluvoxamine 3.09 -4.16 20.25 -0.483 1 0 -0.429 1.70 -0.383 0.0373 325.4 68 furosemide 2.32 -3.35 -48.67 -0.869 1 1 -0.628 5.61 -0.397 0.0275 298.0 69 gentamycin -1.48 -0.41 30.12 1.979 3 3 -0.920 3.69 -0.387 0.0496 405.1 70 hexachlorobenzene 5.86 -6.17 1.59 -1.041 0 0 -0.450 0.48 -0.100 0.0825 205.1 71 hexachlorophene 6.92 -8.03 3.69 -0.861 0 2 -0.679 0.92 -0.386 0.0638 303.2 72 ibuprofen 3.79 -3.70 -15.15 0.199 0 1 -0.411 2.84 -0.326 0.0395 235.9 73 imazapyr 1.57 -2.60 -21.33 -1.184 0 1 -0.555 9.60 -0.393 0.0267 261.2 74 indole 2.05 -1.88 16.54 0.300 0 0 0.052 0.48 -0.196 0.0254 139.8 75 isoproterenol 0.21 0.66 0.89 0.164 0 3 -0.411 7.94 -0.403 0.0331 255.6 76 isoxaben 3.98 -4.99 -5.17 -0.314 0 0 -0.430 8.84 -0.434 0.0314 361.5 77 lindane 4.26 -4.86 4.98 0.147 0 0 -0.450 0.03 -0.119 0.0890 233.6 78 malathion 2.29 -3.62 -9.94 -2.658 0 0 -0.496 4.15 -0.329 0.0376 306.3 79 methanol -0.63 1.49 7.98 3.778 0 1 0.000 1.70 -0.398 0.0457 55.4 80 3-methyl-4-nitrophenol 2.46 -1.86 6.54 -1.006 0 1 -0.305 1.42 -0.360 0.0213 160.0 81 methylphenidate (ritalin) 2.78 -2.27 7.93 0.186 0 0 -0.089 2.14 -0.326 0.0343 265.8 82 morphine 0.72 -1.03 29.29 0.188 0 2 -0.288 5.32 -0.386 0.0440 201.9

142




n SpcP MSA

83 1-naphthol 2.69 -2.11 6.82 -0.247 0 1 -0.037 0.68 -0.359 0.0217 178.8 84 naproamide 3.33 -4.05 14.81 -0.403 0 0 -0.246 3.67 -0.417 0.0316 318.2 85 naproxen 3.10 -3.20 -10.72 -0.439 0 1 -0.301 3.67 -0.385 0.0269 253.2 86 nicosulfuron -1.15 -3.51 -94.35 -1.119 0 0 -0.741 10.98 -0.410 0.0224 386.8 87 nicotine 1.00 0.79 7.06 0.242 0 0 -0.037 2.77 -0.351 0.0329 209.3 88 2-nitrophenol 1.91 -1.75 9.56 -1.014 0 1 -0.199 1.67 -0.359 0.0166 152.8 89 norflurazon 2.19 -3.28 31.23 -0.801 0 0 -0.573 3.77 -0.370 0.0302 273.4 90 novobiocin 2.45 -5.18 - b - b 1 3 -1.276 6.30 -0.419 0.0261 646.9 91 octane 4.27 -5.00 1.74 3.638 0 0 0.000 0.00 -0.065 0.0490 200.6 92 orphenadrine 3.65 -3.38 12.26 0.379 0 0 -0.233 0.86 -0.366 0.0386 284.3 93 orudis 3.00 -3.33 5.52 -0.583 0 1 -0.322 4.41 -0.435 0.0234 279.6 94 oxadiazon 3.43 -4.51 8.35 -0.366 0 1 -0.823 6.62 -0.403 0.0364 329.7 95 paraquat -0.56 -0.10 38.98 -8.887 0 0 -0.160 0.00 -0.068 0.0334 199.1 96 pentachlorophenol 4.74 -4.94 8.31 -0.978 0 1 -0.450 1.31 -0.380 0.0750 193.4 97 pentanol 1.33 -0.63 3.91 3.391 0 1 0.000 0.54 -0.395 0.0454 148.1 98 1,10-phenanthroline 2.29 -3.77 19.99 -0.718 0 0 0.035 3.64 -0.309 0.0179 210.1 99 phenol 1.51 -0.56 -0.85 0.398 0 1 -0.020 1.13 -0.360 0.0239 128.8 100 primsulfuron 2.41 -4.67 -125.03 -1.158 0 0 -0.902 33.10 -0.348 0.0210 354.8 101 propan-2-ol 0.28 0.83 -8.10 3.576 0 1 -0.182 2.02 -0.392 0.0438 101.8 102 propranolol 2.60 -3.06 9.09 -0.219 0 1 -0.250 3.60 -0.388 0.0305 335.0 103 pseudocumene 3.63 -3.18 -2.12 0.502 0 0 -0.216 0.09 -0.059 0.0343 170.8 104 quinidine 3.29 -3.50 23.42 -0.597 0 1 -0.207 2.89 -0.388 0.0368 296.3 105 quinine 3.29 -3.50 23.08 -0.571 0 1 -0.207 2.89 -0.388 0.0368 296.3 106 resorcinol 1.03 -0.11 -7.84 0.321 0 2 -0.127 2.39 -0.394 0.0223 137.5 107 salicylic acid 2.24 -1.56 -19.17 -0.591 0 2 -0.199 5.33 -0.391 0.0196 153.4 108 scopolamine 0.39 -1.24 135.39 0.026 0 1 -0.228 1.20 -0.393 0.0411 241.3 109 sethoxydim 3.99 -5.33 -30.79 -0.052 0 1 -0.387 2.21 -0.393 0.0375 381.1 110 sulphosalicylic acid -0.91 0.57 -44.86 -1.247 0 3 -0.594 6.53 -0.349 0.0145 203.1

143




n SpcP MSA

111 tert-butyl methyl ether 1.43 -0.65 -11.29 2.988 0 0 -0.354 1.03 -0.377 0.0512 125.1 112 tetrachloroethene 2.97 -3.32 0.99 -0.437 0 0 -0.271 0.01 -0.072 0.0968 129.2 113 tetrachloromethane 2.44 -2.74 0.00 -1.117 0 0 -0.414 0.00 -0.067 0.1199 105.5 114 tetracycline -1.33 -2.06 -6.41 -0.816 1 5 -1.142 12.72 -0.436 0.0291 372.6 115 theophylline -0.39 -1.79 8.43 -0.089 0 0 -0.288 3.40 -0.399 0.0217 156.2 116 thiazopyr 5.76 -5.41 33.54 -1.532 0 0 -0.841 6.68 -0.403 0.0314 311.8 117 thifensulfuron 1.32 -3.05 -146.62 -1.606 0 1 -0.690 3.40 -0.397 0.0189 288.3 118 thioridazine 6.45 -7.04 16.47 -0.138 0 0 -0.156 1.77 -0.358 0.0334 354.6 119 toluene 2.54 -2.21 2.31 0.520 0 0 -0.020 0.16 -0.062 0.0294 138.2 120 4-toluidine 1.62 -1.17 -5.67 0.616 1 0 -0.127 0.62 -0.291 0.0300 150.4 121 1,2,3-trichlorobenzene 3.93 -3.98 6.02 -0.364 0 0 -0.199 4.55 -0.120 0.0630 161.5 122 1,1,1-trichloroethane 2.68 -2.30 -5.07 -0.265 0 0 -0.414 1.58 -0.084 0.1043 107.3 123 trichloroethene 2.47 -2.23 -0.64 -0.061 0 0 -0.144 1.25 -0.100 0.0852 120.1 124 trichloromethane 1.52 -1.76 0.00 -0.303 0 0 -0.182 0.59 -0.087 0.1036 100.7 125 trifluralin 5.31 -6.21 20.78 -1.989 0 0 -0.786 4.64 -0.309 0.0365 258.0 126 2,4,6-trihydroxytoluene 1.10 -0.58 -9.06 0.253 0 3 -0.305 1.61 -0.379 0.0241 168.6 127 2,3,6-trimethylphenol 3.15 -1.90 1.01 0.383 0 1 -0.288 1.51 -0.359 0.0347 169.8 128 2,4,6-trimethylphenol 3.15 -1.95 -1.91 0.433 0 1 -0.305 1.74 -0.359 0.0341 173.0 129 3,4,5-trimethylphenol 3.15 -2.31 -5.77 0.430 0 1 -0.305 1.96 -0.360 0.0338 174.1 130 2,4,6-trinitrotoluene 1.99 -2.61 23.55 -2.433 0 0 -0.573 0.17 -0.046 0.0121 200.2 131 verapamil 4.80 -5.01 42.43 0.181 0 0 -0.572 7.72 -0.378 0.0405 459.2 132 warfarin 2.23 -3.67 -10.09 -1.034 0 1 -0.339 6.48 -0.409 0.0277 283.3 133 zanaflex 1.36 -2.22 -5.41 -1.376 0 0 -0.071 5.51 -0.326 0.0281 232.3 a calculated with the Cosmic force field for molecular mechanic calculations b not calculated by the TSAR software

144

Table 12.4. Selected QSARs for acute toxicity to S. meliloti †

No

Mechanism of toxicity / chemical type

Regression equation n R2 s F Q2

1 Non-polar narcotic compounds

0.648 (± 0.069) logPoct – 1.98 (± 0.19) 46 0.670 0.717 89.3 0.630

2 Polar narcotic compounds

80.0 (± 9.1) SpcP – 9.71 (± 3.45) Qmaxn

+ 0.399 (± 0.143) NOH – 6.37 (± 1.29) 26 0.804 0.498 30.0 0.723

3 Weak acid respiratory uncouplers

0.657 (± 0.141) logPoct – 0.599 (± 0.597) 5 0.879 0.587 21.8 0.692

4 Compounds forming free radicals

0.131 (± 0.024) µ – 0.0514 (± 0.0064) Etot – 0.171 (± 0.063)

9 0.929 0.089 39.2 0.391

5

Compounds acting by specific mechanisms

-0.386 (± 0.067) log Saq + 6.46 (± 1.70)

Qmaxn + 1.17 (± 0.43) d1χ + 1.16 (± 0.24)

NNH2 + 1.70 (± 0.64)

47 0.570 0.743 13.9 0.329

6 All 133 compounds

0.462 (± 0.048) logPoct – 0.278 (± 0.049) LUMO + 0.244 (± 0.078) NOH + 0.909 (± 0.190) NNH2 – 1.28 (± 0.16)

131 a 0.541 0.818 37.1 0.464

7 Aliphatic compounds

0.542 (± 0.095) logPoct – 0.305 (± 0.087) LUMO + 0.00245 (± 0.00105) MSA + 1.73 (± 0.25) NNH2 – 1.80 (± 0.30)

30 0.827 0.715 29.8 0.750

† toxicity expressed as log(1/IC50) (in units mmol/l) a clomazone and novobiocin were excluded, since their molecular orbital properties could not be calculated

by the TSAR software

145

CHAPTER 13

INVESTIGATION OF ACUTE AQUATIC TOXICITY

13.1. Objectives

The objective of the study presented in this chapter was to investigate environmental toxic

effects including toxicity to five aquatic species (two algal species, Daphnia, and two fish

species). Interspecies correlations were developed to identify similarities in toxicity

between the species. Baseline toxicity effects for the non-polar narcotics were sought.

Chemical toxicity was investigated by applying different statistical approaches, including

multiple linear regression, linear discriminant analysis, and classification trees in order to

develop regression and classification models for toxicity.

13.2 Methods


The data set was extracted from the New Chemicals Data Base (NCD), which is

maintained by the European Chemicals Bureau within the European Commission’s Joint

Research Centre (ECB, http://ecb.jrc.it). The data are submitted by industry as a part of

the notification process for each new chemical substance that is manufactured in or

imported into the European Union. A harmonised European notification system for new

substances was introduced as part of the 6th Amendment to Directive 67/548/EEC (EC,

1967) on the classification, packaging and labelling of dangerous substances (Directive

79/831/EEC; EC, 1979). The testing for physicochemical, toxicological and

ecotoxicological endpoints, included in the technical dossiers for substances, should be

made according to the official testing methods contained in Annex V of Directive

67/548/EEC (EC, 1967). Some aspects of dossier information are confidential,

particularly chemical structures and spectra. Due to this confidentiality, the structures of

the investigated compounds and their toxicity values are not presented in this thesis.

The extracted data set from the NCD consisted of algal, daphnid and fish toxicities of

approximately 2900 neutral chemicals, salts, metal complexes and chemical mixtures.

However, most of the substances did not have reported toxicity values for all the species.

146

Algal toxicities were expressed as the chemical concentrations (in units mg/l) that result

in a 50% reduction in the growth (EbC50) or the growth rate (ErC50) of an algal culture

relative to a control over a 72-h exposure period (in units mg/l). The testing

recommendations (Annex V of Directive 67/548/EEC) indicate that Selenastrum

capricornutum (now Pseudokirchneriella subcapitata) and Scenedesmus subspicatus are

the preferred test species. In general, a given chemical is tested on a single algal species.

Toxicity to Daphnia was expressed as the chemical concentration that results in 50%

immobilisation of the Daphnia in a test batch, over a 48-h exposure period, and in a static

system (EC50, in mg/l). The preferred test species is Daphnia magna.

Fish toxicity was expressed as the chemical concentration at which 50% lethality is

observed in a test batch of fish within a 96-h exposure period (LC50, in mg/l). Three

testing procedures can be used: a static test, in which test solutions remain unchanged

during the test; a semi-static test, without flow of test solution, but with regular batchwise

renewal of the test solution after a prolonged period (e.g. 24 hours); and a flow-through

test, in which the water is renewed constantly in the test chambers, the chemical under

test being transported with the water used to renew the test medium. A variety of fish

species can be used (Annex V of Directive 67/548/EEC), however, Brachydanio rerio

(zebra fish) and Oncorhynchus mykiss (rainbow trout) are the preferred species. Again,

for a given chemical, testing is generally performed on a single species.

The toxicities of some of the chemicals extracted from the NCD were not reported as

exact values, but as approximate numbers or as greater than or smaller than a cut-off

value. For the purposes of classification and labelling, the test recommendations (EC,

1967) allow for a limit test to be performed at 100 mg/l, to demonstrate that the 50%

effect concentration is greater than this concentration. Furthermore, if two consecutive

test concentrations with a ratio of 2.2 gave only 0 and 100% response, these two values

were considered sufficient to indicate the range within which the investigated toxicity

value falls. The correlations between the toxicity endpoints, the baseline relationships and

the regression QSARs were obtained by using only chemicals for which exact toxicity

values were reported.

To develop baseline relationships and QSARs, the initial data set of approximately 2900

chemicals was reduced by removing salts, metal complexes and mixtures, leaving single

and neutral chemicals. All notes relating to the toxicity testing of each chemical were

147

examined. Only chemicals with a reported purity greater than 95% were considered. This

value was chosen because a balance was sought between the use of pure compounds

(ensuring that the toxic effect is due to the main compound, and not to some impurities),

and a greater number of data points. Chemicals that could not be maintained within 80%

of the starting concentration during the test, due to chemical volatility, instability

(hydrolysis) in the test medium, or poor water solubility, were excluded from the

investigation. Any chemical with a toxicity value higher than its measured water

solubility was also excluded, since the half-maximal effect concentration of such a

chemical would not be reached under the experimental conditions. The final numbers of

investigated chemicals were different for the different endpoints and are given in Section

13.3 “Results”.

To develop classification models, the boundaries for toxicity classification were set

according to Annex VI of Directive 67/548/EEC (EC, 2001a). In the following table the

classification boundaries according to Annex VI are given:

Category Very toxic (V) Toxic (T) Harmful (H) Non-toxic (N)

Endpoint value (E) E ≤ 1 mg/l 1 mg/l < E ≤ 10 mg/l 10 mg/l < E ≤ 100 mg/l E > 100 mg/l

In the table “Endpoint value (E)” denotes LC50 of fish (96h-assay), EC50 of Daphnia

(48h-assay), or IC50 of algae (72h-assay). To assess algal toxicity, the inhibition of the

algal growth rate (ErC50 values) is preferred to the inhibition of the growth of the algal

biomass (EbC50).

In the present work the different species were investigated separately. The group of very

toxic compounds was denoted with “V”; the group of toxic compounds was denoted with

“T”; “H” was used to denote the group of harmful compounds. Compounds having

toxicity values higher than 100 mg/l for a given species were classified as non-toxic (the

group was denoted with “N”).

Additionally, another type of classification was tried, in which the groups of very toxic

(V), toxic (T), and harmful (H) compounds were united, to form a single group of

dangerous compounds. Thus, the chemicals were divided into two groups, representing

dangerous compounds for the investigated species (the group was denoted with “D”), and

non-toxic compounds (this group was denoted with “N”).

148

Additionally, experimental logPoct data were taken from the NCD. The recommended

methods (Annex V of Directive 67/548/EEC) for the experimental determination of

logPoct are the shake-flask method and high performance liquid chromatography (HPLC).

The measurements were obtained at different temperatures.


The following structural descriptors for the compounds were calculated: the logarithm of

the octanol-water partition coefficient (logPoct), aqueous solubility, and approximately

250 different quantum-chemical, charge, and topological descriptors. LogPoct and aqueous

solubilities were calculated with the KOWWIN version 1.66 and WSKOWWIN version

1.40 programs, respectively (Syracuse Research Corporation, Syracuse, NY, USA;

downloaded for free from the website of the Environmental Protection Agency of the

United States: http://www.epa.gov/oppt/exposure/docs/episuitedl.htm). TSAR for

Windows version 3.3 (Accelrys Inc.) was used to calculate quantum-chemical and


(Accelrys Inc.) was used, applying the AM1 Hamiltonian. Topological and charge

descriptors were calculated by using DRAGON version 5.0 (TALETE srl). For the

calculations in DRAGON that require 3D chemical structures (the charge descriptors), the

structures optimised by VAMP optimisation were transferred from TSAR to Dragon as

Mol2 files.

13.2.3.Statistical Analysis

Before performing statistical analyses, the descriptors that had values of zero for more

than 95% of the compounds in a data set were excluded from the descriptor data matrix

since these carry no useful information. To reduce the number of intercorrelated

molecular descriptors, the algorithm developed by the author of this thesis was applied

(see Chapter 9).

QSARs were obtained by multiple linear regression, using the MINITAB version 13

software (Minitab Inc., State College, PA, USA). Again, forward stepwise regression was

applied to select variables in the QSARs, excluding descriptors considered to lack

physicochemical and/or mechanistic relevance. Also, the program in C code (best-subsets

regression), developed by the author was used (see Chapter 9). Only those variables that

intercorrelated with a coefficient of intercorrelation R less than 0.7 were included in the

149

same model. The leave-one-out cross-validation statistical procedure was applied to the

QSAR models.

Discriminant analysis was performed with the Statistica version 5.5 software (StatSoft

Inc., Tulsa, OK, USA). Variable selection in the discriminant analysis was performed by

a forward stepwise procedure, applying the same considerations regarding

physicochemical and/or mechanistic relevance of the selected variables as in the

regression analysis (see above). The assumptions of discriminant analysis for

homogeneity of variable variance and variance/covariance matrices within the groups

were tested with the univariate Levene test, and with the multivariate Sen-Puri test,

included in the ANOVA/MANOVA module of Statistica version 5.5 software.

Classification trees were developed by using the Statistica version 5.5 software. Both

discriminant-based univariate (DBU) and classification and regression tree (CART)

splitting algorithms were explored. In the CART algorithm, the Gini index (Equation

5.21, Chapter 5, Section 5.5 “Classification trees”) was used as a measure of node

homogeneity. A minimum node size of 5 observations was used as a stopping rule, and 3-

fold cross-validation pruning (dividing the training set into three groups) on

misclassification error was applied. The least complex tree, for which the cross-validated

cost did not exceed the minimum cross-validated cost plus the standard error of the

minimum cross-validated cost, was selected as a final classification tree.

13.3. Results

13.3.1. Correlations between the toxicity endpoints

To perform correlations between the toxicity endpoints the toxicity concentrations were

kept in units of mg/l, as they are reported in the database. It was impossible to transform

the data into units of mol/l, because mixtures and salts with ill-defined compositions were

also included in the correlations. The correlations obtained are reported in Table 13.1.

Equation 1 (Table 13.1) represents the correlation between algal EbC50 and ErC50 values,

without taking into account the algal species tested. To obtain interspecies correlations

between the algal toxicity and the toxicity to Daphnia or fish, the EbC50 values were used

to represent the algal toxicity, as they were reported for more chemicals than were the

ErC50 values. Similar interspecies correlations were obtained when ErC50 values were

150

used to represent the algal toxicity, which is consistent with the high EbC50/ErC50

correlation, expressed by Equation 1 (Table 13.1).

13.3.2. Analysis of baseline effects

Before investigating the baseline effects and developing QSARs, each chemical was

assigned to a mechanism of toxic action according to the structural criteria mentioned

described in Chapter 8, Table 8.1. To investigate the baseline effects and develop QSARs,

all toxicity values were transformed into units of µmol/l.

For the development of baseline relationships, the calculated values of logPoct were used

in preference to the reported measured values, because for a number of chemicals,

experimental logPoct values were not reported, so reliance on experimental logPoct values

alone would have reduced considerably the number of data points for analysis.

Furthermore, it is considered good QSAR practice to be consistent in the choice of

descriptor data (i.e. all data should preferably be generated by the same experimental

protocol or estimation method) (Cronin, 2002).

The derived relationships between toxicity and logPoct for the non-polar narcotics present

in the data sets for each species are presented in Table 13.2, and in Figure 13.1.

For the alga S. subspicatus, only 50 chemicals with reported exact EbC50 values remained

after excluding the salts, metal complexes, mixtures, chemicals with purity less than or

equal to 95%, unstable or volatile chemicals, chemicals with algal EbC50 values higher

than their measured water solubilities, and chemicals with reported EbC50 values as

approximate numbers or as greater than or smaller than a cut-off value (see Section 13.2.1

“Toxicity data”). Only 8 of the remaining 50 chemicals met the structural criteria for

chemicals that may act by the non-polar narcosis mechanism of toxic action. For the alga

S. subspicatus ErC50 endpoint, only 33 chemicals with exact ErC50 values were selected

for investigation (after excluding the “unsuitable chemicals”), of which 6 were classified

as non-polar narcotics (Table 13.2).

For the alga S. capricornutum 64 chemicals with exact EbC50 values were selected, from

which 16 were classified as non-polar narcotics. For the S. capricornutum ErC50 values 51

chemicals remained after exclusion of the “unsuitable chemicals”, from which 13

chemicals were classified as non-polar narcotics. However, baseline effects for the

151

toxicity to the alga S. capricornutum were not observed. Chemicals were equally

distributed above and below the regression line defined by the toxicity/logPoct relationship

for the subset of non-polar narcotics.

For D. magna exclusion of the “unsuitable chemicals” resulted in a set of 176 chemicals

with reported exact toxicity values, of which 56 chemicals were classified as non-polar

narcotics. A total of 122 chemicals were selected with reported toxicities to the fish O.

mykiss, (rainbow trout) of which 34 chemicals were classified as non-polar narcotics. For

B. rerio (zebra fish) 55 chemicals were selected after applying the above-described

exclusion criteria, of which 19 chemicals were classified as non-polar narcotics.

13.3.3. QSAR models

The QSAR models developed by multiple linear regression are presented in Table 13.3.

The abbreviations for the included structural descriptors are explained in Table 12.2,

Chapter 12. No QSAR with a reasonable statistical fit could be obtained for toxicity to the

alga S. capricornutum, to D. magna, or to the fish B. rerio, even if additional descriptors

were used.

13.3.4. Development of classification models for toxicity

The derived baseline relationships and QSAR models are not of a high statistical quality.

Since the toxicity data were obtained by different laboratories, experimental variability

was inevitable. Therefore, in order to avoid the use of the exact toxicity values reported,

chemicals were classified into groups according to their toxicity, and classification

models were developed. The toxicity boundaries for the classification were taken from

Annex VI of Directive 67/548/EEC (EC, 2001a) (see Section 13.2.1 “Toxicity data”). The

algal growth rate inhibition (ErC50) was used to classify compounds into groups of algal

toxicity as this endpoint is the preferred measure of acute algal toxicity for the regulatory

assessment of chemicals in the European Union.

To develop QSAR models the classification boundaries were transformed from units mg/l

(as they are given in Annex VI of Directive 67/548/EEC; EC, 2001a) to units µmol/l, by

using the average values of the molecular mass of the chemicals investigated for each

endpoint. In Table 13.4 the number of chemicals for each species, the average molecular

masses (MMav) and the upper boundaries for group V (very toxic compounds) in units

152

µmol/l for each species, are presented. The upper boundaries in units µmol/l for groups T

(toxic compounds) and H (harmful compounds) can be derived by multiplying the

boundary for group V by 10 and 100, respectively. As described in Section 13.2.1

(“Toxicity data”), the boundary for group N (non-toxic compounds) coincides with the

upper boundary of group H. Thus, the classification boundaries differ by one log unit.

Chemicals with reported toxicity values as approximate numbers were included in the

development of classification models (see Section 13.2.1 “Toxicity data”). Also,

chemicals with reported toxicity values greater than the boundary for group N were

included in the group of the non-toxic compounds, and chemicals with reported toxicity

values smaller than the upper boundary for group V were included in the group of the

very toxic compounds. Therefore, the number of the chemicals included in the

classification models is greater than the number of chemicals included in the baseline and

regression QSAR models for each species (Table 13.4). Table 13.5 gives the numbers of

chemicals in each group with toxicity boundaries set both in units mg/l and µmol/l for

each species.

Results from the discriminant analysis

In Table 13.6 the classification models for the four-group classification (V, T, H, and N

groups) obtained by using discriminant analysis (DA) are presented. The a priori

classification probabilities were set as proportional to the group sizes, because it is less

probable for a chemical to have extreme toxicity (from group V) than to be toxic or

harmful.

The discriminant models for the second type of classification (dangerous (D) and non-

toxic (N) groups) are presented in Table 13.7. Classifications were tried with a priori

probabilities set as both proportional to the group sizes and equal in the two groups.

The discriminant model obtained when compounds were grouped into four groups of

toxicity to alga S. subspicatus included logPoct and the number of cyano groups (NC#N)

(Model 1, Table 13.6). However, heterogeneity of the variance of NC#N in the four groups

at 95% significance level was suggested by the Levene test. The same holds true for the

variable Nrings6 used in the model for classification into four groups of toxicity to the alga

S. capricornitum (Model 2, Table 13.6) (the Levene test for this variable was statistically

significant at 95% level). Also, the assumption for homogeneity of the

153

variance/covariance matrices for this model (Model 2, Table 13.6) was violated (the Sen-

Puri test was significant at 95% level). In the models for toxicity to D. magna (Model 3,

Table 13.6; and Model 3, Table 13.7), the variance of logPoct was also heterogeneous

within the groups at 95% level (suggested by the Levene test).

The tests for heterogeneity of variable variances (Levene test) and for heterogeneity of

variance/covariance matrices (Sen-Puri test) for the remaining variables and models

included in Tables 13.6 and 13.7 were not statistically significant at 95% level.

Results from the classification trees approach

Additionally, the classification tree (CT) approach was used to obtain classification

models. Both DBU and CART splitting algorithms were explored. Similarly to the DA,

when classification into four groups was investigated, the a priori probabilities were set

proportional to the group sizes. When models for the classification into dangerous and

non-toxic compounds were aimed, the a priori probabilities both equal or proportional to

the group sizes were explored. In Table 13.8 and Table 13.9, the results of CT analysis for

the two types of grouping – four groups (very toxic (V), toxic (T), harmful (H), and non-

toxic (N) compounds) and two groups (dangerous (D) and non-toxic (N) compounds),

respectively, are presented, including the splitting method and the percentages of

correctly classified compounds. The structures of the CTs are presented in Figures 13.2-

13.7. These are discussed in more detail in Section 13.4 “Discussion”.

13.3.5. Correlation between measured and calculated logPoct values

The following correlation between measured and calculated logPoct values was obtained

(illustrated in Figure 13.8):

Calculated logPoct = 0.709 (± 0.030) Measured logPoct + 0.387 (± 0.099) (13.1)

n = 352, R2 = 0.619, F = 568, s = 1.23

13.4. Discussion

Data sets for toxicity to five aquatic species were investigated by different QSAR

approaches aimed to develop regression and classification models. Interspecies

correlations were also developed.

154

13.4.1. Correlations between toxicological endpoints

The correlations between the toxicity endpoints are presented in Table 13.1. From

Equation 1, Table 13.1, it can be seen that a very good correlation was obtained between

the two algal toxicity endpoints (R2 = 0.92), suggesting that testing only one of these

endpoints could be sufficient for evaluating algal toxicity. The growth rate endpoint

(ErC50) is the preferred measure of acute algal toxicity for the regulatory assessment of

chemicals in the European Union.

Statistically significant correlations were obtained between toxicity to Daphnia and algae,

but with poor statistical fits (Equations 2 and 3, Table 13.1). Toxicity to the fish O. mykiss

correlated better with toxicity to the alga S. subspicatus (R2 = 0.57, Equation 4, Table

13.1), and very badly with toxicity to the alga S. capricornutum (R2 = 0.25, Equation 5,

Table 13.1). For the fish B. rerio the opposite observation can be made, i.e. toxicity to the

fish B. rerio was well correlated with S. capricornutum toxicity (R2 = 0.67, Equation 7,

Table 13.1), having a slope of 1, and poorly with S. subspicatus toxicity (R2 = 0.28,

Equation 6, Table 13.1). A relatively good correlation was observed between toxicities to

the fish O. mykiss and D. magrna (R2 = 0.67, Equation 8, Table 13.1), while the

correlation between toxicities to the fish B. rerio and D. magna was worse (R2 = 0.42,

Equation 9, Table 13.1).

13.4.2. Baseline effects

The baseline toxicity effects are presented in Table 13.2 and Figure 13.1. Although

general trends for presence of linear toxicity/logPoct relationships can be seen in Figure

13.1, with most of the chemicals being placed above the corresponding baselines, the

statistical fit of the baseline equations is not high (Table 13.2). The worst statistical

parameters were observed for the relationships observed for D. magna and the fish B.

rerio.

As mentioned in Section 13.3 (“Results”), baseline effects for the toxicity to the alga S.

capricornutum were not observed, because the chemicals were equally distributed above

and below the regression line defined by the toxicity/logPoct relationship for the subset of

non-polar narcotics. From the baseline plots in Figure 13.1 it can be seen that similar

tendencies exist for the remaining species – although most of the chemicals are placed

155

above the baselines in Plots “b” of Figure 13.1 (indicating that these chemicals act by

mechanisms of toxic action other than non-polar narcosis), some chemicals are placed

below and close to the baselines without being classified as non-polar narcotics according

to the structural criteria. Most of these chemicals were classified as polar narcotics or

specifically-acting chemicals. These chemicals have complex molecular structures

containing different functional groups, which might prevent them from expressing the

potentially higher toxic effects related to some features of their structures. Also, due to

the presence of multiple and different functional groups, it was difficult to assign these

chemicals to a single mechanism of toxic action. Similar reasons for the absence of

observed baseline effect for the toxicity to the alga S. capricornitum could be suggested.

For the algal ErC50 endpoint (S. subspicatus), the baseline relationship is represented by

Equation 2 (Table 13.2), and Plots 2a and 2b (Figure 13.1). Although a linear

toxicity/logPoct relationship can be observed, Plot 2a (Figure 13.1) shows that the 6 non-

polar narcotics do not provide a well-defined baseline.

13.4.3. QSAR models

The derived QSARs have moderate statistical fit (Table 13.3). QSARs were developed for

the alga S. subspicatus, and the models were similar for the both EbC50 and ErC50

endpoints (Equations 1 and 2, Table 13.3). A QSAR model was obtained also for toxicity

to the fish O. mykiss, albeit with a poor statistical fit (Equation 3, Table 13.3).

The QSARs are based on chemicals that are likely to act by different toxic mechanisms,

including narcotics, reactive chemicals and specifically-acting chemicals. LogPoct was

found to be a significant descriptor in all models. It accounts for the role of chemical

hydrophobicity in producing a toxic effect. The number of phenyl rings (NPh) in

Equations 1 and 2 (Table 13.3) might reflect differences in the toxic action towards S.

subspicatus between aromatic and aliphatic chemicals. An attempt was made to develop

class-specific QSARs for toxicity to S. subspicatus, for the aliphatic and aromatic

chemicals separately, but better models could not be obtained. The descriptor LUMO,

which appeared in the model for fish toxicity (Equation 3, Table 13.3.), might account for

the electrophilic reactivity of the chemicals.

The statistical fit (R2) of the QSARs ranged from 0.42 to 0.71. A possible reason for the

limited goodness-of-fit of the models could be due to toxicokinetic factors that influence

156

toxicity in vivo towards fish, but which are not accounted for in the models. Other

possible reasons relate to limitations in the quality of the data used, as described below.

13.4.4. Toxicity classification

Two types of toxicity classification of compounds were attempted. The first one was set

according to classification criteria established in EU legislation (Annex VI of Directive

67/548/EEC; EC, 2001a), which provides for classification into four groups of very toxic,

toxic, harmful, and non-toxic compounds. The second classification was formed by

uniting the groups of very toxic, toxic and harmful compounds into a single group of

dangerous compounds.

In Annex VI of Directive 67/548/EEC (EC, 2001a) classification boundaries for toxicity

in units of mg/l are used. For the purposes of the QSAR analysis these boundaries were

transferred into units of µmol/l. In Table 13.5 the numbers of compounds in each group

with boundaries in units both mg/l and µmol/l are given. From Table 13.5.a it can be seen

that for the alga S. subspicatus, five compounds (6.1%) differ in their classification

according to the classification schemes used. In the classification according to the toxicity

to S. capricornutum, 16 compounds (14.1%) were classified differently by the two

schemes (Table 13.5.b). For D. magna this number is 56 compounds (13.5%, Table

13.5.c). In the case of fish O. mykiss, 32 compounds (14.0%) were classified differently

(Table 13.5.d), and for the fish B. rerio, 19 compounds (17.3%) were classified

differently by the two classification schemes (Table 13.5.e). Thus, approximately 15% of

chemical classifications are likely to be different when using the two different types of

units to define the classification boundaries.

Interpretation of the results from the discriminant analysis

The percentages of correct classifications expected by chance and obtained by applying

the DA for the four-group classification (V, T, H, and N groups) are presented in Table

13.6. From the table it can be seen that the correct classification expected by chance

varies between 25.9% and 32.1%. The larger values mean that the size of one of the

groups represents a larger percentage of the total number of compounds (for a

classification into four groups of equal sizes the correct classification expected by chance

is 25%). Thus, the sizes of the toxicity groups for the alga S. subspicatus and the fish B.

rerio (31.1% and 32.1% correct classifications expected by chance, respectively, Table

157

13.6) are less balanced than the sizes of the toxicity groups for the remaining species.

This is also reflected by the group sizes presented in Table 13.5.

As can be seen from Table 13.6, the discriminant models generally classified correctly

approximately 55% of the compounds, i.e. applying the discriminant models resulted in

an approximately 25% improvement of the classification over that expected by chance.

The worst classification was obtained for toxicity to D. magna (50.0% correct

classification, Model 3, Table 13.6), for which the data set contained the largest number

of compounds. The best overall classification was obtained for the fish B. rerio (60.6%

correct classification, Model 5, Table 13.6); however, the second group (toxic

compounds) was very poorly predicted by this model (only 5.6% correct classification).

All models included logPoct as a discriminating variable. Additionally, the number of

cyano groups, electronic (HOMO, LUMO), topological (PW2), and volume/shape

(Nrings6) factors appeared to determine the toxicity classification.

The discriminant models for classification of chemicals into dangerous and non-toxic

groups are presented in Table 13.7. The percentages of correct classifications expected by

chance are also given in the table. From these percentages it can be seen that the largest

differences between the sizes of the two toxicity groups are present for D. magna and the

fish O. mykiss, which is also indicated by the group sizes presented in Table 13.5.

As can be seen from Table 13.7, when the a priori probabilities were set as proportional

to the group sizes, the overall classifications were similar or better than those obtained

with equal a priori probabilities. However, the group of non-toxic compounds was

classified much less accurately with proportional a priori probabilities. Nevertheless, it

can be considered as more useful (from a regulatory perspective) to have better

classification for the group of dangerous compounds, than for the non-toxic compounds,

thus applying proportional a priori probabilities may result in more valuable classification

schemes. From Table 13.7 it can be seen that the improvement in the classification over

the classification expected by chance varies from approximately 10% (for the model for

D. magna and equal a priori probabilities) to approximately 25% (for the model for the

fish B. rerio and a priori probabilities proportional to the group sizes).

Similar descriptors appeared in the discriminant models for the two-group classification

(Table 13.7) as in those for the four-group classification (Table 13.6). Again, all models

included logPoct. In the classification model for toxicity to the alga S. subspicatus, NC#N

158

was not a statistically significant variable, whereas the number of H-bond donors (NH-

donors) was used as discriminating variable (Model 1, Table 13.7). The model for the alga

S. capricornitum included only logPoct, and Nrings6 was not a statistically significant

discriminating variable in this case (Model 2, Table 13.7.). Similarly, LUMO was not a

statistically significant discriminating variable for the two-group classification of toxicity

to the fish O. mykiss (Model 4, Table 13.7.). For D. magna and the fish B. rerio, the same

discriminating variables were used as in the models for compound classification into four

toxicity groups, namely logPoct, HOMO, and PW2 (Models 3 and 5, Tables 13.6 and

13.7).

Interpretation of the results from the classification trees approach

The results of CT analysis for the two types of grouping – four groups (very toxic, toxic,

harmful, and non-toxic compounds) and two groups (dangerous and non-toxic

compounds), are presented in Table 13.8 and Table 13.9 respectively. Unlike discriminant

analysis, where the selection of the a priori probabilities influences the final classification

only, and not the characteristics of the model itself (variables included and statistical

parameters), in the CT approach the a priori probabilities determine also the selected

discriminating variables and their cut-off values. Therefore, different CTs were obtained

in some cases (for example for the fish B. rerio, CTs 6 and 7, Table 13.9) when different

a priori probabilities (proportional to the group sizes, or equal for the groups) were

applied.

Classification trees for toxicity to the alga S. subspicatus

For the alga S. subspicatus, models for classifying chemicals into four groups (very toxic,

toxic, harmful and non-toxic chemicals) could not be obtained by using CT approach

(Table 13.8). A CT was obtained only for the two-group classification of toxicity to S.

subspicatus, using equal a priori probabilities and DBU splitting (CT 1, Table 13.9). The

same descriptors appeared in the CT as in the model obtained by DA (Model 1, Table

13.7), namely logPoct and NH-donors. The CT approach resulted in a slightly higher

percentage of correctly classified compounds than obtained by DA; however the group of

non-toxic compounds was poorly classified with only 57.7% correct classifications. The

CT is presented in Figure 13.2. It can be seen that compounds with logPoct ≤ 2.09 and at

least one H-bond donor are classified as non-toxic to S. subspicatus, while compounds

159

with logPoct ≤ 2.09 and not containing H-bond donor atoms, or compounds having logPoct

> 2.09 are classified as dangerous to S. subspicatus. Thus, from the CT, it appears that

compounds with higher liphophilicity are more toxic, which is in accordance with

literature reports (see Chapter 8). Also, from the CT, it can be seen that H-bonding ability

is unfavourable for the toxicity.

Classification trees for toxicity to the alga S. capricornitum

The CT for the four-group classification of compounds according to toxicity to the alga S.

capricornitum was obtained by applying the CART splitting method (a priori

probabilities set proportional to the group sizes) (CT 1, Table 13.8) In the CT approach,

the second-order connectivity index (2χ) gave better results than the number of 6-

membered rings (Nrings6), which appeared in the DA (Model 2, Table 13.6). From Table

13.6, Model 2, and Table 13.8, CT 1, it can be seen that the overall percentages of

correctly classified compounds by the DA and the CT approach are similar; however,

with the CT, a better classification of the group of very toxic compounds is obtained. The

CT is presented in Figure 13.3. It can be seen that compounds having logPoct ≤ 2.31 are

classified as non-toxic. Compounds having logPoct > 2.31 and 2χ > 8.955 are classified as

very toxic; compounds with 2χ ≤ 8.955 and 2.31 < logPoct ≤ 3.01 are harmful to S.

capricornitum, and compounds with 2χ ≤ 8.955 and logPoct > 3.01 are toxic. Thus, higher

values of logPoct and 2χ are favourable for toxicity. The value of 2χ increases with

increasing molecular size, i.e. larger molecules are suggested to be more toxic.

For the second type of classification (dangerous and non-toxic compounds), the use of

CART and a priori probabilities both equal and proportional to the group sizes produced

CTs with one split along logPoct (CT 2, Table 13.9). When the a priori probabilities were

set proportional to the group sizes, logPoct had a cut-off value of 2.05 (compounds having

logPoct ≤ 2.05 were classified as non-toxic, while compounds with logPoct > 2.05 were

classified as dangerous). When equal a priori probabilities were considered, logPoct had a

different cut-off value of 2.31 (compounds having logPoct ≤ 2.31 were classified as non-

toxic, compounds with logPoct > 2.31 were classified as dangerous). The percentages of

correct classifications are presented in Table 13.9, CT 2. It can be seen that both settings

for the a priori probabilities resulted in a better classification of the non-toxic compounds

than of the group of dangerous compounds.

160

Classification trees for toxicity to D. magna

For the four-group classification of toxicity to D. magna, logPoct, HOMO, and MM

appeared as discriminating variables when applying DBU splitting (a priori probabilities

proportional to the group sizes). The percentage of correct classification is presented in

Table 13.8, CT 2. A similar overall accuracy of the classification was obtained as with the

DA (Model 3, Table 13.6). The tree is presented in Figure 13.4.a. It can be seen that

compounds with logPoct ≤ 3.82 are classified as non-toxic, harmful or toxic to D. magna,

while compounds with logPoct > 3.82 are either toxic or very toxic to this species.

Compounds with logPoct ≤ 1.91 and having HOMO ≤ -10.07 are classified as non-toxic,

compounds with logPoct ≤ 1.91 and HOMO > -10.07, or compounds with 1.91 < logPoct ≤

3.82 and HOMO ≤ 9.14, are classified as harmful; compounds with 1.91 < logPoct ≤ 3.82

and HOMO > 9.14, or with logPoct > 3.82 and MM ≤ 424.3 are toxic; compounds with

logPoct > 3.82 and MM > 424.3 are very toxic. Thus, compounds with higher values of

logPoct, HOMO, and MM tend to be more toxic. Compounds with higher values of

HOMO will enter chemical reactions more easily as nucleophiles, thus the CT suggests

involvement of nucleophilic reactions in the toxic action. According to the CT, some very

toxic compounds have high molecular mass, suggesting that they are complex organic

molecules with numerous centres of reactivity and/or centres responsible for toxicity by

specific mechanisms. Additionally, the increased molecular mass might reflect increased

lipophilicity, which results in higher toxicity.

Another classification tree, which included logPoct and LmH, was obtained by using the

CART splitting method (a priori probabilities proportional to group sizes) (CT 3, Table

13.8). The tree is presented in Figure 13.4.b. It can be seen that according to this CT the

cut-off value for logPoct that separates the group of non-toxic and harmful compounds

from the group of toxic and very toxic compounds is 3.12. Additionally, the group of

toxic and very toxic compounds is separated by another cut-off value for logPoct, equal to

4.91. This supports the suggestion that the increased toxicity of the very toxic compounds

is related to increased lipophilicity, rather than to the higher molecular mass of these

compounds, which appeared as a descriptor in the CT obtained by the DBU splitting (CT

2, Table 13.8; Figure 13.4.a) (see above). The non-toxic and harmful compounds were

separated by LmH, with a cut-off value of 8.87. LmH might reflect the activation energy

of the compounds. From Figure 13.4.b it can be seen that smaller values of LmH,

corresponding to smaller activation energy, tend to be associated with increased toxicity.

161

In Table 13.8, CTs 2 and 3, the percentages of correct classifications obtained by the two

CTs for toxicity to D. magna are presented. It can be seen that the overall classification

accuracy of the two models is very similar (approximately 51% correct classification), but

the two models discriminate better between different groups – the model obtained by

DBU splitting performs better for the groups of toxic and harmful compounds, while the

model derived by CART splitting predicts better the groups of very toxic and non-toxic

compounds.

When models for the two-group classification (dangerous and non-toxic compounds)

were developed, better results were obtained by using CART splitting. The CTs obtained

with a priori probabilities set both equal and proportional to the group sizes are presented

in Figure 13.5. It can be seen that similar CTs were obtained by using the two types of a

priori probabilities. They included logPoct and LmH as discriminating variables.

However, the cut-off value for logPoct is smaller when proportional a priori probabilities

were applied (a cut-off value of 1.63 for proportional a priori probabilities, and a cut-off

value of 3.12 for equal a priori probabilities). Again, higher values of logPoct, and smaller

values of LmH favour toxicity. The percentages of correct classification are presented in

Table 13.9, CT 3. A better classification was obtained with a priori probabilities

proportional to the group sizes than with the corresponding model obtained by the DA

(Model 3, Table 13.7). As can be seen from Table 13.9, CT 3, setting the a priori

probabilities proportional to the group sizes favoured a more accurate prediction of the

group membership of dangerous chemicals. Understandably, the logPoct cut-off value

favouring classification of more compounds as dangerous (1.63) is smaller than the cut-

off value resulting in assigning more compounds as non-toxic (3.12), since the dangerous

compounds are located in the direction of higher values along the logPoct scale.

Classification trees for toxicity to the fish O. mykiss (rainbow trout)

The CT for the four-group classification of compounds according to their toxicity to the

fish O. mykiss (obtained by applying DBU splitting and a priori probabilities proportional

to the group sizes) included logPoct and ΣE-state (CT 4, Table 13.8). A similar accuracy of

classification was observed to that obtained by DA (Model 4, Table 13.6). The CT is

presented in Figure 13.6. It can be seen that again (as in the CT for toxicity to D. magna

obtained by CART splitting, Figure 13.4.b), the non-toxic and harmful compounds are

separated from the toxic and very toxic compounds by a logPoct cut-off value of 3.39.

162

Additionally, the non-toxic and harmful compounds are separated by logPoct cut-off value

of 0.70. The sum of the E-state indices (ΣE-state) separates the toxic from the very toxic

compounds by a cut-off value of 69.2 (bigger values of ΣE-state favour toxicity). The E-

state indices account for the ability of a molecule to enter into non-covalent

intermolecular interactions (Rose et al., 2002), and thus they encode factors that could

influence interactions with biological macromolecules.

When the classification into dangerous and non-toxic compounds to the fish O. mykiss

was investigated, similar results were obtained with both DBU and CART splitting

methods (CTs 4 and 5, Table 13.9). The CTs included one split by logPoct. When the a

priori probabilities were set proportional to group sizes, the cut-off value of logPoct was

0.81 (compounds having logPoct ≤ 0.81 were classified as non-toxic, while compounds

with logPoct > 0.81 were classified as dangerous) for the DBU splitting, and 1.37 for the

CART splitting. Applying equal a priori probabilities resulted in similar cut-off values of

logPoct for the two splitting methods (2.20 for DBU, and 2.17 for CART). A similar

accuracy of classification was obtained with the CT approach (CTs 4 and 5, Table 13.9)

and with DA (Model 3, Table 13.7). From Table 13.9, CTs 4 and 5 it can be seen that

setting the a priori probabilities proportional to the group sizes resulted in better overall

classification, with the group of dangerous compounds being classified more accurately

than the group of non-toxic compounds. As described above, when proportional a priori

probabilities were used, the logPoct cut-off values obtained were different for the DBU

and the CART splitting methods. As can be seen from the percentages of correct

classifications obtained with the two splitting methods (CTs 4 and 5, Table 13.9) the

logPoct cut-off value obtained by DBU splitting (0.81) allows for better classification of

the dangerous compounds, while if the logPoct cut-off value obtained by CART splitting

(1.37) is used, more compounds will be correctly classified as non-toxic. Again, the

logPoct cut-off value favouring classification of more compounds as dangerous (0.81) is

smaller than the cut-off value resulting in assigning more compounds as non-toxic (1.37).

Classification trees for toxicity to the fish B. rerio (zebra fish)

CTs discriminating between the four groups of toxicity to the fish B. rerio could not be

obtained with any of the splitting methods (Table 13.8). CTs were obtained only for the

two-group classification (dangerous and non-toxic compounds) (CTs 6 and 7, Table

13.9). A CT obtained by applying CART splitting and a priori probabilities proportional

to the group sizes (CT 6, Table 13.9) contained the same variables that appeared in the

163

DA model for this endpoint (Model 5, Table 13.7), namely logPoct and PW2. Similar

percentages of correct classification were obtained with both CT analysis and DA, with

the CT classifying slightly better the group of dangerous compounds at the expense of

correctly classifying the non-toxic compounds. The CT is presented in Figure 13.7. It can

be seen that compounds with logPoct ≤ 2.41 and PW2 > 0.576 are classified as non-toxic,

while compounds with logPoct ≤ 2.41 and PW2 ≤ 0.576, or having logPoct > 2.41 are

classified as dangerous. Generally, the values of PW2 are higher for more branched and

spherical molecules. Thus, less branched molecules with more extended shapes appear to

be more toxic.

When CART splitting with equal a priori probabilities was applied, the CT for the two-

group classification of toxicity to B. rerio included only logPoct (CT 6, Table 13.9). The

descriptor had the same cut-off value (2.41) as the CT obtained by applying CART and

proportional a priori probabilities (thus, compounds with logPoct ≤ 2.41 were classified as

non-toxic; compounds with logPoct > 2.41 were classified as dangerous). The percentages

of correct classification are presented in Table 13.9, CT 7. From Table 13.9, CTs 6 and 7,

it can be seen that, again, setting the a priori probabilities proportional to the group sizes

results in a more accurate prediction of the dangerous compounds.

In summary, a slightly higher classification accuracy was obtained with the CT approach

than with DA. However, CTs for some of the endpoints could not be obtained

(classification of compounds into four groups of toxicity to the alga S. subspicatus and the

fish B. rerio, Table 13.8). Again, logPoct appeared as the first discriminating variable in all

CTs. A logPoct cut-off value of approximately 2 was observed for discrimination between

dangerous and non-toxic compounds. Other descriptors in the CTs, related to electronic

properties (HOMO, LmH, ΣE-state), H-bond donor ability, molecular topology (ΣE-state, 2χ,

PW2) and size (MM), suggested that decreased number of H-bond donors, decreased

branching, and increased compound reactivity and size are favourable for toxicity.

13.4.5. Correlation between measured and calculated logPoct values

Although the correlation between the measured and calculated logPoct values was

statistically significant (Equation 13.1), the correlation coefficient is not high, and also,

the slope of the line is different from 1, and the intercept differs from 0. However, from

Figure 13.8 (representing a plot of calculated against measured logPoct values), it can be

164

seen that the compounds are evenly distributed around the regression line. The low

correlation obtained might be due both to differences in the experimental protocols and to

computational errors. The experimental data were obtained from different laboratories, by

using two different experimental methods, i.e. the shake-flask method and the high

performance liquid chromatography (HPLC) method. Additionally, the measurements

were obtained at different temperatures.

13.4.6. Assessment of the data used

The data used in this study are properties of New Chemicals, which are developed by

industry for specific industrial or commercial applications. Some chemicals are complex

molecules, which, despite possessing the structural characteristics for assignment to a

particular toxicity mechanism, might not be able to reach and/or interact with the

biological targets and express their toxic effects. Alternatively, due to their complexity,

these chemicals might act by several mechanisms.

One reason for the limited goodness-of-fit of the baseline relationships and the QSARs

could be due to the fact that the data were collected from different laboratories under

different experimental conditions. While the laboratories used official testing methods

(Annex V of Directive 67/548/EEC), there are still several sources of variability between

laboratories, due to possible differences in experimental protocols. These are discussed in

the following paragraphs.

In general, the recommendations regarding the origin and number of test animals, the

number of chemical concentrations used to derive the dose-response curves, and

equipment used are not very strict, using statements such as “preferably” or “at least”.

According to the recommendations, testing should be performed without adjustment of

the pH of the test medium. However, when there is evidence of a marked change in the

pH, it is advised that the test should be repeated with pH adjustment and the results

reported. In some cases, the toxic effect might be due to the extreme pH conditions, rather

than to the toxicity of the chemical itself. The pH correction is not always explicitly

mentioned in the results presented in the database.

A particular problem arises with unstable, poorly soluble and volatile chemicals.

Maintenance of their concentrations is difficult and this could result in great differences

165

between the laboratories. For the purposes of this study, wherever such a problem was

recorded in the data base, the chemical was excluded from the analyses. However, such

problems are not always recorded.

In the Daphnia test, the assessment of the endpoint (immobility) is subjective, which

could lead to differences in results between different laboratories. In the fish toxicity test,

different types of water supply (e.g. drinking water, reconstituted water) may be used, and

it is difficult to ensure that the water has no contaminants that could interfere with the

toxicological effects of the test substance. Furthermore, the option to conduct fish testing

with three types of procedures (static, semi-static, flow-through) is likely to introduce

additional interlaboratory variability in the data.

In the Daphnia and fish tests, data are evaluated by plotting the percentage mortality

against concentration on a logarithmic scale. This may be done manually or by using a

computer program, which could result in different results between different laboratories.

In the algal test, there are possible differences in the choice of test equipment (e.g., the

volume of the test flasks), illumination (e.g. different type of lamps), and method for

measuring cell density. The recommendations state that cell density should be measured

by using a direct counting method of cells. However, other methods (photometry,

turbidimetry) may be also used. An additional difficulty when interpreting the results of

the algae test is that significant amounts of the test substance may be incorporated into the

algal biomass during the period of the test. The recommendations do not specify how the

resulting uncertainty in exposure should be taken into account in reporting the test results.


The study reported in this chapter is based on a large data set for fish, Daphnia and algal

toxicity extracted from the New Chemicals Database of the European Union. In the study,

correlations between the different endpoints were derived. A strong correlation was

observed between the inhibition of algal growth and algal growth rate, which suggests

that testing only one of these endpoints should be sufficient for the regulatory assessment

of algal toxicity. Two fish and two algal species were investigated, and, interestingly,

toxicity to the fish O. mykiss correlated well with toxicity to one of the algal species (S.

subspicatus), but correlated poorly with the other algal species (S. capricornutum), while

for the other fish species (B. rerio) this relation was the opposite, i.e. toxicity to B. rerio

166

was well correlated with toxicity to the alga S. capricornutum, and poorly correlated with

toxicity to the alga S. subspicatus.

An attempt was made to define baseline toxicity effects. Although general trends for

linear toxicity/logPoct relationships were observed for the non-polar narcotics, with most

of the remaining compounds being placed above the baselines, the derived relationships

did not have high statistical fit. Also, QSARs with moderate statistical fit were only

obtained for the alga S. subspicatus and the fish O. mykiss. They confirmed the

importance of lipophilicity, encoded by logPoct, for toxicity.

Classification QSAR models were obtained by applying the EU classification scheme

(very toxic, toxic, harmful, and non-toxic compounds to the aquatic species). In general,

the models correctly classified approximately 55% of the compounds, i.e. the

classification models resulted in an approximately 25% improvement of the classification

compared with that expected by chance. Additionally, a two-group classification into

dangerous and non-toxic compounds was investigated, with the group of dangerous

compounds being formed by uniting the groups of very toxic, toxic, and harmful

compounds from the EU classification scheme. The percentage of correct classifications

by the classification models was between 70% and 83%, i.e. the improvement over the

percentage of correct classifications expected by chance was approximately 20%. The

classification models included molecular descriptors allowing for mechanistic

interpretation. The models are easily applicable for predicting the group membership of

new compounds. All classification models included logPoct as a discriminating variable. A

logPoct cut-off value of approximately 2 was observed for discrimination between

dangerous and non-toxic compounds. Additionally, electronic features (HOMO, LUMO,

energy difference between LUMO and HOMO), H-bond donor ability, topological

properties and factors encoding molecular volume, shape, and size appeared to influence

the assignment of compounds to a given toxicity group.

The complexity of the molecular structures of the chemicals included in the NCD might

be a reason for the moderate quality of the derived baseline relationships and QSAR

models. Another reason might be related to the experimental variability of the data, as

they were collected from different laboratories under diverse experimental conditions.

Thus, it appears that the data in the New Chemicals Database are not sufficiently

consistent to develop high quality QSARs.

167

An article based on this work has been published by Lessigiarska et al. (2004b).

13.5. Conclusions

There is a strong correlation between the inhibition of algal growth and algal growth rate.

Important molecular properties included in QSARs for toxicity to the alga S. subspicatus

and the fish O. mykiss were lipophilicity (logPoct), chemical reactivity (LUMO), and the

number of phenyl rings. The number of phenyl rings might suggest differences in the

toxic effect between aromatic and aliphatic chemicals.

Classification models for aquatic toxicity obtained by applying discriminant analysis and

classification trees approach revealed lipophilicity (logPoct) as very important for toxicity

(consistent with literature data). A logPoct cut-off value of approximately 2 was observed

for discrimination between dangerous and non-toxic compounds. Additionally, electronic

properties (HOMO, LUMO, sum of E-state indices), H-bond donor ability, topological

and size/shape factors, and the number of cyano groups appeared to influence the toxicity

classification.

On the basis of this study, it appears that the data in the New Chemicals Database are not

sufficiently consistent to develop high quality QSARs for toxicity to algae, daphnids and

fish.

168

Figure 13.1. Plots of toxicity against logPoct for aquatic toxicity endpoints, showing

baseline toxicity effects (represented by the solid lines) †

No Endpoint Plots “a”

non-polar narcotics

Plots “b”

all compounds

1 Alga S.

subspicatus

log(1/EbC50)

logP

log(1

/EbC

50)

5.04.54.03.53.02.52.0

-1.00

-1.25

-1.50

-1.75

-2.00

-2.25

-2.50

-2.75

logP

log

(1/E

bC

50)

6543210-1-2

2

1

0

-1

-2

-3

-4

2 Alga S.

subspicatus log(1/ErC50)

logP

log

(1/E

rC5

0)

43210-1

-2.0

-2.5

-3.0

-3.5

-4.0

logP

log(1

/ErC

50)

543210-1-2

2

1

0

-1

-2

-3

-4

-5

3 D. magna

log(1/EC50)

logP

log

(1/E

C5

0)

1086420

1

0

-1

-2

-3

-4

logP

log(1

/EC

50)

10.07.55.02.50.0-2.5-5.0

1

0

-1

-2

-3

-4

4 Fish O.

mykiss log(1/LC50)

logP

log(1

/LC

50)

6543210

-0.5

-1.0

-1.5

-2.0

-2.5

-3.0

-3.5

logP

log

(1/L

C5

0)

7.55.02.50.0-2.5-5.0

4

3

2

1

0

-1

-2

-3

-4

5 Fish B. rerio log(1/LC50)

logP

log

(1/L

C5

0)

543210

-1.0

-1.5

-2.0

-2.5

-3.0

logP

log

(1/L

C50

)

1086420

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

-3.0

-3.5

† the numeration of the plots is according the Equations of Table 13.1. Plots “a” show the position of the

baselines (solid lines) with respect to the non-polar narcotic compounds. Plots “b” include all compounds,

regardless of mechanism of action, and the position of the baseline is again presented

oct

oct

oct

oct

oct

oct

oct

oct

oct

oct

169

Figure 13.2. Classification tree for toxicity to the alga S. subspicatus when two compound

groups were investigated (dangerous (D) and non-toxic (N) compounds) (using DBU

splitting and equal a priori probabilities)

1

2 3

4 5

logPoct <= 2.09

NH-donors = 0

34 48

12 22

D

D N

DN

170

Figure 13.3. Classification tree for toxicity to the alga S. capricornitum when four

compound groups were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic

(N) compounds) (using CART splitting and a priori probabilities proportional to the

group sizes)

1

2 3

4 5

6 7

logPoct <= 2.31

2XP <= 8.955

logPoct <= 3.01

54 57

36 21

19 17

N

V

H T

VTHN

171

Figure 13.4. Classification tree for toxicity to D. magna when four compound groups

were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic (N) compounds)

a) using DBU splitting and a priori probabilities proportional to the group sizes

1

2 3

4 5 6 7

8 9 10 11

logPoct <= 3.82

logPoct <= 1.91 MM <= 424.3

HOMO <= -10.07 HOMO <= -9.14

293 121

147 146 95 26

45 102 105 41

T V

N H H T

VTHN

172

Figure 13.4. Classification tree for toxicity to D. magna when four compound groups

were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic (N) compounds)

b) using CART splitting and a priori probabilities proportional to the group sizes

1

2 3

4 5 6 7

logPoct <= 3.12

LmH <= 8.87 logPoct <= 4.91

244 170

87 157 107 63

H N T V

VTHN

173

Figure 13.5. Classification tree for toxicity to D. magna when two compound groups were

investigated (dangerous (D) and non-toxic (N) compounds)

a) using CART splitting and a priori probabilities proportional to the group sizes

1

2 3

4 5

logPoct <= 1.63

LmH <= 8.89

125 289

38 87

D

D N

DN

b) using CART splitting and a priori probabilities equal in the two groups

1

2 3

4 5

logPoct <= 3.12

LmH <= 8.87

244 170

87 157

D

D N

DN

174

Figure 13.6. Classification tree for toxicity to the fish O. mykiss (rainbow trout) when four

compound groups were investigated (very toxic (V), toxic (T), harmful (H) and non-toxic

(N) compounds) (using DBU splitting and a priori probabilities proportional to the group

sizes)

1

2 3

4 5 6 7

logPoct <= 3.39

logPoct <= 0.70 SE-state <= 69.2

151 77

35 116 59 18

N H T V

VTHN

175

Figure 13.7. Classification tree for toxicity to the fish B. rerio when two compound

groups were investigated (dangerous (D) and non-toxic (N) compounds), using CART

splitting and a priori probabilities proportional to the group sizes

1

2 3

4 5

logPoct <= 2.41

PW2 <= 0.576

43 67

20 23

D

D N

DN

176

Figure 13.8. Correlation between measured and calculated logPoct values for 352

compounds taken from the New Chemicals Data Base

177

Table 13.1. Correlations between the aquatic toxicity endpoints †

No Regression equation n R2 s F

1 a log(1/EbC50) = 0.940 (± 0.013) log(1/ErC50) + 0.343 (± 0.020) 443 0.919 0.289 4777

2 log(1/EC50)DM = 0.629 (± 0.063) log(1/EbC50)SS - 0.429 (± 0.075)

152 0.402 0.763 101

3 log(1/EC50) DM = 0.638 (± 0.059) log(1/EbC50) SC - 0.515 (±0.067) 199 0.375 0.840 118

4 log(1/LC50)OM = 0.795 (± 0.079) log(1/EbC50)SS - 0.250 (± 0.097)

77 0.574 0.682 101

5 log(1/LC50)OM = 0.432 (± 0.073) log(1/EbC50) SC - 0.673 (± 0.082)

109 0.246 0.769 34.7

6 log(1/LC50)BR = 0.524 (± 0.134) log(1/EbC50)SS - 0.800 (± 0.161)

41 0.282 0.802 15.3

7 log(1/LC50)BR = 1.00 (± 0.160) log(1/EbC50)SC- 0.500 (± 0.142) 21 0.674 0.549 39.3

8 log(1/LC50)OM = 0.773 (± 0.029) log(1/EC50)DM - 0.266 (± 0.038) 360 0.665 0.631 709

9 log(1/LC50) BR = 0.634 (± 0.062) log(1/EC50)DM - 0.637 ± (0.089)

146 0.424 0.713 105

† the species were encoded as subscripts to the endpoints as follows: BR – fish B. rerio; DM – D. magna;

OM – fish O. mykiss; SC – alga S. capricornutum; SS – alga S. subspicatus a correlation between the two algal endpoints regardless of the species

178

Table 13.2. Baseline toxicity relationships for five aquatic toxicity endpoints

No Endpoint Regression equation n R2 s F

1 Alga S. subspicatus log(1/EbC50)

0.421 (± 0.122) logPoct – 3.38 (± 0.371) 8 0.665 0.345 11.9

2 Alga S. subspicatus log(1/ErC50)

0.319 (± 0.109) logPoct – 3.34 (± 0.282) 6 0.680 0.412 8.48

3 D. magna log(1/EC50) 0.376 (± 0.047) logPoct - 2.95 (± 0.157) 56 0.539 0.639 63.0

4 Fish O. mykiss log(1/LC50)

0.469 (± 0.062) logPoct – 3.28 (± 0.197) 34 0.640 0.499 57.0

5 Fish B. rerio log(1/LC50) 0.349 (± 0.091) logPoct – 3.07 (± 0.292) 19 0.465 0.454 14.8

179

Table 13.3.QSARs for algal and fish toxicity

No Endpoint Regression equation n R2 s F Q2

1 Alga S.

subspicatus log(1/EbC50)

0.345 (± 0.077) logPoct + 1.00 (± 0.179) NPh – 3.11 (± 0.205)

50 0.601 0.769 35.4 0.543

2 Alga S.

subspicatus log(1/ErC50)

0.464 (± 0.091) logPoct + 0.971 (± 0.180) NPh – 3.57 (± 0.236)

33 0.709 0.691 36.3 0.655

3 Fish O. mykiss log(1/LC50)

0.369 (± 0.045) logPoct - 0.303 (± 0.074) LUMO - 2.53 (± 0.145)

122 0.417 0.884 42.6 0.371

180

Table 13.4. Number of investigated compounds (n), average molecular mass (MMav) and

upper classification boundaries for group V (very toxic compounds), converted to units

µmol/l for each species investigated*

Species

n MMav Upper boundary for

group V (mg/l) Upper boundary for

group V (µmol/l)

Alga S. subspicatus 82 236.1 1.000 4.235

Alga S. capricornutum 111 245.7 1.000 4.070

D. magna 414 254.8 1.000 3.925

Fish O. mykiss (Rainbow trout) 228 254.2 1.000 3.934

Fish B. rerio (Zebra fish) 110 246.5 1.000 4.057

* the upper boundaries in units mg/l for the groups T (toxic compounds) and H (harmful compounds) are

derived by multiplying the upper boundary for group V (1 mg/l) by 10 and 100, respectively

181

Table 13.5. Classification of compounds with classification boundaries set both in units

mg/l and µmol/l for each species*

a.) alga S. subspicatus

Groups V (mg/l) T (mg/l) H (mg/l) N (mg/l) Total

V (µmol/l) 8 0 0 0 8

T (µmol/l) 0 13 1 0 14

H (µmol/l) 0 3 30 1 34

N (µmol/l) 0 0 0 26 26

Total 8 16 31 27 82

b) alga S. capricornutum


V (µmol/l) 13 2 0 0 15

T (µmol/l) 0 22 3 0 25

H (µmol/l) 0 5 26 3 34

N (µmol/l) 0 0 3 34 37

Total 13 29 32 37 111

c) D. magna


V (µmol/l) 57 3 0 0 60

T (µmol/l) 11 100 3 0 114

H (µmol/l) 0 7 130 13 150

N (µmol/l) 0 0 19 71 90

Total 68 110 152 84 414

182

Table 13.5. Classification of compounds with classification boundaries set both in units

mg/l and µmol/l for each species*

d) fish O. mykiss (Rainbow trout)


V (µmol/l) 36 4 0 0 40

T (µmol/l) 2 56 0 0 58

H (µmol/l) 0 5 54 12 71

N (µmol/l) 0 0 9 50 59

Total 38 65 63 62 228

e) fish B. rerio (Zebra fish)


V (µmol/l) 8 1 0 0 9

T (µmol/l) 1 17 0 0 18

H (µmol/l) 0 6 32 8 46

N (µmol/l) 0 0 3 34 37

Total 9 24 35 42 110

* the groups were encoded as follows: V – very toxic compounds; T – toxic compounds; H – harmful

compounds; N – non-toxic compounds

183

Table 13.6. Discriminant models for the four-group classification of aquatic toxicity (very

toxic (V), toxic (T), harmful (H), and non-toxic (N) compounds)

% correctly classified compounds

No Species

% correct

by chance

a

Model variables

Wilks λ

F group V

group T

Group H

group N

total

1 Alga S.

subspicatus 31.1

logPoct b,

NC#N 0.583 7.95 37.5 42.9 73.5 38.5 53.7

2 Alga S.

capricornitum 27.4

logPoct b ,

Nrings6 0.783 4.60 26.7 40.0 41.2 83.8 53.2

3 D. magna 27.5 logPoct

c, HOMO

0.609 38.4 41.7 36.0 70.7 38.9 50.0

4 Fish O.

mykiss 25.9

logPoct c,

LUMO 0.592 22.3 37.5 43.1 67.6 55.9 53.1

5 Fish B. rerio 32.1 logPoct

c, PW2 b

0.567 11.5 33.3 5.6 76.1 73.0 60.6

a calculated by using Equation 5.20, Chapter 5, Section 5.4 “Linear discriminant function analysis” b calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used

here is calculated according to the Ghose-Crippen-Viswanadhan model (Ghose and Crippen, 1986;

Viswanadhan et al., 1989) c calculated with the KOWWIN program

184

Table 13.7. Discriminant models for the two-group classification of aquatic toxicity

(dangerous (D) and non-toxic (N) compounds)


a priori probabilities proportional to group

sizes

a priori probabilities equal No Species

% correct

by chance

a

Model variables

Wilks λ

F

group D

group N

total group

D group

N total

1 Alga S.

subspicatus 56.7

logPoct b,

NH-donors c

0.829 8.16 87.5 46.2 74.4 73.2 65.4 70.7

2 Alga S.

capricornitum 55.6 logPoct

d 0.867 16.7 94.6 29.8 73.0 70.3 81.1 73.9

3 D. magna 66.0 logPoct

b, HOMO

0.801 51.0 93.5 25.6 78.7 75.6 73.3 75.1

4 Fish O.

mykiss 61.6 logPoct

b 0.732 82.7 95.3 44.1 82.0 74.6 81.4 76.3

5 Fish B. rerio 55.4 logPoct

b, PW2 d

0.678 25.3 90.4 64.9 81.8 78.1 73.0 76.4

a calculated by using Equation 5.20, Chapter 5, Section 5.4 “Linear discriminant function analysis” b calculated with the KOWWIN program c calculated with the Tsar software d calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used


Viswanadhan et al., 1989)

185

Table 13.8. Summary of the classification tree analysis for the four-group classification of

aquatic toxicity (very toxic (V), toxic (T), harmful (H), and non-toxic (N) compounds)


No Species Splitting method

a Model variables

group V group T group H group N total

Alga S.

subspicatus No model obtained - - - - -

1 Alga S.

capricornitum CART logPoct

b , 2χ c 60.0 44.0 32.4 83.8 55.9

2 D. magna DBU logPoct

d, HOMO, MM

28.3 55.3 69.3 32.2 51.4

3 D. magna CART logPoct d, LmH 63.3 45.6 30.7 83.3 51.0

4 Fish O. mykiss DBU logPoct d, ΣE-state

c 30.0 53.4 76.1 44.1 53.9

Fish B. rerio No model obtained - - - - -

a CART – classification and regression trees; DBU – discriminant-based univariate splitting b calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used


Viswanadhan et al., 1989) c calculated with the Tsar software d calculated with the KOWWIN program

186

Table 13.9. Summary of the classification tree analysis for the two-group classification of

aquatic toxicity (dangerous (D) and non-toxic (N) compounds)


a priori probabilities proportional to group

sizes

a priori probabilities equal No Species

Splitting method

Model variables

group D

group N

total group

D group

N total

1 Alga S.

subspicatus DBU logPoct

b, NH-donors c - - - 87.5 57.7 78.0

2 Alga S.

capricornitum CART logPoct

d 71.6 81.1 74.8 68.9 83.8 73.9

3 D. magna CART logPoct b, LmH 89.5 58.9 82.9 74.7 83.3 76.6

4 Fish O. mykiss DBU logPoct b 94.7 47.5 82.5 70.4 84.7 74.1

5 Fish O. mykiss CART logPoct b 89.9 61.0 82.5 72.8 84.7 75.9

6 Fish B. rerio CART logPoct b, PW2 d 94.5 51.4 80.0 - - -

7 Fish B. rerio CART logPoct b - - - 78.1 73.0 76.4

b calculated with the KOWWIN program c calculated with the Tsar software d calculated with the DRAGON software. DRAGON computes logPoct by using two methods; the one used


Viswanadhan et al., 1989).

187

CHAPTER 14

INVESTIGATION OF ACUTE TOXICITY/CYTOTOXICITY

14.1. Objectives

The objective of the study described in this chapter was to investigate in vivo and in vitro

acute toxicity and cytotoxicity to a broad range of biological species. Toxicities to

bacterial strains, rodent and human cell lines, rodents and humans were included.

Similarities between the toxicity endpoints were investigated by using different

approaches such as principal components analysis and cluster analysis. Models for in vivo

toxicity based on a combination of other in vivo or in vitro endpoints and descriptors of

chemical structure (QSAAR models), and QSAR models were sought.

14.2 Methods


The data set was taken from the MEIC (Multicentre Evaluation of In Vitro Cytotoxicity)

programme (Bondesson et al., 1989; Ekwall et al., 1998a). The MEIC programme aimed

to evaluate the relevance and reliability of in vitro tests for predicting acute in vivo

toxicity. As a part of the MEIC programme, a number of international laboratories

produced data for the toxicity of 50 reference chemicals to more than 60 in vitro systems

(bacterial, animal and human cells) in order to use them for predicting human acute

toxicity. Additionally, in vivo data for acute rat, mouse, and human toxicity of the same

chemicals were collected as a part of the MEMO programme (MEIC monographs on

time-related human lethal blood concentrations). Rat and mouse toxicities (rat LD50 and

mouse LD50 values representing lethal doses at which 50% lethality of the test animals is

observed) were collected from the NIOSH/Registry of Toxic Effects of Chemical

Substances (RTECS); human toxicity data were collected from publications of clinical

and forensic studies of individuals exposed to toxicants. A comparison of human toxicity

to the rat and mouse LD50 values, and in vitro-in vivo correlations, were obtained in the

MEIC programme (Ekwall et al., 1998b).

The MEIC data set is heterogeneous in terms of chemical structure and toxicity

mechanisms, including inorganic and simple organic chemicals, alkaloids and drugs.

188

In the present study, the data for rat LD50, mouse LD50 and human toxicity (negative log-

transformed, units of mmol/kg) for the 50 reference chemicals of the MEIC programme

were taken from Ekwall et al. (1998a). The human toxicity is represented as acute oral

lethal doses (HLD, negative log-transformed, mmol/kg), and approximate acute

blood/serum peak LC50 values (HAP, negative log-transformed, mmol/l). Ekwall et al.

(1998a) derived the HAP values by constructing an approximate blood/serum LC50 curve

against time after exposure, and interpolation at the time of the peak blood/serum

concentrations. For this purpose, data on human highest survival blood/serum

concentrations were plotted against time, and after consideration of known kinetic data,

LC100 time curves were constructed connecting the highest survival concentration data

points. Similarly, curves of LC0 blood/serum concentrations against time were

constructed, using data for the lowest lethal blood/serum concentrations. The approximate

LC50 curves were drawn as the geometrical mean of the LC100 and LC0 curves, and HAP

values were obtained at the time of peak of the blood/serum concentrations. However, for

some chemicals Ekwall et al. (1998a) could not construct approximate LC50 curves due to

insufficient data for the concentrations required for severe poisoning or lethality. For

these chemicals HAP values were taken from the LC0 or LC100 curves.

Data for toxicity to the Chang human liver cell line (IC50 values, negative log-transformed

and converted to mmol/l, 24-h assay) were taken from the MEIC data base

(http://www.cctoxconsulting.a.se/). Data for toxicity to rat hepatocytes (IC50 values,

negative log-transformed, mmol/l, 24-h assay) were obtained from Shrivastava et al.

(1993). Data for the BiotoxTM assay on the bacterium Vibrio fischeri (IC50 values,

negative log-transformed, mmol/l, 5-min assay) were taken from Kahru and Borchardt

(1994). Additionally, toxicity to the bacterium Sinorhizobium meliloti for the MEIC

chemicals was included in the investigation; the data were taken from Botsford (2002)

(IC50 values, negative log-transformed, mmol/l, 20-min assay).

To derive QSAARs and QSARs, inorganic chemicals, salts, xylene (which is a mixture of

isomers) and paraquat (which has a charge of +2) were excluded from the analyses,

resulting in a data set of 26 chemicals (their SMILES codes are given in Appendix B.3).

The investigated chemicals and their toxicities are presented in Table 14.1.

189


The following structural descriptors for the compounds were calculated: the logarithm of

the octanol-water partition coefficient (logPoct), boiling point, melting point, vapour

pressure, water solubility, and approximately 250 different quantum-chemical and

topological descriptors. LogPoct was calculated with the KOWWIN version 1.66 software

(Syracuse Research Corporation, Syracuse, NY, USA). Boiling points, melting points and

vapour pressures were calculated with the MPBPWIN version 1.40 software (Syracuse

Research Corporation, Syracuse, NY, USA). Aqueous solubilities were calculated with

WSKOWWIN version 1.40 (Syracuse Research Corporation, Syracuse, NY, USA).

KOWWIN version 1.66, MPBPWIN version 1.40 and WSKOWWIN version 1.40

software were downloaded from the web-site of the Environmental Protection Agency of

the United States (http://www.epa.gov/oppt/exposure/docs/episuitedl.htm). TSAR for

Windows version 3.3 (Accelrys Inc.) was used to calculate quantum-chemical and


(Accelrys Inc.) was used with the AM1 Hamiltonian. The Dragon version 5.0 software

(TALETE srl.) was used to calculate topological and charge descriptors. Some

calculations in Dragon required 3D chemical structures (the charge descriptors). For these

descriptors, the structures optimised previously in TSAR were used (imported into

Dragon as Mol2 files). The options to exclude constant and near-constant variables, as

well as one variable from a pair with intercorrelation coefficient greater than 0.95, were

applied when saving the calculated descriptors in Dragon.

Additionally, experimental values for logPoct (logPoctM) and logSaq (logSaqM) were

obtained from KOWWIN version 1.66 and WSKOWWIN version 1.40.

14.2.3. Statistical Analysis

Before performing the analysis, variables were retained for further analysis if at least

three of the values were different from zero. To reduce the number of intercorrelated

molecular descriptors, the algorithm developed by the author of this thesis was applied to

the descriptor data matrices (see Chapter 9).

Principal components analysis (PCA) and cluster analysis were performed using the

MINITAB version 14 software (Minitab Inc., State College, PA, USA). In the PCA, the

correlation matrix was used in the calculation of the principal components. Cluster

190

analysis was performed with the average linkage option and with distance measured by

using the Pearson correlation coefficients R between the variables.

QSAAR and QSAR models were obtained by multiple linear regression. The program in

C code (best-subsets regression), developed by the author (see Chapter 9), was used to

select variables in the models. Only those variables that intercorrelated with a coefficient

of intercorrelation R less than 0.7 were included in the same model.

The models were validated by cross-validation and validation on external test sets. Leave-

one-out cross-validation was used to generate leave-one-out cross-validation coefficients

of determination (Q2). Also, leave-five-out cross-validation was performed ten times for

each model (randomly selecting compounds to be excluded at each step of the cross-

validation procedure) in order to investigate model robustness. Thus, ten values for the

leave-five-out cross-validation coefficient of determination (Q2(5)) were generated for

each model.

A simulation of validation by using external test sets was also performed. Five

compounds (referred to as “a test set”) were excluded from the data set and the models

were developed on the basis of the remaining (approximately 21) compounds. The

toxicities of the five compounds from the test set were predicted and the correctness of

the prediction was evaluated by calculating the mean absolute error (MAE) of the

predicted values of the five excluded compounds. As the compounds investigated

represented a heterogeneous data set, no particular group of compounds could be selected

as a test set. Therefore, for each model the five test compounds were randomly chosen

and the described procedure was repeated ten times. Thus, ten MAE values were

produced for each model.

The regression equations with the best regression parameters (R2 and s), cross-validation

parameters (Q2 and Q2(5)), and MAE values were selected. Additionally, models

containing variables allowing for a plausible biological interpretation were chosen from

the equations with best statistical fit. The statistical parameters of the selected models

were finally derived by using MINITAB version 14.

To identify model outliers the method of Tenekedjiev and Radojnova (2001) using

studentised residuals was applied (see Chapter 5, Section 5.1.3 “Observations with large

influence on the regression model, outliers”). An observation was classified as outlier if

191

its studentised residual calculated from a regression model based on the remaining

observations did not lie within the calculated confidence interval corresponding to the 5%

significance level. This procedure was repeated two times (two loops), with the second

time excluding the observations classified as outliers in the first run.

14.3. Results

The investigated data set contained eight toxicity endpoints, which enabled PCA and

cluster analysis to be applied to the endpoints alone to investigate similarities in the toxic

effects. For this purpose, chemicals that had missing toxicity values for one or more of the

endpoints were removed, thus, 30 chemicals remained to perform PCA and cluster

analysis.

The first principal component obtained by the PCA had an eigenvalue of 6.34, and

explained 79.3% of the variance in the toxicity data, while the second component had a

much lower eigenvalue of 0.712, and explained only 8.9% of the data variability. Plots of

the loadings and scores of the first against the second component are given in Figure 14.1.

The grouping of the endpoints, obtained by the cluster analysis, is shown in the

dendrogram of Figure 14.2. It is discussed in section 14.4. “Discussion”.

As explained in Section 14.2.1 (“Toxicity data”) 26 chemicals were investigated by

QSAAR and QSAR analysis (after excluding the inorganic chemicals, salts, xylene and

paraquat). The calculated values of the descriptors of chemical structure for the 26

chemicals, used in the selected QSAAR and QSAR models, are presented in Table 14.2.

The abbreviations of the descriptors are explained in Table 12.2. The derived models are

presented in Table 14.3. Only descriptors that were statistically significant at 95%

confidence level were included in the equations. In order to avoid reporting of multiple

numbers only the minimal, the maximal, and the average values of the leave-five-out

cross-validated Q2 coefficients (Q2(5)) and the MAEs of each model are given in the

table.

In Figure 14.3, plots of observed against predicted values of HLD obtained from the

model with mouse LD50 values (Equation 1, Table 14.3.a, Figure 14.3.a), and observed

against predicted values of HAP obtained from the model with toxicity to human liver

cells (Equation 6, Table 14.3.a, Figure 14.3.b) are presented. From the figure it can be

192

seen that digoxin can have a large influence on the models for the human toxicity

endpoints, having a large residual in Equation 1 (Figure 14.3.a), or giving the good

correlation of Equation 6 (Figure 14.3.b). Therefore attempts were made to develop

models both including and excluding digoxin from the training set.

Similarly, a one-descriptor equation obtained with the molecular mass (MM) for the

toxicity to Chang human liver cells (Equation 25, Table 14.3.b) showed that digoxin

might have large influence on the correlation (Figure 14.4), and so correlations without it

were also attempted (Table 14.3.b).

From Table 14.3 it can be seen that the leave-one-out cross-validation Q2 values differed

from the corresponding regression R2 values with less than or approximately 0.1. Three

exceptions were observed – the QSAAR models for the mouse in vivo toxicity (Equations

13 and 14, Table 14.3.a) and Equation 27 (Table 14.3.b) for toxicity to Chang human

liver cells, where these differences were approximately 0.14.

From Table 14.3 it can be seen that, generally, the leave-five-out Q2(5) values were

smaller than the corresponding regression R2 values with less than or approximately 0.2.

The same exceptions were observed here – for the mouse in vivo QSAAR models

(Equations 13 and 14, Table 14.3.a) and Chang human liver cell models (Equations 27

and 28, Table 14.3.b) these differences were up to 0.43 (Equation 13, Table 14.3.a).

The leave-five-out Q2(5) values varied with less than 0.16 for a given model, with the

same exceptions (Equations 13 and 14, Table 14.3.a, and Equations 27 and 28, Table

14.3.b).

From Table 14.3 it can be seen that the MAE averages over the ten MAEs produced for

each model ranged from 0.26 to 0.79. The maximal MAEs ranged between 0.50 and 1.46.

The worst results with MAE averages higher than 0.60 and/or maximal MAEs higher than

0.90 were obtained for the following equations: the QSAAR Equations 1 and 6 (Table

14.3.a), where only a toxicity endpoint (without structural descriptors) was used as a

predictor variable; the QSAARs for the mouse in vivo toxicity (Equations 13 and 14,

Table 14.3.a); the QSARs for toxicity to Chang human liver cells (Equations 25, 26, 27,

and 28, Table 14.3.b); and the one-parameter Equation 30 (Table 14.3.b) for toxicity to rat

hepatocytes. These exceptions are discussed in Sections “Rodent in vivo toxicity” and

“Toxicity to Chang human liver cells” below.

193

14.4. Discussion

Eight endpoints including in vitro cytotoxicity and in vivo acute toxicity to bacterial

strains, rodent and human cell lines, rodents and humans were investigated. Similarities

between the toxicity endpoints were investigated and QSAAR and QSAR models were

developed.

14.4.1. Investigating of similarities between the toxicity endpoints

Principal Component Analysis

The PCA resulted in one component that explained a large part of the variance in the

toxicity data (79.3%). The second component was much less important, explaining only

8.9% of the variance. This result indicates that the toxicities are determined mainly by a

single factor, with other factors being less influential. The nature of this factor is possibly

related to cell toxicity, which directly determines the values of half of the investigated

endpoints (toxicity to Chang human liver cells, rat hepatocytes, S. meliloti, and Biotox

assay), and is a basis for the remaining in vivo toxicity endpoints.

From the loading plot of the first against the second component (Figure 14.1.a) it can be

seen that all toxicity endpoints have negative loadings in the first component (values

between –0.38 and –0.33). The loadings of the toxicity endpoints in the second

component (which explained only 8.9% of the toxicity variance) separated the toxicity

tests into in vitro and in vivo tests, with in vivo toxicity endpoints having positive

loadings, and the in vitro endpoints having negative loadings. The only exception was the

in vitro toxicity to human liver cells, which also had a positive loading, albeit smaller than

the loadings of the in vivo endpoints. Thus, the second factor accounts for differences

between the in vivo and in vitro endpoints. The QSAAR and QSAR models derived below

might give an insight in the nature of the two factors.

From the score plot (Figure 14.1.b), no obvious grouping of chemicals can be discerned.

The chemicals having highest positive scores in the first component were simple organic

molecules and the salts of univalent metals, while the more complex organic molecules

and the salts of bivalent metals had smaller scores in the first component.

194

Cluster Analysis

From the dendrogram, obtained by the cluster analysis (Figure 14.2), it can be seen that

the human blood/serum LC50 concentration (HAP) is closely related to human liver cell

toxicity, whereas the oral dose causing lethality to human (HLD values) is related closely

to the oral rat and mouse LD50. The remaining in vitro endpoints form a separate cluster.

These results suggest that the route of toxicant administration (and subsequent

toxicokinetics) determines similarities in the toxic effects between the different species.

Thus, the oral lethal doses applied to human, rat and mouse were related, whereas the

toxic concentrations in human blood/serum were related to IC50 values of human liver

cells which are exposed in vivo to the chemical blood concentrations.

14.4.2. QSAAR and QSAR models

Validation procedures

As described in Section “Results”, most of the models had differences between the

regression R2 values and the corresponding leave-one-out cross-validation Q2 values

smaller than or approximately 0.1, and differences between the regression R2 values and

the corresponding leave-five-out Q2(5) values smaller than or approximately 0.2. These

results showed that the models performed satisfactorily during leave-one-out and leave-

five-out cross-validation. The exceptions observed were the QSAAR models for the

mouse in vivo toxicity (Equations 13 and 14, Table 14.3.a) and the models for Chang

human liver cells (Equations 27 and 28, Table 14.3.b). Also, the ranges of the leave-five-

out Q2(5) values obtained for a given model were smaller than 0.16 (with the same

exceptions) suggesting stable model performance with respect to the observations

included.

With the exception of the QSAAR models for the mouse in vivo toxicity (Equations 13

and 14, Table 14.3.a), the models for Chang human liver cells (Equations 25 - 28, Table

14.3.b) and some one-parameter models (Equations 1 and 6, Table 14.3.a; and Equation

30, Table 14.3.b), the remaining models had MAE averages smaller than 0.6 and maximal

MAEs smaller than 0.9, meaning that the mean errors of prediction by the models are

likely to be smaller than 0.9 log units. Perhaps this value reflects the heterogeneity of the

data sets and the complexity of the in vivo endpoints.

195

Human endpoints

For the 26 chemicals investigated HLD correlated best with the mouse and rat in vivo

toxicity. The correlation with mouse in vivo toxicity was better (Equation 1, Table

14.3.a). Previously, Ekwall et al. (1998b) have also obtained a good correlation between

HLD values and the mouse in vivo toxicity for the whole MEIC data set. However, as

mentioned above, the model performed badly during external validation (with maximal

MAE of 0.969). The statistical parameters improved when the number of H-bond donors

was added (NH-donors, Equation 2, Table 14.3.a). When digoxin was excluded, the

correlation of HLD with mouse in vivo toxicity improved (Equation 3, Table 14.3.a). NH-

donors as a second descriptor was significant only at 85 % confidence level (the model

including mouse in vivo toxicity and NH-donors had n = 24, R2 = 0.764, s = 0.499, F = 34,

Q2 = 0.671). A better model included the Kier benzene-likeness index (BLI, Equation 4,

Table 14.3.a), which is a measure of molecular aromaticity (Kier and Hall, 1986). These

results suggest that the number of H-bond donors (NH-donors) influences compound

toxicity, however, for the more restricted data set in terms of toxicity and NH-donors ranges

(excluding digoxin) it becomes less determining, whilst compound aromaticity becomes

more significant.

A QSAR model for HLD was obtained from the data set excluding digoxin with the zero-

order average connectivity index only (0χA, Equation 15, Table 14.3.b), which had a

worse statistical fit than the QSAAR models. Since 0χA decreases with increasing degree

of skeletal branching and unsaturation, its negative coefficient in the equation suggests

that toxicity increases with increases in these factors. The model could not be improved

by adding other descriptors. Thus, QSAR models for HLD with comparable or better

statistical fit than the QSAAR models could not be obtained.

Ekwall et al. (1998b) correlated the HAP values of the MEIC chemicals with the results

of 61 in vitro toxicity assays, and obtained the highest R2 with toxicity to Chang liver

cells (R2 = 0.73). In the present study, HAP was also predicted best from toxicity to

Chang liver cells (Equations 5 and 6, Table 14.3.a). For the data set excluding digoxin,

the correlation improved when the energy of the lowest unoccupied molecular orbital

(LUMO) was added (Equations 7, Table 14.3.a). The best three-descriptor model included

LUMO and the number of oxygen atoms (NO) together with the toxicity to Chang liver

cells (Equation 8, Table 14.3.a).

196

The QSAR models obtained for HAP had slightly better statistical fits than those of the

QSAAR models. Combinations of topological descriptors (second-order valence

connectivity index, 2χv, or third-order valence path connectivity index, 3χpv), and

electronic descriptor (LUMO) or the number of oxygen atoms (NO) gave models with the

best statistical fit (Equations 16 – 20, Table 14.3.b). The topological descriptors (2χv, 3χpv)

encode the size and shape of the molecules. LUMO is usually considered to be a

descriptor of chemical reactivity. Similar descriptors appeared in the models for HAP

including and excluding digoxin.

Adding structural descriptors resulted in QSAAR models for in vivo human toxicity with

better fit than did those based on toxicity endpoints alone; however, the improvement of

R2 was only between 0.06 and 0.17.

Rodent in vivo toxicity

Rat LD50

Rat LD50 values correlated best with the toxicity to rat hepatocytes (Equation 9, Table

14.3.a). Adding the number of 6-membered rings (Nrings6) or the molecular polarisability

(α) improved the correlation (Equations 10 and 11, Table 14.3.a), albeit with an increase

of R2 of only about 0.09. Nrings6 and α were highly intercorrelated with R = 0.98 (n = 25).

Nrings6 may be related to the size and/or shape of the molecule. The molecular

polarisability encodes the susceptibility of a molecule to acquire a dipole moment under

an external electric field, and might influence the transport or distribution of the chemical

into cell membranes. It is also related to the molecular size.

When QSARs were investigated, a model containing the difference between the energies

of LUMO and HOMO (LmH) and the number of 6-membered rings (Nrings6) was derived

(Equation 21, Table 14.3.b). LmH is considered to reflect the activation energy of a

molecule. A three-descriptor model (Equation 22, Table 14.3.b) included the

hydrophilicity factor (Hy), the electrotopological state descriptor (TIE, calculated on the

basis of the electrotopological state indices of Kier and Hall, 1990), and the number of 6-

membered rings (Nrings6). The electrotopological state indices of Kier and Hall are

considered to account for the ability of molecules to enter into non-covalent

197

intermolecular interactions (Kier and Hall, 1990). The three-variable QSARs had better

statistical parameters than the QSAARs.

Mouse LD50

The same descriptors as in the models for rat LD50 appeared also to describe best mouse

LD50, although the statistical parameters of the models were worse. Mouse and rat LD50

intercorrelate with R = 0.840. Mouse LD50 was correlated with the toxicity to rat

hepatocytes in combination with Nrings6 or polarisability (α) (Equation 13 and 14, Table

14.3.a). However, as mentioned above, these models performed poorly during cross-

validation and external validation, but better QSAAR models could not be derived.

However, the QSAR models for mouse LD50 were better than the QSAARs and included

again LmH and Nrings6 (Equation 23, Table 14.3.b), and a combination of Hy, TIE, and

Nrings6 (Equation 24, Table 14.3.b).

As described above, the results from the PCA suggested a factor determining differences

between the in vivo and in vitro toxicity effects, which is probably related to toxicokinetic

factors. An insight about the nature of this factor might be obtained from the structural

descriptors relating in vivo to in vitro endpoints in the QSAAR models. Generally, these

descriptors were related to electronic/reactivity properties, presence of oxygen atoms, and

size/shape properties.

Toxicity to human and rodent cell lines in vitro

Toxicity to Chang human liver cells

The best QSARs for toxicity to Chang liver cells included the same structural descriptors

when digoxin was included and excluded from the data set (Equations 26 and 27, Table

14.3.b), namely MM and the magnitude of molecular dipole moment (µ). The positive

coefficients of MM in the equations suggest that larger molecules are more toxic. µ

influences the intermolecular interactions and membrane transport of a compound. The

exclusion of digoxin from the data set resulted also in a model with logPoctM and LUMO

(Equation 28, Table 14.3.b), which is in accordance with the so-called response-surface

approach for QSAR modeling of cytotoxicity (see Chapter 8, Section 8.2 “Literature

review of QSARs for acute toxicity/cytotoxicity”).

198

However, as mentioned above, the presented models for toxicity to human liver cells

performed poorly during cross-validation and external validation. These results suggest

that the toxicity of some of the compounds is poorly described by the models. The

identification of outliers by using studentised residuals at the 5% significance level

detected lindane, 1,1,1-trichloromethane, tetrachloromethane, and hexachlorophene.

Lindane and tetrachloromethane were more toxic than predicted by the three equations;

1,1,1-trichloroethane and hexachlorophene were less toxic than predicted.

Hexachlorophene has a very high logPoct value (calculated logPoct of 6.92, measured

logPoctM of 7.54, Table 14.2.a), which might be a reason for its lower toxicity than

predicted. Lindane, tetrachloromethane, and 1,1,1-trichloroethane are saturated

halogenoalkanes, which are known to cause their toxic effect by the non-polar narcotic

mechanism (see Chapter 8, Section 8.1 “Mechanisms of toxic action”). According to the

baseline effect the non-polar narcotics have the lowest possible toxicity when compared

with compounds having similar logPoct but acting by other mechanisms (see Chapter 8).

However, in the present study lindane and tetrachloromethane have higher toxicities than

expected from their logPoct values, even when their LUMO energies (considered to

encode compound reactivity) are accounted for. This may result from specific properties

of the target biological system (Chang human liver cells), which interacts in a particular

way with these compounds. These observations need more detailed experimental

investigation.

Toxicity to rat hepatocytes

A model for the toxicity to rat hepatocytes with good statistical parameters was obtained

with logSaq only. Calculated logSaq values gave better results than did measured values

(Equations 29 and 30, Table 14.3.b). The two-parameter models included a combination

of logSaq and LmH (Equation 33, Table 14.3.b) and a combination of logPoctM and LmH

(measured logPoct values gave better results, Equation 34, Table 14.3.b). In addition, the

topological descriptor 0χA was also used in combination with logSaq or logPoctM to obtain

two-descriptor models (Equations 31 and 32, Table 14.3.b). Statistically significant

models with these structural descriptors can also be obtained for rat and mouse LD50

values (data not shown), which is consistent with the high intercorrelation between rat

and mouse in vivo and rat hepatocyte in vitro toxicities.

199

Bacterial toxicity

S. meliloti

Two-descriptor models were obtained by combining logPoctM (measured logPoct values

gave slightly better results than did calculated values) with LUMO or LmH (LmH

resulted in a model with a slightly better statistical fit than did LUMO, Equations 35 and

36, Table 14.3.b). An extended QSAR investigation of the toxicity to the bacterium S.

meliloti is presented in Chapter 12. As described in Chapter 12, a model including logPoct,

LUMO, and the numbers of NH2 and OH groups was obtained for a data set of 131

chemicals acting by different toxicity mechanisms. The model had R2 = 0.541, s = 0.818,

and F = 37.1. When only the MEIC chemicals were considered in the study presented in

this chapter, a model with better statistical fit was obtained by using logPoct and LUMO

only (R2 = 0.862, Equation 36, Table 14.3.b); however, the training set was smaller (n =

22). Adding the number of OH groups to this equation did not improve the fit with

statistical significance. The number of NH2 groups was not attempted in combination with

logPoct and LUMO, because only two of the investigated compounds had NH2 groups.

Biotox TM

(V. fischeri)

The best two-descriptor models included logPoct in combination with 0χA or LmH. A

model with logPoct and LUMO was also obtained, but again, its statistical parameters

were poorer (Equations 37, 38, and 39, Table 14.3.b.).

As described above, the results from PCA suggested that the toxicity to the different

endpoints is determined mainly by a single factor, which might be related to cell toxicity.

The QSARs for cell toxicity related this factor to compound hydrophobicity and

reactivity, which is in accordance with literature reports (see Chapter 8, Section 8.2

“Literature review of QSARs for acute toxicity/cytotoxicity”). Descriptors encoding these

properties also appeared in the QSARs for the in vivo endpoints, together with descriptors

of molecular size/shape.


200

The work presented in this chapter is based on a large data base developed in the MEIC

and MEMO programmes. As far as the author is aware, this is the first investigation by

QSAR analysis of toxicity endpoints included in the MEIC programme. The eight

endpoints selected in this study included a broad range of biological species, including

humans, which allowed comparison between them. The aim of the MEIC programme was

to assess in vitro toxicity endpoints that allow for prediction of human toxicity in vivo.

The present study explored an extended approach to predict in vivo human toxicity by

using combinations of other in vitro and/or in vivo toxicity endpoints and descriptors of

chemical structure (QSAAR analysis). Human in vivo toxicity data were correlated with

rodent toxicity in vivo or human in vitro liver cell toxicity in combination with structural

descriptors accounting for the H-bond donor ability, molecular aromaticity, and electronic

properties. However, adding structural descriptors resulted in improvement of R2 of the

models with only between 0.06 and 0.17 compared with those based on toxicity endpoints

alone. Nevertheless, the QSAAR models with structural descriptors performed better

during cross-validation and external validation. Additionally, QSAR models for HAP

were obtained which had slightly better statistical parameters than the QSAAR model,

without the need to include an in vitro endpoint. Such a model for HLD could not be

obtained.

Also, in vivo toxicity to rat and mouse was investigated and better QSARs (including

electronic parameters and descriptors of size and hydrophobicity) than QSAARs were

derived.

The QSAAR and QSAR models for in vivo toxicity developed in this study have

reasonable statistical parameters, however a clear and detailed explanation of the

mechanistic basis of these models is difficult to provide. The toxic effect in vivo is

determined by a complex combination of factors, related to toxicokinetics and cell

toxicity which are not always understood. However, when mechanistically-based and/or

easily interpretable models based on in vitro endpoints and/or structural descriptors

cannot be developed, models based on statistical correlations can still be useful even if

they are not used to directly replace testing. For example, they could be used as priority-

setting methods or to provide supplementary information in hazard and risk assessment.

QSARs for the in vitro biological systems were obtained, which, in general, included a

descriptor of compound hydrophobicity (logPoct, logSaq), and reactivity (LmH, LUMO).

Thus, the present investigation confirmed the applicability of the so-called response-

201

surface approach (see Chapter 8, Section 8.2 “Literature review of QSARs for acute

toxicity/cytotoxicity”) for QSAR analysis of toxicity to in vitro systems. These QSARs

are simple and have a sound mechanistic interpretation as they account for both the

transport and distribution of the compounds in cell membranes (as a part of the narcotic

mechanisms of toxicity), as well as the chemical reactivity.

An article based on this work has been published by Lessigiarska et al. (2006).

14.5. Conclusions

The results from the cluster and QSAAR analyses were consistent with the findings of

Ekwall et al. (1998b) and suggested that the route of toxicant administration (and

subsequent toxicokinetics) determines similarities in the toxic effects between the

different species. Thus, the in vivo human oral lethal doses were best related to the in vivo

rat and mouse oral lethal doses, whereas the toxic concentrations in human blood/serum

were related to IC50 values of human liver cells which are exposed in vivo to the chemical

blood concentrations.

Toxicity to mammalian cells was shown to be a better predictor of in vivo human and

rodent toxicity than the toxicity to the two bacteria investigated (S. meliloti and V.

fischeri). QSAARs for human toxicity endpoints were developed by combining mouse

LC50 values and descriptors encoding H-bond donor ability and aromaticity, or in vitro

toxicity to human liver cells and descriptors related to electronic properties (LUMO).

Factors that appear to be important for human toxicity from the QSAR models are related

to molecular size and shape, and electronic properties (LUMO).

In vivo toxicity to rat and mouse was related to in vitro IC50 values of rat hepatocytes.

However, in vivo toxicity to rat and mouse were better modelled by QSAR analysis than

by QSAAR analysis, by using descriptors of hydrophilicity, electropological state, and the

number of 6-membered rings.

Descriptors for hydrophobicity (logPoct, logSaq) and compound reactivity (LmH, LUMO)

appeared to determine in vitro toxicity to liver and bacterial cells, which is in accordance

with many QSAR studies of toxicity to in vitro systems (see Chapter 8, Section 8.2

“Literature review of QSARs for acute toxicity/cytotoxicity”).

202

Figure 14.1. Plots of (a) the loadings of the endpoints and (b) the scores of the chemicals

of the first against the second component obtained by the PCA for the MEIC data set

(a)

First Component

Second Component

0.0-0.1-0.2-0.3-0.4

0.50

0.25

0.00

-0.25

-0.50

Rat Hepatocytes

Biotox

S. meliloti

Human Liver

Mouse LD50

Rat LD50

HLD

HAP

Loading plot

(b) the chemicals are numbered according to their MEIC number (see Table 14.1)

First Component

Second Component

5.02.50.0-2.5-5.0

3

2

1

0

-1

-2

50

49

48

45

4340

39

3634 3332

29

2827

2321

19

18

1713

12

11

10 98

7

6

532

Score plot

203

Figure 14.2. Dendrogram plot obtained for the endpoints of the MEIC data set by cluster

analysis with average linkage and correlation coefficients distance

Distance

Rat hepatocytes

Biotox

S. meliloti

Mouse LD50

Rat LD50

HLD

Human liver

HAP

0.28

0.19

0.09

0.00

Average linkage and correlation coefficient distance

204

Figure 14.3. (a) Plot of observed against predicted values of HLD obtained with the

model based on mouse LD50 (Equation 1, Table 14.3.a)

LOG 1/HLD

Predicted Values

Observ

ed V

alu

es

-2.5

-1.5

-0.5

0.5

1.5

2.5

3.5

4.5

-2.5 -1.5 -0.5 0.5 1.5 2.5

digoxin

(b) Plot of observed against predicted values of HAP obtained with the model based on

the toxicity to human liver cells (Equation 6, Table 14.3.a)

LOG 1/HAP

Predicted Values

Observ

ed V

alu

es

-3

-2

-1

0

1

2

3

4

5

-2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5

digoxin

205

Figure 14.4. Plot of observed against predicted IC50 values for the toxicity to Chang

human liver cells obtained with the model based on MM (Equation 25, Table 14.3.b)

Log 1/Human L iver

-3 -2 -1 0 1 2 3 4 5 6 7

Predicted Values

-2

0

2

4

6

Ob

se

rve

d V

alu

es

digoxin

206

Table 14.1. The 50 MEIC chemicals and their toxicities †

No Name HAP HLD Rat

LD50 Mouse LD50

Human liver

Rat hepatocytes

S. meliloti Biotox

1 acetaminophen a -0.37 -0.25 -1.20 -0.35 na d -1.03 -0.22 -1.19

2 acetylsalicylic acid -0.89 -0.33 -0.05 -0.14 0.36 -0.43 -0.39 -0.80

3 iron (II) sulphate b 0.11 -0.55 -0.32 -0.65 0.69 -0.21 0.49 -0.93

4 diazepam a 1.15 c na d -0.09 0.77 1.44 0.64 -0.50 0.12

5 amitriptyline HCl b 2.21 0.87 -0.06 0.30 1.41 1.15 1.82 1.52

6 digoxin 4.04 3.77 1.44 1.64 5.52 0.57 0.64 -0.16

7 ethylene glycol -1.40 -1.40 -1.88 -1.95 -1.65 -2.55 -3.54 -3.43

8 methanol -2.07 -1.69 -2.24 -2.36 -2.66 -2.96 -3.33 -2.90

9 ethanol -2.26 -2.01 -2.19 -1.87 -2.50 -2.65 -3.22 -2.63

10 propan-2-ol -1.92 -1.63 -1.92 -1.78 -1.65 -2.48 -2.98 -2.27

11 1,1,1-trichloroethane -0.24 -1.73 -1.86 -1.65 -2.05 -1.88 0.18 -0.79

12 phenol 0.07 -0.22 -0.53 -0.46 -0.69 0.10 -1.11 -0.15

13 sodium chloride b -2.30 -1.59 -1.71 -1.84 -1.92 -2.01 -2.46 -2.47

14 sodium fluoride a,b -0.01 c -0.32 -0.09 -0.13 -1.54 0.00 na d -2.37

15 malathion a 2.24 c -0.35 0.06 0.24 na d na d 0.95 0.05

16 2,4-dichlorophenoxyacetic acid a

-0.71 -0.24 -0.23 -0.20 -0.22 0.10 na d 0.24

17 xylene b -0.02 c -0.93 -1.61 -1.30 -0.03 -1.24 -0.09 -0.75

18 nicotine 1.08 c 2.36 0.51 1.69 -0.06 -0.55 -0.79 -0.43

19 potassium cyanide b 0.20 1.34 1.11 0.88 -0.32 0.11 0.65 -1.42

20 lithium sulphate a,b -1.15 c -0.88 na d -1.03 -1.32 -1.46 na d -2.37

21 theophylline 0.00 0.06 -0.13 -0.12 -0.86 -0.34 -1.32 -1.08

22 dextropropoxyphene HCl a,b 1.63 c 1.57 0.65 0.17 na d na d na d na d

23 propranolol HCl b 1.92 0.62 -0.20 -0.03 1.46 1.24 0.40 0.55

24 phenobarbital a 0.00 0.32 0.16 0.23 -0.40 -0.37 na d 0.40

25 paraquat a,b 1.17 0.72 0.27 0.19 0.01 -0.07 0.59 na d

26 arsenic trioxide a,b 1.66 1.68 1.13 0.80 2.22 2.52 1.64 na d

27 copper (II) sulphate b 0.60 -0.10 -0.27 -0.36 0.42 1.32 2.15 0.26

28 mercury (II) chloride b 0.70 1.10 2.43 1.66 2.35 2.52 4.22 2.89

29 thioridazine HCl b 1.96 0.83 -0.39 0.02 2.49 1.72 1.99 0.96

30 thallium sulphate a,b 1.44 1.56 1.50 1.33 0.60 0.89 na d na d

31 warfarin a 0.19 c 0.49 2.28 2.01 -0.26 0.86 -0.25 na d

32 lindane 2.35 c 0.08 0.58 0.82 2.83 0.84 0.85 0.68

33 trichloromethane -0.61 -0.92 -0.88 0.52 -0.45 -0.79 -0.72 -0.87

34 tetrachloromethane 1.42 -0.93 -1.18 -1.73 1.81 -0.60 -1.03 -0.78

35 isoniazid a -0.09 -0.10 -0.96 0.01 -0.54 -0.55 na d -1.30

36 dichloromethane -0.61 c -1.21 -1.28 -1.01 -1.74 -2.04 -0.62 -1.41

37 barium nitrate a,b -0.35 c 0.85 -0.13 -0.01 -0.21 0.33 na d na d

38 hexachlorophene a 0.55 0.28 0.86 0.78 1.36 2.70 3.72 na d

39 pentachlorophenol 0.53 0.97 0.99 0.87 1.93 1.30 3.27 1.70

207


No Name HAP HLD Rat

LD50 Mouse LD50

Human liver

Rat hepatocytes

S. meliloti Biotox

40 verapamil HCl b 1.54 0.60 0.66 0.48 0.66 0.95 0.69 0.32

41 chloroquine phosphate a,b 1.53 0.79 na d 0.01 0.39 1.36 -0.14 -1.99

42 orphenadrine HCl a,b 1.38 0.81 0.08 0.49 -0.20 0.94 -0.24 na d

43 quinidine sulphate b 1.10 0.72 0.21 0.17 1.03 0.89 0.38 0.85

44 phenytoin a 0.10 -0.08 -0.81 0.23 0.39 0.75 na d na d

45 chloramphenicol 0.25 c 0.05 -0.89 -0.67 -0.11 0.40 -0.64 0.20

46 sodium oxalate a,b -0.10 c -0.39 na d -1.58 na d 0.24 na d na d

47 amphetamine sulphate a,b 0.94 1.43 0.83 1.19 na d -0.09 na d na d

48 caffeine 0.04 0.14 0.00 0.18 -0.97 -0.20 -1.28 -1.04

49 atropine sulphate b 1.85 c 2.61 0.06 0.17 0.80 0.17 0.18 -0.47

50 potassium chloride b -0.98 c -0.73 -1.54 -1.30 -1.59 -1.96 -2.46 -3.30

† the toxicity values are presented in negative logarithmic form and units of mmol/kg (HLD, rat, and mouse

LD50) or mmol/l (the remaining endpoints) a chemicals excluded from the PCA and cluster analysis due to missing values for some of the endpoint(s) b inorganic chemicals, salts, xylene (mixture of isomers), and paraquat (charge of +2) were excluded from

the QSAAR and QSAR analyses c human peak concentrations obtained from approximate LC0 or LC100 curves, instead of an approximate

LC50 curve (see text) d na – not available

208

Table 14.2. Values for the molecular descriptors included in the selected QSAARs and QSARs for acute toxicity

a) measured descriptors taken from the KOWWIN and WSKOWWIN, and descriptors calculated with KOWWIN, WSKOWWIN, and DRAGON

No Name KOWWIN KOWWIN WSKOWWIN WSKOWWIN DRAGON DRAGON DRAGON DRAGON DRAGON

logPoct logPoctM logSaq logSaqM 0χA BLI Hy NH-donors TIE

1 acetaminophen 0.27 0.46 -0.70 -1.03 0.752 0.886 -0.107 2 45.9 2 acetylsalicylic acid 1.13 1.19 -1.53 -1.59 0.757 0.835 -0.673 0 92.9 4 diazepam 2.70 2.82 -3.69 -3.76 0.706 0.915 -0.787 2 80.6 6 digoxin 0.50 1.26 -5.97 -4.08 0.713 1.013 2.653 8 283 7 ethylene glycol -1.20 -1.36 1.21 1.21 0.854 1.132 1.837 2 24.0 8 methanol -0.63 -0.77 1.49 1.49 1.000 1.342 1.402 1 0.378 9 ethanol -0.14 -0.31 1.24 1.34 0.902 1.535 0.712 1 4.54

10 propan-2-ol 0.28 0.05 0.83 1.22 0.894 1.413 0.371 1 0.00 11 1,1,1-trichloroethane 2.68 2.49 -2.30 -2.01 0.900 1.651 -0.359 0 0.00 12 phenol 1.51 1.46 -0.56 -0.06 0.730 0.915 -0.067 1 19.7 15 malathion 2.29 2.36 -3.62 -3.36 0.784 1.627 -0.517 0 263 16 2,4-dichlorophenoxyacetic acid 2.62 2.81 -2.82 -2.51 0.757 0.957 -0.598 1 83.8 18 nicotine 1.00 1.17 0.79 0.79 0.699 1.034 -0.807 0 37.0 21 theophylline -0.39 -0.04 -1.79 -1.39 0.737 0.797 -0.523 0 39.2 24 phenobarbital 1.33 1.47 -2.15 -2.32 0.733 0.889 0.477 2 48.0 31 warfarin 2.23 2.60 -3.67 -4.26 0.713 0.884 -0.815 0 71.3 32 lindane 4.26 4.14 -4.86 -4.56 0.789 1.482 -0.484 0 82.4 33 trichloromethane 1.52 1.97 -1.76 -1.18 0.894 1.964 -0.215 0 0.00 34 tetrachloromethane 2.44 2.83 -2.74 -2.29 0.900 1.701 -0.180 0 0.00 35 isoniazid -0.81 -0.70 -0.92 0.01 0.740 0.826 -0.576 0 23.3 36 dichloromethane 1.34 1.25 -0.89 -0.82 0.902 2.405 -0.264 0 4.16 38 hexachlorophene 6.92 7.54 -8.03 -3.46 0.757 1.051 0.478 2 385 39 pentachlorophenol 4.74 5.12 -4.94 -4.28 0.789 1.140 0.089 1 474 44 phenytoin 2.16 2.47 -3.15 -3.90 0.700 0.854 -0.296 1 49.0 45 chloramphenicol 0.92 1.14 -2.92 -2.11 0.764 0.953 0.565 3 116 48 caffeine 0.16 -0.07 -1.87 -0.95 0.747 0.822 -0.557 0 43.8

209

Table 14.2. Values for the molecular descriptors included in the selected QSAARs and QSARs for acute toxicity b) descriptors calculated with TSAR

No Name TSAR TSAR TSAR TSAR TSAR TSAR TSAR TSAR TSAR

2χv 3χpv α а LmH LUMO µ a MM NO Nrings6

1 acetaminophen 2.228 1.188 19.31 8.754 0.284 4.618 151.2 2 1 2 acetylsalicylic acid 2.395 1.371 19.68 9.251 -0.532 1.428 180.2 4 1 4 diazepam 5.117 3.623 37.02 8.588 -0.632 3.712 284.8 1 2 6 digoxin 18.816 15.952 90.38 10.034 -0.158 6.837 781.1 14 6 7 ethylene glycol 0.447 0.100 5.44 13.850 3.024 0.001 62.1 2 0 8 methanol 0.000 0.000 3.06 14.914 3.778 1.631 32.0 1 0 9 ethanol 0.316 0.000 5.31 14.439 3.564 1.493 46.1 1 0

10 propan-2-ol 1.094 0.000 7.61 14.701 3.576 1.826 60.1 1 0 11 1,1,1-trichloroethane 3.936 0.000 7.06 11.726 -0.266 1.697 133.4 0 0 12 phenol 1.336 0.756 13.68 9.512 0.398 1.302 94.1 1 1 15 malathion 10.648 8.542 31.15 7.412 -2.658 4.821 330.4 6 0 16 2,4-dichlorophenoxyacetic acid 3.169 1.819 19.22 9.040 -0.312 3.149 221.0 3 1 18 nicotine 3.425 2.590 24.29 9.487 0.213 2.714 162.3 0 1 21 theophylline 2.806 2.036 19.97 8.982 -0.089 6.121 180.2 2 1 24 phenobarbital 3.862 3.026 27.35 9.817 -0.191 1.609 232.3 3 2 31 warfarin 5.517 3.862 41.03 8.284 -1.037 7.999 308.4 4 3 32 lindane 5.936 6.284 16.97 11.211 -0.151 2.081 290.8 0 1 33 trichloromethane 2.474 0.000 4.81 11.467 -0.303 0.860 119.4 0 0 34 tetrachloromethane 4.286 0.000 5.57 11.262 -1.116 0.002 153.8 0 0 35 isoniazid 1.709 1.074 16.23 9.715 -0.596 1.581 137.2 1 1 36 dichloromethane 1.010 0.000 4.08 11.984 0.594 1.279 84.9 0 0 38 hexachlorophene 6.718 5.269 31.92 8.421 -0.772 2.274 406.9 2 2 39 pentachlorophenol 3.962 3.677 17.01 8.596 -0.977 0.878 266.3 1 1 44 phenytoin 4.390 3.282 33.20 9.544 -0.133 2.996 252.3 2 2 45 chloramphenicol 5.106 2.978 27.92 9.278 -1.183 6.764 323.2 5 1 48 caffeine 3.233 2.317 22.50 8.615 -0.348 3.627 194.2 2 1

a calculated with the VAMP package, using AM1 Hamiltonian

210

Table 14.3. QSAAR and QSAR models based on the MEIC data set

(a) QSAAR models

No Endpoint Regression equation n R2 s F Q2 Q2(5) Min

Q2(5) Max

Q2(5) Average

MAE Min

MAE Max

MAE Average

1 HLD a 0.869 log1/MouseLD50 c 25 0.683 0.728 49.6 0.589 0.536 0.666 0.605 0.239 0.969 0.540

2 HLD a 0.796 log1/MouseLD50 + 0.312 NH-donors – 0.346 25 0.853 0.507 63.7 0.756 0.700 0.808 0.760 0.164 0.668 0.394

3 HLD b 0.725 log1/MouseLD50 – 0.144 24 0.739 0.513 62.4 0.663 0.598 0.720 0.669 0.163 0.645 0.412

4 HLD b 0.650 log1/MouseLD50 – 0.590 BLI + 0.548 24 0.797 0.463 41.2 0.702 0.633 0.759 0.693 0.137 0.569 0.365

5 HAP a 0.676 log1/HumanLiver c 24 0.814 0.610 96.5 0.784 0.760 0.802 0.783 0.295 0.755 0.527

6 HAP b 0.647 log1/HumanLiver c 23 0.705 0.621 50.2 0.645 0.602 0.676 0.635 0.422 0.891 0.638

7 HAP b 0.463 log1/HumanLiver – 0.251 LUMO c 23 0.773 0.558 34.0 0.684 0.614 0.759 0.688 0.363 0.735 0.529

8 HAP b 0.412 log1/HumanLiver – 0.323 LUMO – 0.191 NO + 0.376 23 0.828 0.499 30.5 0.730 0.711 0.796 0.741 0.280 0.661 0.455

9 Rat LD50 0.685 log1/RatHepat – 0.153 25 0.661 0.696 44.8 0.599 0.576 0.642 0.609 0.272 0.848 0.514

10 Rat LD50 0.493 log1/RatHepat + 0.347 Nrings6 – 0.628 25 0.758 0.601 34.4 0.658 0.544 0.687 0.624 0.328 0.693 0.529

11 Rat LD50 0.522 log1/RatHepat + 0.0215 α – 0.675 25 0.736 0.628 30.6 0.670 0.552 0.713 0.655 0.289 0.735 0.490

12 Mouse LD50 0.680 log1/RatHepat c 25 0.603 0.782 34.9 0.528 0.496 0.587 0.528 0.194 0.822 0.558

13 Mouse LD50 0.487 log1/RatHepat + 0.348 Nrings6 – 0.354 25 0.693 0.703 24.9 0.560 0.267 0.642 0.516 0.258 1.201 0.601

14 Mouse LD50 0.502 log1/RatHepat + 0.0236 α – 0.447 25 0.686 0.711 24.0 0.557 0.426 0.626 0.539 0.299 1.027 0.592 a digoxin included b digoxin excluded c intercept of the equation was not significantly different from zero at 95% confidence level

211

Table 14.3. QSAAR and QSAR models based on the MEIC data set (b) QSAR models

No Endpoint Regression equation n R2 s F Q2 Q2(5) Min

Q2(5) Max

Q2(5) Average

MAE Min

MAE Max

MAE Average

15 HLD b - 9.57 0χA + 7.30 24 0.655 0.589 41.8 0.575 0.542 0.599 0.576 0.266 0.589 0.435 16 HAP a 0.269 3χp

v – 0.352 LUMO – 0.558 26 0.823 0.614 53.6 0.772 0.649 0.803 0.754 0.274 0.883 0.478 17 HAP a 0.258 2χv – 0.283 LUMO – 0.878 26 0.821 0.618 52.6 0.794 0.745 0.800 0.779 0.242 0.814 0.474 18 HAP b 0.497 2χv – 0.271 NO – 1.26 25 0.778 0.576 38.6 0.715 0.693 0.751 0.718 0.316 0.818 0.439 19 HAP b 0.224 3χp

v – 0.388 LUMO – 0.464 25 0.744 0.620 31.9 0.641 0.583 0.665 0.630 0.273 0.716 0.423 20 HAP b 0.319 3χp

v – 0.420 LUMO – 0. 296 NO c 25 0.870 0.451 47.0 0.794 0.732 0.809 0.778 0.101 0.543 0.339 21 Rat LD50 - 0.281 LmH + 0.427 Nrings6 + 2.01 26 0.716 0.639 29.0 0.633 0.538 0.672 0.617 0.296 0.596 0.482 22 Rat LD50 0.00382 TIE – 0.645 Hy + 0.607 Nrings6 – 1.41 26 0.815 0.527 32.4 0.735 0.667 0.746 0.705 0.188 0.677 0.412 23 Mouse LD50 -0.307 LmH + 0.409 Nrings6 + 2.57 26 0.705 0.675 27.5 0.631 0.581 0.666 0.628 0.248 0.789 0.470 24 Mouse LD50 0.00281 TIE – 0.790 Hy + 0.677 Nrings6 – 1.12 26 0.797 0.573 28.8 0.751 0.718 0.765 0.749 0.215 0.613 0.451 25 Human liver a 0.0101 MM – 2.15 24 0.726 0.989 58.3 0.690 0.672 0.705 0.689 0.364 1.458 0.790 26 Human liver a 0.0128 MM – 0.322 µ –1.87 24 0.817 0.828 46.7 0.757 0.633 0.789 0.750 0.348 1.120 0.703 27 Human liver b 0.0141 MM – 0.335 µ – 2.05 23 0.699 0.834 23.2 0.552 0.421 0.621 0.541 0.326 1.421 0.643 28 Human liver b 0.364 logPoctM – 0.349 LUMO – 0.813 23 0.634 0.920 17.3 0.522 0.408 0.592 0.496 0.314 1.178 0.595 29 Rat hepatocytes - 0.514 logSaq – 1.51 25 0.763 0.691 73.9 0.721 0.694 0.750 0.716 0.284 0.770 0.532 30 Rat hepatocytes - 0.579 logSaqM – 1.39 25 0.662 0.824 45.0 0.605 0.568 0.638 0.599 0.206 1.019 0.555 31 Rat hepatocytes - 0.405 logSaq – 6.90 0χA + 4.19 25 0.907 0.441 108 0.870 0.832 0.874 0.854 0.205 0.683 0.385 32 Rat hepatocytes 0.443 logPoctM – 9.49 0χA + 6.36 25 0.906 0.444 106 0.878 0.852 0.882 0.865 0.271 0.622 0.412 33 Rat hepatocytes - 0.347 logSaq – 0.312 LmH + 2.10 25 0.901 0.455 101 0.871 0.830 0.882 0.864 0.163 0.495 0.319 34 Rat hepatocytes 0.341 logPoctM – 0.401 LmH + 3.18 25 0.871 0.521 74.0 0.841 0.804 0.862 0.831 0.267 0.611 0.409 35 S. meliloti 0.651 logPoctM – 0.276 LmH + 1.24 22 0.886 0.652 73.6 0.860 0.815 0.873 0.850 0.347 0.737 0.539 36 S. meliloti 0.626 logPoctM – 0.348 LUMO – 1.54 22 0.862 0.715 59.5 0.829 0.686 0.841 0.814 0.339 0.836 0.579 37 Biotox 0.576 logPoctM – 6.67 0χA + 3.80 23 0.914 0.371 107 0.885 0.854 0.901 0.876 0.129 0.523 0.263 38 Biotox 0.480 logPoctM – 0.269 LmH + 1.42 23 0.898 0.405 87.8 0.874 0.857 0.886 0.870 0.170 0.519 0.361 39 Biotox 0.430 logPoctM – 0.328 LUMO –1.27 23 0.834 0.518 50.1 0.793 0.669 0.813 0.774 0.201 0.668 0.405

a digoxin included b digoxin excluded c intercept of the equation was not significantly different from zero at 95% confidence level

212

CHAPTER 15

SUMMARY AND GENERAL DISCUSSION

15.1. Objective

The aim of this chapter is to summarise the results of the research project, emphasising

the author’s contributions to existing knowledge. Some observations about the quality and

applicability of the QSARs based on the performed research are made. Also, possible

applications of the developed QSARs in strategies for toxicity testing are described.

15.2. Summary of the research results

The project included QSAR investigations in two main pharmacotoxicological areas,

namely blood-brain barrier (BBB) penetration, and acute toxicity and cytotoxicity.

QSARs were developed by using physicochemical properties, counts of functional

groups, descriptors based on molecular topology, electrostatic and electronic features. A

number of statistical approaches were involved, including regression analysis, partial least

squares (PLS) regression, discriminant analysis, principal components analysis (PCA),

cluster analysis, and classification trees. 3D QSAR approaches (CoMSIA, CoMFA) were

also explored to investigate specific interactions with biological molecules. Additionally,

the author developed some algorithms for the implementation of statistical approaches to

support the QSAR analysis. These are related to reducing variable multicollinearity in

large data sets, and an application of the best subsets regression.

The background to the research is presented in Chapters 2-8, whereas the results of the

research are reported in Chapters 9-14.

In Chapter 9 two computational algorithms with applications in QSAR analysis are

suggested. The first one is used to reduce multicolliearity in large data sets of structural

descriptors. It was designed to optimise subsequent QSAR analysis. The algorithm allows

the user to preserve variables considered important for describing biological activity and

development of QSARs. The algorithm was used in the publication of Lessigiarska et al.

(2004a).

213

The second algorithm reported in Chapter 9 represents an extended implementation of the

best-subsets regression to large data sets. An option is provided to include in the same

model only these variables that are intercorrelated below a given value of the

intercorrelation coefficient R. The algorithm was used in the publications of Lessigiarska

et al. (2004a), Lessigiarska et al. (2004b), Netzeva et al. (2005), and Lessigiarska et al.

(2006).

In Chapters 10 and 11 the QSAR modelling of chemical penetration through the BBB is

reported. The work in Chapter 10 is based on two data sets containing compounds

transported by different mechanisms, including passive diffusion and active transport.

The first data set is taken from a study funded by ECVAM, and contains data for both in

vivo BBB permeability, and permeability through several in vitro membrane models of

the BBB. The permeability through the membrane models was compared to the in vivo

BBB permeability to assess which membrane model was most similar to the BBB in vivo.

A novel aspect introduced by the current study was the development of models that

describe the in vivo BBB penetration by combining in vitro membrane penetration and

structural descriptors (QSAAR models), thus integrating in vitro modelling with QSAR

analysis. Another result of the study was that it confirmed the importance of compound

lipophilicity and H-bonding ability for the membrane transport, which had already been

reported in the literature. However, the models for in vivo BBB penetration of compounds

transported by different mechanisms did not include logPoct. This descriptor was found to

reflect better the penetration of compounds transported by passive diffusion.

The second data set investigated in Chapter 10 was obtained from the literature (Platts et

al., 2001). Regression and classification QSAR models were developed using structural

descriptors of lipophilicity, H-bonding and size/shape (logPoct, the number of H-bond

donors/acceptors, the number of 6-membered rings). A simple rule for compound

classification as low or high BBB penetrators was suggested, stating that if for a

compound the sum of H-bond donor and acceptor atoms is smaller than its logPoct value

minus one, it is likely to have higher concentration in the brain than in the blood at steady

state (high penetrator), and vice versa.

The work presented in Chapter 11 adds to that presented in Chapter 10 by focussing on

structural features influencing transport across the BBB of compounds known to inhibit

one of the efflux systems of the BBB, namely P-glycoprotein (P-gp). The compounds

included were imipramine and phenothiazine derivatives. The BBB penetration of these

214

compounds was investigated in order to identify common 3D structural characteristics

related to the mechanism of transport across the BBB, involving passive diffusion and P-

gp interactions. It was shown that the compounds with highest BBB penetration possess a

similar specific profile of two clearly defined hydrophobic centres and one hydrophilic

(H-bonding) centre arranged in a particular spatial configuration. The present study

emphasised the importance of the 3D-distribution of the hydrophobicity of phenothiazines

and imipramines for the BBB penetration, rather than the whole-molecule lipophilicity

characterised by logPoct. This result was observed in relation to the BBB penetration of

these compounds for the first time in the present study.

The next three chapters (Chapters 12-14) report analysis of acute toxic and cytotoxic

effects. Toxicity to a broad range of biological systems was investigated, including

bacteria, algae, isolated human and rodent cells, and in vivo toxicity to Daphnia, fish,

rodents, and humans. A variety of statistical techniques were applied to investigate

similarities between toxic effects for these systems, including correlations between the

investigated endpoints, PCA, and cluster analysis. Baseline toxicity effects were sought.

Separate QSARs were obtained for compounds acting by different mechanisms of toxic

action or belonging to different chemical types. Additionally, QSARs were developed by

combining different chemical classes and mechanisms of action in the same model.

Chapter 12 reports an investigation of toxicity to the bacterium Sinorhizobium meliloti. A

large data set of 133 compounds, taken from the literature (Botsford, 2002), was used to

develop mechanistically-based and chemical class-based QSARs. A baseline toxicity

effect attributed to the non-polar narcotics was observed. However, some non-polar

narcotics had toxicity different from that defined by the baseline. Explanations for this

observation were sought by evaluation of the quality of the toxicity data. Interestingly,

logPoct was not among the descriptors that were best correlated with toxicity of the polar

narcotics. The descriptors that were related to the toxicity of these compounds accounted

for polarisability, electrostatic properties and H-bonding interactions. A QSAR with

reasonable statistical parameters was developed for the aliphatic compounds in the data

set, but no such QSAR could be obtained for the group of aromatic compounds. The

model for the aliphatic compounds included logPoct, encoding compound biouptake

and/or membrane interactions, and LUMO, related to electrophilic reactivity. The other

descriptors included in the model encoded geometrical factors (MSA), and indicators of

specific interactions/reactivity or H-bonding (NNH2). A global QSAR model for all

215

investigated compounds was derived, which included well-understood molecular

descriptors (logPoct, LUMO, NOH, NNH2), but had low statistical fit.

In Chapter 13 an investigation of environmental toxic effects is reported, involving five

aquatic species (two algal species, Daphnia, and two fish species). The data were taken

from the New Chemicals Database (NCD) of the European Union. These data are

submitted by industry as a part of the notification process for each new chemical

substance that is manufactured in (or imported into) the European Union. Interspecies

correlations were developed to reveal similarities in toxicity between the species. A

strong correlation was observed between the inhibition of algal growth and algal growth

rate. Again, baseline toxicity effects for the non-polar narcotics were investigated.

Although general trends for linear toxicity/logPoct relationships were observed for the

non-polar narcotics, with most of the remaining compounds being placed above the

baselines, the derived relationships did not have high statistical fits. Also, QSARs with

moderate statistical fit were obtained only for the alga S. subspicatus and the fish O.

mykiss. They were based on mechanistically transparent structural descriptors, including

logPoct, LUMO, and the number of phenyl rings.

Classification QSAR models were also developed for the data taken from the NCD, by

applying the EU classification scheme into very toxic, toxic, harmful, and non-toxic

compounds. In general, the models classified correctly approximately 55% of the

compounds. Additionally, a two-group classification (defined by the author) into

dangerous and non-toxic compounds was investigated, with the group of dangerous

compounds being formed by uniting the groups of very toxic, toxic, and harmful

compounds. The obtained percentage correct classifications when applying the

classification models were between 70% and 83%. All classification models included

logPoct as a discriminating variable. A logPoct cut-off value of approximately 2 was

observed for discrimination between dangerous and non-toxic compounds. Additionally,

electronic features (HOMO, LUMO, energy difference between LUMO and HOMO), H-

bond donor ability, topological properties and factors encoding molecular volume, shape,

and size appeared to influence the compounds belonging to a given toxicity group. The

models were evaluated in relation to biological data.

In Chapter 14 a QSAR analysis of toxicity to a broad range of biological species,

including bacterial strains, rodent and human cell lines, rodents and humans, is presented.

The data were taken from the MEIC and MEMO programmes (Bondesson et al., 1989;

216

Ekwalll et al., 1998a). The toxicity endpoints were compared with each other by using

PCA and cluster analysis. Results from PCA analysis indicated that the toxicities are

determined mainly by a single factor, possibly related to cell toxicity, with other factors

being less influential. The analyses showed that the route of toxicant administration (and

subsequent toxicokinetics) might determine similarities in the toxic effects between the

different species. Thus, the oral lethal doses applied to human were related to the oral

lethal doses to rat and mouse, whereas the toxic concentrations in human blood/serum

were related to IC50 values of human liver cells which are exposed in vivo to the chemical

blood concentrations.

The study explored an approach to predict in vivo human toxicity by combining other in

vivo or in vitro data and descriptors of chemical structure (QSAAR analysis). The data for

human in vivo toxicity correlated with rodent toxicity in vivo and human in vitro liver cell

toxicity. The addition of descriptors of chemical structure improved moderately the

statistical fit of the models, compared to the equations based on toxicity endpoints alone.

QSAR models for the investigated endpoints were developed based on descriptors of

molecular hydrophobicity, size and shape, and electronic properties. The QSARs for in

vitro toxicity included descriptors for hydrophobicity (logPoct, logSaq) and compound

reactivity (LmH, LUMO), which confirms the importance of these properties for toxicity

to in vitro systems as previously reported in the literature.

15.3. Some reflections on the quality of QSARs

Some requirements exist for the development of QSARs to assure good quality of the

models and their correct application.

A main consideration is related to the quality of the biological data used for modelling.

Criteria for the selection of high quality biological data for QSAR development are

provided by Cronin and Schultz (2003) and Schultz and Cronin (2003). Criteria include

the principle that high quality data are derived from the same endpoint and protocol, and

ideally should be measured in the same laboratory. The investigated substances should

have high purity. QSARs developed on the basis of heterogeneous data, obtained in

different laboratories, are often of insufficient quality for QSAR modelling, as shown in

Chapter 13 concerning the toxicity data set taken from the NCD. Additionally, as

discussed in the report of QSARs for toxicity to S. meliloti (Chapter 12), even if the data

217

are obtained in the same laboratory, high experimental variability can be observed,

possibly due to some characteristics of the experimental protocol. However, in practice it

is often difficult to obtain high quality biological data, therefore researchers are faced

with the need to use available data with lower quality. According to the author of this

thesis QSARs based on such data can be useful to assess mechanistic factors related to

biological effects, and even to predict biological activity, and the obtained values should

be considered as tentative taking into account the error in biological data.

The ability to develop QSARs is based on the principle that quantitative differences in

biological effect of a series of compounds can be related to quantitative differences in

certain features of their chemical structure. To apply successfully this theoretical

consideration, the same type of chemical features should determine the biological effect

for all compounds in the series used to develop a given QSAR. In more general terms this

is understood as a prerequisite to develop QSARs based on series of compounds with the

same mechanism of action. However, in certain cases QSARs including compounds

acting by different mechanisms might be also successful. An example is the QSAR

developed for the aliphatic compounds of the data set including toxicity to S. meliloti

(Chaper 12). In such cases it is presumed that the descriptors included in the QSAR

account for each mechanism of action.

Additionally, sometimes QSARs are based on compounds from the same chemical class,

suggesting similar structural features determining the biological action. Including diverse

compounds in a QSAR is might be unsuccessful. Thus, complexity and diversity of the

molecular structures of the chemicals included in the NCD might be a reason for the

moderate statistical quality of the derived baseline relationships and QSAR models

described in Chapter 13.

The development of reliable QSARs should be based on descriptors that are considered to

be relevant for the biological effect. The choice of such descriptors is a difficult task, and

could be based on a priori knowledge about the biological effect and the mechanism of

action. However, among the purposes of QSARs analysis is to test hypotheses and to

provide such knowledge, so descriptors that result in high statistical fit to the investigated

endpoint might be considered relevant for the biological effect. When applying this

approach the researcher should be careful not to mistake chance numerical correlations

for a real causality, and should be aware that apparent causal descriptors in reality may

not determine the biological effect, but are simply correlated with the real causality. As

218

argued by Cronin (2002), it is better to include in the QSAR model descriptors that have

physicochemical and/or biological relevance even if they result in a model with worse

statistics compared with other variables. In this project, a balance was sought, by

selecting models with good statistical fit, but containing variables that would allow for

some biological interpretation.

To apply correctly QSARs for the prediction of new compounds, the applicability

domains of the models should be respected. The applicability domain of a model includes

the range of chemicals with particular structural features, to which the model can be

applied. Generally, the applicability domain can be defined on the basis of chemical class,

on the basis of the ranges of structural descriptors for the compounds included in the

model, or on the basis of the range of mechanisms of action covered by the model.

Usually development of QSARs related to a particular chemical class or mechanism of

action increases the statistical quality of the model, but decreases the range of its

applicability. Thus, there is a trade-off between the model quality and its applicability

domain. As a particular example, 3D QSARs have relatively more restricted applicability

domains, as they are usually developed for compounds of a particular chemical class

interacting with the same macromolecules.

15.4. Perspectives on the use of the QSARs in integrated testing strategies

The QSARs developed in this project can contribute to development of integrated

strategies based on non-animal methods for hazard and risk assessment. Two main

elements can be defined in relation to such a contribution, namely the application of the

QSARs for the assessment of brain uptake, i.e. assessment of the availability of

compound to act on the central nervous system, and application of the QSARs to assess

toxic effects.

The models developed in this project for brain uptake (Chapter 10) suggest that efforts to

describe the in vivo BBB penetration using membrane models in vitro in combination

with QSAR analysis might be successful. However, the QSARs developed without using

an in vitro endpoint had similar statistics. The classification model for low and high

penetrators (Chapter 10) can be used in integrated strategies to assess whether the

compound will be distributed mainly in the brain or not. The model represents a simple

relationship between logPoct value and the sum of H-bond donors and acceptor atoms for a

compound, which is easy to apply. It states that if the sum of H-bond donors and

219

acceptors for a compound is smaller than its logPoct value minus one, the compound is

most likely to have higher distribution in the brain than in the blood.

The 3D QSAR models developed in the project (Chapter 11) can be used to identify and

assess possible specific BBB interactions related to P-gp. However, a limitation of such

models is that they are applicable to the specific chemical classes, on which the models

are based.

In parallel with the evaluation of compound biouptake, an appraisal of compound toxicity

is also useful in integrated strategies. Development of alternative methods for toxicity

testing is a quite complex problem, due mainly to the facts that toxic effects can be

different and specific for different biological systems, and different compounds can affect

the same system by different mechanisms of toxicity.

The application of alternative methods is based on an extrapolation of results obtained for

less complex systems (in vitro toxicity) to the more complex in vivo systems. This implies

an inevitable uncertainty in the prediction. In this connection, an interesting result was

obtained for the interspecies correlations of the two fish and two algal species

investigated in this project (Chapter 13). Toxicity to the fish O. mykiss correlated better

with toxicity to the alga S. subspicatus (n = 77, R2 = 0.574), and poorly with toxicity to

the alga S. capricornutum (n = 109, R2 = 0.246); while for the fish B. rerio the opposite

observation was made, i.e. toxicity to the fish B. rerio was well correlated with S.

capricornutum toxicity (n = 21, R2 = 0.674), having a slope of 1, and poorly with S.

subspicatus toxicity (n = 41, R2 = 0.282). This suggests a higher level of similarity in the

toxic effect for a given fish species to a particular algal species, and lack of similarity to

other algal species. An explanation of this interesting result is currently lacking.

QSARs for toxicity are usually related to a particular biological system. Therefore,

models to predict toxicity to different biological systems of interest for hazard evaluation

and risk assessment are needed. Such systems include indicators for environmental

toxicity (bacteria, aquatic species), rodent and human cell lines, and rodents and humans

in vivo.

The development of QSARs for toxicity to simple biological systems such as bacteria is

relatively straightforward, as the factors that influence the toxic effect are less complex

and easier to define. Thus, as outlined by the response-surface approach (for example

220

Cronin et al., 2002a) descriptors encoding compound membrane interactions and

transport (logPoct) and reactivity (LUMO, LmH) are suitable in QSARs for toxicity to

such systems. This was confirmed by the work reported in this thesis. Such QSARs are

suitable for the assessment of toxicity on the lower biological level of cells. However, the

toxic effect in vivo is determined by a complex combination of factors, related to

biokinetics and cell toxicity. QSARs for toxicity in vivo to particular species can be

developed, with the presumption that they will contain descriptors accounting for the

range of factors influencing the toxicity. However, due to the complexity of these factors

and difficulty in modelling biokinetics, such models usually do not have high statistical

quality and/or are difficult to interpret from a mechanistic point of view (as in the case

with the models for in vivo toxicity described in Chapters 13 and 14).

Another aspect that should be considered when developing and applying QSARs for

toxicity is related to the mechanism of toxic action of compounds. A number of

experimental methods are available to assess mechanisms of toxic action. In addition,

considerations on the basis of structural criteria are broadly used, due to the easy

application of this approach. Apart from most simple compounds (containing, for

example, unbranched structures and simple functional groups, like alcohols), the

assessment of mechanism has a number of limitations. These are related to the complexity

of chemical structures, which may elicit a particular toxic effect by multiple mechanisms,

and the diversity of biological systems. Thus, a compound might act in a specific way

with biological macromolecules present in some biological systems, and cause a different

toxic effect by other non-specific mechanism in others biological systems. However, in

practice this is very difficult to determine.

It is generally accepted that QSARs should ideally be associated with a particular

mechanism of toxic action. Usually, QSARs that include diverse compounds acting by

different mechanisms are not of high statistical quality. An example of this is the QSAR

model developed for toxicity to the bacterium S. meliloti including all investigated

compounds (Chapter 12).

Despite the above-mentioned limitations of alternative methods and QSARs, valuable

information can be obtained from the models. QSAR models, considered of sufficient

quality, can be integrated in testing strategies together with other alternative methods. In

this project, a QSAR obtained for toxicity of aliphatic compounds to S. meliloti can be

used to assess possible cell toxicity. It includes compounds acting by non-polar narcosis,

221

electrophilic reactivity, and specific mechanisms. The model has reasonable statistical

parameters. However, its applicability domain is restricted to aliphatic compounds acting

by non-polar narcosis, electrophilic reactivity, and specific mechanisms.

Other QSAR models with good statistics, which can be used to assess cell toxicity, are the

models developed for toxicity to V. fischeri and S. meliloti in Chapter 14 of the thesis. In

other words, bacterial toxicity can be used as an indicator of cell toxicity in general.

In order to contribute to testing strategies for assessment of environmental hazard,

QSARs for aquatic species were sought in the project. However, reasonable models could

be obtained for only two of the five investigated species – the alga S. subspicatus and the

fish O. mykiss. The QSARs developed for the simpler organism, the alga S. subspicatus,

had reasonable statistical fits (R2 between 0.601 and 0.709) and could therefore

incorporated in testing strategies for environmental hazard. The QSAR for toxicity to the

fish O. mykiss was of low quality (R2 = 0.417), and, although it included well-understood

structural descriptors (logPoct and LUMO), should be used with caution, if at all. As

mentioned above, the complexity of the factors determining in vivo toxicity might be one

reason for the poor statistical fit of this model.

Classification QSAR models were obtained for the toxicity to the aquatic species by

applying the EU classification of very toxic, toxic, harmful, and non-toxic compounds.

However, the models classified correctly approximately only 55% of the compounds.

Thus, another type of classification was tried, namely dangerous and non-toxic. The

percentage correct classification when applying these classification models was between

70% and 83%. The classification models included simple molecular descriptors and are

easily applicable for prediction of toxicity group of new compounds. By setting the a

priori classification probabilities to be proportional to the group sizes, or to be equal in

the two groups, different accuracy of the within-group classification can be obtained. A

priori probabilities proportional to the group sizes favour more accurate classification of

the dangerous compounds, at the expense of accuracy of classification of the non-toxic

compounds. This might serve the needs of hazard assessment, where the over-

classification of a non-toxic compound as dangerous might be considered preferable to

classification of dangerous compound as non-toxic.

Other results from the project, which could contribute to development of testing strategies

for human health effect, are described in Chapter 14. Models for predicting human

222

toxicity in vivo based on combinations of in vitro data and molecular descriptors (QSAAR

analysis) were developed. Human peak blood/serum LC50 concentrations (HAP) were

described by a combination of human liver cell toxicity and descriptors related to

electronic properties. Additionally, QSAR models for HAP were obtained, which had

slightly better statistical parameters to the QSAAR model, without the need to include an

in vitro endpoint. A reasonable model to assess the in vivo oral human lethal doses

(HLD), combining an in vitro endpoint and descriptors of chemical structure, could not be

obtained. However HLDs were correlated with combinations of in vivo mouse LD50

values and descriptors of H-bond donor ability and molecular aromaticity. Although

useful for the assessment of chemical hazard to humans, this model cannot be regarded as

an alternative to animal testing.

Other models which could be used as alternative methods to animal testing in integrated

strategies are the models developed for rat and mouse in vivo toxicity described in

Chapter 14. QSAAR models for rat in vivo toxicity were obtained, however the QSAR

models for rat and mouse in vivo toxicity had better statistical fits than the QSAARs.

QSAR models with good statistics were obtained for in vitro toxicity to rat hepatocytes.

This biological systems is closely related to the species of interest, and is therefore more

appropriate for inclusion in testing strategies for the assessment of chemical risk.

As a general conclusion, QSARs can provide useful information about the mechanism of

action of compounds and a potential toxic effect. They can be used as tools for

preliminary analysis or supplementary information in testing strategies in combination

with other methods. Their predictive quality and domain of applicability should be taken

into account.

223

References

Abraham, M.H. (1993). Scales of solute hydrogen-bonding – their construction and

application to physicochemical and biochemical processes. Chemical Society Reviews, 22,

73-83.

Abraham, R.J., and Haworth, I.S. (1988). A modification to the COSMIC

parameterisation using ab initio constrained potential functions. Journal of Computer-

Aided Molecular Design, 1988, 2, 125-135.

Abraham, M.H., and McGowan, J.C. (1987). The use of characteristic volumes to

measure cavity terms in reversed phase liquid chromatography. Chromatographia, 23,

243-246.

Abraham, M.H., and Platts, J.A. (2000). Physicochemical factors that influence brain

uptake. In: Begley, D.J., Bradbury, M.W., Kreuter, J. (Eds.), The Blood-Brain Barrier

and Drug Delivery to the CNS. Marcel Dekker, Inc, New York, Basel.

Abraham, M.H., and Whiting, G.S. (1992). Hydrogen bonding: XXI. Solvation

parameters for alkylaromatic hydrocarbons from gas-liquid chromatographic data.

Journal of Chromatography A, 594, 229-241.

Abraham, M.H., Chadha, H.S., Mitchell R.C. (1994). Hydrogen bonding. 33. Factors that

influence the distribution of solutes between blood and brain. Journal of Pharmaceutical

Sciences, 83, 1257-1268.

Abraham M.H., Chadha, H.S., Martins, F., Mitchell, R.C., Bradbury, M.W., Gratton, J.A.

(1999). Hydrogen bonding part 46: A review of the correlation and prediction of transport

properties by an LFER method: physicochemical properties, brain penetration and skin

permeability. Pesticide Science, 55, 78-88.

Abraham M.H., Grellier, P.L., Prior, D.V., Duce, P.P., Moms, J.J., Taylor, P.J. (1989).

Hydrogen bonding. Part 7. A scale of solute hydrogen-bond acidity based on logK values

for complexation in tetrachloromethane. Journal of Chemical Society, Perkin

Transactions, 2, 699-711.

224

Abraham, M.H., Whiting G.S., Doherty, R.M., Shuely W.J. (1991). Hydrogen bonding:

XVI. A new solute salvation parameter, π2H, from gas chromatographic data. Journal of

Chromatography A, 587, 213-228.

Allinger, N.L. (1977). Conformational analysis 130. MM2. A hydrocarbon force field

utilizing V1 and V2 torsional terms. Journal of the American Chemical Society, 99, 8127-

8134.

Allinger, N.L., Yuh, Y.H., Lii, J-H. (1989). Molecular mechanics. The MM3 force field

for hydrocarbons. 1. Journal of the American Chemical Society, 111, 8551-8565.

Balaban, A.T. (1982). Highly discriminating distance-based topological index. Chemical

and Physical Letters, 89, 399-404.

Balaž, S., and Lukacova, V. (2002). Subcellular pharmacokinetics and its potential for

library focusing. Journal of Molecular Graphics and Modelling, 20, 479-490.

Berry, W.D. (1993). Understanding Regression Assumptions. Series: Quantitative

Applications in the Social Sciences, 92. Thousand Oaks, CA, Sage Publications.

Boyd, E.M., Killham K., Meharg, A.A. (2001). Toxicity of mono-, di- and tri-

chlorophenols to lux marked terrestrial bacteria, Burkholderia species Rasc c2 and

Pseudomonas fluorescens. Chemosphere, 43, 157-166.

Bondesson, I., Ekwall, B., Hellberg, S., Romert, L., Stenberg, K., Walum, E. (1989)

MEIC: a new international multicenter project to evaluate the relevance to human toxicity

of in vitro cytotoxicity tests. Cell Biology and Toxicology, 5, 331-348.

Bradbury, S.P., Henry, T.R., Niemi, G.J., Carlson, R.W., Snarski, V.M. (1989). Use of

respiratory-cardiovascular responses of rainbow trout (Salmo gardineri) in identifying

acute toxicity syndromes in fish: Part 3: Polar narcotics. Environmental Toxicology and

Chemistry, 8, 247-261.

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and

regression trees. Monterey, CA, Wadsworth & Brooks/Cole Advanced Books &

Software.

225

Broderius, S.J., Kahl, M.D., Hoglund, M.D. (1995). Use of joint toxic response to define

the primary mode of toxic action for diverse industrial organic chemicals. Environmental

Toxicology and Chemistry, 9, 1591-1605.

Bundy, J.G., Morriss, A.W.J., Durham, D.G., Campbell, C.D., Paton, G.I. (2001).

Development of QSARs to investigate the bacterial toxicity and biotransformation

potential of aromatic heterocylic compounds. Chemosphere, 42, 885-892.

Burden, F.R. (2001). Quantitative structure-activity relationship studies using Gaussian

processes. Journal of Chemical Information and Computer Sciences, 41, 830-835.

Chapman, M.S., and Connolly, M.L. (2001). Molecular surfaces: calculations, uses and

representations. In: Rossmann, M.G., and Arnold, E. (Eds.), International Tables for

Crystallography, Vol. F: Crystallography of biological macromolecules, Chapter 22:

Molecular geometry and features, Chapter 22.1: Protein surfaces and volumes:

measurement and use. International Union of Crystallography, Chester, UK.

Available on internet (30 December 2004):

http://www.sb.fsu.edu/~chapman/Home/Publications/Chapman_&_Connolly_Internation

al_Tables_1999/Molecular_Surface_Calculations.html

Clark, D.E. (2003). In silico prediction of blood-brain barrier permeation. Drug Discovery

Today, 8, 927-33.

Clark, D.E. (1999). Rapid calculation of polar molecular surface area and its application

to the prediction of transport phenomena. 2. Prediction of blood-brain barrier penetration.

Journal of Pharmaceutical Sciences, 88, 815-821.

Cramer, R. D. III, Patterson, D. E. Bunce, J. D. (1988). Comparative molecular field

analysis (CoMFA). Effect of shape on binding of steroids to carrier proteins. Journal of

the American Chemical Society, 110, 5959-5967.

Crivori, P., Cruciani G., Carrupt P.A., Testa B. (2000). Predicting blood-brain barrier

permeation from three-dimensional molecular structure, Journal of Medicinal Chemistry,

43, 2204-2216.

226

Cronin, M. (2002). ECVAM Workshop on The Use of Computer Models as Alternatives

to Animal Experiments in Chemical Risk Assessment, October 3-4, Praha, Czech

Republic.

Cronin, M.T.D., and Schultz, T.W. (2001). Development of quantitative structure-

activity relationships for the toxicity of aromatic compounds to T. pyriformis:

comparative assessment of the methodologies. Chemical Research in Toxicology, 14,

1284-1295.

Cronin, M.T.D., and Schultz T.W. (2003). Pitfalls in QSAR. Journal of Molecular

Structure. (Theochem), 622, 39-51.

Cronin, M.T.D., Aptula, A.O., Duffy, J.C., Netzeva, T.I., Rowe, P.H., Valkova, I.V.,

Schultz, T.W. (2002a). Comparative assessment of methods to develop QSARs for the

prediction of the toxicity of phenols to Tetrahymena pyriformis. Chemosphere, 49, 1201–

1221.

Cronin, M.T.D., Bowers, G.S., Sinks, G. D., Schultz, T.W. (2000). Structure-toxicity

relationships for aliphatic compounds encompassing a variety of mechanisms of toxic

action to V. fischeri. SAR and QSAR in Environmental Research, 11, 301-312.

Cronin, M.T.D., Dearden, J.C., Duffy, J.C., Edwards, R., Manga, N., Worth, A.P.,

Worgan, A.D.P. (2002b). The importance of hydrophobicity and electrophilicity

descriptors in mechanistically-based QSARs for toxicological endpoints. SAR and QSAR

in Environmental Research, 13, 167-176.

Cronin, M.T.D., Netzeva, T.I, Dearden, J.C., Edwards, R., Worgan, A.D.P. (2004).

Assessment and Modeling of the Toxicity of Organic Chemicals to Chlorella vulgaris:

Development of a Novel Database. Chemical Research in Toxicology, 17, 545-554.

Dearden, J. (2002). ECVAM Workshop on The Use of Computer Models as Alternatives

to Animal Experiments in Chemical Risk Assessment, October 3-4, Praha, Czech

Republic.

227

Dearden, J.C., Al-Noobi, A., Scott, A.C., Thomson, S.A. (2003). QSAR studies on P-

glycoprotein-regulated multidrug resistance and on its reversal by phenothiazines. SAR

and QSAR in Environmental Research, 14, 447-454.

Dearden, J.C., Bradburne, S.J.A., Cronin, M.T.D., Solanki, P. (1988). The physical

significance of molecular connectivity. In: Turner, J.E., England, M.W., Schultz, T.W.,

Kwaak, N.J. (Eds.), Proceedings of the Third International Workshop on Quantitative

Structure-Activity Relationships in Environmental Chemistry. US DOE, Oak Ridge, TN,

43-50.

Devillers, J., Chezeau, A., Thybaud, E., Rahmani, R. (2002a). QSAR modelling of the

adult and developmental toxicity of glycols, glycol ethers and xylenes to Hydra attenuata.

SAR and QSAR in Environmental Research, 13, 555-566.

Devillers, J., Chezeau, A., Thybaud, E. (2002b). PLS-QSAR of the adult and

developmental toxicity of chemicals to Hydra attenuata. SAR and QSAR in

Environmental Research, 13, 705-712.

Dewar, M.J.S., and Thiel, W. (1977). Ground states of molecules. 38. The MNDO

method. Approximation and parameters. Journal of the American Chemical Society, 99,

4899-4907.

Dewar, M.J.S., Zoebisch, E.G., Healy, E.F., Stewart, J.J.P. (1985). The development and

use of quantum-mechanical models. 76. AM1 – a new general-purpose quantum-

mechanical molecular model. Journal of the American Chemical Society, 107, 3902-3909.

Dimitrov, S.D., Mekenyan, O.G., Sinks, G.D., Schultz, T.W. (2003). Global modeling of

narcotic chemicals: ciliate and fish toxicity. Journal of Molecular Strucrure (Theochem),

622, 63-70.

Dimitrov, S.D., Mekenyan, O.G., Schultz, T.W. (2000). Interspecies modeling of narcotic

toxicity to aquatic animals. Bulletin of Environmental Contamination and Toxicology, 65,

399-406.

228

Di Marzio, W., Galassi, S., Todeschini, R., Consolaro, F. (2001). Traditional versus

WHIM molecular descriptors in QSAR approaches applied to fish toxicity studies.

Chemosphere, 44, 401-406.

EC (1967). Council Directive 67/548/EEC of 27 June 1967 on the approximation of laws,

regulations and administrative provisions relating to the classification, packaging and

labelling of dangerous substances, Official Journal of the European Community P 196,

16/08/1967.

EC (1979). Council Directive 79/831/EEC of 18 September 1979 amending for the sixth

time Directive 67/548/EEC on the approximation of the laws, regulations and

administrative provisions relating to the classification, packaging and labelling of

dangerous substances, Official Journal of the European Community L 259, 15/10/1979.

EC (1986). Council Directive 86/609/EEC of 24 November 1986 on the approximation of

laws, regulations and administrative provisions of the Member States regarding the

protection of animals used for experimental and other scientific purposes. Official Journal

of the European Community L358, 18.12.1986.

EC (2001a). Annex VI to Council Directive 67/548/EEC of 21 August 2001 on general

classification and labelling requirements for dangerous substances and preparations,

Official Journal of the European Community L 225/263, 21/08/2001.

EC (2001b). White Paper on a Strategy for a Future Chemicals Policy. Brussels:

Commission of the European Communities. Available on the internet:

http://europa.eu.int/comm/environment/chemicals/whitepaper.htm

EC (2003). Commission’s proposal COM 2003 0644 (03) of 29 October 2003 for a

Regulation of the European Parliament and of the Council concerning the Registration,

Evaluation, Authorisation and Restriction of Chemicals (REACH). Available on the

internet:

http://europa.eu.int/eur-lex/en/com/pdf/2003/com2003_0644en.html

Enache, M., Dearden, J.C., Walker, J.D. (2003). QSAR analysis of metal ion toxicity

data in sunflower callus cultures (Helianthus annuus “Sunspot”). QSAR and

Combinatorial Science, 22, 234 – 240.

229

Enache, M., Palit, P., Dearden, J.C., Lepp, N.W. (2000). Correlation of physico-chemical

parameters with toxicity of metal ions to plants. Pest Management Science, 56, 821-824.

Ekwall, B., Clemedson, C., Crafoord, B., Ekwall, B., Hallander, S., Walum, E.,

Bondesson, I. (1998a). MEIC evaluation of acute systemic toxicity. Part V. Rodent and

human toxicity data for the 50 reference chemicals. Alternatives to Laboratory Animals,

26, Suppl. 2, 571-616.

Ekwall, B., et al. (1998b). MEIC evaluation of acute systemic toxicity. Part VI. The

prediction of human toxicity by rodent LD50 values and results from 61 in vitro methods.

Alternatives to Laboratory Animals, 26, Suppl. 2, 617-658.

Faucon, J.C., Bureau, R., Faisant, J., Briens, F., Rault, S. (2001). Prediction of the

Daphnia acute toxicity from heterogeneous data. Chemosphere, 44, 407-422.

Feher, M., Sourial, E., Schmidt, J.M. (2000). A simple model for the prediction of blood-

brain partitioning., International Journal of Pharmacology, 201, 239-247.

Ford, J.M., Prozialeck, W.C., Hait, W.N. (1989). Structural features determining activity

of phenothiazines and related drugs for inhibition of cell growth and reversal of multidrug

resistance. Molecular Pharmacology, 35, 105-115.

Freidig, A.P., and Hermens, J.L.M. (2000). Narcosis and chemical reactivity QSARs for

acute fish toxicity. Quantitative Structure-Activity Relationships, 19, 547-553.

Gaillard, P.J., and De Boer, A.G. (2000). Relationship between permeability status of the

blood–brain barrier and in vitro permeability coefficient of a drug. European Journal of

Pharmaceutical Sciences, 12, 95–102.

Garberg, P. (2001). Prevalidation of in vitro models for the blood-brain barrier. ECVAM

internal report.

Geiss, K.T., and Frazier, J.M. (2001). QSAR modeling of oxidative stress in vitro

following hepatocyte exposures to halogenated methanes. Toxicology in Vitro, 15, 557-

563.

230

Gloor, S.M., Wachtel, M., Bolliger M.F., Ishihara H., Landmann, R., Frei, K. (2001)

Molecular and cellular permeability control at the blood–brain barrier. Brain Research

Reviews, 36, 258–264.

Gramatica, P., Vighi, M., Consolaro, F., Todeschini, R., Finizio, A., Faust. M. (2001).

QSAR approach for the selection of congeneric compounds with a similar toxicological

mode of action. Chemoshere, 42, 873-883

Gratton, J.A., Abraham, M.H., Bradbury, M.W., Chadha, H.S. (1997). Molecular factors

influencing drug transfer across the blood-brain barrier. Journal of Pharmacy and

Pharmacology, 49, 1211-1216.

Grodnitzky, J.A., and Coats, J.R. (2002). QSAR evaluation of monoterpenoids’

insecticidal activity. Journal of Agricultural and Food Chemistry, 50, 4576-4580.

Ghose, A.K., and Crippen, G.M. (1986). Atomic physicochemical parameters for three-

dimensional structure-directed quantitative structure-activity relationships I. Partition

coefficients as a measure of hydrophobicity. Journal of Computational Chemistry, 7, 565-

577.

Hall L.H., and Kier, L.B. (1991). The molecular connectivity chi indices and kappa shape

indices in structure-property modelling. In: Boyd, D.B., Lipkowitz, K. (Eds.), Reviews of

Computational Chemistry, Vol. 2.

Hall L.H., Mohney, B.K., Kier, L.B. (1991). The electrotopological state: an atom index

for QSAR. Quantitative Structure-Activity Relationships, 10, 43.

Hansch, C., and Gao, H. (1997). Comparative QSAR: radical reactions of benzene

derivatives in chemistry and biology. Chemical Reviews, 97, 2995-3059.

Hansch, C., and Kurup, A. (2003). QSAR of chemical polarisability and nerve toxicity. 2.

Journal of Chemical Information and Computer Sciences, 43, 1647-1651.

231

Hansch, C., Kim, D., Leo, A.J., Novellino, E., Silipo, C., Vittoria, A. (1989). Toward a

quantitative comparative toxicology of organic compounds. Critical Reviews in

Toxicology, 19, 185–226.

Hansch, C., McKarns, S.C., Smith, C.J., Doolittle, D.J. (2000). Comparative QSAR

evidence for a free-radical mechanism of phenol-induced toxicity. Chemico-Biological

Interactions, 127, 61–72.

Hartigan, J.A., and Wong, M.A. (1978). Algorithm 136. A k-means clustering algorithm.

Applied Statistics, 28, 100-108.

Hermens, J.L.M. (1990). Electrophiles and acute toxicity to fish. Environmental Health

Perspectives, 87, 219-225.

Huang, H., Wang, X., Ou, W., Zhao, J., Shao, Y., Wang, L. (2003). Acute toxicity of

benzene derivatives to the tadpoles (Rana japonica) and QSAR analyses. Chemosphere,

53, 963-970.

Huuskonen J. (2003). QSAR modeling with the electrotopological state indices:

predicting the toxicity of organic chemicals. Chemosphere, 20, 949-953.

Ivanciuc, O. (2000). QSAR comparative study of Wiener descriptors for weighted

molecular graphs. Journal of Chemical Information and Computer Sciences, 40, 1412-

1422.

Kaiser, H.F. (1960). The application of electronic computers to factor analysis.

Educational and Psychological Measurement, 20, 141-151.

Kamlet, M.J., and Taft, R.W. (1976). The solvatochromic comparison method. I. The

.beta.-scale of solvent hydrogen-bond acceptor (HBA) basicities. Journal of the American

Chemical Society, 98, 377-383.

Kansy, M., and van de Waterbeemd, H. (1992). Hydrogen-bonding capacity and brain

penetration. Chimia, 46, 299-303.

232

Kahru, A., and Borchardt, B. (1994). Toxicity of 39 MEIC chemicals to bioluminescent

photobacteria (the BiotoxTM test): correlation with other test systems. Alternatives to

Laboratory Animals, 22, 147-160.

Kapur, S., Shusterman, A., Verma, R.P., Hansch, C., Selassie, C.D. (2000). Toxicology of

benzyl alcohols: a QSAR analysis. Chemosphere, 41, 1643-1649.

Kelder, J., Grootenhuis, P.D., Bayada, D.M., Delbressine, L.P., Ploemen, J.P. (1999).

Polar molecular surface as a dominating determinant for oral absorption and brain

penetration of drugs. Pharmaceutical Research, 16, 1514-1519.

Keseru, G.M., and Molnar, L. (2001). High-throughput prediction of blood-brain

partitioning: a thermodynamic approach. Journal of Chemical Information and Computer

Sciences, 41, 120-128.

Kier, L.B. (1985). A shape index from molecular graphs. Quantitative Structure-Activity

Relationships, 4, 109-116.

Kier, L.B., and Hall, L.H. (1986). Molecular Connectivity in Structure-Activity Analysis.

Research Studies Press - Wiley, Chichester, UK.

Kier, L.B., and Hall, L.H. (1990). An electrotopological state index for atoms in

molecules. Pharmaceutical Research, 7, 801.

Kier, L.B., and Hall, L.H. (1999). Molecular Structure Description: The

Electrotopological State. Academic Press.

Klebe, G. (1998). Comparative molecular similarity indices: CoMSIA. In: Kubiny, H.,

Folkers, G., Martin, Y.C. (Eds.), 3D QSAR in Drug Design. Kluwer Academic Publishers,

UK, 3-87.

Könemann, H. (1981). Quantitative structure-activity relationships in fish toxicity studies.

Part 1: Relationship for 50 industrial pollutants. Toxicology, 19, 209-221.

Kubiny, H. (1993). QSAR in drug design. Theory, Methods and Application. Escom,

Leiden.

233

Lee, G., Dallas, S., Hong, M., Bendayan, R. (2001). Drug transporters in the central

nervous system: brain barriers and brain parenchyma considerations. Pharmacological

Reviews, 53, 569-596.

Lessigiarska, I., Cronin, M.T.D., Worth, A.P., Dearden, J.C., Netzeva, T.I. (2004a).

QSARs for toxicity to the bacterium Sinorhizobium meliloti. SAR and QSAR in


Lessigiarska, I., Pajeva, I., Cronin, M.T.D., Worth, A.P. (2005a). 3D QSAR investigation

of the blood-brain barrier penetration of chemical compounds. SAR and QSAR in


Lessigiarska, I., Worth, A.P., Netzeva, T.I. (2005b). Comparative review of QSARs for

acute toxicity. EUR report No. 21559 EN. EC Joint Research Centre, Ispra, Italy.

Lessigiarska, I., Worth, A.P., Netzeva, T.I., Dearden, J.C., Cronin, M.T.D (2006).

Quantitative structure-activity-activity and quantitative structure-activity investigations of

human and rodent toxicity. Chemosphere. In press.

Lessigiarska, I., Worth, A.P., Sokull-Klüttgen, B., Jeram, S., Dearden, J.C., Netzeva, T.I.,

Cronin, M.T.D. (2004b). QSAR investigation of a large data set for fish, algae and

Daphnia toxicity. SAR and QSAR in Environmental Research, 15, 413-431.

Lin, Z., Yin, K., Shi, P., Wang, L., Yu, H. (2003a). Development of QSARs for

predicting the joint effects between cyanogenic toxicants and aldehydes. Chemical

Research in Toxicology, 16, 1365-1371.

Lin, Z., Zhong, P., Yin, K.,Wang, L.,Yu, H. (2003b). Quantification of joint effect for

hydrogen bond and development of QSARs for predicting mixture toxicity.


Lundquist, S., Renftel, M., Brillault, J., Fenart, L., Cecchelli, R., Dehouck, M-P. (2002).

Prediction of drug transport through blood-brain barrier in vivo: a comparison between

two in vitro cell models. Pharmaceutical Research, 19, 976-981.

234

Lipnick, R.L. (1991). Outliers: their origin and use in the classification of molecular

mechanisms of toxicity. Science of the Total Environment, 109/110, 131-153.

Liu, X., Wang, B., Huang, Z., Han, S., Wang, L (2003a). Acute toxicity and quantitative

structure-activity relationships of alpha-branched phenylsulfonyl acetates to D. magna.


Liu, X., Wu, C., Han, S., Wang, L., Zhang, Z. (2001). The acute toxicity of alpha-

branched phenylsulfonyl acetates in V. fischeri test. Ecotoxicology and Environmental

Safety, 49, 240-244.

Liu, X., Yang, Z., Wang, L. (2003b). Three-dimensional quantitative structure–activity

relationship study for phenylsulfonyl carboxylates using CoMFA and CoMSIA.


Lombardo, F., Blake J.F., Curatolo W.J. (1996). Computation of brain-blood partitioning

of organic solutes via free energy calculations. Journal of Medicinal Chemistry, 39, 4750-

4755.

Lu, G.H., Yuan, X., Zhao, Y.H. (2001). QSAR study on the toxicity of substituted

benzenes to the algae (Scenedesmus obliquus). Chemosphere, 44, 437-440.

Luco, J.M. (1999). Prediction of the brain-blood distribution of a large set of drugs from

structurally derived descriptors using partial least-squares (PLS) modelling, Journal of

Chemical Information and Computer Sciences, 39, 396-404.

MEIC data base: http://www.cctoxconsulting.a.se/ (accessed November 2004).

MolconnZ version 4.0, online help manual, Hall Associates Consulting). Available on

internet (20 December 2004): http://www.eslc.vabiotech.com/molconn/manuals/400/

Moridani, M.Y., Siraki A., O’Brien P.J. (2003b). Quantitative structure toxicity

relationships for phenols in isolated rat hepatocytes. Chemico-Biological Interactions,

145, 213-223.

235

Mulliken, R.S. (1955). Electronic population analysis on LCAO-MO molecular wave

functions i. Journal of Chemical Physics, 23, 1833-1840,

Morao, I., and Hillier, I.H. (2001). Magnetic analysis (NICS) of monoarylic cations.

Linear relationship between aromaticity and Hammett constants (σp+). Tetrahedron

Letters, 42, 4429–4431.

Morley, S.D., Abraham R.J., Haworth I.S., Jackson D.E., Saunders M.R., Vinter J.G.

(1991). COSMIC(90): An improved molecular mechanics treatment of hydrocarbons and

conjugated systems. Journal of Computed-Aided Molecular Design, 5, 475-504.

Netzeva, T.I. (2003). Whole molecule and atom based, topological descriptors. In:

Cronin, M.T.D., and Livingstone, D. (Eds.), Predicting Chemical Toxicity and Fate.

Taylor and Francis, London, 61-83.

Netzeva, T.I., Aptula, A.O., Benfenati, E., Cronin, M.T.D., Gini, G., Lessigiarska, I.,

Maran, U., Vracko, M., Schüürmann, G. (2005). Description of the electronic structure of

organic chemicals using semiempirical and ab initio methods for development of

toxicological QSARs. Journal of Chemical Information and Modeling, 45, 106-114.

Netzeva, T.I., Dearden, J.C., Edwards, R., Worgan, A.D.P., Cronin, M.T.D. (2004).

QSAR analysis of the toxicity of aromatic compounds to Chlorella vulgaris in a novel

short-term assay. Journal of Chemical Information and Computer Sciences, 44, 258-265.

Norinder, U., and Osterberg, T. (2001). Theoretical calculation and prediction of drug

transport processes using simple parameters and partial least squares projections to latent

structures (PLS) statistics. The use of electrotopological state indices. Journal of

Pharmaceutical Sciences, 90, 1076-1085.

Ownby, D.R., and Newman, M.C. (2003). Advances in quantitative ion character-activity

relationships (QICARs): using metal-ligand binding characteristics to predict metal

toxicity. QSAR and Combinatorial Science, 22, 241-246.

Pajeva, I.K., and Wiese, M. (1998). Molecular modelling of phenothiazines and related

drugs as multidrug resistance modifiers: a comparative molecular field analysis study.

Journal of Medicinal Chemistry, 41, 1815-1826.

236

Pajeva, I.K., Wiese, M., Cordes, H.P., Seydel, J.K. (1996). Membrane interactions of

some catamphiphilic drugs and relation to their multidrug resistance reversing ability.

Jouranl Cancer Research and Clinical Oncology, 122, 27-40.

Parkerton, T.F., and Konkel, W.J. (2000). Application of quantitative structure-activity

relationships for assessing the aquatic toxicity of phthalate esters. Ecotoxicology and

Environmental Safety, 45, 61-78.

Pedersen, F., de Bruijn, J., Munn, S., van Leeuwen, K. (2003). Assessment of additional

testing needs under REACH. Effects of QSARs, risk based testing and voluntary industry

initiatives. EUR Report No. 20863 EN. EC Joint Research Centre, Ispra, Italy.

Platts, J.A., Abraham, M.H., Zhao, Y.H., Hersey, A., Ijaz, L., Butina, D. (2001).

Correlation and prediction of a large blood–brain distribution data set — an LFER study.

European Journal of Medicinal Chemistry, 36, 719–730.

Pople, J.A., and Beveridge, D.L. (1970). Approximate Molecular Orbital Theory.

McGraw-Hill, New York.

Pople, J.A., and Segal, G.A. (1965). Approximate self-consistent molecular orbital theory

II. Calculations with complete neglect of differential overlap. Journal of Chemical

Physics, 43, S136-S149.

Pople J.A., Santry, D.P., Segal, G.A. (1965). Approximate self-consistent molecular

orbital theory I. Invariant procedures. Journal of Chemical Physics, 43, S129-S135.

QsarIS version 1.1, help manual, then SciVision-Academic Press, San Diego, CA;

currently MDL®QSAR by Elsevier MDL, San Leandro, CA.

Radojnova, D., Tenekedjiev, K., Yordanov, Y. (2002). Stature estimation from long bone

length in Bulgarians, HOMO, 52, 221-232.

Ramu, A., and Ramu, N. (1992). Reversal of multidrug resistance by phenothiazines and

structurally related compounds. Cancer Chemotherapy and Pharmacology, 30, 165-173.

237

Randić, M. (2001). Novel shape descriptors for molecular graphs. Journal of Chemical

Information and Computer Sciences, 41, 607-613.

Rao, C.R. (1951). An asymptotic expansion of the distribution of Wilks’ criterion.

Bulletin of the International Statistical Institute, 33, 177-181.

Ren, S., and Frymier, P.D. (2002). Estimating the toxicities of organic chemicals to

bioluminescent bacteria and activated sludge. Water Research, 36, 4406-4414.

Roberts, D.W., and Costello, J.F. (2003). Mechanisms of action for general and polar

narcosis: A difference in dimension. QSAR and Combinatorial Science, 22, 226 – 233.

Rose, K., and Hall, L.H. (2003). E-state modelling of fish toxicity independent of 3D

structure information. SAR and QSAR in Environmental Research, 14, 113-129.

Rose, K., Hall, L.H., Kier, L.B. (2002). Modeling blood-brain barrier partitioning using

the electrotopological state. Journal of Chemical Information and Computer Sciences, 42,

651-666.

Rücker, G, and Rücker, C. (1993). Counts of all walks as atomic and molecular

descriptors. Journal of Chemical Information and Computer Sciences, 33, 683-695.

Russom, C.L., Bradbury, S.P., Broderius, S.J., Hammermeister, D.E., Drummond, R.A.

(1997). Predicting modes of action from chemical structure: Acute toxicity in the fathead

minnow (Pimephales promelas). Environmentel Toxicology and Chemistry, 16, 948-967.

Sablić, A. (1990). Topological indices and environmental chemistry. In: Karcher, W., and

Devillers J. (Eds.), Practical applications of quantitative structure-activity relationships

(QSAR) in environmental chemistry and toxicology. Kluwer Academic Publishers, 61-82.

Schmitt, H., Altenburger, R., Jastorff, B., Schüürmann, G. (2000). Quantitative structure-

activity analysis of the algae toxicity of nitroaromatic compounds. Chemical Research in

Toxicology, 13, 441-450.

238

Schultz, T.W., and Cronin, M.T.D. (1997). Quantitative structure-activity relationships

for weak acid respiratory uncouplers to Vibirio fisheri. Environmental Toxicology and

Chemistry, 16, 357–360.

Schultz, T.W., and Cronin, M.T.D. (2003). Essential and desirable characteristics of

ecotoxicity quantitative structure-activity relationships. Environmental Toxicology and

Chemistry 22, 599-607.

Schultz, T.W., and Seward, J.R. (2000) Dimyristoyl phosphatidylcholine/water

partitioning-dependent modeling of narcotic toxicity to T. pyriformis. Quantitative

Structure-Activity Relationships, 19, 239-344.

Schultz, T.W., Cronin, M.T.D., Netzeva, T.I., Aptula, A.O. (2002). Structure-toxicity

relationships for aliphatic chemicals evaluated with T. pyriformis. Chemical Research in

Toxicology, 15, 1602-1609.

Schultz, T.W., Cronin, M.T.D., Walker, J.D., Aptula, A.O. (2003a). Quantitative

structure–activity relationships (QSARs) in toxicology: a historical perspective. Journal

of Molecular Structure (Theochem), 622, 1–22.

Schultz, T.W., Lin, D.T., Wilke, T.S., Arnold, L.M. (1990). Quantitative structure-

activity relationships for the Tetrahymena pyriformis population growth endpoint: a

mechanism of action approach. In: Karcher, W., and Devillers, J. (Eds), Practical

Applications of Quantitative Structure-Activity Relationships (QSAR) in Environmental

Chemistry and Toxicology. Kluwer Academic Publishers, 61-82.

Schultz, T.W., Sinks, G.D., Bearden, A.P. (1998). QSAR in aquatic toxicology: a

mechanism of action approach comparing toxic potency to Pimephales promelas, T.

pyriformis, and V. fischeri. In: Devillers, J. (Ed.), Comparative QSAR. Taylor & Francis,

New York, 51-109.

Schultz, T.W., Sinks, G.D., Cronin, M.T.D. (1997). Identification of mechanisms of toxic

action of phenols to Tetrahymena pyriformis from molecular descriptors. In: Chen, F., and

Schüürmann, G. (Eds.), Quantitative Structure-Activity Relationships in Environmental

Sciences - VII. SETAC Press, Pensacola, USA, 329-342.

239

Schüürmann, G. (2004). Quantum chemical descriptors in structure-activity relationships

– calculation, interpretation and comparison of methods. In: Cronin, M.T.D., and

Livingstone, D.J. (Eds.), Predicting Chemical Toxicity and Fate. Taylor and Francis,

London, 85-149.

Selassie, C.D., Garg, R., Kapur, S., Kurup, A., Verma, R.P., Mekapati, S.B., Hansch, C.

(2002). Comparative QSAR and the radical toxicity of various functional groups.

Chemical Reviews, 102, 2585-2605.

Shrivastava, R., Delominie, C., Chevalier, A., John, G., Ekwall, B., Walum, E.,

Massingham, R. (1993). Comparison of in vitro acute lethal potency and in vitro

cytotoxicity of 48 chemicals. Cell Biology and Toxicology, 8, 157-167.

Stewart, J.J.P. (1989). Optimization of parameters for semiempirical methods. I. Method.

Journal of Computational Chemistry, 10, 209-220.

Statistica version 5.5, help manual, StatSoft Inc., Tulsa, OK, USA.

Sun, H., Dai, H., Shaik, N., Elmquist, W. F. (2003). Drug efflux transporters in the CNS.

Advanced Drug Delivery Reviews, 55, 83–105.

Sverdrup, L.E., Nielsen, T., Krogh, P.H. (2002). Soil ecotoxicity of polycyclic aromatic

hydrocarbons in relation to soil sorption, lipophilicity, and water solubility.

Environmental Science and Technology, 36, 2429-2435.

Taft, R.W., and Kamlet, M.J. (1976). The solvatochromic comparison method. 2. The

.alpha.-scale of solvent hydrogen-bond donor (HBD) acidities. Journal of the American

Chemical Society, 98, 2886-2894.

Tamai, I., and Tsuji, A. (2000). Transporter-mediated permeation of drugs across the

blood-brain barrier. Journal of Pharmaceutical Sciences, 89, 1371-1388.

Tao S., Xi, X., Xu F., Dawson R. (2002a). A QSAR model for predicting toxicity (LC50)

to rainbow trout. Water Research, 36, 2926-2930.

240

Tao, S., Xi, X., Xu, F., Li, B., Cao, J., Dawson, R. (2002b). A fragment constant QSAR

model for evaluating the EC50 values of organic chemicals to D. magna. Environmental

Pollution, 116, 57-64.

Tenekedjiev, K., and Radojnova, D., (2001). Numerical procedures for stature estimation

according to length of limb bones in Bulgarian and Hungarian populations. Acta

Morphologica et Anthropologica, 6, 90-97 (2001).

Tenekedjiev, K. (1994). Technical diagnosis of complex energetic objects using statistical

pattern recognition. PhD Thesis Summary, Technical University-Varna, Bulgaria.

Terada, H. (1990). Uncouplers of oxidative phosphorylation. Environmental Health

Perspectives, 87, 213-218.

Tichý, M., Borek-Dohalský, V., Matousová, D., Rucki, M., Feltl, L., Roth, Z. (2002).

Prediction of acute toxicity of chemicals in mixtures: worms Tubifex tubifex and

gas/liquid distribution. SAR and QSAR in Environmental Research, 13, 261-269.

Todeschini, R., Vighi, M., Finizio, A., and Gramatica, P. (1997) 3D-modelling and

prediction by WHIM descriptors. Part 8. Toxicity and physico-chemical properties of

environmental priority chemicals by 2D-TI and 3D-WHIM descriptors. SAR and QSAR in


Topliss, J.G., and Costello, R.J. (1972). Chance correlations in structure-activity studies

using multiple regression analysis. Journal of Medicinal Chemistry, 15, 1066-1068.

Veith, G.D., Broderius, S.J. (1990). Rules for distinguishing toxicants that cause type I

and type II narcosis syndromes. Environmental Health Perspectives, 87, 207-211.

Toropov, A.A., and Benfenati, E. (2004). QSAR modelling of aldehyde toxicity by means

of optimization of correlation weights of nearest neighbouring codes. Journal of

Molecular Structure (Theochem), 676, 165-169.

Toropov, A.A., and Schultz, T.W. (2003). Prediction of aquatic toxicity: use of

optimization of correlation weights of local graph invariants. Journal of Chemical

Information and Computer Sciences, 43, 560-567.

241

Toropov A.A., and Toropova, A.P. (2002). QSAR modeling of toxicity on optimization of

correlation weights of Morgan extended connectivity. Journal of Molecular Structure

(Theochem), 578, 129-134.

Trohalaki, S., Gifford, E., Pachter, R. (2000). Improved QSARs for predictive toxicology

of halogenated hydrocarbons. Computers and Chemistry, 24, 421-427.

Urani, C., Doldi, M., Crippa, S., Camatini, M. (1998). Human-derived cell lines to study

xenobiotic metabolism. Chemosphere, 37, 2785-2795.

Vamp Reference Guide (1997), Accelrys, Inc.

Van der Jagt, K., Munn, S., Tørsløv, J., de Brujin, J. (2004). Alternative approaches can

reduce the use of test animals under REACH. Addendum to the report: Assessment of

additional testing needs under REACH. Effects of (Q)SARs, risk based testing and

voluntary industry initiatives. EUR Report No. 21405 EN. EC Joint Research Centre,

Ispra, Italy.

Veith, G.D., Call, D.J., Brooke, L.T. (1983). Structure-toxicity relationships for the

fathead minnow, Pimephales promelas: narcotic industrial chemicals. Canadian Journal

of Fisheries and Aquatic Sciences, 40, 743–748.

Verhaar, H.J.M., Van Leeuwen C.J., Hermens, J.L.M. (1992). Classifying environmental

pollutants. 1. Structure-activity relationships for prediction of aquatic toxicity.


Verma, R.P., Kapur, S., Barberena, O., Shusterman, A., Hansch, C., Selassie, C.D.

(2003). Synthesis, cytotoxicity, and QSAR analysis of X-thiophenols in rapidly dividing

cells. Chemical Research in Toxicology, 16, 276-284.

Viswanadhan, V.N., Ghose, A.K., Revankar, G.R., Robins R.K. (1989). Atomic

physicochemical parameters for three-dimensional structure-directed quantitative

structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive

Interactions and their application for an automated superposition of certain naturally

occurring nucleoside antibiotics. Journal of Computational Chemistry, 29, 163-172.

242

Voelkel, A. (1994). Structural descriptors in organic chemistry - new topological

parameter based on electrotopological state of graph vertices. Computers and Chemistry,

18, 1-4.

Wang, X., Dong, Y., Wang, L., Han, S. (2001). Acute toxicity of substituted phenols to

Rana japonica tadpoles and mechanism-based quantitative structure-activity relationship

(QSAR) study. Chemosphere, 44, 447-455.

Wang, X., Sun, C., Wang, Y., Wang, L. (2002a). Quantitative structure-activity

relationships for the inhibition toxicity to root elongation of Cucumis sativus of selected

phenols and interspecies correlation with T. pyriformis. Chemosphere, 46, 153-161

Wang, X., Yin, C., Wang, L., (2002b). Structure-activity relationships and response-

surface analysis of nitroaromatics toxicity to the yeast (Saccharomyces cerevisiae).


Wang, X., Yu, J., Wang, Y., Wang, L. (2002c). Mechanism-based quantitative structure-

activity relationships for the inhibition of substituted phenols on germination rate of

Cucumis sativus. Chemosphere, 46, 241-250.

Weiner, P.K., and Kollman, P.A. (1981). AMBER: assisted model building with energy

refinement. A general program for modeling molecules and their interactions. Journal of

Computational Chemistry, 2, 287-303.

Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S.,

Jr., Weiner, P.K. (1984). A new force field for molecular mechanical simulation of

nucleic acids and proteins. Journal of the American Chemical Society, 106, 765-784.

Wiener, H. (1947). Structural determination of paraffin boiling points. Journal of the

American Chemical Society, 69, 17-20.

Wiese, M., and Pajeva, I.K. (2001). Structure-activity relationships of multidrug

resistance reversers. Current Medicinal Chemistry, 8, 685-713.

243

Wilson, L.Y., and Famini, G.R. (1991). Using theoretical descriptors in quantitative

structure-activity relationships: Some toxicological indices. Journal of Medicinal

Chemistry, 34, 1668-1674.

Worgan, A.D.P., Dearden, J.C., Edwards, R., Netzeva, T.I., Cronin, M.T.D. (2003).

Evaluation of a novel short-term algal toxicity assay by the development of QSARs and

inter-species relationships for narcotic chemicals. QSAR and Combinatorial Science, 22,

204-209.

Xu, M., Zhang, A., Han, S., Wang, L. (2002). Studies of 3D-quantitative structure–

activity relationships on a set of nitroaromatic compounds: CoMFA, advanced CoMFA

and CoMSIA. Chemosphere, 48, 707–715.

Yu, H.X., Lin, Z.F., Feng, J.F., Xu, T.L., Wang, L.S. (2001). Development of quantitative

structure activity relationships in toxicity prediction of complex mixtures. Acta

Pharmacologica Sinica, 22, 45-49.

Zerner, M.C. (1993). ZINDO, a comprehensive semiempirical quantum chemistry

package. Quantum Theory Project. Gainesville, Florida, USA.

Young, R.C., Mitchell, R.C., Brown, T.H., Ganellin, C.R., Griffiths, R., Jones, M., Rana,

K.K., Saunders, D., Smith, I.R., Sore, N.E., Wilks, T.J. (1988). Development of a new

physicochemical model to the design of centrally acting H2 receptor histamine

antagonists. Journal of Medicinal Chemistry, 31, 656-671.

244

List of author’s publications related to the research in the project

Lessigiarska, I., and Worth, A. (2003). The use of computer models as alternatives to

animal experiments in chemical risk assessment. Alternatives to Laboratory Animals, 31,

67-73.

Lessigiarska, I., Cronin, M.T.D., Worth, A.P., Dearden, J.C., Netzeva, T.I. (2004a).

QSARs for toxicity to the bacterium Sinorhizobium meliloti. SAR and QSAR in


Lessigiarska, I., Pajeva, I., Cronin, M.T.D., Worth, A.P. (2005a). 3D QSAR investigation

of the blood-brain barrier penetration of chemical compounds. SAR and QSAR in


Lessigiarska, I., Worth, A.P., Netzeva, T.I., Dearden, J.C., Cronin, M.T.D (2006).

Quantitative structure-activity-activity and quantitative structure-activity investigations of

human and rodent toxicity. Chemosphere. In press.

Lessigiarska, I., Worth, A.P., Sokull-Klüttgen, B., Jeram, S., Dearden, J.C., Netzeva, T.I.,

Cronin, M.T.D. (2004b). QSAR investigation of a large data set for fish, algae and

Daphnia toxicity. SAR and QSAR in Environmental Research, 15, 413-431.

Netzeva, T.I., Aptula, A.O., Benfenati, E., Cronin, M.T.D., Gini, G., Lessigiarska, I.,

Maran, U., Vracko, M., Schüürmann, G. (2005). Description of the electronic structure of

organic chemicals using semiempirical and ab initio methods for development of

toxicological QSARs. Journal of Chemical Information and Modeling, 45, 106-114.

245

APPENDIX A

C CODES OF THE PROGRAMS FOR STATISTICAL ALGORITHMS

A.1. Program for reducing of data multicollinearity

#include <stdio.h> #include <stdlib.h> #include <math.h> #include "my_lib.h" typedef struct variabl { char *var_name; double *var_val; int var_nico; } variable; void main(void) { variable *var; FILE *f_from, *f_ic; char sf_from[100], sf_ic[100], ynvar, **var_rem, exs, ynexcl, **var_ex_list, qn; int numb, case_num, n_dep, i, j, i1, n_var_rem, max_vchar, n_step, n_indep, no_max, *var_rem_ind; double **arr_incor, **arr_sum1, r_lim, var_r_ico; int exist(char *); int getint(void); double getfloat(void); char getch(void); void err_calloc(void); int getstr_maxl_notw(char *, int); int getstr_maxl_file(FILE *, char *, int); int scmp_notcase (char *, char *); char* strcopy_my(char *str1, char *str2); int f_r_cont(char *, int *); void intercor(variable *, int, int, int, double **, double **); max_vchar = 30; printf("\nEnter the name of the file with the data:\n"); if ( getstr_maxl_notw(sf_from, 100) != 1 ) { printf("\nError - the file name cannot be read!\n"); exit(-1);

246

} if ( (f_from = fopen(sf_from, "r")) == NULL ) { printf("\nError - can't open file %s for reading!\n", sf_from); exit(-1); } fclose(f_from); if ( (numb = f_r_cont(sf_from, &case_num)) == -1) { printf("\nError with file %s!\n", sf_from); exit(-1); } if (case_num < 3) { printf("\nError - the number of cases in the file %s is less than 3! Program terminates!\n", sf_from); exit(-1); } printf("\nThe file %s contains %d columns and %d cases\n",sf_from, numb, case_num); if ( ( var = (variable *) calloc( (size_t) numb, sizeof(variable) ) ) == NULL ) err_calloc(); for (i = 0; i < numb; ++i) { if ( ( (var + i) -> var_name = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); if ( ( (var + i) -> var_val = (double *) calloc( (size_t) case_num, sizeof(double) ) ) == NULL ) err_calloc(); } f_from = fopen(sf_from, "r"); for (j = 0; j < numb; ++j) { if ( getstr_maxl_file(f_from, (var + j) -> var_name, max_vchar) != 1) { printf("\nError in reading from the file %s!\n", sf_from); exit(-1); } } for (i = 0; i < case_num; ++i) { for (j = 0; j < numb; ++j) { if ( fscanf(f_from, "%lf", ((var + j) -> var_val) + i) != 1) { printf("\nError in reading from the file %s!\n", sf_from); exit(-1);

247

} } } printf("\n"); for (j = 0; j < numb; ++j) { printf("%10s ", (var + j) -> var_name); } printf("\n"); for (i = 0; i < case_num; ++i) { for (j = 0; j < numb; ++j) { printf("%10g ", *(((var + j) -> var_val) + i)); } printf("\n"); } label1: printf("\nEnter the name of the file to save the intercorrelation matrix:\n"); if ( getstr_maxl_notw(sf_ic, 100) != 1 ) { printf("\nError - the file name cannot be read!\n"); goto label1; } if ( exist(sf_ic) == 1 ) { if ( (f_ic = fopen(sf_ic, "w")) == NULL ) { printf("\nError - can't open file %s for writing!\n", sf_ic); goto label1; } } else goto label1; fclose(f_ic); label3: printf("\nEnter the number of dependent variables in the file: "); n_dep = getint(); if ( (n_dep > numb - 2) || (n_dep < 0) ) { printf("\nError - the number of dependent variables must be between 0 and %d!\n", numb-2); goto label3; } label31: printf("\nEnter the absolute value of the intercorrelation coefficent r, above which the variables to be considered as highly intercorrelated: "); r_lim = getfloat(); if ( (r_lim > 1) || (r_lim < 0) )

248

{ printf("\nError - absolute value of r must be between 0 and 1!\n"); goto label31; } label49: printf("\nAre there any independent varibles that you want for sure to remain in the variable data set (Press 'Y' - yes, 'N' - no)?: "); ynvar = getch(); if ( (ynvar != 'Y') && (ynvar != 'y') && (ynvar != 'N') && (ynvar != 'n') ) { printf("\nError!\n"); goto label49; } n_var_rem = 0; if ( ( var_rem = (char **) calloc( (size_t) (numb - n_dep + 1), sizeof(char *) ) ) == NULL ) err_calloc(); if ( ( var_rem_ind = (int *) calloc( (size_t) (numb-n_dep + 1), sizeof(int) ) ) == NULL ) err_calloc(); if ( (ynvar == 'Y') || (ynvar == 'y') ) { label29: printf("\nEnter the number of independent variables that have to remain in the variable data set: "); n_var_rem = getint(); if ( (n_var_rem > numb - n_dep) || (n_var_rem < 0) ) { printf("\nError - the number of variables to remain must be between 0 and %d!\n", numb-n_dep); goto label29; } for (i = 0; i < n_var_rem; ++i) { if ( ( *(var_rem + i) = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); } if (n_var_rem > 0) printf("\nEnter the list with the names of the independent variables that have to remain in any case in the variable data set.\n"); for (i = 0; i < n_var_rem; ++i) { j = 0; label003: if ( getstr_maxl_notw( *(var_rem + i), max_vchar) != 1) { if (j == 0) {

249

printf("\nError in reading the variable name. Try again.\n"); ++j; goto label003; } else { printf("\nThis name cannot be read. Go to the next variable.\n"); } } /*200*/ exs = 1; for (i1 = n_dep; (i1 < numb) && (exs == 1); ++i1) { if ( scmp_notcase ( *(var_rem + i), (var + i1) -> var_name ) == 1 ) { *(var_rem_ind + i) = i1; exs = 2; } } if (exs == 1) { printf("\nThere is no variable %s in the independent variable data set! Enter the name again.\n", *(var_rem + i)); goto label003; } } if (n_var_rem > 0) printf("\nThe following variables will remain in any case in the independent variable data set:\n"); for (i = 0; i < n_var_rem; ++i) { printf("var No %d var name %s\n", *(var_rem_ind + i) + 1, *(var_rem + i)); } } if ( ( arr_incor = (double **) calloc( (size_t) (numb-n_dep + 2), sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < (numb-n_dep+1); ++i) if ( ( *(arr_incor + i) = (double *) calloc( (size_t) (numb-n_dep-i+1), sizeof(double) ) ) == NULL ) err_calloc(); if ( ( arr_sum1 = (double **) calloc( (size_t) 2, sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < 2; ++i) if ( ( *(arr_sum1 + i) = (double *) calloc( (size_t) 1, sizeof(double) ) ) == NULL ) err_calloc(); if ( ( var_ex_list = (char **) calloc( (size_t) (numb - n_dep + 1), sizeof(char *) ) ) == NULL ) err_calloc(); n_step = 0;

250

labelsteps: n_indep = numb - n_dep - n_step; intercor(var, n_indep, n_dep, case_num, arr_sum1, arr_incor); f_ic = fopen(sf_ic, "w"); fprintf(f_ic, " "); for (i = 0; i < n_indep; ++i) { fprintf(f_ic, "%12s ", (var + n_dep + i) -> var_name); } fprintf(f_ic, "\n"); for (i = 0; i < n_indep; ++i) { fprintf(f_ic, "%12s ", (var + n_dep + i) -> var_name); for (j = 0; j < i; ++j) { fprintf(f_ic, "%12.4f ", *(*(arr_incor + j) + i-j-1) ); } fprintf(f_ic, " 1.0000 "); for (j = i+1; j < n_indep; ++j) { fprintf(f_ic, "%12.4f ", *(*(arr_incor + i) + j-i-1) ); } fprintf(f_ic, "\n"); } for (i = 0; i < n_indep; ++i) ((var + n_dep + i) -> var_nico) = 0; fprintf(f_ic,"\n"); printf("\n"); for (i = 0; i < n_indep; ++i) { fprintf(f_ic,"var: %s list of intercorr. vars: ", (var + n_dep + i) -> var_name); printf("var: %s list of intercorr. vars: ", (var + n_dep + i) -> var_name); for (j = 0; j < i; ++j) { if( *(*(arr_incor + j) + i-j-1) < 0 ) var_r_ico = 0 - *(*(arr_incor + j) + i-j-1); else var_r_ico = *(*(arr_incor + j) + i-j-1); if( var_r_ico > r_lim ) { fprintf(f_ic, "%s ", (var + n_dep + j) -> var_name); printf("%s ", (var + n_dep + j) -> var_name); ++((var + n_dep + i) -> var_nico); } } for (j = i+1; j < n_indep; ++j)

251

{ if( *(*(arr_incor + i) + j-i-1) < 0 ) var_r_ico = 0 - *(*(arr_incor + i) + j-i-1); else var_r_ico = *(*(arr_incor + i) + j-i-1); if( var_r_ico > r_lim ) { fprintf(f_ic, "%s ", (var + n_dep + j) -> var_name); printf("%s ", (var + n_dep + j) -> var_name); ++((var + n_dep + i) -> var_nico); } } fprintf(f_ic, "number of intercorr. vars: %d\n", (var + n_dep + i) -> var_nico ); printf("number of intercorr. vars: %d\n", (var + n_dep + i) -> var_nico ); } /*300*/ fclose(f_ic); printf("\nThe correlation matrix is saved in file %s\n", sf_ic); no_max = n_dep; labelh65: exs = 2; for (i1 = 0; (i1 < n_var_rem) && (exs == 2); ++i1) { if ( *(var_rem_ind + i1) == no_max ) { no_max++; exs = 1; } } if ( exs == 1) goto labelh65; if (no_max == n_indep + n_dep) { printf("\nAll of the independent variable are in the list of variables that have to remain!\n"); goto labelh98; } for (i = no_max; i < (n_indep + n_dep); ++i) { if( ((var + i) -> var_nico) > ((var + no_max) -> var_nico) ) { exs = 2; for (i1 = 0; (i1 < n_var_rem) && (exs == 2); ++i1) { if ( scmp_notcase ( *(var_rem + i1), (var + i) -> var_name ) == 1 ) exs = 1; } if (exs == 2) no_max = i; } } if( ((var + no_max) -> var_nico ) > 0 )

252

{ for (i = 0; i < n_var_rem; ++i) { if( ((var + *(var_rem_ind + i) ) -> var_nico) > ((var + no_max) -> var_nico) ) { printf("\nVariable %s has bigger number of intercorrelated variables than variable %s, but it is in the list of variables to remain.\n", (var + *(var_rem_ind + i) ) -> var_name, (var + no_max) -> var_name); } } printf("\nVariable %s has the biggest number of intercorrelated variables (%d).\n", (var + no_max) -> var_name, (var + no_max) -> var_nico); labelh89: printf("\nProgram is goind to exclude variable %s from the independent variable data set (Press 'Y' - yes, 'N' - no): ", (var + no_max) -> var_name); ynexcl = getch(); if ( (ynexcl != 'Y') && (ynexcl != 'y') && (ynexcl != 'N') && (ynexcl != 'n') ) { printf("\nError!\n"); goto labelh89; } if ( (ynexcl == 'Y') || (ynexcl == 'y') ) { if ( ( *(var_ex_list + n_step) = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); if( strcopy_my( *(var_ex_list + n_step), (var + no_max) -> var_name) == NULL) { printf("\nError - the program is not performing properly!\n"); exit(-1); } for (i = no_max; i < (n_indep + n_dep - 1); ++i) { (var + i) -> var_name = (var + i+1) -> var_name; (var + i) -> var_val = (var + i+1) -> var_val; } for (i = 0; i < n_var_rem; ++i) { if ( *(var_rem_ind + i) == no_max ) { printf("\nError - the preogramme is not performing correctly!\n"); exit(-1); } if ( *(var_rem_ind + i) > no_max ) (*(var_rem_ind + i))--; } n_step++; goto labelsteps; /*360*/ } else {

253

labelh32: printf("\nQuit the program (press 'q') or continuing with excluding next intercorrelated variable (press 'n'): "); qn = getch(); if ( (qn != 'Q') && (qn != 'q') && (qn != 'N') && (qn != 'n') ) { printf("\nError!\n"); goto labelh32; } if ( (qn == 'Q') || (qn == 'q') ) goto labelh98; else { if ( ( *(var_rem + n_var_rem) = (char *) calloc( (size_t) max_vchar, sizeof(char) ) ) == NULL ) err_calloc(); if( strcopy_my( *(var_rem + n_var_rem), (var + no_max) -> var_name) == NULL) { printf("\nError - the program is not performing properly!\n"); exit(-1); } *(var_rem_ind + n_var_rem) = no_max; n_var_rem++; goto labelsteps; } } } else { printf("\nThe variable data set of variables not included into the vars remain list does not contain variables that are intercorrelated above %g.\n", r_lim); } labelh98: if (n_step > 0) { f_ic = fopen(sf_ic, "a"); fprintf(f_ic, "\n%d vaeiables were excluded.\nThe excluded variable list:\n", n_step); printf("\n%d vaeiables were excluded.\nThe excluded variable list:\n", n_step); for (i = 0; i < n_step; ++i) { /*400*/ fprintf(f_ic, "%s ", *(var_ex_list + i) ); printf("%s ", *(var_ex_list + i) ); } fprintf(f_ic,"\n"); printf("\n"); fclose(f_ic); }

254

for (i = 0; i < (numb-n_dep+1); ++i) free( *(arr_incor + i) ); free(arr_incor); free(arr_sum1); if ( (ynvar == 'Y') || (ynvar == 'y') ) { for (i = 0; i < n_var_rem; ++i) free( *(var_rem + i) ); free(var_rem); free(var_rem_ind); } } int f_r_cont(char *sf, int *c_num) { /* This function checks the file with name sf. It returns the number of */ /* strings in the first line of the file and sets *c_num value to be the */ /* number of lines of the file minus 1. */ FILE *f; int c, n = 0, eo, i; double re; if ( (f = fopen(sf, "r")) == NULL ) { printf("\nCan't open file %s for reading!\n", sf); return -1; } while ( (c = getc(f)) == '\n' ); while ( (c != '\n') && (c != EOF) ) { while ( (c == ' ') || (c == '\t') ) c = getc(f); if (c != '\n') { n++; while ( (c != ' ') && (c != '\t') && (c != '\n') && (c != EOF) ) c = getc(f); } } *c_num = 0; if (c != EOF) { eo = 1; while ( (eo = fscanf(f,"%lf", &re)) != EOF ) { if (eo != 1) {

255

fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n); return -1; } for (i = 0; i < n-1; ++i) { eo = fscanf(f,"%lf", &re); if (eo != 1) { fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n); return -1; } } (*c_num)++; } } fclose(f); return n; } void intercor(variable *v, int n, int n_dep, int n_cases, double **arr_sum, double **arr_ic) { unsigned int j, m, n_ico; double r_inter; void regr_calc_2(variable *, unsigned int *, int, int, int, int, double **, double *, unsigned int *); for (m = 0; m < n-1; ++m) { for (j = m+1; j < n; ++j) { n_ico = 0; regr_calc_2(v, &j, 1, m + n_dep, n_dep, n_cases, arr_sum, &r_inter, &n_ico); *(*(arr_ic + m) + j-m-1) = r_inter; } } } void regr_calc_2(variable *v, unsigned int *c_index, int k, int n_d, int n_dep, int n_cases, double **arr_sum, double *r, unsigned int *n_co) { unsigned int m, l, q, iv;

256

double **arr_det, det_a, det_k, d_mean, *d_calc, sum_om, sum_oc, sum_cm, r_sq, tol, sign; double sum_sum(double *, double *, int); double sum_1(double *, int); double c_det(double **, int); d_mean = sum_1( (v + n_d) -> var_val, n_cases); if (n_cases != 0) d_mean /= n_cases; else { printf("\nError - there is no cases!\n"); exit(-1); } for (m = 0; m < k; ++m) { for (l = 0; l < k; ++l) { *(*(arr_sum + m) + l) = sum_sum( (v + n_dep + *(c_index + m)) -> var_val, (v + n_dep + *(c_index + l)) -> var_val, n_cases); } *(*(arr_sum + k) + m) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + m) + k) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + k + 1) + m) = sum_sum( (v + n_dep + *(c_index + m )) -> var_val, (v + n_d) -> var_val, n_cases); } *(*(arr_sum + k) + k) = n_cases; *(*(arr_sum + k + 1) + k) = sum_1( (v + n_d) -> var_val, n_cases); if ( ( arr_det = (double **) calloc( (size_t) k + 1, sizeof(double *) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) if ( ( *(arr_det + m) = (double *) calloc( (size_t) k + 1, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); } } det_a = c_det(arr_det, k + 1); iv = 1; if ( (det_a > -1e-6) && (det_a < 1e-6) ) iv = 0; if (iv)

257

{ if ( ( d_calc = (double *) calloc( (size_t) n_cases, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < n_cases; ++m) *(d_calc + m) = 0.0; } for (q = 0; q < k+1; ++q) { if (iv) { for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { if (m != q) *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); else *(*(arr_det + m) + l) = *(*(arr_sum + k+1) + l); } } det_k = c_det(arr_det, k + 1); if (q == 0) sign = det_k/det_a; if (q < k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a) * ( *(((v + n_dep + *(c_index + q)) -> var_val) + m) ); } if (q == k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a); } } } if (iv) { sum_om = 0.0; sum_oc = 0.0; sum_cm = 0.0; for (m = 0; m < n_cases; ++m) { sum_om += pow( (*(((v + n_d) -> var_val) + m) - d_mean), 2.0); sum_oc += pow( (*(((v + n_d) -> var_val) + m) - *(d_calc + m)), 2.0); sum_cm += pow( (*(d_calc + m)- d_mean), 2.0); } tol = 1 - sum_oc/sum_om - sum_cm/sum_om; if ( (tol > 1e-5) || (tol < -1e-5) ) { r_sq = 0.0; } else r_sq = sum_cm/sum_om;

258

if (sign < -1e-12) *(r + *n_co) = -pow(r_sq, 0.5); else *(r + *n_co) = pow(r_sq, 0.5); } else *(r + *n_co) = 0.0; if (iv) free(d_calc); for (m = 0; m < k+1; ++m) free(*(arr_det + m)); free(arr_det); *n_co += 1; } double sum_sum (double *r1, double *r2, int n) { int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += (*(r1+i)) * (*(r2+i)); } return sum; } double sum_1 (double *r1, int n) { int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += *(r1+i); } return sum; }

259

A.2. Program implementing the algorithm of the best-subsets approach for selecting

regression equations with best statistical fit

#include <stdio.h> #include <stdlib.h> #include <math.h> #include "my_lib.h" typedef struct variabl { char *var_name; double *var_val; } variable; void main(void) { variable *var; FILE *f_from, *f_to; char sf_from[100], sf_to[100], **res, yn, c1; int numb, case_num, n_dep, n_max, n_min, i, j, n, n_d, npr_mod; unsigned int *var_index, n_c; long unsigned int comb_n, n_allc; double **arr_sum, *r_sq, **arr_incor, **arr_sum1, rpr_lim, rico_lim; int exist(char *); int getint(void); double getfloat(void); char getch(void); void err_calloc(void); long unsigned int fact_n_k(int, int); int f_r_cont(char *, int *); void intercor_2(variable *, int, int, int, double **, double **); void comb_n_k_w(variable *, unsigned int*, int, int, unsigned int, int, int, int, double **, double *, char **, unsigned int *, double **, double, int, long unsigned int *); printf("\nEnter the name of the file with the data:\n"); scanf("%s",sf_from); if ( (f_from = fopen(sf_from, "r")) == NULL ) { printf("\nError - can't open file %s for reading!\n", sf_from); exit(-1); } fclose(f_from); if ( (numb = f_r_cont(sf_from, &case_num)) == -1) { printf("\nError with file %s!\n", sf_from); exit(-1);

260

} if (case_num < 3) { printf("\nError - the number of cases in the file %s is less than 3! Program terminates!\n", sf_from); exit(-1); } printf("\nThe file %s contains %d columns and %d cases\n",sf_from, numb, case_num); if ( ( var = (variable *) calloc( (size_t) numb, sizeof(variable) ) ) == NULL ) err_calloc(); for (i = 0; i < numb; ++i) { if ( ( (var + i) -> var_name = (char *) calloc( (size_t) 30, sizeof(char) ) ) == NULL ) err_calloc(); if ( ( (var + i) -> var_val = (double *) calloc( (size_t) case_num, sizeof(double) ) ) == NULL ) err_calloc(); } f_from = fopen(sf_from, "r"); for (j = 0; j < numb; ++j) { if ( fscanf(f_from, "%s", (var + j) -> var_name) != 1) { printf("\nError in reading of the file %s! -1\n", sf_from); exit(-1); } } for (i = 0; i < case_num; ++i) { for (j = 0; j < numb; ++j) { if ( fscanf(f_from, "%lf", ((var + j) -> var_val) + i) != 1) { printf("\nError in reading of the file %s!\n", sf_from); exit(-1); } } } printf("\n"); for (j = 0; j < numb; ++j) { printf("%10s ", (var + j) -> var_name); } printf("\n"); for (i = 0; i < case_num; ++i) {

261

for (j = 0; j < numb; ++j) { /* 100 */ printf("%10g ", *(((var + j) -> var_val) + i)); } printf("\n"); } label1: printf("\nEnter the name of the file to save in:\n"); scanf("%s",sf_to); if ( exist(sf_to) == 1 ) { if ( (f_to = fopen(sf_to, "w")) == NULL ) { printf("\nError - can't open file %s for writing!\n", sf_to); goto label1; } } else goto label1; label31: printf("\nEnter the minimum value of the model r squared below which the models to not be printed in the file and on the screen: "); rpr_lim = getfloat(); if ( (rpr_lim > 1) || (rpr_lim < 0) ) { printf("\nError - model r squared must be between 0 and 1!\n"); goto label31; } label37: printf("\nEnter the maximum number of the models to be printed in the file and on the screen for a dependent variable: "); npr_mod = getint(); if ( npr_mod < 0 ) { printf("\nError - nuber of the models to be printed has not to be smaller than 0!\n"); goto label37; } label3: printf("\nEnter the number of dependent variables in the file: "); n_dep = getint(); if ( (n_dep > numb - 1) || (n_dep < 1) ) { printf("\nError - the number of dependent variables must be between 1 and %d!\n", numb-1); goto label3; } label24: printf("\nEnter the minimum number of independent variables to be included in the model: ");

262

n_min = getint(); if ( (n_min > numb - n_dep) || (n_min < 1) ) { printf("\nError - the number of independent variables must be between 1 and %d!\n", numb-n_dep); goto label24; } if (n_min > case_num - 2) { printf("\nError - the number of independent variables has not to be bigger than the number of cases minus two (%d)!\n", case_num-2); goto label24; } label2: printf("\nEnter the maximum number of independent variables to be included in the model: "); n_max = getint(); if ( (n_max > numb - n_dep) || (n_max < n_min) ) { printf("\nError - the number of independent variables must be between %d and %d!\n", n_min, numb-n_dep); goto label2; } if (n_max > case_num - 2) { printf("\nError - the number of independent variables has not to be bigger than the number of cases minus two (%d)!\n", case_num-2); goto label2; } label34: printf("\nEnter the value of the intercorrelation r squared above which the independent variables will be considered as intercorrelated and will not be incuded in the same model: "); rico_lim = getfloat(); if ( (rico_lim > 1) || (rico_lim < 0) ) { printf("\nError - the intercorrelation r squared must be between 0 and 1!\n"); goto label34; } if ( ( arr_incor = (double **) calloc( (size_t) (numb-n_dep + 2), sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < (numb-n_dep+1); ++i) if ( ( *(arr_incor + i) = (double *) calloc( (size_t) (numb-n_dep-i+1), sizeof(double) ) ) == NULL ) err_calloc(); if ( ( arr_sum1 = (double **) calloc( (size_t) 2, sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < 2; ++i)

263

if ( ( *(arr_sum1 + i) = (double *) calloc( (size_t) 1, sizeof(double) ) ) == NULL ) err_calloc(); intercor_2(var, numb - n_dep, n_dep, case_num, arr_sum1, arr_incor); free(arr_sum1); for (n_d = 0; n_d < n_dep; ++n_d) { for (n = n_min; n <= n_max; ++n) { comb_n = fact_n_k(numb - n_dep, n); if (comb_n > (long unsigned int) (pow(2,63) - 1)) { printf("\nThe number of combinations of %d independent variables is %u. Too much. Calculations terminate.\n", n, comb_n); break; } label4: if (comb_n > 10000) { printf("\nThe number of combinations of %d independent variables is %u. Do you want to continue (Press 'Y' - yes, 'N' - no)?: ", n, comb_n); yn = getch(); if ( (yn != 'Y') && (yn != 'y') && (yn != 'N') && (yn != 'n') ) { printf("\nError!\n"); goto label4; } if ( (yn == 'N') || (yn == 'n') ) break; } printf("\nPlease wait...\n"); if( comb_n < npr_mod) n_c = comb_n; else n_c = npr_mod; if ( ( r_sq = (double *) calloc( (size_t) n_c, sizeof(double) ) ) == NULL ) err_calloc(); /* 200 */ if ( ( res = (char **) calloc( (size_t) n_c, sizeof(char *) ) ) == NULL ) err_calloc(); for (i = 0; i < n_c; ++i) { if ( ( *(res + i) = (char *) calloc( (size_t) n*30 + 200, sizeof(char) ) ) == NULL ) err_calloc(); *(*(res + i)) = '\0'; }

264

if ( ( arr_sum = (double **) calloc( (size_t) n + 2, sizeof(double *) ) ) == NULL ) err_calloc(); for (i = 0; i < n + 2; ++i) if ( ( *(arr_sum + i) = (double *) calloc( (size_t) n + 1, sizeof(double) ) ) == NULL ) err_calloc(); if ( ( var_index = (unsigned int *) calloc( (size_t) n, sizeof(unsigned int) ) ) == NULL ) err_calloc(); n_c = 0; n_allc = 0; comb_n_k_w(var, var_index, numb - n_dep, n, 0, n_d, n_dep, case_num, arr_sum, r_sq, res, &n_c, arr_incor, rico_lim, npr_mod, &n_allc); fprintf(f_to, "\n r_sq dep. var number of indep. vars is %d\n", n); for (i = 0; (i < n_c) && ( *(r_sq + i) >= rpr_lim ) && (i < npr_mod) ; ++i) { fprintf(f_to, "%5d %.8f ",i + 1, *(r_sq + i)); j = 0; while ( (c1 = *(*(res + i) + j)) != '\0' ) { fputc(c1,f_to); ++j; } fprintf(f_to,"\n"); } printf("\n r_sq dep. var number of indep. vars is %d\n", n); for (i = 0; (i < n_c) && ( *(r_sq + i) >= rpr_lim ) && (i < npr_mod) ; ++i) { printf("%5d %.8f ",i + 1, *(r_sq + i)); j = 0; while ( (c1 = *(*(res + i) + j)) != '\0' ) { putchar(c1); ++j; } printf("\n"); } free(var_index); for (i = 0; i < n+2; ++i) free(*(arr_sum + i)); free(arr_sum); if( comb_n < npr_mod) n_c = comb_n; else n_c = npr_mod; for (i = 0; i < n_c; ++i) free(*(res + i)); free(res);

265

free(r_sq); } } fclose(f_to); free(arr_incor); } int f_r_cont(char *sf, int *c_num) { /* This function checks the file with name sf. It returns the number of */ /* strings in the first line of the file and sets *c_num value to be the */ /* number of lines of the file minus 1. */ FILE *f; int c, n = 0, eo, i; double re; if ( (f = fopen(sf, "r")) == NULL ) { printf("\nCan't open file %s for reading!\n", sf); return -1; } while ( (c = getc(f)) == '\n' ); while ( (c != '\n') && (c != EOF) ) { while ( (c == ' ') || (c == '\t') ) c = getc(f); if (c != '\n') { n++; while ( (c != ' ') && (c != '\t') && (c != '\n') && (c != EOF) ) c = getc(f); } } *c_num = 0; if (c != EOF) { eo = 1; /* 300 */ while ( (eo = fscanf(f,"%lf", &re)) != EOF ) { if (eo != 1) { fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n);

266

return -1; } for (i = 0; i < n-1; ++i) { eo = fscanf(f,"%lf", &re); if (eo != 1) { fclose(f); printf("\nFile %s contains %d variable names in the first line, so it must contain %d columns of real numbers\n", sf, n, n); return -1; } } (*c_num)++; } } fclose(f); return n; } void intercor_2(variable *v, int n, int n_dep, int n_cases, double **arr_sum, double **arr_ic) { unsigned int j, m; double r_inter_sq; void regr_calc_3(variable *, unsigned int *, int, int, int, int, double **, double *); for (m = 0; m < n-1; ++m) { for (j = m+1; j < n; ++j) { regr_calc_3(v, &j, 1, m + n_dep, n_dep, n_cases, arr_sum, &r_inter_sq); *(*(arr_ic + m) + j-m-1) = r_inter_sq; } } } void regr_calc_3(variable *v, unsigned int *c_index, int k, int n_d, int n_dep, int n_cases, double **arr_sum, double *r_sq) { unsigned int m, l, q, iv; double **arr_det, det_a, det_k, d_mean, *d_calc, sum_om, sum_oc, sum_cm, tol; double sum_sum(double *, double *, int); double sum_1(double *, int);

267

double c_det(double **, int); d_mean = sum_1( (v + n_d) -> var_val, n_cases); if (n_cases != 0) d_mean /= n_cases; else { printf("\nError - there is no cases!\n"); exit(-1); } for (m = 0; m < k; ++m) { for (l = 0; l < k; ++l) { *(*(arr_sum + m) + l) = sum_sum( (v + n_dep + *(c_index + m)) -> var_val, (v + n_dep + *(c_index + l)) -> var_val, n_cases); } *(*(arr_sum + k) + m) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + m) + k) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + k + 1) + m) = sum_sum( (v + n_dep + *(c_index + m )) -> var_val, (v + n_d) -> var_val, n_cases); } *(*(arr_sum + k) + k) = n_cases; *(*(arr_sum + k + 1) + k) = sum_1( (v + n_d) -> var_val, n_cases); if ( ( arr_det = (double **) calloc( (size_t) k + 1, sizeof(double *) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) if ( ( *(arr_det + m) = (double *) calloc( (size_t) k + 1, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); } } det_a = c_det(arr_det, k + 1); /* 400 */ iv = 1; if ( (det_a > -1e-6) && (det_a < 1e-6) ) iv = 0; if (iv) { if ( ( d_calc = (double *) calloc( (size_t) n_cases, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < n_cases; ++m) *(d_calc + m) = 0.0; }

268

for (q = 0; q < k+1; ++q) { if (iv) { for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { if (m != q) *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); else *(*(arr_det + m) + l) = *(*(arr_sum + k+1) + l); } } det_k = c_det(arr_det, k + 1); if (q < k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a) * ( *(((v + n_dep + *(c_index + q)) -> var_val) + m) ); } if (q == k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a); } } } if (iv) { sum_om = 0.0; sum_oc = 0.0; sum_cm = 0.0; for (m = 0; m < n_cases; ++m) { sum_om += pow( (*(((v + n_d) -> var_val) + m) - d_mean), 2.0); sum_oc += pow( (*(((v + n_d) -> var_val) + m) - *(d_calc + m)), 2.0); sum_cm += pow( (*(d_calc + m)- d_mean), 2.0); } tol = 1 - sum_oc/sum_om - sum_cm/sum_om; if ( (tol > 1e-5) || (tol < -1e-5) ) { *r_sq = 0.0; } else *r_sq = sum_cm/sum_om; } else *r_sq = 0.0; if (iv) free(d_calc); for (m = 0; m < k+1; ++m) free(*(arr_det + m));

269

free(arr_det); } void comb_n_k_w(variable *v, unsigned int *c_index, int n, int k, unsigned int i, int n_d, int n_dep, int n_cases, double **arr_sum, double *r, char **results, unsigned int *n_co, double **arr_ic, double ri_l, int npr_m, long unsigned int *n_allco) { unsigned int j, p, m, l, flag; void regr_calc_4(variable *, unsigned int *, int, int, int, int, double **, double *, char **, unsigned int *, int); if (i == 0) p = 0; else p = *(c_index + i - 1) + 1; for (j = p; j < n; ++j) { *(c_index + i) = j; if (i < k-1) comb_n_k_w(v, c_index, n, k, i+1, n_d, n_dep, n_cases, arr_sum, r, results, n_co, arr_ic, ri_l, npr_m, n_allco); if (i == k-1) { flag = 1; if (k > 1) { for (m = 0; (m < k-1) && (flag); ++m) { for (l = m+1; (l < k) && (flag); ++l) { if ( *( *( arr_ic + *(c_index + m) ) + *(c_index + l) - *(c_index + m) - 1 ) > ri_l ) flag = 0; } } } if (flag) regr_calc_4(v, c_index, k, n_d, n_dep, n_cases, arr_sum, r, results, n_co, npr_m); if ( ( (*n_allco+1)%10000 ) == 0 ) printf("\nReached the %u-th combination\n", *n_allco+1 ); *n_allco += 1; } } } /* 500 */

270

void regr_calc_4(variable *v, unsigned int *c_index, int k, int n_d, int n_dep, int n_cases, double **arr_sum, double *r, char **results, unsigned int *n_co, int npr_m) { unsigned int m, l, q, iv; int i, j; double **arr_det, det_a, det_k, d_mean, *d_calc, sum_om, sum_oc, sum_cm, r_sq, tol; char *results_1; double sum_sum(double *, double *, int); double sum_1(double *, int); double c_det(double **, int); char * strcopy_my(char *, char *); d_mean = sum_1( (v + n_d) -> var_val, n_cases); if (n_cases != 0) d_mean /= n_cases; else { printf("\nError - there is no cases!\n"); exit(-1); } if ( ( results_1 = (char *) calloc( (size_t) k*30 + 200, sizeof(char) ) ) == NULL ) err_calloc(); for (m = 0; m < k; ++m) { for (l = 0; l < k; ++l) { *(*(arr_sum + m) + l) = sum_sum( (v + n_dep + *(c_index + m)) -> var_val, (v + n_dep + *(c_index + l)) -> var_val, n_cases); } *(*(arr_sum + k) + m) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + m) + k) = sum_1( (v + n_dep + *(c_index + m )) -> var_val, n_cases); *(*(arr_sum + k + 1) + m) = sum_sum( (v + n_dep + *(c_index + m )) -> var_val, (v + n_d) -> var_val, n_cases); } *(*(arr_sum + k) + k) = n_cases; *(*(arr_sum + k + 1) + k) = sum_1( (v + n_d) -> var_val, n_cases); if ( ( arr_det = (double **) calloc( (size_t) k + 1, sizeof(double *) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) if ( ( *(arr_det + m) = (double *) calloc( (size_t) k + 1, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < k+1; ++m) {

271

for (l = 0; l < k+1; ++l) { *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); } } /* 550 */ det_a = c_det(arr_det, k + 1); iv = 1; if ( (det_a > -1e-6) && (det_a < 1e-6) ) iv = 0; if (iv) { if ( ( d_calc = (double *) calloc( (size_t) n_cases, sizeof(double) ) ) == NULL ) err_calloc(); for (m = 0; m < n_cases; ++m) *(d_calc + m) = 0.0; } sprintf(results_1,"%10s ", (v + n_d) -> var_name); for (q = 0; q < k+1; ++q) { if (iv) { for (m = 0; m < k+1; ++m) { for (l = 0; l < k+1; ++l) { if (m != q) *(*(arr_det + m) + l) = *(*(arr_sum + m) + l); else *(*(arr_det + m) + l) = *(*(arr_sum + k+1) + l); } } det_k = c_det(arr_det, k + 1); if (q < k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"%10s %15g ", (v + n_dep + *(c_index + q)) -> var_name, det_k/det_a); } if (q == k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"Intercept %15g", det_k/det_a); } if (q < k) {

272

for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a) * ( *(((v + n_dep + *(c_index + q)) -> var_val) + m) ); } if (q == k) { for (m = 0; m < n_cases; ++m) *(d_calc + m) += (det_k/det_a); } /* 600 */ } else { if (q < k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"%10s ", (v + n_dep + *(c_index + q)) -> var_name); } if (q == k) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m,"Intercept can't be calculated - data have no variance"); } } } if (iv) { sum_om = 0.0; sum_oc = 0.0; sum_cm = 0.0; for (m = 0; m < n_cases; ++m) { sum_om += pow( (*(((v + n_d) -> var_val) + m) - d_mean), 2.0); sum_oc += pow( (*(((v + n_d) -> var_val) + m) - *(d_calc + m)), 2.0); sum_cm += pow( (*(d_calc + m)- d_mean), 2.0); } tol = 1 - sum_oc/sum_om - sum_cm/sum_om; if ( (tol > 1e-5) || (tol < -1e-5) ) { m = 0; while ( *(results_1 + m) != '\0' ) ++m; sprintf( results_1 + m," minimum tolerance"); r_sq = 0.0; } else r_sq = sum_cm/sum_om; } else r_sq = 0.0;

273

if (iv) free(d_calc); for (m = 0; m < k+1; ++m) free(*(arr_det + m)); free(arr_det); /* 650 */ if (*n_co < npr_m) { strcopy_my( *(results + *n_co), results_1 ); *(r + *n_co) = r_sq; } if (*n_co > npr_m-1) i = npr_m - 1; else i = *n_co; while ( (i >= 0) && ( *(r + i) <= r_sq) ) --i; if ( (i+1 < *n_co+1) && (i+1 < npr_m) ) { if (*n_co > npr_m-1) j = npr_m - 1; else j = *n_co; while (j > i+1) { strcopy_my( *(results + j), *(results + j-1) ); *(r + j) = *(r + j-1); --j; } strcopy_my( *(results + i+1), results_1); *(r + i+1) = r_sq; } free(results_1); *n_co += 1; } double sum_sum (double *r1, double *r2, int n) { int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += (*(r1+i)) * (*(r2+i)); } return sum; } double sum_1 (double *r1, int n)

274

{ int i; double sum = 0.0; for (i = 0; i < n; ++i) { sum += *(r1+i); } return sum; }

275

APPENDIX B

SMILES CODES FOR THE INVESTIGATED COMPOUNDS

B.1. Compounds investigated for blood-brain barrier penetration (Chapter 10)

B.1.a. Compounds from the ECVAM data set.

No Name SMILES

1 L-alanine [NH3+][C@@H](C)C(=O)[O-]

2 antipyrine c1ccccc1N2C(=O)C=C(C)N2C

3 AZT C1=C(C)C(=O)NC(=O)N1C2OC(CO)C(N=[N+]=[N-])C2

4 caffeine CN1C(=O)N(C)c2ncn(C)c2C1(=O)

5 cimetidine CNC(=NCCSCc1nc[nH]c1C)NC#N

6 cyclosporin A CN1C(=O)C(C)NC(=O)C(C)NC(=O)C(CC(C)C)N(C)C(=O)C(C(C)C)NC(=O)C(CC(C)C)N(C)C(=O)CN(C)C(=O)C(CC)NC(=O)C(C(O)C(C)CC=CC)N(C)C(=O)C(C(C)C)N(C)C(=O)C(CC(C)C)N(C)C(=O)C1CC(C)C

7 diazepam O=C1N(C)c2ccc(Cl)cc2C(=NC1)c3ccccc3

8 digoxin CC1OC(CC(O)C1O)OC2C(C)OC(CC2O)OC3C(C)OC(CC3O)OC4CCC5(C)C(CCC6C5CC(O)C7(C)C(CCC67O)C8=CC(=O)OC8)C4

9 L-dopa [NH3+][C@@H](C(c1cc(O)c(O)cc1))C(=O)[O-]

10 glycerol OCC(O)CO

11 lactic acid OC(C)C(=O)O

12 L-leucine [NH3+][C@@H](C(C(C)C))C(=O)[O-]

13 morphine C1=CC2C(N(C)C5)Cc3ccc(O)c4c3C2(C5)C(O4)C1O

14 nicotine c1ccncc1C2N(C)CCC2

15 phenytoin O=C1NC(=O)C(N1)(c2ccccc2)c3ccccc3

16 sucrose OC[C@@H]1[C@@H](O)[C@H](O)[C@@H](O)[C@H](O1)O[C@@]2(CO)[C@@H](O)[C@H](O)[C@@H](CO)O2

17 urea O=C(N)N

18 verapamil COc1c(OC)cc(cc1)CCN(C)CCCC(C(C)C)(C#N)c2cc(OC)c(OC)cc2

19 vinblastine CC[C@@]1(O)CN2CCc3c4ccccc4[nH]c3[C@](C(OC)=O)(C[C@H](C1)C2)c5c(OC)cc6N(C)[C@H]7[C@@](O)(C(OC)=O)[C@H](OC(C)=O)[C@]([C@H]89)(CC)C=CCN8CC[C@]79c6c5

20 vincristine CC[C@@]1(O)CN2CCc3c4ccccc4[nH]c3[C@](C(OC)=O)(C[C@H](C1)C2)c5c(OC)cc6N(C=O)[C@H]7[C@@](O)(C(OC)=O)[C@H](OC(C)=O)[C@]([C@H]89)(CC)C=CCN8CC[C@]79c6c5

21 warfarin c1ccccc1C(CC(C)=O)C2=C(O)c3ccccc3OC2(=O)

276

B.1.b. Compounds from the data set of Platts et al. (2001).

No Name SMILES

1 1 (cimetidine) CNC(=NC#N)NCCSCc1nc[NH]c1C

2 2 NC(N)=Cc1sc(C)nc1

3 4 CN(C)Cc1ccc(o1)CSCCNC2=NC(=O)C(=CN2)Cc3cc4ccccc4cc3

4 5 CN(C)Cc1ccc(o1)CSCCNC2=NC(=O)C(=CN2)Cc3ccc(C)nc3

5 6 (clonidine) Clc1cccc(Cl)c1N=C2NCCN2

6 7 (mepyramine) COc1ccc(cc1)CN(c2ccccn2)CCN(C)C

7 8 (imipramine) CN(C)CCCN1c2ccccc2CCc3ccccc13

8 9 (ranitidine) O=N(=O)C=C(NC)NCCSCc1ccc(o1)CN(C)C

9 10 (tiotidine) N#CN=C(NC)NCCSCc1csc(n1)N=C(N)N

10 13 O=N(=O)c1cc[NH]c1NCCSCc2ncccc2Br

11 14 O=N(=O)c1cc[NH]c1NCCSCc2ncccc2

12 15 O=N(=O)c1c(Cc3ccccc3)c[NH]c1NCCSCc2ncccc2

13 16 NC(N)=Nc1scc(n1)c2ccccc2

14 17 NC(N)=Nc1scc(n1)c2cc(N)ccc2

15 18 NC(N)=Nc1scc(n1)c2cc(NC(=O)C)ccc2

16 19 NC(N)=Nc1scc(n1)c2cc(CC(NC)=NC#N)ccc2

17 20 O=N(=O)c1cc[NH]c1NCCSCc2ccc(o2)CN(C)C

18 21 O=N(=O)c1c(Cc3ccccc3)c[NH]c1NCCSCc2ccc(o2)CN(C)C

19 22 O=N(=O)c1cc[NH]c1Nc2cccc(c2)c3ccc(o3)CN(C)C

20 23 O=N(=O)c1cc[NH]c1Nc2cccc(c2)c3nccc(c3)CN(C)C

21 24 C1CCCCN1Cc2cccc(c2)OCCCNC(=O)C

22 25 C1CCCCN1Cc2cccc(c2)OCCCNC(=O)c3ccccc3

23 26 C1CCCCN1Cc2cccc(c2)OCCCO

24 27 C1CCCCN1Cc2cccc(c2)OCCCNc3ccccn3

25 28 C1CCCCN1Cc2cccc(c2)OCCCNc3sccn3

26 29 C1CCCCN1Cc2cccc(c2)OCCCNc3sc4ccccc4n3

27 30 C1CCCCN1Cc2cccc(c2)OCCCNc3oc4ccccc4n3

28 31 (carbamazepine) c12ccccc2C=Cc3ccccc3N1C(=O)N

29 32 (epoxide of carbamazepine)

c12ccccc2C4OC4c3ccccc3N1C(=O)N

30 33 n1cN2c3cccc(Cl)c3C(=O)N(C)Cc2c1c4noc(n4)C(C)C

31 34 n1cN2c3cccc(Cl)c3C(=O)N(C)Cc2c1c4noc(n4)C(C)(C)O

32 35 n1cN2c3cccc(Cl)c3C(=O)N(C)Cc2c1c4noc(n4)C(C)(O)CO

33 36 (amitriptyline) c12ccccc2CCc3ccccc3C1=CCCN(C)C

34 acetaminophen Oc1ccc(cc1)NC(=O)C

35 acetylsalicylic acid CC(=O)Oc1ccccc1C(=O)O

36 alprazolam Clc1ccc2N3c(C)nnc3CN=C(c4ccccc4)c2c1

37 aminopyrine c1ccccc1N2C(=O)C(N(C)C)=C(C)N2C

38 amobarbital CC(C)CCC1(CC)C(=O)NC(=O)NC1=O

39 antipyrine c1ccccc1N2C(=O)C=C(C)N2C

40 argon Ar

41 atenolol CC(C)NCC(O)COc1ccc(cc1)CC(=O)N

42 benzene c1ccccc1

43 bretazenil CC(C)(C)OC(=O)C1=C2C3CCCN3C(=O)c4cc(Br)ccc4N2C=N1

44 bromperidol Fc1ccc(cc1)C(=O)CCCN2CCC(CC2)(O)c3ccc(Br)cc3

277

B.1.b. (continued)

No Name SMILES

45 butanone CC(=O)CC

46 caffeine c12ncN(C)c1C(=O)N(C)C(=O)N2C

47 chlopromazine c12cc(Cl)ccc1Sc3ccccc3N2CCCN(C)C

48 clobazam Clc1ccc2N(C)C(=O)CC(=O)N(c3ccccc3)c2c1

49 codeine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(OC)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O

50 CS2 S=C=S

51 cyclohexane C1CCCCC1

52 cyclopropane C1CC1

53 desipramine c12ccccc1CCc3ccccc3N2CCCNC

54 desmethydesipramine c12ccccc1CCc3ccccc3N2CCCN

55 desmethylclobazam Clc1ccc2NC(=O)CC(=O)N(c3ccccc3)c2c1

56 desmethyldiazepam Clc1ccc2NC(=O)CN=C(c3ccccc3)c2c1

57 desmonomethylpromazine c12ccccc1Sc3ccccc3N2CCCNC

58 diazepam Clc1ccc2N(C)C(=O)CN=C(c3ccccc3)c2c1

59 dichloromethane C(Cl)Cl

60 didanosine c12ncnc(O)c1N=CN2[C@H]3CC[C@@H](CO)O3

61 diethyl ether CCOCC

62 2,2-dimethylbutane CC(C)(C)CC

63 divinyl ether C=COC=C

64 enflurane ClC(F)C(F)(F)OC(F)F

65 ethanol CCO

66 ethylbenzene c1c(CC)cccc1

67 flumanezil CCOC(=O)C1=C2CN(C)C(=O)c3cc(F)ccc3N2C=N1

68 flunitrazepam [O-][N+](=O)c1ccc2N(C)C(=O)CN=C(c3c(F)cccc3)c2c1

69 fluphenazine c12ccccc1Sc3ccc(C(F)(F)F)cc3N2CCCN4CCN(CC4)CCO

70 fluroxene FC(F)(F)COC=C

71 haloperidol Fc1ccc(cc1)C(=O)CCCN2CCC(CC2)(O)c3ccc(Cl)cc3

72 halothane FC(F)(F)C(Cl)Br

73 heptane CCCCCCC

74 hexane CCCCCC

75 hexobarbital C2CCCC=C2C1(C)C(=O)N(C)C(=O)NC1=O

76 1-hydroxymidazolam Clc1ccc2N3c(CO)ncc3CN=C(c4c(F)cccc4)c2c1

77 4-hydroxymidazolam Clc1ccc2N3c(C)ncc3C(O)N=C(c4c(F)cccc4)c2c1

78 9-hydroxyrisperidone O=C1N2CCCC(O)C2=NC(C)=C1CCN3CCC(CC3)C4=NOc5cc(F)ccc45

79 hydroxyzine c1ccccc1C(c2ccccc2)N3CCN(CC3)CCOCCO

80 ibuprofen CC(C)Cc1ccc(cc1)C(C)C(=O)O

81 indinavir c1ncccc1CN2CC(C(=O)NC(C)(C)C)N(CC2)CC(O)CC(Cc3ccccc3)C(=O)NC4c5ccccc5CC4O

82 indomethacin COc1ccc2N(C(=O)c3ccc(Cl)cc3)c(C)c(CC(=O)O)c2c1

83 isoflurane FC(F)(F)C(Cl)OC(F)F

84 krypton Kr

85 M2L-663581 N1=CN2c3cccc(Cl)c3C(=O)N(C)CC2=C1C4=NOC(C(C)(O)CO)=N4

278

B.1.b. (continued) No Name SMILES

86 mesoridazine c12ccccc1Sc3ccc(S(=O)C)cc3N2CCC4CCCCN4C

87 methane C

88 methohexital CCC=CC(C)C1(CC=C)C(=O)NC(=O)NC1=O

89 methoxyflurane COC(F)(F)C(Cl)Cl

90 methylcyclopentane C1C(C)CCC1

91 3-methylhexane CCC(C)CCC

92 2-methylpentane CC(C)CCC

93 3-methylpentane CCC(C)CC

94 2-methylpropan-1-ol CC(C)CO

95 mianserin c12ccccc1Cc3ccccc3C4N2CCN(C)C4

96 midazolam Clc1ccc2N3c(C)ncc3CN=C(c4c(F)cccc4)c2c1

97 MIL-663581 N1=CN2c3cccc(Cl)c3C(=O)N(C)CC2=C1C4=NOC(C(C)(C)O)=N4

98 mirtazapine c12ncccc1Cc3ccccc3C4N2CCN(C)C4

99 morphine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(O)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O

100 neon Ne

101 nevirapine c12nccc(C)c1NC(=O)c3cccnc3N2C4CC4

102 nitrogen N#N

103 nitrous oxide [O-][N+]#N

104 nor-1-chlorpromazine c12ccccc1Sc3cccc(Cl)c3N2CCCN

105 nor-2-chlorpromazine c12ccccc1Sc3ccc(Cl)cc3N2CCCN

106 northioridazine c12ccccc1Sc3ccc(SC)cc3N2CCC4CCCCN4

107 Org12692 FC(F)(F)c1c(Cl)cc(cc1)N2CCNCC2

108 Org13011 FC(F)(F)c1ccnc(c1)N2CCN(CC2)CCCCN3CCCC3(=O)

109 Org30526 c12cc(Cl)ccc1Oc3ccccc3C4C2CNC4

110 Org32104 c12ccccc1Oc3c(C)cccc3C4C2(O)CCNC4

111 Org34167 c12ccccc2ON=C1c3ccccc3C(N)CC=C

112 Org4428 c12ccccc1Oc3c(C)cccc3C4C2(O)CCN(C)C4

113 Org5222 c12cc(Cl)ccc1Oc3ccccc3C4C2CN(C)C4

114 oxazepam Clc1ccc2NC(=O)C(O)N=C(c3ccccc3)c2c1

115 paraxanthine CN1C=NC2=C1C(=O)N(C)C(=O)N2

116 pentane CCCCC

117 pentobarbital CCCC(C)C1(CC)C(=O)NC(=O)NC1=O

118 phenylbutazone c1ccccc1N2C(=O)C(CCCC)C(=O)N2c3ccccc3

119 phenytoin c1ccccc1C2(c3ccccc3)NC(=O)NC2=O

120 promazine c12ccccc1Sc3ccccc3N2CCCN(C)C

121 propan-1-ol CCCO

122 propan-2-ol CC(O)C

123 propanone CC(=O)C

124 propranolol CC(C)NCC(O)COc1cccc2ccccc12

125 quinidine COc1ccc2nccc(c2c1)[C@H](O)[C@H]3N4CCC(C3)[C@H](C4)C=C

126 risperidone O=C1N2CCCCC2=NC(C)=C1CCN3CCC(CC3)C4=NOc5cc(F)ccc45

127 RO19-4603 CC(C)(C)OC(=O)C1=C2CN(C)C(=O)c3sccc3N2C=N1

128 salicylic acid Oc1ccccc1C(=O)O

279

B.1.b. (continued)

No Name SMILES

129 salicyluric acid Oc1ccccc1C(=O)NCC(=O)O

130 SF6 FS(F)(F)(F)(F)F

131 SKF 101468 CCN(CC)CCc1cccc2NC(=O)Cc12

132 SKF 89124 CCN(CC)CCc1ccc(O)c2NC(=O)Cc12

133 sulforidazine c12ccccc1Sc3ccc(S(=O)(=O)C)cc3N2CCC4CCCCN4C

134 teflurane FC(F)(F)C(F)Br

135 theobromine CN1C=NC2=C1C(=O)NC(=O)N2(C)

136 theophylline c12ncNc1C(=O)N(C)C(=O)N2C

137 thiopental CCCC(C)C1(CC)C(=O)NC(=S)NC1=O

138 thioridazine c12cc(SC)ccc1Sc3ccccc3N2CCC4N(C)CCCC4

139 tibolone C#C[C@@]1(O)CC[C@H]2[C@@H]3[C@H](C)CC4=C(CCC(=O)C4)[C@H]3CC[C@]12C

140 toluene c1c(C)cccc1

141 triazolam Clc1ccc2N3c(C)nnc3CN=C(c4c(Cl)cccc4)c2c1

142 1,1,1-trichloroethane C(Cl)(Cl)(Cl)C

143 trichloroethene C(Cl)(Cl)=CCl

144 trichloromethane C(Cl)(Cl)Cl

145 1,1,1-trifluoro-2-chloroethane

FC(F)(F)CCl

146 trifluoperazine c12cc(C(F)(F)F)ccc1Sc3ccccc3N2CCCN4CCN(C)CC4

147 valproic acid CCCC(C(=O)O)CCC

148 xenon Xe

149 2-xylene c1c(C)c(C)ccc1

150 3-xylene c1c(C)cc(C)cc1

151 4-xylene c1c(C)ccc(C)c1

152 Y-G 14 CNCCc1ccccn1

153 Y-G 15 CN(C)CCc1ccccn1

154 Y-G 16 NCCc1sccn1

155 Y-G 19 NCCc1scc(c2ccccc2)n1

156 Y-G 20 NCCc1cn2ccccc2n1

157 zidovudine C1=C(C)C(=O)NC(=O)N1[C@H]2C[C@H](N=[N+]=[N-])[C@@H](CO)O2

280

B.2. Compounds from the data set of Botsford (2002) (Chapter 12)

No Name SMILES

1 acetaminophen CC(=O)Nc1ccc(O)cc1

2 acetic acid CC(O)=O

3 acetone CC(=O)C

4 acetylsalicylic acid CC(=O)Oc1ccccc1C(O)=O

5 alachlor ClCC(=O)N(COC)c1c(CC)cccc1CC

6 4-aminobenzaldehyde Nc1ccc(C=O)cc1

7 amitriptyline c13ccccc1CCc2ccccc2C3=CCCN(C)C

8 amoxicillin c1cc(O)ccc1C(N)C(=O)NC2C(=O)N3C(C(=O)O)C(C)(C)SC23

9 atenolol CC(C)NCC(O)COc1ccc(cc1)CC(=O)N

10 atropine CN1C2CCC1C[C@H](C2)OC(=O)C(CO)c3ccccc3

11 bensulide c1ccccc1S(=O)(=O)NCCSP(=S)(OC(C)C)OC(C)C

12 benzene c1ccccc1

13 benzyl chloride ClCc1ccccc1

14 bipyridyl n1ccccc1c2ncccc2

15 bromoxynil N#Cc1cc(Br)c(O)c(Br)c1

16 butylamine CCCCN


18 catechol Oc1c(O)cccc1

19 celebrex c1cc(C)ccc1C2=CC(C(F)(F)F)=NN2c3ccc(cc3)S(=O)(=O)N

20 chloramphenicol ClC(Cl)C(=O)NC(CO)C(O)c1ccc(N(=O)=O)cc1

21 chlorobenzene Clc1ccccc1

22 3-chlorophenol Clc1cc(O)ccc1

23 4-chlorophenol Clc1ccc(O)cc1

24 chloroquine CCN(CC)CCCC(C)Nc1ccnc2cc(Cl)ccc12

25 clindamycin CCC[C@H]1CN(C)[C@@H](C1)C(=O)N[C@@H](C(Cl)C)[C@@H]1[C@H](O)[C@H](O)[C@@H](O)[C@@H](CC)O1

26 clomazone Clc1ccccc1CN2OCC(C)(C)C(=O)2

27 codeine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(OC)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O

28 2-cresol Cc1ccccc1O

29 3-cresol Cc1cccc(O)c1

30 4-cresol Cc1ccc(O)cc1

31 cyanazine N#CC(C)(C)Nc1nc(NCC)nc(Cl)n1

32 cyclohexylamine C1CCCCC1N

33 DCPA (chlorthal-dimethyl) COC(=O)c1c(Cl)c(Cl)c(C(=O)OC)c(Cl)c1Cl

34 dexedrine C[C@H](N)Cc1ccccc1

35 2-dianisidine COc1c(N)ccc(c1)c2ccc(N)c(c2)OC

36 diazepam (valium) O=C1N(C)c2ccc(Cl)cc2C(=NC1)c3ccccc3

37 diazinon n1c(C(C)C)cc(C)cc1OP(=S)(OCC)OCC

38 dibromoethane BrCCBr

39 1,2-dichlorobenzene Clc1ccccc1Cl

40 1,3-dichlorobenzene Clc1cccc(Cl)c1

41 1,4-dichlorobenzene Clc1ccc(Cl)cc1

281

B.2. (continued)

No Name SMILES

42 1,2-dichloroethane ClCCCl

43 dichloromethane ClCCl

44 2,4-dichlorophenol Clc1cc(Cl)c(O)cc1

45 3,4-dichlorophenol Clc1c(Cl)cc(O)cc1

46 3,5-dichlorophenol Clc1cc(Cl)cc(O)c1

47 2,4-dichlorophenoxyacetic acid

OC(=O)COc1c(Cl)cc(Cl)cc1

48 diethylamine CCNCC


50 dimethyl sulphoxide CS(=O)C

51 4-(dimethylamino) benzaldehyde

O=Cc1ccc(cc1)N(C)C

52 2,3-dimethylphenol Oc1c(C)c(C)ccc1

53 2,4-dimethylphenol Oc1c(C)cc(C)cc1

54 1,2-dinitrobenzene O=N(=O)c1ccccc1N(=O)=O

55 1,4-dinitrobenzene O=N(=O)c1ccc(N(=O)=O)cc1

56 2,6-dinitro-4-cresol Cc1cc(N(=O)=O)c(O)c(N(=O)=O)c1

57 2,4-dinitrophenol c1cc(N(=O)=O)cc(N(=O)=O)c1O

58 diuron CN(C)C(=O)Nc1ccc(Cl)c(Cl)c1

59 emetine COc1c(OC)cc2c(c1)CCN3[C@H]2C[C@@H]([C@@H](CC)C3)C[C@H]4NCCc5c4cc(OC)c(OC)c5

60 eptam (EPTC) CCSC(=O)N(CCC)CCC

61 ethanol CCO

62 ethanolamine NCCO

63 ethyl acetate CCOC(=O)C

64 ethylbenzene c1ccccc1CC

65 ethylene glycol OCCO

66 FCCP [carbony cyanide 4-(trifluoromethoxy)-phenylhydrazone]

N#CC(C#N)=NNc1cc(Cl)ccc1

67 fluvoxamine COCCCCC(=NOCCN)c1ccc(C(F)(F)F)cc1

68 furosemide NS(=O)(=O)c1cc(C(=O)O)c(cc1Cl)NCC2=CC=CO2

69 gentamycin CNC(C)[C@@H]1CC[C@@H](N)[C@H](O1)O[C@@H]2[C@@H](N)C[C@@H](N)[C@@H]([C@@H]2O)O[C@@H]3[C@@H](O)[C@@H](NC)[C@@](C)(O)CO3

70 hexachlorobenzene Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl

71 hexachlorophene Clc1c(Cl)cc(Cl)c(O)c1Cc2c(O)c(Cl)cc(Cl)c2Cl

72 ibuprofen CC(C)Cc1ccc(C(C)C(=O)O)cc1

73 imazapyr OC(=O)c1cccnc1C2=NC(C)(C(C)C)C(=O)N2

74 indole c12ccccc1ccn2

75 isoproterenol CC(C)NCC(O)c1ccc(O)c(O)c1

76 isoxaben COc1cccc(OC)c1C(=O)NC1=CC(=NO1)C(C)(CC)CC

282

B.2. (continued)

No Name SMILES

77 lindane [C@H]1(Cl)[C@H](Cl)[C@H](Cl)[C@@H](Cl)[C@H](Cl)[C@@H]1Cl

78 malathion CCOC(=O)C(CC(=O)OCC)SP(=S)(OC)OC

79 methanol CO

80 3-methyl-4-nitrophenol Oc1cc(C)c(N(=O)=O)cc1

81 methylphenidate (ritalin) N1CCCCC1C(C(=O)OC)c1ccccc1

82 morphine C1=C[C@H]2[C@H](N(C)C5)Cc3ccc(O)c4c3[C@@]2(C5)[C@@H](O4)[C@H]1O

83 1-naphthol Oc1cccc2ccccc12

84 naproamide CCN(CC)C(=O)C(C)Oc1cccc2ccccc12

85 naproxen OC(=O)[C@@H](C)c1ccc2cc(OC)ccc2c1

86 nicosulfuron CN(C)C(=O)c1cccnc1S(=O)(=O)NC(=O)Nc2nc(OC)cc(OC)n2

87 nicotine c1ccncc1[C@H]2N(C)CCC2

88 2-nitrophenol Oc1ccccc1N(=O)=O

89 norflurazon c1c(C(F)(F)F)cccc1N2N=CC(NC)=C(Cl)C2=O

90 novobiocin CC(C)=CCc1c(O)ccc(c1)C(=O)NC2=C(O)c3c(OC(=O)2)c(C)c(cc3)O[C@H]4[C@@H](O)[C@H](OC(=O)N)[C@@H](OC)C(C)(C)O4

91 octane CCCCCCCC

92 orphenadrine CN(C)CCOC(c1ccccc1)c2ccccc2C

93 orudis OC(=O)C(C)c1cccc(c1)C(=O)c2ccccc2

94 oxadiazon CC(C)(C)C1=NN(C(=O)O1)c2c(Cl)cc(O)c(OC(C)C)c2

95 paraquat C[n+]1ccc(cc1)c2cc[n+](C)cc2

96 pentachlorophenol Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1O

97 pentanol CCCCCO

98 1,10-phenanthroline n1cccc2ccc3cccnc3c12

99 phenol Oc1ccccc1

100 primsulfuron FC(F)Oc1cc(OC(F)F)nc(n1)NC(=O)NS(=O)(=O)c2ccccc2C(=O)OC


102 propranolol CC(C)NCC(O)COc1cccc2ccccc12

103 pseudocumene Cc1c(C)cc(C)cc1

104 quinidine COc1ccc2nccc(c2c1)[C@H](O)[C@H]3N4CCC(C3)[C@H](C4)C=C

105 quinine COc1ccc2nccc(c2c1)[C@@H](O)[C@H]3N4CCC(C3)[C@H](C4)C=C

106 resorcinol Oc1cccc(O)c1

107 salicylic acid OC(=O)c1c(O)cccc1

108 scopolamine CN1C2C4C(O4)C1C[C@H](C2)OC(=O)C(CO)c3ccccc3

109 sethoxydim CCSC(C)CC1CC(O)=C(C(=O)C1)C(CCC)=NOCC

110 sulphosalicylic acid OC(=O)c1c(O)ccc(S(=O)(=O)O)c1

111 tert-butyl methyl ether COC(C)(C)C

112 tetrachloroethene ClC(Cl)=C(Cl)Cl

113 tetrachloromethane ClC(Cl)(Cl)Cl

114 tetracycline NC(=O)C1=C(O)[C@@H](N(C)C)C2CC3[C@](C)(O)c4cccc(O)c4C(=O)C3=C(O)[C@]2(O)C1=O

283

B.2. (continued)

No Name SMILES

115 theophylline CN1C(=O)N(C)C(=O)C2=C1NC=N2

116 thiazopyr CC(C)Cc1c(C(=O)OC)c(C(F)F)nc(C(F)(F)F)c1C2=NCCS2

117 thifensulfuron n1c(OC)nc(C)nc1NC(=O)NS(=O)(=O)C2=C(C(O)=O)SC=C2

118 thioridazine CN1CCCCC1CCN2c3ccccc3Sc4ccc(SC)cc24

119 toluene Cc1ccccc1

120 4-toluidine Cc1ccc(N)cc1

121 1,2,3-trichlorobenzene Clc1c(Cl)c(Cl)ccc1

122 1,1,1-trichloroethane ClC(Cl)(Cl)C

123 trichloroethene ClC(Cl)=CCl

124 trichloromethane ClC(Cl)Cl

125 trifluralin FC(F)(F)c1cc(N(=O)=O)c(N(CCC)CCC)c(N(=O)=O)c1

126 2,4,6-trihydroxytoluene Cc1c(O)cc(O)cc1O

127 2,3,6-trimethylphenol Oc1c(C)c(C)ccc1C

128 2,4,6-trimethylphenol Oc1c(C)cc(C)cc1C

129 3,4,5-trimethylphenol Oc1cc(C)c(C)c(C)c1

130 2,4,6-trinitrotoluene Cc1c(N(=O)=O)cc(N(=O)=O)cc1N(=O)=O

131 verapamil COc1c(OC)cc(cc1)CCN(C)CCCC(C(C)C)(C#N)c2cc(OC)c(OC)cc2


133 zanaflex N1CCN=C1NC2=C(Cl)C=CC3=NSN=C23

284

B.3. Compounds from the MEIC data set included in the QSAAR and QSAR

analyses (Chapter 14)

No Name SMILES

1 acetaminophen CC(=O)Nc1ccc(O)cc1

2 acetylsalicylic acid CC(=O)Oc1ccccc1C(O)=O

4 diazepam c1ccccc1C2=NCC(=O)N(C)c3ccc(Cl)cc23


7 ethylene glycol OCCO

8 methanol CO

9 ethanol CCO


11 1,1,1-trichloroethane ClC(Cl)(Cl)C

12 phenol Oc1ccccc1

15 malathion CCOC(=O)C(CC(=O)OCC)SP(=S)(OC)OC

16 2,4-dichlorophenoxyacetic acid

Clc1cc(Cl)ccc1OCC(=O)O

18 nicotine c1ccncc1C2N(C)CCC2

21 theophylline CN1C(=O)N(C)C(=O)C2=C1NC=N2

24 phenobarbital c1ccccc1C2(CC)C(=O)NC(=O)NC2=O


32 lindane [C@H]1(Cl)[C@H](Cl)[C@H](Cl)[C@@H](Cl)[C@H](Cl)[C@@H]1Cl

33 trichlormethane ClC(Cl)Cl

34 tetrachlormethane ClC(Cl)(Cl)Cl

35 isoniazid NNC(=O)c1ccncc1

36 dichloromethane ClCCl

38 hexachlorophene Clc1c(Cl)cc(Cl)c(O)c1Cc2c(O)c(Cl)cc(Cl)c2Cl

39 pentachlorophenol Clc1c(Cl)c(Cl)c(Cl)c(Cl)c1O

44 phenytoin c1ccccc1C3(c2ccccc2)C(=O)NC(=O)N3

45 chloramphenicol ClC(Cl)C(=O)NC(CO)C(O)c1ccc(N(=O)=O)cc1


DEVELOPMENT OF STRUCTURE-ACTIVITY … · DEVELOPMENT OF STRUCTURE-ACTIVITY RELATIONSHIPS FOR...

Documents

Transcript of DEVELOPMENT OF STRUCTURE-ACTIVITY … · DEVELOPMENT OF STRUCTURE-ACTIVITY RELATIONSHIPS FOR...