ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING...

12
ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL PARDO * M , ENRIQUE BARRADO and LUIS DEBA ´ N Departamento de Quı´mica Analı´tica, Facultad de Ciencias, Universidad de Valladolid, 47005 Valladolid, Spain (First received December 1996; accepted March 1998) Abstract—22 Physico-chemical variables have been analyzed in water samples collected every three months for two and a half years from three sampling stations located along a section of 25 km of a river aected by man-made and seasonal influences. Exploratory analysis of experimental data have been carried out by box plots, ANOVA, display methods (principal component analysis) and unsuper- vised pattern recognition (cluster analysis) in an attempt to discriminate sources of variation of water quality. PCA has allowed the identification of a reduced number of ‘‘latent’’ factors with a hydrochemi- cal meaning: mineral contents, man-made pollution and water temperature. Spatial (pollution from anthropogenic origin) and temporal (seasonal and climatic) sources of variation aecting quality and hydrochemistry of river water have been dierentiated and assigned to polluting sources. An ANOVA of the rotated principal components has demonstrated that (i) mineral contents are seasonal and climate dependent, thus pointing to a natural origin for this polluting form and (ii) pollution by organic matter and nutrients originates from anthropogenic sources, mainly as municipal wastewater. The application of PCA and cluster analysis has achieved a meaningful classification of river water samples based on seasonal and spatial criteria. # 1998 Elsevier Science Ltd. All rights reserved Key words: water quality, surface water, hydrochemistry, exploratory data analysis, ANOVA, box plot, principal component analysis, pattern recognition, cluster analysis. INTRODUCTION River basins generally constitute areas with a high population density owing to favourable living con- ditions such as the availability of fertile lands, water for irrigation, industrial or drinking purposes, and ecient means of transportation. Rivers play a major role in assimilating or carrying o industrial and municipal wastewater, manure discharges and runo from agricultural fields, roadways and streets, which are responsible for river pollution (Stroomberg et al., 1995; Ward and Elliot, 1995). Rivers constitute too the main water resources in inland areas for drinking, irrigation and industrial purposes; thus, it is a prerequisite for eective and ecient water management to have reliable infor- mation of water quality. The discharge of industrial and municipal waste- water and manure can be considered a constant polluting source, but not so the surface runo which is seasonal and highly aected by climate. Flow in rivers is a function of many factors includ- ing precipitation, surface runo, interflow, ground- water flow and pumped inflow and outflow. Seasonal variations of these factors have a strong eect on flow rates and hence on the concentration of pollutants in the river water. Long-term surveys and monitoring programs of water quality are an adequate approach to a better knowledge of river hydrochemistry and pollution, but they produce large sets of data which are often dicult to interpret (Dixon and Chiswell, 1996). Most discussions on trend detection focus on ana- lysing a single variable, while routine monitoring programs ordinarily measure several variables. The problem of data reduction and interpretation of multiconstituent chemical and physical measure- ments can be approached through the application of multivariate statistical methods and exploratory data analysis (Massart et al., 1988; Wenning and Erickson, 1994). The usefulness of multivariate stat- istical tools in the treatment of analytical and en- vironmental data is reflected by the increasing number of papers cited in Analytical Chemistry Reviews (Brown et al., 1994, 1996). Cluster analysis and principal component analysis (PCA) have been widely used as they are unbiased methods which can indicate associations between samples and/or variables (Wenning and Erickson, 1994). These associations, based on similar magni- tudes or variations in chemical and physical constitu- ents, may indicate the presence of seasonal or man- made influences. Hierarchical agglomerative cluster Wat. Res. Vol. 32, No. 12, pp. 3581–3592, 1998 # 1998 Elsevier Science Ltd. All rights reserved Printed in Great Britain 0043-1354/98 $19.00 + 0.00 PII: S0043-1354(98)00138-9 *Author to whom all correspondence should be addressed. [E-mail: [email protected]]. 3581

Transcript of ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING...

Page 1: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS

ON THE QUALITY OF RIVER WATER BY EXPLORATORY

DATA ANALYSIS

MARISOL VEGA*, RAFAEL PARDO*M, ENRIQUE BARRADO and LUIS DEBAÂ N

Departamento de QuõÂmica AnalõÂ tica, Facultad de Ciencias, Universidad de Valladolid, 47005Valladolid, Spain

(First received December 1996; accepted March 1998)

AbstractÐ22 Physico-chemical variables have been analyzed in water samples collected every threemonths for two and a half years from three sampling stations located along a section of 25 km of ariver a�ected by man-made and seasonal in¯uences. Exploratory analysis of experimental data havebeen carried out by box plots, ANOVA, display methods (principal component analysis) and unsuper-vised pattern recognition (cluster analysis) in an attempt to discriminate sources of variation of waterquality. PCA has allowed the identi®cation of a reduced number of ``latent'' factors with a hydrochemi-cal meaning: mineral contents, man-made pollution and water temperature. Spatial (pollution fromanthropogenic origin) and temporal (seasonal and climatic) sources of variation a�ecting quality andhydrochemistry of river water have been di�erentiated and assigned to polluting sources. An ANOVAof the rotated principal components has demonstrated that (i) mineral contents are seasonal and climatedependent, thus pointing to a natural origin for this polluting form and (ii) pollution by organic matterand nutrients originates from anthropogenic sources, mainly as municipal wastewater. The applicationof PCA and cluster analysis has achieved a meaningful classi®cation of river water samples based onseasonal and spatial criteria. # 1998 Elsevier Science Ltd. All rights reserved

Key words: water quality, surface water, hydrochemistry, exploratory data analysis, ANOVA, box plot,principal component analysis, pattern recognition, cluster analysis.

INTRODUCTION

River basins generally constitute areas with a high

population density owing to favourable living con-

ditions such as the availability of fertile lands,

water for irrigation, industrial or drinking purposes,

and e�cient means of transportation. Rivers play a

major role in assimilating or carrying o� industrial

and municipal wastewater, manure discharges and

runo� from agricultural ®elds, roadways and

streets, which are responsible for river pollution

(Stroomberg et al., 1995; Ward and Elliot, 1995).

Rivers constitute too the main water resources in

inland areas for drinking, irrigation and industrial

purposes; thus, it is a prerequisite for e�ective and

e�cient water management to have reliable infor-

mation of water quality.

The discharge of industrial and municipal waste-

water and manure can be considered a constant

polluting source, but not so the surface runo�

which is seasonal and highly a�ected by climate.

Flow in rivers is a function of many factors includ-

ing precipitation, surface runo�, inter¯ow, ground-

water ¯ow and pumped in¯ow and out¯ow.

Seasonal variations of these factors have a strong

e�ect on ¯ow rates and hence on the concentration

of pollutants in the river water.

Long-term surveys and monitoring programs of

water quality are an adequate approach to a better

knowledge of river hydrochemistry and pollution,

but they produce large sets of data which are often

di�cult to interpret (Dixon and Chiswell, 1996).

Most discussions on trend detection focus on ana-

lysing a single variable, while routine monitoring

programs ordinarily measure several variables. The

problem of data reduction and interpretation of

multiconstituent chemical and physical measure-

ments can be approached through the application

of multivariate statistical methods and exploratory

data analysis (Massart et al., 1988; Wenning and

Erickson, 1994). The usefulness of multivariate stat-

istical tools in the treatment of analytical and en-

vironmental data is re¯ected by the increasing

number of papers cited in Analytical Chemistry

Reviews (Brown et al., 1994, 1996).

Cluster analysis and principal component analysis

(PCA) have been widely used as they are unbiased

methods which can indicate associations between

samples and/or variables (Wenning and Erickson,

1994). These associations, based on similar magni-

tudes or variations in chemical and physical constitu-

ents, may indicate the presence of seasonal or man-

made in¯uences. Hierarchical agglomerative cluster

Wat. Res. Vol. 32, No. 12, pp. 3581±3592, 1998# 1998 Elsevier Science Ltd. All rights reserved

Printed in Great Britain0043-1354/98 $19.00+0.00PII: S0043-1354(98)00138-9

*Author to whom all correspondence should be addressed.[E-mail: [email protected]].

3581

Page 2: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

analysis indicates groupings of samples by linkinginter-sample similarities and illustrates the overall

similarity of variables in the data set (Massart andKaufman, 1983). PCA is used to reduce the dimen-sionality of the data set by explaining the correlation

among a large set of variables in terms of a smallnumber of underlying factors or principal com-ponents without losing much information (Jackson,

1991; Meglen, 1992), and allows to assess associ-ations between variables, since they indicate partici-pation of individual chemicals in several in¯uence

factors. Exploratory data analysis has been used toevaluate the water quality of rivers, and seasonal,spatial and anthropogenic in¯uences have been evi-denced (Brown et al., 1980; Bartels et al., 1985;

Grimalt et al., 1990; Librando, 1991; Andrade et al.,1992; Aruga et al., 1993; Elosegui and Pozo, 1994;Pardo et al., 1994; Battegazzore and Renoldi, 1995;

Voutsa et al., 1995).In this work, PCA, analysis of variance

(ANOVA) and agglomerative hierarchical cluster

analysis have been used to investigate the waterquality of the Pisuerga river (Duero basin, Spain),to assess the in¯uence that pollution and seasonality

have on the quality of river water, and to discrimi-nate the individual e�ects of climate and human ac-tivities on the river hydrochemistry.

METHODS

Sampling stations

The Pisuerga river belongs to the Duero river basin,which is located in the Castilla y Leo n region (Centre-North of Spain). The inland geographic situation of thebasin, surrounded by mountains, conditions an extremelycontinental climate. Precipitations in the area are scarce,ranging from 313 to 571 mm yrÿ1, with a mean of442 mm yrÿ1. Precipitations are maximum in November(49.8 mm) and minimum in August (13.2 mm). The annualmean temperature is 128C, and extreme values of ÿ48Cand 328C are registered in January and July, respectively.

The river ¯ows in direction North±South from theNorthern mountains through a high tableland to run intothe Duero river, and is the main drainage stream in thatdirection; in spring, snow melting in the Northern moun-tains causes a marked increase in river ¯ow. Along itscourse, the river pass through limestone, marl, gypsumand sandstone soils which are the main contributors to thehigh levels of minerals in the river water. An importantagricultural activity devoted to irrigated crops takes placein riverine areas where the use of nitrogenous fertilisers isa common practice. 12 Km upstream its mouth, the rivercrosses the town of Valladolid, major industrial centre ofthe region with a population of ca. 400 000. Municipalwastewater is directly discharged into the river (estimatedvolume is ca. 57 millions m3 yrÿ1) as the wastewater andsewage treatment plant is still being built. Moreover,although big industries settled in the area purify theirwastewater, small industries are suspected to dischargeresidues into the river. The combination of both a highpopulation density in the area and an extreme continentalclimate causes river hydrology and hence river pollution tobe strongly in¯uenced by seasonality.

The investigated river section is located at 41823'24Nand 04827'00W, and is in average 690 m over the sea level.It covers a length of 25 km from Cabezo n de Pisuerga,

small village located 13 km upstream Valladolid, and thevillage of Simancas, in the mouth of the Pisuerga river,12 km downstream Valladolid. Major industrial activity inthe area is concentrated in the North of the city, upstreamthe bridge called Puente Mayor, and municipal dischargesinto the river are mainly produced from Puente Mayor toSimancas.Selected sampling stations were located in Cabezo n de

Pisuerga, Puente Mayor and Simancas, in an attempt toisolate and identify the polluting sources: in Cabezo n dePisuerga the river has not received industrial and munici-pal wastewater yet, and the water quality in this stationcan be considered to re¯ect pollution from overland ¯owand from agricultural and manure discharges; PuenteMayor re¯ects the situation in which industrial wastewaterhas been discharged, but no municipal residues; inSimancas the river has received all the polluting dis-charges.Selected stations were sampled every three months for

two and a half years. A total of 10 samples were collectedfrom each station on the following dates: 06/04/90, 03/07/90, 09/10/90, 09/01/91, 03/04/91, 02/07/91, 10/10/91, 09/01/92, 10/04/92 and 06/07/92. Samples are identi®edthroughout by means of a four-character code XYZZ,where X means the sampling station (C, Cabezo n; P,Puente Mayor and S, Simancas), Y is the month ofsampling (A, April; J, July; O, October and E, January)and ZZ means the year (90, 91 or 92).

Analytical procedures

Sample containers were 1 l polyethylene bottles providedwith hermetic-locking caps. Bottles and caps were cleanedby soaking into 50% HCl for three days, rinsed withdesionized water and soaked into 2 M HNO3 for anotherthree days, ®nally rinsed with desionized water, drained,wrapped in polyethylene bags and stored until required.Samples were collected by means of a Go-Flo device

from the middle of the stream at a depth of 15 cm, fromstone bridges existing in each of the sampling stations.Prior to sample collection, sampling device and containerswere rinsed twice with the water to be sampled.Temperature, pH, conductivity and dissolved oxygenmeasurements were performed in situ. Duplicate sampleswere taken out from each sampling station and immedi-ately ®ltered under nitrogen pressure through cellulosenitrate ®lters (pore size 0.45 mm) into acid-washed poly-ethylene bottles. One duplicate was acidi®ed to pH 2 byaddition of 100 ml of 10 M HCl to each 100 ml sample andused for determination of metals, hardness, nitrogen (asammonia, nitrite and nitrate) and phosphorous (as phos-phate). The second duplicate was kept at its natural pHand used for determination of the remaining anions (bicar-bonate, chloride and sulphate), conductivity and organicmatter (as chemical oxygen demand, COD, and biochemi-cal oxygen demand, BOD). Samples were immediatelytransported to the laboratory and stored at 48C until theiranalysis, which was accomplished within one week.22 Physico-chemical parameters have been determined

by following standard and recommended methods ofanalysis (APHA-AWWA-WPCF, 1985; AOAC, 1990).Table 1 displays the variables measured and their units,the analytical techniques employed, and the abbreviationsused henceforth. A total of 660 analysis were carried out(22 variables in 30 samples). Two replications of eachanalysis were performed and mean values were used forcalculations.

Data treatment

Exploratory data analysis was performed by linear dis-play methods (principal component analysis) and by unsu-pervised pattern recognition techniques (hierarchicalcluster analysis) on experimental data normalized to zero

Marisol Vega et al.3582

Page 3: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

mean and unit variance in order to avoid misclassi®cationsarising from the di�erent order of magnitude of both nu-merical value and variance, of the parameters analysed. Asthe methods of classi®cation used here are non-parametric,they make no assumptions about the underlying statisticaldistribution of the data and therefore no evaluation ofnormal (Gaussian) distribution of the data is necessary(Sharaf et al., 1986).

Principal component analysis was applied to normalizeddata to assess associations between variables, since thismethod evidences participation of individual chemicals inseveral in¯uence factors, which commonly occurs inhydrochemistry. Diagonalization of the correlation matrixtransforms the original p correlated variables into p uncor-related (orthogonal) variables called principal components(PCs), which are weighed linear combinations of the orig-inal variables (Mellinger, 1987; Meglen, 1992; Wenningand Erickson, 1994). The characteristic roots (eigenvalues)of the PCs are a measure of their associated variances,and the sum of eigenvalues coincides with the total num-ber of variables. Correlation of PCs and original variablesis given by loadings, and individual transformed obser-vations are called scores.

Cluster analysis is an unsupervised pattern recognitiontechnique that uncovers intrinsic structure or underlyingbehaviour of a data set without making a priori assump-tions about the data, in order to classify the objects of thesystem into categories or clusters based on their nearnessor similarity. In hierarchical cluster analysis the distancebetween samples is used as a measure of similarity.Hierarchical agglomerative cluster analysis was carried outon the normalised data by means of the complete linkage(furthest neighbour), average linkage (between and withingroups) and Ward's methods, using squared Euclidean dis-tances as a measure of similarity (Massart and Kaufman,1983; Willet, 1987).

RESULTS AND DISCUSSION

Table 2 summarises brie¯y the mean value andstandard deviation of the 22 measured variables in

the river water samples from the three stations. Itmust be noticed the high dispersion of most vari-ables (high standard deviations), which indicatesvariability in chemical composition between

samples, thus pointing to the presence of temporal

variations caused likely by polluting sources and/orclimatic factors.Recommended guide levels of these variables and

maximum levels allowed by the European Directive80/778/EEC concerning the quality of water intendedfor human consumption are included in Table 2. Itmust be emphasised that average concentrations of

some variables such as chloride, COD, iron, manga-nese, sodium, ammonia, nitrite, phosphate and sul-phate are higher than those recommended by the

European legislation, therefore this water resource isnot adequate for human consumption or industrialpurposes and needs to be puri®ed.

High levels of phosphate may originate from mu-nicipal wastewater discharges since it is an importantcomponent of detergents. The presence of nitrate in

the river section sampled is suspected to originatefrom overland runo� from riverine agricultural ®eldswhere irrigated horticultural crops are grown and theuse of inorganic fertilisers (usually as ammonium

nitrate) is rather frequent. This practice could alsoexplain the high levels of ammonia, but this pollutantmay also originate from decomposition of nitrogen-

containing organic compounds such as proteins andurea occurring in municipal wastewater discharges.In the presence of high levels of organic matter,

nitrate can be reduced in some extent to nitrite, whatcould explain the high concentration of this pollutantin some samples. The high sulphate contents found

in waters of the Pisuerga river are probably a conse-quence of the morphology of soils irrigated by theriver, which are formed mainly by limestone, marland gypsum.

Exploratory data analysis using box plots

Normal probability plots of the variables in con-junction with the Anderson±Darling normality test

Table 1. Physico-chemical parameters determined and analytical techniques used

Variable Abbreviation Analytical technique Units

Biochemical oxygen demand BOD potentiometry/O2 probe mg O2 lÿ1

Calcium Ca ¯ame AAS mg lÿ1

Chloride Cl ion chromatography mg lÿ1

Chemical oxygen demand COD redox titrometry (KMnO4) mg O2 lÿ1

Conductivity COND conductometry mmho cmÿ1

Dissolved solids DS drying at 1808C/weighing mg lÿ1

Iron Fe ¯ame AAS mg lÿ1

Flow rate FLOW (*) m3 sÿ1

Hardness HARD EDTA titrometry mg CaCO3 lÿ1

Bicarbonate HCO3 acid±base titrometry mg lÿ1

Potassium K ¯ame AES mg lÿ1

Magnesium Mg ¯ame AAS mg lÿ1

Manganese Mn ¯ame AAS mg lÿ1

Sodium Na ¯ame AES mg lÿ1

Ammonium NH4 spectrophotometry mg lÿ1

Nitrite NO2 spectrophotometry mg lÿ1

Nitrate NO3 spectrophotometry mg lÿ1

Dissolved oxygen OXYG potentiometry/O2 probe mg lÿ1

pH pH potentiometry/pH probe pH unitsPhosphate PO4 ion chromatography mg lÿ1

Sulphate SO4 ion chromatography mg lÿ1

Temperature TEMP temperature probe 8C

(*) Data supplied by Confederacio n Hidrogra ®ca del Duero.

Water quality analysis using exploratory data 3583

Page 4: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

demonstrated that most variables were not normally

distributed. However, these normality tests applied

to individual sampling stations resulted in normal

distributions for most variables, thus pointing to

the existence of di�erences in water composition

among stations.

Box plots (also called box-and-whisker plots) of

individual variables in the three sampling stations

were examined. Figure 1 shows an example of

box plots for some meaningful variables related to

the quality of river water, such as conductivity

(mineralization), COD, dissolved oxygen or am-

monium. The line across the box represents the

median, whereas the bottom and top of the box

show the locations of the ®rst and third quartiles

(Q1 and Q3). The whiskers are the lines that

extend from the bottom and top of the box to

the lowest and highest observations inside the

region de®ned by Q1ÿ1.5(Q3ÿQ1) and

Q3+1.5(Q3ÿQ1). Individual points with values

outside these limits (outliers) are plotted with

asterisks.

Table 2. Statistical descriptives for the 30 samples analysed

Cabezo n Puente Mayor Simancas

Variable Mean Std. Dev. Mean Std. Dev. Mean Std. Dev. Min. Max. Guide level* Max.*

BOD 2.8 0.8 3.2 0.7 3.7 1.2 1.5 6.5Ca 77.0 9.6 77.1 7.4 76.5 8.9 58.8 91.2 100Cl 23.3 7.7 24.3 8.0 28.3 9.9 12.2 46.1 25 200COD 3.1 1.2 3.6 0.8 5.0 2.0 0.7 10 2 5COND 589 123 599 98 629 115 402 773 400DS 398 81 410 67 427 69 273 524 1500Fe 0.10 0.05 0.12 0.04 0.11 0.05 0.01 0.19 0.05 0.2FLOW 45.0 42.6 37.0 20.9 37.5 21.2 14.8 129.2HARD 250.1 43.6 253.1 32.6 254.4 35.1 179.1 302.9HCO3 150.4 17.8 142.8 20.9 156.1 23.4 96.1 176.8K 4.8 1.9 5.2 1.8 6.2 2.2 2.8 10.4 10 12Mg 14.0 5.3 14.8 4.8 15.4 4.5 6.2 23.8 30 50Mn 0.03 0.02 0.03 0.02 0.04 0.02 0.01 0.08 0.02 0.05Na 19.4 9.5 20.2 7.7 25.6 10.0 7.1 40.5 20 150NH4 0.63 0.62 0.51 0.23 1.66 0.92 0.05 3.61 0.05 0.5NO2 0.32 0.32 0.13 0.09 0.35 0.30 0.03 1.08 Absence 0.1NO3 11.2 7.3 11.9 7.0 10.4 8.4 0.3 29.9 25 50OXYG 8.1 1.8 8.4 1.8 4.9 3.3 0.7 11.4pH 8.0 0.2 8.1 0.5 7.6 0.3 7.2 8.8 6.5±8.5 9.5PO4 0.84 0.32 0.86 0.30 1.61 0.63 0.35 2.50 0.3 3.3SO4 105.4 34.9 108.9 28.7 112.7 28.1 50 150 25 250TEMP 13.6 5.9 14.5 7.7 14.3 7.3 2.2 24.9 12 25

(*) Recommended guide levels and maximum concentrations allowed by the European Directive 80/778/EEC concerning the quality ofwater intended for human consumption.

Fig. 1. Box plots for conductivity, COD, dissolved oxygen and ammonium in Cabezo n (C), PuenteMayor (P) and Simancas (S).

Marisol Vega et al.3584

Page 5: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

Box plots provide a visual impression of the lo-cation and shape of the underlying distributions.

For example, box plots with long whiskers at thetop of the box (such as that for ammonium atSimancas) indicate the underlying distribution is

skewed toward high concentration. Box plots withlarge spread indicate seasonal variations of thewater composition (see conductivity box plot). By

inspecting these plots it was also possible to per-ceive di�erences among the three stations. Forexample, dissolved oxygen in Simancas is lower and

has a greater spread compared with that inCabezo n and Puente Mayor. At the same time,COD and ammonium are higher in Simancas, thuspointing to a deterioration of the water quality

downstream likely caused by the discharge of mu-nicipal wastewater.Analysis of variance (ANOVA) examines the

di�erent e�ects (usually called sources of variation)operating simultaneously on a response to decidewhich e�ects are statistically signi®cant and to esti-

mate their contribution to the variability of the re-sponse (Sche�e, 1959; Ross, 1988). Two-wayANOVA of independent variables showed the exist-

ence of seasonal and/or spatial di�erences. Forexample, seasonal signi®cant di�erences were foundfor conductivity, temperature or ¯ow, whereas forammonium, phosphate or pH the di�erences were

mainly due to the sampling station. For COD andBOD both sources of variation were signi®cant.Box plots and ANOVA showed similar trends for

each variable; however, these are univariate tech-niques inadequate for the investigation of our mul-tivariate data table as the variables are correlated.

Principal component analysis

The covariance matrix of the 22 analysed vari-

ables was calculated from data normalised asdescribed in Section 2.3 and, therefore, coincideswith the correlation matrix (Table 3). Because the

three sampling stations were combined to calculatethe correlation matrix, the correlation coe�cientsshould be interpreted with caution as they are

a�ected simultaneously by spatial and temporalvariations. Nevertheless, some clear hydrochemicalrelationships can be readily inferred: High and posi-tive correlation (underlined values) can be observed

between bicarbonate, sulphate, chloride, calcium,magnesium, potassium, sodium, dissolved solids,conductivity and hardness (r = 0.572 to 0.977),

which are responsible for water mineralization.Flow rate is negatively correlated to most variables,since an increase in ¯ow rate causes dilution of con-

taminants. This anti-correlation is highly signi®cantfor ``mineral'' components (conductivity, hardness,dissolved solids, magnesium and sulphate). BOD

and COD are strongly correlated (r = 0.893) andalso with ammonia, phosphate (closely related tocontamination for organic mater) and potassium.As expected, dissolved oxygen is negatively corre-

Table

3.Correlationmatrix

ofthe22physico-chem

icalparametersdetermined

BOD

Ca

Cl

COD

COND

DS

Fe

FLOW

HARD

HCO

3K

Mg

Mn

Na

NH

4NO

2NO

3OXYG

pH

PO

4SO

4TEMP

BOD

1.000

Ca

ÿ0.117

1.000

Cl

0.413

0.758

1.000

COD

0.893ÿ0

.036

0.516

1.000

COND

0.260

0.887

0.916

0.321

1.000

DS

0.316

0.825

0.881

0.334

0.974

1.000

Fe

0.177ÿ0

.270ÿ0

.137

0.065ÿ0

.154ÿ0

.102

1.000

FLOW

ÿ0.164ÿ0

.497ÿ0

.394ÿ0

.108ÿ 0

.592ÿ0

.571ÿ0

.048

1.000

HARD

0.229

0.898

0.860

0.240

0.977

0.951ÿ0

.151ÿ0

.659

1.000

HCO

30.270

0.648

0.712

0.347

0.774

0.762ÿ0

.251ÿ0

.484

0.770

1.000

K0.679

0.442

0.748

0.649

0.701

0.713ÿ0

.100ÿ0

.356

0.656

0.644

1.000

Mg

0.552

0.579

0.772

0.484

0.849

0.868

0.016ÿ0

.683

0.879

0.725

0.736

1.000

Mn

0.492

0.109

0.434

0.437

0.333

0.311

0.464ÿ0

.431

0.346

0.285

0.423

0.521

1.000

Na

0.238

0.809

0.914

0.350

0.929

0.902ÿ0

.118ÿ0

.419

0.841

0.705

0.697

0.683

0.280

1.000

NH

40.709

0.110

0.483

0.773

0.378

0.384

0.094ÿ0

.170

0.291

0.485

0.663

0.419

0.359

0.468

1.000

NO

20.324

0.062

0.190

0.258

0.195

0.233ÿ0

.110ÿ0

.198

0.222

0.381

0.381

0.341

0.329

0.118

0.327

1.000

NO

3ÿ0

.010ÿ0

.021ÿ0

.172ÿ0

.114ÿ0

.018

0.072

0.208

0.187ÿ0

.019ÿ0

.211ÿ0

.047ÿ0

.014ÿ0

.314ÿ0

.074

0.021ÿ0

.109

1.000

OXYG

ÿ0.531ÿ0

.009ÿ0

.375ÿ0

.634ÿ0

.282ÿ0

.246ÿ0

.016

0.389ÿ0

.247ÿ0

.435ÿ0

.476ÿ0

.444ÿ0

.613ÿ0

.286ÿ0

.559ÿ0

.555

0.453

1.000

pH

ÿ0.541

0.402ÿ0

.031ÿ0

.544

0.112

0.030ÿ0

.337ÿ0

.132

0.159ÿ0

.048ÿ0

.112ÿ0

.144ÿ0

.292

0.076ÿ0

.477ÿ0

.365ÿ0

.173

0.442

1.000

PO

40.434

0.209

0.506

0.601

0.409

0.378

0.026ÿ0

.395

0.342

0.532

0.503

0.406

0.590

0.451

0.613

0.453ÿ0

.376ÿ0

.847ÿ0

.374

1.000

SO

40.130

0.902

0.873

0.209

0.971

0.944ÿ0

.097ÿ0

.594

0.949

0.682

0.572

0.781

0.297

0.900

0.224

0.112

0.014ÿ0

.182

0.160

0.338

1.000

TEMP

0.290ÿ0

.080

0.122

0.278

0.092

0.041

0.022ÿ0

.481

0.150

0.142

0.198

0.359

0.568ÿ0

.025ÿ0

.031

0.359ÿ0

.501ÿ0

.712ÿ0

.070

0.463

0.074

1.000

Water quality analysis using exploratory data 3585

Page 6: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

lated with temperature because the solubility ofoxygen in water decreases with increasing tempera-

ture; BOD, COD and nitrogen and phosphorouscompounds are also anti-correlated with dissolved

oxygen as organic matter is partially oxidized byoxygen, whilst nutrients are responsible for eutro-

phication of freshwater, thus causing a furtherincrease in organic matter concentration and,

hence, in oxygen demand. Iron, nitrate and pHshowed no signi®cant correlation with any other

variables.

By applying the Bartlett's sphericity test, a valueof 1006.6 for the Bartlett chi-square statistic was

found (critical value is 234 for 231 degrees of free-dom at the 95% signi®cance level), con®rming that

variables are not orthogonal but correlated, there-fore allowing to explain the data variability with a

lesser number of variables (called principal com-

ponents).Principal components were extracted by the R-

mode principal component method which math-ematically transforms the original data with no

assumptions about the form of the covariancematrix. This analysis allows a clustering of variables

on the basis of mutual correlations, and a grouping

of objects based on their similarities. For this analy-sis, the covariance matrix was diagonalised and the

characteristic roots (eigenvalues) were obtained.The transformed variables or principal components

(PCs) were obtained as weighted linear combi-nations of the original variables.

The Scree plot (see Fig. 2) was used to identify

the number of PCs to be retained in order to com-prehend the underlying data structure (Jackson,

1991). The Scree plot shows a pronounced changeof slope after the third eigenvalue; Cattell and

Jaspers (1967) suggested using all the PCs up toand including the ®rst one after the brake, so that

four PCs were retained, which have eigenvaluesgreater than unity and explain 81.5% of the var-

iance or information contained in the original dataset. Projections of the original variables on the sub-

space of the PCs are called loadings and coincide

with the correlation coe�cients between PCs andvariables. Loadings of the four retained PCs are

presented in Table 4. PC1 explains 46.1% of the

variance and is highly contributed by most vari-ables: chloride, bicarbonate, sulphate, conductivity,

dissolved solids, hardness, calcium, potassium, mag-

nesium, sodium and, in a less extent, by BOD,COD, manganese, ammonia, and phosphate. These

variables were demonstrated to be correlated (see

correlation matrix, Table 3). Flow rate and dis-solved oxygen have a negative participation in PC1.

PC2 explains 19.0% of the variance and includes

calcium, dissolved oxygen, pH (positive loading),BOD, COD, nitrite, phosphate and manganese

(negative participation). PC3 (9.8% of the variance)

is positively contributed by nitrate and negativelyby temperature. Finally, PC4 explains 6.6% of the

total variability of the original data and is highly

participated by iron.

As can be seen in Table 4, PC1 is highly partici-pated by most variables, thus hindering its hydro-

chemical interpretation. In the same way, variables

related to anthropogenic pollution like BOD, COD,phosphorous or nitrogen compounds have a high

participation on both PC1 and PC2, and therefore

PC2 cannot be explained only in terms of organicpollution. A rotation of principal components can

achieve a simpler and more meaningful represen-

tation of the underlying factors by decreasing thecontribution to PCs of variables with minor signi®-

cance and increasing the more signi®cant ones.Rotation produces a new set of factors, each one

involving primarily a subset of the original variables

with as little overlap as possible, so that the originalvariables are divided into groups somewhat inde-

Fig. 2. Scree plot of the characteristic roots (eigenvalues)of principal components (r) and varifactors (q).

Table 4. Loadings of 22 experimental variables on four signi®cantprincipal components for 30 river water samples

Variable PC1 PC2 PC3 PC4

BOD 0.523 ÿ0.635 0.353 0.022Ca 0.702 0.656 ÿ0.073 ÿ0.027Cl 0.914 0.164 0.101 ÿ0.073COD 0.574 ÿ0.618 0.322 ÿ0.154COND 0.925 0.365 0.046 0.036DS 0.909 0.335 0.139 0.076Fe ÿ0.074 ÿ0.328 0.195 0.826FLOW ÿ0.628 ÿ0.095 0.424 ÿ0.347HARD 0.897 0.394 ÿ0.037 0.101HCO3 0.821 0.116 ÿ0.050 ÿ0.242K 0.828 ÿ0.139 0.215 ÿ0.147Mg 0.901 0.020 0.013 0.216Mn 0.547 ÿ0.479 ÿ0.253 0.459Na 0.864 0.317 0.140 ÿ0.067NH4 0.590 ÿ0.468 0.446 ÿ0.199NO2 0.388 ÿ0.400 ÿ0.152 ÿ0.218NO3 ÿ0.160 0.223 0.710 0.303OXYG ÿ0.576 0.669 0.329 0.136pH ÿ0.120 0.712 ÿ0.401 ÿ0.082PO4 0.650 ÿ0.495 ÿ0.205 ÿ0.151SO4 0.851 0.458 0.003 0.140TEMP 0.306 ÿ0.469 ÿ0.708 0.143Eigenvalue 10.148 4.181 2.154 1.459% Variance explained 46.1 19.0 9.8 6.6% Cum. variance 46.1 65.1 74.9 81.5

Marisol Vega et al.3586

Page 7: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

pendent of each other (Sharaf et al., 1986; Massart

et al., 1988). Although rotation does not a�ect the

goodness of ®tting of the principal component sol-

ution, the variance explained by each factor is

modi®ed.

A varimax rotation of the principal components

led to 22 rotated PCs (called henceforth varifactors)

whose eigenvalues are plotted in Fig. 2. The Scree

plot shows a pronounced change of slope after the

third eigenvalue, therefore four varifactors explain-

ing 67.8% of the variance were retained (Cattell

and Jaspers, 1967). Eigenvalues and loadings of

these varifactors are displayed in Table 5. It must

be noted that rotation has resulted in an increase of

the number of factors necessary to explain the same

amount of variance of the original data set, so that

the ®rst two varifactors used for graphical represen-

tation explains a lesser amount of variance.

However, smaller groups of variables can be now

associated to individual rotated factors with a

clearer hydrochemical meaning.

Varifactor 1 explains 37.2% of the total variance

and is highly participated by calcium, chloride, con-

ductivity, dissolved solids, hardness, bicarbonate,

magnesium, sodium and sulphate, and can be thus

interpreted as a mineral component of the river

water. This clustering of variables points to a com-

mon origin for these minerals, likely from dissol-

ution of limestone, marl and gypsum soils. Flow

rate contributes negatively to this factor, which can

be explained considering that dilution processes of

dissolved minerals increase with ¯ow. Varifactor 2

contains 16.7% of the variance and includes BOD,

COD and ammonia, whereas pH and oxygen have

a negative contribution to this varifactor. This vari-

factor can be explained taking into account that

high levels of dissolved organic matter consume

large amounts of oxygen; organic matter in urban

wastewater consists mainly of carbohydrates, pro-

teins and lipids which, as the amount of available

dissolved oxygen decreases, undergo anaerobic fer-

mentation processes leading to ammonia and or-

ganic acids. Hydrolysis of these acidic materials

causes a decrease of water pH values. Potassium

contributes in the same extent to varifactor 1 and 2.

Varifactor 3 (8.0% of variance) has a high and

positive load of temperature and negative of dis-

Table 5. Loadings of 22 experimental variables on the ®rst fourrotated PCs for 30 river water samples

VariableVarifactor

1Varifactor

2Varifactor

3Varifactor

4

BOD 0.116 0.934 0.163 0.111Ca 0.920 ÿ0.179 ÿ0.093 ÿ0.119Cl 0.893 0.326 0.048 ÿ0.034COD 0.180 0.912 0.159 0.011COND 0.973 0.148 0.049 ÿ0.038DS 0.950 0.183 ÿ0.001 0.001Fe ÿ0.131 0.072 0.012 0.970FLOW ÿ0.496 ÿ0.005 ÿ0.323 ÿ0.094HARD 0.952 0.089 0.106 ÿ0.033HCO3 0.697 0.184 0.024 ÿ0.139K 0.584 0.614 0.089 ÿ0.043Mg 0.766 0.359 0.289 0.071Mn 0.248 0.290 0.387 0.472Na 0.918 0.180 ÿ0.070 0.003NH4 0.225 0.761 ÿ0.190 0.065NO2 0.105 0.170 0.182 ÿ0.061NO3 0.014 ÿ0.003 ÿ0.260 0.104OXYG ÿ0.132 ÿ0.418 ÿ0.540 ÿ0.016pH 0.169 ÿ0.434 ÿ0.018 ÿ0.201PO4 0.276 0.350 0.244 0.045SO4 0.981 0.008 0.059 0.022TEMP ÿ0.003 0.114 0.919 0.031Eigenvalue 8.175 3.677 1.763 1.292% Variance explained 37.2 16.7 8.0 5.9% Cum. variance 37.2 53.9 61.9 67.8

Fig. 3. Scores of river water samples on the bidimensional plane de®ned by the ®rst two varifactors.Space reduction from 22 to 2 dimensions (53.9% of the total variance). Samples collected at Cabezo nde Pisuerga (.), Puente Mayor (Q) and Simancas (R) in January (E), April (A), July (J) and October

(O) from 1990 to 1992.

Water quality analysis using exploratory data 3587

Page 8: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

solved oxygen, since solubility of gases in water

decreases with increasing temperature. Flow rate

should be expected to have a high and negative

load on varifactor 3, as high temperatures corre-

spond to dry and hot seasons like summer, when

¯ow rate is lower; however, its load is negative but

small (ÿ0.323) because during 1990 a persistent

drought caused low ¯ow rates even in winter sea-

son. Finally, varifactor 4 (5.9% of variance) is par-

ticipated by iron and manganese, which are

hydrochemically related.

Figure 3 displays a plot of sample scores on the

bidimensional plane de®ned by varifactors 1 (min-

eral contents) and varifactor 2 (anthropogenic con-

tamination, namely organic matter). High and

positive scores on varifactors 1 or 2 indicate high

mineral contents or high organic pollution, respect-

ively, whereas those samples with high and negative

scores on varifactors 1 or 2 will correspond to

higher ¯ow rate or dissolved oxygen, thus indicating

a better water quality. From Fig. 3 it can be con-

cluded that sample SJ90 (collected in Simancas in

July 1990) shows the worst quality, with high levels

of both minerals and organics. Samples collected in

January and April 1991 are projected onto negative

varifactor 1 and therefore show the lowest mineral

contents. As pointed above, winter of 1990 was

extremely dry and that fact is re¯ected by the high

scores on varifactor 2 of samples collected in April

and July 1990.

Box plots of varifactors 1, 2 and 3 in the three

sampling stations are shown in Fig. 4. Some im-

portant conclusions are derived from these plots:

varifactor 1 (mineral contents) and varifactor 3

(temperature) show large spread around the me-

dian, thus pointing to an important contribution of

sampling time to the variance of these varifactors.

On the other hand, varifactor 2 (anthropogenic pol-

lution) exhibits small spread, but the median

increases slightly from Cabezo n to Simancas, there-

fore indicating that sampling station is the most im-

portant source of variation in explaining the

variance of this varifactor, which is scarcely a�ected

by sampling times.

Two-way ANOVA on the three more relevant

varifactors was carried out and results of the F-test

are displayed in Table 6. Normal probability plots

of varifactors applied to individual sampling

Fig. 4. Box plots for three more signi®cant varifactors inCabezo n (C), Puente Mayor (P) and Simancas (S).

Table 6. Two-way ANOVA and F-test of the three more relevant rotated PCs

Source of variationSum ofsquares

Degreesof freedom

Varianceof squares F

Pooled sumof squares % Contribution

Varifactor 1Sampling time 18.399 9 2.044 3.521 13.629 47.0Sampling station 0.150 2 0.075 0.129Residual 10.451 18 0.581 15.371 53.0Total 29.000 29 29.000 100.0Varifactor 2Sampling time 11.809 9 1.312 1.941Sampling station 5.026 2 2.513 3.718 3.250 11.2Residual 12.165 18 0.676 25.750 88.8Total 29.000 29 29.000 100.0Varifactor 3Sampling time 25.306 9 2.812 15.428 23.643 81.5Sampling station 0.414 2 0.207 1.135Residual 3.281 18 0.182 5.357 18.5Total 29.000 29 29.000 100.0

F calculated as variance of the e�ect/variance of the residual.Fcrit is 2.456 for 9 and 18 degrees of freedom and 3.555 for 2 and 18 d.f (p= 0.05).

Marisol Vega et al.3588

Page 9: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

stations showed that varifactors were normally dis-

tributed, except varifactor 2 at Simancas. However,

the F-test as applied in ANOVA is not too sensitive

to departures from normality of distribution (Miller

and Miller, 1984) and was therefore used to inter-

pret the sources of variation.

Sources of variation that can a�ect sample pro-

jections on varifactors are sampling time (seasonal

e�ect) and sampling station (geographical or pollut-

ing e�ect). A comparison of the estimates of var-

iance by means of the Fisher ratio (F) indicates

that, at the 95% con®dence level, there is a signi®-

cant contribution to the total variance of varifactor

1 due to variation between sampling times

(F>Fcrit(9,18,p = 0.05), but the variation between

sampling stations does not contribute signi®cantly

(F< Fcrit(2,18,p = 0.05). Since varifactor 1 can be

interpreted as water inorganic (mineral) contents,

which increase with decreasing ¯ow rate, it can be

concluded that levels of minerals in the river water

investigated are seasonal and climate dependent,

and are una�ected by sampling location, thus point-

ing to a natural (non-anthropogenic) origin for this

polluting form. For varifactor 2 (organic matter,

nitrogen and phosphorous), only signi®cant contri-

bution to the variance due to di�erences between

sampling stations was found. This indicates that or-

ganic pollution of river water originates from

anthropogenic sources, mainly as municipal waste-

water which is disposed into the river between

Puente Mayor and Simancas. Sampling stations

were demonstrated not to contribute to the variance

of varifactor 3 (temperature), whereas highly signi®-

cant di�erences were found between sampling times,

thus showing that only climate and seasonality are

responsible for variations in water temperature, and

that there is no thermal pollution in the river sec-

tion investigated.

Those sources of variation that were demon-

strated not to contribute signi®cantly to the var-

iance of varifactors (F< Fcritical) were combined

with the residual variance (Ross, 1988) and from

the recalculated sum of squares the contribution of

the e�ect to the variability of the varifactor was

estimated as

%Contribution � SS 0

SST� 100,

where SS' is the pooled sum of squares and SST the

total sum of squares. It can be seen in Table 6 that

seasonality contributes by 47.0% and 81.5% to the

variability of varifactors 1 (mineral composition)

and 3 (temperature), respectively, thus evidencing

the strong e�ect that climate has on the variables

explained by these varifactors. Besides, sampling lo-

cation has a negligible contribution to varifactors 1

and 3, but contributes by 11.2% to the variability

of varifactor 2 (anthropogenic pollution); this con-

tribution is smaller than that of the residual

(88.8%), thus indicating the possible existence of an

interaction between both sources of variation:although the e�ect of sampling time (season) is not

signi®cant, it cannot be completely discarded

(F< Fcritical but >1) since climate has also a smallcontribution to varifactor 2 due to seasonal vari-

ations of ¯ow rate which cause dilution of pollu-

tants of anthropogenic origin.

Spatio-temporal variations of water quality canbe readily visualised in Fig. 5, where varifactors 1,

2 and 3 have been plotted vs sampling times for the

Fig. 5. Spatio-temporal ¯uctuations of varifactors 1, 2 and3 and their relationship with river ¯ow rate (ÐÐÐ).Sampling stations: Cabezo n de Pisuerga (�), Puente

Mayor (q) and Simancas (r).

Water quality analysis using exploratory data 3589

Page 10: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

stations investigated: Cabezo n, Puente Mayor and

Simancas. The average ¯ow rate for the three

stations has been simultaneously plotted to showthe relationship between water quality and ¯ow

rate. Again, the inverse relationship between ¯ow

rate and rotated factors 1 and 3 (mineral com-

ponents in water and temperature, respectively) can

be observed, whilst for varifactor 2 (organic pol-lution and nutrients) this negative correlation exists

not so markedly. The interaction between sampling

location and sampling time is illustrated in Fig. 5:

maximum variability of varifactor 2 along the riversection sampled occurs in dry seasons (July and

October) when river ¯ow rate decreases. This can

be interpreted taking into account that municipal

wastewater discharges into the Pisuerga river arethe main and nearly constant source of organic

matter, so that an increase in river ¯ow rate causes

dilution of pollutants and hence di�erences between

sampling stations are made less evident. Figure 5shows also that sample scores on varifactor 2 are

always higher for those samples collected in

Simancas whilst Cabezo n and Puente Mayor scores

are similar, thus assessing that the main discharges

of organic mater and nutrients are located between

Puente Mayor and Simancas, which con®rms mu-

nicipal wastewater as the principal source of or-ganic pollutants for the Pisuerga river. These

conclusions are in good agreement with the spatio-

temporal pro®le exhibited by the complexing ca-

pacity of the Pisuerga river water (Pardo et al.,1994). Furthermore, di�erences in sample scores

between Simancas and the other two sampling

stations were higher in dry seasons (July and

October) thus con®rming the spatial-temporal inter-action on varifactor 2.

Temporal variation of some independent vari-ables associated to contamination of river water is

depicted in Fig. 6. It can be observed that conduc-

tivity behaves in the same way as varifactor 1 (see

Fig. 5 for comparison), since this variable is closelyrelated to mineral composition of river water, and

therefore to varifactor 1. COD and ammonia are

associated to organic pollution and therefore their

pro®les are similar to that of varifactor 2. As canbe seen in Fig. 6, the highest variation of these con-

taminants occurs in Simancas, as important

amounts of municipal wastewater are discharged

Fig. 6. Temporal variations of some original variables associated to river water pollution and their re-lation with ¯ow rate (ÐÐÐ). Sampling stations: Cabezo n de Pisuerga (�), Puente Mayor (q) and

Simancas (r).

Marisol Vega et al.3590

Page 11: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

upstream this station. Dissolved oxygen also showsa periodic pro®le habit related to seasonality with

strong decreases at Simancas, caused by the highlevels of oxygen-consuming organic matter.

Cluster analysis

Cluster analysis allows the grouping of riverwater samples on the basis of their similarities inchemical composition. Unlike PCA that normally

uses only two or three PCs for display purposes,cluster analysis uses all the variance or informationcontained in the original data set. Hierarchical

agglomerative clustering by the Ward's method wasselected for sample classi®cation because it pos-sesses an small space distorting e�ect, uses more in-

formation on cluster contents that other methods,and has been proved to be an extremely powerfulgrouping mechanism (Willet, 1987); besides, Ward's

method yielded the most meaningful clusters. Themethod was applied to normalised data usingsquared Euclidean distances as a measure of simi-larity (Massart and Kaufman, 1983). A similar

classi®cation pattern was obtained by the averagelinkage method (between groups).The dendrogram of samples obtained by the

Ward's method is shown in Fig. 7. Two well di�er-entiated clusters can be seen, each formed by twosubgroups, with river water quality decreasing from

top to bottom. The ®rst group from the top is

assorted with samples collected in January and

April 1991, and one sample collected in Cabezo n in

April 1992; in the PCA method of classi®cation

these samples scored high and negative on varifac-

tor 1 and close to 0 on varifactor 2 (see Fig. 3) thus

indicating the lowest levels of both minerals and or-

ganic matter as these samples were collected in

January and April 1991, when the river ¯ow rate is

at is maximum due to snow melting at the river

sources. This cluster is linked at a rescaled distance

of about 7 to other small but tight group that

includes samples taken out in July 1991 and July

1992 (except that from Simancas) and the sample

PO91. In the PCA analysis these samples were also

grouped on intermediate and negative values on the

varifactor 1 axis. The second main cluster is formed

for two subgroups that are linked at a rescaled dis-

tance of 10: the ®rst of them includes very similar

samples collected in January and April 1992, and

samples CO90 and CO91 and corresponds to

samples scoring high and positive varifactor 1 and

negative varifactor 2 in the PCA analysis (see Fig. 3)

thus pointing to their high levels of minerals and

low of anthropogenic pollutants. The second sub-

group includes samples collected in 1990 (April,

July and October) and samples collected from

Simancas in July and October 1991. These samples

correspond to dry seasons and to the most contami-

Fig. 7. Dendrogram based on agglomerative hierarchical clustering (Ward's method) for 30 river watersamples collected at Cabezo n de Pisuerga (C), Puente Mayor (P) and Simancas (S) in January (E),

April (A), July (J) and October (O) from 1990 to 1992.

Water quality analysis using exploratory data 3591

Page 12: ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE ... · ASSESSMENT OF SEASONAL AND POLLUTING EFFECTS ON THE QUALITY OF RIVER WATER BY EXPLORATORY DATA ANALYSIS MARISOL VEGA*, RAFAEL

nated station (Simancas) and show the worst waterquality in both minerals and organic matter.

CONCLUSIONS

Environmental analytical chemistry generatesmultidimensional data that need of multivariatestatistics to analyse and interpret the underlying in-

formation. Water quality data of a river have beenanalysed by unsupervised pattern recognition (hier-archical cluster analysis) and display methods (prin-

cipal component analysis) to extract correlationsand similarities between variables and to classifyriver water samples in groups of similar quality.PCA has found a reduced number of ``latent'' vari-

ables (principal components) that explain most ofthe variance of the experimental data set. A vari-max rotation of these PCs led to a reduced number

of varifactors, each of them related to a smallgroup of experimental variables with a hydrochemi-cal meaning: mineral contents for varifactor 1,

anthropogenic pollutants for varifactor 2 or watertemperature for varifactor 3.PCA in combination with ANOVA has allowed

the identi®cation and assessment of spatial (pol-lution from anthropogenic origin) and temporal(seasonal and climatic) sources of variation a�ectingquality and hydrochemistry of river water. Man-

made pollution was demonstrated to originate frommunicipal wastewater discharged into the riverbetween the sampling stations of Puente Mayor and

Simancas; temporal e�ects were associated to seaso-nal variations of river ¯ow rate which cause di-lution of pollutants and hence variations in water

quality. The application of PCA and cluster analy-sis has achieved meaningful classi®cation of hydro-chemical variables and of river water samples basedon seasonal and spatial criteria. Both multivariate

techniques led to very similar classi®cation patterns.

AcknowledgementÐThe authors wish to thank theConfederacio n Hidrogra ®ca del Duero (Valladolid, Spain)for providing data of river ¯ow rates.

REFERENCES

Andrade J. M., Prada D., Muniategui S., Gonza lez E. andAlonso E. (1992) Multivariate analysis of environmentaldata for two hydrographic basins. Anal. Lett. 25, 379±399.

AOAC (1990) O�cial Methods of Analysis, Vol. 1, 15thedn., Association of O�cial Analytical Chemists,Arlington, VI, U.S.A., p. 312.

APHA-AWWA-WPCF (1985) Standard Methods for theExamination of Water and Wastewater, 16th edn.,American Public Health Association, American WaterWorks Association, Water Pollution ControlFederation, U.S.A.

Aruga R., Negro G. and Ostacoli G. (1993) Multivariatedata analysis applied to the investigation of river pol-lution. Fresenius J. Anal. Chem. 346, 968±975.

Bartels J. H. M., Janse T. A. H. M. and Pijpers F.W. (1985) Classi®cation of the quality of surface waters

by means of pattern recognition. Anal. Chim. Acta 177,35±45.

Battegazzore M. and Renoldi M. (1995) Integrated chemi-cal and biological evaluation of the quality of the riverLambro (Italy). Wat. Air Soil Poll. 83, 375±390.

Brown S. D., Skogerboe R. K. and Kowalski B. R. (1980)Pattern recognition assessment of water quality data:coal strip mine drainage. Chemosphere 9, 265±276.

Brown S. D., Blank T. B., Sum S. T. and Weyer L.G. (1994) Chemometr. Anal. Chem. 66, 315R±359R.

Brown S. D., Sum S. T. and Despagne F. (1996)Chemometrics. Anal. Chem. 68, 21R±61R.

Cattell R. B. and Jaspers J. (1967) A general plasmode(No. 30-10-5-2) for factor analytic exercises andresearch. Mult. Behav. Res. Monogr. 67, 1±212.

Dixon W. and Chiswell B. (1996) Rewiew of aquaticmonitoring program design. Wat. Res. 30, 1935±1948.

Elosegui A. and Pozo J. (1994) Spatial vs temporal varia-bility in the physical and chemical characteristics of theAguera stream (Northern Spain). Acta Ecologica Ð Int.J. Ecol. 15, 543±559.

Grimalt J. O., Olive J. and Go mez-Belincho n J. I. (1990)Assessment of organic source contributions in coastalwaters by principal component and factor analysis ofthe dissolved and particulate hydrocarbon and fattyacid contents. Int. J. Environ. Anal. Chem. 38, 305±320.

Jackson J. E. (1991) A User's Guide to PrincipalComponents. Wiley, New York.

Librando V. (1991) Chemometric evaluation of surfacewater quality at regional level. Fresenius J. Anal. Chem.339, 613±619.

Massart D. L. and Kaufman L. (1983) The Interpretationof Analytical Chemical Data by the Use of ClusterAnalysis. Wiley, New York.

Massart D. L., Vandeginste B. G. M., Deming S. N.,Michotte Y. and Kaufman L. (1988) Chemometrics: ATextbook. Elsevier, Amsterdam.

Meglen R. R. (1992) Examining large databases: a chemo-metric approach using principal component analysis.Mar. Chem. 39, 217±237.

Mellinger M. (1987) Multivariate data analysis: itsmethods. Chemometr. Intell. Lab. Systems 2, 29±36.

Miller J. C. and Miller J. N. (1984) Statistics forAnalytical Chemistry. Ellis Horwood Series in AnalyticalChemistry, Wiley, New York.

Pardo R., Barrado E., Vega M., Deban L. and Tasco n M.L. (1994) Voltammetric complexation capacity of watersfrom the Pisuerga river. Wat. Res. 28, 2139±2146.

Ross P. J. (1988) Taguchi Techniques for QualityEngineering. McGraw-Hill, New York.

Sche�e H. (1959) The Analysis of Variance. Wiley, NewYork.

Sharaf M. A., Illman D. L. and Kowalski B. R. (1986)Chemometrics. Wiley, New York.

Stroomberg G. J., Freriks I. L., Smedes F. and Co®no W.P. (1995) In Quality Assurance in EnvironmentalMonitoring, ed. P. Quevauviller. VCH, Weinheim.

Voutsa D., Zachariadis G., Samara C. and KouimtzisT. (1995) Evaluation of chemical parameters inAliakmon river in Northers Greece. 2. Dissolved andparticulate heavy metals. J. Environ. Sci. Hlth. Part A:Environ. Sci. Engng 30, 1±13.

Ward A. D. and Elliot W. J. (1995) In EnvironmentalHydrology, ed. A. D. Ward and W. J. Elliot, pp. 1.CRC Press, Boca Raton.

Wenning R. J. and Erickson G. A. (1994) Interpretationand analysis of complex environmental data using che-mometric methods. Trends Anal. Chem. 13, 446±457.

Willet P. (1987) Similarity and Clustering in ChemicalInformation Systems. Research Studies Press, Wiley,New York.

Marisol Vega et al.3592