Fast regression methods in a Lanczos (or PLS-1) basis. Theory and applications

Ž .Chemometrics and Intelligent Laboratory Systems 51 2000 145–161www.elsevier.comrlocaterchemometrics

ž /Fast regression methods in a Lanczos or PLS-1 basis.Theory and applications

Wen Wu a,1, Rolf Manne b,)

a ChemoAC, Pharmaceutical Institute, Vrije UniÕersiteit Brussel, Laarbeeklaan 103, B-1090 Brussel, Belgiumb Department of Chemistry, UniÕersity of Bergen, Allegt 41, NO-5007 Bergen, Norway

Received 4 October 1999; received in revised form 3 February 2000; accepted 15 February 2000

Abstract

In order to improve the calibration speed for very large data sets, novel algorithms for principal component regressionŽ . Ž .PCR and partial-least-squares PLS regression are presented. They use the Lanczos or PLS-1 transformation to reduce the

Ž . Ž X .data matrix X to a small bidiagonal matrix R , after which the small tridiagonal matrix R R is diagonalized and inverted.Ž .The complexity of the PCR model may be optimized by cross-validation PCRL but also using simpler and faster recipes

Ž . Ž .based upon round-off monitoring and model fit PCRF . A similar fast PLS procedure PLSF is also presented. CalculationsŽ . Ž .are made for five near infrared spectroscopy NIR data sets and compared with PCR with feature selection PCRS based on

Ž .correlation and with de Jong’s simple partial least squares SIMPLS . The Lanczos-based methods have comparable predic-tion performance and similar model complexity to PCRS and SIMPLS but are considerably faster. From a detailed compari-son of the methods, some insight is gained into the performance of the PLS method. q 2000 Elsevier Science B.V. All rightsreserved.

Keywords: PCR; PLS; Lanczos; Fast algorithms; Fast optimization; Calibration; NIR

1. Introduction

In industrial applications of chemometrics, datamatrices are often very large, e.g., in 3D-QSAR and

) Corresponding author. Tel.: q47-55-58-3444; fax: q47-55-58-9490.

Ž .E-mail address: [email protected] R. Manne .1 Present address: Glaxo SmithKline Pharmaceuticals, The Fry-

the, Welwyn, Hertfordshire, AL6 9AR, UK.

Ž .in data mining. Near infrared spectroscopy NIR datasets may have hundreds or thousands of variables, andsometimes hundreds of samples. For such large datasets, calculational time is a factor which cannot beneglected. In this paper, we are concerned with theefficient implementation of principal component re-

Ž . Ž .gression PCR and partial least square PLS regres-sion for large data sets. Both techniques are widely

w xapplied in chemometrics 1–4 .The present work was motivated by a desire to find

a more efficient PCR algorithm utilizing Lanczos’w xmethod for solving eigenvalue problems 5 . Such an

0169-7439r00r$ - see front matter q 2000 Elsevier Science B.V. All rights reserved.Ž .PII: S0169-7439 00 00063-0

( )W. Wu, R. MannerChemometrics and Intelligent Laboratory Systems 51 2000 145–161146

algorithm will be presented, but when comparing itsperformance with a fast PLS algorithm we found thelatter to be still more effective. From the comparisonof the two algorithms, we were led to results con-cerning the question why PLS outperforms PCR inspeed and parsimony of model.

Both PCR and PLS start by generating a set of la-tent variables describing the data. Then follows a re-gression calculation in the basis of these latent vari-ables. In common chemometric implementations ofPCR, non-linear iterative partial least squaresŽ . w xNIPALS algorithm 6 is used for the generation oflatent variables. The PLS algorithm might be seen asa simplification of the NIPALS algorithm which usesinformation of the dependent or y variable in the

w xgeneration of the latent variables 4,7 .In PCR, there are two ways to select latent vari-

ables for the construction of the model. The standardtop-down approach selects principal components af-ter their ability to explain the variance of the datamatrix X. This choice, however, is not optimal if onewants to use as few principal components as possiblein the regression. The alternative is therefore to se-lect principal components using some feature relat-ing to the dependent or y variable. Such an approachis thus likely to have properties similar to the PLS

w xalgorithm. For PCR, criteria like the correlation 8,9w xor the prediction coefficient 10 have been used for

the selection of latent variables. Verdu-Andres and´ ´w xMassart 11 found that these selection criteria give

results which are equal to or better than those ob-w xtained by top-down PCR. Centner et al. 12 have

systematically compared 24 different multivariate re-gression methods with four NIR data sets with testsamples within the calibration domain. They found

Ž .that PCR with feature selection PCRS and PLS givesimilar results, both with regard to complexity ofmodels and performance of prediction. Kalivas et al.w x13–15 have combined top-down PCR, PLS and or-dinary least squares methods into the method of cyclic

Ž .subspace regression CSR and give theoretical argu-ments why PLS uses fewer factors than top-downPCR. In our mind, the main advantage of PCR overPLS lies in the ability of detecting outliers in X-space— both in the calibration and in the prediction step.

Current implementations of PCRS, seem to re-quire that all principal components are extracted be-fore one builds the model. If cross-validation is used

to optimize the number of principal components in themodel, this full decomposition is repeated several

Ž .times. Kernel principal component analysis PCAw xalgorithms previously developed 16,17 improve the

calculating speed for high-dimensional data, but forPCRS, full factor extraction still cannot be avoided.

In this paper, we explore the use of Lanczos’w xeigenvalue algorithm 5 for PCR. Lanczos’ method

has advantages for very large and sparse matrices andis well known in numerical analysis, see for exam-

w xple, Ref. 18 . The basic idea is that the large matrixis first transformed to a smaller tridiagonal one usinga set of auxiliary basis vectors, which are formally

w xidentical to the latent variables used in PLS 19 .These vectors may further be seen as a more effi-cient use of intermediate vectors generated in the NI-PALS or power eigenvalue method, as it is generally

w xknown outside chemometrics 20 . There are thusseveral reasons to try Lanczos’ method for the pre-sent problem.

The paper is organized as follows. Section 2 pre-sents an overview over the least-squares calibrationproblem with emphasis on calculational linear alge-bra. The relation between the PLS and Lanczosmethods are explored and the numerical peculiaritiesof the latter explained. Section 3 describes the dataand the calculational procedures actually used. InSection 4, results are presented regarding selectionprocedures, calculational speed and predictive power.The discussion compares the present method with the

w xwork of Kalivas et al. 13–15 with regards to the re-lation between PLS and PCR.

2. Theory

2.1. Notation

Bold-face capital letters are used for matrices,bold-face lower-case letters for vectors and italiccharacters for scalars except for the two letters of sand d, which represent diagonal matrices of theeigenvalues and singular values, respectively. X de-notes the calibration data set after column mean cen-tering, where n rows represent n given samples andp columns p given variables. y denotes a columnvector containing the measurement values of interest

( )W. Wu, R. MannerChemometrics and Intelligent Laboratory Systems 51 2000 145–161 147

for the n calibration samples after the same pretreat-ment.

2.2. Calibration and least squares methods

We are interested in solving a system of linearequations X bsy. A general solution may be writ-

Žten in terms of the generalized inverse or pseudoin-. w x qverse 21 as bsX y, where b is the estimate of b.

This solution covers the various cases which may ap-Ž . y1pear: i the ordinary solution bsX y when X is

Ž .square and invertible, ii the least-squares solution

y1X Xbs X X X y, 1Ž . Ž .

when there are more equations than variables, as wellŽ .as iii the minimum-norm solution when there are

more variables than equations. The latter situationcommonly occurs in multivariate calibration whenspectra are measured at more wavelengths than thereare samples.

Several methods for calculating the generalizedinverse may be seen as involving an orthogonal de-composition of X, i.e.,

XsURWX , 2Ž .

where U and W are rectangular matrices with or-thonormal columns and R is a square and non-singu-

w xlar matrix of a form which is easily inverted 19 . Inthe general case, one may write the transpose XX sWRX UX and the generalized inverse X HsWRy1UX.The regression coefficient vector may thus be written

bsXqy sWRy1UX y. 3Ž .

In the following, we derive an expression which iscloser to the ordinary least–squares expression. Thegeneralized inverse fulfils, e.g., XqsXqXXq and

q Ž q.X w xXX s XX 21 . Combining these two expres-Ž .sions with the orthogonal decomposition Eq. 2 , one

gets

qX Xq qbsX ysX X X yŽ .y1X X X Xy1sWR U U R W X yŽ .

y1X X XsW R R W X y. 4Ž . Ž .

X Ž X .y1When X X is non-singular and the inverse X Xexists, one may obtain this expression directly by in-

Ž .serting Eq. 2 into the ordinary least-squares expres-Ž .sion Eq. 1 .

Ž .In actual applications, the decomposition Eq. 2 isheavily truncated, at least in chemometrics. Still,there are reasons to hope that the transformed matrixR and its inverse Ry1 pick up those parts of X whichare important for the problem under consideration.

There are two forms of R which are of impor-tance in the present discussion. For PCR the matrixR is diagonal, while for PLS-1, R is right bidiagonal.For PCR, one may find the non-zero diagonal ele-ments of R from one of the eigenvalue problems

XX XWsWd, 5Ž .

and

XXX UsUd. 6Ž .

The diagonal matrix d is usually written with thediagonal elements ordered, i.e., d G d G . . . G11 22

d G0, where A is the number of non-zero eigen-A A

values. The positive square roots d1r2 ss are calledi i i

singular values, and the singular-value decomposi-Žtion ssdiagonal matrix with elements s on the di-i.agonal is

XsUsWX

giving the regression coefficients

b sWsy1UX y. 7Ž .PCR

Ž . Ž .It is sufficient to solve one of the Eqs. 5 or 6 ,since the non-zero eigenvalues d are the same. Onei i

thus obtains U or W from the eigenvalue problem ofsmallest dimension and the other matrix from

UsXWsy1 , 8Ž .

or

WsXX Usy1 , 9Ž .

respectively. The kernel PCA methods recently de-w xveloped 16,17 solve the smallest-dimension prob-

Ž .lem Eq. 6 for wide matrices.


2.3. PLS and the Lanczos eigenÕalue method

In the PLS-1 method as described by one of usw x19 , one starts from u s0 and0

X 5 X 5w sX yr X y 10Ž .1

and obtains successive column vectors of U and Wthrough the iteration scheme

Xu sk Xw yu u Xw 11Ž . Ž .i i i iy1 iy1 i

X X X Xw sk X u yw w X u 12Ž . Ž .i i iy1 iy1 iy1 iy1

where k and kX are normalization constants.i iw xThis scheme is attributed to Golub and Kahan 22

and is closely related to the Lanczos tridiagonaliza-w xtion transformation for symmetric matrices 5 . It

gives RsUX XW in right bidiagonal form, i.e., onlyelements R and R are different from zero. It isi i i, iq1

then straightforward to calculate the inverse Ry1,which has right triangular form. For details see Ref.w x19 .

Ž .It is clear from Eq. 7 that small singular valuesrepresenting noise may give large contributions to theregression coefficients b . For this reason, calcula-PCR

tions are made with a cut-off such that small singularvalues and their corresponding singular vectors do notcontribute to the regression coefficients.

This possibility of explicitly separating out noiseis an advantage of PCR over PLS. The latter method,on the other hand, often gives a description ofcomparable quality as PCR with fewer componentvectors. The relation between PCR and PLS will befurther addressed in the light of calculations in thispaper.

w xThe original Lanczos transformation 5 takes asymmetric matrix A into a tridiagonal matrix QsW XAW through the iteration scheme

Xw sN Aw yw w AwŽ .i iy1 iy1 iy1 iy1

Xyw w Aw , 13Ž . Ž .iy2 iy2 iy1

where N is a normalization factor. In the present case,we consider AsXX X, w s0, and w given by Eq.0 1Ž .10 . One then obtains the same vectors w as withi

the Golub–Kahan or PLS-1 bidiagonalization schemeŽ .Eqs. 11 and 12 above.

There are thus two ways of obtaining QsRX R andŽ .W: by bidiagonalization Eqs. 11 and 12 of X fol-

lowed by multiplication RX R or by direct tridiagonal-Ž . Xization Eq. 13 of A s X X. Which approach is

faster depends upon the shape of X as well as uponthe number of factors w to be calculated. Tridiago-i

Ž . Ž .nalization Eq. 13 is more efficient for tall portraitmatrices and when many factors w are desired. Fori

Ž .wide landscape matrices and when few factors wi

are needed, bidiagonalization is more efficient. In ourprogram development, we have used the latter ap-proach.

For the Lanczos method, the remaining task is tosolve the eigenvalue equation

QVsVd, 14Ž .

giving

QsVdVX , 15Ž .

The eigenvectors V of Q with eigenvalues d fromŽ . XEq. 14 give eigenvectors of X X as WV with the

same eigenvalues. They are a subset of those ob-tained by diagonalizing the full matrix XX X as in Eq.Ž .5 . For PCR, one may use the decomposition Eq.Ž .15 to get the inverse

Qy1 sVdy1 V X , 16Ž .

Ž .which, when inserted into Eq. 4 gives the regres-sion coefficient vector as

bsWVdy1 V X W X XX y. 17Ž .

A final simplification occurs from the choice of w1X 5 X 5 Ž .sX yr X y in the PLS-1 algorithm Eq. 10 and

the assumed orthonormality of the columns of W. Wemay therefore obtain the elements of the regressioncoefficient vector as

5 X 5 y1b s X y W V d V . 18Ž .Ýi i k k l l 1 lkl

Ž .With formula 18 , one has the possibility of deletingŽ .eigenvalues and eigenvectors using two criteria, 1Ž .small eigenvalues d which represent noise and 2l

small coefficients V which represent vectors irrele-1 lŽ .vant to the current regression problem. Eq. 18 ,

however, should not be used directly without first


considering the problems with round-off which arepeculiar to the Lanczos approach.

2.4. Numerical peculiarities of the Lanczos approach

The numerical peculiarities of the Lanczos eigen-value are well understood. The so-called Krylov se-

� 2 n 4quence w , Aw , A w , . . . , A w is known to0 0 0 0

converge to the eigenvector of A with the numeri-w xcally largest eigenvalue 20 . If the iteration is car-

ried far enough and the vectors An w are normal-0

ized, the difference between consecutive elements� 4thus becomes infinitesimal. The Lanczos vectors wi

may be seen as a Schmidt-orthogonalized version ofthe Krylov sequence. Even for fairly low values of i,

� 4the vectors w are dominated by round-off. This hasiŽ .important consequences: 1 with k large enough, wi

Ž .and w may be severely non-orthogonal; 2 singleiqk

eigenvalues of AsXX X appear as multiple roots ofX Ž .the tridiagonal matrix QsW AW; 3 there may be

eigenvalues of Q appearing in between those of A.� 4Explicit reorthogonalization of the vectors w post-i

pones the problems but does not eliminate them. Forsparse–matrix computations, where the Lanczosmethod has its greatest advantages, explicit reorthog-onalization may be undesirable or even impossible.Furthermore, non-orthogonalities and multiple eigen-values tell that the calculation has converged and maybe used for monitoring purposes. When they appear,one knows that one has exhausted the space spanned

w xby Lanczos or Krylov vectors 18 .For the regression problem, it is thus important

� 4that non-orthogonalities among the vectors w arei

treated correctly. This can be done in two ways. TheX Ž .simple way is to monitor the overlap w w j)11 j

and to stop the iteration when this quantity starts todeviate appreciably from zero. The possible draw-back is that one stops too early, i.e., before the spacespanned by the Lanczos vectors is exhausted. On theother hand, if one continues with more iterations, onestill gets good eigenvalues but the normalization ofthe eigenvectors WV is likely to be destroyed. It istherefore necessary to renormalize the eigenvectorsand to delete those which overlap strongly with con-verged solutions. These overlapping vectors are ei-ther the well-known converged multiple solutions orthe occasional non-converged ones. In the presentwork, we have chosen to ‘‘stop early’’, i.e., before

� 4non-orthogonalities among the vectors w requirej

special corrections.

2.5. Selection of eigenÕectors

In this section, we discuss how to truncate V, i.e.Ž .to decide which eigenvectors of Q Eq. 14 are to be

included in the calculation of regression coefficientsŽ .Eq. 18 . We discuss here two criteria, one basedupon model fit, the other upon cross-validation.

The first criterion is thus to study how much aneigenvector contributes to the model of the targetvector y in the calibration set. We write the objective

Žfunction of the least-squares problem as L s y y.XŽ . Ž .Xb yyXb with b in Eq. 17 ,

5 X 5 y1 Xbs X y WVd v , 19Ž .1

where v is the first row vector of V. The objective1

function may be simplified toX 5 X 5 2 y1 XLsy yy X y v d v . 20Ž .1 1

In the following, we are interested in the contribu-tions of individual terms to L. We write

v dy1 vX s V 2rd 21Ž .Ý1 1 1 l ll

and introduce the normalized fit

V 2 rd1k kF s 22Ž .k 2V rdÝ 1 l l

l

This expression tells that a column eigenvector v isk

important for the regression problem if it makes Fk

large, and we might therefore investigate the eigen-vectors in the descending order of this quantity. WhenF is equal to zero, the corresponding eigenvector hask

no contribution at all to the fit of the model. The nor-malization gives

F s1 23Ž .Ý kk

a

and a partial sum F is thus a measure of the fitÝ kks1

of the model with the most ‘‘effective’’ eigenvectors.With a some small number determined from experi-ence one may decide to stop with a eigenvectors ifay1 a

F -1yaF F 24Ž .Ý Ýk kks1 ks1


Calculations using this approach will, in the follow-Ž .ing, be denoted by PCRF Fs fast . Note that with

as0, one uses the full space of eigenvectors gener-ated in the Lanczos basis. The result is a PLS-1 cal-culation.

More accurate but also more time-consuming isthe procedure of cross-validation. A full-scale cross-validation is on two levels, first determining the

� 4number of basis functions w , then selecting thei

eigenvalues and eigenvectors contributing to the re-gression problem. For each combination of the twoparameters, one deletes in turn the data for each sam-ple and makes a prediction of the dependent variableusing the remaining data. A search is then made forparameter values which give models with low valuesof the root mean square error based on the cross-

Ž .validation RMSECV . The optimal model shouldŽhave the most parsimonious complexity lowest

numbers of the Lanczos basis vectors and eigenvec-.tors and its RMSECV is statistically equivalent to the

minimal RMSECV value. This approach is hereŽ .named PCRL LsLanczos .

2.6. Algorithms

We will here summarize the algorithms for PCRin the Lanczos basis.

Step 1. Transform X to bidiagonal form R usingŽ . Ž .formulas 11 and 12

XsURWX ;

RsUX XW.

This gives the auxiliary LanczosrPLS-1 basis vec-� 4tors w as columns of W. Monitor non-orthogonal-i

Ž .ity due to round-off as a stopping criterion PCRF .Step 2. Make the tridiagonal and symmetric ma-

trix QsRX R.Step 3. Solve the eigenvalue problem

QVsVd,

where d is the diagonal matrix of squared singularvalues. Sort the eigenvalues and eigenvectors after

Ž .their contribution to the fit Eq. 21 . Delete vectorsŽ . Ž .according to Eq. 24 PCRF .

Step 4. Calculate regression coefficients as

5 X 5 y1b s X y W V d VÝi i k k l l 1 lkl

In anticipation of numerical calculations, one mayestimate that step 1 is comparable to the multiplica-tion of XX X or XXX in ordinary or kernel PCA. Step2 is linear in the dimension of R and is thus quite fast.Matrix diagonalization, step 3, usually starts with aHouseholder transformation to tridiagonal form. Inthe present case, the matrix is already tridiagonalfrom the Lanczos transformation. This fact is likelyto give some saving of time. The dimension of Qcannot exceed the smaller of the dimensions of XX Xand XXX, but is in reality much smaller due to thestopping criterion.

Appendix A presents Matlab code for calibrationaccording to the PCRF procedure as well as for a PLS

Ž .procedure PLSF , which uses the same step 1 asPCRF but makes a direct inversion of the bidiagonalR matrix. A short routine for prediction is also in-cluded.

3. Calculations

3.1. Data sets

Five NIR data sets were analysed as practical ex-amples. They are the same data sets as used by Cent-

w xner et al. 12 who discuss their properties at lengthand describe in detail how they are split into separateparts for calibration and prediction, respectively. Inthe following, the sets are numbered 1–5. Sets no. 1and 2 are for the prediction of moisture content inwheat and are from data made available as a stan-

w xdard reference by Kalivas 23 . The remaining sets no.3–5 are from industrial practice. Set no. 1 is withoffset correction and centering. Like sets no. 3–5, itwas split into calibration and prediction sets using the

w xKennard and Stone 24 algorithm. Set no. 2 is asw xgiven by Kalivas 23 with no data pretreatment ex-


Table 1Description of NIR data sets

Data set Wave length No. of No. of No. of Dependent Transform LinearityrŽ .range nm variables spectra in spectra in variable nonlinearity

Ž .calibration test set sset

Ž .1 WHEAT 1100–2500 701 59 40 Moisture content offset correction linearŽ .2 WHEAT 1100–2500 701 50 20, 20 Moisture content none linearKalivas

w xsetup 23Ž .3 POLY-DAT 1130–2126 499 60 24 Hydroxyl number offset correction linearŽ .4 GASOLINE 800–1080 561 132 30 Octane number first derivative slightly nonlinearŽ . w x5 POLYMER 1100–2498 700 40 14 Minor mineral SNV 31 strongly nonlinear

cept centering and with two test sets. Set no. 3, de-w xnoted POLY-DAT by Centner et al. 12 , was col-

lected for the prediction of hydroxyl number frompolyether polyol samples, set no. 4, denoted GASO-LINE, was collected for the determination of octanenumber for a set of gasoline samples, and set no. 5,denoted POLYMER, for the determination of theamount of a minor mineral compound in a polymer.For all five data sets, the objects of the test sets arewithin the domain covered by the calibration sets. Forthe data sets no. 1–4, the relationship between thedata X and the dependent y variable is linear or onlyslightly nonlinear, while there is a strong nonlinear-ity for data set no. 5. The number of samples in thecalibration sets varied between 40 and 132, the num-

Ž .ber of variables wavelengths between 499 and 701,giving matrices with between 28 000 and 74 052 ele-ments. The test sets contained between 14 and 40samples. Table 1 gives a summary of the propertiesof the data sets. Further details about the data may be

w xfound in Refs. 12,23 .

3.2. Methods of calculation

All algorithms have been programmed in Matlab4.0. Timing estimates have been made by using theflops function which counts the number of floating

Ž .point operations flops needed in the operation. Thismeasure is independent of the speed of the computer,and is proportional to the elapsed time for a givenhardware configuration. Calculations with five dif-

ferent methods were made and will be described be-low.

Two methods of calculations were used to providea reference for comparison with the present work.

Ž .1 PCRS. These calculations are identical to cal-w xculations by Centner et al. 12 using feature selec-

tion based on correlation. The fastest kernel PCA al-w xgorithm 15,16 was used for this calculation, and full

matrix diagonalization was achieved with the eigw xsubroutine of Matlab. Following Centner et al. 12 ,

Ž .this approach is denoted PCRS SsSelection .Ž .2 de Jong’s modified simple partial least squares

Ž .SIMPLS method which gives practically the samew xresults as the standard PLS-1 method 25,26 .

Apart from the predictive ability of these calcula-tions, it is interesting to compare the number of fac-

Ž .tors principal components used and the timing withthe present work.

The remaining three sets of calculations weremade using the Lanczos iteration scheme without re-

Žorthogonalization as presented above Eqs. 11 and.12 . These three calculations were as follows.Ž .1 PCR in the Lanczos basis with both the num-

ber of Lanczos vectors nl and the number of eigen-valuesreigenvectors nf determined by cross-valida-tion. As mentioned in Section 2.5, this method is de-

Ž .noted PCRL LsLanczos .Ž .2 PCR with the number of Lanczos vectors nl

determined by the overlap criterion discussed in Sec-tion 2.4 and the number of eigenvaluesreigenvectorsnf determined by model fit as discussed in Section

Ž2.5. This type of calculation is denoted by PCRF F.sFast .


Table 2The matrix WX W for data set no. 1. Values denoted by 0.0 and 1.0 are correct within 5=10y5

1 2 3 4 5 6 7

1 1.0 0.0 0.0 0.0 y0.0001 y0.8538 y0.51482 0.0 1.0 0.0 0.0 0.0 y0.0665 y0.04013 0.0 0.0 1.0 0.0 0.0 y0.0001 y0.00014 0.0 0.0 0.0 1.0 0.0 0.0 0.05 y0.0001 0.0 0.0 0.0 1.0 0.0 0.06 y0.8538 y0.0665 y0.0001 0.0 0.0 1.0 0.07 y0.5148 y0.0401 y0.0001 0.0 0.0 0.0 1.0

Ž .3 PLS-1 in the same Lanczos basis as PCRF.Ž .This calculation is denoted by PLSF FsFast .

4. Results

4.1. Selection of Lanczos Õectors

Table 2 presents the overlap matrix W X W for thefirst seven Lanczos vectors from data set no. 1, and

X ŽTable 3 presents the overlap elements w w js2–1 j.10 for all 5 data sets. Table 4 compares the eigen-

Žvalues for data set no. 1 from the full calculation eig.subroutine of Matlab with those from calculations in

Lanczos bases of dimension 4 and 7, respectively.In exact arithmetic, Table 2 would show a unit

matrix, and all entries in Table 3 would be zero. Thisis far from the case, as anticipated in Section 2.4. Thenon-zero elements coupling w with w and w in1 6 7

Table 2 show that w is contained in the latter two1Žvectors to more than 99% here and elsewhere per-

centages are calculated as sums of squares of expan-.sion coefficients . One might therefore expect that

with seven or more Lanczos vectors, the first eigen-value is duplicated for this data set. That this is thecase is seen in Table 4. This table, which is represen-tative of all data sets investigated, also shows that theconvergence of eigenvalues improves with the size ofthe Lanczos basis. With nl s 4, the three largesteigenvalues are in agreement with the full calcula-tion, with nls7, one has agreement for four out ofmaximally six eigenvalues. The agreement is best forthe largest eigenvalues and deteriorates for thesmaller ones. The smallest eigenvalues from theLanczos calculations cannot be fitted to the exact onesin any sensible manner. The corresponding errors areput in parenthesis and are only given for complete-ness.

We now consider the dimension of the LanczosŽ .basis nl actually used in the calculations. In the

SIMPLS as well as in the PCRL calculations, nl wasdetermined by cross-validation while the ‘‘fast’’ cal-culations PCRF and PLSF stopped at nl s j y 1when, due to accumulation of round-off, the valuewX w exceeded 0.5 = 10y5. The values of nl are1 j

given in Table 5. For each data set, one obtains es-

Table 3X Ž . y5Values of of w w js2, 3, . . . ,10 for the data sets investigated. Values denoted by 0.0 are correct within 5=101 j

Data js2 js3 js4 js5 js6 js7 js8 js9 js10set no.

1 0.0 0.0 0.0 y0.0001 y0.8538 y0.5148 0.0 0.0 0.10092 0.0 0.0 0.0 0.0 0.8412 0.5403 0.0 0.0 y0.08233 0.0 0.0 0.0 0.0 0.0 0.0 0.0010 0.4865 0.27094 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 y0.03345 0.0 0.0 0.0 0.0 y0.0030 y0.9960 y0.0663 0.0 0.0


Table 4Eigenvalues for data set no. 1

Ž .No. Exact value Errors negative in Lanczos calculations

nls7 nls4y1 2 y8 y121 126.0890 -0.5=10 ; 4P10 -0.5=10y1 2 y72 4.3863 -0.5=10 9=10

y1 2 y63 0.2928 1=10 2=10y74 0.1268 2=10

Ž .5 0.0118 0.0020 y0.01616 0.0066

Ž .7 0.0031 0.00058 0.00279 0.0017

10 0.0014

sentially the same result, irrespective of method.There is no deviation larger than 1 unit.

A further indication that round-off monitoring issufficient as a stopping criterion for the number ofLanczos basis function is obtained from the cross-validation calculations for the PCRL method. For dataset no. 1, Table 6 gives values of RMSECV for allcombinations of nl and nf which are consistent withthe round-off stopping criterion. The lowest RM-SECV value is obtained for nlrnfs3r3. This value,0.2433 is significantly different from the results ob-tained with nl-3 and nf-2, but the difference fromthe model with nlrnfs3r2 is not significant. Thisis clear both from inspection of the table and from the

w xuse of a randomization test 27 . The optimal modelselected for this data set is thus nlrnfs3r2. Thenlrnf values of optimal models for all the data setsare given in Table 5. For data sets no. 1 and 2,cross-validation gives a value of nl which is 1 unitsmaller than that given by the round-off criterion used

Table 5Ž .Dimensions nlrnf of optimal models with different methods

Method Data set no.

1 2 3 4 5

PCRS yr2 yr3 yr8 yr11 yr7SIMPLS 3ry 4ry 7ry 9ry 5ryPCRL 3r2 4r3 7r7 9r6 5r5PCRF 4r2 5r4 7r6 9r6 5r5PLSF 4ry 5ry 7ry 9ry 5ry

Table 6Ž .RMSECV for data set no. 1 using different dimensions nlrnf in

Ž .Lanczos calculation PCRL . The optimal model is in bold-face

nlrnf 1 2 3 4

1 1.17012 0.8434 0.59863 0.7562 0.2486 0.24334 0.7536 0.2470 0.2498 0.2502

for PCRF. In neither case, the difference in RM-SECV is significant.

We therefore conclude that the simplest approach,monitoring the round-off as in Table 3, is just as goodfor the determination of nl as the cross-validationnormally performed in PLS calculations.

The cross-validation calculations for PCRL weremade also with larger values of nl than given by theround-off criterion, and the choice of nl was madewithout regard of round-off.

4.2. Selection of eigenÕectors

Ž .In this section, we compare how many nf andwhich eigenvectors are selected by the three variantsof PCR considered — PCRS, PCRL, and PCRF. Adifference between PCRS and the other variants isthat with PCRS, one may have any number of eigen-vectors, while for the Lanczos-based methods, themaximal number of eigenvectors is given by thenumber of Lanczos vectors nl. The order in which theeigenvectors are considered in PCRL and PCRF isdetermined by how they maximize the fit or —equivalently — how they minimize the residual.

Ž .Normalized fit parameters F Eq. 22 are given ink

Table 7. The number of eigenvectors nf is given inTable 5.

As mentioned above, Tables 5 and 6 give detailsfrom the PCRL cross-validation calculations.

We first compare PCRL with the reference methodPCRS. For data sets no. 1 and 2, only 2 and 3 eigen-vectors, respectively, are needed. Both methods se-lect the same eigenvectors. For data set no. 1, eigen-vector nos. 1 and 3 are selected, for data set no. 2,eigenvector nos. 1, 3, and 4. With nl determined bycross-validation equal to 3 and 4, respectively, for thetwo data sets, there is thus one eigenvector missing.


Table 7Ž .The normalized fit parameter F ks1,2, . . . ,nlk

Data set no. Eigenvector no.

1 2 3 4 5 6 7 8 9

1 0.2775 0.0002 0.7211 0.00122 0.6966 0.0006 0.2029 0.0958 0.00413 0.7328 0.2372 0.0042 0.0137 0.0061 0.0026 0.00334 0.7917 0.0656 0.0002 0.0405 0.0009 0.0082 0.0692 0.0024 0.02145 0.8606 0.1170 0.0056 0.0041 0.0126

This is no. 2 for both data sets. It goes without say-Žing that the last contributing eigenvector nos. 3 and

.4, respectively in PCRL has not converged withthese low values of nl. With nl taking its maximum

Žvalue allowed by round-off monitoring 4 and 5, re-.spectively , the convergence problem for the con-

tributing eigenvectors goes away. This is shown, e.g.,by the errors in eigenvalues given for data set no. 1in Table 4 for the case of nls4.

For data sets no. 3 and 5, PCRL selects all eigen-vectors available in the Lanczos basis. These calcula-tions are thus formally identical to PLS with the samenumber of basis functions. The number of eigenvec-tors used in PCRS is larger — 8 and 7, respectively,compared to 7 and 5 used in PCRL. For data set no.3, PCRS selects eigenvector nos. 1–6, 8, and 9. Ofthe seven vectors in the Lanczos calculation, the un-converged vector no. 7 may be expanded in terms ofthose of the full calculation. One finds that it consists87% of no. 8 and 9% of no. 9. The Lanczos basis isthus contained within the basis of eigenvectors of thePCRS calculation.

The situation for data set no. 5 is similar. PCRSselects eigenvector nos. 1–7, PCRL selects eigenvec-tor nos. 1–5. The unconverged vector no. 5 in theLanczos calculation is 14% no. 5, 8% no. 6, and 76%no. 7.

For data set no. 4, PCRS selects eigenvectors 1, 2,4, 6, 7, 9, 11–15, a total of 11 eigenvectors. TheLanczos basis consists of 9 vectors, with 6 eigenvec-tors selected by PCRL. These are nos. 1, 2, 4, 6, 7,and 9. No. 1–7 are pure, i.e., the same as in the fullPCRS calculation. No. 9 is 97% a mixture of nos.11–15, but none of these dominates over the others.The PCRL eigenvectors are thus contained within thespace of those selected by PCRS. What is missing,

however, is a vector in the PCRL calculation corre-sponding to no. 9 in the full calculation. This vectoris no. 8 of the PCRL calculation which is 11% no. 8,80% no. 9, and 7% no. 11–15. As may be inferredfrom Table 7, this is the vector in line to be selectedif nf is increased from 6 to 7.

The PCRF algorithm was constructed to give re-sults close to those of PCRL without using cross-validation. The number of eigenvectors was deter-

Ž .mined according to Eq. 24 with the stopping pa-rameter as0.0040. This number was determined bytrial and error, and gives nf values which are thesame as those of PCRL for three of the five data sets.For sets no. 2 and 3, the nf values deviate by 1 unit,see Table 5.

4.3. Speed of calculations

Table 8 presents timing estimates for the calibra-Ž .tion calculations. The unit is Megaflops Mflops , i.e.,

millions of floating point operations, as given by theMatlab system. The numbers are thus independent ofinstallation but depend, of course, upon the relativespeed of the numerical routines used by Matlab. The

Table 8Ž .Timing estimates from calibration calculations Mflops

Method Data set no.

1 2 3 4 5

PCRS 167.70 86.51 147.02 3357.21 42.07SIMPLS 25.26 16.81 24.83 199.77 9.83PCRL 30.70 27.50 22.27 45.50 23.91PCRF 0.96 1.01 1.14 3.29 0.85PLSF 0.95 0.96 1.06 3.18 0.77


Table 9RMSEP for optimal model using different methods

Method Test set no.

1 2a 2b 3 4 5

PCRS 0.213 0.256 0.307 2.09 0.163 0.0477SIMPLS 0.215 0.246 0.296 2.23 0.146 0.0445PCRL 0.213 0.254 0.306 2.23 0.151 0.0445PCRF 0.213 0.240 0.307 2.31 0.151 0.0445PLSF 0.218 0.235 0.301 2.23 0.147 0.0445

differences between the methods are quite large, afactor of 1000 in the worst case.

So far, the slowest is the PCRS method where fullmatrix eigenvalue problems are solved repeatedly inthe cross-validation. This shows up particularly fordata set no. 4 where the largest matrix is generated.Of the remaining methods, SIMPLS and PCRL loosetime compared to the fast methods PCRF and PLSFdue to cross-validation calculations. They are ofcomparable speed. A method with cross-validation

Ž .only for the number of eigenvectors nf would beintermediate in speed between PCRL and PCRF. Fi-nally, of the two fast methods, PLSF is marginallyfaster than PCRF. This difference between these twomethods is expected from the complexity of matrixinversion vs. matrix diagonalization.

4.4. PredictiÕe ability

Table 9 presents the root-mean-square error ofŽ .prediction RMSEP for five test sets not used in theŽcalibration 1 for data sets no. 1, 3–5; 2 for data set

.no. 2 . It is striking that the methods investigated giveso similar results. This conclusion was also drawn by

w xCentner et al. 12 who investigated the same dataw xsets. They concluded from a randomization test 27

for data set no. 3, that the RMSEP differencesachieved are not statistically significant. In our calcu-lations, the RMSEP differences are smaller, and thesame conclusion is drawn.

5. Discussion

As stated in Section 1, the method of CSR byw xKalivas et al. 13–15 combines elements of PCR and

PLS. By varying the parameters of this method it ispossible to make it describe other regression meth-ods as well. The main idea of CSR is to start with acomplete singular-value decomposition of the datamatrix and then to generate subspaces of the full rowand column spaces by a Krylov or Lanczos proce-dure. From the point of view of calculational speedlittle is gained by CSR since it involves a completeresolution of the data matrix, the time-consuming stepthat we explicitly wanted to avoid by first applyingthe Lanczos procedure and then solving the eigen-value problem in the subspace. With the numerical

Žpeculiarities of the Lanczos procedure in mind Sec-.tion 2.4 , it is understandable that special efforts were

w xrequired to make CSR numerically robust 14 .The great value of starting with a complete SVD,

however, lies in the conceptual understanding it givesto different regression methods. A major concern in

w xa recent paper by Kalivas 28 is whether PLS over-fits the data relative to top-down PCR. It has been

w xshown theoretically by de Jong 29 that PLS fits thedata better than top-down PCR with the same num-ber of factors. This improvement, however, is ob-tained by including contributions from eigenvectorswith small eigenvalues, which may describe noise. Todraw the line between signal and noise is to some ex-tent a subjective matter. Round-off monitoring of theLanczos iteration should give a conservative estimateof the number of relevant factors that one can extractin a regression calculation. The agreement withcross-validation for three out of the five data sets thatwere investigated speaks for this. However, for data

Ž .sets no. 1 and 2, cross-validation PCRL gives oneŽfactor less than round-off monitoring PCRF, PLSF,

. w xsee Table 5 . Kalivas 28 , using the data set we labelw xno. 2 but with 141 instead of 701 spectral points 23 ,

contends that this set is slightly overfitted with fiveinstead of four PLS factors. This is hardly a seriousproblem. It is avoided — at the expense of calcula-tional speed — if the number of PLS factors or thenumber of eigenvectors is determined by cross-vali-dation.

The present study, Section 4.2, gives some furtherinsight into the PLS method. Our Lanczos calcula-tions give good approximations to the larger eigen-values of the principal components. In the Lanczosbases, 3–9 vectors could be used without ambigui-ties due to round-off with, in each case one, or for the


largest set, two principal component eigenvectors notwell represented. The same result holds for PLS cal-culations in the same bases. The poorly representedvectors obtained by diagonalizing in the Lanczos orPLS basis correspond to mixtures of several princi-pal component eigenvectors. For data sets no. 3–5,these mixtures contain eigenvectors which are se-lected by PCRS. As a consequence, PLS has fewer

w xfactors than PCRS for these data sets 13–15,28 .Principal component eigenvectors which are ex-cluded by PCRS are thus handled in two ways byPLS. Those with large eigenvalues are included inPLS, those with small eigenvalues are, in general,excluded in PLS or in the Lanczos basis.

In the present work, the ordering feature of eigen-vectors was their relative contribution to fit to the yvariable. An equivalent formulation could be made in

w xterms of residuals. Verdu-Andres and Massart 11´ ´have recently suggested that one should instead useleverage-corrected residuals so as to include somecorrections in leave-one-out cross-validation. In thepresent work, we have not investigated the use ofcross-validation or leverage correction on the order-ing of eigenvectors for feature selection. However,

w xLorber and Kowalski showed some years ago 30 thatleverage correction only accounts for a part of thedifference between errors of fit and errors of truecross-validation.

6. Concluding remarks

This paper has achieved results in two areas. Oneis in the understanding of the relation between PLSand PCR where our calculations show which eigen-vectors in PCR are included in PLS and which arenot. The other and major area is how to make PCRS

Ž .in a quicker way. This involves three steps. 1 Thecalculation is made in the truncated basis obtained

Ž .from the Lanczos iteration. 2 The eigenvectors areordered after their contribution to the least-squares fit,or equivalently, how they minimize the least-squares

Ž .residuals. 3 A third step would be to do withoutcross-validation altogether. For the determination ofthe size of the Lanczos basis, monitoring the accu-mulation of round-off gives almost the same result ascross-validation. To eliminate cross-validation for theselection of eigenvectors is more controversial andshould be investigated further. An alternative to this

is to find ways to do cross-validation in a more effi-cient way so as to avoid the heavy re-calculationsperformed by many codes in use. This is a topic forfurther research.

7. List of abbreviations

CSR Cyclic subspace regressionMflops Megaflops, 1 million floating point

operationsnf Number of eigenvectorsNIPALS Non-linear iterative partial least

squares algorithmNIR Near infrared spectroscopynl Number of Lanczos vectorsPCA Principal component analysisPCR Principal component regressionPCRS Principal component regression with

feature selectionPCRF Principal component regression in

Lanczos basis using round-off mo-nitoring and model fit to optimizemodel — ‘‘fast method’’

PCRL Principal component regression inLanczos basis using cross-valida-tion to optimize model

PLS Partial least squaresPLSF Partial least squares with dimension

determined by round-off monitoring— ‘‘fast method’’

RMSECV Root mean square error based oncross-validation

RMSEP Root mean square error of predic-tion

SIMPLS ‘‘Simple partial least squares’’ of dew xJong 25,26

Acknowledgements

Thanks are due to D.L. Massart for the encour-agement and support of this work. Thanks are alsodue to Luisa Pasti for valuable discussions and toJohn H. Kalivas for helpful comments. Financial as-sistance to W.W. from the Nationaal Fonds vanWetenschappelijk Onderzoek, the DWTC, and theStandards, Measurement and Testing program of theEU is gratefully acknowledged.


Appendix A. Matlab code for fast PCR and PLS calculations


References

w x1 D.L. Massart, B. Vandeginste, L. Buydens, S. de Jong, P.J.Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics andQualimetrics: Parts A and B, Elsevier, Amsterdam, 1998.

w x2 B.G. Osborne, T. Fearn, P.H. Hindle, Practical NIR Spec-troscopy with Applications in Food and Beverage Analysis,Wiley, New York, 1993.

w x3 E.R. Malinowski, Factor Analysis in Chemistry, 2nd edn.,Wiley, New York, 1991.

w x4 H. Martens, T. Næs, Multivariate Calibration, Wiley, Chich-ester, 1989.

w x5 C. Lanczos, An iteration method for the solution of theeigenvalue problem of linear differential and integral opera-tors, Journal of Research of the National Bureau of Standards

Ž .45 1950 255–282.w x6 H. Wold, Estimation of principal components and related

Ž .models by iterative least squares, in: P.R. Krishnaiah Ed. ,Multivariate Analysis, Academic Press, New York, 1966.

w x7 K. Esbensen, T. Midtgard, S. Schonkopf, Multivariate Analy-˚ ¨sis in Practice, Camo AS, Trondheim, 1994.

w x8 J. Sun, A correlation principal component regression analysisŽ .of NIR data, Journal of Chemometrics 9 1995 21–29.

w x9 A.M.C. Davies, The best way of doing principal componentŽ .regression, Spectroscopy Europe 7 1995 36–38.

w x10 J.M. Sutter, J.H. Kalivas, P.M. Lang, Which principal com-ponents to utilize for principal component regression, Journal

Ž .of Chemometrics 6 1992 217–225.w x11 J. Verdu-Andres, D.L. Massart, Comparison of prediction and´ ´

correlation based methods to select the best subset of princi-pal components for principal component regression and de-

Ž .tect outlying objects, Applied Spectroscopy 52 1998 1425–1434.

w x12 V. Centner, J. Verdu-Andres, B. Walczak, D. Jouan-Rimbaud,´ ´F. Despagne, L. Pasti, R. Poppi, D.L. Massart, A comparisonof multivariate calibration techniques applied to experimentaldata sets, Chemometrics and Intelligent Laboratory Systems,submitted for publication.


w x13 P.M. Lang, J.M. Brenchley, R.G. Nieves, J.H. Kalivas, Cyclicsubspace regression, Journal of Multivariate Analysis 65Ž .1998 58–70.

w x14 J.M. Brenchley, P.M. Lang, R.G. Nieves, J.H. Kalivas, Stabi-Ž .lization of cyclic subspace regression, Chemolab 41 1998

127–134.w x15 J.H. Kalivas, Cyclic subspace regression with analysis of the

hat matrix, Chemometrics and Intelligent Laboratory SystemsŽ .45 1999 215–224.

w x16 W. Wu, D.L. Massart, S. de Jong, The kernel PCA algo-rithms for wide data: Part I. Theory and algorithms, Chemo-

Ž .metrics and Intelligent Laboratory Systems 36 1997 165–172.

w x17 W. Wu, D.L. Massart, S. de Jong, Kernel PCA algorithms forwide data: Part II. Fast cross-validation and application inclassification of NIR data, Chemometrics and Intelligent

Ž .Laboratory Systems 37 1997 271–280.w x18 L.N. Trefethen, D. Bau III, Numerical Linear Algebra, SIAM,

Philadelphia, 1997.w x19 R. Manne, Analysis of two partial-least-squares algorithms for

multivariate calibration, Chemometrics and Intelligent Labo-Ž .ratory Systems 2 1987 187–197.

w x20 A.S. Householder, The theory of matrices in numerical anal-ysis, Blaisdell, New York, 1964, reprinted by Dover, NewYork, 1975.

w x21 G. Strang, Linear Algebra and Its Applications, 3rd edn.,Harcourt, Brace, Jovanovich, Orlando, 1988, App. A.

w x22 G.H. Golub, W. Kahan, Calculating the singular values andthe pseudo-inverse of a matrix, SIAM Journal of Numerical

Ž .Analysis, Series B 2 1965 205–224.w x23 J.H. Kalivas, Two data sets of near infrared spectra, Chemo-

Ž .metrics and Intelligent Laboratory Systems 37 1997 255–259.

w x24 R.W. Kennard, L.A. Stone, Computer aided design of experi-Ž .ments, Technometrics 11 1969 137–148.

w x25 S. de Jong, SIMPLS: an alternative approach to partial leastsquares regression, Chemometrics and Intelligent Laboratory

Ž .Systems 18 1993 251–263.w x26 S. de Jong, A comparison of algorithms for partial least

squares regression, in preparation.w x27 H. van derVoet, Comparing the predictive accuracy of mod-

els using a simple randomization test, Chemometrics and In-Ž .telligent Laboratory Systems 25 1994 313–323.

w x28 J.H. Kalivas, Interrelationships of multivariate regressionmethods using eigenvector basis sets, Journal of Chemomet-

Ž .rics 13 1999 111–132.w x29 S. de Jong, PLS fits closer than PCR, Journal of Chemomet-

Ž .rics 7 1993 551–557.w x30 A. Lorber, B.R. Kowalski, Alternatives to cross-validatory

estimation of the number of factors in multivariate calibra-Ž .tion, Applied Spectroscopy 44 1990 1464–1470.

w x31 R.J. Barnes, M.S. Dhanoa, S.J. Lister, Standard normal vari-ate transformation and detrending of near-infrared diffuse re-

Ž .flectance spectra, Applied Spectroscopy 43 1989 772–777.

Fast regression methods in a Lanczos (or PLS-1) basis. Theory and applications

Documents

Transcript of Fast regression methods in a Lanczos (or PLS-1) basis. Theory and applications