Feature selection using the Kalman filter for classification of multivariate data

ELSEVIER Analytica Chimica Acta 335 (1996) 11-22

ANALmu CHIMICA ACTA

Feature selection using the Kalman filter for classification of multivariate data

Wen Wu, Sarah C. Rutan’, Antonietta Baldovin, Disiri-Luc Massart*

ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium

Received 16 April 1996; revised 16 July 1996; accepted 19 July 1996

Abstract

A Kalman filter is developed as a feature selection method and classifier for multivariate data. Three near-infrared (NIR) data sets and a pollution data set are analyzed. For the two most difficult data sets (data sets 1 and 3), the Kalman filter successfully selects the wavelengths which lead to very good results with a correct classification rate (CCR) equal to one. These results are much better than the best results obtained from regularized discriminant analysis (RDA) using Fourier transform (FT), principal component regression (PCA) and univariate feature selection methods as the variable reduction methods. For the second data set which consists of more than two classes, the Kalman filter gives similar results (CCR=l) to those of RDA. For the pollution data set (data set 4), the Kalman filter gives similar results to partial least-squares (PLS) using fewer variables. The disadvantage of the Kalman filter is that it needs more memory and more computing time than PLS. The potential hazards of overfitting have been considered.

Keywork Chemometrics; Classification; Feature selection; Kalman filter; Infrared spectrometry

1. Introduction

Multivariate linear regression (inverse MLR), principal component regression (PCR) and partial

least-squares (PLS) are the three common multi-

variate methods used in calibration and classification of chemical data. For spectral data such as ultravio- let-visible (UV-Vis), infrared (IR) and near-infrared

(NIR) data, the number of variables (wavelengths) is

* Corresponding author. Fax: +32 2 417 41 35.

’ Permanent address: Box 842006, Department of Chemistry,

Virginia Commonwealth University, Richmond, VA23284-2006,

USA.

0003-2670/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved

PII SOOO3-2670(96)00347-9

much higher than the number of objects, therefore,

classical inverse MLR cannot be directly applied

because of the matrix singularity problem. PLS and PCR can be used directly for such ill-conditioned data by extracting the latent variables (factors). The

number of latent variables is lower than the number of objects [l--8].

In fact, all three methods have a common point in that all of them model data using a linear least- squares fitting technique. This means that they build

linear models between an independent matrix X (spectral data) and a dependent matrix Y (concentra- tion or class label)

Y = X*B + Error (1)

12 U? Wu et al./Analytica Chimica Acta 335 (1996) 11-22

and estimate the B matrix using least-squares fitting techniques.

Recently some researchers have applied PLS with selected variables, and have found that the prediction ability of PLS model is improved when a limited number of variables is used [2-Q. In these reports, some of the authors selected variables based on weights of loadings in a PLS model such as in Martens [4], Frank’s Intermediate Least-Squares method [5], Lindgren’s Interactive Variable Selection method [6], and others based on b-coefficients such as Garrido Frenich’s work [7]. When the b-coefficients are used, the standard deviation (sb) of the b-

coefficient is difficult to calculate, so that the X matrix is standardized before modelling. After standardization, the standard deviation of each variable is constant, so it is reasonable to select the variables which have largest absolute values of b-

coefficient. However, such standardization increases the influence of the non-informative or noise variables (such as baseline wavelengths in NIR data), and leads to model bias. Therefore, Centner suggests selecting variables according to the ratio of the b-

coefficient to the corresponding standard deviation [8], which seems quite logical. However, the problem is how to calculate the standard deviation of the b-

coefficients in PLS. Centner solves this problem through leave-one-out cross-validation. During cross- validation, a set of b-coefficients are calculated, and the mean and standard deviation are used to estimate the ratio of the b-coefficient to the standard deviation. This method overcomes the biased model problem. However, cross-validation is very time consuming, especially in classification problems. In classification, the number of objects is often much higher than that in calibration, and the number of variables in the Y matrix is usually greater than one, which leads to more than one column of b-coefficients.

Another possible approach to solve the singularity problem in classical inverse MLR is the Kalman filter [9-131. The Kalman filter is a recursive, point-by- point, least-squares fitting technique. It can be used to build a linear model between X and Y The parameters of the model are calculated iteratively, and no matrix inversion operator is necessary in this technique. When the number of variables is higher than the number of objects, the Kalman filter still yields a solution, but with a higher risk of overfitting.

For PLS and PCR, the model is not directly connected with the original variables, which makes the model difficult to interpret. The Kalman filter, however, gives a direct model. In PLS and PCR, the standard deviations of b-coefficients are difficult to estimate, whereas the Kalman filter directly gives the covariance matrix for the b-coefficients, which can be easily used to calculate the ratio of the b-coefficient to its standard deviation to select the variables for the model. These theoretical advantages prompted us to apply the Kalman filter to select features for classification of multivariate data, especially of NIR data. The goal of this paper is to improve the prediction ability of multivariate classifiers by selecting a small number of informative variables.

2. Theory

2.1. Kalman jilter

The Kalman [9-131 filter was first created by R.E. Kalman to process data for problems in satel- lite orbit mechanics [9]. In essence, it is a recursive linear parameter estimation filter, or a least- squares fitting technique. It has been used in calibration, multicomponent analysis, kinetic analysis, outlier detection and so on. A tutorial giving an explanation and applications of the Kalman filter has appeared in the chemometric literature [ 101. Here just a brief description relevant to our work is presented.

We have a linear model in Eq. (l), where X,,,,, is a spectral data matrix with m objects and n variables,

and Ymxp is a class label matrix with m objects and p

classes. To eliminate the constant term in the model, here both X and Y have been centred before modelling. B, xp is the b-coefficient matrix with p-

column vectors of b-coefficients. In the Kalman filter, the B matrix (parameters) and its covariance matrix P pxp are estimated in an iterative way using the simplified equations shown in Table 1. In [lo], often the Kalman filter is based on a one-column Y matrix (dependent variable) similar to PLSl. The formulae in Table 1 have been extended to deal with a p-

column Y matrix, so that the Kalman filter can be used like PLS2. This calculation is equivalent to running p independent Kalman filters. Table 1 does

W WM et al./Analytica Chimica Acta 335 (1996) 11-22 13

Table 1

Kalman filter equations

B(k) = B(k - 1) + K(k)[Y(k) - X(k)B(k - l)]

P(k) = [Z - K(k)X(k)]P(k - l)[Z - K(k)X(k)]’ (2)

+K(k)r(k)K’(k) (3) K(k) = P(k - l)X’(k)[X(k)P(k - 1)X’(k) + r(k)]-’ (4)

n: Number of system states (number of variables); m: Number of

objects; k=l : m: Index indicating the measurement number (index

of the object to be filtered); B(k): System state matrix (nxp) (b-

coefficient matrix); Y(k): Measurement vector (1 xp) (class

membership for the kth object); X(k): Measurement function vector

(1 xn) (spectrum for the kth object); K(k): Kalman gain weighting

factor (nx 1); P(k): System state covariance (nxn) (covariance

matrix of B); and r(k): Measurement noise variance (scalar).

not include the extrapolation equations, since the b-

coefficients are time-invariant in our model, and these

equations have no effect on the results. Each object’s

data (X(k), Y(k)) are entered into the filter, and the B

matrix is updated by Eq. (2). The P matrix which describes the precision of b-coefficients is updated by

Eq. (3). The vector K(k) in Eq. (4) contains the weighting factors and is called the Kalman gain, which determines the importance of using the kth

object’s data to improve the parameter estimates. From Eq. (4), one can see that the gain is inversely

proportional to the variance of the measurement noise

r(k). This means that relatively noise-free data will be

weighted heavily in the update calculation, whereas noisy data will be given correspondingly less weight.

Since the data are processed object-by-object, it is fairly easy to change the variance of the noise and the

weighting factors, leading to a convenient imple- mentation of weighted least-squares fitting.

It is quite simple to use the Kalman filter in practice. For the Kalman filter, a value of r(k) must be

determined at the beginning. According to our

experience, we use a small value of 2.5x lo-’ as the default value. The results of the Kalman filter are relatively stable when different initial values

of P and B are used, since our model is a linear

model. In our study, we use a zero matrix for the initial value of B and the identity matrix as the initial

default value for P. Then for each object k (from

object 1 to object m), the matrices P and B and the vector K are updated by Eqs. (2)-(4) using the data of the kth object as input. After all objects have been evaluated, the final values of P and B are the output

The dimension of P matrix is nxn. When the

number (n) of variables is high, such as for NIR data set with 500 or more variables, the computer

needs a very large memory to hold the P matrix and a lot of time to update the P matrix. One of the

alternative algorithms of the Kalman filter is based on

using a square root matrix S to represent the P matrix [13]. This square root algorithm is usually less

susceptible to round off errors and is particularly

effective in stabilising the filter. It might be possible

to save some memory using the square root filter when the number of elements in P is large (by using

lower precision variables), however, the dimension of

the S matrix is not reduced compared to the P matrix and the square root filter has a significantly larger

computational burden than the algorithm used here

1131.

2.2. Feature selection

The principle of feature selection is to select the variables based on the ratio of the b-coefficient to its standard deviation [8]. This ratio represents the

importance of the variables which are used to build the model. One selects those variables which

have the higher ratios, and eliminates those for

which the ratios are lower. The optimal number of

variables is decided by cross-validation. The feature

selection procedure is described by the following 11 steps which can be divided into three parts: to select a

pre-defined number of variables, to optimize the number of variables, and to eliminate collinear variables.

The following steps are used to select a pre-defined

number of variables:

(1) (2)

(3)

Use the Kalman filter to calculate B and P. Calculate the standard deviation of the b-

coefficient:

W), x 1 =W@iag(PN, where diag means to take the diagonal elements of the P matrix, and sqrt means the square root of the elements of the matrix.

For each column in the B matrix (b), calculate

(b.lsb),. I, where b./sb means every element of vector b is divided by the corresponding element of vector sb.

(4) Rank the absolute values of b.lsb from large to

of the Kalman filter. small.

14 N! Wu et al./Analytica Chimica Acta 335 (1996) 11-22

(5) If there is only one column of B, go to step 7.

(6) If there are more than one column in B, repeat steps 3 and 4 p-l times.

(7) For each column in B, choose the first np (a pre- defined number) variables after ranking, and find the mc common variables for the set of different columns. The number rrrc depends on how many variables end up being in common.

After steps l-7, with np pre-defined variables, the common variables are identified (which are ranked according to their importance to the model). The purpose of the following steps is to decide how many variables should be selected.

(8) With a different pre-defined number (np), the common variables are again selected. For the mc common variables, use the Kalman filter to build a model and calculate the correct classification rate (CCR) based on leave-one- out cross-validation. The CCR is calculated as described previously [ 141.

(9) Plot the CCR as the function of number of variables to find the optimal number of variables, where the maximal CCR is reached.

After step 9, highly collinear variables which make the model unstable are sometimes in- cluded among the selected variables. Collinear- ity can be detected by several parameters such as the tolerance, the variance inflation factor (VIF), the eigenvalues or the condition indexes [ 151. Here we use the simplest paired correlation coefficient. This selection is not necessa- rily, and even probably, the best possible choice, but this aspect was considered to be unim- portant for the aims of the study. Therefore, the effect of collinearity is simply studied by the following steps:

(10) Calculate the correlation coefficients (r) for the selected variables.

(11) Eliminate the variables which have highest values of r and have lower rankings, e.g. if variables 1, 2 and 3 are highly correlated (r>O.9999) and they are ranked in the order of 2, 1 and 3 in step 4, we will eliminate variables 1 and 3 because they are less important than variable 2.

Once these variables have been identified, the Kalman filter is run a final time to establish the coefficients in the model.

3. Experimental

3.1. Data

Four data sets were studied. Data sets l-3 are NIR data sets which have been used in [17,18]. Data set 4 is an industrial pollution data set; it has also been used in [20]. Fig. la-d are the score plots for PC1 versus PC2 for the four data sets, and show that there is overlap between the classes for these data sets.

Here we describe them briefly. Data set 1 contains 40 spectra (13 18-2340 nm, 5 12

wavelengths) of pure para-xylene and para-xylene with 0.30% o&o-xylene; each class contains 20 spectra. The goal is to identify if a sample is pure or impure.

Data set 2 consists of 60 spectra (1376-2398 nm, 512 wavelengths) of three batches of excipients which are made by mixing cellulose, mannitol, sucrose, sodium saccharin and citric acid in different proportions. Each class contains the spectra measured for 20 samples of the same batch of excipients. The aim is to quickly distinguish different batches of excipients in order to decide procedures for pharmaceutical formulation.

Data set 3 contains 95 spectra (1130-2152 nm, 5 12 wavelengths) of pure butanol and butanol containing different concentrations of water (from 0.02% to 0.32%). The pure class consists of 42 spectra and the impure class consists of 53 spectra. The goal for this data set is the same as data set 1.

Data set 4 contains 49 objects related to two different kinds of pollution (galvanization and steel- works sludges) with 10 variables including the pH, Cd, Cr, Pb, Cu, residual at 105”C, residual at 66o”C, Zn, CN- and Ni. Three outlier objects have been eliminated in our previous work using several outlier detection techniques [20]. Class 1 consists of 30 objects and class 2 consists of 19 objects. The data set is standardized before analysing to correct for the effect of the different magnitude of the variables. The goal of the data set is to find a method which requires minimum number of measurements to discriminate the data.

I+! Wu et al./Analytica Chimica Acta 335 (19%) 11-22 15

0.15.

(4

0.1 - 22

0.05- 2' 1 718

1 9

"'",',I O- 4 22 2

i 2' 2

4.0% '

d 2 2 2

1

0.1 2

1 1

-0.15- 2

4.2' /

-0.15 0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 PC1

0.04 r 0.02

0.01 I

g 0

t -o.Ol-

-0.02

-0.03 -

(c) 42 .2 5 .l 4

.2 “;” ,“a + =‘: .I

.2 .2 ‘:, “k2 .2 “*, .I .2 .l

a.2 .2 ‘a$

.2 .3 91

.2 .2

&3 .21

.zJ, *l

.ll

.2 .2 = “,a .2 .2

‘1 a4 .2 .2 q 5

4 .2

I= -0.04 a

-0.2 -0.15 -01 -0.05 0 0.05 01 PC1

4 1 1 0.03 q 2

0.1 0.05 0 0.05 0.1 0.15 0.2 0.25 PC1

4 (d) 2 2 3-

* 2 2

2- 4

2 2

l- 8

g o- 2 2 2 111 Id r$!y, 1

-1 - 2 1

-2. 2 2

2

3- 22

-4 0 I

-5 -l -3 -2 f& 0 I 2 3

Fig. 1. Score plots for PC1 versus PC2: (a) data set 1; (b) data set 2; (c) data set 3; (d) data set 4.

4. Results and discussion

4.1. Extension of the Kalman filter

As described in previous literature, the Kalman filter is normally used to deal with a Y matrix with only one column. If there are more than one column, one can make several models with each column separately. In classification, there are often several classes, so the Y matrix has several columns. If we make models with each column separately as is done for PLSl, it is very time consuming. In Table 1, the formulae have been modified so that the Kalman filter can be used to model several columns of Y simultaneously. Compared with the normal Kalman filter, the symbols are not changed, we just modified the state B(k) from a vector (nx 1) to a matrix (nxp), and changed Y(k) from a scalar into a row vector

(1 xp). The results can be shown to be identical to those obtained by defining several state vectors independently. For instance, if we have two columns of Y our modified Kalman filter gives a B matrix with two columns. Each column is exactly the same as that obtained by using the corresponding columns of Y separately, and the same P matrix is also obtained. This allows the Kalman filter to be used conveniently as a classifier or multicomponent calibrator like PLS2. This situation is directly analogous to inverse MLR, where the simultaneous model yields results that are exactly the same as those obtained using separate models.

4.2. Data set 1

This is a data set which was specially designed in our laboratory to make classification more challen-

16 W. Wu et al./Analytica Chimica Acta 335 (1996) 11-22

0.6 b--J 0.4'

0 100 200 wa ml 5oa lnd@xofvmvaie~

wn

Fig. 2. The mean spectra of the two classes in data set 1; “+”

denotes the variables selected by the Kalman filter.

ging. This can be seen from the mean spectra of the two classes (Fig. 2). No difference between them can

be seen at this scale. In our previous work [17], we have analyzed these data with different kinds of classifiers, i.e. LDA (linear discriminant analysis),

QDA (quadratic discriminant analysis) and RDA

(regularized discriminant analysis), using PCA, FI (Fourier transform) and a univariate feature selection

method to reduce the number of variables. The standard normal variate (SNV) transform was also

used to pre-treat the data [16]. The results from those

studies are listed in Table 2. It shows that the best CCR of leave-one-out cross-validation using all those

methods is only 0.775. Walczak and Bogaert have analyzed this data set with the very powerful wavelet

packet transform (WPT) method [19], where the best result was 0.975 (Table 2). These results demonstrate that this data set is extremely difficult to discriminate.

With the original data, the leave-one-out cross- validation result from the Kalman filter method is

CCR=l (Table 2). This result is extremely good. In these data, there are only two classes, so the

rank of the Y matrix is only 1. After step 1 (described

in Section 2.2), only one column of b-coefficients (in matrix B) is necessary. Fig. 3 displays the ratio of

absolute value of the b-coefficient to the standard deviation. The high degree of noise apparent in this

plot may indicate that overfitting is occurring and/or that many of the variables are collinear, however, the

absolute value of the ratio is the largest in two regions near the 200th and 400th wavelengths, which

correspond to the peak regions in the original spectra (Fig. 2). This indicates that the Kalman filter tends to

give a high weight and small standard deviation to the lower noise wavelengths, and low weight and high

standard deviation to the high noise wavelengths.

Table 2

The classification results (CCR) from different dimensionality reduction methods

h 500

>

Fig. 3. The ratio of absolute value of the first column b-coefficient/

standard deviation versus wavelength; data set 1.

Data set

1

2 3

Kalman filter

1.0000 (60)

1.0000 (12) I.cMKl (100)

FS-RDA a FT-RDA a PCA-RDA ” WPTb

0.7750 (6) (LO) 0.7250 (4) (0.6,O) 0.6250 (4) (1,l) 0.975

0.9333 (9) (I,O) 1.0000 (8) (0,O) 1.0000 (7) (1,O) 1 .OOo 0.8211 (3) (0,l) 0.8526 (4) (0.0) 0.8526 (21) (0,O) -

FS-RDA: Univariate feature selection method to select variables pre-treated by SNV as the input of RDA; FT-RDA: Select the first few

Fourier coefficients except the first one as the input of RDA; PCA-RDA: Use the first few PC factors after ranking the input of RDA with SNV

data; WPT: Wavelet packet transform; the optimal number of variables or factors is listed between brackets; the (A,?) for RDA is also

expressed between brackets. ‘Results taken from [17].

bResults taken from [ 191.

W Wu et al./Analytica Chimica Acta 335 (1996) 11-22 17

Fig. 4. The ratio of absolute value of the frrst column b-coefficient/

standard deviation after ranking from large to small; data set 1.

After ranking this ratio from large to small, we obtain

Fig. 4. There is no specific trend indicating that a certain number of wavelengths should be selected. We decided to select every set of 20 variables from 20

to 120 as the pre-defined number of features in step 8.

The leave-one-out cross-validation results (Fig. 5)

show that when 60 variables are selected, the Kalman

filter gives a 100% correct classification rate. In step 10, we find that the correlation coefficients between

the 60 selected wavelengths are all smaller than 0.6,

so there are no highly collinear variables to he deleted in step 11. The selected features (Fig. 2) correspond to the chemically characteristic wavelengths for paru-

07. V

06.5 20 30 40 50 Bo 70 80 90 1w 110 120

Number cd t*m

Fig. 5. The classification results (CCR) with different numbers of Fig. 6. The classification results (CCR) based on leave-one-out selected variables based on leave-one-out cross-validation; data set cross-validation with different numbers of selected variables; data

1. set 2.

xylene and or-rho-xylene. The reduction of the number of wavelengths from 512 to 60 means that the risk of overfitting is correspondingly reduced.

4.3. Data set 2

This data set is an example with more than two classes. There are three classes, so two columns of

matrix B are taken into account. Based on these two

columns’ ratios of the absolute values of the b-

coefficient to the standard deviation after ranking,

20-160 variables in increments of 20 are selected as

the pre-defined number of variables in step 8. The

common variables of the 20,40, ., 160 variables are obtained, and the numbers of variables in common are 9, 19, 26, 35, 46, 60, 77 and 91.

With these common variables, the cross-validation results (Fig. 6) show that with 19 variables, the

Kalman filter leads to a 100% correct classification

rate. With these 19 wavelengths, we eliminate the highly correlated variables with different threshold

values of the correlation coefficient (r) (0.9, 0.99,

0.999, 0.9999).

Table 3 shows the results when the collinear variables whose correlation coefficient values are

higher than 0.999 are eliminated. The Kalman filter

still gives very good results with the remaining 12

variables. When, however, the variables with 00.99 are eliminated, the Kalman filter does not give good results. Even though many of the remaining variables

18 W Wu et al./Analytica Chimica Acta 335 (1996) 11-22

Table 3

The classification results (CCR) after deleting the highly correlated

variables (r higher than the different threshold values): data set 2

Threshold value No. of features CCR

0.9000 2 0.3500

0.9900 5 0.7667

0.9990 12 1.0000

0.9999 19 1.0000

are still highly collinear, they apparently can provide information which leads to improved predictions. Therefore, the last selected variables are the 12 variables shown in Fig. 7. This classification result (CCR=l) is the same as that obtained for most of the other classification methods, as shown in Table 2.

Usually the Kalman filter is used to estimate the parameters of the measurement model, so I(k) is the measurement noise variance. However, Eq. (1) is an inverse measurement model, and not a direct measurement model, so it is more difficult to interpret the meaning of r(k). According to error propagation, r(k) includes not only the measurement error in the X matrix, but also the model error in the Y matrix. To study the effects of the initial values of P(0) and r(k), we systematically changed the values of r(k) and P(0) in the Kalman filter using the final set of 12 selected variables. The results are shown in Table 4. From the first three rows, we can see that when the ratio of r(k)/ P(0) is fixed, the root mean square error of prediction

04-

035-

Fig. 7. The mean spectra of the three classes in data set 2; “+”


(RMSEP) is constant because the b-coefficient is constant. If the ratio of #)/P(O) is fixed, the standard deviation is proportional to the square root of P(O), so the ratio of b-coefficient to the standard deviation is inversely proportional to the square root of P(0). Figs. 8a and b show the ratio of absolute value of the first column b-coefficient/standard deviation vs. the 12 selected wavelengths obtained from the Kalman filter with r(k)=2.5~10-~, P(0)=lx106, and with r(k)=2.5 x lo-*, P(O)=1 x 102. The shape of this ratio vs. variable plot in Fig. 8a does not change compared to that in Fig. 8b, but the magnitude of the values

Table 4

Classification results (CCR) with different initial values of r(k) and P(O) in the Kalman filter with 12 selected variables

4) P(O) r(W(O) max(b./sb) ’ max(b./sb) b CCR

2.5x 1O-4 1x106

2.5x lo-’ 1x102

2.5 x lo-‘0 1

2.5 x 10-s 1

2.5 x 10-s 1

2.5 x 1O-7 1

2.5x IO@ 1

2.5x 10-s 1

2.5 x 10-s 10

2.5 x lo@ 1x102

2.5 x lo-’ 1x10s

2.5 x 1O-5 1x104

2.5 x 1O-5 1x10s

2.5x10-” 2.73~10’

2.5x lo-” 2.73x IO3

2.5x lo-” 2.73x lo4

2.5x 1O-9 8.69x lo3

2.5x 10-s 2.86x IO3

2.5 x 1O-7 1.14x 103

2.5 x 1O-6 5.77x 102

2.5x lo-’ 1.61~10~

2.5 x IO@ 1.82x 10’

2.5 x lo-’ 1.14x lo*

2.5 x lo-* 9.03x 10’

2.5x 1O-9 8.69x 10’

2.5x lo-” 8.65x 10’

4.19x 10’ l.OOfKJ

4.19x103 1.0000

4.19x 104 1.0000

1.32~10~ 1.0000

4.27x lo3 1.0000

1.47x 103 0.9833

5.80x lo* 0.9667

1.69x 10’ 0.4667

1.83~10’ 0.9667

1.47x 102 0.9833

1.35x102 1.0000

1.32~10’ 1.0000

1.33x102 1.0000

“Maximum value of the ratio of absolute value of the first column b-coefficient/standard deviation.

b Maximum value of the ratio of absolute value of the second column b-coefficient/standard deviation; data set 2.

W Wu et al./Analytica Chimica Acta 335 (19%) 11-22 19

“0 2 4 (a) Index Zf vwiatde

8 TO 12

0’ Y I

0 2 4 (b) I*:variable

8 10 12

Fig. 8. The ratio of absolute value of the first column b-

coefficient/standard deviation versus the 12 selected wavelengths obtained from the Kalman filter with: (a) @)=2.5~10-~, P(O)=1 x 106; (b) r(k)=25 x lo-*, P(O)=1 x 10’; data set 2.

changes. This indicates that the ratio of b-coefficient to the standard deviation obtained by the Kalman filter only has a relative meaning. We can also see from this table that when the ratio is between 2.5x lOPi and 2.5x 10e8, there is no difference among CCRs, however, when the ratio is higher than 2.5x 10P8, the CCR decreases substantially. This suggests that our results are quite stable when the ratio of rjk)/P(O) is between 2.5 x lo-” and 2.5x lo-*, and also indicates that the default values of r(k) and P(0) are reasonable. If r(k) is fixed at 2.5x lop5 and P(0) is increased from 1~10~ to 10x105, CCR increases first, and then converges when the ratio is between 2.5 x lo-” and 2.5 x lo-*.

4.4. Data set 3

This data set is known to be a difficult one, in the sense that good classification is difficult. Cross- validation results (Fig. 9) show that when 100 variables are selected, the Kalman filter correctly classifies all of objects.

With these 100 variables, the results after elimination of the most collinear variables are shown in Table 5. As was the case for the previous data set, it is advantageous to retain more variables, even though they are highly collinear. The 100 features selected

20 30 40 50 60 NwnberZd SO IW 110 120

Fig. 9. The classification results (CCR) based on leave-one-out cross-validation with different numbers of selected variables; data set 3.

Table 5 The classification results (CCR) after deleting the highly correlated variables cr hieher than the different threshold values): data set 3

Threshold value No. of features CCR

0.90000 9 0.8000 0.99000 32 0.7684 0.99900 66 0.9158 0.99990 96 0.9789 0.99999 100 1.0000

“‘I’

0.25 0 100 200 h?&XOf%~ 400 500 @cm

Fig. 10. The mean spectra of the two classes in data set 3; “+” denotes the variables selected by the Kalman filter.

20 W Wu et al./Analytica Chimica Acta 335 (1996) 11-22

Table 6

The classification results (CCR) of different feature selection

methods for data set 4 (PLS (bw) - select variables according to the

b-coefficient after standardising the data)

Method Features

Kalman filter 2

PLS (bw)” 3

a Results taken from [20].

Factors CCR

- 0.9184

1 0.9388

by the Kalman filter are presented in Fig. 10. The

classification results for the Kalman filter together

with our previous results from RDA are shown in Table 2. It shows that the Kalman filter dramatically

improves the CCR from 0.8526 to 1.0.

The reason why the Kalman filter gives very good

classification results is that the Kalman filter can be used directly to classify data sets whose variables-to-

objects ratio is so high that even RDA cannot be used directly. The Kalman filter allows us to identify the most significant variables, which are used for the final

model.

4.5. Data set 4

This is a pollution data set which has been used by

Baldovin et al. [20]. Baldovin applied PLS with

ranked b-coefficients after standardizing the data to

select variables. Her results are listed in Table 6. The

results of the Kalman filter are also listed in this table. The results show that the leave-one-out cross- validation CCR obtained from the Kalman filter is

slightly worse than that from PLS. However, the difference is not significant, and the Kalman filter uses two variables only, while PLS uses three. In this

case, the more the variables, the higher the cost. Considering both the CCR and the number of variables, we prefer the Kalman filter to PLS in this

case.

Fig. 11 shows the absolute value of the ratio of b- coefficient to the standard deviation. This figure shows that there are three variables (2, 7 and 10)

where the absolute value of the b-coefficient is relatively high, its standard deviation low, and the ratio also high. These variables correspond to the three variables with significantly different parameter values for the two different classes (Fig. 12). This means that the Kalman filter has correctly modelled

180,

Fig. 11. The ratio of absolute value of the first column b-

coefficient/standard deviation versus variable; data set 4.

Fig. 12. The mean value f standard errors for the original data of

the two classes versus the index of variables in data set 4; “*”


the data by giving high weights to these “good”

variables. Since the number of variables is not very high in

this data set, we select 1, 2, . . ., 10 as the pre-defined number of variables in step 8. The leave-one-out cross-validation results (Fig, 13) show that CCR is maximum (0.9184) when two variables are selected. Since the correlation coefficient of the two variables is only -0.09, no highly correlated variables can be eliminated in steps 10 and 11.

W Wu et al./Analytica Chimica Acta 335 (1996) II-22 21

0.83 I I 2 3 4 5 6 7 8 9 10

Number of fesMes

Fig. 13. The classification results (CCR) based on leave-one-out

cross-validation with different numbers of selected variables; data

set 4.

4.6. Independent test set

One possibility that must be considered when using

the Kalman filter for variable selection and classification or calibration is overfitting. In general, for any calibration or classification method, when the ratio of

the number of variables to objects is high, the

possibility exists that a very good fit or class prediction will be obtained based on random

contributions to the data. When a large number of

variables are used in both PLS and PCR there is some risk of overfitting the data, if too many latent

variables are retained in the model. Overfitting results in a model that cannot provide reliable predictions for

independent test samples, even though the fit quality for the training set is good. The appropriate number

of latent variables is normally determined using

leave-one-out cross-validation. When feature selection is used in PLS, an additional leave-one-out cross- validation step is also needed to determine the

number of features [7]. In using the Kalman filter, leave-one-out cross-validation can also be used to

determine the appropriate number of spectral fea-

tures. Therefore, the Kalman filter is simpler than

PLS with feature selection. However, the risk of overfitting is somewhat greater when using the Kalman filter as a tool for feature selection. In using all three methods (PCR, PLS and Kalman filter), other additional validation steps may be required to

Table I

The results of the Kalman filter with independet test set (LOO -the

results of leave-one-out cross-validation for the training set)

Data No. of No. of No. of CCR CCR

set objects objects features (LOO) (test set)

(training set) (test set)

1 30 10 35 0.9667 0.9000

2 45 15 18 1.0000 1.0000

3 74 21 140 1 .OOOO 0.8095

4 39 10 3 0.9231 0.9000

minimize the risk of overfitting. This topic is currently under investigation in this research group.

Although leave-one-out cross-validation can over- come the problem of overfitting to some extent, the

resulting model may still give very large errors for

any independent test set samples. To evaluate this possibility, all four data sets were divided into

training sets and test sets, using a Kennard-Stone

selection procedure [ 141. The Kalman filter is applied

to select the variables using the training set, and the final model is evaluated using the independent test

set. The results are shown in Table 7.

For data set 1, the CCR from leave-one-out cross- validation of the training set is 0.9667 with 35

variables, and the CCR of the independent test set is only a little lower (0.9000). For data set 2, the CCR

from cross-validation of the training set is 1 .O with 18

variables, and the CCR of the independent test set is also 1.0. The selected variables are similar to those selected based on the entire data set. For data set 3,

the CCR from cross-validation of the training set is

1.0 with 140 variables, and the CCR of the independent test set is 0.8095. For data set 4, the

optimal CCR from leave-one-out cross-validation of

the training set is 0.9231 using three variables. The

CCR of the test set is 0.9000. These results show that there is some overfitting on data set 3. These results also indicate that the higher the number of selected variables, the higher the difference between CCRs of

the training set and the test set, probably due to the

corresponding increase in risk of overfitting. Further studies to develop appropriate validation procedures

to check for the possibility of overlitting are in progress. When there is multicollinearity, one expects that

the ratio Ibllsb would lead to the false conclusion that most of the informative variables are insignificant.

This is not the case for the four studied data sets. For

22 W Wu et alJAnalytica Chimica Acta 335 (1996) 11-22

data sets 2 and 3, even highly correlated variables are selected to give good classification results because 16) is still very large for these informative variables which are highly correlated. The feature of the Kalman filter that is exploited here is that multi- collinearities among the variables do not prevent a result from being obtained. When variables are multicollinear, unstable coefficients are obtained (i.e. the coefficients have large standard deviations), however the predicted results are still reliable. This is a separate issue from true overfitting, where random contributions to the data help in determining the values for the coefficients of the model. In that case, the model will not be useful in providing accurate predictions for the independent test samples.

5. Conclusion

The regular Kalman filter is extended to deal with more than one dependent variable. Like PLS2, the proposed Kalman filter approach can be used as a classifier or multicomponent calibrator in the analysis of NIR spectral data. For the two most difficult NIR data sets (data sets 1 and 3), the Kalman filter successfully selects the wavelengths which lead to excellent classification results (CCR=l). The results are much better than the best results of RDA using Fourier transform, PCA and univariate feature selection methods as variable reduction methods. For the NIR data set (data set 2) which consists of more than 2 classes, the Kalman filter gives the same good results (CCR=l) as that of RDA. For the pollution data set (Data set 4), the Kalman filter gives similar results to PLS with fewer variables. Therefore, the Kalman filter appears to be a very powerful feature selection method for classification problems. The results obtained from the evaluation of the independent test sets suggest that the greater the number of variables selected, the higher the probability of overfitting the data. The disadvantage of the Kalman filter is that it needs more memory and more computing time than PLS. This disadvantage is not considered to be very important, because memory sizes and computer speeds are becoming larger and faster.

Acknowledgements

The first author thanks Vita Centner, Bas van den Bogaert and Beata Walczak for kindly providing their unpublished papers.

References

HI

121

[31

141

PI

[61

171

Bl

[91

w1

[ill

u21

[I31

1141

w1

b51

H71

H81

u91

[201

P Geladi and B.R. Kowalski, Anal. Chim. Acta, 185 (1986)

1.

D. Jouan-Rimbaud, B. Walczak, D.L. Massart, I.R. Last and

K.A. Prebble, Anal. Chim. Acta, 304 (1995) 285.

M. Baroni, S. Clementi, G. Cmciani, G. Costantino, D.

Riganelli and E. Oberrauch, J. Chemometric, 6 (1992) 347.

H. Martens and T. Naes, Multivariate Calibration, Wiley,

New York, 199 1.

I. Frank, Chemometrics Intell. Lab. Systems, 1 (1987) 233.

F. Lindgren, P. Geladi, S. Rannar and S. Wold, J.

Chemometrics, 8 (1994) 349.

A. Garrido Frenich, D. Jouan-Rimbaud, D.L. Massart, S.

Kuttatharmmaku, M. Martinez Galera and J.L. Martinez

Vidal, Analyst, 120 (1995) 2787.

V Centner, D.L. Massart, O.E. de Noord, S. de Jong, B.M.

Vandeginste and C. Sterna, A method for elimination of un-

informative variables for multivariate calibration, Anal.

Chem., in press.

SC. Rutan and S.D. Brown, Anal. Chim. Acta, 160 (1984)

99.

S.C. Rutan, Chemometrics Intell. Lab. Systems, 6 (1989)

191.

SC. Rutan, J. Chemometric, 4 (1990) 103.

S.C. Rutan, Anal. Chem., 63 (1991) 1103A.

S.D. Brown, Anal. Chim. Acta, 181 (1986) 1.

W. Wu, B. Walczak, D.L. Massart, S. Heuerding, F. Emi, I.R.

Last and K.A. Prebble, Chemometrics Intell. Lab. Systems,

33 (1996) 35.

M.J. Norusis, SPSS for Windows: Base System User’s Guide,

Release 5.0, 347-348, SPSS, Chicago, 1992.

W. Wu, B. Walczak, D.L. Massart, K.A. Prebble and I.R.

Last, Anal. Chim. Acta, 315 (1995) 243.

W. Wu, B. Walczak, W. Pennickx and D.L. Massart, Feature

reduction by Fourier transfomr in pattern recognition of NIR

data, Anal. Chim. Acta, in press.

W. Wu, Y. Mallet, B. Walczak, W. Penninckx and D.L.

Massart, Anal. Chim. Acta, 329 (19%) 257.

B. Walczak, B. van den Bogaert and D.L. Massart, Anal.

Chem., 68 (1996) 1742.

A. Baldovin, W. Wu, V Centner, D. Jouan-Rimbaud, D.L.

Massart, L. Favretto and A. Turello, Feature selection for the

discrimination between pollution types with partial least

squares modelling, Analyst, in press.

Feature selection using the Kalman filter for classification of multivariate data

Documents

Transcript of Feature selection using the Kalman filter for classification of multivariate data