Impact of a metric of association between two variables on performance of filters for binary data

13
Impact of a metric of association between two variables on performance of lters for binary data Kashif Javed a,n , Haroon A. Babri a , Mehreen Saeed b a Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan b Department of Computer Science, National University of Computer and Emerging Sciences, Lahore, Pakistan article info Article history: Received 9 March 2014 Received in revised form 1 May 2014 Accepted 31 May 2014 Communicated by Feiping Nie Available online 11 June 2014 Keywords: Feature selection Filters Binary data Classication abstract In the feature selection community, lters are quite popular. Design of a lter depends on two parameters, namely the objective function and the metric it employs for estimating the feature-to-class (relevance) and feature-to-feature (redundancy) association. Filter designers pay relatively more attention towards the objective function. But a poor metric can overshadow the goodness of an objective function. The metrics that have been proposed in the literature estimate the relevance and redundancy differently, thus raising the question: can the metric estimating the association between two variables improve the feature selection capability of a given objective function or in other words a lter. This paper investigates this question. Mutual information is the metric proposed for measuring the relevance and redundancy between the features for the mRMR lter [1] while the MBF lter [2] employs correlation coefcient. Symmetrical uncertainty, a variant of mutual information, is used by the fast correlation-based lter (FCBF) [3]. We carry out experiments on mRMR, MBF and FCBF lters with three different metrics (mutual information, correlation coefcient and diff-criterion) using three binary data sets and four widely used classiers. We nd that MBF's performance is much better if it uses diff-criterion rather than correlation coefcient while mRMR with diff-criterion demonstrates performance better or comparable to mRMR with mutual information. For the FCBF lter, the diff-criterion also exhibits results much better than mutual information. & 2014 Elsevier B.V. All rights reserved. 1. Introduction In data and knowledge management systems, feature selection is frequently used as a preprocessing step for data analysis [4,5]. The aim is to remove from the data irrelevant and redundant information [6]. To formally dene the feature selection problem, we rst introduce some notation. Suppose that D ¼fx t ; C t g N t ¼ 1 represents a labeled data set consisting of N instances and M features such that x t A R M and C t denotes the class variable of instance t. In this study, we assume that variable C can attain two values, thus representing a two-class classication task. Each vector x is, thus, an M-dimensional vector of features, i.e., x t ¼fF t 1 ; F t 2 ; ; F t M g. Let W k ðF i ; CÞ denote the relationship estimated by the kth metric of association, where the two variables are the ith feature and C. Furthermore, F is used to refer to the set comprising all features of a data set, whereas G denotes a feature subset. The feature selection problem is to nd a subset G, termed as the nal or optimal subset, of m features from the set F having M features with the smallest classication error [7] or at least without a signicant degradation in the performance [8]. Removal of irrelevant and redundant features through a feature selection algorithm improves the effectiveness and efciency of learning algorithms [2,9]. Additionally, the learned results become comprehensible [3]. Because of these advantages, feature selection has been a fertile eld of research and development over the years. Different approaches have been suggested and are broadly categor- ized as lters, wrappers, and embedded methods [10]. Filters select a subset of most useful features as a standalone task independent of the learning algorithm [11]. Wrappers search for a good subset in the feature space with the guidance of a classier [7]. In the embedded approach, the feature selection process is integrated into the training process of a given classier [12]. Among these methods, lters are highly popular because of their fast speed of computation and less chances of overtting [13,14]. Recently, researchers have shown interest in FS methods based on sparse learning because of their good performance [15]. For example, the method of Nie et al. [16] was found to outperform the well-known minimal-redundancy- maximal-relevance (mRMR) and information gain (IG) algorithms. In this study, our focus is however only on lter methods. As lters select features independently of any particular learn- ing algorithm, these methods require a function or a criterion to Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing http://dx.doi.org/10.1016/j.neucom.2014.05.066 0925-2312/& 2014 Elsevier B.V. All rights reserved. n Corresponding author. E-mail addresses: [email protected] (K. Javed), [email protected] (H.A. Babri), [email protected] (M. Saeed). Neurocomputing 143 (2014) 248260

Transcript of Impact of a metric of association between two variables on performance of filters for binary data

Page 1: Impact of a metric of association between two variables on performance of filters for binary data

Impact of a metric of association between two variableson performance of filters for binary data

Kashif Javed a,n, Haroon A. Babri a, Mehreen Saeed b

a Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistanb Department of Computer Science, National University of Computer and Emerging Sciences, Lahore, Pakistan

a r t i c l e i n f o

Article history:Received 9 March 2014Received in revised form1 May 2014Accepted 31 May 2014Communicated by Feiping NieAvailable online 11 June 2014

Keywords:Feature selectionFiltersBinary dataClassification

a b s t r a c t

In the feature selection community, filters are quite popular. Design of a filter depends on two parameters,namely the objective function and the metric it employs for estimating the feature-to-class (relevance)and feature-to-feature (redundancy) association. Filter designers pay relatively more attention towards theobjective function. But a poor metric can overshadow the goodness of an objective function. The metricsthat have been proposed in the literature estimate the relevance and redundancy differently, thus raisingthe question: can the metric estimating the association between two variables improve the featureselection capability of a given objective function or in other words a filter. This paper investigates thisquestion. Mutual information is the metric proposed for measuring the relevance and redundancybetween the features for the mRMR filter [1] while the MBF filter [2] employs correlation coefficient.Symmetrical uncertainty, a variant of mutual information, is used by the fast correlation-based filter (FCBF)[3]. We carry out experiments on mRMR, MBF and FCBF filters with three different metrics (mutualinformation, correlation coefficient and diff-criterion) using three binary data sets and four widely usedclassifiers. We find that MBF's performance is much better if it uses diff-criterion rather than correlationcoefficient while mRMR with diff-criterion demonstrates performance better or comparable to mRMRwith mutual information. For the FCBF filter, the diff-criterion also exhibits results much better thanmutual information.

& 2014 Elsevier B.V. All rights reserved.

1. Introduction

In data and knowledge management systems, feature selectionis frequently used as a preprocessing step for data analysis [4,5].The aim is to remove from the data irrelevant and redundantinformation [6]. To formally define the feature selection problem,we first introduce some notation. Suppose that D¼ fxt ;CtgNt ¼ 1represents a labeled data set consisting of N instances and Mfeatures such that xtARM and Ct denotes the class variable ofinstance t. In this study, we assume that variable C can attaintwo values, thus representing a two-class classification task. Eachvector x is, thus, an M-dimensional vector of features, i.e.,xt ¼ fFt1; Ft2;…; FtMg. Let WkðFi;CÞ denote the relationship estimatedby the kth metric of association, where the two variables arethe ith feature and C. Furthermore, F is used to refer to the setcomprising all features of a data set, whereas G denotes a featuresubset. The feature selection problem is to find a subset G, termedas the final or optimal subset, of m features from the set F having

M features with the smallest classification error [7] or at leastwithout a significant degradation in the performance [8].

Removal of irrelevant and redundant features through a featureselection algorithm improves the effectiveness and efficiency oflearning algorithms [2,9]. Additionally, the learned results becomecomprehensible [3]. Because of these advantages, feature selectionhas been a fertile field of research and development over the years.Different approaches have been suggested and are broadly categor-ized as filters, wrappers, and embedded methods [10]. Filters select asubset of most useful features as a standalone task independent ofthe learning algorithm [11]. Wrappers search for a good subset in thefeature space with the guidance of a classifier [7]. In the embeddedapproach, the feature selection process is integrated into the trainingprocess of a given classifier [12]. Among these methods, filters arehighly popular because of their fast speed of computation and lesschances of overfitting [13,14]. Recently, researchers have showninterest in FS methods based on sparse learning because of theirgood performance [15]. For example, the method of Nie et al. [16]was found to outperform the well-known minimal-redundancy-maximal-relevance (mRMR) and information gain (IG) algorithms.In this study, our focus is however only on filter methods.

As filters select features independently of any particular learn-ing algorithm, these methods require a function or a criterion to

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.05.0660925-2312/& 2014 Elsevier B.V. All rights reserved.

n Corresponding author.E-mail addresses: [email protected] (K. Javed),

[email protected] (H.A. Babri), [email protected] (M. Saeed).

Neurocomputing 143 (2014) 248–260

Page 2: Impact of a metric of association between two variables on performance of filters for binary data

quantify how useful is the presence of a feature in the final subset.The function is optimized so that features highly relevant tothe class but having minimum redundancy get selected. To meetthis goal, a metric of association is used to estimate the relevancebetween the feature and the class variable and the redundancy offeatures. Although the metric is equally important for a filter, filterdesigners are usually more concerned about the working of theobjective function. The usefulness of a filter (or an objectivefunction) is established by testing it against other filters. Thishowever only provides insight into the performance of a filter witha given metric.

A number of metrics have been proposed in the literature [17].These metrics of association measure the relationship of twovariables differently [18,19]. A metric that poorly estimates therelevance and redundancy of features can spoil the goodness of afilter's objective function. On the other hand, a filter that has shownpoor performance against other filters can demonstrate betterperformance if a better metric of association is chosen for it. Thisaspect of a filter's working has been ignored. In this work, weanalyze filters from this perspective and investigate how theperformance of a given filter is affected when it uses differentmetrics meant for measuring the relationship between variables.For this purpose, we examine the performance of three well-knownfilters, namely minimal-redundancy-maximal-relevance (mRMR)filter [1], Markov blanket filter (MBF) [2] and fast correlation-based filter (FCBF) [3] with three different metrics (mutual informa-tion, correlation coefficient and diff-criterion). Peng et al. [1]suggested that mutual information should be employed for themRMR filter. We use this combination as a reference to see whetherthe filter's performance improves when it uses correlation coeffi-cient and diff-criterion. Similarly, the MBF and correlation coefficientcombination as suggested by Koller and Sahami [2] can be taken as areference to investigate the performance of MBF with mutualinformation and the diff-criterion. For the FCBF, Yu and Liu [3]employed symmetrical uncertainty (a variant of mutual informa-tion). We consider FCBF and mutual information to be the referenceagainst which the other combinations will be tested. Because oftheir wide applications [20–25], experiments are carried out onbinary data sets. We use three binary data sets from differentapplication domains with four widely used classifiers to evaluatethe performance of filters.

The remainder of the paper is organized into four sections.Section 2 describes the theory related to filters and discusses well-known filter methods. We also present various metrics of associa-tion. In Section 3, experimental setup is described while resultsobtained on three real-life data sets are presented and discussed inSection 4. The conclusions are drawn in Section 5.

2. Filters

The idea of filters is to separate feature selection from classifica-tion. Filters search the feature space by optimizing a function or acriterion, which is termed as the objective function. The goal is toselect those features in the final subset that maximize the relevanceand minimize the redundancy. The function therefore acts as a proxymeasure of the accuracy of the classification algorithm [14]. Itemploys a metric of association, which estimates the associationbetween variables by taking the statistics of the data into account.Because of these characteristics, the filter's search for an optimalfeature subset is less expensive than that of wrappers and embeddedmethods [26]. Filters are also known to be less prone to overfit thetraining data [13]. These advantages have made filters to be widelyused among researchers of different application domains, which arealso highly active in designing new filters.

If the design of filters is taken into consideration, we can broadlyidentify two design parameters: an objective function with which a

filter populates the final subset by searching for the most usefulfeatures and a metric of association used for estimating theusefulness of features. The combination of these two componentsdetermines the performance of a filter. However, one can find thatfilters that have been proposed in the literature [27,28,14] primarilyvary in the objective function. Even though a number of metrics ofassociation have been proposed in the literature, metrics havedrawn less attention of the filter designers. Because of this, differentfilters may even employ the same metric without considering itsimpact on the objective function. For instance, Battiti's mutualinformation based feature selection (MIFS) criterion [29] and Penget al.'s mRMR filter [1] both use mutual information as the metric ofassociation but optimize different objective functions.

Based on a metric that is not a good estimator of the relevanceand redundancy, even a well-designed objective function will notbe able to select the most useful features thus, resulting in poorperformance. One can find many studies in the literature [11,30]that compare the performance of different filters. But hardlyany study exists which investigates the effect of the metrics ofassociation on the performance of a given objective function.The selection of a good metric is therefore important for a filter.The motivation behind this argument is illustrated with anexample given in Section 2.3. But before that, let us next presentthose metrics of association and filters that are used for this study.

2.1. Metrics of association

Because of computational issues, filters generally considerpairwise interactions among variables, i.e., feature-to-classand feature-to-feature relationships [31]. A number of metricshave been proposed in the literature and can be broadly categor-ized into three groups, namely correlation based, information-theoretic based and probabilistic metrics [32,17]. Although theexpressions given in this section are written in terms of a feature(Fi) and the class variable (C), we can also use the same expressionfor estimating the redundancy between two features by replacingC with a feature, say Fj.

2.1.1. Correlation based metricsPearson's correlation coefficient [33] and Chi-squared statistics

[34] are well-known members of this category. The correlationcoefficient (CC) is designed to estimate the linear relationshipbetween two variables and is given by

WccðFi;CÞ ¼EðFiCÞ�EðFiÞEðCÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

σ2ðFiÞσ2ðCÞp ð1Þ

where E represents the expected value of a variable(s) and σ2 denotesits variance. When Fi and C are linearly dependent,WccðFi;CÞ is equalto 71 and when they are completely uncorrelated, WccðFi;CÞbecomes 0.

2.1.2. Information-theoretic metricsIn this category, mutual information [29] and symmetrical

uncertainty [3] are widely known. Mutual information (MI) esti-mates the relationship between the joint distribution pðC; FiÞ of twovariables C and Fi and their product distribution pðCÞpðFiÞ [35] and isgiven by

WmiðFi;CÞ ¼∑C∑Fi

pðC; FiÞ logpðC; FiÞpðCÞpðFiÞ

ð2Þ

The value of WmiðFi;CÞZ0 and a larger value means that a variablecontains more information about the other variable. When twovariables have no information in common, WmiðFi;CÞ reduces to 0.

K. Javed et al. / Neurocomputing 143 (2014) 248–260 249

Page 3: Impact of a metric of association between two variables on performance of filters for binary data

2.1.3. Probabilistic metricsThe distance between probability distributions can also be used

to estimate the dependency between the features and the classvariable. One example of this category is the Vajda entropy [36].Recently, we have proposed a metric termed as the diff-criterion(DC), which has shown promising results on binary data [37] andis given by

WdcðFi;CÞ ¼ jpðFi ¼ 1jC2Þ�pðFi ¼ 1jC1Þj ð3ÞThe diff-criterion values range over [0 1]. The higher the distancebetween the two probability distributions, the higher the worth ofthe feature. One simple way of extending the diff-criterion to workwith discrete-valued features is to take the sum of the differenceof each of the values of a feature between the classes. Further-more, we can make the diff-criterion to work for multi-classclassification applications by taking the average of the pairs[32,38], i.e.,

W ¼ ∑L

a ¼ 1∑L

b4apðCaÞpðCbÞWa;b

where Wa;b is the weight of the ith feature between two classes Caand Cb and L represents the total number of classes. Formulatingthese proposals for extending the diff-criterion is a part of ourfuture work. A comparison of diff-criterion against mutual infor-mation was given in Javed et al. [37] in which it was found thatthe two metrics work almost the same for binary data sets withbalanced classes. As class skew increases, the dissimilaritybecomes evident. At present, we are working on extending thework done in Javed et al. [37] so that a unified framework for theanalysis of metrics can be defined. The platform allows us to learnwhich type of binary-valued features top the ranking of a metric,thus, giving a better understanding of the working of metrics.

2.2. Well-known filters

The minimal-redundancy-maximal-relevance (mRMR) [1] is awell-known filter. It employs the mutual information metric formeasuring the associations of the variables. At each step, it selectsthe feature Fi that obtains the maximum value for the differencebetween its mutual information with the class and the average ofits mutual information with each of the already selected individualfeatures. The objective function of mRMR is given by

maxFi A fF �Gg

WmiðFi;CÞ�1

m�1∑

Fj AGWmiðFi; FjÞ

" #ð4Þ

wherem�1 features have already been selected in the final subsetG and WmiðFi; FjÞ denotes the value of association between Fi and Fjestimated by mutual information.

Koller and Sahami [2] have shown that among a set of features,a feature Fi whose Markov blanket can be identified, can beeliminated from that set without an increase in the divergencefrom the true class distribution. However, to exactly pinpoint thetrue Markov blanket of Fi is practically an intractable task. There-fore, heuristics are used to find an approximation of the Markovblanket. One such heuristic proposed by Koller and Sahami is theMarkov blanket filter (MBF), which uses the correlation coefficientmetric to estimate the feature-to-class and feature-to-featurerelationships.

In Yu and Liu [3], a filter termed as fast correlation-based filter(FCBF) was proposed. To estimate the feature-to-class and feature-to-feature correlations, it employs symmetrical uncertainty (avariant of mutual information). FCBF works in two stages. Initially,it discards those features, which have relevance for the classvariable lower than a given threshold value. Next, features of thereduced set which are sorted in decreasing order of relevance are

checked for redundancies. A feature Fi in the ordered list isdiscarded if its correlation with a feature Fj that has already beenselected is higher than its relevance with the class. FCBF iscomputationally efficient as it does not analyze redundancies forall the possible pairs of features.

2.3. Motivation

Now, we look at an example to provide motivation for inves-tigating the performance of filters with different metrics. InTable 1, a data set denoted by x with three binary-valued featuresand nine instances is presented. We used the Boolean expression,C ¼ ð:F14F2Þ3ðF14:F3Þ to generate the class variable. Let RðkthÞdenote the ranking of features in the descending order of impor-tance that is obtained by the kth algorithm which can be a metricworking as a standalone or as a part of the objective function in afilter method.

First, we look at the working of the metrics as standalonemethods. The association between the features and the classvariable is estimated by using Eqs. (2), (1) and (3) for mutualinformation, correlation coefficient and diff-criterion, respectively.The results thus obtained are shown in the third column of Table 1.We can see that the features are ordered differently by the threemetrics. In other words, each metric has estimated the feature-to-class relationship (or relevance) differently. Considering F1, wefind that for mutual information and diff-criterion, it has thelowest association with C while it is located at position number2 in the ranked list of correlation coefficient. Similarly, F2 is the topranked feature in terms of the weights estimated by mutualinformation and correlation coefficient. According to diff-criterion,it should be placed at the second position. The F3 feature has beenassigned ranks 2, 3, and 1 by mutual information, correlationcoefficient and diff-criterion, respectively. This is quite surprisingas no two metrics have located it at a common position in theirrankings. Hence, different metrics estimate the relevance offeatures with the class differently. One can easily deduce fromthese observations that these metrics will also work differentlywhile estimating the association between two features for deter-mining the redundancy between them.

Because of the above behavior of metrics, we can claim that afilter whose objective function employs one of the metrics willperform differently when used with different metrics. As anevidence, we next apply the mRMR filter whose objective functionis given by Equation (4) on the data set of this example usingthe three metrics turn-by-turn. The results are shown in the lastcolumn of Table 1. The output of mRMR is sorted according to thedescending order of importance it assigns to the features. It is clearthat when mRMR employs different metrics of association, itlocates the features at different positions, which will thus, resultin different performance.

Table 1Motivation.

x C Ranked list of metrics Ranked list of mRMR

ð0;0;0Þ 0ð0;0;1Þ 0 RðmiÞ ¼ ðF2; F3; F1Þ RðmrmrþmiÞ ¼ ðF2; F3; F1Þð0;1;0Þ 1ð0;1;1Þ 1ð1;0;0Þ 1 RðccÞ ¼ ðF2 ; F1 ; F3Þ Rðmrmrþ ccÞ ¼ ðF2 ; F1 ; F3Þð1;0;1Þ 0ð1;1;0Þ 1ð1;1;1Þ 0 RðdcÞ ¼ ðF3 ; F2; F1Þ RðmrmrþdcÞ ¼ ðF3; F2; F1Þð1;1;0Þ 1

K. Javed et al. / Neurocomputing 143 (2014) 248–260250

Page 4: Impact of a metric of association between two variables on performance of filters for binary data

3. Experiments

In this section, the experimental setup used to test the perfor-mance of mRMR, MBF and FCBF filters with three different metricsof association (mutual information (MI), correlation coefficient (CC)and diff-criterion (DC)) is presented. We have implemented thethree filters mRMR, MBF and FCBF in MATLAB [39]. Table 2 presentsthe summary of the three real-life data sets used. The first two datasets were introduced in the “Agnostic Learning versus Prior Knowl-edge” challenge [40]. GINA models a recognition problem of hand-written digits and has been binarized according to the methodgiven in Javed et al. [37]. HIVA represents a drug discoveryapplication and is designed to predict, which compounds are activeand inactive against the AIDS HIV infection. SPECT has been takenfrom the University of California Irvine (UCI) machine learningrepository [41] and models a heart disease classification task.

Comparing these data sets in terms of the number of features, wesee that SPECT is a data set with small dimensionality while GINAand HIVA are high-dimensional data sets.

3.1. Performance evaluation

We generate NT nested subsets S1DS2D⋯DSNT from thesorted list of features generated by a filter by progressively addingfeatures of decreasing importance. As SPECT contains only 22features, we use NT equal toM for it. For GINA and HIVA, which arehigh-dimensional data sets, NT is far less than M. Each subset isthen used to train a classifier. Four widely used classifiers, namelynaive Bayes' (NB), kernel ridge regression (KRR) also known askridge, logistic regression (LR) and the support vector machineswith radial basis function (SVMþRBF) are used to evaluate thequality of the feature subsets generated. These classifiers areimplemented in the CLOP package [42]. The output of a classifieris measured in terms of error which is estimated over 5-fold crossvalidation with the balanced error rate (BER). The average of theerror rates of the positive and negative classes gives the balancederror rate [40].

To set the various parameters of the classifiers, we referredto the quick start guide available for CLOP [42]. It describesthe different hyperparameters and their possible values for eachclassifier. There are no parameters for the naive Bayes' classifier. Alinear kernel was used for the KRR classifier. That value of itsshrinkage parameter was used, which resulted in the least error ona data set. We employed a radial basis function kernel for the

Table 2Characteristics of the data sets.

Data set M N %Sp M=N %þive class

GINA 970 3468 69.1 0.28 49.2HIVA 1617 4229 90.9 0.38 3.5SPECT 22 267 67.1 0.083 41.2

For each data set, we show the number of features M, the number of trainingexamples N, the sparsity Sp, the ratio number of features to the number of trainingexamples M=N and the fraction of examples in the positive class. Each data setmodels a 2-class classification task.

0 200 400 600 800 10000.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

size of feature subsets

BE

R

GINA − Performance of naive Bayes

mrmrWdiffmrmrWmimrmrWcorrBER−970=0.19786

0 200 400 600 800 10000.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

size of feature subsets

BE

R

GINA − Performance of Kridge

mrmrWdiffmrmrWmimrmrWcorrBER−970=0.13943

0 200 400 600 800 10000.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

size of feature subsets

BE

R

GINA − Performance of SVM+RBF

mrmrWdiffmrmrWmimrmrWcorrBER−970=0.063978

0 200 400 600 800 10000.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

size of feature subsets

BE

R

GINA − Performance of LOGREG

mrmrWdiffmrmrWmimrmrWcorrBER−970=0.21405

Fig. 1. Performance of mRMR filter on GINA. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260 251

Page 5: Impact of a metric of association between two variables on performance of filters for binary data

support vector machines classifier. Its gamma and shrinkageparameters were tuned in order to obtain the least error on adata set. For the LR classifier, implementation provided by Minka

[43] was used and integrated into CLOP. We used the conjugategradient optimizer for it. The default value was chosen for lambda,which gives a regularized estimate of the weights.

Table 3Comparison of metrics for MBF, mRMR and FCBF filters on GINA.

Classifier Metric BERF (%) The MBF filter The mRMR filter The FCBF filter

jGj BERG (%) Red. (%) jGj BERG (%) Red. (%) jGj BERG (%) Red. (%)

NB DC 19.8 37 19.7 96.2 5 18.9 99.5 9 19.1 99.1MI 37 19.8 96.2 5 18.1 99.5 9 19.1 99.1CC 400 19.3 58.8 9 19.4 99.1 355 19.7 63.4

KRR DC 13.9 165 13.7 83 170 13.9 82.5 827 14 14.7MI 206 13.9 78.8 165 13.8 83 935 14 3.6CC 255 13.9 73.7 970 13.9 0 400 14 58.7

SVMþRBF DC 6.4 90 6.3 90.7 95 6.4 90.2 575 6.3 40.7MI 100 6.1 89.7 155 6.4 84 550 6.5 43.3CC 105 6.2 89.2 245 6.3 74.7 400 6.1 58.7

LR DC 21.4 8 21.4 99.2 4 21.9 99.6 6 20.1 99.4MI 14 19.8 98.6 4 18.9 99.6 5 21.2 99.5CC 10 21.3 98.9 6 20.1 99.4 13 21.6 98.6

Mean DC – 75 – 92.3 69 – 92.9 354 – 63.5MI 89 – 90.8 82 – 91.5 375 – 61.4CC 193 – 80.2 308 – 68 292 – 70

F is the entire feature set, G is the selected feature subset, BER is the balanced error rate, DC is the diff-criterion, MI is mutual information, CC is correlation coefficient, NB isnaive Bayes', KRR is kernel ridge regression, SVMþRBF means support vector machines with radial basis function kernel, LR stands for logistic regression and Red. stands forReduction¼ ðM�jGjÞ=M � 100.

0 200 400 600 800 1000 1200 1400 1600 18000.23

0.24

0.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

size of feature subsets

BE

R

HIVA − Performance of naive Bayes

mrmrWdiffmrmrWmimrmrWcorrBER−1617=0.28984

0 200 400 600 800 1000 1200 1400 1600 18000.23

0.24

0.25

0.26

0.27

0.28

0.29

0.3

0.31

size of feature subsets

BE

R

HIVA − Performance of KridgemrmrWdiffmrmrWmimrmrWcorrBER−1617=0.26778

0 200 400 600 800 1000 1200 1400 1600 18000.2

0.25

0.3

0.35

0.4

0.45

size of feature subsets

BE

R

HIVA − Performance of SVM+RBF

mrmrWdiffmrmrWmimrmrWcorrBER−1617=0.24997

0 200 400 600 800 1000 1200 1400 1600 18000.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

size of feature subsets

BE

R

HIVA − Performance of LOGREG

mrmrWdiffmrmrWmimrmrWcorrBER−1617=0.29053

Fig. 2. Performance of mRMR filter on HIVA. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260252

Page 6: Impact of a metric of association between two variables on performance of filters for binary data

Table 4Comparison of metrics for MBF, mRMR and FCBF filters on HIVA.

Classifier Metric BERF (%) The MBF filter The mRMR filter The FCBF filter

jGj BERG (%) Red. (%) jGj BERG (%) Red. (%) jGj BERG (%) Red. (%)

NB DC 28.9 50 28.9 97.0 8 28.4 99.5 9 28.1 99.5MI 101 28.9 93.7 8 28.5 99.5 9 28.4 99.4CC 101 28.3 93.7 8 28.3 99.5 16 28.1 99.1

KRR DC 26.8 46 26.8 97.2 18 26.3 98.8 13 26.6 99.2MI 50 26.4 97.0 20 26.2 98.7 50 26.7 96.9CC 73 26.5 95.5 20 26.6 98.7 50 26.5 96.9

SVMþRBF DC 24.9 140 24.7 91.4 25 25 98.4 15 25.1 99.1MI 267 25.1 83.5 44 25.3 97.3 185 25.2 88.5CC 153 25.0 90.4 57 25.3 96.5 335 24.9 79.2

LR DC 29.1 10 29.1 99.4 8 27.8 99.5 9 28.5 99.4MI 58 28.8 96.4 8 29.1 99.5 8 29.1 99.5CC 12 28.6 99.3 7 28.3 99.6 10 29.1 99.4

Mean DC – 61.5 – 96.2 14.8 – 98.8 11.5 – 99.3MI 119 – 92.6 20 – 97.9 63 – 96CC 84.8 – 94.7 23 – 98.6 102.8 – 93.6

F is the entire feature set, G is the selected feature subset, BER is the balanced error rate, DC is the diff-criterion, MI is mutual information, CC is correlation coefficient, NB isnaive Bayes', KRR is kernel ridge regression, SVMþRBF means support vector machines with radial basis function kernel, LR stands for logistic regression and Red. isReduction¼ ðM�jGjÞ=M � 100.

0 5 10 15 20 250.25

0.26

0.27

0.28

0.29

0.3

0.31

size of feature subsets

BE

R

SPECT − Performance of naive Bayes

mrmrWdiffmrmrWmimrmrWcorrBER−22=0.30649

0 5 10 15 20 250.25

0.255

0.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

size of feature subsets

BE

R

SPECT − Performance of kridge

mrmrWdiffmrmrWmimrmrWcorrBER−22=0.2806

0 5 10 15 20 250.25

0.255

0.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

size of feature subsets

BE

R

SPECT − Performance of SVM+RBF

mrmrWdiffmrmrWmimrmrWcorrBER−22=0.27695

0 5 10 15 20 250.26

0.27

0.28

0.29

0.3

0.31

0.32

0.33

size of feature subsets

BE

R

SPECT − Performance of LogReg

mrmrWdiffmrmrWmimrmrWcorrBER−22=0.32464

Fig. 3. Performance of mRMR filter on SPECT. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260 253

Page 7: Impact of a metric of association between two variables on performance of filters for binary data

The procedure above is repeated for the three data sets andleads to the construction of error curves, which are then analyzedto determine the feature selection capability of a filter.

In the following discussion, G denotes the subset with aminimum number of features, with which a BER value as goodas obtained by F , all the features of a data set, is attained. We

Table 5Comparison of metrics for MBF, mRMR and FCBF filters on SPECT.

Classifier Metric BERF (%) The MBF filter The mRMR filter The FCBF filter

jGj BERG (%) Red. (%) jGj BERG (%) Red. (%) jGj BERG (%) Red. (%)

NB DC 30.1 1 29.1 95.5 1 29.1 95.5 1 29.1 95.5MI 1 29.4 95.5 1 29.1 95.5 1 29.1 95.5CC 1 29.4 95.5 1 29.1 95.5 1 29.1 95.5

KRR DC 28.1 2 27.9 90.9 3 27.4 86.4 4 26.4 81.8MI 2 27.9 90.9 3 27.7 86.4 4 26.4 81.8CC 2 27.9 90.9 3 27.4 86.4 3 27.8 86.4

SVMþRBF DC 27.7 2 27.6 90.9 3 26.5 86.4 4 26.9 81.8MI 2 27.6 90.9 3 27.6 86.4 4 26.9 81.8CC 2 27.6 90.9 3 26.5 86.4 3 27.7 86.4

LR DC 32.5 1 29.1 95.5 1 29.1 95.5 1 29.1 95.5MI 1 29.4 95.5 1 29.1 95.5 1 29.1 95.5CC 1 29.4 95.5 1 29.1 95.5 1 29.1 95.5

Mean DC – 1.3 – 94.3 2 – 90.9 2.5 – 88.6MI 1.3 – 94.3 2 – 90.9 2.5 – 88.6CC 1.3 – 94.3 2 – 90.9 2 – 90.9

F is the entire feature set, G is the selected feature subset, BER is the balanced error rate, DC is the diff-criterion, MI is mutual information, CC is correlation coefficient, NB isnaive Bayes', KRR is kernel ridge regression, SVMþRBF means support vector machines with radial basis function kernel, LR stands for logistic regression and Red. isReduction¼ ðM�jGjÞ=M � 100.

0 200 400 600 800 1000

0.2

0.25

0.3

0.35

0.4

size of feature subsets

BE

R

GINA − Performance of naive BayesMbfWdiffMbfWmiMbfWcorrBER−970=0.19786

0 200 400 600 800 10000.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

size of feature subsets

BE

R

GINA − Performance of kridgeMbfWdiffMbfWmiMbfWcorrBER−970=0.13943

0 200 400 600 800 10000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

size of feature subsets

BE

R

GINA − Performance of SVM+RBF

MbfWdiffMbfWmiMbfWcorrBER−970=0.063978

0 200 400 600 800 1000

0.16

0.18

0.2

0.22

0.24

0.26

size of feature subsets

BE

R

GINA − Performance of LOGREG

MbfWdiffMbfWmiMbfWcorrBER−970=0.21235

Fig. 4. Performance of MBF filter on GINA. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines, and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260254

Page 8: Impact of a metric of association between two variables on performance of filters for binary data

represent the BER attained by the set of all features with ahorizontal dashed line in the error curves. The following criterionis used in order to compare the performance of a filter:

� the reduction in dimensionality that is obtained with subset Gis measured as, Reduction¼ ðM�jGjÞ=M, where jGj is thecardinality of G.

4. Results

In this section, results obtained for the performance of mRMR,MBF and FCBF filters when used with mutual information (MI),correlation coefficient (CC) and diff-criterion (DC) are presentedand discussed.

4.1. The mRMR filter

As Peng et al. [1] proposed mutual information to be used forthe mRMR filter, we can therefore look at the working of mRMRwith mutual information as a reference and see how its perfor-mance varies when it uses other metrics.

Fig. 1 shows the error curves obtained for mRMR on GINAwhenevaluated by naive Bayes', kernel ridge regression, logistic regres-sion, and support vector machines. From these curves, we candetermine the feature selection and classification performance ofmRMR. It is clear from the plots that the feature subsets generatedby the combination of mRMR with correlation coefficient haveresulted in the poorest classification performance. Comparing

mutual information and diff-criterion, we observe that classifica-tion performance of mRMR is almost the same for these twometrics. Table 3 indicates how the feature selection capability ofmRMR gets affected by changing the metric of association. We findthat the feature subset G (whose features result in the error sameas that obtained by all the GINA features) selected by mRMR withcorrelation coefficient contains the highest number of features asevaluated by the four classifiers. The performance of mRMR withdiff-criterion is comparable to the combination of mRMR withmutual information. The average of G leads us to the conclusionthat diff-criterion and mRMR is the best combination for the GINAdata set.

Next, we look at the HIVA data set. The performance of mRMRwith the three metrics of association and four classifiers is shownin Fig. 2. Among the three metrics, it is clear that mRMR'sclassification is the poorest with correlation coefficient. As thesize of the feature subset increases, classification performancedeteriorates. On the other hand, mRMR with the diff-criterionachieves the best classification results. Table 4 lists the results ofthe feature selection performance of mRMR on HIVA. We find thatmRMR selects the least number of features for the G subset whenused with the diff-criterion. On this data set, we observe thatcorrelation coefficient's feature selection performance is compar-able to that of mutual information. Taking the overall classificationand feature selection performance into account, we can say thatmRMR and diff-criterion combination is the winner for HIVA.

Finally, we investigate the impact of changing the metric ofassociation on mRMR's performance for SPECT, which consists ofonly 22 features. Fig. 3 depicts the error curves that have been

0 200 400 600 800 1000 1200 1400 1600 18000.26

0.27

0.28

0.29

0.3

0.31

0.32

0.33

0.34

0.35

size of feature subsets

BE

R

HIVA − Performance of naive Bayes

MbfWdiffMbfWmiMbfWcorrBER−1617=0.28984

0 200 400 600 800 1000 1200 1400 1600 18000.24

0.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

0.33

size of feature subsets

BE

R

HIVA − Performance of KridgeMbfWdiffMbfWmiMbfWcorrBER−1617=0.26778

0 200 400 600 800 1000 1200 1400 1600 18000.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

size of feature subsets

BE

R

HIVA − Performance of SVM+RBF

MbfWdiffMbfWmiMbfWcorrBER−1617=0.24997

0 200 400 600 800 1000 1200 1400 1600 18000.24

0.26

0.28

0.3

0.32

0.34

0.36

size of feature subsets

BE

R

HIVA − Performance of LOG REG

MbfWdiffMbfWmiMbfWcorrBER−1617=0.29053

Fig. 5. Performance of MBF filter on HIVA. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines, and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260 255

Page 9: Impact of a metric of association between two variables on performance of filters for binary data

obtained for the four classifiers. We can observe that mRMR withcorrelation coefficient results in the optimal BER value whenevaluated by kridge. In the remaining three cases, mRMR withthe three metrics demonstrate comparable classification perfor-mance. The feature selection capability of mRMR is compared inTable 5. We find that it remains the same irrespective of the metricthat is used with mRMR. Hence, we can conclude that reduction indimensionality of mRMR on SPECT does not change significantlyby changing the measure of association. The reason is that metricson SPECT, which is a data set with small dimensionality, havehigher chances of selecting similar features.

It was suggested by Peng et al. [1] that mRMR filter shouldemploy mutual information for the estimation of relevance andredundancy of features. Considering mutual information and mRMRas the reference, we find through our experiments that by changingthe metric of association, mRMR's performance does change. Thischange is noticeable in the case of data sets with high dimensionality.We find that the mRMR filter exhibits performance better orcomparable when it uses diff-criterion than with mutual informationin terms of both feature selection and classification.

4.2. The MBF filter

As suggested by Koller and Sahami [2], we can consider thecombination of MBF and correlation coefficient as a reference andsee how MBF's performance varies when used with other metrics.

Fig. 4 shows the classification performance of MBF whenrelevance and redundancy of GINA's features are estimated by

mutual information, correlation coefficient and diff-criterion. Fromthe error curves of the four classifiers, we find that classificationperformance of the MBF filter does not significantly vary if themetric of association is changed. We next look at the featureselection performance. Results are given in Table 3. On comparingthe results, we find that the MBF filter selects the least number offeatures for the G subset when it uses diff-criterion for the fourclassifiers. On the other hand, the cardinality of G is the largest forcorrelation coefficient. Hence, we can say that MBF and diff-criterion is the best of the three combinations for GINA.

HIVA is the data set we investigate next. The classificationperformance is shown in Fig. 5 for the four classifiers. The plotsindicate that for naive Bayes' only there is an improvement in theerror over the reference (MBF and correlation coefficient), which isa result of combining MBF with diff-criterion. Otherwise, changingthe metric of association for MBF does not significantly change theclassification performance. However, when we compare the threemetrics on the basis of feature selection capability, we find thatMBF's performance is at its best when diff-criterion is used with itinstead of correlation coefficient. As we can see from Table 4, MBFand diff-criterion has selected smaller G subsets as compared tomutual information and correlation coefficient. This comparisonleads us to conclude that if MBF is used with diff-criterion it canproduce comparatively better results.

Fig. 6 shows the error curves obtained by evaluating theperformance of MBF on SPECT with naive Bayes', kridge, logisticregression, and SVM with the RBF kernel. The classificationperformance of MBF is improved when it employs diff-criterion

0 5 10 15 20 250.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

0.3

0.305

0.31

size of feature subsets

BE

R

SPECT − Performance of naive Bayes

MbfWdiffMbfWmiMbfWcorrBER−22=0.30649

0 5 10 15 20 250.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

0.3

size of feature subsets

BE

R

SPECT − Performance of kridge

MbfWdiffMbfWmiMbfWcorrBER−22=0.2806

0 5 10 15 20 250.25

0.255

0.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

size of feature subsets

BE

R

SPECT − Performance of SVM+RBF

MbfWdiffMbfWmiMbfWcorrBER−22=0.27695

0 5 10 15 20 250.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

0.33

0.34

size of feature subsets

BE

R

SPECT − Performance of LogReg

MbfWdiffMbfWmiMbfWcorrBER−22=0.32464

Fig. 6. Performance of MBF filter on SPECT. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260256

Page 10: Impact of a metric of association between two variables on performance of filters for binary data

for measuring relevance and redundancy of features over its useswith correlation coefficient. Feature selection performance is givenin Table 5. We find that by changing the metric of association thefeature selection capability (measured in terms of G) of MBFremains the same. But the classification performance varies.The minimum BER point in the case of naive Bayes', supportvector machines and logistic regression is obtained with the diff-criterion while correlation coefficient gives comparatively betterresults when the kridge classifier is used. Hence, MBF with thediff-criterion though results in the same feature selection perfor-mance but performs better than the other two metrics in the caseof classification performance.

Koller and Sahami [2] have suggested the use of correlationcoefficient for MBF. When we evaluated the performance of MBFfilter with other metrics, we found that MBF on data sets with smalldimensionality does vary remarkably. However, it was found thatchanging the metric of association changes the feature selectioncapability of the MBF filter significantly on high-dimensional datasets. The MBF filter performs much better with diff-criterion incomparison to correlation coefficient and mutual information byselecting smaller G subsets.

4.3. The FCBF filter

FCBF [3] divides the features into two subsets. One subsetcontains the selected features while the other one consists ofdiscarded features. In other words, unlike mRMR and MBF it does

not output a ranked list of all the features. The selected subset mayor may not be equal to G, which we have defined to be the smallestsubset of top ranked features that result in the same error as withall the features and has been used for the performance evaluationof mRMR and MBF. Therefore, we combined the selected anddiscarded subsets. From this ranked list, we generated the nestedsubsets in order to determine the subset G. Thus, allowing ourevaluation for FCBF to be consistent with the criterion used for theother two filters.

First, FCBF is evaluated on GINA. The four error curves for FCBFwith the three metrics are shown in Fig. 7. We find that FCBF withcorrelation coefficient shows comparatively better classificationperformance. On comparing the results in terms of G subsets givenin Table 3, the FCBF filter performs relatively better with correla-tion coefficient. Hence, the combination of FCBF and correlationcoefficient is much better than when FCBF uses diff-criterion ormutual information.

For the HIVA data set, Fig. 8 shows the performance of FCBF withthe three metrics for the four classifiers. The plots indicate thatFCBF with diff-criterion has better or comparable classificationperformance against the other two metrics. The comparison offeature selection capability is given in Table 4. It can be seen thatFCBF when uses diff-criterion for the relevance and redundancymeasurements, it outperforms the other two metrics. Hence, wecan say that FCBF and diff-criterion is the best combinationfor HIVA.

Finally, we refer to Fig. 9 to observe the classification perfor-mance of FCBF with the three metrics. On the low-dimensional

0 200 400 600 800 10000.18

0.19

0.2

0.21

0.22

0.23

0.24

0.25

0.26

0.27

0.28

size of feature subsets

BE

R

GINA − Performance of naive BayesFcbfWdiffFcbfWmiFcbfWcorrBER−970=0.19786

0 200 400 600 800 10000.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

size of feature subsets

BE

R

GINA − Performance of kridgeFcbfWdiffFcbfWmiFcbfWcorrBER−970=0.13943

0 200 400 600 800 10000.05

0.1

0.15

0.2

0.25

0.3

size of feature subsets

BE

R

GINA − Performance of SVM+RBF

FcbfWdiffFcbfWmiFcbfWcorrBER−970=0.063978

0 200 400 600 800 1000

0.16

0.18

0.2

0.22

0.24

0.26

0.28

size of feature subsets

BE

R

GINA − Performance of Logistric regression

FcbfWdiffFcbfWmiFcbfWcorrBER−970=0.21276

Fig. 7. Performance of FCBF filter on GINA. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260 257

Page 11: Impact of a metric of association between two variables on performance of filters for binary data

data set SPECT, no significant difference can be observed.Table 5 shows the feature selection capability. We can observethat the three metrics have shown a similar kind of behavior.Hence, the performance of FCBF on the low-dimensional SPECTdata set is almost the same when different metrics of associationare used.

4.4. Discussions

A filter's design contains two important parameters, namelythe objective function and the metric, it employs for estimatingthe association between feature-to-feature and feature-to-classvariables. The decision whether a filter's performance is good orbad is usually based on the comparisons against other filters. Thiskind of comparison in fact compares working of two objectivefunctions each with a given metric. Whether this metric is suitableor not for the objective function? This question has been ignored.In fact, filter designers are more interested in the objectivefunctions as one can easily find filters that have different objectivefunctions but the same metric. However, not every metric suits anobjective function. The impact of a metric on the objectivefunction needs to be studied before making their pair. As a numberof metrics of association have been proposed in the literature,which work differently. The difference in their working motivatedus to address the above question. Among the metrics of associa-tion, the one that has better capability of capturing the relevanceand redundancy of features can improve an objective function's

performance. In this regard, we investigated the performance ofthree widely known filters, namely mRMR, MBF and FCBF withmutual information, correlation coefficient and diff-criterion. Ourexperiments performed on three data sets with four widely usedclassifiers lead us to the following findings:

� On high-dimensional data sets, changing the metric of associa-tion changes the performance of a filter significantly while fordata sets with small dimensionality may not result in remark-able changes in the performance.

� The mRMR filter shows performance better or comparable withthe diff-criterion than with mutual information whose employ-ment was suggested by Peng et al. [1].

� Koller and Sahami [2] proposed the use of correlation coeffi-cient with the MBF filter. But the performance of MBF improvessignificantly when it uses diff-criterion instead of correlationcoefficient.

� FCBF with correlation coefficient and diff-criterion have per-formed much better than FCBF with mutual information.

Besides allowing us to find a better metric for a filter, the ideaof testing the objective functions of filters with different metricsprovides us a new way of comparing metrics. It also guides us thatthe filter which has been tagged to be of poor quality after beingcompared against other filters may show better performance if abetter metric is employed for it. Therefore, before arriving at aconclusion about the performance of a filter, one should search for

0 200 400 600 800 1000 1200 1400 1600 18000.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

size of feature subsets

BE

R

HIVA − Performance of naive Bayes

FcbfWdiffFcbfWmiFcbfWcorrBER−1617=0.28984

0 200 400 600 800 1000 1200 1400 1600 18000.2

0.25

0.3

0.35

0.4

0.45

size of feature subsets

BE

R

HIVA − Performance of kridge

FcbfWdiffFcbfWmiFcbfWcorrBER−1617=0.26778

0 200 400 600 800 1000 1200 1400 1600 18000.2

0.25

0.3

0.35

0.4

0.45

size of feature subsets

BE

R

HIVA − Performance of SVM+RBFFcbfWdiffFcbfWmiFcbfWcorrBER−1617=0.24997

0 200 400 600 800 1000 1200 1400 1600 18000.2

0.25

0.3

0.35

0.4

0.45

0.5

size of feature subsets

BE

R

HIVA − Performance of LogRegFcbfWdiffFcbfWmiFcbfWcorrBER−1617=0.29053

Fig. 8. Performance of FCBF filter on HIVA. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260258

Page 12: Impact of a metric of association between two variables on performance of filters for binary data

a suitable candidate for the objective function by testing the filter'sperformance with different metrics.

5. Conclusions

The feature selection capability of a filter method dependsupon the objective function for populating the final subset and themetric it uses for estimating the relevance and redundancy offeatures. The opinion about a filter which was found exhibitingpoor performance against other filters may change altogether if abetter metric of association is employed for it. This paper studiesthis issue. We tested the working of three well-known filters,namely the mRMR, MBF and FCBF filters with three differentmetrics (mutual information, correlation coefficient and diff-cri-terion). Experiments were carried out on three binary data setsand for evaluation purposes four widely used classifiers wereemployed. We found that the mRMR filter for which [1] havesuggested the use of mutual information has shown better resultswith the diff-criterion. Similarly, our results on the MBF filter forwhich [2] have proposed the use of correlation coefficient havedemonstrated that its performance improves significantly with theusage of diff-criterion. For the FCBF filter, the use of diff-criterionhas resulted in improved performance as compared to FCBF withmutual information. Hence, it was found that by changing themetric of association can significantly change the performance of afilter. Among the three metrics we evaluated, the diff-criterion hasshown promising results as a metric estimating the associationbetween two variables for the mRMR, MBF and FCBF filters.

References

[1] H. Peng, F. Long, C. Ding, Feature selection based on mutual information:criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans.Pattern Anal. Mach. Intell. 27 (2005) 1226–1238.

[2] D. Koller, M. Sahami, Toward Optimal Feature Selection, Technical Report1996-77, Stanford InfoLab, 1996.

[3] L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlationbased filter solution, in: Proceedings of the 20th International Conference onMachine Learning (ICML), 2003, pp. 856–863.

[4] L. Yin, Y. Ge, K. Xiao, X. Wang, X. Quan, Feature selection for high-dimensionalimbalanced data, Neurocomputing 105 (2013) 3–11.

[5] G. Doquire, M. Verleysen, Mutual information-based feature selection formultilabel classification, Neurocomputing 122 (2013) 148–155.

[6] K. Javed, Development of feature selection algorithms for high-dimensionalbinary data (Ph.D. thesis), Department of Electrical Engineering, University ofEngineering and Technology, Lahore, Pakistan, 2012.

[7] R. Kohavi, G. John, Wrappers for feature subset selection, Artif. Intell. 97 (1997)273–324.

[8] M. Dash, H. Liu, Feature selection for classification, Intell. Data Anal. 1 (1997)131–156.

[9] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach.Learn. Res. 3 (2003) 1157–1182.

[10] I. Guyon, S. Gunn, M. Nikravesh, L. Zadeh, Feature Extraction: Foundations andApplications, Springer, 2006.

[11] M. Hall, G. Holmes, Benchmarking attribute selection techniques for discreteclass data mining, IEEE Trans. Knowl. Data Eng. 15 (2003) 1437–1447.

[12] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classifica-tion using support vector machines, Mach. Learn. 46 (2002) 389–422.

[13] S. Das, Filters, wrappers, and a boosting based hybrid for feature selection, in:Proceedings of the 18th International Conference on Machine Learning (ICML),2001, pp. 74–81.

[14] G. Brown, A. Pocock, M.-J. Zhao, M. Lujan, Conditional likelihood maximisa-tion: a unifying framework for information theoretic feature selection, J. Mach.Learn. Res. 13 (2012) 27–66.

[15] A. Argyriou, T. Evgeniou, M. Pontil, Multi-task feature learning, in: NIPS, 2006,pp. 41–48.

0 5 10 15 20 250.24

0.25

0.26

0.27

0.28

0.29

0.3

0.31

size of feature subsets

BE

R

SPECT − Performance of naive Bayes

FcbfWdiffFcbfWmiFcbfWcorrBER−22=0.30649

0 5 10 15 20 250.245

0.25

0.255

0.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

size of feature subsets

BE

R

SPECT − Performance of kridge

FcbfWdiffFcbfWmiFcbfWcorrBER−22=0.2806

0 5 10 15 20 250.245

0.25

0.255

0.26

0.265

0.27

0.275

0.28

0.285

0.29

0.295

size of feature subsets

BE

R

SPECT − Performance of SVM+RBF

FcbfWdiffFcbfWmiFcbfWcorrBER−22=0.27695

0 5 10 15 20 250.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

0.33

0.34

size of feature subsets

BE

R

SPECT − Performance of LogReg

FcbfWdiffFcbfWmiFcbfWcorrBER−22=0.32464

Fig. 9. Performance of FCBF filter on SPECT. (a) Naive Bayes', (b) kernel ridge regression, (c) support vector machines and (d) logistic regression.

K. Javed et al. / Neurocomputing 143 (2014) 248–260 259

Page 13: Impact of a metric of association between two variables on performance of filters for binary data

[16] F. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via jointℓ2;1-norms minimization, in: NIPS, 2010, pp. 1813–1821.

[17] W. Duch, Filter methods, in: I. Guyon, M. Nikravesh, S. Gunn, L. Zadeh (Eds.),Feature Extraction: Foundations and Applications, Springer, 2006, pp. 89–117.

[18] A. Arauzo-Azofra, J. Aznarte, J. Benitez, Empirical study of feature selectionmethods based on individual feature evaluation for classification problems,Expert Syst. Appl. 38 (2011) 8170–8177.

[19] K. Javed, M. Saeed, H.A. Babri, The correctness problem: evaluating theordering of binary features in rankings, Knowl. Inf. Syst. 39 (2014) 543–563.

[20] N. Bouguila, On multivariate binary data clustering and feature weighting,Comput. Stat. Data Anal. 54 (2010) 120–134.

[21] S. Myllykangas, J. Tikka, T. Bohling, S. Knuutila, J. Hollmen, Classification ofhuman cancers based on DNA copy number amplification modeling, BMCMed. Gen. 1 (2008).

[22] A. Juan, E. Vidal, Bernoulli mixture models for binary images, in: Proceedings ofthe 17th International Conference on Pattern Recognition (ICPR), 2004, pp. 367–370.

[23] J. Wilbur, J. Ghosh, C. Nakatsu, S. Brouder, R. Doerge, Variable selection inhigh-dimensional multivariate binary data with application to the analysis ofmicrobial community DNA fingerprints, Biometrics 58 (2002) 378–386.

[24] S. Dolnicar, F. Leisch, Behavioral market segmentation of binary guest surveydata with bagged clustering, in: Proceedings of the International Conferenceon Artificial Neural Networks (ICANN), 2001, pp. 111–118.

[25] M. Saeed, K. Javed, H.A. Babri, Machine learning using Bernoulli mixturemodels: clustering, rule extraction and dimensionality reduction, Neurocom-puting 119 (2013) 366–374.

[26] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter,V. Schaetzen, R. Duque, H. Bersini, A. Nowe, A survey on filter techniquesfor feature selection in gene expression microarray analysis, IEEE/ACM Trans.Comput. Biol. Bioinform. 9 (2012) 1106–1119.

[27] Y. Saeys, I. Inza, P. Larranage, A review of feature selection techniques inbioinformatics, Bioinformatics 23 (2007) 2507–2517.

[28] V. Bolon-Canedo, N. Sanchez-Marono, A. Alonso-Betanzos, A review of featureselection methods on synthetic data, Knowl. Inf. Syst. 34 (2013) 483–519.

[29] R. Battiti, Using mutual information for selecting features in supervised neuralnet learning, IEEE Trans. Neural Netw. 5 (1994) 537–550.

[30] L. Yu, H. Liu, Efficient feature selection via analysis of relevance andredundancy, J. Mach. Learn. Res. 5 (2004) 1205–1224.

[31] G. Brown, A new perspective for information theoretic feature selection, in:JMLRWorkshop and Conference Proceedings: AISTATS, vol. 5, 2009, pp. 49–56.

[32] B. Guo, R. Damper, S. Gunn, J. Nelson, A fast separability based featureselection method for high-dimensional remotely sensed image classification,Pattern Recognit. 41 (2008) 1653–1662.

[33] W. Press, B. Flannery, S. Teukolsky, W. Vetterling, Numerical Recipes in C,Cambridge University Press, 1988.

[34] H. Liu, R. Setiono, Chi2: feature selection and discretization of numericattributes, in: Proceedings of IEEE 7th International Conference on Tools withArtificial Intelligence, 1995, pp. 338–391.

[35] T. Cover, J. Thomas, Elements of Information Theory, John Wiley and Sons,1991.

[36] I. Vajda, Theory of Statistical Inference and Information, Kluwer AcademicPress, 1979.

[37] K. Javed, H. Babri, M. Saeed, Feature selection based on class-dependentdensities for high-dimensional binary data, IEEE Trans. Knowl. Data Eng. 24(2012) 465–477.

[38] F. van der Heijden, R. Duin, D. de Ridder, D. Tax, Classification, ParameterEstimation, and State Estimation: An Engineering Approach using MATLAB,John Wiley and Sons, 2004.

[39] MathWorks, MATLAB: The Language of Technical Computing, ⟨http://www.mathworks.com/products/matlab/index.html⟩, 2010.

[40] I. Guyon, A. Saffari, G. Dror, G. Cawley, Agnostic learning vs. prior knowledgechallenge, in: Proceedings of International Joint Conference on Neural Net-works (IJCNN), 2007, pp. 829–834.

[41] A. Frank, A. Asuncion, UCI Machine Learning Repository, ⟨http://archive.ics.uci.edu/ml⟩, 2010.

[42] A. Saffari, I. Guyon, Quick Start Guide for Challenge Learning Object Package(CLOP), Technical Report, Graz University of Technology and Clopinet, ⟨http://clopinet.com/clop/⟩, 2006.

[43] T. Minka, A Comparison of Numerical Optimizers for Logistic Regression,⟨http://research.microsoft.com/�minka/papers/⟩, 2003.

Kashif Javed received the BSc, MSc and PhD degrees inelectrical engineering in 1999, 2004, and 2012, respec-tively, from the University of Engineering and Technol-ogy (UET), Lahore, Pakistan. He joined the Departmentof Electrical Engineering at UET in 1999, where he iscurrently an assistant professor. He has been a reviewerfor IEEE Transactions on Knowledge and Data Engineer-ing (TKDE) in 2011, 2012 and 2013. His research inter-ests include machine learning, pattern recognition andad hoc network security.

Haroon A. Babri received the BSc degree in electricalengineering from the University of Engineering andTechnology (UET), Lahore, Pakistan, in 1981, and the MSand PhD degrees in electrical engineering from theUniversity of Pennsylvania in 1991 and 1992, respec-tively. He was with the Nanyang Technological Univer-sity, Singapore, from 1992 to 1998, with the KuwaitUniversity from 1998 to 2000, and with the LahoreUniversity of Management Sciences (LUMS) from2000 to 2004. He is currently a professor of electricalengineering at UET. He has written two book chaptersand has more than 60 publications in machine learning,pattern recognition, neural networks, and softwarereverse engineering.

Mehreen Saeed has been working at the Department ofComputer Science, FAST, National University of Com-puter and Emerging Sciences, Lahore Campus, Pakistan,since 2006. She is also engaged as search consultantat Centre for Language Engineering at AlKhawarizmiInstitute of Computer Science, UET, Lahore. She receivedher PhD from the University of Bristol, UK in 1999 andcompleted her Master's degree in Computer Science in1995 from Quaid-e-Azam University, Islamabad. Herhobbies include programming, participating in machinelearning challenges and implementing intelligent learn-ing algorithms.

K. Javed et al. / Neurocomputing 143 (2014) 248–260260