FOCUS.doc

21
Literature review of dimensionality reduction Feature Selection (FS) & Feature Extraction (FE) o K. Fukunaga and DR Olsen, "An algorithm for finding intrinsic dimensionality of data," IEEE Transactions on Computers, vol. 20, no. 2, pp. 176-183, 1971. This paper is the earliest literature I could collect on dimensionality reduction issue. The intrinsic dimensionality is defined as the smallest dimensional space we can obtain under some constraint. Two problems were addressed in this paper: the intrinsic dimensionality representation (mapping) and intrinsic dimensionality for classification (separating). FOCUS There are several famous dimensionality reduction algorithms which appear in almost all literatures. They are FOCUS and RELIEF. o H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI-91), pages 547--552, Anaheim, CA, USA, 1991. AAAI Press. o Almuallim H., Dietterich T.G.: Efficient Algorithms for Identifying Relevant Features, Proc. of the Ninth Canadian Conference on Artificial Intelligence, University of British Columbia, Vancouver, May 11-15, 1992, 38-45 The first paper above proposed the FOCUS algorithm; the second upgraded it into FOCUS-2. FOCUS implements the Min- Feature bias that prefers consistent hypotheses definable over as few features as possible. In the simplest implementation, it does a breadth-first search and check for any inconsistency. RELIEF o K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National

Transcript of FOCUS.doc

Page 1: FOCUS.doc

Literature review of dimensionality reduction

Feature Selection (FS) & Feature Extraction (FE)

o K. Fukunaga and DR Olsen, "An algorithm for finding intrinsic dimensionality of data," IEEE Transactions on Computers, vol. 20, no. 2, pp. 176-183, 1971.

This paper is the earliest literature I could collect on dimensionality reduction issue. The intrinsic dimensionality is defined as the smallest dimensional space we can obtain under some constraint. Two problems were addressed in this paper: the intrinsic dimensionality representation (mapping) and intrinsic dimensionality for classification (separating).

FOCUSThere are several famous dimensionality reduction algorithms which appear in almost all literatures. They are FOCUS and RELIEF.

o H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI-91), pages 547--552, Anaheim, CA, USA, 1991. AAAI Press.

o Almuallim H., Dietterich T.G.: Efficient Algorithms for Identifying Relevant Features, Proc. of the Ninth Canadian Conference on Artificial Intelligence, University of British Columbia, Vancouver, May 11-15, 1992, 38-45

The first paper above proposed the FOCUS algorithm; the second upgraded it into FOCUS-2. FOCUS implements the Min-Feature bias that prefers consistent hypotheses definable over as few features as possible. In the simplest implementation, it does a breadth-first search and check for any inconsistency.

RELIEFo K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In

Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), pages 129--134, Menlo Park, CA, USA, 1992. AAAI Press.

o Kononenko, Igor. Estimating Attributes: Analysis and Extensions of RELIEF, In European Conference on Machine Learning, pages 171-182, 1994.

In the first paper, RELIEF was presented as a feature ranking algorithm, using a weight-based algorithm. From the set of training instances, it first chooses a sample of instances; the user must provide the number of instances in this sample. RELIEFrandomly picks this sample of instances, and for each instance in it finds Near Hit (minimum distance to the same class instance) and Near Miss (minimum distance to the different class instance) instances based on a Euclidean distance measure.

Class AClass B

Page 2: FOCUS.doc

Figure: Near Hit and Near Miss

The basic idea is to update the weights that are initialized to zero in the beginning based on the following equation:

Where, A is the attribute; R is the instance randomly selected; H is the near hit; M is the near miss; and diff calculates the difference between two instances on attribute. After exhausting all instances in the sample, RELIEF chooses all features having weight greater than or equal to a threshold.

IN the second paper, the algorithm of RELIEF, which only dealt with two-class problems, was upgraded into RELIEFF, which could handle noisy, incomplete, and multi-class data sets. Firstly, RELIEF was extended to search for k-nearest hits/misses instead of only one near/miss. The extended version was called RELIEF-A. It averaged the contribution of k nearest hits/misses to deal with noisy data. Secondly, based on different diff functions, we could get different BELIEF versions. So in this paper the author proposed three versions of RELIEF (RELIEF-B, RELIEF-C and RELIEF-D) to deal with incomplete data sets.

RELIEF-D: Given two instances (I1 and I2) If one instance (I1) has unknown value

If both instances have unknown values

Finally, extension was made on RELIEF-D, the author got version to deal with multi-class problems, i.e., RELIEF-E and RELIEF-F. However, the RELIEF-F show advantages both in noise free and noisy data.

RELIEF-F: find one near miss M(C) near each different class and average their contribution for updating estimates W[A] :

FS: Liu, Dash et al.

Review of DR methods One of important work of Liu and Dash is to review and group all feature selection, under classification scenarios.

o Dash, M., &: Liu, H. 1997. Feature selection for classification. Intelligent Data Analysis, 1, 131-156.

Page 3: FOCUS.doc

The major contribution of this paper is to present the following figure, in which the feature selection process is addressed clearly.

Figure: Feature selection process with validation

And Liu et al. claimed that each feature selection method could be characterized in the types of generation procedure and evaluation function used by it. So this paper presented a table to classify all feature selection methods. In this table, each row stands for one type of evaluation measures. Each column represents for one kind of generation method. Most of the methods listed in each cell in this table (totally there are 5 X 3 = 15 cells) were addressed in this review paper.

Table: Two dimensional categorization of feature selection methodsEvaluationMeasures

Generation methodsHeuristic Complete Random

Distance measure Relief, Relief-F, Sege84 B&B, BFF, Bobr88Information measure DTM, Kroll-Saha96 MDLMDependency measure POE1ACC, PRESETConsistency measure Focus, MIFESClassifier error rate SBS, SFS, SBS-

FLASH, PQSS, Moor-Lee94, BDS, RC, Quei-

Gels84

Ichi-Skla84a, Ichi-Skla84b, AMB&B,

BS

GA, SA, RGSS, RMHC-PF1

LVF + Consistency MeasureH. Liu, M. Dash, et al. published many papers on their feature selection methods.

o Liu, H., and Setiono, R. (1996) A probabilistic approach to feature selection - A filter solution. In 13th International Conference on Machine Learning (ICML'96), July 1996, pp. 319-327. Bari, Italy.

o H. Liu and S. Setiono. Some issues on scalable feature selection. In 4th World Congress of Expert Systems: Application of Advanced Information Technologies, 1998.

In the introduction section of the first paper, the authors first compared the wrapper approach and the filter approach. Some reasons were pointed out why the wrapper

Original

Feature Set

Generation EvaluationSubset

Goodness of

Subset

Stopping Criterion

YesNo Validation

Page 4: FOCUS.doc

approach is not as general as the filter approach, although it has certain advantages: (1) learning algorithm’s bias, (2) high computational cost, (3) large dataset will cause problems while running some algorithm, and impractical to employ computationally intensive ;earning algorithms such as neural network or genetic algorithm. Second, the section addressed two types of feature selection methods: exhaustive (check all order correlations) and heuristic (make use of 1st and 2nd order information – one attribute and the combination of two attributes) search.

The first paper tried to introduce random search based on a Las Vegas algorithm, which used randomness to guide the search to guarantee sooner or later it will get a correct solution. The probabilistic approach is called LVF, which use the inconsistency rate and the LV random search. LVF only works on discrete attributes because it replies on the inconsistency calculation.

The second paper pointed out three big issues in feature selection, identified by Liu et al. (1) large number of features; (2) large number of instance; and (3) feature expanding due to environment changing. The authors proposed a LVF algorithm to reduce the number of features. An upgraded LVF algorithm (LVS) was developed. The major idea is to use some percentage of data set and then add more instances step by step until some conditions satisfied. The last issue became the missing attribute problem in many cases.

Page 5: FOCUS.doc

1. Scaling data 2. Expanding featuresAt the end of this paper, the authors considered the computing implementing as a potential future work area. Twp ideas were listed. One is to use parallel feature selection; the other is to use the database techniques such as data warehouse, metadata.

o Huan Liu, Rudy Setiono : Incremental Feature Selection. Applied Intelligence 9 (3): 217-230 (1998)

This paper actually is almost the same as the one addressing LVS; the different is just the name of algorithm is LVI.

ABB + Consistency Measureo H. Liu, H. Motoda, and M. Dash, A Monotonic Measure for Optimal Feature Selection, Proc. of

ECML-98, pages 101-106, 1998.o M. Dash, H. Liu and H. Motoda, "Consistency Based Feature Selection", pp 98 -- 109, PAKDD 2000,

Kyoto, Japan. April, 2000. Springer.

The first paper studied the monotonic characteristic of the inconsistency measure, with regard to the fact that most error- or distance-based measures are not monotonic. The authors argued that monotonic measure should be necessary to find an optimal subset of features while using complete, but not exhaustive search. This paper gave an ABB (Automatic Branch & Bound) algorithm to select subset of features.

The second paper focused on the consistency measure, which was used on five different algorithms of feature selection: FOCUS (exhaustive search), ABB (Automated Branch &

Page 6: FOCUS.doc

Bounce) (complete search), SetCover (heuristic search), LVF (probabilistic search), and QBB (combination of LVF and ABB) (hybrid search).

QBB + Consistency Measureo M. Dash and H. Liu, "Hybrid search of feature subsets," in PRICAI'98, (Singapore), Springer-Verlag,

November 1998.

This paper proposed a hybrid algorithm (QBB) of probabilistic and complete search, which began with LVF to reduce the number of features and then ran ABB to get optimal feature subset. So if M is known as SMALL, apply FocusM; else if N >= 9, apply ABB; else apply QBB.

VHR (Discretization)o H. Liu and R. Setiono, "Dimensionality reduction via discretization," Knowledge Based Systems 9(1),

pp. 71--77, 1996.

This paper proposed a vertical and horizontal reduction (VHR) method to build a data and dimensionality reduction (DDR) system. This algorithm is based on the idea of Chi-merge during the discretization. The duplication after merging is removed from the data set. At the end, if an attribute is merged to only one value, it simple means that the attribute could be discarded.

Page 7: FOCUS.doc

Neural Network Pruningo R. Setiono and H. Liu, \Neural network feature selector," IEEE Trans. on Neural Networks, vol. 8, no.

3, pp. 654-662, 1997.

This paper proposed the use of a three layer feed forward neural network to select those input attributes (features) that are most useful for discriminating classes in a given set of input patterns. A network pruning algorithm was the foundation of the proposed algorithm. A simple criterion was developed to remove an attribute based on the accuracy rate of the network.

Kohavi, John et al. Compared to Liu et al, Kohavi, John et al. did much research work in the Wrapper approach.

o R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97(1-2) (1997), 273--324.

In this influential paper Kohavi and John presented a number of disadvantages of the filter approach to the feature selection problem, steering research towards algorithms adopting the wrapper approach.

The wrapper approach to feature subset selection is shown in the following figure:

Relevance measure

Definition 1: an optimal feature subset. Given an inducer I, and a dataset D with features X1, X2, …, Xn, from a distribution D over the labeled instance space, an optimal feature subset, Xopt, is a subset of the features such that the accuracy of the inducer classifier C = I(D) is maximal.

Definition 2-3: Existing definitions of relevance. Almullim & Dietterich: A feature Xi is said to be relevant to a concept C if Xi appears in every Boolean formula that represents C and irrelevant otherwise. Gennari et al.: Xi is relevant iff there exists some xi and y for which such that .

Page 8: FOCUS.doc

Definition 4. Let . si is a value assigned to all features in Si. Xi is relevant iff there exists some xi and y and si for which such that

.

Definition 5 (Strong relevance). Xi is relevant iff there exists some xi and y and si for which such that .

Definition 6 (Weak relevance). Xi is relevant iff it is strong relevant and there exists a subset of Si, Si

’ for which there exists some xi and y and si’ with such

that .

The following figure shows a view of feature set relevance.

Search & Induce

This paper then demonstrated the wrapper approach with two searching methods: hill-climbing and best-first search, two induction algorithms: decision tree (ID3) and Naïve-Bayes, on 14 datasets.

Page 9: FOCUS.doc

This paper also gave some directions of future work. (1) Other search engines such as simulating annealing, genetic algorithms. (2) Select initial subset of features. (3) Incremental operations and aggregation techniques. (4) Parallel computing techniques. (5) Overfitting issue (using cross-validation).

FE: PCAo Partridge M. and RA Calvo. Fast Dimensionality Reduction and Simple PCA, Intelligent Data

Analysis, 2(3), 1998.

A fast and simple algorithm for approximately calculating the principal components (PCs) of a data set and so reducing its dimensionality is described. This Simple Principal Components Analysis (SPCA) method was used for dimensionality reduction of two high-dimensional image databases, one of handwritten digits and one of handwritten Japanese characters. It was tested and compared with other techniques: matrix methods such as SVD and several data methods.

o David Gering (2002) In fulfillment of the Area Exam doctoral requirements: Linear and Nonlinear Data Dimensionality Reduction, http://www.ai.mit.edu/people/gering/areaexam/areaexam.doc

This report presented three different approaches to deriving PCA: Pearson’s Least Squares Distance approach, Hotelling’s Change of Variables approach, and the author’s new method of Matrix Factorization for Variation Compression. The report also addresses the Multidimensional Scaling (MDS). Then the author gave three implementations of the two techniques: Eigenfaces, Locally Linear Embedding (LLE) and Isometric feature mapping (Isomap).

FE: Soft Computing Approaches

This is an interesting area.

o Pal, N.R. and K.K.Chintalapudi (1997). "A connectionist system for feature selection", Neural, Parallel and Scientific Computation Vol. 5, 359-382.

o R. N. Pal, "Soft Computing for Feature Analysis," Fuzzy Sets and Systems, Vol.103, 201 221, 1999.

In the first paper, Pal and Chintalapudi proposed a connectionist model for selection of a subset of good features for pattern recognition problems. Each input node of an MLP has

Page 10: FOCUS.doc

an associated multiplier, which allows or restricts the passing of the corresponding feature into the higher layers of the net. A high value of the attenuation factor indicates that the associated feature is either redundant or harmful. The network learns both the connection weights and the attenuation factors. At the end of learning, features with high value of the attenuation factors are eliminated.

The second paper gave an overview of using three soft computing techniques: fuzzy logic, neural networks, and genetic algorithms for feature ranking, selection and extraction.

o Fuzzy sets were introduced in 1965 by Zadeh as a new way to represent vagueness in everyday life.

o Neural networks have characteristics of parallel computing, robustness, built-in learn ability, and capability to deal with imprecise, fuzzy, noisy, and probabilistic information.

o Genetic Algorithms (GAs) are biologically inspired tools for optimization. They are parallel and randomized search techniques.

This paper also mentioned the separation of the two problems of feature selection and feature extraction.

For feature extraction using neural networks, Pal reviewed several methods such as o PCA Neural networks by Rubner (J. Rubner and P. Tavan. A self-organization network for

principal-component analysis. Europhysics. Letters, 10:693-698, 1989 . ) (This reminded me the book of Principal component neural networks: theory and applications / Kostas I. Diamantaras, S.Y. Kung, New York: Wiley, 1996);

o Nonlinear projection by Sammon (J. W. Sammon, Jr., "A nonlinear mapping for data

structure analysis," IEEE Trans. Comput., vol. C-18, pp. 401--409, May 1969.) For feature ranking & selection using neural networks, Pal summarized three types of methods:

o Saliency based feature ranking (SAFER) (Ruck, D.W., S.K.Rogers and M.Kabrisky (1990). "Feature selection using a multilayer perceptron", Journal of Neural Network Computing, 40-48.)

o Sensitivity based feature ranking (SEFER)(R. K. De, N. R. Pal, and S. K. Pal. Feature analysis: Neural network and fuzzy set theoretic approaches. Pattern Recognition, 30(10):1579--1590, 1997)

o An attenuator based feature selection (AFES)

Neural networks

o De Backer S., Naud A., Scheunders P.. - Nonlinear dimensionality reduction techniques for unsupervised feature extraction. - In: Pattern recognition letters, 19(1998), p. 711-720

In this paper, a study is performed on unsupervised non-linear feature extraction. Four techniques were studied: a multidimensional scaling approach (MDS), Sammon’s mapping (SAM), Kohonen’s self-organizing map (SOM) and an auto-associative feedforward neural network (AFN). All four yield better classification results than the optimal linear approach PCA , and therefore can be utilized as a feature extraction step in

Page 11: FOCUS.doc

a design for classification schemes. Because of the nature of the techniques, SOM and AFN perform better for very low dimensions. Because of the complexity of the techniques, MDS and SAM are most suited for high-dimensional data sets with a limited number of data points, while SOM and AFN are more appropriate for low-dimensional problems with a large number of data points.

Aha et al. o Aha, D. W. & Bankert, R. L. (1994), Feature selection for case-based classification of cloud types: An

empirical comparison, in Working Notes of the AAAI-94 Workshop on Case-based Reasoning, pp 106-112. 22

o Aha, D. W. and Bankert, R. L. (1995), A comparative evaluation of sequential feature selection algorithms, In Proceedings of the Fifth International 33 Workshop on Artificial Intelligence and Statistics, editors D. Fisher and H. Lenz, pp. 1--7, Ft. Lauderdale, FL.

The two papers gave a framework of sequential feature selection - BEAM. The following figures show two versions of algorithms in the two papers.

It actually considered the feature selection as a combination of the search method (FSS and BSS) and the evaluation function (IB1 and Index).

Bradley et alo P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical programming.

INFORMS Journal on Computing, 10:209--217, 1998.

This paper tried to transform the feature selection into a mathematical program problem. The task became discriminating two given sets ( and ) in an n-dimensional feature space by using as few of the given features as possible. In the

Page 12: FOCUS.doc

mathematical program terms, we will attempt to generate a separating plane (, suppressing as many of the components of ω as possible) in a

feature space of as small as a dimension as possible while minimizing the average distance of misclassified points to the plane.

Typically this will be achieved in a feature space of reduced dimensionality, that is . is a step function term that is discrete. Different approximation methods of

it will lead to different algorithms. Three methods were proposed in this paper: o Standard sigmoid: . Get FSS (FS sigmoid). o Concave exponential (advantageous of its simplicity & concavity):

. Get FSV (FS Concave). o Consider it as a linear program with equilibrium constraints. Get FSL. After

reformulating to avoid the computational difficulty, we get FSB (FS bilinear program).

This paper also gave an adaptation of the optimal brain damage (OBD).

Algorithms for FSS, FSV, FSB and OBD were presented. Experiments were carried out on the WPBC (Wisconsin Prognostic Breast Cancer) problem and the Ionosphere problem.

o P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In Proc. 15th International Conf. on Machine Learning, pages 82--90. Morgan Kaufmann, San Francisco, CA, 1998.

This paper actually extracted parts of the previous paper, focusing on FSV and its algorithm (Successive Linearization Algorithm - SLA). The introduction of SVM was just to add another suppressing term in the objective function.

Hall et al.o Hall, M. A., Smith, L. A. (1999). Feature selection for machine learning: comparing a correlation-

based filter approach to the wrapper. Proceedings of the Florida Artificial Intelligence Symposium (FLAIRS-99).

o Practical feature subset selection for machine learning. Proceedings of the 21st Australian Computer Science Conference. Springer. 181-191.

o M. Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, 1999.

o Hall, M. (2000). Correlation-based feature selection of discrete and numeric class machine learning. In Proceedings of the International Conference on Machine Learning, pages 359-366, San Francisco, CA. Morgan Kaufmann Publishers.

In these papers and thesis, Hall tried to present his new method CFS (Correlation based Feature Selection). It is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.

Page 13: FOCUS.doc

Further experiments compared CFS with a wrapper—a well known approach to feature selection that employs the target learning algorithm to evaluate feature sets. In many cases CFS gave comparable results to the wrapper, and in general, outperformed the wrapper on small datasets. CFS executes many times faster than the wrapper, which allows it to scale to larger datasets.

Langely et alo Blum, Avrim and Langely, Pat. Selection of Relevant Features and Examples in Machine Learning, In

Artificial Intelligence, Vol.97, No.1-2, pages 245-271,1997.

This paper addressed the two problems of irrelevant features and irrelevant samples. For the feature selection, they used almost the same relevance concepts and definitions by Johnn, Kohavi and Pfleger.

Secondly, the authors regarded the feature selection as a problem of heuristic search. 4 basic issues were studied: (1) the starting point; (2) the organization of search: exhaustive or greedy; (3) the alternative subset evaluation; and (4) the halting criterion. Such discussion is the same as Liu’s publication. Then, the authors reviewed three types of feature selection methods: (1) those that embed the selection within the basic induction algorithm; (2) those that use the selection to filter features passed to induction; and (3) those that treat the selection as a wrapper around the induction process. The following two tables list the characteristics of different methods.

Page 14: FOCUS.doc

Devaney et al. - Conceptual Clusteringo Devaney, M., and Ram, A. Efficient Feature Selection in Conceptual Clustering. Machine Learning:

ICML '97, 92-97, Morgan Kaufmann, San Francisco, CA, 1997.

This paper addressed that feature selection in unsupervised situations. Any typical wrapper approach cannot be applied due to the absence of the class labels in the dataset. One solution is to use average predictive accuracy over all attributes. Another way is to use category utility (M. A. Gluck and J. E. Corter. Information, uncertainty, and the unity of categories. In Proceedings of the 7th Annual Conference of the Cognitive Science Society, pages 283--287, Irvine, CA, 1985).

The COBWEB system (D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139--172, 1987) applying the category utility was used in the paper.

The Ck terms are the concepts in the partition, Ai is each attribute, and Vij is each of the possible values for attribute. This equation yields a measure of the increase in the number of attribute values that can be predicted given a set of concepts, C1 … Ck, over the number of attribute values that could be predicted without using any concepts.The term is the probability of each attribute value independent of

class membership and is obtained from the parent of the partition. The term weights the values for each concept according to its size, and the division by n, the number of concepts in the partition, allows comparison of partitions of different sizes.For the continuous attributes, the paper used CLASSIT algorithm (Gennari, J. H. (1990). An experimental study of concept formation. Doctoral dissertation, Department of Information & Computer Science, University of California, Irvine. ).

Page 15: FOCUS.doc

where K is the number of classes in the partition, the standard deviation of attribute i in class k and the standard deviation of attribute i at the parent node.The method, as the authors described, blurs the traditional wrapper/filter model distinction – it is like a wrapper model in that the underlying learning algorithm is being used to guide the descriptor search but it is like a filter in that the evolution function measures an intrinsic property of the data rather than some type of predictive accuracy.

Then a hill-climbing based search algorithm was proposed – AICC; and the heart disease, LED datasets were used to benchmark the methodology.

Caruana & Freitago Caruana, R. and D. Freitag. Greedy Attribute Selection. in International Conference on Machine

Learning. 1994. The paper examined five greedy hillclimbing procedures (forward selection, backward elimination, forward stepwise selection, backward stepwise elimination, and backward stepwise elimination–SLASH) that search for attribute sets that generalize well with ID3/C4.5. A caching scheme was presented that made attribute hillclimbing more practical computationally.