Feature Selection I

Dr. Athanasios Tsanas (‘Thanasis’), Wellcome Trust post-doctoral fellow, Institute of Biomedical Engineering (IBME), Affiliate researcher, Centre for Mathematical Biology, Mathematical Institute, Junior Research Fellow, Kellogg College University of Oxford

Information Driven Healthcare: Machine Learning course

Lecture: Feature selection I --- Concepts

Centre for Doctoral Training in Healthcare Innovation

The big picture signal processing and statistical machine learning

Feature

generation

Feature

selection/transformation Statistical

mapping

Signal processing course: tools for extracting information

(patterns) from the data (feature generation)

This course: using the extracted information from

the data

Statistical mapping: associating features with another

measured quantity (response) – supervised learning

Goal: maximize the information available in the data to

predict the response

Supervised learning setting

Subjects feature 1 feature 2 ... feature M

S1 3.1 1.3 0.9

S2 3.7 1.0 1.3

S3 2.9 2.6 0.6

SN 1.7 2.0 0.7

Outcome

type 1

type 2

type 1

type 5 N (

M (features or characteristics)

y = f (X), f : mechanism X: feature set y: outcome

Feature selection: which features 1…M in the design matrix X should we keep?

Feature transformation: project the features to a new lower dimensional feature space

Introduction to the problem

Many features Μ Curse of dimensionality

Obstruct interpretability and detrimental to learning

process

Solution to the problem

Reduce the initial feature space M

into m (or m<<M)

Feature selection

Feature transformation

Main concepts

Principle of parsimony

Information content

Statistical associations

Computational constraints

We want to determine the most parsimonious feature subset

with maximum joint information content

Feature transformation

Construct lower dimensional space where the new

data points retain the distance of the data points in the

original feature space

Different algorithms depending on how we define the

distance

Feature transformation problems

Results are not easily interpretable

Does not save up on resources on

data collection or data processing

Reliable transformation in high

dimensional spaces is problematic

Feature selection

Discard non-contributing features towards predicting

the response

Feature selection advantages

Interpretable

Retain domain expertise

Often is the only useful approach in practice (e.g. in

micro-array data)

Saves on resources on data collection or data

processing

Feature selection approaches

Two approaches:

Wrappers (involve a learner)

Filters (rely on information content of feature subset,

e.g. using statistical tests)

Wrappers

Computationally intensive

Rely on incorporating a learner

Feature exportability problems

Wrappers may produce models with better predictive

performance compared to filters

Filters

Rely on basic concepts

statistical

Information theory

Computationally fast

Learner performance comes later and hence filters may

generalize better than wrappers

Filter concept: relevance

Maximum relevance: features (F) and response (y)

Which features would you choose? In which order?

Filter concept: redundancy

Minimum redundancy amongst features in the subset

Which features would you choose? In which order?

Filter concept: complementarity

Conditional relevance (feature interaction or

complementarity)

Formalizing these concepts

How to express relevance and redundancy (i.e. which

are the appropriate metrics?)

Metrics include: correlation coefficients, mutual

information, statistical tests, p-values, information

gain…

How to compromise between relevance and

redundancy?

Process? (forward selection VS backward

elimination)

To be continued…

In the following lecture we will look at specific algorithms!

Usual steps in forward selection

Start with classical ordinary least squares regression

L1 penalty: sparsity promoting, some coefficients become 0

RELIEF

Feature weighting algorithm

Concept: work with nearest neighbours

Nearest hit (NH) and nearest miss (NM)

Great for datasets with interactions but does not account for

information redundancy

minimum Redundancy Maximum Relevance (mRMR)

Trade-off relevance + redundancy

Does not account for interactions and non-pairwise redundancy

Generally works very well

Comparing feature selection algorithms

Selecting the ‘true’ feature subset (i.e. discarding features

which are known to be noise)

o Possible only for artificial datasets

Maximize the out of sample prediction performance

o proxy for assessing feature selection algorithms

o adds an additional ‘layer’: the learner

o beware of feature exportability (different learners may give

different results)

o BUT… in practice this is really what is of most interest!

Matlab code

LASSO: http://www.mathworks.co.uk/help/stats/lasso.html

RELIEF: http://www.mathworks.co.uk/help/stats/relieff.html

http://www.mathworks.com/matlabcentral/fileexchange/14888

Be careful: the latter implementation relies on discrete features

and computes densities using histograms. For continuous

features you would need another density estimator (e.g. kernel

density estimation)

UCI ML repository http://archive.ics.uci.edu/ml/

Conclusions

Multi-faceted problem, fertile field for research

No free lunch theorem (no universally best algorithm)

Trade-offs

o algorithmic: relevance, redundancy, complementarity

o computational: wrappers are costly but often give better results

o comprehensive search of the feature space, e.g. genetic

algorithms (very costly)

Reducing the number of features may improve prediction

performance and always improves interpretability

Feature Selection I

Documents

Transcript of Feature Selection I

Feature Selection for Unsupervised Learningjmlr.csail.mit.edu/papers/volume5/dy04a/dy04a.pdf · 2. Feature Subset Selection and EM Clustering (FSSEM) Feature selection algorithms

Feature Selection for SVMs - Neural Information Processing ...papers.nips.cc/paper/1850-feature-selection-for-svms.pdf · plexity. Note, some previous work on feature selection for

Feature Grouping-Based Fuzzy-Rough Feature Selection

Autonomic Feature Selection for Application Classificationumpeysak/Classes/CS576/papers/Zhang… · STATISTICAL TOOLS USED FOR FEATURE SELECTION A. Feature Selection Feature selection

Explicit Max Margin Input Feature Selection for Nonlinear SVM …tfcolema/articles/fspaper.pdf · EXPLICIT MAX-MARGIN FEATURE SELECTION Explicit Max Margin Input Feature Selection

Data Visualization and Feature Selection: New Algorithms ...papers.nips.cc/...feature-selection-new-algorithms... · feature selection algorithm can identify meaningful coordinate

Feature extraction vs. feature selection Search …research.cs.tamu.edu/prism/lectures/pr/pr_l11.pdf · • Feature extraction vs. feature selection • Search strategy and objective

Feature Selection Extraction

Feature Selection - Feature Selection for Data · PDF file1 Background Introduction Feature relevance ... Fast Correlation Based Filter (FCBF) Scatter Search ... Feature selection

Enhanced Classification Accuracy for …selection and feature rankers. We evaluate four different strategies for feature selection: (i) Correlation-based feature selection, (ii) Consistency-based

Comparative Feature Selection

Information Theory-Based Feature Selection: …...selection, or variable selection. The ideal situation for feature selection is to select the optimal feature subset that maximize

Backward Feature Selection

Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Feature Selection for Classification - Machine Learningmachine-learning.martinsewell.com/feature-selection/... · 2013-01-21 · Feature Selection for Classification M. Dash ‘,

Feature Selection for Classiﬁcation: A Review · PDF fileFeature Selection for Classiﬁcation: A Review. 2. ... methods can be converted into feature selection methods ... feature

Dimensionality reduction: feature extraction & feature selection

Feature Selection in Hierarchical Feature Spaces · Feature Selection in Hierarchical Feature Spaces 3 Fig.1: An example hierarchy of binary features 3 Related Work Feature selection

Feature Selection And

Feature Selection for Classification - Instituto de Computação · 2012-05-08 · Feature Selection for Classification (1) Overview. • various feature selection methods since the