Feature Selection I

25
Dr. Athanasios Tsanas (‘Thanasis’), Wellcome Trust post-doctoral fellow, Institute of Biomedical Engineering (IBME), Affiliate researcher, Centre for Mathematical Biology, Mathematical Institute, Junior Research Fellow, Kellogg College University of Oxford Information Driven Healthcare: Machine Learning course Lecture: Feature selection I --- Concepts Centre for Doctoral Training in Healthcare Innovation

Transcript of Feature Selection I

Page 1: Feature Selection I

Dr. Athanasios Tsanas (‘Thanasis’), Wellcome Trust post-doctoral fellow, Institute of Biomedical Engineering (IBME), Affiliate researcher, Centre for Mathematical Biology, Mathematical Institute, Junior Research Fellow, Kellogg College University of Oxford

Information Driven Healthcare: Machine Learning course

Lecture: Feature selection I --- Concepts

Centre for Doctoral Training in Healthcare Innovation

Page 2: Feature Selection I

The big picture signal processing and statistical machine learning

Feature

generation

Feature

selection/transformation Statistical

mapping

Signal processing course: tools for extracting information

(patterns) from the data (feature generation)

This course: using the extracted information from

the data

Statistical mapping: associating features with another

measured quantity (response) – supervised learning

Goal: maximize the information available in the data to

predict the response

Page 3: Feature Selection I

Supervised learning setting

Subjects feature 1 feature 2 ... feature M

S1 3.1 1.3 0.9

S2 3.7 1.0 1.3

S3 2.9 2.6 0.6

SN 1.7 2.0 0.7

Outcome

type 1

type 2

type 1

type 5 N (

sa

mp

les)

M (features or characteristics)

y

y = f (X), f : mechanism X: feature set y: outcome

X

Feature selection: which features 1…M in the design matrix X should we keep?

Feature transformation: project the features to a new lower dimensional feature space

Page 4: Feature Selection I

Introduction to the problem

Many features Μ Curse of dimensionality

Obstruct interpretability and detrimental to learning

process

Page 5: Feature Selection I

Solution to the problem

Reduce the initial feature space M

into m (or m<<M)

Feature selection

Feature transformation

Page 6: Feature Selection I

Main concepts

Principle of parsimony

Information content

Statistical associations

Computational constraints

We want to determine the most parsimonious feature subset

with maximum joint information content

Page 7: Feature Selection I

Feature transformation

Construct lower dimensional space where the new

data points retain the distance of the data points in the

original feature space

Different algorithms depending on how we define the

distance

Page 8: Feature Selection I

Feature transformation problems

Results are not easily interpretable

Does not save up on resources on

data collection or data processing

Reliable transformation in high

dimensional spaces is problematic

Page 9: Feature Selection I

Feature selection

Discard non-contributing features towards predicting

the response

Page 10: Feature Selection I

Feature selection advantages

Interpretable

Retain domain expertise

Often is the only useful approach in practice (e.g. in

micro-array data)

Saves on resources on data collection or data

processing

Page 11: Feature Selection I

Feature selection approaches

Two approaches:

Wrappers (involve a learner)

Filters (rely on information content of feature subset,

e.g. using statistical tests)

Page 12: Feature Selection I

Wrappers

Computationally intensive

Rely on incorporating a learner

Feature exportability problems

Wrappers may produce models with better predictive

performance compared to filters

Page 13: Feature Selection I

Filters

Rely on basic concepts

statistical

Information theory

Computationally fast

Learner performance comes later and hence filters may

generalize better than wrappers

Page 14: Feature Selection I

Filter concept: relevance

Maximum relevance: features (F) and response (y)

Which features would you choose? In which order?

F1

y

F2

F3

Page 15: Feature Selection I

Filter concept: redundancy

Minimum redundancy amongst features in the subset

Which features would you choose? In which order?

F1

F4

F2

F3

Page 16: Feature Selection I

Filter concept: complementarity

Conditional relevance (feature interaction or

complementarity)

Page 17: Feature Selection I

Formalizing these concepts

How to express relevance and redundancy (i.e. which

are the appropriate metrics?)

Metrics include: correlation coefficients, mutual

information, statistical tests, p-values, information

gain…

How to compromise between relevance and

redundancy?

Process? (forward selection VS backward

elimination)

Page 18: Feature Selection I

To be continued…

In the following lecture we will look at specific algorithms!

Page 19: Feature Selection I

Usual steps in forward selection

Page 20: Feature Selection I

LASSO

Start with classical ordinary least squares regression

L1 penalty: sparsity promoting, some coefficients become 0

Page 21: Feature Selection I

RELIEF

Feature weighting algorithm

Concept: work with nearest neighbours

Nearest hit (NH) and nearest miss (NM)

Great for datasets with interactions but does not account for

information redundancy

Page 22: Feature Selection I

mRMR

minimum Redundancy Maximum Relevance (mRMR)

Trade-off relevance + redundancy

Does not account for interactions and non-pairwise redundancy

Generally works very well

Page 23: Feature Selection I

Comparing feature selection algorithms

Selecting the ‘true’ feature subset (i.e. discarding features

which are known to be noise)

o Possible only for artificial datasets

Maximize the out of sample prediction performance

o proxy for assessing feature selection algorithms

o adds an additional ‘layer’: the learner

o beware of feature exportability (different learners may give

different results)

o BUT… in practice this is really what is of most interest!

Page 24: Feature Selection I

Matlab code

LASSO: http://www.mathworks.co.uk/help/stats/lasso.html

RELIEF: http://www.mathworks.co.uk/help/stats/relieff.html

mRMR:

http://www.mathworks.com/matlabcentral/fileexchange/14888

Be careful: the latter implementation relies on discrete features

and computes densities using histograms. For continuous

features you would need another density estimator (e.g. kernel

density estimation)

UCI ML repository http://archive.ics.uci.edu/ml/

Page 25: Feature Selection I

Conclusions

Multi-faceted problem, fertile field for research

No free lunch theorem (no universally best algorithm)

Trade-offs

o algorithmic: relevance, redundancy, complementarity

o computational: wrappers are costly but often give better results

o comprehensive search of the feature space, e.g. genetic

algorithms (very costly)

Reducing the number of features may improve prediction

performance and always improves interpretability