Preprocessing and Dimensionality...

73
Datasets Preprocessing Dimensionality reduction Preprocessing and Dimensionality Reduction er´ emy Fix CentraleSup´ elec jeremy.fi[email protected] 2017 1 / 73

Transcript of Preprocessing and Dimensionality...

Page 1: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Preprocessing and Dimensionality Reduction

Jeremy Fix

CentraleSupelec

[email protected]

2017

1 / 73

Page 2: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

You need datasets

You can use open datasets

For example for experimenting a new ML algorithm:

• UCI ML Repo : http://archive.ics.uci.edu/ml/

• Kaggle competitions, e.g. https:

//www.kaggle.com/c/diabetic-retinopathy-detection

• specific well known datasets for specific ML problems

2 / 73

Page 3: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Face expression classification

48x48 pixel grayscale images of faces0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise,6=Neutral28K Train; 3K for public test, another 3K for final test.

Kaggle, ICML 2013

3 / 73

Page 4: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Object localization/detection

PascalVOC2012: 20 classes, 20000 Train images, 20000 Test,11000 TestAvg image size : 469x387 pixels, RGB

Classes : person/bird, cat, cow, dog, horse, sheep/aeroplane, bicycle, boat, bus, car, motorbike, train/bottle, chair,dining table, potted plant, sofa, tv-monitor

http://host.robots.ox.ac.uk/pascal/VOC/

4 / 73

Page 5: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Object localization/detection

ImageNet, ILSVRC2014: 1000 classes, 1.2M Train images, 50KValid, 100K TestAvg image size : 482x415 pixels, RGB

ImageNet Large Scale Visual Recognition Challenge, Russakovsky et al. (2015)

5 / 73

Page 6: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Object localization/detection

Open Images Dataset: https://github.com/openimages/dataset

• ≈ 9M automatically labelled images, 4M human validated

• 80M bounding boxes, 6000 classes

• both meta labels (e.g. vehicle), fine-grain labels (e.g. hondansx)

6 / 73

Page 7: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Object segmentation

COCO 2017: 200K images, 80 classes, 500K masks

http://cocodataset.org/

7 / 73

Page 8: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Recommendation systems

MovieLens, Netflix Prize, Anime Recommendations DatabaseMovieLens 20M

• 27K movies by 138K users

• 5star ratings, 1/2 increment (0.0, 0.5, ..)

• 20M ratings

• metadata (e.g. genre)

• links to imdb to enrich metadata

https://grouplens.org/datasets/movielens/

8 / 73

Page 9: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Automatic speech recognition

Timit, VoxForge, ...Timit :

• 630 speakers, eight American english dialects

• time-aligned orthographic, phonetic and word transcriptions

• 16kHz speech waveform file for each utterance

https://catalog.ldc.upenn.edu/ldc93s1

9 / 73

Page 10: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Where to get data

Some available datasets

Sentiment analysis

Large Movie Review Dataset (IMDB)

• 25K reviews for training, 25K reviews for testing

• movie reviews (sentences), with rating ([1,10])

• aim : Are reviews on a given product positive/negative ?

Maas(2011), Learning Word Vectors for Sentiment Analysis

Automatic translation

Dataset from the european parliament (Europarl dataset)

• single language datasets (language model)

• parallel corpora (translation), e.g. french-english (2Msentences), Czech-English (650K sentences), ..

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp

Koehn, MT Summit 200510 / 73

Page 11: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Make your own dataset

You need datasets

You have a specific problem

You may need to collect data on your own.

• Crawl the web ? (e.g. Tweeter API, ..)

• if supervized learning : assign labels (mechanical turk, domainexperts (classifying tumors))

• Ensure you collected sufficient features

11 / 73

Page 12: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Preprocessing

12 / 73

Page 13: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Preprocessing data

Data are not necessarily vectorial

• Ordinal or Categorical : poor/faire/excellent ; Male/Female

• Text documents : bag of words / word embeddings

Even if vectorial

• Missing data : check how missing values are indicated (-9, ’ ’,..) → Imputation of missing values

• Feature scaling

13 / 73

Page 14: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Your data might not be vectorial data

Ordinal and categorical features

Ordinal values have an order.Ordinal Feature value poor fair excellent

Numerical feature value -1 0 1

Categorical values do not have an order (use one-hot) :Categorical value American Spanish German French

Numerical value [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]

14 / 73

Page 15: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Your data might not be vectorial data

Vectorial representation of text documents

Bag Of Words

• define a vocabulary V, |V| = n

• for each document, build a vector x so that xi is thefrequency of the word Vi

e.g. V = {I , in, love,metz ,machinelearning , study}I love machine learning and love metz too. → x = [1, 0, 2, 1, 1, 0]I love studying machine learning in Metz. → x = [1, 1, 1, 1, 1, 1]Does not take the order into account → N − gram, but this leadsto sparser representations

15 / 73

Page 16: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Your data might not be vectorial data

Vectorial representation of text documents

Word/Sentence embeddings (e.g. word2vec, GLoVe, fasttext).Continuous Bag of Words (CBOW) : predict a word given itscontext

• Input and output coded with one-hot

• predict a word given its context

• hidden layer : word representation

Captures some semantic information.For sentences : tweet2vec, sentence2vec, word vector avg

see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)Pennington(2014) GloVe: Global Vectors for Word Representation; Mikolov(2013) Efficient Estimation of Word

Representations in Vector Space; https://fasttext.cc/ 16 / 73

Page 17: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Some features might be missing

Missing features

• Completely drop out the samples with missing attributes, orthe dimensions that have missing values

• or try to impute, i.e. set a value in place of the missingattributes

For missing value imputation, there are plenty of methods :

• global : assign the mean, median, most frequent value of anattribute

• local : based on k-nearest neighbors, decide which value toimpute

The bias you may introduce by imputing a value may depend onthe causes of the missing values, see [Silva(2014)].Silva(2014). A brief review of the main approaches for treatment of missing data 17 / 73

Page 18: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Some vectorial data might not be appropriately scaled

Feature scaling

• dimensions with the largest variations will dominate euclideandistances (e.g. nearest neighbors)

• when gradient descent is involved, feature scaling makesconvergence faster (because the loss is circular symmetric)

• when regularization is involved, we would like to use a singleregularization coefficient, independent on the scale of thefeatures

18 / 73

Page 19: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Some vectorial data might not be appropriately scaled

Feature scaling

Given xi ∈ Rd , you can normalize by :

• min/max scaling :

∀i , ∀j ∈ [0, d − 1]x ′i ,j =xi,j−mink xk,j

maxk xk,j−minkxk,j

• z-score normalization :

∀i ,∀j ∈ [0, d − 1]xi,j =xi,j − µj

σj

µj =1

N

∑k

xk,j

σj =

√1

N

∑k

(xk,j − µj)2

Your statistics must be computed from the training set and applied also

to test data.19 / 73

Page 20: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction

20 / 73

Page 21: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction : what/why/how ?

What

Optimally transform xi ∈ Rn into zi ∈ Rd so that d << nIt remains to define what means “optimally transform”

Why

• visualization of the data

• interpretability of the predictor

• speed up the algorithms whose complexity depends on n

• data may occupy a manifold of lower dimensionality than n

• curse of dimensionality : data get quickly sparse, models mayoverfit

21 / 73

Page 22: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Data analysis/Visualization

How are your data distributed ? How are your classes intricated ?Do we have discriminative features ?

t-SNE, Mnist, Maaten et al.22 / 73

Page 23: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Interpretability of the predictor

e.g. Why does this predictor say the tumor is malignant ?

Real risk = 0.92± 0.05Real risk = 0.92± 0.06

UCI ML Breast Cancer Wisconsin (Diagnostic) datasetReal risk estimated by 10-fold CV.

23 / 73

Page 24: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Speed up of the algorithms

Decreasing dimensionality decreases training/inference times.For example :

• Linear regression y = θT x + b

• Logistic regression(classification) : P(y = 1/x) = 11+exp(θT x)

Both training and inference in O(n), x ∈ Rn

24 / 73

Page 25: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

Swiss roll

→ you do not necessarily loose information by reducing thenumber of dimensions

25 / 73

Page 26: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

You want to classify facial expressions of a single person,controlled illumination:

• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3

• what is the dimensionality of the data manifold ?

≈ 50

→ you do not necessarily loose information by reducing thenumber of dimensions

26 / 73

Page 27: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

You want to classify facial expressions of a single person,controlled illumination:

• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3

• what is the dimensionality of the data manifold ? ≈ 50

→ you do not necessarily loose information by reducing thenumber of dimensions

27 / 73

Page 28: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

You want to classify facial expressions of a single person,controlled illumination:

• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3

• what is the dimensionality of the data manifold ? ≈ 50

→ you do not necessarily loose information by reducing thenumber of dimensions

28 / 73

Page 29: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

You may even have better predictors : Curse of dimensionality

The data become (exponentially) quickly sparse with respect tothe dimension

Image from [Goodfellow, Bengio, Courville (2016) : Deep learning]See also [Hastie et al.(2017), The elements of statistical learning]

29 / 73

Page 30: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction : what/why/how ?

What

Optimally transform xi ∈ Rn into zi ∈ Rd so that d << nIt remains to define what means “optimally transform”

How

• select a subset of the original features : feature selection

• compute new features from the original ones :featureextraction

30 / 73

Page 31: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection

Select a subset of the original features/attributes/dimensionsxi ∈ Rn z ∈ Rd

31 / 73

Page 32: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection

Overview

• Embededed : The ML algorithm is designed to select a subsetof the features, e.g. linear regression with L1 penalty

• Filters : dimensions are selected based on a heuristic

• Wrappers : dimensions are selected based on an estimationof the real risk

⇒ Notebook ”Feature selection.ipynb”

32 / 73

Page 33: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: embedded

Embedded : the loss to minimize embeds a penalty promotingsparsity.

Least Absolute Shrinkage and Selection Operator (LASSO)

Given a regression problem (xi , yi ), xi ∈ Rn, yi ∈ R, optimize w.r.t.θ:

1

N

N−1∑i=0

(yi − θT xi )2 + λ|θ|1 (1)

Linear regression with L1 penaltyL1 penalty promotes sparse predictors

Tibshirani (1996). Regression shrinkage and Selection via the Lasso

33 / 73

Page 34: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: embedded

LASSO example

N=30 points, yi = 0.5 + 0.4 ∗ sin(2πxi ) +N (0, 0.01)30 RBF features + constant term:

φ(x) = [1, e(x0−x)2

2σ2 , ..., e(xN−1−x)2

2σ2 ] 0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Fit

samplestruelreglreg_l1

0 5 10 15 20 25 300.4

0.2

0.0

0.2

0.4

0.6Parameters

≈ 20%− 33% features selected

34 / 73

Page 35: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: embedded

Decision tree example

Decision Tree with gini impurity, max depth=2, 10-fold CV (0.92)UCI ML Breast Cancer Wisconsin dataset. 569 samples, binaryclassif, 30 continuous features.

35 / 73

Page 36: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: univariate filtersPrinciple : measure correlation/dependency between each inputfeatures, considered independently, and the target. E.g. chi-2,Anova test of independence, mutual information measure, pearsoncorrelation,..

Example Continuous → Discrete : Anova

Breast cancer, F-values Anova

P(x14/y) (lowest F) P(x27/y) (highest F)36 / 73

Page 37: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: multivariate filters and wrappers

Overview

Denote χ a subset of the dimensions/attributes/features

• suppose we are provided a measure of how good this subset isJ(χ)

• we optimize J(χ) over the possible subsets χ

If x ∈ Rn, we have 2n possible subsets χ:

χ ∈ {∅, {x1}, {x2}, · · · , {x1, x2}, · · · }

http://featureselection.asu.edu/ : Algorithms and datasetsPython package scikit-feature.John et al.(1994) Irrelevant features and the subset selectionproblem.

37 / 73

Page 38: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: optimizing J(χ)

Tree search

{x0, x1}

Xd

{x0} {x1} {xd−1}

{x0, x2} {x0, xd−1}· · ·

· · · · · ·

{x1, xd−1} {xd−2, xd−1}· · ·· · ·

... ... ...

Number of sets

1

1

d

d(d-1)/2

d!k!(d−k)!

Sequential Forward Search

Sequential Backward Search

If you allow to undo steps, “Sequential Floating ForwardSearch”/”Sequential Floating Backward Search”

Variants and extensions : Somol et al.(2010) Efficient FeatureSubset Selection and Subset Size Optimization

38 / 73

Page 39: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: quantifying the quality of a subset offeatures

We need to quantify the quality of a subset of features J(χ)Filters use a heuristic to be maximized.

Filters

Heuristic : e.g. Correlation based feature selectionStrategy: Keep features correlated with the label, yet uncorrelatedbetween each other.Given a training set {(xi , yi )}:

JCSF (χ) =kr(χ, y)√

k(k − 1)r(χ, χ)

r(χ, y) =1

k

∑j∈χ

r(x.,j , y)

r(χ, χ) =1

k(k − 1)

∑(j1,j2)∈χ,j1 6=j2

r(x.,j1 , x.,j2 )

with k = |χ| and r a measure of correlation 39 / 73

Page 40: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection: quantifying the quality of a subset offeatures

We need to quantify the quality of a subset of features J(χ)Wrappers use an estimation of the real risk to be minimized.

Wrappers

1 Train a predictor from the subset χ

2 J(χ) = estimation of the real risk (e.g. cross validation)

More theoretically grounded, but more computationally expensive.

40 / 73

Page 41: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Feature extraction

Given N samples xi ∈ Rd ,We compute r � d new features from the original d features.

41 / 73

Page 42: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analsysis [Pearson(1901)]Statement : Find an affine transformation of the data minimizingthe reconstruction error

Intuition and formalisation

1.0 0.5 0.0 0.5 1.0 1.5 2.01.0

0.5

0.0

0.5

1.0

1.5

2.0

w0

w1

In 1D, we seek a line (w0,w1) minimizing the sum of the squaredlength of the red segments. It is not unique !

42 / 73

Page 43: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analsysis [Pearson(1901)]

Statement : Find an affine transformation of the data minimizingthe reconstruction errorFormally :

min{w0,w1,..wr}∈Rd

N−1∑i=0

∣∣∣∣∣∣xi − (w0 +r∑

j=1

(wTj (xi − w0))wj)

∣∣∣∣∣∣2

2

(2)

subject to wTi wj = δi ,j .

→ matrix form ?

43 / 73

Page 44: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analsysis [Pearson(1901)]

Matrix formulation of PCA

Introduce W = (w1| . . . |wr ) ∈Md×r (R)

(2)⇔ min{w0,w1,..wr}∈Rd

N−1∑i=0

∣∣∣(Id −WWT )(xi − w0)∣∣∣22

subject to WTW = Ir

44 / 73

Page 45: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Simplification of the matrix formulation

• If M is idempotent, so is (I−M)

• (Id −WWT ) is symmetric and idempotent

(2)⇔ min{w0,w1,..wr}∈Rd

N−1∑i=0

(xi − w0)T (Id −WWT )(xi − w0)

subject to WTW = Ir

45 / 73

Page 46: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]Remember : For u : Rn 7→ Rm, v : Rn 7→ Rm, A ∈Mm,m(R):

duTAv

dx=

du

dxAv +

dv

dxATu

Finding w0

J =N−1∑i=0

(xi − w0)T (Id −WWT )(xi − w0)

∂J

∂w0= −2(Id −WWT )

N−1∑i=0

(xi − w0)

46 / 73

Page 47: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Finding w0

J =N−1∑i=0

(xi − w0)T (Id −WWT )(xi − w0)

∂J

∂w0= −2(Id −WWT )

N−1∑i=0

(xi − w0)

∂J

∂w0= 0 ⇔ ∃h ∈ span{w1, ...,wr},w0 = h +

1

N

∑i

xi

(Id −WWT )h is the residual vector by the orthogonal projection on the column vectors of WIf h ∈ span{w1, ...,wr}, the residual is 0

If h ∈ span{w1, ...,wr}⊥, the residual is h

47 / 73

Page 48: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Finding w0

J =N−1∑i=0

(xi − w0)T (Id −WWT )(xi − w0)

argminw0J ⇒ w0 = h + 1

N

∑i xi

h ∈ span{w1, ...,wr} e.g. h = 0

The offset w0

The offset w0 is the mean of the data points, up to a translation inthe space spaned by the principal components vectors.Step 1 : Center the data

48 / 73

Page 49: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]Denote xi = xi − x , x = 1

N

∑i xi

Deriving the first principal component

• J =∑N−1

i=0 xTi (Id −WWT )xi

• argminw1,..wrJ = argmaxw1,..wr

∑N−1i=0 xTi WWT xi

• X = (x0| . . . | ˜xN−1)

• argminw1,..wrJ = argmaxw1,..wr

∑rj=1 w

Tj XXTwj

Our optimization problem turns out to be :

argmaxw1,...,wr

∑rj=1 w

Tj XXTwj subject to WTW = Ir

We have a constrained optimization problem : Lagrangian.49 / 73

Page 50: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Deriving the first principal component : Lagrangian

argmaxw1wT

1 XXTw1 subject to wT1 w1 = 1

• L(w1, λ1) = wT1 XXTw1 + λ1(1− wT

1 w1)

• ∂Ldw1

= 0⇒ XXTw1 = λ1w1, w1eigen~v but which λ1 ?

• wT1 XXTw1 = λ1, λ1 is the largest eigenvalue of XXT

First principal component vector

The first principal component vector is a normalized eigenvectorassociated with the largest eigenvalue of the “sample covariancematrix” XXT

50 / 73

Page 51: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Deriving the second principal component : Greedy

Suppose we have w1 a norm. eigenvector of XXT associated withits largest eigenvalue. Denote λ1 ≥ λ2 ≥ . . . λd ≥ 0 theeigenvalues.We want to optimize :

argmaxw2wT

1 XXTw1 + wT2 XXTw2 = argmaxw2

λ1 + wT2 XXTw2

= argmaxw2wT

2 XXTw2

with wTi wj = δi ,j .

And wT2 XXTw2 = wT

2 (XXT − λ1w1wT1 )w2

w2 is a normalized eigenvector associated with the largesteigenvalue of (XXT − λ1w1w

T1 ), i.e. λ2

And so on. But is the greedy algorithm finding the optimum ? 51 / 73

Page 52: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Deriving the other principal component : Greedy

Does it make sense to use a greedy algorithm ? (proof in lecturenotes)

Theorem

For any symmetric positive semi-definite matrix M ∈Md×d(R),denote {λi}i=1..d its eigenvalues with λ1 ≥ λ2 · · · ≥ λd ≥ 0. Forany set of r ∈ [|1, d |] orthogonal unit vectors, {v1, . . . , vr}, wehave :

r∑j=1

vTj Mvj ≤r∑

j=1

λj (3)

And this upper bound is reached by eigenvectors associated withthe largest eigenvalues of M

52 / 73

Page 53: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA : recipe

Given {x0, . . . , xN−1} ∈ Rd , to compute the r principal componentvectors :

1 Center your data xi = xi − x

2 Build the matrix X = [x0| . . . | ˜xN−1]

3 Compute r normalized eigenvectors associated with the rlargest eigenvalues of XXT

53 / 73

Page 54: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA is a projection method

Given x ∈ Rd , its principal components are its coordinates in theselected eigenspace :

x → ((x − x)Tw1, (x − x)Tw2, . . . , (x − x)Twr )

If x ∈ {x0, . . . , xN−1}, you should better use the SVD which givesyou directly the principal components.

54 / 73

Page 55: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Singular Value Decomposition

For any matrix M ∈Md ,N(R), there exists an orthogonal matrixU ∈Md ,d(R), a diagonal matrix D ∈Md ,N(R), and anorthogonal matrix V ∈MN,N(R), such that :

M = UDVT

Orthogonal matrices : UT = U−1.

55 / 73

Page 56: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA with SVD

Given X = UDVT :

XXT = UDDTU−1

This is the diagonalization of XXT

The projection vectors are the column vectors of U :{w1, . . . ,wr} = {u1, . . . , ur}.The principal components of the training set are the r first rowsof :

UT X = UTUDVT = DVT

56 / 73

Page 57: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

What is XXT ?

XXT =N−1∑i=0

xi xTi

=∑i

(xi −1

N

∑j

xj)(xi −1

N

∑j

xj)T

= (N − 1)Σ

with Σ the sample covariance matrix.Σ is symmetric, positive semi-definite, i.e. its eigenvalues are allpositive.

57 / 73

Page 58: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

Equivalent formulations

There are two equivalent formulations of the PCA :

• Find an affine transformation minimizing the reconstructionerror

• Find an affine transformation maximizing the variance of theprojections

58 / 73

Page 59: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis

Maximizing the variance of the projections

Suppose your data are centered, i.e. 1N

∑i xi = 0.

Denote zi ∈ Rr the projection of xi over w1, . . . ,wr . We havez = 0.The sample covariance matrix Σ ∈Mr ,r (R) is :

Σ =1

N − 1

∑i

zizTi =

1

N − 1WT xix

Ti W

We want to maximize∑r

j=1 Σj ,j and :

r∑j=1

Σj ,j =1

N − 1

∑j

∑i

(wTj xi )(xTi wj) =

1

N − 1

∑j

wTj XXTwj

This is the same optimization problem as before. 59 / 73

Page 60: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

What is the fraction of variance we keep ?

For any matrix M, orthogonal matrix P :

Tr P−1MP = Tr M

Therefore, Tr XXT =∑N−1

i=0 λi . The variance of our datapoints is

Tr 1N−1XXT = 1

N−1

∑N−1i=0 λi

If we keep r principal components, we keep a fraction of thevariance equals to : ∑r−1

i=0 λi∑N−1i=0 λi

60 / 73

Page 61: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Principal Component Analysis [Pearson(1901)]

PCA on MNIST (28× 28 images)

PCA with 2 princip. vectors, 17.05% tot var

10 8 6 4 2 0 2 4 6w1

6

4

2

0

2

4

6w

2

0 1 2 3 4 5 6 7 8 9

10 first princip. vectors

0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10 61 / 73

Page 62: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Sample covariance and Gram matrices

Definitions

The sample coviarance matrix is :

Σ =1

N − 1

∑i

(xi − x)(xi − x)T =1

N − 1XXT

The Gram matrix is :

G = XTX =

xT0 x0 xT0 x1 · · · xT0 xN−1...

......

...xTN−1x0 xTN−1x1 · · · xTN−1xN−1

The Gram matrix is build up from dot products.

The eigenvectors/eigenvalues of G and Σ are related !62 / 73

Page 63: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Sample covariance and Gram matrices

Lemma

∀A ∈ Rn×m, ker (A) = ker(ATA

)Theorem (Rank-nullity)

∀A ∈ Rn×m, rk (A) + dim (ker (A)) = m.

Theorem

∀A ∈ Rn×m, rk(ATA

)= rk

(AAT

)≤ min(n,m)

63 / 73

Page 64: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Lemma (Eigenvalues of the covariance and gram matrices)

The nonzero eigenvalues of the scaled covariance matrix(N − 1)Σ = XXT and gram matrix G = XTX are the same :

{λ ∈ R∗, ∃v 6= 0, (N − 1)Σv = λv} = {λ ∈ R∗, ∃v 6= 0,Gv = λv}

And, during the proof, we show that :

• If (λ, v) eigen of XXT , then (λ,XT v) eigen of XTX

• If (λ,w) eigen of XTX , then (λ,Xw) eigen of XXT

There are several applications of this property :

• the eigenface algorithm, used when N � d

• the nonlinear PCA called Kernel PCA [Schoelkopf, 1999]

64 / 73

Page 65: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

What to do when N � dG ∈MN,N(R), Σ ∈Md ,d(R).

Eigenface

If N � d , it is much more efficient to “diagonalize” G than Σ. Inthat case, the recipe is :

1 Center your data xi = xi − x

2 Build the matrix X = [x0| . . . | ˜xN−1]

3 Compute the r normalized eigenvectors wj ∈ RN of G, witheigenvalues λj

4 Project your data on the r normalized eigenvectors of Σ givenby :

Xwj∣∣∣Xwj

∣∣∣2

=1√λj

Xwj

65 / 73

Page 66: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Toward a Kernel PCAWe can reformulate the PCA using only dot products.

PCA with only dot products

Computing the Gram matrix involves only dot products between xiProjecting a vector x on the vector 1√

λjXwj reads :

(1√λj

Xwj)T x =

1√λj

wTj

< x0, x >...

< xN−1, x >

A linear algorithm involving only dot products can be renderednon-linear using the kernel trick (see SVM).The only remaining difficulty is that we must ensure the vectors inthe feature space are centered.

66 / 73

Page 67: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Non linear PCA

Kernel PCA [Scholkopf(1999)]

Consider a kernel k : RN × RN 7→ R, < φ(x), φ(x ′) >= k(x , x ′)e.g.

• RBF kernel : k(x , x ′) = exp(− |x−x′|2

2

2σ2 )

We perform a PCA in the feature space, image of φ.Compute the Gram matrix, its eigenvectors/eigenvalues λj ,wj .For projecting a vector x , compute :

1√λj

wTj

k(x0, x)...

k(xN−1, x)

What about centering the φ(xi ) ?

67 / 73

Page 68: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Non linear PCA

Kernel PCA : centering in the feature space

It can be shown that introducing the Gram matrix G :

G = (IN −1

N1)G (IN −

1

N1)

is the matrix of the dot products of the feature vectors centered inthe feature space.The above transformation is called double centeringtransformation.

68 / 73

Page 69: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Feature extraction : Manifold learning

Goal : For each xi ∈ Rd , associate yi ∈ Rr so that the pairwisedistances in Rd are as similar as possible to the pairwise distance

in Rr

Perfect for visualizing the datasets in low dimensions.Examples : LLE, MDS, Isomap, SNE, t-SNE, ..

69 / 73

Page 70: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Manifold learning

Overview

xi ∈ Rd , yi ∈ Rr , r � d , e.g. r = 2

1 Quantify the similarity between pairs of points in Rd

2 Quantify the similarity between pairs of points in Rr

3 Quantify the discrepancy between these similarities

4 Optimize with respect to yi

70 / 73

Page 71: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]

Focuses on preserving local distances, allowing larger distances inRd to be even larger in Rr

• Similarity in Rd

∀i , j , pi ,j =pi/j + pj/i

2N

∀i , j , pi/j =exp(− |xi−xj |

2

2σ2i

)∑k 6=i exp(− |xi−xk |2

2σ2i

)

71 / 73

Page 72: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]

Focuses on preserving local distances, allowing larger distances in Rd tobe even larger in Rr

• Similarity in Rr

∀i , j , qi,j =(1 + |yi − yj |22)−1∑k 6=l(1 + |yk − yl |22)−1

• Maximize the similarity of qi,j , pi,j with the Kullback-Leiblerdivergence :

C =∑i,j

pi,j log(pi,jqi,j

)

Complexity (O(N2)). Optimized version with Barnes-Hutt, complexity

O(N logN)

72 / 73

Page 73: Preprocessing and Dimensionality Reductionmalis.metz.supelec.fr/~fix_jer/Preprocessing/slides.pdf · Datasets Preprocessing Dimensionality reduction We need vectors, appropriately

Datasets Preprocessing Dimensionality reduction

Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]

73 / 73