Principal Component Analysis (PCA)maktabfile.ir/wp-content/uploads/2017/11/PCA.pdfPrincipal...
Transcript of Principal Component Analysis (PCA)maktabfile.ir/wp-content/uploads/2017/11/PCA.pdfPrincipal...
CE-717: Machine LearningSharif University of Technology
Spring 2016
Soleymani
Principal Component Analysis (PCA)
Dimensionality Reduction:
Feature Selection vs. Feature Extraction
Feature selection
Select a subset of a given feature set
Feature extraction
A linear or non-linear transform on the original feature space
2
๐ฅ1โฎ๐ฅ๐
โ
๐ฅ๐1โฎ
๐ฅ๐๐โฒ
Feature
Selection
(๐โฒ < ๐)
๐ฅ1โฎ๐ฅ๐
โ
๐ฆ1โฎ๐ฆ๐โฒ
= ๐
๐ฅ1โฎ๐ฅ๐
Feature
Extraction
Feature Extraction
3
Mapping of the original data to another space
Criterion for feature extraction can be different based on problem
settings
Unsupervised task: minimize the information loss (reconstruction error)
Supervised task: maximize the class discrimination on the projected space
Feature extraction algorithms
Linear Methods
Unsupervised: e.g., Principal Component Analysis (PCA)
Supervised: e.g., Linear Discriminant Analysis (LDA)
Also known as Fisherโs Discriminant Analysis (FDA)
Feature Extraction
4
Unsupervised feature extraction:
Supervised feature extraction:
Feature Extraction๐ฟ =๐ฅ1(1)
โฏ ๐ฅ๐(1)
โฎ โฑ โฎ
๐ฅ1(๐)
โฏ ๐ฅ๐(๐)
A mapping ๐:โ๐ โ โ๐โฒ
Or
only the transformed data
๐ฟโฒ =
๐ฅโฒ1(1)
โฏ ๐ฅโฒ๐โฒ(1)
โฎ โฑ โฎ
๐ฅโฒ1(๐)
โฏ ๐ฅโฒ๐โฒ(๐)
Feature Extraction๐ฟ =๐ฅ1(1)
โฏ ๐ฅ๐(1)
โฎ โฑ โฎ
๐ฅ1(๐)
โฏ ๐ฅ๐(๐)
๐ =๐ฆ(1)
โฎ๐ฆ(๐)
A mapping ๐:โ๐ โ โ๐โฒ
Or
only the transformed data
๐ฟโฒ =
๐ฅโฒ1(1)
โฏ ๐ฅโฒ๐โฒ(1)
โฎ โฑ โฎ
๐ฅโฒ1(๐)
โฏ ๐ฅโฒ๐โฒ(๐)
Unsupervised Feature Reduction
5
Visualization: projection of high-dimensional data onto 2D
or 3D.
Data compression: efficient storage, communication, or
and retrieval.
Pre-process: to improve accuracy by reducing features
As a preprocessing step to reduce dimensions for supervised
learning tasks
Helps avoiding overfitting
Noise removal
E.g, โnoiseโ in the images introduced by minor lighting
variations, slightly different imaging conditions, etc.
Linear Transformation
6
For linear transformation, we find an explicit mapping
๐ ๐ = ๐จ๐๐ that can transform also new data vectors.
Original data
reduced data
=
Type equation here.
๐จ๐ โ โ๐โฒร๐
๐ โ โ
๐โฒ โ โ๐โฒ
๐โฒ = ๐จ๐๐๐โฒ < ๐
Linear Transformation
7
Linear transformation are simple mappings
1ad a
๐ = 1,โฆ , ๐โฒ
๐จ =
๐11 โฏ ๐1๐โฒโฎ โฑ โฎ
๐๐1 โฏ ๐๐๐โฒ๐โฒ = ๐จ๐๐
๐ฅ๐โฒ = ๐๐
๐๐
๐ฅ1โฒ
โฎ๐ฅ๐โฒโฒ
=
๐11 โฏ ๐๐1โฎ โฑ โฎ
๐1๐โฒ โฏ ๐๐โฒ๐
๐ฅ1โฎ๐ฅ๐๐๐โฒ๐
๐1๐
Linear Dimensionality Reduction
8
Unsupervised
Principal Component Analysis (PCA) [we will discuss]
Independent Component Analysis (ICA) [we will discuss]
SingularValue Decomposition (SVD)
Multi Dimensional Scaling (MDS)
Canonical Correlation Analysis (CCA)
Principal Component Analysis (PCA)
9
Also known as Karhonen-Loeve (KL) transform
Principal Components (PCs): orthogonal vectors that are
ordered by the fraction of the total information (variation) in
the corresponding directions
Find the directions at which data approximately lie
When the data is projected onto first PC, the variance of the projected data
is maximized
PCA is an orthogonal projection of the data into a subspace
so that the variance of the projected data is maximized.
Principal Component Analysis (PCA)
10
The โbestโ linear subspace (i.e. providing least reconstructionerror of data): Find mean reduced data
The axes have been rotated to new (principal) axes such that:
Principal axis 1 has the highest variance
....
Principal axis i has the i-th highest variance.
The principal axes are uncorrelated
Covariance among each pair of the principal axes is zero.
Goal: reducing the dimensionality of the data while preservingthe variation present in the dataset as much as possible.
PCs can be found as the โbestโ eigenvectors of the covariancematrix of the data points.
Principal components
11
If data has a Gaussian distribution ๐(๐, ๐บ), the direction of the
largest variance can be found by the eigenvector of ๐บ that
corresponds to the largest eigenvalue of ๐บ
4.0 4.5 5.0 5.5 6.02
3
4
5
๐1๐2
PCA: Steps
12
Input: ๐ ร ๐ data matrix ๐ฟ (each row contain a ๐dimensional data point)
๐ =1
๐ ๐=1๐ ๐(๐)
๐ฟ โ Mean value of data points is subtracted from rows of ๐ฟ
๐ช =1
๐ ๐ฟ๐ ๐ฟ (Covariance matrix)
Calculate eigenvalue and eigenvectors of ๐ช
Pick ๐โฒ eigenvectors corresponding to the largest eigenvalues
and put them in the columns of ๐จ = [๐1, โฆ , ๐๐โฒ]
๐ฟโฒ = ๐ฟ๐จFirst PC dโ-th PC
Covariance Matrix
13
๐๐ =
๐1โฎ๐๐
=๐ธ(๐ฅ1)โฎ
๐ธ(๐ฅ๐)
๐ฎ = ๐ธ ๐ โ ๐๐ ๐ โ ๐๐๐
ML estimate of covariance matrix from data points ๐(๐)๐=1
๐:
๐ฎ =1
๐
๐=1
๐
๐(๐) โ ๐ ๐(๐) โ ๐๐=1
๐ ๐ฟ๐ ๐ฟ
๐ฟ = ๐(1)
โฎ ๐(๐)
=๐(1) โ ๐
โฎ๐(๐) โ ๐
๐ =1
๐
๐=1
๐
๐(๐)
Mean-centered dataWe now assume that data are mean removed
and ๐ in the later slides is indeed ๐
Correlation matrix
14
1
๐๐ฟ๐๐ฟ =
1
๐
๐ฅ1(1)
โฆ ๐ฅ1(๐)
โฎ โฑ โฎ
๐ฅ๐(1)
โฆ ๐ฅ๐(๐)
๐ฅ1(1)
โฆ ๐ฅ๐(1)
โฎ โฑ โฎ
๐ฅ1(๐)
โฆ ๐ฅ๐(๐)
=1
๐
๐=1
๐
๐ฅ1(๐)
๐ฅ1(๐)
โฆ
๐=1
๐
๐ฅ1(๐)
๐ฅ๐(๐)
โฎ โฑ โฎ
๐=1
๐
๐ฅ๐(๐)
๐ฅ1(๐)
โฆ
๐=1
๐
๐ฅ๐(๐)
๐ฅ๐(๐)
๐ฟ =๐ฅ1(1)
โฆ ๐ฅ๐(1)
โฎ โฑ โฎ
๐ฅ1(๐)
โฆ ๐ฅ๐(๐)
Two Interpretations
15
MaximumVariance Subspace
PCA finds vectors v such that projections on to the
vectors capture maximum variance in the data
1
๐ ๐=1๐ ๐๐๐ ๐ 2
=1
๐๐๐๐ฟ๐๐ฟ๐
Minimum Reconstruction Error
PCA finds vectors v such that projection on to the
vectors yields minimum MSE reconstruction
1
๐ ๐=1๐ ๐ ๐ โ ๐๐๐ ๐ ๐
2
Least Squares Error Interpretation
16
PCs are linear least squares fits to samples, each orthogonal to
the previous PCs:
First PC is a minimum distance fit to a vector in the original feature
space
Second PC is a minimum distance fit to a vector in the plane
perpendicular to the first PC
And so on
Example
17
Example
18
Least Squares Error and Maximum Variance
Views Are Equivalent (1-dim Interpretation)
Minimizing sum of square distances to the line is equivalent to
maximizing the sum of squares of the projections on that line
(Pythagoras).
19
origin
red2+blue2=green2
green2 is fixed (shows the data vector after mean removing)
โ maximizing blue2 is equivalent to minimizing red2
First PC
20
The first PC is direction of greatest variability in data
We will show that the first PC is the eigenvector of the
covariance matrix corresponding the maximum eigen value of
this matrix.
If ||๐|| = 1, the projection of a d-dimensional ๐ on ๐ is ๐๐๐
origin
๐
๐
๐ cos ๐ = ๐๐๐๐
๐ ๐= ๐๐๐
๐
First PC
21
argmax๐
1
๐
๐=1
๐
๐๐๐ ๐ 2=1
๐๐๐๐ฟ๐๐ฟ๐
s.t. ๐๐๐ = 1
๐
๐๐
1
๐๐๐๐ฟ๐๐ฟ๐ + ๐ 1 โ ๐๐๐ = 0 โ
1
๐๐ฟ๐๐ฟ๐ = ๐๐
๐ is the eigenvector of sample covariance matrix1
๐๐ฟ๐๐ฟ
The eigenvalue ๐ denotes the amount of variance along that dimension.
Variance=1
๐๐๐๐ฟ๐๐ฟ๐ = ๐๐
1
๐๐ฟ๐๐ฟ๐ = ๐๐๐๐ = ๐
So, if we seek the dimension with the largest variance, it will be theeigenvector corresponding to the largest eigenvalue of the samplecovariance matrix
PCA: Uncorrelated Features
22
๐โฒ = ๐จ๐๐
๐น๐โฒ = ๐ธ ๐โฒ๐โฒ๐
= ๐ธ ๐จ๐๐๐๐๐จ = ๐จ๐๐ธ ๐๐๐ ๐จ = ๐จ๐๐น๐๐จ
If ๐จ = [๐1, โฆ , ๐๐] where ๐1, โฆ , ๐๐ are orthonormaleighenvectors of ๐น๐:
๐น๐โฒ = ๐จ๐๐น๐๐จ = ๐จ๐ ๐จ๐ฒ๐จ๐ ๐จ = ๐ฒ
โ โ๐ โ ๐ ๐, ๐ = 1,โฆ , ๐ ๐ธ ๐๐โฒ๐๐
โฒ = 0
then mutually uncorrelated features are obtained
Completely uncorrelated features avoid informationredundancies
PCA Derivation:
Mean Square Error Approximation
23
Incorporating all eigenvectors in ๐จ = [๐1, โฆ , ๐๐]:
๐โฒ = ๐จ๐๐ โ ๐จ๐โฒ = ๐จ๐จ๐๐ = ๐โ ๐ = ๐จ๐โฒ
โน If ๐โฒ = ๐ then ๐ can be reconstructed exactly from ๐โฒ
PCA Derivation:
Relation between Eigenvalues and Variances
24
The ๐-th largest eigenvalue of ๐น๐ is the variance on the ๐-thPC:
๐ฃ๐๐ ๐ฅ๐โฒ = ๐๐
๐ฃ๐๐ ๐ฅ๐โฒ = ๐ธ ๐ฅ๐
โฒ๐ฅ๐โฒ
= ๐ธ ๐๐๐๐๐๐๐๐ = ๐๐
๐๐ธ ๐๐๐ ๐๐
= ๐๐๐๐น๐๐๐ = ๐๐
๐๐๐๐๐ = ๐๐
Eigenvalues ๐1 โฅ ๐2 โฅ ๐3 โฅ โฏโขThe 1st PC is the the eigenvector of the sample covariance matrix
associated with the largest eigenvalue
โขThe 2nd PC ๐ฃ2 is the the eigenvector of the sample covariance matrix
associated with the second largest eigenvalue
โขAnd so on โฆ
PCA Derivation:
Mean Square Error Approximation
25
Incorporating only ๐โฒ eigenvectors corresponding to the
largest eigenvalues ๐จ = [๐1, โฆ , ๐๐โฒ] (๐โฒ < ๐)
It minimizes MSE between ๐ and ๐ = ๐จ๐โฒ:
๐ฝ ๐จ = ๐ธ ๐ โ ๐ 2 = ๐ธ ๐ โ ๐จ๐โฒ 2
= ๐ธ
๐=๐โฒ+1
๐
๐ฅ๐โฒ๐๐
2
= ๐ธ
๐=๐โฒ+1
๐
๐=๐โฒ+1
๐
๐ฅ๐โฒ๐๐
๐๐๐ ๐ฅ๐โฒ = ๐ธ
๐=๐โฒ+1
๐
๐ฅ๐โฒ2
=
๐=๐โฒ+1
๐
๐ธ ๐ฅ๐โฒ2 =
๐=๐โฒ+1
๐
๐๐ Sum of the ๐ โ ๐โฒ smallest
eigenvalues
PCA Derivation:
Mean Square Error Approximation
26
In general, it can also be shown MSE is minimized compared to
any other approximation of ๐ by any ๐โฒ -dimensional
orthonormal basis
without first assuming that the axes are eigenvectors of the correlation
matrix, this result can also be obtained.
If the data is mean-centered in advance,๐น๐ and ๐ช๐ (covariance
matrix) will be the same.
However, in the correlation version when ๐ช๐ โ ๐น๐ the approximation is
not, in general, a good one (although it is a minimum MSE solution)
PCA on Faces: โEigenfacesโ
27
ORL Database
Some Images
PCA on Faces: โEigenfacesโ
28
For eigen faces
โgrayโ = 0,
โwhiteโ > 0,
โblackโ < 0
Average
face
1st
to 10th
PCs
PCA on Faces:
Feature vector=[๐ฅ1โฒ ,๐ฅ2
โฒ , โฆ ,๐ฅ๐โฒโฒ ]
+๐ฅ1โฒ ร +๐ฅ2
โฒ ร +๐ฅ256โฒ ร+โฏ
29
=
Average
Face
๐ฅ๐โฒ = ๐๐ถ๐
๐๐ The projection of ๐ on the i-th PC
๐ is a 112 ร 92 = 10304 dimensional vector
containing intensity of the pixels of this image
PCA on Faces: Reconstructed Face
d'=1 d'=2 d'=4 d'=8 d'=16
d'=32 d'=64 d'=128Original
Imaged'=256
30
Dimensionality Reduction by PCA
31
In high-dimensional problems, data sometimes lies near a
linear subspace (small variability around this subspace can
be considered as noise)
Only keep data projections onto principal components
with large eigenvalue
Might lose some info, but if eigenvalues are small, do not
lose much
Kernel PCA
32
Kernel extension of PCA
data (approximately) lies on
a lower dimensional non-linear space
PCA and LDA: Drawbacks
33
PCA drawback: An excellent information packing transformdoes not necessarily lead to a good class separability. The directions of the maximum variance may be useless for classification
purpose
LDA drawback Singularity or under-sampled problem (when ๐ < ๐)
Example: gene expression data, images, text documents
Can reduces dimension only to ๐โฒ โค ๐ถ โ 1 (unlike PCA)
PCALDA
PCA vs. LDA
Although LDA often provide more suitable features for
classification tasks, PCA might outperform LDA in some
situations:
when the number of samples per class is small (overfitting problem
of LDA)
when the number of the desired features is more than ๐ถ โ 1
Advances in the last decade:
Semi-supervised feature extraction
E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA)
34
Singular Value Decomposition (SVD)
35
Given a matrix ๐ฟ โ โ๐ร๐, the SVD is a decomposition:
๐ฟ = ๐ผ๐บ๐ฝ๐
๐บ is a diagonal matrix with the singular values ๐1, โฆ , ๐๐ of ๐.
Columns of ๐ผ,๐ฝ are orthonormal matrices
๐ฟ(๐ ร ๐)
๐ผ(๐ ร ๐)
๐ฝ๐
(๐ ร ๐)
๐บ(๐ ร ๐)
Singular Value Decomposition (SVD)
36
Given a matrix ๐ฟ โ โ๐ร๐, the SVD is a decomposition:
๐ฟ = ๐ผ๐บ๐ฝ๐
SVD of ๐ is related to eigen-decomposition of ๐ฟ๐๐ฟ and ๐ฟ๐ฟ๐ .
๐ฟ๐๐ฟ = ๐ฝ๐บ๐ผ๐๐ผ๐บ๐ฝ๐ = ๐ฝ๐บ2๐ฝ๐
so ๐ฝ contains eigenvectors of ๐ฟ๐๐ฟ and ๐บ2 includes its eigenvalues (๐๐= ๐๐
2)
๐ฟ๐ฟ๐ = ๐ผ๐บ๐ฝ๐๐ฝ๐บ๐ผ๐ = ๐ผ๐บ2๐ผ๐
so ๐ผ contains eigenvectors of ๐ฟ๐ฟ๐and ๐บ2 includes its eigenvalues (๐๐= ๐๐
2)
In fact, we can view each row of ๐๐ as the coordinates of an
example along the axes given by the eigenvectors.
Independent Component Analysis (ICA)
37
PCA:
The transformed dimensions will be uncorrelated from each
other
Orthogonal linear transform
Only uses second order statistics (i.e., covariance matrix)
ICA:
The transformed dimensions will be as independent as
possible.
Non-orthogonal linear transform
High-order statistics can also used
Uncorrelated and Independent
38
Gaussian
Independent โบ Uncorrelated
Non-Gaussian
Independent โ Uncorrelated
Uncorrelated โ Independent
Uncorrelated: ๐๐๐ฃ ๐1,๐2 = 0Independent: ๐ ๐1,๐2 = ๐(๐1)๐(๐2)
ICA: Cocktail party problem
39
Cocktail party problem
๐ speakers are speaking simultaneously and any microphone
records only an overlapping combination of these voices.
Each microphone records a different combination of the speakersโ voices.
Using these ๐ microphone recordings, can we separate out the
original ๐ speakersโ speech signals?
Mixing matrix ๐จ:
๐ = ๐จ๐
Unmixing matrix ๐จโ1:
๐ = ๐จโ1๐
๐ ๐(๐)
: sound that speaker ๐ was uttering at time ๐.
๐ฅ๐(๐)
: acoustic reading recorded by microphone ๐ at time ๐.
ICA
40
Find a linear transformation ๐ = ๐จ๐
for which dimensions of ๐ = ๐ 1, ๐ 2, โฆ , ๐ ๐๐ are
statistically independent
๐(๐ 1, โฆ , ๐ ๐) = ๐1(๐ 1)๐2(๐ 2)โฆ๐๐(๐ ๐)
Algorithmically, we need to identify matrix ๐จ and sources
๐ where ๐ = ๐จ๐ such that the mutual information
between ๐ 1, ๐ 2, โฆ , ๐ ๐ is minimized:
๐ผ ๐ 1, ๐ 2, โฆ , ๐ ๐ =
๐=1
๐
๐ป ๐ ๐ โ๐ป ๐ 1, ๐ 2, โฆ , ๐ ๐