Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006....
-
Upload
norma-shaw -
Category
Documents
-
view
229 -
download
0
Transcript of Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006....
Ch 12. continuous latent variables Ch 12. continuous latent variables
Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.
Summarized by Soo-Jin Kim
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
2 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
ContentsContents
12.1 Principle Component Analysis 12.1.1 Maximum variance formulation 12.1.2 Minimum-error formulation 12.1.3 Application of PCA 12.1.4 PCA for high-dimensional data
3 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1 Principal Component Analysis12.1 Principal Component Analysis
Principal Component Analysis (PCA) PCA is used for applications such as dimensionality reduction, l
ossy data compression, feature extraction and data visualization. Also known as Karhunen-Loeve transform PCA can be defined as the orthogonal projection of the data ont
o a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized.
Principal subspace
Orthogonal projection of the data points
The subspace maximizes the variance of the projected points
Minimizing the sum-of-squares of the projection errors
4 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.1 Maximum variance formulation (1/2)12.1.1 Maximum variance formulation (1/2)
Consider data set of observations {xn} where n = 1,…,N, and xn is a Euclidean variable with dimensionality D. To project the data onto a space having dimensionality M<D w
hile maximizing the variance of the projected data. One-dimensional space (M=1)
Define the direction of this space using a D-dimensional vector u1
The mean of the projected data is where is the sample set mean given by
The variance of the projected data is given by
1Tu x x
1
1 N
nn
x xN
21 1 1 1
1
1{ }
NT T T
nn
u x u x u SuN
1
1( )( )
NT
n nn
S x x x xN
Maximize the
projected variance
S is the data covariance matrix
5 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.1 Maximum variance formulation (2/2)12.1.1 Maximum variance formulation (2/2)
Lagrange multiplier Make an unconstrained maximization of
(u1 must be an eigenvector of S)
The variance will be maximum when we set u1 equal to the eigenvector having the largest eigenvalue λ1.
PCA involves evaluating the mean x and the covariance matrix S of the data set and then finding the M eigenvectors of S corresponding
to the M largest eigenvalues. The cost of computing the full eigenvector decomposition for a matrix of size D x D is
O(D3)
1 1 1 1 1(1 )T Tu Su u u
1 1 1Su u
1 1 1Tu Su (this eigenvector is the first principal component)
6 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.2 Minimum-error formulation (1/4)12.1.2 Minimum-error formulation (1/4)
Based on projection error minimization A complete orthogonal set of D-dimensional basis vectors {ui}
where i = 1,…,D that satisfy Each data point can be represented exactly by a linear combinati
on of the basis vectors
To approximate this data point using a representation involving a restricted number M<D of variables corresponding to projection onto a lower-dimensional subspace.
Ti j iju u
1
D
n ni ii
x u
coefficient
Tnj n jx u
1
( )D
Tn n j i
i
x x u u
(Without loss of generality)
7 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.2 Minimum-error formulation (2/4)12.1.2 Minimum-error formulation (2/4)
We approximate each data point xn by
Distortion measure the squared distance between the original data point xn and its appro
ximation , so that goal is to minimize
the minimization with respect to the quantities {znj} where j = 1,…, M and similarly, where j = M+1, …, D.
1 1
M D
n ni i i ii i M
x z u b u
Depend on particular data point
Constants that are the same for all data points
to choose the {ui}, the {zni}, and the {bj} so as to minimize the distortion
nx
2
1
1|| ||
N
nnn
J x xN
T
nj n jz x uT
j jb x u
1
{( ) }D
Tnn n i i
i M
x x x x u u
The displacement vector from xn to
lies in the space orthogonal to the principal subspace (Fig. 12.2.).
nx
8 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.2 Minimum-error formulation (3/4)12.1.2 Minimum-error formulation (3/4)
Therefore, distortion measure J
The case of a 2-dimensional data space D=2 and 1-dimensional principal subspace M=1 To choose a direction u2 so as to minimize Using a Lagrange multiplier λ2 to enforce the constraint
The minimum value of J by choosing u2 to be the eigenvector corresponding to the smaller of the two eigenvalues
choose the principal subspace to be aligned with the eigenvector having the larger eigenvalues (in order to minimize the average squared projection distance)
2
1 1 1
1( )
N D DTT Tn i i i i
n i M i M
J x u x u u SuN
2 2TJ u Su
~
2 2 2 2 2(1 )T TJ u Su u u
9 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.2 Minimum-error formulation (4/4)12.1.2 Minimum-error formulation (4/4)
General solution to the minimization of J for arbitrary M < D The distortion measure is simply the sum of the eigenvalues of t
hose eigenvectors that are orthogonal to the principal subspace.
Canonical correlation analysis (CCA) linear dimensionality reduction technique called CCA PCA works with a single random variable, CCA consider two (o
r more) variables
( 1,..., )i i iSu u i D
!
D
ii M
J
10 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.3 Applications of PCA (1/4)12.1.3 Applications of PCA (1/4)
Data compression The first five eigenvectors, along with the corresponding eigenvalues
The distortion measure J
introduced by projecting the data
onto a principal component subspace of dimensionality M
the eigenvalue spectrum
11 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.3 Applications of PCA (2/4)12.1.3 Applications of PCA (2/4)
PCA approximation to a data vector xn in the form
PCA reconstruction of data points by M principal components
1 1
1
( ) ( )
( )
M D TTn nn i i i i
i i M
M TTnn i i
i
x x u u x u u
x x u x u
1
( )D T
n i ii
x x u u
12 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.3 Applications of PCA (3/4)12.1.3 Applications of PCA (3/4)
Data pre-processing The transformation of a data set in order to standardize certain of
its properties (allowing subsequent pattern recognition algorithm) PCA makes a more substantial normalization of the data to give it
zero mean and unit covariance
SU UL L, DxD diagonal matrix with elements λi
U, DxD orthogonal matrix with columns given by ui
1/ 2 ( )Tn ny L U x x
1/ 2 1/ 2
1 1
1/ 2 1/ 2 1/ 2 1/ 2
1 1( )( )
N NT T T
n n n nn n
T
y y L U x x x x ULN N
L U SUL L LL I
13 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.3 Applications of PCA (4/4)12.1.3 Applications of PCA (4/4)
Compare to PCA with the Fisher linear discriminant for linear dimensionality reduction
PCA chooses the direction of maximum variance
Fisher linear discriminant takes account of the class labels information
14 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
12.1.4 PCA for high-dimensional data12.1.4 PCA for high-dimensional data
computationally infeasible (O(D3) ) Resolve this problem
Let us define X to be the (NxD)-dimensional centered data matrix, whose nth row is given by ( )T
nx x1 T
i i iX Xu uN
1( )( ) ( )T T T
i i iX X X v X vN
1 Ti i iXX v v
N
This is an eigenvector of S with eigenvalue λi(vi=Xui)
1( ) ( )T
i i iXX Xu XuN
Solve the eigenvector problem in spaces of lower dimensionality with computational cost O(N3) instead of O(D3)