Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006....

14
Ch 12. continuous latent Ch 12. continuous latent variables variables Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006. C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory, Seoul National Univ ersity http://bi.snu.ac.kr/

Transcript of Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006....

Page 1: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

Ch 12. continuous latent variables Ch 12. continuous latent variables

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized by Soo-Jin Kim

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

Page 2: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

12.1 Principle Component Analysis 12.1.1 Maximum variance formulation 12.1.2 Minimum-error formulation 12.1.3 Application of PCA 12.1.4 PCA for high-dimensional data

Page 3: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

3 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1 Principal Component Analysis12.1 Principal Component Analysis

Principal Component Analysis (PCA) PCA is used for applications such as dimensionality reduction, l

ossy data compression, feature extraction and data visualization. Also known as Karhunen-Loeve transform PCA can be defined as the orthogonal projection of the data ont

o a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized.

Principal subspace

Orthogonal projection of the data points

The subspace maximizes the variance of the projected points

Minimizing the sum-of-squares of the projection errors

Page 4: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

4 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.1 Maximum variance formulation (1/2)12.1.1 Maximum variance formulation (1/2)

Consider data set of observations {xn} where n = 1,…,N, and xn is a Euclidean variable with dimensionality D. To project the data onto a space having dimensionality M<D w

hile maximizing the variance of the projected data. One-dimensional space (M=1)

Define the direction of this space using a D-dimensional vector u1

The mean of the projected data is where is the sample set mean given by

The variance of the projected data is given by

1Tu x x

1

1 N

nn

x xN

21 1 1 1

1

1{ }

NT T T

nn

u x u x u SuN

1

1( )( )

NT

n nn

S x x x xN

Maximize the

projected variance

S is the data covariance matrix

Page 5: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

5 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.1 Maximum variance formulation (2/2)12.1.1 Maximum variance formulation (2/2)

Lagrange multiplier Make an unconstrained maximization of

(u1 must be an eigenvector of S)

The variance will be maximum when we set u1 equal to the eigenvector having the largest eigenvalue λ1.

PCA involves evaluating the mean x and the covariance matrix S of the data set and then finding the M eigenvectors of S corresponding

to the M largest eigenvalues. The cost of computing the full eigenvector decomposition for a matrix of size D x D is

O(D3)

1 1 1 1 1(1 )T Tu Su u u

1 1 1Su u

1 1 1Tu Su (this eigenvector is the first principal component)

Page 6: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

6 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.2 Minimum-error formulation (1/4)12.1.2 Minimum-error formulation (1/4)

Based on projection error minimization A complete orthogonal set of D-dimensional basis vectors {ui}

where i = 1,…,D that satisfy Each data point can be represented exactly by a linear combinati

on of the basis vectors

To approximate this data point using a representation involving a restricted number M<D of variables corresponding to projection onto a lower-dimensional subspace.

Ti j iju u

1

D

n ni ii

x u

coefficient

Tnj n jx u

1

( )D

Tn n j i

i

x x u u

(Without loss of generality)

Page 7: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

7 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.2 Minimum-error formulation (2/4)12.1.2 Minimum-error formulation (2/4)

We approximate each data point xn by

Distortion measure the squared distance between the original data point xn and its appro

ximation , so that goal is to minimize

the minimization with respect to the quantities {znj} where j = 1,…, M and similarly, where j = M+1, …, D.

1 1

M D

n ni i i ii i M

x z u b u

Depend on particular data point

Constants that are the same for all data points

to choose the {ui}, the {zni}, and the {bj} so as to minimize the distortion

nx

2

1

1|| ||

N

nnn

J x xN

T

nj n jz x uT

j jb x u

1

{( ) }D

Tnn n i i

i M

x x x x u u

The displacement vector from xn to

lies in the space orthogonal to the principal subspace (Fig. 12.2.).

nx

Page 8: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

8 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.2 Minimum-error formulation (3/4)12.1.2 Minimum-error formulation (3/4)

Therefore, distortion measure J

The case of a 2-dimensional data space D=2 and 1-dimensional principal subspace M=1 To choose a direction u2 so as to minimize Using a Lagrange multiplier λ2 to enforce the constraint

The minimum value of J by choosing u2 to be the eigenvector corresponding to the smaller of the two eigenvalues

choose the principal subspace to be aligned with the eigenvector having the larger eigenvalues (in order to minimize the average squared projection distance)

2

1 1 1

1( )

N D DTT Tn i i i i

n i M i M

J x u x u u SuN

2 2TJ u Su

~

2 2 2 2 2(1 )T TJ u Su u u

Page 9: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

9 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.2 Minimum-error formulation (4/4)12.1.2 Minimum-error formulation (4/4)

General solution to the minimization of J for arbitrary M < D The distortion measure is simply the sum of the eigenvalues of t

hose eigenvectors that are orthogonal to the principal subspace.

Canonical correlation analysis (CCA) linear dimensionality reduction technique called CCA PCA works with a single random variable, CCA consider two (o

r more) variables

( 1,..., )i i iSu u i D

!

D

ii M

J

Page 10: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

10 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.3 Applications of PCA (1/4)12.1.3 Applications of PCA (1/4)

Data compression The first five eigenvectors, along with the corresponding eigenvalues

The distortion measure J

introduced by projecting the data

onto a principal component subspace of dimensionality M

the eigenvalue spectrum

Page 11: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

11 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.3 Applications of PCA (2/4)12.1.3 Applications of PCA (2/4)

PCA approximation to a data vector xn in the form

PCA reconstruction of data points by M principal components

1 1

1

( ) ( )

( )

M D TTn nn i i i i

i i M

M TTnn i i

i

x x u u x u u

x x u x u

1

( )D T

n i ii

x x u u

Page 12: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

12 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.3 Applications of PCA (3/4)12.1.3 Applications of PCA (3/4)

Data pre-processing The transformation of a data set in order to standardize certain of

its properties (allowing subsequent pattern recognition algorithm) PCA makes a more substantial normalization of the data to give it

zero mean and unit covariance

SU UL L, DxD diagonal matrix with elements λi

U, DxD orthogonal matrix with columns given by ui

1/ 2 ( )Tn ny L U x x

1/ 2 1/ 2

1 1

1/ 2 1/ 2 1/ 2 1/ 2

1 1( )( )

N NT T T

n n n nn n

T

y y L U x x x x ULN N

L U SUL L LL I

Page 13: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

13 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.3 Applications of PCA (4/4)12.1.3 Applications of PCA (4/4)

Compare to PCA with the Fisher linear discriminant for linear dimensionality reduction

PCA chooses the direction of maximum variance

Fisher linear discriminant takes account of the class labels information

Page 14: Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Soo-Jin Kim Biointelligence Laboratory,

14 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

12.1.4 PCA for high-dimensional data12.1.4 PCA for high-dimensional data

computationally infeasible (O(D3) ) Resolve this problem

Let us define X to be the (NxD)-dimensional centered data matrix, whose nth row is given by ( )T

nx x1 T

i i iX Xu uN

1( )( ) ( )T T T

i i iX X X v X vN

1 Ti i iXX v v

N

This is an eigenvector of S with eigenvalue λi(vi=Xui)

1( ) ( )T

i i iXX Xu XuN

Solve the eigenvector problem in spaces of lower dimensionality with computational cost O(N3) instead of O(D3)