PCA & Matrix Factorizations
Transcript of PCA & Matrix Factorizations
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1
Principal Component Analysis andMatrix Factorizations for Learning
Chris DingLawrence Berkeley National Laboratory
Supported by Office of Science, U.S. Dept. of Energy
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 2
Many unsupervised learning methodsare closely related in a simple way
Spectral Clustering
NMF
K-means clustering
PCA
Indicator Matrix Quadratic Clustering
Semi-supervised classification
Semi-supervised clustering
Outlier detection
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 3
Part 1.A.Principal Component Analysis (PCA)
andSingular Value Decomposition (SVD)
• Widely used in large number of different fields• Most widely known as PCA (multivariate
statistics)• SVD is the theoretical basis for PCA
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 4
Brief history
• PCA– Draw a plane closest to data points (Pearson, 1901)– Retain most variance (Hotelling, 1933)
• SVD– Low-rank approximation (Eckart-Young, 1936) – Practical application/Efficient Computation (Golub-
Kahan, 1965)• Many generalizations
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 5
PCA and SVD
),,,( 21 nxxxX L=Data: n points in p-dim:
Covariance
Principal directions:(Principal axis,subspace)
ku Principal components:(projection on the subspace)
kv
∑=
==p
k
Tkkk
T uuXXC1λ
∑=
=r
k
Tkkk
T vvXX1
λGram (kernel) matrix
Underlying basis: SVD Tp
k
Tkkk VUvuX Σ==∑
=1
σ
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 6
Further Developments
SVD/PCA• Principal Curves • Independent Component Analysis• Sparse SVD/PCA (many approaches)• Mixture of Probabilistic PCA• Generalization to exponential familty, max-margin• Connection to K-means clustering
Kernel (inner-product)
• Kernel PCA
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 7
Methods of PCA Utilization
dkkk XduXuu ⋅++⋅= )()1( 1 L
Principal components (uncorrelated random variables):
Projection to low-dim subspace
Sphereing the dataTransform data to N(0,1)
Dimension reduction: Tp
k
Tkkk VUvuX Σ==∑
=1
σ
),,,( 21 nxxxX L=
XUX T=~ ),,( 1 kuuU L=
XUUXCX T12/1~ −Σ== −
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 8
Applications of PCA/SVD
• Most popular in multivariate statistics• Image processing, signal processing• Physics: principal axis, diagonalization of
2nd tensor (mass)• Climate: Empirical Orthogonal Functions
(EOF)• Kalman filter. • Reduced order analysis
Ttttt APAPEsAs )()1()()1( , =+= ++
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 9
Applications of PCA/SVD
• PCA/SVD is as widely as Fast Fourier Transforms– Both are spectral expansions– FFT is more on Partial Differential Equations– PCA/SVD is more on discrete (data) analysis– PCA/SVD surpass FFT as computational sciences
further advance
• PCA/SVD– Select combination of variables– Dimension reduction
• An image has 104 pixels. True dimension is 20 !
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 10
PCA is a Matrix Factorization(spectral/eigen decomposition)
Covariance Tp
k
Tkkk
T UUuuXXC Λ=== ∑=1
λ
Tr
k
Tkkk
T VVvvXX Λ==∑=1
λKernel matrix
Underlying basis: SVD Tp
k
Tkkk VUvuX Σ==∑
=1
σ
Principal directions: ),,,( 21 kuuuU L=
Principal components: ),,,( 21 kvvvV L=
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 11
From PCA to spectral clusteringusing generalized eigenvectors
∑=j iji wd
In Kernel PCA we compute eigenvector: vWv λ=
Consider the kernel matrix:
Generalized Eigenvector:
)(),( jiij xxW φφ=
DqWq λ=
),,( 1 ndddiagD L=
This leads to Spectral Clustering !
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 12
Scale PCA ⇒ Spectral Clustering
PCA:
2/1)/(~,~21
21
jiijij ddwwWDDW == −−
scaled principal component
Scaled PCA: DqqDDWDWk
Tkkk ∑
=
==1
21
21 ~ λ
kk vDq 21−=
∑=k
Tkkk vvW λ
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 13
Scaled PCA on a Rectangle Matrix⇒ Correspondence Analysis
Re-scaling: 2/1.. )(~ ,~ /2
121
jiijijcr ppppPDDP == −−
are scaled row and column principal component (standard coordinates in CA)
Apply SVD on P~
ck
Tkkkr
T DgfDprcP ..1
/ ∑=
=− λ
Subtract trivial component
Tnppr ),,( ..1 L=
Tnppc ),,( .1. L=
kckkrk vDguDf 21
21
, −− ==
(Zha, et al, CIKM 2001, Ding et al, PKDD2002)
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 14
Nonnegative Matrix Factorization
),,,( 21 nxxxX L=Data Matrix: n points in p-dim:
TFGX ≈Decomposition (low-rank approximation)
Nonnegative Matrices0,0,0 ≥≥≥ ijijij GFX
),,,( 21 kgggG L=),,,( 21 kfffF L=
is an image, document, webpage, etc
ix
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 15
Solving NMF with multiplicative updating
Fix F, solve for G; Fix G, solve for F
Lee & Seung ( 2000) propose
0,0,|||| 2 ≥≥−= GFFGXJ T
jkT
jkT
jkjk FGFFX
GG)(
)(←
ikT
ikikik GFG
XGFF)(
)(←
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 16
Matrix Factorization Summary
PCA:
Scaled PCA:
DQQDDWDW TΛ== 21
21 ~
TVVW Λ=
Symmetric(kernel matrix, graph)
Rectangle Matrix (contigency table, bipartite graph)
TVUX Σ=
cT
rcr DGFDDXDX Λ== 21
21 ~
TFGX ≈NMF: TQQW ≈
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 17
Indicator Matrix Quadratic Clustering
Unsigned Cluster indicator Matrix H=(h1, …, hK)
0,..),Tr( max ≥= HIHHtsWHH TTH
;XXW T=
Kernel K-means clustering:
Spectral clustering (normalized cut)
K-means: ))(),(( ><= ji xxW φφKernel K-means
0,..),Tr( max ≥= HIDHHtsWHH TTH
Difference between the two is the orthogonality of H
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 18
Indicator Matrix Quadratic Clustering
Additional features:
)Tr( max HCWHH TTH
+
.,)(
2/)( CHWHHH
CWHHH TT
ikikik
ikik +=+← αα
Semi-suerpvised classification:
Semi-supervised clustering: (A) must-link and (B) cannot-link constraints
allowing zero rows in HOutlier Detection:
)Tr( max BHHAHHWHH TTTH
βα −+
)Tr( max WHHTH
Nonnegative Lagrangian Relaxation:
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 19
Tutorial Outline• PCA
– Recent developments on PCA/SVD– Equivalence to K-means clustering
• Scaled PCA– Laplacian matrix– Spectral clustering– Spectral ordering
• Nonnegative Matrix Factorization– Equivalence to K-means clustering– Holistic vs. Parts-based
• Indicator Matrix Quadratic Clustering– Use Nonnegative Lagrangian Relaxtion – Includes
• K-means and Spectral Clustering• semi-supervised classification• Semi-supervised clustering• Outlier detection
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20
Part 1.B.Recent Developments on PCA and SVD
Principal Curves Independent Component AnalysisKernel PCAMixture of PCA (probabilistic PCA)Sparse PCA/SVD
Semi-discrete, truncation, L1 constraint, Direct sparsification
Column Partitioned Matrix Factorizations2D-PCA/SVDEquivalence to K-means clustering
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 21
PCA and SVD
),,,( 21 nxxxX L=Data Matrix:
Covariance
Principal directions:(Principal axis,subspace)
ku Principal components:(projection on the subspace)
kv
∑=
==p
k
Tkkk
T uuXXC1λ
∑=
=r
k
Tkkk
T vvXX1
λGram (kernel) matrix
Underlying basis: SVD ∑=
=p
k
Tkkk vuX
1σ
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 22
Kernel PCA
Kernel
Feature extraction
Indefinite Kernels
Generalization to graphs with nonnegative weights
)(),( jiij xxK φφ=
(Scholkopf, Smola, Muller, 1996)
)(),()(, xxvxv iiiφφφ ∑=
PCA Component v)( ii xx φ→
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 23
Mixture of PCA• Data has local structures.
– Global PCA on all data is not useful• Clustering PCA (Hinton et al):
– Using clustering to cluster data into clusters– Perform PCA in each cluster– No explicit generative model
• Probabilistic PCA (Tipping & Bishop)– Latent variables– Generative model (Gaussian)– Mixture of Gaussians ⇒ mixture of PCA– Adding Markov dynamics for latent variables (Linear
Gaussian Models)
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 24
Probabilistic PCALinear Gaussian Model
),0(~, 2INWsx ii εσεεμ ++=
Latent variables ),,( 1 nssS L=
),(~)( 20 IsNsP sσGaussian prior
),(~ 20
TsWWIWsNx σσε +
(Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)
Linear Gaussian Model
,,1 εη +=+=+ iiii WsxAss
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 25
Sparse PCA• Compute a factorization
– U or V is sparse or both are sparse• Why sparse?
– Variable selection (sparse U)– When n >> d– Storage saving– Other new reasons?
• L1 and L2 constraints
TUVX ≈
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 26
Sparse PCA: Truncation and Discretization
• Sparsified SVD– Compute {uk,vk} one at a time, truncate those entries
below a threshold. – Recursively compute all pairs using deflation.– (Zhang, Zha, Simon, 2002)
• Semi-discrete decomposition – U, V only contains {-1, 0, 1} – Iterative algorithm to compute U,V using deflation– (Kolda & O’leary, 1999)
TVUX Σ≈
TuvXX σ−←
)( 1 kuuU L= )( 1 kvvV L=
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 27
Sparse PCA: L1 constraint
• LASSO (Tibshirani, 1996)
• SCoTLASS (Joliffe & Uddin, 2003)
• Least Angle Regression (Efron, et al 2004)
• Sparse PCA (Zou, Hastie, Tibshirani,2004)
tXy T ≤− 12 ||||,||||min ββ
0,||||,)(max 1 =≤ hTTTT uutuuXXu
Ixx Tk
jjj
k
jji
Tn
ii =++− ∑∑∑
===
ααβλβλβαβα
,||||||||||||min1
1,11
22
1,
||||/ jjjv ββ=
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 28
Sparse PCA: Direct Sparsification
• Sparse SVD with explicit sparsification
– rank-one approximation – Minimize a bound– deflation
• Direct sparse PCA, on covariance matrix S
)nnz()nnz(||||min,
vuudvX FT
vu++−
)Tr(max)Tr(maxmax SUSuuSuuu TT ===1)rank(,0,)nnz(,1)Tr(.. 2 =≤= UUkUUts f
(Zhang, Zha, Simon 2003)
(D’Aspremont, Gharoui, Jordan,Lancriet, 2004)
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 29
Sparse PCA Summary• Many different approaches
– Truncation, discretization– L1 Constraint– Direct sparsification– Other approaches
• Sparse Matrix factorization in general– L1 constraint
• Many questions– Orthogonality– Unique solution, global solution
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 30
PCA: Further Generalizations
• Generalization to Exponential Family– (Collins, Dasgupta, Schapire, 2001)
• Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004)
– Collaborative filtering– Input Y is binary– Hard margin– Soft margin
∑∈
Σ −+Sia
iaia XYcX )1,0max(||||min
)||||||(||||||, 2221
FroFroT VUXUVX +==
SiaXY iaia ∈∀≥ ,1
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 31
Column Partitioned Matrix Factorizations
• Column Partitioned Data Matrix• Partitions are generate by clustering• Centroid matrix
– uk is centroid– Fix U, compute V
• Represent each partition by a SVD. – Pick leading Us to form U– Fix U, compute V
• Several other variations
1)( −= UUUXV TT2||||min FTUVX −
)( 1 kuuU L=
),,,(),( 1111 1
2
21
1
1
4484476LL
48476L
48476LL
k
k
n
nn
n
nn
n
nn xxxxxxxxX ++ −==
nnn k =++L1
),,(),( )()(
1
)1(
1
)1(111
48476
LL
48476
LL
l
l
l
ll
k
k
k
k uuuuUUU ==
(Zhang & Zha, 2001)
(Castelli, Thomasian & Li 2003)
(Park, Jeon & Rosen, 2003)
(Dhillon & Modha, 2001)
(Zeimpekis & Gallopoulos, 2004)
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 32
Two-dimensional SVD
• Large number of data objects are 2-D: images, maps• Standard method:
– convert (re-order) each image as a 1D vector– collect all 1D vectors into a single (big) matrix– apply SVD on the big matrix
• 2D-SVD is developed for 2D objects– Extension of standard SVD– Keeping the 2D characteristics– Improves quality of low-dimensional approximation– Reduces computation, storage
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 33
0 0050 710
080 20 0
.
.
.
.
.
.
.
M
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
Pixel vector
Linearize a 2D object into 1D object
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 34
SVD and 2D-SVDSVD
VXU T=ΣTVUX Σ=
),,,( 21 nxxxX L=Eigenvectors of TXX XX Tand
},,,{}{ 21 nAAAA L=Eigenvectors of
2D-SVD
Tii
i
AAAAF ))(( −−=∑)()( AAAAG i
Ti
i
−−=∑T
ii VUMA = VAUM iT
i =
row-row covariance
column-column cov
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 35
2D-SVD},,,{}{ 21 nAAAA L= assume 0=A
∑∑ == Tkkk
Tii
iuuAAF λ
∑∑=
==1k
Tkki
Ti
ikuuAAG ζ
VAUM iT
i =
row-row cov:
col-col cov:
),,,( 21 kuuuU L=),,,( 21 kvvvV L=
niVUMA Tii ,,1, L==
Bilinear
subspace
kki
kckrcri MVUA ×××× ℜ∈ℜ∈ℜ∈ℜ∈ ,,,
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 36
2D-SVD Error Analysis
∑∑+==
=−=r
kjj
Tii
n
i
RMAJ1
2
12 ||||min λ
∑∑∑+=+==
+≅−=c
kjj
r
kjj
Tii
n
i
RLMAJ11
2
13 ||||min ζλ
∑∑+==
=−=c
kjjii
n
i
LMAJ1
2
11 ||||min ζ
kki
kckrcri
Tii RMRRRLRARLMA ×××× ∈∈∈∈≈ ,,,,
∑∑+==
≅−=r
kjj
Tii
n
i
LLMAJ1
2
14 2||||min λ
∑+=
=Σ−p
kii
TVUX1
22||||min σSVD:
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 37
Temperature maps (January over 100 years)
Reconstruction Errors
SVD/2DSVD=1.1
Storages
SVD/2DSVD=8
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 38
Reconstructed image
SVD (K=15), storage 160560
2DSVD (K=15), storage 93060
SVD
2dSVD
PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 39
2D-SVD Summary
• 2DSVD is extension of standard SVD• Provides optimal solution for 4 representations for
2D images/maps• Substantial improvements in storage, computation,
quality of reconstruction• Capture 2D characteristics
40
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Part 1.C.K-means Clustering ⇔
Principal Component Analysis
(Equivalence between PCA and K-means)
41
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
K-means clustering
• Also called “isodata”, “vector quantization”• Developed in 1960’s (Lloyd, MacQueen, Hatigan,
etc)• Computationally Efficient (order-mN)• Widely used in practice
– Benchmark to evaluate other algorithms
∑∑∈=
−=kCi
ki
K
kK cxJ 2
1
||||min
TnxxxX ),,,( 21 L=Given n points in m-dim:
K-means objective
42
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
PCA is equivalent to K-means
Continuous optimal solution for cluster indicators in K-means clustering are given by principal components.
Subspace spanned by K cluster centroidsis given by PCA subspace.
43
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
2-way K -means Clustering
⎪⎩
⎪⎨⎧
∈−∈+
=221
112
if/ if/
)(CinnnCinnn
iqCluster membership indicator:
⎥⎦
⎤⎢⎣
⎡−−= 2
2
2221
11
21
2121 ),(),(),(2n
CCdn
CCdnnCCd
nnnJD,2
DK JxnJ −⟩⟨=
DK JJ maxmin ⇒
Define distance matrix: 2||),( jiijij xxddD −==
KqqqXXqqDqDqqJ TTTTTD 2)(2~ ==−=−= KD =~
Solution is principal eigenvector v1of K
}0)(|{},0)(|{ 1211 ≥=<= iviCiviCClusters C1, C2 are determined by:
44
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
A simple illustration
45
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
DNA Gene Expression File for Leukemia
Using v1 , tissue samples separated into 2 clusters, 3 errors
Do one more K-means, reduce to 1 error
46
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Multi-way K-means Clustering
Unsigned Cluster membership indicators h1, …, hK:
),,(
1000
0100
0011
321 hhh=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
C1 C2 C3
47
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Multi-way K-means Clustering
∑ ∑ ∑=
∈−=
i
K
kCji j
Ti
kiK
kxx
nxJ
1,
2 1
(Unsigned) Cluster indicators H=(h1, …, hK)
)(Tr2k
TTk
iiK XHXHxJ −=∑
∑ ∑=
−=i
K
kk
TTki XhXhx
1
2
THQ kk=
Redundancy: ∑=
=K
kkk ehn
1
2/1Regularized Relaxation
Transform h1, …, hK to q1 - qk via orthogonal matrix T
Thhqq kk ),,(),...,( 11 L= 2/11 /neq =
48
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Multi-way K-means Clustering
])(maxTr[ 11 −− kTT
k QXXQ
21
1
2 min xnJxnK
kKk <<−∑
−
=
λ
),...,( 21 kk qqQ =−
Optimal solutions of q2… qk are given by
principal components v2… vk.
JK is bounded below by total variance minus sum of K eigenvalues of covariance:
49
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Consistency: 2-way and K-way approaches
Orthogonal Transform:
Recover the original 2-way cluster indicator
T transforms (h1, h2) to (q1,q2):
Tbbaaq ),,,,,(, 2 −−= LL Tq )11(1 L=
Th )11,00(, 2 LL= Th )00,11(1 LL= nnna1
2=
nnnb2
1=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
nnnnnnnn
T////
21
12
50
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Lower bound is within 0.6-1.5% of the optimal value
Test of Lower bounds of K-means clustering
opt
LBopt
JJJ || −
51
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
∑∑∑∑ ====k
Tkkk
T
k
Tkk
T
k
Tkk
k
Tkk uuXvvXXhhXccP λ
Cluster Subspace (spanned by K centroids) = PCA Subspace
Given a data point x,
∑=k
TkkccP project x into the cluster subspace
kk
ikk Xhxihc ==∑ )(Centroid is given by
PCAk
Tkk
k
TkkkmeansK PuuuuP ≡⇔= ∑∑− λ
PCA automatically project into cluster subspace
PCA is unsupervised version of LDA
52
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Effectiveness of PCA Dimension Reduction
53
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Kernel K-means Clustering
∑∑∈=
−=kCi
ki
K
kK cxJ 2
1
||)()(||min φφφ
Kernal K-means objective: )( ii xx φ→
Kernal K-means
∑ ∑∑= ∈
−=K
k Cjij
Ti
kii
k
xxn
x1 ,
2 )()(1|)(| φφφ
∑ ∑= ∈
=K
k Cjiji
kK
k
xxn
J1 ,
)(),(1max φφφ
54
PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
Kernel K-means clusteringis equivalent to Kernal PCA
Continuous optimal solution for cluster indicators are given by Kernal PCA components
Subspace spanned by K cluster centroidsare given by Kernal PCA principal subspace