Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and...
-
Upload
nguyentram -
Category
Documents
-
view
254 -
download
0
Transcript of Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and...
![Page 1: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/1.jpg)
Expectation Maximization forSparse and Non-Negative PCA
Christian D. Sigg Joachim M. Buhmann
Institute of Computational Science, ETH Zurich, Switzerland
ICML 2008, Helsinki
![Page 2: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/2.jpg)
Overview
Introduction
Constrained PCA
Solving Constrained PCA
Method
Probabilistic PCA
Expectation Maximization
Sparse and Non-Negative PCA
Multiple Principal Components
Experiments and Results
Setup
Sparse PCA
Nonnegative Sparse PCA
Feature Selection
Conclusions
![Page 3: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/3.jpg)
Constrained PCA
Loadings w(1) of first principal component (PC) are found by solving aconvex maximization problem:
arg maxw
w>Cw, s.t.‖w‖2 = 1, (1)
where C ∈ RD×D is the symmetric p.s.d. covariance matrix.
Loadings w(l) of further PCs again maximize (1), subject to w>(l)w(k) = 0,l > k.
We consider (1) under one or two additional constraints:
1. Sparsity: ‖w‖0 ≤ K2. Non-negativity: wd ≥ 0, ∀d = 1, . . . , D
Introduction Constrained PCA 3/23
![Page 4: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/4.jpg)
Motivation and Applications
Constraints facilitate trade-off between:
1. statistical fidelity – maximization of variance2. interpretability – feature selection omits irrelevant variables3. applicability – constraints imposed by domain (e.g. physical process,
transaction costs in finance)
Sparse PCA:
I Interpretation of PC loadings (Jolliffe et. al., 2003 JCGS)
I Gene selection (Zou et. al., 2004 JCGS) and ranking (d’Aspremont et. al., 2007
ICML).
Non-negative sparse PCA: Image parts extraction (Zass and Shashua, 2006 NIPS)
Left: Full loadings of first four PCs, Right: Corresponding non-negative sparse loadings
Introduction Constrained PCA 4/23
![Page 5: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/5.jpg)
Combinatorial Solvers
Write (1) as
w>Cw =∑i,j
Cijwiwj (2)
For given sparsity pattern S = {i|wi 6= 0}, optimal solution is dominanteigenvector of corresponding submatrix CS .
Algorithms:
I Exact: Branch and bound (Moghaddam et. al., 2006 NIPS) for small D
I Greedy forward search (d’Aspremont et. al., 2007), improved to O(D2) perstep
Optimality verifiable in O(D3) (d’Aspremont et. al., 2007).
Introduction Solving Constrained PCA 5/23
![Page 6: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/6.jpg)
Continuous Approximation
L1 relaxation: ‖w‖1 ≤ B
Adding constraint creates local minima (Jolliffe et. al., 2003)
w>Cw, subject to ‖w‖2 ≤ 1 w>Cw, subject to ‖w‖2 ≤ 1 ∧ ‖w‖1 ≤ 1.2
and makes the problem NP-hard.
Algorithms:
I Iterative L1 regression (Zou et. al., 2004)
I Convex O(D4√
logD) SDP approximation (d’Aspremont et. al., 2005 NIPS)
I d.c. minimization (Sriperumbudur et. al., 2007 ICML), O(D3) per iteration
Introduction Solving Constrained PCA 6/23
![Page 7: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/7.jpg)
Overview
Introduction
Constrained PCA
Solving Constrained PCA
Method
Probabilistic PCA
Expectation Maximization
Sparse and Non-Negative PCA
Multiple Principal Components
Experiments and Results
Setup
Sparse PCA
Nonnegative Sparse PCA
Feature Selection
Conclusions
![Page 8: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/8.jpg)
Generative Model
(Bishop and Tipping, 1997 TECR; Roweis, 1998 NIPS)
Latent variable y ∈ RL in PC subspace
p(y) = N (0, I).
Observation x ∈ RD conditioned on y
p(x|y) = N (Wy + µ, σ2I).
y
p(y)
y
x2
x1
µ
p(x|y)
}y|jwjj
wx2
x1
µ
p(x)
Illustrations from (Bishop, 2006 PRML).
Method Probabilistic PCA 8/23
![Page 9: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/9.jpg)
Limit-Case EM Algorithm
Three Simplifications: Take limit σ2 → 0, consider L = 1 subspace w andnormalize ‖w‖ = 1.
E-Step: Orthogonal projection on current estimate (X ∈ RN×D)
y = Xw(t). (3)
M-Step: Minimization of L2 reconstruction error
w(t+1) = arg minw
N∑n=1
(x(n) − ynw
)2. (4)
Renormalization: w(t+1) = w(t+1)
‖w(t+1)‖
(a)
−2 0 2
−2
0
2 (b)
−2 0 2
−2
0
2 (c)
−2 0 2
−2
0
2 (d)
−2 0 2
−2
0
2 (e)
−2 0 2
−2
0
2 (f)
−2 0 2
−2
0
2
Illustrations from (Bishop, 2006 PRML).
Method Expectation Maximization 9/23
![Page 10: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/10.jpg)
Adding Constraints
Rewrite M-step (4) as isotropic QP, favor sparsity using L1 constraint andadd lower bounds
w◦ = arg minw
(hw>w − 2f>w
)s.t. ‖w‖1 ≤ B
wd ≥ 0, ∀d ∈ {1, . . . , D}
with h =∑N
n=1 y2n and f =
∑Nn=1 ynx(n).
Minimize L2 distance to unconstrained optimum w∗ = fh .
Method Sparse and Non-Negative PCA 10/23
![Page 11: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/11.jpg)
Axis-Aligned Gradient Descent
After transformation into non-negative orthant:
w1
w2
B
w*
w°
Express sparsity by K directly:
I L1 bound B set implicitly due to monotonicity
I Regularization path obtained by sorting elements of w∗, O(D logD)Method Sparse and Non-Negative PCA 11/23
![Page 12: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/12.jpg)
Multiple Principal Components
Iterative Deflation:
1. Compute sparse PC loadings w(l).
2. Project data onto orthogonal subspace, using projectorP = I−w(l)w>(l).
Including nonnegativity constraints:
w(l)i > 0⇒ w
(m)i = 0
for m 6= l, i.e. Sl and Sm are disjoint.
This might be a too strong requirement. Enforcing quasi-orthogonality:Add constraint
V>w ≤ α (5)
where V =[w(1)w(2) · · ·w(l−1)
].
Method Multiple Principal Components 12/23
![Page 13: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/13.jpg)
Overview
Introduction
Constrained PCA
Solving Constrained PCA
Method
Probabilistic PCA
Expectation Maximization
Sparse and Non-Negative PCA
Multiple Principal Components
Experiments and Results
Setup
Sparse PCA
Nonnegative Sparse PCA
Feature Selection
Conclusions
![Page 14: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/14.jpg)
Algorithms and Datasets
Algorithms considered:
I SPCA (Zou et. al., 2004): iterative L1 regression, continuous, uses Xinstead of C and ranking
I PathSPCA (d’Aspremont et. al., 2007): combinatorial, greedy step is O(D2)I NSPCA (Zass and Shashua, 2006): non-negativity, quasi-orthogonality
Data sets:
1. N > D: MIT CBCL faces dataset (Sung, 1996), N = 2429, D = 361,referenced in NSPCA and NMF literature.
2. D � N : Gene expression data of three types of Leukemia (Armstrong
et. al., 2002), N = 72, D = 12582
Standardized to zero mean, unit variance per dimension
Experiments and Results Setup 14/23
![Page 15: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/15.jpg)
Variance vs. Cardinality
Faces data Gene expression data
0 50 100 1500
20
40
60
80
100
120
Cardinality
Var
ianc
e
emPCAemPCAopt
SPCASPCAopt
PathSPCA
0 50 100 1500
20
40
60
80
100
120
140
CardinalityV
aria
nce
emPCASPCAPathSPCAThresholding
I “opt” denotes result after recomputing weights for given S.
Experiments and Results Sparse PCA 15/23
![Page 16: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/16.jpg)
Computational Complexity
Parameter sensitivity of EM convergence and effective computational cost:
EM Iterations (Faces Data) Matlab Runtime (Gene Data)
Cardinality
Dim
ensi
onal
ity
20 40 60 80 100 120 140
20
40
60
80
100
120
140
0
1
2
3
4
5
6
7
8
9
100
101
102
10−2
10−1
100
101
102
CardinalityC
PU
Tim
e [s
]
emPCASPCAPathSPCA
I Sublinear dependence of EM iterations on D
I Sparse emPCA solutions (10 ≤ K ≤ 40) require more effort
Experiments and Results Sparse PCA 16/23
![Page 17: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/17.jpg)
Variance vs. Cardinality
20 40 60 80 100 120 140 160 180 2000
2
4
6
8
10
12
14
16
18
20
Cardinality
Var
ianc
e
emPCA (nn)NSPCA (α = 1e5)
NSPCA (α = 1e6)
NSPCA (α = 1e7)
I Best result after 10 random restarts
I NSCPA sparsity parameter β determined by bisection search
I α is an orthonormality penalty
Experiments and Results Nonnegative Sparse PCA 17/23
![Page 18: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/18.jpg)
Multiple Principal Components
First PC loadings may lie in nonnegative orthant. Recover more than onecomponent:
1. Keep orthogonality requirement, but constrain cardinality.2. Require minimum angle between components.
Orthogonal Sparse Loadings Full Quasi-Orthogonal Loadings
1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
12
14
16
18x 10
4
Number of components
Cum
ulat
ive
varia
nce
emSPCAemNSPCA
NSPCA (α = 1e12)
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5x 10
5
Number of components
Cum
ulat
ive
varia
nce
emNPCANSPCA
Experiments and Results Nonnegative Sparse PCA 18/23
![Page 19: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/19.jpg)
Unsupervised Gene Selection
Compare loadings of emPCA to CE+SR criterion (Varshavsky et al., 2006 BIOI)
(LOO comparison of SV spectrum):
1. Choose gene subset
2. k-means clustering of samples(k = 3, 100 restarts)
3. Compare clustering assignmentto label (AML, ALL, MLL)
50 100 150 200 2500.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Number of FeaturesJa
ccar
d S
core
emPCAemPCA (nn)CE + SR
Jaccard Score: n11n11+n10+n01
n11: pairs which have same assignment in true labeling and clusteringsolution.
Experiments and Results Feature Selection 19/23
![Page 20: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/20.jpg)
Overview
Introduction
Constrained PCA
Solving Constrained PCA
Method
Probabilistic PCA
Expectation Maximization
Sparse and Non-Negative PCA
Multiple Principal Components
Experiments and Results
Setup
Sparse PCA
Nonnegative Sparse PCA
Feature Selection
Conclusions
![Page 21: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/21.jpg)
Summary
We have presented an O(D2) constrained PCA algorithm:
I Applicability to wide range of problems:I sparse and nonnegative constraintsI strict and quasi-orthogonality between componentsI N > D and D � N
I Competitive variance per cardinality for sparse PCA, superior fornon-negative sparse PCA
I Unmatched computational efficiency
I Direct specification of desired cardinality K, instead of bound B onL1 norm)
Conclusions 21/23
![Page 22: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/22.jpg)
Thank you for your attention.
We have machine learning PhD and postdoc positions available at ETHZurich:
I Computational (systems) biology
I Robust (noisy) combinatorial optimization
I Learning dynamical systems
Contact: Prof. J.M. Buhmann, [email protected]://www.ml.ethz.ch/open_positions
![Page 23: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4e786c7f8b9a2e4b8b4e47/html5/thumbnails/23.jpg)
Joint Optimization (Work in Progress)
EM algorithm for simultaneous computation of L PCs:
E-Step:
Y =(W>
(t)W(t)
)−1W>
(t)X
M-Step:
W(t+1) = arg minW‖X−WY‖2F
s.t. ‖w(l)‖1 ≤ B, l = {1, . . . , L}wi,j ≥ 0
Tasks:
I Make this as fast as iterative deflation approach
I Avoid bad minima: enforce orthogonality?
I Specify K instead of B
Conclusions 23/23