Sparse and Low-Rank Modeling for High …CVPR 2015 Tutorial Boston, MA High-dimensional data deluge...
Transcript of Sparse and Low-Rank Modeling for High …CVPR 2015 Tutorial Boston, MA High-dimensional data deluge...
Sparse and Low-Rank Modeling for High-Dimensional Data Analysis
Ehsan Elhamifar, Rene Vidal, John Wright, Guillermo Sapiro
CVPR 2015 Tutorial Boston, MA
High-dimensional data deluge
72 hrs new videos / minute 300M new photos / day
Low-dimensional structures
• Intrinsic structures are low-dimensional
S1 S2
Basri-Jacobs’03, Tomasi-Kanade’92, Deerwester et al ’90, Goldberg-Nichols-Oki-Terry ’92
High-dimensional data analysis
Clustering
Embedding
Rd1Rd2
Classification
S1 S2
S1S2
S1 S2
Subset selection
Challenges
• Clustering and subset selection: Non-convex and NP-hard
• Real data are often corrupted
• Little prior knowledge about low-dim structures
• Points in different groups can be very close
- Ext YaleB dataset (38 subjects, 64 images)
6% 14% 23% 31%
K = 1 K = 2 K = 3 K = 4
nearest neighborsK
This tutorial
kck1 ↵⇤
c2
c1
ATools:
- Sparse & low-rank representation
- High-dimensional statistics & geometry
- Convex programming & analysis
Efficient, robust and provably correct algorithms for
(1) clustering, subset selection (2) classification, dimension reduction
This tutorial
1) Clustering, Subset selection: algorithm, theory, applications
Ehsan Elhamifar
Rene Vidal
2) Robust PCA, Learning low-rank transformations: algorithm, theory, applications
— Coffee Break 3:30pm — 4:15pm
John Wright
Guillermo Sapiro
Sparse Subspace Clustering Ehsan Elhamifar
S1 S2 S1 S2
Subspace clustering problem
• Given points in lying in , find
- Basis for each subspace
- Clustering of the data
• Challenging for multiple subspaces
- Do not know subspace bases
- Do not know memberships of points
- Corruption by noise, missing entries, outliers, ...
S1
S2
S3
{y1, . . . ,yN} S1 [ . . . [ SLRn
Rn
Tomasi-Kanade’92, Tipping-Bishop’99, Tseng’00, Kanatani’01, Vidal-Ma-Sastry’05, Yan-Pollefeys’06, Chen-Lerman’09
Possible approach: Expectation Maximization, Issue: Local minima
Spectral clustering-based approach
• Spectral Clustering
- Represent points as graph nodes
- Connect and with weight
- Infer clusters from graph Laplacian
• Good similarity for subspaces?
- Points in the same subspace:
- Points in different subspaces:
- Nearest neighbors
S1
S2
S3
i j cij
cij 6= 0
cij = 0
Fiedler’73, Shi-Malik’01, Belkin-Niyogi’01, Ng-Weiss-Jordan’01, Xing-Jordan’03, Von Luxburg’07
i
j
cij
Subspace clustering: idea
• Self-Expressiveness Property (SEP)
- many solutions
-
• In of dim , a point can be reconstructed by other points
- : number of nonzero elements
yi = Y ci
low column-rank
min kcik0 s. t. yi = Y ci, cii = 0
S` d` d`
S1
S2
S3
yi
`0
NP-hard
Y =⇥Y 1 Y 2 · · · Y L
⇤�
Subspace clustering: idea
• Self-Expressiveness Property (SEP)
- many solutions
-
• In of dim , a point can be reconstructed by other points
- : sum of absolute values of elements
yi = Y ci
low column-rank
S` d` d`
S1
S2
S3
yi
Convex
min kcik1 s. t. yi = Y ci, cii = 0
`1
Y =⇥Y 1 Y 2 · · · Y L
⇤�
Sparse subspace clustering (SSC)
• 1: Solve the sparse optimization
• 2: Infer clustering from similarity graph
spectral clustering
min kcik1 s. t. yi = Y ci, cii = 0 c⇤i =
2
64c⇤i1...
c⇤iN
3
75
E. Elhamifar and R. Vidal, CVPR 2009; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013
L1 graph vs k-NN graph
• Conventional graph clustering
- 1) build a k-NN graph
- 2) learn edge weights
- 3) partition the graph
• SSC algorithm
- 1) learn graph & weights
- 2) partition the graph
yi
yi
cij = e�kyi�yjk
2
2�2
c⇤i =
2
66666664
00.80...0.30
3
77777775
SSC automatically selects the right number of neighbors! SSC can deal with subspaces of different dimensions!
E. Elhamifar and R. Vidal, CVPR 2009; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013
Theoretical analysis
• When does SSC succeed?
- selects points from the correct subspace: no false discovery
• More challenging than conventional sparse recovery
- Sparse representation from the correct subspace
- Sparse representation not unique
`1
S1
S2
S3
yi
Donoho-Elad’03, La-Do’05, Candes-Romberg-Tao’06, Tropp-Gilbert’07, Wright-Ma’10, Sankaranarayanan-Turaga-Baraniuk-Chellappa’10
0 5 10 15 20 25 30
ï0.2
0
0.2
0.4
q = 1Textci
Geometry-based theoretical guarantees
• Theorem: SSC has zero false discovery for any ify 2 Si
E. Elhamifar and R. Vidal, ICASSP 2010; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013
Si
Sj
S`
xO
Pi
P�i Si
Sj
S`
xO
Pi
P�i
Si
Sj
S`
xO
Pi
P�i
max
j 6=icos(✓ij) < max
rank(Y0i)=di
�di(Y0
i) /pdi
L1 succeeds
L1 fails
Geometry-based theoretical guarantees
• Theorem: SSC has zero false discovery for any ify 2 Si
E. Elhamifar and R. Vidal, ICASSP 2010; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013
max
j 6=icos(✓ij) < max
rank(Y0i)=di
�di(Y0
i) /pdi
No need to have many points; Need a few but well distributed!
Si
Sj
S`
xO
Pi
P�i Si
Sj
S`
xO
Pi
P�i
Si
Sj
S`
xO
Pi
P�i
Clustering noisy data
• All points contaminated by noise
• Self-expressiveness implies
• Solve Lasso
S3
S1
S2
eY = Y +Z
min � kcik1 +1
2k eyi � eY ci k22 s. t. cii = 0
zij ⇠ N (0,�2/n)
eyi = eY ci + (zi �Zci)
sparseperturbationsparse
yi = Y ci
i.i.d.
corrupted?
M. Soltanolkotabi, E. Elhamifar and E. Candes, Annals of Stats, 2014; E. Elhamifar, M. Soltanolkotabi, S. Sastry, IEEE TSP, 2015
Robust SSC
• Theorem: Assume noise-free data is drawn uniformly at random from the intersection of each subspace and hypersphere. Apply the two-step procedure to . Under some assumptions, if
• Algorithm: two-step approach
1)
2)
ey 2 Si
max
j 6=i
qAve(cos
2(✓ij)) < (logN)
�1
⇠
�⇤i = argmin
�i
k�ik1 s. t. keyi � eY �ik2 2�, �ii = 0
with high prob, a) no false discovery, b) about subspace dim nonzeros.
� =1
4 k�⇤i k1i
c⇤i = argminci
� kcik1 +1
2k eyi � eY cik22 s. t. cii = 0
i data dependent!
M. Soltanolkotabi, E. Elhamifar and E. Candes, Annals of Stats, 2014; E. Elhamifar, M. Soltanolkotabi, S. Sastry, IEEE TSP, 2015
Application: motion segmentation
• Given feature trajectories of multiple rigid motions
• Find segmentation into underlying motions
Tomasi-Kanade’92, Sugaya-Kanatani’04, Vida-Ma-Sastry’05, Elhamifar-Vidal’09, Chen-Lerman’09, Zhao-Medioni’11
Experiments: motion segmentation
• Hopkins 155 dataset
- 155 sequences
- 2 and 3 motions
• Clustering errors
Algorithms RANSAC GPCA MSL LSA SCC LRR LRSC SSC
2 Motions
Mean 5.56 4.59 4.14 4.23 2.89 4.10 3.69 1.52Median 1.18 0.38 0.00 0.56 0.00 0.22 0.29 0.00
3 Motions
Mean 22.94 28.66 8.23 7.02 8.25 9.89 7.69 4.40Median 22.03 28.26 1.76 1.45 0.24 6.22 3.80 0.56
- nonconvex, local min + convex, provable- k-NN based + automatic selection- sensitive to noise + robust to noise- exponential complexity + computationally efficient
E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013
Application: face clustering
• Corruption by sparse errors
• SSC error on Ext YaleB faces
min � kcik1 + kyi � eY cik1 s. t. cii = 0
yi = yi + ei
E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013
LaplaceGaussian
0
< 2.0% for 2 subjects
< 11.0% for 10 subjects
2 4 6 8 100
10
20
30
40
50
60
70
Number of subjects
Clu
ster
ing e
rror
(%)
D = 2,016 dimensional data
SSC
LRSC
LRR−H
LRR
SCC
LSA
Other extensions of SSC
• Extension to clustering and DR of nonlinear manifolds [Elhamifar-Vidal NIPS’11]
• Scaling to large datasets
- Greedy algorithm, theory [Dyer-Sankaranarayanan-Baraniuk JMLR’13]
- Sampling + more a compact dictionary [Peng-Zhang-Yi CVPR’13]
• Dealing with sequential and spatial data [Tierney-Gao-Guo CVPR’13, Pham et al CVPR’12]
• Enforcing block-diagonal structure on laplacian /adjacency [Feng-Lin et al CVPR’14]
• Connectivity of SSC graph [Nasihatkon-Hartley CVPR’11]
Conclusions
• Addressed clustering of data lying in multiple subspaces
• Proposed an efficient algorithm based on sparse modeling
- Proved theoretical guarantees of the algorithm
- Extended to deal with corrupted data
- Resolved challenges of the state of the art
- Showed it performs well in real-world problems
S1 S2S1 S2
Sparse Subset Selection Ehsan Elhamifar
Finding representatives
• A subset of data / models, efficiently representing the entire set
- Summarize and visualize images/videos/text/web datasets
79
Finding representatives
• A subset of data / models, efficiently representing the entire set
- Summarize and visualize images/videos/text/web datasets
- Improve computational time and memory
- Describe (complex) nonlinear models
0 1 2 3 4 5 6
−1
−0.5
0
0.5
1
u(t) y(t)�t
�1
submodel 1
submodel s�s
Column subset selection
• Given , select a subset that “well represent” the dataset
[ [⇤ ⇤ ⇤ ⇤ ⇤⇤
⇤ ⇤ ⇤⇤ ⇤⇤⇤ ⇤ ⇤ ⇤ ⇤⇤
Representatives
yi1
yi2
yi3
E. Elhamifar, G. Sapiro and R. Vidal, CVPR 2012
argmin �NX
i=1
kCi⇤kp +1
2kY � Y Ck2F
C
s. t. 1>C = 1>
{yi1 , . . . ,yik}y1,y2, . . . ,yN 2 Rn
=
SSC via L1 graph
C [ [⇤
⇤⇤
⇤
⇤⇤
⇤ ⇤
⇤
⇤
⇤⇤⇤
⇤
⇤
argmin �kCk1 +1
2kY � Y Ck2F =
Column subset selection: theory
• Theorem: Let be the convex hull of the columns of with k vertices. Assume the columns of lie in a (k-1)-dim. affine subspace. For p > 1, we obtain k representatives, corresponding to the vertices of .
• Theorem: For points lying in a union of independent subspaces ( ), we obtain at least representatives from each .
H YY
H
yi1
yi2
yi3
dim(�iSi) =X
i
dim(Si) dim(Si) + 1Si S1
S2
S3
E. Elhamifar, G. Sapiro and R. Vidal, CVPR 2012
Column subset selection
• Given , select a subset that “well represent” the dataset
- What if data not in low-dim subspaces?
- What if no feature representation? e.g, social network graph
- What about summarization between two / multiple sets?
[ [⇤ ⇤ ⇤ ⇤ ⇤⇤
⇤ ⇤ ⇤⇤ ⇤⇤⇤ ⇤ ⇤ ⇤ ⇤⇤
E. Elhamifar, G. Sapiro and R. Vidal, CVPR 2012
argmin �NX
i=1
kCi⇤kp +1
2kY � Y Ck2F
C
s. t. 1>C = 1>
{yi1 , . . . ,yik}y1,y2, . . . ,yN 2 Rn
=
Representatives
yi1
yi2
yi3
Subset selection using dissimilarities
• Given: dissimilarities
• Goal: select a small subset of that well represent w.r.t.
• : how well represents
- = models, = data = representation/coding error
- = data, = data = Euclidean/ geodesic distance
d : X⇥ Y �! R�0
source target
YX d(·, ·)
x1
x2
xM
y1
y2
yN
xi yjd(xi,yj) = dij
YX d(·, ·)
YX d(·, ·)
x1
x2
xM
y1
y2
yN
d11
dMN
Dissimilarity-based sparse subset selection (DS3)
• Let , introduce
-
• To select few elements of that well represent , minimize
• Solve the simultaneous sparse recovery program
D =⇥dij
⇤Z =
⇥zij
⇤
X Y
1) Encoding of via representativesYMX
i=1
NX
j=1
dijzij = tr(D>Z)
zij = P (xi rep. yj)
2) Number of representativesMX
i=1
I (kZi⇤kp)
p 2 {2,1}
xi
yj
minZ
�MX
i=1
kZi⇤kp + tr(D>Z) s. t. Z � 0, 1>Z = 1> Convex
E. Elhamifar, G. Sapiro and R. Vidal, NIPS, 2012; E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015
Dissimilarity-based sparse subset selection (DS3)
• Identical source and target
−10 −5 0 5 10−4
−2
0
2
4
6
8
10
Representatives for λ = 0.01 λmax,∞
data pointsrepresentatives
−10 −5 0 5 10−4
−2
0
2
4
6
8
10
Representatives for λ = 0.1 λmax,∞
data pointsrepresentatives
−10 −5 0 5 10−4
−2
0
2
4
6
8
10
Representatives for λ = λmax,2
data pointsrepresentatives�1 �2 > �1 �3 > �2
Z matrix for λ = 0.01 λmax,∞
20 40 60 80 100
20
40
60
80
1000
0.2
0.4
0.6
0.8
1
Z matrix for λ = 0.1 λmax,∞
20 40 60 80 100
20
40
60
80
1000
0.2
0.4
0.6
0.8
1
Z matrix for λ = λmax,∞
20 40 60 80 100
20
40
60
80
1000
0.2
0.4
0.6
0.8
1
Theoretical analysis
• Proposition 1: Assume and are identical. If is sufficiently large, only one representative is selected. If is sufficiently small, each point chooses itself as a representative.
- , where
-
• We determine to set the regularization
� � �max,p(D) Z = e`1
> ` = argmini
1>Di⇤
� �min,p(D) Z = I
��
minZ
�MX
i=1
kZi⇤kp + tr(D>Z) s. t. Z � 0, 1>Z = 1>
E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015
[�min,p(D) , �
max,p(D)]
X Y
�min,2(D) = minj
(mini 6=j
dij � djj) �max,2(D) = max
i 6=`
pN
2
kDi⇤ �D`⇤k22
1>(Di⇤ �D`⇤)
e.g., ,
Theoretical analysis
• Proposition 2: Assume and are identical. Assume points partition into groups. If , the optimal is such that
- (1) each group will have representatives;
- (2) points in each group select representatives from that group only.
L � �g(D) Z
minZ
�MX
i=1
kZi⇤kp + tr(D>Z) s. t. Z � 0, 1>Z = 1>
G1 G2
xc1
xjxi
�g(D) = mink
minj2Gk
(mink0 6=k
mini2Gk0
dij � dckj)
X Y
E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015
DS3 applications: Learning nonlinear models
• Nonlinear dynamical systems as switched linear models
- Human gaits / activities, motor control systems, ...
• Learning switched dynamical models: Non-convex & NP-hard
E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015; E. Elhamifar, S. Burden and S. Sastry, IFAC, 2014
X = {�1, . . . , �M}
Y = {(u(1),y(1)), . . . , (u(N),y(N))}
= ensemble of models
0 1 2 3 4 5 6
−1
−0.5
0
0.5
1
u(t) y(t)�t
�1
submodel 1
submodel s�s
Our convex solution
DS3 applications: Learning nonlinear models
• Experiments on segmentation of CMU motion capture data
- Discrete-time SS model via subspace ID, snippets of length 100
- = Euclidean norm of representation error of j-th snippet via i-th estimated submodel
Sequence number 1 2 3 4 5 6 7 8 9 10 11
# activities 4 8 7 7 7 10 6 9 4 4 7
SC error (%) 23.86 30.61 19.02 40.60 26.43 47.77 14.85 38.09 9.02 8.31 3.47SBiC error (%) 22.77 22.08 18.94 28.40 29.85 30.96 30.50 24.78 13.03 12.68 23.68
Kmedoids error (%) 18.26 46.26 49.89 51.99 37.07 54.75 29.81 49.53 9.71 33.50 33.80AP error (%) 22.93 41.22 49.66 54.56 37.87 50.19 37.84 48.37 9.71 26.05 23.84DS3 error (%) 5.33 9.90 12.27 19.64 16.55 14.66 12.56 11.73 11.18 3.32 6.18
dij
E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015
Dealing with outliers via DS3
• Add outlier representative node to source
• Solve the optimization
E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015
x1
x2
xM
y1
y2
yN
d11
dMN
ej = P (outlier node yj)
outlier rep
minZ,e
�MX
i=1
kZi⇤kp + tr
Ddo
�> Ze>
�!
s. t. 1>Ze>
�= 1>,
Ze>
�� 0
−5 0 5 10
−2
0
2
4
6
8
10
data pointsrepresentatives
Source set −5 0 5 10
−2
0
2
4
6
8
10
data pointsoutliers
Target set
Dealing with outliers via DS3: experiment
• Exclude one activity when estimating LDS ensemble
• Set outlier node weights
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False positive rate
Tru
e p
osi
tiv
e ra
te
Walk
Jump
Punch
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False positive rate
Tru
e p
osi
tiv
e ra
te
Walk
Jump
Punch
x1
x2
xM
y1
y2
yN
d11
dMN
outlier rep
E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015
wj = � e�mini dij
⌧
⌧ = 0.1 ⌧ = 1.0
(0.91,0.06)
(0.91,0.01)
DS3 applications: Active learning
• Successively annotate the most informative unlabeled samples
• For = = { unlabeled samples }, solveX Y
minZ
� kWZk1,p + tr(D>Z) s. t. Z � 0, 1>Z = 1>
G(1)1
G(1)2
G(1)3
G(2)1G(2)
2
G(1)1
G(1)2
G(2)1
G(2)2
wi , min{� � (� � 1)
E(pi)
log2(L), � � (� � 1)
minj2L djimaxk2U minj2L djk
}
classifier uncertainty confidence sample diversity confidence
W , diag(w1, w2, . . .)
E. Elhamifar, G. Sapiro, A. Yang and S. Sastry, ICCV, 2013
DS3 applications: Active learning
• Pedestrian vs non-pedestrian
- classifier: SVM
- dissimilarity: HOG distances
0 50 100 150 200 250 3000.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Labeled training set size
Test
set a
ccur
acy
CPALCCALRAND
proposed method
�2
• Face recognition
- classifier: SRC
- dissimilarity: Euclidean dist.
300 400 500 600 7000.75
0.8
0.85
0.9
Labeled training set size
Test
set a
ccur
acy
CPALCCALRAND
proposed method
high
er is
bet
ter
high
er is
bet
ter
E. Elhamifar, G. Sapiro, A. Yang and S. Sastry, ICCV, 2013
Conclusions
• Studied the problem of subset selection
- Feature-space representations
- Pairwise similarities
• Proposed convex programs using simultaneous sparse recovery
- Extended to deal with outliers
• Proved the solution recovers representatives from each group and correctly clusters data
• Addressed several problems effectively
- Active learning
- Learning nonlinear dynamical models
- Segmentation of time-series data
Thanks!
Shankar Sastry (UCB) Emmanuel Candes (Stanford) Allen Yang (UCB) Mahdi Soltanolkotabi (USC)
Collaborators:Guillermo Sapiro (Duke) Rene Vidal (JHU) Sam Burden (UCB)
Codes: http://www.eecs.berkeley.edu/~ehsan.elhamifar/code.htm