Sparse and Low-Rank Modeling for High …CVPR 2015 Tutorial Boston, MA High-dimensional data deluge...

Sparse and Low-Rank Modeling for High-Dimensional Data Analysis

Ehsan Elhamifar, Rene Vidal, John Wright, Guillermo Sapiro

CVPR 2015 Tutorial Boston, MA

High-dimensional data deluge

72 hrs new videos / minute 300M new photos / day

Low-dimensional structures

• Intrinsic structures are low-dimensional

S1 S2

Basri-Jacobs’03, Tomasi-Kanade’92, Deerwester et al ’90, Goldberg-Nichols-Oki-Terry ’92

High-dimensional data analysis

Clustering

Embedding

Rd1Rd2

Classification

S1 S2

S1S2

S1 S2

Subset selection

Challenges

• Clustering and subset selection: Non-convex and NP-hard

• Real data are often corrupted

• Little prior knowledge about low-dim structures

• Points in different groups can be very close

- Ext YaleB dataset (38 subjects, 64 images)

6% 14% 23% 31%

K = 1 K = 2 K = 3 K = 4

nearest neighborsK

This tutorial

kck1 ↵⇤

c2

c1

ATools:

- Sparse & low-rank representation

- High-dimensional statistics & geometry

- Convex programming & analysis

Efficient, robust and provably correct algorithms for

(1) clustering, subset selection (2) classification, dimension reduction

This tutorial

1) Clustering, Subset selection: algorithm, theory, applications

Ehsan Elhamifar

Rene Vidal

2) Robust PCA, Learning low-rank transformations: algorithm, theory, applications

— Coffee Break 3:30pm — 4:15pm

John Wright

Guillermo Sapiro

Sparse Subspace Clustering Ehsan Elhamifar

S1 S2 S1 S2

Subspace clustering problem

• Given points in lying in , find

- Basis for each subspace

- Clustering of the data

• Challenging for multiple subspaces

- Do not know subspace bases

- Do not know memberships of points

- Corruption by noise, missing entries, outliers, ...

S1

S2

S3

{y1, . . . ,yN} S1 [ . . . [ SLRn

Rn

Tomasi-Kanade’92, Tipping-Bishop’99, Tseng’00, Kanatani’01, Vidal-Ma-Sastry’05, Yan-Pollefeys’06, Chen-Lerman’09

Possible approach: Expectation Maximization, Issue: Local minima

Spectral clustering-based approach

• Spectral Clustering

- Represent points as graph nodes

- Connect and with weight

- Infer clusters from graph Laplacian

• Good similarity for subspaces?

- Points in the same subspace:

- Points in different subspaces:

- Nearest neighbors

S1

S2

S3

i j cij

cij 6= 0

cij = 0

Fiedler’73, Shi-Malik’01, Belkin-Niyogi’01, Ng-Weiss-Jordan’01, Xing-Jordan’03, Von Luxburg’07

i

j

cij

Subspace clustering: idea

• Self-Expressiveness Property (SEP)

- many solutions

-

• In of dim , a point can be reconstructed by other points

- : number of nonzero elements

yi = Y ci

low column-rank

min kcik0 s. t. yi = Y ci, cii = 0

S` d` d`

S1

S2

S3

yi

`0

NP-hard

Y =⇥Y 1 Y 2 · · · Y L

⇤�

Subspace clustering: idea

• Self-Expressiveness Property (SEP)

- many solutions

-

• In of dim , a point can be reconstructed by other points

- : sum of absolute values of elements

yi = Y ci

low column-rank

S` d` d`

S1

S2

S3

yi

Convex

min kcik1 s. t. yi = Y ci, cii = 0

`1

Y =⇥Y 1 Y 2 · · · Y L

⇤�

Sparse subspace clustering (SSC)

• 1: Solve the sparse optimization

• 2: Infer clustering from similarity graph

spectral clustering

min kcik1 s. t. yi = Y ci, cii = 0 c⇤i =

2

64c⇤i1...

c⇤iN

3

75

E. Elhamifar and R. Vidal, CVPR 2009; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013

L1 graph vs k-NN graph

• Conventional graph clustering

- 1) build a k-NN graph

- 2) learn edge weights

- 3) partition the graph

• SSC algorithm

- 1) learn graph & weights

- 2) partition the graph

yi

yi

cij = e�kyi�yjk

2

2�2

c⇤i =

2

66666664

00.80...0.30

3

77777775

SSC automatically selects the right number of neighbors! SSC can deal with subspaces of different dimensions!

E. Elhamifar and R. Vidal, CVPR 2009; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013

Theoretical analysis

• When does SSC succeed?

- selects points from the correct subspace: no false discovery

• More challenging than conventional sparse recovery

- Sparse representation from the correct subspace

- Sparse representation not unique

`1

S1

S2

S3

yi

Donoho-Elad’03, La-Do’05, Candes-Romberg-Tao’06, Tropp-Gilbert’07, Wright-Ma’10, Sankaranarayanan-Turaga-Baraniuk-Chellappa’10

0 5 10 15 20 25 30

ï0.2

0

0.2

0.4

q = 1Textci

Geometry-based theoretical guarantees

• Theorem: SSC has zero false discovery for any ify 2 Si

E. Elhamifar and R. Vidal, ICASSP 2010; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013

Si

Sj

S`

xO

Pi

P�i Si

Sj

S`

xO

Pi

P�i

Si

Sj

S`

xO

Pi

P�i

max

j 6=icos(✓ij) < max

rank(Y0i)=di

�di(Y0

i) /pdi

L1 succeeds

L1 fails

Geometry-based theoretical guarantees

• Theorem: SSC has zero false discovery for any ify 2 Si

E. Elhamifar and R. Vidal, ICASSP 2010; E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013

max

j 6=icos(✓ij) < max

rank(Y0i)=di

�di(Y0

i) /pdi

No need to have many points; Need a few but well distributed!

Si

Sj

S`

xO

Pi

P�i Si

Sj

S`

xO

Pi

P�i

Si

Sj

S`

xO

Pi

P�i

Clustering noisy data

• All points contaminated by noise

• Self-expressiveness implies

• Solve Lasso

S3

S1

S2

eY = Y +Z

min � kcik1 +1

2k eyi � eY ci k22 s. t. cii = 0

zij ⇠ N (0,�2/n)

eyi = eY ci + (zi �Zci)

sparseperturbationsparse

yi = Y ci

i.i.d.

corrupted?

M. Soltanolkotabi, E. Elhamifar and E. Candes, Annals of Stats, 2014; E. Elhamifar, M. Soltanolkotabi, S. Sastry, IEEE TSP, 2015

Robust SSC

• Theorem: Assume noise-free data is drawn uniformly at random from the intersection of each subspace and hypersphere. Apply the two-step procedure to . Under some assumptions, if

• Algorithm: two-step approach

1)

2)

ey 2 Si

max

j 6=i

qAve(cos

2(✓ij)) < (logN)

�1

⇠

�⇤i = argmin

�i

k�ik1 s. t. keyi � eY �ik2 2�, �ii = 0

with high prob, a) no false discovery, b) about subspace dim nonzeros.

� =1

4 k�⇤i k1i

c⇤i = argminci

� kcik1 +1

2k eyi � eY cik22 s. t. cii = 0

i data dependent!

M. Soltanolkotabi, E. Elhamifar and E. Candes, Annals of Stats, 2014; E. Elhamifar, M. Soltanolkotabi, S. Sastry, IEEE TSP, 2015

Application: motion segmentation

• Given feature trajectories of multiple rigid motions

• Find segmentation into underlying motions

Tomasi-Kanade’92, Sugaya-Kanatani’04, Vida-Ma-Sastry’05, Elhamifar-Vidal’09, Chen-Lerman’09, Zhao-Medioni’11

Experiments: motion segmentation

• Hopkins 155 dataset

- 155 sequences

- 2 and 3 motions

• Clustering errors

Algorithms RANSAC GPCA MSL LSA SCC LRR LRSC SSC

2 Motions

Mean 5.56 4.59 4.14 4.23 2.89 4.10 3.69 1.52Median 1.18 0.38 0.00 0.56 0.00 0.22 0.29 0.00

3 Motions

Mean 22.94 28.66 8.23 7.02 8.25 9.89 7.69 4.40Median 22.03 28.26 1.76 1.45 0.24 6.22 3.80 0.56

- nonconvex, local min + convex, provable- k-NN based + automatic selection- sensitive to noise + robust to noise- exponential complexity + computationally efficient

E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013

Application: face clustering

• Corruption by sparse errors

• SSC error on Ext YaleB faces

min � kcik1 + kyi � eY cik1 s. t. cii = 0

yi = yi + ei

E. Elhamifar and R. Vidal, IEEE Trans. PAMI, 2013

LaplaceGaussian

0

< 2.0% for 2 subjects

< 11.0% for 10 subjects

2 4 6 8 100

10

20

30

40

50

60

70

Number of subjects

Clu

ster

ing e

rror

(%)

D = 2,016 dimensional data

SSC

LRSC

LRR−H

LRR

SCC

LSA

Other extensions of SSC

• Extension to clustering and DR of nonlinear manifolds [Elhamifar-Vidal NIPS’11]

• Scaling to large datasets

- Greedy algorithm, theory [Dyer-Sankaranarayanan-Baraniuk JMLR’13]

- Sampling + more a compact dictionary [Peng-Zhang-Yi CVPR’13]

• Dealing with sequential and spatial data [Tierney-Gao-Guo CVPR’13, Pham et al CVPR’12]

• Enforcing block-diagonal structure on laplacian /adjacency [Feng-Lin et al CVPR’14]

• Connectivity of SSC graph [Nasihatkon-Hartley CVPR’11]

Conclusions

• Addressed clustering of data lying in multiple subspaces

• Proposed an efficient algorithm based on sparse modeling

- Proved theoretical guarantees of the algorithm

- Extended to deal with corrupted data

- Resolved challenges of the state of the art

- Showed it performs well in real-world problems

S1 S2S1 S2

Sparse Subset Selection Ehsan Elhamifar

Finding representatives

• A subset of data / models, efficiently representing the entire set

- Summarize and visualize images/videos/text/web datasets

79

Finding representatives

• A subset of data / models, efficiently representing the entire set

- Summarize and visualize images/videos/text/web datasets

- Improve computational time and memory

- Describe (complex) nonlinear models

0 1 2 3 4 5 6

−1

−0.5

0

0.5

1

u(t) y(t)�t

�1

submodel 1

submodel s�s

Column subset selection

• Given , select a subset that “well represent” the dataset

[ [⇤ ⇤ ⇤ ⇤ ⇤⇤

⇤ ⇤ ⇤⇤ ⇤⇤⇤ ⇤ ⇤ ⇤ ⇤⇤

Representatives

yi1

yi2

yi3

E. Elhamifar, G. Sapiro and R. Vidal, CVPR 2012

argmin �NX

i=1

kCi⇤kp +1

2kY � Y Ck2F

C

s. t. 1>C = 1>

{yi1 , . . . ,yik}y1,y2, . . . ,yN 2 Rn

=

SSC via L1 graph

C [ [⇤

⇤⇤

⇤

⇤⇤

⇤ ⇤

⇤

⇤

⇤⇤⇤

⇤

⇤

argmin �kCk1 +1

2kY � Y Ck2F =

Column subset selection: theory

• Theorem: Let be the convex hull of the columns of with k vertices. Assume the columns of lie in a (k-1)-dim. affine subspace. For p > 1, we obtain k representatives, corresponding to the vertices of .

• Theorem: For points lying in a union of independent subspaces ( ), we obtain at least representatives from each .

H YY

H

yi1

yi2

yi3

dim(�iSi) =X

i

dim(Si) dim(Si) + 1Si S1

S2

S3


Column subset selection

• Given , select a subset that “well represent” the dataset

- What if data not in low-dim subspaces?

- What if no feature representation? e.g, social network graph

- What about summarization between two / multiple sets?

[ [⇤ ⇤ ⇤ ⇤ ⇤⇤

⇤ ⇤ ⇤⇤ ⇤⇤⇤ ⇤ ⇤ ⇤ ⇤⇤


argmin �NX

i=1

kCi⇤kp +1

2kY � Y Ck2F

C

s. t. 1>C = 1>

{yi1 , . . . ,yik}y1,y2, . . . ,yN 2 Rn

=

Representatives

yi1

yi2

yi3

Subset selection using dissimilarities

• Given: dissimilarities

• Goal: select a small subset of that well represent w.r.t.

• : how well represents

- = models, = data = representation/coding error

- = data, = data = Euclidean/ geodesic distance

d : X⇥ Y �! R�0

source target

YX d(·, ·)

x1

x2

xM

y1

y2

yN

xi yjd(xi,yj) = dij

YX d(·, ·)

YX d(·, ·)

x1

x2

xM

y1

y2

yN

d11

dMN

Dissimilarity-based sparse subset selection (DS3)

• Let , introduce

-

• To select few elements of that well represent , minimize

• Solve the simultaneous sparse recovery program

D =⇥dij

⇤Z =

⇥zij

⇤

X Y

1) Encoding of via representativesYMX

i=1

NX

j=1

dijzij = tr(D>Z)

zij = P (xi rep. yj)

2) Number of representativesMX

i=1

I (kZi⇤kp)

p 2 {2,1}

xi

yj

minZ

�MX

i=1

kZi⇤kp + tr(D>Z) s. t. Z � 0, 1>Z = 1> Convex

E. Elhamifar, G. Sapiro and R. Vidal, NIPS, 2012; E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015

Dissimilarity-based sparse subset selection (DS3)

• Identical source and target

−10 −5 0 5 10−4

−2

0

2

4

6

8

10

Representatives for λ = 0.01 λmax,∞

data pointsrepresentatives

−10 −5 0 5 10−4

−2

0

2

4

6

8

10

Representatives for λ = 0.1 λmax,∞


−10 −5 0 5 10−4

−2

0

2

4

6

8

10

Representatives for λ = λmax,2

data pointsrepresentatives�1 �2 > �1 �3 > �2

Z matrix for λ = 0.01 λmax,∞

20 40 60 80 100

20

40

60

80

1000

0.2

0.4

0.6

0.8

1

Z matrix for λ = 0.1 λmax,∞

20 40 60 80 100

20

40

60

80

1000

0.2

0.4

0.6

0.8

1

Z matrix for λ = λmax,∞

20 40 60 80 100

20

40

60

80

1000

0.2

0.4

0.6

0.8

1


• Proposition 1: Assume and are identical. If is sufficiently large, only one representative is selected. If is sufficiently small, each point chooses itself as a representative.

- , where

-

• We determine to set the regularization

� � �max,p(D) Z = e`1

> ` = argmini

1>Di⇤

� �min,p(D) Z = I

��

minZ

�MX

i=1

kZi⇤kp + tr(D>Z) s. t. Z � 0, 1>Z = 1>

E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015

[�min,p(D) , �

max,p(D)]

X Y

�min,2(D) = minj

(mini 6=j

dij � djj) �max,2(D) = max

i 6=`

pN

2

kDi⇤ �D`⇤k22

1>(Di⇤ �D`⇤)

e.g., ,


• Proposition 2: Assume and are identical. Assume points partition into groups. If , the optimal is such that

- (1) each group will have representatives;

- (2) points in each group select representatives from that group only.

L � �g(D) Z

minZ

�MX

i=1

kZi⇤kp + tr(D>Z) s. t. Z � 0, 1>Z = 1>

G1 G2

xc1

xjxi

�g(D) = mink

minj2Gk

(mink0 6=k

mini2Gk0

dij � dckj)

X Y


DS3 applications: Learning nonlinear models

• Nonlinear dynamical systems as switched linear models

- Human gaits / activities, motor control systems, ...

• Learning switched dynamical models: Non-convex & NP-hard

E. Elhamifar, G. Sapiro and S. Sastry, IEEE Trans. PAMI, 2015; E. Elhamifar, S. Burden and S. Sastry, IFAC, 2014

X = {�1, . . . , �M}

Y = {(u(1),y(1)), . . . , (u(N),y(N))}

= ensemble of models

0 1 2 3 4 5 6

−1

−0.5

0

0.5

1

u(t) y(t)�t

�1

submodel 1

submodel s�s

Our convex solution

DS3 applications: Learning nonlinear models

• Experiments on segmentation of CMU motion capture data

- Discrete-time SS model via subspace ID, snippets of length 100

- = Euclidean norm of representation error of j-th snippet via i-th estimated submodel

Sequence number 1 2 3 4 5 6 7 8 9 10 11

# activities 4 8 7 7 7 10 6 9 4 4 7

SC error (%) 23.86 30.61 19.02 40.60 26.43 47.77 14.85 38.09 9.02 8.31 3.47SBiC error (%) 22.77 22.08 18.94 28.40 29.85 30.96 30.50 24.78 13.03 12.68 23.68

Kmedoids error (%) 18.26 46.26 49.89 51.99 37.07 54.75 29.81 49.53 9.71 33.50 33.80AP error (%) 22.93 41.22 49.66 54.56 37.87 50.19 37.84 48.37 9.71 26.05 23.84DS3 error (%) 5.33 9.90 12.27 19.64 16.55 14.66 12.56 11.73 11.18 3.32 6.18

dij


Dealing with outliers via DS3

• Add outlier representative node to source

• Solve the optimization


x1

x2

xM

y1

y2

yN

d11

dMN

ej = P (outlier node yj)

outlier rep

minZ,e

�MX

i=1

kZi⇤kp + tr

Ddo

�> Ze>

�!

s. t. 1>Ze>

�= 1>,

Ze>

�� 0

−5 0 5 10

−2

0

2

4

6

8

10


Source set −5 0 5 10

−2

0

2

4

6

8

10

data pointsoutliers

Target set

Dealing with outliers via DS3: experiment

• Exclude one activity when estimating LDS ensemble

• Set outlier node weights

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False positive rate

Tru

e p

osi

tiv

e ra

te

Walk

Jump

Punch

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False positive rate

Tru

e p

osi

tiv

e ra

te

Walk

Jump

Punch

x1

x2

xM

y1

y2

yN

d11

dMN

outlier rep


wj = � e�mini dij

⌧

⌧ = 0.1 ⌧ = 1.0

(0.91,0.06)

(0.91,0.01)

DS3 applications: Active learning

• Successively annotate the most informative unlabeled samples

• For = = { unlabeled samples }, solveX Y

minZ

� kWZk1,p + tr(D>Z) s. t. Z � 0, 1>Z = 1>

G(1)1

G(1)2

G(1)3

G(2)1G(2)

2

G(1)1

G(1)2

G(2)1

G(2)2

wi , min{� � (� � 1)

E(pi)

log2(L), � � (� � 1)

minj2L djimaxk2U minj2L djk

}

classifier uncertainty confidence sample diversity confidence

W , diag(w1, w2, . . .)

E. Elhamifar, G. Sapiro, A. Yang and S. Sastry, ICCV, 2013

DS3 applications: Active learning

• Pedestrian vs non-pedestrian

- classifier: SVM

- dissimilarity: HOG distances

0 50 100 150 200 250 3000.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Labeled training set size

Test

set a

ccur

acy

CPALCCALRAND

proposed method

�2

• Face recognition

- classifier: SRC

- dissimilarity: Euclidean dist.

300 400 500 600 7000.75

0.8

0.85

0.9

Labeled training set size

Test

set a

ccur

acy

CPALCCALRAND

proposed method

high

er is

bet

ter

high

er is

bet

ter

E. Elhamifar, G. Sapiro, A. Yang and S. Sastry, ICCV, 2013

Conclusions

• Studied the problem of subset selection

- Feature-space representations

- Pairwise similarities

• Proposed convex programs using simultaneous sparse recovery

- Extended to deal with outliers

• Proved the solution recovers representatives from each group and correctly clusters data

• Addressed several problems effectively

- Active learning

- Learning nonlinear dynamical models

- Segmentation of time-series data

Thanks!

Shankar Sastry (UCB) Emmanuel Candes (Stanford) Allen Yang (UCB) Mahdi Soltanolkotabi (USC)

Collaborators:Guillermo Sapiro (Duke) Rene Vidal (JHU) Sam Burden (UCB)

Codes: http://www.eecs.berkeley.edu/~ehsan.elhamifar/code.htm

Sparse and Low-Rank Modeling for High …CVPR 2015 Tutorial Boston, MA High-dimensional data deluge...

Documents

Transcript of Sparse and Low-Rank Modeling for High …CVPR 2015 Tutorial Boston, MA High-dimensional data deluge...