Latent Structure Beyond Sparse Codes -...

Latent Structure Beyond Sparse Codes

Benjamin RechtDepartment of EECS and StatisticsUniversity of California, Berkeley

Sparse Codes

1.25x 2.5x

5x 10x

Figure 1. Learned dictionaries. Each panel shows 100 basis functions selected at random from the dictionary of a givenovercompleteness ratio.

resulting in dictionaries containing more specialized elements such as straight contours, blobs, local curvature, andgratings. The specialized elements are better matched to the structures occurring natural images, as evidencedby the fact that they yield lower L1 norm representations, steeper coe�cient decay, and better denoising. Itseems plausible that they may also result in improved image compression though this remains to be seen.

These results are of relevance to neuroscience because the input layer of V1 is thought to be at least 100x

redundancy

Which mathematical representations can be learned robustly?

robustness and sparsity

Gabor-like thingies...

Sparse Approximation

• Use the fact that images are sparse in wavelet basis to reduce number of measurements required for signal acquisition.

pixels largewaveletcoefficients

widebandsignalsamples

largeGaborcoefficients

time

frequency

Compressed Sensing

• npatients << npeaks

• If very few are needed for diagnosis, search for a sparse set of markers

Lasso

Cardinality Minimization• PROBLEM: Find the vector of lowest cardinality that

satisfies/approximates the underdetermined linear system

• NP-HARD:–Reduce to EXACT-COVER

–Hard to approximate

–Known exact algorithms require enumeration

• HEURISTIC: Replace cardinality with l1 norm

�x = y � : Rp ! Rn

Density Matrix

Seismic Imaging

Geometric Structure

Rank of:

RecommenderSystems

DataMatrix

Quantum Tomography

Rank of:

Rank of:

Rank of: Unfolded Tensor

GramMatrix

Affine Rank Minimization• PROBLEM: Find the matrix of lowest rank that

satisfies/approximates the underdetermined linear system

• NP-HARD:–Reduce to solving polynomial equations

–Hard to approximate

–Exact algorithms are awful

• HUERISTIC: Replace rank with nuclear norm

�(X) = y � : Rp1⇥p2 ! Rn

Heuristic: Gradient Descent

• Step 1: Pick (i,j) and compute residual:

• Step 2: Take a mixture of current model and corrected model (𝛼,β>0):

r x p2

=M LR*

p1 x rp1 x p2

minimize kXk⇤subject to �(X) = b

IDEA: Replace rank with nuclear norm:

Some guy on livejournal, 2006Fazel, Parillo, Recht, 2007Candes and Recht, 2008

Succeeds when number of samples is Õ(r(p1 +p2))

e = (LiRTj �Mij)

Li

Rj

�

↵Li � �eRj

↵Rj � �eLi

�

System Identification: find a dynamical model that agrees with time series data• All linear systems are combinations of single pole filters.• Leverage this structure for new algorithms and analysis.

Observe a time series driven by the inputy1, y2, . . . , yTu1, u2, . . . uT

What is a principled way to build a parsimonious model for the input-output responses?

Na et al, 2012

Shah, Bhaskar, Tang, and Recht 2012

Linear Inverse Problems• Find me a solution of

• Φ n x p, n<p

• Of the infinite collection of solutions, which one should we pick?

• Leverage structure:

• How do we design algorithms to solve underdetermined systems problems with priors?

y = �x

Sparsity Rank Smoothness Symmetry

kxk1 =pX

i=1

|xi|

• 1-sparse vectors of Euclidean norm 1

• Convex hull is the unit ball of the l1 norm

1

1

-1

-1

Sparsity

minimize kxk1

subject to �x = y

x1

x2

Φx=y

Compressed Sensing: Candes, Romberg, Tao, Donoho, Tanner, Etc...

• 2x2 matrices• plotted in 3d

rank 1 x2 + z2 + 2y2 = 1

Rank


rank 1 x2 + z2 + 2y2 = 1

Convex hull:

Rank

kXk⇤ =X

i

�i(X)


Nuclear Norm Heuristic

Fazel 2002. R, Fazel, and Parillo 2007

Rank Minimization/Matrix Completion

kXk⇤ =X

i

�i(X)

• Integer solutions: all components of x

are ±1

• Convex hull is the unit ball of the l1 norm

(1,-1)

(1,1)

(-1,-1)

(-1,1)

Integer Programming

minimize kxk1subject to �x = y

x1

x2

Φx=y

Donoho and Tanner 2008Mangasarian and Recht. 2009.

• Search for best linear combination of fewest atoms• “rank” = fewest atoms needed to describe the model

Parsimonious Models

atomsmodel weights

rank

Atomic Norms• Given a basic set of atoms, , define the function

• When is centrosymmetric, we get a norm

• When can we compute this?• When does this work?

kxkA = inf{X

a2A|ca| : x =

X

a2Acaa}

kxkA = inf{t > 0 : x 2 tconv(A)}

A

minimize kzkAsubject to �z = yIDEA:

A

Hierarchical dictionary for image patches

26/42

Union of Subspaces

• X has structured sparsity: linear combination of elements from a set of subspaces {Ug}.

• Atomic set: unit norm vectors living in one of the Ug

Permutations and Rankings

• X a sum of a few permutation matrices

• Examples: Multiobject Tracking, Ranked elections, BCS

• Convex hull of permutation matrices: doubly stochastic matrices.

• Moments: convex hull of of [1,t,t2,t3,t4,...], t∈T, some basic set.

• System Identification, Image Processing, Numerical Integration, Statistical Inference

• Solve with semidefinite programming

• Cut-matrices: sums of rank-one sign matrices.

• Collaborative Filtering, Clustering in Genetic Networks, Combinatorial Approximation Algorithms

• Approximate with semidefinite programming

• Low-rank Tensors: sums of rank-one tensors

• Computer Vision, Image Processing, Hyperspectral Imaging, Neuroscience

• Approximate with alternating least-squares

Atomic norms in sparse approximation

• Greedy approximations

• Best n term approximation to a function f in the convex hull of A.

• Maurey, Jones, and Barron (1980s-90s)• Devore and Temlyakov (1996)• Random Feature Heuristics (Rahimi and R, 2007)

kf � fnkL2 c0kfkAp

n

• Set of directions that decrease the norm from x form a cone:

• x is the unique minimizer if the intersection of this cone with the null space of Φ equals {0}

Tangent Cones

y = �zx

minimize kzkAsubject to �z = y

{z : kzkA kxkA}TA(x)

TA(x) = {d : kx + ↵dkA kxkA for some ↵ > 0}

Mean Width

d

0x

S

C

(d) = supx2C

d

0x

�d

0x

Support Function:

SC(d) + SC(�d)measures width of C when projected onto span of d.

mean width: w(C) =

Z

Sp�1

SC(u)du

• When does a random subspace, U in , intersect a convex cone C at the origin?

• Gordon (1988): with high probability if

where is the mean width.

• Corollary: For inverse problems, if Φ is a random Gaussian matrix with n rows, need

for exact recovery of x.

codim(U) � pw(C \ Sp�1)

2

w(C \ Sp�1) =

Z

Sp�1

SC(u)du

n � pw(TA(x) \ Sp�1)2

Rp

• Hypercube:

• Sparse Vectors, p vector, sparsity s

• Block sparse, M groups (possibly overlapping), maximum group size B, k active groups

• Low-rank matrices: p1 x p2, (p1<p2), rank r

Ratesn � p/2

n � 2s log�ps

�+

5s4

n � k⇣p

2 log (M � k) +pB⌘2

+ kB

n � 3r(p1 + p2 � r)

• Suppose we observe

• If is an optimal solution, then provided that

Robust Recovery (deterministic)

minimize kzkAsubject to k�z � yk �

kwk2 �

kx� x̂k 2�

✏

x̂

y = �x + w

{z : kzkA kxkA}

k�z � yk �

n � pw(TA(x) \ Sp�1)2

(1� ✏)2

• Suppose we observe

• If is an optimal solution, then provided that

Robust Recovery (statistical)

x̂

y = �x + w

x̂

minimize k�z � yk2 + µkzkA

cone{u : kx+ ukA kxkA + �kuk}

kx� x̂k2 ⌘(x,A,�, �)µAnd under an additional “cone condition”

Bhaskar, Tang, and Recht 2011

µ � Ew[k�⇤wk⇤A]k�x� �x̂k2

pµkxkA

• Sparse Vectors, p vector, sparsity s

• Low-rank matrices: p1 x p2, (p1<p2), rank r

Denoising Rates (re-derivations)

1

pkx̂� x?k22 = O

✓�2s log(p)

p

◆

1

p1p2kx̂� x?k2F = O

✓�2r

p1

◆

Atomic Norm Minimization

• Generalizes existing, powerful methods• Rigorous formula for developing new analysis

algorithms• Tightest bounds on number of measurements

needed for model recovery in all common models• One algorithm prototype for many data-mining

applications

minimize kzkAsubject to �z = yIDEA:

Chandrasekaran, Recht, Parrilo, and Willsky 2010

• Gram matrix of y vectors indicates overlapping support

• Use graph algorithms to identify single dictionary elements at a time

Learning representations

• ASSUME:• very sparse vectors• s<N1/2/log(N)

• very incoherent dictionary (much more than RIP)

• number of observations is much bigger than N

Arora, Ge, and MoitraAgarwal, Anandkumar, and Netrapalli

x z

|��x, �z�| � |�x, z�|

Extended representations

C = �(K � L)convex body

linear map

cone affine space

this non-regular hexagon only has the trivial LP-lift

{y ! R5+ : y1 + y2 + y3 + y5 = 2, y3 + y4 + y5 = 1},

regular hexagon is the projection of a 3-dimlslice of R

5+

C = �(K � L)

(1,-1)

(1,1)

(-1,-1)

(-1,1)

1

1

-1

-1

� =�I �I

�L = {y :

2d�

i=1

yi = 1} L = {Z : trace(Z) = 1}

�

��A B

BT C

��= B

�

��T xxT u

��= x

L =

�y :

yi + yi+d = 11 � i � d

�

� =�I �I

�

L =

�Z =

�T xxT u

�:

T toeplitzT11 = u = 1

�

K = R2d+

K = Sd1+d2+

K = Sd+1+K = R2d

+

Extended representations

C = �(K � L)

linear map

cone affine space

this non-regular hexagon only has the trivial LP-lift

{y ! R5+ : y1 + y2 + y3 + y5 = 2, y3 + y4 + y5 = 1},

regular hexagon is the projection of a 3-dimlslice of R

5+

C� = {y : �x, y� � 1 �x � C}

1 � �x, y� = �A(x), B(y)�A : C � K B : C� � K�

C has a lift into K if there are maps

such that

for all extreme points of x ∈ C and y ∈ C*

polar body

Gouveia, Parrilo, and Thomas

Representation learning becomes matrix factorization

Learning extended representations?

C = �(K � L)convex body

linear map

cone affine space

• Learning representation through NMF?• Ties immediately with gaussian width analysis• Could obviate graph structured arguments• What are the right features?

Latent Structure Beyond Sparse Codes -...

Documents

Transcript of Latent Structure Beyond Sparse Codes -...