Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

34
University Advanced Machine Learning & Perception Instructor: Tony Jebara

Transcript of Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Page 1: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Advanced Machine Learning & Perception

Instructor: Tony Jebara

Page 2: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Topic 13•Manifolds Continued and Spectral Clustering

•Convex Invariance Learning (CoIL)

•Kernel PCA (KPCA)

•Spectral Clustering & N-Cuts

Page 3: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Manifolds Continued•PCA: linear manifold•MDS: get inter-point distances, find 2D data with same•LLE: mimic neighborhoods using low dimensional vectors•GTM: fit a grid of Gaussians to data via nonlinear warp

•Linear PCA after Nonlinear normalization/invariance of data•Manifold is Linear PCA in Hilbert space (Kernels)•Spectral Clustering in Hilbert space

Page 4: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Convex Invariance Learning•PCA is appropriate for finding a linear manifold•Variation in data is only modeled linearly•But, many problems are nonlinear•However, the nonlinear variations may be irrelevant:

Images: morph, rotate, translate, zoom… Audio: pitch changes, ambient acoustics… Video: motion, camera view, angles… Genomics: proteins fold, insertions, deletions… Databases: fields swapped, formats, scaled…

•Imagine a “Gremlin” is corrupting your data by multiplying each input vector Xt by a type of matrix At to give AtXt

•Idea: remove nonlinear irrelevant variations before PCA•But, make this part of PCA optimization, not pre-processing

Page 5: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Convex Invariance Learning•Example of irrelevant variation in our data: permutation in image data… each image Xt is multiplied by a permutation matrix At by gremlin. Must clean it.•When we convert images to a vector, we are assuming arbitrary meaningless ordering (like Gremlin mixing order)

•This arbitrary ordering causes wild nonlinearities (manifold)•We should not trust ordering, assume gremlin has permuted it with arbitrary permutation matrix…

® 1127 12 54 1 3 85 1 4 84T

tX é ù= ê úë û

K

Page 6: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Permutation Invariance•Permutation is irrelevant variation in our data.•Gremlin is permuting fields in our input vectors•So, view a datum as “Bag of Vectors” instead single vector i.e. grayscale image = Set of Vectors or Bag of Pixels

N pixels, each is a D=3 XYI tuple matrix Ai by gremlin. Must clean it.

•Treat each input as permutable “Bag of Pixels”

xxx

Page 7: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Optimal Permutation

( ) ( ) ( ){ }1 1 1 2 2 21 1 1 1

, , , , , , , , ,N N Nx X Y I X Y I X Y I=r

K

( ) ( ) ( ){ }5 5 5 8 8 8 2 2 21 1 1 1 1 1 1 1 1 1

, , , , , , , , ,x X Y I X Y I X Y I=r

K

( ) ( ) ( ){ }1 1 1 2 2 22 2 2 2

, , , , , , , , ,N N Nx X Y I X Y I X Y I=r

K

( ) ( ) ( ){ }3 3 3 4 4 4 9 9 92 2 2 2 2 2 2 2 2 2

, , , , , , , , ,x X Y I X Y I X Y I=r

K

•Vectorization / Rasterization: uses index in image to sort pixels into large vector.

•If we knew “optimal” correspondence: could fix sorting pixels in bag into large vector more appropriately

… we don’t know it, must learn it…

Page 8: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

PCA on Permutated Data•In non-permuted vector images, linear changes & eigenvectors are additions & deletions of intensities (bad!). Translating, raising eyebrows, etc. = erasing & redrawing

•In bag of pixels (vectorized only after knowing optimal permutation) get linear changes & eigenvectors are morphings, warpings, jointly spatial & intensity change

Page 9: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Permutation as a Manifold•Assume order unknown. “Set of Vectors” or “Bag of Pixels” Get permutational invariance (order doesn’t matter)•Can’t represent invariance by single ‘X’ vector point in DxN space since we don’t know the ordering•Get permutation invariance by ‘X’ spanning all possible reorderings. Multiply X by unknown A matrix (permutation or doubly-stochastic)x

x

( )

( )

( )

( )

1

2

3

4

é ùê úê úê úê úê úê úê úë û

( )

( )

( )

( )

2

1

3

4

é ùê úê úê úê úê úê úê úë û

( )

( )

( )

( )

4

1

3

2

é ùê úê úê úê úê úê úê úë û

( )

( )

( )

( )

4

1

2

3

é ùê úê úê úê úê úê úê úë û

( )

( )

( )

( )

1

2

3

4

é ùê úê úê úê úê úê úê úë û

x

x

x

Page 10: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Invariant Paths as Matrix Ops•Move vector along manifold by multiplying by matrix:Restrict A to be permutation matrix (operator)Resulting manifold of configurations is “orbit” if A is groupOr, for smooth manifold, make A doubly-stochastic matrixEndow each image in dataset with own transformation matrix . Each is now a bag or manifold:

X AX

{ }1 1,...,

T TA X A X

Page 11: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

A Dataset of Invariant Manifolds•E.g. assume model is PCA, learn 2D subspace of 3D data•Permutation lets points move independently along paths•Find PCA after moving to form ‘tight’ 2D subspace •More generally, move along manifolds to improve fit of any model (PCA, SVM, probability density, etc.)

Page 12: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Optimizing the Permutations•Optimize: modeling cost & linear constraints on matrices•Estimate transformation parameters and model parameters (PCA, Gaussian, SVM)•Cost on matrices A emerges from modeling criterion•Typically, get a Convex Cost with Convex Hull of Constraints (Unique!)

•Since A matrices are soft permutation matrices (doubly-stochastic) we have:

1 1 0ij ij ijt t t

i j

A A A= = ³å å

{ }1,...,

TA A A=

Q

( )1min , ,

TAC A AK

: 0 ,ij ijt td td

ij

subjectto A Q b t d+ ³ "å

Q

Page 13: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Example Cost: Gaussian Mean•Maximum Likelihood Gaussian Mean Model:

•Theorem 1: C(A) is convex in A (Convex Program) •Can solve via a quadratic program on the A matrices•Minimizing the trace of a covariance tries to pull the data spherically towards a common mean

( ) ( ), log ; ,t tt

l A A X Im = N må 1ˆ T t ttA Xm= å

( )2

12 2ˆ ˆ, log2TD

t ttl A A Xm = - p- - må

( ) ( ) ( )( )ˆ,C A l A trace Cov AX= - m =

Page 14: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Example Cost: Gaussian Cov

•Theorem 2: Regularized log determinant of covariance is convex. Equivalently, minimize

•Theorem 3: Cost non-quadratic but upper boundable by quad. Iteratively solve with QP with variational bound:

•Min’ing determinant flattens data into low volume pancake

( ) ( ), , log ; ,t tt

l A A XmS = N mSå1ˆ T t tt

A Xm= å( ) ( ) ( )11

2 2 2ˆ ˆ ˆˆ ˆ ˆ, , log2 log

TTD T

t t t ttl A A X A X-mS = - p- S - - m S - må

( ) ( ) ( )ˆˆ, ,C A l A Cov AX= - mS =

( )( )1ˆ ˆ ˆT

T t t t ttA X A XS = - m - må

( ) ( )1 10 0 0 0

log logS trace S S S trace S S- -£ + -

( ) ( )( )logCov AX I tr Cov AX+ e + e

Page 15: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Example Cost: Fisher Discrimin.•Find linear Fisher Discriminant model w that maximizes ratio of between & within-class scatter

•For discriminative invariance, transformation matrices should increase between-class scatter (numerator) and should reduce within class scatter (denominator)

•Minimizing above permutes data to make classification easy

maxT

w T

w Uww Sw

S -+= S +S

( )( )TU - -+ += m - m m - mxx

xx

x

x

x

xx

x xx

xx

x

x

x

x

xx

x x

( ) ( )( )TC A - - -+ + += S +S - l m - m m - m

Page 16: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Interpreting C(A)•Maximum Likelihood Mean Permute data towards common mean

•Maximum Likelihood Mean & Covariance Permute data towards flat subspace Pushes energy into few eigenvectors Great as pre-processing before PCA

•Fisher Discriminant Permute data towards two flat subspaces while repelling away from each other’s means

Page 17: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

SMO Optimization of QP•Quadratic Programming used for all C(A) since:

Gaussian Mean quadraticGaussian Covariance upper boundable by

quadraticFisher Discriminant upper boundable by

quadratic

•Use Sequential Minimal Optimizationaxis parallel optimization, pick axes to update,ensure constraints not violated

•Soft permutation matrix 4 constraintsor 4 entries at a time

( ) 21 1mn pq q pm n mn pq q pm nT Ti i i i i j j i

mpnqi mpnqij

trace MS A A X M X A A X M X= -å å

, , ,mn mq pn pqt t t t

A A A A

Page 18: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

XY Digits Permuted PCA

Original

PCA Permuted PCA

20 Images of ‘3’ and ‘9’Each is 70 (x,y) dotsNo order on the ‘dots’

PCA compress with samenumber of Eigenvectors

Convex Program firstestimates thepermutation better reconstruction

Page 19: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

InterpolationIntermediate imagesare smooth morphs

Points nicelycorresponded

Spatial morphingversus ‘redrawing’

No ghosting

Page 20: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

XYI Faces Permuted PCA

Original PCA Permuted Bag-of XYI Pixels PCA

2000 XYI Pixels: Compress to 20 dimsImprove squared error of PCA byAlmost 3 orders of magnitude x103

Page 21: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

XYI Multi-Faces Permuted PCA

+/- Scaling on Eigenvector

Top 5Eigenvectors

All just linearvariations inbag of XYIpixels

Vectorizationnonlinearneeds huge #of eigenvectors

Page 22: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

XYI Multi-Faces Permuted PCA

+/- Scaling on Eigenvector

Next 5Eigenvectors

Page 23: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

•Replace all dot-products in PCA with kernel evaluations.•Recall, could do PCA on DxD covariance matrix of data

or NxN Gram matrix of data:•For nonlinearity, do PCA on feature expansions:

•Instead of doing explicit feature expansion, use kernel I.e. d-th order polynomial

•As usual, kernel must satisfy Mercer’s theorem•Assume, for simplicity, all feature data is zero-mean

Kernel PCA

1

1 N Ti ii

C xxN =

= å r r

Tij i j

K x x=

( ) ( ) ( ) ( ),dT T

ij i j i j i jK k x x x x x x= = ff =

( ) ( )1

1 TN

i iiC x x

N == ffå

v Cvl =r r Evals &

Evecssatisfy

( )1

0N

iix

=f =å

If data iszero-mean

Page 24: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Kernel PCA•Efficiently find & use eigenvectors of C-bar:•Can dot either side of above equation with feature vector:

•Eigenvectors are in span of feature vectors:•Combine equations:

v Cvl =r r

( )1

N

i iiv x

== a få

r( ) ( )T T

i ix v x Cvl f = f

r r

( ) ( )

( ) ( ){ } ( ) ( ){ }( ) ( ){ } ( ) ( ) ( ){ } ( ){ }

1 1

11 1 1

11 1 1

2

T T

i i

T TN N

i j j i j jj i

TT TN N N

Ni j j i j jk kj k j

N N N

Nj ij jik kjj k j

x v x Cv

x x x C x

x x x x x x

K K K

N K K

N K

= =

= = =

= = =

l f = f

l f a f = f a f

l f a f = ff f a f

l a = a

l a = a

l a = a

å å

å å å

å å å

r r

r r

r r

Page 25: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

•From before, we had: this is an eig equation!•Get eigenvectors and eigenvalues of K•Eigenvalues are N times •For each eigenvector k there is an eigenvector vk

•Want eigenvectors v to be normalized:

•Can now use alphas only for doing PCA projection & reconstruction!

Kernel PCA( ) ( )T T

i ix v x Cv

N K

l f = f

l a = a

r r

r r

( )( ) ( )( )1 1

1

1

1

1

1

Tk k

TN Nk kj i j ji j

Tk k

Tk k k

Tk kk

v v

x x

K

N

N

= =

=

a f a f =

a a =

a l a =

a a =l

å å

r r

r

r

r

Page 26: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Kernel PCA•To compute k’th projection coefficient of a new point (x)

•Reconstruction*:

*Pre-image problem, linear combo in Hilbert goes outside•Can now do nonlinear PCA and do PCA on non-vectors•Nonlinear KPCA eigenvectors satisfy same properties as usual PCA but in Hilbert space. These evecs: 1) Top q have max variance 2) Top q reconstruction has with min mean square error 3) Are uncorrelated/orthogonal 4) Top have max mutual with inputs

( ) ( ) ( ){ } ( )1 1

,N Nk k k kT T

i i i ii ic x v x x k x x

= == f = f a f = aå å

r

( ) ( ) ( )1 1 1 1,

K K N Nk k k ki i j jk k i j

x c v k x x x= = = =

f = = a a få å å år%

Page 27: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Centering Kernel PCA•So far, we had assumed the data was zero-mean:•We want this:•How to do without touching feature space? Use kernels…

•Can get alpha eigenvectors from K tilde by adjusting old K

( )1

0N

iix

=f =å

( ) ( ) ( )11

N

Nj j iix x x

=f = f - få%

( ) ( )( ) ( )( ) ( ) ( )( )

( ) ( ) ( ) ( )( ) ( ) ( ) ( )

2

1 11 1

11

1 1 11 1 1

1 1 11 1 1 1

T

ij i j

TN N

N Ni jk kk k

TT N

Ni j jkk

TTN N N

N N Ni k k lk k l

N N N N

N Nij kj ik klNk k k l

K x x

x x x x

x x x x

x x x x

K K K K

= =

=

= = =

= = = =

= ff

= f - ff - f

= ff - ff

- ff + ff

= - - +

å å

åå å å

å å å å

% % %

Page 28: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

•KPCA on 2d dataset

•Left-to-right Kernel poly order goes from 1 to 3

1=linear=PCA

•Top-to-bottom top evec to weaker evecs

2D KPCA

Page 29: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

•Use coefficients of the KPCA for training a linear SVM classifier to recognize chairs from their images.•Use various polynomial kernel degrees where 1=linear as in regular PCA

Kernel PCA Results

Page 30: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

•Use coefficients of the KPCA for training a linear SVM classifier to recognize characters from their images.•Use various polynomial kernel degrees where 1=linear as in regular PCA (worst case in experiments)•Inferior performance to nonlinear SVMs (why??)

Kernel PCA Results

Page 31: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

•Typically, use EM or k-means to cluster N data points•Can imagine clustering the data points only from NxN matrix capturing their proximity information•This is spectral clustering•Again compute Gram matrix using, e.g. RBF kernel

•Example: have N pixels from an image, each x = [xcoord, ycoord, intensity] of each pixel•From eigenvectors of K matrix (or slight, variant), these seem to capture some segmentation or clustering of data points!•Nonparametric form of clustering since we didn’t assume Gaussian distribution…

Spectral Clustering

( ) ( ) ( ) ( )2

21

2, exp

T

ij i j i j i jK k x x x x x x

s= = ff = - -

Page 32: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Stability in Spectral Clustering•Standard problem when computing & using eigenvectors:

•Small changes in data can cause eigenvectors to change wildly•Ensure the eigenvectors we keep are distinct & stable: look at eigengap…•Some algorithms ensure the eigenvectors are going to have a safe eigengap. Adjust or process Gram matrix to ensure eigenvectors are still stable.

3 evecs=unsafe

3 evecs=safe

gap

Page 33: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Stabilized Spectral Clustering•Stabilized spectral clustering algorithm:

Page 34: Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Tony Jebara, Columbia University

Stabilized Spectral Clustering•Example results compared to other clustering algorithms (traditional kmeans, unstable spectral clustering, connected components).