Methods of Manifold Learning for Dimension Reduction of Large Data Sets

METHODS OF MANIFOLD LEARNING FOR DIMENSION REDUCTION OF

LARGE DATA SETSDoctoral Candidacy Preliminary Oral Exam

Ryan Bensussan Harvey

May 17, 2010

Committee:Wojciech Czaja, Chair

Kasso OkoudjouJohn BenedettoRama Chellappa

1

Ryan B Harvey - Prelim Oral ExamDimension Reduction -

PREVIEW

•Motivation

• Problem

•Methods

• Research Ideas Image by Stefan Baudy, used under Creative Commons license

2Preview


MOTIVATION

• Science and business producing massive quantities of data

• Computationally difficult to store, process, analyze, visualize

• Academic focus on compression, dimension-reduced processing to address this problem

• Compression methods widely available, but require decompression step to use data

•Dimension-reduced processing generally not available3Motivation


THE PROBLEM

•What are we trying to do?

•What is the intuition for this problem?

• How can we formalize the problem mathematically?

•On what kinds of data can we solve this problem?

Image by qisur, used under Creative Commons license

4Problem Definition


PROBLEM INTUITION

• Think flattening a 3D surface to a 2D image

• Simple projection

• Preserving some particular quantity of interest locally

• Preserving some global property of the surface

5Problem Definition


• Inputs:

•Outputs:

• Assumption: data live on some manifold embedded in , and inputs are samples taken in of the underlying manifold .

• Problem statement: Find a reduced representation of which best preserves the manifold structure of the data, as defined by some metric of interest.

THE PROBLEM (FORMALIZED)

X = [x1, . . . , xn], xk ∈ RD

Y = [y1, . . . , yn], yk ∈ Rd, d� D

Y X

M

M ⊂ Rd

XRD RD

6Problem Definition


EXAMPLES OF DATA SETS

Hyperspectral

10

model in the form of Equation 3, we can synthesize new shapes through the walkingcycle. In these examples only 10 samples were used to embed the manifold for half acycle on a unit circle in 2D and to learn the model. Silhouettes at intermediate bodyconfigurations were synthesized (at the middle point between each two centers) usingthe learned model. The learned model can successfully interpolate shapes at intermedi-ate configurations (never seen in the learning) using only two-dimensional embedding.The figure shows results for three different peoples.

Manifold

Embedding

Visual

input

3D pose

Learn Mapping from

Embedding to 3DLearn Nonlinear Mapping

from Embedding

to visual input

Learn Nonlinear

Manifold Embedding

(a) Learning components

Manifold

Embedding(view based)

Visual

input

3D poseClosed Formsolution for

inverse mapping

Error Criteria

View Determination

3D pose

interpolation

Manifold Selection

(b) pose estimation.(c) Synthesis.

Fig. 4. (a,b) Block diagram for the learning framework and 3D pose estimation. (c) Shape synthe-sis for three different people. First, third and fifth rows: samples used in learning. Second, fourth,sixth rows: interpolated shapes at intermediate configurations (never seen in the learning)

Given a visual input (silhouette), and the learned model, we can recover the intrinsicbody configuration, recover the view point, and reconstruct the input and detect anyspatial or temporal outliers. In other words, we can simultaneously solve for the pose,view point, and reconstruct the input. A block diagram for recovering 3D pose andview point given learned manifold models are shown in Figure 4. The framework [20]is based on learning three components as shown in Figure 4-a:

1. Learning Manifold Representation: using nonlinear dimensionality reduction weachieve an embedding of the global deformation manifold that preserves the geo-metric structure of the manifold as described in section 2.1. Given such embedding,the following two nonlinear mappings are learned.

2. Manifold-to-input mapping: a nonlinear mapping from the embedding space intovisual input space as described in section 2.2.

3. Manifold-to-pose: a nonlinear mapping from the embedding space into the 3D bodypose space.

Video

An example from Molecular Dynamics, I

The dynamics of a small protein in a bath of water molecules is

approximated by a Langevin system of stochastic equations

x = −∇U(x) + w .

The set of states of the protein is a noisy set of points in R36.

Mauro Maggioni Analysis of High-dimensional Data Sets and Graphs

MolecularDynamics

Handwritten Digits

Data base of about 60, 000 28× 28 gray-scale pictures of

handwritten digits, collected by USPS. Point cloud in R282

.

Goal: automatic recognition.

Set of 10, 000 picture (28 by 28 pixels) of 10 handwritten digits. Color represents the label (digit) of each point.

Mauro Maggioni Geometry of data sets in high dimensions and learning

ImageCollections

Text documents

1000 Science News articles, from 8 different categories. Wecompute about 10000 coordinates, i-th coordinate of documentd represents frequency in document d of the i-th word in a fixeddictionary.

Mauro Maggioni Geometry of data sets in high dimensions and learningText

7Problem Definition


METHODS: TAXONOMY

•Methods considered involve convex optimizations solved via eigenvalue problems

• Full-rank: PCA, Kernel PCA

• Sparse: LLE, Laplacian Eigenmaps

Dimension Reduction

Non-convexConvex

Full-Rank Sparse

Linear : PCA

Non-linear : k-PCA

Reconstruction Weights: LLE

Neighborhood Graph Laplacian: LE

8Methods Introduction


METHODS: FRAMEWORK

• 3 step algorithm framework:

• Build the kernel matrix

• Solve the appropriate eigenvalue problem associated with that kernel

• Use eigenvectors to compute the embedding in the lower dimension

• Some methods:

• Principal Components Analysis

• Kernel-based Principal Components Analysis

• Laplacian Eigenmaps

• Locally Linear Embedding9Methods Introduction


• Linear method: rotation, translation, simple scaling

• Think SNR: maximize signal while minimizing noise

• Rotate and translate axes so that signal variances lie on as few axes as possible

• The kernel:

• The eigenvalue problem:

• The embedding: {λk}d+1k=2,

1 ≥ λ1 ≥ · · · ≥ λD ≥ 0Y = P{k}X

PRINCIPAL COMPONENTS ANALYSIS

C = 1n

�nj=1 xjxT

j

PT Λ = CPT

λp = Cp

10Principal Components Analysis


PRINCIPAL COMPONENTS ANALYSIS

EVP

11Principal Components Analysis

−20−10

010

20

−20−10

010

200

10

20

30

40

50

Student Version of MATLAB

−20 −15 −10 −5 0 5 10 15 20−15

−10

−5

0

5

10

15

Student Version of MATLAB


• To move from (linear) PCA to (nonlinear) Kernel-based PCA (Schölkopf, Smola & Müller, 1998), we consider a formulation of PCA exclusively using dot products:

• All solutions with lie in .

PCA USING DOT PRODUCTS

λp = Cp

λ(xk · p) = (xk · Cp)Cp = 1

n

�ni=1(xj · p)xj

p λ �= 0 span(x1, x2, . . . , xn)

12Kernel-based PCA


•We then introduce nonlinearity by mapping from the input space to the feature space :

• For now, assume the data in is centered:

• Then, the covariance matrix in is:

INTRODUCING NONLINEARITY IN PCA

Φ : RD → Fx �→ x = Φ(x)

F

F

FΦ(xk)�n

k=1 Φ(xk) = 0

C = 1n

�nk=1 Φ(xj)Φ(xj)T = 1

n

�nk=1 �xj �xj

T

13Kernel-based PCA


•We now rewrite the eigenvalue problem of PCA in :

• Again, all with lie in .

• In addition, we can write the linear expansion:

THE EIGENPROBLEM IN

Fλp = Cp

λ(Φ(xk) · p) = (Φ(xk) · Cp),∀k = 1, . . . , n

p span(x1, x2, . . . , xn)λ �= 0

p =�n

j=1 αjΦ(xi)

F

14Kernel-based PCA


• Using this expansion, we rewrite the dot product formulation of the PCA eigenproblem in :

•We define an kernel matrix by

• And rewrite the eigenproblem in matrix form:

THE EIGENPROBLEM IN (CONTINUED)

F

F

λ�n

j=1 αj(xk · xj) = 1n

�ni=1 αi(xk ·

�nj=1 xj)(xj · xi)

∀k = 1, . . . , n

n× n KKij = (Φ(xi) · Φ(xj)) = (xi · xj)

nλKα = K2α

15Kernel-based PCA


• The feature space is of arbitrarily large and possibly infinite dimension. Computing dot products directly is often not possible, and computationally impractical when it is.

• Solution: the “kernel trick” (Aizerman et al, 1964). Construct a kernel function

• Then replace each with . This construction implicitly defines and via .

COMPUTING IN FEATURE SPACE

F

(Φ(u) · Φ(v)) k(u, v)Φ F k(u, v)

k(u, v) = (Φ(u) · Φ(v))

16Kernel-based PCA


• Kernels must be continuous, symmetric, positive semi-definite.

• Some possible kernels proposed by Schölkopf et al include:

•Dot product in the space of all order monomials:

• Radial basis functions:

• Sigmoid functions:

SOME POSSIBLE KERNELS

k(u, v) = (u · v)d

k(u, v) = exp�−||u−v||2

2σ2

�

k(u, v) = tanh(κ(u · v) + Θ)

dth

17Kernel-based PCA


• To solve the eigenproblem where , we solve the following:

• Solutions are identical to all relevant solutions to the prior, as can be seen by expanding in the eigenvector basis of .

• Let be the complete set of eigenvalues and the corresponding eigenvectors, with the first nonzero eigenvalue.

SOLVING THE EIGENPROBLEM

18Kernel-based PCA

nλKα = K2αKij = k(xi, xj)

nλα = Kα

α K

0 ≤ λ1 ≤ λ2 ≤ · · · ≤ λn

nλ α1, α2, . . . ,αn

λq


•We normalize by requiring that the corresponding vectors in be normalized:

• This translates to a normalization condition for :

• Compute projections of a test point onto eigenvectors :

COMPUTING THE EMBEDDING

19Kernel-based PCA

λq, . . . ,λn

(pk · pk) = 1,∀k = q, . . . , nF

αq, . . . ,αn

1 =�n

i,j=1 αki αk

j (xi · xj) =�n

i,j=1 αki αk

j Kij

= (αk · Kαk) = λk(αk · αk)

x pk

(pk · Φ(x)) =�n

j=1 αkj (Φ(xj) · Φ(x)) =

�nj=1 αk

j k(xj , x)


•We assumed centered data in , which is unrealistic. To center our data, we must have:

• Then we rewrite everything in terms of , and thus have a new kernel , which we will express in terms of :

where .

CENTERING THE DATA

20Kernel-based PCA

F

Φc(xj) = xcj = xj − 1

n

�nk=1 xk

Φc(xj)Kc K

Kcij = (xc

i · xcj) =

��xi − 1

n

�nk=1 xk

�·�xj − 1

n

�n�=1 x�

��

= (K − 1nK −K1n + 1nK1n)ij

(1n)ij = 1n ,∀i, j


• Extend linear PCA to nonlinear space via kernel transformation

• Think SNR where signal lies along a curve in space

• Rotate/translate transformed axes so signal variances lie on as few axes as possible

• The kernel:


• The embedding:

KERNEL-BASED PCA

21Kernel-based PCA

(nλ)α = Kα

(pk · x) =�n

j=1 αkj k(xi, x)

Φ : RD → Fx �→ x

Kij = (xi · xj) = k(xi, xj)


KERNEL-BASED PCAFEATURE SPACE , DEFINED BY

KERNEL PCA(EVP)

22Kernel-based PCA

k(u, v)Φ

F

B

D

A

C

B

C

AD

B

C

D

A

B C DA

Figure 3. Embeddings from kernel PCA with the Gaussiankernel on the Swiss roll and teapot data sets in Figs. 1and 2. The first three principal components are shown. Inboth cases, different patches of the manifolds are mappedto orthogonal parts of feature space.

Φ(xi) and Φ(xj) must be nearly orthogonal. As a re-

sult, the different patches of the manifold are mapped

into orthogonal regions of the feature space: see Fig. 3.

Thus, rather than unfolding the manifold, the Gaus-

sian kernel leads to an embedding whose dimensional-

ity is equal to the number of non-overlapping patches

of length scale σ. This explains the generally poor per-

formance of the Gaussian kernel for manifold learning

(as well as its generally good performance for large

margin classification, discussed in section 4.2).

Another experiment on the data set of teapot images

was performed by restricting the angle of rotation to

180 degrees. The results are shown in Fig. 4. Interest-

ingly, in this case, the eigenspectra of the linear, poly-

nomial, and Gaussian kernel matrices are not qualita-

tively different. By contrast, the SDE kernel matrix

now has only one dominant eigenvalue, indicating that

the variability for this subset of images is controlled by

a single (non-cyclic) degree of freedom.

As a final experiment, we compared the performance of

the different kernels on a real-world data set described

by an underlying manifold. The data set (Hull, 1994)

consisted of N =953 grayscale images at 16×16 resolu-

tion of handwritten twos and threes (in roughly equal

proportion). In this data set, it is possible to find a rel-

atively smooth “morph” between any pair of images,

and a relatively small number of degrees of freedom

describe the possible modes of variability (e.g. writ-

ing styles for handwritten digits). Fig 5 shows the

results. Note that the kernel matrix learned by SDE

concentrates the variance in a significantly fewer num-

ber of dimensions, suggesting it has constructed a more

appropriate feature map for nonlinear dimensionality

reduction.

Eigenvalues (normalized by trace)

Figure 4. Results of kernel PCA applied to N =200 imagesof a teapot viewed from different angles in the plane, under180 degrees of rotation. The eigenvalues of different kernelmatrices are shown, normalized by their trace. The onedimensional embedding from SDE is also shown.

4.2. Large margin classification

We also evaluated the use of SDE kernel matrices for

large margin classification by SVMs. Several training

and test sets for problems in binary classification were

created from the USPS data set of handwritten dig-

its (Hull, 1994). Each training and test set had 810

and 90 examples, respectively. For each experiment,

the SDE kernel matrices were learned (using k = 4

nearest neighbors) on the combined training and test

data sets, ignoring the target labels. The results were

compared to those obtained from linear, polynomial

(p = 2), and Gaussian (σ = 1) kernels. Table 1 shows

that the SDE kernels performed quite poorly in this

capacity, even worse than the linear kernels.

Fig. 6 offers an explanation of this poor performance.

The SDE kernel can only be expected to perform well

for large margin classification if the decision bound-

ary on the unfolded manifold is approximately linear.

There is no a priori reason, however, to expect this

type of linear separability. The example in Fig. 6

shows a particular binary labeling of inputs on the

Swiss roll for which the decision boundary is much

simpler in the input space than on the unfolded mani-

fold. A similar effect seems to be occurring in the large

margin classification of handwritten digits. Thus, the

strength of SDE for nonlinear dimensionality reduction

is generally a weakness for large margin classification.

By contrast, the polynomial and Gaussian kernels lead

to more powerful classifiers precisely because they map

inputs to higher dimensional regions of feature space.

5. Related and ongoing work

SDE can be viewed as an unsupervised counterpart to

the work of Graepel (2002) and Lanckriet et al (2004),

who proposed learning kernel matrices by semidefinite

programming for large margin classification. The ker-

nel matrices learned by SDE differ from those usually

B

D

A

C

B

C

AD

B

C

D

A

B C DA

Figure 3. Embeddings from kernel PCA with the Gaussiankernel on the Swiss roll and teapot data sets in Figs. 1and 2. The first three principal components are shown. Inboth cases, different patches of the manifolds are mappedto orthogonal parts of feature space.

Φ(xi) and Φ(xj) must be nearly orthogonal. As a re-

sult, the different patches of the manifold are mapped

into orthogonal regions of the feature space: see Fig. 3.

Thus, rather than unfolding the manifold, the Gaus-

sian kernel leads to an embedding whose dimensional-

ity is equal to the number of non-overlapping patches

of length scale σ. This explains the generally poor per-

formance of the Gaussian kernel for manifold learning

(as well as its generally good performance for large

margin classification, discussed in section 4.2).

Another experiment on the data set of teapot images

was performed by restricting the angle of rotation to

180 degrees. The results are shown in Fig. 4. Interest-

ingly, in this case, the eigenspectra of the linear, poly-

nomial, and Gaussian kernel matrices are not qualita-

tively different. By contrast, the SDE kernel matrix

now has only one dominant eigenvalue, indicating that

the variability for this subset of images is controlled by

a single (non-cyclic) degree of freedom.

As a final experiment, we compared the performance of

the different kernels on a real-world data set described

by an underlying manifold. The data set (Hull, 1994)

consisted of N =953 grayscale images at 16×16 resolu-

tion of handwritten twos and threes (in roughly equal

proportion). In this data set, it is possible to find a rel-

atively smooth “morph” between any pair of images,

and a relatively small number of degrees of freedom

describe the possible modes of variability (e.g. writ-

ing styles for handwritten digits). Fig 5 shows the

results. Note that the kernel matrix learned by SDE

concentrates the variance in a significantly fewer num-

ber of dimensions, suggesting it has constructed a more

appropriate feature map for nonlinear dimensionality

reduction.

Eigenvalues (normalized by trace)

Figure 4. Results of kernel PCA applied to N =200 imagesof a teapot viewed from different angles in the plane, under180 degrees of rotation. The eigenvalues of different kernelmatrices are shown, normalized by their trace. The onedimensional embedding from SDE is also shown.

4.2. Large margin classification

We also evaluated the use of SDE kernel matrices for

large margin classification by SVMs. Several training

and test sets for problems in binary classification were

created from the USPS data set of handwritten dig-

its (Hull, 1994). Each training and test set had 810

and 90 examples, respectively. For each experiment,

the SDE kernel matrices were learned (using k = 4

nearest neighbors) on the combined training and test

data sets, ignoring the target labels. The results were

compared to those obtained from linear, polynomial

(p = 2), and Gaussian (σ = 1) kernels. Table 1 shows

that the SDE kernels performed quite poorly in this

capacity, even worse than the linear kernels.

Fig. 6 offers an explanation of this poor performance.

The SDE kernel can only be expected to perform well

for large margin classification if the decision bound-

ary on the unfolded manifold is approximately linear.

There is no a priori reason, however, to expect this

type of linear separability. The example in Fig. 6

shows a particular binary labeling of inputs on the

Swiss roll for which the decision boundary is much

simpler in the input space than on the unfolded mani-

fold. A similar effect seems to be occurring in the large

margin classification of handwritten digits. Thus, the

strength of SDE for nonlinear dimensionality reduction

is generally a weakness for large margin classification.

By contrast, the polynomial and Gaussian kernels lead

to more powerful classifiers precisely because they map

inputs to higher dimensional regions of feature space.

5. Related and ongoing work

SDE can be viewed as an unsupervised counterpart to

the work of Graepel (2002) and Lanckriet et al (2004),

who proposed learning kernel matrices by semidefinite

programming for large margin classification. The ker-

nel matrices learned by SDE differ from those usually

Images from (3)


•While it provides nonlinearity, kernel-based PCA requires computation dependent on number of points , rather than the often smaller dimension of each point .

•We thus consider Laplacian Eigenmaps (Belkin & Niyogi, 2003) which introduces sparsity in the kernel.

• This method has been shown to be a special case of Kernel-based PCA by Bengio et al (2004).

LAPLACIAN EIGENMAPS

23Laplacian Eigenmaps

nD


• To build a sparse kernel, we build a graph from the data which samples the assumed manifold . The adjacency matrix is built by taking either a fixed number of nearest neighbors to, or by selecting all points within an -ball of, a given point as the point’s nearest neighbors.

•We denote the set of nearest neighbors of a point by . The adjacency matrix is then given by:

INTRODUCING SPARSITY


Mmε

xj Nj

A

Aij =

�1, if xi ∈ Nj

0, if xi /∈ Nj


•We then introduce edge weights in the graph. The heat kernel is chosen due to its connection to the Laplace Beltrami operator on the manifold and therefore to the graph approximation of the manifold Laplacian:

•Note that as .

BUILDING THE KERNEL


Wij =

�exp

�− ||xi−xj ||2

t

�, xi ∈ Nj or xj ∈ Ni

0, otherwise

t→∞, W → A


• To understand what eigenvalue problem to solve here, we must consider the optimization problem.

• First, we think of mapping the graph in a simplistic sense to a line such that connected points stay as close together as possible.

• This gives the objective function:

CONSTRUCTING THE EIGENVALUE PROBLEM


G(X,E,W )y1D

12

�i

�j(yi − yj)2Wij


• Here, we introduce the diagonal matrix , where

and the graph Laplacian matrix , and note that is symmetric, which allows us to rewrite the objective function in matrix-vector form:



Dii =�

j Wji

L = D −W

D

12

�i

�j(yi − yj)2Wij = 1

2

�i

�j(y

2i + y2

j − 2yiyj)Wij

= 12

�i y2

i Dii + 12

�j y2

j Djj −�

i

�j(y

2i + y2

j − 2yiyj)Wij

= yT Dy − yT Wy = yT Ly

W


• So the relevant optimization problem in the case becomes:

• This problem can be solved by solving the generalized eigenvalue problem for the minimum eigenvalues.

•Note that the computation on the previous slide also shows that is positive semi-definite.



1D

L

Ly = λDy

arg min yT Lyy

yT Dy = 1


• Extending the same argument to , with and , we need to minimize the objective function:

giving the minimization

• This problem can also be solved by solving the generalized eigenvalue problem for the minimum eigenvalues.



f ∈ Rd F ∈ Rn × Rd

f (i) = [f (i)1 , . . . , f (i)

d ]T

�i

�j ||f (i) − f (j)||2Wij = tr(FT LF)

arg min tr(FT LF)

Lf = λDf

FFT DF = I


• Thus, we solve for the minimum nonzero eigenvalue solution of the generalized eigenvalue problem

•We then order the eigenvaluesand construct the embedding from the first corresponding eigenvectors (leaving out the zero eigenvector), giving the embedding

THE EIGENVALUE PROBLEM AND THE EMBEDDING


Lf = λDf

d

xi → yi = (f (i)1 , . . . , f (i)

d )

0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1


•Move away from PCA’s full-matrix computations toward a graph sampling of the manifold which allows for a sparse kernel matrix

• Point-to-point metric locally applied to preserve distances on the manifold between points

• The kernel:


• The embedding:

LAPLACIAN EIGENMAPS

or


Wij = exp�− ||xi−xj ||2

t

�

xi ∈ Nj xj ∈ Ni

0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1

xi → yi = (f (i)1 , . . . , f (i)

d )

Lf = λDfDii =

�j Wji

L = D −W


LAPLACIAN EIGENMAPS

G(X,E,W)

EVPNN, W


−10 −5 0 5 10 150

50

100

−15

−10

−5

0

5

10

15

N = 5 t = 5.0 N = 10 t = 5.0 N = 15 t = 5.0

N = 5 t = 25.0 N = 10 t = 25.0 N = 15 t = 25.0

N = 5 t = ! N = 10 t = ! N = 15 t = !

−10 −5 0 5 10 150

50

100

−15

−10

−5

0

5

10

15

N = 5 t = 5.0 N = 10 t = 5.0 N = 15 t = 5.0

N = 5 t = 25.0 N = 10 t = 25.0 N = 15 t = 25.0

N = 5 t = ! N = 10 t = ! N = 15 t = !

Images from (1)


• Although theoretically sound and using sparsity, Laplacian Eigenmaps is intuitively difficult to understand.

• Belkin & Niyogi (2003) show that Locally Linear Embedding (Roweis & Saul, 2000), which has a more intuitive geometric construction, is approximately equivalent under certain conditions.

•We will develop the LLE method, then briefly sketch the argument for approximate equivalence.

AN ALTERNATIVE VIEW OF LAPLACIAN EIGENMAPS

33Locally Linear Embedding


•We construct the graph sampling the manifold in the same way as before, by finding nearest neighbors of each point.

•Weights for the matrix are selected by assuming that local neighborhoods of points are nearly linear, and solving a minimization problem with the cost function:

where .

LOCALLY LINEAR EMBEDDING


�i

��xi −�

j Wijxij

��2

xij ∈ Ni


FINDING THE WEIGHTS•Weights solving this minimization can be found via a closed-

form expression as follows:

(1) Compute neighbor correlation matrices (and inverses):

(2) Compute Lagrange multiplier (sum-to-one constraint):

(3) Compute reconstruction weights:

•Nearly singular can be preconditioned prior to computing.35Locally Linear Embedding

Cjk = xij · xik , xij ∈ Ni, Xik ∈ Ni

λ = αβ =

1−P

j

Pk C−1

jk (xi·xik)

Pj

Pk C−1

jk

Wij =�

k C−1jk (xi · xik + λ)

Cjk


• To find the embedding, we minimize the same form, this time over the embedding coordinates with fixed weights:

subject to constraints:

• Centering:

• Unit covariance (to avoid degenerate solutions):

THE EMBEDDING


arg min�

i

��yi −�

j Wijyj

��2

y

�i yi = 0

1n

�i yi ⊗ yi = I


• To compute solutions to the minimization, we introduce the sparse matrix defined by

•We then solve for eigenpairs of and take as the embedding the eigenvectors corresponding to the lowest eigenvalues, excluding the zero eigenvalue.

•Note that is symmetric and positive semi-definite.

COMPUTING THE EMBEDDING


E = (I −W )T (I −W )

Eij = δij −Wij −Wji +�

k WkiWkj

E

E

Ed


• Another method which considers metrics used to weight a graph sampling the manifold

•Weights computed by global linear optimization over a local neighborhood around each point

• The kernel:


• The embedding:



Wij =�


Ef = λfE = (I −W )T (I −W )

0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1

xi → yi = (f (i)1 , . . . , f (i)

d )


•We will show in three steps that for a function (under appropriate assumptions) :

(1) Fix a point and show that:

where is the Hessian of at .

(2) Show that the expectation .

(3) Put steps (1) and (2) together to achieve the final result.

CONNECTION TO THE GRAPH LAPLACIAN


Ef ≈ 12L

2f

[(I −W )f ]i ≈ − 12

�j Wij(xi − xij )T

H(xi − xij )xi

xiH f

E[vTHv] = rLf

f ∈M


• Show:

• Consider a coordinate system in the tangent plane centered at and let . This is a vector originating at .

• Let . Since is in the affine span of its neighbors (and by construction of ), we have

where .

CONNECTION TO THE GRAPH LAPLACIAN (1)


[(I −W )f ]i ≈ − 12

�j Wij(xi − xij )T

H(xi − xij )

vj = xij − xi

αj = Wij xi

W

�j αj = 1

oo = xi

o = xi =�

j αjvj


• Assuming is sufficiently smooth, we write the 2nd order Taylor approximation

where is the gradient and is the Hessian, both evaluated at .



f

f(v) = f(o) + vT∇f + 1

2 (vTHv) + o(||v||2)

o∇f H


•We have , and using Taylor’s approximation for , we can write

• Since and , the first three terms disappear, and



[(I −W )f ]i = f(o)−�

j αjf(vj)f(vj)

[(I −W )f ]i = f(o)−�

j αjf(vj)

≈ f(o)−�

j αjf(o)−�

j αjvTj ∇f + 1

2

�j αj(vT

j Hvj)

�j αj = 1

�j αjvj = o

[(I −W )f ]i = f(o)−�

j αjf(vj) ≈ − 12

�j αjv

Tj Hvj


• Show: is proportional to

• If form an orthonormal basis (unusual), then

• If not, we assume to be a random vector with uniform distribution on every sphere centered at , and show proportionality.

• Let be an orthonormal basis for corresponding to eigenvalues .



vTHv Lf

√αjvj �

j WijvTj Hvj = tr(H) = Lf

xxi

e1, . . . , en H

λ1, . . . ,λn


• Then using the Spectral theorem, we can write

• Since is independent of , we can replace

to get



E[vTHv] = E

��i λi�v, ei�2

�

E��v, ei�2

�

E��v, ei�2

�= r

i

E[vTHv] = r (

�i λi) = rtr(H) = rLf


•Now, putting these together, we have

• LLE minimizes which reduces to finding eigenfunctions of , which can now be interpreted as finding eigenfunctions of the iterated Laplacian . Eigenfunctions of coincide with those of .



[(I −W )f ]i ≈ − 12

�j Wijv

Tj Hvj

E[vTHv] = rLf

(I −W )T (I −W )f ≈ 12L

2f

fT (I −W )T (I −W )f(I −W )T (I −W )

L2 LL2


• Another method which considers metrics used to weight a graph sampling the manifold

•Weights computed by global linear optimization over a local neighborhood around each point

• The kernel:


• The embedding:



Wij =�


Ef = λfE = (I −W )T (I −W )

0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1

xi → yi = (f (i)1 , . . . , f (i)

d )



G(X,E,W)

EVPNN, MIN


35. R. N. Shepard, Psychon. Bull. Rev. 1, 2 (1994).36. J. B. Tenenbaum, Adv. Neural Info. Proc. Syst. 10, 682(1998).

37. T. Martinetz, K. Schulten, Neural Netw. 7, 507 (1994).38. V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduc-tion to Parallel Computing: Design and Analysis ofAlgorithms (Benjamin/Cummings, Redwood City, CA,1994), pp. 257–297.

39. D. Beymer, T. Poggio, Science 272, 1905 (1996).40. Available at www.research.att.com/!yann/ocr/mnist.41. P. Y. Simard, Y. LeCun, J. Denker, Adv. Neural Info.Proc. Syst. 5, 50 (1993).

42. In order to evaluate the fits of PCA, MDS, and Isomapon comparable grounds, we use the residual variance

1 – R2(DM , DY). DY is the matrix of Euclidean distanc-es in the low-dimensional embedding recovered byeach algorithm. DM is each algorithm’s best estimateof the intrinsic manifold distances: for Isomap, this isthe graph distance matrix DG; for PCA and MDS, it isthe Euclidean input-space distance matrix DX (exceptwith the handwritten “2”s, where MDS uses thetangent distance). R is the standard linear correlationcoefficient, taken over all entries of DM and DY.

43. In each sequence shown, the three intermediate im-ages are those closest to the points 1/4, 1/2, and 3/4of the way between the given endpoints. We can alsosynthesize an explicit mapping from input space X tothe low-dimensional embedding Y, or vice versa, us-

ing the coordinates of corresponding points {xi , yi} inboth spaces provided by Isomap together with stan-dard supervised learning techniques (39).

44. Supported by the Mitsubishi Electric Research Labo-ratories, the Schlumberger Foundation, the NSF(DBS-9021648), and the DARPA Human ID program.We thank Y. LeCun for making available the MNISTdatabase and S. Roweis and L. Saul for sharing relatedunpublished work. For many helpful discussions, wethank G. Carlsson, H. Farid, W. Freeman, T. Griffiths,R. Lehrer, S. Mahajan, D. Reich, W. Richards, J. M.Tenenbaum, Y. Weiss, and especially M. Bernstein.

10 August 2000; accepted 21 November 2000

Nonlinear DimensionalityReduction by

Locally Linear EmbeddingSam T. Roweis1 and Lawrence K. Saul2

Many areas of science depend on exploratory data analysis and visualization.The need to analyze large amounts of multivariate data raises the fundamentalproblem of dimensionality reduction: how to discover compact representationsof high-dimensional data. Here, we introduce locally linear embedding (LLE), anunsupervised learning algorithm that computes low-dimensional, neighbor-hood-preserving embeddings of high-dimensional inputs. Unlike clusteringmethods for local dimensionality reduction, LLE maps its inputs into a singleglobal coordinate system of lower dimensionality, and its optimizations do notinvolve local minima. By exploiting the local symmetries of linear reconstruc-tions, LLE is able to learn the global structure of nonlinear manifolds, such asthose generated by images of faces or documents of text.

How do we judge similarity? Our mentalrepresentations of the world are formed byprocessing large numbers of sensory in-puts—including, for example, the pixel in-tensities of images, the power spectra ofsounds, and the joint angles of articulatedbodies. While complex stimuli of this form canbe represented by points in a high-dimensionalvector space, they typically have a much morecompact description. Coherent structure in theworld leads to strong correlations between in-puts (such as between neighboring pixels inimages), generating observations that lie on orclose to a smooth low-dimensional manifold.To compare and classify such observations—ineffect, to reason about the world—dependscrucially on modeling the nonlinear geometryof these low-dimensional manifolds.

Scientists interested in exploratory analysisor visualization of multivariate data (1) face asimilar problem in dimensionality reduction.The problem, as illustrated in Fig. 1, involvesmapping high-dimensional inputs into a low-dimensional “description” space with as many

coordinates as observed modes of variability.Previous approaches to this problem, based onmultidimensional scaling (MDS) (2), havecomputed embeddings that attempt to preservepairwise distances [or generalized disparities(3)] between data points; these distances aremeasured along straight lines or, in more so-phisticated usages of MDS such as Isomap (4),

along shortest paths confined to the manifold ofobserved inputs. Here, we take a different ap-proach, called locally linear embedding (LLE),that eliminates the need to estimate pairwisedistances between widely separated data points.Unlike previous methods, LLE recovers globalnonlinear structure from locally linear fits.

The LLE algorithm, summarized in Fig.2, is based on simple geometric intuitions.Suppose the data consist of N real-valuedvectors !Xi, each of dimensionality D, sam-pled from some underlying manifold. Pro-vided there is sufficient data (such that themanifold is well-sampled), we expect eachdata point and its neighbors to lie on orclose to a locally linear patch of the mani-fold. We characterize the local geometry ofthese patches by linear coefficients thatreconstruct each data point from its neigh-bors. Reconstruction errors are measuredby the cost function

!"W # ! !i

" !Xi$%jWij!Xj" 2

(1)

which adds up the squared distances betweenall the data points and their reconstructions. Theweights Wij summarize the contribution of thejth data point to the ith reconstruction. To com-pute the weights Wij, we minimize the cost

1Gatsby Computational Neuroscience Unit, Universi-ty College London, 17 Queen Square, London WC1N3AR, UK. 2AT&T Lab—Research, 180 Park Avenue,Florham Park, NJ 07932, USA.

E-mail: [email protected] (S.T.R.); [email protected] (L.K.S.)

Fig. 1. The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensionaldata (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm mustdiscover the global internal coordinates of the manifold without signals that explicitly indicate howthe data should be embedded in two dimensions. The color coding illustrates the neighborhood-preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of asingle point. Unlike LLE, projections of the data by principal component analysis (PCA) (28) orclassical MDS (2) map faraway data points to nearby points in the plane, failing to identify theunderlying structure of the manifold. Note that mixture models for local dimensionality reduction(29), which cluster the data and perform PCA within each cluster, do not address the problemconsidered here: namely, how to map high-dimensional data into a single global coordinate systemof lower dimensionality.

R E P O R T S

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2323

35. R. N. Shepard, Psychon. Bull. Rev. 1, 2 (1994).36. J. B. Tenenbaum, Adv. Neural Info. Proc. Syst. 10, 682(1998).

37. T. Martinetz, K. Schulten, Neural Netw. 7, 507 (1994).38. V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduc-tion to Parallel Computing: Design and Analysis ofAlgorithms (Benjamin/Cummings, Redwood City, CA,1994), pp. 257–297.

39. D. Beymer, T. Poggio, Science 272, 1905 (1996).40. Available at www.research.att.com/!yann/ocr/mnist.41. P. Y. Simard, Y. LeCun, J. Denker, Adv. Neural Info.Proc. Syst. 5, 50 (1993).

42. In order to evaluate the fits of PCA, MDS, and Isomapon comparable grounds, we use the residual variance

1 – R2(DM , DY). DY is the matrix of Euclidean distanc-es in the low-dimensional embedding recovered byeach algorithm. DM is each algorithm’s best estimateof the intrinsic manifold distances: for Isomap, this isthe graph distance matrix DG; for PCA and MDS, it isthe Euclidean input-space distance matrix DX (exceptwith the handwritten “2”s, where MDS uses thetangent distance). R is the standard linear correlationcoefficient, taken over all entries of DM and DY.

43. In each sequence shown, the three intermediate im-ages are those closest to the points 1/4, 1/2, and 3/4of the way between the given endpoints. We can alsosynthesize an explicit mapping from input space X tothe low-dimensional embedding Y, or vice versa, us-

ing the coordinates of corresponding points {xi , yi} inboth spaces provided by Isomap together with stan-dard supervised learning techniques (39).

44. Supported by the Mitsubishi Electric Research Labo-ratories, the Schlumberger Foundation, the NSF(DBS-9021648), and the DARPA Human ID program.We thank Y. LeCun for making available the MNISTdatabase and S. Roweis and L. Saul for sharing relatedunpublished work. For many helpful discussions, wethank G. Carlsson, H. Farid, W. Freeman, T. Griffiths,R. Lehrer, S. Mahajan, D. Reich, W. Richards, J. M.Tenenbaum, Y. Weiss, and especially M. Bernstein.

10 August 2000; accepted 21 November 2000

Nonlinear DimensionalityReduction by

Locally Linear EmbeddingSam T. Roweis1 and Lawrence K. Saul2

Many areas of science depend on exploratory data analysis and visualization.The need to analyze large amounts of multivariate data raises the fundamentalproblem of dimensionality reduction: how to discover compact representationsof high-dimensional data. Here, we introduce locally linear embedding (LLE), anunsupervised learning algorithm that computes low-dimensional, neighbor-hood-preserving embeddings of high-dimensional inputs. Unlike clusteringmethods for local dimensionality reduction, LLE maps its inputs into a singleglobal coordinate system of lower dimensionality, and its optimizations do notinvolve local minima. By exploiting the local symmetries of linear reconstruc-tions, LLE is able to learn the global structure of nonlinear manifolds, such asthose generated by images of faces or documents of text.

How do we judge similarity? Our mentalrepresentations of the world are formed byprocessing large numbers of sensory in-puts—including, for example, the pixel in-tensities of images, the power spectra ofsounds, and the joint angles of articulatedbodies. While complex stimuli of this form canbe represented by points in a high-dimensionalvector space, they typically have a much morecompact description. Coherent structure in theworld leads to strong correlations between in-puts (such as between neighboring pixels inimages), generating observations that lie on orclose to a smooth low-dimensional manifold.To compare and classify such observations—ineffect, to reason about the world—dependscrucially on modeling the nonlinear geometryof these low-dimensional manifolds.

Scientists interested in exploratory analysisor visualization of multivariate data (1) face asimilar problem in dimensionality reduction.The problem, as illustrated in Fig. 1, involvesmapping high-dimensional inputs into a low-dimensional “description” space with as many

coordinates as observed modes of variability.Previous approaches to this problem, based onmultidimensional scaling (MDS) (2), havecomputed embeddings that attempt to preservepairwise distances [or generalized disparities(3)] between data points; these distances aremeasured along straight lines or, in more so-phisticated usages of MDS such as Isomap (4),

along shortest paths confined to the manifold ofobserved inputs. Here, we take a different ap-proach, called locally linear embedding (LLE),that eliminates the need to estimate pairwisedistances between widely separated data points.Unlike previous methods, LLE recovers globalnonlinear structure from locally linear fits.

The LLE algorithm, summarized in Fig.2, is based on simple geometric intuitions.Suppose the data consist of N real-valuedvectors !Xi, each of dimensionality D, sam-pled from some underlying manifold. Pro-vided there is sufficient data (such that themanifold is well-sampled), we expect eachdata point and its neighbors to lie on orclose to a locally linear patch of the mani-fold. We characterize the local geometry ofthese patches by linear coefficients thatreconstruct each data point from its neigh-bors. Reconstruction errors are measuredby the cost function

!"W # ! !i

" !Xi$%jWij!Xj" 2

(1)

which adds up the squared distances betweenall the data points and their reconstructions. Theweights Wij summarize the contribution of thejth data point to the ith reconstruction. To com-pute the weights Wij, we minimize the cost

1Gatsby Computational Neuroscience Unit, Universi-ty College London, 17 Queen Square, London WC1N3AR, UK. 2AT&T Lab—Research, 180 Park Avenue,Florham Park, NJ 07932, USA.

E-mail: [email protected] (S.T.R.); [email protected] (L.K.S.)

Fig. 1. The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensionaldata (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm mustdiscover the global internal coordinates of the manifold without signals that explicitly indicate howthe data should be embedded in two dimensions. The color coding illustrates the neighborhood-preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of asingle point. Unlike LLE, projections of the data by principal component analysis (PCA) (28) orclassical MDS (2) map faraway data points to nearby points in the plane, failing to identify theunderlying structure of the manifold. Note that mixture models for local dimensionality reduction(29), which cluster the data and perform PCA within each cluster, do not address the problemconsidered here: namely, how to map high-dimensional data into a single global coordinate systemof lower dimensionality.

R E P O R T S

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2323

Images from (6)


is the sparsity ratio of the kernel matrix: the number of non-zero elements divided by the total number of elements in the kernel matrix

COMPUTATIONAL COMPLEXITY OF METHODS

O(D3) O(D2)O(n3) O(n2)O(ξn2) O(ξn2)O(ξn2) O(ξn2)

Method Computational Cost

Memory Usage

PCA

k-PCA

LLE

LE

48Comparison

ξ


BENEFITS AND LIMITATIONS

Method Benefits Limitations

PCA Fast, simple Linear

k-PCA Kernel choiceComputation,

kernel selection

LESparse kernel, justification

Nearest neighbor search

LLESparse kernel, direct solution

Nearest neighbor search

49Comparison


RESEARCH IDEAS

• Guiding principles

• Software should be built for modular use within a framework and library

• Software should be validated with real known data associated with “ground truth”

• Research directions

• Landmarks, out-of-sample extensions, low-rank update iterative methods

• Hybridization of methods and ideas

• Extension to higher order graph properties

50Research Ideas


• Comprehensive testbed software library and experimentation framework is needed to support manifold learning research

•Must be modular, extensible, platform agnostic

• Interpreted/scriptable languages are a good choice for experimentation: Python, MATLAB, Boo, IDL

• Previous efforts:

•DRToolbox (MATLAB, 2007-) by van der Maaten

• scikit.learn (Python, 2009-) by Matthieu Brucher

SOFTWARE LIBRARY

51Research Ideas


1) Belkin, M. and P. Niyogi, Laplacian Eigenmaps for Dimensionality Reduction and Data Representation, Neural Comp. 15 (2003),1373-1396.

2) Schölkopf, B., A. Smola, and K. Müller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Neural Comp. 10 (1998), 1299-1319.

3) Weinberger K. K., B. D. Packer, and L. K. Saul, Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization, Proc AI and Statistics (Dec 2005), 381-388.

4) Golub, G. H. and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins University Press (Baltimore, 1996), 70-75.

5) Roweis, S. T. and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science 290 (2000), 2323-2326.

6) Shlens, J, A Tutorial on Principal Component Analysis, Version 2 (Dec 2005).

7) van der Maaten, L., E. Postma and J. van der Herik, Dimensionality Reduction: A Comparative Review, TiCC TR 2009-005 (Oct 2009).

52References


QUESTIONS? IDEAS?THANK YOU!

53

Methods of Manifold Learning for Dimension Reduction of Large Data Sets

Data & Analytics

Transcript of Methods of Manifold Learning for Dimension Reduction of Large Data Sets