SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral...

76
Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings Anima Anandkumar U.C. Irvine

Transcript of SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral...

Page 1: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Spectral Methods for Learning

Latent Variable Models:Unsupervised and Supervised Settings

Anima Anandkumar

U.C. Irvine

Page 2: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Learning with Big Data

Page 3: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Data vs. Information

Messy Data

Missing observations, gross corruptions, outliers.

High dimensional regime: as data grows, more variables !

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Page 4: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Data vs. Information

Messy Data

Missing observations, gross corruptions, outliers.

High dimensional regime: as data grows, more variables !

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Learning is finding needle in a haystack

Page 5: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Data vs. Information

Messy Data

Missing observations, gross corruptions, outliers.

High dimensional regime: as data grows, more variables !

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Learning is finding needle in a haystack

Learning with big data: computationally challenging!

Principled approaches for finding low dimensional structures?

Page 6: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

How to model information structures?

Latent variable models

Incorporate hidden or latent variables.

Information structures: Relationships between latent variables andobserved data.

Page 7: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

How to model information structures?

Latent variable models

Incorporate hidden or latent variables.

Information structures: Relationships between latent variables andobserved data.

Basic Approach: mixtures/clusters

Hidden variable is categorical.

Page 8: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

How to model information structures?

Latent variable models

Incorporate hidden or latent variables.

Information structures: Relationships between latent variables andobserved data.

Basic Approach: mixtures/clusters

Hidden variable is categorical.

Advanced: Probabilistic models

Hidden variables have more general distributions.

Can model mixed membership/hierarchicalgroups.

x1 x2 x3 x4 x5

h1

h2 h3

Page 9: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Latent Variable Models (LVMs)

Document modeling

Observed: words.

Hidden: topics.

Social Network Modeling

Observed: social interactions.

Hidden: communities, relationships.

Recommendation Systems

Observed: recommendations (e.g., reviews).

Hidden: User and business attributes

Unsupervised Learning: Learn LVM without labeled examples.

Page 10: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

LVM for Feature Engineering

Learn good features/representations for classification tasks, e.g.,computer vision and NLP.

Sparse Coding/Dictionary Learning

Sparse representations, low dimensional hidden structures.

A few dictionary elements make complicated shapes.

Page 11: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Associative Latent Variable Models

Supervised Learning

Given labeled examples {(xi, yi)}, learn a classifier y = f(x).

Page 12: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Associative Latent Variable Models

Supervised Learning

Given labeled examples {(xi, yi)}, learn a classifier y = f(x).

Associative/conditional models: p(y|x).

Example: Logistic regression: E[y|x] = σ(〈u, x〉).

Page 13: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Associative Latent Variable Models

Supervised Learning

Given labeled examples {(xi, yi)}, learn a classifier y = f(x).

Associative/conditional models: p(y|x).

Example: Logistic regression: E[y|x] = σ(〈u, x〉).

Mixture of Logistic Regressions

E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)

Page 14: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Associative Latent Variable Models

Supervised Learning

Given labeled examples {(xi, yi)}, learn a classifier y = f(x).

Associative/conditional models: p(y|x).

Example: Logistic regression: E[y|x] = σ(〈u, x〉).

Mixture of Logistic Regressions

E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)

Multi-layer/Deep Network

E[y|x] = σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))

Page 15: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Challenges in Learning LVMs

Computational Challenges

Maximum likelihood is NP-hard in most scenarios.

Practice: Local search approaches such as Back-propagation, EM,Variational Bayes have no consistency guarantees.

Sample Complexity

Sample complexity is exponential (w.r.t hidden variable dimension)for many learning methods.

Guaranteed and efficient learning through spectral methods

Page 16: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 17: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 18: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Classical Spectral Methods: Matrix PCA and CCA

Unsupervised Setting: PCAFor centered samples {xi}, find projection P withRank(P ) = k s.t.

minP

1

n

i∈[n]

‖xi − Pxi‖2.

Result: Eigen-decomposition of S = Cov(X).

Supervised Setting: CCAFor centered samples {xi, yi}, find

maxa,b

a⊤E[xy⊤]b√

a⊤E[xx⊤]a b⊤E[yy⊤]b.

Result: Generalized eigen decomposition.

x y

〈a, x〉〈b, y〉

Page 19: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Shortcomings of Matrix Methods

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix)

Clustering on projected vectors (e.g. k-means).

Page 20: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Shortcomings of Matrix Methods

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix)

Clustering on projected vectors (e.g. k-means).

Basic method works only for single memberships.

Failure to cluster under small separation.

Page 21: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Shortcomings of Matrix Methods

Learning through Spectral Clustering

Dimension reduction through PCA (on data matrix)

Clustering on projected vectors (e.g. k-means).

Basic method works only for single memberships.

Failure to cluster under small separation.

Efficient Learning Without Separation Constraints?

Page 22: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 23: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Beyond SVD: Spectral Methods on Tensors

How to learn the mixture models without separation constraints?

◮ PCA uses covariance matrix of data. Are higher order moments helpful?

Unified framework?

◮ Moment-based estimation of probabilistic latent variable models?

SVD gives spectral decomposition of matrices.◮ What are the analogues for tensors?

Page 24: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Moment Matrices and Tensors

Multivariate Moments in Unsupervised Setting

M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Page 25: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Moment Matrices and Tensors

Multivariate Moments in Unsupervised Setting

M1 := E[x], M2 := E[x⊗ x], M3 := E[x⊗ x⊗ x].

Matrix

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Multivariate Moments in Supervised Setting

M1 := E[x],E[y], M2 := E[x⊗ y], M3 := E[x⊗ x⊗ y].

Page 26: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Spectral Decomposition of Tensors

M2 =∑

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

Page 27: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Spectral Decomposition of Tensors

M2 =∑

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =∑

i

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

u⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3 .

How to solve this non-convex problem?

Page 28: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Decomposition of Orthogonal Tensors

M3 =∑

i

wiai ⊗ ai ⊗ ai.

Suppose A has orthogonal columns.

Page 29: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Decomposition of Orthogonal Tensors

M3 =∑

i

wiai ⊗ ai ⊗ ai.

Suppose A has orthogonal columns.

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

Page 30: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Decomposition of Orthogonal Tensors

M3 =∑

i

wiai ⊗ ai ⊗ ai.

Suppose A has orthogonal columns.

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv =M(I, v) = λv.

Page 31: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Decomposition of Orthogonal Tensors

M3 =∑

i

wiai ⊗ ai ⊗ ai.

Suppose A has orthogonal columns.

M3(I, a1, a1) =∑

iwi〈ai, a1〉2ai = w1a1.

ai are eigenvectors of tensor M3.

Analogous to matrix eigenvectors:Mv =M(I, v) = λv.

Two Problems

How to find eigenvectors of a tensor?

A is not orthogonal in general.

Page 32: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Page 33: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→M(I, v)

‖M(I, v)‖.

Page 34: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→M(I, v)

‖M(I, v)‖.

Algorithm: tensor power method: v 7→T (I, v, v)

‖T (I, v, v)‖.

Page 35: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→M(I, v)

‖M(I, v)‖.

Algorithm: tensor power method: v 7→T (I, v, v)

‖T (I, v, v)‖.

How do we avoid spurious solutions (not part of decomposition)?

• {vi}’s are the only robust fixed points.

Page 36: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→M(I, v)

‖M(I, v)‖.

Algorithm: tensor power method: v 7→T (I, v, v)

‖T (I, v, v)‖.

How do we avoid spurious solutions (not part of decomposition)?

• {vi}’s are the only robust fixed points. • All other eigenvectors are saddle points.

Page 37: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Orthogonal Tensor Power MethodSymmetric orthogonal tensor T ∈ R

d×d×d:

T =∑

i∈[k]

λivi ⊗ vi ⊗ vi.

Recall matrix power method: v 7→M(I, v)

‖M(I, v)‖.

Algorithm: tensor power method: v 7→T (I, v, v)

‖T (I, v, v)‖.

How do we avoid spurious solutions (not part of decomposition)?

• {vi}’s are the only robust fixed points. • All other eigenvectors are saddle points.

For an orthogonal tensor, no spurious local optima!

Page 38: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Whitening: Conversion to Orthogonal Tensor

M3 =∑

i

wiai ⊗ ai ⊗ ai, M2 =∑

i

wiai ⊗ ai.

Find whitening matrix W s.t. W⊤A = V is an orthogonal matrix.

When A ∈ Rd×k has full column rank, it is an invertible

transformation.

v1

v2v3

Wa1a2a3

Use pairwise moments M2 to find W .

SVD of M2 is needed.

Page 39: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Putting it together

Non-orthogonal tensor M3 =∑

iwiai ⊗ ai ⊗ ai, M2 =∑

iwiai ⊗ ai.

Whitening matrix W :

Multilinear transform: T =M3(W,W,W )

v1v2

v3

Wa1a2a3

Tensor M3 Tensor T

Page 40: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Putting it together

Non-orthogonal tensor M3 =∑

iwiai ⊗ ai ⊗ ai, M2 =∑

iwiai ⊗ ai.

Whitening matrix W :

Multilinear transform: T =M3(W,W,W )

v1v2

v3

Wa1a2a3

Tensor M3 Tensor T

Tensor Decomposition: Guaranteed Non-Convex Optimization!

Page 41: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Putting it together

Non-orthogonal tensor M3 =∑

iwiai ⊗ ai ⊗ ai, M2 =∑

iwiai ⊗ ai.

Whitening matrix W :

Multilinear transform: T =M3(W,W,W )

v1v2

v3

Wa1a2a3

Tensor M3 Tensor T

Tensor Decomposition: Guaranteed Non-Convex Optimization!

For what latent variable models can we obtain M2 and M3 forms?

Page 42: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 43: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Types of Latent Variable Models

What is the form of hidden variables h?

Basic Approach: mixtures/clusters

Hidden variable h is categorical.

Advanced: Probabilistic models

Hidden variable h has more general distributions.

Can model mixed memberships, e.g. Dirichletdistribution.

x1 x2 x3 x4 x5

h1

h2 h3

Page 44: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 45: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Topic Modeling

Page 46: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Geometric Picture for Topic ModelsTopic proportions vector (h)

Document

Page 47: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Geometric Picture for Topic ModelsSingle topic (h)

Page 48: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Geometric Picture for Topic ModelsSingle topic (h)

AAA

x1

x2

x3Word generation (x1, x2, . . .)

Page 49: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Geometric Picture for Topic ModelsSingle topic (h)

AAA

x1

x2

x3Word generation (x1, x2, . . .)

Linear model: E[xi|h] = Ah .

Page 50: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Moments for Single Topic Models

E[xi|h] = Ah.

w := E[h].

Learn topic-word matrix A, vector wx1 x2 x3 x4 x5

AAAAA

h

Page 51: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Moments for Single Topic Models

E[xi|h] = Ah.

w := E[h].

Learn topic-word matrix A, vector wx1 x2 x3 x4 x5

AAAAA

h

Pairwise Co-occurence Matrix Mx

M2 := E[x1 ⊗ x2] = E[E[x1 ⊗ x2|h]] =k

i=1

wiai ⊗ ai

Triples Tensor M3

M3 := E[x1 ⊗ x2 ⊗ x3] = E[E[x1 ⊗ x2 ⊗ x3|h]] =

k∑

i=1

wiai ⊗ ai ⊗ ai

Page 52: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Moments under LDA

M2 := E[x1 ⊗ x2] −α0

α0 + 1E[x1]⊗ E[x1]

M3 := E[x1 ⊗ x2 ⊗ x3] −α0

α0 + 2E[x1 ⊗ x2 ⊗ E[x1]]−more stuff...

Then

M2 =∑

wi ai ⊗ ai

M3 =∑

wi ai ⊗ ai ⊗ ai.

Three words per document suffice for learning LDA.

Similar forms for HMM, ICA, sparse coding etc.

“Tensor Decompositions for Learning Latent Variable Models” by A. Anandkumar, R. Ge, D.

Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.

Page 53: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 54: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Network Community Models

Page 55: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

Page 56: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

Page 57: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

0.9

Page 58: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

0.1

Page 59: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Network Community Models

0.4 0.3 0.3 0.7 0.2 0.1

0.1 0.8 0.1

Page 60: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Subgraph Counts as Graph Moments

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.

Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Page 61: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Subgraph Counts as Graph Moments

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.

Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Page 62: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Subgraph Counts as Graph Moments

3-Star Count Tensor

M3(a, b, c) =1

|X|# of common neighbors in X

=1

|X|

x∈X

G(x, a)G(x, b)G(x, c).

M3 =1

|X|

x∈X

[G⊤x,A ⊗G⊤

x,B ⊗G⊤x,C ]

x

a b c

A B C

X

“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.

Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.

Page 63: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 64: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Computational Complexity (k ≪ n)

n = # of nodes

N = # of iterations

k = # of communities.

c = # of cores.

Whiten STGD Unwhiten

Space O(nk) O(k2) O(nk)Time O(nsk/c+ k3) O(Nk3/c) O(nsk/c)

Whiten: matrix/vector products and SVD.

STGD: Stochastic Tensor Gradient Descent

Unwhiten: matrix/vector products

Our approach: O(nskc

+ k3)

Embarrassingly Parallel and fast!

Page 65: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Tensor Decomposition on GPUs

102

103

10−1

100

101

102

103

104

Number of communities k

Runningtime(secs)

MATLAB Tensor Toolbox(CPU)

CULA Standard Interface(GPU)

CULA Device Interface(GPU)

Eigen Sparse(CPU)

Page 66: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Summary of Results

FriendUsers

Facebook

n ∼ 20k

BusinessUserReviews

Yelp

n ∼ 40k

AuthorCoauthor

DBLP(sub)

n ∼ 1 million(∼ 100k)

Error (E) and Recovery ratio (R)

Dataset k Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP sub(k=250) 500 ours 10,157 0.139 89%DBLP sub(k=250) 500 variational 558,723 16.38 99%DBLP(k=6000) 100 ours 5407 0.105 95%

Thanks to Prem Gopalan and David Mimno for providing variational code.

Page 67: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Page 68: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/k, . . . , 1/k]⊤

Top-5 bridging nodes (businesses)

Business CategoriesFour Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, TempePizzeria Bianco Restaurants, Pizza, PhoenixFEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, PhoenixMatt’s Big Breakfast Restaurants, Phoenix, Breakfast& BrunchCornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

Page 69: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 70: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Moment Tensors for Associative Models

Multivariate Moments: Many possibilities...

E[x⊗ y],E[x⊗ x⊗ y],E[ψ(x) ⊗ y] . . . .

Feature Transformations of the Input: x 7→ ψ(x)

How to exploit them?

Are moments E[ψ(x) ⊗ y] useful?

If ψ(x) is a matrix/tensor, we have matrix/tensor moments.

Can carry out spectral decomposition of the moments.

Page 71: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Score Function Features

Higher order score function: Sm(x) := (−1)m∇(m)p(x)

p(x)

∗ Can be a matrix or a tensor instead of a vector.

∗ Derivative w.r.t parameter or input

Form the cross-moments: E [y · Sm(x)].

Extension of Stein’s lemma: E [y · Sm(x)] = E

[

∇(m)G(x)]

when E[y|x] := G(x)

Spectral decomposition:E

[

∇(m)G(x)]

=∑

j∈[k]

u⊗mj

Can be applied for learning of associative latent variable models.

Page 72: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Learning Deep Neural Networks

Realizable Setting E[y|x] = σd(Ad σd−1(Ad−1 σd−2(· · ·A2 σ1(A1x))))

M3 = E[y · S3(x)] =∑

i∈[r]

λi · u⊗3i

where ui = e⊤i A1 are rows of A1.

Guaranteed learning of weights(layer-by-layer) via tensordecomposition.

Similar guarantees for learning mixture of classifiers

Page 73: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Automated Extraction of Discriminative Features

Page 74: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Outline

1 Introduction

2 Spectral MethodsClassical Matrix MethodsBeyond Matrices: Tensors

3 Moment Tensors for Latent Variable ModelsTopic ModelsNetwork Community ModelsExperimental Results

4 Moment Tensors in Supervised Setting

5 Conclusion

Page 75: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

Conclusion: Guaranteed Non-Convex Optimization

Tensor Decomposition

Efficient sample and computational complexities

Better performance compared to EM, Variational Bayes etc.

In practice

Scalable and embarrassingly parallel: handle large datasets.

Efficient performance: perplexity or ground truth validation.

Related Topics

Overcomplete Tensor Decomposition: Neural networks, sparsecoding and ICA models tend to be overcomplete (more neurons thaninput dimensions).

Provable Non-Convex Iterative Methods: Robust PCA, Dictionarylearning etc.

Page 76: SpectralMethodsforLearning LatentVariableModels: … · 2015. 3. 27. · Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected

My Research Group and Resources

Furong Huang Majid Janzamin Hanie Sedghi

Niranjan UN Forough Arabshahi

ML summer school lectures available athttp://newport.eecs.uci.edu/anandkumar/MLSS.html