MLconf NYC Animashree Anandkumar

47
Tensor Decompositions for Guaranteed Learning of Latent Variable Models Anima Anandkumar U.C. Irvine

description

 

Transcript of MLconf NYC Animashree Anandkumar

Page 1: MLconf NYC Animashree Anandkumar

Tensor Decompositions for Guaranteed Learningof Latent Variable Models

Anima Anandkumar

U.C. Irvine

Page 2: MLconf NYC Animashree Anandkumar

Application 1: Topic Modeling

Document modeling

Observed: words in document corpus.

Hidden: topics.

Goal: carry out document summarization.

Page 3: MLconf NYC Animashree Anandkumar

Application 2: Understanding Human Communities

Social Networks

Observed: network of social ties, e.g. friendships, co-authorships

Hidden: groups/communities of actors.

Page 4: MLconf NYC Animashree Anandkumar

Application 3: Recommender Systems

Recommender System

Observed: Ratings of users for various products, e.g. yelp reviews.

Goal: Predict new recommendations.

Modeling: Find groups/communities of users and products.

Page 5: MLconf NYC Animashree Anandkumar

Application 4: Feature Learning

Feature Engineering

Learn good features/representations for classification tasks, e.g.image and speech recognition.

Sparse representations, low dimensional hidden structures.

Page 6: MLconf NYC Animashree Anandkumar

Application 5: Computational Biology

Observed: gene expression levels

Goal: discover gene groups

Hidden variables: regulators controlling gene groups

“Unsupervised Learning of Transcriptional Regulatory Networks via Latent Tree Graphical

Model” by A. Gitter, F. Huang, R. Valluvan, E. Fraenkel and A. Anandkumar Submitted to

BMC Bioinformatics, Jan. 2014.

Page 7: MLconf NYC Animashree Anandkumar

Statistical Framework

In all applications: discover hidden structure in data: unsupervisedlearning.

Latent Variable Models

Concise statistical description throughgraphical modeling

Conditional independence relationshipsor hierarchy of variables. x

h

Page 8: MLconf NYC Animashree Anandkumar

Statistical Framework

In all applications: discover hidden structure in data: unsupervisedlearning.

Latent Variable Models

Concise statistical description throughgraphical modeling

Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5

h

Page 9: MLconf NYC Animashree Anandkumar

Statistical Framework

In all applications: discover hidden structure in data: unsupervisedlearning.

Latent Variable Models

Concise statistical description throughgraphical modeling

Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5

h1

h2 h3

Page 10: MLconf NYC Animashree Anandkumar

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Page 11: MLconf NYC Animashree Anandkumar

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.

Page 12: MLconf NYC Animashree Anandkumar

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.

Tensor-based Estimation

Estimate moment tensors from data: higher order relationships.

Compute decomposition of moment tensor.

Iterative updates, e.g. tensor power iterations, alternatingminimization.

Non-convex: convergence to a local optima. No guarantees.

Page 13: MLconf NYC Animashree Anandkumar

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.

Tensor-based Estimation

Estimate moment tensors from data: higher order relationships.

Compute decomposition of moment tensor.

Iterative updates, e.g. tensor power iterations, alternatingminimization.

Non-convex: convergence to a local optima. No guarantees.

Innovation: Guaranteed convergence to correct model.

Page 14: MLconf NYC Animashree Anandkumar

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.

Tensor-based Estimation

Estimate moment tensors from data: higher order relationships.

Compute decomposition of moment tensor.

Iterative updates, e.g. tensor power iterations, alternatingminimization.

Non-convex: convergence to a local optima. No guarantees.

Innovation: Guaranteed convergence to correct model.

In this talk: tensor decompositions and applications

Page 15: MLconf NYC Animashree Anandkumar

Outline

1 Introduction

2 Topic Models

3 Efficient Tensor Decomposition

4 Experimental Results

5 Conclusion

Page 16: MLconf NYC Animashree Anandkumar

Topic Models: Bag of Words

Page 17: MLconf NYC Animashree Anandkumar

Probabilistic Topic Models

Bag of words: order of words does not matter

Graphical model representation

l words in a document x1, . . . , xl.

h: proportions of topics in a document.

Word xi generated from topic yi.

A(i, j) := P[xm = i|ym = j] :

topic-word matrix.

Words

Topics

Topic

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

AAAAA

h

Page 18: MLconf NYC Animashree Anandkumar

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

Linear Model:E[xi|h] = Ah .

Multiview model: h isfixed and multiple words(xi) are generated.

Page 19: MLconf NYC Animashree Anandkumar

Geometric Picture for Topic Models

Single topic (h)

Linear Model:E[xi|h] = Ah .

Multiview model: h isfixed and multiple words(xi) are generated.

Page 20: MLconf NYC Animashree Anandkumar

Geometric Picture for Topic Models

Topic proportions vector (h)

Linear Model:E[xi|h] = Ah .

Multiview model: h isfixed and multiple words(xi) are generated.

Page 21: MLconf NYC Animashree Anandkumar

Geometric Picture for Topic Models

Topic proportions vector (h)

AAA

x1

x2

x3Word generation (x1, x2, . . .)

Linear Model:E[xi|h] = Ah .

Multiview model: h isfixed and multiple words(xi) are generated.

Page 22: MLconf NYC Animashree Anandkumar

Moment TensorsConsider single topic model.

E[xi|h] = Ah. ~λ := [E[h]]i.

Learn topic-word matrix A, vector ~λ = P[h]

M2: Co-occurrence of two words in a document

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

k∑

r=1

λrara⊤r

Page 23: MLconf NYC Animashree Anandkumar

Moment TensorsConsider single topic model.

E[xi|h] = Ah. ~λ := [E[h]]i.

Learn topic-word matrix A, vector ~λ = P[h]

M2: Co-occurrence of two words in a document

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

k∑

r=1

λrara⊤r

Tensor M3: Co-occurrence of three words

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

r

λrar ⊗ ar ⊗ ar

Page 24: MLconf NYC Animashree Anandkumar

Moment TensorsConsider single topic model.

E[xi|h] = Ah. ~λ := [E[h]]i.

Learn topic-word matrix A, vector ~λ = P[h]

M2: Co-occurrence of two words in a document

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

k∑

r=1

λrara⊤r

Tensor M3: Co-occurrence of three words

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

r

λrar ⊗ ar ⊗ ar

Matrix and Tensor Forms: ar := rth column of A.

M2 =

k∑

r=1

λrar ⊗ ar. M3 =

k∑

r=1

λrar ⊗ ar ⊗ ar

Page 25: MLconf NYC Animashree Anandkumar

Tensor Decomposition Problem

M2 =k

r=1

λrar ⊗ ar. M3 =k

r=1

λrar ⊗ ar ⊗ ar

= + ....

Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2

u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.

k topics, d words in vocabulary.

M3: O(d× d× d) tensor, Rank k.

Learning Topic Models through Tensor Decomposition

Page 26: MLconf NYC Animashree Anandkumar

Detecting Communities in Networks

Page 27: MLconf NYC Animashree Anandkumar

Detecting Communities in Networks

Stochastic Block Model

Non-overlapping

Page 28: MLconf NYC Animashree Anandkumar

Detecting Communities in Networks

Stochastic Block Model

Non-overlapping

Mixed Membership Model

Overlapping

Page 29: MLconf NYC Animashree Anandkumar

Detecting Communities in Networks

Stochastic Block Model

Non-overlapping

Mixed Membership Model

Overlapping

Page 30: MLconf NYC Animashree Anandkumar

Detecting Communities in Networks

Stochastic Block Model

Non-overlapping

Mixed Membership Model

Overlapping

Unifying Assumption

Edges conditionally independent given community memberships

Page 31: MLconf NYC Animashree Anandkumar

Multi-view Mixture Models

Page 32: MLconf NYC Animashree Anandkumar

Tensor Forms in Other Models

Independent Component Analysis

Independent sources, unknown mixing.

Blind source separation of speech, image, video..

h1 h2 hk

x1 x2 xd

A

Gaussian Mixtures Hidden MarkovModels/Latent Trees

x1 x2 x3 x4 x5

h1

h2 h3

Reduction to similar moment forms

Page 33: MLconf NYC Animashree Anandkumar

Outline

1 Introduction

2 Topic Models

3 Efficient Tensor Decomposition

4 Experimental Results

5 Conclusion

Page 34: MLconf NYC Animashree Anandkumar

Tensor Decomposition Problem

M3 =

k∑

r=1

λrar ⊗ ar ⊗ ar

= + ....

Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2

u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.

k topics, d words in vocabulary.

M3: O(d× d× d) tensor, Rank k.

d: vocabulary size for topic models or n: size of network forcommunity models.

Page 35: MLconf NYC Animashree Anandkumar

Dimensionality Reduction for Tensor Decomposition

M3 =

k∑

r=1

λrar ⊗ ar ⊗ ar

Dimensionality Reduction(Whitening)

Convert M3 of size O(d× d× d)to tensor T of size k × k × k

Carry out decomposition of T Tensor M3 Tensor T

Dimensionality reduction through multi-linear transforms

Computed from data, e.g. pairwise moments.

T =∑

i ρir⊗3i is symmetric orthogonal tensor: {ri} are orthonormal

Page 36: MLconf NYC Animashree Anandkumar

Orthogonal/Eigen Decomposition

Orthogonal symmetric tensor: T =∑

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

ρj〈r1, rj〉2rj = ρ1r1

Page 37: MLconf NYC Animashree Anandkumar

Orthogonal/Eigen Decomposition

Orthogonal symmetric tensor: T =∑

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

ρj〈r1, rj〉2rj = ρ1r1

Obtaining eigenvectors through power iterations

u 7→T (I, u, u)

‖T (I, u, u)‖

Page 38: MLconf NYC Animashree Anandkumar

Orthogonal/Eigen Decomposition

Orthogonal symmetric tensor: T =∑

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

ρj〈r1, rj〉2rj = ρ1r1

Obtaining eigenvectors through power iterations

u 7→T (I, u, u)

‖T (I, u, u)‖

Basic Algorithm

Random initialization, run power iterations and deflate

Page 39: MLconf NYC Animashree Anandkumar

Practical Considerations

k communities, n nodes, k ≪ n.

Steps

k-SVD of n× n matrix: randomized techniques

Online k × k × k tensor decomposition: No tensor explicitly formed.

Parallelization: Inherently parallelizable, GPU deployment.

Sparse implementation: real-world networks are sparse

Validation Metric: p-value test based “soft-pairing”

Parallel time complexity: O

(

nsk

c+ k3

)

,

s is max. degree in graph and c is number of cores.

Huang, Niranjan, Hakeem and Anandkumar, “Fast Detection of Overlapping Communities via

Online Tensor Methods,” Preprint, Sept. 2013.

Page 40: MLconf NYC Animashree Anandkumar

Scaling Of The Stochastic Iterations

vt+1

i ← vti − 3θβt

k∑

j=1

[

vtj , vti

⟩2vtj

]

+ βt⟨

vti , ytA

⟩⟨

vti , ytB

ytC + . . .

Parallelize acrosseigenvectors.

STGD is iterative:device code reusebuffers for updates.

vti

ytA,ytB,y

tC

CPU

GPU

Standard Interface

vti

ytA,ytB,y

tC

CPU

GPU

Device Interface

vti

Page 41: MLconf NYC Animashree Anandkumar

Scaling Of The Stochastic Iterations

102

103

10−1

100

101

102

103

104

Number of communities k

Runningtime(secs)

MATLAB Tensor Toolbox

CULA Standard Interface

CULA Device Interface

Eigen Sparse

Page 42: MLconf NYC Animashree Anandkumar

Outline

1 Introduction

2 Topic Models

3 Efficient Tensor Decomposition

4 Experimental Results

5 Conclusion

Page 43: MLconf NYC Animashree Anandkumar

Experimental Results

Friend

Users

Facebook

n ∼ 20, 000

BusinessUser

Reviews

Yelp

n ∼ 40, 000

AuthorCoauthor

DBLP

n ∼ 1 million

Error (E) and Recovery ratio (R)

Dataset k̂ Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP(k=6000) 100 ours 5407 0.105 95%

Page 44: MLconf NYC Animashree Anandkumar

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Page 45: MLconf NYC Animashree Anandkumar

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/k̂, . . . , 1/k̂]⊤

Top-5 bridging nodes (businesses)

Business Categories

Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe

Pizzeria Bianco Restaurants, Pizza, Phoenix

FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix

Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch

Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

Page 46: MLconf NYC Animashree Anandkumar

Outline

1 Introduction

2 Topic Models

3 Efficient Tensor Decomposition

4 Experimental Results

5 Conclusion

Page 47: MLconf NYC Animashree Anandkumar

Conclusion

Guaranteed Learning of Latent Variable Models

Guaranteed to recover correct model

Efficient sample and computational complexities

Better performance compared to EM, VariationalBayes etc.

Mixed membership communities, topic models,ICA, Gaussian mixtures...

Current and Future Goals

Guaranteed online learning in high dimensions

Large-scale cloud-based implementation of tensor approaches

Code available on website and Github