MLconf NYC Animashree Anandkumar

Tensor Decompositions for Guaranteed Learningof Latent Variable Models

Anima Anandkumar

U.C. Irvine

Application 1: Topic Modeling

Document modeling

Observed: words in document corpus.

Hidden: topics.

Goal: carry out document summarization.

Application 2: Understanding Human Communities

Social Networks

Observed: network of social ties, e.g. friendships, co-authorships

Hidden: groups/communities of actors.

Application 3: Recommender Systems

Recommender System

Observed: Ratings of users for various products, e.g. yelp reviews.

Goal: Predict new recommendations.

Modeling: Find groups/communities of users and products.

Application 4: Feature Learning

Feature Engineering

Learn good features/representations for classification tasks, e.g.image and speech recognition.

Sparse representations, low dimensional hidden structures.

Application 5: Computational Biology

Observed: gene expression levels

Goal: discover gene groups

Hidden variables: regulators controlling gene groups

“Unsupervised Learning of Transcriptional Regulatory Networks via Latent Tree Graphical

Model” by A. Gitter, F. Huang, R. Valluvan, E. Fraenkel and A. Anandkumar Submitted to

BMC Bioinformatics, Jan. 2014.

Statistical Framework

In all applications: discover hidden structure in data: unsupervisedlearning.

Latent Variable Models

Concise statistical description throughgraphical modeling

Conditional independence relationshipsor hierarchy of variables. x

Conditional independence relationshipsor hierarchy of variables. x1 x2 x3 x4 x5

Computational FrameworkChallenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard.

Practice: EM, Variational Bayes have no consistency guarantees.

Efficient computational and sample complexities?

Fast methods such as matrix factorization are not statistical. Wecannot learn the latent variable model through such methods.

Tensor-based Estimation

Estimate moment tensors from data: higher order relationships.

Compute decomposition of moment tensor.

Iterative updates, e.g. tensor power iterations, alternatingminimization.

Non-convex: convergence to a local optima. No guarantees.

Innovation: Guaranteed convergence to correct model.

In this talk: tensor decompositions and applications

Outline

1 Introduction

2 Topic Models

3 Efficient Tensor Decomposition

4 Experimental Results

5 Conclusion

Topic Models: Bag of Words

Probabilistic Topic Models

Bag of words: order of words does not matter

Graphical model representation

l words in a document x1, . . . , xl.

h: proportions of topics in a document.

Word xi generated from topic yi.

A(i, j) := P[xm = i|ym = j] :

topic-word matrix.

Topics

Mixture

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5

Geometric Picture for Topic Models

Topic proportions vector (h)

Document

Linear Model:E[xi|h] = Ah .

Multiview model: h isfixed and multiple words(xi) are generated.

Single topic (h)

x3Word generation (x1, x2, . . .)

Moment TensorsConsider single topic model.

E[xi|h] = Ah. ~λ := [E[h]]i.

Learn topic-word matrix A, vector ~λ = P[h]

M2: Co-occurrence of two words in a document

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

λrara⊤r

E[xi|h] = Ah. ~λ := [E[h]]i.

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

λrara⊤r

Tensor M3: Co-occurrence of three words

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

λrar ⊗ ar ⊗ ar

E[xi|h] = Ah. ~λ := [E[h]]i.

M2 := E[x1x⊤2 ] = E[E[x1x

⊤2 |h]] = AE[hh⊤]A⊤ =

λrara⊤r

Tensor M3: Co-occurrence of three words

M3 := E(x1 ⊗ x2 ⊗ x3) =∑

λrar ⊗ ar ⊗ ar

Matrix and Tensor Forms: ar := rth column of A.

λrar ⊗ ar. M3 =

λrar ⊗ ar ⊗ ar

Tensor Decomposition Problem

λrar ⊗ ar. M3 =k

λrar ⊗ ar ⊗ ar

= + ....

Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2

u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.

k topics, d words in vocabulary.

M3: O(d× d× d) tensor, Rank k.

Learning Topic Models through Tensor Decomposition

Detecting Communities in Networks

Stochastic Block Model

Non-overlapping

Mixed Membership Model

Overlapping

Non-overlapping

Overlapping

Non-overlapping

Overlapping

Unifying Assumption

Edges conditionally independent given community memberships

Multi-view Mixture Models

Tensor Forms in Other Models

Independent Component Analysis

Independent sources, unknown mixing.

Blind source separation of speech, image, video..

h1 h2 hk

x1 x2 xd

Gaussian Mixtures Hidden MarkovModels/Latent Trees

x1 x2 x3 x4 x5

Reduction to similar moment forms

Outline

1 Introduction

2 Topic Models

5 Conclusion

Tensor Decomposition Problem

λrar ⊗ ar ⊗ ar

= + ....

Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2

u⊗ v ⊗ w is a rank-1 tensor whose i, j, kth entry is uivjwk.

k topics, d words in vocabulary.

M3: O(d× d× d) tensor, Rank k.

d: vocabulary size for topic models or n: size of network forcommunity models.

Dimensionality Reduction for Tensor Decomposition

λrar ⊗ ar ⊗ ar

Dimensionality Reduction(Whitening)

Convert M3 of size O(d× d× d)to tensor T of size k × k × k

Carry out decomposition of T Tensor M3 Tensor T

Dimensionality reduction through multi-linear transforms

Computed from data, e.g. pairwise moments.

T =∑

i ρir⊗3i is symmetric orthogonal tensor: {ri} are orthonormal

Orthogonal/Eigen Decomposition

Orthogonal symmetric tensor: T =∑

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

ρj〈r1, rj〉2rj = ρ1r1

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

Obtaining eigenvectors through power iterations

u 7→T (I, u, u)

‖T (I, u, u)‖

j∈[k]

ρjr⊗3j

T (I, r1, r1) =∑

j∈[k]

Obtaining eigenvectors through power iterations

u 7→T (I, u, u)

‖T (I, u, u)‖

Basic Algorithm

Random initialization, run power iterations and deflate

Practical Considerations

k communities, n nodes, k ≪ n.

k-SVD of n× n matrix: randomized techniques

Online k × k × k tensor decomposition: No tensor explicitly formed.

Parallelization: Inherently parallelizable, GPU deployment.

Sparse implementation: real-world networks are sparse

Validation Metric: p-value test based “soft-pairing”

Parallel time complexity: O

s is max. degree in graph and c is number of cores.

Huang, Niranjan, Hakeem and Anandkumar, “Fast Detection of Overlapping Communities via

Online Tensor Methods,” Preprint, Sept. 2013.

Scaling Of The Stochastic Iterations

i ← vti − 3θβt

vtj , vti

⟩2vtj

+ βt⟨

vti , ytA

⟩⟨

vti , ytB

ytC + . . .

Parallelize acrosseigenvectors.

STGD is iterative:device code reusebuffers for updates.

ytA,ytB,y

Standard Interface

ytA,ytB,y

Device Interface

Scaling Of The Stochastic Iterations

10−1

Number of communities k

Runningtime(secs)

MATLAB Tensor Toolbox

CULA Standard Interface

CULA Device Interface

Eigen Sparse

Outline

1 Introduction

2 Topic Models

5 Conclusion

Experimental Results

Friend

Facebook

n ∼ 20, 000

BusinessUser

Reviews

n ∼ 40, 000

AuthorCoauthor

n ∼ 1 million

Error (E) and Recovery ratio (R)

Dataset k̂ Method Running Time E RFacebook(k=360) 500 ours 468 0.0175 100%Facebook(k=360) 500 variational 86,808 0.0308 100%.Yelp(k=159) 100 ours 287 0.046 86%Yelp(k=159) 100 variational N.A..DBLP(k=6000) 100 ours 5407 0.105 95%

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Experimental Results on Yelp

Lowest error business categories & largest weight businesses

Rank Category Business Stars Review Counts

1 Latin American Salvadoreno Restaurant 4.0 362 Gluten Free P.F. Chang’s China Bistro 3.5 553 Hobby Shops Make Meaning 4.5 144 Mass Media KJZZ 91.5FM 4.0 135 Yoga Sutra Midtown 4.5 31

Bridgeness: Distance from vector [1/k̂, . . . , 1/k̂]⊤

Top-5 bridging nodes (businesses)

Business Categories

Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe

Pizzeria Bianco Restaurants, Pizza, Phoenix

FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix

Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch

Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe

Outline

1 Introduction

2 Topic Models

5 Conclusion

Conclusion

Guaranteed Learning of Latent Variable Models

Guaranteed to recover correct model

Efficient sample and computational complexities

Better performance compared to EM, VariationalBayes etc.

Mixed membership communities, topic models,ICA, Gaussian mixtures...

Current and Future Goals

Guaranteed online learning in high dimensions

Large-scale cloud-based implementation of tensor approaches

Code available on website and Github

MLconf NYC Animashree Anandkumar

Technology

Transcript of MLconf NYC Animashree Anandkumar

Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

American Express Slides, MLconf 2013

Jake Mannix, MLconf 2013

MLconf NYC Pek Lum

Online Tensor Methods for Learning Latent Variable Models · Mohammad Umar Hakeem mhakeem@uci.edu Animashree Anandkumar a.anandkumar@uci.edu Electrical Engineering and Computer Science

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SEA - 5/01/15

MLconf Yael Elmatad

Shri R. Baskaran, Shri B. Anandkumar, Shri R.P. Joshua ... · 1 27th Annual Report BOARD OF DIRECTORS Shri R. Baskaran, Chair man & Managing Director Shri B. Anandkumar, Joint Managing

MLConf 2016 SigOpt Talk by Scott Clark

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Music recommendations @ MLConf 2014

Josh Patterson MLconf slides

Dr. Rajendran Sankaravelayuthan & Dr. Anandkumar, M. · Dr. Rajendran Sankaravelayuthan & Dr. Anandkumar, M. Visual Onto-thesaurus for Tamil 224 formal axioms that constrain the interpretation

Mahoney mlconf-nov13

MLconf NYC Chang Wang

MLConf Seattle 2015 - ML@Quora

Evan Estola – Data Scientist, Meetup.com at MLconf ATL

MLconf NYC Edo Liberty

Automotive Textiles by Anandkumar Ubhare and Avdhoot Jadhav from V.J.T.I.

MLconf NYC Xiangrui Meng