Factorized Asymptotic Bayesian Inference for Latent Feature Models

Factorized Asymptotic BayesianInference for Latent Feature Models

Kohei Hayashi12

1National Institute of Informatics

2JST, ERATO, Kawarabayashi Large Graph Project

Sept. 5th, 2014

Joint work with Ryohei Fujimaki (NEC Labs America)

1 / 24

Background

Generally, data are High-dimensional and consist oflarge-samples

• Sensor data, texts, images, ...

• Raw data are often hard to interpret for human

One purpose of machine learning: data interpretation

• Aim: to find meaningful features from data

Latent feature models

2 / 24

Example: Mixture Models (MMs)

3 / 24

Example: Mixture Models (MMs)

4 / 24

Model SelectionIn MMs, selection of #components is important• Too many components: difficult to interpret

5 / 24

Factorized Information Criterion (FIC)

Model selection for binary latent variable models

• MMs [Fujimaki&Morinaga AISTATS’12]

• HMMs [Fujimaki&Hayashi ICML’12]

Pros/Cons:

:) Asymptotically equivalent to marginal likelihood• Preferable for “Big Data” scenario

:) Fast computation, no sensitive tuning parameter• An alternative of nonparametric Bayesian methods

:( Only applicable for MM-type models

6 / 24

Contribution

Derive FIC for latent feature models

• Compact and accurate model selection

• Runtime is ×5–50 faster (v.s. IBP)

• May applicable for other non MM-type models (e.g.topic model)

7 / 24

Latent Feature Models (LFMs)

8 / 24

LFM: an Extension of MM

• LFM considers combinations of components

9 / 24

LFM: an Extension of MM

• LFM considers combinations of components10 / 24

Observation Model (Likelihood)For n = 1, . . . , N,

xn = Wzn + εn (1)

• xn ∈ RD: observation• W ∈ RD×K : linear bases• zn ∈ {0, 1}K : binary latent variable• εn ∈ RD: Gaussian noise N(0, diag(λ)−1)

Binary

ObservationLatent

Variable Linear Bases

11 / 24

Priors

• p(Z) =∏

∏k π

znkk (1− πk)

1−znk

• p(P) for P ≡ {π,W,λ}

12 / 24

Marginal Likelihood

A criterion for Bayesian model selection

p(X) =∑Z

∫dPp(P)p(X,Z|P) (2)

Problems:

• Integral w.r.t. P is intractable

• Sum over Z needs O(2K)

Approach: use

• Laplace approximation

• Variational bound (+ mean field + linearlization)

13 / 24

FIC of LFMs

14 / 24

Variational Lower Bound

Suppose we have p(X,Z) =∫dPp(P)p(X,Z|P), then

log p(X) ≥∑Z

q(Z) log p(X,Z) +H(q) (3)

• H(q) ≡ −∑

Z q(Z) log q(Z)

• Equality holds iff q(Z) is a true posterior

15 / 24

Laplace Approximation of p(X,Z)

Suppose we have maximum likelihood estimators P̂ ..

......

For N →∞,

log p(X,Z)

= log p(X,Z|P̂)− r(Z)− D +K

2logN +Op(1) (4)

• r(Z) ≡ D2

∑k log

∑n znk: complexity of model

16 / 24

By combining the two approxs, we obtain FICLFM:.

......max

[log p(X,Z|P̂)− r(Z)

]+H(q) +

“Asymptotically” equivalent to marginal likelihood:

log p(X) = FICLFM +O(1)

• r(Z) = D2

∑k log

∑n znk prefers sparse Z

0 2 4 6 8 10

-log(x)

17 / 24

FAB AlgorithmMaximize w.r.t. q̃ and P by EM-like algorithm

Maximize w.r.t. q̃

Eq̃[znk]← sigmoid

(cnk + logit(πk)−

m Eq̃[zmk]

)• cnk = w>k diag(λ)(xn −

∑l 6=k Eq̃[znl]wl − 1

Maximize w.r.t. W and λ

• Closed-form solutions

Shrinkage Z

• Delete zk and wk if∑

n znk/N ' 018 / 24

Experiments

19 / 24

Artificial Data• Generate X by observation model (D = 30)• error-bar: sd over 10 trials

100.5101

101.5102

102.5103

True K=5

100 250 500 10002000

fab em ibp meibp vb

Computational time v.s. N20 / 24

Artificial Data (Cont’d)

100 250 500 1000 2000

Selected K v.s. N

21 / 24

Block Data

22 / 24

Real Data• Evaluate testing and training errors (PLL and TLL)

Data Method Time (h) K PLL TLL

Sonar FAB < 0.01 4.4± 1.1 −1.25± 0.02 −1.14± 0.03208× 49 EM < 0.01 48.8± 0.5 −4.04± 0.46 −0.08± 0.07(N ×D) IBP 3.3 69.6± 4.8 −4.48± 0.15 0.13± 0.02Libras FAB < 0.01 19.0± 0.7 −0.63± 0.03 −0.42± 0.03360× 90 EM 0.01 75.6± 8.6 −0.68± 0.11 0.76± 0.24

IBP 4.8 36.4± 1.1 −0.18± 0.01 0.13± 0.01Auslan FAB 0.04 6.0± 0.7 −1.34± 0.15 −0.92± 0.0216180× 22 EM 0.2 22± 0 −1.79± 0.27 −0.78± 0.02

IBP 50.2 73± 5 −4.54± 0.08 0.08± 0.01EEG FAB 1.6 11.2± 1.6 −0.93± 0.02 −0.76± 0.04120576× 32 EM 3.7 32± 0 −0.88± 0.09 −0.59± 0.01

IBP 53.0 46.4± 4.4 −3.16± 0.03 −0.26± 0.05Piano FAB 19.4 58.0± 3.5 −0.83± 0.01 −0.63± 0.0257931× 161 EM 50.1 158.6± 3.4 −0.82± 0.02 −0.45± 0.01

IBP 55.8 89.6± 4.2 −1.83± 0.02 −0.84± 0.05yaleB FAB 2.2 77.2± 7.9 −0.37± 0.02 −0.29± 0.032414× 1024 EM 50.9 929± 20 −4.60± 1.20 0.80± 0.27

IBP 51.7 94.2± 7.5 −0.54± 0.02 −0.35± 0.02USPS FAB 11.2 110.2± 5.1 −0.96± 0.01 −0.64± 0.02110000× 256 EM 45.7 256± 0 −1.06± 0.01 −0.36± 0.01

IBP 61.6 181.0± 4.8 −2.59± 0.08 −0.76± 0.0123 / 24

Summary

• Derive FIC for LFMs

• Develop FAB algorithm accelerating sparseness

• Demonstrate FAB is fast and accurate

24 / 24

Factorized Asymptotic Bayesian Inference for Latent Feature Models

Data & Analytics

Transcript of Factorized Asymptotic Bayesian Inference for Latent Feature Models

Factorized Latent Spaces with Structured Sparsitysalzmann/papers/JiaSalzmannDarrellNIPS… · Factorized Latent Spaces with Structured Sparsity Yangqing Jia 1, Mathieu Salzmann;2,

Factorized Point Process Intensities: A Spatial Analysis ...

FAST FACTORIZED BACK-PROJECTION IN AN FPGAhh.diva-portal.org/smash/get/diva2:237459/FULLTEXT01.pdf · Technical report, IDE0607, January 2006 FAST FACTORIZED BACK-PROJECTION IN AN

Factorized Geometrical Autofocus for Synthetic Aperture ...publications.lib.chalmers.se/records/fulltext/200370/local_200370.pdf · Factorized Geometrical Autofocus for Synthetic

Learning the Face Prior for Bayesian Face Recognition · nation (MRD) [9]: (1) It can learn a factorized latent variable representation of multiple observation spaces; (2) Each latent

Finding Large Elements in Factorized Tensors

FAST FACTORIZED BACK-PROJECTION IN AN FPGA

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series · 2019. 11. 25. · Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series

Relaxed Multivariate Bernoulli Distribution and Its ... · applications. For instance, when performing density esti-mation with a discrete latent variable model, a factorized posterior

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

A arXiv:1410.8516v6 [cs.LG] 10 Apr 2015gwylab.com/pdf/nice.pdf · formation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized

Multi-Modal Factorized Bilinear Pooling With Co-Attention ......Multi-modal Factorized Bilinear Pooling with Co-Attention Learning forVisual Question Answering Zhou Yu†, Jun Yu†∗,

Speaker De-Identiﬁcation using a Factorized Hierarchical ...

Asymptotic Theory and Stochastic Regressionhome.iitk.ac.in/~shalab/econometrics/WordFiles... · Web view: Asymptotic mean: Asymptotic variance. Asymptotic distribution of sample mean

Factorized Graph Representations for Semi-Supervised ......Factorized Graph Representations for Semi-Supervised Learning from Sparse Data Krishna Kumar P. IIT Madras Paul Langton Northeastern

The Factorized Self-Controlled Case Series Method: An ...

DEVELOPMENT AND IMPLEMENTATION OF A FAST FACTORIZED …

Asymptotic Simulated Annealing for Variational …ing methods on Latent Dirichlet Allocation (LDA), Hidden Markov Model (HMM) and Stochastic Variational Inference (SVI). We start with

A Probabilistic Model for Recursive Factorized Image Features · A Probabilistic Model for Recursive Factorized Image Features Sergey Karayev 1Mario Fritz;2 Sanja Fidler 3 Trevor

ERFNet: Efﬁcient Residual Factorized ConvNet for Real ... · ERFNet: Efﬁcient Residual Factorized ConvNet for Real-time Semantic Segmentation ... We propose to redesign the non-bottleneck