Factorized Asymptotic Bayesian Inference for Latent Feature Models

Post on 01-Dec-2014

68 views 1 download

description

Casual slides that explains our work accepted in NIPS2013.

Transcript of Factorized Asymptotic Bayesian Inference for Latent Feature Models

Factorized Asymptotic BayesianInference for Latent Feature Models

Kohei Hayashi12

1National Institute of Informatics

2JST, ERATO, Kawarabayashi Large Graph Project

Sept. 5th, 2014

Joint work with Ryohei Fujimaki (NEC Labs America)

1 / 24

Background

Generally, data are High-dimensional and consist oflarge-samples

• Sensor data, texts, images, ...

• Raw data are often hard to interpret for human

One purpose of machine learning: data interpretation

• Aim: to find meaningful features from data

Latent feature models

2 / 24

Example: Mixture Models (MMs)

3 / 24

Example: Mixture Models (MMs)

4 / 24

Model SelectionIn MMs, selection of #components is important• Too many components: difficult to interpret

5 / 24

Factorized Information Criterion (FIC)

Model selection for binary latent variable models

• MMs [Fujimaki&Morinaga AISTATS’12]

• HMMs [Fujimaki&Hayashi ICML’12]

Pros/Cons:

:) Asymptotically equivalent to marginal likelihood• Preferable for “Big Data” scenario

:) Fast computation, no sensitive tuning parameter• An alternative of nonparametric Bayesian methods

:( Only applicable for MM-type models

6 / 24

Contribution

Derive FIC for latent feature models

• Compact and accurate model selection

• Runtime is ×5–50 faster (v.s. IBP)

• May applicable for other non MM-type models (e.g.topic model)

7 / 24

Latent Feature Models (LFMs)

8 / 24

LFM: an Extension of MM

• LFM considers combinations of components

9 / 24

LFM: an Extension of MM

• LFM considers combinations of components10 / 24

Observation Model (Likelihood)For n = 1, . . . , N,

xn = Wzn + εn (1)

• xn ∈ RD: observation• W ∈ RD×K : linear bases• zn ∈ {0, 1}K : binary latent variable• εn ∈ RD: Gaussian noise N(0, diag(λ)−1)

Real

Binary

Real

K

N

D

X ZW

ObservationLatent

Variable Linear Bases

11 / 24

Priors

• p(Z) =∏

n

∏k π

znkk (1− πk)

1−znk

• p(P) for P ≡ {π,W,λ}

12 / 24

Marginal Likelihood

A criterion for Bayesian model selection

p(X) =∑Z

∫dPp(P)p(X,Z|P) (2)

Problems:

• Integral w.r.t. P is intractable

• Sum over Z needs O(2K)

Approach: use

• Laplace approximation

• Variational bound (+ mean field + linearlization)

13 / 24

FIC of LFMs

14 / 24

Variational Lower Bound

Suppose we have p(X,Z) =∫dPp(P)p(X,Z|P), then

log p(X) ≥∑Z

q(Z) log p(X,Z) +H(q) (3)

• H(q) ≡ −∑

Z q(Z) log q(Z)

• Equality holds iff q(Z) is a true posterior

15 / 24

Laplace Approximation of p(X,Z)

Suppose we have maximum likelihood estimators P̂ ..

......

For N →∞,

log p(X,Z)

= log p(X,Z|P̂)− r(Z)− D +K

2logN +Op(1) (4)

• r(Z) ≡ D2

∑k log

∑n znk: complexity of model

16 / 24

By combining the two approxs, we obtain FICLFM:.

......max

qEq

[log p(X,Z|P̂)− r(Z)

]+H(q) +

D +K

2logN

“Asymptotically” equivalent to marginal likelihood:

log p(X) = FICLFM +O(1)

• r(Z) = D2

∑k log

∑n znk prefers sparse Z

0 2 4 6 8 10

-log(x)

17 / 24

FAB AlgorithmMaximize w.r.t. q̃ and P by EM-like algorithm

Maximize w.r.t. q̃

Eq̃[znk]← sigmoid

(cnk + logit(πk)−

D

2∑

m Eq̃[zmk]

)• cnk = w>k diag(λ)(xn −

∑l 6=k Eq̃[znl]wl − 1

2wk)

Maximize w.r.t. W and λ

• Closed-form solutions

Shrinkage Z

• Delete zk and wk if∑

n znk/N ' 018 / 24

Experiments

19 / 24

Artificial Data• Generate X by observation model (D = 30)• error-bar: sd over 10 trials

N

Ela

psed

tim

e (s

ec)

100.5101

101.5102

102.5103

True K=5

100 250 500 10002000

10

100 250 500 10002000

fab em ibp meibp vb

Computational time v.s. N20 / 24

Artificial Data (Cont’d)

N

Est

imat

ed K

5

10

15

20

25

305

100 250 500 1000 2000

10

100 250 500 1000 2000

Selected K v.s. N

21 / 24

Block Data

22 / 24

Real Data• Evaluate testing and training errors (PLL and TLL)

Data Method Time (h) K PLL TLL

Sonar FAB < 0.01 4.4± 1.1 −1.25± 0.02 −1.14± 0.03208× 49 EM < 0.01 48.8± 0.5 −4.04± 0.46 −0.08± 0.07(N ×D) IBP 3.3 69.6± 4.8 −4.48± 0.15 0.13± 0.02Libras FAB < 0.01 19.0± 0.7 −0.63± 0.03 −0.42± 0.03360× 90 EM 0.01 75.6± 8.6 −0.68± 0.11 0.76± 0.24

IBP 4.8 36.4± 1.1 −0.18± 0.01 0.13± 0.01Auslan FAB 0.04 6.0± 0.7 −1.34± 0.15 −0.92± 0.0216180× 22 EM 0.2 22± 0 −1.79± 0.27 −0.78± 0.02

IBP 50.2 73± 5 −4.54± 0.08 0.08± 0.01EEG FAB 1.6 11.2± 1.6 −0.93± 0.02 −0.76± 0.04120576× 32 EM 3.7 32± 0 −0.88± 0.09 −0.59± 0.01

IBP 53.0 46.4± 4.4 −3.16± 0.03 −0.26± 0.05Piano FAB 19.4 58.0± 3.5 −0.83± 0.01 −0.63± 0.0257931× 161 EM 50.1 158.6± 3.4 −0.82± 0.02 −0.45± 0.01

IBP 55.8 89.6± 4.2 −1.83± 0.02 −0.84± 0.05yaleB FAB 2.2 77.2± 7.9 −0.37± 0.02 −0.29± 0.032414× 1024 EM 50.9 929± 20 −4.60± 1.20 0.80± 0.27

IBP 51.7 94.2± 7.5 −0.54± 0.02 −0.35± 0.02USPS FAB 11.2 110.2± 5.1 −0.96± 0.01 −0.64± 0.02110000× 256 EM 45.7 256± 0 −1.06± 0.01 −0.36± 0.01

IBP 61.6 181.0± 4.8 −2.59± 0.08 −0.76± 0.0123 / 24

Summary

• Derive FIC for LFMs

• Develop FAB algorithm accelerating sparseness

• Demonstrate FAB is fast and accurate

24 / 24