Deep Generative Models with Stick-Breaking Priorsenalisni/nalisnick_uciAIML_talk.pdf · Overview of...

Post on 27-Jul-2020

14 views 0 download

Transcript of Deep Generative Models with Stick-Breaking Priorsenalisni/nalisnick_uciAIML_talk.pdf · Overview of...

Deep Generative Models with Stick-Breaking Priors

Eric NalisnickUniversity of California, Irvine

Overview of Deep Generative Models.1

Outline

DefinitionApplicationsTraining

Overview of Deep Generative Models.1

Outline

The Dirichlet Process.2

Applications

Stochastic Gradient Variational Bayes for the DP

Definition

Definition

Training

Overview of Deep Generative Models.

Deep Generative Models with Stick-Breaking Priors.

1

3

Outline

Stick-Breaking Variational Autoencoders

The Dirichlet Process.2

Applications

Dirichlet Process Variational AutoencodersStick-Breaking for Adaptive Computation

Stochastic Gradient Variational Bayes for the DP

Definition

Definition

Training

Overview of Deep Generative Models

“Generative modeling…denotes modeling tasks in which a density over all the observable quantities is constructed.”

David MacKayInformation Theory, Inference, and Learning Algorithms

p( y | x ) p( x, y )p( x )

Deep Generative Models

Density Network (MacKay & Gibbs, 1997)

XZ

One or More Neural Network Layers

Observed Data

Stochastic Latent Variable

Deep Generative Models

Density Network (MacKay & Gibbs, 1997)

XZ

One or More Neural Network Layers

Observed Data

Stochastic Latent Variable

Z d

X N

θ λ

Sampling from a Deep Generative Model

XZ^~

Sampling from a Deep Generative Model

XZ^~

ImageNet

Sampling from a Deep Generative Model

XZ^~

ImageNet Samples from a DGM

XZ^~

Sampling from a Deep Generative Model

Samples from a DGM

(Bowman et al., 2016)

Overview of Deep Generative Models

1.1 Why Generative Modeling?

Why Generative Modeling?

Cognitive Motivations.1

Why Generative Modeling?

2 Semi-Supervised Learning: Regularizing discriminative models with unlabeled data.

} p( y | x )

Why Generative Modeling?

2

} p( x, y )

Semi-Supervised Learning: Regularizing discriminative models with unlabeled data.

Why Generative Modeling?

3 Simulating Environments for Reinforcement Learning.

Overview of Deep Generative Models

1.2 How to Train a Deep Generative Model

Learning a Deep Generative Model

1 Importance Sampling

2 Variational Inference

3 Adversarial Training

Soph

istic

atio

n

Learning a Deep Generative Model

1 Importance Sampling(MacKay & Gibbs, 1997)

Learning a Deep Generative Model

1 Importance Sampling(MacKay & Gibbs, 1997)

Learning a Deep Generative Model

1 Importance Sampling

Works only in low dimensions

(MacKay & Gibbs, 1997)

Learning a Deep Generative Model

2 Variational Inference

Learning a Deep Generative Model

2 Variational InferencePosterior Approximation

Learning a Deep Generative Model

2 Variational InferencePosterior Approximation

Learning a Deep Generative Model

Data Reconstruction Regularization

2 Variational InferencePosterior Approximation

Learning a Deep Generative Model

Problem: For neural network likelihoods, these integrals are analytically intractable

Data Reconstruction Regularization

2 Variational InferencePosterior Approximation

Learning a Deep Generative Model

Step #1: Use Monte Carlo Approximations

2 Variational Inference

Learning a Deep Generative Model

Step #1: Use Monte Carlo Approximations

Problem: Still can’t compute gradient w.r.t. variational parameters

2 Variational Inference

Learning a Deep Generative Model

Step #1: Use Monte Carlo Approximations

Problem: Still can’t compute gradient w.r.t. variational parameters

Step #2: Sample via Differentiable Non-Centered Parametrizations

2 Variational Inference

(Kingma & Welling, 2014), (Rezende et al., 2014)

The Variational Autoencoder (VAE)

}

Learning a Deep Generative Model

Classifier predicting if x is generated or observed data

3 Adversarial Training(Goodfellow et al., 2014)

Learning a Deep Generative Model

Classifier predicting if x is generated or observed data

1 Minimize with respect to

Two step training:

2 Maximize with respect to

3 Adversarial Training(Goodfellow et al., 2014)

Generative Adversarial Networks (GANs)(Goodfellow et al., 2014)

Variational Bayes vs Adversarial Training

VAE GAN

Deep Generative Models Summary

XZ

(Original) Density Network

Variational Autoencoder

Generative Adversarial Network

Training Method

Importance Sampled Estimate of Marginal

Likelihood

Stochastic Gradient Variational Bayes

Adversarial Training

No auxiliary model

Backend auxiliary model

Frontend auxiliary model

Deep Generative Models Summary

XZ

(Original) Density Network

Variational Autoencoder

Generative Adversarial Network

Training Method

Importance Sampled Estimate of Marginal

Likelihood

Stochastic Gradient Variational Bayes

Adversarial Training

No auxiliary model

Backend auxiliary model

Frontend auxiliary model

The Dirichlet Process

The Dirichlet Process

The Dirichlet Process

1

The Dirichlet Process

1 2

The Dirichlet Process

1 2

Draw from GEM distribution

The Dirichlet Process

1 2

Draw from GEM distribution

The Dirichlet Process

1 2

Draw from GEM distribution

The Dirichlet Process

1 2

Draw from GEM distribution

The Dirichlet Process

1 2

Draw from GEM distribution

SGVB for The Dirichlet Process

Two Obstacles:1. Gradients through 2. Gradients thought samples from the mixture

Obstacle: The Beta distribution does not have a non-centered parametrization (except in special cases)

Differentiating Through πk

Obstacle: The Beta distribution does not have a non-centered parametrization (except in special cases)

Kumaraswamy Distribution: A Beta-like distribution with a closed-form inverse CDF. Use as variational posterior.

Poondi Kumaraswamy (1930-1988)

Differentiating Through πk

Obstacle: Not obvious how to use the reparametrization trick for samples from a mixture distribution.

Stochastic Backpropagation through Mixtures

Obstacle: Not obvious how to use the reparametrization trick for samples from a mixture distribution.

Stochastic Backpropagation through Mixtures

Brute Force Solution: Summing over all components results in a tractable ELBO but requires O(T) decoder propagations.

Deep Generative Models with Stick-Breaking Priors

Why Combine the DP with DGMs?

XZ

Latent variable drawn from

stochastic process

Why Combine the DP with DGMs?

1 Dynamic Latent Representations (Width)

XZ

Latent variable drawn from

stochastic process

Why Combine the DP with DGMs?

1 Dynamic Latent Representations (Width)

2 Neural Networks with Adaptive Computation (Depth)

XZ

Latent variable drawn from

stochastic process

Why Combine the DP with DGMs?

1 Dynamic Latent Representations (Width)

2

Automatic and Principled Model Selection (Based on Marginal Likelihood)

3

Neural Networks with Adaptive Computation (Depth)

XZ

Latent variable drawn from

stochastic process

3.1 The Stick-Breaking Variational Autoencoder

In collaboration with

Padhraic Smyth

Deep Generative Models with Stick-Breaking Priors

MODEL #1: Stick-Breaking Variational Autoencoder

Kumaraswamy Samples

Truncated posterior; not necessary but learns faster

(Nalisnick & Smyth, 2017)

Stick-Breaking VAE

Truncation level of 50 dimensions, Beta(1,5) Prior

Samples from Generative Model (Nalisnick & Smyth, 2017)

Stick-Breaking VAE Gaussian VAE

50 dimensions, N(0,1) PriorTruncation level of 50 dimensions, Beta(1,5) Prior

Samples from Generative Model (Nalisnick & Smyth, 2017)

Uns

uper

vise

dSe

mi-

Supe

rvis

ed

MNIST: Dirichlet Latent Space (t-SNE)

MNIST: Gaussian Latent Space (t-SNE)

MNIST: kNN Classifier on Latent Space

Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)

Uns

uper

vise

dSe

mi-

Supe

rvis

ed

MNIST: Dirichlet Latent Space (t-SNE)

MNIST: Gaussian Latent Space (t-SNE)

MNIST: kNN Classifier on Latent Space

Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)

(Estimated) Marginal Likelihoods

Uns

uper

vise

dSe

mi-

Supe

rvis

ed

MNIST: Dirichlet Latent Space (t-SNE)

MNIST: Gaussian Latent Space (t-SNE)

MNIST: kNN Classifier on Latent Space

Nonparametric version of (Kingma et al., NIPS 2014)’s M2 model

Quantitative Results for SB-VAE (Nalisnick & Smyth, 2017)

(Estimated) Marginal Likelihoods

3.2 The Dirichlet Process Variational Autoencoder

Lars Hertel

In collaboration with

Padhraic Smyth

Deep Generative Models with Stick-Breaking Priors

MODEL #2: Dirichlet Process Variational Autoencoder (Nalisnick et al., 2016)

MNIST Samples from Component #1

MNIST Samples from Component #5

(Nalisnick et al., 2016)

Samples from Generative Model

MNIST: Dirichlet Process Latent Space (t-SNE)

MNIST: Gaussian Latent Space (t-SNE)

MNIST: kNN Classifier on Latent Space

Quantitative Results for DP-VAE (Nalisnick et al., 2016)

MNIST: Dirichlet Process Latent Space (t-SNE)

MNIST: Gaussian Latent Space (t-SNE)

MNIST: kNN Classifier on Latent Space

Quantitative Results for DP-VAE (Nalisnick et al., 2016)

(Estimated) Marginal Likelihoods

3.3 Adaptive Computation Time via Stick-Breaking

Marc-Alexandre Cote

In collaboration with

Hugo Larochelle

Deep Generative Models with Stick-Breaking Priors

MODEL #3: Stick-Breaking Adaptive Computation

yky2y1

X

π1 π2 πk

Remaining stick is negligibly small

MODEL #3: Stick-Breaking Adaptive Computation

yky2y1

X

π1 π2 πk

Remaining stick is negligibly small

Quantitative Results for SB-RNN: Image Modeling

Quantitative Results for SB-RNN: Text Modeling

Quantitative Results for SB-RNN: Text Modeling

Superior Latent Spaces with NP priors: Seem to preserve class structure well, resulting in better discriminative properties. Their dynamic capacity naturally encodes factors of variation.

1

Conclusions

Adaptive Computation: Fusing NNs with nonparametric priors endows NNs with natural model selection properties.

2

Thank you. Questions?

Appendix