Lecture 9: Deep Generative Models - GitHub Pages

93
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 1 Lecture 9: Deep Generative Models Efstratios Gavves

Transcript of Lecture 9: Deep Generative Models - GitHub Pages

Page 1: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 1

Lecture 9: Deep Generative ModelsEfstratios Gavves

Page 2: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 2

oEarly Generative Models

oRestricted Boltzmann Machines

oDeep Boltzmann Machines

oDeep Belief Network

oContrastive Divergence

oGentle intro to Bayesian Modelling and Variational Inference

oVariational Autoencoders

oNormalizing Flows

Lecture overview

Page 3: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 3

Explicit density models

oPlug in the model density function to likelihood

oThen maximize the likelihood

Problems

oDesign complex enough modelthat meets data complexity

oAt the same time, make sure modelis computationally tractable

oMore details in the next lecture

Page 4: Lecture 9: Deep Generative Models - GitHub Pages

Restricted Boltzmann MachinesDeep Boltzmann MachinesDeep Belief Nets

Page 5: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 5

oWe can define an explicit density function over all possible relations πœ“π‘between the input variables π‘₯𝑐

𝑝 π‘₯ =ΰ·‘

𝑐

πœ“π‘ (π‘₯𝑐)

oQuite inefficient think of all possible relations between 256 Γ— 256 =65𝐾 input variablesβ—¦Not just pairwise

oSolution: Define an energy function to model these relations

How to define a generative model?

Page 6: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 6

oFirst, define an energy function βˆ’πΈ(π‘₯) that models the joint distribution

𝑝 π‘₯ =1

𝑍exp(βˆ’πΈ(π‘₯))

o𝑍 is a normalizing constant that makes sure 𝑝 π‘₯ is a pdf: ∫ 𝑝 π‘₯ = 1

𝑍 =

π‘₯

exp(βˆ’πΈ(π‘₯))

Boltzmann Distribution

Page 7: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 7

oWell understood in physics, mathematics and mechanics

oA Boltzmann distribution (also called Gibbs distribution) is a probability distribution, probability measure, or frequency distribution of particles in a system over various possible states

oThe distribution is expressed in the form

𝐹 π‘ π‘‘π‘Žπ‘‘π‘’ ∝ exp(βˆ’πΈ

π‘˜π‘‡)

o𝐸 is the state energy, π‘˜ is the Boltzmann constant, 𝑇 is the thermodynamic temperature

Why Boltzmann?

https://en.wikipedia.org/wiki/Boltzmann_distribution

Page 8: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 8

Problem with Boltzmann Distribution?

Page 9: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 9

oAssuming binary variables π‘₯ the normalizing constant has very high computational complexity

oFor 𝑛-dimensional π‘₯ we must enumerate all possible 2𝑛 operations for 𝑍

oClearly, gets out of hand for any decent 𝑛

oSolution: Consider only pairwise relations

Problem with Boltzmann Distribution?

Page 10: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 10

oThe energy function becomes

𝐸 π‘₯ = βˆ’π‘₯π‘‡π‘Šπ‘₯ βˆ’ 𝑏𝑇π‘₯

oπ‘₯ is considered binary

oπ‘₯π‘‡π‘Šπ‘₯ captures correlations between input variables

o𝑏𝑇π‘₯ captures the model priorβ—¦The energy that each of the input variable contributes itself

Boltzmann Machines

Page 11: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 11

Problem with Boltzmann Machines?

Page 12: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 12

oStill too complex and high-dimensional

o If π‘₯ has 256 Γ— 256 = 65536 dimensions

oThe pairwise relations need a huge π‘Š: 4.2 billion dimensions

o Just for connecting two layers!

oSolution: Consider latent variables for model correlations

Problem with Boltzmann Machines?

Page 13: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 13

oRestrict the model energy function further to a bottleneck over latents β„Ž

𝐸 π‘₯ = βˆ’π‘₯π‘‡π‘Šβ„Ž βˆ’ 𝑏𝑇π‘₯ βˆ’ π‘π‘‡β„Ž

Restricted Boltzmann Machines

Page 14: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 14

o𝐸 π‘₯ = βˆ’π‘₯π‘‡π‘Šβ„Ž βˆ’ 𝑏𝑇π‘₯ βˆ’ π‘π‘‡β„Ž

oThe π‘₯π‘‡π‘Šβ„Ž models correlations between π‘₯ and the latent activations via the parameter matrix π‘Š

oThe 𝑏𝑇π‘₯, π‘π‘‡β„Ž model the priors

oRestricted Boltzmann Machines (RBM) assume x, β„Ž to be binary

Restricted Boltzmann Machines

Page 15: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 15

oEnergy function: 𝐸 π‘₯ = βˆ’π‘₯π‘‡π‘Šβ„Ž βˆ’ 𝑏𝑇π‘₯ βˆ’ π‘π‘‡β„Ž

𝑝 π‘₯ =1

𝑍

β„Ž

exp(βˆ’πΈ π‘₯, β„Ž )

β—¦Not in the form ∝ exp(x)/Z because of the βˆ‘

oFree energy function: 𝐹 π‘₯ = βˆ’π‘π‘‡π‘₯ βˆ’ βˆ‘π‘– logβˆ‘β„Žπ‘–exp(β„Žπ‘–(𝑐𝑖 +π‘Šπ‘–π‘₯))

𝑝 π‘₯ =1

𝑍exp(βˆ’πΉ(π‘₯))

𝑍 =

π‘₯

exp(βˆ’πΉ(π‘₯))

Restricted Boltzmann Machines

π‘₯ β„Ž

Page 16: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 16

oThe 𝐹 π‘₯ defines a bipartite graph with undirected connectionsβ—¦ Information flows forward and backward

Restricted Boltzmann Machines

π‘₯ β„Ž

Page 17: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 17

oThe hidden units β„Žπ‘— are independent to each otherconditioned on the visible units

𝑝 β„Ž π‘₯ =ΰ·‘

𝑗

𝑝 β„Žπ‘— π‘₯, πœƒ

oThe hidden units π‘₯𝑖 are independent to each otherconditioned on the visible units

𝑝 π‘₯ β„Ž =ΰ·‘

𝑖

𝑝 π‘₯𝑖 β„Ž, πœƒ

Restricted Boltzmann Machines

Page 18: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 18

oThe conditional probabilities are defined as sigmoids

𝑝 β„Žπ‘— π‘₯, πœƒ = 𝜎 π‘Šβ‹…π‘—π‘₯ + 𝑏𝑗𝑝 π‘₯𝑖 β„Ž, πœƒ = 𝜎(π‘Šβ‹…π‘–π‘₯ + 𝑐𝑖)

oMaximize log-likelihood

β„’ πœƒ =1

Ν

𝑛

log 𝑝(π‘₯𝑛|πœƒ)

and

𝑝 π‘₯ =1

𝑍exp(βˆ’πΉ(π‘₯))

Training RBMs

Hidden unit (features)

Page 19: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 19

oLet’s take the gradients

πœ• log 𝑝(π‘₯𝑛|πœƒ)

πœ•πœƒ= βˆ’

πœ•πΉ π‘₯π‘›πœ•πœƒ

βˆ’πœ• log 𝑍

πœ•πœƒ

= βˆ’

β„Ž

𝑝 β„Ž π‘₯𝑛, πœƒπœ•πΈ π‘₯𝑛|β„Ž, πœƒ

πœ•πœƒ+

π‘₯,β„Ž

𝑝 π‘₯, β„Ž πœƒπœ•πΈ π‘₯, β„Ž|πœƒ

πœ•πœƒ

Training RBMs

Hidden unit (features)

Page 20: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 20

oLet’s take the gradientsπœ• log 𝑝(π‘₯𝑛|πœƒ)

πœ•πœƒ= βˆ’

πœ•πΉ π‘₯π‘›πœ•πœƒ

βˆ’πœ• log 𝑍

πœ•πœƒ

= βˆ’

β„Ž

𝑝 β„Ž π‘₯𝑛, πœƒπœ•πΈ π‘₯𝑛|β„Ž, πœƒ

πœ•πœƒ+

π‘₯,β„Ž

𝑝 π‘₯, β„Ž πœƒπœ•πΈ π‘₯, β„Ž|πœƒ

πœ•πœƒ

oEasy because we just substitute in the definitions the π‘₯𝑛 and sum over β„Ž

oHard because you need to sum over both π‘₯, β„Ž which can be hugeβ—¦ It requires approximate inference, e.g., MCMC

Training RBMs

Page 21: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 21

oApproximate the gradient with Contrastive Divergence

oSpecifically, apply Gibbs sampler for π‘˜ steps and approximate the gradientπœ• log 𝑝(π‘₯𝑛|πœƒ)

πœ•πœƒ= βˆ’

πœ•πΈ(π‘₯𝑛, β„Ž0|πœƒ)

πœ•πœƒβˆ’πœ•πΈ(π‘₯π‘˜ , β„Žπ‘˜|πœƒ)

πœ•πœƒ

Training RBMs with Contrastive Divergence

Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation, 2002

Page 22: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 22

o RBMs are just one layer

oUse RBM as a building block

o Stack multiple RBMs one on top of the other𝑝 π‘₯, β„Ž1, β„Ž2 = 𝑝 π‘₯|β„Ž1 β‹… 𝑝 β„Ž1|β„Ž2

oDeep Belief Networks (DBN) are directed modelsβ—¦The layers are densely connected and have a single forward flow

β—¦This is because the RBM is directional, 𝑝 π‘₯𝑖 β„Ž, πœƒ = 𝜎(π‘Šβ‹…π‘–π‘₯ + 𝑐𝑖), namely the input argument has only variable only from below

Deep Belief Network

Page 23: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 23

oStacking layers again, but now with connection from the above and from the below layers

oSince it’s a Boltzmann machine, we need an energy function𝐸 π‘₯, β„Ž1, β„Ž2|πœƒ = π‘₯π‘‡π‘Š1β„Ž1 + β„Ž1

π‘‡π‘Š2β„Ž2 + β„Ž2π‘‡π‘Š3β„Ž3

𝑝 β„Ž2π‘˜ β„Ž1, β„Ž3 = 𝜎(

𝑗

π‘Š1π‘—π‘˜β„Ž1𝑗+

𝑙

π‘Š3π‘˜π‘™β„Ž3

π‘˜)

Deep Boltzmann Machines

Page 24: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 24

oSchematically similar to Deep Belief Networks

oBut, Deep Boltzmann Machines (DBM) are undirected modelsβ—¦Belong to the Markov Random Field family

oSo, two types of relationships: bottom-up and up-bottom

𝑝 β„Ž2π‘˜ β„Ž1, β„Ž3 = 𝜎(

𝑗

π‘Š1π‘—π‘˜β„Ž1𝑗+

𝑙

π‘Š3π‘˜π‘™β„Ž3

π‘˜)

Deep Boltzmann Machines

Page 25: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 25

oComputing gradients is intractable

o Instead, variational methods (mean-field) or sampling methods are used

Training Deep Boltzmann Machines

Page 26: Lecture 9: Deep Generative Models - GitHub Pages

Variational Inference

Page 27: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 27

oObserved variables π‘₯

o Latent variables πœƒβ—¦Both unobservable model parameters 𝑀 and

unobservable model activations 𝑧

β—¦πœƒ = {𝑀, 𝑧}

o Joint probability density function (pdf): 𝑝(π‘₯, πœƒ)

oMarginal pdf: 𝑝 π‘₯ = βˆ«πœƒπ‘ π‘₯, πœƒ π‘‘πœƒ

o Prior pdf marginal over input: 𝑝 πœƒ = ∫π‘₯ 𝑝 π‘₯, πœƒ 𝑑π‘₯β—¦Usually a user defined pdf

o Posterior pdf: 𝑝 πœƒ|π‘₯

o Likelihood pdf: 𝑝 π‘₯|πœƒ

Some (Bayesian) Terminology

π‘₯

Page 28: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 28

o Posterior pdf𝑝 πœƒ|π‘₯ =

=𝑝(π‘₯, πœƒ)

𝑝(π‘₯)

=𝑝 π‘₯ πœƒ 𝑝(πœƒ)

𝑝(π‘₯)

=𝑝 π‘₯ πœƒ 𝑝(πœƒ)

βˆ«πœƒβ€²π‘(π‘₯, ΞΈβ€²) π‘‘ΞΈβ€²βˆ 𝑝 π‘₯ πœƒ 𝑝(πœƒ)

o Posterior Predictive pdf

𝑝 𝑦𝑛𝑒𝑀|𝑦 = ΰΆ±πœƒ

𝑝 𝑦𝑛𝑒𝑀 πœƒ 𝑝 πœƒ 𝑦 π‘‘πœƒ

Bayesian Terminology

Conditional probability

Bayes Rule

Marginal probability

𝑝(π‘₯ ) is constant

Page 29: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 29

o Conjugate priorsβ—¦when posterior and prior belong to the same family, so

the joint pdf is easy to compute

o Point estimate approximations oflatent variablesβ—¦ instead of computing a distribution over all possible

values for the variableβ—¦ compute one point onlyβ—¦e.g. the most likely (maximum likelihood or max a

posteriori estimate)

πœƒβˆ— = argπœƒmax𝑝 π‘₯ πœƒ 𝑝 πœƒ (𝑀𝐴𝑃)πœƒβˆ— = argπœƒmax𝑝 π‘₯ πœƒ (𝑀𝐿𝐸)

β—¦Quite good when the posterior distribution is peaky(low variance)

Bayesian Terminology

*

Point estimate of your

neural network weight

Page 30: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 31

oEstimate the posterior density 𝑝 πœƒ|π‘₯ for your training data π‘₯

oTo do so, need to define the prior 𝑝 πœƒ and likelihood 𝑝 π‘₯|πœƒ distributions

oOnce the 𝑝 πœƒ|π‘₯ density is estimated, Bayesian Inference is possible◦𝑝 πœƒ|π‘₯ is a (density) function, not just a single number (point estimate)

oBut how to estimate the posterior density?β—¦Markov Chain Monte Carlo (MCMC) Simulation-like estimation

β—¦Variational Inference Turn estimation to optimization

Bayesian Modelling

Page 31: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 32

oEstimating the true posterior 𝑝 πœƒ|π‘₯ is not always possibleβ—¦especially for complicated models like neural networks

oVariational Inference assumes another function π‘ž πœƒ|πœ‘ with which we want to approximate the true posterior 𝑝 πœƒ|π‘₯β—¦π‘ž πœƒ|πœ‘ is the approximate posterior

β—¦Note that the approximate posterior does not depend on the observable variables π‘₯

oWe approximate by minimizing the reverse KL-divergence w.r.t. πœ‘πœ‘βˆ— = argmin

πœ‘πΎπΏ(π‘ž(πœƒ|πœ‘)||𝑝 πœƒ|π‘₯ )

oTurn inference into optimization

Variational Inference

Page 32: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 33

Variational Inference (graphically)

Underestimating variance. Why?

Page 33: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 34

Variational Inference (graphically)

Underestimating variance. Why?

Page 34: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 35

Variational Inference (graphically)

Underestimating variance. Why?How to overestimate variance?

Page 35: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 36

Variational Inference (graphically)

Underestimating variance. Why?How to overestimate variance? Forward KL

Page 36: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 38

oGiven latent variables πœƒ and the approximate posterior

π‘žπœ‘ πœƒ = π‘ž πœƒ|πœ‘

oWhat about the log marginal log 𝑝 π‘₯ ?

Variational Inference - Evidence Lower Bound (ELBO)

Page 37: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 39

oGiven latent variables πœƒ and the approximate posterior

π‘žπœ‘ πœƒ = π‘ž πœƒ|πœ‘

oWe want to maximize the marginal 𝑝 π‘₯ (or the log marginal log 𝑝 π‘₯

log 𝑝 π‘₯ β‰₯ π”Όπ‘žπœ‘ πœƒ log𝑝(π‘₯, πœƒ)

π‘žπœ‘ πœƒ

Variational Inference - Evidence Lower Bound (ELBO)

Page 38: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 40

Evidence Lower Bound (ELBO): Derivations

Page 39: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 41

oGiven latent variables πœƒ and the approximate posteriorπ‘žπœ‘ πœƒ = π‘ž πœƒ|πœ‘

oThe log marginal is

log 𝑝 π‘₯ = logΰΆ±πœƒ

𝑝 π‘₯, πœƒ π‘‘πœƒ

= logΰΆ±πœƒ

𝑝 π‘₯, πœƒπ‘žπœ‘ πœƒ

π‘žπœ‘ πœƒπ‘‘πœƒ

= log π”Όπ‘žπœ‘(πœƒ)𝑝(π‘₯, πœƒ)

π‘žπœ‘ πœƒ

β‰₯ π”Όπ‘žπœ‘ πœƒ log𝑝(π‘₯, πœƒ)

π‘žπœ‘ πœƒ

Evidence Lower Bound (ELBO): Derivations

Jensen Inequality

β€’ πœ‘ 𝔼 x ≀ 𝔼 πœ‘ π‘₯

for convex πœ‘β€’ log is convave

Page 40: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 42

ELBO: A second derivation

Page 41: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 44

β‰₯ π”Όπ‘žπœ‘ πœƒ log𝑝(π‘₯, πœƒ)

π‘žπœ‘ πœƒ= π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯|πœƒ) + π”Όπ‘žπœ‘ πœƒ log 𝑝 πœƒ βˆ’ π”Όπ‘žπœ‘ πœƒ log π‘žπœ‘ πœƒ

= π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯|πœƒ) βˆ’ KL(π‘žπœ‘ πœƒ ||p(ΞΈ))

= ELBOΞΈ,Ο†(x)

oMaximize reconstruction accuracy π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯|πœƒ)

oWhile minimizing the KL distance between the prior p(ΞΈ) and the approximate posterior π‘žπœ‘ πœƒ

ELBO: Formulation 1

Page 42: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 45

β‰₯ π”Όπ‘žπœ‘ πœƒ log𝑝(π‘₯, πœƒ)

π‘žπœ‘ πœƒ= π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯, πœƒ) βˆ’ π”Όπ‘žπœ‘ πœƒ log π‘žπœ‘(πœƒ)

= π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯, πœƒ) + Ξ— ΞΈ

= ELBOΞΈ,Ο†(x)

oMaximize something like negative Boltzmann energy π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯, πœƒ)

oWhile maximizing the entropy the approximate posterior π‘žπœ‘ πœƒβ—¦Avoid collapsing latents ΞΈ to a single value (like for MAP estimates)

ELBO: Formulation 2

Page 43: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 46

o It is easy to see that the ELBO is directly related to the marginal

log 𝑝(π‘₯) = ELBOΞΈ,Ο† x + 𝐾𝐿(π‘žπœ‘ πœƒ ||𝑝(πœƒ|π‘₯))

oYou can also see ELBOΞΈ,Ο† x as Variational Free Energy

ELBO vs. Marginal

Page 44: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 47

o It is easy to see that the ELBO is directly related to the marginalELBOΞΈ,Ο† x =

ELBO vs. Marginal: Derivations

Page 45: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 48

o It is easy to see that the ELBO is directly related to the marginalELBOΞΈ,Ο† x == π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯, πœƒ) βˆ’ π”Όπ‘žπœ‘ πœƒ log π‘žπœ‘ πœƒ

= π”Όπ‘žπœ‘ πœƒ log 𝑝(πœƒ|π‘₯) + π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯) βˆ’ π”Όπ‘žπœ‘ πœƒ log π‘žπœ‘ πœƒ

= π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯) βˆ’ 𝐾𝐿(π‘žπœ‘ πœƒ ||𝑝(πœƒ|π‘₯))

= log 𝑝(π‘₯) βˆ’ 𝐾𝐿(π‘žπœ‘ πœƒ ||𝑝(πœƒ|π‘₯))β‡’log 𝑝(π‘₯) = ELBOΞΈ,Ο† x + 𝐾𝐿(π‘žπœ‘ πœƒ ||𝑝(πœƒ|π‘₯))

oYou can also see ELBOΞΈ,Ο† x as Variational Free Energy

ELBO vs. Marginal: Derivations

log 𝑝(π‘₯) does not depend on π‘žπœ‘ πœƒ

π”Όπ‘žπœ‘ πœƒ [1]=1

Page 46: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 49

o log 𝑝(π‘₯) = ELBOΞΈ,Ο† x + 𝐾𝐿(π‘žπœ‘ πœƒ ||𝑝(πœƒ|π‘₯))

oThe log-likelihood log 𝑝(π‘₯) constant does not depend on any parameter

oAlso, ELBOΞΈ,Ο† x > 0 and 𝐾𝐿(π‘žπœ‘ πœƒ ||𝑝 πœƒ π‘₯ ) > 0

1. The higher the Variational Lower Bound ELBOΞΈ,Ο† x , the smaller the difference between the approximate posterior π‘žπœ‘ πœƒ and the true posterior 𝑝 πœƒ π‘₯ better latent representation

2. The Variational Lower Bound ELBOΞΈ,Ο† x approaches the log-likelihood better density model

ELBO interpretations

Page 47: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 50

oThe variational distribution π‘ž πœƒ|πœ‘ does not depend directly on dataβ—¦Only indirectly, via minimizing its distance to the true posterior 𝐾𝐿(π‘ž πœƒ|πœ‘ ||𝑝(πœƒ|π‘₯))

oSo, with π‘ž πœƒ|πœ‘ we have a major optimization problem

oThe approximate posterior must approximate the whole dataset π‘₯ =[π‘₯1, π‘₯2, … , π‘₯𝑁] jointly

oDifferent neural network weights for each data point π‘₯𝑖

Amortized Inference

Page 48: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 51

oBetter share weights and β€œamortize” optimization between individual data points

π‘ž πœƒ|πœ‘ = π‘žπœ‘(πœƒ|π‘₯)

oPredict model parameters πœƒ using a πœ‘-parameterized model of the input π‘₯

oUse amortization for data-dependent parameters that depend on dataβ—¦E.g., the latent activations that are the output of a neural network layer: z~π‘žπœ‘(𝑧|π‘₯)

Amortized Inference

Page 49: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 52

Amortized Inference (Intuitively)

oThe original view on Variational Inference is that π‘ž πœƒ|πœ‘ describes the approximate posterior of the dataset as a whole

o Imagine you don’t want to make a practical model that returns latent activations for a specific input

o Instead, you want to optimally approximate the true posterior of the unknown weights with an model with latent parameters

o It doesn’t matter if these parameters are β€œlatent activations” 𝑧 or β€œmodel variables” 𝑀

Page 50: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 54

oLet’s rewrite the ELBO a bit more explicitlyELBOπœƒ, πœ‘ π‘₯ = π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯|πœƒ) βˆ’ KL(π‘žπœ‘ πœƒ ||p(ΞΈ))

= π”Όπ‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

oπ‘πœƒ(π‘₯|𝑧) instead of 𝑝(π‘₯|πœƒ)

o I.e., the likelihood model π‘πœƒ(π‘₯|𝑧) has weights parameterized by πœƒ

oConditioned on latent model activations parameterized by 𝑧

Variational Autoencoders

Page 51: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 55

oLet’s rewrite the ELBO a bit more explicitlyELBOπœƒ, πœ‘ π‘₯ = π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯|πœƒ) βˆ’ KL(π‘žπœ‘ πœƒ ||p(ΞΈ))

= π”Όπ‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

opΞ»(z) instead of p(ΞΈ)

o I.e., a πœ†-parameterized prior only on the latent activations z

oNot on model weights

Variational Autoencoders

Page 52: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 56

oLet’s rewrite the ELBO a bit more explicitlyELBOπœƒ, πœ‘ π‘₯ = π”Όπ‘žπœ‘ πœƒ log 𝑝(π‘₯|πœƒ) βˆ’ KL(π‘žπœ‘ πœƒ ||p(ΞΈ))

= π”Όπ‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

oπ‘žπœ‘ 𝑧|π‘₯ instead of π‘ž πœƒ|πœ‘

oThe model π‘žπœ‘ 𝑧|π‘₯ approximates the posterior density of the latents 𝑧

oThe model weights are parameterized by πœ‘

Variational Autoencoders

Page 53: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 58

oELBOπœƒ,πœ‘ π‘₯ = π”Όπ‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

oHow to model π‘πœƒ(π‘₯|𝑧) and π‘žπœ‘ 𝑧|π‘₯ ?

Variational Autoencoders

Page 54: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 59

oELBOπœƒ,πœ‘ π‘₯ = π”Όπ‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

oHow to model π‘πœƒ(π‘₯|𝑧) and π‘žπœ‘ 𝑧|π‘₯ ?

oWhat about modelling them as neural networks

Variational Autoencoders

Page 55: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 60

oThe approximate posterior π‘žπœ‘ 𝑧|π‘₯ is a CovnNet (or MLP)β—¦ Input π‘₯ is an image

β—¦Given input the output is a feature map from a latent variable 𝑧

β—¦Also known as encoder or inference or recognition network, because it infers/recognizes the latent codes

oThe likelihood density π‘πœƒ(π‘₯|𝑧) is an inverted ConvNet (or MLP)β—¦Given the latent 𝑧 as input, it reconstructs the input π‘₯

β—¦Also known as decoder or generator network

o If we ignore the distribution of the latents z, pΞ»(z)), then we get the Vanilla Autoencoder

Variational Autoencoders

𝑧pΞ»(z)

π‘žπœ‘ 𝑧|π‘₯

π‘πœƒ(π‘₯|𝑧)

Encoder/Inference/Recognitionnetwork

Decoder/Generatornetwork

Page 56: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 61

oMaximize the Evidence Lower Bound (ELBO)β—¦Or minimize the negative ELBO

β„’ πœƒ, πœ‘ = π”Όπ‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

oHow to we optimize the ELBO?

Training Variational Autoencoders

Page 57: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 62

oMaximize the Evidence Lower Bound (ELBO)β—¦Or minimize the negative ELBO

β„’ πœƒ, πœ‘ = π”Όπ‘žπœ‘ 𝑍|π‘₯ log π‘πœƒ(π‘₯|𝑍) βˆ’ KL(π‘žπœ‘ 𝑍|π‘₯ ||pΞ»(Z))

= ࢱ𝑧

π‘žπœ‘ 𝑧 π‘₯ log π‘πœƒ(π‘₯|𝑧) 𝑑𝑧 βˆ’ ࢱ𝑧

π‘žπœ‘ 𝑧 π‘₯ logπ‘žπœ‘(𝑧|π‘₯)

π‘πœ†(𝑧)𝑑𝑧

oForward propagation compute the two terms

oThe first term is an integral (expectation) that we cannot solve analytically. So, we need to sample from the pdf insteadβ—¦When π‘πœƒ(π‘₯|𝑧) contains even a few nonlinearities, like in a neural network, the integral is hard to compute analytically

Training Variational Autoencoders

Page 58: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 64

oMaximize the Evidence Lower Bound (ELBO)β—¦Or minimize the negative ELBO

β„’ πœƒ, πœ‘ = π”Όπ‘žπœ‘ 𝑍|π‘₯ log π‘πœƒ(π‘₯|𝑍) βˆ’ KL(π‘žπœ‘ 𝑍|π‘₯ ||pΞ»(Z))

= ࢱ𝑧

π‘žπœ‘ 𝑧 π‘₯ log π‘πœƒ(π‘₯|𝑧) 𝑑𝑧 βˆ’ ࢱ𝑧

π‘žπœ‘ 𝑧 π‘₯ logπ‘žπœ‘(𝑧|π‘₯)

π‘πœ†(𝑧)𝑑𝑧

o Forward propagation compute the two terms

o The first term is an integral (expectation) that we cannot solve analytically.β—¦When π‘πœƒ(π‘₯|𝑧) contains even a few nonlinearities, like in a neural network, the integral is hard

to compute analytically

o So, we need to sample from the pdf instead

o VAE is a stochastic model

o The second term is the KL divergence between two distributions that we know

Training Variational Autoencoders

Page 59: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 65

oβˆ«π‘§ π‘žπœ‘ 𝑧 π‘₯ log π‘πœƒ(π‘₯|𝑧) 𝑑𝑧

oThe first term is an integral (expectation) that we cannot solve analytically.β—¦When π‘πœƒ(π‘₯|𝑧) contains even a few nonlinearities, like in a neural network, the integral is hard to compute analytically

oAs we cannot compute analytically, we sample from the pdf insteadβ—¦Using the density π‘žπœ‘ 𝑧 π‘₯ to draw samples

β—¦Usually one sample is enough stochasticity reduces overfitting

oVAE is a stochastic model

oThe second term is the KL divergence between two distributions that we know

Training Variational Autoencoders

Page 60: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 66

oβˆ«π‘§ π‘žπœ‘ 𝑧 π‘₯ logπ‘žπœ‘(𝑧|π‘₯)

π‘πœ†(𝑧)𝑑𝑧

oThe second term is the KL divergence between two distributions that we know

oE.g., compute the KL divergence between a centered 𝑁(0, 1) and a non-centered 𝑁(πœ‡, 𝜎) gaussian

Training Variational Autoencoders

Page 61: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 68

oWe set the prior pΞ»(z) to be the unit Gaussianp 𝑧 ~ 𝑁(0, 1)

oWe set the likelihood to be a Bernoulli for binary data

𝑝(π‘₯|𝑧)~π΅π‘’π‘Ÿπ‘›π‘œπ‘’π‘™π‘™π‘–(πœ‹)

oWe set π‘žπœ‘(z|x) to be a neural network (MLP, ConvNet), which maps an input x to the Gaussian distribution, specifically it’s mean and varianceβ—¦πœ‡π‘§, πœŽπ‘§ ~ π‘žπœ‘(z|x)

β—¦The neural network has two outputs, one is the mean πœ‡π‘₯ and the other the 𝜎π‘₯, which corresponds to the covariance of the Gaussian

Typical VAE

π‘žπœ‘ 𝑧|π‘₯

πœ‡π‘§ πœŽπ‘§

πœ‡π‘§

πœŽπ‘§

Page 62: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 69

oWe set π‘πœƒ(x|z) to be an inverse neural network, which maps Z to the Bernoulli distribution if our outputs binary (e.g. Binary MNIST)

oGood exercise: Derive the ELBO for the standard VAE

Typical VAE

π‘žπœ‘ 𝑧|π‘₯

πœ‡π‘§ πœŽπ‘§

πœ‡π‘§

πœŽπ‘§

Page 63: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 70

VAE: Interpolation in the latent space

Page 64: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 71

oSample 𝑧 from the approximate posterior density 𝑧~π‘žπœ‘ 𝑍 π‘₯β—¦As π‘žπœ‘ is a neural network that outputs values from a specific and known parametric pdf, e.g. a Gaussian, sampling from it is rather easy

β—¦Often even a single draw is enough

oSecond, compute the log π‘πœƒ(π‘₯|𝑍)β—¦As π‘πœƒ is a a neural network that outputs values from a specific and known parametric pdf, e.g. a Bernoulli for white/black pixels, computing the log-prob is easy

oComputing the ELBO is rather straightforward in the standard case

oHow should we optimize the ELBO?

Forward propagation in VAE

Page 65: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 72

oSample 𝑧 from the approximate posterior density 𝑧~π‘žπœ‘ 𝑍 π‘₯β—¦As π‘žπœ‘ is a neural network that outputs values from a specific and known parametric pdf, e.g. a Gaussian, sampling from it is rather easy

β—¦Often even a single draw is enough

oSecond, compute the log π‘πœƒ(π‘₯|𝑍)β—¦As π‘πœƒ is a a neural network that outputs values from a specific and known parametric pdf, e.g. a Bernoulli for white/black pixels, computing the log-prob is easy

oComputing the ELBO is rather straightforward in the standard case

oHow should we optimize the ELBO? Backpropagation?

Forward propagation in VAE

Page 66: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 73

oBackpropagation compute the gradients ofβ„’ πœƒ, πœ‘ = 𝔼𝑧~π‘žπœ‘ 𝑍|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑍|π‘₯ ||pΞ»(Z))

oWe must take the gradients with respect to the trainable parameters

oThe generator network parameters πœƒ

oThe inference network/approximate posterior parameters πœ‘

Backward propagation in VAE

Page 67: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 74

o Let’s try to compute the following integral𝔼(𝑓) = ΰΆ±

π‘₯

𝑝 π‘₯ 𝑓(π‘₯)

where 𝑝 π‘₯ is a probability density function for π‘₯β—¦Often complex if 𝑝 π‘₯ and 𝑓 π‘₯ is slightly complicated

o Instead, we can approximate the integral as a summation

𝔼 𝑓 = ΰΆ±π‘₯

𝑝 π‘₯ 𝑓 π‘₯ β‰ˆ1

𝑁

𝑖=1

𝑁

𝑓 π‘₯𝑖 , π‘₯𝑖~𝑝 π‘₯ = αˆ˜π‘“

o The estimator is unbiased: 𝔼 𝑓 = 𝔼 αˆ˜π‘“ and its variance

π‘‰π‘Žπ‘Ÿ αˆ˜π‘“ =1

𝑁𝔼[(𝑓 βˆ’ 𝔼( αˆ˜π‘“))]

o So, if we have an easy to sample from probability density function in the integral we can do Monte Carlo integration to approximate the integral

Monte Carlo Integration

Page 68: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 75

oBackpropagation compute the gradients ofβ„’ πœƒ, πœ‘ = 𝔼𝑧~π‘žπœ‘ 𝑍|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑍|π‘₯ ||pΞ»(Z))

Gradients w.r.t. the generator parameters πœƒ

Page 69: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 76

oBackpropagation compute the gradients ofβ„’ πœƒ, πœ‘ = 𝔼𝑧~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

with respect to πœƒ and πœ‘

oπ›»πœƒβ„’ = 𝔼𝑧~π‘žπœ‘ 𝑧|π‘₯ π›»πœƒ log π‘πœƒ(π‘₯|𝑧)

oThe expectation and sampling in 𝔼𝑧~π‘žπœ‘ 𝑧|π‘₯ do not depend on πœƒ

oAlso, the KL does not depend on πœƒ, so no gradient from over there!

oSo, no problem Just Monte-Carlo integration using samples 𝑧 drawn from π‘žπœ‘ 𝑧|π‘₯

Gradients w.r.t. the generator parameters πœƒ

Page 70: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 78

oBackpropagation compute the gradients ofβ„’ πœƒ, πœ‘ = 𝔼𝑧~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) βˆ’ KL(π‘žπœ‘ 𝑧|π‘₯ ||pΞ»(z))

oOur latent variable 𝑧 is a Gaussian (in standard VAE) represented by πœ‡π‘§, πœŽπ‘§

oSo, we can train by sampling randomly from that Gaussian 𝑧~𝑁(πœ‡π‘, πœŽπ‘)

oProblem?

oSampling 𝑧~π‘žπœ‘ 𝑧|π‘₯ is not differentiableβ—¦And after sampling 𝑧, it’s a fixed value (not a function), so we cannot backprop

oNot differentiable no gradients

oNo gradients No backprop No training!

Gradients w.r.t. the recognition parameters πœ‘

Page 71: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 79

oπ›»πœ‘π”Όπ‘§~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) = π›»πœ‘ βˆ«π‘§ π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

= ࢱ𝑧

π›»πœ‘[π‘žπœ‘ 𝑧|π‘₯ ] log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

oProblem: Monte Carlo integration not possible anymoreβ—¦No density function inside the integral

β—¦Only the gradient of a density function

oSimilar to Monte Carlo integration, we want to have an expression where there is a density function inside the summation

oThat way we can express it again as Monte Carlo integration

Solution: Monte Carlo Differentiation?

Page 72: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 80

oπ›»πœ‘π”Όπ‘§~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) = π›»πœ‘ βˆ«π‘§ π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

= ࢱ𝑧

π›»πœ‘[π‘žπœ‘ 𝑧|π‘₯ ] log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

o βˆ«π‘§π›»πœ‘[π‘žπœ‘ 𝑧|π‘₯ ] log π‘πœƒ π‘₯ 𝑧 𝑑𝑧 =

= ࢱ𝑧

π‘žπœ‘ 𝑧|π‘₯

π‘žπœ‘ 𝑧|π‘₯π›»πœ‘ π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

NOTE: 𝛻π‘₯log 𝑓 π‘₯ =1

𝑓 π‘₯𝛻π‘₯𝑓 π‘₯

= ࢱ𝑧

π‘žπœ‘ 𝑧|π‘₯ π›»πœ‘ log π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

= 𝔼𝑧~π‘žπœ‘ 𝑧|π‘₯ [π›»πœ‘ log π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧 ]

Solution: Monte Carlo Differentiation?

Page 73: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 81

oπ›»πœ‘π”Όπ‘§~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) = 𝔼𝑧~π‘žπœ‘ 𝑧|π‘₯ π›»πœ‘ log π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧

=

𝑖

π›»πœ‘ log π‘žπœ‘ 𝑧𝑖|π‘₯ log π‘πœƒ π‘₯ 𝑧𝑖 , 𝑧𝑖~π‘žπœ‘ 𝑧|π‘₯

oAlso known as REINFORCE or score-function estimatorβ—¦ log π‘πœƒ π‘₯ 𝑧 is called score function

β—¦Used to approximate gradients of non-differentiable function

β—¦Highly popular in Reinforcement Learning, where we also sample from policies

oProblem: Typically high-variance gradients

o Slow and not very effective learning

Solution: Monte Carlo Differentiation == REINFORCE

Page 74: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 82

oSo, our latent variable 𝑧 is a Gaussian (in the standard VAE) represented by the mean and variance πœ‡π‘§, πœŽπ‘§, which are the output of a neural net

oSo, we can train by sampling randomly from that Gaussian𝑧~𝑁(πœ‡π‘, πœŽπ‘)

oOnce we have that 𝑧, however, it’s a fixed value (not a function), so we cannot backprop

oWe can use REINFORCE algorithm to compute an approximation to the gradientβ—¦High-variance gradients slow and not very effective learning

To sum up

Page 75: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 83

oRemember, we have a Gaussian output 𝑧~𝑁(πœ‡π‘, πœŽπ‘)

oFor certain pdfs, including the Gaussian, we can rewrite their random variable 𝑧 as deterministic transformations of an auxiliary and simpler random variable πœ€

𝑧~𝑁 πœ‡, 𝜎 ⇔ 𝑧 = πœ‡ + πœ€ β‹… 𝜎, πœ€~𝑁 0, 1

oπœ‡, 𝜎 are deterministic (not random) values

oLong story short:

oWe can model πœ‡, 𝜎 by our NN encoder/recognition

oAnd πœ€ comes externally

Solution: Reparameterization trick

Page 76: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 84

o Change of variables𝑧 = 𝑔(πœ€)p 𝑧 𝑑𝑧 = 𝑝 πœ€ π‘‘πœ€

β—¦ Intuitively, think that the probability mass must be invariant after the transformation

o In our caseπœ€~π‘ž πœ€ = Ν(0, 1), 𝑧 = π‘”πœ‘ πœ€ = πœ‡πœ‘ + πœ€ β‹… πœŽπœ‘

o π›»πœ‘π”Όπ‘§~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) = π›»πœ‘ βˆ«π‘§ π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ π‘₯ 𝑧 𝑑𝑧

= π›»πœ‘ΰΆ±πœ€

π‘ž(πœ€) log π‘πœƒ π‘₯ πœ‡πœ‘, πœŽπœ‘, πœ€ π‘‘πœ€

= ΰΆ±πœ€

π‘ž πœ€ π›»πœ‘ log π‘πœƒ π‘₯ πœ‡πœ‘, πœŽπœ‘, πœ€ π‘‘πœ€

o π›»πœ‘π”Όπ‘§~π‘žπœ‘ 𝑧|π‘₯ log π‘πœƒ(π‘₯|𝑧) β‰ˆ βˆ‘π‘– π›»πœ‘ log π‘πœƒ π‘₯ πœ‡πœ‘, πœŽπœ‘, πœ€π‘– , πœ€π‘–~𝑁(0, 1)β—¦ The Monte Carlo integration does not depend on the parameter of interest πœ‘ anymore

What do we gain?

Page 77: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 85

oSampling directly from πœ€~𝑁 0,1 leads to low-variance estimates compared to sampling directly from 𝑧~𝑁 πœ‡π‘, πœŽπ‘β—¦Why low variance? Exercise for the interested reader

oRemember: since we are sampling for 𝑧, we are also sampling gradientsβ—¦Stochastic gradient estimator

oMore distributions beyond Gaussian possible: Laplace, Student-t, Logistic, Cauchy, Rayleight, Pareto

Solution: Reparameterization trick

High-variance gradient

Low-variance gradient

http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/

Page 78: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 86

oAgain, the latent variable is 𝑧 = πœ‡πœ‘ + πœ€ β‹… πœŽπœ‘

oπœ‡πœ‘ and πœŽπœ‘ are deterministic functions (via the neural network encoder)

oπœ€ is a random variable, which comes externally

oThe 𝑧 as a result is itself a random variable, because of πœ€

oHowever, now the randomness is not associated with the neural network and its parameters that we have to learnβ—¦The randomness instead comes from the external πœ€

β—¦The gradients flow through πœ‡πœ‘ and πœŽπœ‘

Once more: what is random in the reparameterization trick?

Page 79: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 87

Reparameterization Trick (graphically)

Page 80: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 88

Variational Autoencoders

https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html

Page 81: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 89

VAE Training Pseudocode

Page 82: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 90

VAE for NLP

Page 83: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 91

VAE for Image Resynthesis

Page 84: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 92

VAE for designing chemical compounds

Page 85: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 93

oVAE cannot model 𝑝(π‘₯) directly because of intractable formulation

oNormalizing Flows solves exactly that problem

o It does that by series of invertible transformations that allow for much more complex latent distributions (beyond Gaussians)

oThe loss is the negative log-likelihood (not ELBO and so on)

Normalizing Flows

Page 86: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 94

Series of invertible transformations

https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html

Page 87: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 95

oUsing simple pdfs, like a Gaussian, for the approximate posterior limits the expressivity of the model

oBetter make sure the approximate posterior comes from a class of models that can evencontain the true posterior

oUse a series of 𝐾 invertible transformations to construct the approximate posteriorβ—¦π‘§π‘˜ = π‘“π‘˜ ∘ π‘“π‘˜βˆ’1 ∘ ⋯𝑓1(𝑧0)

β—¦Rule of change for variables

Normalizing Flows https://blog.evjang.com/2018/01/nf1.html

https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf

https://arxiv.org/pdf/1505.05770.pdf

Changing from the π‘₯ variable to 𝑦 using the transformation y = 𝑓 π‘₯ = 2π‘₯ + 1

Page 88: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 96

o π‘₯ = π‘§π‘˜ = π‘“π‘˜ ∘ π‘“π‘˜βˆ’1 ∘ ⋯𝑓1 𝑧0 β†’ 𝑧𝑖 = 𝑓𝑖 π‘§π‘–βˆ’1

o Again, change of variables (multi-dimensional): 𝑝𝑖 𝑧𝑖 = π‘π‘–βˆ’1 π‘“π‘–βˆ’1(𝑧𝑖) | det

π‘‘π‘“π‘–βˆ’1

𝑑𝑧𝑖|

o log 𝑝 π‘₯ = log πœ‹K 𝑧𝐾 = log πœ‹Kβˆ’1 π‘§πΎβˆ’1 βˆ’ log | det𝑑𝑓𝐾

π‘‘π‘“πΎβˆ’1|

= β‹―

= log πœ‹0 𝑧0 βˆ’

𝑖

𝐾

log detπ‘‘π‘“π‘–π‘‘π‘§π‘–βˆ’1

o Two requirements

1. 𝑓𝑖 must be easily invertible

2. The Jacobian of 𝑓𝑖 must be easy to compute

Normalizing Flows: Log-likelihood

https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html

Page 89: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 97

Normalizing Flows https://blog.evjang.com/2018/01/nf1.html

https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf

https://arxiv.org/pdf/1505.05770.pdf

Page 90: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 98

Normalizing Flows

https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf

Page 91: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 99

Normalizing Flows on Non-Euclidean Manifolds

https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf

Page 92: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 100

Normalizing Flows on Non-Euclidean Manifolds

Page 93: Lecture 9: Deep Generative Models - GitHub Pages

UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES

DEEP GENERATIVE MODELS - 101

Summary

oGentle intro to Bayesian Modelling and Variational Inference

oRestricted Boltzmann Machines

oDeep Boltzmann Machines

oDeep Belief Network

oContrastive Divergence

oVariational Autoencoders

oNormalizing Flows