Lecture 9: Deep Generative Models - GitHub Pages
Transcript of Lecture 9: Deep Generative Models - GitHub Pages
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 1
Lecture 9: Deep Generative ModelsEfstratios Gavves
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 2
oEarly Generative Models
oRestricted Boltzmann Machines
oDeep Boltzmann Machines
oDeep Belief Network
oContrastive Divergence
oGentle intro to Bayesian Modelling and Variational Inference
oVariational Autoencoders
oNormalizing Flows
Lecture overview
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 3
Explicit density models
oPlug in the model density function to likelihood
oThen maximize the likelihood
Problems
oDesign complex enough modelthat meets data complexity
oAt the same time, make sure modelis computationally tractable
oMore details in the next lecture
Restricted Boltzmann MachinesDeep Boltzmann MachinesDeep Belief Nets
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 5
oWe can define an explicit density function over all possible relations ππbetween the input variables π₯π
π π₯ =ΰ·
π
ππ (π₯π)
oQuite inefficient think of all possible relations between 256 Γ 256 =65πΎ input variablesβ¦Not just pairwise
oSolution: Define an energy function to model these relations
How to define a generative model?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 6
oFirst, define an energy function βπΈ(π₯) that models the joint distribution
π π₯ =1
πexp(βπΈ(π₯))
oπ is a normalizing constant that makes sure π π₯ is a pdf: β« π π₯ = 1
π =
π₯
exp(βπΈ(π₯))
Boltzmann Distribution
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 7
oWell understood in physics, mathematics and mechanics
oA Boltzmann distribution (also called Gibbs distribution) is a probability distribution, probability measure, or frequency distribution of particles in a system over various possible states
oThe distribution is expressed in the form
πΉ π π‘ππ‘π β exp(βπΈ
ππ)
oπΈ is the state energy, π is the Boltzmann constant, π is the thermodynamic temperature
Why Boltzmann?
https://en.wikipedia.org/wiki/Boltzmann_distribution
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 8
Problem with Boltzmann Distribution?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 9
oAssuming binary variables π₯ the normalizing constant has very high computational complexity
oFor π-dimensional π₯ we must enumerate all possible 2π operations for π
oClearly, gets out of hand for any decent π
oSolution: Consider only pairwise relations
Problem with Boltzmann Distribution?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 10
oThe energy function becomes
πΈ π₯ = βπ₯πππ₯ β πππ₯
oπ₯ is considered binary
oπ₯πππ₯ captures correlations between input variables
oπππ₯ captures the model priorβ¦The energy that each of the input variable contributes itself
Boltzmann Machines
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 11
Problem with Boltzmann Machines?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 12
oStill too complex and high-dimensional
o If π₯ has 256 Γ 256 = 65536 dimensions
oThe pairwise relations need a huge π: 4.2 billion dimensions
o Just for connecting two layers!
oSolution: Consider latent variables for model correlations
Problem with Boltzmann Machines?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 13
oRestrict the model energy function further to a bottleneck over latents β
πΈ π₯ = βπ₯ππβ β πππ₯ β ππβ
Restricted Boltzmann Machines
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 14
oπΈ π₯ = βπ₯ππβ β πππ₯ β ππβ
oThe π₯ππβ models correlations between π₯ and the latent activations via the parameter matrix π
oThe πππ₯, ππβ model the priors
oRestricted Boltzmann Machines (RBM) assume x, β to be binary
Restricted Boltzmann Machines
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 15
oEnergy function: πΈ π₯ = βπ₯ππβ β πππ₯ β ππβ
π π₯ =1
π
β
exp(βπΈ π₯, β )
β¦Not in the form β exp(x)/Z because of the β
oFree energy function: πΉ π₯ = βπππ₯ β βπ logββπexp(βπ(ππ +πππ₯))
π π₯ =1
πexp(βπΉ(π₯))
π =
π₯
exp(βπΉ(π₯))
Restricted Boltzmann Machines
π₯ β
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 16
oThe πΉ π₯ defines a bipartite graph with undirected connectionsβ¦ Information flows forward and backward
Restricted Boltzmann Machines
π₯ β
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 17
oThe hidden units βπ are independent to each otherconditioned on the visible units
π β π₯ =ΰ·
π
π βπ π₯, π
oThe hidden units π₯π are independent to each otherconditioned on the visible units
π π₯ β =ΰ·
π
π π₯π β, π
Restricted Boltzmann Machines
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 18
oThe conditional probabilities are defined as sigmoids
π βπ π₯, π = π πβ ππ₯ + πππ π₯π β, π = π(πβ ππ₯ + ππ)
oMaximize log-likelihood
β π =1
Ξ
π
log π(π₯π|π)
and
π π₯ =1
πexp(βπΉ(π₯))
Training RBMs
Hidden unit (features)
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 19
oLetβs take the gradients
π log π(π₯π|π)
ππ= β
ππΉ π₯πππ
βπ log π
ππ
= β
β
π β π₯π, πππΈ π₯π|β, π
ππ+
π₯,β
π π₯, β πππΈ π₯, β|π
ππ
Training RBMs
Hidden unit (features)
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 20
oLetβs take the gradientsπ log π(π₯π|π)
ππ= β
ππΉ π₯πππ
βπ log π
ππ
= β
β
π β π₯π, πππΈ π₯π|β, π
ππ+
π₯,β
π π₯, β πππΈ π₯, β|π
ππ
oEasy because we just substitute in the definitions the π₯π and sum over β
oHard because you need to sum over both π₯, β which can be hugeβ¦ It requires approximate inference, e.g., MCMC
Training RBMs
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 21
oApproximate the gradient with Contrastive Divergence
oSpecifically, apply Gibbs sampler for π steps and approximate the gradientπ log π(π₯π|π)
ππ= β
ππΈ(π₯π, β0|π)
ππβππΈ(π₯π , βπ|π)
ππ
Training RBMs with Contrastive Divergence
Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation, 2002
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 22
o RBMs are just one layer
oUse RBM as a building block
o Stack multiple RBMs one on top of the otherπ π₯, β1, β2 = π π₯|β1 β π β1|β2
oDeep Belief Networks (DBN) are directed modelsβ¦The layers are densely connected and have a single forward flow
β¦This is because the RBM is directional, π π₯π β, π = π(πβ ππ₯ + ππ), namely the input argument has only variable only from below
Deep Belief Network
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 23
oStacking layers again, but now with connection from the above and from the below layers
oSince itβs a Boltzmann machine, we need an energy functionπΈ π₯, β1, β2|π = π₯ππ1β1 + β1
ππ2β2 + β2ππ3β3
π β2π β1, β3 = π(
π
π1ππβ1π+
π
π3ππβ3
π)
Deep Boltzmann Machines
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 24
oSchematically similar to Deep Belief Networks
oBut, Deep Boltzmann Machines (DBM) are undirected modelsβ¦Belong to the Markov Random Field family
oSo, two types of relationships: bottom-up and up-bottom
π β2π β1, β3 = π(
π
π1ππβ1π+
π
π3ππβ3
π)
Deep Boltzmann Machines
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 25
oComputing gradients is intractable
o Instead, variational methods (mean-field) or sampling methods are used
Training Deep Boltzmann Machines
Variational Inference
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 27
oObserved variables π₯
o Latent variables πβ¦Both unobservable model parameters π€ and
unobservable model activations π§
β¦π = {π€, π§}
o Joint probability density function (pdf): π(π₯, π)
oMarginal pdf: π π₯ = β«ππ π₯, π ππ
o Prior pdf marginal over input: π π = β«π₯ π π₯, π ππ₯β¦Usually a user defined pdf
o Posterior pdf: π π|π₯
o Likelihood pdf: π π₯|π
Some (Bayesian) Terminology
π₯
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 28
o Posterior pdfπ π|π₯ =
=π(π₯, π)
π(π₯)
=π π₯ π π(π)
π(π₯)
=π π₯ π π(π)
β«πβ²π(π₯, ΞΈβ²) πΞΈβ²β π π₯ π π(π)
o Posterior Predictive pdf
π π¦πππ€|π¦ = ΰΆ±π
π π¦πππ€ π π π π¦ ππ
Bayesian Terminology
Conditional probability
Bayes Rule
Marginal probability
π(π₯ ) is constant
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 29
o Conjugate priorsβ¦when posterior and prior belong to the same family, so
the joint pdf is easy to compute
o Point estimate approximations oflatent variables⦠instead of computing a distribution over all possible
values for the variableβ¦ compute one point onlyβ¦e.g. the most likely (maximum likelihood or max a
posteriori estimate)
πβ = argπmaxπ π₯ π π π (ππ΄π)πβ = argπmaxπ π₯ π (ππΏπΈ)
β¦Quite good when the posterior distribution is peaky(low variance)
Bayesian Terminology
*
Point estimate of your
neural network weight
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 31
oEstimate the posterior density π π|π₯ for your training data π₯
oTo do so, need to define the prior π π and likelihood π π₯|π distributions
oOnce the π π|π₯ density is estimated, Bayesian Inference is possibleβ¦π π|π₯ is a (density) function, not just a single number (point estimate)
oBut how to estimate the posterior density?β¦Markov Chain Monte Carlo (MCMC) Simulation-like estimation
β¦Variational Inference Turn estimation to optimization
Bayesian Modelling
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 32
oEstimating the true posterior π π|π₯ is not always possibleβ¦especially for complicated models like neural networks
oVariational Inference assumes another function π π|π with which we want to approximate the true posterior π π|π₯β¦π π|π is the approximate posterior
β¦Note that the approximate posterior does not depend on the observable variables π₯
oWe approximate by minimizing the reverse KL-divergence w.r.t. ππβ = argmin
ππΎπΏ(π(π|π)||π π|π₯ )
oTurn inference into optimization
Variational Inference
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 33
Variational Inference (graphically)
Underestimating variance. Why?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 34
Variational Inference (graphically)
Underestimating variance. Why?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 35
Variational Inference (graphically)
Underestimating variance. Why?How to overestimate variance?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 36
Variational Inference (graphically)
Underestimating variance. Why?How to overestimate variance? Forward KL
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 38
oGiven latent variables π and the approximate posterior
ππ π = π π|π
oWhat about the log marginal log π π₯ ?
Variational Inference - Evidence Lower Bound (ELBO)
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 39
oGiven latent variables π and the approximate posterior
ππ π = π π|π
oWe want to maximize the marginal π π₯ (or the log marginal log π π₯
log π π₯ β₯ πΌππ π logπ(π₯, π)
ππ π
Variational Inference - Evidence Lower Bound (ELBO)
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 40
Evidence Lower Bound (ELBO): Derivations
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 41
oGiven latent variables π and the approximate posteriorππ π = π π|π
oThe log marginal is
log π π₯ = logΰΆ±π
π π₯, π ππ
= logΰΆ±π
π π₯, πππ π
ππ πππ
= log πΌππ(π)π(π₯, π)
ππ π
β₯ πΌππ π logπ(π₯, π)
ππ π
Evidence Lower Bound (ELBO): Derivations
Jensen Inequality
β’ π πΌ x β€ πΌ π π₯
for convex πβ’ log is convave
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 42
ELBO: A second derivation
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 44
β₯ πΌππ π logπ(π₯, π)
ππ π= πΌππ π log π(π₯|π) + πΌππ π log π π β πΌππ π log ππ π
= πΌππ π log π(π₯|π) β KL(ππ π ||p(ΞΈ))
= ELBOΞΈ,Ο(x)
oMaximize reconstruction accuracy πΌππ π log π(π₯|π)
oWhile minimizing the KL distance between the prior p(ΞΈ) and the approximate posterior ππ π
ELBO: Formulation 1
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 45
β₯ πΌππ π logπ(π₯, π)
ππ π= πΌππ π log π(π₯, π) β πΌππ π log ππ(π)
= πΌππ π log π(π₯, π) + Ξ ΞΈ
= ELBOΞΈ,Ο(x)
oMaximize something like negative Boltzmann energy πΌππ π log π(π₯, π)
oWhile maximizing the entropy the approximate posterior ππ πβ¦Avoid collapsing latents ΞΈ to a single value (like for MAP estimates)
ELBO: Formulation 2
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 46
o It is easy to see that the ELBO is directly related to the marginal
log π(π₯) = ELBOΞΈ,Ο x + πΎπΏ(ππ π ||π(π|π₯))
oYou can also see ELBOΞΈ,Ο x as Variational Free Energy
ELBO vs. Marginal
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 47
o It is easy to see that the ELBO is directly related to the marginalELBOΞΈ,Ο x =
ELBO vs. Marginal: Derivations
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 48
o It is easy to see that the ELBO is directly related to the marginalELBOΞΈ,Ο x == πΌππ π log π(π₯, π) β πΌππ π log ππ π
= πΌππ π log π(π|π₯) + πΌππ π log π(π₯) β πΌππ π log ππ π
= πΌππ π log π(π₯) β πΎπΏ(ππ π ||π(π|π₯))
= log π(π₯) β πΎπΏ(ππ π ||π(π|π₯))βlog π(π₯) = ELBOΞΈ,Ο x + πΎπΏ(ππ π ||π(π|π₯))
oYou can also see ELBOΞΈ,Ο x as Variational Free Energy
ELBO vs. Marginal: Derivations
log π(π₯) does not depend on ππ π
πΌππ π [1]=1
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 49
o log π(π₯) = ELBOΞΈ,Ο x + πΎπΏ(ππ π ||π(π|π₯))
oThe log-likelihood log π(π₯) constant does not depend on any parameter
oAlso, ELBOΞΈ,Ο x > 0 and πΎπΏ(ππ π ||π π π₯ ) > 0
1. The higher the Variational Lower Bound ELBOΞΈ,Ο x , the smaller the difference between the approximate posterior ππ π and the true posterior π π π₯ better latent representation
2. The Variational Lower Bound ELBOΞΈ,Ο x approaches the log-likelihood better density model
ELBO interpretations
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 50
oThe variational distribution π π|π does not depend directly on dataβ¦Only indirectly, via minimizing its distance to the true posterior πΎπΏ(π π|π ||π(π|π₯))
oSo, with π π|π we have a major optimization problem
oThe approximate posterior must approximate the whole dataset π₯ =[π₯1, π₯2, β¦ , π₯π] jointly
oDifferent neural network weights for each data point π₯π
Amortized Inference
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 51
oBetter share weights and βamortizeβ optimization between individual data points
π π|π = ππ(π|π₯)
oPredict model parameters π using a π-parameterized model of the input π₯
oUse amortization for data-dependent parameters that depend on dataβ¦E.g., the latent activations that are the output of a neural network layer: z~ππ(π§|π₯)
Amortized Inference
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 52
Amortized Inference (Intuitively)
oThe original view on Variational Inference is that π π|π describes the approximate posterior of the dataset as a whole
o Imagine you donβt want to make a practical model that returns latent activations for a specific input
o Instead, you want to optimally approximate the true posterior of the unknown weights with an model with latent parameters
o It doesnβt matter if these parameters are βlatent activationsβ π§ or βmodel variablesβ π€
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 54
oLetβs rewrite the ELBO a bit more explicitlyELBOπ, π π₯ = πΌππ π log π(π₯|π) β KL(ππ π ||p(ΞΈ))
= πΌππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
oππ(π₯|π§) instead of π(π₯|π)
o I.e., the likelihood model ππ(π₯|π§) has weights parameterized by π
oConditioned on latent model activations parameterized by π§
Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 55
oLetβs rewrite the ELBO a bit more explicitlyELBOπ, π π₯ = πΌππ π log π(π₯|π) β KL(ππ π ||p(ΞΈ))
= πΌππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
opΞ»(z) instead of p(ΞΈ)
o I.e., a π-parameterized prior only on the latent activations z
oNot on model weights
Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 56
oLetβs rewrite the ELBO a bit more explicitlyELBOπ, π π₯ = πΌππ π log π(π₯|π) β KL(ππ π ||p(ΞΈ))
= πΌππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
oππ π§|π₯ instead of π π|π
oThe model ππ π§|π₯ approximates the posterior density of the latents π§
oThe model weights are parameterized by π
Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 58
oELBOπ,π π₯ = πΌππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
oHow to model ππ(π₯|π§) and ππ π§|π₯ ?
Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 59
oELBOπ,π π₯ = πΌππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
oHow to model ππ(π₯|π§) and ππ π§|π₯ ?
oWhat about modelling them as neural networks
Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 60
oThe approximate posterior ππ π§|π₯ is a CovnNet (or MLP)β¦ Input π₯ is an image
β¦Given input the output is a feature map from a latent variable π§
β¦Also known as encoder or inference or recognition network, because it infers/recognizes the latent codes
oThe likelihood density ππ(π₯|π§) is an inverted ConvNet (or MLP)β¦Given the latent π§ as input, it reconstructs the input π₯
β¦Also known as decoder or generator network
o If we ignore the distribution of the latents z, pΞ»(z)), then we get the Vanilla Autoencoder
Variational Autoencoders
π§pΞ»(z)
ππ π§|π₯
ππ(π₯|π§)
Encoder/Inference/Recognitionnetwork
Decoder/Generatornetwork
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 61
oMaximize the Evidence Lower Bound (ELBO)β¦Or minimize the negative ELBO
β π, π = πΌππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
oHow to we optimize the ELBO?
Training Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 62
oMaximize the Evidence Lower Bound (ELBO)β¦Or minimize the negative ELBO
β π, π = πΌππ π|π₯ log ππ(π₯|π) β KL(ππ π|π₯ ||pΞ»(Z))
= ΰΆ±π§
ππ π§ π₯ log ππ(π₯|π§) ππ§ β ΰΆ±π§
ππ π§ π₯ logππ(π§|π₯)
ππ(π§)ππ§
oForward propagation compute the two terms
oThe first term is an integral (expectation) that we cannot solve analytically. So, we need to sample from the pdf insteadβ¦When ππ(π₯|π§) contains even a few nonlinearities, like in a neural network, the integral is hard to compute analytically
Training Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 64
oMaximize the Evidence Lower Bound (ELBO)β¦Or minimize the negative ELBO
β π, π = πΌππ π|π₯ log ππ(π₯|π) β KL(ππ π|π₯ ||pΞ»(Z))
= ΰΆ±π§
ππ π§ π₯ log ππ(π₯|π§) ππ§ β ΰΆ±π§
ππ π§ π₯ logππ(π§|π₯)
ππ(π§)ππ§
o Forward propagation compute the two terms
o The first term is an integral (expectation) that we cannot solve analytically.β¦When ππ(π₯|π§) contains even a few nonlinearities, like in a neural network, the integral is hard
to compute analytically
o So, we need to sample from the pdf instead
o VAE is a stochastic model
o The second term is the KL divergence between two distributions that we know
Training Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 65
oβ«π§ ππ π§ π₯ log ππ(π₯|π§) ππ§
oThe first term is an integral (expectation) that we cannot solve analytically.β¦When ππ(π₯|π§) contains even a few nonlinearities, like in a neural network, the integral is hard to compute analytically
oAs we cannot compute analytically, we sample from the pdf insteadβ¦Using the density ππ π§ π₯ to draw samples
β¦Usually one sample is enough stochasticity reduces overfitting
oVAE is a stochastic model
oThe second term is the KL divergence between two distributions that we know
Training Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 66
oβ«π§ ππ π§ π₯ logππ(π§|π₯)
ππ(π§)ππ§
oThe second term is the KL divergence between two distributions that we know
oE.g., compute the KL divergence between a centered π(0, 1) and a non-centered π(π, π) gaussian
Training Variational Autoencoders
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 68
oWe set the prior pΞ»(z) to be the unit Gaussianp π§ ~ π(0, 1)
oWe set the likelihood to be a Bernoulli for binary data
π(π₯|π§)~π΅πππππ’πππ(π)
oWe set ππ(z|x) to be a neural network (MLP, ConvNet), which maps an input x to the Gaussian distribution, specifically itβs mean and varianceβ¦ππ§, ππ§ ~ ππ(z|x)
β¦The neural network has two outputs, one is the mean ππ₯ and the other the ππ₯, which corresponds to the covariance of the Gaussian
Typical VAE
ππ π§|π₯
ππ§ ππ§
ππ§
ππ§
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 69
oWe set ππ(x|z) to be an inverse neural network, which maps Z to the Bernoulli distribution if our outputs binary (e.g. Binary MNIST)
oGood exercise: Derive the ELBO for the standard VAE
Typical VAE
ππ π§|π₯
ππ§ ππ§
ππ§
ππ§
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 70
VAE: Interpolation in the latent space
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 71
oSample π§ from the approximate posterior density π§~ππ π π₯β¦As ππ is a neural network that outputs values from a specific and known parametric pdf, e.g. a Gaussian, sampling from it is rather easy
β¦Often even a single draw is enough
oSecond, compute the log ππ(π₯|π)β¦As ππ is a a neural network that outputs values from a specific and known parametric pdf, e.g. a Bernoulli for white/black pixels, computing the log-prob is easy
oComputing the ELBO is rather straightforward in the standard case
oHow should we optimize the ELBO?
Forward propagation in VAE
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 72
oSample π§ from the approximate posterior density π§~ππ π π₯β¦As ππ is a neural network that outputs values from a specific and known parametric pdf, e.g. a Gaussian, sampling from it is rather easy
β¦Often even a single draw is enough
oSecond, compute the log ππ(π₯|π)β¦As ππ is a a neural network that outputs values from a specific and known parametric pdf, e.g. a Bernoulli for white/black pixels, computing the log-prob is easy
oComputing the ELBO is rather straightforward in the standard case
oHow should we optimize the ELBO? Backpropagation?
Forward propagation in VAE
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 73
oBackpropagation compute the gradients ofβ π, π = πΌπ§~ππ π|π₯ log ππ(π₯|π§) β KL(ππ π|π₯ ||pΞ»(Z))
oWe must take the gradients with respect to the trainable parameters
oThe generator network parameters π
oThe inference network/approximate posterior parameters π
Backward propagation in VAE
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 74
o Letβs try to compute the following integralπΌ(π) = ΰΆ±
π₯
π π₯ π(π₯)
where π π₯ is a probability density function for π₯β¦Often complex if π π₯ and π π₯ is slightly complicated
o Instead, we can approximate the integral as a summation
πΌ π = ΰΆ±π₯
π π₯ π π₯ β1
π
π=1
π
π π₯π , π₯π~π π₯ = απ
o The estimator is unbiased: πΌ π = πΌ απ and its variance
πππ απ =1
ππΌ[(π β πΌ( απ))]
o So, if we have an easy to sample from probability density function in the integral we can do Monte Carlo integration to approximate the integral
Monte Carlo Integration
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 75
oBackpropagation compute the gradients ofβ π, π = πΌπ§~ππ π|π₯ log ππ(π₯|π§) β KL(ππ π|π₯ ||pΞ»(Z))
Gradients w.r.t. the generator parameters π
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 76
oBackpropagation compute the gradients ofβ π, π = πΌπ§~ππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
with respect to π and π
oπ»πβ = πΌπ§~ππ π§|π₯ π»π log ππ(π₯|π§)
oThe expectation and sampling in πΌπ§~ππ π§|π₯ do not depend on π
oAlso, the KL does not depend on π, so no gradient from over there!
oSo, no problem Just Monte-Carlo integration using samples π§ drawn from ππ π§|π₯
Gradients w.r.t. the generator parameters π
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 78
oBackpropagation compute the gradients ofβ π, π = πΌπ§~ππ π§|π₯ log ππ(π₯|π§) β KL(ππ π§|π₯ ||pΞ»(z))
oOur latent variable π§ is a Gaussian (in standard VAE) represented by ππ§, ππ§
oSo, we can train by sampling randomly from that Gaussian π§~π(ππ, ππ)
oProblem?
oSampling π§~ππ π§|π₯ is not differentiableβ¦And after sampling π§, itβs a fixed value (not a function), so we cannot backprop
oNot differentiable no gradients
oNo gradients No backprop No training!
Gradients w.r.t. the recognition parameters π
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 79
oπ»ππΌπ§~ππ π§|π₯ log ππ(π₯|π§) = π»π β«π§ ππ π§|π₯ log ππ π₯ π§ ππ§
= ΰΆ±π§
π»π[ππ π§|π₯ ] log ππ π₯ π§ ππ§
oProblem: Monte Carlo integration not possible anymoreβ¦No density function inside the integral
β¦Only the gradient of a density function
oSimilar to Monte Carlo integration, we want to have an expression where there is a density function inside the summation
oThat way we can express it again as Monte Carlo integration
Solution: Monte Carlo Differentiation?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 80
oπ»ππΌπ§~ππ π§|π₯ log ππ(π₯|π§) = π»π β«π§ ππ π§|π₯ log ππ π₯ π§ ππ§
= ΰΆ±π§
π»π[ππ π§|π₯ ] log ππ π₯ π§ ππ§
o β«π§π»π[ππ π§|π₯ ] log ππ π₯ π§ ππ§ =
= ΰΆ±π§
ππ π§|π₯
ππ π§|π₯π»π ππ π§|π₯ log ππ π₯ π§ ππ§
NOTE: π»π₯log π π₯ =1
π π₯π»π₯π π₯
= ΰΆ±π§
ππ π§|π₯ π»π log ππ π§|π₯ log ππ π₯ π§ ππ§
= πΌπ§~ππ π§|π₯ [π»π log ππ π§|π₯ log ππ π₯ π§ ]
Solution: Monte Carlo Differentiation?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 81
oπ»ππΌπ§~ππ π§|π₯ log ππ(π₯|π§) = πΌπ§~ππ π§|π₯ π»π log ππ π§|π₯ log ππ π₯ π§
=
π
π»π log ππ π§π|π₯ log ππ π₯ π§π , π§π~ππ π§|π₯
oAlso known as REINFORCE or score-function estimatorβ¦ log ππ π₯ π§ is called score function
β¦Used to approximate gradients of non-differentiable function
β¦Highly popular in Reinforcement Learning, where we also sample from policies
oProblem: Typically high-variance gradients
o Slow and not very effective learning
Solution: Monte Carlo Differentiation == REINFORCE
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 82
oSo, our latent variable π§ is a Gaussian (in the standard VAE) represented by the mean and variance ππ§, ππ§, which are the output of a neural net
oSo, we can train by sampling randomly from that Gaussianπ§~π(ππ, ππ)
oOnce we have that π§, however, itβs a fixed value (not a function), so we cannot backprop
oWe can use REINFORCE algorithm to compute an approximation to the gradientβ¦High-variance gradients slow and not very effective learning
To sum up
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 83
oRemember, we have a Gaussian output π§~π(ππ, ππ)
oFor certain pdfs, including the Gaussian, we can rewrite their random variable π§ as deterministic transformations of an auxiliary and simpler random variable π
π§~π π, π β π§ = π + π β π, π~π 0, 1
oπ, π are deterministic (not random) values
oLong story short:
oWe can model π, π by our NN encoder/recognition
oAnd π comes externally
Solution: Reparameterization trick
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 84
o Change of variablesπ§ = π(π)p π§ ππ§ = π π ππ
β¦ Intuitively, think that the probability mass must be invariant after the transformation
o In our caseπ~π π = Ξ(0, 1), π§ = ππ π = ππ + π β ππ
o π»ππΌπ§~ππ π§|π₯ log ππ(π₯|π§) = π»π β«π§ ππ π§|π₯ log ππ π₯ π§ ππ§
= π»πΰΆ±π
π(π) log ππ π₯ ππ, ππ, π ππ
= ΰΆ±π
π π π»π log ππ π₯ ππ, ππ, π ππ
o π»ππΌπ§~ππ π§|π₯ log ππ(π₯|π§) β βπ π»π log ππ π₯ ππ, ππ, ππ , ππ~π(0, 1)β¦ The Monte Carlo integration does not depend on the parameter of interest π anymore
What do we gain?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 85
oSampling directly from π~π 0,1 leads to low-variance estimates compared to sampling directly from π§~π ππ, ππβ¦Why low variance? Exercise for the interested reader
oRemember: since we are sampling for π§, we are also sampling gradientsβ¦Stochastic gradient estimator
oMore distributions beyond Gaussian possible: Laplace, Student-t, Logistic, Cauchy, Rayleight, Pareto
Solution: Reparameterization trick
High-variance gradient
Low-variance gradient
http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 86
oAgain, the latent variable is π§ = ππ + π β ππ
oππ and ππ are deterministic functions (via the neural network encoder)
oπ is a random variable, which comes externally
oThe π§ as a result is itself a random variable, because of π
oHowever, now the randomness is not associated with the neural network and its parameters that we have to learnβ¦The randomness instead comes from the external π
β¦The gradients flow through ππ and ππ
Once more: what is random in the reparameterization trick?
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 87
Reparameterization Trick (graphically)
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 88
Variational Autoencoders
https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 89
VAE Training Pseudocode
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 90
VAE for NLP
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 91
VAE for Image Resynthesis
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 92
VAE for designing chemical compounds
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 93
oVAE cannot model π(π₯) directly because of intractable formulation
oNormalizing Flows solves exactly that problem
o It does that by series of invertible transformations that allow for much more complex latent distributions (beyond Gaussians)
oThe loss is the negative log-likelihood (not ELBO and so on)
Normalizing Flows
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 94
Series of invertible transformations
https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 95
oUsing simple pdfs, like a Gaussian, for the approximate posterior limits the expressivity of the model
oBetter make sure the approximate posterior comes from a class of models that can evencontain the true posterior
oUse a series of πΎ invertible transformations to construct the approximate posteriorβ¦π§π = ππ β ππβ1 β β―π1(π§0)
β¦Rule of change for variables
Normalizing Flows https://blog.evjang.com/2018/01/nf1.html
https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf
https://arxiv.org/pdf/1505.05770.pdf
Changing from the π₯ variable to π¦ using the transformation y = π π₯ = 2π₯ + 1
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 96
o π₯ = π§π = ππ β ππβ1 β β―π1 π§0 β π§π = ππ π§πβ1
o Again, change of variables (multi-dimensional): ππ π§π = ππβ1 ππβ1(π§π) | det
πππβ1
ππ§π|
o log π π₯ = log πK π§πΎ = log πKβ1 π§πΎβ1 β log | detπππΎ
πππΎβ1|
= β―
= log π0 π§0 β
π
πΎ
log detπππππ§πβ1
o Two requirements
1. ππ must be easily invertible
2. The Jacobian of ππ must be easy to compute
Normalizing Flows: Log-likelihood
https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 97
Normalizing Flows https://blog.evjang.com/2018/01/nf1.html
https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf
https://arxiv.org/pdf/1505.05770.pdf
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 98
Normalizing Flows
https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 99
Normalizing Flows on Non-Euclidean Manifolds
https://www.shakirm.com/slides/DeepGenModelsTutorial.pdf
UVA DEEP LEARNING COURSE β EFSTRATIOS GAVVES DEEP GENERATIVE MODELS - 100
Normalizing Flows on Non-Euclidean Manifolds
UVA DEEP LEARNING COURSEEFSTRATIOS GAVVES
DEEP GENERATIVE MODELS - 101
Summary
oGentle intro to Bayesian Modelling and Variational Inference
oRestricted Boltzmann Machines
oDeep Boltzmann Machines
oDeep Belief Network
oContrastive Divergence
oVariational Autoencoders
oNormalizing Flows