Variational Autoencoders (VAEs) - Peoplepeople.math.gatech.edu/~yyang767/research/VAE_Yuqin.pdf ·...
Transcript of Variational Autoencoders (VAEs) - Peoplepeople.math.gatech.edu/~yyang767/research/VAE_Yuqin.pdf ·...
Preliminaries Variational Autoencoders Extensions of VAEs
Variational Autoencoders (VAEs)
Yuqin Yang
Wilson Lab Group Meeting Presentation
September 26 & October 3, 2017
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Section 1
Preliminaries
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Kullback-Leibler divergence
KL divergence (continuous case)
p(x) and q(x) are two density distributions. Then theKL-divergence is defined as
KL(p||q) =Z
p(x) logp(x)
q(x)dx . (1.1)
By Jensen’s Inequality, KL(p||q) � 0, and the equation holds if andonly if p = q, almost everywhere.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Kullback-Leibler divergence
Special case: multivariate Gaussian distribution
Suppose k-dimensional variable p1 ⇠ N (µ1,⌃1), p2 ⇠ N (µ2,⌃2),then
KL(p1||p2) =1
2
log
det(⌃1)
det(⌃2)� k + tr(⌃�1
2 ⌃1) +
(µ2 � µ1)>⌃�1
2 (µ2 � µ1)i.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Variational Inference
Suppose we want to use Q(z) to approximate P(z |X ), wherep(z |X ) does not have a explicit representation, then a goodapproximation would try to minimize
KL(Q(z)||P(z |X )) =
ZQ(z) log
Q(z)
P(z |X )dz .
By Bayes’ formula, the above equation could be transferred into
logP(X )� KL(Q(z)||P(z |X )) =Z
Q(z) logP(X |z)dz � KL(Q(z)||P(z)).(1.2)
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Section 2
Variational Autoencoders
1
1A. B. L. Larsen et al. (2015). “Autoencoding beyond pixels using a learnedsimilarity metric”. In: arXiv preprint arXiv:1512.09300.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Original problem
Given a dataset X from a distribution P(x), we want to generatenew data that satisfies the unknown distribution P(x).
We construct a model f (z ; ✓) : Z ⇥⇥ ! X , where X is the spaceof observed variables (datas), Z the space of latent variables, ⇥the parameter space, and f a complex but deterministic mapping.
Latent Variables: Variables that are not directly observedbut are rather inferred from other directly observed variables.Given z , we can generate a sample X by f (z ; ✓).
We wish to optimize ✓ such that we can sample z from P(z) and,with high probability, f (z ; ✓) will be like the X s in our dataset.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Original problem
Given a dataset X from a distribution P(x), we want to generatenew data that satisfies the unknown distribution P(x).
We construct a model f (z ; ✓) : Z ⇥⇥ ! X , where X is the spaceof observed variables (datas), Z the space of latent variables, ⇥the parameter space, and f a complex but deterministic mapping.
Latent Variables: Variables that are not directly observedbut are rather inferred from other directly observed variables.Given z , we can generate a sample X by f (z ; ✓).
We wish to optimize ✓ such that we can sample z from P(z) and,with high probability, f (z ; ✓) will be like the X s in our dataset.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Original problem
Given a dataset X from a distribution P(x), we want to generatenew data that satisfies the unknown distribution P(x).
We construct a model f (z ; ✓) : Z ⇥⇥ ! X , where X is the spaceof observed variables (datas), Z the space of latent variables, ⇥the parameter space, and f a complex but deterministic mapping.
Latent Variables: Variables that are not directly observedbut are rather inferred from other directly observed variables.Given z , we can generate a sample X by f (z ; ✓).
We wish to optimize ✓ such that we can sample z from P(z) and,with high probability, f (z ; ✓) will be like the X s in our dataset.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Likelihood
P(X ; ✓) =
ZP(X |z ; ✓) P(z)dz (2.1)
Choose ✓ to maximize the above integral.
In VAEs, P(X |z ; ✓) ⇠ N (f (z ; ✓),�2 ⇤ I ) in continuous case, andP(X |z ; ✓) ⇠ B(f (z ; ✓)) in discrete case. In both cases, P(X |z ; ✓) iscontinuous with respect to theta, so we can use gradient ascent tomaximize ✓.Questions:
How to define the latent variable z to capture latentinformation?
How to deal with the integral over z , and its gradient withrespect to ✓?
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Likelihood
P(X ; ✓) =
ZP(X |z ; ✓) P(z)dz (2.1)
Choose ✓ to maximize the above integral.
In VAEs, P(X |z ; ✓) ⇠ N (f (z ; ✓),�2 ⇤ I ) in continuous case, andP(X |z ; ✓) ⇠ B(f (z ; ✓)) in discrete case. In both cases, P(X |z ; ✓) iscontinuous with respect to theta, so we can use gradient ascent tomaximize ✓.Questions:
How to define the latent variable z to capture latentinformation?
How to deal with the integral over z , and its gradient withrespect to ✓?
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Define latent variable
We want the latent variable satisfies these two properties:
The latent variables are chosen automatically, because we donot know too much about the intrinsic properties of X .
Di↵erent components of z are mutually independent, in orderto avoid the overlap in latent information.
VAEs asserts that the latent variable could be drawn fromstandard Gaussian distribution, N (0, I ).
Assertion
Any distribution in d dimensions can be generated by taking a setof d variables that are normally distributed and mapping themthrough a su�ciently complicated function.
Since f (z , ✓) is complicated enough (trained by neural network),this choice of latent variable will not matter too much.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Deal with the integral
P(X ; ✓) ⇡ 1
n
X
i
P(X |z(i); ✓), z(i) ⇠ N (0, I ).
Figure: Contradict Example. We need to set � very small, which willneed a very large dataset.
In this case, we need to choose a faster sampling procedure of z .Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Deal with the integral
Sampling in VAEs
The key idea behind the variational autoencoder is to attempt tosample values of z that are likely to have produced X , andcompute P(X ) just from those.
New function Q(z): gives us a distribution over z values that arelikely to produce X . Then EP(z)[P(X |z)] ! EQ(z)[P(X |z)].We can see that P(z |X ) is the optimum choice of Q(z), but P isintractable.
Aim:
Find a Q(z) which is an approximation of P(z |X ), with Q(z)simple enough.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Recall: Variational Inference
For any Q(z), use Q(z) to approximate P(z |X ). According toEquation (1.2),
logP(X )�KL(Q(z)||P(z |X )) = EQ(z)[logP(X |z)]�KL(Q(z)||P(z))
Since were interested in inferring P(X ), it makes sense toconstruct a Q which does depend on X :
logP(X )� KL(Q(z |X )||P(z |X ))
= EQ(z|X )[logP(X |z)]� KL(Q(z |X )||P(z)).(2.2)
Aim:
Maximize logP(x) (w.r.t. ✓), minimize KL(Q(z |X )||P(z |X )). ,Maximize LHS , Maximize RHS.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Second term of RHS
Aim:
Minimize KL(Q(z |X )||P(z)). We already have P(z) ⇠ N (0, I ).
The usual choice is to define Q(z |X ) ⇠ N (µ(X ;�),⌃(X ;�)),where µ and ⌃ are deterministic functions of X with parameters �.(We omit � in the following equations.) Besides, we constrain ⌃ tobe a diagonal matrix.
Minimization
According to previous equation of KL-divergence of multivariateGaussian distribution,
KL(Q(z |X )||P(z)) = 1
2(tr⌃(X )+(µ(X ))>(µ(X ))�k�log(det⌃(X ))).
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
First term of RHS
The maximization of the first item uses SGD. To approximate thedistribution, take a sample z from Q(z |X ), and
EQ(z|X )[logP(X |z)] ⇡ logP(X |z).
General Maximization function
EX⇠D [logP(X )� KL(Q(z |X )||P(z |X ))]
= EX⇠D [Ez⇠Q|X [logP(X |z)]� KL(Q(z |X )||P(z))].(2.3)
To use SGD, sample a value X and a value z , then compute thegradient of RHS by backpropagation. Do this for m times and takethe average to get the result converging to the gradient of RHS.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
First term of RHS
The maximization of the first item uses SGD. To approximate thedistribution, take a sample z from Q(z |X ), and
EQ(z|X )[logP(X |z)] ⇡ logP(X |z).
General Maximization function
EX⇠D [logP(X )� KL(Q(z |X )||P(z |X ))]
= EX⇠D [Ez⇠Q|X [logP(X |z)]� KL(Q(z |X )||P(z))].(2.3)
To use SGD, sample a value X and a value z , then compute thegradient of RHS by backpropagation. Do this for m times and takethe average to get the result converging to the gradient of RHS.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Figure: Flow chart for the VAE algorithm.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Significant Problems
The algorithm seems to be perfect, but there are two significantproblems during the calculation:
The gradient of first term of RHS in Equation (2.3) shouldhave included the parameters of both P and Q, but in oursampling method, we omit the parameters of Q. In this case,we cannot generate the true gradient of �.
The algorithm is separated into 2 parts: the first half train themodel Q(z |X ) by the given data X , the second half train themodel f by the newly-sampling data z . Thus thebackpropagation rule cannot cover this discontinuous point,making the algorthm fail.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Significant Problems
The algorithm seems to be perfect, but there are two significantproblems during the calculation:
The gradient of first term of RHS in Equation (2.3) shouldhave included the parameters of both P and Q, but in oursampling method, we omit the parameters of Q. In this case,we cannot generate the true gradient of �.
The algorithm is separated into 2 parts: the first half train themodel Q(z |X ) by the given data X , the second half train themodel f by the newly-sampling data z . Thus thebackpropagation rule cannot cover this discontinuous point,making the algorthm fail.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Modification by Reparameterization Trick
To solve the first problem, we need to change the way ofsampling. We firstly sample a ✏ ⇠ N (0, I ), then definez = ⌃(X )1/2✏+ µ(X ). It is just the equivalent representationof the sample z in previous algorithm, but now theoptimization function is changed into
EX⇠D [E✏⇠N (0,I )[logP(X |µ(X )+⌃(X )1/2✏)]�KL(Q(z |X )||P(z))].
This time the sampling function does not include our targetfunction.
Sample from Q(z |X ) by evaluating a function h(⌘,X ), where⌘ is an unobserved noise, and h continuous in X . (DiscreteQ(z |X ) fails in this case.) Then the backpropagation can beoperated successfully.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Modification by Reparameterization Trick
To solve the first problem, we need to change the way ofsampling. We firstly sample a ✏ ⇠ N (0, I ), then definez = ⌃(X )1/2✏+ µ(X ). It is just the equivalent representationof the sample z in previous algorithm, but now theoptimization function is changed into
EX⇠D [E✏⇠N (0,I )[logP(X |µ(X )+⌃(X )1/2✏)]�KL(Q(z |X )||P(z))].
This time the sampling function does not include our targetfunction.
Sample from Q(z |X ) by evaluating a function h(⌘,X ), where⌘ is an unobserved noise, and h continuous in X . (DiscreteQ(z |X ) fails in this case.) Then the backpropagation can beoperated successfully.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Figure: Flow chart for the corrected VAE algorithm.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Verification
For decoder:
We can just sample a random variable z ⇠ N (0, I ) and input itinto the decoder to find the f .
The probability P(X ) for a testing example X :
This is not tractable, because P is implicit.However, according to Equation (2.2), since KL divergence isnon-negative, we can find a lower bound of logP(X ), which iscalled Expectation of Lower BOund (ELBO) of P(X ). This lowerbound can be a useful tool for getting a rough idea of how well ourmodel is capturing a particular datapoint X , because its fastconvergence.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Verification
For decoder:
We can just sample a random variable z ⇠ N (0, I ) and input itinto the decoder to find the f .
The probability P(X ) for a testing example X :
This is not tractable, because P is implicit.However, according to Equation (2.2), since KL divergence isnon-negative, we can find a lower bound of logP(X ), which iscalled Expectation of Lower BOund (ELBO) of P(X ). This lowerbound can be a useful tool for getting a rough idea of how well ourmodel is capturing a particular datapoint X , because its fastconvergence.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Remarks
Detailed remarks are not presented here.
Interpretation of RHS. The two terms have their meanings ininformation theory.
Separate the RHS by sample.
Regularization term. It could be found by sometransformation on RHS.
Sampling for Q(z |X ). The original paper expresses thisdistribution with g(X , ✏), where ✏ ⇠ p✏ independently.Restriction on p✏ is needed.2
2D. P. Kingma and M. Welling (2013). “Auto-encoding variational bayes”.In: arXiv preprint arXiv:1312.6114.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Comparison Versus GAN
Section 3
Extensions of VAEs
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Comparison Versus GAN
Both are newly deep generative models.
The biggest advantage of VAEs is the nice probabilisticformulation they come with as a result of maximizing a lowerbound on the log-likelihood. Also, VAE is usually easier totrain and get working. Relatively easy to implement androbust to hyperparameter choices.
GANs are better at generating visual features. Sometimes theoutput of VAEs is vague.
More detailed discussions are shown on Reddit.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Conditional Variational Autoencoders
Original problem:
Given input dataset X and output Y , we want to create a modelP(Y |X ) which maximizes the probability of the ground truthdistribution. Example: Generating Hand-write digits. We want toadd digits to an existing string of digits written by a single person.
A standard regression model will fail in this situation, because itwill finally generate an “average image” with the minimum indistance, which may look like a meaningless blur.However, CVAEs allow us to tackle problems where theinput-to-output mapping is one-to-many, without requiring us toexplicitly specify the structure of the output distribution.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Conditional Variational Autoencoders
Original problem:
Given input dataset X and output Y , we want to create a modelP(Y |X ) which maximizes the probability of the ground truthdistribution. Example: Generating Hand-write digits. We want toadd digits to an existing string of digits written by a single person.
A standard regression model will fail in this situation, because itwill finally generate an “average image” with the minimum indistance, which may look like a meaningless blur.However, CVAEs allow us to tackle problems where theinput-to-output mapping is one-to-many, without requiring us toexplicitly specify the structure of the output distribution.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Conditional Variational Autoencoders
Figure: Flow chart for the CVAE algorithm.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Conditional Variational Autoencoders
P(Y |X ) = N (f (z ,X ),�2I );
logP(Y |X )� KL(Q(z |Y ,X )||P(z |Y ,X ))
= EQ(z|Y ,X )[logP(Y |z ,X )]� KL(Q(z |Y ,X )||P(z |X )).
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
VAE-GAN3
Combine a VAE with a GAN by collapsing the decoder and thegenerator into one, since they are both from standard Gaussiandistribution to X .
Figure: Overview of the VAE-GAN algorithm.
3A. B. L. Larsen et al. (2015). “Autoencoding beyond pixels using a learnedsimilarity metric”. In: arXiv preprint arXiv:1512.09300.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
Instead of analyzing the error element-wise, VAE-GANanalyses the error feature-wise, where the feature is generatedby Discriminator.
Share the parameters of Generator and Decoder together.
Optimize three kinds of errors simultaneously.
Figure: Flow of the VAE-GAN algorithm. Grey arrows represents theterms in the training objective.
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)
Preliminaries Variational Autoencoders Extensions of VAEs
That’s all. Thanks!
Yuqin Yang Wilson Lab Group Meeting Presentation
Variational Autoencoders (VAEs)