Lecture 5 Autoregressive Models 30mins
Transcript of Lecture 5 Autoregressive Models 30mins
Autoregressive Models
Hao Dong
Peking University
1
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦. (Next Lecture)
2
Autoregressive Models
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
3
Definition of Autoregressive Models
π¦! = π + π"π¦!#" + π$π¦!#$ +β¦+ π% π¦!#% + Ι! , Ι!~N(0, π$)
Put simply, an autoregressive model is merely a feed-forward model which predicts future values from past values:
The term autoregressive originates from the literature on time-series models where observations from the previous time-steps are used to predict the value at the current time step.
π¦& could be:The specific stock price of day πβ¦The amplitude of a simple pendulum at period πβ¦Or any variable that depends on its preceding values!
Two examples of data from autoregressive models with a few different parameters. Left: AR(1) with yt=18β0.8ytβ1 + Ξ΅t. Right: AR(2) with yt=8+1.3ytβ1β0.7ytβ2 + Ξ΅t.
Definition of Autoregressive Models
π¦! = π + π"π¦!#" + π$π¦!#$ +β¦+ π% π¦!#% + Ι! , Ι!~N(0, π$)Autoregressive Models have a strong ability in data representation.
Definition of Autoregressive ModelsAutoregressive Models have a strong ability in data representation.
β’ Regression
β’ Generation
β’ Prediction
Prior Knowledge
+Learning
dataset π
Recap: Statistical Generative Models
Sampling from p(x) generates new images 7
π₯π₯1 π₯2 π₯3 π₯4
π₯3 is a 64x64x3 high dimensional vector representing a woman with blonde hair.
ππππ‘π(π₯)
The probability density value is
very high
Model family, loss function, optimization algorithm, etc.
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
8
Recap: Challenge of Generative Models
Suppose x1,x2,x3 are binary variables. P(π₯#, π₯$, π₯%) can be specified with (23 - 1) =7 parameters
Too many possibilities!
n Compactness
Main idea: write as a product of simpler terms
Main challenge: distributions over high dimensional objects is actually very sparse!!
What about a 28Γ28 black/white digit image?
2$&Γ$& = 2(&) β 10$%* parameters!
But with only 10 peaks of 0, 1, 2, β¦ 9π(2π)
Recap: Challenge of Generative Models
n Solution #1: Factorization
Definition of conditional probability:
π(π₯!, π₯") = π(π₯!) π(π₯"|π₯!)
Product rule:π π₯#, π₯$, β¦ , π₯+ =-
,-#
+
π.(π₯,|π₯/,)
Divide and conquer ! We can solve the joint distribution π(π) by
solving simpler conditional distributions π#(π₯$|π₯%$) one by one
Itβs hard to exactly modeling every conditional distribution
Still complex!!
Can you tell the exact likelihood of the next pixel (noted as a red point) conditioned on the given pixels?
Recap: Challenge of Generative Models
Solution #2a: use simple functions to form the conditionals
π(π₯!|π₯", π₯#, π₯$) β π ππππππ (π"π₯" +π#π₯# +π$π₯$)
β¦ Only requires storing 3 parameters
β¦ Relationship between π₯$ and (π₯%, π₯#, π₯&) could be too simple
x1
x2
11IJCAI-ECAI 2018 TUTORIAL: DEEP GENERATIVE MODELS
x3
x4
Neural network
sigmoid functionSolution #2b: use more complex functional form Neural network
π" = π""(π₯", π₯#, π₯$), π# = π"# π₯", π₯#, π₯$ , π$ = π"$ π₯", π₯#, π₯$
π" = π#"(π", π#, π$), π# = π##(π", π#, π$), π$ = π#$ π", π#, π$ β¦
π(π₯!|π₯", π₯#, π₯$) β π ππππππ (π"π" +π#π# +π$π$β¦ )
β¦ More flexible
β¦ More parameters
β¦ More powerful on fitting data
finally, itβs possible to model the data distributions!
Sigmoid function can be used to binarize the output
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
12
β’ Graph model: Directed, fully-observed Bayesian network
β’ Key idea: Decompose the joint distribution as a product of tractable conditionals
13
Definition of Autoregressive ModelsHowever, by defining 3π₯& , the output of step π , as a random variable that follows the conditional distribution based on previous inputs π₯", π₯$β¦π₯&#", we get the probability model, which can present the joint distribution of π' π₯", π₯$, β¦ π₯(
*π₯$ = π# π₯$ π₯!, π₯", β¦ , π₯$:!
π. π =-,-#
+
π.(π₯,|π₯#, π₯$, β¦ , π₯,0#) =-,-#
+
π.(π₯,|π₯/,)
WaveNet animation. Source: Google DeepMind.Obligatory RNN diagram. Source: Chris Olah.
Relationship with RNN:Like an RNN, an autoregressive modelβs output β;,at time π‘ depends on not just π₯;, but also π₯!, π₯", β¦ , π₯$:! from previous time steps.However, unlike an RNN, the previous π₯!, π₯", β¦ , π₯$:! are not provided via some hidden state: they are given just as an input to the model.
Definition of Autoregressive Models
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
15
Learning and Inference of Autoregressive Models
16
β’ Learning maximizes the model log-likelihood over the dataset !
=
Tractable: The distribution is simple enough to be modeled explicitly.
Tractable conditionals make conditional distribution learning meaningful,and thus allow for exact likelihood evaluation.
min#β=
π>? π@A;A, π# = min#β=
EB~D1232 log pEFGF x β log p# x β max#β=
πΈH~I4565 log π#(π₯)
max#
log π# π· = =HβJ
log π# π₯ = =HβK
=$L!
M
log π#(π₯$|π₯%N)
Learning and Inference of Autoregressive Models
17
β’ Inference samples each variable of one data from estimated conditional distributions step by step, until the whole data is generated.
Ancestral sampling:A process of producing samples from a probabilistic model.First sample variables which has no conditional constraints using their prior distribution. π₯"~π'(π₯")Then sample child variables using conditional distribution based on their parents and repeat so on. π₯$~π' π₯$ π₯")β¦β¦
The attribute of Autoregressive Models that directly model and output distributions allows for ancestral sampling.
Learning and Inference of Autoregressive Models
18
Differences between Autoregressive models (AR), VAE and GAN:
GAN model doesnβt define any distribution, it adapts discriminator to learn the data distribution implicitly.VAE model believes the data distribution is too complex to model directly, thus it tries to learn the distribution by defining an intermediate distribution and learning the map between the defined simple distribution to the complex data distribution.AR model on the one hand assumes that the data distribution can be learned directly (tractable), then it define its outputs as conditional distributions to solve the generation problem by directly modeling each conditional distribution.
Learning and Inference of Autoregressive Models
19
Conclusion:1. Using complex networks, each step Autoregressive Models output an approximated complex conditional distribution *π₯$ = π# π₯$ π₯!, π₯", β¦ , π₯$:!
2. Taking in the previous inputs π₯!, π₯", β¦ , π₯$:! and the next input π₯$ by sampling previous estimated conditional distribution *π₯$ , Autoregressive Model is able to generate all conditional distributions iterativelyπ₯!~π# π₯! , π₯"~π# π₯" π₯! , π₯O~π# π₯O π₯!, π₯" , β¦ π₯M~π#(π₯M|π₯!, β¦ , π₯M:!)
3. Product rule makes sure the generated data that made up of sampled result π₯$ from each step follows the data distribution.
π₯!, π₯", β¦ , π₯M ~?$L!
M
π#(π₯$|π₯%$)
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
20
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
21
Fully Visible Sigmoid Belief Network (FVSBN)
The conditional variables xi | x1 ,..., xiβ1 in FVSBN are Bernoulli with parameters.
FVSBN
l the fully visible sigmoid belief network without any hidden units is denoted FVSBN.
π₯Oπ₯! π₯" π₯P
Gan Z , Henao R , Carlson D , et al. Learning Deep Sigmoid Belief Networks with Data Augmentation[C]// Artificial Intelligence and Statistics (AISTATS). 2015.
β’ Ο denotes the sigmoid function
denotes the parameters
3π₯, = π π₯, = 1 π₯1, π₯2,β¦π₯,0# = π, π₯#, π₯$, β¦ , π₯,0#; πΌ,
= π(πΌ7, + πΌ#
, π₯# +β―+ πΌ,0#(,) π₯,0#)
Some conditionals are too complex. So FVSBN assume:
πΌ( = {πΌ)((), πΌ"
((), β¦ , πΌ(,"(() }
β’ The conditional for variable xi requires i parameters, and hence the total number of parameters in the model is given by β(-". π = π π# βͺ π(2.)
FVSBN Example
β’ Suppose we have a dataset D of handwritten digits (binarized MNIST)
β’ Each image has n = 28Γ28784 = pixels. Each pixel can either be black (0) or white (1).
β’ We want to learn a probability distribution π π₯ = π(π₯", β¦ , π₯#$%)over π₯ β0,1 #$%such that when π₯~π(π₯), π₯ looks like a digit.
β’ Idea: define a FVSBN model , then pick a good one based on training data D. (more on that later)
FVSBN Example
β’ We can pick an ordering, i.e., order variables (pixels) from top-left (π₯") to bottom-right (π₯)*+). β’ Use product rule factorization without loss of generality:
π(π₯",Β·Β·Β· , π₯)*+) = π π₯" π π₯$ π₯")π(π₯, | π₯", π₯$)Β·Β·Β· π π₯)*+ π₯",Β·Β·Β· , π₯)*,)
β’ FVSBN model assume: (less parameters)
β’ Note: This is a modeling assumption. We are using a logistic regression to predict next pixel distribution based on the previous ones. Called autoregressive.
3π₯& = π π₯" = 1 π₯1, π₯2,β¦π₯&#" = π& π₯", π₯$, β¦ , π₯&#"; πΌ&
= π(πΌ-& + πΌ"
& π₯" +β―+ πΌ&#"(&) π₯&#")
FVSBN Example
β’ How to evaluate π(π₯!,Β·Β·Β· , π₯"#$) i.e. density estimation? Multiply all the conditionals (factors) In the above example:
π π₯# = 0, π₯$ = 1, π₯% = 1, π₯) = 0= π π₯# = 0 π π₯$ = 1 π₯# = 0 π π₯% = 1 π₯# = 0, π₯$ = 1 π π₯) = 0 π₯# = 0, π₯$ = 1, π₯% = 1= 1 β;π₯# Γ;π₯$Γ;π₯%Γ(1 β ;π₯))
β’ How to sample from π(π₯!,Β·Β·Β· , π₯"#$) ?1. Sample π₯!~π(π₯!) (ππ. ππππππ. πβππππ([1,0], π = [8π₯!, 1 β 8π₯!])) 2. Sample π₯%~π(π₯%|π₯! = π₯!)3. Sample π₯&~π(π₯&|π₯! = π₯!, π₯% = π₯%)Β·Β·Β·
binary hidden variables
ia
π₯!
@π₯O
π₯" π₯O π₯O
@π₯! @π₯" @π₯P
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
26
NADE: Neural Autoregressive Density Estimation
Improve FVSBN: use one hidden layer neural network instead of logistic regression
Tie weights are shared to reduce the number of parameters and speed upcomputation(see blue dots in the figure)
iaxi = vi
Uria B, CΓ΄tΓ© M A, Gregor K, et al. Neural autoregressive distribution estimation[J]. The Journal of Machine Learning Research, 2016, 17(1): 7184-7220.
ππ = π π΄$π%π + ππ*ππ = π π₯$ π₯!, π₯"β¦π₯$:!; π΄$ , ππ, πΆπ, π$)=π(πΆπππ + π$)
parameters
NADE: Neural Autoregressive Density Estimation
ππ = {π΄$ β π @Γ $:! , ππ β π π, πΆπ β π π, ππ β π }are the set of parameters .
The total number of parameters in this model is dominated by the matrices {π¨π, π¨π, β¦ , π¨π}given by O(n2d).
Sharing parameters: Tie weights are shared to reduce the number of parameters and speed up computation. β> O(nd).
ππ = π π΄$π%π + ππ*ππ = π π₯$ π₯!, π₯"β¦π₯$:!; π΄$ , ππ, πΆπ, π$)
= π$ π₯!, π₯", β¦ , π₯$:! = π(πΆπππ + π$)
π%π β π $:!, denotes the vector made of preceding π₯s
ππ β π @ , denotes the hidden layer activations of the MLP
Performance on the MNIST dataset. (Left) Training data. (Middle) Averaged synthesized samples. (Right) Learned features at the bottom layer.
FVSBN
NADE
Generate samples
Learned Features
Generate other distributions
β’ How to model non-binary discrete random variables Vi β{1, β¦K }? E.g., pixel intensities varying from 0 to 255?
β’ Softmax generalizes the sigmoid/logistic function Ο(Β·) and transformsa vector of K numbers into a vector of K probabilities (non-negative, sum to 1).
β’ One solution: Let *π―π’ parameterize a categorical distribution
ππ = π π΄$π%π + ππ
*π―π = π π£$ π£!,β¦ , π£$:! = (π$!, π$"β¦ ,π$_)
= π πππ‘πππ₯ π$ππ + ππ
π πππ‘πππ₯ π = π πππ‘πππ₯ π", β¦ , π/ = (exp π"
β( exp π(, β¦ ,
exp π/
β( exp π()
Generate other distributions
β’ How to model continuous random variables Vi βR? E.g., speech signals ?
β’ One solution: Let *π―π parameterize a continuous distributionE.g., uniform mixture of K Gaussians
ππ = π π΄$π%π + ππ
*π―π = π ππ = π$!,β¦ , π$>, π$!,β¦π$>
π π£$ π£!,β¦ , π£$:! ==fL!
>1πΎ(π©(π£$; π$
f, π$f))
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦.
32
Autoregressive model vs. Autoencoders
β’ FVSBN and NADE look similar to an Autoencoder.
β’ An encoder e(Β·), E.g., π(π₯) = π(π" π!π₯ + π! + π")β’ A decoder such that π π π₯ β π₯
Binary:
ming:,g;,h:,h;,i,j
=HβK
=$
(βπ₯$ log *π₯$ β 1 β π₯$ log(1 β *π₯$))
Continuous:
ming:,g;,h:,h;,i,j
=HβK
=$
π₯$ β *π₯$ "
β’ Encoder: feature learningβ’ A vanilla autoencoder is not a generative model: it does not define a distribution
over π₯ we can sample from to generate new data points.
Autoregressive model vs. Autoencoders
β’ FVSBN and NADE look similar to an autoencoder.
β’ Can we get a generative model from an Autoencoder?
οΌοΌ
A dependency order constraint is required for Autoencoder to make it a Bayesian Network.
VS
Autoregressive model vs. Autoencoders
β’ To get an autoregressive model from an Autoencoder,
β’ we need to make sure it corresponds to a valid Bayesian Network,
so we need an ordering. If the ordering is 1,2,3, then:
β’ @π₯! cannot depend on any input x.
β’ @π₯" can only depend on π₯!.
β’ @π₯O can only depend on π₯!, π₯".
β’ Bonus: we can use a single neural network (with n outputs) to
produce all the parameters. In contrast, NADE requires n passes.
Much more efficient on modern hardware.
MADE: Masked Autoencoder for Distribution Estimation
Mathieu M. Masked Autoencoder for Distribution Estimation[J]. 2015.
Masked Autoencoder
Use Masks to constraint dependency paths!Each output unit is an estimated distribution, it only depends on the inputs with orderings that before its chosen ordering
With the order π₯",π₯O,π₯!:
1. π(π₯")doesnβt depends on any input
2. π(π₯O|π₯")depends on input π₯"3. π(π₯!|π₯", π₯O)depends on input π₯" , π₯O
Performance on the MNIST dataset. (Left) : Samples from a 2 hidden layer MADE(Right): Nearest neighbor in binarized MNIST
MADE
Generate samples
Autoregressive Models in NLP
Natural language generation (NLG) is one of the important research fields of Artificial
Intelligence, including text-to-text generation, meaning-to-text generation and image-to-
test generation etc.
However, every time generating a word, itβs always helpful to consider text that already
generated! Thus Autoregressive Model is widely adopted in NLP.
Examples of powerful GPT-2 model generating βFirst Law Of Roboticsβ
Appendix A βTaxonomy of Generative Models
β’ Definition of Autoregressive Models (β )
β’ Challenge of Generative Models
β’ Definition of Autoregressive Models (β ‘)
β’ Learning and Inference of Autoregressive Models
β’ Examples of Autoregressive Models
β’ Fully Visible Sigmoid Belief Network (FVSBN)
β’ Neural Autoregressive Density Estimation (NADE)
β’ Masked Autoencoder for Distribution Estimation (MADE)
β’ PixelRNN, PixelCNN, WaveNetβ¦. (Next Lecture)
40
Thanks
41