Exponential Family of Distributions · 2019-02-06 · Exponential Family The exponential family of...
Transcript of Exponential Family of Distributions · 2019-02-06 · Exponential Family The exponential family of...
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Family of Distributions
Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA
Email: [email protected]
URL: https://www.zabaras.com/
September 6, 2018
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
References
2
C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)
A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis,
Chapman and Hall CRC Press, 2nd Edition, 2003.
J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online
resource)
D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University
Press, 2006.
Bayesian Statistics for Engineering, Online Course at Georgia Tech, B.
Vidakovic.
• Chris Bishops’ PRML book, Chapter 2
• M. Jordan, An introduction to Probabilistic Graphical Models, Chapter 8 (pre-
print)
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapters 2 and
4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Family, Bernoulli, Beta, Gamma, Gaussian
Conjugate Priors, Posterior Predictive
Computing Moments
Moment Parametrization
MLE for the Exponential Family
Maximum Entropy and the Exponential Family
Generalized Linear Models
Contents
3
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Family Large family of useful distributions with common
properties
Bernoulli, beta, binomial, chi-square, Dirichlet,
gamma, Gaussian, geometric, multinomial, Poisson,
Weibull, . .
Not in the family: Uniform,
Student’s T,
Cauchy,
Laplace,
Mixture of Gaussians,
. . .
Variable can be discrete/continuous (or vectors thereof)
4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Family The exponential family of distributions over 𝒙, given para-
meters 𝜼, is defined to be the set of distributions of the form
𝒙 is scalar/vector, discrete/continuous. 𝜼 are the natural
parameters and 𝒖(𝒙) is referred to as a sufficient statistic.
𝑔(𝜼) ensures that the distribution is normalized and satisfies
The normalization factor 𝑍 and the log of it 𝐴 are defined as:
The space of 𝜼 for which is the
natural parameter space. 5
( | ) ( ) ( ) exp ( )
( | ) ( ) exp ( ) ( ) , : ( ) log ( )
T
T
p h g u or
p h u A where A g
x x x
x x x
( ) ( )exp ( ) 1Tg h u d x x x
1
( ) , ( ) ln ( ) ln ( ) ln ( ) exp ( )( )
TZ A g Z h u dg
x x x
( | ) ( ) exp ( ) ( )Tp h u Z x x x
( ) exp ( )Th u d x x x <
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Canonical or Natural Parameters When the parameter 𝜽 enters the exponential family as (q),
we write the probability density of the exponential family as
follows:
𝜼(𝜽) are the canonical or natural parameters,
𝜽 is the parameter vector of some distribution that can be
written in the exponential family format
6
( | ) ( ) exp ( )
( | ) ( ) exp ( ) ,
: log
T
T
p h g u or
p h u A
where A g
q q q
q q q
q q
x x x
x x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Family: The Bernoulli Distribution
Consider the Bernoulli distribution:
From this we see that (note that the relation m() is invertible)
and
Finally:
7
1
( )
( | ) ( | ) (1 ) exp ln (1 ) ln(1 )
(1 )exp ln1
x x
g
p x x x x
x
m m m m m m
mm
m
Bern
ln1
m
m
( ) 1 1g m
( | ) ( ) exp , ( ) , ( ) 1, ( ) ( ),
( ) ln ( ) log(1 ) log(1 )
p x g x u x x h x g
A g e
m
( | ) ( ) ( ) exp ( )
( ) exp ( ) ( )
T
T
p h g u
h u A
x x x
x x
Logistic sigmoid
function 1
1 e m
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider the Beta distribution
Comparing this with our exponential family:
we can easily indentify:
8
1 1( ) ( )( | , ) (1 ) exp ( 1) ln ( 1) ln(1 )
( ) ( ) ( ) ( )
a ba b a ba b a b
a b a bm m m m m
Beta
( )( ) (ln , ln(1 )) , ( 1, 1) , ( ) 1, ( , ) ,
( ) ( )
( ) ( )( , ) ln ( ) ln
( )
T T a bu a b h g a b
a b
a bA a b g
a b
m m m m
Exponential Family: The Beta Distribution
( | ) ( ) ( )exp ( )
( )exp ( ) ( )
T
T
p h g u
h u A
x x x
x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Exponential Family: The Gaussian Consider the univariate Gaussian
Comparing this with our exponential family:
we can indentify (this is a two parameter distribution):
9
22 2 2
2 2 2 22 2
1 ( ) 1 1 1( | , ) exp exp
2 2 22 2
xp x x x
m mm m
2
1/22 122 2
2
1 1( ) ( , ) , ( , ) , ( ) , ( ) 2 exp
2 42
T Tu x x x h x gm
2
12
2
1( ) ln ( ) ln 2
2 4A g
( | ) ( ) ( ) exp ( ) ( ) exp ( ) ( )T Tp h g u h u A x x x x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Conjugate Priors In general, for a given probability distribution 𝑝(𝒙|𝜼), we
can seek a prior 𝑝(𝜼) that is conjugate to the likelihood
function, so that the posterior distribution has the same
functional form as the prior.
For the Bernoulli, the conjugate prior is the Beta
distribution
For the Gaussian, the conjugate prior for the mean is a
Gaussian, and the conjugate prior for the precision is
the Wishart distribution
10
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Conjugate Priors For any member of the exponential family with likelihood
there exists a conjugate prior that can be written in the form
In normalized form, we write:
11
( | ) ( ) ( ) exp ( )Tp h g uq q qx x x
𝑝(𝜽|𝜈0, 𝝉0) ∝ 𝑔 𝜼 𝜽𝜈0exp 𝜼𝑇 𝜽 𝝉0 = exp 𝜈0𝜼
𝑇 𝜽 ҧ𝜏0 − 𝐴 𝜼 𝜽 𝜈0 , 𝑤ℎ𝑒𝑟𝑒: 𝝉0 ≡ 𝜈0ത𝝉0
𝑝(𝜽|𝜈0, 𝝉0) =1
𝑍 𝜈0, 𝝉0𝑔 𝜼 𝜽
𝜈0exp 𝜼𝑇 𝜽 𝝉0 =
1
𝑍 𝜈0, 𝝉0exp 𝜈0𝜼
𝑇 𝜽 ത𝝉0 − 𝐴 𝜼 𝜽 𝜈0
𝑤ℎ𝑒𝑟𝑒: 𝑍 𝜈0, 𝝉0 = නexp 𝜈0𝜼𝑇 𝜽 ത𝝉0 − 𝐴 𝜼 𝜽 𝜈0 𝑑𝜽
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Conjugate Priors
Using , the posterior becomes (this form justifies ):
The parameter 𝜈0 can be interpreted as effective number of fictitious
observations in the prior each of which has a value for the sufficient
statistic equal to .
12
11 1
( | ) ( ) exp ( ) ( ) exp ( )N N N
NT T
n n n n
nn n
p h g u h g u
q q q q qX x x x x
𝑝(𝜽|𝜈0, 𝝉𝟎) =1
𝑍 𝜈0, 𝝉0𝑔 𝜼 𝜽
𝜈0exp 𝜼𝑇 𝜽 𝝉0 =
1
𝑍 𝜈0, 𝝉0exp 𝜈0𝜼
𝑇 𝜽 ത𝝉0 − 𝐴 𝜼 𝜽 𝜈0
𝑝(𝜽|𝑿, 𝝌, 𝜈) ∝ 𝑔 𝜼 𝜽𝜈0+𝑁
exp 𝜼𝑇 𝜽
𝑛=1
𝑁
𝒖(𝒙𝑛) + 𝜈0ത𝝉0 = 𝑔 𝜼 𝜽𝜈0+𝑁
exp 𝜼𝑻 𝜽 𝑁ഥ𝒖 + 𝜈0ത𝝉0
ഥ𝒖 =1
𝑁
𝑛=1
𝑁
)𝒖(𝒙𝑛 ത𝝉0
ത𝝉0
𝑝 𝜽|𝑿, 𝜈𝑁, 𝝉𝑵 =1
)𝑍(𝜈𝑁, 𝝉𝑁𝑔 𝜼 𝜽
𝜈𝑁exp 𝑁 + 𝑣0 𝜼𝑇 𝜽
𝑁ഥ𝒖 + 𝜈0ത𝝉0𝑁 + 𝑣0
=1
)𝑍(𝜈𝑁, 𝜏𝑁𝑔 𝜼 𝜽
𝜈𝑁exp 𝜈𝑁𝜼
𝑇 𝜃 ത𝝉𝑁 ,
𝑤ℎ𝑒𝑟𝑒 𝜈𝑁 = 𝜈0 + 𝑁, ത𝝉𝑁 =𝑁ഥ𝒖 + 𝜈0ത𝝉0𝑁 + 𝑣0
, 𝝉𝑁 = 𝜈𝑁ത𝝉𝑁 = 𝑁ഥ𝒖 + 𝜈0ത𝝉0 =
𝑖=1
𝑁
)𝒖(𝑥𝑖 + 𝝉0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Posterior Predictive Let the posterior predictive is then:
This is simplified as follows:
If 𝑁 = 0, this becomes the marginal likelihood of 𝑋′, which reduces to
the normalizer of the posterior divided by the normalizer of the prior
multiplied by a constant.
13
' '
'' ' '
0
1 0 0
( | ) ( | ) ( | )
1( ) ( ) exp ( ) exp ( )
( ( ) )
N
NN T T
i
i
p p p d
h g g dZ N
X X X X
x u X u Xu X
q q q
q q q q
'
''
0 0'
1 0
''' '
0
1 0 0
0
1( ) exp ( ) ( )( | )
', (
, ( )
) ( )( )
, ( )
N
i
T
N
i
NN
i
i
p
Z N Nh
Z N
h g dZ N
x u X u XX Xu
u X u X
u X
X
x
q q q
'' '
1 1
( ) ( ), ( ) ( ),N N
i i
i i
u X u x u X u x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Beta/Bernoulli: Posterior Predictive Consider a Bernoulli likelihood with a Beta prior. The likelihood takes
the familiar exponential distribution form:
The conjugate prior is a Beta:
Thus the posterior becomes:
Let 𝑠 the number of heads in the past data. The probability of
future heads in 𝑚 trials is then:
14
( 1)
| (1 ) exp log1
(1 )i i
i i
i
i
N
i
i
s x
x N x
p xqq
q qqq
D
| , , , , ( 1)N N N N i
i
p s N s s xq D Beta
1 1' ' '| , (1 ) (1 )N NN N N N N m N ms m s
N N N N N m N m
p s m d
q q q q q
D
0 0 0| (1 )s N s
p q q q
D
'
1
' ( 1)m
i
i
s x
', 'N m N N m Ns m s
0 0 0 0 0| , , , 1, 1,p q Beta
00 00
0 0 0| , 1 exp lo1 g1
p q q
qq q
q
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Computing Moments of Sufficient Statistics u(x)
Differentiate wrt the for the exponential family:
The above equation can be further simplified if written in terms of the
partition function 𝑍 = 1/𝑔(𝜂) or 𝐴 = 𝑙𝑜𝑔𝑍 = −𝑙𝑜𝑔𝑔(𝜂):
Let us re-write explicitly the above equation as:
We can compute the variance of 𝒖(𝒙) by differentiating the Eq.
above with respect to 𝜼.
15
( | ) ( ) ( ) exp ( ) 1Tp d h g u d x x x x x
( ) ( )exp ( ) ( ) ( )exp ( ) ( ) 0T Tg h d g h u d x u x x x x u x x
( | ) 1p d x x
( )
( ) ( ) exp ( ) ( ) ( | ) ( ) ( )( )
Tgg h u d p d
g
= x x u x x x u x x u x
( ) ( )A u x
( ) ( ) ( ) exp ( ) ( )TA g h u d x x u x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Computing Moments of Sufficient Statistics u(x)
Thus the covariance of 𝑢(𝒙) can be expressed in terms of the
2nd derivatives of 𝐴(𝜼) and similarly for higher order moments.
Provided we can normalize a distribution from the exponential
family, we can always find its moments by simple
differentiation.
16
2 ( ) cov ( ) ( )A so A is convex u x
2
( ) ( ) ( ) ( )
( ) ( ) ( ) exp ( ) ( ) ( ) ( ) exp ( ) ( ) ( )
T T
T T TA g h u d g h u d
u x u x u x u x
x x u x x x x u x u x x
( ) ( ) ( )exp ( ) ( )TA g h u d x x u x x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Computing Moments of Sufficient Statistics u(x)
Let us check these relations for the Univariate Gaussian:
17
2 ( ) cov ( ) ( )A so A is convex u x
( ) ( )A u x
2
212 2 2
2
1 1( ) ln 2 , ( , ) , ( ) ( , )
2 4 2
T TA u x x x m
22 2 21 1
2
1 2 2 2 2
22
2
1 2
( ) ( ) 1,
2 2 4
( ) 1var , .
2
A AX X
AX etc
m m
𝑔(𝜂) = −2𝜂2Τ1 2exp
𝜂12
4𝜂2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
We have shown that we can compute the mean of the
distribution 𝜇 = 𝐸[𝑢(𝒙)] in terms of the canonical
parameter :
We have also shown that 𝐴(𝜂) is
a convex function. Since for a
convex function there is one-to-one
relation between the argument of
the function and its derivative, the
mapping 𝜇(𝜂) is invertable.
Thus the exponential family of distributions can also be
parameterized in terms of 𝜇 (moment parametrization)
exactly as we started this course.
18
Moment Parametrization
( ) ( )Am u x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
MLE for the Exponential Family The joint density for a data 𝑿 = {𝒙𝟏, . . . , 𝒙𝒏} is itself an exp. distribution
with sufficient statistics :
The exponential family is the only family of distributions with finite
sufficient statistics (size independent of the data set size).
The log likelihood is concave (𝐴 convex) and has a unique maximum.
Maximizing wrt 𝜼 gives:
At the MLE, the empirical average of the sufficient statistic is equal the
model’s theoretical expected sufficient statistics (moment matching).
Thus to find the expected value of the sufficient statistics, one can use
directly the data without having to estimate 𝜼. When 𝒖(𝒙) = 𝒙, the
above allows us to compute the expectation of 𝒙 directly from the data.
19
11 1
1 1 1 1
( | ) ( ) ( ) exp ( ) ( ) ( ) exp ( )
ln ( | ) ( ) ln ( ) ( ) ( ) ( ) ( )
N N NT N T
n n n n
nn n
N N N NT T
n n n n
n n n n
p h g u h g u
p h N g u h NA u
X x x x x
X x x x x
1( )
N
nnu
x
1 1
1 1( ) ( ) ( ) ( )
N N
ML n n
n n
AN N
u x u x u x
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
MLE for the Exponential Family
Using the sufficient statistic, one can in principle invert the above equ.
to compute 𝜼𝑀𝐿𝐸. For example, for the Bernoulli distribution,
and thus:
and
20
1
1( ) ( ) ( )
N
ML n
n
A uN
u x x
( | ) ( ) exp , ( ) , ( ) 1,
1 1, ( ) , ln
1 1 1
p x g x u x x h x
ge e
mm
m
𝜂𝑀𝐿𝐸 = lnҧ𝜇
1 − ҧ𝜇
𝔼 𝑋 = 𝑝(𝑋 = 1) = ҧ𝜇 ≡ 𝜇𝑀𝐿𝐸 =1
𝑁
𝑛=1
𝑁
𝕀 𝑥𝑛 = 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
MLE and Kullback-Leibler Distance A useful property for the MLE (and not just a property for the
exponential family of distributions) is the following:
Minimizing the KL distance to the empirical distribution is equivalent to
maximizing the likelihood.
Indeed, let us consider the model 𝑙𝑜𝑔𝑝(𝑥|𝜃) and the empirical
distribution:
We can then derive the following:
and from this:
Since the 1st term is independent of 𝜃, the assertion is proved.
21
1
1( ) ,
N
emp n
n
p x x xN
1 1
1 1 1( ) log ( | ) , log ( | ) log ( | ) ( | )
N N
emp n n
x n x n
p x p x x x p x p xN N N
q q q q
D
( )
( ), ( | ) ( ) log ( ) log ( ) ( ) log ( | )( | )
1( ) log ( ) ( | )
emp
emp emp emp emp emp
x x x
emp emp
x
p xKL p x p x p x p x p x p x p x
p x
p x p xN
q qq
q
D
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Maximum Entropy and Exponential Family If nothing is known about a distribution except that it belongs to a
certain class, then the distribution with the largest entropy should be
chosen as the default.a
The entropy is defined as
discrete case
When some statistics (moments) of the distribution are known,
the maximum entropy distribution is of the form (𝜆’s are the Lagrange
multipliers enforcing the constraints):
Thus the MaxEnt distribution has the form of the exponential family.
22
( ) ( ) log ( )k k
k
q q
1
1
exp ( )
( ) ,
exp ( )
K
k k i
k
i kK
k k j
j k
g
Lagrange multipliers
g
q
q
q
, 1,..,k kg w k K q
a C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Generalized Linear Models
23
We now study regression given data 𝑋 and 𝑌 and a GLIM.
We choose a particular conditional expectation of 𝑌. We denote the
modeled value of conditional expectation as 𝑚 = 𝑓(𝜽𝑇𝒙), 𝜉 = 𝜽𝑇𝒙.
For linear regression, GLIM extends these ideas beyond the
Gaussian, Bernoulli and multinomial setting to the more general
exponential family.
𝑋 enters linearly as 𝜽𝑇𝒙 and 𝒇 is called a response
function. Y is a one-to-one map of m to .
To specify a GLIM we need (a) a choice of exponential family
distribution, and (b) a choice of the response function 𝑓 ∙ .
Choosing the exponential family distribution is strongly constrained
by the nature of the data.
Note that 𝑓 ∙ needs to be both monotonic and differentiable.
However, there is a particular response function (canonical response
function) that is uniquely associated with a given exponential family
distribution.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Canonical Response Function Canonical response function:
𝑓 ∙ = Y-1(∙) x =
If we decide to use the canonical response function, the choice of the
exponential family density completely determines the GLIM.
24
1fx m m Y
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
MLE & Canonical Response Function
25
Consider a regression problem with data 𝓓 = {(𝒙𝒊, 𝑦𝑖}, 𝑖 =1,…𝑁}. The log likelihood for a GLIM is as:
For a canonical response, 𝜂 = 𝜉 = 𝜽𝑇𝒙 , and this is simplified as:
Regardless of 𝑁, the size of the sufficient statistic is fixed: the
dimension of 𝒙𝑛 - important reason for using canonical response.
This is a general expression for GLM with exponential family
distributions and the canonical response function.
1 1
, log ( ) ( ) , :
n
N NT
n n n n n n n n
n n
h y y A where f with
m
q m x x
xqD
1 1 1
, log ( ) ( )N N N
T
n n n n
n n n
Sufficient statistic for
h y y Aq
x
q
qD
1 1 1
, '( ) ,
n
N N NT
n n n n n n n n n
n n n
y A y y or m m
x
x X yq q q qq q mD D
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Iterative Reweighted Least Squares (IRLS)
26
The Hessian can now be computed from
as:
To estimate parameters in the canonical response function choice,
one can use the iteratively reweighted least squares (IRLS) algorithm
The batch Newton algorithm now takes the familiar IRLS form:
For non-canonical response functions, the Hessian has an extra term
that contains the factor (𝒚 − 𝝁). When we take expectations this term
vanishes! So using the expected Hessian in the Newton method the
algorithm looks essentially the same (Fisher Scoring algorithm).
1
, ,N
T
n n n
n
y orm
x X yq qq q mD D
2 2 1
1 1
, , ,...,N
T Tn nn n
n n n
d ddor , where :
d d d
m mm
H x x X WX W =D Dq qq q
1 1
1 11
1 1
t
t t T t T t T t T t t T t
T t T t t t t T t T t t t t
X W X X y X W X X W X X y
X W X X W X W y X W X X W W y
q q m q m
q m m
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Sequential Estimation - LMS An on-line estimation algorithm can be obtained by following the
stochastic gradient of the log likelihood function.
If we do not use the canonical response function, then the gradient
also includes the derivatives of 𝑓(∙) and Ψ(∙). These can be viewed
as scaling coefficients that alter the step size, but otherwise leave the
general LMS form intact.
The LMS algorithm is the generic stochastic gradient algorithm for
models throughout the GLIM family.
27
1 ,Tt t t t t
n n n n ny x f x m m q q q