Exponential family & Generalized Linear Models (GLIMs)

30
Exponential family & Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

Transcript of Exponential family & Generalized Linear Models (GLIMs)

Page 1: Exponential family & Generalized Linear Models (GLIMs)

Exponential family &

Generalized Linear Models (GLIMs)

Probabilistic Graphical Models

Sharif University of Technology

Spring 2017

Soleymani

Page 2: Exponential family & Generalized Linear Models (GLIMs)

Outline

2

Exponential family

Many standard distributions are in this family

Similarities among learning algorithms for different models in

this family:

ML estimation has a simple form for exponential families

moment matching of sufficient statistics

Bayesian learning is simplest for exponential families

They have a maximum entropy interpretation

GLIMs as to parameterize conditional distributions that

have an exponential distribution on a variable for each

value of parent

Page 3: Exponential family & Generalized Linear Models (GLIMs)

Exponential family: canonical

parameterization

3

๐‘ƒ ๐’™ ๐œผ =1

๐‘ ๐œผโ„Ž ๐’™ exp ๐œผ๐‘‡๐‘‡(๐’™)

๐‘ ๐œผ = โ„Ž ๐’™ exp ๐œผ๐‘‡๐‘‡(๐’™) ๐‘‘๐’™

๐‘ƒ ๐’™ ๐œผ = โ„Ž ๐’™ exp ๐œผ๐‘‡๐‘‡ ๐’™ โˆ’ ln ๐‘(๐œผ)

๐‘‡:๐’ณ โ†’ โ„๐พ: sufficient statistics function

๐œผ: natural or canonical parameters

โ„Ž:๐’ณ โ†’ โ„+: reference measure independent of parameters

๐‘: Normalization factor or partition function (0 < ๐‘ ๐œผ < โˆž)

๐ด(๐œผ): log partition function

Page 4: Exponential family & Generalized Linear Models (GLIMs)

Example: Bernouli

4

๐‘ƒ ๐‘ฅ ๐œƒ = ๐œƒ๐‘ฅ 1 โˆ’ ๐œƒ 1โˆ’๐‘ฅ

= exp ln๐œƒ

1 โˆ’ ๐œƒ๐‘ฅ + ln 1 โˆ’ ๐œƒ

โ€ข ๐œ‚ = ln๐œƒ

1โˆ’๐œƒ

โ€ข ๐œ‚ = ln๐œƒ

1โˆ’๐œƒโ‡’ ๐œƒ =

๐‘’๐œ‚

๐‘’๐œ‚+1=

1

1+๐‘’โˆ’๐œ‚

โ€ข ๐‘‡ ๐‘ฅ = ๐‘ฅ

โ€ข ๐ด ๐œ‚ = โˆ’ ln 1 โˆ’ ๐œƒ = ln 1 + ๐‘’๐œ‚

โ€ข โ„Ž ๐‘ฅ = 1

Page 5: Exponential family & Generalized Linear Models (GLIMs)

Example: Gaussian

5

๐‘ƒ ๐‘ฅ ๐œ‡, ๐œŽ2 =1

2๐œ‹๐œŽexp โˆ’

๐‘ฅ โˆ’ ๐œ‡ 2

2๐œŽ2

โ€ข ๐œผ =๐œ‚1๐œ‚2

=

๐œ‡

๐œŽ2

โˆ’1

2๐œŽ2

โ€ข โ‡’ ๐œ‡ = โˆ’๐œ‚1

2๐œ‚2, ๐œŽ2 = โˆ’

1

2๐œ‚2

โ€ข ๐‘‡ ๐‘ฅ =๐‘ฅ๐‘ฅ2

โ€ข ๐ด ๐œผ = โˆ’ ln 2๐œ‹๐œŽ exp๐œ‡2

2๐œŽ2= โˆ’

1

2ln 2๐œ‹ โˆ’

1

2ln โˆ’2๐œ‚2 โˆ’

๐œ‚12

4๐œ‚2

โ€ข โ„Ž ๐‘ฅ = 1

Page 6: Exponential family & Generalized Linear Models (GLIMs)

Example: Multinomial

6

๐‘ƒ ๐’™ ๐œฝ = ๐‘˜=1

๐พ

๐œƒ๐‘˜๐‘ฅ๐‘˜

๐‘ƒ ๐’™ ๐œฝ = exp

๐‘˜=1

๐พ

๐‘ฅ๐‘˜ ln ๐œƒ๐‘˜

= exp

๐‘˜=1

๐พโˆ’1

๐‘ฅ๐‘˜ ln ๐œƒ๐‘˜ + 1 โˆ’

๐‘˜=1

๐พโˆ’1

๐‘ฅ๐‘˜ ln 1 โˆ’

๐‘˜=1

๐พโˆ’1

๐œƒ๐‘˜

โ€ข ๐œผ = ๐œ‚1, โ€ฆ , ๐œ‚๐พโˆ’1๐‘‡ = ln

๐œƒ1

1โˆ’ ๐‘˜=1๐พโˆ’1 ๐œƒ๐‘˜

, โ€ฆ , ln๐œƒ๐พโˆ’1

1โˆ’ ๐‘˜=1๐พโˆ’1 ๐œƒ๐‘˜

๐‘‡

โ€ข ๐œผ = ln๐œƒ1

๐œƒ๐พ, โ€ฆ , ln

๐œƒ๐พโˆ’1

๐œƒ๐พ

๐‘‡

โ‡’ ๐œƒ๐‘˜ =๐‘’๐œ‚๐‘˜

๐‘—=1๐พ ๐‘’

๐œ‚๐‘—

โ€ข ๐‘‡ ๐’™ = ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐พโˆ’1๐‘‡

โ€ข ๐ด ๐œผ = โˆ’ ln ๐œƒ๐พ = โˆ’ ln 1 โˆ’ ๐‘˜=1๐พโˆ’1๐œƒ๐‘˜ = ln ๐‘˜=1

๐พ ๐‘’๐œ‚๐‘—

๐‘˜=1

๐พ

๐œƒ๐‘˜ = 1

Page 7: Exponential family & Generalized Linear Models (GLIMs)

Well-behaved parameter space

7

Multiple exponential families may encode the same set of

distributions

We want the parameter space ๐œผ 0 < ๐‘ ๐œผ < โˆž to be:

Convex set

Non-redundant:๐œผ โ‰  ๐œผโ€ฒ โ‡’ ๐‘ƒ ๐’™ ๐œผ โ‰  ๐‘ƒ ๐’™ ๐œผโ€ฒ

The function from ๐œฝ to ๐œผ is invertible

Example: invertible function from ๐œƒ to ๐œ‚ in the Bernoulli example ๐œƒ

=1

1+๐‘’โˆ’๐œ‚

Page 8: Exponential family & Generalized Linear Models (GLIMs)

Examples of non-exponential distributions

8

Uniform

Laplace

Student t-distribution

Page 9: Exponential family & Generalized Linear Models (GLIMs)

Moments

9

๐ด ๐œผ = ln๐‘ ๐œผ

๐‘ ๐œผ = โ„Ž ๐’™ exp ๐œผ๐‘‡๐‘‡(๐’™) ๐‘‘๐’™

๐›ป๐œผ๐ด ๐œผ =๐›ป๐œผ๐‘ ๐œผ

๐‘ ๐œผ= โ„Ž ๐’™ ๐‘‡(๐’™) exp ๐œผ๐‘‡๐‘‡(๐’™) ๐‘‘๐’™

๐‘ ๐œผ

= ๐‘‡(๐’™)โ„Ž ๐’™ exp ๐œผ๐‘‡๐‘‡(๐’™)

๐‘ ๐œผ๐‘‘๐’™ = ๐ธ๐‘ƒ(๐’™|๐œผ) ๐‘‡(๐’™)

โ‡’ ๐›ป๐œผ๐ด ๐œผ = ๐ธ๐œผ ๐‘‡(๐’™)

๐›ป๐œผ2๐ด ๐œผ = ๐ธ๐œผ ๐‘‡ ๐’™ ๐‘‡ ๐’™ ๐‘‡ โˆ’ ๐ธ๐œผ ๐‘‡ ๐’™ ๐ธ๐œผ ๐‘‡ ๐’™ ๐‘‡ = ๐ถ๐‘œ๐‘ฃ๐œผ ๐‘‡ ๐’™

The first derivative of ๐ด ๐œผ is

the mean of sufficient statistics

The i-th derivative gives the i-th centered moment

of sufficient statistics.

Page 10: Exponential family & Generalized Linear Models (GLIMs)

Properties

10

The moment parameters ๐œฝ can be derived as a function of the

natural or canonical parameters:

๐›ป๐œผ๐ด ๐œผ = ๐ธ๐œผ ๐‘‡(๐’™)

๐œฝ โ‰ก ๐ธ๐œผ ๐‘‡(๐’™) โ‡’ ๐›ป๐œผ๐ด ๐œผ = ๐œฝ

๐ด(๐œผ) is convex since ๐›ป๐œผ2๐ด ๐œผ = ๐ถ๐‘œ๐‘ฃ๐œผ ๐‘‡ ๐’™ โ‰ฝ 0

Covariance matrix is always positive semi-definite โ‡’ Hessian ๐›ป๐œผ2๐ด ๐œผ is

positive semi-definite, and hence that ๐ด ๐œผ = ln ๐‘ ๐œผ is a convex

function of ๐œผ.

For many distributions,

we have ๐œฝ โ‰ก ๐ธ๐œผ ๐‘‡(๐‘ฅ)

Page 11: Exponential family & Generalized Linear Models (GLIMs)

Exponential family: moment parameterization

11

A distribution in the exponential family can also be

parameterized by the moment parameterization:

๐‘ƒ ๐’™ ๐œฝ =1

๐‘ ๐œฝโ„Ž ๐’™ exp ๐œ“ ๐œฝ ๐‘‡๐‘‡(๐’™)

๐‘ ๐œฝ = โ„Ž ๐’™ exp ๐œ“ ๐œฝ ๐‘‡๐‘‡(๐’™) ๐‘‘๐’™

If ๐›ป๐œผ2๐ด ๐œผ โ‰ป 0โ‡’๐›ป๐œผ๐ด ๐œผ is ascending โŸน๐œ“โˆ’1 ๐œผ = ๐œฝ = ๐›ป๐œผ๐ด ๐œผ is

ascending and thus is 1-to-1

The mapping from the moments to the canonical parameters is

invertible (1-to-1 relationship): ๐œผ = ๐œ“(๐œฝ)

๐œผ = ๐œ“(๐œฝ)

๐œ“ maps the parameters ๐œฝ to

the space of sufficient statistics๐œฝ โ‰ก ๐ธ๐œผ ๐‘‡(๐’™) = ๐›ป๐œผ๐ด ๐œผ

๐œฝ = ๐œ“โˆ’1 ๐œผ

Page 12: Exponential family & Generalized Linear Models (GLIMs)

Sufficiency

12

A statistic is a function of a random variable

Suppose that the distribution of ๐‘‹ depends on a parameter ๐œƒ

โ€œ๐‘‡(๐‘‹) is a sufficient statistic for ๐œƒ if there is no information in ๐‘‹regarding ๐œƒ beyond that in ๐‘‡(๐‘‹)โ€

Sufficiency in both frequentist and Bayesian frameworks implies a

factorization of ๐‘ƒ ๐‘ฅ ๐œƒ (Neyman factorization theorem):

๐‘ƒ ๐‘ฅ, ๐‘‡ ๐‘ฅ , ๐œƒ = ๐‘“ ๐‘‡ ๐‘ฅ , ๐œƒ ๐‘” ๐‘ฅ, ๐‘‡ ๐‘ฅ

๐‘ƒ ๐‘ฅ, ๐œƒ = ๐‘“ ๐‘‡ ๐‘ฅ , ๐œƒ ๐‘”(๐‘ฅ, ๐‘‡(๐‘ฅ))๐‘ƒ ๐‘ฅ|๐œƒ = ๐‘“โ€ฒ ๐‘‡ ๐‘ฅ , ๐œƒ ๐‘”(๐‘ฅ, ๐‘‡(๐‘ฅ))

Page 13: Exponential family & Generalized Linear Models (GLIMs)

Sufficient statistic

13

Sufficient statistic and the exponential family:

๐‘ƒ ๐’™ ๐œผ = โ„Ž ๐’™ exp ๐œผ๐‘‡๐‘‡ ๐’™ โˆ’ ๐ด(๐œผ)

Sufficient statistic in the case of i.i.d sampling can be obtained

easily for a set of N observations from a distribution

๐‘ƒ ๐’Ÿ ๐œผ = ๐‘›=1

๐‘

โ„Ž ๐’™(๐‘›) exp ๐œผ๐‘‡๐‘‡ ๐’™ ๐‘› โˆ’ ๐ด(๐œผ)

= ๐‘›=1

๐‘

โ„Ž ๐’™(๐‘›) exp{๐œผ๐‘‡ ๐‘›=1

๐‘

๐‘‡ ๐’™ ๐‘› โˆ’๐‘๐ด ๐œผ }

๐’Ÿ has itself an exponential distribution with sufficient statistic

๐‘›=1๐‘ ๐‘‡ ๐’™ ๐‘›

Page 14: Exponential family & Generalized Linear Models (GLIMs)

MLE for exponential family

15

โ„“ ๐œผ;๐’Ÿ = ln๐‘ƒ ๐’Ÿ ๐œผ = ln ๐‘›=1

๐‘

โ„Ž ๐’™(๐‘›) exp ๐œผ๐‘‡๐‘‡ ๐’™ ๐‘› โˆ’ ๐ด(๐œผ)

= ln

๐‘›=1

๐‘

โ„Ž(๐’™(๐‘›)) + ๐œผ๐‘‡ ๐‘›=1

๐‘

๐‘‡ ๐’™ ๐‘› โˆ’ ๐‘๐ด ๐œผ

๐›ป๐œผโ„“ ๐œผ;๐’Ÿ = 0 โ‡’ ๐‘›=1

๐‘

๐‘‡ ๐’™ ๐‘› โˆ’ ๐‘๐›ป๐œผ๐ด ๐œผ = 0

โ‡’ ๐›ป๐œผ๐ด ๐œผ = ๐‘›=1๐‘ ๐‘‡ ๐’™ ๐‘›

๐‘

โ‡’ ๐›ป๐œผ๐ด ๐œผ = ๐ธ ๐œผ ๐‘‡(๐’™) = ๐‘›=1๐‘ ๐‘‡ ๐’™ ๐‘›

๐‘

moment matching

Concave

function

Page 15: Exponential family & Generalized Linear Models (GLIMs)

Maximum entropy models

16

Among all distributions with certain moments of interest, the

exponential family is the most random (makes fewest

assumptions or structure)

Out of all distributions which reproduce the observed sufficient

statistics, the exponential family distribution (roughly) makes the fewest

additional assumptions.

The unique distribution maximizing the entropy, subject to the

constraint that these moments are exactly matched, is then an

exponential family distribution

Page 16: Exponential family & Generalized Linear Models (GLIMs)

Maximum entropy

17

Constraints:

๐ธ ๐‘“๐‘˜ =

๐’™

๐‘“๐‘˜ ๐’™ ๐‘ƒ ๐’™ = ๐น๐‘˜

Maximum entropy (maxent): pick the distribution

with maximum entropy subject to the constraints๐ฟ ๐‘ƒ, ๐€

= โˆ’

๐’™

๐‘ƒ ๐’™ log๐‘ƒ ๐’™ + ๐œ†0 1 โˆ’

๐’™

๐‘ƒ ๐’™ +

๐‘˜

๐œ†๐‘˜ ๐น๐‘˜ โˆ’

๐’™

๐‘“๐‘˜ ๐’™ ๐‘ƒ ๐’™

๐›ป๐ฟ = 0 โ‡’ ๐‘ƒ ๐’™ =1

๐‘exp โˆ’

๐‘˜

๐œ†๐‘˜๐‘“๐‘˜ ๐’™

๐‘“๐‘˜ ๐’™ : an arbitrary function

๐น๐‘˜: constant

๐‘ =

๐’™

exp โˆ’

๐‘˜

๐œ†๐‘˜๐‘“๐‘˜ ๐’™

Page 17: Exponential family & Generalized Linear Models (GLIMs)

Maximum entropy: constraints

18

Constants in the constraints:

๐น๐‘˜ measure the empirical counts on the training data

๐น๐‘˜ = ๐‘›=1๐‘ ๐‘“๐‘˜ ๐’™(๐‘›)

๐‘

These constraints also ensure consistency automatically.

Page 18: Exponential family & Generalized Linear Models (GLIMs)

Exponential family: summary

19

Many famous distribution are in the exponential family

Important properties for learning with exponential families: Gradients of log partition function gives expected sufficient statistics, ormoments, for some models

Moments of any distribution in exponential family can be easily computed bytaking the derivatives of the log normalizer

The Hessian of the log partition function is positive semi-definite and sothe log partition function is convex

Among all distributions with certain moments of interest, theexponential family has the highest entropy

Are important for modeling distributions of Markovnetworks

Page 19: Exponential family & Generalized Linear Models (GLIMs)

Generalized linear models (GLIMs)

20

Conditional relationship between ๐‘Œ and ๐‘ฟ

Examples:

Linear regression:๐‘ƒ ๐‘ฆ ๐’™,๐’˜, ๐œŽ2 = ๐’ฉ(๐‘ฆ|๐’˜๐‘‡๐’™, ๐œŽ2)

Discriminative linear classifier (two class)

Logistic regression:๐‘ƒ ๐‘ฆ ๐’™,๐’˜ = ๐ต๐‘’๐‘Ÿ(๐‘ฆ|๐œŽ ๐’˜๐‘‡๐’™ )

Probit regression: ๐‘ƒ ๐‘ฆ ๐’™,๐’˜ = ๐ต๐‘’๐‘Ÿ(๐‘ฆ|ฮฆ ๐’˜๐‘‡๐’™ ) where ฮฆ is the cdf

of ๐’ฉ(0,1)

Page 20: Exponential family & Generalized Linear Models (GLIMs)

Generalized linear models (GLIMs)

21

๐‘ƒ(๐‘ฆ|๐’™) is a generalized linear model if:

๐’™ enters into the model via a linear combination ๐’˜๐‘‡๐’™

The conditional mean of ๐‘ƒ(๐‘ฆ|๐’™) is expressed as ๐‘“ ๐’˜๐‘‡๐’™ :

๐‘“ is called the response function

๐œ‡ = ๐ธ ๐‘ฆ|๐’™ = ๐‘“ ๐’˜๐‘‡๐’™

The distribution of ๐‘ฆ is characterized by an exponential family distribution(with conditional mean ๐‘“ ๐’˜๐‘‡๐’™ )

We have two choices in the specification of a GLIM:

The choice of the exponential family distribution

Usually constrained by the nature of ๐‘Œ

The choice of the response function ๐‘“ the principal degree of freedom in the specification of a GLIM

However, we need to impose constraints on this function (e.g., ๐‘“ must be in [0,1] forBernoulli distribution on ๐‘ฆ)

Page 21: Exponential family & Generalized Linear Models (GLIMs)

The relation between vars. in a GLIMs

22

Page 22: Exponential family & Generalized Linear Models (GLIMs)

Canonical response function

23

Canonical response function: ๐‘“(. ) = ๐œ“โˆ’1(. ) or ๐œ‰ = ๐œ‚ In this case, the choice of the exponential family density completely

determines the GLIM

The constraints on the range of ๐‘“ are automatically satisfied.

๐œ‡ = ๐‘“ ๐œ‚ are guaranteed to be possible values of the conditional

expectation (i.e.,๐‘“ ๐œ‚ = ๐œ“โˆ’1 ๐œ‚ =๐‘‘๐ด ๐œ‚

๐‘‘๐œ‚= ๐ธ ๐‘Œ|๐œ‚ )

Page 23: Exponential family & Generalized Linear Models (GLIMs)

Log likelihood for GLIMs

24

โ„“ ๐œผ;๐’Ÿ = ln๐‘ƒ ๐’Ÿ ๐œผ

= ln ๐‘›=1

๐‘

โ„Ž ๐‘ฆ(๐‘›) exp ๐œ‚(๐‘›)๐‘ฆ(๐‘›) โˆ’ ๐ด ๐œ‚(๐‘›)

=

๐‘›=1

๐‘

ln โ„Ž ๐‘ฆ(๐‘›) +

๐‘›=1

๐‘

๐œ‚(๐‘›)

๐‘ฆ(๐‘›) โˆ’ ๐ด ๐œ‚(๐‘›)

๐œ‚(๐‘›) = ๐œ“(๐œ‡ ๐‘› ) and ๐œ‡ ๐‘› = ๐‘“ ๐œฝ๐‘‡๐’™(๐‘›)

In the case of canonical response function ๐œ‚(๐‘›) = ๐œฝ๐‘‡๐’™(๐‘›)

โ„“ ๐œฝ;๐’Ÿ =

๐‘›=1

๐‘

ln โ„Ž ๐‘ฆ(๐‘›) + ๐œฝ๐‘‡

๐‘›=1

๐‘

๐’™(๐‘›)๐‘ฆ(๐‘›) โˆ’

๐‘›=1

๐‘

๐ด ๐œฝ๐‘‡๐’™(๐‘›)

Sufficient statistics for ๐œฝ

Page 24: Exponential family & Generalized Linear Models (GLIMs)

Gradient of log likelihood

25

๐›ป๐œฝ๐‘™ ๐œผ; ๐’Ÿ =

๐‘›=1

๐‘๐‘‘๐‘™

๐‘‘๐œ‚(๐‘›)๐›ป๐œฝ๐œ‚

(๐‘›) =

๐‘›=1

๐‘

๐‘ฆ(๐‘›) โˆ’๐‘‘๐ด ๐œ‚(๐‘›)

๐‘‘๐œ‚(๐‘›)๐›ป๐œฝ๐œ‚

(๐‘›)

=

๐‘›=1

๐‘

๐‘ฆ(๐‘›) โˆ’ ๐œ‡(๐‘›)๐‘‘๐œ‚(๐‘›)

๐‘‘๐œ‡(๐‘›)๐‘‘๐œ‡(๐‘›)

๐‘‘๐œ‰(๐‘›)๐’™(๐‘›)

In the case of canonical response function ๐œ‚(๐‘›) = ๐œ‰(๐‘›):

๐›ป๐œฝ๐‘™ ๐œฝ; ๐’Ÿ =

๐‘›=1

๐‘

๐‘ฆ(๐‘›) โˆ’ ๐œ‡(๐‘›) ๐’™(๐‘›)

๐œ‡(๐‘›) = ๐‘“ ๐œฝ๐‘‡๐’™(๐‘›)

Page 25: Exponential family & Generalized Linear Models (GLIMs)

Online learning for GLIMs

26

An LMS like algorithm as a generic stochastic gradient

descent for GLIMs:

๐œฝ๐‘ก+1 = ๐œฝ๐‘ก + ๐œŒ ๐‘ฆ(๐‘›) โˆ’ ๐œ‡ ๐‘› ๐‘ก๐’™(๐‘›)

๐œ‡ ๐‘› ๐‘ก= ๐‘“ ๐œฝ๐‘ก

๐‘‡๐’™(๐‘›)

If we do not use the canonical response function only

scaling coefficients due to the derivatives of ๐‘“(. ) and

๐œ“(. ) will also incorporated into the step size

Similar to Least Mean Squares

(LMS) algorithm

Page 26: Exponential family & Generalized Linear Models (GLIMs)

Batch learning for GLIMs :

Newton-Rafson

27

For the canonical response functions:

๐›ป๐œฝ๐‘™ ๐œฝ; ๐’Ÿ =

๐‘›=1

๐‘

๐‘ฆ(๐‘›) โˆ’ ๐œ‡(๐‘›) ๐’™(๐‘›) = ๐‘ฟ๐‘‡ ๐’š โˆ’ ๐

๐ป =๐‘‘2๐‘™

๐‘‘๐œฝ๐‘‘๐œฝ๐‘‡=

๐‘‘๐‘™

๐‘‘๐œฝ๐‘‡

๐‘›=1

๐‘

๐‘ฆ(๐‘›) โˆ’ ๐œ‡(๐‘›) ๐’™(๐‘›) = โˆ’

๐‘›=1

๐‘

๐’™ ๐‘›๐‘‘๐œ‡ ๐‘›

๐‘‘๐œฝ๐‘‡

= โˆ’

๐‘›=1

๐‘

๐’™(๐‘›)๐‘‘๐œ‡(๐‘›)

๐‘‘๐œ‚(๐‘›)๐‘‘๐œ‚(๐‘›)

๐‘‘๐œฝ๐‘‡

Since ๐œ‚(๐‘›) = ๐œฝ๐‘‡๐’™(๐‘›)

๐ป = โˆ’

๐‘›=1

๐‘

๐’™(๐‘›)๐‘‘๐œ‡(๐‘›)

๐‘‘๐œ‚(๐‘›)๐’™ ๐‘› ๐‘‡

= โˆ’๐‘ฟ๐‘‡๐‘พ๐‘ฟ

๐‘พ = ๐‘‘๐‘–๐‘Ž๐‘”๐‘‘๐œ‡(1)

๐‘‘๐œ‚(1), โ€ฆ ,

๐‘‘๐œ‡(๐‘)

๐‘‘๐œ‚(๐‘)

๐‘ฟ =๐‘ฅ1(1)

โ‹ฏ ๐‘ฅ๐‘‘(1)

โ‹ฎ โ‹ฑ โ‹ฎ

๐‘ฅ1(๐‘)

โ‹ฏ ๐‘ฅ๐‘‘(๐‘)

๐’š =๐‘ฆ(1)

โ‹ฎ๐‘ฆ(๐‘)

๐‘‘๐œ‡(๐‘›)

๐‘‘๐œ‚(๐‘›)=

๐‘‘2๐ด

๐‘‘๐œ‚(๐‘›)

Page 27: Exponential family & Generalized Linear Models (GLIMs)

Batch learning for GLIMs: Newton-Rafson

28

๐œฝ๐‘ก+1 = ๐œฝ๐‘ก + ๐‘ฟ๐‘‡๐‘พ๐‘ก๐‘ฟ โˆ’1๐‘ฟ๐‘‡ ๐’š โˆ’ ๐๐‘ก

= ๐‘ฟ๐‘‡๐‘พ๐‘ก๐‘ฟ โˆ’1 ๐‘ฟ๐‘‡๐‘พ๐‘ก๐‘ฟ๐œฝ๐‘ก + ๐‘ฟ๐‘‡ ๐’š โˆ’ ๐๐‘ก

โ‡’ ๐œฝ๐‘ก+1 = ๐‘ฟ๐‘‡๐‘พ๐‘ก๐‘ฟ โˆ’1๐‘ฟ๐‘‡๐‘พ๐‘ก๐’›๐‘ก

๐’›๐‘ก = ๐œ‚๐‘ก +๐‘พ๐‘กโˆ’1 ๐’š โˆ’ ๐๐‘ก

Iterative Reweighted Least Squares (IRLS)

Page 28: Exponential family & Generalized Linear Models (GLIMs)

Linear regression

29

Cost function (according to MLE where ๐‘ƒ ๐‘ฆ ๐’™ = ๐’ฉ(๐‘ฆ|๐œฝ๐‘‡๐’™, ๐œŽ2)):

๐ฝ ๐œฝ =1

2

๐‘›=1

๐‘

๐œฝ๐‘‡๐’™ ๐‘› โˆ’ ๐‘ฆ ๐‘› 2

๐›ป๐œฝ๐ฝ ๐œฝ = ๐ŸŽ โ‡’ ๐œฝ = ๐‘ฟ๐‘‡๐‘ฟ โˆ’1๐‘ฟ๐‘‡๐’š

Online learning (LMS):

๐œฝ๐‘ก+1 = ๐œฝ๐‘ก + ๐œŒ ๐‘ฆ(๐‘›) โˆ’ ๐œฝ๐‘ก๐‘‡๐’™(๐‘›) ๐’™(๐‘›)

IRLS:

๐œฝ๐‘ก+1 = ๐‘ฟ๐‘‡๐‘พ๐‘ก๐‘ฟ โˆ’1๐‘ฟ๐‘‡๐‘พ๐‘ก๐’›๐‘ก

= ๐‘ฟ๐‘‡๐‘ฟ โˆ’1๐‘ฟ๐‘‡ ๐‘ฟ๐œฝ๐‘ก + ๐’š โˆ’ ๐๐‘ก

= ๐‘ฟ๐‘‡๐‘ฟ โˆ’1๐‘ฟ๐‘‡๐’š

Canonical response function

๐œ‡ ๐’™ = ๐œฝ๐‘‡๐’™ = ๐œ‚(๐’™)๐‘‘๐œ‡

๐‘‘๐œ‚= 1 โ‡’ ๐‘พ = ๐‘ฐ

Page 29: Exponential family & Generalized Linear Models (GLIMs)

Logistic regression

30

๐œ‡ ๐’™ =1

1 + ๐‘’โˆ’๐œ‚(๐’™)

Canonical response function ๐œ‚ = ๐œ‰ = ๐œฝ๐‘‡๐’™

IRLS:๐‘‘๐œ‡

๐‘‘๐œ‚= ๐œ‡ 1 โˆ’ ๐œ‡

๐‘Š =๐œ‡(1) 1 โˆ’ ๐œ‡(1) โ‹ฏ 0

โ‹ฎ โ‹ฑ โ‹ฎ0 โ‹ฏ ๐œ‡(๐‘) 1 โˆ’ ๐œ‡(๐‘ )

๐œฝ๐‘ก+1 = ๐‘ฟ๐‘‡๐‘พ๐‘ก๐‘ฟ โˆ’1๐‘ฟ๐‘‡๐‘พ๐‘ก๐’›๐‘ก

๐’›๐‘ก = ๐‘ฟ๐œฝ๐‘ก +๐‘พ๐‘กโˆ’1 ๐’š โˆ’ ๐๐‘ก

Page 30: Exponential family & Generalized Linear Models (GLIMs)

References

31

Jordan, Chapter 8.

Koller & Friedman, Chapter 8.1-8.3.