M L D M Chapter 3 Parameter Estimation

34
MIMA Group Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Chapter 3 Parameter Estimation M L D M

Transcript of M L D M Chapter 3 Parameter Estimation

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 3Parameter Estimation

M LD M

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Contents Introduction Maximum-Likelihood Estimation Bayesian Estimation

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3

Bayesian Theorem

To compute posterior probability , we need to know:

)()()|()|(

xxx

pPpP ii

i

c

jii Ppp

1)()|()( xx

)|( ip x )( iP

)|( xiP

How can we get these values?

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4

Samples

},,,{ 21 cDDDD D1 D2

D3

The samples in Dj are drawn independently according to the probability law p(x|j). That is, examples in Dj are i.i.d. random variables, i.e., independent and identically distributed.

It is easy to compute the prior probability:

c

i i

ji

D

DP

1

)(

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5

Samples For class-conditional pdf:

Case I: p(x|j) has certain parametric forme.g.

If j contains “d+d(d+1)/2” free parameters.

Case II: p(x|j) doesn’t have parametric formNext chapter.

),(~)|( jjj Np Σμx

jT

mj ),,,( 21 θdRX

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6

Goal

},,,{ 21 cDDDD D1 D2

D3

1 2

3

)|()|( jj pp θxx )|()|( jj pp θxx

Use Dj to estimate the unknown parameter vector j

1θ̂2θ̂

3θ̂

Tmj ),,,( 21 θ

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7

Estimation Under Parametric Form

Maximum-Likelihood Estimation

Bayesian Estimation

View parameters as quantities whose values are fixed but unknown

Estimate parameter values by maximizing the likelihood (probability) of observing the actual examples.

View parameters as random variables having some known prior distribution

Observation of the actual training examples transforms parameters’ prior into posterior distribution. (via Bayes rule)

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8

Maximum-Likelihood Estimation

Because each class is considered individually, the subscript used before will be dropped.

Now the problem becomes:

Given a sample set D, whose elements are drawn independently from a population possessing a known parameter form, say p(x|), we want to choose a that will make D to occur most likely.

θ̂

D

θ̂

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

Maximum-Likelihood Estimation (Cont.)

Criterion of ML

By the independence assumption, we have

The Likelihood Function

The maximum-likelihood estimation:

},,,{ 21 nxxx D

)|()|()|()|( 21 θxθxθxθ npppp D

n

kkp

1

)|( θx

)|()|( θθ DD pL

n

kkp

1

)|( θx

)|(maxarg ˆ DL

θ

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10

Maximum-Likelihood Estimation (Cont.)

Often, we resort to maximize the log-likelihood function

)|(ln)|( DD θθ Ll

n

kkp

1)|(ln θx

ˆ arg max ( | )Lθ

θ θ Dˆ arg max ( | )Lθ

θ θ D

ˆ arg max ( | )lθ

θ θ Dˆ arg max ( | )lθ

θ θ D

why?

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11

Maximum-Likelihood Estimation (Cont.)

Find the extreme values using the method in differential calculus.

Gradient Operator Let f() be a continuous function, where =(1, 2,…, n)T.

Find the extreme values by solving

T

n

,,,21

θGradientOperator

0 fθ

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12

The Gaussian Case I Case I: unknown , and is known

)()(

21exp

||)2(1),|( 1

2/12/ μxΣμxΣ

Σμx Tdp

)|()|( μμ DD pL

n

kk

Tknnd

1

12/2/ )()(

21exp

||)2(1 μxΣμx

Σ

n

kkp

1

)|( θx

)|(ln)|( DD μμ Ll

n

kk

Tk

nnd

1

12/2/ )()(21||)2ln( μxΣμxΣ

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13

The Gaussian Case I

0)()|(1

1

n

kkl μxΣμμ D

n

kkn 1

1ˆ xμ

Intuitive Result: Maximum estimate for the unknown is just the arithmetic average of training samples---sample mean.

Sample Mean!

)|(ln)|( DD μμ Ll

n

kk

Tk

nnd

1

12/2/ )()(21||)2ln( μxΣμxΣ

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14

The Gaussian Case II Case II: both and are unknown Consider univariate case

2

22

2)(exp

21),|(

xxp TT ),(),( 2

21 θ

)|()|( θθ DD pL

n

k

knn

x1

2

2

2/ 2)(exp

)2(1

n

kkxp

1

)|( θ

)|(ln)|( DD θθ Ll

n

kk

nn x1

22

2/ )(2

1)2ln(

n

kk

nn x1

21

2

2/2

2/ )(21)2ln(

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15

The Gaussian Case II

n

kk

nn xl1

21

2

2/2

2/ )(21)2ln()|(

0θθ

n

k

k

n

kk

xn

xl

122

21

2

11

2

2)(

2

)(1

)|(

D

n

kkx

n 11

1ˆˆ

n

kkx

n 1

22

2 )ˆ(1ˆˆ

biased

unbiased

Arithmetic average of n vectors

Arithmetic average of n matricesT

kk )ˆ)(ˆ( μxμx

Unbiased Estimator:

Consistent Estimator:θθ ]ˆ[E

θθ

]ˆ[lim En

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

MLE for Normal Population

n

kkn 1

1ˆ xμ

n

kkn 1

1ˆ xμ

n

k

Tkkn 1

)ˆ)(ˆ(1ˆ μxμxΣ

n

k

Tkkn 1

)ˆ)(ˆ(1ˆ μxμxΣ

μμ ]ˆ[E

ΣΣΣ

n

nE 1]ˆ[

n

k

Tkkn 1

)ˆ)(ˆ(1

1 μxμxC

n

k

Tkkn 1

)ˆ)(ˆ(1

1 μxμxCSample Covariance Matrix

ΣC ][E

Sample Mean

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18

Bayesian Estimation Settings

The parametric form of the likelihood function for each category is known

However, j is considered to be random variables instead of being fixed (but unknown) values.

In this case, we can no longer make a single ML estimate and then infer based on and

θ̂)|( xiP )( iP )|( ip x

How can we proceed?

Fully exploit training examples!

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19

Posterior Probabilities from sample

c

j j

iii

PP

PPP

1)x,,(

)x,,()x,(

)x,,(),|(D

DD

DD

x

),|x()|()()|x,()(),,( DDDD iiii PPDPPDPP x

),|(),|( iDD ii PP xx ),|(),|( iDD ii PP xx

)()|( ii PP D )()|( ii PP D

Assumptions:

c

jjjj

iiii

PP

PPP

1)(),|(

)(),|(),|(

D

DD

x

xx

Each class can be considered independently

Each class can be considered independently

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20

Problem Formulation

c

jjjj

iiii

PP

PPP

1)(),|(

)(),|(),|(

D

DD

x

xx

The key problem is to determine, ,treat each class independently, the problem becomes

),|( iiP Dx)|( DxP

This is always the central problem of Bayesian Learning.

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21

Class-Conditional Density Estimation

θθθx

θθθx

θθxx

dpp

dpDp

dDpp

)|()|(

)|(),|(

)|,()|(

D

D

D :Random variable w.r.t. parametric form

x is independent of D given

The form of distribution is assumed

known

The posterior density we want to estimate

Assume p(x) is unknown but knowing it has a fixed form with parameter vector .

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

Bayesian Estimation: General Procedure

?)|( DθpPhase I:

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

Bayesian Estimation: General Procedure

θθθxx dppp )|()|( )|( DDPhase II:

Phase III:

c

jjjj

iiii

PP

PPP

1)(),|(

)(),|(),|(

D

DD

x

xx

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24

The Gaussian Case The univariate Gaussian: unknown

Phase I:

Dxpp )|()( )|( Dp

2

21exp

21)|(

xxp

2

0

0

0 21exp

21)(

p

Other form of prior pdf could be assumed as well

)()|()|(1

θθxθ pppn

kk

D

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

The Gaussian Case

2

0

0

01

2

21exp

21

21exp

21)|(

n

k

kxp D

n

k

kx1

2

0

02

21exp

n

kkxn

120

02

220

2

12121exp

2

21exp

21)|(

xxp

2

0

0

0 21exp

21)(

p

)()|()|(1

θθxθ pppn

kk

D

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26

The Gaussian Case

)|( Dp

n

kkxn

120

02

220

2

12121exp

),(~)|( 2nnNp D

2

21exp

21)|(

n

n

n

p

D

22

2 22

1exp2

1nn

nn

Com

paris

on

is an exponential function of a quadratic function of ; thus is also a normal.

)|( Dp)|( Dp

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

The Gaussian Case Equating the coefficients in both form; then, we

have

0220

2

220

20 ˆ

nn

nnn

220

2202

nn

n

kkn x

n 1

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28

The Gaussian Case

Phase II:

)|()|( xpDp )|( Dxp

2

21exp

21)|(

xxp

How would p(x|D) look like in this case?

θθθxx dppp )|()|( )|( DD

),(~)|( 2nnNp D

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29

The Gaussian Case

θxx dupupp )|()|()|( DD

2

21exp

21)|(

xxp

),(~)|( 2nnNp D

dxxp

n

n

n

22

21exp

21exp

21)|( D

dxx

n

nn

n

n

n

n

n

2

22

22

22

22

22

2

21exp)(

21exp

21

p(x|D) is an exponential function of a quadratic function of x; thus, it is also a normal pdf. =?

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30

The Gaussian Case

θxx dupupp )|()|()|( DD

2

21exp

21)|(

xxp

),(~)|( 2nnNp D

dxxp

n

n

n

22

21exp

21exp

21)|( D

dxx

n

nn

n

n

n

n

n

2

22

22

22

22

22

2

21exp)(

21exp

21

p(x|D) is an exponential function of a quadratic function of x; thus, it is also a normal pdf. =?

),(~)|( 22nnNxp D

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

The Gaussian Case

Phase III:

c

jjjj

iiii

PP

PPP

1)(),|(

)(),|(),|(

D

DD

x

xx

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32

Summary Key issue

Estimate prior and class-conditional pdf from training set

Basic assumption on training examples: i.i.d. Two strategies to key issue

Parametric form for class-conditional pdfMaximum likelihood estimationBayesian estimation

No parametric form for class-conditional pdf

MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33

Summary Maximum likelihood estimation

Settings: parameters as fixed but unknown values The objective function: log-likelihood function The gradient for the objective function should be zero Gaussian

Bayesian estimation Settings: parameters as random variables General procedure: I, II, III Gaussian caseProject 3.2

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Any Question?