236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher...

21
236607 Visual Recognition Tutorial 1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    237
  • download

    2

Transcript of 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher...

Page 1: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 1

Tutorial 6

• Bias and variance of estimators

• The score and Fisher information

• Cramer-Rao inequality

Page 2: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 2

Estimators and their Properties

• Let be a parametric set of distributions. Given a sample drawn i.i.d from one of the distributions in the set we would like to estimate its parameter (thus identifying the distribution).

• An estimator for w.r.t. is any function notice that an estimator is a random variable.

• How do we measure the quality of an estimator?• Consistency: An estimator for is consistent if

this is a (desirable) asymptotic property that motivates us to acquire large samples. But we should emphasize that we are also interested in measures for finite (and small!) sample sizes.

{ ( | )},p x ( )

1, ,nnD x x x

D ( )T D

( )( ) , aspnT x n T

Page 3: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 3

• Bias: Define the bias of an estimator to be Here, the expectation is w.r.t. to the distribution

The estimator is unbiased if its bias is zero

• Example: the estimators and , for the mean of a normal distribution, are both unbiased. The estimator for its variance is biased whereas the estimator is unbiased.

• Variance: another important property of an estimator is its variance . We would like to find estimators with minimum bias and variance.

• Which is more important, bias or variance?

2ˆ ˆ( ) [ ]b E ( | ).p x

x 11 n

i ix xn

1 21( )ni inx x

1 211( )n

i inx x

ˆ( ) 0b

( | )ˆvar ( )p x

Estimators and their Properties

Page 4: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 4

Risky Estimators

• Employ our decision-theoretic framework to measure the quality of estimators.

• Abbreviate and consider the square error loss function

• The conditional risk associated with when is the true parameter

• Claim:

• Proof:

( )ˆ ( )nT x

2 2 ( ) ( )ˆ ˆ ˆ( | ) ( ) ( ) ( | )n nR E p x dx

2ˆ ˆ( , ) ( )

ˆ ˆ ˆ( | ) var( ) ( ) variance+biasR b

2 2

2 2

2 2

ˆ ˆ ˆ ˆ( ) ( )

ˆ ˆ ˆ ˆ ˆ ˆ2

ˆ ˆ ˆ variance+bias

E E E E

E E E E E E

E E E

Page 5: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 5

Bias vs. Variance• So, for a given level of conditional risk, there is a tradeoff

between bias and variance. • This tradeoff is among the most important facts in pattern

recognition and machine learning.

• Classical approach: Consider only unbiased estimators and try to find those with minimum possible variance.

• This approach is not always fruitful:– The unbiasedness only means that the average of the

estimator (w.r.t. to ) is . It doesn’t mean it will be near for a particular sample (if variance is large).

– In general, an unbiased estimate is not guaranteed to exist.

( | )p x

Page 6: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 6

The Score• The score of the family is the random variable

measures the “sensitivity” of as a function of the parameter .

• Claim:• Proof:

• Corollary:

v ( | )p x ( | )

ln ( | )( | )

p xv p x

p x

( | )p x

[ ] 0E v

( | )[ ] ( | ) ( | )

( | )

( | ) 1 0

p xE v p x dx p x dx

p x

p x dx

2 2var[ ] ( [ ]) [ ]v E v E v E v

Page 7: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 7

The Score - Example• Consider the normal distribution

• clearly,

• and

( ,1)N 2

2

1 1( | ) exp ( )

221 1

ln ( | ) ln(2 ) ( )2 2

ln ( | )

p x x

p x x

v p x x

E v E x E x[ ] [ ] [ ] 0

var( ) [ ] [( ) ]v E v E x 2 2 2 1

Page 8: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 8

The Score - Vector Form• In case where is a vector, the score is

the vector whose th component is

• Example:

1( , , )k vi

ln ( | )ii

v p x

22

22

2

2

3

2

2 3

1 1( | , ) exp ( )

221 1

ln ( | , ) ln(2 ) ln ( )2 2

ln ( | , )

1 ( )ln ( | , )

1 ( ),

p x x

p x x

xp x

xp x

x xv

Page 9: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 9

Fisher Information• Fisher information: Designed to provide a measure of how

much information the parametric probability law carries about the parameter .

• An adequate definition of such information should possess the following properties:– The larger the sensitivity of to changes in , the

larger should be the information– The information should be additive: The information

carried by the combined law should be the sum of those carried by and

– The information should be insensitive to the sign of the change in and preferably positive

– The information should be a deterministic quantity; should not depend on the specific random observation

( | )p x

( | )p x

p x x( , | )1 2 p x( | )1 p x( | )2

Page 10: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 10

• Definition (scalar form): Fisher information (about ), is the variance of the score

• Example: consider a random variable

2

( ) ln ( | )J E p x

2~ ( , )N

22

2

2 22 2 2

2 4 4

1 1ln ( | , ) ln(2 ) ln ( )

2 2

ln ( | , )

1( ) ( ) 1/

p x x

xv p x

xJ E v E E x

Fisher Information

Page 11: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 11

Fisher Information - Cntd.• Whenever is a vector, Fisher information

is the matrix where

• Remainder:

• Remark: the Fisher information is only defined whenever the distributions satisfy some regularity conditions. (For example, they should be differentiable w.r.t. and all the distributions in the parametric family must have same support set).

1( , , )k ,( ) ( )i jJ J

, ( ) cov log ( | ), log ( | )i ji j

J p x p x

cov , [ ] [ ]X Y E X E X Y E Y

( | )p x i

Page 12: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 12

• Claim: Let be i.i.d. random variables . The score of is the sum of the individual scores.

• Proof:

• Example: If are i.i.d. , the score is

( )1, ,n

nx x x ~ ( | )p x ( )( | )np x

( ) ( )( ) ln ( | ) ln ( | )

ln ( | )

( )

n ni

i

ii

ii

v x p x p x

p x

v x

2~ ( , )N ( )1, ,n

nx x x

2ln ( | , )

xn p x n

Fisher Information - Cntd.

Page 13: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 13

• Based on i.i.d. samples, the Fisher information about is

• Thus, the Fisher information is additive w.r.t. i.i.d. random variables.• Example: Suppose are i.i.d. . From previous

example we know that the Fisher information about the parameter based on one sample is Therefore, based on the entire sample,

2( )

2

2 ( )

1

2

1

( ) ln ( | )

( ) ( )

( ) ( )

nn

nn

ii

n

ii

J E p x

E v x E v x

E v x nJ

( )1, ,n

nx x x 2~ ( , )N

2( ) 1/J 2( ) /nJ n

n

Fisher Information - Cntd.

Page 14: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 14

The Cramer-Rao Inequality• Theorem: Let be an unbiased estimator for . Then

• Proof: Using we have:

1ˆvar( )( )J

ˆ ˆ ˆ ˆ

ˆ ˆ

ˆ[ ]

E v Ev E E v E

E v E Ev

E v

Ev 0

Page 15: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 15

The Cramer-Rao Inequality - Cntd.

• Now ( | )

ˆ ˆ ( | )( | )

ˆ( | )

ˆ( | )

ˆ 1

p xE v p x dx

p x

p x dx

p x dx

E

Page 16: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 16

• So,• By the Cauchy-Schwarz inequality

• Therefore,

• For a biased estimator we have:

ˆ ˆ ˆ[ ] 1E v Ev E E v

2 22

2

ˆ ˆ ˆ ˆ1

ˆvar( )

ˆ( ) var( )

E v Ev E E v Ev E E

E v

J

1ˆvar( )( )J

ˆ1 ( )ˆvar( )

( )

E

J

The Cramer-Rao Inequality - Cntd.

Page 17: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 17

• The Cramer-Rao inequality also true in general form: The error covariance matrix for is bounded as follows:

θ̂

The Cramer-Rao General Case

1ˆ ˆ[( )( ) ] ( )tE C θ -θ θ -θ J

Page 18: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 18

• Example: Let be i.i.d. . From previous example

• Now let be an (unbiased) estimator for .

• So matches the Cramer-Rao lower bound.

• Def: An unbiased estimator whose covariance meets the Cramer-Rao lower bound is called efficient.

( )1, ,n

nx x x 2~ ( , )N 2( ) /nJ n

1( )

1

ˆ( )n

nin

i

x x

2 22 2 2 2ˆ ˆ ˆ ˆ ˆ ˆ ˆvar( ) 2E E E E E E

2

2 2 2 22 2

1

2 2

1 1ˆ

/

n

ii

E E x n nn n

n

2ˆvar( ) / n

The Cramer-Rao Inequality - Cntd.

Page 19: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 19

Efficiency

• Theorem (Efficiency): The unbiased estimator is efficient, that is,

iff

• Proof (If): If then

meaning

θ̂

1ˆ ˆ[( )( ) ] ( )tE C θ -θ θ -θ J θ

ˆE θ θ

ˆ( )( ) J θ θ -θ ν

ˆ( )( ) J θ θ -θ νˆ ˆ[ ( )( )( ) ( )] ( ) ( ) [ ] ( )t t t tE E J θ θ -θ θ -θ J θ J θ CJ θ νν J θ

1( )C J θ

Page 20: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 20

Efficiency

• Only if: Recall the cross covariance between

The Cauchy-Schwarz inequality for random variables says

with equality iff

2ˆ[ ( ) ]tE ν θ -θ I

ˆand ( ) : ν θ θ

2ˆ ˆ ˆ[ ( ) ] [ ] [( )( ) ]t t tE E E I ν θ -θ νν θ -θ θ -θ JC

ˆ( )( ) J θ θ -θ ν

Page 21: 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher information Cramer-Rao inequality.

236607 Visual Recognition Tutorial 21

Cramer-Rao Inequality and ML - Cntd.

• Theorem: Suppose there exists an efficient estimator for all . Then the ML estimator is .

• Proof: By assumption

By previous claim or

for all

This holds at and since this is a maximum point the left side is zero so

var( )( )

1

J

( )

v

J

log ( | )

( )( )p x

J

ML ML

ML