236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher...

236607 Visual Recognition Tutorial 1

Tutorial 6

• Bias and variance of estimators

• The score and Fisher information

• Cramer-Rao inequality


Estimators and their Properties

• Let be a parametric set of distributions. Given a sample drawn i.i.d from one of the distributions in the set we would like to estimate its parameter (thus identifying the distribution).

• An estimator for w.r.t. is any function notice that an estimator is a random variable.

• How do we measure the quality of an estimator?• Consistency: An estimator for is consistent if

this is a (desirable) asymptotic property that motivates us to acquire large samples. But we should emphasize that we are also interested in measures for finite (and small!) sample sizes.

{ ( | )},p x ( )

1, ,nnD x x x

D ( )T D

( )( ) , aspnT x n T


• Bias: Define the bias of an estimator to be Here, the expectation is w.r.t. to the distribution

The estimator is unbiased if its bias is zero

• Example: the estimators and , for the mean of a normal distribution, are both unbiased. The estimator for its variance is biased whereas the estimator is unbiased.

• Variance: another important property of an estimator is its variance . We would like to find estimators with minimum bias and variance.

• Which is more important, bias or variance?

2ˆ ˆ( ) [ ]b E ( | ).p x

x 11 n

i ix xn

1 21( )ni inx x

1 211( )n

i inx x

ˆ( ) 0b

( | )ˆvar ( )p x

Estimators and their Properties


Risky Estimators

• Employ our decision-theoretic framework to measure the quality of estimators.

• Abbreviate and consider the square error loss function

• The conditional risk associated with when is the true parameter

• Claim:

• Proof:

( )ˆ ( )nT x

2 2 ( ) ( )ˆ ˆ ˆ( | ) ( ) ( ) ( | )n nR E p x dx

2ˆ ˆ( , ) ( )

ˆ ˆ ˆ( | ) var( ) ( ) variance+biasR b

2 2

2 2

2 2

ˆ ˆ ˆ ˆ( ) ( )

ˆ ˆ ˆ ˆ ˆ ˆ2

ˆ ˆ ˆ variance+bias

E E E E

E E E E E E

E E E


Bias vs. Variance• So, for a given level of conditional risk, there is a tradeoff

between bias and variance. • This tradeoff is among the most important facts in pattern

recognition and machine learning.

• Classical approach: Consider only unbiased estimators and try to find those with minimum possible variance.

• This approach is not always fruitful:– The unbiasedness only means that the average of the

estimator (w.r.t. to ) is . It doesn’t mean it will be near for a particular sample (if variance is large).

– In general, an unbiased estimate is not guaranteed to exist.

( | )p x


The Score• The score of the family is the random variable

measures the “sensitivity” of as a function of the parameter .

• Claim:• Proof:

• Corollary:

v ( | )p x ( | )

ln ( | )( | )

p xv p x

p x

( | )p x

[ ] 0E v

( | )[ ] ( | ) ( | )

( | )

( | ) 1 0

p xE v p x dx p x dx

p x

p x dx

2 2var[ ] ( [ ]) [ ]v E v E v E v


The Score - Example• Consider the normal distribution

• clearly,

• and

( ,1)N 2

2

1 1( | ) exp ( )

221 1

ln ( | ) ln(2 ) ( )2 2

ln ( | )

p x x

p x x

v p x x

E v E x E x[ ] [ ] [ ] 0

var( ) [ ] [( ) ]v E v E x 2 2 2 1


The Score - Vector Form• In case where is a vector, the score is

the vector whose th component is

• Example:

1( , , )k vi

ln ( | )ii

v p x

22

22

2

2

3

2

2 3

1 1( | , ) exp ( )

221 1

ln ( | , ) ln(2 ) ln ( )2 2

ln ( | , )

1 ( )ln ( | , )

1 ( ),

p x x

p x x

xp x

xp x

x xv


Fisher Information• Fisher information: Designed to provide a measure of how

much information the parametric probability law carries about the parameter .

• An adequate definition of such information should possess the following properties:– The larger the sensitivity of to changes in , the

larger should be the information– The information should be additive: The information

carried by the combined law should be the sum of those carried by and

– The information should be insensitive to the sign of the change in and preferably positive

– The information should be a deterministic quantity; should not depend on the specific random observation

( | )p x

( | )p x

p x x( , | )1 2 p x( | )1 p x( | )2


• Definition (scalar form): Fisher information (about ), is the variance of the score

• Example: consider a random variable

2

( ) ln ( | )J E p x

2~ ( , )N

22

2

2 22 2 2

2 4 4

1 1ln ( | , ) ln(2 ) ln ( )

2 2

ln ( | , )

1( ) ( ) 1/

p x x

xv p x

xJ E v E E x

Fisher Information


Fisher Information - Cntd.• Whenever is a vector, Fisher information

is the matrix where

• Remainder:

• Remark: the Fisher information is only defined whenever the distributions satisfy some regularity conditions. (For example, they should be differentiable w.r.t. and all the distributions in the parametric family must have same support set).

1( , , )k ,( ) ( )i jJ J

, ( ) cov log ( | ), log ( | )i ji j

J p x p x

cov , [ ] [ ]X Y E X E X Y E Y

( | )p x i


• Claim: Let be i.i.d. random variables . The score of is the sum of the individual scores.

• Proof:

• Example: If are i.i.d. , the score is

( )1, ,n

nx x x ~ ( | )p x ( )( | )np x

( ) ( )( ) ln ( | ) ln ( | )

ln ( | )

( )

n ni

i

ii

ii

v x p x p x

p x

v x

2~ ( , )N ( )1, ,n

nx x x

2ln ( | , )

xn p x n

Fisher Information - Cntd.


• Based on i.i.d. samples, the Fisher information about is

• Thus, the Fisher information is additive w.r.t. i.i.d. random variables.• Example: Suppose are i.i.d. . From previous

example we know that the Fisher information about the parameter based on one sample is Therefore, based on the entire sample,

2( )

2

2 ( )

1

2

1

( ) ln ( | )

( ) ( )

( ) ( )

nn

nn

ii

n

ii

J E p x

E v x E v x

E v x nJ

( )1, ,n

nx x x 2~ ( , )N

2( ) 1/J 2( ) /nJ n

n

Fisher Information - Cntd.


The Cramer-Rao Inequality• Theorem: Let be an unbiased estimator for . Then

• Proof: Using we have:

1ˆvar( )( )J

ˆ ˆ ˆ ˆ

ˆ ˆ

ˆ[ ]

E v Ev E E v E

E v E Ev

E v

Ev 0


The Cramer-Rao Inequality - Cntd.

• Now ( | )

ˆ ˆ ( | )( | )

ˆ( | )

ˆ( | )

ˆ 1

p xE v p x dx

p x

p x dx

p x dx

E


• So,• By the Cauchy-Schwarz inequality

• Therefore,

• For a biased estimator we have:

ˆ ˆ ˆ[ ] 1E v Ev E E v

2 22

2

ˆ ˆ ˆ ˆ1

ˆvar( )

ˆ( ) var( )

E v Ev E E v Ev E E

E v

J

1ˆvar( )( )J

ˆ1 ( )ˆvar( )

( )

E

J



• The Cramer-Rao inequality also true in general form: The error covariance matrix for is bounded as follows:

θ̂

The Cramer-Rao General Case

1ˆ ˆ[( )( ) ] ( )tE C θ -θ θ -θ J


• Example: Let be i.i.d. . From previous example

• Now let be an (unbiased) estimator for .

• So matches the Cramer-Rao lower bound.

• Def: An unbiased estimator whose covariance meets the Cramer-Rao lower bound is called efficient.

( )1, ,n

nx x x 2~ ( , )N 2( ) /nJ n

1( )

1

ˆ( )n

nin

i

x x

2 22 2 2 2ˆ ˆ ˆ ˆ ˆ ˆ ˆvar( ) 2E E E E E E

2

2 2 2 22 2

1

2 2

1 1ˆ

/

n

ii

E E x n nn n

n

2ˆvar( ) / n



Efficiency

• Theorem (Efficiency): The unbiased estimator is efficient, that is,

iff

• Proof (If): If then

meaning

θ̂

1ˆ ˆ[( )( ) ] ( )tE C θ -θ θ -θ J θ

ˆE θ θ

ˆ( )( ) J θ θ -θ ν

ˆ( )( ) J θ θ -θ νˆ ˆ[ ( )( )( ) ( )] ( ) ( ) [ ] ( )t t t tE E J θ θ -θ θ -θ J θ J θ CJ θ νν J θ

1( )C J θ


Efficiency

• Only if: Recall the cross covariance between

The Cauchy-Schwarz inequality for random variables says

with equality iff

2ˆ[ ( ) ]tE ν θ -θ I

ˆand ( ) : ν θ θ

2ˆ ˆ ˆ[ ( ) ] [ ] [( )( ) ]t t tE E E I ν θ -θ νν θ -θ θ -θ JC

ˆ( )( ) J θ θ -θ ν


Cramer-Rao Inequality and ML - Cntd.

• Theorem: Suppose there exists an efficient estimator for all . Then the ML estimator is .

• Proof: By assumption

By previous claim or

for all

This holds at and since this is a maximum point the left side is zero so

var( )( )

1

J

( )

v

J

log ( | )

( )( )p x

J

ML ML

ML

236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher...

Documents

Transcript of 236607 Visual Recognition Tutorial1 Tutorial 6 Bias and variance of estimators The score and Fisher...