ASP Lecture 3 Estimation Intro 2015
description
Transcript of ASP Lecture 3 Estimation Intro 2015
Advanced Signal Processing
Introduction to Estimation Theory
Danilo Mandic,
room 813, ext: 46271
Department of Electrical and Electronic Engineering
Imperial College London, [email protected], URL: www.commsp.ee.ic.ac.uk/∼mandic
c© D. P. Mandic Advanced Signal Processing, Spring 2015 1
Aims of this lecture
◦ To introduce the notions of: Estimator, Estimate, Estimandum
◦ To discuss the bias and variance in statistical estimation theory,asymptotically unbiased and consistent estimators
◦ Performance metric: the Mean Square Error (MSE)
◦ The bias–variance dilemma and the MSE in this context
◦ To derive a feasible MSE estimator
◦ A class of Minimum Variance Unbiased (MVU) estimators
◦ Extension to the vector parameter case
◦ Point estimators, confidence intervals, statistical goodness of anestimator, the role of noise
c© D. P. Mandic Advanced Signal Processing, Spring 2015 2
Role of estimation in signal processing
(try also the function specgram in Matlab)
◦ An enabling technology in many electronic signal processing systems
1. Radar 4. Image analysis 7. Control2. Sonar 5. Biomedicine 8. Seismics3. Speech 6. Communications 9. Almost everywhere ...
◦ Radar and sonar: range and azimuth
◦ Image analysis: motion estimation, segmenation
◦ Speech: features used in recognition and speaker verification
◦ Seismics: oil reservoirs
◦ Communications: equalization, symbol detection
◦ Biomedicine: various applications
c© D. P. Mandic Advanced Signal Processing, Spring 2015 3
Statistical estimation problem
for simplicity, consider a DC level in WGN, x[n] = A + w[n], w ∼ N (0, σ2)
Problem statement: We seek to determine a set of parametersθ = [θ1, . . . , θp]
T from a set of data points x = [x[0], . . . , x[N − 1]]T
such that the values of these parameters would yield the highestprobability of obtaining the observed data. In other words,
maxspan θ
p(x;θ) reads : “p(x) parametrised by θ′′
◦ The unknown parameters may be seen as deterministic or randomvariables
◦ There are essentially two alternatives to the statistical case
– No a priori distribution assumed: Maximum Likelihood– A priori distribution known: Bayesian estimation
◦ Key problem # to estimate a group of parameters from a discrete-timesignal or dataset.
c© D. P. Mandic Advanced Signal Processing, Spring 2015 4
Estimation of a scalar random variable
Given an N - point dataset x[0], x[1], . . . , x[N − 1] which depends on anunknown parameter θ, (scalar), define an “estimator” as some function, g,of the dataset, that is
θ = g(x[0], x[1], . . . , x[N − 1])
which may be used to estimate θ (single parameter case).
(in our DC level estimation problem, θ = A)
◦ This defines the problem of “parameter estimation”
◦ Also need to determine g(·)
◦ Theory and techniques of statistical estimation are available
◦ Estimation based on PDFs which contain unknown but deterministicparameters is termed classical estimation
◦ In Bayesian estimation, the unknown parameters are assumed to berandom variables, which may be prescribed “a priori” to lie within somerange of allowable parameters (or desired performance)
c© D. P. Mandic Advanced Signal Processing, Spring 2015 5
The stastical estimation problem
First step: to model, mathematically, the data
◦ We employ a Probability Desity Function (PDF) to describe theinherently random measurement process, that is
p(x[0], x[1], . . . , x[N − 1]; θ)
which is “parametrised” by the unknown parameter θ
Example: for N = 1, and θ denoting the mean value, a generic form ofPDF for the class of Gaussian PDFs with any value of θ is given by
p(x[0]; θ) = 1√2πσ2
exp[− 1
2σ2(x[0]− θ)2]
x[0] 1 2
Clearly, theobserved value ofx[0] impacts uponthe likely value ofθ.
c© D. P. Mandic Advanced Signal Processing, Spring 2015 6
Estimator vs. Estimate
The parameter to be estimated is then viewed as a realisation of therandom variable θ
◦ Data are described by the joint PDF of the data and parameters:
p(x, θ) = p(x | θ)︸ ︷︷ ︸
(conditional PDF )
p(θ)︸︷︷︸
(prior PDF )
◦ An estimator is a rule that assigns a value of θ from each realisationof x = x = [x[0], . . . , x[N − 1]]T
◦ An estimate of, i.e. θ (also called ’estimandum’) is the value obtained
for a given realisation of x = [x[0], . . . , x[N − 1]]T in the form θ = g(x)
Example: for a noisy straight line: p(x; θ) = 1(2πσ2)N/2e
[
− 12σ2
∑N−1n=0 (x[n]−A−Bn)2
]
◦ Performance is critically dependent upon this PDF assumption - theestimator should be robust to slight mismatch between the measurementand the PDF assumption
c© D. P. Mandic Advanced Signal Processing, Spring 2015 7
Example: finding the parameters of straight line
Specification of the PDF is critical in determining a good estimator
In practice, we choose a PDF which fits the problem constraints and any“a priori” information; but it must also be mathematically tractable.
Example: Assume that “on theaverage” the data are increasing
n
noisy line
ideal noiseless line
0
A
x[n]
Data: straight line embedded inrandom noise w[n] ∼ N (0, σ2)
x[n] = A+Bn+ w[n]
n = 0, 1, . . . , N − 1
p(x;A,B) = 1(2πσ2)N/2e
[
− 12σ2
∑N−1n=0 (x[n]−A−Bn)2
]
Unknown parameters:
A, B ⇔ θ ≡ [A B]T
Careful: what would be the effectsof bias in A and B?
c© D. P. Mandic Advanced Signal Processing, Spring 2015 8
Bias in parameter estimation
Estimation theory (scalar case): estimate the value of an unknown
parameter, θ, from a set of observations of a random variable described bythat parameter
θ = g(x[0], x[1], . . . , x[N − 1]
)
Example: given a set of observations from Gaussian distribution, estimatethe mean or variance from these observations.
◦ Recall that in linear mean square estimation, when estimating a value ofrandom variable y from an observation of a related random variable x,the coefficents a and b in the estimation y = ax+ b depend upon themean and variance of x and y as well as on their correlation.
The difference between the expected value of the estimate and theactual value θ is caleld the bias and will be denoted y B.
B = E{θN} − θ
where θN denotes estimation over N data samples, x[0], . . . , x[N − 1]
c© D. P. Mandic Advanced Signal Processing, Spring 2015 9
Asymptotic unbiasedness
If the bias is zero, then the expected value of the estimate is equal to thetrue value, that is
E{θN} = θ ≡ B = E{θN} − θ = 0
and the estimate is said to be unbiased.
If B 6= 0 then the estimator θ = g(x) is said to be biased.
Example: Consider the sample mean estimator of the signalx[n] = A+ w[n], w ∼ N (0, 1), given by
A = x =1
N + 2
N−1∑
n=0
x[n] that is θ = A
Is the above sample mean estimator of the true mean A biased?
More often: an estimator is biased but bias B → 0 when N → ∞
limN→∞
E{θN} = θ
Such as estimator is said to be asymptotically unbiased.
c© D. P. Mandic Advanced Signal Processing, Spring 2015 10
How about the variance?
◦ It is desirable that an estimator be either unbiased or asymptoticallyunbiased (think about the power of estimation error due to DC offset)
◦ For an estimate to be meaningful, it is necessary that we use theavailable statistics effectively, that is,
V ar → 0 as N → ∞
or in other words
limN→∞
var{θN} = limN→∞
{
|θN − E{θN}|2}
= 0
If θN is unbiased then E{θN} = θ, and from Tchebycheff inequality ∀ ǫ > 0
Pr{|θN − θ| ≥ ǫ} ≤var{θN}
ǫ2
⇒ if V ar → 0 as N → ∞, then the probability that θN differs by morethan ǫ from the true value will go to zero (showing consistency).
In this case, θN is said to converge to θ with probability one.
c© D. P. Mandic Advanced Signal Processing, Spring 2015 11
Mean square convergence
Another form of convergence, stronger than convergence with probabilityone is mean square convergence.
An estimate θN is said to converge to θ in the mean–square sense, if
limN→∞
E{|θN − θ|2}︸ ︷︷ ︸mean square error
= 0
◦ For an unbiased estimator this is equivalent to the previous conditionthat the variance of the estimate goes to zero
◦ An estimate is said to be consistent if it converges, in some sense, tothe true value of the parameter
◦ We say that the estimator is consistent if it is asymptoticallyunbiased and has a variance that goes to zero as N → ∞
c© D. P. Mandic Advanced Signal Processing, Spring 2015 12
Example: Assessing the performance of the Sample
Mean as an estimator
Consider the estimation of a DC level, A in random noise, which could bemodelled as
x[n] = A+ w[n]
w[n] ∼ some zero mean random process.
◦ Aim: to estimate A given {x[0], x[1], . . . , x[N − 1]}
◦ Intuitively, the sample mean is a reasonable estimator
A =1
N
N−1∑
n=0
x[n]
Q1: How close will A be to A?
Q2: Are there better estimators than the sample mean?
c© D. P. Mandic Advanced Signal Processing, Spring 2015 13
Mean and variance of the Sample Mean estimator
x[n] = A+ w[n] w[n] ∼ N (0, σ2)
Estimator = f( random data ), ⇒ a random variable itself
⇒ its performance must be judged statistically
(1) What is the mean of A?
E{
A}
= E
{
1
N
N−1∑
n=0
x[n]
}
=1
N
N−1∑
n=0
E {x[n]} = A # unbiased
(2) What is the variance of A?
Assumption: The samples of w[n]s are uncorrelated
E{
A2}
= V ar{
A}
= V ar
{
1
N
N−1∑
n=0
x[n]
}
=1
N2
N−1∑
n=0
V ar {x[n]} =1
N2N σ2 =
σ2
N
Notice the variance → 0 as N → ∞ # consistent (see your P&A sets)
c© D. P. Mandic Advanced Signal Processing, Spring 2015 14
Minimum Variance Unbiased (MVU) estimation
Aim: to establish “good” estimators of unknown deterministic parameters
Unbiased estimator # “on the average” yields the true value of theunknown parameter independent of its particular value, i.e.
E(θ) = θ a < θ < b
where (a, b) denotes the range of possible values of θ
Example: Unbiased estimator for a DC level in White GaussianNoise (WGN). If we are given
x[n] = A+ w[n] n = 0, 1, . . . , N − 1
where A is the unknown, but deterministic, parameter to be estimatedwhich lies within the interval (−∞,∞), then the sample mean can be usedas an estimator of A, namely
A =1
N
N−1∑
n=0
x[n]
c© D. P. Mandic Advanced Signal Processing, Spring 2015 15
Careful: the estimator is parameter dependent!
An estimator may be unbiased for certain values of the unknownparameter but not all, such an estimator is not unbiased
Consider another sample mean estimator:
ˆA = 1
2N
∑N−1n=0 x[n]
Therefore: E{ˆA}
= 0 when A = 0 but
E{ˆA}
= A2 when A 6= 0 (parameter dependent)
HenceˆA is not an unbiased estimator
◦ A biased estimator introduces a “systemic error” which should notgenerally be present
◦ Our goal is to avoid bias if we can, as we are interested in stochasticsignal properties and bias is largely deterministic
c© D. P. Mandic Advanced Signal Processing, Spring 2015 16
Remedy
(also look in the Assignment dealing with PSD in your CW )
Several unbiased estimates of the same quantity may be averagedtogether, i.e. given the L independent estimates
{
θ1, θ2, . . . , θL
}
We may choose to average them, to yield
θ = 1L
∑Ll=1 θl
Our assumption was that the individual estimators are unbiased, with equalvariance, and uncorrelated with one another.
Then (NB: averaging biased estimators will not remove the bias)
E{
θ}
= θ
and
V ar{
θ}
= 1L2
∑Ll=1 V ar
{
θl
}
= 1LV ar
{
θl
}
Note, as L → ∞, θ → θ (consistent)
c© D. P. Mandic Advanced Signal Processing, Spring 2015 17
Effects of averaging for real world data
Problem 3.4 from your P/A sets: heart rate estimation
The heart rate, h, of a patient is automatically recorded by a computer every 100ms. In
one second the measurements{
h1, h2, . . . , h10
}
are averaged to obtain h. Given than
E{
hi
}
= αh for some constant α and var(hi) = 1 for all i, determine whether
averaging improves the estimator if α = 1 and α = 1/2.
h =1
10
10∑
i=1
hi[n],
E{
h}
=α
10
10∑
i=1
h = αh
If α = 1, unbiased, if α = 1/2 it will not
be unbiased unless the estimator is formed
as h = 15
∑10i=1 hi[n].
var{
h}
=1
L2
10∑
i=1
var{
hi
}
h
p(h
i)
Before Averaging
h/2 h
p(h
i)
h
p(h
i)
After Averaging
h/2 h
p(h
i)
α =1
2α =
1
2
α = 1α = 1
hi
hi
c© D. P. Mandic Advanced Signal Processing, Spring 2015 18
Minimum variance criterion
⇒ An optimality criterion is necessary to define an optimal estimator
Mean Square Error (MSE)
MSE(θ) = E
{(
θ − θ)2
}
measures the average mean squared deviation of the estimator from thetrue value.
This criterion leads, however, to unrealisable estimators - namely, oneswhich are not solely a function of the data
MSE(θ) = E
{[(
θ − E(θ))
+(
E(θ)− θ)]2
}
= V ar(θ) + E{
(θ)− θ}2
= V ar(θ) +B2(θ)
⇒ MSE = VARIANCE OF THE ESTIMATOR + SQUARED BIAS
c© D. P. Mandic Advanced Signal Processing, Spring 2015 19
Example: An MSE estimator with ’gain factor’
Consider the following estimator for DC level in WGN
A = a1
N
N−1∑
n=0
x[n]
Task: Find a which results in minimum MSE
Given
E{
A}
= aA and
V ar(A) = a2σ2
N
we have
MSE(A) =a2σ2
N+ (a− 1)2A2
Of course, the choice of a = 1 removes the mean and minimises the variance
c© D. P. Mandic Advanced Signal Processing, Spring 2015 20
Continued: MSE estimator with ’gain’
But, can we find a analytically? Differentiating with respect to a yields
∂MSA
∂a(A) =
2aσ2
N+ 2(a− 1)A2
and setting the result to zero gives the optimal value
aopt =A2
A2 + σ2
N
but we do not know the value of A
◦ The optimal value depends upon A which is the unknown parameter
◦ Comment - any criterion which depends on the value of the unknownparameter to be found is likely to yield unrealisable estimators
◦ Practically, the minimum MSE estimator needs to be abandoned, andthe estimator must be constrained to be unbiased
c© D. P. Mandic Advanced Signal Processing, Spring 2015 21
A counter-example: A little bias can help
(but the estimator is difficult to control)
Q: Let {y[n]}, n = 1, . . . , N be iid Gaussian variables ∼ N (0, σ2).Consider the following estimate of σ2
σ2 =α
N
N∑
n=1
y2[n] α > 1
Find α which minimises the MSE of σ2.
S: It is straightforward to show that E{σ2} = ασ2 andMSE(σ2) = E{
(
σ2 − σ2)2} = E{σ4} + σ4(1 − 2α)
=α2
N2
N∑
n=1
N∑
s=1
E{y2[n]y2[s]} + σ4(1 − 2α)
=α2
N2
(
N2σ4 + 2Nσ4)
+ σ4(1 − 2α) = σ4[
α4(1 +2
N) + (1 − 2α)
]
The MMSE is obtained for αmin = NN+2 and has the value
minMSE(σ2) = 2σ4
N+2. Given that the minimum variance of an
unbiased estimator (CRLB, later) is 2σ4/N , this is an example of abiased estimator which obtains a lower MSE than the CRLB.
c© D. P. Mandic Advanced Signal Processing, Spring 2015 22
Example: Estimation of hearing threshold (hearing aids)
Objective estimation of hearing threshold:
◦ Design AM or FM stimuli with carrier frequency up to 20 kHz
◦ Modulate this carrier with e.g. 40 Hz envelope
◦ Present this sound to the listener
◦ Measure the electroencephalogram, which should contain a 40 Hz signalif the subject was able to detect the sound
◦ This is a good objective estimate of the hearing curve
c© D. P. Mandic Advanced Signal Processing, Spring 2015 23
Key points
Simulation: Auditory steady state response (ASSR), 40 Hz amplitude modulated
◦ An estimator is a random
variable, its performance can
only be completely described
statistically or by a PDF.
◦ Estimator’s performance cannot
be conclusively assessed by
one computer simulation - we
often average M trials of
independent simulations, called
“Monte Carlo” analysis.
◦ Performance and computation
complexity are often traded - the
optimal estimator is replaced by
a suboptimal but realisable one.
90 92 94 96 98
−2
0
2
x 10−6 Time series
20 25 30 35 40 45
10−15
PSD analysis
98 100 102 104 106
−2
0
2
x 10−6
20 25 30 35 40 45
10−15
106 108 110 112 114
−4
−2
0
2
4x 10
−6
20 25 30 35 40 45
10−15
90 95 100 105 110
−4
−2
0
2
4x 10
−6
Time (s)20 25 30 35 40 45
10−15
Averaged PSD analysis
Frequency (Hz)
Simulation: Left column: The 24s of ITE EEG sampled at 256Hz was split into 3
segments of length 8s. Left column: The respective periodograms (rectangular window).
Bottom right: The averaged periodogram, observe the reduction in variance (in red).
c© D. P. Mandic Advanced Signal Processing, Spring 2015 24
Desired: minimum variance unbiased (MVU) estimator
Minimising the variance of an unbiased estimator concentrates the PDF ofthe error about zero ⇒ estimation error is therefore less likely to be large
◦ Existence of the MVU estimator
The MVU estimator is an unbiased estimator with minimumvariance for all θ, that is, θ3 on the graph.
c© D. P. Mandic Advanced Signal Processing, Spring 2015 25
Methods to find the MVU estimator
◦ The MVU estimator may not always exist
◦ A single unbiased estimator may not exist – in which case a searchfor the MVU is fruitless!
1. Determine the Cramer-Rao lower bound (CRLB) and find someestimator which satisfies
2. Apply the Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem
3. Restrict the class of estimators to be not only unbiased, but also linear(BLUE)
4. Sequential vs. block estimators
5. Adaptive estimators
c© D. P. Mandic Advanced Signal Processing, Spring 2015 26
Extensions to the vector parameter case
◦ If θ =[
θ1, θ2, . . . , θp
]T
∈ Rp×1 is a vector of unknown parameters, an
estimator is unbiased if
E(θi) = θi ai < θi < bi
for i = 1, 2, . . . , p
and by defining
E(θ) =
E(θ1)E(θ2)
...E(θp)
an unbiased estimator has the property E(θ) = θ
within the p– dimensional space of parameters
◦ An MVU estimator has the additional property that V ar(θi) fori = 1, 2, . . . , p is minimum among all unbiased estimators
c© D. P. Mandic Advanced Signal Processing, Spring 2015 27
Summary and food for thoughts
◦ We are now equipped with performance metrics for assessing thegoodnes of any estimator (bias, variance, MSE)
◦ Since MSE = var + bias2, some biased estimators may yield low MSE.However, we prefer the minimum variance unbiased (MVU) estimators
◦ Even a simple Sample Mean estimator is a very rich example of theadvantages of statistical estimators
◦ The knowledge of the parametrised PDF p(data;parameters) is veryimportant for designing efficient estimators
◦ We have introduced statistical “point estimators”, would it be useful toalso know the “confidence” we have in our point estimate
◦ In many disciplines it is useful to design so called “set membershipestimates”, where the output of an estimator belongs to a pre-defininedbound (range) of values
◦ We will next address linear, best linear unbiased, maximum likelihood,least squares, sequential least squares, and adaptive estimators
c© D. P. Mandic Advanced Signal Processing, Spring 2015 28
Homework: Check another proof for the MSE expression
MSE(θ) = var(θ) + bias2(θ)
Note : var(x) = E[x2]−[E[x]
]2(∗)
Idea : Let x = θ − θ → substitute into (∗)
to give var(θ − θ)︸ ︷︷ ︸
term (1)
= E[(θ − θ)2]︸ ︷︷ ︸
term (2)
−[E[θ − θ]
]2
︸ ︷︷ ︸
term (3)
(∗∗)
Let us now evaluate these terms:
(1) var(θ − θ) = var(θ)
(2) E[θ − θ]2 = MSE
(3)[E[θ − θ]
]2=
[E[θ]− E[θ]
]2=
[E[θ − θ]
]2= bias2(θ)
Substitute (1), (2), (3) into (**) to give
var(θ) = MSE− bias2 ⇒ MSE = var(θ) + bias2(θ)
c© D. P. Mandic Advanced Signal Processing, Spring 2015 29
Recap: Unbiased estimators
Due to the linearity properties of the E {·}, that is
E{a+ b} = E{a}+ E{b}
the sample mean operator can be simply shown to be unbiased, i.e.
E{
A}
= 1N
∑N−1n=0 E {x[n]} = 1
N
∑N−1n=0 A = A
◦ In some applications, the value of A may be constrained to be positive
a component value such as an inductor, capacitor or resistor wouldbe
positive (prior knowledge)
◦ For N data points in random noise, unbiased estimators generally havesymmetric PDFs centred about their true value, i.e.
A ∼ N (A, σ2/N)
c© D. P. Mandic Advanced Signal Processing, Spring 2015 30
Notes
c© D. P. Mandic Advanced Signal Processing, Spring 2015 31
Notes
c© D. P. Mandic Advanced Signal Processing, Spring 2015 32