Class Notes
Transcript of Class Notes
1/28
Review of ProbabilityReview of Probability
2/28
Random Variable
! DefinitionNumerical characterization of outcome of a random event
!Examples1) Number on rolled dice2) Temperature at specified time of day3) Stock Market at close4) Height of wheel going over a rocky road
3/28
Random Variable
! Non-examples1) Heads or Tails on coin2) Red or Black ball from urn
! Basic Idea dont know how to completely determine what value will occur Can only specify probabilities of RV values occurring.
But we can make these into RVs
4/28
Two Types of Random Variables
Random Variable
Discrete RV Die Stocks
Continuous RV Temperature Wheel height
5/28
Given Continuous RV X What is the probability that X = x0 ?
Oddity : P(X = x0) = 0Otherwise the Prob. Sums to infinity
Need to think of Prob. Density Function (PDF)
pX(x)
x
The Probability density function of RV X
∆+oxox
∫∆+
=
=∆+<<
o
o
x
x X dxxp
xXxP
)(
shownarea )( 00
PDF for Continuous RV
6/28
Most Commonly Used PDF: Gaussian
m & σ are parameters of the Gaussian pdfm = Mean of RV Xσ = Standard Deviation of RV X (Note: σ > 0)σ2 = Variance of RV X
22 2/)(
21)( σ
πσmx
X exp −−=
A RV X with the following PDF is called a Gaussian RV
Notation: When X has Gaussian PDF we say X ~ N(m,σ 2)
7/28
" Generally: take the noise to be Zero Mean
22 2
21)( σ
πσx
x exp =
Zero-Mean Gaussian PDF
8/28
Small σ
Large σ
Small Variability (Small Uncertainty)
Large Variability (Large Uncertainty)
pX(x)
x
Area within ±1 σ of mean = 0.683 = 68.3%
σ σ
x = m
pX(x)
pX(x)
x
x
Effect of Variance on Guassian PDF
9/28
"Central Limit theorem (CLT)The sum of N independent RVs has a pdfthat tends to be Gaussian as N →∞
"So What! Here is what : Electronic systems generate internal noise due to random motion of electrons in electronic components. The noise is the result of summing the random effects of lots of electrons.
CLT applies Guassian Noise
Why Is Gaussian Used?
10/28
Describes probabilities of joint events concerning X and Y. For example, the probability that X lies in interval [a,b] and Y lies in interval [a,b] is given by:
),( yxpXY
∫ ∫=<<<<b
a
d
cXY dxdyyxpdYcbXa ),()( and )(Pr
This graph shows the Joint PDFGraph from B. P. Lathis book: Modern Digital & Analog Communication Systems
Joint PDF of RVs X and Y
11/28
When you have two RVs often ask: What is the PDF of Y if X is constrained to take on a specific value.
In other words: What is the PDF of Y conditioned on the fact X is constrained to take on a specific value.
Ex.: Husbands salary X conditioned on wifes salary = $100K?
First find all wives who make EXACTLY $100K how are their husbands salaries distributed.
Depends on the joint PDF because there are two RVs but it should only depend on the slice of the joint PDF at Y=$100K.
Now we have to adjust this to account for the fact that the joint PDF (even its slice) reflects how likely it is that X=$100K will occur (e.g., if X=105 is unlikely then pXY(105,y) will be small); soif we divide by pX(105) we adjust for this.
Conditional PDF of Two RVs
12/28
Conditional PDF (cont.)
Thus, the conditional PDFs are defined as (slice and normalize):
≠=
otherwise,0
0)(,)(
),()|(|
xpxp
yxpxyp X
X
XY
XY
≠=
otherwise,0
0)(,)(
),()|(|
ypyp
yxpyxp Y
Y
XY
YX
This graph shows the Conditional PDF
Graph from B. P. Lathis book: Modern Digital & Analog Communication Systems
y is held fixed
x is held fixed
slice and normalize
y is held fixed
13/28
Independence should be thought of as saying that:
neither RV impacts the other statistically thus, the values that one will likely take should be irrelevant to the value that the other has taken.
In other words: conditioning doesnt change the PDF!!!
)()(
),()|(
)()(
),()|(
|
|
xpyp
yxpyxp
ypxp
yxpxyp
XY
XYyYX
YX
XYxXY
==
==
=
=
Independent RVs
14/28
Independent and Dependent Gaussian PDFs
Independent(zero mean)
Independent(non-zero mean)
Dependent
y
x
y
x
y
x
Contours of pXY(x,y).
If X & Y are independent, then the contour ellipses are aligned with either the x or y axis
Different slices give
differentnormalized curves
Different slices give
same normalized curves
15/28
RVs X & Y are independent if:
)()(),( ypxpyxp YXXY =
)()(
)()()(
),()|(| ypxp
ypxpxp
yxpxyp YX
YX
X
XYxXY ====
Heres why:
An Independent RV Result
16/28
Characterizing RVs! PDF tells everything about an RV
but sometimes they are more than we need/know! So we make due with a few Characteristics
Mean of an RV (Describes the centroid of PDF) Variance of an RV (Describes the spread of PDF) Correlation of RVs (Describes tilt of joint PDF)
Mean = Average = Expected Value
Symbolically: EX
17/28
Motivation First w/ Data Analysis ViewConsider RV X = Score on a test Data: x1, x2, xN
Possible values of RV X : V0 V1 V2... V1000 1 2 100
This is called Data Analysis ViewBut it motivates the Data Modeling View
Ni = # of scores of value Vi
N = (Total # of scores)∑=
n
iiN
1
Test Average
≈ P(X = Vi)
∑∑=
= =++
===100
0
10011001 ...
i
ii
nNi i
NNV
NVNVNVN
N
xx
Probability
Statistics
Motivating Idea of Mean of RV
18/28
Theoretical View of Mean
" For Discrete random Variables :
Data Analysis View leads to Probability Theory:
" This Motivates form for Continuous RV:
Probability Density Function
∑=
=n
niXi xPxXE
1)(
∫∞
∞−
= dxxpxXE X )(
Probability Function
Notation: XXE =
Data Modeling
Shorthand Notation
19/28
Aside: Probability vs. StatisticsProbability Theory» Given a PDF Model» Describe how the
data will likely behave
Statistics» Given a set of Data» Determine how the
data did behave
∑=
=n
iix
NAvg
1
1
Data
There is no DATA here!!!The PDF models how the data will likely behave
There is no PDF here!!!The Statistic measures how the data did behave
∫∞
∞−
= dxxpxXE X )(
Dummy Variable
≈Law of Large
Numbers
20/28
Variance: Characterizes how much you expect the RV to Deviate Around the Mean
There are similar Data vs. Theory Views here But lets go right to the theory!!
Variance:
∫ −=
−=
dxxpmx
mXE
Xx
x
)()(
)(
2
22σ
Note : If zero mean
∫==
dxxpx
XE
X )(
2
22σ
Variance of RV
21/28
Motivating Idea of Correlation
Consider a random experiment that observes the outcomes of two RVs:Example: 2 RVs X and Y representing height and weight, respectively
y
x
Positively Correlated
Motivate First w/ Data Analysis View
x
y
22/28
Illustrating 3 Main Types of Correlation
Positive CorrelationBest Friends
Negative CorrelationWorst Enemies
Zero Correlationi.e. uncorrelatedComplete Strangers
GPA &
Starting Salary
Height &
$ in Pocket
Student Loans&
Parents Salary
yy −
xx −
yy − yy −
xx − xx −
Data Analysis View: ∑=
−−=N
iiixy yyxx
NC
1))((1
23/28
To capture this, define Covariance :
If the RVs are both Zero-mean :
))(( YYXXEXY −−=σ
XYXY Ε=σ
If X = Y: 22YXXY σσσ ==
∫ ∫ −−= dxdyyxpYyXx XYXY ),())((σ
Prob. Theory View of Correlation
If X & Y are independent, then: 0=XYσ
24/28
If 0))(( =−−= YYXXEXYσ
Then Say that X and Y are uncorrelated
If 0))(( =−−= YYXXEXYσ
Then YXXYE =
Called Correlation of X & Y
So RVs X and Y are said to be uncorrelated
if σXY = 0
or equivalently if EXY = EXEY
25/28
X & Y are Independent
Implies X & Y are Uncorrelated
Uncorrelated
Independence
INDEPENDENCE IS A STRONGER CONDITION !!!!
)()(
),(
yfxf
yxf
YX
XY
=
YEXE
XYE
=
Independence vs. Uncorrelated
PDFs Separate Means Separate
26/28
Covariance : ))(( YYXXEXY −−=σ
Correlation : XYE
Correlation Coefficient :YX
XYXY σσ
σρ =
11 ≤≤− XYρ
Same if zero mean
Confusing Covariance and Correlation Terminology
27/28
Correlation Matrix :
==
NNNN
N
N
T
XXEXXEXXE
XXEXXEXXE
XXEXXEXXE
E
!
"#""
!
!
21
22212
12111
xxRx
TNXXX ][ 11 !=x
Covariance Matrix :
))(( TE xxxxCx −−=
Covariance and Correlation For Random Vectors
28/28
YEXEYXE +=+
( ) ( ) ( ) ( ) ( ) ( )
XYYX
zzzz
zzzz
zzz
YXEYEXE
YXYXE
XXXYXE
YXYXEYX
σσσ 2
2
2
where
var
22
22
22
2
2
++=
++=
++=
−=+=
−−+=+
+
++=+
eduncorrelat are & if,
2var
22
22
YXYX
YX
XYYX
σσ
σσσ
∫= dxxpxfXfE X )()()( XaEaXE =
22var XaaX σ=
A Few Properties of Expected Value
1/45
Review of Matrices and Vectors
2/45
Definition of Vector: A collection of complex or real numbers, generally put in a column
[ ]TN
N
vv
v
v
!" 1
1
=
=v
Transpose
+
+=+
=
=
NNNN ba
ba
b
b
a
a"""
1111
baba
Definition of Vector Addition: Add element-by-element
Vectors & Vector Spaces
3/45
Definition of Scalar: A real or complex number.
If the vectors of interest are complex valued then the set of scalars is taken to be complex numbers; if the vectors of interest are real valued then the set of scalars is taken to be real numbers.
Multiplying a Vector by a Scalar :
α
α=α
=
NN a
a
a
a""
11
aa
changes the vectors length if |α| ≠ 1
reverses its direction if α < 0
4/45
Arithmetic Properties of Vectors: vector addition and scalar multiplication exhibit the following properties pretty much like the real numbers do
Let x, y, and z be vectors of the same dimension and let αand β be scalars; then the following properties hold:
αα xx
xyyx
=
+=+
xx
zxyzyx
)()(
)()(
αβ=βα
++=++
xxx
yxyx
β+α=β+α
α+α=+α
)(
)(
zeros all of vector zero theis where,
1
000x
xx
=
=
1. Commutativity
2. Associativity
3. Distributivity
4. Scalar Unity & Scalar Zero
5/45
Definition of a Vector Space: A set V of N-dimensional vectors (with a corresponding set of scalars) such that the set of vectors is:
(i) closed under vector addition(ii) closed under scalar multiplication
In other words: addition of vectors gives another vector in the set multiplying a vector by a scalar gives another vector in the set
Note: this means that ANY linear combination of vectors in the space results in a vector in the spaceIf v1, v2, and v3 are all vectors in a given vector space V, then
∑=
=++=3
1332211
iiivvvvv αααα
is also in the vector space V.
6/45
Axioms of Vector Space: If V is a set of vectors satisfying the above definition of a vector space then it satisfies the following axioms:
1. Commutativity (see above)
2. Associativity (see above)
3. Distributivity (see above)
4. Unity and Zero Scalar (see above)
5. Existence of an Additive Identity any vector space V must have a zero vector
6. Existence of Negative Vector: For every vector v in V its negative must also be in V
So a vector space is nothing more than a set of vectors with an arithmetic structure
7/45
Def. of Subspace: Given a vector space V, a subset of vectors in Vthat itself is closed under vector addition and scalar multiplication (using the same set of scalars) is called a subspace of V.
Examples:
1. The space R2 is a subspace of R3.
2. Any plane in R3 that passes through the origin is a subspace
3. Any line passing through the origin in R2 is a subspace of R2
4. The set R2 is NOT a subspace of C2 because R2 isnt closed under complex scalars (a subspace must retain the original spaces set of scalars)
8/45
Length of a Vector (Vector Norm): For any vector v in CN
we define its length (or norm) to be
∑=
=N
iiv
1
2
2v ∑
==
N
iiv
1
222v
22 vv αα =
2221221 vvvv βαβα +≤+
NC∈∀∞< vv2
0vv == iff02
Properties of Vector Norm:
Geometric Structure of Vector Space
9/45
Distance Between Vectors: the distance between two vectors in a vector space with the two norm is defined by:
22121 ),( vvvv −=d
2121 iff 0),( vvvv ==dNote that:
v2
v1 v1 v2
10/45
Angle Between Vectors & Inner Product:
Motivate the idea in R2:v
θA
u
=
θθ
=01
sincos
uvAA
θ=θ⋅+θ⋅=∑=
cossin0cos12
1AAAvu i
iiNote that:
Clearly we see that This gives a measure of the angle between the vectors.
Now we generalize this idea!
11/45
Inner Product Between Vectors : Define the inner product between two complex vectors in CN by:
*
1i
N
iivu∑
=
>=< vu,
Properties of Inner Products:1. Impact of Scalar Multiplication:
2. Impact of Vector Addition:
3. Linking Inner Product to Norm:
4. Schwarz Inequality:
5. Inner Product and Angle: (Look back on previous page!)
><>=<
><>=<
vu,vu,
vu,vu,
*ββ
αα
><+>>=<+<
><+>>=<+<
vw,vu,vw,u
zu,vu,zvu,
>=< vv,v 2
2
22vuvu, ≤><
)cos(22
θ=><
vuvu,
12/45
Inner Product, Angle, and Orthogonality :
)cos(22
θ=><
vuvu,
(i) This lies between 1 and 1;
(ii) It measures directional alikeness of u and v
= +1 when u and v point in the same direction
= 0 when u and v are a right angle
= 1 when u and v point in opposite directions
Two vectors u and v are said to be orthogonal when <u,v> = 0
If in addition, they each have unit length they are orthonormal
13/45
Can we find a set of prototype vectors v1, v2, , vM from which we can build all other vectors in some given vector space V by using linear combinations of the vi?
Same Ingredients just different amounts of them!!!
∑=
=M
kkk
1vv α ∑
==
M
kkk
1vu β
We want to be able to do is get any vector just by changing the amounts To do this requires that the set of prototypevectors v1, v2, , vM satisfy certain conditions.
Wed also like to have the smallest number of members in the set of prototype vectors.
Building Vectors From Other Vectors
14/45
Span of a Set of Vectors: A set of vectors v1, v2, , vM is said to span the vector space V if it is possible to write each vector v in V as a linear combination of vectors from the set:
∑=
α=M
kkk
1vv
This property establishes if there are enough vectors in the proposed prototype set to build all possible vectors in V.
It is clear that:
1. We need at least N vectors to span CN or RN but not just any Nvectors.
2. Any set of N mutually orthogonal vectors spans CN or RN (a set of vectors is mutually orthogonal if all pairs are orthogonal).Does not
Span R2Spans R2
Examples in R2
15/45
Linear Independence: A set of vectors v1, v2, , vM is said to be linearly independent if none of the vectors in it can be written as a linear combination of the others.
If a set of vectors is linearly dependent then there is redundancy in the setit has more vectors than needed to be a prototype set!
For example, say that we have a set of four vectors v1, v2, v3, v4 and lets say that we know that we can build v2 from v1 and v3then every vector we can build from v1, v2, v3, v4 can also be built from only v1, v3, v4.
It is clear that:
1. In CN or RN we can have no more than N linear independent vectors.
2. Any set of mutually orthogonal vectors is linear independent (a set of vectors is mutually orthogonal if all pairs are orthogonal).
Linearly Independent
Not Linearly Independent
Examples in R2
16/45
Basis of a Vector Space: A basis of a vector space is a set of linear independent vectors that span the space.
Span says there are enough vectors to build everything
Linear Indep says that there are not more than needed
Orthonormal (ON) Basis: If a basis of a vector space contains vectors that are orthonormal to each other (all pairs of basis vectors are orthogonal and each basis vector has unit norm).
Fact: Any set of N linearly independent vectors in CN (RN) is a basis of CN (RN).
Dimension of a Vector Space: The number of vectors in anybasis for a vector space is said to be the dimension of the space. Thus, CN and RN each have dimension of N.
17/45
Fact: For a given basis v1, v2, , vN, the expansion of a vector vin V is unique. That is, for each v there is only one, unique set of coefficients α1, α2, , αN such that ∑
=
α=N
kkk
1vv
In other words, this expansion or decomposition is unique. Thus, for a given basis we can make a 1-to-1 correspondence between vector v and the coefficients α1, α2, , αN.
We can write the coefficients as a vector, too: [ ]TNαα !1=α
= →←
=−−
NNv
v
α
α
""
1
1to1
1
αv
Expansion can be viewed as a mapping (or transformation) from vector v to vector α.
We can view this transform as taking us from the original vector space into a new vector space made from the coefficient vectors of all the original vectors.
Expansion and Transformation
18/45
Fact: For any given vector space there are an infinite number of possible basis sets.
The coefficients with respect to any of them provides complete information about a vector
some of them provide more insight into the vector and are therefore more useful for certain signal processing tasks than others.
Often the key to solving a signal processing problem lies in finding the correct basis to use for expanding this is equivalent to finding the right transform. See discussion coming next linking DFT to these ideas!!!!
19/45
DFT from Basis Viewpoint:
If we have a discrete-time signal x[n] for n = 0, 1, N-1
Define vector:
Define a orthogonal basis from the exponentials used in the IDFT:
[ ]TNxxx ]1[]1[]0[ −= !x
=
1
11
0 "d
=
−π
⋅π
NNj
Nj
e
e
/)1(12
/112
1
1
"d
=
−π
⋅π
NNj
Nj
e
e
/)1(22
/122
2
1
"d
=
−−π
⋅−π
−
NNNj
NNj
N
e
e
/)1)(1(2
/1)1(2
1
1
"d
Then the IDFT equation can be viewed as an expansion of the signal vector x in terms of this complex sinusoid basis:
∑−
==
1
0][1N
kk
k
kXN
dx#$#%&
α
kth coefficient
T
NNX
NX
NX
−
=]1[]1[]0[ !α
coefficient vector
20/45
Whats So Good About an ON Basis?: Given any basis v1, v2, , vN we can write any v in V as
∑=
α=N
kkk
1vv
Given the vector v how do we find the αs? In general hard! But for ON basis easy!!
If v1, v2, , vN is an ON basis then
i
ij
ij
N
jj
ij
N
jji
α=
α=
α=
−δ=
=
∑
∑
#$#%&][
1
1
v,v
v,vvv,
ii vv,=α
ith coefficient = inner product with ith ON basis vector
Usefulness of an ON Basis
21/45
Another Good Thing About an ON Basis: They preserve inner products and norms (called isometric):
If v1, v2, , vN is an ON basis and u and v are vectors expanded as
Then.
1. < v ,u > = < α , β > (Preserves Inner Prod.)
2. ||v||2 = ||α||2 and ||u||2 = ||β||2 (Preserves Norms)
∑=
α=N
kkk
1vv ∑
==
N
kkk
1vu β
So using an ON basis provides: Easy computation via inner products Preservation of geometry (closeness, size, orientation, etc.
22/45
Example: DFT Coefficients as Inner Products:
Recall: N-pt. IDFT is an expansion of the signal vector in terms of N Orthogonal vectors. Thus
∑
∑
−
=
−
−
=
=
=
=
1
0
/2
1
0
*
][
][][
][
N
n
Nknj
N
nk
k
enx
ndnx
kX
π
dx,
See reading notes for some details about normalization issues in this case
23/45
Matrix: Is an array of (real or complex) numbers organized in rows and columns.
Here is a 3x4 example:
Well sometimes view a matrix as being built from its columns; The 3x4 example above could be written as:
[ ]4321 ||| aaaaA = [ ]Tkkkk aaa 321=a
Well take two views of a matrix:
1. Storage for a bunch of related numbers (e.g., Cov. Matrix)
2. A transform (or mapping, or operator) acting on a vector(e.g., DFT, observation matrix, etc. as well see)
=
34333231
24232221
14131211
aaaa
aaaa
aaaa
A
Matrices
24/45
Matrix as Transform: Our main view of matrices will be as operators that transform one vector into another vector.
Consider the 3x4 example matrix above. We could use that matrix to transform the 4-dimensional vector v into a 3-dimensional vector u:
[ ] 44332211
4
3
2
1
4321 ||| aaaaaaaaAvu vvvv
v
v
v
v
+++=
==
Clearly u is built from the columns of matrix A; therefore, it must lie in the span of the set of vectors that make up the columns of A.
Note that the columns of A are 3-dimensional vectors so is u.
25/45
Transforming a Vector Space: If we apply A to all the vectors in a vector space V we get a collection of vectors that are in a new space called U.
In the 3x4 example matrix above we transformed a 4-dimensional vector space V into a 3-dimensional vector space U
A 2x3 real matrix A would transform R3 into R2 :
Facts: If the mapping matrix A is square and its columns are linearly independent then
(i) the space that vectors in V get mapped to (i.e., U) has the same dimension as V (due to square part)
(ii) this mapping is reversible (i.e., invertible); there is an inverse matrix A-1 such that v = A-1u (due to square & LI part)
A
26/45
1x
1, −AA
1y2x
2y
1, −AA
Transform = Matrix × Vector: a VERY useful viewpoint for all sorts of signal processing scenarios. In general we can view many linear transforms (e.g., DFT, etc.) in terms of some invertible matrix A operating on a signal vector x to give another vector y:
ii Axy = ii yAx 1−=
We can think of A and A-1 as mapping back and forth
between two vector spaces
27/45
Basis Matrix & Coefficient Vector:
Suppose we have a basis v1, v2, , vN for a vector space V. Then a vector v in space V can be written as:
Another view of this:
∑=
α=N
kkk
1vv
[ ]
=
N
NxNN
α
α
α
"## $## %& !
2
1
matrix 21 ||| vvvv Vαv =
The Basis Matrix V transforms the coefficient vector into the original vector v
Matrix View & Basis View
28/45
Three Views of Basis Matrix & Coefficient Vector:View #1Vector v is a linear combination of the columns of basis matrix V.
View #2Matrix V maps vector α into vector v.
View #3There is a matrix, V-1, that maps vector v into vector α.
∑=
α=N
kkk
1vv
Vαv =
vVα 1−=
Aside: If a matrix A is square and has linearly independent columns, then A is invertible and A-1 exists such that A A-1 = A-1A = I where I is the identity matrix having 1s on the diagonal and zeros elsewhere.
Now have a way to go back-and-forth between vector vand its coefficient vector α
Now have a way to go back-and-forth between vector vand its coefficient vector α
29/45
Basis Matrix for ON Basis: We get a special structure!!!
Result: For an ON basis matrix V V-1 = VH
(the superscript H denotes hermitian transpose, which consists of transposing the matrix and conjugating the elements)
To see this:
I
vvvvvv
vvvvvvvvvvvv
VV
=
=
><><><
><><><><><><
=
100
010001
,,,
,,,,,,
21
22212
12111
!"'"
!!
!"'"
!!
NNNN
N
N
H
Inner products are 0 or 1 because this is an ON basis
30/45
A unitary matrix is a complex matrix A whose inverse is A-1 = AH
For the real-valued matrix case we get a special case of unitarythe idea of unitary matrix becomes orthogonal matrixfor which A-1 = AT
Two Properties of Unitary Matrices: Let U be a unitary matrix and let y1 = Ux1 and y2 = Ux2
1. They preserve norms: ||yi|| = ||xi||.
2. They preserve inner products: < y1, y2 > = < x1, x2 >
That is the geometry of the old space is preserved by the unitary matrix as it transforms into the new space.
(These are the same as the preservation properties of ON basis.)
Unitary and Orthogonal Matrices
31/45
DFT from Unitary Matrix Viewpoint:
Consider a discrete-time signal x[n] for n = 0, 1, N-1.
Weve already seen the DFT in a basis viewpoint:
Now we can view the DFT as a transform from the Unitary matrix viewpoint:
∑−
==
1
0][1N
kk
k
kXN
dx#$#%&
α
==
−−π
⋅−π
−π
⋅π
−π
⋅π
−
NNNj
NNj
NNj
Nj
NNj
Nj
N
e
e
e
e
e
e
/)1)(1(2
/1)1(2
/)1(22
/122
/)1(12
/112
110
111
1
11
]|||["
!
!!
"""… dddD
xDx H=~xDx ~1
N=
DFT IDFT
(Acutally D is not unitary but N-1/2D is unitary see reading notes)
32/45
Geometry Preservation of Unitary Matrix Mappings
Recall unitary matrices map in such a way that the sizes of vectors and the orientation between vectors is not changed.
1x
1, −AA
1y2x
2y
1, −AA
Unitary mappings just rigidly rotate the space.
33/45
1x
1, −AA
1y2x
2y
1, −AA
Effect of Non-Unitary Matrix Mappings
34/45
More on Matrices as Transforms
Axy =
m×1 m×n n×1
Well limit ourselves here to real-valued vectors and matrices
A maps any vector x in Rn
into some vector y in Rm
Rn Rm
x yA
Range(A): Range Space of A = set of all vectors in Rm that can be reached by mapping
vector y = weighted sum of columns of A
⇒ may only be able to reach certain ys
Mostly interested in two cases:1. Tall Matrix m > n2. Square Matrix m = n
35/45
Range of a Tall Matrix (m > n) Rn Rm
x yA
The range(A) ⊂ Rm
Proof: Since y is built from the n columns of A there are not enough to form a basis for Rm (they dont span Rm)
Range of a Square Matrix (m = n)
If the columns of A are linearly indep.The range(A) = Rm
because the columns form a basis for Rm
Otherwise.The range(A) ⊂ Rm
because the columns dont span Rm
36/45
Rank of a Matrix: rank(A) = largest # of linearly independent columns (or rows) of matrix A
For an m×n matrix we have that rank(A) ≤ min(m,n)
An m×n matrix A has full rank when rank(A) = min(m,n)
Example: This matrix has rank of 3 because the 4th column cam be written as a combination of the first 3 columns
=
0000
0000
1100
2010
1001
A
37/45
Tall Matrix (m > n) Case
If y does not lie in range(A), then there is No Solution
If y lies in range(A), then there is a solution (but not necessarily just one unique solution)
y∈range(A)y∉range(A)
No Solution
One Solution Many Solutions
A full rank
A not full rank
Characterizing Tall Matrix MappingsWe are interested in answering: Given a vector y, what vector x mapped into it via matrix A?
Axy =
38/45
Full-Rank Tall Matrix (m > n) Case
Rn Rm
xy
A
Range(A)
Axy =
For a given y∈range(A)there is only one x that maps to it.
This is because the columns of A are linearly independent and we know from our studies of vector spaces that the coefficient vector of y is unique x is that coefficient vector
By looking at y we can determine which x gave rise to it
39/45
NonFull-Rank Tall Matrix (m > n) Case
Rn Rm
x1y
ARange(A)
Axy =
For a given y∈range(A) there is more than one x that maps to it
This is because the columns of A are linearly dependent and that redundancy provides several ways to combine them to create y
x2A
By looking at y we can not determine which x gave rise to it
40/45
Characterizing Square Matrix Mappings
Q: Given any y∈Rn can we find an x∈Rn that maps to it?
A: Not always!!!
y∈range(A)y∉range(A)
No Solution
One Solution
Many Solutions
A full rank
A not full rank
Axy = Careful!!! This is quite a different flow diagram here!!!
When a square A is full rank then its range covers the complete new space then, y must be in range(A) and because the columns of A are a basis there is a way to build y
41/45
A Full-Rank Square Matrix is InvertibleA square matrix that has full rank is said to be.
nonsingular, invertibleThen we can find the x that mapped to y using x = A-1y
Several ways to check if n×n A is invertible:1. A is invertible if and only if (iff) its columns (or rows) are
linearly independent (i.e., if it is full rank)2. A is invertible iff det(A) ≠ 03. A is invertible if (but not only if) it is positive definite (see
later)4. A is invertible if (but not only if) all its eigenvalues are
nonzero
Pos. Def.Matrices
InvertibleMatrices
42/45
Eigenvalues and Eigenvectors of Square MatricesIf matrix A is n×n, then A maps Rn → Rn
Q: For a given n×n matrix A, which vectors get mapped into being almost themselves???
More precisely Which vectors get mapped to a scalar multiple of themselves???
Even more precisely which vectors v satisfy the following:
vAv λ=
These vectors are special and are called the eigenvectors of A.The scalar λ is that e-vectors corresponding eigenvalue.
Input Output
v Av
43/45
If n×n real matrix A is symmetric, then e-vectors corresponding to distinct e-values are orthonormal e-values are real valued can decompose A as
If, further, A is pos. def. (semi-def.), then e-values are positive (non-negative) rank(A) = # of non-zero e-values
Pos. Def. ⇒ Full Rank (and therefore invertible) Pos. Semi-Def. ⇒ Not Full Rank (and therefore not invertible)
When A is P. D., then we can write
Eigen-Facts for Symmetric Matrices
TVVΛA =
[ ]
n
Tn
diag λλλ ,,, 21
21
…
!
=
==
Λ
IVVvvvV
TVVΛA 11 −− =
n
diag λλλ1111 ,,,
21…=−Λ
For P.D. A, A-1 has the same e-vectors and has reciprocal e-values
44/45
Well limit our discussion to real-valued matrices and vectors
Quadratic Forms and Positive-(Semi)Definite Matrices
Quadratic Form = Matrix form for a 2nd-order multivariate polynomial
Example:
=
=
2221
1211
2
1
aa
aa
x
xAx
fixedvariable
2121122222
2111
2
1
2
1
21
)(
scalar )11()12()22()21(),(
xxaaxaxaxxa
xxQ
i jjiij
T
+++==
×=×⋅×⋅×=
∑∑= =
AxxA
scalar
The quadratic form of matrix A is:
Other Matrix Issues
45/45
Values of the elements of matrix A determine the characteristics of the quadratic form QA(x) If QA(x) ≥ 0 ∀x ≠ 0 then say that QA(x) is positive semi-definite If QA(x) > 0 ∀x ≠ 0 then say that QA(x) is positive definite Otherwise say that QA(x) is non-definite
These terms carry over to the matrix that defines the Quad Form If QA(x) ≥ 0 ∀x ≠ 0 then say that A is positive semi-definite If QA(x) > 0 ∀x ≠ 0 then say that A is positive definite
1/15
Ch. 1 Introduction to Estimation
2/15
An Example Estimation Problem: DSB RxS( f )
f o
M( f )
f f –f o)2cos()(),;( oooo tftmfts φπφ +=
AudioAmp
BPF&
Amp
)()()( twtstx +=X
)ˆˆ2cos( ootf φπ +
Est. Algo.
Electronics Adds Noise w(t)
(usually “white”)oof φ&ˆ
f
)(ˆ fMOscillatorw/ oof φ&ˆ
Goal: Given
Find Estimates(that are optimal in some sense)
)(),;()( twftstx oo += φ Describe with Probability Model: PDF & Correlation
3/15
Discrete-Time Estimation ProblemThese days, almost always work with samples of the observed signal (signal plus noise): ][],;[][ nwfnsnx oo += φ
Our Thought Model: Each time you “observe” x[n] it contains same s[n] but different “realization” of noise w[n], so the estimate is different each time. oof φ&ˆ are RVs
Our Job: Given finite data set x[0], x[1], … x[N-1] Find estimator functions that map data into estimates:
)(])1[,],1[],0[(ˆ
)(])1[,],1[],0[(ˆ
22
11
x
x
gNxxxg
gNxxxgf
o
o
=−=
=−=
…
…
φ
These are RVs… Need to describe w/ probability model
4/15
PDF of EstimateBecause estimates are RVs we describe them with a PDF…
Will depend on: 1. structure of s[n]2. probability model of w[n]3. form of est. function g(x)
)ˆ( ofp
ofof
Mean measures centroid
Std. Dev. & Variance measure spread
Desire:
( ) small ˆˆ
ˆ
22ˆ =
−=
=
oof
oo
fEfE
ffE
oσ
5/15
1.2 Mathematical Estimation ProblemGeneral Mathematical Statement of Estimation Problem:
For… Measured Data x = [ x[0] x[1] … x[N-1] ]
Unknown Parameter θ = [θ1 θ2 … θp ]
θ is Not Random x is an N-dimensional random data vector
Q: What captures all the statistical information needed for an estimation problem ?
A: Need the N-dimensional PDF of the data, parameterized by θ
);( θxpIn practice, not given PDF!!!Choose a suitable model
• Captures Essence of Reality• Leads to Tractable Answer We’ll use p(x;θ) to find )(ˆ xθ g=
6/15
Ex. Estimating a DC Level in Zero Mean AWGN
]0[]0[ wx +=θConsider a single data point is observedGaussianzero meanvariance σ2
~ N(θ, σ2)
So… the needed parameterized PDF is:
p(x[0];θ ) which is Gaussian with mean of θ
So… in this case the parameterization changes the data PDF mean:
θ1
p(x[0];θ1)
x[0] θ2
p(x[0];θ2)
x[0] θ3
p(x[0];θ3)
x[0]
7/15
Ex. Modeling Data with Linear TrendSee Fig. 1.6 in Text
Looking at the figure we see what looks like a linear trend perturbed by some noise…
So the engineer proposes signal and noise models:
[ ] ][][],;[
nwBnAnxBAns
++= "#"$%
Signal Model: Linear Trend Noise Model: AWGN w/ zero mean
AWGN = “Additive White Gaussian Noise”“White” = x[n] and x[m] are uncorrelated for n ≠ m Iwwww 2))(( σ=−− TE
8/15
Typical Assumptions for Noise Model• W and G is always easiest to analyze
– Usually assumed unless you have reason to believe otherwise– Whiteness is usually first assumption removed– Gaussian is less often removed due to the validity of Central Limit Thm
• Zero Mean is a nearly universal assumption – Most practical cases have zero mean– But if not… µ+= ][][ nwnw zm
Non-Zero Mean of µ Zero Mean Now group into signal model
• Variance of noise doesn’t always have to be known to make an estimate– BUT, must know to assess expected “goodness” of the estimate– Usually perform “goodness” analysis as a function of noise variance (or
SNR = Signal-to-Noise Ratio)– Noise variance sets the SNR level of the problem
9/15
Classical vs. Bayesian Estimation ApproachesIf we view θ (parameter to estimate) as Non-Random
→ Classical EstimationProvides no way to include a priori information about θ
If we view θ (parameter to estimate) as Random → Bayesian EstimationAllows use of some a priori PDF on θ
The first part of the course: Classical Methods• Minimum Variance, Maximum Likelihood, Least Squares
Last part of the course: Bayesian Methods• MMSE, MAP, Wiener filter, Kalman Filter
10/15
1.3 Assessing Estimator PerformanceCan only do this when the value of θ is known:
• Theoretical Analysis, Simulations, Field Tests, etc.
is a random variableRecall that the estimate )(ˆ xg=θ
Thus it has a PDF of its own… and that PDF completely displays the quality of the estimate.
Illustrate with 1-D parameter case
θ θ
)ˆ(θp
Often just capture quality through mean and variance of )(ˆ xg=θ
Desire:
( ) small ˆˆ
ˆ
22ˆ
ˆ
=
−=
==
θθσ
θθ
θ
θ
EE
Em If this is true: say estimate is
unbiased
11/15
Equivalent View of Assessing Performance)ˆ(ˆ ee +=−= θθθθDefine estimation error:
RV RV Not RV
Completely describe estimator quality with error PDF: p(e)
p(e)
e
Desire:
( ) small
0
22 =−=
==
eEeE
eEm
e
e
σ
If this is true: say estimate is
unbiased
12/15
Example: DC Level in AWGNModel: x 1,,1,0],[][ −=+= NnnwAn …
Gaussian, zero mean, variance σ2
White (uncorrelated sample-to-sample)
PDF of an individual data sample:
−−= 2
2
2 2)][(exp
2
1])[(σπσ
Aixixp
Uncorrelated Gaussian RVs are Independent… so joint PDF is the product of the individual PDFs:
−
−=
−−=
∑∏
−
=−
=2
1
0
2
2/2
1
02
2
2 2
)][(exp
)2(1
2)][(exp
2
1)(σπσσπσ
N
nN
N
n
AnxAnxp x
( property: prod of exp’s gives sum inside exp )
13/15
Each data sample has the same mean (A), which is the thing we are trying to estimate… so, we can imagine trying to estimate A by finding the sample mean of the data:
Statistics ∑−
==
1
0][1ˆ
N
nnx
NA
Prob. Theory
Let’s analyze the quality of this estimator…• Is it unbiased?
AAE
ixEN
nxN
EAEn A
N
n
=⇒
=
= ∑∑=
−
=
ˆ
][1][1ˆ1
0#$%
Yes! Unbiased!
NA
NN
Nnx
Nnx
NA
N
n
N
n
N
n
2
2
21
0
22
1
02
1
0
)ˆvar(
1])[var(1][1var)ˆvar(
σ
σσ
=⇒
===
= ∑∑∑
−
=
−
=
−
=
Can make var small by increasing N!!!
Due to Indep.(white & Gauss.
⇒ Indep.)• Can we get a small variance?
14/15
Theoretical Analysis vs. Simulations
• Ideally we’d like to be always be able to theoretically analyze the problem to find the bias and variance of the estimator– Theoretical results show how performance depends on the problem
specifications
• But sometimes we make use of simulations– to verify that our theoretical analysis is correct– sometimes can’t find theoretical results
15/15
Course Goal = Find Optimal Estimators• There are several different definitions or criteria for optimality!• Most Logical: Minimum MSE (Mean-Square-Error)
– See Sect. 2.4– To see this result: ( )
)(ˆvar
ˆ)ˆ(
2
2
θθ
θθθ
b
Emse
+=
−=
( )
( ) ( )[ ]
[ ]
)(ˆvar
)(ˆˆ)(ˆˆ
ˆˆˆ
ˆ)ˆ(
2
2
0
2
2
2
θθ
θθθθθθ
θθθθ
θθθ
b
bEEbEE
EEE
Emse
+=
+−+
−=
−+−=
−=
=""#""$%
θθθ −= ˆ)( EbBias
Although MSE makes sense, estimates usually rely on θ
Chapter 2
Minimum Variance Unbiased Estimators
MVU
Ch. 2: Minimum Variance Unbiased Est.
Basic Idea of MVU: Out of all unbiased estimates, find the one with the lowest variance
(This avoids the realizability problem of MSE)
2.3 Unbiased EstimatorsAn estimator is unbiased if
θθ =ˆE for all θ
Example: Estimate DC in White Uniform Noise
[ ] [ ] 1...,,1,0 −=+= NnnwAnx
Unbiased Estimator:
[ ]
valueAofregardlessAAEbeforeassame
nxN
N
n
=
=Α
∧
−
=
∧
∑
:
1 1
0
Biased Estimator:
⇒
<≠
≥=⇒
≠
<
=
⇒=⇒
=≥
∨
∨∧∨
−
>
∨
∑
10
10
,1
][][,1:
)(1 1
0
Aif
AifBias
AAEthenAif
AAEAA
nxnxthenAifNote
nxN
AN
n
Biased Est.
MVUE = Minimum Variance Unbiased Estimator
)()ˆvar()ˆ( 2 θθθ bmse +=
So, MVU could also be called“Minimum MSE Unbiased Est.”
(Recall problem with MMSE criteria)
Constrain bias to be zero 0 find the estimator that minimizes variance
2.4 Minimum Variance Criterion
Note:
= 0 for MVU
2.5 Existence of MVU EstimatorSometimes there is no MVUE… can happen 2 ways:
1. There may be no unbiased estimators2. None of the above unbiased estimators has a
uniformly minimum varianceEx. of #2Assume there are only 3 unbiased estimators for a problem. Two possible cases:
3,2,1),(ˆ == igii xθ
∃ an MVU∃ an MVU
1θ
θ
ˆvar iθ
2θ
3θ
ˆvar iθ1θ
θ
2θ
3θ
Even if MVU exists: may not be able to find it!!2.6 Finding the MVU Estimator
No Known “turn the crank” Method
Three Approaches to Finding the MVUE
1. Determine Cramer-Rao Lower Bound (CRLB)… and see if some estimator satisfies it (Ch 3 & 4)(Note: MVU can exist but not achieve the CRLB)
2. Apply Rao-Blackwell-Lechman-Scheffe TheoremRare in Practice… We’ll skip Ch. 5
3. Restrict to Linear Unbiased & find MVLU (Ch. 6)Only gives true MVU if problem is linear
2.7 Vector ParameterWhen we wish to estimate multiple parameters we group them into a vector: [ ]Tpθθθ !21=θ
[ ]Tpθθθ ˆˆˆˆ21 !=θThen an estimator is notated as:
θθ =ˆEUnbiased requirement becomes:
Minimum Variance requirement becomes:For each i…
estimates unbiased allover minˆvar θθ =
Chapter 3 Cramer-Rao Lower Bound
Abbreviated: CRLB or sometimes just CRB
CRLB is a lower bound on the variance of any unbiasedestimator:
The CRLB tells us the best we can ever expect to be able to do (w/ an unbiased estimator)
then, ofestimator unbiased an is If θθ
)()()()( 2 θθσθθσ
θθθθ CRLBCRLB ≥⇒≥
What is the Cramer-Rao Lower Bound
1. Feasibility studies ( e.g. Sensor usefulness, etc.) Can we meet our specifications?
2. Judgment of proposed estimators Estimators that dont achieve CRLB are looked
down upon in the technical literature
3. Can sometimes provide form for MVU est.
4. Demonstrates importance of physical and/or signal parameters to the estimation problem
e.g. Well see that a signals BW determines delay est. accuracy⇒ Radars should use wide BW signals
Some Uses of the CRLB
Q: What determines how well you can estimate θ ?
Recall: Data vector is x
3.3 Est. Accuracy Consideration
samples from a random process that depends on an θ
⇒ the PDF describes that dependence: p(x;θ )
Clearly if p(x;θ ) depends strongly/weakly on θwe should be able to estimate θ well/poorly.
See surface plots vs. x & θ for 2 cases:1. Strong dependence on θ2. Weak dependence on θ
⇒ Should look at p(x;θ ) as a function of θ for fixed value of observed data x
Surface Plot Examples of p(x;θ )
( )
−−= 2
2
2 2)]0[(exp
2
1];0[σπσ
AxAxp
x[0] = A + w[0]
Ex. 3.1: PDF Dependence for DC Level in Noisew[0] ~ N(0,σ2)
Then the parameter-dependent PDF of the data point x[0] is:
A x[0]
A3
p(x[0]=3;θ )
Say we observe x[0] = 3So Slice at x[0] = 3
The LF = the PDF p(x;θ )
but as a function of parameter θ w/ the data vector x fixed
Define: Likelihood Function (LF)
We will also often need the Log Likelihood Function (LLF):
LLF = lnLF = ln p(x;θ )
LF Characteristics that Affect Accuracy Intuitively: sharpness of the LF sets accuracy But How???Sharpness is measured using curvature: ( )
valuetruedatagiven 2
2 ;ln
==∂
∂−
θθ
θx
xp
Curvature ↑ ⇒ PDF concentration ↑ ⇒ Accuracy ↑
But this is for a particular set of data we want in general:SoAverage over random vector to give the average curvature:
( )
valuetrue2
2 ;ln
=
∂
∂−
θθ
θxpEExpected sharpness
of LF
E is w.r.t p(x;θ )
Theorem 3.1 CRLB for Scalar Parameter
Assume regularity condition is met: θθθ
∀=
∂∂ 0);(ln xpE
E is w.r.t p(x;θ )
Then( )
valuetrue2
22
;ln
1
=
∂∂
−
≥
θ
θ
θθ
σxpE
3.4 Cramer-Rao Lower Bound
( ) ( ) dxxpppE );(;ln;ln2
2
2
2θ
θθ
θθ
∫ ∂
∂=
∂
∂ xx
Right-Hand Side is CRLB
1. Write log 1ikelihood function as a function of θ: ln p(x;θ )
2. Fix x and take 2nd partial of LLF: ∂2ln p(x;θ )/∂θ 2
3. If result still depends on x: Fix θ and take expected value w.r.t. x Otherwise skip this step
4. Result may still depend on θ: Evaluate at each specific value of θ desired.
5. Negate and form reciprocal
Steps to Find the CRLB
Need likelihood function:
( ) [ ]( )
( )[ ]( )
−∑−
=
−−=
−
=
−
=∏
2
21
0
22
1
02
2
2
2exp
2
1
2exp
2
1;
σπσ
σπσ
Anx
AnxAp
N
nN
N
nx
Example 3.3 CRLB for DC in AWGNx[n] = A + w[n], n = 0, 1, , N 1
w[n] ~ N(0,σ2)& white
Due to whiteness
Property of exp
( ) [ ]( )!!! "!!! #$!! "!! #$
?(~~)
1
0
22
0(~~)
22
212ln);(ln
=∂∂
−
=
=∂∂
∑ −−
−=
A
N
n
A
NAnxAp
σπσx
Now take ln to get LLF:
[ ]( ) ( )AxNAnxApA
N
n−=−=
∂∂ ∑
−
=2
1
02
1);(lnσσ
x
Now take first partial w.r.t. A:sample mean
(!)
Now take partial again:
22
2);(ln
σNAp
A−=
∂
∂ x
Doesnt depend on x so we dont need to do E
Since the result doesnt depend on x or A all we do is negate and form reciprocal to get CRLB:
( ) NpE
CRLB2
valuetrue2
2 ;ln
1 σ
θθ
θ
=
∂∂
−
=
=
xN
A2
var σ≥
Doesnt depend on A Increases linearly with σ 2
Decreases inversely with N
CRLB
For fixed N
σ 2
A
CRLB
For fixed N & σ 2
CRLB Doubling DataHalves CRLB!
N
For fixed σ 2
Continuation of Theorem 3.1 on CRLBThere exists an unbiased estimator that attains the CRLB iff:
[ ]θθθ
θ−=
∂∂ )()();(ln xx gIp
for some functions I(θ ) and g(x)
Furthermore, the estimator that achieves the CRLB is then given by:
)( xg=θ
(!)
Since no unbiased estimator can do better this is the MVU estimate!!
This gives a possible way to find the MVU: Compute ∂ln p(x;θ )/∂θ (need to anyway) Check to see if it can be put in form like (!)
If so then g(x) is the MVU esimator
with CRLBI
==)(
1varθ
θ
Revisit Example 3.3 to Find MVU Estimate
( )AxNApA
−=∂∂
2);(lnσ
x
For DC Level in AWGN we found in (!) that:Has form of
I(A)[g(x) A]
[ ]∑−
====
1
0
1)(N
nnx
Nxg xθCRLB
NANAI ==⇒=
2
2 var)( σσ
So for the DC Level in AWGN: the sample mean is the MVUE!!
Definition: Efficient Estimator
An estimator that is:
unbiased and
attains the CRLB
is said to be an Efficient Estimator
Notes:
Not all estimators are efficient (see next example: Phase Est.)
Not even all MVU estimators are efficient
So there are times when our 1st partial test wont work!!!!
Example 3.4: CRLB for Phase EstimationThis is related to the DSB carrier estimation problem we used for motivation in the notes for Ch. 1Except here we have a pure sinusoid and we only wish to estimate only its phase
Signal Model: ][)2cos(][];[
nwnfAnxons
oo ++= !!! "!!! #$φ
φπ AWGN w/ zero mean & σ 2
Assumptions:1. 0 < fo < ½ ( fo is in cycles/sample)
2. A and fo are known (well remove this assumption later)
Signal-to-Noise Ratio: Signal Power = A2/2Noise Power = σ 2 2
2
2σASNR =
Problem: Find the CRLB for estimating the phase.
We need the PDF:
( )( )
[ ]( )
+−∑−
=
−
=2
21
0
22 2
)2cos(exp
2
1;σ
φπ
πσ
φnfAnx
po
N
nNx
Exploit Whiteness and Exp.
Form
Now taking the log gets rid of the exponential, then taking partial derivative gives (see book for details):
( ) [ ]21
02 )24sin(2
)2sin(;ln
+−+∑
−=
∂∂ −
=φπφπ
σφφ nfAnfnxAp
ooN
n
x
Taking partial derivative again:( ) [ ]( ))24cos()2cos(;ln 1
022
2φπφπ
σφφ
+−+∑−
=∂
∂ −
=nfAnfnxAp
ooN
n
x
Still depends on random vector x so need E
Taking the expected value:
( ) [ ]( )
[ ] ( ))24cos()2cos(
)24cos()2cos(;ln
1
02
1
022
2
φπφπσ
φπφπσφ
φ
+−+∑=
+−+∑=
∂
∂−
−
=
−
=
nfAnfnxEA
nfAnfnxAEpE
ooN
n
ooN
n
x
Ex[n] = A cos(2π fon + φ )
So plug that in, get a cos2 term, use trig identity, and get
( ) SNRNNAnfApEN
no
N
n×=≈
+−=
∂
∂− ∑∑
−
−
−
=2
21
0
1
02
2
2
2
2)24cos(1
2;ln
σφπ
σφφx
= N << N iffo not near 0 or ½
nN-1
Now invert to get CRLB:SNRN ×
≥1varφ
CRLB Doubling DataHalves CRLB!
N
For fixed SNR
Non-dB
CRLB Doubling SNRHalves CRLB!
SNR (non-dB)
For fixed N Halve CRLB for every 3B
in SNR
Does an efficient estimator exist for this problem? The CRLB theorem says there is only if
[ ]θθθ
θ−=
∂∂ )()();(ln xx gIp
( ) [ ]21
02 )24sin(2
)2sin(;ln
+−+∑
−=
∂∂ −
=φπφπ
σφφ nfAnfnxAp
ooN
n
x
Our earlier result was:
Efficient Estimator does NOT exist!!!
Well see later though, an estimator for which CRLB→varφas N →∞ or as SNR →∞
Such an estimator is called an asymptotically efficient estimator(Well see such a phase estimator in Ch. 7 on MLE)
CRBN
varφ
1
Alternate Form for CRLB
∂∂
≥2);(ln
1)var(
θθ
θxpE
See Appendix 3A for Derivation
Sometimes it is easier to find the CRLB this way.
This also gives a new viewpoint of the CRLB:
From Gardners Paper (IEEE Trans. on Info Theory, July 1979)
Consider the Normalized version of this form of CRLB
Posted on BB
∂∂
≥2
22
);(ln
1)var(
θθθ
θθ
xpE
Well derive this in a way that will re-interpret the
CRLB
2
Consider the Incremental Sensitivity of p(x;θ ) to changes in θ :
If θ → θ +∆θ, then it causes p(x;θ ) → p(x;θ +∆θ )
How sensitive is p(x;θ ) to that change??
∆∆
==
∆
∆
=∆
);();(
in change %);( in change %);(
);(
)(~θ
θθθ
θθ
θθθθ
θ xxxx
x
xp
pppp
S p
Now let ∆θ → 0:θ
θθθ
θθθ
θθθ ∂∂
=
∂∂
==→∆
);(ln);(
);()(~lim)(0
xx
xxx pp
pSS pp
Recall from Calculus:xxf
xfxxf
∂∂
=∂
∂ )()(
1)(ln
[ ]
=
∂∂
≥222
22
)(
1
);(ln
1)var(
xx pSEpE θθθ
θθθθ
InterpretationNorm. CRLB = Inverse Mean
Square Sensitivity
3
Definition of Fisher Information
The denominator in CRLB is called the Fisher Information I(θ )
It is a measure of the expected goodness of the data for the purpose of making an estimate
∂
∂−= 2
2 );(ln)(θ
θθ xpEI
Has the needed properties for info (as does Shannon Info): 1. I(θ ) ≥ 0 (easy to see using the alternate form of CRLB)2. I(θ ) is additive for independent observations
follows from: [ ]∑∏ =
=
nnnxpnxpp )];[(ln)];[(ln);(ln θθθx
If each In (θ ) is the same: I(θ ) = N×I(θ )
4
3.5 CRLB for Signals in AWGNWhen we have the case that our data is signal + AWGN then we get a simple form for the CRLB:
Signal Model: x[n] = s[n;θ ] + w[n], n = 0, 1, 2, , N-1
White, Gaussian, Zero MeanQ: What is the CRLB?
First write the likelihood function:
( )( )
−−
= ∑−
=
1
0
222/2
];[][2
1exp2
1);(N
nN nsnxp θ
σπσθx
Differentiate Log LF twice to get:
( )∑−
=
∂∂
−∂
∂−=
∂
∂ 1
0
2
2
2
22
2 ];[];[];[][1);(lnN
n
nsnsnsnxpθθ
θθθ
σθ
θx
Depends on random x[n] so must take
E
5
2
1
0
2
1
0
2
2
2
0
];[22
2
];[
];[];[];[][1);(ln
σ
θθ
θθ
θθθ
σθ
θ θ
∑
∑
−
=
−
=
=
∂∂
−
=
∂∂
−∂
∂
−=
∂
∂
N
n
N
n ns
ns
nsnsnsnxEpE
!!! "!!! #$"#$x
Then using this we get the CRLB for Signal in AWGN:
∑−
=
∂∂
≥1
0
2
2
];[)var(
N
n
nsθθ
σθ2];[
∂∂
θθnsNote: tells how
sensitive signal is to parameter
If signal is very sensitive to parameter change then CRLB is small can get very accurate estimate!
6
Ex. 3.5: CRLB of Frequency of Sinusoid Signal Model: 1,,2,1,00][)2cos(][ 2
1 −=<<++= NnfnwnfAnx oo …φπ
0 0.1 0.2 0.3 0.4 0.50
2
4
6x 10
-4
fo (cycles/sample)
CR
LB (
cycl
es/s
ampl
e)2
0 0.1 0.2 0.3 0.4 0.50.01
0.015
0.02
0.025
fo (cycles/sample)
CRL
B1/
2 (cy
cles/
sam
ple)
Error in Book
Bound on Variance
Bound on Std. Dev.
[ ]∑−
=+×
≥ 1
0
2)2sin(2
1)var( N
nonfnSNR φππ
θ
Signal is less sensitive if fonear 0 or ½
7
3.6 Transformation of ParametersSay there is a parameter θ with known CRLBθ
But imagine that we instead are interested in estimating some other parameter α that is a function of θ :
α = g(θ )
Q: What is CRLBα ?
θα θθα CRLBgCRLB
2)()var(
∂∂
=≥
Captures the sensitivity of α to θ
Proved inAppendix 3B
Large ∂g/∂ θ → small error in θ gives larger error in α→ increases CRLB (i.e., worsens accuracy)
8
Example: Speed of Vehicle From Elapsed Time
Known Distance D
start
Laser
Sensor Sensor
Laser
stop
Measure Elapsed Time TPossible Accuracy Set by CRLBT
T
T
TV
CRLBDV
CRLBTD
CRLBTD
TCRLB
×=
×
−=
×
∂∂
=
2
4
2
2
But really want to measure speed V = d/TFind the CRLBV:
)/(2
smCRLBD
VTV ≥σ
Accuracy Bound
Less accurate at High Speeds (quadratic) More accurate over large distances
9
Effect of Transformation on EfficiencySuppose you have an efficient estimator of θ : θ
But you are really interested in estimating α = g(θ )
Suppose you plan to use )( θα g=
Q: Is this an efficient estimator of α ???A: Theorem: If g(θ ) has form g(θ ) = aθ + b, thenis efficient.
)( θα g=
affine transform
Proof:First: ( ) ( ) ( ) θθθα CRLBaaba 22 varvarvar ==+=
= because efficient
Now, what is CRBα ? Using transformation result:
( )θθα θ
θ CRLBaCRLBbaCRLB
a
22
2
=
∂+∂
=
=!!"!!#$
( ) αα CRLB=var
Efficient!
10
Asymptotic Efficiency Under TransformationIf the mapping α = g(θ ) is not affine this result does NOT hold
But if the number of data samples used is large, then the estimator is approximately efficient (Asymptotically Efficient)
θ
)(),( θθα pg=
θ of pdf
Small N CasePDF is widely spread
over nonlinear mapping
θ
)(),( θθα pg=
θ of pdf
Large N CasePDF is concentrated
onto linearized section
1
3.7 CRLB for Vector Parameter CaseVector Parameter: [ ]Tpθθθ !21=θ
Its Estimate: [ ]Tpθθθ 21 !=θ
Assume that estimate is unbiased: θθ =E
For a scalar parameter we looked at its variancebut for a vector parameter we look at its covariance matrix:
[ ][ ] θCθθθθθ var =
−−=
TE
For example:
for θ = [x y z]T
=
)var(),cov(),cov(
),cov()var(),cov(
),cov(),cov()var(
zyzxz
zyyxy
zxyxx
θC
2
Fisher Information MatrixFor the vector parameter case
Fisher Info becomes the Fisher Info Matrix (FIM) I(θ) whose mnth element is given by:
[ ] [ ] pnmpEmn
mn ,,2,1,,);(ln)(2
…=
∂∂∂
−=θθθxθI
Evaluate at true value of θ
3
The CRLB Matrix Then, under the same kind of regularity conditions,
the CRLB matrix is the inverse of the FIM: )(1 θI−=CRLB
So what this means is: nnnnn
][][ )(1
2 θICθ
−≥=θσ
Diagonal elements of Inverse FIM bound the parameter variances, which are the diagonal elements of the parameter covariance matrix
=
)var(),cov(),cov(
),cov()var(),cov(
),cov(),cov()var(
zyzxz
zyyxy
zxyxx
θC )(1
333231
232221
131211
θI−=
bbb
bbb
bbb
(!)
4
More General Form of The CRLB Matrix
definite-semi positive is)(1 θICθ
−−
0θICθ ≥− − )(1
Mathematical Notation for this is:
(!!)
Note: property #5 about p.d. matrices on p. 573 states that (!!) ⇒ (!)
5
CRLB Off-Diagonal Elements InsightLet θ = [xe ye]T represent the 2-D x-y location of a transmitter (emitter) to be estimated.
Consider the two cases of scatter plots for the estimated location:
ex
ey
ex
ey
exex
ey eyeyσeyσ
exσexσ
Each case has the same variances but location accuracy characteristics are very different. ⇒ This is the effect of the off-diagonal elements of the covariance
Should consider effect of off-diagonal CRLB elements!!!
Not In Book
6
CRLB Matrix and Error Ellipsoids Not In Book
Assume [ ]Tee yx =θ is 2-D Gaussian w/ zero meanand a cov matrix θC Only For Convenience
Then its PDF is given by:
( )( )
−= − θCθ
Cθ θ
θ
21exp
2
1 1
TN
pπ
Quadratic Form!!(recall: its scalar valued)
So the equi-height contours of this PDF are given by the values of θ such that:
kT =θAθ Some constant
easefor
:Let
1 AC θ =−
Note: A is symmetric so a12 = a21 because any cov. matrix is symmetric and the inverse of symmetric is symmetric
7
What does this look like? kyayxaxa eeee =++ 22212
211 2
An Ellipse!!! (Look it up in your calculus book!!!)
Recall: If a12 = 0, then the ellipse is aligned w/ the axes & the a11 and a22 control the size of the ellipse along the axes
Note: a12 = 0 ⇒
=⇒
=−
22
11
22
111
10
01
0
0
a
a
a
a
θθ CC
eduncorrelat are& ee yx⇒
Note: a12 ≠ 0 correlated are& ee yx⇒
=2
2
eee
eee
yxy
yxx
σσ
σσ
θC
8
ex ex
eyeduncorrelat are& ee yxif
ey
ex ex
eycorrelated are& ee yxif
ey
ex2~ σ
ey2~ σ
ex2~ σ
ey2~ σ
ex2~ σ
ey2~ σ
ex2~ σ
ey2~ σ
Not In BookError Ellipsoids and Correlation
Choosing k ValueFor the 2-D case
k = -2 ln(1-Pe)
where Pe is the prob. that the estimate will lie inside the ellipse
See posted paper by Torrieri
9
Ellipsoids and Eigen-StructureConsider a symmetric matrix A & its quadratic form xTAx
kT =Axx⇒ Ellipsoid: or k=xAx ,
Principle Axes of Ellipse are orthogonal to each otherand are orthogonal to the tangent line on the ellipse:
x1
x2
Theorem: The principle axes of the ellipsoid xTAx = k are eigenvectors of matrix A.
Not In Book
10
Proof: From multi-dimensional calculus: gradient of a scalar-valued function φ(x1,, xn) is orthogonal to the surface:
x1
x2
T
n
n
xx
xxgrad
∂∂
∂∂
=
=∂
∂=∇=
φφ
φφφ
!
…
1
1)()(),,(
xxxx
Different Notations
See handout posted on Blackboard on Gradients and Derivatives
11
∑∑∑∑ ∂
∂=
∂∂
⇒==i k
ji
jij
ki jjiij
Txxx
ax
xxa)(
xAx)x( φφ
Product rule:#$%#$%
jkik
k
ji
kiki
jk
i
k
ji
xx
xxxx
xxx
δδ
∂∂
+∂∂
=∂
∂
≠=
==01
)(
For our quadratic form function we have:
(♣)
(♣♣)
Using (♣♣) in (♣) gives:
jj
kj
ji
ikjj
jkk
xa
xaxax
∑
∑∑=
+=∂∂
2
φ
By Symmetry:aik = aki
And from this we get:AxAxxx 2)( =∇ T
12
x1
x2
Since grad ⊥ ellipse, this says Ax is ⊥ ellipse:
xAx
k=xAx ,
When x is a principle axis, then x and Ax are aligned:
x1
x2
x Ax
k=xAx ,
xAx λ=Eigenvectors are Principle Axes!!!
< End of Proof >
13
Theorem: The length of the principle axis associated with eigenvalue λi is ik λ/
Proof: If x is a principle axis, then Ax = λx. Take inner productof both sides of this with x:
xxxAx ,, λ==#$%
kλλkk
=⇒=
=
xxx
x#$%
2
,
< End of Proof >
Note: This says that if A has a zero eigenvalue, then the error ellipse will have an infinite length principle axis ⇒ NOT GOOD!!
So well require that all λi > 0⇒ must be positive definite
θC
14
Application of Eigen-Results to Error EllipsoidsThe Error Ellipsoid corresponding to the estimator covariance matrix must satisfy:
θC kT =− θCθ θ
1 Note that the error
ellipse is formed using the inverse covThus finding the eigenvectors/values of
shows structure of the error ellipse 1
−θC
Recall: Positive definite matrix A and its inverse A-1 have the same eigenvectors reciprocal eigenvalues
Thus, we could instead find the eigenvalues ofand then the principle axes would have lengthsset by its eigenvalues not inverted
)(1 θICθ
−=
Inverse FIM!!
15
Illustrate with 2-D case: kT =− θCθ θ 1
v1 & v2λ1 & λ2
Eigenvectors/values for (not the inverse!)
θC
v1
1θ
2θ
v2
1λk2λk
16
The CRLB/FIM Ellipse
We can re-state this in terms of the FIM
Once we find the FIM we can: Find the inverse FIM Find its eigenvectors gives the Principle Axes Find its eigenvalues Prin. Axis lengths are then
Can make an ellipse from the CRLB Matrix instead of the Cov. Matrix
This ellipse will be the smallest error ellipse that an unbiased estimator can achieve!
ikλ
1
3.8 Vector TransformationsJust like for the scalar case…. α = g(θ)
If you know CRLBθ you can find CRLBα
!!!!! "!!!!! #$"#$
α
θα θ
θθIθθCRLB
onCRLB
T
onCRLB
gg
∂
∂
∂
∂= − )()()( 1
Jacobian Matrix(see p. 46)
Example: Usually can estimate Range (R) and Bearing (ϕ) directlyBut might really want emitter (x, y)
2
Example of Vector Transform
xe
ye
x
yEmitter
Rφ
Can estimate Range (R) and Bearing (φ) directlyBut might really want emitter location (xe, ye)
==
=
=
φ
φ
φ sin
cos)(
R
Rg
y
xR
e
eθαθ
Direct
ParametersMapped
Parameters
xe
ye
x
y
Tgg
∂
∂
∂
∂=
θθCRLB
θθCRLB θα
)()(
Jacobian Matrix
−=
∂∂
∂∂
∂∂
∂∂
=∂
∂
φφ
φφ
φφφ
φφφ
θθ
cossin
sincos
cossin
coscos)(
R
R
RR
R
RR
Rg
3
3.9 CRLB for General Gaussian CaseIn Sect. 3.5 we saw the CRLB for “signal + AWGN”
For that case we saw: The PDF’s parameter-dependence showed up only in the mean of the PDF
Deterministic Signal w/ Scalar Deterministic Parameter
Now generalize to the case where:• Data is still Gaussian, but• Parameter-Dependence not restricted to Mean• Noise not restricted to White… Cov not necessarily diagonal
( ))(),(~ θCθµx N
One way to get this case: “signal + AGN”
Random Gaussian Signal w/ Vector Deterministic Parameter
Non-White Noise
4
For this case the FIM is given by: (See Appendix 3c)
!!!!!!! "!!!!!!! #$!!!!! "!!!!! #$
∂∂
∂∂
+
∂∂
∂∂
= −−−
jij
T
iij tr
θθθθ)()()()(
21)()()()]([ 111 θCθCθCθCθµθCθµθI
Variability of Mean w.r.t. parameters
Variability of Covw.r.t. parameters
This shows the impact of signal model assumptions• deterministic signal + AGN• random Gaussian signal + AGN
Est. Cov. uses average over only noise
Est. Cov. uses average over signal & noise
5
Gen. Gauss. Ex.: Time-Difference-of-ArrivalGiven:
Goal: Estimate ∆τ
[ ]TTT21 xxx =
Rx1Tx
Rx2
x1(t) = s(t – ∆τ) + w1(t)
x2(t) = s(t) + w2(t)
How to model the signal? • Case #1: s(t) is zero-mean, WSS, Gauss. Process• Case #2: s(t) is a deterministic signal
Passive SonarRadar/Comm Location
Case #1Case #10µ =∆ )( τ No Term #1 CC =∆ )( τ
∆−
∆
∆−
∆
∆
=∆
];1[
];0[
];1[
];1[
];0[
)(
2
2
1
1
1
τ
τ
τ
τ
τ
τ
Ns
s
Ns
s
s
%
%
µ
No Term #2
∆
∆=∆
2221
1211
)(
)()(
CC
CCC
τ
ττ
)()( ττ ∆=∆
+=
ji
iiii
ij
ii
ss
wwss
CC
CCC
6
Comments on General Gaussian CRLB
It is interesting to note that for any given problem you may find each case used in the literature!!!
For example for the TDOA/FDOA estimation problem:• Case #1 used by M. Wax in IEEE Trans. Info Theory, Sept. 1982• Case #2 used by S. Stein in IEEE Trans. Signal Proc., Aug. 1993
See also differences in the book’s examples
Well skip Section 3.10 and leave it as a reading assignment
1/19
3.11 CRLB Examples
1. Range Estimation sonar, radar, robotics, emitter location
2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase) sonar, radar, communication receivers (recall DSB Example), etc.
3. Bearing Estimation sonar, radar, emitter location
4. Autoregressive Parameter Estimation speech processing, econometrics
Well now apply the CRLB theory to several examples of practical signal processing problems.
Well revisit these examples in Ch. 7 well derive ML estimators that will get close to achieving the CRLB
2/19
max,);(
0)()()( osts
o TTttwtstxo
τττ
+=≤≤+−= !"!#$
Transmit Pulse: s(t) nonzero over t∈[0,Ts]
Receive Reflection: s(t τo)
Measure Time Delay: τo
C-T Signal Model
BandlimitedWhite Gaussian
tTs
s(t)
tT
s(t τo)BPF& Amp
x(t)PSD of w(t)
f BB
No/2
Ex. 1 Range Estimation Problem
3/19
Sample Every ∆ = 1/2B secw[n] = w(n∆)
DT White Gaussian Noise
Var σ2 = BNo1,,1,0][)(][ −=+−∆= Nnnwnsnx o …τ
f
ACF of w(t)
τ1/2B
BB1/B 3/2B
PSD of w(t)No/2 σ2 = BNo
−≤≤+
−+≤≤+−∆
−≤≤
=
1][
1][)(
10][
][
NnMnnw
Mnnnnwns
nnnw
nx
o
ooo
o
τ
s[n;τo] has M non-zero samples starting at no
Range Estimation D-T Signal Model
4/19
Now apply standard CRLB result for signal + WGN:
∑∑
∑∑
−
= ∆=
−+
= −∆=
−+
=
−
=
∂∂
=
∂∂
=
∂−∆∂
=
∂
∂≥
1
0
2
2
1 2
2
1 2
2
1
0
2
2
)()(
)(];[)var(
M
n nt
Mn
nn nt
Mn
nn o
oN
n o
oo
tts
tts
nsns
o
o o
o
o
σσ
ττ
σ
ττ
στ
τ
Plug in and keep non-zero terms
Exploit Calculus!!! Use approximation: τo = ∆ noThen do change of variables!!
Range Estimation CRLB
5/19
s
T
o
s
T
o
To
E
dttts
NE
dttts
N
dttts sss ∫∫∫
∂∂
=
∂∂
=
∂∂
∆
≥
0
2
0
2
0
2
2
)(
2/
1
)(
2/
)(1)var( στ
Assume sample spacing is small approx. sum by integral
∫= sTs dttsE
02 )(
( )
( )
∫∫
∫
∞
∞−
∞
∞−
=
≥
dtfS
dffSf
NE
E
dffSf
NE
o
s
s
T
o
s
os
2
22
022
)(
)(2
2/
1
)(2
2/
1)var(
π
πτ FT Theorem
& Parseval
Parseval( )
∫∫
∞
∞−
∞
∞−=dtfS
dffSfBrms 2
22
)(
)(2π
Define a BW measure:
Brms is RMS BW (Hz)A type of SNR
Range Estimation CRLB (cont.)
6/19
Using these ideas we arrive at the CRLB on the delay:
)(sec1)var( 22rms
oBSNR ×
≥τ
To get the CRLB on the range use transf. of parms result:
)(m4/)var( 22
2
rmsBSNRcR×
≥o
CRLBRCRLBo
R ττ
2
∂∂
= with R = cτo / 2
CRLB is inversely proportional to: SNR Measure RMS BW Measure
CRLB is inversely proportional to: SNR Measure RMS BW Measure
So the CRLB tells us Choose signal with large Brms Ensure that SNR is large Better on Nearby/large targets Which is better?
Double transmitted energy? Double RMS bandwidth?
Range Estimation CRLB (cont.)
7/19
1,,1,0][)cos(][ −=++Ω= NnnwnAnx o …φ
Given DT signal samples of a sinusoid in noise.Estimate its amplitude, frequency, and phase
DT White Gaussian NoiseZero Mean & Variance of σ2
Ωo is DT frequency in cycles/sample: 0 < Ωo < π
Multiple parameters so parameter vector: ToA ][ φΩ=θ
Recall SNR of sinusoid in noise is:
2
2
2
2
22/
σσAA
PPSNR
n
s ===
Ex. 2 Sinusoid Estimation CRLB Problem
8/19
Approach: Find Fisher Info Matrix Invert to get CRLB matrix Look at diagonal elements to get bounds on parm variances
Recall: Result for FIM for general Gaussian case specialized to signal in AWGN case:
j
N
n i
T
jiij
nsnsθθσ
θθσ
∂∂
∂∂
=
∂∂
∂∂
=
∑−
=
];[];[1
1)]([
1
02
2
θθ
ssθI θθ
Sinusoid Estimation CRLB Approach
9/19
Taking the partial derivatives and using approximations given inbook (valid when Ωo is not near 0 or π) :
( )
( )
2
21
0
22233
1
02
21
0
2223223
1
0
22
21
0
22
21
0
222222
1
02
1
023113
1
02
1
022112
2
1
02
1
0
2211
2)(sin1)]([
2)(sin1)]([)]([
2)22cos(1
2)(sin)(1)]([
0)22sin(2
)sin()cos(1)]([)]([
0)22sin(2
)sin()cos(1)]([)]([
2)22cos(1
21)(cos1)]([
σφ
σ
σφ
σ
σφ
σφ
σ
φσ
φφσ
φσ
φφσ
σφ
σφ
σ
NAnA
nAnnA
nAnnAnnA
nAnnA
nnAnnAn
Nnn
N
no
N
n
N
no
N
n
N
no
N
no
N
no
N
noo
N
no
N
noo
N
no
N
no
≈+Ω=
≈+Ω==
≈+Ω−=+Ω=
≈+Ω−
=+Ω+Ω−
==
≈+Ω−
=+Ω+Ω−
==
≈+Ω+=+Ω=
∑
∑∑
∑∑∑
∑∑
∑∑
∑∑
−
=
−
=
−
=
−
=
−
=
−
=
−
=
−
=
−
=
−
=
−
=
−
=
θI
θIθI
θI
θIθI
θIθI
θI
ToA ][ φΩ=θ
Sinusoid Estimation Fisher Info Elements
10/19
≈
∑
∑∑
−
=
−
=
−
=
2
21
02
2
1
02
21
0
22
2
2
220
220
002
)(
σσ
σσ
σ
NAnA
nAnA
N
N
n
N
n
N
nθI
Fisher Info Matrix then is:
2
2
2σASNR =Recall and closed form results for these sums
ToA ][ φΩ=θ
Sinusoid Estimation Fisher Info Matrix
11/19
Inverting the FIM by hand gives the CRLB matrix and then extracting the diagonal elements gives the three bounds:
(using co-factor & detapproach helped by 0s)
)rad(4)1(
)12(2)var(
))rad/sample(()1(
12)var(
)volts(2)var(
2
22
22
NSNRNNSNRN
NNSNR
NA
o
×≈
+×−
≥
−×≥Ω
≥
φ
σ
Amp. Accuracy: Decreases as 1/N, Depends on Noise Variance (not SNR)
Freq. Accuracy: Decreases as 1/N3, Decreases as as 1/SNR
Phase Accuracy: Decreases as 1/N, Decreases as as 1/SNR
To convert to Hz2
multiply by (Fs /2π)2
Sinusoid Estimation CRLBs
12/19
The CRLB for Freq. Est. referred back to the CT is
)Hz()1()2(
12)var( 222
2
−×≥
NNSNRFf s
oπ
Does that mean we do worse if we sample faster than Nyquist?NO!!!!! For a fixed duration T of signal: N = TFs
Also keep in mind that Fs has effect on the noise structure:
Not in Book
f
ACF of w(t)
τ1/2B
BB1/B 3/2B
PSD of w(t)No/2 σ2 = BNo
Frequency Estimation CRLBs and Fs
13/19
Uniformly spaced linear array with M sensors: Sensor Spacing of d meters Bearing angle to target β radians
Figure 3.8 from textbook:
Simple model
Emits or reflects signal s(t)
)2cos()( φπ += tfAts ot
Propagation Time to nth Sensor: 1,,1,0cos0 −=−= Mncdnttn …β
+
+−=
−=
φβπ
α
cos2cos
)()(
0 cdnttfA
ttsts
o
nn
Signal at nth Sensor:
Ex. 3 Bearing Estimation CRLB Problem
14/19
Now instead of sampling each sensor at lots of time instants we just grab one snapshot of all M sensors at a single instant ts
( )φφβπ
φβπ
ω
~cos
~cos2cos
cos2cos)( 0
+Ω=
+
=
+
+−=
Ω
nAndcfA
cdnttfAts
so
sosn
s
s !!! "!!! #$!!"!!#$
Spatial sinusoid w/ spatial frequency Ωs
Spatial Frequencies: ωs is in rad/meter Ωs is in rad/sensor
For sinusoidal transmitted signal Bearing Est. reduces to Frequency Est.
And we already know its FIM & CRLB!!!
Bearing Estimation Snapshot of Sensor Signals
15/19
( ) ][~
cos][)(][ nwnAnwtsnx ssn ++Ω=+= φ
Each sample in the snapshot is corrupted by a noise sample
and these M samples make the data vector x = [x[0] x[1] x[M-1] ]:
Each w[n] is a noise sample that comes from a different sensor so Model as uncorrelated Gaussian RVs (same as white temporal noise)Assume each sensor has same noise variance σ2
So the parameters to consider are: TsA ][ φΩ=θ
which get transformed to:
Ω=
==
φ
π
φ
β
2arccos
)(df
c
AA
o
sθgα
Parameter of interest!
Bearing Estimation Data and Parameters
16/19
Using the FIM for the sinusoidal parameter problem together with the transform. of parms result (see book p. 59 for details):
( ))rad(
sin11)2(
12)var( 2
22
2 βλ
π
β
−+
×
≥L
MMMSNR
Bearing Accuracy:
Decreases as 1/SNR Depends on actual bearing β
Decreases as 1/M ! Best at β = π/2 (Broadside)
Decreases as 1/Lr2 ! Impossible at β = 0! (Endfire)
L = Array physical length in metersM = Number of array elementsλ = c/fo Wavelength in meters (per cycle)
Define: Lr = L/λArray Length in wavelengths
Low-frequency (i.e., long wavelength) signals need very large physical lengths to achieve good accuracy
Bearing Estimation CRLB Result
17/19
In speech processing (and other areas) we often model the signal as an AR random process and need to estimate the AR parameters. An AR process has a PSD given by
2
1
2
2
][1
);(
∑=
−+
=p
m
fmj
uxx
ema
fPπ
σθ
AR Estimation Problem: Given data x[0], x[1], , x[N-1] estimate the AR parameter vector
[ ]Tupaaa 2][]2[]1[ σ&=θ
This is a hard CRLB to find exactly but it has been published. The difficulty comes from the fact that there is no easy direct relationship between the parameters and the data.
It is not a signal plus noise problem
Ex. 4 AR Estimation CRLB Problem
18/19
Approach: The asymptotic result we discussed is perfect here: An AR process is WSS is required for the Asymp. Result
Gaussian is often a reasonable assumption needed for Asymp. Result
The Asymp. Result is in terms of partial derivatives of the PSD and that is exactly the form in which the parameters are clearly displayed!
[ ] [ ] [ ] dffPfPN
j
xx
i
xxij ∫− ∂
∂∂
∂≈ 2
1
21
);(ln);(ln2
)(θθ
θθθI
2
1
222
1
2
2][1lnln
][1
ln);(ln ∑∑
=
−
=
−
+−=
+
=p
m
fmju
p
m
fmj
uxx ema
ema
fP π
π
σσθ
Recall:
AR Estimation CRLB Asymptotic Approach
19/19
After taking these derivatives you get results that can be simplified using properties of FT and convolution.
The final result is: [ ]
N
pkN
ka
uu
kkxxu
42
12
2)var(
,,2,1])[var(
σσ
σ
≥
=≥ − …R
Both Decrease as 1/N
To get a little insight look at 1st order AR case (p = 1):
])1[1(1])1[var( 2aN
a −≥
Complicated dependence on AC Matrix!!
Improves as pole gets closer to unit
circle PSDswith sharp
peaks are easier to estimate
a[1] Re(z)
Im(z)
AR Estimation CRLB Asymptotic Result
1
CRLB Example: Single-Rx Emitter Location via Doppler
],[);( 111 Ttttfts +∈
Received Signal Parameters Depend on Location
Estimate Rx Signal Frequencies: f1, f2, f3, , fN
Then Use Measured Frequencies to Estimate Location
(X, Y, Z, fo)
],[);( 222 Ttttfts +∈
],[);( 333 Ttttfts +∈
2
Problem BackgroundRadar to be Located: at Unknown Location (X,Y,Z)Transmits Radar Signal at Unknown Carrier Frequency fo
Signal is intercepted by airborne receiver
Known (Navigation Data): Antenna Positions: (Xp(t), Yp(t), Zp(t))Antenna Velocities: (Vx(t), Vy(t), Vz(t))
Goal: Estimate Parameter Vector x = [X Y Z fo]T
3
Physics of Problem
Receiver
Emitter
v(t)
u(t)Relative motion between emitter and receiver causes a Doppler shift of the carrier frequency:
( ) ( ) ( )( ) ( ) ( )
.)()()(
)()()()()()(
)()(),(
222
−+−+−
−+−+−−=
•−=
ZtZYtYXtX
ZtZtVYtYtVXtXtVcff
ttcfftf
ppp
pzpypxoo
oo uvx
Because we estimate the frequency there is an error added:
)(),(),(~
iii tvtftf += xx
4
Estimation Problem StatementGiven:
Data Vector:
Navigation Info:
[ ]TNtftftf ),(~
),(~
),(~
)(~
21 xxxxf !=
)(,),(),(
)(,),(),()(,),(),(
)(,),(),(
)(,),(),(
)(,),(),(
21
21
21
21
21
21
Nzzz
Nyyy
Nxxx
Nppp
Nppp
Nppp
tVtVtV
tVtVtVtVtVtV
tZtZtZ
tYtYtY
tXtXtX
!
!!
!
!
!
Estimate:Parameter Vector: [X Y Z fo]T
Right now only want to consider the CRLB
Vector-Valued function of a Vector
5
The CRLBNote that this is a signal plus noise scenario:
The signal is the noise-free frequency values The noise is the error made in measuring frequency
Assume zero-mean Gaussian noise with covariance matrix C: Can use the General Gaussian Case of the CRLB Of course validity of this depends on how closely the errors
of the frequency estimator really do follow this
Our data vector is distributed according to: )),(()(~ Cxf~xf Ν
Only Mean Shows Dependence on
parameter x!
Only need the first term in the CRLB equation:
[ ]
∂∂
∂∂
= −
j
T
iij xx
)()( 1 xfCxfJ
I use J for the FIM instead of I to avoid confusion with the identity matrix.
6
Convenient Form for FIM
To put this into an easier form to look at Define a matrix H:
Called The Jacobian of f(x)
[ ]4321 valuetrue
|||),( hhhhxtx
Hx
=∂∂
==
f
where
xx
x
x
x
h
"
#
=
∂∂
∂∂∂∂
=
),(
),(
),(
2
1
Nj
j
j
j
tfx
tfx
tfx
Then it is east to verify that the FIM becomes:
HCHJ 1−= T
7
CRLB MatrixThe Cramer-Rao bound covariance matrix then is:
[ ] 11
1)(−−
−
=
=
HCH
JxC
T
CRB
A closed-form expression for the partial derivatives needed for H can be computed in terms of an arbitrary set of navigation data see Reading Versionof these notes.
Thus, for a given emitter-platform geometry it is possible to compute the matrix H and then use it to compute the CRLB covariance matrix in (5), from which an eigen-analysis can be done to determine the 4-D error ellipsoid.
Cant really plot a 4-D ellipsoid!!!
But it is possible to project this 4-D ellipsoid down into 3-D (or even 2-D) so that you can see the effect of geometry.
8
Projection of Error EllipsoidsA zero-mean Gaussian vector of two vectors x & y:
The the PDF is:
[ ]TTT yxθ =
θCθC
θ θθ
121exp
)(det2
1)(21
−−= Tpπ
=
yyx
xyxθ CC
CCC
The quadratic form in the exponential defines an ellipse:
kT =− θCθ θ1 Can choose k to make size of
ellipsoid such that θ falls inside the ellipsoid with a
desired probability
Q: If we are given the covariance Cθ how is x alone is distributed?
A: Extract the sub-matrix Cx out of Cθ
See also Slice Of Error Ellipsoids
9
Projections of 3-D Ellipsoids onto 2-D Space
2-D Projections ShowExpected Variation
in 2-D Space
xy
z
Projection ExampleTzyx ][=θ Tyx ][=xFull Vector: Sub-Vector:
We want to project the 3-D ellipsoid for θdown into a 2-D ellipse for x
2-D ellipse still shows the full range of
variations of x and y
10
Finding ProjectionsTo find the projection of the CRLB ellipse:
1. Invert the FIM to get CCRB2. Select the submatrix CCRB,sub from CCRB3. Invert CCRB,sub to get Jproj 4. Compute the ellipse for the quadratic form of Jproj
Mathematically:T
TCRBsubCRB
PPJ
PPCC1,−=
= ( ) 11 −−= Tproj PPJJ
P is a matrix formed from the identity matrix: keep only the rows of the variables projecting onto
For this example, frequency-based emitter location: [X Y Z fo]T
To project this 4-D error ellipsoid onto the X-Y plane:
=
00100001
P
11
Projections Applied to Emitter Location
Shows 2-D ellipses that result from projecting 4-D
ellipsoids
12
Slices of Error EllipsoidsQ: What happens if one parameter were perfectly known.
Capture by setting that parameters error to zero⇒ slice through the error ellipsoid.
Slices ShowImpact of Knowledge
of a Parameter
xy
z
Impact: slice = projection when ellipsoid not tilted slice < projection when ellipsoid is tilted.
Recall: Correlation causes tilt
1
Chapter 4Linear Models
2
General Linear ModelRecall signal + WGN case: x[n] = s[n;θ] + w[n]
x = s(θ) + w Here, dependence on θ is general
N×1 known “observation matrix” (N×p)
p×1 known “offset”(p×1)
Now we consider a special case: Linear Observations:s(θ) = Hθ + b
The General Linear Model:
wbHθx ++=
Data Vector Known &
Full Rank
To Be Estimated
~N(0,C)
zero-mean, Gaussian,
C is pos. def.
Note: Gaussian is part of the Linear Model
Known
3
Need For Full-Rank H MatrixNote: We must assume H is full rank
Q: Why?
A: If not, the estimation problem is “ill-posed”…given vector s there are multiple θ vectors that give s:
If H is not full rank…Then for any s : ∃ θ1, θ2 such that s = Hθ1 = Hθ2
4
Importance of The Linear Model
There are several reasons:
1. Some applications admit this model
2. Nonlinear models can sometimes be linearized
( ) ( )bxCHHCHθ −= −−− 111 ˆ TTMVU … as we’ll see!!!
3. Finding Optimal Estimator is Easy
5
MVUE for Linear ModelTheorem: The MVUE for the General Linear Model and its covariance (i.e. its accuracy performance) are given by:
( ) ( )bxCHHCHθ −= −−− 111 ˆ TTMVU
( ) 11ˆ
−−= HCHCθT and achieves the CRLB.
Proof: We’ll do this for the b = 0 case but it can easily be done for the more general case.
First we have that x~N(Hθ,C) because:
Ex = EHθ + w = Hθ + Εw = Hθ
covx = E(x – Hθ) (x – Hθ)T = Ew wT = C
6
Recalling CRLB Theorem… Look at the partial of LLF:
Constant w.r.t. θ
Linear w.r.t. θ
Quadratic w.r.t. θ(Note: HTC-1H is symmetric)
Now use results in “Gradients and Derivatives” posted on BB:
The CRLB Theorem says that if we have this form we have found the MVU and it achieves the CRLB of I-1(θ)!!
[ ]
+−
∂∂
−=
−−∂∂
−=∂
∂
−−−
−
!!"!!#$!"!#$!"!#$ HθCHθHθCxxCxθ
HθxCHθxθθ
θ
111
1
221
)()(21);x(ln
TTTT
Tp (Hθ)TC-1x = [(Hθ)TC-1x]T
= xTC-1 (Hθ)
[ ] [ ]
( )
−=
−=+−−=∂
∂
=
−−−−
−−−−
θxCHHCHHCH
HθCHxCHHθCHxCHθθ
θθθI!!! "!!! #$!"!#$
ˆ)(
111
)(
1
1111 2221);x(ln
g
TTT
TTTTp
Pull out HTC-1H
7
For simplicity… assume b = 0Whitening Filter ViewpointAssume C is positive definite (necessary for C-1 to exist)
Thus, from (A1.2): for pos. def. C ∃ N×N invertible matrix D, s.t.
C-1 = DTD C = D-1(DT)-1
Transform data x using matrix D: wθHDwDHθDxx ~~~ +=+==
Claim: White!! ( ) IDDDDDCD
DDwwDwDwww
=
==
==
−− TTT
TTTT EEE
11
))((~~
DMVUE for Lin. Model w/ White
Noise
x~x θ
WhiteningFilter
8
Ex. 4.1: Curve FittingCaution: The “Linear” in “Linear Model”
does not come from fitting straight lines to data
It is more general than that !!
n
x[n] Data
Model is Quadratic in Index n…But Model is Linear in Parameters
wHθx +=
=
3
2
1
θ
θ
θ
θ
=
2
2
2
1
221
111
NN
%%%H
][][ 2321 nwnnnx +++= θθθ
Linear in θ’s
9
Ex. 4.2: Fourier Analysis (not most general) ][2sin2cos][
1 1nw
Nknb
Nknanx
M
k
M
kkk +
+
= ∑ ∑
= =
ππData Model:
Parameters to Estimate
AWGN
Parameters: θ = [a1 . . . aM b1 . . . bM]T (Fourier Coefficients)
=H
ObservationMatrix:
Nknπ2cos
Nknπ2sin k = 1, 2, …, M
n = 0, 1, 2, …, NDown each column
10
Now apply MVUE Theorem for Linear Model:
( ) xHHH TTMVU
1ˆ −=θ
I2N
=
Using standard orthogonality of
sinusoids (see book)
xHTMVU
N2
ˆ =θ
Each Fourier coefficient estimate is found by the inner product of a column of H with
the data vector x
Interesting!!! Fourier Coefficients for signal + AWGN are MVU estimates of the Fourier Coefficients of the noise-free signal
COMMENT: Modeling and Estimation (are Intertwined)• Sometimes the parameters have some physical significance (e.g. delay
of a radar signal).• But sometimes parameters are part of non-physical assumed model
(e.g. Fourier)• Fourier Coefficients for signal + AGWN are MVU estimates of the
Fourier Coefficients of the noise-free signal
11
H(z) +
w[n]
u[n]Ex. 4.3: System Identification
x[n]
Observed Noisy Output
Known Input
Unknown System
Goal: Determine a model for the system Some Application Areas:
• Wireless Communications (identify & equalize multipath)• Geophysical Sensing (oil exploration)• Speakerphone (echo cancellation)
In many applications: assume that the system is FIR (length p)
unknown, but here we’ll assume known][][][][
1
0nwknukhnx
p
k+−= ∑
−
=
Measured EstimationParameters
Known InputAssume u[n] =0, n < 0
AWGN
12
Write FIR convolution in matrix form:
wx
θ
H
+
−
−−
=
!"!#$
%
!!!!!!!!!!!! "!!!!!!!!!!!! #$ &&&&
%%
%
'%
'''%
%''''%
&
&&
&&
]1[
]1[
]0[
][]1[
]1[
]0[
0
00]0[]1[]2[
00]0[]1[
000]0[
)(
ph
h
h
pNuNu
u
u
uuu
uu
u
NxpEstimate
This
Measured Data
Known Input Signal Matrix
The Theorem for the Linear Model says:
( ) xHHHθ TTMVU
1 ˆ −
=
( ) 12ˆ
−= HHCθ
Tσ and achieves the CRLB.
13
Q: What signal u[n] is best to use ?
A: The u[n] that gives the smallest estimated variances!!
Book shows: Choosing u[n] s.t. HTH is diagonal will minimize variance
⇒ Choose u[n] to be pseudo-random noise (PRN)u[n] is ⊥ to all its shifts u[n – m]
Proof uses: ( ) 12ˆ
−= HHCθ
Tσ
And Cauchy-Schwarz Inequality (same as Schwarz Ineq.)
Note: PRN has approximately flat spectrum
So from a frequency-domain view a PRN signal equally probes at all frequencies
1
Chapter 6Best Linear Unbiased Estimate
(BLUE)
2
Motivation for BLUEExcept for Linear Model case, the optimal MVU estimator might:
1. not even exist2. be difficult or impossible to find
⇒ Resort to a sub-optimal estimateBLUE is one such sub-optimal estimate
Idea for BLUE: 1. Restrict estimate to be linear in data x2. Restrict estimate to be unbiased3. Find the best one (i.e. with minimum variance)
Advantage of BLUE:Needs only 1st and 2nd moments of PDF
Mean & CovarianceDisadvantages of BLUE:
1. Sub-optimal (in general)2. Sometimes totally inappropriate (see bottom of p. 134)
3
6.3 Definition of BLUE (scalar case)Observed Data: x = [x[0] x[1] . . . x[N – 1] ]T
PDF: p(x;θ ) depends on unknown θ
BLUE constrained to be linear in data: xaTN
nnBLU nxa == ∑
−
−
1
0][θ
Choose a’s to give: 1. unbiased estimator2. then minimize variance
LinearUnbiased
Estimators
NonlinearUnbiased Estimators
BLUE
MVUE
Var
ianc
e
Note: This is not Fig. 6.1
4
6.4 Finding The BLUE (Scalar Case)∑−
−
=1
0][ˆ
N
nn nxaθ1. Constrain to be Linear:
2. Constrain to be Unbiased:
∑−
=
=
⇓
=
1
0][
ˆ
N
nn nxEa
E
θ
θθUsing linear constraint
Q: When can we meet both of these constraints?
A: Only for certain observation models (e.g., linear observations)
5
Finding BLUE for Scalar Linear ObservationsConsider scalar-parameter linear observation:
x[n] = θs[n] + w[n] ⇒ Ex[n] = θs[n]
Tells how to choose weights to use in the
BLUE estimator form 1
][ˆ1
0
=
== ∑−
− ⇓
saT
N
nn
Need
nsaE θθθ !"#
∑−
−
=1
0][ˆ
N
nn nxaθ
Then for the unbiased condition we need:
Now… given that these constraints are met…We need to minimize the variance!!
Given that C is the covariance matrix of x we have:
Caaxa TTBLU == varˆvar θ
Like varaX =a2 varX
6
Goal: minimize aTCa subject to aTs = 1
⇒ Constrained optimization
Appendix 6A: Use Lagrangian Multipliers: Minimize J = aTCa + λ(aTs – 1)
sCssCssa
sCa
sa
11
1
1
12
12
20 :Set
−−
=
−
=−⇒=−=⇒
−=⇒=∂∂
TTT
T
aJ
λλ
λ
$$!$$"#
sCssCa 1
1
−
−= T
sCs 11)ˆvar( −= Tθ
sCsxCsxa 1
1ˆ
−
−== T
TT
BLUEθ
Appendix 6A shows that this achieves a global minimum
7
Applicability of BLUE
We just derived the BLUE under the following:1. Linear observations but with no constraint on the noise PDF2. No knowledge of the noise PDF other than its mean and cov!!
What does this tell us???BLUE is applicable to linear observations
But noise need not be Gaussian!!! (as was assumed in Ch. 4 Linear Model)
And all we need are the 1st and 2nd moments of the PDF!!!
But well see in the Example that we can often linearize a nonlinear model!!!
8
6.5 Vector Parameter Case: Gauss-Markov Thm
Gauss-Markov Theorem:If data can be modeled as having linear observations in noise:
wHθx +=Known Matrix Known Mean & Cov
(PDF is otherwise arbitrary & unknown)
Then the BLUE is: ( ) xCHHCHθ 111ˆ −−−= TTBLUE
and its covariance is: ( ) 11ˆ
−−= HCHCθT
Note: If noise is Gaussian then BLUE is MVUE
9
Ex. 4.3: TDOA-Based Emitter Location
Tx @ (xs,ys)
Rx3(x3,y3)
Rx2(x2,y2)
Rx1(x1,y1)
s(t)
s(t – t1) s(t – t2) s(t – t3)
Hyperbola:τ12 = t2 – t1 = constant
Hyperbola:τ23 = t3 – t2 = constant
TDOA = Time-Difference-of-Arrival
Assume that the ith Rx can measure its TOA: ti
Then… from the set of TOAs… compute TDOAs
Then… from the set of TDOAs… estimate location (xs,ys)
We won’t worry about “how” they do that.Also… there are TDOA systems that never actually estimate TOAs!
10
TOA Measurement ModelAssume measurements of TOAs at N receivers (only 3 shown above):
t0, t1, … ,tN-1There are measurement errors
TOA measurement model:To = Time the signal emittedRi = Range from Tx to Rxic = Speed of Propagation (for EM: c = 3x108 m/s)
ti = To + Ri/c + εi i = 0, 1, . . . , N-1
Measurement Noise ⇒ zero-mean, variance σ2, independent (but PDF unknown)(variance determined from estimator used to estimate ti’s)
Now use: Ri = [ (xs – xi)2 + (ys - yi)2 ]1/2
iisisossi yyxxc
Tyxft ε+−+−+== 22 )()(1),(Nonlinear
Model
11
Linearization of TOA Model
⇒ θ = [δx δy]T
So… we linearize the model so we can apply BLUE:Assume some rough estimate is available (xn, yn)
xs = xn + δxs ys = yn + δys
know estimate know estimate
Now use truncated Taylor series to linearize Ri (xn, yn):
sn
ins
n
inni y
Ryyx
RxxRR
iBiAii
iδδ
$!$"#$!$"#
== ∆∆
−+
−+≈
Known
isi
si
on
ii ycBx
cAT
c
Rtt i εδδ +++=−=~Apply to TOA:
known known known
Three unknown parameters to estimate: To, δys, δys
12
TOA Model vs. TDOA ModelTwo options now:
1. Use TOA to estimate 3 parameters: To, δys, δys
2. Use TDOA to estimate 2 parameters: δys, δys
Generally the fewer parameters the better…Everything else being the same.
But… here “everything else” is not the same: Options 1 & 2 have different noise models
(Option 1 has independent noise)(Option 2 has correlated noise)
In practice… we’d explore both options and see which is best.
13
Conversion to TDOA Model N–1 TDOAs rather than N TOAs
TDOAs: 1,,2,1,~~1 −=−= − Nitt iii …τ
$!$"#$!$"#$!$"# noise correlated1
known
1
known
1−
−− −+−
+−
= iisii
sii y
cBBx
cAA εεδδ
In matrix form: x = Hθ + w
εwH A
)()(
)()(
)()(
1
21
12
01
2121
1212
0101
=
−
−
−
=
−−
−−
−−
=
−−−−−− NNNNNN BBAA
BBAA
BBAA
c
εε
εε
εε
&
&
&&&
&
&
[ ]TN 121 −= τττ 'x [ ]Tss yx δδ=θ
See book for structure of matrix ATAAwCw
2cov σ==
14
Apply BLUE to TDOA Linearized Model( )
( ) 112
11ˆ
−−
−−
=
=
HAAH
HCHC wθ
TT
T
σ
( )( ) ( ) xAAHHAAH
xCHHCHθ ww
111
111ˆ
−−−
−−−
=
=
TTTT
TTBLUE
Describes how large the location error is
Dependence on σ2
cancels out!!!
Things we can now do:1. Explore estimation error cov for different Tx/Rx geometries
• Plot error ellipses2. Analytically explore simple geometries to find trends
• See next chart (more details in book)
15
Apply TDOA Result to Simple Geometry
Rx1 Rx2 Rx3
d dα α
R
Tx
−
=
2
222
ˆ
)sin1(2/30
0cos2
1
α
ασ cθCThen can show:
Diagonal Error Cov ⇒ Aligned Error Ellipse
And y-error always bigger than x-error ex
ey
16
0 10 20 30 40 50 60 70 80 9010-1
100
101
102
103
α (degrees )
σx/cσ
o
r σ
y/cσ
σxσy
Rx1 Rx2 Rx3
d dα α
R
Tx• Used Std. Dev. to show units of X & Y• Normalized by cσ… get actual values by
multiplying by your specific cσ value
• For Fixed Range R: Increasing Rx Spacing d Improves Accuracy
• For Fixed Spacing d: Decreasing Range R Improves Accuracy
1
Chapter 7Maximum Likelihood Estimate
(MLE)
2
Motivation for MLEProblems: 1. MVUE often does not exist or can’t be found
<See Ex. 7.1 in the textbook for such a case>2. BLUE may not be applicable (x ≠ Hθ + w)
Solution: If the PDF is known, then MLE can always be used!!!
This makes the MLE one of the most popular practical methods
• Advantages: 1. It is a “Turn-The-Crank” method2. “Optimal” for large enough data size
• Disadvantages: 1. Not optimal for small data size2. Can be computationally complex
- may require numerical methods
3
Rationale for MLEChoose the parameter value that:
makes the data you did observe…the most likely data to have been observed!!!
Consider 2 possible parameter values: θ1 & θ2
Ask the following: If θi were really the true value, what is the probability that I would get the data set I really got ?
Let this probability be Pi
So if Pi is small… it says you actually got a data set that was unlikely to occur! Not a good guess for θi!!!
But p1 = p(x;θ1) dx
p2 = p(x;θ2) dx⇒ pick so that is largestMLθ )ˆ;( MLp θx
4
Definition of the MLEis the value of θ that maximizes the “Likelihood
Function” p(x;θ) for the specific measured data xMLθ
maximizes the likelihood function
MLθp(x;θ)
θMLθ
Note: Because ln(z) is a monotonically increasing function…
maximizes the log likelihood function lnp(x; θ)MLθ
General Analytical Procedure to Find the MLE
1. Find log-likelihood function: ln p(x;θ)
2. Differentiate w.r.t θ and set to 0: ∂ln p(x;θ)/∂θ = 0
3. Solve for θ value that satisfies the equation
5
Ex. 7.3: Ex. of MLE When MVUE Non-Existentx[n] = A + w[n] ⇒ x[n] ~ N(A,A)
WGN~N(0,A)
Likelihood Function:
!!!!!! "!!!!!! #$
−−= ∑
−
=
1
0
2
2
)][(21exp
)2(
1);(N
nN Anx
AA
Ap
π
x
To take ln of this… use log properties:
Take ∂/∂A, set = 0, and change A to A
A > 0
0ˆ2
ˆ][ˆ2ˆ2
1][ˆ21ˆ
ˆ1][ˆ
1ˆ2
:thisExpand
0)ˆ][(ˆ21)ˆ][(ˆ
1ˆ2
2
2
22
1
0
22
1
0
=+−+−+−
=−+−+−
∑∑∑
∑∑−
=
−
=
ANAnxA
Anx
AAN
Anx
AAN
AnxA
AnxAA
N N
n
N
n
Cancel
6
Manipulate to get: 0][1ˆˆ1
0
22 =−+ ∑−
=
N
nnx
NAA
41][1
21ˆ
1
0
2 ++−= ∑−
=
N
nML nx
NA
Solve quadratic equation to get MLE:
Can show this estimator biased (see bottom of p. 160)But it is asymptotically unbiased…
Use the “Law of Large Numbers”: Sample Mean → True Mean
][][1 21
0
2 nxEnxN Nas
N
n → →∞
−
=∑
AnxEnxEEAEAA
ML =++−=
++−→
+=41][
21
41][
21ˆ
2
22!"!#$
CRLBAN
AA =
+
→
21
)ˆvar(2
So can use this to show:
Asymptotically…Unbiased & Efficient
7
7.5 Properties of the MLE (or… “Why We Love MLE”)
The MLE is asymptotically:
1. unbiased
2. efficient (i.e. achieves CRLB)
3. Gaussian PDF
Also, if a truly efficient estimator exists, then the ML procedure finds it !
The asymptotic properties are captured in Theorem 7.1:
If p(x;θ ) satisfies some “regularity” conditions, then the MLE is asymptotically distributed according to
))(,(~ˆ 1 θIθNθ aML
−
where I(θ ) = Fisher Information Matrix
8
Size of N to Achieve Asymptotic
This Theorem only states what happens asymptotically…when N is small there is no guarantee how the MLE behaves
Q: How large must N be to achieve the asymptotic properties?
A: In practice: use “Monte Carlo Simulations” to answer this
9
Monte Carlo Simulations: see Appendix 7A
Not just for the MLE!!!A methodology for doing computer simulations to evaluate performance of any estimation method Illustrate for deterministic signal s[n; θ ] in AWGN
Monte Carlo Simulation:
Data Collection:
1. Select a particular true parameter value, θtrue- you are often interested in doing this for a variety of values of θso you would run one MC simulation for each θ value of interest
2. Generate signal having true θ: s[n;θt] (call it s in matlab)
3. Generate WGN having unit variancew = randn ( size(s) );
4. Form measured data: x = s + sigma*w;- choose σ to get the desired SNR- usually want to run at many SNR values
→ do one MC simulation for each SNR value
10
Data Collection (Continued):
5. Compute estimate from data x
6. Repeat steps 3-5 M times
- (call M “# of MC runs” or just “# of runs”)
7. Store all M estimates in a vector EST (assumes scalar θ)
Statistical Evaluation:
1. Compute bias
2. Compute error RMS
3. Compute the error Variance
4. Plot Histogram or Scatter Plot (if desired)
( )∑=
−=M
itrueiM
b1
ˆ1 θθ
( )∑=
−=M
itiM
RMS1
2ˆ1 θθ
∑ ∑= =
−=
M
i
M
iii MM
VAR1
2
1
ˆ1ˆ1 θθ
Now explore (via plots) how: Bias, RMS, and VAR vary with: θ value, SNR value, N value, Etc.
Is B ≈ 0 ?Is RMS ≈ (CRLB)½ ?
11
Ex. 7.6: Phase Estimation for a SinusoidSome Applications: 1. Demodulation of phase coherent modulations
(e.g., DSB, SSB, PSK, QAM, etc.)2. Phase-Based Bearing Estimation
Signal Model:x[n] = Acos(2πfon + φ) + w[n], n = 0, 1,…, N-1
A and fo known, φ unknown White~N(0,σ2)
( )SNRNNA ⋅
=≥12ˆvar 2
2σφRecall CRLB:
For this problem… all methods for finding the MVUE will fail!!⇒ So… try MLE!!
12
So first we write the likelihood function:
( )( )[ ]
+−−= ∑−
= !!!!! "!!!!! #$
1
0
22
222cos][
21exp
2
1);(N
noN nfAnxp φπ
σπσ
φx
… equivalent to minimizing this
End up in same place if we maximize LLF
GOAL: Find φ that maximizes this
So, minimize: ( ) ( )[ ] ( ) gives0 Setting2cos][1
0
2 =∂
∂+−= ∑
−
=
∆
φφφπφ JnfAnxJ
N
no
( ) ( ) ( )!!!!!! "!!!!!! #$
0
1
0
1
02cosˆ2sinˆ2sin][
≈
−
=
−
=∑∑ ++=+N
noo
N
no nfnfAnfnx φπφπφπ
sin and cos are ⊥ when summed over full cycles
So… MLE Phase Estimate satisfies: ( ) 0ˆ2sin][1
0=+∑
−
=
N
nonfnx φπ
Interpret via inner product or correlation
13
Now…using a Trig Identity and then re-arranging gives:
( ) ( )
−=
∑∑n
on
o nfnxnfnx πφπφ 2cos][)ˆsin(2sin][)ˆcos(
( )
( )
−=∑∑
−
no
no
ML nfnx
nfnx
π
πφ
2cos][
2sin][tanˆ 1
Or… Recall: This is the approximate MLE
Don’t need to know A or σ2 but do need
to know foLPF
LPF
yi(t)
yq(t)
x(t)cos(2πfot)
-sin(2πfot)
Recall: I-Q Signal Generation
The “sums” in the above equation play the role of the LPF’s in the figure (why?)Thus, ML phase estimator can be viewed as: atan of ratio of Q/I
14
Monte Carlo Results for ML Phase Estimation
See figures 7.3 & 7.4 in text book
1
7.6 MLE for Transformed ParametersGiven PDF p(x;θ ) but want an estimate of α = g (θ )
What is the MLE for α ??
θg(θ )Two cases:
1. α = g(θ ) is a one-to-one function
))(;( maximizes ˆ 1 αα −gpML x
2. α = g(θ ) is not a one-to-one function θg(θ )
Need to define modified likelihood function:
!!! "!!! #$);(max);(
)(:θα
θαθxx pp
gT
==
• For each α, find all θ’s that map to it• Extract largest value of p(x; θ ) over
this set of θ’s);( maximizes ˆ αα xTML p
2
Invariance Property of MLE Another Big Advantage of MLE!
Theorem 7.2: Invariance Property of MLEIf parameter θ is mapped according to α = g(θ ) then the MLE of α is given by
where is the MLE for θ found by maximizing p(x;θ )
)ˆ(ˆ θα g=
θ
Note: when g(θ ) is not one-to-one the MLE for α maximizes the modified likelihood function
“Proof”: Easy to see when g(θ ) is one-to-one
Otherwise… can “argue” that maximization over θ inside definition for modified LF ensures the result.
3
Ex. 7.9: Estimate Power of DC Level in AWGNx[n] = A + w[n] noise is N(0,σ2) & White
Want to Est. Power: α = A2 ⇒
A
α = A2
⇒ For each α value there are 2 PDF’s to consider
+−=
−−=
∑
∑
nNT
nNT
nxp
nxp
222/2
222/2
)][(2
1exp)2(
1);(
)][(2
1exp)2(
1);(
2
1
ασπσ
α
ασπσ
α
x
x
[ ]2
2
2
0
ˆ
);x(maxarg
);(),;x(maxargˆ
ML
A
ML
A
Ap
xpp
=
=
−=
∞<<∞−
≥ααα
αThen: Demonstration that
Invariance Result Holds for this
Example
4
Ex. 7.10: Estimate Power of WGN in dB x[n] = w[n] WGN w/ var = σ2 unknown
Recall: Pnoise = σ2
∑−
=
=1
0
2 ][1ˆN
nnoise nx
NPCan show that the MLE for variance is:
To get the dB version of the power estimate:
Note: You may recall a result for estimating variance that divides by N–1 rather than by N … that estimator is unbiased, this estimate is biased (but asymptotically unbiased)
!!! "!!! #$
= ∑
−
=
1
0
210 ][1log10ˆ
N
ndB nx
NP
Using Invariance Property !
5
7.7: Numerical Determination of MLENote: In all previous examples we ended up with a closed-formexpression for the MLE: )(ˆ xfML =θ
Ex. 7.11: x[n] = rn + w[n] noise is N(0,σ2) & whiteEstimate r If –1 < r < 0 then this signal
is a decaying oscillation that might be used to model:• A Ship’s “Hull Ping”• A Vibrating String, Etc.
∑−
=
− =−⇒
=∂
∂
1
0
1 0)][(
0);x(ln
N
n
nn nrrnx
pθ
θTo find MLE:
No closed-form solution for the MLE
6
So…we can’t always find a closed-form MLE!But a main advantage of MLE is:
We can always find it numerically!!!(Not always computationally efficiently, though)
Brute Force MethodCompute p(x;θ ) on a fine grid of θ values
Advantage: Sure to Find maximum (if grid is fine enough)
Disadvantage: Lots of Computation (especially w/ a fine grid)
p(x;θ )
θ
7
Iterative Methods for Numerical MLEStep #1: Pick some “initial estimate”Step #2: Iteratively improve it using
),ˆ(ˆ1 xii f θθ =+
0θ
);x(max);x(lim θθθ
pp ii
=∞→
such that
Hill Climbing in the Fogp(x;θ )
θ0θ 1θ 2θ
Note: A so-called “Greedy”maximization algorithm will always move up even though taking an occasional step downward may be the better global strategy!
Convergence Issues:1. May not converge2. May converge, but to local maximum
- good initial guess is needed !!- can use rough grid search to initialize- can use multiple initializations
8
Iterative Method: Newton-Raphson MLEThe MLE is the maximum of the LF… so set derivative to 0:
0);(ln
)(
=∂
∂
∆=
!"!#$θ
θθ
g
p x
Newton-Raphson is a numerical method for finding the zero of a function… so it can be applied here… Linearize g(θ )
So… MLE is a zero of g(θ )
θ0θ1θ2θ
)ˆ()()()(
1ˆ
0
ˆ!!!!! "!!!!! #$
+
=
=−
+≈
k
k
foresolvset
kk ddggg
θ
θθθθ
θθθθ
−=
=
+
kd
dg
g kkk
θθθθ
θθθ
ˆ
1)(
)ˆ(ˆˆ
Truncated Taylor Series
9
θθθ
∂∂
=);(ln)( xpNow… using our “definition of convenience”: g
So then the Newton-Raphson MLE iteration is:
k
ppkk
θθθθ
θθθθ
ˆ
1
2
2
1);(ln);(lnˆˆ
=
−
+
∂∂
∂
∂−=
xxIterate until convergence criterion is met:
εθθ <−+ |ˆˆ| 1 kk
Look Familiar???Looks like I(θ ), except: I(θ ) is evaluated at the
true θ, and has an expected valueYou get to
choose!
Generally: For a given PDF model, compute derivatives analytically…
or… compute derivatives numerically:
θθθθ
θθ
θ ∆−∆+
≈∂
∂ )ˆ;(ln)ˆ;(ln);(lnˆ
kk ppp
k
xxx
10
Convergence Issues of Newton-Raphson:1. May not converge2. May converge, but to local maximum
- good initial guess is needed !!- can use rough grid search to initialize- can use multiple initializations
0θ1θ
2θ3θ
θθ
∂∂ );(ln xp
θ
Some Other Iterative MLE Methods1. Scoring Method
• Replaces second-partial term by I(θ )2. Expectation-Maximization (EM) Method
• Guarantees convergence to at least a local maximum• Good for complicated multi-parameter cases
11
7.8 MLE for Vector ParameterAnother nice property of MLE is how easily it carries over to the vector parameter case.
The vector parameter is: [ ]Tpθθθ %21=θ
0);x(ln=
∂∂
!"!#$ θθp
is the vector that satisfies:MLθ
∂∂
∂∂
∂∂
=∂
∂
p
f
f
f
f
θ
θ
θ
)(
)(
)(
)( 2
1
θ
θ
θ
θθ
&
Derivative w.r.t. a vector
12
Ex. 7.12: Estimate DC Level and Variancex[n] = A + w[n] noise is N(0,σ2) and white
Estimate: DC level A and Noise Variance σ2 ⇒
=
2σ
Aθ
( )[ ]
−−= ∑−
=
1
0
22
22
2 ][2
1exp
2
1),;(N
nN AnxAp
σπσ
σxLF is:
0θθx setp
=∂
∂ );(lnSolve:
−=
∑ 2)][(1ˆ
n
MLxnx
N
x
θ
( ) ( )
( )∑
∑
−
=
−
=
=−+−=∂
∂
=−=−=∂
∂
1
0
2422
2
1
02
0][2
12
);(ln
0][1);(ln
N
n
N
n
AnxNp
AxNAnxA
p
σσσ
σσ
θx
θx
Interesting: For this problem… First estimate A just like scalar caseThe subtract it off and then estimate variance like scalar case
13
Properties of Vector MLThe asymptotic properties are captured in Theorem 7.3:
If p(x;θ) satisfies some “regularity” conditions, then the MLE is asymptotically distributed according to
))(,(~ˆ 1 θIθθ −NaML
where I(θ) = Fisher Information MatrixSo the vector ML is asymptotically:
• unbiased • efficient
Invariance Property Holds for Vector Case
If α = g (θ ), then )ˆ(ˆ MLML g θα =
14
Ex. 7.12 Revisited
−=
−=4
2
2
2 )1(20
0ˆcov)1(ˆ
σ
σ
σNN
N
NN
A
E θθIt can be shown that:
)(20
0ˆcovˆ 1
4
2
2θIθθθ −=
≈=
≈
σ
σ
σN
NA
EFor large N then :
which we see satisfies the asymptotic property.
Diagonal covariance matrix shows estimates are uncorrelated:
Error Ellipse is aligned with axes
Ae
2σe
This is why we could “decouple”
the estimates
15
MLE for the General Gaussian CaseLet the data be general Gaussian: x ~ N (µ(θ), C(θ))
Thus ∂ ln p(x;θ)/ ∂θ will depend in general onθθ
θθ
∂∂
∂∂ )(C)(u and
For each k = 1, 2, . . . , p set: 0);(ln=
∂∂
k
pθ
θx
This gives p simultaneous equations, the kth one being:
[ ] [ ] [ ] 0)()()(21)()()()()(
21 1
11 =−
∂∂
−−−
∂∂
+
∂∂
−−
−− θµxθCθµxθµxθCθµθCθCk
TT
kktr
θθθ
Term #1 Term #2 Term #3
Note: for the deterministic signal + noise case: Terms #1 & #3 are zero
This gives general conditions to find the MLE… but can’t always solve it!!!
16
MLE for Linear Model CaseThe signal model is: x = Hθ + w with the noise w ~ N(0,C)
So terms #1 & #3 are zero and term #2 gives:
For this case we cansolve these equations!
[ ] 0HθxCθ
Hθ
H
=−
∂∂ −
=
1)(
!"!#$
T
( ) xCHHCH 111ˆ −−−= TTMLθSolving this gives:
Hey! Same as chapter 4’s MVU for linear model
Recall: the Linear Model is specified to have Gaussian noise
For Linear Model: ML = MVU
))(,(~ˆ 11 −− HCHθθ TML N
EXACT… Not Asymptotic!!
17
Numerical Solutions for Vector CaseObvious generalizations… see p. 187
There is one issue to be aware of, though:
The numerical implementation needs ∂ln p(x;θ)/∂θ
For the general Gaussian case this requires:θθC
∂∂ − )(1
…often hard to analytically: get C-1(θ)
& then differentiate!
So… we use (3C.2):
"#$"#$
"#$ )()()()( 111
θCθCθCθC −−−
∂∂
−=∂
∂
kk θθ
GetAnalytically
GetNumerically
18
7.9 Asymptotic MLE
Useful when data samples x[n] come from a WSS process
Reading Assignment Only
1
7.10 MLE ExamplesWe’ll now apply the MLE theory to several examples of practical signal processing problems.
These are the same examples for which we derived the CRLB in Ch. 3
1. Range Estimation – sonar, radar, robotics, emitter location
2. Sinusoidal Parameter Estimation (Amp., Frequency, Phase)– sonar, radar, communication receivers (recall DSB Example), etc.
3. Bearing Estimation – sonar, radar, emitter location
4. Autoregressive Parameter Estimation– speech processing, econometrics
See Book
We Will
Cover
2
Ex. 1 Range Estimation ProblemTransmit Pulse: s(t) nonzero over t∈[0,Ts]
Receive Reflection: s(t – τo)
Measure Time Delay: τo
max,);(
0)()()( osts
o TTttwtstxo
τττ
+=≤≤+−= !"!#$
C-T Signal Model
tTs
s(t)
tT
s(t – τo)
BandlimitedWhite Gaussian
BPF& Amp
x(t)PSD of w(t)
f B–B
No/2
3
Range Estimation D-T Signal Model
Sample Every ∆ = 1/2B secw[n] = w(n∆)
DT White Gaussian Noise
Var σ2 = BNo
f
ACF of w(t)
τ1/2B
B–B1/B 3/2B
PSD of w(t)No/2 σ2 = BNo
1,,1,0][][][ −=+−= Nnnwnnsnx o …
s[n;no]… has M non-zero samples starting at no no ≈ τo /∆
−≤≤+
−+≤≤+−
−≤≤
=
1][
1][][
10][
][
NnMnnw
Mnnnnwnns
nnnw
nx
o
ooo
o
4
Range Estimation Likelihood FunctionWhite and Gaussian ⇒ Independent ⇒ Product of PDFs
3 different PDFs – one for each subinterval
2
3#
1
2
2
2#
1
2
2
1#
1
02
2
2
1
2][exp
2])[][(exp
2][exp);(
πσ
σσσ
=
−•
−−−•
−= ∏∏∏
−+
+=
−+
=
−
=
C
nxCnnsnxCnxCnpMn
Mnn
Mn
nn
on
no
o
o
o
o
o
!!!! &!!!! '(!!!!!!! &!!!!!!! '(!!! &!!! '(
x
Expand to get an x2[n] term… group it with the other x2[n] term
−+−−−•
−= ∑∑ −+
=
−
=
!!!!!!! "!!!!!!! #$!!! "!!! #$
12
022
1
0
2
])[][][2(2
1exp2
][exp);(
Mn
nno
N
nNo
o
o
nnsnnsnxnx
Cnpσσ
x
must minimize this or maximize its negative over values of noDoes not depend on no
5
Range Estimation ML Condition
!!! "!!! #$!!! "!!! #$∑∑
−+
=
−+
=−+−
12
1
0 ][][][Mn
nno
Mn
nn
o
o
o
o
nnsnnsnx
Doesn’t depend on no! …Summand moves with the limits as no changes.
Because s[n – no] = 0 outside summation
range… so can extend it!
2So maximize this:
∑−
=−
1
00 ][][
N
nnnsnxSo maximize this:
So…. MLE Implementation is based on Cross-correlation: “Correlate” Received signal x[n] with transmitted signal s[n]
,][][][][maxargˆ1
00∑−
=−≤≤−==
N
nxsxs
MNmo mnsnxmCmCn
6
Range Estimation MLE Viewpoint
mno
Cxs[m]
,][][][1
0∑−
=−=
N
nxs mnsnxmC
Doesn’t depend on no! …Summand moves with the limits as no changes.
Warning: When signals are complex (e.g., ELPS) take find peak of |Cxs[m] |
• Think of this as an inner product for each m• Compare data x[n] to all possible delays of signal s[n]
! pick no to make them most alike
7
Ex. 2 Sinusoid Parameter Estimation ProblemGiven DT signal samples of a sinusoid in noise….
Estimate its amplitude, frequency, and phase
1,,1,0][)cos(][ −=++Ω= NnnwnAnx o …φ
Ωo is DT frequency in cycles/sample: 0 < Ωo < π
DT White Gaussian NoiseZero Mean & Variance of σ2
Multiple parameters… so parameter vector: ToA ][ φΩ=θ
The likelihood function is:
),,(
))cos(][(2
1exp);(1
0
22
φ
φσ
o
N
no
N
AJ
nAnxCp
Ω=
+Ω−−=
∆
−
=∑θx
For MLE: Minimize This
8
Sinusoid Parameter Estimation ML ConditionTo make things easier…
Define an equivalent parameter set:
[α1 α2 Ωo ]T α1 = Acos(φ) α2 = –Asin(φ)
Then… J'(α1 ,α2,Ωo) = J(A,Ωo,φ) α = [α1 α2]T
Define:
c(Ωo) = [1 cos(Ωo) cos(Ωo2) … cos(Ωo(N-1))]T
s(Ωo) = [0 sin(Ωo) sin(Ωo2) … sin(Ωo(N-1))]T
and…
H(Ωo) = [c(Ωo) s(Ωo)] an Nx2 matrix
9
Then: J'(α1 ,α2,Ωo) = [x – H (Ωo) α]T [x – H (Ωo) α]
Looks like the linear model case… except for Ωo dependence of H (Ωo)
Thus, for any fixed Ωo value, the optimal α estimate is
[ ] xHHHα )()()(ˆ 1o
Too
T ΩΩΩ=−
Then plug that into J'(α1 ,α2,Ωo):
[ ] [ ]
[ ][ ]
[ ][ ]
[ ] !!!!!!! "!!!!!!! #$
!!!!!!!! "!!!!!!!! #$
o
oT
ooT
o
oT
ooT
oTT
oT
ooT
oT
ooTTT
oT
ooJ
Ω
−
ΩΩΩΩ−=
−
ΩΩΩΩ−=
ΩΩΩΩ−=
Ω−Ω−=
Ω−Ω−=Ω′
−
w.r.t.minimize
1
)()()()(
21
21
)()()()(
)()()()(
ˆ)()(ˆ
ˆ)(ˆ)(),ˆ,ˆ(
1
xHHHHxxx
xHHHHIx
αHxHαx
αHxαHx
HHHHI
αα
10
Sinusoid Parms. Exact MLE Procedure
[ ]
ΩΩΩΩ=Ω
−
≤Ω≤xHHHHx )()()()(minargˆ 1
0o
Too
To
To
o π
oΩStep 1: Minimize “this term” over Ωo to find
Step 2: Use result of Step 1 to get
[ ] xHHHα )ˆ()ˆ()ˆ(ˆ 1o
Too
T ΩΩΩ=−
Done Numerically
Step 3: Convert Step 2 result by solving
)ˆsin(ˆˆ
)ˆcos(ˆˆ
2
1
φα
φα
A
A
−=
=φ&ˆfor A
11
Sinusoid Parms. Approx. MLE ProcedureFirst we look at a specific structure:
[ ]
Ω
Ω
ΩΩΩΩ
ΩΩΩΩ
Ω
Ω=ΩΩΩΩ
−
−
xs
xc
sscs
sccc
xs
xcxHHHHx
)(
)(
)()()()(
)()()()(
)(
)()()()()(
1
1
oT
oT
ooT
ooT
ooT
ooTT
oT
oT
oT
ooT
oT
!!!!!! "!!!!!! #$
Then… if Ωo is not near 0 or π, then approximately1
20
02
−
≈N
N
and Step 1 becomes
2
0
21
00)(minarg)exp(][2minargˆ Ω=
Ω−=Ω≤Ω≤
−
=≤Ω≤∑ Xnjnx
N
N
noo
o ππ
and Steps 2 & 3 become DTFT of Data x[n]
)ˆ(ˆ
)ˆ(2ˆ
o
o
X
XN
A
Ω∠=
Ω=
φ
12
The processing is implemented as follows:
Given the data: x[n], n = 0, 1, 2, … , N-1
1. Compute the DFT X[m], m = 0, 1, 2, … , M-1 of the data• Zero-pad to length M = 4N to ensure dense grid of frequency points
• Use the FFT algorithm for computational efficiency
2. Find location of peak• Use quadratic interpolation of |X[m]|
3. Find height at peak• Use quadratic interpolation of |X[m]|
4. Find angle at peak• Use linear interpolation of ∠X[m]
oΩ
|X(Ω)|
Ω
∠X(Ω)
ΩoΩ
13
Figure 3.8 from textbook:
)2cos()( φπ += tfAts ot
Ex. 3 Bearing Estimation MLEEmits or reflects
signal s(t)
Simple model
Grab one “snapshot” of all M sensors at a single instant ts:
( ) ][~
cos][)(][ nwnAnwtsnx ssn ++Ω=+= φ
Same as Sinusoidal Estimation!! So Compute DFT and Find Location of Peak!!
If emitted signal is not a sinusoid then you get a different MLE!!
1
MLE for TDOA/FDOA Location
Overview Estimating TDOA/FDOA Estimating Geo-Location
2
)(ts
tjetts 1)( 1ω−
Data Link
Data Link
tjetts 2)( 2ω−
tjetts 3)( 3ω−
MULTIPLE-PLATFORM LOCATION
Emitter to be located
3
)(ts
tjetts 1)( 1ω−
Data Link
Data Link
τ23 = t2 t3= constant
TDOATime-Difference-Of-Arrival
τ21 = t2 t1 = constant
tjetts 2)( 2ω−
tjetts 3)( 3ω−
ν21 = ω2 ω1 = constant
ν23 = ω2 ω3 = constant
FDOAFrequency-Difference-Of-Arrival
TDOA/FDOA LOCATION
4
Estimating TDOA/FDOA
5
SIGNAL MODEL! Will Process Equivalent Lowpass signal, BW = B Hz
Representing RF signal with RF BW = B Hz
! Sampled at Fs > B complex samples/sec! Collection Time T sec! At each receiver:
BPF ADCMakeLPE
SignalEqualize
cos(ω1t)
f
XRF(f)
X(f)
f
f
XfLPE(f)
B/2-B/2
6
Tx Rx
s(t) sr(t) = s(t τ(t))
R(t)
Propagation Time: τ(t) = R(t)/c
!+++= 2)2/()( tavtRtR o
Use linear approximation assumes small change in velocity over observation interval
)/]/1([)/][()( cRtcvscvtRtsts oor −−=+−=Time
ScalingTime Delay: τd
For Real BP Signals:
DOPPLER & DELAY MODEL
7
Analytic Signals Model
)]([)()(~ ttj cetEts φ+ω=
Now what? Notice that v << c " (1 v/c) ≈ 1Say v = 300 m/s (670 mph) then v/c = 300/3x108 = 10-6 " (1 v/c)=1.000001
Now assume E(t) & φ(t) vary slowly enough that
)()]/1([
)()]/1([
ttcv
tEtcvE
φ≈−φ
≈− For the range of vof interest
DOPPLER & DELAY MODEL (continued)Analytic Signal of Tx
)]/1([)]/1([)]/1([
)]/1([~)(~
ddc tcvtcvjd
dr
etcvE
tcvsts
τ−−φ+τ−−ωτ−−=
τ−−=Analytic Signal of Rx
Called Narrowband Approximation
8
)()/(
)()/(
)(
)()(~
dccdc
ddccc
tjd
tjtcvjj
ttcvtjdr
etEeee
etEts
τ−φωω−τω−
τ−φ+τω−ω−ω
τ−=
τ−=
ConstantPhaseTerm
α= ωcτd
DopplerShiftTermωd= ωcv/c
CarrierTerm
Transmitted SignalsLPE Signal
Time-Shifted by τd
Narrowband Analytic Signal Model
Narrowband Lowpass Equivalent Signal Model
)()( dtjj
r tseets d τ−= ω−α
This is the signal that actually gets processed digitally
DOPPLER & DELAY MODEL (continued)
9
CRLB for TDOAWe already showed that the CRLB for the active sensor case is:
But here we need to estimate the delay between two noisy signals rather than between a noisy one and a clean one.
The only difference in the result is: replace SNR by an effective SNR given by
,min1111
21
2121
SNRSNR
SNRSNRSNRSNR
SNReff ≈++
=
∑
∑−
−=
−
−== 12/
2/
2
12/
2/
22
2
][
][1
N
Nk
N
Nkrms
nS
kSk
NB
2281)(
rmsBSNRNTDOAC
××=
π
where Brms is an effective bandwidth of the signal computed from the DFT values S[k].
10
CRLB for TDOA (cont.)A more familiar form for this is in terms of the C-T version of the problem:
SNRBT B 1
effrmsTDOA
×≥
22πσ
dffS
dffSf Brms∫∫= 2
222
)(
)(seconds
BT = Time-Bandwidth Product (≈ N, number of samples in DT)B = Noise Bandwidth of Receiver (Hz)T = Collection Time (sec)
BT is called Coherent Processing Gain(Same effect as the DFT Processing Gain on a sinusoid)
For a signal with rectangular spectrum of RF width of Bs, then the bound becomes:
SNRBT B
effsTDOA
×≥
255.0
σ
S. Stein, Algorithms for Ambiguity Function Processing, IEEE Trans. on ASSP, June 1981
11
CRLB for FDOAHere we take advantage of the time-frequency duality if the FT:
where Trms is an effective duration of the signal computed from the signal samples s[k].
∑
∑−
−=
−
−== 12/
2/
2
12/
2/
22
2
][
][1
N
Nn
N
Nnrms
ns
nsk
NT
2281)(
rmseff TSNRNFDOAC
××=
π
Again we use the same effective SNR:
,min1111
21
2121
SNRSNR
SNRSNRSNRSNR
SNReff ≈++
=
12
CRLB for FDOA (cont.)
A more familiar form for this is in terms of the C-T version of the problem:
SNRBT T 1
effrmsFDOA
×≥
22πσ
dtts
dttst Trms∫∫= 2
222
)(
)(Hz
For a signal with constant envelope of duration Ts, then the bound becomes:
SNRBT T
effsFDOA
×≥
255.0σ
S. Stein, Algorithms for Ambiguity Function Processing, IEEE Trans. on ASSP, June 1981
13
Interpreting CRLBs for TDOA/FDOAA more familiar form for this is in terms of the C-T version of the problem:
SNRBT T 1
effrmsFDOA
×≥
22πσ
SNRBT B 1
effrmsTDOA
×≥
22πσ
BT pulls the signal up out of the noise Large Brms improves TDOA accuracy Large Trms improves FDOA accuracy
SNR1 SNR2 T = Ts B = Bs σTDOA σFDOA
3 dB 30 dB 1 ms 1 MHz 17.4 ns 17.4 Hz
3 dB 30 dB 100 ms 10 kHz 1.7 µs 0.17 Hz
Two Examples of Accuracy Bounds:
14
MLE for TDOA/FDOA S. Stein, Differential Delay/Doppler ML Estimation with Unknown Signals, IEEE Trans. on SP, August 1993
We already showed that the ML Estimate of delay for the active sensor case is the Cross-Correlation of the time signals.
By the time-frequency duality the ML estimate for doppler shift should be Cross-Correlation of the FT, which is mathematically equivalent to
dttstsCT
∫ +=0
21 )()()( ττ
dtetstsCT
tj∫ −=0
21 )()()( ωω
dtetstsAT
tj∫ −+=0
21 )()(),( ωττω
The ML estimate of the TDOA/FDOA has been shown to be:
Find Peak of |C(τ)|
Find Peak of |C(ω)|
Find Peak of |A(ω,τ)|
15
Ambiguity Function
τωωd
τd
FindPeak
of|A(ω,τ)|
ML Estimator for TDOA/FDOA (cont.)
)(1 dtjj tsee d τωα −=
Delayτ
Dopplerω
CompareSignalsFor all
Delays & Dopplers
)(1 ts
)(2 ts
LPE RxSignalsAt Two
Receivers
Called: Ambiguity Function Complex Ambiguity Function (CAF) Cross-Correlation Surface
16
ML Estimator for TDOA/FDOA (cont.)
How well do we expect the Cross-Correlation Processing to perform?
Well it is the ML estimator so it is not necessarily optimum.
But we know that an ML estimate is asymptotically Unbiased & Efficient (that means it achieves the CRLB) Gaussian
))(,(~ 1 θIθθ −NML
Those are some VERY nice properties that we can make use of in our location accuracy analysis!!!
17
Consider when τ = τd [ ]∫ ω−ω=τωT
tjtjd dteetsA d
0
2)(),(
like windowed FT of sinusoidwhere window is |s(t)|2
ωωd
|A(ω,τd)|
width ∼ 1/T
Consider when ω = ωd
∫ τ+τ−=τωT
dd dttstsA0
)()(),(
correlation
|A(ωd,τ)|
ττd
width ∼ 1/BW
Properties of the CAF
18
TDOA Accuracy depends on:» Effective SNR: SNReff
» RMS Widths: Brms = RMS Bandwidth
TDOA ACCURACY REVISITED
dffS
dffSf Brms∫∫= 2
222
)(
)(
XCorr Function
~1/Brms
TDOA
Narrow Brms CasePoor Accuracy
Wide Brms CaseGood Accuracy
TDOA
XCorr Function
Low Effective SNR Causes Spurious Peaks
On Xcorr Function
Narrow Xcorr FunctionLess Susceptible to
Spurious Peaks
19
FDOA Accuracy depends on:» Effective SNR: SNReff
» RMS Widths: Drms = RMS Duration
FDOA ACCURACY REVISITED
dffS
dffSf Brms∫∫= 2
222
)(
)(
XCorr Function
~1/Drms
FDOA
Narrow Drms CasePoor Accuracy
Wide Drms CaseGood Accuracy
FDOA
XCorr Function
Low Effective SNR Causes Spurious Peaks
On Xcorr Function
Narrow Xcorr FunctionLess Susceptible to
Spurious Peaks
20
COMPUTING THE AMBIGUITY FUNCTION
Direct computation based on the equation for the ambiguity function leads to computationally inefficient methods.
In EECE 521 notes we showed how to use decimation to efficiently compute the ambiguity function
21
Estimating Geo-Location
22
Data Link
Data Link
TDOA/FDOA LOCATION
Data Link
Centralized Network of P P-Choose-2 Pairs
# P-Choose-2 TDOA Measurements# P-Choose-2 FDOA Measurements
Warning: Watch out for Correlation Effect Due to Signal-Data-In-Common
23
TDOA/FDOA LOCATIONPair-Wise Network of P P/2 Pairs
# P/2 TDOA Measurements# P/2 FDOA Measurements
Many ways to select P/2 pairs Warning: Not all pairings are equally good!!! The Dashed Pairs are Better
24
TDOA/FDOA Measurement ModelGiven N TDOA/FDOA measurements with corresponding 2×2 Cov. Matrices
),(,),,(),,( 2211 NN ντντντ …
For notational purposes define the 2N measurements r(n) n = 1, 2, , 2N
Nnr
Nnr
nn
nn
,,2,1,
,,2,1,
2
12
…
…
==
==−
ν
τ
NCCC ,,, 21 …
TNrrr ][ 221 !=r
Data Vector
Now, those are the TDOA/FDOA estimates so the true values are notated as:
),(,),,(),,( 2211 NN ντντντ …
Nns
Nns
nn
nn
,,2,1,
,,2,1,
2
12
…
…
==
==−
ν
τT
Nsss ][ 221 !=s
Signal Vector
Assume pair-wise network, soTDOA/FDOA pairs are uncorrelated
25
TDOA/FDOA Measurement Model (cont.)Each of these measurements r(n) has an error ε(n) associated with it, so
==
N
N
C00
00
00C
CCCC #…
1
21 ,,,diag
εsr +=Because these measurements were estimated using an ML estimator (with sufficiently large number of signal samples) we know that error vector ε is a zero-mean Gaussian vector with cov. matrix C given by:
The true TDOA/FDOA values depend on:Emitter Parms: (xe, ye, ze) and transmit frequency fe xe = [ xe ye ze fe ]T
Receivers Nav Data (positions & velocities): The totality of it called xr
εxxsr += );( re Deterministic Signal + Gaussian Noise Signal is nonlinearly related to parms
Assumes that TDOA/FDOA pairs are uncorrelated!!!
To complete the model we need to know how s(xe;xr) depends on xe and xr.Thus we need to find TDOA & FDOA as functions of xe and xr
26
TDOA/FDOA Measurement Model (cont.)
Two Receivers with: (x1, y1, Vx1, Vy1) and (x2, y2, Vx2, Vy2)Emitter with: (xe, ye)
(Let Ri be the range between Receiver i and the emitter; c is the speed of light.)
The TDOA and FDOA are given by:
( ) ( ) ( ) ( )
−+−−−+−=
−==
22
22
21
21
21121
1
),(
eeee
ee
yyxxyyxxc
cRRyxs τ
( )
( ) ( )( ) ( )
( ) ( )( ) ( )
−+−
−+−−
−+−
−+−=
−==
22
22
22222
12
1
1111
21122 ),,(
ee
ee
ee
eee
eeee
yyxx
VyyyVxxx
yyxx
VyyyVxxxcf
RRdtd
cffyxs ν
Here well simplify to the x-y plane extension is straight-forward.
27
[ ]$$$$$$$ %$$$$$$$ &'$$$$$ %$$$$$ &'
parms w.r.t.cov. ofy variabilit
11
parms w.r.t.mean ofy variabilit
1 )()()()(tr21)()()()(
∂∂
∂∂
+
∂∂
∂
∂= −−−
mnm
T
nnm θθθθ
θCθCθCθCθµθCθµθJ xx
xx
xx
x
CRLB for Geo-Location via TDOA/FDOARecall: For the General Gaussian Data case the CRLB depends on a FIM that has structure like this:
Here we have a deterministic signal plus Gaussian noise so we only have the 1st term Using the notation introduced here gives
11 )()()(
−−
∂∂
∂∂
=e
e
e
eT
eCRLB xxsC
xxsxC
Called the Jacobian for the 3-D location with TDOA/FDOA will be a 2N × 4 matrix whose columns are derivatives of s w.r.t. each of the 4 parameters.
($)
HHT
28
TDOA/FDOA Jacobian:
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
=∂∂
=∆
e
eN
e
eN
e
eN
e
eN
e
e
e
e
e
e
e
ee
e
e
e
e
e
e
e
e
e
fs
zs
ys
xs
fs
zs
ys
xs
fs
zs
ys
xs
)()()()(
)()()()(
)()()()(
)(
2222
2222
1111
xxxx
xxxx
xxxx
xxsH
((((
ex∂∂
ey∂∂
ez∂∂
ef∂∂
CRLB for Geo-Loc. via TDOA/FDOA (cont.)
Jacobian can be computed for any desired Rx-Emitter Scenario Then plug it into ($) to compute the CRLB for that scenario:
[ ] 11)(−−= HCHxC T
eCRLB
29
The Location CRLB can be used to study various aspects of the emitter location problem. It can be used to study the effect of
Rx-Emitter Geometry and/or Platform Velocity TDOA accuracy vs. FDOA accuracy Number of Platforms Platform Pairings Etc., Etc., Etc
Once you have computed the CRLB Covariance CCRLB you can use it to compute and plot error ellipsoids.
CRLB Studies
Faster than doing Monte Carlo Simulation Runs!!!
Sensor 1
Emitter
V V
Sensor 2
k1σTDOA
k2σFDOA
Assumes Geo-Location Error is Gaussian
Usually reasonably valid
30
Projections of 3-D Ellipsoids onto 2-D Space
2-D Projections ShowExpected Variation
in 2-D Space
xy
z
Error Ellipsoids: If our CCRLB is 4×4 how do we get 2-D ellipses to plot???
Projections!
CRLB Studies (cont.)
31
CRLB Studies (cont.)Another useful thing that can be computed from the CRLB is the CEP for the location problem. For the 2-D case:
Circular Error Probable= radius of a circle that when centered at the estimates mean, contains 50% of the estimates
22
2121 75.075.0CEP σσλλ +=+≈
within 10%Diagonal elements of the 2-D Cov. MatrixEigen-Values of the
2-D Cov. Matrix
Cross Range
Dow
n R
ange
CEP = 10
CEP = 20
CEP = 40
CEP = 80
CEP Contour Plots are Good Ways to Assess
Location Performance
32
25
TargetTarget
TargetSensor
FDOA OnlyTDOA Only
TDOA/FDOA
Both Important FDOA Important TDOA Important
TDOA Only FDOA Only
CRLB Studies (cont.)
Geometry and TDOA vs. FDOA Trade-Offs
33
Estimator for Geo-Location via TDOA/FDOABecause we have used the ML estimator to get the TDOA/FDOA estimates the MLs asymptotic properties tell us that we have Gaussian TDOA/FDOA measurements
Because the TDOA/FDOA measurement model is nonlinear it is unlikely that we can find a truly optimal estimate so we againresort to the ML. For the ML of a Nonlinear Signal in Gaussian we generally have to proceed numerically.
One way to do Numerical MLE is ML Newton-Raphson (need vector version):
k
ppTkk
θθθθx
θθθxθθ
12
1);(ln);(ln
=
−
+
∂∂
∂∂
∂−=
However, the Hessian requires a second derivativeThis can add complexity in practice Alternative:
Gauss-Newton Nonlinear Least Squares based on linearizing the model.
Gradient: p×1 vectorHessian: p×p matrix
1
Chapter 8Least-Squares Estimation
2
8.3 The Least-Squares (LS) ApproachAll the previous methods we’ve studied… required a probabilistic model for the data: Needed the PDF p(x;θ)
For a Signal + Noise problem we needed: Signal Model & Noise Model
Least-Squares is not statistically based!!! ⇒ Do NOT need a PDF Model ⇒ Do NEED a Deterministic Signal Model
signal model ∑ ∑
x[n] = strue[n;θ] + w[n]
= s[n;θ] + e[n]
model & measurement error
+
δ[n] model error
w[n] noise (measurement
error)
+ +
Similar to Fig. 8.1(a)
s[n;θ]+
strue[n;θ]
3
Least-Squares Criterion
signal model
∑ε[n]
+
x[n]
]ˆ;[ θns
θ
Choose the Estimate…
… to make this “residual” small
( )∑∑−
=
−
=−==
1
0
21
0
2 ];[][][)(N
n
N
nnsnxnJ θθ ε
Minimize the LS Cost
Ex. 8.1: Estimate DC Level x[n] = A + e[n] = s[n;θ] + e[n]
xnxN
AAAJSet
AnxAJ
N
n
N
n
==⇒=∂
∂
−=
∑
∑−
=
−
=
1
0
1
0
2
][1ˆ0)(
)][()( Same thing we’ve gotten before!
Note: If e[n] is WGN, then LS = MVUTo Minimize…
4
Weighted LS CriterionSometimes not all data samples are equally good:
x[0], x[1], … , x[N-1]
Say you know x[10] was poor in quality compared to other data…
You’d want to de-emphasize its importance in the sum of squares:
∑−
=−=
1
0
2]);[][()(N
nn nsnxwJ θθ
set this small to de-emphasize a sample
5
8.4 Linear Least-SquaresA linear least-squares problem is one where the parameter observation model is linear: s = Hθ x = Hθ + e
p×1N×1
p = Order of the modelN×p Known Matrix
We must assume that H is full rank… otherwise there are multiple parameter vectors that will map to the same s!!!
Note: Linear LS does NOT mean “fitting a line to data”… although that is a special case:
!θ
H
s
−
=⇒+=BA
N
BnAns
"#"$%
&&11
211101
][
6
Finding the LSE for the Linear Model( )
( ) ( )HθxHθx
θθ
−−=
−= ∑−
=
T
N
nnsnxJ
1
0
2];[][)(For the linear model the LS cost is:
Now, to minimize, first expand:
HθHθHθxxx
HθHθxHθHθxxxθ
TTTT
TTTTTTJ
+−=
+−−=
2
)(
Scalar = scalarT So…θTHTx = (θTHTx)T = xTHθ
0θθ
=∂
∂ )(JNow setting gives 0θHHxH =+− ˆ22 TT
Called the “LS Normal Equations”xHθHH TT =ˆ
Because H is full rank we know that HTH is invertible:
( ) xHHHθ TTLS
1ˆ −= ( ) xHHHHθHs TT
LSLS1ˆˆ −
==
7
Comparing the Linear LSE to Other EstimatesModel Estimate
eHθx += ( ) xHHHθ TTLS
1ˆ −=
No Probability Model Needed
( ) xHHHθ TTBLUE
1ˆ −=
wHθx +=
PDF Unknown, White
( ) xHHH TTML
1ˆ −=θ
wHθx +=
PDF Gaussian, White
( ) xHHHθ TTMVU
1 ˆ −=
wHθx += PDF Gaussian, White
If you assume
Gaussian & apply
these BUT you
are WRONG you at least
get the LSE!
8
The LS Cost for Linear LSFor the linear LS problem…
what is the resulting LS cost for using ( ) ?ˆ 1xHHHθ TT
LS−
=
( ) ( ) ( ) ( )
( ) ( )
( ) ( )( )
xHHHHIHHHHxIx
xHHHHxHHHHxx
xHHHHxxHHHHxθHxθHx
HHHHI
"""""""" #"""""""" $%
−=
−−
−−
−−
−
−
−=
−
−=
−
−=−−=
TT
TTTTTT
TTTTTT
TTT
TTLS
TLSJ
1
11
11
11min
ˆˆ
Properties of Transpose
Factor out xs
Easily Verified!Note: if AA = A then A is called idempotent
( ) xHHHHIx
−=
− TTTJ1
min ( ) xHHHHxxx TTTTJ1
min−
−=
2min0 x≤≤ J
9
Weighted LS for Linear LSRecall: de-emphasize bad samples’ importance in the sum of squares:
∑−
=−=
1
0
2]);[][()(N
nn nsnxwJ θθ
( ) ( )HθxWHθx −−= TJ )(θFor the linear LS case we get:
Diagonal Matrix
Minimizing the weighted LS cost gives:
( ) xWHWHHWHWx
−=
− TTTJ1
min( ) WxHWHHθ TTWLS
1ˆ −=
Note: Even though there is no true LS-based reason… many people use an inverse cov matrix as the weight: W = Cx
-1
This makes WLS look like BLUE!!!!
10
8.5 Geometry of Linear LS• Provides different derivation• Enables new versions of LS
Recall the LS Cost to be minimized: ( ) ( ) 2)( HθxHθxHθxθ −=−−= TJ
s– Order Recursive– Sequential
Thus, LS minimizes the length of the error vector between the data and the signal estimate: sxε ˆ−=
∑=
==p
iii
1hHθs θ [ ]phhhH '21=But… For Linear LS we have
N×p
θ sRange (H) ⊂ RN
N > pRNRp
s lies in subspace of RN
x can lie anywhere in RN
11
LS Geometry Example N = 3 p = 2Notation a bit different from the book
x = s + e“noise” takes s out of Range(H) and into RN
h1
h2H columns lie in this plane = “subspace” spanned by the columns of H = S2
(Sp in general)
x
2211ˆ hhs θθ +=s
e sxε ˆ−= ihε ⊥
12
LS Orthogonality PrincipleThe LS error vector must be ⊥ to all columns of H
TT 0Hε = 0εH =Tor
( )
( ) xHHHθxHHθH
0HθxH0εH
TTLS
TT
TT
1ˆ −=⇒=⇒
=−⇒=Can use this property to derive the LS estimate:
θs
RNRp
θ
H
x
(HTH)-1HT
Same answer as before…but no derivatives to worry about!
Range (H) ⊂ RN
Acts like an inverse from RN back to Rp called pseudo-inverse of H
13
LS Projection ViewpointFrom the R3 example earlier… we see that must lies“right below” x
= “Projection” of x onto Range(H)s
(Recall: Range(H) = subspace spanned by columns of H)
From our earlier results we have: ( ) xHHHHθHs
HP
"" #"" $%∆=
−
== TT
LS1ˆˆ
x
xPs H=ˆ
sxε ˆ−= “Projection Matrix onto Range(H)”
14
Aside on ProjectionsIf something is “on the floor”… its projection onto the floor = itself!
zzPHz H =∈ then,)Range( if
Now… for a given x in the full space… PHx is already in Range(H)… so PH(PHx) = PHx
Thus… for any projection matrix PH we have: PH PH = PH
HH PP =2 Projection Matrices are Idempotent
Note also that the projection onto Range(H) is symmetric:
( ) TT HHHHPH1−
= Easily Verified
15
What Happens w/ Orthonormal Columns of H
( ) xHHHθ TTLS
1ˆ −=Recall the general Linear LS solution:
=
pppp
p
p
T
hhhhhh
hhhhhh
hhhhhh
HH
,,,
,,,
,,,
21
22212
12111
'
&(&&
'
'
where
If the columns of H are orthonormal then <hi,hj> = δij ⇒ HTH = I
xHθ TLS =ˆ
Easy!! No Inversion Needed!!
Recall Vector Space Ideas with ON Basis!!
16
Geometry with Orthonormal Columns of H Inner Product Between ithColumn and Data VectorxhT
iiθ =ˆRe-write this LS solution as:
∑∑==
===p
ii
Ti
p
iii
11)(ˆˆˆ "#"$% hxhhθHs θThen we have:
Projection of xonto hi axis
h1
h2
2211 )()(ˆ hxhhxhs TT +=
22 )( hxhT
11 )( hxhT
x
When the columns of H are ⊥we can first find the projection
onto each 1-D subspace independently, then add these independently derived results.
Nice!
1
8.6 Order-Recursive LS s[n]
nWant to fit a polynomial to data.., but which one is the right model?
! Constant ! Quadratic! Linear ! Cubic, Etc.
Motivate this idea with Curve FittingGiven data: n = 0, 1, 2, . . ., N-1
s[0], s[1], . . ., s[N-1]
Try each model, look at Jmin … which one works “best”Jmin(p) Constant
Line Quadratic Cubic
p 3
(# of parameters in model)
1 2 4
2
Choosing the Best Model OrderQ: Should you pick the order p that gives the smallest Jmin??A: NO!!!!Fact: Jmin(p) is monotonically non-increasing as order p increases
If you have any N data points… you can perfectly fit a p = N model to them!!!!
s[n]
n
2 points define a line3 points define a quadratic4 points define a cubic
N points define aNxN+aN-1xN-1+…+a1x+a0
Warning: Dont Fit the Noise!!Warning: Dont Fit the Noise!!
3
Choosing the Order in PracticePractice: use simplest model that adequately describes the dataScheme: Only increase order if cost reduction is “significant”
" Increase to order p+1 only if Jmin(p) – Jmin(p=1) > ε
" Also, in practice you may have some idea of the expected level of error ⇒ thus have some idea of expected Jmin⇒ use order p such that Jmin(p) ≈ Expected Jmin
user-set threshold
Wasteful to independently compute the LS solution for each order
Drives Need for: Efficient way to compute LS for many models
Q: If we have computed p-order model, can we use it to recursively compute (p+1)-order model?
A: YES!! ⇒ Order-Recursive LS
4
Define General Order-Increasing Models Define: Hp+1 = [ Hp hp+1 ] ⇒ h1, h2, h3, . . .
H1
H2 Etc.H3
Order-Recursive LS with Orthonormal ColumnsIf all hi are ⊥ ⇒ EASY !!
h1
h2
2212 )(ˆˆ hxhss T+=
22 )( hxhT
111 )(ˆ hxhs T=
x
!!
3323
2212
111
)(ˆˆ3
)(ˆˆ2
)(ˆ1
hxhss
hxhss
hxhs
T
T
T
p
p
p
+==
+==
==
5
Order-Recursive Solution for General HIf hi are Not ⊥ ⇒ Harder, but Possible!
Basic Idea: Given current-order estimate: • map new column of H into an ON version • use it to find new “estimate,”• then transform to correct for orthogonalization
Quotes here because this estimate is for theorthogonalized model
h1
h3
h2
3~h
Orthogonalized version of h3 S2 = 2-D space spanned
by h1 & h2= Range(H2)
Note: x is not shown here… it is in a higher dimensional space!!
6
Geometrical Development of Order-Recursive LS The Geometry of Vector Space is indispensable for DSP!
Current-Order = k⇒ Hk = [h1 h2 . . . hk] (not necessarily ⊥)
See App. 8A for AlgebraicDevelopment
Yuk! Geometry is
Easier!"""" #"""" $%T
kkT
kkk HHHHP 1)( −=
Projector onto Sk = Range(Hk)
Recall:
Given next column: hk+1 Find , which is ⊥ to Sk1
~+kh
( ) 1111~
++++⊥
−=−= kkkkkk
k
hPIhPhhP"#"$%
1+kkhP
11~
+⊥
+ = kkk hPh
kkk
k S shh ˆ~~11 ⊥⇒⊥ ++ ks
h3
Sk
7
So our approach is now: project x ontoand then add to
1~
+kh
ks
The projection of x onto is given by1~
+kh
1
!
21
112
1
1
111
1
1
11
~~
~
~~~
~~
,ˆ
+⊥
+⊥
+⊥
+
+
+
+⊥
++
+
+
++
==
==∆
kk
scalar
kk
kkT
kk
kT
kkkk
k
k
kk use
hPhP
hPxhh
hx
hPhhh
hhxs
"" #"" $%
Divide by norm to
normalize
1
11
ˆˆ
ˆˆˆ
+
++
∆+=
∆+=
kkk
kkk
sθH
sssNow add this to current signal estimate:
8
""#""$% 11
11
121
11
)(ˆ
ˆˆ
+⊥
+
⊥++
+⊥
+⊥
+⊥
+
−+=
+=
kkTk
kTkkk
kk
kk
kk
kkT
kkk
hPhxPhhPIθH
hPhP
hPxθHsScalar…
can move here and transpose
Write out Pk⊥
scalar… define as b for convenience
Now we have:
Write out ||.||2 and use that Pk
⊥ is idempotent
[ ]
−=
−+=
+−
=+
+−
+
+ b
b
bb
kTkk
Tkk
kk
kTkk
Tkkkkkk
k
11
1
11
1
)(ˆ
)(ˆˆ
1
hHHHθhH
hHHHHhθHs
H"#"$%
Finally:
Clearly this is 1ˆ+kθ
9
−
=
+⊥
+
⊥+
+⊥
+
⊥+
+−
+
11
1
11
11
1
1
)(ˆˆ
kkTk
kTk
kkTk
kTk
kTkk
Tkk
k
hPhxPh
hPhxPhhHHHθ
θ
Order-Recursive LS Solution
Drawback: Needs Inversion Each RecursionSee Eq. (8.29) and (8.30) for a way to avoid inversion
Comments:
1. If hk+1 ⊥ Hk ⇒ simplifies problem as we’ve seen(This equation simplifies to our earlier result)
2. Note: P⊥k x above is residual of k-order model
= part of x not modeled by k-order model⇒ Update recursion works solely with this
Makes Sense!!!
10
8.7 Sequential LSIn Last Section: In This Section:• Data Stays Fixed • Data Length Increases• Model Order Increases • Model Order Stays Fixed
You have received new data sample!
Say we have based on x[0], . . ., x[N-1]
If we get x[N]… can we compute based on and x[N]?(w/o solving using full data set!)
]1[ˆ −Nθ
][ˆ Nθ ]1[ˆ −Nθ
])[],1[ˆ(][ˆ NxNfN −= θθWe want…
Approach Here:1. Derive for DC-Level case2. Interpret Results3. Write Down General Result w/o Proof
11
Sequential LS for DC-Level Case
∑−
=− =
1
01 ][1ˆ
N
nN nx
NAWe know this:
Re-Write
][1
1ˆ1
][][11
1][1
1ˆ
1
111
1
00
NxN
AN
N
NxnxN
NN
nxN
A
N
N
N
n
N
nN
++
+=
+
+=
+=
−
+−=
−
==∑∑
#$%
… and this:
""" #""" $%
#$%#$%
errorprediction
datanew theofprediction
1estimateold
1 )ˆ][(1
1ˆˆ−− −
++= NNN ANx
NAA
12
Weighted Sequential LS for DC-Level CaseThis is an even better illustration… w[n] has unknown PDF
but has known time-dependent variance
2][var][][ nnwnwAnx σ=+=Assumed model:
∑
∑−
=
−
=− = 1
02
1
02
11
][
ˆN
n n
N
n nN
nx
A
σ
σStandard WLS gives:
With manipulations similar to the above case we get:
"" #"" $%
"#"$%
#$%errorprediction
1
02
2
estimateold1 )ˆ][(
1
1
ˆˆ−
=
=
− −
+=
∆
∑N
k
N
n n
NNN ANxAA
N
σ
σ
kN is a “Gain” term that reflects “goodness” of
new data
13
Exploring The Gain Term
∑−
=
−
=
1
02
11
1)ˆvar(N
n n
NA
σ
We know that … and using it in kN …
&datanew theofvariance
21
1
)ˆvar()ˆvar(
NN
NN
AAk
σ+=
−
−
“poorness” of current estimate
…we get that“poorness” of
new data
Note: 0 ≤ K[N] ≤ 1
⇒ Gain depends on Relative Goodness Between:o Current Estimateo New Data Point
14
Extreme Cases for The Gain Term
""" #""" $%"#"$%errorpredictionestimateold
])1[ˆ][(][]1[ˆ][ˆ −−+−= NANxNKNANA
New Data on Based"Correction" LittleMake
UseLittleHasDataNew
0][
])1[ˆvar( If 2
⇒
⇒
≈⇒
<<−
NK
NA nσGood EstimateBad Data
New Data on Based"Correction" LargeMake
UsefulVeryDataNew
1][
])1[ˆvar( If 2
⇒
⇒
≈⇒
>>−
NK
NA nσBad EstimateGood Data
15
General Sequential LS Result See App. 8C for derivation
Diagonal Covariance
(Sequential LS requires this)
At time index n-1 we have: [ ]
estimate of measurequality ˆcov
using EstimateLSˆ
,,,diag
]1[]1[]0[
11
11
21
21
201111
1
−∆
−
−−
−−−−−
−
=
=+=
−=
nn
nn
nnnnn
Tn nxxx
θΣ
xθ
CwθHx
x
σσσ '
'
At time index n we get x[n]:
nTn
n
nnn wθh
HwθHx +
=+=
−1 Tack on row at bottom to show how θmaps to x[n]
16
Iterate these Equations:2
11 ][ˆnnnn nx σhΣθ −−Given the Following:
1
12
1
11
)(
)ˆ][(ˆˆ
−
−
−
−−
−=
+=
−+=
nTnnn
nnTnn
nnn
nTnnnn nx
ΣhkIΣ
hΣhhΣk
θhkθθ
σ
"#"$%
Prediction of x[n] using current
parameter estimate
Update the Estimate:
Compute the Gain:
Update the Est. Cov.:
Initialization: (Assume p parameters)• Collect first p data samples x[0], . . ., x[p-1]• Use “Batch” LS to compute:• Then start sequential processing
11ˆ
−− pp Σθ
Gain has same kind of dependence on RelativeGoodness between:
o Current Estimateo New Data Point
17
Sequential LS Block Diagram
kn Σ
z-1
1ˆ−nθ
+
+
x[n]
21 ,, nnn σhΣ −
Compute Gain
1ˆ][ −− n
Tnnx θh
Updated Estimate
Σ
Tnh1
ˆ−n
Tnθh
+
Observations
( )11ˆ][ˆ−− −+ n
Tnnn nxk θhθ
nθ−
Predicted Observation Previous
Estimatenh
1
8.8 Constrained LSWhy Constrain? Because sometimes we know (or believe!) certain values are not allowed for θ
For example: In emitter location you may know that the emitter’s range can’t exceed the “radio horizon”
You may also know that the emitter is on the left side of the aircraft (because you got a strong signal from the left-side antennas and a weak one from the right-side antennas)
LSθThus, when finding you want to constrain it to satisfy these conditions
2
Constrained LS Problem StatementSay that Sc is the set of allowable θ values (due to constraints).
Then we seek such thatcCLS S∈θ
22minˆ HθxθHxθ
−=−∈ cS
CLS
Types of Constraints
1. Linear Equality Aθ = b
2. Nonlinear Equality f (θ) = b
3. Linear Inequality Aθ ≥ bAθ ≤ b
4. Nonlinear Inequality f (θ) ≥ bf(θ) ≤ b
HARDER
Constrained to a line, plane or hyperplane
Constrained to lie above/below a
hyperplane
Well Cover #1. See Books on Optimization for Other Cases
3
LS Cost with a Linear Equality Constraint
Using Lagrange Multipliers… we need to minimize
( ) ( )λθ
bAθλHθxHθxθ
and
J TTc
w.r.t.
)()( −+−−=
Linear Equality Constraint
x2
x1
contours of (x – Hθ)T (x – Hθ)
Unconstrained Minimum
2-D Linear Equality Constraint
Constrained Minimum
4
Constrained Optimization: Lagrange Multiplier
x1
0),(),(
),(),(
2121
2121
=∇+∇⇒
∇−=∇
xxhxxf
xxhxxf
λ
λ
Constrained Max occurs when:
( )[ ] 0),(),( 2121 =−+∇ Cxxgxxf λ
=
∂∂
∂∂
=∇b
a
xxxh
xxxh
xxh
2
21
1
21
21 ),(
),(
),(
Ex. The grad vector has “slope” of b/a ⇒orthogonal to constraint line
Ex. ax1 + bx2 – c = 0⇒ x2 = (–a/b)x1 + c/bA Linear Constraint
Constraint: g(x1,x2) = Cg(x1,x2) – C = h(x1,x2) = 0
x2 f (x1,x2) contours
5
LS Solution with a Linear Equality Constraint
)(ˆoffunctionaasˆ λλ CLSCLScJ θθ0θ
⇒=∂∂
Follow the usual steps for Lagrange Multiplier Solution:
1. Set
( ) ( ) λAHHxHHHλθ0λAHθHxHθ
TTTTc
TTT
uc
1
ˆ
1
21)(ˆ22
−−−=⇒=++− !!"!!#$
Unconstrained Estimate
2. Solve for λ to make satisfy the constraint:CLSθ !"!#$cforsolve
cλλ
bλθA⇒
=)(ˆ
( ) ( ) ( )bθAAHHAλbλAHHθA −
=⇒=
−
−−−uc
TTc
TTuc
ˆ221ˆ
111
3. Plug in to get the constrained solution: θ )(ˆˆccc λθ=
( ) ( ) ( )!!!!!!!! "!!!!!!!! #$
Term" Correction"
111 ˆˆˆ bθAAHHAAHHθθ −
−=
−−−uc
TTTTucc
Amount of Constraint Deviation
6
Geometry of Constrained Linear LSThe above result can be interpreted geometrically:
x
ucss
cs
Constraint Line
Constrained Estimate of the Signal is the Projection of the Unconstrained Estimate
onto the Linear Constraint Subspace
7
8.9 Nonlinear LSEverything we’ve done up to now has assumed a linearobservation model… but we’ve already seen that many applications have nonlinear observation models: s(θ) ≠ Hθ
Recall: For linear case – closed-form solution
< Not so for nonlinear case!! >
Must use numerical, iterative methods to minimize the LS cost given by:
J(θ) = [x s(θ)]T [x – s(θ)]
But first… Two Tricks!!!
8
Two Tricks for Nonlinear LSSometimes it is possible to:
1. Transform into a Linear Problem2. Separate out any Linear Parameters
Trick #1: Seek an invertible function
such that
s(θ(α)) = Hα, which can be easily solved for
and then find
=
=
− )(
)(
1 αθ
αθ
g
g
LSα
)ˆ(ˆ 1LSLS g αθ −=
Sometimes Possible to Do
Both Tricks Together
Trick #2: See if some of the parameters are linear:
Try to decompose βαHθsβ
αθ )()(get to =
=
Linear in β!!!Nonlinear in α
9
Example of Linearization Trick
=+=φ
φπA
nfAns o θ)2cos(][
Consider estimation of a sinusoid’s amplitude and phase (with a known frequency):
But we can re-write this model as:
)2sin()sin()2cos()cos(][21
nfAnfAns oo πφπφαα!"!#$!"!#$ −=
xHHHα TT 1)(ˆ −=which is linear in α = [α1 α2]T so:
Then map this estimate back using
−+
== −−
1
21
22
21
1
ˆˆ
tan
ˆˆ)ˆ(ˆ
αααα
αθ gNote that for this example this is merely exploiting polar-to-
rectangular ideas!!!
10
Example of Separation TrickConsider a signal model of three exponentials:
%T
nnn
rAAA
rrArArAns
T
][
10][
321
33
221
α!"!#$
β
θ =
<<++=
=
−−− )1(3)1(21
32
111
)(
NNN rrr
rrrr
&&&H
Then we can write:
βHθs )()( r=
xHHHβ )()]()([)(ˆ 1 rrrr TT −=
[ ] [ ][ ] [ ]xHHHHxxHHHHx
βHxβHx
)()]()()[()()]()()[(
)(ˆ)()(ˆ)()(
11 rrrrrrrr
rrrrrJ
TTTTT
T
−− −−=
−−=
Depends on only one variable… so might conceivably just compute on
a grid and find minimum
Then we need to minimize :
11
Iterative Methods for Solving Nonlinear LSGoal: Find θ value that minimizes J(θ) = [x-s(θ)]T [x-s(θ)]without computing it over a p-dimensional gridTwo most common approaches:1. Newton-Raphson
a. Analytically find ∂J(θ)/∂θb. Apply Newton-Raphson to find a zero of ∂J(θ)/∂θ
(i.e. linearize ∂J(θ)/∂θ about the current estimate)c. Iteratively Repeat
2. Gauss-Newtona. Linearize signal model s(θ) about the current estimateb. Solve resulting linear problemc. Iteratively Repeat
Both involve: • Linearization (but they each linearize something different!)• Solve linear problem• Iteratively improve result
12
Newton-Raphson Solution to Nonlinear LS
%0
)(
=∂∂
∆= θ
θg
JTo find minimum of J(θ): set
∂∂
∂∂
=∂
∂
p
J
J
J
θ
θ
)(
)(
)( 1
θ
θ
θθ & ∑
−
=−=
1
0
2])[][()(N
iisixJ θθforNeed to find
% ∑−
=
==
∆
∆∂∂
−−=∂∂ 1
0
?
][])[][(2)( N
i
h
jr
Whyignorecanj
ij
i
isisixJ
"#$!!"!!#$ θθ
θθ
θTaking these partials gives:
13
pjforhr
VectorMatrix
N
iiji ,,10
1
0…
!"!#$==
×
−
=∑ 0rHθ θθ ==⇒ Tg )(
Depend nonlinearly on θ
Now set to zero:
−−−
−
=
∂−∂
∂−∂
∂−∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
=
])1[]1[(
])0[]0[(
]1[]1[]1[
]1[]1[]1[
]0[]0[]0[
21
21
21
NsNx
sx
NsNsNs
sss
sss
p
p
p
θ
θ
θ
θθθ
θθθ
θθθ
θ rH &
'
&'&&
'
'
θθθ
θθθ
θθθ
∂∂
∂∂
∂∂
=P
Ti
isisisθθθ
][][][)(21
θθθθ 'Define the ith row of Hθ : h
0θhrHθ θθθ === ∑−
=)(][)(
1
0i
N
n
T nrgThen the equation to solve is:
14
For Newton-Raphson we linearize g(θ) around our current estimate and iterate: Need this
kk
TT
kkk gg
θθ
θθθθ
θθ
rHθrHθθ
θθθθ
ˆ
1
ˆ
1
1ˆ)()(ˆˆ
=
−
=
−
+
∂∂
−=
∂
∂−=
!! "!! #$!!"!!#$θθHH
θ
θG
θθ
θθθ
θθh
θθh
θθhθh
θθrH
Tn
N
nn
N
n
nN
n
nN
nn
T nrnrnrnr
−
−
=
−
=
=
−
=
−
=∑∑∑∑ ∂
∂+
∂∂
=∂
∂=
∂∂
=∂
∂
∆
1
0
1
0
)(
1
0
1
0
][)(][)()(][][)(
∂∂
∂∂
∂∂
−=∂−∂
=∂
∂
pθns
θns
θns
nsnxnr
][
][
][
])[][(][ 2
1
θ
θ
θ
θθθθ
&[ ] pjins
jiijn ,,2,1,][)(
2…=
∂∂∂
=θθθθG
( ) θθθθθ HHθG
θrH T
N
nn
Tnsnx −−=
∂∂ ∑
−
=
1
0][][)(
Derivative of Product Rule
15
So the Newton-Raphson method becomes:
( ) ( )kkkkk
k
TN
nkn
Tk
TT
kk
nsnx θθθθθ
θθ
θθθθ
sxHθGHHθ
rHθrHθθ
ˆˆ
11
0ˆˆˆ
ˆ
1
1
][][)ˆ(ˆ
ˆˆ
−
−−+=
∂∂
−=
−−
=
=
−
+
∑
2nd partials of s[n] w.r.t. parameters
1st partials of signal w.r.t. parameters
Note: if the signal is linear in parameters… this collapses to the non-iterative result we found for the linear case!!!
Newton-Raphson LS Iteration Steps:1. Start with an initial estimate2. Iterate the above equation until change is “small”
16
Gauss-Newton Solution to Nonlinear LSFirst we linearize the model around our current estimate by using a Taylor series and keeping only the linear terms:
)ˆ(
)ˆ(
ˆˆ k
k
kk
θθθsss
θH
θθ
θθθ −
∂∂
+≈
∆=
=!"!#$
Then we use this linearized model in the LS cost:
[ ] [ ]
[ ] [ ][ ] [ ]θHθHsxθHθHsx
θθHsxθθHsx
sxsxθ
θθθθθθ
θθθθ
θθ
kkkkkk
kkkk
kT
k
kT
k
TJ
ˆˆˆˆˆˆ
ˆˆˆˆ
ˆˆ
)ˆ()ˆ(
)(
−+−−+−=
−+−−+−≈
−−=
y∆= y∆=All Known ThingsAll Known Things
17
[ ] [ ]θHyθHyθ θθ kk
TJ ˆˆ)( −−=
This gives a form for the LS cost that looks like a linear problem!!
We know the LS solution to that problem is
[ ][ ] ( )[ ] [ ] ( )
kkkkkkkk
kkkkk
kkk
TTk
TT
kTT
TTk
θθθθ
I
θθθθ
θθθθθ
θθθ
sxHHHθHHHH
θHsxHHH
yHHHθ
ˆˆ1
ˆˆˆˆ1
ˆˆ
ˆˆˆ1
ˆˆ
ˆ1
ˆˆ1
ˆ
ˆ
ˆ
−+=
+−=
=
−
=
−
−
−+
!!! "!!! #$
[ ] ( )kkkk
TTkk θθθθ sxHHHθθ ˆˆ
1ˆˆ1
ˆˆ −+=−
+Gauss-Newton LS Iteration:
Gauss-Newton LS Iteration Steps:1. Start with an initial estimate2. Iterate the above equation until change is “small”
18
Newton-Raphson vs. Gauss-NewtonHow do these two methods compare?
[ ] ( )kkkk
TTkk θθθθ sxHHHθθ ˆˆ
1ˆˆ1
ˆˆ −+=−
+G-N:
( ) ( )kkkkk
TN
nkn
Tkk nsnx θθθθθ sxHθGHHθθ ˆˆ
11
0ˆˆˆ1 ][][)ˆ(ˆˆ −
−−+=
−−
=+ ∑N-R:
The term of 2nd partials is missingin the Gauss-Newton Equation
Which is better?Typically I prefer Gauss-Newton:
• Gn matrices are often small enough to be negligible • … or the error term is small enough to make the sum term negligible• Inclusion of the sum term can sometimes de-stablize the iteration
See p. 683 of Numerical Recipes book
1
8.10 Signal Processing Examples of LSWe’ll briefly look at two examples from the book…
Book Examples
1. Digital Filter Design
2. AR Parameter Estimation for the ARMA Model
3. Adaptive Noise Cancellation
4. Phase-Locked Loop (used in phase-coherent demodulation)
The two examples we will cover highlight the flexibility of the LS viewpoint!!!
Then (in separate note files) we’ll look in detail at two emitter location examples not in the book
2
Ex. 8.11 Filter Design by Pronys LS Method The problem:
• You have some desired impulse response hd[n]• Find a rational TF with impulse response h[n] ≈ hd[n]
View: hd[n] as the observed “data”!!!Rational TF model’s coefficients as the parameters
General LS Problem
signal model
∑ε[n]
+
x[n]
]ˆ;[ θns
θ
Choose the Estimate…
… to make this “residual” small
LS Filter Design Problem
∑ε[n]
+
hd[n]
]ˆ,ˆ;[ banh
ba ˆ,ˆ)()()(
zAzBzH =
δ[n]p×1
(q+1)×1
3
Pronys Modification to Get Linear ModelThe previous formulation results in a model that is nonlinear in the TF coefficient vectors a, b
Prony’s idea was to change the model slightly…
∑ε[n]
+
hd[n]
]ˆ,ˆ;[ banh
ba ˆ,ˆ)()()(
zAzBzH =
δ[n]
)(zA
)(zA
p×1(q+1)×1
This model is only approximately
equivalent to the original!!
Solution (see book for details):
1,1][ˆ −−= Nq
Tqq
Tq hHHHa
aHhb ˆˆ0,0 += q
Hq, H0, hq,N-1, and h0,q all contain elements from hd[n]… the subscripts indicate the range of these elements
4
Key Ideas in Prony LS Example1. Shows power and flexibility of LS approach
- There is no noise here!!! ⇒ MVU, ML, etc. are not applicable- But, LS works nicely!
2. Shows a slick trick to convert nonlinear problem to linear one- Be aware that finding such tricks is an art!!!
3. Results for LS “Prony” method have links to modeling methods for Random Processes (i.e. AR, MA, ARMA)
Is this a practical filter design method?It’s not the best: Remez-Based Method is Used Most
5
Ex. 8.13 Adaptive Noise Cancellation Done a bit different from
the book
Σ
Adaptive FIR Filter
][][][ nindnx +=
][~ ni ][ˆ ni
+–
Desired Interference
][ˆ nd
Estimate of the desired signal… with
“cancelled” interference
Statistically correlated with interference i[n] but mostly
uncorrelated with desired d[n]
Estimate of the interference i[n] adapted to “best” cancel the interference
∑=
−=p
ln lkilhni
0][~][][ˆ
Time-Varying Filter!! Coefficients change at each sample index
6
Noise Cancellation Typical Applications1. Fetal Heartbeat Monitoring
Σ
Adaptive FIR Filter
][][][ nindnx +=
][~ ni ][ˆ ni
+–
On Mother’s Chest
][ˆ nd
On Mother’s Stomach
Fetal Heartbeat
Mother’s Heartbeat
via Stomach
Mother’s Heartbeat via Chest Adaptive filter has to mimic the TF of the
chest-to-stomach propagation
7
2. Noise Canceling Headphones
Adaptive FIR Filter
][~ ni
][ni
Σ+
–
Ear
Noise
][ˆ ni
][ˆ][ ninm −
MusicSignal
][nm !"!#$cancel
nininm ][ˆ][][ −+
8
3. Bistatic Radar System
Σ
Adaptive FIR Filter
][][][ ndntnx t+=
][ˆ nt][ˆ ndt
+–
t[n]
dt[n]
d[n]
d[n]
Desired Interference
Tx
d[n]
Delay/Doppler Radar
Processing
9
LS and Adaptive Noise CancellationGoal: Adjust the filter coefficients to cancel the interferenceThere are many signal processing approaches to this problem…
We’ll look at this from a LS point of view:Adjust the filter coefficients to minimize ∑=
nndJ ][ˆ2
Σ
Adaptive FIR Filter
][][][ nindnx += ])[ˆ][(][][ˆ ninindnd −+=
][~ ni ][ˆ ni
+–
Because i[n] is uncorrelated with d[n] minimizing J is essentially the
same as making this term zeroDesired Interference
Because the interference likely changes is character with time… we want to adapt!Use Sequential LS with
Fading Memory
10
Sequential LS with Forgetting FactorWe want to weight recent measurements more heavily than past measurements… that is we want to “forget” past values.
So we can use weighted LS… and if we choose our weighting factor as an exponential function then it is easy to implement!
[ ]
∑ ∑
∑
=
−
=
−
=
−
−−=
−=
n
k
p
ln
kn
n
k
kn
lkilhkx
kikxnJ
0
21
0
0
2
][~][][
][ˆ][][
λ
λ
Small λ quickly “down weights”the past errors
λ = forgetting factor if 0 < λ < 1
See book for solution details
See Fig. 8.17 for simulation results
Single Platform Emitter Location
AOA(DF) FOA Interferometery TOA
SBI LBI
Emitter Location is Two Estimation Problems in One:1) Estimate Signal Parameter(s) that Depend on Emitter’s Location:
a) Time-of-Arrival (TOA) of Pulsesb) Phase Interferometery: Phase is measured between two different signals
received at nearby antennas• SBI – Short Baseline Interferometery (antennas are close enough
together that phase is measured without ambiguity)• LBI – Long Baseline Interferometery (antennas are far enough apart that
phase is measured with ambiguity; ambiguity resolved either using processing or so-called self-resolved)
c) Frequency-of-Arrival (FOA) or Dopplerd) Angle-of-Arrival (AOA)
2) Use Signal Parameters Measured at Several Instants to Estimate Location
Frequency-Based Location (i.e. Doppler Location)The Problem• Emitter assumed non-moving and at position (X,Y,Z)
– Transmitting a radar signal at unknown carrier frequency is fo
• Signal is intercepted by a receiver on a single aircraft– A/C dynamics are considered to be perfectly known as a function of time
• Nav Data: Position Xp(t), Yp(t), Zp(t) and Velocity Vx(t), Vy(t), Vz(t)• Relative motion between the Tx and Rx causes Doppler shift
– Received carrier frequency differs from transmitted carrier frequency– Thus, the carrier frequency of the received signal will change with time
• For a given set of nav data, how the frequency changes dependS on the transmitter’s carrier frequency fo and the emitter’s position (X,Y,Z)– Parameter Vector: x = [X Y Z fo]T
– fo is a “nuisance” parameter• Received frequency is a function of time as well as parameter vector x
( ) ( ) ( )( ) ( ) ( )
)1()()()(
)()()()()()(),(
222
−+−+−
−+−+−−=
ZtZYtYXtX
ZtZtVYtYtVXtXtVcfftf
ppp
pzpypxoox
• Make noisy frequency measurements at t1, …, tN:• Problem: Given noisy frequency measurements and the nav data,
estimate x• What PDF model do we use for our data????
)(),(),(~
iii tvtftf += xx
In the TDOA/FDOA case… we had an ML estimator for TDOA/FDOA so we could claim that the measurements were asymptotically Gaussian. Because we then had a well-specified PDF for the TDOA/FDOA we could hope to use ML for the location processing.
However, here we have no ML estimator for the instantaneous frequency so claiming that the inst. freq. estimates are Gaussian is a bit of a stretch.
So we could:
1. Outright ASSUME Gaussian and then use ML approach
2. Resort to LS… which does not even require a PDF viewpoint!
Both paths get us to the exact same place:
Find the estimate that minimizes
∑=
−=N
ieieie tftfJ
1
2)]ˆ,(),(~
[)ˆ( xxx
If we Assume Gaussian… we could choose:
• Newton-Raphson MLE approach: leads to double derivatives of the measurement model f (ti,xe).
If we Resort to LS… we could choose either:
• Newton-Raphson approach, which in this case is identical to N-R under the Gaussian assumption
• Gauss-Newton approach, which needs only first derivatives of the measurement model f (ti,xe).
We’ll resort to LS and use Gauss-Newton
Time
Measured Frequency
Frequency ComputedUsing Measured Navand Poor Assumed Loc.
Frequency ComputedUsing Measured Navand Good Assumed Loc.
∑=
−=N
iii tftfJ
1
2)]ˆ,(),(~
[ xx
LS Approach: Find the estimate such that the corresponding computed frequency measurements are “close” to the actual measurements:– Minimize
x)ˆ,( xitf
The SolutionMeasurement model in (1) is nonlinear in x ! no closed form solution
– Newton-Raphson: Linearize the derivative of the cost function– Gauss-Newton: Linearize the measurement model
Thus: ! (A Linear Model)
where…
Get LS solution for update and then update current estimate:
[ ] vxxHxfxf +−+≈ nn ˆ)ˆ()( vxHxf +∆≈∆ )ˆ( n~
[ ]4321ˆ
|||),( hhhhxtx
Hxx=
∂∂
== n
f
( ) )ˆ(ˆ 111n
TT xfRHHRHx ∆=∆ −−− xxx ˆˆˆ 1 ∆+=+ nn
Under the condition that the frequency measurement errors are Gaussian, then the CRLB for the problem can be shown to be
( ) 11var−−≥ HRHx T
Can use this to investigate performance under geometries of interest…even when the measurement errors aren’t truly Gaussian
The AlgorithmInitialization: • Use the average of the measured frequencies as an initial transmitter
frequency estimate. • To get an initial estimate of the emitter’s X,Y,Z components there are
several possibilities:– Perform a grid search– Use some information from another sensor (e.g., if other on-board
sensors can give a rough angle use that together with a typical range)
– Pick several typical initial locations (e.g., one in each quadrant with some typical range)
• Let the initial estimate be
]ˆˆˆˆ[ˆ 0,0000 ofZYX=x
Iteration:
For n = 0, 1, 2, …
1. Compute the vector of predicted frequencies at times t1, t2, …, tN using the current nth estimate and the nav info:
( ) ( ) ( )( ) ( ) ( )
−+−+−
−+−+−−=
222,
,ˆ)(ˆ)(ˆ)(
ˆ)()(ˆ)()(ˆ)()(ˆˆ)ˆ,(ˆ
njpnjpnjp
njpjznjpjynjpjxnononj
ZtZYtYXtX
ZtZtVYtYtVXtXtVc
fftf x
[ ]TnNnnn tftftf )ˆ,(ˆ)ˆ,(ˆ)ˆ,(ˆ)ˆ( 21 xxxxf !=
2. Compute the residual vector by subtracting the predicted frequency vector from the measured frequency vector:
)ˆ(ˆ)(~)ˆ( nn xfxfxf −=∆
3. Compute Jacobian matrix H using the nav info and the current estimate:
[ ]4321ˆ
|||),( hhhhxtx
Hxx=
∂∂
== n
f
,)(ˆ)(ˆ)(ˆ)(ˆ
ˆ)()(ˆˆ)()(ˆˆ)()(ˆ
222jnjnjnjn
njpjnnjpjnnjpjn
tZtYtXtR
ZtZtZYtYtYXtXtX
∆+∆+∆=
−=∆−=∆−=∆Define:
[ ]
∆+∆+∆∆+
−−=
∂∂
==
3ˆ
1 ˆ)(ˆ)()(ˆ)()(ˆ)()(ˆ
ˆ)(ˆ
),()(j
jnjzjnjyjnjxjn
j
jxoj
R
tZtVtYtVtXtVtX
R
tVcftf
Xj
nxxxh
[ ]
∆+∆+∆∆+
−−=
∂∂
==
3ˆ
2 ˆ)(ˆ)()(ˆ)()(ˆ)()(ˆ
ˆ)(ˆ
),()(j
jnjzjnjyjnjxjn
j
jyoj
R
tZtVtYtVtXtVtY
R
tVcftf
Xj
nxxxh
[ ]
∆+∆+∆∆+
−−=
∂∂
==
3ˆ
3 ˆ)(ˆ)()(ˆ)()(ˆ)()(ˆ
ˆ)(ˆ
),()(j
jnjzjnjyjnjxjn
j
jzoj
R
tZtVtYtVtXtVtZ
R
tVcftf
Zj
nxxxh
1),()(ˆ
4 ≈∂∂
== n
jo
tff
jxx
xh
4. Compute the estimate update:
( ) )ˆ(ˆ 111n
TTn xfCHHCHx ∆=∆ −−−
C is the covariance of the frequency measurements;
usually assumed to be diagonal with measurement variances on the diagonal
In practice you would implement this inverse using Singular Value Decomposition (SVD) due to numerical issues of H being near to
singular (MATLAB will give you a warning when this is a problem)See pp. 676-677 of the book “Numerical Recipes …”
5. Update the estimate using
nnn xxx ˆˆˆ 1 ∆+=+
6. Check for convergence of solution: look to see if update is small in some specified sense.
If “Not Converged”… go to Step 1.
If Converged or Maximum number of iterations… quit loop & Set x
7. Compute Least-Squares Cost of Converged solution
1ˆ += nx
( )∑=
−=
N
n n
nn tftfC1
2)ˆ,(ˆ),(
~)ˆ(
σxxx This last step is often done to
allow assessment of how much confidence you have in the
solution. There are other ways to assess confidence – see discussion in Ch. 15 of
“Numerical Recipes …”
Note: There is no guarantee that this algorithm will converge… it might not converge at all… it might: (i) simply wander around aimlessly,
(ii) oscillate back and forth along some path, or
(iii) wander off in complete divergence.
In practical algorithms it is a good idea to put tests into the code to check for such occurrences
-1 0 1 2 3 4 5
x 104
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
4
x (meters)
y (m
eter
s)
-400 -200 0 200 400 600
-400
-300
-200
-100
0
100
200
300
400
500
-300 -200 -100 0 100 200 300
-200
-150
-100
-50
0
50
100
150
200
250
300
Simulation Results with 95% CRLB Error Ellipses
Platform Trajectory
1/11
Doppler Tracking
Passive Tracking of an Airborne Radar:An Example of Least Squares
“State” Estimation
2/11
Problem StatementAirborne radar to be located follows a trajectory X(t), Y(t), Z(t) with velocities Vx(t), Vy(t), Vz(t)
It is transmitting a radar signal whose carrier frequency is fo.
Signal is intercepted by a non-moving receiver at known location Xp, Yp, Zp.
Problem: Estimate the trajectory X(t), Y(t), Z(t)
Solution Here: • Measure received frequency at instants t1, t2, … , tN• Assume a simple model for the aircraft’s motion• Estimate model parameters to give estimates of trajectory
unknown
3/11
An AdmissionThis problem is somewhat of a “rigged” application…
• Unlikely it would be done in practice just like this • Because it will lead to poorly observable parameters • The H matrix is likely to be less than full rank
In real practice we would likely need either:• Multiple Doppler sensors or • A single sensor that can measure other things in addition to Doppler (e.g., bearing).
We present it this way to maximize the similarity to the example of locating a non-moving radar from a moving platform
• Can focus on the main characteristic that arises when the parameter to be estimated is a varying function (i.e. state estimation).
4/11
Doppler Shift ModelRelative motion between the emitter and receiver… Doppler Shift
Frequency observed at time t is related to the unknown transmitted frequency of fo by:
( ) ( ) ( )( ) ( ) ( )
−+−+−
−+−+−−=
222 )()()(
)()()()()()(
tZZtYYtXX
tZZVtYYtVtXXtVcfftf
ppp
pzpypxoo
We measure this at time instants t1, t2, … , tN
)()()(~
iii tvtftf +=And group them into a measurement vector:
Frequency Measurement
“Noise”
[ ]TNtftftf )(~
)(~
)(~~
21 !=f
But what are we trying to estimate from this data vector???
5/11
Trajectory ModelWe can’t estimate arbitrary trajectory functions… like X(t), Y(t), etc.
Need a trajectory model… to reduce the problem to estimating a few parameters
Here we will choose the simplest… Constant-Velocity Model
NNz
NNy
NNx
ZttVtZ
YttVtYXttVtX
+−×=
+−×=+−×=
)()(
)()()()(
Final Positions in Observation Block
Velocity Values
Now, given measurements of frequencies f(t1), f(t2), … , f(tN) ……we wish to estimate the 7-parameter vector:
TozyxNNN fVVVZYX ][=x
6/11
Measurement Model and Estimation ProblemSubstituting the Trajectory Model into the Doppler Modelgives our measurement model:
[ ]( ) [ ]( ) [ ]( )[ ]( ) [ ]( ) [ ]( )
+−×−++−×−++−×−
+−×−++−×−++−×−−=
222 )()()(
)()()()(),(
NNzpNNypNNxp
NNzpzNNypyNNxpxoo
ZttVZYttVYXttVX
ZttVZVYttVYtVXttVXVcfftf x
Dependence on parameter vector
)(),(),(~
tvtftf += xx
[ ]vxf
xxxxf
+=
=
)(
),(~
),(~
),(~
)(~21
TNtftftf !
Noisy Frequency
Measurement
Noisy Measurement
VectorNoise-Free
Frequency VectorNoise Vector
7/11
Estimation Problem
Given: Noisy Data Vector:Sensor Position:
Estimate:Parameter Vector:
[ ]TNtftftf ),(~
),(~
),(~
)(~21 xxxxf !=
ppp ZYX ,,
TozyxNNN fVVVZYX ][=x
This is a nonlinear problem…
Although we could use ML to attack this we choose LS here partly because we aren’t given an explicit noise model and partly because LS is “easily” applied here!!!
“Nuisance” parameter
8/11
Linearize the Nonlinear ModelWe have a non-linear measurement model here…
so we choose to linearize our model (as before):
[ ] vxxHxfxf +−+≈ nn ˆ)ˆ()(~
where…
nx is the “current” estimate of the parameter vector
)ˆ( nxf is the “predicted” frequency measurements computed using the Doppler & Trajectory models with “back-propagation” (see next)
[ ]7654321ˆ
||||||),( hhhhhhhxtx
Hxx=
∂∂
== n
f
is the N×7 Jacobian matrix evaluated at the current estimate
9/11
Back-Propagate to Get Predicted FrequenciesGiven the current parameter estimate:
TozyxNNNn nfnVnVnVnZnYnX )](ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ[ˆ =x
Back-Propagate to get the current trajectory estimate:
)(ˆ)()(ˆ)(ˆ
)(ˆ)()(ˆ)(ˆ
)(ˆ)()(ˆ)(ˆ
nZttnVtZ
nYttnVtY
nXttnVtX
NNzn
NNyn
NNxn
+−×=
+−×=
+−×=
Use Back-Propagated trajectory to get predicted frequencies:
( ) ( ) ( )( ) ( ) ( )
−+−+−
−+−+−−=
222)(ˆ)(ˆ)(ˆ
)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)(ˆ)ˆ,(
tZZtYYtXX
tZZnVtYYnVtXXnVcnfnftf
npnpnp
npznpynpxooni x
10/11
Converting to Linear LS Problem FormFrom the linearized model and the back-propagated trajectory estimate we get:
" vxHxfxxxfxf
+∆≈∆−− nn
nnˆ)ˆ()(~
)ˆ(#$#%&“Update”
Vector“Residual”
Vector
This is in standard form of Linear LS… so the solution is:
( ) )ˆ(ˆ 111n
TTn xfRHHRHx ∆=∆ −−−
This LS estimated “update” is then used to get an updated parameter estimate:
nnn xxx ˆˆˆ 1 ∆+=+
R is the covariance
matrix of the measurements
11/11
Iterating to the Solution• n=0: Start with some initial estimate• Loop until stopping criterion satistfied
– n ← n+1– Compute Back-Propagated Trajectory– Compute Residual– Compute Jacobian– Compute Update– Check Update for smallness of norm
• If Update small enough… stop• Otherwise, update estimate and loop
1
Pre-Chapter 10Results for Two Random Variables
See Reading Notes posted on BB
2
Let X and Y be two RVs each with there own PDF: pX(x) and pY(y)
Their complete probabilistic description is captured in…
Joint PDF of X and Y: pXY(x,y)
Describes probabilities of joint events concerning X and Y.
∫ ∫=<<<<b
a
d
cXY dxdyyxpdYcbXa ),()( and )(Pr
Marginal PDFs of X and Y: The individual PDFs pX(x) and pY(y)
Imagine “adding up” the joint PDF along one direction of a piece of paper to give values “along one of the margins”.
∫∫ == dxyxpypdyyxpxp XYYXYX ),()(),()(
3
Expected Value of Functions of X and Y: You sometimes create a new RV that is a function of the two of them: Z = g(X,Y).
∫ ∫== dxdyyxpyxgYXgEZE XYXY ),(),(),(
Example: Z = X + Y
( )
YEXE
dyyypdxxxp
dydxyxpydxdyyxpx
dxdyyxypdxdyyxxp
dxdyyxpyxYXEZE
YX
YX
XYXY
XYXY
XYXY
+=
+=
+
=
+=
+=+=
∫∫
∫ ∫∫ ∫
∫ ∫∫ ∫
∫ ∫
)()(
),(),(
),(),(
),(
4
Conditional PDFs : If you know the value of one RV how is the remaining RV now distributed?
≠=
otherwise,0
0)(,)(
),(
)|(|
xpxp
yxp
xypX
X
XY
XY
≠=
otherwise,0
0)(,)(
),(
)|(|
ypyp
yxp
yxpY
Y
XY
YX
Sometimes we think of a specific numerical value upon which we are conditioning… pY|X(y|X = 5)
Other times it is an arbitrary value…
pY|X(y|X = x) or pY|X(y|x) or pY|X(y|X)
Various Notations
5
Independence: RVs X and Y are said to be independent if knowledge of the value of one does not change the PDF model for the other.
)()|(
)()|(
|
|
xpyxp
ypxyp
XYX
YXY
=
=
)()(),( ypxpyxp YXXY =This implies (and is implied by)…
)()(
)()()|(
)()(
)()()|(
|
|
xpyp
ypxpyxp
ypxp
ypxpxyp
XY
YXYX
YX
YXXY
==
==
6
Decomposing the Joint PDF: Sometimes it is useful to be able to write the joint PDF in terms of conditional and marginal PDFs.
From our results for conditioning above we get…
)()|(),( | xpxypyxp XXYXY =
)()|(),( | ypyxpyxp YYXXY =
From this we can get results for the marginals:
∫
∫=
=
dxxpxypyp
dyypyxpxp
XXYY
YYXX
)()|()(
)()|()(
|
|
7
Bayes’ Rule: Sometimes it is useful to be able to write one conditional PDF in terms of the other conditional PDF.
)()()|(
)|(
)()()|(
)|(
||
||
ypxpxyp
yxp
xpypyxp
xyp
Y
XXYYX
X
YYXXY
=
=
Some alternative versions of Bayes’ rule can be obtained by writing the marginal PDFs using some of the above results:
∫∫
∫∫
==
==
dxxpxyp
xpxyp
dxyxp
xpxypyxp
dyypyxp
ypyxp
dyyxp
ypyxpxyp
XXY
XXY
XY
XXYYX
YYX
YYX
XY
YYXXY
)()|(
)()|(
),(
)()|()|(
)()|(
)()|(
),(
)()|()|(
|
|||
|
|||
8
Conditional Expectations: Once you have a conditional PDF it works EXACTLY like a PDF… that is because it IS a PDF!
Remember that any expectation involves a function of a random variable(s) times a PDF and then integrating that product.
So the trick to working with expected values is to make sure youknow three things:
1. What function of which RVs
2. What PDF
3. What variable to integrate over
9
For conditional expectations… one idea but several notations!
∫= dxyxpyxgYXgE YXYX )|(),(),( ||
∫== dxyxpyxgYXgE oYXoyYX o)|(),(),( ||
∫= dxyxpyxgYYXgE YX )|(),(|),( |
∫== dxyxpyxgyYYXgE oYXoo )|(),(|),( |
Uses subscript on E to indicate that you use the cond. PDF.
Does not explicitly state the value at which Y should be fixed so use an arbitrary y
Uses subscript on E to indicate that you use the cond. PDF.
Explicitly states that the value at which Y should be fixed is yo
Uses “conditional bar” inside brackets of E to indicate use of the cond. PDF.
Does not explicitly state the value at which Y should be fixed so use an arbitrary y
Uses “conditional bar” inside brackets of E to indicate use of the cond. PDF.
Explicitly states that the value at which Y should be fixed is yo
10
Decomposing Joint Expectations: When averaging over the joint PDF it is sometimes useful to be able to decompose it into nested averaging in terms of conditional and marginal PDFs.
This uses the results for decomposing joint PDFs.
),(
),(),(
| YXgEE
YXgEYXgE
XYX
XY
=
=
dxxpxypyxg
dxdyyxpyxgYXgE
Xx
YXgE
XYy
xpxypXY
XY
XXY
)()|(),(
),(),(),(
),(
|
)()|(
|
|
∫ ∫
∫ ∫
=
=
!!!! "!!!! #$
!"!#$
This is an RV that “inherits” the PDF of X!!!
11
Ex. Decomposing Joint Expectations:
Let X = # on Red Die Y = # on Blue Die g(X,Y) = X + Y
),(),( | YXgEEYXgE XYX=
∑=
=+6
15.9
61)6(
yy(6+6)(6+5)(6+4)(6+3)(6+2)(6+1)6
(5+6)(5+5)(5+4)(5+3)(5+2)(5+1)5
(4+6)(4+5)(4+4)(4+3)(4+2)(4+1)4
(3+6)(3+5)(3+4)(3+3)(3+2)(3+1)3
(2+6)(2+5)(2+4)(2+3)(2+2)(2+1)2
(1+6)(1+5)(1+4)(1+3)(1+2)(1+1)1
EY|X654321X/Y
∑=
=+6
15.4
61)1(
yy
∑=
=+6
15.5
61)2(
yy
∑=
=+6
15.8
61)5(
yy
∑=
=+6
15.7
61)4(
yy
∑=
=+6
15.6
61)3(
yy
These constitute
an RV with uniform
probability of 1/6
∑=
===+6
17
61||
xxYEXYEEYXE
1
Chapter 10Bayesian Philosophy
2
10.1 IntroductionUp to now… Classical Approach: assumes θ is deterministicThis has a few ramifications:
• Variance of the estimate could depend on θ• In Monte Carlo simulations:
– M runs done at the same θ, – must do M runs at each θ of interest– averaging done over data – no averaging over θ values
E is w.r.t. p(x;θ)
Bayesian Approach: assumes θ is random with pdf p(θ)This has a few ramifications:
• Variance of the estimate CAN’T depend on θ• In Monte Carlo simulations:
– each run done at a randomly chosen θ, – averaging done over data AND over θ values
E is w.r.t. p(x,θ)
joint pdf
3
Why Choose Bayesian?1. Sometimes we have prior knowledge on θ ⇒ some values are
more likely than others
2. Useful when the classical MVU estimator does not exist because of nonuniformity of minimal variance
θ
)(2ˆ θθi
σ1θ
2θ
3. To combat the “signal estimation problem”… estimate signal s
x = s + w If s is deterministic and is the parameter to estimate, then H = I
Classical Solution: ( ) xxIIIs ==− TT 1ˆ Signal Estimate is
the data itself!!!
The Wiener filter is a Bayesian method to combat this!!
4
10.3 Prior Knowledge and Estimation
Bayesian Data Model:• Parameter is “chosen” randomly w/ known “prior PDF”• Then data set is collected• Estimate value chosen for parameter
Every time you collect data, the parameter has a different value, but some values may be more likely to occur than others
This is how you think about it mathematically and how you run simulations to test it.
This is what you know ahead of time about the parameter.
5
Ex. of Bayesian Viewpoint: Emitter Location Emitters are where they are and don’t randomly jump around each
time you collect data. So why the Bayesian model?
(At least) Three Reasons1. You may know from maps, intelligence data, other sensors,
etc. that certain locations are more likely to have emitters• Emitters likely at airfields, unlikely in the middle of a lake
2. Recall Classical Method: Parm Est. Variance often depends on parameter• It is often desirable (e.g. marketing) to have a single
number that measures accuracy.3. Classical Methods try to give an estimator that gives low
variance at each θ value. However, this could give large variance where emitters are likely and low variance where they are unlikely.
6
Bayesian Criteria Depend on Joint PDFThere are several different optimization criteria within the Bayesian framework. The most widely used is…
Minimize the Bayesian MSE: Bmse ∫∫ −=
−=
dθd,θpθθ
θθEθ
xxx )()](ˆ[
)ˆ()ˆ(
2
2Take E w.r.t.
joint pdf of x and θ
Can Not Depend on θ Joint pdf of x and θ
To see the difference… compare to the Classical MSE:
∫ −=
−=
xxx dθpθθ
θθEθmse
);()](ˆ[
)ˆ()ˆ(
2
2
pdf of x parameterized by θCan Depend on θ
7
Ex. Bayesian for DC Level Zero-Mean White Gaussian
Same as before… x[n] = A + w[n] p(A)
-Ao
1/2Ao
Ao A
But here we use the following model: • that A is random w/ uniform pdf• RVs A and w[n] are independent of each other
Now we want to find the estimator function that maps data x into the estimate of A that minimizes Bayesian MSE:
[ ] xxx
xx
dpdAApAA
dAdApAAABmse
∫ ∫
∫∫−=
−=
)()|(]ˆ[
),(]ˆ[)ˆ(
2
2 Now use… p(x,A) = p(A|x)p(x)
Minimize this for each x valueThis works because p(x) ≥ 0
So… fix x, take its partial derivative, set to 0
8
Finding the Partial Derivative gives:
∫∫
∫
∫∫
+−=
−−=
∂−∂
=−∂∂
dAApAdAAAp
dAApAA
dAApA
AAdAApAAA
)|(ˆ2)|(2
)|(]ˆ[2
)|(ˆ]ˆ[)|(]ˆ[ˆ
22
xx
x
xx
=1Setting this equal to zero and solving gives:
x
x
|
)|(ˆ
AE
dAAApA
=
= ∫Conditional mean of A given data x
Bayesian Minimum MSE Estimate = The Mean of “posterior pdf”
MMSE So… we need to explore how to compute this from our data given knowledge of the
Bayesian model for a problem
9
Compare this Bayesian Result to the Classical Result:… for a given observed data vector x look at
MVUE = x
AAoAo
p(A|x)p(x;A)
MMSE = EA|x
Before taking any data… what is the best “estimate” of A?• Classical: No best guess exists!• Bayesian: Mean of the Prior PDF…
– observed data “updates” this “a priori” estimate into an “a posteriori” estimate that balances “prior” vs. data
10
So… for this example we’ve seen that we need EA|x.How do we compute that!!!?? Well…
∫==
dAAAp
AEA
)|(
|ˆ
x
x
So… we need the posterior pdf of A given the data… which can be found using Bayes’ Rule:
∫=
=
dAApApApAp
pApApAp
)()|()()|(
)()()|()|(
xx
xxx
Allows us to write one cond. PDF in terms of the other way around
Assumed KnownMore easily found than p(A|x)… very much the same structure as the parameterized PDF
used in Classical Methods
11
So now we need p(x|A)… For x[n] = A + w[n] we know that
( )
−−=
−=
−=
222
][2
1exp2
1
)][(
)|][()|][(
Anx
Anxp
AAnxpAnxp
w
wx
σπσ
For A known, x[n] is the known A plus random w[n]
PDF of x
Because w[n] and A are assumed Independent
Because w[n] is White Gaussian they are independent… thus, the data conditioned on A is independent:
( )( )
−−= ∑
−
=
1
0
222/2
][2
1exp2
1)|(N
nN AnxAp
σπσx
Same structure as the parameterized PDF used in Classical Methods… But here A is an RV upon which we have conditioned the PDF!!!
12
Now we can use all this to find the MMSE for this problem:
MMSE Estimator…A function that maps observed data into the estimate… No Closed Form for this Case!!!
( ) ( ) [ ]
( ) ( ) [ ]
( )
( )∫ ∑
∫ ∑
∫ ∑
∫ ∑
∫∫∫
−
−
=
−
−
=
−
−
=
−
−
=
−−
−−
=
−−
−−
=
===
o
o
o
o
o
o
o
o
A
A
N
n
A
A
N
n
A
A o
N
nN
A
A o
N
nN
dAAnx
dAAnxA
A
dAAAnx
dAAAnxA
dAApAp
dAApAApdAAApAEA
1
0
22
1
0
22
1
0
222/2
1
0
222/2
][2
1exp
][2
1expˆ
2/1][2
1exp2
1
2/1][2
1exp2
1
)()|(
)()|()|(|ˆ
σ
σ
σπσ
σπσ
x
xxx Using
Bayes’ Rule
Use Prior PDF
Use Parameter-Conditioned PDF
IdeaEasy!!
Hard toBuild
13
How the Bayesian approach balances a priori and a posteriori info:
AAoAo
p(A)
EA
No Data
AAoAo
p(A|x)
EA|x x
Short DataRecord
AAoAo
p(A|x)
xx ≈|AE
Long DataRecord
14
General Insights From Example1. After collecting data: our knowledge is captured by the
posterior PDF p(θ |x)
2. Estimator that minimizes the Bmse is Eθ |x… the mean of the posterior PDF
3. Choice of prior is crucial: Bad Assumption of Prior ⇒ Bad Bayesian Estimate!
(Especially for short data records)
4. Bayesian MMSE estimator always exists!But not necessarily in closed form
(Then must use numerical integration)
15
10.4 Choosing a Prior PDFChoice is crucial:1. Must be able to justify it physically
2. Anything other than a Gaussian prior will likely result in no closed-form estimates
We just saw that a uniform prior led to a non-closed form
We’ll see here an example where a Gaussian prior gives a closed form
So… there seems to be a trade-off between:• Choosing the prior PDF as accurately as possible• Choosing the prior PDF to give computable closed form
16
Ex. 10.1: DC in WGN with Gaussian Prior PDFWe assume our Bayesian model is now: x[n] = A + w[n]with a prior PDF of
),(~ 2AANA σµ
AWGN
So… for a given value of the RV A the conditional PDF is
( )( )
−−= ∑
−
=
1
0
222/2
][2
1exp2
1)|(N
nN AnxAp
σπσx
Then to get the needed conditional PDF we use this and the a priori PDF for A in Bayes’ Theorem:
∫=
dAApApApApAp)()|()()|()|(
xxx
17
Then… after much algebra and gnashing of teeth we get:See the Book( )
−−= 2
|2|2
|2
1exp2
1)|( xAxAxA
AAp µσπσ
x
which is a Gaussian PDF with
22
2|
2
2|
2
2|
|
11
A
xA
AA
xAxAxA
N
xN
σσ
σ
µσ
σ
σ
σµ
+=
+
= Weighted Combination of a
priori and sample means
“Parallel” Combination of a priori and sample variances
So… the main point here so far is that by assuming:• Gaussian noise• Gaussian a priori PDF on the parameter
We get a Gaussian a posteriori PDF for Bayesian estimation!!
18
Now recall that the Bayesian MMSE was the conditional a posteriori mean: x|ˆ AEA =
Because we now have a Gaussian a posteriori PDF it is easy to find an expression for this:
AA
xAxAxA x
NAEA µ
σ
σ
σ
σµ
+
=== 2
2|
2
2|
||ˆ x
After some algebra we get:
( ) 10,1
ˆ2
2
2
22
2
<<−+=
+
+
+
=
αµαα
µσ
σ
σ
σσ
σ
A
A
AA
A
x
N
Nx
N
A
Easily Computable Estimator:• Sample mean computed from data• σ known from data model• µA and σA known from prior model
22
2| 1
1|varˆvar
A
xA NAA
σσ
σ+
=== x
Little or Poor Data:
Much or Good Data:
AA AN µσσ ≈<< ˆ/22
xANA ≈>> ˆ/22 σσ
19
Comments on this Example for Gaussian Noise and Gaussian Prior1. Closed-Form Solution for Estimate!2. Estimate is… Weighted sum of prior mean & data mean3. Weights balance between prior info quality and data quality4. As N increases…
a. Estimate EA|x movesb. Accuracy varA|x moves
xA →µNA /22 σσ →
p(A|x)
A1AxA ≈2
ˆ
N2 > N1 > 0AoA µ=ˆ
No = 0
20
2|)ˆ( xAABmse σ=Bmse for this Example:
To see this: ( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( )[ ] ( )∫ ∫
∫∫
∫∫
==
−=
−=
−=
−=
xxxx
xxxx
xx
x
dpdAApAEA
dAdpApAEA
dAdApAA
AAEABmse
xAA!!!! "!!!! #$
2||var
2
2
2
2
|
|
,ˆ
ˆˆ
σ
General Result: Bmse = posterior variance averaged over PDF of x
In this case σA|x is not a function of x:
( ) ( ) 2|
2|
ˆxAxA dpABmse σσ == ∫ xx
21
The big thing that this example shows:
Gaussian Data & Gaussian Prior gives Closed-Form MMSE SolutionThis will hold in general!
1
10.5 Properties of Gaussian PDFTo help us develop some general MMSE theory for the Gaussian Data/Gaussian Prior case, we need to have some solid results forjoint and conditional Gaussian PDFs.
Well consider the bivariate case but the ideas carry over to the general N-dimensional case.
2
Bivariate Gaussian Joint PDF for 2 RVs X and Y
−
−
−
−−= −
!!!! "!!!! #$formquadratic
y
xT
y
x
y
x
y
xyxp
µ
µ
µ
µ
π1
1/2 21exp
||21),( CC
=
Y
X
Y
XE
µ
µ
=
=
=
2
2
2
2
)var(),cov(
),cov()var(
YYX
YXX
YYX
XYX
YXY
YXX
σσρσ
σρσσ
σσ
σσC
-10 -5 0 5 10-8
-6
-4
-2
0
2
4
6
8
x
y
-8-6
-4-2
02
46
8
-10
-5
0
5
100
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
xy
p(x,
y)
xy
p(x,
y)
x
y
3
Marginal PDFs of Bivariate GaussianWhat are the marginal (or individual) PDFs?
∫∞
∞−= dyyxpxp ),()( ∫
∞
∞−= dxyxpyp ),()(
We know that we can get them by integrating:
After performing these integrals you get that:
X ~ N(µX, varX) Y ~ N(µY, varY)
-10 -5 0 5 10-8
-6
-4
-2
0
2
4
6
8
x
y
x
y
p(x)
p(y)
4
Comment on Jointly GaussianWe have used the term Jointly Gaussian
Q: EXACTLY what does that mean?A: That the RVs have a joint PDF that is Gaussian
−
−
−
−−= −
y
xT
y
x
y
x
y
xyxp
µ
µ
µ
µ
π1
1/2 21exp
||21),( CC
Weve shown that jointly Gaussian RVs also have Gaussian marginal PDFs
Q: Does having Gaussian Marginals imply Jointly Gaussian?
In other words if X is Gaussian and Y is Gaussian is it always true that X and Y are jointly Gaussian???
A: No!!!!!
Example for 2 RVs
See Reading Notes on Counter Example
posted on BB
5
Well construct a counterexample: start with a zero-mean, uncorrelated 2-D joint Gaussian PDF and modify it so it is no longer 2-D Gaussian but still has Gaussian marginals.
+
−= 2
2
2
2
21exp
21),(
YXYXXY
yxyxpσσσπσ x
y
x
y
But if we modify it by: Setting it to 0 in the shaded regions Doubling its value elsewhere
We get a 2-D PDF that is not a joint Gaussian but the marginals are the same as the original!!!!
6
Conditional PDFs of Bivariate GaussianWhat are the conditional PDFs?
If you know that X has taken value X = xo, how is Y distributed?
×
×=
1616258.0
16258.025C
Slope of LinecovX,Y/varX = ρσY/σX
∫∞
∞−
==dyyxp
yxpxp
xxpxyp),(
),()(
)|()|(0
0
0
00
Slice @ xo
Normalizer
-15 -10 -5 0 5 10 15-15
-10
-5
0
5
10
15
x
y
p(y|X=5)
p(y)Note: Conditioning on correlated RV shifts mean reduces variance
7
Theorem 10.1: Conditional PDF of Bivariate GaussianLet X and Y be random variables distributed jointly Gaussian with mean vector [EX EY]T and covariance matrix
=
=
2
2
)var(),cov(
),cov()var(
YYX
XYX
YXY
YXX
σσ
σσC
Then p(y|x) is also Gaussian with mean and variance given by:
( )
( )
| 2
XExYE
XExYExXYE
oX
Y
oX
XYo
−+=
−+==
σρσ
σσ
( ) 22222
2
22
1
|var
YYY
X
XYYoxXY
σρσρσ
σσσ
−=−=
−==
Slope of Line
Amount of Reduction
Reduction Factor
8
Impact on MMSE
We know the MMSE of RV Y after observing the RV X = xo: oxXYEY == |
So using the ideas we have just seen: if the data and the parameter are jointly Gaussian, then
( )|2 XExYExXYEY oX
XYoMMSE −+===
σσ
It is the correlation between the RVs X and Y that allow us to perform Bayesian estimation.
9
Theorem 10.2: Conditional PDF of Multivariate GaussianLet X (k×1) and Y (l×1) be random vectors distributed jointlyGaussian with mean vector [EXT EYT ]T and covariance matrix
××
××=
=
)()(
)()(
llkl
lkkk
YYYX
XYXX
CC
CCC
Then p(y|x) is also Gaussian with mean vector and covariance matrix given by:
( )| 1 XxCCYxXY XXYX EEE oo −+== −XYXXYXYYxX|Y CCCCC 1−
= −=o
( )| 2 XExYExXYE oX
XYo −+==
σσ
2
22|var
X
XYYoxXY
σσσ −==
Compare to Bivariate Results
For the Gaussian case the cond. covariance does not depend on the conditioning x-value!!!
10
10.6 Bayesian Linear ModelNow we have all the machinery we need to find the MMSE for the Bayesian Linear Model
wHθx +=
N×1 N×p known
p×1~N(µθ,Cθ)
N×1~N(0,Cw)
Clearly, x is Gaussian and θ is GaussianBut are they jointly Gaussian???
If yes then we can use Theorem 10.2 to get the MMSE for θ!!!
Answer = Yes!!
11
Bayesian Linear Model is Jointly Gaussianθ and w are each Gaussian and are independent
Thus their joint PDF is a product of Gaussians which has the form of a jointly Gaussian PDF
Can now use: a linear transform of jointly Gaussian is jointly Gaussian
=
w
θ
0I
IH
θ
xJointly Gaussian
Thus, Thm. 10.2 applies! Posterior PDF is
! Joint Gaussian
! Completely described by its mean and variance
12
Conditional PDF for Bayesian Linear ModelTo apply Theorem 10.2, notationally let X = x and Y = θ.
First we need EX = H Eθ + Ew = Hµθ
EY = Eθ = µθ
And also θYY CC = ( )( ) ( )[ ] ( )[ ] ( )( ) TTT
T
T
EE
E
EEE
wwHµθµθH
wµθHwµθH
xxxxC
θCθθ
θθ
XX
+−−=
+−+−=
−−=
!!! "!!! #$
TT E wwHHCC θXX +=
Cross Terms are Zero because θ and w are
independent
13
( )( ) ( )( ) ( )( ) TT
T
T
E
E
E
Hµθµθ
HµwHθµθ
µxµθCC
θθ
θθ
xθθxYX
−−=
−+−=
−−==SimilarlyTHCC θθx =Use Eθw = 0
Eµθw = 0
Then Theorem 10.2 gives the conditional PDFs mean and cov(and we know the conditional mean is the MMSE estimate)
( ) ( )θwθθθ HµxCHHCHCµ
xθθ
−++=
=
−1
|
TT
MMSE E
Data Prediction Error
Update TransformationMaps unpredictable part
a priori estimate
Cross Correlation CθxRelative Quality
( ) θwθθθxθ HCCHHCHCCC1
|−
+−= TTPosterior Covariance:
a priori covariance Reduction Due to Data
Posterior Mean:
Bayesian MMSE
Estimator
14
Ex. 10.2: DC in AWGN w/ Gaussian PriorData Model: x[n] = A + w[n] A & w[n] are independent
),(~ 2AAN σµ ),0(~ 2σN
Write in linear model form:
x = 1A + w with H = 1 = [ 1 1 1]T
Now General Result gives the MMSE estimate as:
)-()(
)-()(|
12
2
2
2
1222
ATATA
A
AT
AT
AAMMSE AEA
µσσ
σσµ
µσσσµ
1x11I1
1xI111x
−
−
++=
++==
Can simplify using The Matrix Inversion Lemma
15
Aside: Matrix Inversion Lemma
( ) ( ) 1111111 −−−−−−− +−=+ DACBDABAABCDA
n×n n×m m×m m×n
( )uAu
AuuAAuuA 1
1111
1 −
−−−−
+−=+ T
TT
n×n n×1
Special Case (m = 1):
16
Continuing the Example Apply the Matrix Inversion Lemma:
)(/
1
)(/
)(/
)(
222
2
222
2
222
2
1
2
2
2
2
AA
AA
AT
A
TAA
AA
TTA
A
ATATA
AMMSE
NxNN
N
NN
N
A
µσσσ
σµ
µσσσ
σµ
µσσσ
σµ
µσσ
σσµ
−
+−+=
−
+−+=
−
+−+=
−
++=
−
1x11
1x11I1
1x11I1
)(/
22
2
AA
AAMMSE x
NA µ
σσσµ −
++=
Use Matrix Inv Lemma
Pass through 1T
& use 1T 1 = N
Factor Out 1T
& use 1T 1 = N
Algebraic Manipulation
Error BetweenData-Only Est.
& Prior-Only Est.
GainFactor
a priori estimate
When data is bad (σ2/N >> σ2A),
gain is small, data has little use
When data is good (σ2/N >> σ2A),
gain is large, data has large use
AMMSEA µ≈
xAMMSE ≈
17
Using similar manipulations gives:
NN
NA
AA
A
/11
1)x|var(
222
2
22
σσσσ
σσ
+=
+
=
Like || resistors small one wins!⇒ var (A|x) is ≈ the smaller of:
data estimate variance prior variance
NA A /11
)x|var(1
22 σσ+=
Or looking at it another way:
additive information!
18
10.7 Nuisance Parameters
One difficulty in classical methods is that nuisance parameters must explicitly dealt with.
In Bayesian methods they are simply Integrated Away!!!!
Recall Emitter Location: [x y z f0]
In Bayesian ApproachFrom p(x, y, z, f0 | x) can get p(x, y, z | x):
Nuisance Parameter
∫= 00 )x|,,,()|,,( dffzyxpzyxp x
Then find conditional mean for the MMSE estimate!
1
Ch. 11 General Bayesian Estimators
2
IntroductionIn Chapter 10 we:
• introduced the idea of a “a priori” information on θ⇒ use “prior” pdf: p(θ)
• defined a new optimality criterion⇒ Bayesian MSE
• showed the Bmse is minimized by E θ|x
called:• “mean of posterior pdf”• “conditional mean”
In Chapter 11 we will:• define a more general optimality criterion
⇒ leads to several different Bayesian approaches⇒ includes Bmse as special case
Why? Provides flexibility in balancing: • model, • performance, and• computations
3
11.3 Risk FunctionsPreviously we used Bmse as the Bayesian measure to minimize
( )
εθθ
θθθ
∆=−
−=
ˆ
),(...ˆ 2xptrwEBmse
So, Bmse is… Expected value of square of error
Let’s write this in a way that will allow us to generalize it.
Define a quadratic Cost Function: ( )22 ˆ)( θθεε −==C
Then we have that )(εCEBmse =
ε
C(ε) = ε2
Why limit the cost function to just quadratic?
4
General Bayesian Criteria1. Define a cost function: C(ε)
2. Define Bayes Risk: R = EC(ε) w.r.t. p(x,θ )
)ˆ()ˆ( θθθ −= CER
Depends on choice of estimator
3. Minimize Bayes Risk w.r.t. estimate θ
The choice of the cost function can be tailored to:• Express importance of avoiding certain kinds of errors• Yield desirable forms for estimates
– e.g., easily computed• Etc.
5
Three Common Cost Functions
1. Quadratic: C(ε) = ε2
ε
C(ε)
2. Absolute: C(ε) = | ε |ε
C(ε)
3. Hit-or-Miss:
≥
<=
δε
δεε
,1
,0)(C
δ > 0 and small ε
C(ε)
δ–δ
6
General Bayesian EstimatorsDerive how to choose estimator to minimize the chosen risk:
[ ] dxxpdθθ|xpθθC
xpθ|xp
dθdxx,θpθθC
θθCE
θg
)()()ˆ(
)()(
)()ˆ(
)ˆ()ˆ(
)ˆ(
∫ ∫
∫∫
∆=
−=
=
−=
−=θR
must minimize this for each x value
So… for a given desired cost function… you have to find the form of the optimal estimator
7
The Optimal Estimates for the Typical Costs1. Quadratic: ( ) )ˆ(ˆ)ˆ(
2θθθθ BmseE =
−=R
x
x
|( of mean
|ˆ
θ
θθ
p
E
=
=
As we saw in Ch. 10
2. Absolute: θθθ ˆ)ˆ( −= ER )|( of median ˆ xθθ p=
3. Hit-or-Miss: )|( of mode ˆ xθθ p=
p(θ|x)
θMode
MedianMean
If p(θ|x) is unimodal & symmetricmean = median = mode
“Maximum A Posteriori”or MAP
8
Derivation for Absolute Cost Function
θθθθ
θ
θθθθ
θθθθθθθθθ
θθθθθ
ˆ|ˆ| whereregion
ˆ
ˆ|ˆ| whereregion
ˆ
)|()ˆ()|()ˆ(
)|(|ˆ|)ˆ(
−=−
∞
−=−
∞−
∞
∞−
∫∫
∫
−+−=
−=
dpdp
dpg
xx
x
Writing out the function to be minimized gives:
Now set 0ˆ)ˆ(=
θ∂θ∂g and use Leibnitz’s rule for ∫
φ
φ∂∂ )(
)(
2
1
),(u
udvvuh
u
0)|()|(ˆ
ˆ=−⇒ ∫ ∫
∞−
∞θ
θ
dθθpdθθp xx
which is satisfied if… (area to the left) = (area to the right)⇒ Median of conditional PDF
9
Derivation for Hit-or-Miss Cost Function
∫
∫∫
∫
+
−
∞
+
−
∞−
∞
∞−
−=
⋅+⋅=
−=
δθ
δθ
δθ
δθ
θθ
θθθθ
θθθθ
ˆ
ˆ
ˆ
ˆ
)x|(1
)x|(1)x|(1
)x|()ˆ()ˆ(
dp
dpdp
dθpCg
Writing out the function to be minimized gives:
Almost all the probability = 1 – left out
Maximize this integral
So… center the integral around peak of integrand⇒ Mode of conditional PDF
10
11.4 MMSE EstimatorsWe’ve already seen the solution for the scalar parameter case
x
x
|( of mean
|ˆ
θ
θθ
p
E
=
=
Here we’ll look at:• Extension to the vector parameter case• Analysis of Useful Properties
11
Vector MMSE EstimatorThe criterion is… minimize the MSE for each component
Vector Parameter: [ ]Tpθθθ 21=θ
Vector Estimate: [ ]Tpθθθ ˆˆˆˆ21=θ
is chosen to minimize each of the MSE elements:
∫ −=− iiiiii ddpE θθθθθθ xx ),()ˆ()ˆ( 22= p(x, θ) integrated over all other θj’s
∫ ∫
∫ ∫
∫
=
=
=
θxθx
xx
xx
ddp
dddp
ddp
i
ppi
iiii
),(
),,,(
),(ˆ
11
θ
θθθθθ
θθθθ
…
From the scalar case we know the solution is:
|ˆ xii E θθ =
12
So… putting all these into a vector gives:
[ ][ ]
[ ] x
xxx
θ
|
|||
ˆˆˆˆ
21
21
21
Tp
Tp
Tp
E
EEE
θθθ
θθθ
θθθ
=
=
=
xθθ |ˆ E=Vector MMSE Estimate
= Vector Conditional Mean
Similarly… [ ] pidpBmse iii ,,1)(C)ˆ( | …== ∫ xxxθθ
where [ ][ ] TEEE |||| xθθxθθC xθxθ −−=
13
Ex. 11.1 Bayesian Fourier AnalysisSignal model is: x[n] = acos(2πfon) + bsin(2πfon) + w[n]
AWGN w/ zero mean and σ2),(~ 2I0θ θσN
b
a
=
θ and w[n] are independent for each n
This is a common propagation model called Rayleigh Fading
Write in matrix form: x = Hθ + w Bayesian Linear Model
↓↓
↑↑
= sinecosineH
14
1
222
1
2211|ˆ
−−
+=
+==
σσσσσ θθ
HHICxHHHIxθθ x|θ
TTTE
Results from Ch. 10 show that
For fo chosen such that H has orthogonal columns then
xHxθθ TE
+==
22
2
11
1
|ˆ
σσ
σ
θ
=
=
∑
∑
−
=
−
=
1
0
1
0
)2sin(][2ˆ
)2cos(][2ˆ
N
no
N
no
nfnxN
b
nfnxN
a
πβ
πβ
2
2 /21
1
θσσ
βN
+
=
Fourier Coefficients in the Brackets
Recall: Same form as classical result, except there β = 1
Note: β ≈ 1 if σθ2 >> 2σ2/N
⇒ if prior knowledge is poor, this degrades to classical
15
Impact of Poor Prior KnowledgeConclusion: For poor prior knowledge in Bayesian Linear Model
MMSE Est. → MVU Est.
Can see this holds in general: Recall that
[ ] [ ]θwwθθ HµxCHHCHCµxθθ +++== −−−− 1111|ˆ TTE
For no prior information: 0Cθ →−1 and 0µθ→
[ ] xCHHCHθ ww111ˆ −−−→ TT
MVUE for General Linear Model
16
Useful Properties of MMSE Est.1. Commutes over affine mappings:
If we have α = Aθ + b then bθAα += ˆˆ
2. Additive Property for independent data sets Assume θ, x1, x2 are jointly Gaussian w/ x1 and x2 independent
][][ˆ22
111
12211
xxCCxxCCθθ xxθxxθ EEE −+−+= −−
a priori Estimate Update due to x1 Update due to x2
Proof: Let x = [x1T x2
T]T. The jointly Gaussian assumption gives:
[ ]
−−
+=
−+=
−
−
−
00
][ˆ
22
111
1
1
2
121 xx
xxC
CCCθ
xxCCθθ
EE
E
EE
x
xxx
xx
θθ
θ Indep. ⇒ Block Diagonal
Simplify to get the result
Will be used for Kalman Filter
3. Jointly Gaussian case leads to a linear estimator: mPxθ +=ˆ
1
11.5 MAP EstimatorRecall that the “hit-or-miss” cost function gave the MAP estimator… it maximizes the a posteriori PDF
Q: Given that the MMSE estimator is “the most natural” one…why would we consider the MAP estimator?
A: If x and θ are not jointly Gaussian, the form for MMSE estimate requires integration to find the conditional mean.
MAP avoids this Computational Problem!Note: MAP doesn’t require this integration
Trade “natural criterion” vs. “computational ease”
What else do you gain? More flexibility to choose the prior PDF
2
Notation and Form for MAP
)|(maxargˆ xθθθ
pMAP =
MAPθNotation: maximizes the posterior PDF
“arg max” extracts the value of θ that causes the maximum
Equivalent Form (via Bayes’ Rule): )]()|([maxargˆ θθθθ
ppMAP x=
Proof: Use )()()|()|(
xxx
pppp θθθ =
)]()|([maxarg)(
)()|(maxargˆ θθθθθθθ
ppp
ppMAP x
xx
=
=
Does not depend on θ
3
Vector MAP < Not as straight-forward as vector extension for MMSE >
The obvious extension leads to problems:
iθChoose to minimize ˆ()ˆ( iii CE θθθ −=R
Exp. over p(x,θi)
⇒ )|(maxargˆ xii pi
θθθ
= 1-D marginal conditioned on x
Need to integrate to get it!!
Problem: The whole point of MAP was to avoid doing the integration needed in MMSE!!!
Is there a way around this?Can we find an Integration-Free Vector MAP?
pddpp θθθ 21 )|θ()|( ∫∫= xx
4
Circular Hit-or-Miss Cost Function Not in Book
First look at the p-dimensional cost function for this “troubling”version of a vector map:
It consists of p individual applications of 1-D “Hit-or-Miss”
ε1
ε2
δ-δ-δ
δ
=square innot ),(,1
square in ),(,0),(
21
2121
εε
εεεεC
The corners of the square “let too much in” ⇒ use a circle!
ε1
ε2
δ
≥
<=
δ
δ
ε
εε
,1
,0)(C
This actually seems more natural than the “square” cost function!!!
5
MAP Estimate using Circular Hit-or-Miss Back to Book
So… what vector Bayesian estimator comes from using this circular hit-or-miss cost function?
Can show that it is the following “Vector MAP”
)|(maxargˆ xθθθ
pMAP = Does Not Require Integration!!!
That is… find the maximum of the joint conditional PDF
in all θi conditioned on x
6
How Do These Vector MAP Versions CompareIn general: They are NOT the Same!!
Example: p = 2p(θ1, θ2 | x)
1/6
1/3
1/6
θ1
θ2
1 2 3 4 5
1
2
The vector MAP using Circular Hit-or-Miss is: [ ]T5.05.2ˆ =θ
To find the vector MAP using the element-wise maximization:
θ1
p(θ1|x)
1 2 3 4 5
1/6
1/3
θ2
p(θ2|x)
1 2
1/3
2/3[ ]T5.15.2ˆ =θ
7
“Bayesian MLE”Recall… As we keep getting good data, p(θ|x) becomes more concentrated as a function of θ. But… since:
)]()|([maxarg)|(maxargˆ θθxxθθθθ
pppMAP ==
… p(x|θ) should also become more concentrated as a function of θ.
p(x|θ)p(θ)
θ
• Note that the prior PDF is nearly constant where p(x|θ) is non-zero
• This becomes truer as N →∞, and p(x|θ) gets more concentrated
)|(maxarg)]()|([maxarg θxθθxθθ
ppp ≈
MAP “Bayesian MLE”
Uses conditional PDF rather than the parameterized PDF
8
11.6 Performance CharacterizationThe performance of Bayesian estimators is characterized by looking at the estimation error: θθε ˆ−=
Random (due to a priori PDF)
Random (due to x)
Performance characterized by error’s PDF p(ε)We’ll focus on Mean and Variance
If ε is Gaussian then these tell the whole storyThis will be the case for the Bayesian Linear Model (see Thm. 10.3)
We’ll also concentrate on the MMSE Estimator
9
Performance of Scalar MMSE Estimator
∫==
θθθ
θθ
dp
E
)|(
|ˆ
x
xThe estimator is:
Function of x
So the estimation error is: ),(| θθθε xx fE =−=Function of two RV’s
General Result for a function of two RVs: Z = f (X, Y)
dydxyxpyxfZE XY ),(),( ∫∫=
dydxyxpZEyxfZEZEZ XY ),()),(()(var 22 ∫∫ −=−=
10
Evaluated as seen below
So… applying the mean result gives:
00
|][
]|[][
]|[
|
|
|
||
|
,
==
−=
−=
−=
−=
x
xx
x
xxx
xx
x
x
x
x
x
E
EEE
EEEE
EEE
EEE
E
θθ
θθ
θθ
θθε
θ
θ
θθ
θ
θ
See Chart on “Decomposing Joint
Expectations” in “Notes on 2 RVs”
Pass Eθ |x through the terms
Two Notations for the same thing
|)|(|
)|(|]|[
|
|
on dependnot does
|
xxx
xxx
x
xx
θθθθ
θθθθ
θθ
θθ
θ
θ
EdpE
dpEEE
==
=
∫
∫0 =εE
i.e., the Mean of the Estimation Error (over data
& parm) is Zero!!!!
11
And… applying the variance result gives:
)ˆ(
),()ˆ(
)ˆ()(var
,2
222
θ
θθθθ
θθεεεε
θ
Bmse
ddp
EEEE
=
−=
−==−=
∫∫ xxx
Use Eε = 0
So… the MMSE estimation error has:
• mean = 0
• var = Bmse
So… when we minimize Bmsewe are minimizing the variance
of the estimate
If ε is Gaussian then ( ))ˆ(,0~ θε BmseN
12
Ex. 11.6: DC Level in WGN w/ Gaussian PriorWe saw that
22 /1/1)ˆ(
ANABmse
σσ +=
AAA
A
N
Nx
NA µ
σσ
σ
σσσ
++
+=
/
/
/ˆ
22
2
22
2with
constantconstant
So… A is Gaussian
+ 22 /1/1,0~
ANN
σσε
Note: As N gets large this PDF collapses around 0.
This estimate is “consistent in the Bayesian sense”
Bayesian Consistency: For large N
(regardless of the realization of A!)
AA ≈ˆ
this is Gaussian because it is a linear combo of the jointly
Gaussian data samples
If X is Gaussian then Y = aX + b
is also Gaussian
13
Performance of Vector MMSE Estimatorθθε ˆ−=Vector estimation error: The mean result is obvious.
Must extend the variance result:
θθx,ε MεεCε ˆcov ∆=== TE
Some New Notation…“Bayesian Mean Square
Error Matrix”
Look some more at this:
|
|
ˆ
]][[
]][[
||
||
xθx
xθx
θx,θ
xθθxθθ
xθθxθθM
CE
EEEE
EEE
T
T
=
−−=
−−=
0ε =E |ˆ xθxθε MC CE==
General Vector Results:
See Chart on “Decomposing
Joint Expectations”
= Cθ|x In general this is a function of x
14
θM ˆThe Diagonal Elements of are Bmse’s of the Estimates
[ ]
)(
),(]|[
),(]|[
2
21
i
iiii
iiiiT
Bmse
dθdθpE
ddpEE
i
p
θ
θθ
θθ
θ
θ θ
=
−=
−=
∫ ∫
∫ ∫ ∫
x
xθx,
xxx
θxθxxεε
To see this:
Why do we call the error covariance the “Bayesian MSE Matrix”?
Integrate over all the other
parameters…“marginalizing”
the PDF
15
Perf. of MMSE Est. for Jointly Gaussian CaseLet the data vector x and the parameter vector θ be jointly Gaussian.
0ε =ENothing new to say about the mean result:
Now… look at the Error Covariance (i.e., Bayesian MSq Matrix):
|ˆ xθxθε MC CE==Recall General Result:
Thm 10.2 says that for Jointly Gaussian Vectors we get that…Cθ|x does NOT depend on x
xθxθxθε MC ||ˆ CCE ===
xθxθxθ
xθθε
CCCC
MC
1
|ˆ
−−=
== C
Thm 10.2 also gives the form as:
16
Perf. of MMSE Est. for Bayesian Linear ModelwHθx += ~N(µθ,Cθ)
~N(0,Cw)Recall the model:
0ε =ENothing new to say about the mean result:
Now… for the error covariance… this is nothing more than a special case of the jointly Gaussian case we just saw:
xθxθxθ
xθθε
CCCC
MC
1
|ˆ
−−=
== CResults for Jointly Gaussian Case
THCC θθx =
wθx CHHCC += T
Evaluations for Bayesian Linear
( )( ) 111
1|ˆ
−−−
−
+=
+−===
HCHC
HCCHHCHCCMC
wθ
θwθθθxθθε
T
TTC
Alternate Form … see (10.33)
17
Summary of MMSE Est. Error Results1. For all cases: Est. Error is zero mean 0ε =E
2. Error Covariance for three “Nested” Cases:
|ˆ xθxθε MC CE==
[ ]iiiBmse θM ˆ)( =θ
General Case:
Jointly Gaussian: xθxθxθxθθε CCCCMC 1|ˆ
−−=== C
Bayesian Linear:Jointly Gaussian
& Linear Observation
( )( ) 111
1|ˆ
−−−
−
+=
+−===
HCHC
HCCHHCHCCMC
wθ
θwθθθxθθε
T
TTC
18
Main Bayesian Approaches
MAP“Hit-or-Miss” Cost Function
MMSE“Squared” Cost Function
(In General: Nonlinear Estimate)
xθxθ CMxθθ
|ˆ :Cov. Err.
ˆ :EstimateE
E=
=
Jointly Gaussian x and θ(Yields Linear Estimate)
( )xθxxθxθθθ
xxθx
CCCCM
xxCCθθ1
ˆ
1
:Cov. Err.
ˆ :Estimate−
−
−=
−+= EE
Bayesian Linear Model(Yields Linear Estimate)
( ) ( )
( ) θθθθθ
θθθ
HCCHHCHCCM
HµxCHHCHCθθ1
ˆ
1
:Cov. Err.
ˆ :Estimate−
−
+−=
−++=
wTT
wTTE
)|(maxargˆ :Estimate xθθθ
p=
Hard to Implement…numerical integration
Easy to Implement…Performance Analysis
is Challenging
Easier to Implement…Determining Cθx can be
hard to find
“Easy” to Implement…Only need accurate model: Cθ, Cw, H
19
11.7 Example: Bayesian DeconvolutionThis example shows the power of Bayesian approaches over classical methods in signal estimation problems (i.e. estimating the signal rather than some parameters)
h(t) Σs(t)
w(t)
x(t)
Model as a zero-mean
WSS Gaussian
Process w/ known
ACF Rs(τ)
Assumed Known
Gaussian Bandlimited White Noise w/ Known Variance
Measured Data = Samples of x(t)
Goal: Observe x(t) & Estimate s(t)Note: At Output…
s(t) is Smeared & Noisy
So… model as D-T System
20
Sampled-Data Formulation
−
+
−
−−−
=
− ]1[
]1[
]0[
]1[
]1[
]0[
][]1[]1[
00]0[]1[
000]0[
]1[
]1[
]0[
Nw
w
w
ns
s
s
nNhNhNh
hh
h
Nx
x
x
ss
Measured Data Vector x
Known Observation Matrix H
Signal Vector sto Estimate
AWGN: wCw = σ2I
We have modeled s(t) as zero-mean WSS process with known ACF…So… s[n] is a D-T WSS process with known ACF Rs[m]…So… vector s has a known covariance matrix (Toeplitz & Symmetric) given by:
−
−
=
]0[]1[]2[]1[
]1[
]2[]0[]1[]2[
]1[]0[]1[
]1[]2[]1[]0[
sssss
s
ssss
sss
sssss
RRRnR
R
RRRR
RRR
nRRRR
sC
Model for Prior PDF is then s ~ N(0,Cs)
s and w are independent
21
MMSE Solution for DeconvolutionWe have the case of the Bayesian Linear Model… so:
( ) xIHHCHCs ss12ˆ −
+= σTT
Note that this is a linear estimateThis matrix is called “The Weiner Filter”
( ) 121ˆ /
−− +== σHHCMC ssεT
The performance of the filter is characterized by:
22
Sub-Example: No Inverse Filtering, Noise OnlyDirect observation of s with H = I… x = s + w
Σs(t)
w(t)
x(t) Goal: Observe x(t) & “De-Noise” s(t)Note: At Output… s(t) with Noise
( ) xICCs ss12ˆ −
+= σ ( ) 121ˆ /
−− +== σICMC ssε
Note: Dimensionality Problem… # of “parms” = # of observationsClassical Methods Fail… xs =ˆ Bayesian methods can solve it!!
For insight… consider “single sample” case:
)(]0[]0[1
]0[]0[
]0[]0[s 22 SNRRxxR
R s
s
s
ση
ηη
σ=
+=
+=
0]0[ˆ]0[]0[ˆ ≈≈ sSNRLow
xsSNRHigh
Data Driven Prior PDF Driven
23
Sub-Sub-Example: Specific Signal ModelDirect observation of s with H = I… x = s + w
But here… the signal follows a specific random signal model
][]1[][ 1 nunsans +−−= u[n] is White Gaussian “Driving Process”
This is a 1st-order “auto-regressive” model: AR(1)Such a random signal has an ACF & PSD of
||12
1
2)(
1][ ku
s aa
kR −
−=
σ22
1
2
1)(
fj
us
eafP
π
σ−+
=
See Figures 11.9 & 11.10 in the
Textbook
1
Ch. 12 Linear Bayesian Estimators
2
IntroductionIn chapter 11 we saw:
the MMSE estimator takes a simple form when x and θ are jointly Gaussian – it is linear and used only the 1st and 2nd order moments (means and covariances).
Without the Gaussian assumption, the General MMSE estimator requires integrations to implement – undesirable!
So what to do if we can’t “assume Gaussian” but want MMSE?
Keep the MMSE criteria
But…restrict the form of the estimator to be LINEAR
⇒ “LMMSE Estimator” Something similar to
BLUE!
LMMSE Estimator = Wiener Filter
3
Bayesian Approaches
MAP“Hit-or-Miss” Cost Function
xθxθ CMxθθ
|ˆ :Cov. Err.
ˆ :EstimateE
E=
=
MMSE“Squared” Cost Function
(Nonlinear Estimate)
Other Cost
Functions
LMMSEForce Linear EstimateKnown: Eθ,Ex, C
Jointly Gaussian x and θ(Yields Linear Estimate)
( )xθxxθxθθθ
xxθx
CCCCM
xxCCθθ1
ˆ
1
:Cov. Err.
ˆ :Estimate−
−
−=
−+= EE ( )xθxxθxθθθ
xxθx
CCCCM
xxCCθθ1
ˆ
1
:Cov. Err.
ˆ :−
−
−=
−+= EEEstimateSame!
Bayesian Linear Model(Yields Linear Estimate)
( ) ( )
( ) θθθθθ
θθθ
HCCHHCHCCM
HµxCHHCHCθθ1
ˆ
1
:Cov. Err.
ˆ :Estimate−
−
+−=
−++=
wTT
wTTE
4
12.3 Linear MMSE Estimator SolutionScalar Parameter Case:Estimate: θ, a random variable realizationGiven: data vector x = [x[0] x[1] . . .x[N-1] ]T
Assume:– Joint PDF p(x, θ) is unknown– But…its 1st two moments are known– There is some statistical dependence between x and θ
• E.g., Could estimate θ = salary using x = 10 past years’ taxes owed• E.g., Can’t estimate θ = salary using x = 10 past years’ number of Christmas
cards sent
Goal: Make the best possible estimate while using an affine form for the estimator
∑−
=
+=1
0][ˆ
N
nNn anxaθ
Handles Non-Zero Mean Case
)ˆ()ˆ( 2θθθ θ −= xEBmseChoose an to minimize
5
Derivation of Optimal LMMSE CoefficientsUsing the desired affine form of the estimator, the Bmse is
+−= ∑
−
=
21
0][)ˆ(
N
nNn anxaEBmse θθ
0)ˆ(=
∂∂
NaBmse θ
0][21
0=+−− ∑
−
=
N
nNn anxaE θ
Step #1: Focus on aN
Passing ∂/∂aN through E gives
∑−
=−=
1
0][
N
nnN nxEaEa θ
Note: aN = 0 if Eθ = Ex[n] = 0
6
Step #2: Plug-In Step #1 Result for aN
−−−=
−−−= ∑
−
=
2
21
0
)()(
)(])[][()ˆ(
!"!#$!!"!!#$scalarscalar
T
N
nn
EEE
EnxEnxaEBmse
θθ
θθθ
xxa
where a = [a0 a1 . . . aN-1]T
Only up to N-1
Note: aT (x – Ex) = (x – Ex)Ta since it is scalar
7
Thus, expanding out [aT (x – Ex) – (θ – Eθ )] 2 gives
θθθθTT
T
TT
TT
c
Etc
EtcEEE
EtcEEEBmse
+−−=
+=
+−−=
+−−=
accaaCa
aCa
axxxxa
axxxxa
xxxx
xx .
.))((
.))(()ˆ(θ
N×N N×1 1×N 1×1
θTθ
Tθθ θEθE
xx
xx
cc
xcxc
=
== cross-covariancevectors ⇒
θθθTT cBmse +−= xxx caaCa 2)ˆ(θ
8
Step #3: Minimize w.r.t. a1, a2, … , aN-1
Only up to N-10)ˆ(
=∂
∂aθBmse
θxxxcCa 1−= 1−= xxxCca θT0caC xxx =− θ22
This is where the statistical dependence between the data and the parameter is used… via a cross-covariance vector
Step #4: Combine Results
[ ] ( )
][ˆ1
0
xxaxaxa EEEE
anxa
TTT
N
nNn
−+=−+=
+= ∑−
=
θθ
θ
So the Optimal LMMSE Estimate is:
xCc xxx1ˆ −= θθ( )ˆ 1 xxCc xxx EE θ −+= −θθ If Means = 0
Note: LMMSE Estimate Only Needs 1st and 2nd Moments… not PDFs!!
9
Step #5: Find Minimum BmseSubstitute into Bmse result and simplify:
θθθθθθ
θθθθθθ
θθθTT
c
c
cBmse
+−=
+−=
+−=
−−
−−−
xxxxxxxx
xxxxxxxxxxxx
xxx
cCccCc
cCccCCCc
caaCa
11
111
2
2
2)ˆ(θ
θθθθcBmse xxxx cCc 1)ˆ( −−=θ
Note: If θ and x are statistically independent then Cθx = 0
ˆ θθ E=Totally based on prior info… the data is uselessθθcBmse =)ˆ(θ
10
Ex. 12.1 DC Level in WGN with Uniform PriorRecall: Uniform prior gave a non-closed form requiring integration
…but changing to a Gaussian prior fixed this.
Here we keep the uniform prior and get a simple form:
• by using the Linear MMSE
xCc xxx1ˆ −= AAFor this problem the LMMSE estimate is:
( )( ) I11
w1w1Cxx
22 σσ +=
++=
TA
TAAE
( ) T
A
Tθ AAEAE
1
w1xc x
2
σ=
+==Need
A & w are uncorrelatedA & w are
uncorrelated
xN
AA
A
+=
/ˆ
22
2
σσσ
11
12.4 Geometrical InterpretationsAbstract Vector Space
Mathematicians first tackled “physical” vector spaces like RN
and CN, etc.
But… then abstracted the “bare essence” of these structures into the general idea of a vector space.
We’ve seen that we can interpret Linear LS in terms of “Physical” vector spaces.
We’ll now see that we can interpret Linear MMSE in terms of “Abstract” vector space ideas.
12
Abstract Vector Space RulesAn abstract vector space consists of a set of “mathematical objects” called vectors and another set called scalars that obey:1. There is a well-defined operation of “addition” of vectors that
gives a vector in the set, and…• “Adding” is commutative and associative• There is a vector in the set – call it 0 – for which “adding” it to any
vector in the set gives back that same vector • For every vector there is another vector s.t. when the 2 are added you get
the 0 vector
2. There is a well-defined operation of “multiplying” a vector by a “scalar” and it gives a vector in the set, and…
• “Multiplying” is associative • Multiplying a vector by the scalar 1 gives back the same vector
3. The distributive property holds• Multiplication distributes over vector addition• Multiplication distributes over scalar addition
13
Examples of Abstract Vector Spaces
1. Scalars = Real Numbers Vectors = Nth Degree Polynomials w/ Real Coefficients
2. Scalars = Real Numbers Vectors = M×N Matrices of Real Numbers
3. Scalars = Real Numbers Vectors = Functions from [0,1] to R
4. Scalars = Real Numbers Vectors = Real-Valued Random Variables with Zero Mean
Colliding Terminology… a scalar RV is a vector!!!
14
There is a well-defined concept of inner product s.t. all the rules of “ordinary” inner product still hold
• <x,y> = <y, x>*
• <a1x1+ a2x2,y> = a1<x1,y > + a2<x2,y> • <x,x> ≥ 0; <x,x> = 0 iff x = 0
Note: an inner product “induces” a norm (or length measure):
||x||2 = <x,x>
So an inner product space has:1. Two sets of elements: Vectors and Scalars2. Algebraic Structure (Vector Addition & Scalar Multiplication) 3. Geometric Structure
• Direction (Inner Product)• Distance (Norm)
Not needed for Real IP Spaces
Inner Product SpacesAn extension of the idea of Vector Space… must also have:
15
Inner Product Space of Random VariablesVectors: Set of all real RVs w/ zero mean & finite variance (ZMFV)Scalars: Set of all real numbersInner Product: <X,Y> = EXYClaim… This is an Inner Product Space
Inner Product is Correlation!Uncorrelated = Orthogonal
First this is a vector space
Addition Properties: X+Y is another ZMFV RV1. It is Associative and Commutative: X+(Y+Z) = (X+Y)+Z; X+Y = Y+X2. The zero RV has variance of 0 (What is an RV with var = 0???)3. The negative of RV X is –X
Multiplication Properties: For any real # a, aX is another ZMFV RV1. It is Associative: a(bX) = (ab)X2. 1X = X
Distributive Properties:1. a(X+Y) = aX + aY2. (a+b)X = aX + bX
NextThis is an inner product space• <a1X1+ a2X2,Y> = E(a1X1+ a2X2)Y
= a1EX1Y+ a2EX2Y• ||X||2 = <X, X> = EX2 = varX ≥ 0
16
Use IP Space Ideas for Section 12.3∑−
==
1
0][ˆ
N
nn nxaθApply to the Estimation of a zero-mean scalar RV:
Trying to estimate the realization of RV θ via a linear combination of N other RVs x[0], x[1], x[2],… x[N-1]
Zero-Mean… don’t need aN
Now…using our new vector space view of RVs, this is the same structural mathematics that we saw for the Linear LS !
( ) ( )θθθθθ ˆˆˆ 22BmseE =
−=−N = 2 Case Minimize:
Each RV is viewed as a vector
θ
x[0]x[1] θ
Connects to Geometry Connects to MSE
Recall Orthogonality Principle!!!
Estimation Error ⊥ Data Space
0][ˆ )( =− nxE θθ
17
Now apply this Orthogonality Principle…TTE 0x =− )( θθ with xaT=θ
axxxxxax0xxa )( TTTTTTTT EEEEE =⇒=⇒=− θθθ
θxxx caC = The Normal Equations”
Assuming that Cxx is invertible…
θxxxcCa 1−= xCcxa xxx1ˆ −== θθ T
Same as before!!!
18
12.5 Vector LMMSE EstimatorMeaning a “Physical” Vector
[ ]Tpθθθ %21=θEstimate: Realization of
Linear Estimator: aAxθ +=ˆ
Goal: Minimize Bmse for each element
View ith row in A and ith element in a as forming a scalar LMMSE estimator for θi
Already know the individual element solutions!
• Write them down
• Combine into matrix form
19
Solutions to Vector LMMSE
][ˆ 1 xxCCθθ xxθx EE −+= −The Vector LMMSE estimate is:
Now… p×N Matrix…Cross-Covariance Matrix
Still… N×N Matrix…Covariance Matrix
xCCθ xxθx1ˆ −=If Eθ = 0 & Ex = 0
Can show similarly that Bmse Matrix is
)ˆ)(ˆ(ˆTE θθθθMθ −−=
xθxxθxθθθ CCCCM 1ˆ
−−=
p×pprior Cov. Matrix
p×N N×pN×N
20
Two Properties of LMMSE Estimator1. Commutes over affine transformations
If bAθα += and θ is LMMSE Estimate
Then bθAα += ˆˆ is LMMSE Estimate for α
2. If α = θ1 + θ2 then 21ˆˆˆ θθα +=
21
Bayesian Gauss-Markov Theorem Like G-M Theorem for the BLUE
wHθx +=Let the data be modeled as
knownp×1 random
mean µθCov Mat Cθθ
(Not Gaussian)
N×1 randomzero mean
Cov Mat Cw(Not Gaussian)
( ) ][ˆ 1θwθθθθθ HµxCHHCHCµθ −++=
−TT
Application of previous results, evaluated for this data model gives:
( ) θθwθθθθθθε HCCHHCHCCC1−
+−= TTMMSE Matrix: εθ CM =ˆ
Same forms as for Bayesian Linear Model (which include Gaussian assumption)
Except here… the result is suboptimal… unless the optimal estimate is linear
In practice… generally don’t know if linear estimate is optimal… but we useLMMSE for its simple form!
The challenge is to “guess” or estimate the needed means & cov matrices
1
12.6 Sequential LMMSE EstimationSame kind if setting as for Sequential LS…
Fixed number of parameters (but here they are modeled as random)
Increasing number of data samples
][][][ nnn wθHx +=Data Model:
(n+1)×1x[n] = [x[0] … x[n]]T
p×1unknown PDF
known mean & cov
(n+1)×1w[n] = [w[0] … w[n]]T
unknown PDFknown mean & covCw must be diagonal
with elements σ2n
θ & w are uncorrelated
−=
][
]1[][
n
nn
Th
HH
(n+1)×pknown
Goal: Given an estimate ]1[ˆ −nθ based on x[n – 1], when newdata sample x[n] arrives, update the estimate to ][ˆ nθ
2
Development of Sequential LMMSE EstimateOur Approach Here: Use vector space ideas to derive solution for “DC Level in White Noise” then write down general solution. ][][ nwAnx +=
For convenience… Assume both A and w[n] have zero mean
Given x[0] we can find the LMMSE estimate
]0[]0[])[(
])[(]0[]0[]0[ˆ
22
2
220 xxnwAE
nwAAExxEAxEA
A
A
+=
+
+=
=
σσσ
Now we seek to sequentially update this estimate with the info from x[1]…
3
• From Vector Space View: A
x[0]
x[1]0A
1A
• First project x[1] onto x[0] to get ]0|1[x
Estimate new data given old data…
Prediction!
Notation: the estimate “at 1” based “on 0”• Use Orthogonality Principle
]0|1[ˆ]1[]1[~ xxx −=∆ is ⊥ to x[0]
⇒ This is the new, non-redundant info provided by data x[1]It is called the “innovation”
x[0]
x[1]0A
]1[~x
4
• Find Estimation Update by Projecting A onto Innovation
]1[~]1[~]1[~]1[~
]1[~]1[~
,]1[~]1[~
]1[~]1[~
,ˆ221 x
xExAEx
x
xAxx
xxAA
===∆
Gain: k1
• Recall Property: Two Estimates from ⊥ data just add:
]0[]1[~ xx ⊥
]1[~ˆ
ˆˆˆ
10
101
xkA
AAA
+=
∆+=
[ ]]0|1[ˆ]1[ˆˆ101 xxkAA −+=
x[0]
x[1]0A
]1[~x1A∆
1APredictedNew Data
New Data
Old Estimate
Gain
“Innovation” is ⊥ Old Data
5
The Innovations SequenceThe Innovations Sequence is…
• Key to the derivation & implementation of Seq. LMMSE• A sequence of orthogonal (i.e., uncorrelated) RVs• Broadly significant in Signal Processing and Controls
]0[x
…],2[~],1[~],0[~ xxx
]0|1[ˆ]1[ xx −
]1|2[ˆ]2[ xx −
Means: “Based on ALL data up to n = 1
(inclusive)
6
General Sequential LMMSE EstimationInitialization No Data Yet! ⇒ Use Prior Information
ˆ1 θθ E=−
Estimate
θθCM =−1 MMSE Matrix
Update Loop For n = 0, 1, 2, …
nnTnn
nnn
hMhhMk
12
1
−
−
+=σ
Gain Vector Calculation
[ ]11ˆ][ˆˆ−− −+= n
Tnnnn nx θhkθθ Estimate Update
[ ] 1−−= nTnnn MhkIM MMSE Matrix Update
7
Sequential LMMSE Block Diagram
][][][ nnn wθHx +=
−=
][
]1[][
n
nn
Th
HH
Data Model
kn Σ
21 ,, nnn σhM −
Compute Gain
][~ nx
z-1
1ˆ−nθ
+
+
x[n]Σ
Tnh
1ˆ]1|[ˆ −=− n
Tnnnx θh
+
InnovationObservation
( )11ˆ][ˆ−− −+ n
Tnnn nx θhkθ
nθ−
Delay Updated Estimate
Previous Estimate
nhPredicted Observation
Exact Same Structure as for Sequential Linear LS!!
8
Comments on Sequential LMMSE Estimation1. Same structure as for sequential linear LS. BUT… they solve
the estimation problem under very different assumptions.
2. No matrix inversion required… So computationally Efficient
3. Gain vector kn weighs confidence in new data (σ2n) against all
previous data (Mn-1)• when previous data is better, gain is small… don’t use new data much• when new data is better, gain is large… new data is heavily used
4. If you know noise statistics σ2n and observation rows hn
T over the desired range of n:
• Can run MMSE Matrix Recursion without data measurements!!!• This provides a Predictive Performance Analysis
9
12.7 Examples Wiener Filtering During WWII, Norbert Wiener developed the mathematical ideas that led to the Wiener filter when he was working on ways to improve anti-aircraft guns.
He posed the problem in C-T form and sought the best linear filter that would reduce the effect of noise in the observed A/Ctrajectory.
He modeled the aircraft motion as a wide-sense stationary random process and used the MMSE as the criterion for optimality. The solutions were not simple and there were many different ways of interpreting and casting the results.
The results were difficult for engineers of the time to understand.
Others (Kolmogorov, Hopf, Levinson, etc.) developed these ideas for the D-T case and various special cases.
10
Weiner Filter: Model and Problem Statement
Signal Model: x[n] = s[n] + w[n]
Observed: Noisy SignalModel as WSS, Zero-Mean
Cxx = Rxx
covariance matrix
correlationmatrix TE xxRxx =
))(( TEEE xxxxCxx −−=
Desired SignalModel as WSS, Zero-Mean
Css = Rss
NoiseModel as WSS, Zero-Mean
Cww = Rww
Same if zero-mean
Problem Statement: Process x[n] using a linear filter to provide a “de-noised” version of the signal that has minimum MSErelative to the desired signal
LMMSE Problem!
11
Filtering, Smoothing, PredictionTerminology for three different ways to cast the Wiener filter problem
Filtering Smoothing PredictionGiven: x[0], x[1], …, x[n] Given: x[0], x[1], …, x[N-1] Given: x[0], x[1], …, x[N-1]
Find: ]1[ˆ,],1[ˆ],0[ˆ −Nsss … 0,][ˆ >+ llNxFind: ][ˆ ns Find:
xCCθ xxθx1ˆ −=
x[n]
]0[s
21 n
]1[s
]2[s
]3[s
3
x[n]
21
x[n]
]5[x
21 4 5
Note!!
n n3 3
]0[s ]1[s ]2[s ]3[s
All three solved using General LMMSE Est.
12
Filtering Smoothing
xCCθ xxθx1ˆ −=
Prediction(vector) sθ =(scalar) ][nsθ =
[ ]
)(vector! ~
]0[][
][
][
Tss
ssss
T
T
rnr
nsE
nsE
r
s
xC x
=
=
=
=
"
θ
wwss
TT
Txx
E
E
RR
wwss
wswsC
+=
+=
++= ))((
[ ][ ][ ]1)1()1()1()1(1
1)(~][ˆ
×++×++×
−+=
nnnn
wwssT
ssns xRRr
)(Matrix!
)(
ss
TT
T
T
E
E
E
R
swss
wss
sxCθx
=
+=
+=
=
(scalar) ]1[ lNxθ +−=
x not s!
[ ]
)(vector! ~
][]1[
]1[
Txx
xxxx
Tx
lrlNr
lNxE
r
xC
=
+−=
+−=
"
θ
wwss
TT
Txx
E
E
RR
wwss
wswsC
+=
+=
++= ))(( xx
Txx E
RxxC
==
[ ][ ][ ]1
1)(ˆ×××
−+=NNNNNwwssss xRRRs
[ ][ ][ ]11
1~]1[ˆ×××
−=+−NNNN
xxTxxlNx xRr
13
Comments on Filtering: FIR WienerxaxRRr
a
Twwss
Tss
T
ns =+= −### $### %&
1)(~][ˆ
[ ][ ]Tnn
Tnnn
aaa
nhhh
01
)()()(
...
][...]1[]0[
−=
=h
∑=
−=n
k
n knxkhns0
)( ][][][ˆ
Wiener Filter as Time-Varying FIR Filter
• Causal!• Length Grows!
Wiener-Hopf Filtering Equations
[ ]Tssssssss
sswwss
nrrrxx
][...]1[]0[
)(
=
=+
r
rhRRR##$##%&
=
−
−
][
]1[]0[
][
]1[]0[
]0[]1[][
]1[]0[]1[][]1[]0[
)(
)(
)(
Toeplitz & Symmetric
nr
rr
nh
hh
rnrnr
nrrrnrrr
ss
ss
ss
n
n
n
xxxxxx
xxxxxx
xxxxxx
''
###### $###### %& "'(''
""
In Principle: Solve WHF Eqs for filter h at each nIn Practice: Use Levinson Recursion to Recursively Solve
14
Comments on Filtering: IIR Wiener
Can Show: as n →∞ Wiener filter becomes Time-InvariantThus: h(n)[k] → h[k]
Then the Wiener-Hopf Equations become:
…,1,0][][][0
==−∑∞
=llrklrkh ss
kxx
and these are solved using so-called “Spectral Factorization”
And… the Wiener Filter becomes IIR Time-Invariant:
∑∞
=−=
0][][][ˆ
kknxkhns
15
Revisit the FIR Wiener: Fixed Length L
∑−
=−=
1
0][][][ˆ
L
kknxkhns
]6[s
The way the Wiener filter was formulated above, the length of filter grew so that the current estimate was based on all the past data
Reformulate so that current estimate is based on only L most recent data: … x[3] x[4] x[5] x[6] x[7] x[8] x[9] …
]7[s]8[s
Wiener-Hopf Filtering Equations for WSS Process w/ Fixed FIR
[ ]Tssssssss
sswwss
nrrrxx
][...]1[]0[
)(
=
=+
r
rhRRR##$##%&
=
−
−
][
]1[
]0[
][
]1[
]0[
]0[]1[][
]1[]0[]1[
][]1[]0[
Toeplitz & Symmetric
nr
r
r
nh
h
h
rnrnr
nrrr
nrrr
ss
ss
ss
xxxxxx
xxxxxx
xxxxxx
''
####### $####### %& "
'(''
"
"
Solve W-H Filtering Eqs ONCE for filter h
16
Comments on Smoothing: FIR Smoother
WxxRRRsW
=+= −### $### %&
1)(ˆ wwssssEach row of W like a FIR Filter
• Time-Varying • Non-Causal!• Block-Based
To interpret this – Consider N=1 Case:
]0[1
]0[]0[]0[
]0[]0[ˆ
SNRLow ,0SNR High,1
xSNR
SNRxrr
rswwss
ss
#$#%&
≈
+=
+
=
17
Comments on Smoothing: IIR Smoother
Estimate s[n] based on …, x[–1], x[0], x[1],…
∑∞
−∞=−=
kknxkhns ][][][ˆ Time-Invariant &
Non-Causal IIR Filter
The Wiener-Hopf Equations become:
∞<<∞−=−∑∞
−∞=llrklrkh ss
kxx ][][][ ][][][ nrnrnh ssxx =∗
)()()(
)()()(
fPfPfP
fPfPfH
wwss
ss
xx
ss
+=
=Differs From Filter CaseSum over all k
Differs From Filter CaseSolve for all l
H( f ) ≈ 1 when Pss( f ) >> Pww( f )H( f ) ≈ 0 when Pss( f ) << Pww( f )
18
Relationship of Prediction to AR Est. & Yule-WalkerWiener-Hopf Prediction Equations
[ ]Txxxxxxxx
xxxx
Nlrlrlr ]1[...]1[][ −++=
=
r
rhR
−+
+=
−−
−−
]1[
]1[][
][
]1[]0[
]0[]2[]1[
]2[]0[]1[]1[]1[]0[
Toeplitz & Symmetric
Nlr
lrlr
nh
hh
rNrNr
NrrrNrrr
xx
xx
xx
xxxxxx
xxxxxx
xxxxxx
''
######## $######## %& "'(''
""
For l=1 we get EXACTLY the Yule-Walker Eqs used inEx. 7.18 to solve for the ML estimates of the AR parameters!!! FIR Prediction Coefficients are estimated AR parms
Recall: we first estimated the ACF lags rxx[k] using the dataThen used the estimates to find estimates of the AR parameters
xxxx rhR ˆˆ =
19
Relationship of Prediction to Inverse/Whitening Filter
)(11
za−
u[k] x[k]
)(za
Σ
AR Model
][ˆ kx
+
–
Inverse Filter: 1a(z)
FIR Pred.
Signal Observed
u[k]
White NoiseWhite Noise
1-Step PredictionImagination
& Modeling
Physical Reality
20
Results for 1-Step Prediction: For AR(3)
0 20 40 60 80 100-6
-4
-2
0
2
4
Sample Index, k
Sign
al V
alue
Signal PredictionError
At each k we predict x[k] using past 3 samples
Application to Data CompressionSmaller Dynamic Range of Error gives More Efficient Binary Coding
(e.g., DPCM – Differential Pulse Code Modulation)
1
Ch. 13 Kalman Filters
2
IntroductionIn 1960, Rudolf Kalman developed a way to solve some of the practical difficulties that arise when trying to apply Weiner filters.
There are D-T and C-T versions of the Kalman Filter… we will only consider the D-T version.
The Kalman filter is widely used in:• Control Systems• Navigation Systems • Tracking Systems
It is less widely used in signal processing applications
KF initially arose in the field of control systems –
in order to make a system do what you
want, you must know what it is doing now
3
The Three Keys to Leading to the Kalman Filter
Wiener Filter: LMMSE of a Signal (i.e., a Varying Parameter)
Sequential LMMSE: Sequentially Estimate a Fixed Parameter
State-Space Models: Dynamical Models for Varying Parameters
Kalman Filter: Sequential LMMSE Estimation for a time-varying parameter vector – but the time variation is
constrained to follow a “state-space” dynamical model.
Aside: There are many ways to mathematically model dynamical systems…• Differential/Difference Equations• Convolution Integral/Summation• Transfer Function via Laplace/Z transforms• State-Space Model
4
13.3 State-Variable Dynamical ModelsSystem State: the collection of variables needed to know how to determine how the system will “exist” at some future time (in the absence of an input)…
For an RLC circuit… you need to know all of its current capacitor voltages and all of its current inductor currents
Motivational Example: Constant Velocity Aircraft in 2-D
=
)(
)(
)(
)(
)(
tv
tv
tr
tr
t
y
x
y
x
s
A/C positions (m) For the constant velocity model we would constrain vx(t) & vy(t) to be constants Vx & Vy.A/C velocities (m/s)
If we know s(to) and there is no input we know how the A/C behaves for all future times: rx(to + τ) = Vxτ + rx(to)
rx(to + τ) = rx(to) + Vxτry(to + τ) = ry(to) + Vyτ
5
D-T State Model for Constant Velocity A/CBecause measurements are often taken at discrete times… we oftenneed D-T models for what are otherwise C-T systems
(This is the same as using a difference equation to approximate a differential equation)
If every increment of n corresponds to a duration of ∆ sec and there is no driving force then we can write a D-T State Model as:
∆
∆
=
1000
0100
010
001
A
]1[][ −= nn Ass
State Transition Matrix
rx[n] = rx[n-1] + vx[n-1]∆
ry [n] = ry[n-1] + vy[n-1]∆
vx[n] = vx[n-1]∆
vy[n] = vy[n-1]∆
We can include the effect of a vector input:
][]1[][ nnn BuAss +−=
Input could be deterministic and/or random.Matrix B combines inputs & distributes them to states.
6
Thm 13.1 Vector Gauss-Markov Model Don’t confuse with the G-M Thm. of Ch. 6This theorem characterizes the probability model for a
specific state-space model with Gaussian Inputs
][]1[][ nnn BuAss +−=Linear State Model: n ≥ 0
p×1 p×pknown
p×rknown
r×1
s[n]: “state vector” is a vector Gauss-Markov processA: “state transition matrix”; assumed |λi| < 1 for stabilityB: “input matrix”u[n]: “driving noise” is vector WGN w/ zero means[-1]: “initial state” ~ N(µs,Cs) and independent of u[n]
eigenvalues
u[n] ~ N(0,Q)Eu[n] uT[m] = 0, n ≠ m
7
Theorem:• s[n] for n ≥ 0 is Gaussian with the following characteristics…• Mean of state vector is sµAs 1][ += nnE diverges if e-values
have |λi| ≥ 1• Covariance between state vectors at m and n is
( ) ∑
−=
+++ +=
−−=≥
m
nmk
Tkn-mTkTnm
TnEnmEmEnmnm
)(
]][][]][[][[],[:for
11 ABQBAACA
ssssC
s
s
],[],[:for mnnmnm Tss CC =< State Process is Not WSS!
• Covariance Matrix: C[n] = Cs[n,n] (this is just notation)• Propagation of Mean & Covariance:
TTnn
nEnE
BQBAACC
sAs
+−=
−=
]1[][
]1[][
8
Proof: (only for the scalar case: p = 1)For the scalar case the model is: s[n] = a s[n-1] + b u[n] n ≥ 0
differs a bit from (13.1) etc.
Now we can just iterate this model and surmise its general form:]0[]1[]0[ bu as s +−=
]1[]0[]1[
]1[]0[]1[
2 bu abu s a
bu as s
++−=
+=
]2[]1[]0[]1[
]2[]1[]2[
23 bu abubu a s a
bu as s
+++−=
+=
∑=
+ −+−=n
k
kn knbua s a ns0
1 ][]1[][
!Now easy to find the mean:
sn
n
k
kn
a
knuEba sE a nsEs
µ
µ
1
0 0
1 ][]1[][
+
= ==
+
=
−+−= ∑ "#"$%"#"$%
… as claimed!z.i. response… exponential
z.s. response… convolution
9
Covariance between s[m] and s[n] is:
∑∑
∑
∑
= = +−−=
++
=
+
=
+
++
−−+=
−+−
×−+−=
−−=
m
k
m
l
l
kmnl
ks
nm
n
l
ls
n
m
k
ks
m
Ts
ns
ms
balnukmuEbaaa
lnbuansa
kmbuamsaE
ansamsEnmC
u0 0 )]([
211
0
1
0
1
11
2
][][
][]][[
][]][[
]][][][[],[
)(
)(
""" #""" $%δσ
σ
µ
µ
µµ
Must use differentdummy variables!!
Cross-terms will be zero… why?
∑−=
+−++ +=m
nmk
kmnu
ks
nms babaaanmC 2211],[ σσFor m ≥ n:
For m < n: ],[],[ mnCnmC ss =
10
For mean & cov. propagation: from s[n] = a s[n – 1] + b u[n]
"#"$%"""" #"""" $%0 theoremin as propagates
][]1[][=
+−= nuEbnsaEnsE
bnuEbansEnsEa
nsaEnbunasE
nsEnsEns
uns
"#"$%""""" #""""" $%2
][])1[]1[(
])1[][]1[(
])[][(][var
2
]1[var
2
2
2
σ=−=
+−−−=
−−+−=
−=
… which propagates as in theorem < End of Proof >
So we now have:• Random Dynamical Model (A State Model)• Statistical Characterization of it
11
Random Model for “Constant” Velocity A/C
+
−
−
−
−
∆
∆
=
=
][
][
0
0
]1[
]1[
]1[
]1[
1000
0100
010
001
][
][
][
][
][
nu
nu
nv
nv
nr
nr
nv
nv
nr
nr
n
y
x
y
x
y
x
y
x
y
x
s
Deterministic Propagation of Constant-Velocity
Random Perturbation of Constant Velocities
=
2
2
000
000
0000
0000
][cov
u
un
σ
σu
12
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
2000
4000
6000
8000
10000
12000
14000
X position (m)
Y po
sitio
n (m
)
m/s 100]1[]1[
m 0]1[]1[
m/s 5
sec 1
=−=−
=−=−
=
=∆
yx
yx
u
vv
rr
σ
Ex. Set of “Constant-Velocity” A/C TrajectoriesRed Line is Non-Random
Constant Velocity Trajectory
Acceleration of (5 m/s)/1s = 5m/s2
13
Observation ModelSo… we have a random state-variable model for the dynamics of the “signal” (… the “signal” is often some true A/C trajectory)
We need to have some observations (i.e., measurements) of the “signal”
• In Navigation Systems… inertial sensors make noisy measurements at intervals of time
• In Tracking Systems… sensing systems make noisy measurements (e.g., range and angles) at intervals of time
][][][][ nnnn wsHx +=Linear Observation Model:
Measured “observation” vector at each time
allows multiple measurements at each time
Observation Matrix …can change w/ time
State Vector Process being observed
Vector Noise Process
14
The Estimation ProblemObserve a Sequence of Observation Vectors x[0], x[1], … x[n]
Compute an Estimate of the State Vector s[n]
]|[ˆ nnsestimate state at n
using observation up to n
Notation: ]|[ˆ mns = Estimate of s[n] using x[0], x[1], … x[m]
Want Recursive Solution:Given: ]|[ˆ nns and a new observation vector x[n + 1]Find: ]1|1[ˆ ++ nns
Three Cases of Interest:• Scalar State – Scalar Observation• Vector State – Scalar Observation• Vector State – Vector Observation
1
13.4 Scalar Kalman FilterData ModelTo derive the Kalman filter we need the data model:
><+=
><+−=
EquationnObservatio][][][
EquationState][]1[][
nwnsnx
nunasns
Assumptions1. u[n] is zero mean Gaussian, White, 22 ][ unuE σ=
2. w[n] is zero mean Gaussian, White, 22 ][ nnwE σ=3. The initial state is ),(~]1[ 2
ssNs σµ−4. u[n], w[n], and s[–1] are all independent of each other
Can vary with time
To simplify the derivation: let µs = 0 (we’ll account for this later)
2
Goal and Two Properties ][,],1[],0[|][]|[ˆ nxxxnsEnns …=Goal: Recursively compute
[ ]Tnxxxn ][,],1[],0[][ …=XNotation: X[n] is set of all observationsx[n] is a single vector-observation
Two Properties We Need1. For the jointly Gaussian case, the MMSE estimator of zero mean based on two uncorrelated data vectors x1 & x2 is (see p. 350 of text)
||,|ˆ2121 xxxx θθθθ EEE +==
2. If θ = θ1 + θ2 then the MSEE estimator is
||||ˆ2121 xxxx θθθθθθ EEEE +=+==
(a result of the linearity of E. operator)
3
Derivation of Scalar Kalman FilterInnovation: ]1|[ˆ][][~ −−= nnxnxnxRecall from Section 12.6…
MMSE estimate of x[n] given X[n – 1]
(prediction!!)
By MMSE Orthogonality Principle
0X =− ]1[][~ nnxE
data previous the witheduncorrelat is that ][ ofpart is ][~ nxnx
Now note: X[n] is equivalent to ][~],1[ nxn −XWhy? Because we can get get X[n] from it as follows:
][][
]1[
][~]1[
nnx
n
nx
nX
XX=
−→
−
"#"$%]1|[ˆ
1
0][][~][
−
−
=∑+=
nnx
n
kk kxanxnx
4
What have we done so far?
• Have shown that ][~],1[][ nxnn −↔ XX
⇒ Have split current data set into 2 parts:1. Old data2. Uncorrelated part of new data (“just the new facts”)
uncorrelated
][~],1[|][][|][]|[ˆ nxnnsEnnsEnns −== XX Because of this⇒
So what??!! Well… can now exploit Property #1!!
""#""$%"" #"" $% ][~|][
]1|[ˆ
]1[|][]|[ˆ nxnsE
nns
nnsEnns +
−=
−=
∆
X
Update based on innovation part
of new data
⇒
Now need to look more closely at
each of these!prediction of s[n]based on past data
5
]1|[ˆ −nnsLook at Prediction Term:
Use the Dynamical Model… it is the key to prediction because it tells us how the state should progress from instant to instant
]1[|][]1[]1[|][]1|[ˆ −+−=−=− nnunasEnnsEnns XX
Now use Property #2:
""" #""" $%""" #""" $%0][]1|1[ˆ
]1[|][]1[|]1[]1|[ˆ==−−=
−+−−=−nuEnns
nnuEnnsEanns XX
By Definition By independence of u[n] & X[n-1]… See bottom of p. 433 in textbook.
]1|1[ˆ]1|[ˆ −−=− nnsanns
The Dynamical Model provides the update from estimate to prediction!!
6
][~|][ nxnsELook at Update Term:
Use the form for the Gaussian MMSE estimate:
][~][~
][~][][~|][
][
2 nxnxE
nxnsEnxnsE
nk
"" #"" $%∆=
=
]1|[ˆ][][~ −−= nnxnxnx
( )]1|[ˆ][][][~|][ −−= nnxnxnknxnsE
"#"$%"#"$%0
]1|[ˆ]1|[ˆ=
−+−= nnwnnsby Prop. #2
Prediction Shows Up Again!!!
So…
Because w[n] is indep. of x[0], … , x[n-1]Put these Results Together:
[ ]]1|[ˆ][][]1|[ˆ]|[ˆ]1|1[ˆ
−++−=−−=
nnsnxnknnsnnsnnsa"#"$% This is the
Kalman Filter
How to get the gain?
7
Look at the Gain Term:Need two properties…
])1|[ˆ][])(1|[ˆ][(])1|[ˆ][]([ −−−−=−− nnsnxnnsnsEnnsnxnsEA.
The innovation][~
]1|[ˆ][
nx
nnxnx
=
−−=Aside<x,y> = <x+z,y>
for any z ⊥ y
Linear combo of past data… thus ⊥ w/ innovation
0])1|[ˆ][]([ =−− nnsnsnwEB.
“ proof ”• w[n] is the measurement noise and by assumption is indep. of the “dynamical driving noise” u[n] and s[-1]… In other words: w[n] is indep. of everything dynamical… So Ew[n]s[n] = 0
• is based on past data, which include w[0], … , w[n-1], and since the measurement noise has indep. samples we get
]1|[ˆ −nns][]1|[ˆ nwnns ⊥−
8
So… we start with the gain as defined above:
[ ] [ ]
[ ][ ] [ ]
[ ][ ] [ ]
[ ] [ ] [ ] [ ] ][]1|[ˆ][2]1|[ˆ][
][]1|[ˆ][]1|[ˆ][
][]1|[ˆ][
][]1|[ˆ][]1|[ˆ][
][]1|[ˆ][
]1|[ˆ][]1|[ˆ][
]1|[ˆ][
]1|[ˆ][][][~
][~][][
22
2
2
2
22
nwnnsnsEnnsnsE
nwnnsnsEnnsnsE
nwnnsnsE
nwnnsnsnnsnsE
nwnnsnsE
nnsnxnnsnsE
nnsnxE
nnsnxnsEnxE
nxnsEnk
n −−++−−
−−+−−=
+−−
+−−−−=
+−−
−−−−=
−−
−−==
σ
Use Prop. A in num.Use x[n] = s[n]+ w[n] in denominator
(!)
(!!)
Use x[n] = s[n]+ w[n]
in numerator
Expand
= 0 by Prop. B]1|[ −=∆ nnM
Plug in for innovation
MSE when s[n] is estimated by 1-step prediction
9
This gives a form for the gain:
]1|[]1|[][ 2 −+
−=
nnMnnMnk
nσ
This balances… • the quality of the measured data • against the predicted state
In the Kalman filter the prediction acts like the prior information about the state at time nbefore we observe the data at time n
10
Look at the Prediction MSE Term:
But now we need to know how to find M[n|n – 1]!!!
[ ] [ ] ( )[ ] 2
2
2
][]1|1[ˆ]1[
]1|1[ˆ][]1[
1|[ˆ][]1|[
nunnsnsaE
nnsanunasE
nnsnsEnnM
+−−−−=
−−−+−=
−−=−
22 ]1|1[]1|[ unnMannM σ+−−=−
Why are the cross-terms zero? Two parts:1. s[n – 1] depends on u[0] … u[n – 1], s[-1], which are indep. of u[n]2. depends on s[0]+w[0] … s[n – 1]+w[n – 1], which are
indep. of u[n]]1|1[ˆ −− nns
Use dynamical model & exploit
form for prediction
Cross-terms = 0
Est. Error at previous time
11
Look at a Recursion for MSE Term: M[n|n]
[ ] ( )[ ] 22 ]1|[ˆ][][]1|[ˆ][]|[ˆ][]|[ −−−−−=−= nnsnxnknnsnsEnnsnsEnnBy def.: M
Term A Term BNow we’ll get three terms:
EA2, EAB, EB2 ]1|[2 −= nnMAE
[ ][ ]
]1|[][2
]1|[ˆ][]1|[ˆ][][22
−−=
−−−−−=
nnMnk
nnsnxnnsnsEnkABE
[ ] [ ]
[ ] ]1|[][][ of Num.][
][ of Den.][
]1|[ˆ][][
2
222
−==
=
−−=
nnMnknknk
nknk
nnsnxEnkBE
from (!!)… is num. k[n]
from (!)… is den. k[n]
by definition
]1|[]1|[][ 2 −+
−=
nnMnnMnk
nσRecall:
12
So this gives…
]1|[][]1|[][2]1|[]|[ −+−−−= nnMnknnMnknnMnnM
( ) ]1|[][1]|[ −−= nnMnknnM
Putting all of these results together gives some very simple equations to iterate…
Called the Kalman Filter
We just derived the form for Scalar State & Scalar Observation.On the next three charts we give the Kalman Filter equations for:
• Scalar State & Scalar Observation• Vector State & Scalar Observation• Vector State & Vector Observation
13
Kalman Filter: Scalar State & Scalar Observationu[n] WGN; WSS; ),0(~ 2
uN σ][]1[][ nunasns +−=State Model:Varies with n][][][ nwnsn +=xObservation Model: w[n] WGN; ~ ),0( 2
nN σ
22])1|1[]1[(]1|1[
]1[]1|1[
s
s
ssEM
sEs
σ
µ
=−−−−=−−
=−=−− Must Know: µs, σ2s , a, σ2
u, σ2n
Must Know: µs, σ2s , a, σ2
u, σ2nInitialization:
Prediction: ]1|1[]1|[ −−=− nnsanns
22 ]1|1[]1|[ unnMannM σ+−−=−Pred. MSE:
]1|[]1|[][ 2 −+
−=
nnMnnMnK
nσKalman Gain:
( )]1|[][][]1|[]|[ −−+−= nnsnxnKnnsnnsUpdate:
( ) ]1|[][1]|[ −−= nnMnKnnMEst. MSE:
14
Kalman Filter: Vector State & Scalar Observation1 )(~;;1; ][]1[][ ××××+−= rNr pp ppnnn Q0,uBAsBuAssState Model:
1][];[][][][ ×+= pnnwnnnx TT hsh ),0(~ 2nN σw[n] WGN; Observation Model:
sT
s
E
E
CssssM
µss
=−−−−−−−−=−−
=−=−−
])1|1[ˆ]1[])(1|1[ˆ]1[(]1|1[
]1[]1|1[ˆ Must Know: µs, Cs, A, B, h, Q, σ2n
Must Know: µs, Cs, A, B, h, Q, σ2nInitialization:
]1|1[ˆ]1|[ˆ −−=− nnnn sAsPrediction:
TTnnnn BQBAAMM +−−=− ]1|1[]1|[Pred. MSE (p×p):
""" #""" $%11
2 ][]1|[][][]1|[][
×
−+−
=nnnn
nnnn Tn hMh
hMKσKalman Gain (p×1):
"""" #"""" $%"" #"" $%
sinnovationnx
nnx
T nnnnxnnnnn
:][~]1|[ˆ
])1|[ˆ][][(][]1|[ˆ]|[ˆ−
−−+−= shKssUpdate:
( ) ]1|[][][]|[ −−= nnnnnn T MhKIMEst. MSE (p×p): :
15
Kalman Filter: Vector State & Vector Observation1 )N(~;;1; ][]1[][ ××××+−= rr pp ppnnn Q0,uBAsBuAssState Model:
1 )][N(~][;][;1];[][][][ ×××+= MnnpMnMnnnn C0,wHxwsHxObservation:
sT
s
E
E
CssssM
µss
=−−−−−−−−=−−
=−=−−
])1|1[ˆ]1[])(1|1[ˆ]1[(]1|1[
]1[]1|1[ˆ Must Know: µs, Cs, A, B, H, Q, C[n]Must Know: µs, Cs, A, B, H, Q, C[n]Initialization:
]1|1[ˆ]1|[ˆ −−=− nnnn sAsPrediction:
TTnnnn BQBAAMM +−−=− ]1|1[]1|[Pred. MSE (p×p):
1
][]1|[][][][]1|[][−
×
−+−= """ #""" $%
MM
TT nnnnnnnnn HMHCHMKKalman Gain (p×M):
"""" #"""" $%"" #"" $%
sinnovationn
nn
nnnnnnnnn
:][~]1|[ˆ
])1|[ˆ][][(][]1|[ˆ]|[ˆ
x
x
sHxKss−
−−+−=Update:
Est. MSE (p×p): : ( ) ]1|[][][]|[ −−= nnnnnn MHKIM
16
Kalman Filter Block Diagram
K[n] Σ
Az-1
+
+
][ˆ nuB
]1|[ˆ −nns
x[n]Σ
H[n]]1|[ˆ −nnx
+
−
][~ nx ]|[ˆ nns
EstimatedState
EstimatedDriving NoiseInnovations
Observations
EmbeddedObservation
Model
EmbeddedDynamical
ModelPredicted Observation
Predicted State
Looks a lot like Sequential LS/MMSE except it has the Embedded Dynamical Model!!!
17
Overview of MMSE Estimation
Jointly Gaussian LMMSE
Bayesian Linear Model
LMMSE Linear Model
|ˆ xθθ E=
Optimal Seq. Filter
(No Dynamics)
Optimal Kalman Filter(w/ Dynamics)
Linear Seq. Filter
(No Dynamics)
Linear Kalman Filter(w/ Dynamics)
( )ˆ 1 xxCCθθ xxθx EE −+= −
( ) ( )θθθθ HµxCHHCHCµθ −++=−1ˆ
wTT
[ ]11ˆ][ˆˆ−− −+= n
Tnnnn nx θhkθθ
])1|1[ˆ][][]([]1|[ˆ]|[ˆ −−−+−= nnnnnnnnn sAHxKss
Force LinearAny PDF, Known 2nd Moments
Assume Gaussian
Gen. MMSE“Squared” Cost Function
1
Important Properties of the KF1. Kalman filter is an extension of the sequential MMSE
estimator• Sequential MMSE is for a fixed parameter• Kalman is for time-varying parameter, but must have a known
dynamical model• Block diagrams are nearly identical except for the Az-1 feedback box in
the Kalman filter… just a z-1 box in seq. MMSE… the A is the dynamical model’s state-transition matrix
2. Inversion is only needed for the vector observation case3. Kalman filter is a time-varying filter
• Due to two time-varying blocks: gain K[n] & Observation Matrix H[n]• Note: K[n] changes constantly to adjust the balance between “info from
the data” (the innovation) vs. “info from the model” (the prediction)
4. Kalman filter computes (and uses!) its own performance measure M[n|n] (which is the MMSE matrix)
• Used to help balance between innovation and prediction
2
5. There is a natural up-down progression in the error• The Prediction Stage increases the error • The Update Stage decreases the error M[n|n – 1] > M[n|n]• This is OK… prediction is just a natural, intermediate step in the Optimal
processing
5 6 7 8 n
M[5|4]M[6|5]
M[7|6]M[8|7]M[5|5]
M[6|6]M[7|7]
6. Prediction is an integral part of the KF• And it is based entirely on the Dynamical Model!!!
7. After a “long” time (as n →∞) the KF reaches “steady-state”operation… and the KF becomes a Linear Time-Invariant filter
• M[n|n] and M[n|n – 1] both become constant• … but still have M[n|n – 1] > M[n|n]• Thus, the gain k[n] becomes constant, too.
3
8. The KF creates an uncorrelated sequence… the innovations.• Can view the innovations as “an equivalent input sequence”• Or… if we view the innovations as the output, then the steady-state KF is
a LTI whitening filter (need state-state to get constant-power innovations)
9. The KF is optimal for the Gaussian Case (minimizes MSE)• If not Gaussian… the KF is still the optimal Linear MMSE estimator!!!
10. M[n|n – 1], M[n|n], and K[n] can be computed ahead of time (“off-line”)
• As long as the expected measurement variance σ2n is known
• This allows off-line data-independent assessment of KF performance
4
13.5 Kalman Filters vs. Wiener FiltersThey are hard to directly compare… They have different models
• Wiener assumes WSS signal + Noise
• Kalman assumes Dynamical Model w/ Observation Model
So… to compare we need to put them in the same context:
If we let:1. Consider only after much time has elapsed (as n →∞)
• Gives IIR Wiener case• Gives steady-state Kalman & Dynamic model becomes AR
2. For Kalman Filter, let σ2n be constant
• Observation noise becomes WSS
Then… Kalman = Wiener!!! See book for more details
5
13.7 Extended Kalman FilterThe dynamical and observation models we assumed when developing the Kalman filter were Linear models:
][]1[][ nnn BuAss +−=Dynamics:(A matrix is a linear operator)
][][][][ nnnn wsHx +=Observations:
However, many (most?) applications have a• Nonlinear State Equation
and/or• Nonlinear Observation Equation
Solving for the Optimal Kalman filter for the nonlinear model case is generally intractable!!!
The “Extended Kalman Filter” is a sub-optimal approach that linearizes the model(s) and then applies the standard KF
6
EKF Motivation: A/C Tracking with Radar
Case #1: Dynamics are Linear but Observations are Nonlinear
Recall the constant-velocity model for an aircraft:
=
][
][
][
][
][
nv
nv
nr
nr
n
y
x
y
x
s
A/C positions (m)
A/C velocities (m/s)
∆
∆
=
1000
0100
010
001
A
][]1[][ nnn BuAss +−=
=
10
01
00
00
B
Dynamics Model
Define the state in rectangular coordinates:
For rectangular coordinates the state equation is linear
7
But… the choice of rectangular coordinates makes the radar’s observations nonlinearly related to the state:A radar can observe range and bearing (i.e., angle to target)
(and radial and angular velocities, which we will ignore here)
So the observations equations – relating the observation to the state – are given by:
+
+
=− ][
][
][
][tan
][][][
1
22
nw
nw
nr
nr
nrnrn
R
x
y
yx
βx
Observation ModelFor rectangular state
coordinates the observation equation is Non-Linear
Target
Radar rx
ry
β
R
8
Case #2: Observations are Linear but Dynamics are NonlinearIf we choose the state to be in polar form then the observations will be linear functions of the state… so maybe then we won’t have a problem??? WRONG!!!
=
][
][
][
][
][
n
nS
n
nR
n
α
βs
A/C Range & Bearing
A/C Speed & Heading
Target
Radar rx
ry
β
R
αS
Observation Model
+
=
][
][
][
][
][
][
0010
0001][
nw
nw
n
nS
n
nR
nR
β
α
βxThe observation is linear:
9
But… The Dynamics Model is now Non-Linear:
( ) ( )( ) ( )[ ]
( ) ( )( ) ( )[ ]
+−−+−−
+−−++−−
−−∆+−−−−∆+−−
−−∆+−−+−−∆+−−
=
=
−
−
][])1[cos(]1[/][])1[sin(]1[tan
][])1[sin(]1[][])1[cos(]1[
])1[cos(]1[])1[cos(]1[/])1[sin(]1[])1[sin(]1[tan
])1[sin(]1[])1[sin(]1[])1[cos(]1[])1[cos(]1[
][
][
][
][
][
1
22
1
22
nunnSnunnS
nunnSnunnS
nnSnnRnnSnnR
nnSnnRnnSnnR
n
nS
n
nR
n
xy
yx
αα
αα
αβαβ
αβαβ
α
βs
In each of these cases…We can’t apply the standard KF because it relies on the assumption of linear state and observation models!!!
10
Nonlinear ModelsWe state here the case where both the state and observation equations are nonlinear…
][])1[(][ nnn Busas +−=
][])[(][ nnn n wshx +=
where a(.) and hn(.) are both nonlinear functions mapping a vector to a vector
11
What To Do When Facing a Non-Linear Model?1. Go back and re-derive the MMSE estimator for the the nonlinear
case to develop the “your-last-name-here filter”??• Nonlinearities don’t preserve Gaussian so it will be hard to derive…• There has been some recent progress in this area: “particle filters”
2. Give up and try to convince your company’s executives and the FAA (Federal Aviation Administration) that tracking airplanes is not that important??
• Probably not a good career move!!!
3. Argue that you should use an extremely dense grid of radars networked together??
• Would be extremely expensive… although with today’s efforts in sensor networks this may not be so far-fetched!!!
4. Linearize each nonlinear model using a 1st order Taylor series?• Yes!!!• Of course, it won’t be optimal… but it might give the required
performance!
12
Linearization of Models
( )]1|1[ˆ]1[]1[
])1|1[ˆ(])1[(
]1[
]1|1[ˆ]1[−−−−
−∂∂
+−−≈−
−=
−−=−
∆
nnnn
nnn
n
nnnss
sasasa
A
ss !!!! "!!!! #$
State:
( )]1|[ˆ][][
])1|[ˆ(])1[(
][
]1|[ˆ][−−
∂∂
+−≈−
∆=
−=
nnnn
nnn
n
nnn
nnn ss
shshsh
H
ss !!! "!!! #$
Observation:
13
Using the Linearized Models[ ]]1|1[ˆ]1[])1|1[ˆ(][]1[]1[][ −−−−−−++−−= nnnnnnnnn sAsaBusAs
Just like what we did in the linear case except now have a time-varying A matrix
New additive term… But it is knownat each step. So… in terms of development we can imagine that we just subtract off this known part… ⇒Result: This part has no real impact!
[ ]]1|[ˆ][])1|[ˆ(][][][][ −−−++= nnnnnnnnn n sHshwsHx
1. Resulting EKF iteration is virtually the same – except there is a “linearizations” step
2. We no longer can do data-free, off-line performance iteration• H[n] and A[n-1] are computed on each iteration using the
data-dependent estimate and prediction
14
Extended Kalman Filter (Vector-Vector)
ss CMµs =−−=−− ]1|1[]1|1[ˆInitialization:
( )]1|1[ˆ]1|[ˆ −−=− nnnn sasPrediction:
Linearizations :)1|1(ˆ)1()1(
]1[−−=−
−∂
∂=−
nnnnn
sssaA
)1|(ˆ)()(][
−=
∂∂
=nnn
nn
nsss
hH
TT nnnnnn BQBAMAM +−−−−=− ]1[]1|1[]1[]1|[Pred. MSE:
( ) 1][]1|[][][][]1|[][ −−+−= nnnnnnnnn TT HMHCHMKKalman Gain:
])1|[ˆ][][]([]1|[ˆ]|[ˆ −−+−= nnnnnnnnn sHxKssUpdate:
( ) ]1|[][][]|[ −−= nnnnnn MHKIMEst. MSE:
1
13.8 Signal Processing ExamplesEx. 13.3 Time-Varying Channel Estimation
Tx RxDirect Path
Multi Pathv(t) y(t)
∫ −=T
t dtvhty0
)()()( τττ T is the maximum delay
Model using a time-varying D-T FIR system
Channel changes with time if: Relative motion between Rx, Tx Reflectors move/change with time
∑=
−=p
kn knvkhny
0][][][Coefficients change at each nto model time-varying channel
2
In communication systems, multipath channels degrade performance(Inter-symbol interference (ISI), flat fading, frequency-selective fading, etc.)
Need To: First estimate the channel coefficients Second Build an Inverse Filter or Equalizer
2 Broad Scenarios:1. Signal v(t) being sent is known (Training Data)2. Signal v(t) being sent is not known (Blind Channel Est.)
One method for scenario #1 is to use a Kalman Filter:
State to be estimated is h[n] = [hn[0] hn[p]]T
(Note: h here is no longer used to notate the observation model here)
3
Need State Equation:Assume FIR tap coefficients change slowly
][]1[][ nnn uAhh +−=
Assumed Known That is a weakness!!
Assume FIR taps are uncorrelated with each other<uncorrelated scattering>
A, Q , Ch , are Diagonalcovh[-1] = M[-1|-1]covu[n]
4
Have measurement model from convolution view:
][][][][0
nwknvkhnxp
kn +−= ∑
= zero-mean, WGN, σ2
Known training signal
Need Observation Equation:
][][ nwnx nT += hv
Observation Matrixis made up of the
samples of the known transmitted signal
State Vectoris the filter
coefficients
5
Simple Specific Example: p = 2 (1 Direct Path, 1 Multipath)
=
=
0001.00
00001.0
999.00
099.0QA][]1[][ nnn uAhh +−=
Q = covu[n]Typical Realization of Channel Coefficients
Book doesnt state how the initial coefficients were chosen for this realization
Note: hn[0] decays fasterand that the random perturbation is small
6
Known Transmitted Signal
Noise-Free Received Signal<It is a bit odd that the received
signal is larger than the transmitted signal>
Noisy Received SignalThe variance of the noise in the measurement model is σ2 = 0.1
7
Estimation Results Using Standard Kalman Filter
Initialization: 1.0100]1|1[]00[]1|1[ 2 ==−−=−− σIMh T
Chosen to reflect that little prior knowledge is known
In theory we said that we initialize to the a priorimean but in practice it is common to just pick some arbitrary initial value and set the initial covariance quite high this forces the filter to start out trusting the data a lot!
Transient due to wrong IC
Eventually Tracks Well!!hn[0]
hn[1]
8
Kalman Filter Gains
Decay down relies more on modelGain is zero when signal is noise only
Kalman Filter MMSE
Filter Performance improves with time
9
Example: Radar Target Tracking
State Model: Constant-Velocity A/C Model
!"!#$!"!#$!!! "!!! #$!"!#$][]1[][
][
][
0
0
]1[
]1[
]1[
]1[
1000
0100
010
001
][
][
][
][
n
y
x
n
y
x
y
x
n
y
x
y
x
nu
nu
nv
nv
nr
nr
nv
nv
nr
nr
usAs
+
−
−
−
−
∆
∆
=
−
==
2
2
000
000
0000
0000
cov
u
u
σ
σuQ
+
+
=− ][
][
][
][tan
][][][
1
22
nw
nw
nr
nr
nrnrn
R
x
y
yx
βx
Observation Model: Noisy Range/Bearing Radar Measurements
For this simple example. assume:
==
2
2
0
0cov
βσ
σ RwC
Velocity perturbations due to wind, slight speed corrections, etc.Velocity perturbations due to wind, slight speed corrections, etc.
in radians
10
Extended Kalman Filter Issues
1. Linearization of the observation model (see book for details) Calculate by hand, program into the EKF to be evaluated each iteration
2. Covariance of State Driving Noise Assume wind gusts, etc. are as likely to occur in any direction w/ same
magnitude ! model as indep. w/ common variance
Need the following:
==
2
2
000
000
0000
0000
cov
u
u
σ
σuQ
σu = what??? Note: ux[n]/ ∆ = acceleration from n-1 to n
So choose σu in m/s so that σu/ ∆ gives a reasonable range of accelerations for the
type of target expected to track
11
3. Covariance of Measurement Noise The DSP engineers working on the radar usually specify this or build
routines into the radar to provide time updated assessments of range/bearing accuracy
Usually assume to be white and zero-mean Can use CRLBs for Range & Bearing
" Note: The CRLBs depend on SNR so the Range & Bearing measurement accuracy should get worse when the target is farther away
Often assume Range Error to be Uncorrelated with Bearing Error" So use C[n] = diagσR
2[n], σβ2[n] But best to derive joint CRLB to see if they are correlated
12
4. Initialization Issues Typically Convert first range/bearing into initial rx & ry values If radar provides no velocity info (i.e. does not measure Doppler) can
assume zero velocities Pick a large initial MSE to force KF to be unbiased
" If we follow the above two ideas, then we might pick the MSE for rx & ry based on statistical analysis of conversion of range/bearing accuracy into rx & ryaccuracies
Sometimes one radar gets a hand-off from some other radar or sensor" The other radar/sensor would likely hand-off its last track values so use
those as ICs for the initializing the new radar" The other radar/sensor would likely hand-off a MSE measure of the quality its
last track so use that as M[-1|-1]
13
State Model Example Trajectories: Constant-Velocity A/C Model
-20 -15 -10 -5 0 5 10 15-10
-5
0
5
10
15
20
25
X position (m)
Y po
sitio
n (m
)
Radar
Red Line is Non-Random Constant Velocity Trajectory
m/s 2.0]1[2.0]1[
m 5]1[10]1[
)/sm 001.0( m/s 0.0316
sec 1
222
=−−=−
−=−=−
==
=∆
yx
yx
uu
vv
rr
σσ
14
0 10 20 30 40 50 60 70 80 90 1000
5
10
15
20
Sample Index n
Ran
ge R
(met
ers)
0 10 20 30 40 50 60 70 80 90 100-50
0
50
100
Bea
ring β
(deg
rees
)
Sample Index n
Observation Model Example Measurements
Red Lines are Noise-Free Measurements
)rad 01.0( deg 5.7 rad 0.1
)m 1.0( m 0.3162
22
22
===
==
R
RR
σσ
σσ
β
In reality, these would get worse when the target is far away due to a weaker
returned signal
15
If we tried to directly convert the noisy range and bearing measurements into a track this is what wed get.
Not a very accurate track!!!! ! Need a Kalman Filter!!!
But Nonlinear Observation Model so use Extended KF!
Radar
Note how the track gets worse when far from the radar (angle accuracy
converts into position accuracy in a way that depends on range)
Measurements Directly Give a Poor Track
16
Extended Kalman Filter Gives Better Track
Note: The EKF was run with the correct values for Q and C(i.e., the Q and C used to simulate the trajectory and measurements was used to implement the Kalman Filter)
Initialization: s[-1|-1] = [5 5 0 0]T M[-1|-1] = 100I
Picked Arbitrarily
Radar×
Initialization
Set large to assert that little is known a priori
After about 20 samples the EKF attains track even with poor ICs and the linearization. Track gets worse near end where measurements are worse MSE show obtain track and show that things get worse at the end
17
MSE Plots Show Performance
First a transient where things get worse
Next the EKF seems to obtain track
Finally the accuracy degrades due to range magnification of
bearing errors