Lecture 10 - kth.se€¦ · Lecture 10 Håkan Hjalmarsson February 6, 2019 (KTH) Lecture 10 1/51....

Lecture 10

Håkan Hjalmarsson

February 6, 2019

(KTH) Lecture 10 1 / 51

Outline

1 Model structure selection

2 A model’s accuracy

3 Fundamental limitations: FIR

4 The structure of the asymptotic covariance matrix P

5 A refresher on orthogonal projection

6 A geometric interpretation of P

7 Structural results


The Identification Problem

Simply solve: minθ∑t εTt (θ)Λ−1εt(θ) (εt(θ) pred. error)

What’s the big deal???Experiment design/Cost of complexityValue of sensors and communicationModel structure selectionNon-convex optimization problem(KTH) Lecture 10 4 / 51

Model structure selection

Suppose given a set of model structures:

Ξ := G(ρ) : ρ ∈ Dρ ⊂ R,

Which model structure G to use?

ML fails:

ρ, gG(ρ) = arg minρ,g

L(g)

s.t. g ∈ G(ρ)

has solution g = gLS whenever at least one model structure containsgLS .


Model structure selection

Methods:Cross Validation

1 Estimate each θ(i) with training data

2 Choose i = arg maxi pi(y; θ(i)) on test data

Residual analysis (Section 16.6 in Ljung)Hypothesis testingInformation based criteria (AIC, BIC, MDL, ..., Section 16.4 inLjung)Confidence regions


Residual analysis

ε(t) := ε(t, θN ) white and independent of the input???


Residual analysis: Whiteness test

Correlation with past residuals: ϕ(t) =[ε(t− 1) . . . ε(t−M)

]T1√N

N∑t=1

ϕ(t)ε(t) =√N

RNε (1)

...RNε (M)

=√NRN,Mε ∼ N(0, λ2

e · I)

Test statistic:

N

λ2e

‖RN,Mε ‖2 ∼ χ2(M) asymptotically

Noise variance estimate λe := 1N

∑Nt=1 ε

2(t) = RNε (0) gives

ζN,M = N‖RN,Mε ‖2

(RNε (0))2∼ χ2(M)

Reject hypothesis that residuals white if ζN,M > χ2α(M)


Residual analysis: Cross-correlation test

Use past inputs instead: ϕ(t) =[u(t− 1) . . . u(t−M)

]T1√N

N∑t=1

ϕ(t)ε(t) =√N

RNεu(1)

...RNεu(M)

=√NRN,Mεu

Assume residuals white (general case in Ljung):√NRN,Mεu ∼ N(0, λeE[ϕ(t)ϕT (t)])

Test statistic:

ζuN,M := N

Rε(0)(RN,Mεu )T

(1N

N∑t=1

ϕ(t)ϕT (t))−1

RN,Mεu ∼ χ2(M)


Model error modeling

E(g(θN )) = Y − Φg(θN ) = Φη + V, η ∈ RM

η = (ΦTΦ)−1ΦTE

Assuming η = 0:√Nη ∼ N(0, λe(ΦTΦ)−1)

N

λeηTΦTΦη ∼ χ2(M)

N

λeηTΦTΦη = N

λeETΦ(ΦTΦ)−1ΦTE

Same as cross-correlation statistic(KTH) Lecture 10 10 / 51

Information criteria

Akaike Information Criterion (AIC)ChooseMi with highest

AIC(i) = ln p(y; θ(i))− dim(θ(i))

Bayesian Information Criterion (BIC)

BIC(i) = ln p(y; θ(i))− dim(θ(i)) lnN, N := dim(y)

Consistent estimate of model order


Model comparisons

Suppose we know that

Y = Φgo + V ∈ RN

Φ ∈ RN×n known deterministicV ∈ N(0, λeI)ML-estimate: gLS = (ΦTΦ)−1ΦTY

Want to curb the high variance by using another model structureLet gG ∈ RnG be the minimizer in G ⊂ Rn


Hypothesis testing

Y = Φgo + V, V ∈ N(0, λeI)

L(g) := (Y − Φg)T (Y − Φg)

= (g − gLS)TR(g − gLS) + L(gLS)

gLS : = R−1ΦTY = go +R−1ΦTV

Hence

L(gLS) = (Φgo + V − Φ(go +R−1ΦTV ))T (same thing)= V T (IN×N − Φ(ΦTΦ)−1ΦT )︸︷︷︸

Projection matrix

V ∈ λe χ2(N − n)

Similarly

L(gG)− L(gLS) = (gG − gLS)TR(gG − gLS) ∈ λe χ2(n− nG)(at least asymptotically in N )Can also show these quantities are independent.


Hypothesis testing

Hence

F :=L(gG)−L(gLS)λe(n−nG)L(gLS)λe(N−n)

∈ F (n− nG , N − n)

Thus, if g ∈ G,

F ∈ F (n− nG , N − n) (at least asymptotically in N)

Standard F-test: Reject hypothesis if F > F−1p (n− nG , N − n)


Information criteria

AIC(G) = L(gG)(

1 + 2nGN

)(1)

Structured vs unstructured estimate:

L(gG)(

1 + 2nGN

)> L(gLS)

(1 + 2n

N

)i.e. if

L(gG) > L(gLS)1 + 2n

N

1 + 2nGN

≈(

1 + 2(n− nG)N

)L(gLS)


Confidence regions

C(p) =g : (g − gLS)TR(g − gLS) ≤ nλeF−1

p (n,N − n)

Reject G if gG /∈ C(p).

i.e. if(gG − gLS)TR(gG − gLS) > λenF

−1p (n,N − n)


Model structure selection - Summary

F-test:F > F−1

p (n− nG , N − n)

Cross-correlation test:

CGEΦ > F−1p (n− nG , N − n)

AIC:L(gG) >

(1 + 2(n− nG)

N

)L(gLS)

Confidence region test:

(gG − gLS)TR(gG − gLS) > nλeF−1p (n,N − n)


Model structure selection revisited

L(g) = (g − gLS)TR(g − gLS) + L(gLS)

V (g) := (g − gLS)TR(g − gLS) = ‖Φg − ΦgLS‖2 =∥∥∥Y (g)− Y (gLS)

∥∥∥2

• F-test: V (gG) > (n− nG) F−1p (n− nG , N − n) λe

• Cross-corr. test: V (gG) > (n− nG) F−1p (n− nG , N − n) λe

• AIC: V (gG) > (n− nG) 2(N−n)N λe

• Conf. region test: V (gG) > nF−1p (n,N − n) λe

Same statistic(Slightly) different thresholdsCheck of how well structured predictor matches unstructuredpredictor


Model structure selection revisited

Why does this hold?Cross-correlation statistic:

N

λeETΦ(ΦTΦ)−1ΦTE

ΦTE = ΦT (Y − Φg(θN ))= ΦT (Y − ΦgLS + Φ(gLS)− g(θN ))= ΦTΦ(gLS)− g(θN )

N

λeETΦ(ΦTΦ)−1ΦTE = N

λe(gLS − g)TΦTΦ(gLS − g)


A model’s accuracy: FIR

y(t) = Bo(q)u(t) + e(t) = ϕT (t)θo + e(t), ϕ(t) =[u(t− 1) . . .

]T

Estimate: θN =[N∑t=1

ϕ(t)ϕT (t)]−1 N∑

t=1ϕ(t)y(t)

=[N∑t=1

ϕ(t)ϕT (t)]−1 N∑

t=1ϕ(t)(ϕT (t)θo + e(t))

= θo +[N∑t=1

ϕ(t)ϕT (t)]−1 N∑

t=1ϕ(t)e(t)

White Gaussian noise ⇒√N(θN − θo) ∼ N (0, I−1

F )

(Per sample) Information matrix: IF = 1Nλe

N∑t=1

ϕ(t)ϕT (t)



√N(θN − θo) ∼ N (0, I−1

F )

IF = 1λeN

N∑t=1

ϕ(t)ϕT (t), ϕ(t) =[u(t− 1) . . .

]T

Confidence ellipsoids: θ : (θo − θ)T IF (θo − θ) ≤ χ2α(n)N

(χ2α(n) quantile funct. of χ2 dist. with n def. of freedom.

θoEid

θoEid

Scales with sample sizeχ2α(n) ∝ n so scales with number of parametersIF ∝ 1/λe so scales with noise variance



(θo − θ)T IF (θo − θ) =∑Nt=1(θo − θ)Tϕ(t)ϕT (t)(θo − θ)

λeN θoEid

θoEid

ϕT (t)(θo − θ) = y(t)− e(t)− ϕT θ = ε(t, θ)− e(t) ⇒

(θo − θ)T IF (θo − θ) ≈1

λeN

N∑t=1

ε2(t, θ) + 1λeN

N∑t=1

e2(t) ≈ 1λeVid(θ) + 1

Confidence ellipsoid: θ : Vid(θ) ≤ λeχ2α(n)N − λe

Level curve of the identification criterionAccurate estimate in directions where prediction error sensitive toparameter changes (input dependent)Large sample size (N →∞):

IF ≈1λe

E[ϕ(t)ϕT (t)] = 1λe

E[u2(t)] E[u(t)u(t− 1)] . . .E[u(t)u(t− 1)] E[u2(t)] . . .

.... . . . . .

Strong correlations⇒ Poor conditioning⇒ Poor information



Frequency function estimate:

G(eiω, θ) = B(eiω) = ΓT (eiω)θ, where Γ(q) =[q−1 . . . q−n

]TLinear in θ ⇒ Error:

CovG(eiω, θN ) = 1N

ΓT (eiω)I−1F Γ(eiω)

White input: IF = λuλeI ⇒

CovG(eiω, θN ) = 1N

ΓT (eiω)λuλeI Γ(eiω) = n

λeNλu

Noise power (λe) to signal energy (Nλu)Scaling with number of parameters n

Approximate expression for general input and noise spectra:

CovG(eiω, θN ) ≈ n Φe(ω)NΦu(ω)


Error bounds

Data (input is white with variance 1):

0 10 20 30 40 50 60 70 80 90 100−30

−20

−10

0

10

20

30

InputOutput

Model: y(t) = b1+f1q−1+f2q−2u(t) + e(t)

Bode plots with error bounds:


Fundamental limitations: FIRCramér-Rao lower boundI−1F smallest possible covariance matrix among all unbiased

estimators.1

2π

∫ π

−π

CovG(eiω, θN )Φu(ω)dω =1

2πN

∫ π

−π

ΓT (eiω)I−1F

Γ(eiω)Φu(ω)dω

= Tr

I

−1F

12πN

∫ π

−π

Γ(eiω)ΓT (eiω)Φu(ω)dω

but ϕ(t) =

[u(t− 1) . . . u(t− n)

]T= Γ(q)u(t) so

IF ≈1λe

E[ϕ(t)ϕT (t)] = 1λe2π

∫ π

−πΓ(eiω)Φu(ω)Γ∗(eiω)dω

Water bed effect

12π

∫ π

−πCovG(eiω, θN )Φu(ω)dω = 1

λeNTrI = nλe

N


The waterbed effect: FIR

12π

∫ π

−πCovG(eiω, θN )Φu(ω)dω = nλe

N

Decreasing the error somewhere will increase the errorsomewhere else.Input spectrum acts as weighting and controls the effect.Can allow smaller error where input spectrum largeAgain the number of parameters pops-up


A model’s accuracy and fundamental limitations

The general case

y(t) = G(q, θ)u(t) +H(q, θ)e(t)

Error for small sample sizes not well understoodLarge sample size⇒

ε(t, θN ) ≈ ε(t, θo) + ϕT (t)(θN − θo)

where ϕ(t) = ddθε(t, θo) = − d

dθ y(t, θo). Same as FIR case!√N(θN − θo) ∼ N (0, I−1

F ), IF = 1λe

E[ϕ(t)ϕT (t)]

CovG(eiω, θN ) ≈ n |Ho(eiω)|2λeNΦu(ω)

12π

∫ π

−πCovG(eiω, θN ) Φu(ω)

|Ho(eiω)|2dω = n

λeN


Connection to ML

VN (θ) = 1N

N∑t=1

`(ε(t, θ))

P = κ(`)(E[ϕ(t)ϕT (t)]

)−1

κ(`) has replaced λe

κ(`) = E[`′(eo(t))]2

(E[`′′(eo(t))])2

κ = λe for `(x) = x2

κ(`) ≥ κo := κ(− log fe)


Analysing the asymptotic covariance matrix

√N(θN − θo

)∈ AsN (0, P )

where P also satisfies

P = AsCov θN , limN→∞

N · E[(θN −EθN )(θN −EθN )T

]


Introduction

An enormous amount of model error information hidden in P :Structural

Model structure (ARX, ARMAX, Box-Jenkins, non-linear,...)Model orderOpen vs closed loopInput channels and input excitationSensor channelsNoise

System propertiesFrequency functionImpulse response coefficientsGainsPoles and zerosControl applications


The structure of P

A non-singlular P can always be written as

P = [〈Ψ,Ψ〉]−1

where Ψ : C→ Cn×m, for some integer m > 0 depending on the modelstructure, and where

〈Ψ,Φ〉 = 12π

∫ π

−πΨ(ejω)Ψ∗(ejω)dω

(inner products between the rows,also known as a Gramian)


The structure of P : Example 1 - FIR models

yt =n∑k=1

θk ut−k + et = ϕTt θ + et

P = λe[E[ϕtϕ

Tt

]]−1

E[ϕtϕ

Tt

]=

r0 r1 . . . rn−1r1 r0 . . . rn−2...

...rn−1 rn−2 . . . r0

= 1

2π

∫ π

−πΓn(ejω)Φu(ω)Γ∗n(ejω)

where Γn(q) =[q−1 . . . q−n

]T⇒ P = 〈ΓnΦ1/2

u /√λe,ΓnΦ1/2

u /√λe〉−1 = 〈Ψ,Ψ〉−1


The structure of P : Example 2 - NFIR models

yt =n∑k=1

θk ut−k +m∑

k=n+1θk (ut−(k−n))2 + et = ϕTt θ + et

ϕt = [ut−1, . . . , ut−n, (ut−1)2, . . . , (ut−m)2]T

= (M(q)zt)

where

M(q) =[Γn(q) 0

0 Γm(q)

], zt =

[ut

(ut)2

]

⇒ Ψ(eiω) = 1√λeM(eiω)Φ1/2

z (eiω).

where Φ1/2z is a Cholesky factor of Φz, the spectrum of z.(KTH) Lecture 10 36 / 51

The structure of P : Other quantities

Typically not the parameters θ directly we are interested in, but rathersome function

J(θ) : Rn×1 → C1×q:frequency functionimpulse responsesystem gain

AsCov J(θN ) = limN→∞

E[(J(θN )− J(θo))∗(J(θN )− J(θo))] = Λ∗ [〈Ψ,Ψ〉]−1 Λ

where

Λ , J ′(θo) ∈ Cn×q


A refresher on orthogonal projection: Scalar case

Least squares estimation

Y = θΦ + E

W

QP

(Y and θ are row vectors)

Y = Y ΦT[ΦΦT

]−1Φ

W

Q

Y

X

Z

General case:Let the rows of X =

[x1 x2 . . . xn

]Tspan a (closed) subspace SX

to a Hilbert space H.Then the projection of f ∈ H on SX is given by

f = 〈f,X〉 [〈X,X〉]−1X


A refresher on orthogonal projection: MV-case

Multi-output least squares estimation

Y = θΦ + E

Q

WP

Y = Y ΦT[ΦΦT

]−1Φ

Same equation as before. Each row of Y projected on the row spaceof Φ.

General case:The projection of fi ∈ H, i = 1, . . . , n on SX is given by the rows of

f = 〈f,X〉 [〈X,X〉]−1X

where f =[f1 f2 . . . fn

]T(KTH) Lecture 10 40 / 51

A refresher on orthogonal projection: The norm

f = 〈f,X〉 [〈X,X〉]−1X

The “norm” of the projection:

〈f , f〉 = 〈f,X〉 [〈X,X〉]−1 〈X, f〉


A refresher on orthogonal projection

〈f , f〉 = 〈f,X〉 [〈X,X〉]−1 〈X, f〉

Hmmm, didn’t we just see something similar a minute ago????

AsCov J(θN ) = Λ∗ [〈Ψ,Ψ〉]−1 Λ

What if Λ = 〈Ψ, γ〉 for some function γ?

AsCov J(θN ) = 〈γ, γ〉 = 〈ProjSΨγ,ProjSΨγ〉


A geometric interpretation of P

AsCov J(θN ) = 〈ProjSΨγ,ProjSΨγ〉

Scalar quantities: The asymptotic variance is the squared norm of γprojected onto the subspace spanned by the rows of Ψ.

Why is this useful?Often γ can be chosen so that it only depends on the quantity J(θ)of interestThe influence on Ψ from model structure, model order,experimental conditions often simple to establish.

Decoupling!Examples will be used to illustrate this.


Structural results: Adding parameters

Example: FIR system

yt =n∑k=1

θk ut−k + et = θϕTt + et

P = [〈Ψ,Ψ〉]−1

Ψ = 1√λe

ΓnΦ1/2u

True order n = no, J = J(θ1, . . . , θno).What happens with the accuracy if we over-model (n > no)?


Adding parameters: FIR example

AsCov J(θN ) =[

Λ0(n−no)×q

]∗[〈Ω,Ω〉]−1

[Λ

0(n−no)×q

]

where

Ω = ΓnΦ1/2u√λe

=[

ΓnoΓno,n

]Φ1/2u√λe

where

Γm,n(z) =[z−(m+1) . . . z−n

]T.

The question is now how large Λ∗ [〈Ψ,Ψ〉]−1 Λ is in comparison withthe expression above.



Geometrical resultLet X and Y be two subspaces of Lm2 such that X ⊆ Y ⊆ Lm2 and letγ ∈ Lq×m2 . We then have the orthogonal decomposition

ProjYγ = ProjXγ+ ProjX⊥(Y)γ

(X⊥(Y) denotes the orthogonal complement of X in Y)

Hence

〈ProjYγ,ProjYγ〉 − 〈ProjXγ,ProjXγ〉= 〈ProjX⊥(Y)γ,ProjX⊥(Y)γ〉 ≥ 0



AsCov J(θN ) =[

Λ0(n−no)×q

]∗[〈Ω,Ω〉]−1

[Λ

0(n−no)×q

]

where Ω = ΓnΦ1/2u√λe

=[

ΓnoΓno,n

]Φ1/2u√λe

,

[ΨΦ

]

X = SΨ ⊆ SΩ = Y.

A property of a suitable γ: 〈Ω, γ〉 =[

Λ0(n−no)×q

]i.e. γ ⊥ SΦ.

When is there no variance increase?0 = ProjX⊥(Y)γ = ProjS⊥Ψ (SΩ)γ, γ ⊥ SΦ⇒ S⊥Ψ (SΩ) ⊂ SΦ ⇒SΦ ⊥ SΨ ⇒ 〈Φ,Ψ〉 = 0



No increase: If and only if 〈Φ,Ψ〉 = 0

FIR example:

〈Φ,Ψ〉 =〈Γno,nΦ1/2u ,ΓnoΦ1/2

u 〉

= Rno,n ,

rno rno−1 · · · r1rno+1 rno · · · r2

......

rn−1 rn−2 · · · rn−no

,

No increase: If and only if r1 = . . . = rn−1 = 0



No increase: If and only if 〈Φ,Ψ〉 = 0

Derivation not tied to the FIR example.

Immediate generalization:Adding parameters increases asymptotic variance unless newpredictor gradients are orthogonal to the old ones.


L. Ljung, “Asymptotic variance expressions for identified black-boxtransfer function models,” IEEE Trans. Automatic Control, vol. 30,no. 9, pp. 834–844, 1985.

L. Ljung and Z. Yuan, “Asymptotic properties of black-boxidentification of transfer functions,” IEEE Trans. Automatic Control,vol. 30, no. 6, pp. 514–530, 1985.

H. Hjalmarsson and J. Mårtensson, “A geometric approach tovariance analysis in system identification,” IEEE Transactions onAutomatic Control, vol. 56, no. 5, pp. 983–997, May 2011.

J. Mårtensson, N. Everitt, and H. Hjalmarsson, “Covarianceanalysis in SISO linear systems identification,” Automatica, vol. 77,pp. 82–92, Mar 2017.


Lecture 10 - kth.se€¦ · Lecture 10 Håkan Hjalmarsson February 6, 2019 (KTH) Lecture 10 1/51....

Documents

Transcript of Lecture 10 - kth.se€¦ · Lecture 10 Håkan Hjalmarsson February 6, 2019 (KTH) Lecture 10 1/51....