Nonlinear Signal and Image Processing Theory Methods and Applications

download Nonlinear Signal and Image Processing Theory Methods and Applications

If you can't read please download the document

Transcript of Nonlinear Signal and Image Processing Theory Methods and Applications

Nonlinear_Signal_and_Image_Processing_Theory__Methods__and_Applications/Nonlinear_Signal_and_Image_Processing_Theory__Methods__and_Applications/1427ch1.pdf1Energy Conservation in Adaptive Filtering

Ali H. Sayed, Tareq Y. Al-Naffouri, and Vitor H. Nascimento

CONTENTS1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Energy-Conservation Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Weighted Variance Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Mean-Square Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Mean-Square Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.7 Steady-State Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101.8 Small-Step-Size Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121.9 Applications to Selected Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151.10 Fourth-Order Moment Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .221.11 Long Filter Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .231.12 Adaptive Filters with Error Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . .231.13 An Interpretation of the Energy Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .321.14 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

1.1 Introduction

The study of the steady-state and transient performance of adaptive filtersis a challenging task because of the nonlinear and stochastic nature of theirupdate equations (e.g., References 1 to 4). The purpose of this chapter is toprovide an overview of an energy-conservation approach to study the per-formance of adaptive filters in a unified manner.4 The approach is based onshowing that certain a priori and a posteriori errors maintain an energy balancefor all time instants.57 When examined under expectation, this energy bal-ance leads to a variance relation that characterizes the dynamics of an adaptivefilter.1014 An advantage of the energy framework is that it allows us to push

1

2004 by CRC Press LLC

2 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

the algebraic manipulations of variables to a limit, and to eliminate unneces-sary cross-terms before appealing to expectations. This is a useful step becauseit is usually easier to handle random variables algebraically than under ex-pectations, especially for higher-order moments. A second advantage of theenergy arguments is that they can be pursued without restricting the distri-bution of the input data. To illustrate this point, we have opted not to restrictthe regression data to Gaussian or white in most of the discussions below.Instead, all results are derived for arbitrary input distributions. Of course, byspecializing the results to particular distributions, some known results fromthe literature can be recovered as special cases of the general framework.

As for most adaptive filter analysis, progress is difficult without relyingon simplifying assumptions. In the initial part of our presentation, we deriveexact energy-conservation and variance relations that hold for a large classof adaptive filters without any approximations. Subsequent discussions willcall upon simplifying assumptions to make the analysis more tractable. Theassumptions tend to be reasonable for small step sizes and long filters.

1.2 The Data Model

Consider reference data {d(i)} and regression data {ui }, assumed related viathe linear regression model

d(i) = uiwo + v(i) (1.1)for some M 1 unknown column vector wo that we wish to estimate. Here uiis a regressor, taken as a row vector, and v(i) is measurement noise. Observethat we are using boldface letters to denote random quantities, which will beour convention throughout this chapter. Also, all vectors in our presentationare column vectors except for the regressor ui . In this way, the inner productbetween ui and wo is written simply as uiwo without the need for transpositionsymbols.

In Equation 1.1, {d(i), ui , v(i)} are random variables that satisfy the follow-ing conditions:

a. {v(i)} is zero-mean, independent and identically distributed withvariance E v2(i) = 2v .

b. v(i) is independent of u j for all i, j.c. The regressor ui is zero-mean and has covariance matrix

E uTi ui = Ru > 0.

(1.2)

In the first part of the chapter we focus on data-normalized adaptive filtersfor generating estimates for wo , specifically, on updates of the form

wi = wi1 + uTi

g[ui ]e(i), i 0, (1.3)

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 3

wheree(i) = d(i) ui wi1 (1.4)

is the estimation error at iteration i , and g[ui ] > 0 is some function of ui .Typical choices for g are

g[u] = 1 (LMS), g[u] = u2 (NLMS), g[u] = + u2 (-NLMS).

The initial condition w1 of Equation 1.3 is assumed to be independent ofall {d( j), u j , v( j)}. Later in the chapter we study adaptive filters with errornonlinearities in their update equations see Equation 1.52.

Our purpose is to examine the transient and steady-state performance ofsuch data-normalized filters in a unified manner (i.e., uniformly for all g).The first step in this regard is to establish an energy-conservation relationthat holds for a large class of adaptive filters, and then use it as the basis ofall subsequent analysis.

1.3 Energy-Conservation Relation

Let wi = wo wi denote the weight-error vector at iteration i , and let denote some M M positive-definite matrix. Define further the weighted apriori and a posteriori errors:

ea (i)= uiwi1, ep (i) = uiwi . (1.5)

When = I , we recover the standard definitions

ea (i) = ui wi1, ep(i) = ui wi . (1.6)

The freedom in selecting will be seen to be useful in characterizing severalaspects of the dynamic behavior of an adaptive filter. For now, we shall treat as an arbitrary weighting matrix.

It turns out that the errors {wi , wi1, ea (i), ep (i)} satisfy a fundamentalenergy-conservation relation. To arrive at the relation, we subtract wo fromboth sides of Equation 1.3 to obtain

wi = wi1 uTi

g[ui ]e(i), (1.7)

and then multiply Equation 1.7 by ui from the left to conclude that

ep (i) = ea (i) ui2g[ui ]

e(i), (1.8)

2004 by CRC Press LLC

4 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

where the notation ui2 denotes the squared weighted Euclidean norm ofui , specifically,

ui2 = uiuTi .

Relation 1.8 can be used to express e(i)/g[ui ] in terms of {ep (i), ea (i)} and toeliminate this term from Equation 1.7. Doing so leads to the equality

ui2 wi + uTi ea (i) = ui2 wi1 + uTi ep (i). (1.9)

By equating the weighted Euclidean norms of both sides of this equation, wearrive, after a straightforward calculation, at the relation:

ui2 wi2 +(ea (i)

)2 = ui2 wi12 + (ep (i))2 . (1.10)This energy relation is an exact result that shows how the energies of theweight-error vectors at two successive time instants are related to the energiesof the a priori and a posteriori estimation errors.* In addition, it follows frome(i) = ui wi1 + v(i), and from Equation 1.7, that the weight-error vectorsatisfies

wi =(

I uTi ui

g[ui ]

)wi1 u

Ti

g[ui ]v(i). (1.11)

1.4 Weighted Variance Relation

The result (Equation 1.10) with = I was developed in Reference 5 andsubsequently used in a series of works to study the robustness of adaptivefilters (e.g., References 6 through 9). It was later used in References 10 through12 to study the steady-state and tracking performance of adaptive filters. Theincorporation of a weighting matrix in References 13 and 14 turns out tobe useful for transient (convergence and stability) analysis.

In transient analysis we are interested in characterizing the time evolutionof the quantity Ewi2 , for some of interest (usually, = I or = Ru). Toarrive at this evolution, we use Equation 1.8 to replace ep (i) in Equation 1.10

* Later in Section 1.13 we provide an interpretation of the energy relation (Equation 1.10) in termsof Snells law for light propagation.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 5

in terms of ea (i) and e(i). This step yields, after expanding and groupingterms,

ui2 wi2 = ui2 wi12 +2(ui2)2

g2[ui ]wi12uTi ui

+ 2(ui2)2g2[ui ]

v2(i) ui2

g[ui ]wi12uTi ui +uTi ui

+ 22 (ui2)

2

g2[ui ]v(i)ea (i) 2ui

2

g[ui ]v(i)ea (i). (1.12)

Assuming the event ui2 = 0 occurs with zero probability, we can eliminateui2 from both sides of Equation 1.12 and take expectations to arrive at

Ewi2 = E(wi12) + 2 2v E

(ui2g2[ui ]

), (1.13)

where the weighting matrix is defined by

= g[ui ]

uTi ui

g[ui ]uTi ui +

2ui2g2[ui ]

uTi ui . (1.14)

Observe that is a random matrix due to its dependence on the data (and,hence, the use of the boldface notation for it). The matrix , on the other hand,is not random.

1.4.1 Independent Regressors

Relations 1.11, 1.13, and 1.14 characterize the dynamic behavior of data-normalized adaptive filters for generic input distributions; they are all ex-act relations. Still, Recursion 1.13 is difficult to propagate as it requires theevaluation of the expectation

E(wi12) = E (wTi1wi1).

The difficulty is due to the fact that is a random matrix that depends onui , and wi1 is dependent on prior regressors as well. To progress further inthe analysis, we assume that

the {ui } are independent and identically distributed, (1.15)which allows us to deal with independently from wi1. This so-called in-dependence assumption is commonly used in the literature. Although rarelyapplicable, it gives good results for small step sizes.

Under Assumption 1.15, it is easy to verify that wi1 becomes independentof and, consequently, that

E[wi12] = E [wi12E []],

2004 by CRC Press LLC

6 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

with the weighting matrix replaced by its mean, which we shall denote by. In this way, the variance Recursion 1.13 becomes

Ewi2 = Ewi12 + 2 2v E(ui2

g2[ui ]

), (1.16)

with deterministic weighting matrices {, } and where, by evaluating theexpectation of Equation 1.14,

= E(

uTi uig[ui ]

) E

(uTi uig[ui ]

) + 2E

(ui2g2[ui ]

uTi ui

). (1.17)

Observe that the expression for is data dependent only.Finally, taking expectations of both sides of Equation 1.11, and using

Equation 1.15, we find that

E wi =(

I E(

uTi uig[ui ]

)) E wi1. (1.18)

Expressions 1.16 through 1.18 show that studying the transient behavior of adata-normalized adaptive filter in effect requires evaluating the three multi-variate moments:

E(ui2

g2[ui ]

), E

(uTi uig[ui ]

), and E

(ui2g2[ui ]

uTi ui

),

which are functions of ui only. In terms of these moments, Relations 1.16through 1.18 can now be used to characterize the dynamic behavior of adap-tive filters under independence Assumption 1.15. We start with the mean-square (transient) behavior.

1.5 Mean-Square Behavior

Let denote the M21 column vector that is obtained by stacking the columnsof on top of each other, written as = vec(). Likewise, let = vec().We shall also use the vec1() notation and write = vec1( ) to recover from . Similarly, = vec1( ).

Then using the Kronecker product notation,16 and the following property,for arbitrary matrices {X, Y, Z} of compatible dimensions,

vec(XYZ) = (ZT X)vec(Y).We can easily verify that Relation 1.17 for transforms into the linear vectorrelation

= F ,

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 7

where F is M2 M2 and given by

F = I A+ 2 B, (1.19)

in terms of the symmetric matrices {A, B},

A = (P IM) + (IM P),

B = E(

uTi ui uTi uig2[ui ]

),

P = E(

uTi uig[ui ].

).

(1.20)

Actually, A is positive-definite (because P is) and B is nonnegative-definite.Using the column notation , and the relation = F , we can writeEquations 1.16 through 1.17 as

Ewi2vec1( ) = Ewi12vec1(F ) + 2 2v E( ui2

g2[ui ]

),

which we shall rewrite more succinctly, by dropping the vec1() notation andkeeping the weighting vectors, as

Ewi2 = Ewi12F + 2 2v E( ui2

g2[ui ]

). (1.21)

Now, as mentioned earlier, in transient analysis we are interested in theevolution of Ewi2 and Ewi2Ru ; the former quantity is the filter mean-square deviation while the second quantity relates to the filter mean-squareerror (or learning) curve because

E e2(i) = E e2a (i) + 2v = Ewi12Ru + 2v .

The quantities {Ewi2, Ewi2Ru} are in turn special cases of Ewi2 obtainedby choosing = I or = Ru. Therefore, in the sequel, we focus on studyingthe evolution of Ewi2 for arbitrary .

From Equation 1.21 we see that to evaluate Ewi2 , we need Ewi2F withweighting vector F . This term can be deduced from Equation 1.21 by writingit for F , i.e.,

Ewi2F = Ewi12F 2 + 2 2v E(ui2F

g2[ui ]

),

with the weighted term Ewi2F 2 . This term can in turn be deduced fromEquation 1.21 by writing it for F 2 . Continuing in this fashion, for

2004 by CRC Press LLC

8 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

successive powers of F , we arrive at

Ewi2F M21 = Ewi12F M2 + 2 2v E(ui2F M21

g2[ui ]

)

in terms of the M2-power of F (recall that F is M2 M2).Fortunately, this procedure terminates. To see this, let p(x) = det(xI F )

denote the characteristic polynomial of F , say,

p(x) = xM2 + pM21xM21 + pM22xM22 + + p1x + p0,

with coefficients {pi }. Then, since p(F ) = 0 in view of the CayleyHamiltontheorem,16 we have

Ewi2F M2 =M21k=0

pkEwi2F k .

Putting these results together, we conclude that the transient (mean-square)behavior of the filter (Equation 1.3) is described by an M2-dimensional state-space model of the form:

Wi = FWi1 + 2 2v Y, (1.22)

where the M2 1 vectors {Wi , Y} are defined by

Wi =

Ewi2Ewi2F

...

Ewi2F M22Ewi2F M21

, Y =

E(ui2 /g2[ui ])

E(ui2F /g2[ui ])

...

E(ui2F M22 /g2[ui ])

E(ui2F M21 /g2[ui ])

, (1.23)

and the M2 M2 coefficient matrix F is given by

F =

0 10 0 10 0 0 1...

0 0 0 1p0 p1 p2 . . . pM21

. (1.24)

The entries of Y can be written more compactly as

Y = col {Tr(Q vec1(F k)), k = 0, 1, . . . , M2 1} ,

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 9

where

Q = E(

uTi uig2[ui ]

), (1.25)

and the notation vec1(F k) recovers the weighting matrix that correspondsto the vector F k .

When = I , the evolution of the top entry ofWi in Equation 1.22 describesthe mean-square deviation of the filter, i.e., Ewi2. If, on the other hand, is chosen as = Ru, the evolution of the top entry ofWi describes the excessmean-square error (or learning curve) of the filter, i.e., Ewi2Ru = E e2a (i).

The learning curve can also be characterized more explicitly as follows. Letr = vec(Ru) and choose = r . Iterating Equation 1.21 we find that

Ewi2r = w12F i+1r + 2 2v E[ui2(I+F ++F i )r

g2[ui ]

],

that is,Ewi2r = w12ai + 2 2v b(i),

where the vector ai and the scalar b(i) satisfy the recursions

ai = F ai1, a1 = r,

b(i) = b(i 1) + E[

ui2ai1g2[ui ]

], b(1) = 0.

Usually w1 = 0 so that w1 = wo . Using the definitions for {ai , b(i)}, it iseasy to verify that

E e2a (i) = E e2a (i 1) + wo2F i1(F I )r + 2 2v Tr(Qvec1(F i+1r)), (1.26)

which describes the learning curve of data-normalized adaptive filters as inEquation 1.3. Further discussions on the learning behavior of adaptive filterscan be found in Reference 17.

1.6 Mean-Square Stability

Recursion 1.22 shows that the adaptive filter will be mean-square stable if,and only if, the matrix F is a stable matrix; i.e., all its eigenvalues lie inside theunit circle. But since F has the form of a companion matrix, its eigenvaluescoincide with the roots of p(x), which in turn coincide with the eigenvaluesof F . Therefore, the mean-square stability of the adaptive filter requires thematrix F in Equation 1.19 to be a stable matrix.

2004 by CRC Press LLC

10 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

Now it can be verified that matrices F of the form of Equation 1.19, forarbitrary {A > 0, B 0}, are stable for all values of in the range:

0 < < min

{1

max(A1 B),

1max

{(H) IR+}

}, (1.27)

where the second condition is in terms of the largest positive real eigenvalueof the block matrix,

H =[

A/2 B/2IM2 0

],

when it exists. Because H is not symmetric, its eigenvalues may not be positiveor even real. If H does not have any real positive eigenvalue, then the upperbound on is determined by 1/max(A1 B) alone.*

Likewise, the mean-stability of the filter, as dictated by Equation 1.18, re-quires the eigenvalues of (I P) to lie inside the unit circle or, equivalently,

< 2/max(P). (1.28)

Combining Equations 1.27 and 1.28 we conclude that the filter is stable in themean and mean-square senses for step-sizes in the range

< min

{2

max(P),

1max(A1 B)

,1

max{(H) IR+}

}. (1.29)

1.7 Steady-State Performance

Steady-state performance results can also be deduced from Equation 1.21.Assuming the filter is operating in steady state, Recursion 1.21 gives in thelimit

limi

Ewi2(IF ) = 2 2v E[ ui2

g2[ui ]

].

This expression allows us to evaluate the steady-state value of Ewi2S forany weighting matrix S, by choosing such that

(I F ) = vec(S),i.e.,

= (I F )1vec(S).

* The condition involving max(A1 B) in Equation 1.27 guarantees that all eigenvalues of F areless than 1, while the condition involving H ensures that all eigenvalues of F are larger than 1.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 11

In particular, the filter excess mean-square error, defined by

EMSE = limi

E e2a (i)

corresponds to the choice S = Ru since, by virtue of the independenceAssumption 1.15, E e2a (i) = Ewi12Ru . In other words, we should select as

emse = (I F )1vec(Ru).

On the other hand, the filter mean-square deviation, defined as

MSD = limi

Ewi2

is obtained by setting S = I , i.e.,

msd = (I F )1vec(I ).

Let {emse, msd} denote the weighting matrices that correspond to the vectors{emse, msd}, i.e.,

emse = vec1(emse), msd = vec1(msd).

Then we are led to the following expressions for the filter performance:

EMSE = 2 2v Tr(Qemse),

MSD = 2 2v Tr(Qmsd).(1.30)

Alternatively, we can also write

EMSE = 2 2v vecT (Q)emse = 2 2v vecT (Q)(I F )1vec(Ru),

MSD = 2 2v vecT (Q)msd = 2 2v vecT (Q)(I F )1vec(I ).(1.31)

While these steady-state results are obtained here as a consequence of vari-ance Relation 1.21, which relies on independence Assumption 1.15, it turnsout that steady-state results can also be deduced in an alternative manner thatdoes not rely on using the independence condition. This alternative deriva-tion starts from Equation 1.10 and uses the fact that Ewi2 = Ewi12 insteady state to derive expressions for the filter EMSE; the details are spelledout in References 11 and 12.

2004 by CRC Press LLC

12 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

1.8 Small-Step-Size Approximation

Returning to the expression of F in Equation 1.19, and to the performanceresults (Equation 1.30), we see that they are defined in terms of momentmatrices {A, B, P, Q}. These moments are generally not easy to evaluate forarbitrary input distributions and data nonlinearities g. This fact explains whyit is common in the literature to resort to Gaussian or whiteness assumptionson the regression data.

In our development so far, all results concerning filter transient perfor-mance, stability, and steady-state performance (e.g., Equations 1.22, 1.26, 1.29,and 1.30) have been derived without restricting the distribution of the regres-sion data to being Gaussian or white. To simplify the analysis, we keep theinput distribution generic and appeal instead to approximations pertainingto the step-size value, to the filter length, and also to a fourth-order momentapproximation. In this section, we discuss small-step-size approximation.

To begin with, even though we may not have available explicit values forthe moments {A, B, P, Q} in general, we can still assert the following. If thedistribution of the regression data is such that the matrix B is finite, then therealways exists a small enough step size for which F (and, hence, the filter) isstable. To see this, observe first that the eigenvalues of I A are given by

{1 [k(P) + j (P)]}for all combinations 1 j, k M of the eigenvalues of P . Now if B isbounded, then the maximum eigenvalue of F is bounded by

max(F ) 1 2min(P) + 2for some finite positive scalar (e.g., = max(B)). The upper bound onmax(F ) is a quadratic function of , and it is easy to verify that the values ofthis function are less than 1 for step sizes in the range (0, 2min(P)/). Becausemin(P)/ is positive, we conclude that there should exist a small enough such that F is stable and, consequently, the filter is mean-square stable.

Now for such small step sizes, we may ignore the quadratic term in thatappears in Equation 1.17, and approximate the variance relation (1.16 to 1.17)by

Ewi2 = Ewi12 + 2 2v E(ui2

g2[ui ]

),

= P P,(1.32)

or, equivalently, using the weighting vector notation, by

Ewi2 = Ewi12F + 2 2v E( ui2

g2[ui ]

),

F = I A,

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 13

where P = E (uTi ui/g[ui ]). Moreover, because I F = A, we can alsoapproximate the EMSE and MSD performances (Equation 1.30) of the filter by

EMSE 2v Tr(Qemse),

MSD 2v Tr(Qmsd),(1.33)

where now {emse, msd} denote the weighting matrices that correspond tothe vectors

emse = A1vec(Ru), msd = A1vec(I ).That is, {emse, msd} are the unique solutions of the Lyapunov equations:

Pmsd + msd P = I and Pemse + emse P = Ru.It is easy to verify that msd = P1/2 so that the MSD expression can bewritten more explicitly as

MSD 2v

2Tr(QP1). (1.34)

For example, in the special case of LMS, for which g[u] = 1 and P = Ru = Q,the above expressions give for small step sizes:

EMSE 2v Tr(Ru)

2, MSD

2v M2

(LMS). (1.35)

Using the simplified variance relation (Equation 1.32), we can also describethe dynamic behavior of the mean-square deviation of the filter by means of anM-dimensional state-space model, as opposed to the M2-dimensional model(Equation 1.22). To see this, let P = UUT denote the eigen-decompositionof P > 0, and introduce the transformed quantities:

wi = UT wi , ui = uiU, = UTU, Q = UT QU.Then the variance relation (Equation 1.32) can be equivalently rewritten as*

Ewi2 = Ewi12 + 2 2v E

(ui2g2[ui ]

),

= .

(1.36)

The expression for shows that it will be diagonal as long as is diagonal.Therefore, because we are free to choose (and, consequently, ), we can

* Usually, g[] is invariant under orthogonal transformations, i.e., g[ui ] = g[ui ]. This is the casefor LMS, NLMS, and -NLMS.

2004 by CRC Press LLC

14 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

assume that is diagonal. In this way, {, } will be fully characterized bytheir diagonal entries. Thus let { , } denote M 1 vectors that collect thediagonal entries of {, }, i.e.,

= diag(), = diag().

Then from Equation 1.36 we find that

= F ,

where F is the M M matrix

F = I A, A = 2.

Repeating the arguments that led to Equation 1.22 we can then establish that,for sufficiently small step sizes, the evolution of Ewi2 is described by thefollowing M-dimensional state-space model:

Wi = F Wi1 + 2 2v Y, (1.37)

where the M 1 vectors {Wi , Y} are defined by

Wi =

Ewi2Ewi2F

...

Ewi2F

M2

Ewi2F

M1

, Y =

E(ui2 /g2[ui ])

E(ui2F /g2[ui ])

...

E(ui2

FM2

/g2[ui ]

)E

(ui2F

M1/g2[ui ]

)

, (1.38)

and the M M coefficient matrix F is given by

F =

0 10 0 10 0 0 1...

0 0 0 1p0 p1 p2 . . . pM1

, (1.39)

where the {pi } are the coefficients of the characteristic polynomial of F . If weselect = vec(I ), then

wi2 = wi2 = UT wi2 = wi2

because U is orthogonal. In this case, the top entry of Wi will describe theevolution of the filter MSD.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 15

When P and Ru have identical eigenvectors, e.g., as in LMS for whichg[u] = 1 and P = Ru, then the evolution of the learning curve of the filtercan also be read from Equation 1.37. To see this, let be the column vectorconsisting of the eigenvalues of Ru. Choosing = gives

wi2 = wi2 = wTi wi = wTi Ruwi = wi2Ru ,so that the EMSE behavior of the filter can be read from the top entry of theresulting state-vector Wi .

1.9 Applications to Selected Filters

We now illustrate the application of the results of the earlier sections, as wellas some extensions of these results, to selected adaptive filters.

1.9.1 The NLMS Algorithm

Our first example derives performance results for NLMS by showing how torelate it to LMS. In NLMS, g[u] = u2, and the filter recursion takes the form

wi = wi1 + uTi

ui2 [d(i) ui wi1].

Introduce the transformed variables:

ui = uiui , d(i) =d(i)ui , v(i) =

v(i)ui . (1.40)

Then the NLMS recursion can be rewritten as

wi = wi1 + uTi e(i)with

e(i) = d(i) ui wi1.In other words, we find that NLMS can be regarded as an LMS filter withrespect to the variables {d(i), ui }. Moreover, these variables satisfy a modelsimilar to that of {d(i), ui }, as given by Equations 1.1 and 1.2, specifically

d(i) = uiwo + v(i),where:

1. The sequence {v(i)} is iid with variance

E v2(i) = 2v = 2v E(

1ui2

).

2004 by CRC Press LLC

16 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

2. The sequence v(i) is independent of u j for all i = j .3. The covariance matrix of ui is

Ru = E uTi ui = E(

uTi uiui2

)> 0.

4. The random variables {v(i), ui } are zero mean.These conditions allow us to repeat the previous derivation of the variance andmean relations (Equations 1.16 through 1.18) using the transformed variables(Equation 1.40). In this way, the performance of NLMS can be deduced fromthat of LMS. In particular, from Equation 1.35 we obtain for NLMS:

MSD 2v M2

= 2v M2

E(

1ui2

)(1.41)

and

limi

E e2a (i) 2v Tr(Ru)

2=

2v

2,

because Tr(Ru) = 1, and where ea (i) = d(i)ui wi1. However, the filter EMSErelates to the limiting value of E e2a (i) and not E e

2a (i). To find this limiting

value, we first note from the definitions of ea (i) and ea (i) that

1ui2 e

2a (i) = e2a (i).

Then if we introduce the steady-state separation assumption*

E(

1ui2 e

2a (i)

) E e

2a (i)

Eui2 as i ,

so thatlimi

E e2a (i) = Tr(Ru) (

limi

E e2a (i))

,

we obtain

EMSE = 2v Tr(Ru)

2E

(1

ui2)

. (1.42)

An alternative method to evaluate the steady-state (as well as transient)performance of NLMS is to treat it as a special case of the results devel-oped in Section 1.8 by setting g(u) = u2. In this case, the variance relation(Equation 1.36) would become

Ewi2 = Ewi12 +

2 2v E

[ui2ui4

],

= .

* The assumption is reasonable for longer filters.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 17

Moreover, the EMSE and MSD expressions (1.33 and 1.34) would give

MSD = 2v Tr(QP

1)2

,(1.43)

EMSE = 2v Tr(Qemse),

where now

P = E(

uTi uiui2

), Q = E

(uTi uiui4

)

and emse is the unique solution of Pemse + emse P = Ru. Expressions 1.44are alternatives to (1.41) and (1.42).

1.9.2 The RLS Algorithm

Our second example pertains to the recursive least-squares algorithm:

wi = wi1 + Pi uTi [d(i) ui wi1], i 0 (1.44)

Pi = 1[

Pi1 1Pi1uTi ui Pi1

1 + 1ui Pi1uTi

], (1.45)

where the data {d(i), ui } satisfy Equations 1.1 and 1.2, and the regressorssatisfy independence Assumption 1.15. In the above, 0 1 is a forgettingfactor and P1 = 1 I for a small positive .

Compared with the LMS-type recursion (Equation 1.3), the RLS updateincludes the matrix factor Pi multiplying uTi from the left. Moreover, Pi is afunction of both current and prior regressors. Still, the energy-conservationapproach of Sections 1.3 and 1.4 can be extended to deal with this more generalcase. In particular, it is straightforward to verify that Equation 1.10 is nowreplaced by

ui2Pi Pi wi2 +(ePi a (i)

)2 = ui2Pi Pi wi12 + (ePi p (i))2. (1.46)Under expectation, Equation 1.46 leads to

Ewi2 = Ewi12 + 2v Eui2Pi Pi , (1.47) = E (Pi uTi ui ) E (uTi ui Pi ) + E

[ui2Pi Pi uTi ui].However, the presence of the matrix Pi makes the subsequent analysis ratherchallenging; this is because Pi is dependent not only on ui but also on all priorregressors {u j , j i}.

To make the analysis more tractable, whenever necessary, we approxi-mate and replace the random variable Pi in steady-state by its respective

2004 by CRC Press LLC

18 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

mean value.* Now because

P1i = i+1 I +i

j=0i j uj u j ,

we find that, as i , and because < 1,

limi

E(P1i

) = Ru1

= P1.

That is, the mean value of P1i tends to Ru/(1 ). In comparison, the eval-uation of the limiting mean value of Pi is generally more difficult. For thisreason, we content ourselves with the approximation

E Pi [E P1i

]1 = (1 )R1u = P, as i .This is an approximation, of course, because even though Pi and P1i are theinverses of one another, it does not hold that their means will have the sameinverse relation.

Replacing Pi by P = (1)R1u , we find that variance Relation 1.48 becomesEwi2 = Ewi12 + 2v (1 )2Eui2R1u R1u ,

= 2(1 ) + (1 )2E [ui2R1u R1u uTi ui].Introduce the eigen-decomposition Ru = UUT , and define the transformedvariables

wi= UT wi , ui = uiU, = UTU.

Assume further, for the sake of illustration, that the regressors {ui } areGaussian. Then

E[ui211 uTi ui] = 2Tr(1) +

and the variance relation becomes

Ewi2 = Ewi12 + 2v (1 )2Eui211 ,

= 2 + 2(1 )2Tr(1).

It follows that will be diagonal if is. If we further introduce the M-dimensional column vectors

= diag{}, a = diag{1}, = diag{},

* This approximation essentially amounts to an ergodicity assumption on the regressors. It turns out that the approximation is reasonable for Gaussian regressors.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 19

then the above recursion for is equivalent to

= F where F = 2 I + 2(1 )2a T .Let msd denote the weighting matrix that corresponds to the vector

msd = (I F )1diag(I ).Let also emse denote the weighting matrix that corresponds to the vector

emse = (I F )1.Then, because

MSD = 2v (1 )2Eui21msd1 ,

EMSE = 2v (1 )2Eui21emse1 ,

we can verify after some algebra that

MSD = 2v

Mk=1(1/k)

1+1 2M

,

(1.48)

EMSE = 2v M

1+1 2M

.

1.9.3 Leaky-LMS

Our third example extends the energy-conservation and variance relations ofSections 1.3 and 1.4 to leaky-LMS updates of the form:

wi = (1 )wi1 + uTi e(i), i 0e(i) = d(i) ui wi1,

where is a positive scalar. The data {d(i), ui } are still assumed tosatisfy Equations 1.1 and 1.2, with the regressors satisfying independenceAssumption 1.15.

Repeating the arguments of Sections 1.3 and 1.4, it is straightforward toverify that the variance and mean relations (Equations 1.16 through 1.18)

Ewi2 = Ewi12 + 2 2v Eui2 + 2 (wo)T J E wi1 + 22wo2 ,

= (E Ui ) (E Ui ) + 2E (UiUi ),

E wi = J E wi1 + wo .

2004 by CRC Press LLC

extend to the following (see Reference 18):

20 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

where

Ui = I + uTi ui , J = E (I Ui ) = (1 )I Ru.

Frequently, w1 = 0, so that E w1 = wo . We will make this assumption tosimplify the analysis, although it is not necessary for stability or steady-stateresults.

Therefore, by iterating the recursion for E wi we can verify that

E wi1 = Ciwo , i 0,

whereCi = J i + (I + J + + J i1).

It then follows that the term below, which appears in the recursion for Ewi2 ,can be expressed in terms of wo2 as

2(wo

)T J E wi1 = wo2 J Ci +Ci J .

Now repeating the arguments of Section 1.5 we can verify that the transientbehavior of the leaky filter is characterized by the following state-space model:

Wi = FWi1 + Yi ,

where Wi is the M2-dimensional vector

Wi=

Ewi2Ewi2F Ewi2F 2

...

Ewi2F M21

,

and F is the M2 M2 companion matrix

F =

0 10 0 10 0 0 1...

0 0 0 1p0 p1 p2 . . . pM21

,

with

p(x) = det(xI F ) = xM2 +M21k=0

pk xk

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 21

denoting the characteristic polynomial of the matrix

F = I A+ 2 B,

where

A = (E Ui I ) + (I E Ui ),B = E (Ui Ui ).

Moreover,

Yi = 2v

E |ui2Eui2F Eui2F 2

...

Eui2F M21

+

wo2(I+Si )wo2(I+Si )F wo2

(I+Si )F 2...

wo2(I+Si )F M21

,

where Si is the M2 M2 matrix

Si= (J Ci IM) + (IM Ci J ).

It follows that the filter is stable in the mean and mean-square senses forstep sizes in the range

< min

{2

+ max(Ru) ,1

max(A1 B),

1max

{(H) IR+}

},

where

H =[

A/2 B/2I 0

].

It also follows that in steady state,

limi

E wi = ( I + Ru)1wo ,

MSD = 2 2v E(ui2(IF )1vec(I )

)+ 22wo2T(IF )1vec(I ), (1.49)

EMSE = 2 2v E(ui2(IF )1vec(Ru)

)+ 22wo2T(IF )1vec(Ru),

where T is the M2 M2 matrix

T = I + ((I J )1 J I) + (I (I J )1 J ).

2004 by CRC Press LLC

22 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

1.10 Fourth-Order Moment Approximation

Instead of the small-step-size approximation of Section 1.8, we can choose toapproximate the fourth-order moment that appears in the expression for

in Equation 1.17 as

E(ui2

g2[ui ]uTi ui

) E

(uTi uig[ui ]

) E

(ui2g[ui ]

)= PTr(P),

where P = E (uTi ui/g[ui ]). In this way, Expression 1.17 for would become

= P P + 2 PTr(P), (1.50)

which is fully characterized in terms of the single moment P . If we nowlet P = UUT denote the eigen-decomposition of P > 0, and introduce thetransformed quantities:

wi = UT wi , ui = uiU, = UTU.

Then variance Relations 1.16 and 1.50 can be equivalently rewritten as

Ewi2 = Ewi12 + 2 2v E

(ui2g2[ui ]

),

= + 2Tr().

(1.51)

The expression for shows that it will be diagonal as long as is diagonal.Thus let again

= diag(), = diag().

Then from Equation 1.51 we find that

= F ,

where F is M M and given by

F = I A+ 2, B, A = 2, B = 2T ,

where = diag(). Repeating the arguments that led to Equation 1.22 wecan establish that, under the assumed fourth-order moment approximation,the evolution of Ewi2 is described by an M-dimensional state-space modelsimilar to Equation 1.37.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 23

1.11 Long Filter Approximation

In addition to the small-step-size and fourth-order moment approximationsof Sections 1.8 and 1.10, we can also resort to a long filter approximationand derive simplified transient and steady-state performance results for data-normalized filters of form 1.3. We postpone this discussion until Section 1.12.5,whereby the simplified results will be obtained as a special case of the theorywe develop below for adaptive filters with error nonlinearities.

1.12 Adaptive Filters with Error Nonlinearities

The analysis in the earlier sections has focused on data-normalized adaptivefilters of the form 1.3. We now extend the energy-based arguments to filterswith error nonlinearities in their update equations. This class of filters isusually more challenging to study. For this reason, we resort to a long filterassumption in order to make the analysis more tractable, as we explain in thefollowing.

Thus consider filter updates of the form

wi = wi1 + uTi f [e(i)], i 0 (1.52)where

e(i) = d(i) ui wi1 (1.53)is the estimation error at iteration i , and f is some function of e(i). Typicalchoices for f are

f [e] = e (LMS), f [e] = sign(e) (sign-LMS), f [e] = e3 (LMF).The initial condition w1 of Equation 1.52 is assumed to be independent ofall {d( j), u j , v( j)}.

The same argument that was employed in Section 1.3 can be repeated here toverify that the energy relation (Equation 1.10) still holds. Indeed, subtractingwo from both sides of Equation 1.52 we obtain

wi = wi1 uTi f [e(i)], (1.54)and multiplying Equation 1.54 by ui from the left we find that

ep (i) = ea (i) ui2 f [e(i)]. (1.55)Relation 1.55 can be used to express f [e(i)] in terms of {ep (i), ea (i)} and toeliminate it from Equation 1.54. Doing so leads to the equality

ui2 wi + uTi ea (i) = ui2 wi1 + uTi ep (i), (1.56)

2004 by CRC Press LLC

24 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

and by equating the weighted Euclidean norms of both sides of this equationwe arrive again at Equation 1.10, which is repeated here for ease of reference,

ui2 wi2 +(ea (i)

)2 = ui2 wi12 + (ep (i))2 . (1.57)1.12.1 Variance Relation for Error Nonlinearities

Now recall that in transient analysis we are interested in characterizing thetime evolution of the quantity Ewi2 , for some of interest (usually, = Ior = Ru). To characterize this evolution, we replace ep (i) in Equation 1.57by its expression (Equation 1.55) in terms of ea (i) and e(i) to obtain

ui2 wi2 = ui2 wi12 + 2(ui2)2 f 2[e(i)]

2ui2ea (i) f [e(i)].Assuming the event ui2 = 0 occurs with zero probability, we can eliminateui2 from both sides and take expectations to arrive at

Ewi2 = Ewi12 2E(ea (i) f [e(i)]

) + 2E (ui2 f 2[e(i)]) , (1.58)which is the equivalent of Equation 1.13 for filters with error nonlinearities.Observe, however, that the weighting matrices for Ewi2 and Ewi12 arestill identical because we did not substitute {ea (i), e(i)} by their expressionsin terms of wi1. The reason we did not do so here is because of the non-linear error function f . Instead, to proceed, we show how to evaluate theexpectations

E(ea (i) f [e(i)]

)and E

(ui2 f 2[e(i)]) . (1.59)These expectations are generally hard to compute because of f . To facilitatetheir evaluation, we assume that the filter is long enough to justify, by centrallimit theorem arguments, that

ea (i) and ea (i) are jointly Gaussian random variables. (1.60)

1.12.1.1 Evaluation of E(ea f [e])Using Statement 1.60 we can evaluate the first expectation, E

(ea (i) f [e(i)]

),

by appealing to Prices theorem.15 The theorem states that if x and y arejointly Gaussian random variables that are independent from a third randomvariable z, then

E xk(y + z) = E xyE y2

E yk(y + z),

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 25

TABLE 1.1

Expressions for hG and hU for Some ErrorNonlinearities

Algorithm Error Nonlinearity {hG, hU}

LMS f [e] = e hG = 1hU = E e2a (i) + 2v

sign-LMS f [e] = sign[e] hG =

2

1E e2a (i) + 2v

hU = 1

LMF f [e] = e3 hG = 3(

E e2a (i) + 2v)

hU = 15(

E e2a (i) + 2v)3

Note: In the least-mean-fourth (LMF) case, we assume Gaussiannoise for simplicity.

where k() is some function of y + z. Using this result, together with theequality e(i) = ea (i) + v(i), we obtain

E ea (i) f [e(i)] = E ea (i)ea (i)E ea (i) f [e(i)]

E e2a (i)= (E ea (i)ea (i)) hG,

where the function hG is defined by

hG= E ea (i) f [e(i)]

E e2a (i). (1.61)

Clearly, because ea (i) is Gaussian, the expectation E ea (i) f [e(i)] depends onea (i) only through its second moment, E e2a (i). This means that hG itself is onlya function of E e2a (i). The function hG[] can be evaluated for different choicesof the error nonlinearity f [], as shown in Table 1.1.

1.12.1.2 Evaluation of E(ui2 f 2[e])To evaluate the second expectation, E (ui2 f 2[e(i)]), we resort to a separationassumption; i.e., we assume that the filter is long enough so that

ui2 and f 2[e(i)] are uncorrelated. (1.62)This assumption allows us to write

E(ui2 f 2[e(i)]) = (Eui2) (E f 2[e(i)]) = (Eui2) hU,

where the function hU is defined by

hU= E f 2[e(i)]. (1.63)

Again, since ea (i) is Gaussian and independent of the noise, the function hU isa function of Ee2a (i) only. The function hU can also be evaluated for differenterror nonlinearities, as shown in Table 1.1.

2004 by CRC Press LLC

26 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

1.12.2 Independent Regressors

Using the definitions of hU and hG , we can rewrite the variance relation(Equation 1.58) more compactly as

Ewi2 = Ewi12 2hGE(ea (i)ea (i)

) + 2hUTr(Ru). (1.64)As it stands, this relation is still difficult to propagate because it requires theevaluation of E ea (i)ea (i), and this expectation is not trivial in general. Thisis because of possible dependencies among the successive regressors {ui }.However, if we again resort to independence Assumption 1.15, then it is easyto verify that

E ea (i)ea (i) = Ewi12Ru ,so that Equation 1.64 becomes

Ewi2 = Ewi12 2hGEwi12Ru + 2hUTr(Ru). (1.65)We now illustrate the application of this result by considering two cases

separately. We start with the simpler case of white input data followed bycorrelated data.

1.12.3 White Regression Data

Assume first that Ru = 2u I and select = I . Then Equation 1.65 becomesEwi2 = Ewi12 2hG 2u Ewi12 + 2 M 2u hU . (1.66)

Note that all terms on the right-hand side are dependent on Ewi12 only;this is because hG and hU are functions of Ee2a (i) and, for white input data,Ee2a (i) = 2u Ewi12. We therefore find that Recursion 1.66 characterizes theevolution of Ewi2. Two special cases help demonstrate this fact.

1.12.3.1 Transient Behavior of LMS

When f [e] = e we obtain the LMS algorithm,wi = wi1 + uTi e(i) (1.67)

hU = 2u Ewi12 + 2v , hG = 1.we obtain

Ewi2 =(1 2 2u + 2 4u M

)Ewi12 + 2 M 2u 2v (1.68)

which is a linear recursion in Ewi2; it characterizes the transient behaviorof LMS for white input data.

2004 by CRC Press LLC

Using the following expressions from Table 1.1,

Energy Conservation in Adaptive Filtering 27

1.12.3.2 Transient Behavior of sign-LMS

When f [e] = sign(e) we obtain the sign-LMS algorithm,

wi = wi1 + uTi sign[e(i)]. (1.69)

hU = 1, hG =

2

1 2u Ewi12 + 2v

,

we obtain

Ewi2 =(

1

8

2u 2u Ewi12 + 2v

)Ewi12 + 2 M 2u , (1.70)

which is now a nonlinear recursion in Ewi2; it characterizes the transientbehavior of sign-LMS for white input data.

1.12.3.3 Transient Behavior of LMF

When f [e] = e3 we obtain the LMF algorithm,

wi = wi1 + uTi e3(i). (1.71)

Using the following expressions from Table 1.1,

hG = 3(E |ea (i)|2 + 2v

), hU = 15

(E |ea (i)|2 + 2v

)3,we obtain

Ewi2 = f Ewi12 + 152 M 2u 6v , (1.72)where

f = [1 + 2u 2v (45M 2u 2v 2) + 4u (45M 2u 2v 2)] Ewi12+ 152 8u M(Ewi12)2,

which is a nonlinear recursion in Ewi2; it characterizes the transient behav-ior of LMF for white input data.

1.12.4 Correlated Regression Data

When the input data is correlated, different weighting matrices will ap-pear on both sides of the variance relation (Equation 1.65). Indeed, writingEquation 1.65 for = I yields

Ewi2 = Ewi12 2hGEwi12Ru + 2Tr(Ru) hU,

2004 by CRC Press LLC

Using the following expressions from Table 1.1,

28 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

with the weighted term Ewi12Ru . This term can be deduced from Equation1.65 by writing it for = Ru, which leads to

Ewi2Ru = Ewi12Ru 2hGEwi12R2u + 2hUTr

(R2u

),

with the weighted term Ewi2R2u . This term can in turn be deduced fromEquation 1.65 by writing it for = R2u. Continuing in this fashion, for suc-cessive powers of Ru, we arrive at

Ewi2RM1u = Ewi12RM1u

2hGEwi12RMu + 2hUTr

(RMu

).

As before, this procedure terminates. To see this, let p(x) = det(xI Ru)denote the characteristic polynomial of Ru, say,

p(x) = xM + pM1xM1 + pM2xM2 + + p1x + p0.

Then, since p(Ru) = 0 in view of the CayleyHamilton theorem, we have

Ewi2RM = p0Ewi2 p1Ewi2Ru pM1Ewi2RM1u .

This result indicates that the weighted term Ewi2RM is fully determined bythe prior weighted terms.

Putting these results together, we find that the transient behavior of Filter1.52 is now described by a nonlinear M-dimensional state-space model of theform

Wi = FWi1 + 2hUY, (1.73)where the M 1 vectors {Wi , Y} are defined by

Wi=

Ewi2Ewi2Ru

...

Ewi2RM2uEwi2RM1u

, Y =

Tr(Ru)Tr(R2u)

...

Tr(RM1u )Tr(RMu )

, (1.74)

and the M M coefficient matrix F is given by

F=

1 2hG0 1 2hG0 0 1 2hG...

0 0 1 2hG2p0hG 2p1hG . . . 2pM2hG 1 + 2pM1hG

.

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 29

The evolution of the top entry of Wi describes the mean-square deviation ofthe filter, Ewi2, while the evolution of the second entry of Wi relates thelearning behavior of the filter because

E e2(i) = E e2a (i) + 2v = Ewi12Ru + 2v .

1.12.5 Long Filter Approximation

The earlier results on filters with error nonlinearities can be used to providean alternative simplified analysis of adaptive filters with data nonlinearitiesas in Equation 1.3. We did this in Sections 1.8 and 1.10 by resorting to sim-plifications that resulted from the small-step-size and fourth-order momentapproximations.

Indeed, starting from Equation 1.10, substituting ep (i) in terms of {ea (i),e(i)} from (Equation 1.8), and taking expectations, we arrive at the variancerelation

Ewi2 = Ewi12 2E(

ea (i)e(i)g[ui ]

)+ 2E

(ui2e2(i)g2[ui ]

). (1.75)

This relation is equivalent to Equation 1.13, except that in Equation 1.13 weproceeded further and expressed the terms ea (i)e(i) and e

2(i) as weightednorms of wi1. Relation 1.75 has the same form as variance relation 1.58 usedfor filters with error nonlinearities. Observe in particular that the functione/g[u] in data-normalized filters plays the role of f [e] in nonlinear errorfilters.

Now by following the arguments of Section 1.12.1, and under the followingassumptions:

ea (i) and ea (i) are jointly Gaussian random variables,

ui2 and g[ui ] are independent of e(i), and

the regressors ui are independent and identically distributed,

(1.76)

we can evaluate the expectations

E(

ea (i)e(i)g[ui ]

)and E

(ui2e2(i)g2[ui ]

),

and conclude that variance Relation 1.75 reduces to

Ewi2 = Ewi12 2hGE(ea (i)ea (i)

) + 2E(ui2g2[ui ]

) (E e2a (i) + 2v

),

2004 by CRC Press LLC

30 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

where now

hG= E

(e2a (i)/g[ui ]

)E e2a (i)

= E(

1g[ui ]

)(1.77)

in view of the independence assumptions in Listing 1.76.If we again use E e2a (i) = Ewi1Ru , then we arrive at

Ewi2 = Ewi12 hGEwi12Ru+Ru + 2E(ui2

g2[ui ]

)(Ewi12Ru + 2v

), (1.78)

which is the extension of Equation 1.65 to data-normalized filters. We nowillustrate the application of this result to the transient analysis of some data-normalized adaptive filters.

1.12.5.1 White Regression Data

Assume first that Ru = 2u I and select = I . Then Equation 1.78 becomes

Ewi2 =(

1 2 2u hG + 2 2u E( ui2

g2[ui ]

))Ewi12 + 2 2v E

( ui2g2[ui ]

).

(1.79)

For the special case of LMS, when g[u] = 1, hG in Equation 1.77 becomeshG = 1 and Equation 1.79 reduces to

Ewi2 =(1 2 2u + 2 4u M

)Ewi12 + 2 M 2u 2v . (1.80)

This is the same recursion we obtained before for LMS when trained withwhite input data.

For the special case of NLMS, g[u] = u2, and Relation 1.79 reduces to

wi2 =(

1 2 2u E(

1ui2

)+ 2 2u E

(1

ui2))

Ewi12

+ 2 2v E(

1ui2

). (1.81)

1.12.5.2 Correlated Regression Data

When the input data are correlated, different weighting matrices will appearon both sides of variance Relation 1.78. Indeed, writing 1.78 for = I yields

Ewi2 = Ewi12 2hGEwi12Ru + 2E( ui2

g2[ui ]

) (Ewi12Ru + 2v

)

2004 by CRC Press LLC

Energy Conservation in Adaptive Filtering 31

with the weighted term Ewi1Ru . This term can be deduced from Relation1.78 by writing it for = Ru, which leads to

Ewi2Ru = Ewi12Ru 2hGEwi12R2u + 2E

(ui2Rug2[ui ]

)(Ewi12Ru + 2v

)with the weighted term Ewi2R2u and so forth. The procedure terminates andleads to the following state-space model:

Wi =(F+ 2YeT2

)Wi1 + 2 2v Y, (1.82)

where the M 1 vectors {Wi , Y} are defined by

Wi=

Ewi2Ewi2Ru

...

Ewi2RM2uEwi2RM1u

, Y =

E(ui2/g2[ui ])

E(ui2Ru/g2[ui ])

...

E(ui2RM2u /g2[ui ])

E(ui2RM1u /g2[ui ])

, (1.83)

the M M matrix F is given by

F=

1 2hG0 1 2hG0 0 1 2hG...

0 0 1 2hG2p0hG 2p1hG . . . 2pM2hG 1 + 2pM1hG

ande2 = col{0, 1, 0, . . . , 0}.

Also,

hG = E(

1g[ui ]

).

The evolution of the top entry of Wi describes the mean-square deviation ofthe filter, Ewi2, while the evolution of the second entry of Wi relates tothe learning behavior of the filter. The model (Equation 1.82) is an alternativeto Equation 1.22 for adaptive filters with data nonlinearities; it is based onassumptions in Listing 1.76.

1.12.5.3 Steady-State Performance

The variance Relation 1.78 can also be used to approximate the steady-stateperformance of data-normalized adaptive filters. Writing it for = I ,

Ewi2 = Ewi12 2hGEwi12Ru + 2E( ui2

g2[ui ]

) (Ewi12Ru + 2v

)(1.84)

2004 by CRC Press LLC

32 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

and setting, in steady state,

limi

Ewi2 = limi

Ewi12,

we obtain

0 = 2E(

1g[ui ]

)EMSE + 2E

( ui2g2[ui ]

) (EMSE + 2v

),

so that the excess mean-square error, E e2a (), is given by

EMSE = 2v Tr(Q)

2E(1/g[ui ]

) Tr(Q) , (1.85)where Q = E (uTi ui/g2[ui ]). For LMS we have g[u] = 1 and Q = Ru, and theabove expression reduces to

EMSE = 2v Tr(Ru)

2 Tr(Ru) (LMS).

For NLMS we have g[u] = u2 and Q = E (uTi ui/ui4), so thatEMSE

2v

2 (NLMS).

1.12.5.4 Stability

Recursion 1.84 can be rearranged as

Ewi2 = Ewi12 + (Tr(Q) 2hG) Ewi12Ru + 2 2v Tr(Q).It is now easy to see that Ewi2 converges for step sizes satisfying

Tr(Q) 2hG < 0,or, equivalently,

0 < 2j , orv j,k,i = 0 otherwise, where 2j,k,i is the local average energy of the eight near-est neighbors of w j,k,i , and 2j the average energy in scale j . By conditioning onVj,k,i and using GMF, we develop the local contextual hidden Markov model(LCHMM) for w j,k,i as

fWj,k,i |Vj,k,i (w|v j,k,i = v) =1

m=0pSj,k,i |Vj,k,i (m|v j,k,i = v)g

(w|0, 2j,k,i,m

), (10.12)

where

pSj,k,i |Vj,k,i (m|v j,k,i = v) =pSj,k,i (m)pVj,k,i |Sj,k,i (v|m)1

m=0 pSj,k,i (m)pVj,k,i |Sj,k,i (v|m). (10.13)

LCHMM is specified by

j,k,i ={

pSj,k,i (m), 2j,k,i,m, pVj,k,i |Sj,k,i (v|m)|v, m = 0, 1

},

where j = 1, . . . , J and k, i = 0, 1, . . . , Nj 1. In fact, LCHMM defines alocal density function for each wavelet coefficient conditioning on its contextvalue. The EM training algorithm can be developed from that in Reference 13.Because j,k,i has a small number of data, one might be concerned that theestimation of j,k,i may not be robust. In this work, we can solve this problemby providing a good initial setting of j,k,i based on an idea similar to that inthe previous section. Given the AWGN of variance 2 , the LCHMM training

is performed as follows, where

x

y denotesk+C j

x=kC ji+C j

y=iC j .

Step 1. Initialization:1.0 0j = {pSj (0) = pSj (1) = 0.5, 2j,0 = 2 , 2j,1 = 22j 2 } and set

p = 0.1.1 E step: Given pj , calculate (Bayes rule)

p(

Sj,k,i = m|w j,k,i , pj) = pSj (m)g(w j,k,i ; 0, 2j,m)1

m=0 pSj (m)g(w j,k,i ; 0, 2j,m

) .(10.14)

1.2 M step: Compute the elements of p+1j by

pSj (m) =Nj 1k=0

Nj 1i=0

p(

Sj,k,i = m|w j,k,i , pj), (10.15)

2004 by CRC Press LLC

cies of wavelet coefficients as shown in Figure 10.7b. We define the random

350 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

2j,m =Nj 1

k=0Nj 1

i=0 w2j,k,i p

(Sj,k,i = m|w j,k,i , pj

)N2j pSj (m)

.

(10.16)

1.3 Iteration: Set p = p + 1. If it converges (or p = Np), then go toStep 1.4; otherwise, go to Step 1.1.

1.4 Set c = 0 and set the elements in 0j,k,i by

pSj,k,i (m) =

x

y

p(Sj,x, y = m|w j,x, y, j ), (10.17)

2j,k,i,m =

x

y w2j,x, y p(Sj,x, y = m|w j,x, y, j )

(2C j + 1)2 pSj,k,i (m), (10.18)

pVj,k,i |Sj,k,i (v|m) =

x

y p(Sj,x, y = m|w j,x, y, v j,x, y = v, j )pSj,k,i (m)

.

(10.19)

Step 2. E step: Given cj,k,i , k, i = 0, 1, . . . , Nj 1, calculate (Bayesrule)

pSj,k,i |Vj,k,i ,Wj,k,i (m|w j,k,i , v j,k,i = v)

= pSj,k,i (m)pVj,k,i |Sj,k,i (v|m)g(w j,k,i |0, 2j,k,i,m

)1

m=0 pSj,k,i (m)pVj,k,i |Sj,k,i (v|m)g(w j,k,i |0, 2j,k,i,m

) . (10.20) Step 3. M step: Compute the elements of c+1j,k,i , k, i = 0, 1, . . . , Nj 1,

by

pSj,k,i (m) =

x

y

pSj,x, y|Vj,x, y,Wj,x, y(m|w j,x, y, v j,x, y), (10.21)

2j,k,i,m =

x

y w2j,x, y pSj,x, y|Vj,x, y,Wj,x, y(m|w j,x, y, v j,x, y)

(2C j + 1)2 pSj,k,i (m),

(10.22)

pVj,k,i |Sj,k,i (v|m) =

x

y pSj,x, y|Vj,x, y,Wj,x, y(m|w j,x, y, v j,x, y = v)pSj,k,i (m)

.

(10.23)

Step 4. Iteration: Set c = c + 1. If it converges (or c = Nc), then stop;otherwise, go to Step 2.

2004 by CRC Press LLC

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 351

10.3.3 Fast EM Model Training

It seems that LCHMM training is computationally expensive because Step 3(M step) is performed on each wavelet coefficient. As a matter of fact, it is easyto notice that there are many overlapped computations in Step 3, as shown in

is slightly higher than that of CHMM13 and lower than those of HMMs.1,6

Given j,k,i , we can estimate the noise-free Yj,k,i from Wj,k,i as the conditionalmean as

E[Yj,k,i |w j,k,i , v j,k,i ] =1

m=0pSj,k,i |Vj,k,i ,Wj,k,i (m|w j,k,i , v j,k,i )

2j,k,i,m

2j,k,i,m + 2w j,k,i .

(10.24)

The denoised image is the IDWT of the above estimates of wavelet coeffi-cients. We expect the proposed LCHMM has the spatial adaptability, reduceddenoising artifacts, and fast model training process.

10.3.4 Simplified Shift-Invariant Denoising

The lack of shift-invariant property of the orthogonal DWT results in the visu-ally disturbing artifacts in denoised images. The Cycle-spinning techniquehas been proposed27 to solve this problem, where signal denoising is appliedto all shifts of the noisy signal, and the denoised results are then averaged.It can be shown that shift-invariant image denosing is equivalent to imagedenoising based on redundant wavelet transforms, such as those used in Ref-erences 6, 23, and 28. In this work, we consider the 16 shifted versions that areobtained from shifting the noisy image by 1, 2, 3, and 4 pixels in each dimen-sion, respectively. This simplification was found to be sufficient for most im-ages in practice. We assume that the LCHMM parameters are the same as thoseof the 16 shifted versions. Therefore, the EM training is performed only once,and the LCHMM training results are applied to the 16 images for denoising.

10.3.5 Simulation Results

We apply LCHMM to image denoising for real images Barbara and Lena (8 bpp,512 512) with AWGN of known variance 2 . The experimental setting isgiven as follows: (1) the window size of the local GMM in GMF decreaseswith the increase of the scale to adapt to the higher variations of waveletcoefficients in coarser scales, and in practice, {C j = 6 j | j = 1, 2, 3, 4, 5} arefound both effective and efficient; (2) for simplicity, we also fix the iterationnumbers of the initialization step and the EM training step to be Np = 20and Nc = 5; (3) we use the five-scale DWT where two wavelets, Daubechies-8(D8) and Symmlet-8 (S8), are tested; (4) the DWT is used with two setups: theorthogonal DWT and the redundant DWT or shift-invariant (SI) techniques.

2004 by CRC Press LLC

Figure 10.7c. The actual computational complexity of the LCHMM training

352 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

TABLE 10.2

PSNR (dB) Results from Several Recent Denoising Algorithms

Noisy ImagesLena Barbara

Denoising Methods 10 15 20 25 10 15 20 25

Orthogonal DWT

Donohos HT (D8)19 31.6 29.8 28.5 27.4 28.6 26.5 25.2 24.3Wiener (MATLAB) 32.7 31.3 30.1 29.0 28.4 27.4 26.5 25.7HMT (D8)6 33.9 31.8 30.4 29.5 31.9 29.4 27.8 27.1SAWT (S8)23 31.8 30.5 29.5 29.2 27.6 26.5LAWMAP (D8)21 34.3 32.4 31.0 30.0 32.6 30.2 28.6 27.4AHMF (D8)22 34.5 32.5 31.1 30.1 32.7 30.3 28.7 27.5SSM (D8)24 34.8 32.5 32.4 30.0 LCHMM (D8) 34.4 32.4 30.9 29.9 32.8 30.5 28.9 27.7LCHMM (S8) 34.5 32.5 31.2 30.1 33.1 30.8 29.2 28.0

Redundant DWT

RHMT (D8)6 34.6 32.6 31.2 30.1 32.8 30.3 28.6 27.7SAWT (S8)23 33.0 31.9 30.6 30.7 28.9 27.6SAOE28 34.9 33.0 31.9 30.6 33.3 31.1 29.4 28.2LCHMM-SI (D8) 34.8 33.0 31.7 30.5 33.5 31.2 29.6 28.3LCHMM-SI (S8) 35.0 33.0 31.7 30.6 33.6 31.4 29.7 28.5

The peak signal-to-noise ratio (PSNR) results are shown in Table 10.2 whereseveral recent image denosing algorithms are compared. It is shown thatLCHMM provides the excellent denoising performance for the two images,especially for the Barbara image where the nonstationarity property is promi-nent. LCHMM outperforms all the other methods in most cases. We also show

LCHMM and LCHMM-SI provide better visual quality with fewer artifactsthan HMT.

10.3.6 Discussions of Image Denoising

In this section, we have proposed a new wavelet-domain HMM called thelocal contextual hidden Markov model (LCHMM), for statistical modelingand image denoising. The simulation results show that LCHMM can achievestate-of-the-art denoising performance with three major advantages, i.e., spa-tial adaptability, nonstructured local-region modeling, and fast-model train-ing. However, the main drawback of LCHMM is overfitting in terms ofthe number of model parameters, which is even larger than the number ofwavelet coefficients to be modeled. This drawback may prevent LCHMMfrom the wider applications. Nevertheless, here LCHMM demonstrates itsevident advantages in image estimation and restoration applications.

2004 by CRC Press LLC

the visual quality of image denoising (with D8 wavelet) in Figure 10.8 where

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 353

(a) (b) (c)

(d) (e) (f)

FIGURE 10.8Partial denoising results of image Barbara. (a) Noisy image (20.34 dB). (b) Donohos HT (24.23 dB).(c) Wiener filter (25.71 dB). (d) HMT (26.92 dB). (e) LCHMM (27.72 dB). (f) LCHMM-SI (28.43 dB).

10.4 Image Segmentation

Bayesian approaches to image segmentation have proved efficient for inte-grating both image features and prior contextual properties, where maxi-mum a posteriori (MAP) estimation is usually involved. The Markov randomfield (MRF) has been developed to model the contextual behavior of imagedata,29,30 and Bayesian segmentation becomes the MAP estimate of the un-known MRF from the observed data. Because the MRF model usually favorsthe formation of large uniformly classified regions, it may oversmooth thetexture boundaries and wipe off small isolated areas. The noncausal depen-dence structure of MRFs typically results in high computational complexity.Recently, researchers have proposed multiscale techniques that apply con-textual behavior in the coarser scale to guide the decision in the finer scaleand retain the underlying MRF model in each fixed scale.31,32 In particu-lar, Markovian dependencies are assumed across scales to capture interscaledependencies of multiscale class labels with a causal MRF structure,32 so

2004 by CRC Press LLC

354 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

that a non-iterative segmentation algorithm was developed where a sequen-tial MAP (SMAP) estimator replaces the MAP estimator. In this section, wedevelop a joint multicontext and multiscale (JMCMS) approach to Bayesiansegmentation, which can be formulated as a multi-objective optimization thatgeneralizes the single-objective optimization involved in References 7, 32, and33. To estimate the SMAP with respect to multiple context models in JMCMS,we use the heuristic multistage problem-solving technique.34 The simulationresults show that the proposed JMCMS algorithm improves the accuracy oftexture classification, boundary localization, and detection at the comparablecomputational cost.35

10.4.1 Multiscale Bayesian Segmentation

We now briefly review the multiscale segmentation approaches in Refer-ence 32. Given a random field Y, we need to accurately estimate the pixellabel in X where each label specifies one of Nc possible classes. Bayesian esti-mators attempt to minimize the average cost of an erroneous segmentation,as shown in the following:

x = arg maxx

E[C(X, x)|Y = y], (10.25)

where C(X, x) is the cost of estimating the true segmentation, X. The MAPestimate is the solution of Equation 10.25, if we use the cost functional ofCMAP(X, x) = 1 whenever any pixel is incorrectly classified. This means thatthe MAP estimator aims at maximizing the probability that all pixels willbe correctly classified. Because the MAP estimator is excessively conserva-tive, multiscale Bayesian segmentation has been proposed,32 where sequen-tial MAP (SMAP) cost function, CSMAP(X, x), is introduced by proportionallysumming together the segmentation errors from multiple scales. The SMAPestimator aims at minimizing the spatial size of errors, resulting in more de-sirable segmentation results with lower computational complexity than theMAP estimator. The multiscale image model proposed in Reference 32 is com-posed of a series of random fields at multiple scales. Each scale has a randomfield of image feature vectors, Y(n), and a random field of class labels, X(n). Wedenote an individual sample at scale n by y(n)s and x

(n)s , where s is the posi-

tion in a 2D lattice S(n). Assuming Markovian dependencies across scales, theSMAP recursion can be computed in the fashion of coarse-to-fine as follows:

x(n) = arg maxx(n)

{log py(n)|x(n) (y|x(n)) + log px(n)|x(n+1) (x(n)|x(n+1))}. (10.26)

The two terms in Equation 10.26 are the likelihood function of the imagefeature y(n) and the context-based prior knowledge from the next coarserscale, respectively. Specifically, the quadtree pyramid was developed inReference 32 to capture interscale dependencies of multiscale class labels re-garding the latter part of Equation 10.26. Thanks to the multiscale embedded

2004 by CRC Press LLC

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 355

structure, the quadtree model allows the efficient recursive computation oflikelihood functions, but it also results in discontinuous texture boundariesbecause spatially adjacent samples may not have a common parent sample atthe next coarser scale. Therefore, a more generalized pyramid graph modelwas introduced,32 where each sample has more parent samples in the nextcoarser scale. However, this pyramid graph also complicates the computationof likelihood functions, and the fine-to-coarse recursion of Equation 10.26 hasto be solved approximately. A trainable context model for multiscale Bayesiansegmentation has been proposed,33 where x(n)s is assumed to be only depen-dent on x(n)s , a set of neighboring samples (5 5) at the coarser scale, ands S(n+1) denotes a 5 5 window of samples at scale n + 1. The behaviorof this simplified contextual structure can be trained off-line by providingsufficient training data including many images and their ground truth seg-mentations. Then, the segmentation can be accomplished efficiently via asingle fine-to-coarse-to-fine iteration through the pyramid.

10.4.2 HMTseg Algorithm

A distinct context-based Bayesian segmentation algorithm has been pro-posed,7 in which the context model is characterized by a context vector v(n)

derived from a set of neighboring samples (3 3) in the coarser scale. It isassumed that, given y(n)s , its context vector v

(n)s = {x(n)s , x(n)s } can provide sup-

plementary information regarding x(n)s , where x(n)s denotes the class label of

the parent sample and x(n)s the dominant class label of the 3 3 samples atthe coarser scale. Both s S(n+1), the position of the parent sample, ands S(n+1), a 33 window centered at s, are at scale n+1. So given v(n)s , x(n)sis independent with all other class labels. In particular, the contextual priorpx(n)|v(n) (c|u) is involved in the SMAP estimation, which can be estimated bymaximizing the following context-based mixture model likelihood as

f (y(n)|v(n) = u) =

sS(n)

Ncc=1

px(n)|v(n)(c|v(n)s = u

)f(

y(n)s |x(n)s = c), (10.27)

where the likelihood function f (y(n)|x(n) = c) is computed by using thewavelet-domain HMT model. An iterative EM algorithm has been developed7

to maximize Equation 10.27, and the SMAP estimate is obtained by

x(n) = arg maxx(n)

px(n)|v(n), y(n) (x(n)|v(n), y(n)), (10.28)

where

px(n)|v(n), y(n) (x(n)|v(n), y(n)) =px(n) (x(n))pv(n)|x(n) (v(n)|x(n)) f (y(n)|x(n))Nc

c=1 px(n) (c)pv(n)|x(n) (v(n)|x(n) = c) f (y(n)|x(n) = c).

(10.29)

2004 by CRC Press LLC

356 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

j = 3

j = 2

j = 1

(a)

(b) (c)

FIGURE 10.9(a) The pyramid representation of dyadic image blocks. (b) Wavelet subtree. (c) 2D HMT.

In particular, wavelet-domain HMT was used to obtain the statistical multi-scale characterization regarding the likelihood function f (y(n)|x(n) = c). Usingthe Haar DWT of the best spatial localizability, an image can be recursivelydivided into four subimages of the same size J times and represented in apyramid of J scales, as shown in Figure 10.9a. We denote a dyadic block atscale n as y(n). Given a set of Haar wavelet coefficients w and a set of HMTmodel parameters , the dyadic block y(n) is associated with three waveletsubtrees {TnL H, TnHL , TnH H}. The three wavelet subtrees are rooted in the treewavelet coefficients of the same location and from three subbands at scale n.Regarding the model likelihood in Equation 10.29, the computation f (yn|)is a realization of the HMT model and is obtained by

f (y(n)|) = f (T (n)L H | L H) f (T (n)HL | HL) f (T (n)H H | H H), (10.30)where it is assumed that three wavelet subbands are independent and eachcomponent in Equation 10.30 can be computed based on the closed formulain Reference 1.

2004 by CRC Press LLC

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 357

10.4.3 Joint Multicontext and Multiscale Approach

The context-based Bayesian segmentation approaches7,32,33 have been appliedto multispectral SPOT images, document images, aerial photos, etc. It wasfound that segmentation results in homogeneous regions, which are usuallybetter than those around texture boundaries. This is primarily because thecontext models used in those approaches mainly capture interscale depen-dencies and encourage the formation of large, uniformly classified regionswith less consideration of texture boundaries. To improve the segmentationresults in both homogeneous regions and texture boundaries simultaneously,we want to discuss two questions in this work. (1) What are the characteris-tics of context models of different structures in terms of their segmentationresults? (2) How can multiple context models of distinct advantages be inte-grated to implement the Bayesian segmentation? To answer the first question,we apply a set of numerical criteria to evaluate and quantify the segmentationperformance, and we conduct experiments on a set of synthetic mosaics toquantitatively analyze context models. We then propose a joint multicontextand multiscale (JMCMS) approach to Bayesian segmentation, which is for-mulated as a multiobjective optimization problem. In particular, we use themultistage problem-solving technique to estimate SMAP of JMCMS.34

Given a sample x(n)s , its contextual information may come from some neigh-bors in the spatial and/or scale spaces. Then, we naturally have three non-overlapped contextual sources as P = x(n)s , NP = x(n)s , and N = x(n)hs , wheres is the 3 3 window centered at s and excluding s at scale n + 1, andhs is the 3 3 window centered at s and excluding s at scale n. Specifically,P is the class label of s, and PN and N are dominant class labels of s andhs, respectively. Other contextual sources could be possible, but we believeP , NP, and N are the most important because they are the nearest to x(n)s inthe pyramid representation, and high-order context models may introducethe context dilution problem.36 Instead of using the majority voting schemeused in Reference 7, which may have ambiguity when Nc > 2, we determinethe dominant class label, e.g., x(n)hs , over several samples, e.g., hs, by

x(n)hs = arg maxc{1,..., Nc }

ths

px(n)|v(n), y(n)(c|v(n)t , y(n)t

), (10.31)

where we assume that each sample has the same textural contribution, whichis measured by its posterior probability in Equation 10.28, to the dominantclass label over several samples. x(n)s can be obtained similarly. Based on P ,PN, and N, we develop five context models of different orders d as follows:

d = 1: Context-1 = {P} and Context-5 = {N} d = 2: Context-2 = {P, NP} and Context-4 = {P, N} d = 3: Context-3 = {P, NP, N}

2004 by CRC Press LLC

358 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

PV1

(a) (b)

V2V1PNP

(c)

PNPNV1V3V2

(d)

PNV1V2

(e)

NV1

Context-1

Context-4 Context-5

Context-2 Context-3

FIGURE 10.10Five context models between two neighboring scales where the coarser scale (top) and finerscale (bottom) are shown. {V1, . . . , Vd } (d = 1, 2, 3) is the context vector, and is defined inEquation 10.31.

The five context models are depicted in Figure 10.10, among whichContext-1 and Context-2 are interscale context models, which are similar tothose used to encourage the formation of large, uniformly classified regions.7,33

Context-5 is an intrascale context model often used in the MRF literatureto ensure local homogeneous labeling with high sensitivity to boundaries.Context-3 and Context-4 are hybrid inter- and intrascale context models,which have similar characteristics to those used in References 37 through 39.We anticipate that those context models have distinct effects on the segmen-tation results in terms of classification, boundary localization, and boundarydetection. To study their characteristics, we use three numerical criteria toquantify the segmentation performance. Specifically, Pa is the percentage ofpixels that are correctly classified, showing accuracy, Pb the percentage of

2004 by CRC Press LLC

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 359

TABLE 10.3

Segmentation Results of the Five Contexts Regarding Pa ,Pb , and Pc

Context-1 Context-2 Context-3 Context-4 Context-5

P a 0.9365 0.9728 0.9260 0.9186 0.8327P b 0.2098 0.3567 0.3716 0.3000 0.1173P c 0.6389 0.5189 0.7071 0.7222 0.7237

boundaries that coincide with the true ones, showing boundary specificity,and Pc the percentage of true boundaries that can be detected, showingboundary sensitivity. We conduct segmentation experiments on 10 mosaics,

tation on the 10 mosaics using the supervised context-based segmentationalgorithm.7 Pa , Pb , and Pc , which are averaged over 10 trials, are shown inTable 10.3.

A good segmentation requires high Pa , Pb , and Pc . Even though Pa is usu-ally most important, high Pb and Pc provide more desirable segmentationresults with high accuracy of boundary localization and detection. On theother hand, boundary localization and detection in textured regions are usu-ally regarded as difficult issues due to abundant edges and structures aroundtextured boundaries.41,42 From Table 10.3, it is found that none of the five con-text models can work well singly in terms of the three criteria. For example,Context-2 has the best Pa but the worst Pc . This fact experimentally verifiesthat the context models used in References 7 and 33 are good choices in termsof Pa . Context-5 is the strongest in Pc but the weakest in Pa . Context-3 givesthe highest Pb , but Pa and Pc suffer. These observations are almost completelyconsistent in each trial. Intuitively speaking, interscale context models, e.g.,Context-1 and Context-2, favor Pa by encouraging the formation of large, uni-formly classified regions across scales of the pyramid. The intrascale contextmodel Context-5 helps Pc by being sensitive to boundaries within a scale. Asa hybrid inter- and intrascale context model, Context-3 provides the best Pbby appropriately balancing both interscale and intrascale dependencies intothe SMAP Bayesian estimation. Thus, a natural idea is to integrate multiplecontext models to achieve high Pa , Pb , and Pc simultaneously.

Generally speaking, given y = {y(n)|n = 1, 2, . . . , L} the collection of mul-tiscale random fields of an image Y, a context model V is used to simplifythe characterization of the joint statistics of y with local contextual model-ing. Thus, given different context models, we can have different statisticalcharacterizations of y. Accordingly, we may have different Bayesian segmen-tation results. For example, the quadtree pyramid32 and the interscale contextmodels7,33 emphasize the homogeneity of the labeling across scales, and thesegmentation results tend to be composed of large, uniformly classified re-gions. However, those contexts cannot provide high accuracy of boundary

2004 by CRC Press LLC

as shown in Figure 10.11. For each context, we perform pixel-level segmen-

360 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

(a) (b) (c)

(d) (e) (f)

(g)

(j)

(h) (i)

FIGURE 10.11Ten synthetic mosaics (256 256, 8 bpp). (a) Mosaic1 (D9/D68), (b) Mosaic2 (D16/D24),(c) Mosaic3 (D15/D38), (d) Mosaic4 (D16/D84/D24/D19), (e) Mosaic5 (D24/D68/D16/D19),(f) Mosaic6 (D9/D16/D19/D24/D28), (g) Mosaic7 (D16/D24/D84), (h) Mosaic8 (D38/D16/D15), (i) Mosaic9 (D9/D16/D19), (j) Mosaic10 (D24/D68/D16/D19). (Mosaics from Brodatz, P.,TexturesA Photographic Album for Artists and Designers, New York: Dover, 1966. Withpermission.)

2004 by CRC Press LLC

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 361

localization and detection due to their limitations on boundary character-ization. Similar to the multiscale image modeling,3739 intrascale or hybridinterscale and intrascale context models can be used to achieve more ac-curate contextual modeling around boundaries, e.g., Context-3, Context-4,and Context-5. However, those contexts may be challenged in some homo-geneous regions where the homogeneity is not very good in a certain scale.In this work, our goal is to apply multiple context models that have differentadvantages for image segmentation. Hence, y can be represented as multi-ple (Z) copies and each copy is characterized by a distinct context model,i.e., {yz|z = 1, 2, . . . , Z}. Because different context models provide differentmultiscale modeling, leading to distinct results in terms of Pa , Pb , and Pc , wepropose a joint multicontext and multiscale (JMCMS) approach to Bayesiansegmentation, which reformulates Equation 10.25 as a multiobjective opti-mization as

x = arg maxx

E[CSMAP(X, x)|Y = y1],... (10.32)

x = arg maxx

E[CSMAP(X, x)|Y = yZ].

The multiobjective optimization in Equation 10.32 is roughly analogous tothe multiple criteria of Pa , Pb , and Pc , and it can be regarded as a generalizationof the single optimization in Equation 10.25. The problem of Equation 10.32can be approached by a heuristic algorithm, called the multistage problem-solving technique.34 In other words, the problem in Equation 10.32 can be bro-ken into multiple stages, and the solution of a stage defines the constraintson the latter stage. Thus, Equation 10.32 can be solved based on multiplecontext models individually and sequentially. According to the multistageproblem-solving technique, the SMAP estimation of the posterior probabil-ities, as defined in Equation 10.29, is conducted for all dyadic blocks withrespect to three contexts individually and sequentially, and the SMAP de-cision is only made in the final step according to Equation 10.25 or 10.28.The new JMCMS algorithm can be widely applied to different multiscaleBayesian segmentation methods using distinct texture models or texture fea-tures. Here, in particular, we adopt the supervised segmentation algorithm7 toimplement context-based Bayesian segmentation where the wavelet-domainHMT is used to obtain multiscale texture characterization.

The implementation of the JMCMS approach to Bayesian segmentation isbriefly listed as follows, where Z context models are used as {V1, . . . , VZ}, anL-scale image pyramid is involved, and n = 0 means the pixel-level represen-tation. An important issue that should be addressed here is the determinationof context vectors during the EM training process. Especially, the causal inter-scale context models, e.g., Context-1 and Context-2, have fixed context vectorsduring the EM training process. Meanwhile, the noncausal intrascale or hy-brid context models, e.g., Context-3, Context-4, and Context-5, require the

2004 by CRC Press LLC

362 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

real-time update of context vectors during each iteration based on the resultsof the previous step. The JMCMS segmentation algorithm is implemented asfollows.

Step 1. Set n = L 1, starting from the next to the coarsest scale. Step 2. Set z = 1, starting from the first context model V1 in the list. Step 3. Set p = 0, initializing {px(n) (c), pv(n)|x(n) (u|v)} and v(n). Step 4. Expectation (E) Step, as defined in Equation 10.29. Step 5. If context model Vz is noncausal, update v(n); otherwise,

continue. Step 6. Maximization (M) Step, update contextual prior as

px(n) (c) =

sS(n)px(n)|v(n), y(n)

(x(n)s = c|v(n)s , y(n)s

), (10.33)

pv(n)|x(n) (u|c) =1

px(n) (c)

v

(n)s =u

px(n)|v(n), y(n)(x(n)s = c|v(n)s , y(n)s

). (10.34)

Step 7. Set p = p+1. If converged (or p = Np), then stop; otherwise,go to Step 4.

Step 8. Set z = z + 1. If z > Z, then stop; otherwise, use context Xzand go to Step 3.

Step 9. Set n = n 1. If n < 0, then stop; otherwise, go to Step 2. Step10. arg maxc px(0)|v(0), y(0) (c|v(0), y(0)) gives the pixel-level segmen-

tation.

In Table 10.4, we list the optimal (numerically in terms of Pa ) settings ofJMCMS on the 10 synthetic mosaics when Z = 1, 2, 3. In practice, Z = 3

help much in terms of Pa , Pb , and Pc . We also found that the JMCMS ofContext-2-3-5, i.e., V1 = Context-2, V2 = Context-3, and V3 = Context-5, isthe numerically best setting for the 10 mosaics regarding the three criteria,and it is almost completely consistent in each trial. It is interesting to notethat the context ordering of the optimal JMCMS algorithms given the orderZ in Table 10.4 also somehow follows a coarse-to-fine way. Three facts aboutJMCMS are noteworthy: (1) The JMCMS of Context-2-3-5 may not be the

TABLE 10.4

The optimal JMCMS on the 10 Mosaics with Z = 1, 2, 3JMCMS Context-2 (Z = 1) Context-2-5 (Z = 2) Context-2-3-5 (Z = 3)

Pa 0.9728 0.9893 0.9897Pb 0.3567 0.6923 0.7259Pc 0.5189 0.7314 0.7337

2004 by CRC Press LLC

is found sufficient for the 10 mosaics in Figure 10.11, and Z > 3 does not

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 363

universally optimal design, and we have the flexibility to design the tailoredJMCMS for a specific application. (2) As an alternative, some sophisticatedmethods of selecting different contexts in a spatially adaptive fashion could bedeveloped to improve the segmentation results. However, since the contextualprior is trained by the EM algorithm, which needs sufficient data for theefficient training, the spatially adaptive context selection faces the difficultyof the robust prior estimation. The generally designed JMCMS here has goodrobustness and adaptability to various image data and texture boundaries.(3) The new JMCMS approach is neither preprocessing nor postprocessing onthe segmentation map, as the SMAP decision is only made in the final stage.The proposed JMCMS is a new approach to driving the Bayesian estimation ofposterior probabilities toward a desired solution via multiple context models,step by step.

10.4.4 Simulation Results

Here we test the proposed JMCMS approach of Context-2-3-5 (Z = 3) on bothsynthetic mosaics and remotely sensed images. At the same time, we alsostudy the segmentation algorithm,7 where only Context-2 (Z = 1) is used.For both cases, we use the wavelet-domain HMT model to obtain multiscalestatistical texture characterization. We fix the total iteration numbers of thetwo methods to be the same, e.g., Np Z = 30. Thus they have similar compu-tational complexity, and the execution time is about 20 to 30 s for 256256 im-ages (Nc = 2, 3, 4) on a Pentium-II 400 computer. The average improvementson Pa , Pb , and Pc are about 2, 32, and 18%, respectively, across the 10 mosaics

a b c are also given.Although Context-2 in Reference 7 provides generally good segmentation

results in homogeneous regions, the texture boundaries cannot be well local-ized and detected, i.e., low Pb and Pc . This is the major shortcoming of mostmultiscale segmentation approaches, where only interscale context modelsare used.7,32,33 It is shown that JMCMS can overcome this limitation. First,accuracy of boundary localization and boundary detection are significantlyimproved with much smoother texture boundaries, as shown by Pb andPc . Second, the classification accuracy in homogeneous regions is also im-proved by reducing mis-classified and isolated pixels, as shown by Pa . Theseimprovements are due to multiple context models used in JMCMS, where con-textual information is propagated both across scales and via multiple contextmodels to warrant good segmentation results in both homogeneous regionsand boundaries.

One may argue that some simple processing methods, such as the morpho-logical operation, can also provide smoother boundary localization. However,there are three limitations in the morphological operation for postprocess-ing segmentation maps. First, it cannot deal with errors of a large size, asthose that appear in Figure 10.12d. Second, it may weaken the accuracy of

2004 by CRC Press LLC

given in Figure 10.11. We also show the segmentation results of five mosaicsin Figure 10.12, where the improvements on P , P , and P

364N

onlinearSignaland

Image

Processing:T

heory,Methods,and

Applications

(a) (b) (c) (d) (e)

FIGURE 10.12Segmentation results of HMTseg (top) and JMCMS (bottom). (a) Mosaic6: Pa = 3.03%, Pb = 22.12%, Pc = 1.86%. (b) Mosaic7: Pa = 0.98%,Pb = 26.75%, Pc = 20.88%. (c) Mosaic8: Pa = 2.01%, Pb = 39.71%, Pc = 28.10%. (d) Mosaic9: Pa = 3.72%, Pb = 41.29%, Pc = 27.48%.(e) Mosaic10: Pa = 3.98%, Pb = 43.44%, Pc = 7.97%.

2004 by CRC Press LLC

Statistical Image Modeling and Processing Using Wavelet-Domain HMMs 365

(a)

(c)

(e)

(b)

(d)

(f)

FIGURE 10.13Segmentation of real images using HMTseg and JMCMS. (a) Aerial photo A (Sea/ground). (b)HMTseg result of A. (c) JMCMS result of A. (d) SAR image B (forest/grass). (e) HMTseg resultof B. and (f) JMCMS result of B.

boundary detection by producing oversmoothed boundaries. Third, it maywipe off some small isolated targets, which are important to some applica-tions. In the following, we conduct experiments on remotely sensed images,including an aerial photograph and a synthetic aperture radar (SAR) image,as shown in Figure 10.13. Texture models are first trained on image samples,

2004 by CRC Press LLC

366 Nonlinear Signal and Image Processing: Theory, Methods, and Applications

which are manually extracted from original images (512512, 8 bpp). We cansee the improvements of JMCMS (Context-2-3-5) over HMTseg (Context-2).The accuracy of texture classification, in particular, boundary localization anddetection, is improved. Meanwhile, the small targets are kept in the segmen-tation map.

10.4.5 Discussions of Image Segmentation

In this section, a JMCMS approach to Bayesian segmentation has been pro-posed. JMCMS is able to accumulate contextual behavior both across scalesand via multiple context models, allowing more effective Bayesian estima-tion. JMCMS applies the wavelet-domain HMT to obtain multiscale texturecharacterization. JMCMS can be formulated as a multiobjective optimiza-tion, which can be approached by the heuristic multistage problem-solvingte