Paper 5094-43_5

Performance assessment of frequency plane filters applied to track

association and sensor registration Clay J. Stanek a, Bahram Javidi b, and P. Yanni a

aANZUS, 9747 Businesspark Ave, San Diego CA 92131

bDepartment of E. E., University of Connecticut, 206 Glenbrook Road, Storrs CT 06269-2157

ABSTRACT The current generation of correlation systems attempting to provide a Single Integrated Picture (SIP) have concentrated on improving quality from the situational awareness (SA) and tracking perspective with limited success, while having not addressed the combat identification (CID) issue at all. Furthermore, decision time has lengthened, not decreased, [1] as more and more sensor data are made available to the commanders --much of which is video in origin. Many efforts are underway to build a network of sensors including the Army’s Future Combat System (FCS), Air Force Multi-mission Command and Control Aircraft (MC2A), Network-Centric Collaborative Targeting (NCCT), and the follow-on to the Navy’s Cooperative Engagement Capability (CEC). Each of these (and still other) programs has the potential to increase precision of the targeting data with successful correlation algorithms while eliminating dual track reports, but almost none have combined or will combine disparate sensor data into a cohesive target with a high confidence of identification. In this paper, we address an architecture that solves the track correlation problem using frequency plane pattern recognition techniques that also can provide CID capability. Also, we discuss statistical considerations and performance issues. Keywords: track correlation, sensor registration, phase invariance, generalized noise, fusion

INTRODUCTION Correlation engines have been evolving since the implementation of radar. Here, correlation is a process typically referred to in the literature as track association/correlation. Association is often taken to mean a preliminary list of tracks for comparison (hypotheses) and correlation is reflects the track association hypotheses that pass a statistical test. Track correlation and sensor registration are required to produce common, continuous, and unambiguous tracks of all objects in the surveillance area [2]. The objective is to provide a unified picture of the theatre or area of interest to battlefield decision makers. This unified picture has many names, but is most commonly referred to as a Single Integrated Picture (SIP). A related process, known as sensor registration or gridlock filtering (gridlocking), refers to the reduction in navigation errors and sensor misalignment errors so that one sensor’s track data can be accurately transformed into another sensor’s coordinate system. As platforms gain multiple sensors, the correlation and gridlocking of tracks become significantly more difficult. In this paper, we concentrate on correlation algorithms and follow work introduced in previous papers [3, 4]. From a technical viewpoint, legacy track correlation systems rely on a few key mathematical concepts born out of research in the early 60s through early 80s. These are: statistical hypothesis testing applied to the track association/correlation problem, an assignment algorithm to further eliminate ambiguities, and the Kalman filter for positional estimation of the targets. Several limitations arise. Digital computer performance lead researchers to develop methods of simplifying the association/correlation and estimation process either through model assumptions, data reduction, or both. In application, modeling assumptions tend to be a convenient exception rather than the rule, adversely affecting the association/correlation and Kalman filtering process in many real-world situations. For example, the Kalman filter assumes all data is independent of prior data and no cross-correlation of errors with other system data occurs; this can yield poor tracking performance if left unaccounted [5]. A related assumption is used for hypothesis testing as information is accumulated over time in the association/correlation process: the Naïve Bayesian approximation is employed across parameters and multiple samples

in time. Well known as the “curse of dimensionality”, when the observed feature vector has a high dimension, finding the marginal amd joint densities is very difficult.. Thus, modeling assumptions and data reduction in tandem can significantly reduce the processing burden, but can severely impact performance in terms of the quality of the delivered SIP [6]. In summary, statistical hypothesis testing assigns a probability conditioned on the priors and is only consistent with what we tell it: Did we enumerate the possible states of nature, discrete or continuous? Are the assigned priors consistent with our state of knowledge? Is there any additional evidence we can digest via Bayes Rule? Have we described the sampling probabilities (conditional density functions) honestly given our state of knowledge? Problems in any and all of these steps all can lead to a dubious posterior probability calculation. If this weren’t discouraging enough, legacy systems are extremely sensitive to time-alignment and sensor gridlock, which ultimately inject additional sources of systematic error and random noise into the process. Furthermore, to combine classification data with tracking data while also performing track correlation, a better approach using modern technology is required. The approach presented here avoids these limitations by using pattern recognition techniques that can be implemented on a digital, FPGA, or optical architecture. This architecture provides a native ability to accept classification data along with standard track information to provide track correlation with CID. The pattern recognition algorithms are based on concepts developed for real-time facial and target recognition systems. This can all but eliminate the need to throw away information in the process.

TRACK CORRELATION & FUSION FORMULATION Consider the problem of determining if a track reported by one sensor or link, ix , and a second track, jx , correspond to the same target. In terms of a hypothesis test, we have two cases of interest and describe the binary case in this way: Let

0ω = the event that track ix and track jx correspond to the same target

1ω = the event that track ix and track jx do not correspond to the same target. The typical track state contains parameters such as position, velocity, and course and are often reported with respect to a wide variety of reference frames. The most common of these is with respect to the WGS-84 datum, which defines an ellipsoidal reference surface where the origin is taken as the center of the Earth. Sensors can use a spherical coordinate system in their internal processing. In the context of data links for track reports, it is often the case that a local Cartesian frame is used with the origin being a surveyed ground point (improves track accuracy), or a more arbitrarily chosen reference. Through simple transformations, we can take the WGS-84 coordinate and transform it into a Cartesian coordinate relative to a tangent plane of specific orientation. Such transformations occur between tracks reported on Link-16 and those on Cooperative Engagement Capability (CEC) and vice-versa. The xy plane is taken as tangent to the Earth surface at the origin of the grid, with the y axis oriented in the direction of North, and z axis normal to the plane. Let us take the position of the track in ellipsoidal coordinates where λ is the azimuth, φ is the reduced latitude, h is the altitude (referenced as a geometric height above the reference ellipsoid), and often speed and heading information is also available (it is vector in nature and can be resolved into component velocities). Thus, it is perfectly valid to refer to the description of a track sample as [ ]φ λ=x h v in ellipsoidal coordinates or [ ]=x x y z v in Cartesian coordinates where x

is a vector representing the track state. The state of the target at a particular t can be denoted as [ ]φ λ=t tx h v

or [ ]=t tx x y z v . By accumulating several track state samples, we can construct a matrix where each row represents

a field and each column a specific sample time; this track state matrix is referred to as iX and represents all the information of the track reported over all time as shown in Figure 1.

Figure 1 (left) 52 tracks represented by line plots over time; (right) Track image generated from one case. See approximately 29.5N, 86.2W in the image for red, circular track.

HYPOTHESIS TESTING

We will outline in some detail the standard hypothesis testing approach to track correlation and describe its foundation in Bayesian theory. The Bayesian decision rule is simple: choose the class that maximizes the posteriori probability. For a two class problem, the decision rule is written as ( ) ( )0 1| |ω ω<>q x q x (1.1) where ( )|ωiq x is the posteriori probability for class ωi . This says if the probability of class 0ω given x is greater than the probability of 1ω , choose 0ω and vice-versa. x is a vector of attributes, which includes everything used in the decision process: measurements such as target position, velocity, radar cross-section, and ESM parameters as well as taxonomic clues such as IFF modes, PPLI (Precise Position Location Indicator on Link-16), and others such that

ν = t x y z tx x y z v v v m T … and so on.

The posteriori probability is one of the most misunderstood concepts in scientific inference. E.T. Jaynes offers perhaps what the author feels is the best description of probability, which is often simply interpreted as the actual frequency with which we would observe events to be true in a real experiment. He states,

“In pedantic, but scholastically correct terminology, a probability p is an abstract concept, a quantity that we assign theoretically, for the purpose of representing a state of knowledge, or that we calculate from previously assigned probabilities using the rules of inference. A frequency, f , is, in situations where it makes sense to speak of repetitions, a factual property of the real world, that we measure or estimate. Instead of committing the error of saying that the probability is the frequency, we ought to calculate the probability ( )p f df that the frequency lies in various intervals df .” [7]

Using Bayes rule, ( ) ( ) ( )( )

||

ω ωω = i i

i

p x pq x

p x, with prior class probabilities ( )ωip , conditional class densities,

( )|ωip x , and writing in ratio form:

( ) ( )( )

( )( ) ( ) ( ) ( )

( )0 0 0

0 0 01 1 1

|| ,

|ω ω ω

ω ω ωω ω ω

≡ = =q p x p

O x O Oq p x p

(1.2)

In (1.2), we have used the Jaynes [8] description by referring to ( )0 |ωO x as the “odds on hypothesis 0ω ”. Thus, the

posteriori odds are equal to the prior odds ( )0ωO multiplied by a dimensionless factor called the likelihood ratio. Using

the Fukanaga representation [9], define the likelihood ratio ( )l x [10] and decision rule as

( ) ( )( )

( )( )

0 1

1 0

|,

|ω ω

τ τω ω

= > =<p x p

l xp x p

(1.3)

Either (1.2) or (1.3) is an equivalent formulation using Bayes rule. By using base 10 and putting a factor of 10 in front, we can measure evidence, ( ) ( )( )0 0| 10log |ω ω=e x O x , in decibels (db) in (1.2) rather than odds. Clearly, the evidence equals the prior evidence plus the number of db of evidence provided by calculating the likelihood ratio. While there is nothing in this calculation per se that tells us where to make the decision 0ω is accepted, there are several techniques for measuring false alarm as a function of the threshold, which help us guide the decision portion of the calculation. The use of a loss function helps us rescale what the Bayesian calculation provides into something which becomes an actionable calculation.

THE NAÏVE BAYESIAN APPROXIMATION Suppose that the new information consists of several pieces or propositions such that 1 2, , ,= nx x x x… . Then we could expand (1.2) by using the product rule for probabilities

( ) ( )( )

( )( )

( )( ) ( )1 2 0 2 0 0

0 01 2 1 2 1 1

| , , | , , || 10 log 10log 10log

| , , | , , |ω ω ω

ω ωω ω ω

= + + +n n n

n n n

p x x x p x x p xe x e

p x x x p x x p x… …… …

(1.4)

If we suppose that each ix is independent of the outcome of the other jx given the class, then we have the “Naïve Bayesian approximation”.

( ) ( )( ) ( )0

0 01 1

|| 10 log

|ω

ω ωω=

= +∑n

i

i i

p xe x e

p x (1.5)

Notice that each piece of information contributes additively to the evidence and makes for a clear way to introduce newly learned information into the inference process. The Bayesian normally draws distinction between logical and causal independence. Typically this is taken to mean an honest representation of the state of knowledge about the situation, not merely known or measured cause and effect. Thus, perception is an important part of the formulation to the Bayesian. Two events may be causally dependent in the sense one influences the other, but to a person attempting to infer from available data that has not yet discovered this, the probabilities representing ones state of knowledge might be independent. There are other advantages to (1.5) that many have written about [11]. Probably the most important of these is the rate of learning from a performance standpoint: given n training examples over l attributes, the time required to learn a boosted naive Bayesian classifier is ( )Ο ⋅n l , i.e. linear. We will comment on the non-extensibility of (1.5) to the multi-hypothesis case. In the CID problem, our hypothesis is the platform (an F-15 friendly, MIG-21 hostile, a SCUD, etc.) and intent, and thus the naïve Bayesian approximation is really saying that everything is conditioned on the target type alone, e.g. a model-based approach. We could interpret this from the following scenario. Imagine characteristics for a platform such as radar information (pulse width, frequency range, pulse repetition frequency, waveform), engine type, length, radar cross-section, positional and kinematic information. Our simplifying assumptions would mean that given the platform was an F15, the radar is independent of the engine is independent of the length, is independent of the speed, and so on. The main advantage in this type of formulation is the avoidance of joint density estimation where the features are correlated. Think of this in only three dimensions. Rather than needing to estimate from data covering the whole volume in three dimensions, one can intersect that volume with three orthogonal planes representing a feature per plane.

However, when the set of hypotheses is [ ]0, , 2ω ∈ ≥i n n , then (1.5) will lead to trivial conclusions because 0ω contains many alternative hypotheses, not just 1ω . Thus, even if our assumption of each ix independent of the other jx honestly represents our state of knowledge, application of (1.5) is a serious error. Instead we must express (1.2) as

( ) ( )( ) ( ) ( )

( ) ( )

( )

1 211 2 0

0 0 1 2 01 2 0

1

, , |, , |

| , , , |, , |

ω ωω

ω ω ωω ω

=

=

= =∑

∑

n

n j jjn

n nn

jj

p x x x pp x x x

O x O p x x xp x x x p

……

……

(1.6)

so that the denominator of the likelihood ratio is a weighted sum and is the average probability for a feature of all alternatives. One could of course take two hypotheses at a time and use the simpler form, but then would have to rank them. The Quicksort algorithm accomplishes this in ( )logΟ n n operations for a full ranking procedure.

SINGLE AND SEQUENTIAL HYPOTHESIS TESTING Classification based on kinematic data is compounded by the difficulty in describing the class distributions for the case of uncorrelated track pairs. One well-known alternative is to perform a single hypothesis test, the most common of which is the distance test. The distance from the feature vector to the class mean is computed (weighted by the covariance matrix) and compared to a threshold to determine if the feature vector supports the hypothesis. If the original 0ω class distribution is Gaussian, then the distance classification represents the evaluation of ( )( )0log |ωp x to within a constant factor, which is referred to as the Mahalanobis distance:

( ) ( ) ( )2 1 2

1

,µ µ µ−

=

= − Σ − = = = −∑nT T T

ii

d x x z z z z A x (1.7)

and thus is half of the calculation for a binary hypothesis test. Its relation to quadratic classifiers will not be further discussed here. Note that z is a whitened feature vector of zero mean, 0µ = , and unit covariance, Σ = I . From the Bayes standpoint, if all we know about our data set is the first two moments of the distribution, then (1.7) is the most honest representation of our state of knowledge according to the Principle of Maximum Entropy. Taking it one step further, the most widespread use of the Gaussian sampling distribution is not that the error frequencies are known or believed to be Gaussian, but rather because they are unknown. However, the issue comes from the unknown distribution of the second class and the use of the threshold in lieu of a complete binary hypothesis test. Fukunaga [12] has outlined the issues with this approach that warrant mention here. He shows that the expectation and variance of the distance are given by { } { } { }2 2 4

1 1| , | , 1ω ω γ γ= = = −iE d n Var d n E z when the 'iz s are independent.

For normal distributions, 2γ = and the density function for 2d is

( ) ( ) ( )2

2 2/ 2

12 / 2

n

np d un

ζ ζ−

=Γ

(1.8)

which is the gamma density function. For the location of the second class, we can safely assume that under a whitening transformation, the mean vector is non-zero and the covariance matrix non-white. Thus, we take { }2 ˆ|ω µ=E x and ( ){ }2

2ˆ |µ ω− = ΛE x and can easily

calculate that

{ } { }2 2 2 22 2

1 1 1 1

ˆ ˆ| , | 2 4ω λ µ µ ω λ λ µ= = = =

= + = +∑ ∑ ∑ ∑n n n n

Ti i i i

i i i iE d Var d (1.9)

where λi is an eigenvalue of Λ . The result of this type of analysis is that the Bayes error in the one-dimensional 2d space is considerably higher than that in the n -dimensional space of x when n is moderate to large in number. The mapping from the original space to the one-dimensional space destroys classification information which existed in the original feature space.

SEQUENTIAL HYPOTHESIS TESTING

Previously, we have discussed a general hypothesis testing scheme for situations such as

[ ] [ ]1 2, ,φ λ= =t ntx h v x x x… and noted that the naïve Bayesian approximation refers to the independence between features, which here are distinct parameters. One distinguishing characteristic of tracks is their updating with time and thus we often have many samples available from which to draw an inference. A basic approach is the averaging of observation vectors; this can theoretically achieve zero error when the hypotheses have different expectations. Let 1 2 3 4, , , , , nx x x x x… be a series of observed feature vectors over time. If these are assumed to be independent and identically distributed, then (1.5) applies directly. Thus, we examine the distribution of the evidence function and seek to accept or reject our hypothesis 0ω depending on the sign of ( )0 |ωe x . If the log-likelihood ratio and its expectation are defined as

( ) ( )( ) ( ){ } ( ){ }0 2

1

|10 log , | , |

|ω

ω η ω σω

= = =

ii i j j i j j

i

p xh x E h x Var h x

p x (1.10)

then ( ){ } ( ){ } 2| , |ω η ω σ= =j j j jE e x n Var e x n . This demonstrates a well-known scaling with sample number in the

sequential testing case: as the number of samples scales with n , the expectation of the evidence increases n faster than the variance of the evidence. This implies that the evidence density functions for the two classes become more separable and more normal as n increases. If the expectation of the log-likelihood function for each class is identical, then a linear classifier can not improve our separability in the sequential test. However, if the two classes have different covariance matrices for ( ) |ωi jh x , then a quadratic classifier will provide increased separability with sample number. When there is correlation among the time series, the error mitigation is significantly affected since a cross correlation term arises in the computation of the variance of the evidence.

( ){ } ( ){ } ( )( ) ( )( ){ }1 1 1

| | | | |n n n

j i j i j j k j j ji k i

Var e x Var h x E h x h xω ω ω η ω η ω= = =

= + − −∑ ∑∑ (1.11)

In the limit of perfect linear correlation across the samples ( )( ) ( )( ){ }| | | 1ω η ω η ω− − →i j j k j j jE h x h x , then variance

of the evidence function and the expectation both change in proportion to the number of samples, making it tantamount to a single hypothesis test. In Figure 2, we provide an example of a non-zero second term in (1.11) for several components of a feature vector normally constructed in testing hypothesis 0ω . We outline examples of the autocorrelation over multiple samples and demonstrate that it can be significant. For each autocorrelation sequence, there are 3 track pair examples. For example, we provide autocorrelation sequences for a difference in latitude for three track pairs in the top-left plot. There are several possible explanations for this. The two most prominent are 1) the interpolation scheme used to provide time alignment of the data, and 2) the cross-correlation of the errors in the track states as reported from two different links or sensors. Nevertheless, the sequential hypothesis test receives continual treatment in the literature. For example, modern papers last year treated the topic of the Mean-Field Bayesian Data Reduction Algorithm (BDRA) for adaptive sequential classification utilizing Page’s test. This method has application in detecting a permanent change in the distribution or to classify as quickly as possible with an acceptable Mean Time Between False Alarms (MTBF) [13]. We would view (1.5) as the accumulation of evidence over time (samples) and denote this as

( )( ) ( ) ( )( )1

| ,, 10 log ,

| ,=

= =∑t k

tn l

f y x HS e y n e y

f y x H (1.12)

then the decision rule becomes min<

− <>t mm tS S h with the threshold h set by the false alert rate desired.

Figure 2 Correlation of feature vectors impacts of independent, identically distributed assumption of samples in time

CONSTRUCTING THE FEATURE VECTOR FROM KINEMATIC INFORMATION

We outline one model-based approach proposed in the early development of this problem [14] to hypothesis testing here to enlighten the reader on many of the issues with classification in the track correlation problem. Let us suppose the most recent track report on a target from one sensor provides a position of ( ),=i i iP x y and a second sensor provides a

track report of ( ),=j j jP x y . One sensible feature vector is the difference in positions reported by the two sensors

( ) ( )( ) ( )

, , or

, ,φ φ λ λ φ λ

= − − ≡ ∆ ∆

= − − ≡ ∆ ∆

i j i j

i j i j

x x x y y x y

x (1.13)

for either a Cartesian or elliptical coordinate system description. Let us take the Cartesian formulation explicitly for further evaluation. As a model of the conditional class density function, we take the distribution of x to be normal, or

( ) ( )0| 0,ω = Σp x N in the case two tracks are the same target. Recalling our previous definitions, the system is in state

0ω when the two tracks correspond to the same target and in state 1ω when the two tracks correspond to different targets. One simple model is that the measurement differences will be uniform when in state 1ω and that this assumption holds over a window size on the order of the average target separation, which goes like the track density. If the window size is denoted by having length scale ξ , then we take ( ) ( )( )1| U 0, ,ω α= ∈p x D x D where ( )α D is the

area in domain 2ξ∝D depending on the exact shape of the window. It further implies that ( )1| 0ω =p x when outside of the area D . Applying the Bayesian classifier in (1.4), we can express this simplified association problem with two-element feature vector ( ),= ∆ ∆x x y as:

( )2

2122

10 log1

ξ

πρ σ σ ρ∆ ∆

∆ ∆ ∆ ∆ ∆ ∆

= − −

x y

x y x y x y

l d x (1.14)

with ( ) ( ) ( )2 10 0µ µ−

∆ ∆ = − Σ −

T

x yd x x x as given by (1.7), the covariance matrix and correlation coefficient

2

2

σ ρ σ σρ σ σ σ

∆ ∆ ∆ ∆ ∆∆ ∆

∆ ∆ ∆ ∆ ∆

Σ =

x x y x yx y

x y x y y

, and ρ∆ ∆x y . We can further extend the feature vector with additional information.

If we assume Gaussian distributions for speed and heading differences for both the correlated and uncorrelated case, then the likelihood ratio is written as

( ) ( ){ }( ) ( ){ }

1 1

1 1

110 02

2 111 12

exp10log

1 exp

µ µσ σ

σ σ ρ µ µ

−∆ ∆

∆ ∆

−∆ ∆ ∆ ∆

− − Σ − =

− − − Σ −

T

s cs cs c

T Ts c s c s c

x xl

x x (1.15)

with ( ),= ∆ ∆x s c , covariance matrix ( )2

2,σ ρρ σ

∆ ∆ ∆∆ ∆

∆ ∆ ∆

Σ = Σ ∆ ∆ =

s s c

s cs c c

s c , and correlation coefficient ρ∆ ∆s c . Also, note

that ( ){ } ( ){ }0 0 1 1, | , , |ω µ ω µ= ∆ ∆ = = ∆ ∆ =E x s c E x s c . If the class means are equal, then all the separability will fall to covariance differences and would dominate the Bhattacharyya distance. Finally for altitude or tangent plane height, we take

( ) ( ){ }

( ) ( ){ }1

11

112 20 02

111 12

exp10log

exp

µ µσ σ

σ µ µ

−∆∆∆

−+∆

− − Σ −+ =

− − Σ −

T

zz zzT T

zz z

x xl

x x (1.16)

with ( ){ } ( ){ } 20 0 1 1| , | ,ω µ ω µ σ∆ ∆= ∆ = = ∆ = Σ =z zE x z E x z .

In this model, we have really expressed likelihood ratios for separate components of the complete kinematic feature vector. So far, the key assumption is that of logical independence of the likelihood ratios for the horizontal position, the speed/heading, and the altitude. We have allowed for correlation between the horizontal position information, as well as correlation between the course and speed. The independence of position and velocity is not necessarily a good one, but is a simplifying assumption, while the independence of altitude is usually a good one except at short range. Also, we have not demonstrated that the probability density functions themselves are well-modeled as Gaussian for both classes. Normality tests are usually accomplished with conventional Chi-squares tests, Beta distribution test on the Mahalanobis distance with estimated mean and covariance from the training data (also called the Kolmogorov-Smirnov test). Using the 5-dimensional feature vector = ∆ ∆ ∆ ∆ ∆ t ij ij ij ij ij t

x x y z s c (1.17),

from our formula in (1.5), the instantaneous evidence in favor of hypothesis 0ω is ( ) ( )0 0|ω ω∆ ∆ ∆ ∆ ∆= + + +x y s c ze x l l l e (1.18) Furthermore, in the sequential case under the model assumptions, we envision the accumulation of evidence according to

( ) ( )0 01

|ω ω∆ ∆ ∆ ∆ ∆

=

= + + +∑n

x y s c zt t t

te x l l l e (1.19)

We mention that the assertion in (1.19), while convenient, is subject to the reality presented by (1.11), and Figure 2 implies that some basic assumptions (i.i.d.) begin to break down when compared to real data. While this model was developed in a Cartesian reference frame, we should take notice of another difficulty that often arises in hypothesis testing. If we were to now formulate the above line of reasoning with an elliptical coordinate frame such as WGS-84, we would have to transform our distributions that reflect our knowledge of the feature vector. The approach is as follows. Since the same event ( 0ω ) has two simultaneous expressions (as a probability in terms of

= ∆ ∆ ∆ ∆ ∆ t ij ij ij ij ij tx x y z s c or φ λ = ∆ ∆ ∆ ∆ ∆ t ij ij ij ij ij t

x h s c ), the volume in probability density space is

conserved. To eliminate confusion, let us momentarily refer to the latter feature vector as ty , the actual probability should be independent of our method for describing it. For instance, say we have a description for the joint probability density in several variables, but now we want it in terms of other variables. Such as:

( ) ( ) ( ) ( ) known, want with ,= =p x q y x g y y f x (1.20)

By the argument above, ( ) ( )=p x dx q y dy and

( ) ( )( )= ∑ ii

q y p f x J (1.21)

where the sum is over all ( )ix leading to the outcome ( )q y . Note the use of the Jacobian in (1.21). This quantity

relates the change in differential volume elements when transforming coordinates and is defined by

1 2

1 1 1

1 2

2 2

1

∂∂ ∂∂ ∂ ∂

∂ ∂∂ ∂

∂∂∂ ∂

=

n

n

n n

xx xy y y

x xy y

xxy y

xJy

…

(1.22)

In Figure 3, we show some examples of actual data representing the distribution of the feature vectors, along with Gaussian fits to those distributions. Parzen estimation is also relevant in the estimation of density functions with

( ) ( )1

1ˆ κ=

= −∑n

ii

p x x xn

(1.23)

( )p x is the estimated probability and κ is the kernel function, which is typically normal or uniform.

Figure 3 Histogram representing feature distributions and Gaussian pdf model fit for , , and φ λ∆ ∆ ∆v

While Figure 3, demonstrates at least some superficial indication that normality isn’t the worst assumption one could make for these distributions, we have not described the conditional class density for the second class because it is so difficult to infer. Thus, a single hypothesis test is often attempted (a distance classifier as in (1.7)) where we accept the increase in Bayes error due to dimensional folding. Also, even in 5 dimensions, estimating the conditional density functions is difficult from the training data. The simplifications of (1.19) help, but still leaves us with two correlation coefficients to estimate. The reader should take care to distinguish between feature component correlations, and time series correlation of a particular feature. Each presents unique challenges to the classification problem. Even having overcome the density estimation issue, a potentially more problematic issue arises. Angular misalignment is a dominant source of error. As such, the distribution of changes in latitude and longitude, etc, are a function of the absolute distance between the measuring sensor and the target. Thus, we really should be gathering information on the change of the distributions based on the spatial relationships between the sensors and targets. Setting decision thresholds at .02 degrees for example, will not necessarily lead to the same location on the ROC curve for all targets.

CONNECTING HYPOTHESIS TESTING TO MATCHED FILTERING Under Gaussian conditions the Bayes classifier for the two-class problem becomes a linear classifier when the class covariance matrices are the same and a quadratic classifier when the covariance matrices are different. If we use a classification scheme of based on (1.5) of

( )0sgn |ω e x (1.24)

with evidence, ( )0 |ωe x , viewed as an argument to a discriminant function, then we can seek to optimize our classifier

subject to some criterion. Dropping 0ω , we take ( )y x as the generalization of ( )e x . For the linear case, the general solution has the form

( )( )( )

1 00

0 1

,1

µ µ−= + =

Σ + − ΣTy x V x p V

s s (1.25)

where 0p and s are constant. Returning to our earlier discussion, the covariance matrix Σ can always be made an identity matrix through a suitable whitening transformation and the decision rule of (1.24) in the Gaussian, equal covariance case becomes

( ) ( ) ( ) ( )10 1 0 1 1 0 0 02sgn | sgn

T T Te x x eω µ µ µ µ µ µ ω = − − − − − (1.26)

which is directly derivable from (1.15) when

1 1∆ ∆Σ = Σ =s c s c I .

Since ( ) ( )11 1 0 0 0 02 µ µ µ µ ω− − − =T T e p is a constant term independent of x , we can view (1.26) as proportional to the

difference of two correlation operations:

( )1 0 1 0µ µρ ρ µ µ− = −T

xx x (1.27)

where the correlation operation is defined as

( ) ( ) ( )1

µρ κ µ κ=

= +∑i

N

x ij

j x j (1.28)

and in (1.27), 0κ = . This is nothing more than expression as the inner product of x and iµ with given lag κ . The

decision rule for equation (1.25) then becomes ( ) ( )0 1 0sgn ,TTy x V x p V xµ µ = + = − . Clearly, the connection

between (1.26) and (1.25) is the correlation operation. The decision rule compares the difference in correlation scores to a threshold and to a class accordingly. The threshold is determined by the mean class separability and the prior probabilities for each class. We can explain this in terms of basic, linear filtering theory. Given an input x and filter h , the output of a linear system is = ∗y h x where ∗ is a convolution operation:

( ) ( ) ( )1

κ κ=

= −∑N

jy h j x j (1.29)

If ( ) ( )µ κ κ+ = −i j h j , then (1.28) and (1.29) are equivalent and convolution and cross-correlation can be seen as one in

the same and the discriminant function is nothing more than the ( )sgn i function applied to the difference in outputs of a

matched filters One of the filters has impulse response ( )0µ κ and the other ( )1µ κ .

CORRELATION FILTERS: PHASE, NONOVERLAPPING DISJOINT NOISE Finally, we state the key relation in discrete form as discussed in [15] and countless others by returning to

( ) ( ) ( ) ( ) ( ) ( )1

ˆ ˆ ˆN

j j i i j j ji

y t h t x Y H Xτ τ ν ν ν=

= − ⇔ =∑ (1.30)

The output of a linear system is the convolution of the input with the filter. In the frequency domain, this is just multiplication of their respective transforms. For a track image as demonstrated in Figure 1 and applying the 2-D convolution theorem, we can state the two-dimensional version of (1.30) as

( ) ( ) ( )1 2 1 2 1 2ˆ ˆ ˆ, , ,ν ν ν ν ν ν=Y H X (1.31)

where ˆ ˆ, ,X H and Y are the 2-D Fourier transforms of the input track image, the filter, and the output response. For a matched filter, ( ) ( ) ( ) ( )*

1 2 1 2ˆ ˆ, , , ,j j j i j iH X h s t x s tν ν ν ν σ τ= ⇔ = − − (1.32)

with *X the complex conjugate of X . For a phase-only filter

( )( )( )( )( )

*1 2

1 2 *1 2

Im ,ˆ , exp atanRe ,

ν νν ν

ν ν

=

XH i

X (1.33)

Notice that for a matched filter,

( ) ( ) ( ) ( ) ( )2

1 2 1 2 1 2 1 2 1 2ˆ ˆ ˆ ˆ, , , , ,ν ν ν ν ν ν ν ν ν ν∗= = = ΦY X X X (1.34)

From a frequency-plane correlation viewpoint, we can introduce a Fourier-plane nonlinearity with the hopes of improving the correlation performance. The effect of this is to allow more complicated decision surfaces to better partition the class regions. One such simple mapping is

( ) ( )( )( )( )( )

*1 2

1 2 1 2 X *1 2

ˆIm ,ˆ ˆ, , , 0 1, = atan

ˆRe ,

γ φν ν

ν ν ν ν γ φν ν

∗ = ≤ ≤

XiX

H X eX

(1.35) When 1γ = , we have a classic matched filter if ( ) ( )1 2 1 2

ˆ ˆ, ,H Xν ν ν ν∗= and a phase-only filter if 0γ = . In optical architectures, 0 1γ< < nonlinearities are achievable.

Figure 4 Track image tiles and correlation output

Phase Invariance for Alignment Errors Time alignment refers to the need to interpolate or extrapolate data so that comparisons can be made across data from different data links and sensors as the samples from these sources almost never occur at the same time. One of the main ways this becomes an issue is when the kinematics of the target do not follow a linear velocity model. For example, the six-state or constant velocity tracker is the Kalman filter of choice for constant speed targets, while a nine-state, or constant acceleration tracker is the Kalman filter of choice for maneuvering targets. When models do not correspond to actual target dynamics, at best this results in larger uncertainty estimates from the filter and at worst a lost track. The correlation algorithm has many of the same issues. For current track association algorithms, data alignment is an important component in the algorithm behavior. Some algorithms are so sensitive to these time errors that even with constant velocity targets, if the reported time is in error

from the actual time by as much as 50 msec, it will result in a de-correlation. However, when more information is used, we can recognize like patterns easier even when there are gaps in the data or other distortions.

Figure 5 (Left) Latitude, longitude and phase plots for 3 correlated track pairs. Even time alignment problems will tend to preserve phase. (Right) Plot of difference vectors for same information

In practice, one is typically not interested in exact interpolation. First, real samples are usually noisy and an interpolating function passing through every data point will lead to overfitting and thereby poor generalization. Better generalization will be achieved using a smoother fit that averages out noise. This is the working premise of the modern area of radial basis functions and the generalized theory of kernel machines; classifiers must have useful generalization properties to capture system behaviour and not just model the input data. Consider the left and middle plots of Figure 5. The latitude and longitude of three correlated tracks are shown and the associated phases for the track time series. These latitude reports are of the same target, but due to sensor registration issues and temporal sampling differences, the latitudes and longitudes do not correspond perfectly. However, when we look at the phase of this information as plotted on the far right, we maintain good correspondence across most of the phase. On the far right, we see a plot of the time series for a feature vector comprised of φ λ = ∆ ∆ t ij ij t

x . The

latitude difference vectors are on top and the longitude difference vectors on the bottom. There is a reasonably wide variation in the quantities for certain track combinations and we should not expect a constant threshold to work well for all correlated track pair combinations because of rotational misalignment introducing a bias depending on the physical location of the track with respect to the measuring radar.

Figure 6 Two track images that should correlate. One has a significant translational and rotational bias, and is missing a feature as well.

However, phase is nearly invariant to small angular misalignment and certainly invariant to translation, which is really an additive constant to the feature. This is further demonstrated in Figure 6. A 5 degree in-plane rotational misalignment and translation offset is introduced. Furthermore, we have removed a feature from one image that is present in the other. Attaining a correlation peak is relatively easy as demonstrated.

Generalized Distortion Modeling In several previous papers, we have outlined the extension of least-squares filter performance for more general and applicable conditions than that of overlapping white noise. Basic to the approach was the construction of a window function, which allowed us to differentiate between the target and background, and allow us to specify specific noise processes for a region of the input, rather than uniformly imposed on the whole input. The general approach is to let the input signal be represented by ijx for the feature vector constructed from track i and track j and further subdivide the windowing function on the track image to contain the ability to have different noises for different features. The model has considerable application in the track correlation problem by allowing us to describe regions of noise specific to certain parameters. Due to space considerations, we refer the reader to [16], [17], [18].

COMBAT ID / REGISTRATION The approach until this point has focused on the use of image topology to make track association decisions. Central to this approach is that tracks have attributes in common that can be compared. In the track image construction process, we can always leave an attribute ‘blank’ in the image by setting that field to zero, but there must be some common information. This section focuses on the ability to merge disparate information. Such a situation can arise when various input sources provide a specific piece of information, but none of it alone is enough to make a classification. The Polynomial correlation filter (PCF) is designed to address this situation. The objective is to find filters, ( ),ih m n such that the filter can respond to different transformations of the true class, and do so in a simultaneous manner. Furthermore, positive true class detections can be due to the filter response to individual, some, or all of the input data about the object. The typical performance criterion is

( )2ˆˆ

ˆˆ ˆˆm h

J hh Bh

+ =

+ (1.36)

where B is a diagonal matrix related to the spectral properties of the training images. Notice that this criterion is analogous to the MACH filter where ˆ ˆB S C= + . (1.36) was extended to multiple sensors by A. Mahalanobis [19]. To explain the idea, imagine that we have several sources of information about an object, χ . This might be in the form of imagery, intelligence, kinematic, or other types of information. We can describe the information as some transformation of the original object, however complicated that transformation might be. Furthermore, let us assume that we can describe this information in some two-dimensional format. ( ) ( ), χ=i ix m n f (1.37)

if is the transformation applied to object χ by source i and ( ),ix m n is the information described in a two-dimensional format. We could then design a filter bank such that the correlation output plane is expressed as

( ) ( ) ( )1

, ,χ χ=

= ⊗∑T

i ii

y m n h m n f (1.38)

where ⊗ is the correlation operation (see equation (1.28)) and there are T sources of information on χ . Imagine we had three sources of information on χ and we wanted to design a filter to recognize χ based on the output of 3 sensors. Let ( ) ( )1 1, χ=x m n f be the output of a radar tracking the object with kinematic values supplied at various times. That is, a row, m , would be one parameter such as course, speed, latitude, altitude, or longitude and the column, n , would represent a particular sample at a particular time. Let ( ) ( )2 2, χ=x m n f be the output of an LWIR FLIR imaging

the target and let ( ) ( )3 3, χ=x m n f be the hyperspectral information such as plume signature data, and 1

1ˆ ˆiT

i ij

ji

m xT =

= ∑ the

average training image in frequency space for source input i . The average filter response in the frequency plane to the source input i is ( ) ˆˆ î i

iy m h+

< > = . For this example, the Average Correlation Height (ACH) and Average Similarity

Metric (ASM) are [ ] ( )23

2

1

ˆÂCH ,ii

im h

+

=

= ∑ and

3 3

1 1

ˆ ˆÂSM i ij ji j

h h+

= =

= Σ∑∑ with îijjΣ a frequency-plane term that resembles

the covariance in training image sets between sensor i and j . Mahalanobis showed that an optimal filter for this problem can be written as ˆ ˆ îjh mΣ -1= (1.39) and we have an effective algorithm for a CID architecture.

Figure 7 A track image with additional information useful to CID process

A simple example is given in Figure 7. The tiles have latitude, longitude, altitude, course, and speed on the vertical axis and time on the horizontal axis. Some tracks have mode-II / mode-III codes and frequency information from an Electronic Intelligence (ELINT) system. These mode codes are used to identify friendly platforms from hostile ones, and frequency information can be a non-cooperative way of identifying the platform.

CONCLUSION We have discussed the number one problem facing battlefield information systems today: track association/correlation for improved situational awareness and CID. We outlined in some detail how statistical hypothesis testing is used today to address this problem and described these techniques with a specific example. The statistical viewpoint is quite useful for understanding performance issues and theoretical bounds. As a novel approach to track association, we have introduced the concepts of image topology with regards to tracking and how this topology can be used to develop a complete track association system with ID capability. We looked at several frequency plane nonlinear and composite filters and discussed how sensor registration covariance can be used to define the region associated tracks are to lay with high probability. Then we sample this region of uncertainty in such a way as to generate training images to be used in composite filters. Furthermore, the majority of the filters discussed lend well to optical processing and we discuss a commercial system that can be used to implement simplified versions of nonlinear and/or composite filters.

APPENDIX

First, take the variances of the location of the target in two reported positions: a local and remote report. The covariance matrix, Σ , for ,l lx y is expressed as

( ) ( )2 2

2 2, , ,σ ρ σ σ σ ρ σ σ

ρ σ σ σ ρ σ σ σ

Σ = Σ = Σ = Σ =

l l l l l r r r r r

l l l l l r r r r r

x x y x y x x y x yl l l r r r

x y x y y x y x y y

x y x y (1.40)

With radars, the measurements are typically performed in ( ),θR and these variables are assumed to be independent. For

most filters, this is a good assumption. If ( ),θi iR is the target range and bearing from one sensor and ( ),θj jR the second target range and bearing, then the first sensor measurements and the second sensor variances are

( ) ( )

2 2 2 2 2 2 2 2 2 2 2 2

2 2 2 2 2 2 2 2 2 2 2 2

2 2 22 2 2

sin cos sin cos

cos sin cos sin

sin 2sin 2

2 2

θ θ

θ θ

θθ

σ σ θ σ θ σ σ θ σ θ

σ σ θ σ θ σ σ θ σ θ

σ σ θσ σ θρ ρ

σ σ σ σ

= + = +

= + = +

−−= =

i i i j j j

i i i j j j

j ji i

i i j j

x R i i i x R j j j

y R i i i y R j j j

R j jR i ii j

y x y x

R R

R R

RR

(1.41)

with σiR being the first sensor range error and

lθσ the sensor azimuth error. These are fundamental system performance

parameters. It is easy to show that the variances and correlation coefficient of the differences are given by

2 2 2 2 2 2, ,ρ σ σ ρ σ σ

σ σ σ σ σ σ ρσ σ∆ ∆ ∆ ∆

∆ ∆

+= + = + = i i j j

i j i j

i x y j x yx x x y y y x y

x y

(1.42)

In a similar fashion, the speed-heading differences can be expressed as

∆

= − ∆

ji

ji

sssccc

(1.43)

Under convolution of probability density functions due to subtraction of two random variables in creating the feature elements, we find the entries in the covariance matrix:

2 2 2 2 2 2, ,ρ σ σ ρ σ σ

σ σ σ σ σ σ ρσ σ∆ ∆ ∆ ∆

∆ ∆

+= + = + = i i j j

i j i j

i s c r s cs s s c c c s c

s c

(1.44)

1 R. Reynolds, Colonel (Ret.) USAF, private communication 2 RADM M. Mathis, Col H. Dutchyshyn, CAPT J. Wilson, Single Integrated Air Picture, Network Centric Warfare Conference, American Institute of Engineers, 23 October 2001 3 C. Stanek, B. Javidi, and P. Yanni, Imaged-based Topology for Sensor Gridlocking and Association, SPIE Proceedings in Automatic Target Recognition, Vol 4726 April 2002 4 C. Stanek, B. Javidi, P. Yanni, Filter Construction for Topological Track Association and Sensor Registration, SPIE Annual Meeting Proceedings Vol. 4789, 2002 5 O. Drummond, Track and Tracklet Fusion Filtering Using Data from Distributed Sensors, Proceedings from the Workshop on Tracking, Estimation, and Fusion: A Tribute to Bar Shalom, May 2001. 6 Sensor gridlock is often called sensor registration: the process of registering the sensor’s frame of reference to a common frame of reference or datum. The accurate registration of multiple sensors is required before any gains in precision can be made. 7 E.T. Jaynes, Bayesian Methods: General Background, An Introductory Tutorial, p. 8, 1996 8 E.T. Jaynes, Probability Theory as Logic: Hypothesis Testing, Chapter 4, 1994 9 K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Edition, Ch 3, 1990 10 G. Toussaint, Course Notes in Pattern Recognition 308-644B, University of McGill 11 C. Elkan, Naïve Bayesian Learning, Dept. of Computer Science, Harvard University 12 K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Edition, pp 67-73, 1990 13 R. Lynch, Peter Willett, Adaptive Sequential Classification using Page’s Test, Proceedings of SPIE, Vol. 4731, 2002 14 Applied Physics Laboratory, John Hopkins, Gridlock Analysis Report, Vol. III, Issue 1, July 1982 15 X.Rong Li, Probability, Random Signals, and Statistics, CRC Press, 1999 16 B. Javidi, J. Wang, Design of filters to detect a noisy target in nonoverlapping background noise, J. Opt Soc Am, Vol. 11 No., 10 Oct 1994 17 B. Javidi, F. Parchekani, and G.Zhang, Minimum-mean-square error filters for detecting a target in background noise, Applied Optics, Vol 35, No. 35, December 1996 18 B. Javidi, J. Wang, Optimum distortion-invariant filter for detecting a noisy target in nonoverlapping background noise, J Opt Soc of Am, Vol. 12, No. 12, December 1995 19 .B Javidi, ed., Image Recognition and Classification, Ch. 10, Marcel Dekker, 2002

Paper 5094-43_5

Documents

Transcript of Paper 5094-43_5