A New Framework for On-Line Change Detectionyfwang/papers/as06.pdf · A New Framework for On-Line...

A New Framework for On-Line Change Detection

Ankur Jain and Yuan-Fang WangDepartment of Computer Science

University of CaliforniaSanta Barbara, CA 93106

SUMMARY

In this paper, an on-line change-detection algorithm is proposed. The algorithm is applicable for detecting

the changes in both independent and dependent random processes. It is specifically tailored for on-line

applications, and uses only a small amount of memory and a reasonable computation effort.

The proposed algorithm does not require the knowledge of the form (e.g., the distribution is Gaussian) or

the parameter (e.g., the distribution has zero mean) of the probability distributions of the processes before

and after the changeover. This represents a significant relaxation of the assumptions in most other algo-

rithms (e.g., CUSUM and GLR) that must know the form and the parameter value of the process before

the changeover (often the process after the changeover as well). While such knowledge is available in, say,

the process control applications, it is not true for many others. For example, in signal segmentation, the

statistical properties of the before and the after processes are often not known. What is important is that if a

certain statistical property has changed in the signal, then the signal should be broken into pieces.

Theoretically, it is proven that the proposed algorithm produces the correct detection results in the expected

sense, and is an unbiased estimator of the changeover location under certain general conditions. In practice,

not only the proposed technique is more general than the traditional techniques, but it is significantly more

accurate, especially in the difficult cases where the populations of the before and after processes are not well

separated. As will be demonstrated in the experimental results, the CUSUM method, widely regarded as

one of the best techniques for process monitoring, performs significantly worse than the new technique in

terms of both the detection accuracy and the detection bias.

1

1 INTRODUCTION

Detecting changes (novelty, abnormality, irregularity, faults, etc.) in data is important in diverse application

domains. To give a few examples: The ability to detect malicious intrusion attempts and identify denial-of-

service attacks is a must in the modern computer networks. A network of sensors might be deployed in a

forest to monitor the humidity, temperature, and acidity of the soil. Abnormal readings might indicate an

incipient forest fire or illegal toxic waste dumping. In signal analysis, a basic operation is to decompose a

signal into stationary, or weakly stationary, segments with distinct statistical properties. Quality control in

an assembly line requires continuous, on-line monitoring of the production process to identify it as either in

control or out of control. In an aircraft or a spacecraft navigation system, it is critical to identify and isolate

faulty sensors/devices to maintain the normal system operation under adverse conditions. In a water (air)

quality control system, it is important that an alarm is raised if an abnormal concentration of certain harmful

substances (e.g., lead in the water or ozone in the air) is detected. All these problems fall in the realm of

change/abnormality/fault/novelty detection. It is obvious that a good change-detection algorithm must be

able to quickly and accurately identify the changes while raising as few false alarms as possible.

In [6], the change-detection algorithms are classified into two categories, namely, that of detecting the

changes (1) in independent random sequences1 with the distributions parameterized by scalar parameters,

and (2) in dependent sequences characterized by a multi-dimensional parameter vector. The latter can be

further divided based on the types of the changes allowed (either additive or non-additive/spectral changes)

and the system models assumed (a linear or a nonlinear system model). In the case of a linear system model,

the model can be an iterative model (corresponding to an FIR filter), a recursive model (corresponding to an

IIR filter), or a state-space model (corresponding to the model used in the Kalman Filter) [15, 26, 8, 25].

While the change-detection problems in the second category appear to be much more complicated than those

in the first category, these two types of problems are actually intimately related. One systematic approach

(Fig. 1) of detecting the changes in a dependent random sequence is to perform a “whitening” transform

1An independent (dependent) random sequence is a realization of a random process where the random variables at each timeinstance are independent (not independent).

2

Figure 1: A unified framework to address the change-detection problems of both dependent and independentrandom processes.

(or an inverse transform). The whitening transform T−1, together with the original system transform T,

produces an identity mapping as shown in Fig. 1. It has the effect of “peeling away” the dependency induced

by the dependency models to reveal the “driving force” behind the system change. Often, the driving force

can be treated as an independent random process, which implies that the change-detection algorithms in the

first category are again applicable.

2 METHODS

The proposed technique is specifically designed to address change detection in independent random se-

quences. The algorithm operates based on the hypothesis-and-verification principle: In the first stage, we

compute an index where the changeover is most likely to occur within the processing window (hypothe-

sis), and in the second stage, we validate if such a hypothesis is correct (verification), as there might be no

changeover at all.

We formulate the change-detection problem in an independent random sequence mathematically as fol-

lows: Consider a data stream X1, X2, · · · , Xn of length n, which is composed of random variables of

possibly two distributions, i.e., X1, X2, · · · , Xk−1 are I.I.D. variables having a distribution of D1 while

Xk, Xk+1, · · · , Xn are I.I.D. variables having a distribution of D2.2 The changeover point k is unknown

and can be anywhere between 2 and n, or beyond (if k = 1 or k > n, then there is only one type of I.I.D.

variables in X1, · · · , Xn). The core algorithm is then for determining the occurrence and the position of

2As will be shown in the experimental results, the analysis window size n in practice is very short. Hence, it is reasonably toassume that there is at most one changeover within the window.

3

the changeover. For sensor-network and computer-network applications, it is especially important that the

design is for an on-line algorithm using a small amount of memory and a reasonable computation effort.

In an on-line application, data are continuously streamed from a source to a receiver. Our algorithm op-

erates by sliding an analysis window of length n over the data stream, Xi, i = 1, · · ·, and examines if the

changeover point occurs within the window. To simplify the notation, we denote the n data items within

the analysis window X1, · · · , Xn, regardless of their true positions in the stream. Furthermore, we allow the

algorithm to operate on a slightly longer data stream X1−m, · · · , X0, X1, · · · , Xn, Xn+1, · · · , Xn+m, or the

processing window is padded with m extra data items at each end (X1−m, · · · , X0, and Xn+1, · · · , Xn+m).

This padding is necessary to ascertain with accuracy the occurrence of the changeover point from index 2 up

to and including index n. The window can be positioned in either an overlapped or a non-overlapped manner

over the data stream. If the window is positioned in a non-overlapped manner over successive groups of n

data items, there is an average latency of n/2 +m steps in reporting the changeover, if one did occur within

the window. If the window is allowed to center at each and every data item, the latency is cut to m steps.

While we do not assume that the form or he parameters of the underlying distributions are known, we can

nonetheless test for the changeover by observing if certain statistical properties (such as the data stream’s

mean and variance) have changed. Denote the statistics to watch out for as f . Then for an index l, 1 < l ≤ n

within the processing window, we define the following four functions of l (the means and variances of the

statistics of the before and after component distributions, separated at l)3

f1(l) =

∑l−1

i=1−mf(Xi)

n1, s2

f1(l) =

∑l−1

i=1−m(f(Xi)−f1)2

n1−1 , and

f2(l) =

∑n+m

i=lf(Xi)

n2, s2

f2(l) =

∑n+m

i=l(f(Xi)−f2)2

n2−1

(1)

where n1 = m + l − 1, n2 = n + m − l + 1, and n1 + n2 = n + 2m. If the windowed stream does

contain two distributions, intuitively, each individual distribution should be homogeneous in f while the

two distributions should be distinct in f . Hence, we define two statistics on the stream: the between-class

3To be consistent with the notations commonly used in the statistics [46, 43], we denote the mean, standard deviation, andvariance of the statistics f of a finite sample as f , sf , and s2

f , while those of the underlying population as µf , σf , and σ2f . It

is important to properly distinguish the sample quantities and the population quantities. For example, the sample variance for asample of size n is computed by dividing the squared sum by n − 1 instead of n if the sample mean is estimated from the samedata set.

4

scatter:

sb(l) =n1n2

(n1 + n2)(f1(l)− f2(l))

2 (2)

that measures the degree of dissimilarity between the two distributions (large numbers indicating a high

degree of dissimilarity) and the within-class scatter:

sw(l) = (n1 − 1)s2f1

(l) + (n2 − 1)s2f2

(l) (3)

that measures the degree of homogeneity within the two distributions, weighed by the sample size (small

numbers indicating a high degree of homogeneity). The factor n1n2(n1+n2)

is used to weigh the importance of

the two scatter measures. A logical choice of the changeover point is then an index that minimizes sw(l)

while maximizes sb(l), or the ratio between sw(l) and sb(l),sw(l)sb(l)

, should be minimized [10, 16].4

An algorithm for the changeover detection that tests the ratio of sw and sb obviously requires only limited

storage space (more exactly, only n+2m data points need to be kept at a time). As an on-line algorithm, we

also need an efficient way to compute and update sw and sb. It is easy to see that simple recursive formulas

exist to compute f1(l + 1), and f2(l + 1) from f1(l), and f2(l), i.e.,

f1(l + 1) =

∑l

i=1−mf(Xi)

m+l =

∑l−1

i=1−mf(Xi)+f(Xl)

m+l = m+l−1m+l f1(l) + f(Xl)

m+l

f2(l + 1) =

∑n+m

i=l+1f(Xi)

n+m−l =

∑n+m

i=lf(Xi)−f(Xl)

n+m−l = n+m−l+1n+m−l f2(l)−

f(Xl)n+m−l

(4)

and hence, sb(l) can be computed recursively as well.

It is a lot more complicated to compute sw(l) recursively. However, one observes that, for the windowed

stream with padding, the total variance of the statistics

s2f =

∑n+mi=1−m(f(Xi)− f)2

n + 2m− 1(5)

is a constant, independent of the choice of l ( f =

∑n+m

i=1−mf(Xi)

n+2m is the total mean). The following lemma

shows that n + 2m − 1 times the total variance equal to the sum of the between- and within-class scatters,

4The ratio expression of the between- and within-class variances was previously used in the Fisher’s linear discriminant analysis(FDA) [10, 16]. The FDA is designed to achieve a reduction in the feature-space dimensions while preserving the separation of thetwo populations in a binary pattern-classification problem. While we use the same ratio expression, we use it in an entirely differentcontext of changeover detection. Furthermore, we prove that the proposed change-detection algorithm is correct and unbiased ifcertain conditions are true.

5

i.e., (n+2m−1)s2f = sw(l)+sb(l),∀ l. Hence, the sum of the two scatter measures is a constant regardless

of the choice of l. As the sum of sb and sw is a constant, minimizing the ratio is equivalent to maximize sb.

Therefore we need only to compute sb with the recursion formulas above.

Lemma 1. n + 2m − 1 times the total variance equal to the sum of the between- and within-class scatters.

Hence the sum of the two scatter measures is a constant, regardless of the choice of the changeover point l.

The proof is given in the Appendix. While we found no simple formula to satisfactorily analyze the be-

havior of the above algorithm when the window length is finite,5 the estimation algorithm can be shown

theoretically to give the correct prediction and approach an unbiased estimator of the true changeover point

if the window length approaches infinity and the two distributions are sufficiently dissimilar. This assertion

is stated more formally and proven in the following theorem.

Theorem 1. If the random variable e, defined as

e = argmin1<i≤nsw(i)

sb(i)(6)

is used as the estimator of the the changeover point and if the changeover indeed occurs within the window

at an index k, then asymptotically the following two statements are true:

• The estimator gives the correct prediction in the expected sense, i.e., P (e = k) > P (e = l) 1 < l ≤

n and l 6= k, and

• The estimator is an unbiased estimator of the changeover point, i.e., E(e) = k if the two underlying

distributions have the same variance.

The statements are true in the asymptotic sense. I.e., the estimator gives the correct expected prediction

if

• the window length n approaches infinity,

• the window contains a sufficiently large number of samples from both distributions, and

5The resolving power of the estimator for a finite window length will be studied through computer experiments.

6

• |µf(D1) − µf(D2)| > max(σf(D1), σf(D2)), where µf(Di) and σf(Di) denote the mean and variance of the

statistics f for the distribution Di for i = 1, 2.

Furthermore, the estimator is an unbiased estimator if σf(D1) = σf(D2). 2

The proof is given in the Appendix. While the proof is for the asymptotic case, our algorithm does not

require a very large window to perform well. This is because the proof hinges upon the convergence of the

sample means to each other and to the population mean, when the sample size becomes large. The Central

Limit Theorem asserts that the sampling distribution of the mean is Gaussian distributed with the variance

dropping at a rate proportional to the sample size [46, 43]. In practical situations, the sample size does not

have to be very large for convergence (e.g., for a sample size of 10, the variance of the sample mean is only

10% of that of the underlying population). This point will be illustrated in our experimental results.

What the above procedure does is to identify an index where the changeover is most likely to occur within

the analysis window. However, the hypothesis that a changeover does occur inside the window is not always

correct, as there might not be any changeover within the window at all. Regardless, the algorithm will

report the index where sw/sb is the minimal. The obvious question is then how one can distinguish a real

changeover from an illusory one? What is needed is then a validation procedure for the hypothesis.

If no changeover has occurred within the window, the two groups of data separated by the given index

then belong to the same population, and hence, should have the same f -mean.6. On the other hand, if the

two sample means are “reasonably” separated then the changeover point is a valid one. The “reasonable

separation” test is a standard statistics significance test (or a t-test) [13, 46, 43].

In more detail, if the above procedure identifies e, 1 < e ≤ n, as the changeover point, we compute the

mean and variance (Eq. 1) of the sampling distribution for f(Xi) in the range of 1 −m ≤ i < e (denoted

as f1(e) and s2f1

(e)) and e ≤ i ≤ n + m (denoted as f2(e) and s2f2

(e)). The null hypothesis is then that no

changeover has occurred within the window (f1 = f2). The alternative hypothesis is that the changeover

6As we do not assume the knowledge of the underlying distributions, the f -mean has to be estimated from the data (i.e., thesample mean f , not the population mean µf , is used)

7

did occur and it occurred at the index e. To reject the null hypothesis and accept the alternative, we evaluate

the statistics

t =f1(e) − f2(e)

s(7)

where s is the estimated standard error of the difference between the means, which is

s =

√

2MSE

nhand MSE =

n1s2f1

(e) + n2s2f2

(e)

n + 2m− 2, nh =

21n1

+ 1n2

(8)

MSE is the mean-squared-error, nh is the harmonic mean of the two sample sizes, and n + 2m − 2 is the

degrees of freedom (DOF) [46, 43]. It is then a simple statistical test to see how likely it is for the two

sampling means from the same underlying population to fall outside, say, 5% or 1% confidence intervals

from each other by looking up the t value computed above in the t table. We accept the hypothesis that a

changeover did occur if the probability of the hypothesis satisfies the requirement of the specified confidence

interval. Otherwise, we reject the hypothesis, slide the processing window, and repeat the calculation.

An important metric for the detection procedure is the false-alarm rate, that is, if there is no changeover

within the window, how likely is the algorithm to report a positive detection (a false positive)? And if there

is indeed a changeover, how likely is the algorithm to not report it (a false negative)? In our algorithm, the

false-positive rate is completely controlled by the confidence interval used in the statistical t-test. This is

because if there is no changeover, no matter where the changeover index is reported to be by minimizing

sw/sb, the two groups of data before and after the index are drawn from the same population. Difference-

of-the-means test is designed exactly to validate this scenario [13, 46, 43]. The important things to note are

that (1) the analysis is valid regardless of the statistics f used, and (2) it is possible to adjust the significance

setting to tolerate different levels of the false-positive rate.

The false-negative analysis is more involved. There are two possibilities: (1) the minimal sw/sb ratio locates

the changeover point correctly, or (2) the ratio test misplaces the changeover point. The first case is again

what the difference-of-the-means test is designed to validate. The difficulty with the second case is that if

the changeover point is misplaced, say, toward the head (tail) of the window, the second (first) group of data

is then “diluted” and contains a mixture of the samples from both populations. The sample mean f2 (f1)

8

then shifts toward f1 (f2) which can cause false negatives to occur. In general, the significance level in the

difference-of-the-means test is still a good indication of the false-negative rate because we prove in Theorem

1 that our algorithm locates the changeover point correctly in the expected sense.

Table 1 gives the pseudo-code algorithm. The algorithm is extremely fast, as can be seen that its operations

are very simple. A more detailed analysis of Table 1 shows that the algorithm is O(n) with a small constant.

More precisely, the initialization step requires n + 2m + 4 FLOPs. The loop in the hypothesis generation

step requires 12 FLOPs and is repeated n− 2 times for 12 · (n− 2) FLOPs. Finally, 3(n+2m)+12 FLOPs

are needed for the hypothesis verification. For a window size (see Sec. 5 Experimental Results) of, say, 100,

less than 1500 FLOPs are needed for the whole algorithm. Given modern CPU speed (e.g., Pentium 4 and

Athlon 64 CPUs provide giga(109)-FLOPs per second), this is an effort that takes negligible time.

3 RELATED RESEARCH

To maintain relevance to our proposed scheme, the discussion will focus on algorithms for detecting changes

in independent random sequences only. Interested readers are referred to a large number of textbooks on

change detection for more in depth discussions of this and other related topics (the reference section contains

a partial list of related results).

Most traditional scalar I.I.D. change-detection algorithms base their detection decision, one way or the other,

on the log-likelihood ratio (sufficient statistics) of the observation Xi. In more detail, define

si = lnPθ2(Xi)

Pθ1(Xi)(9)

where θ is the parameter of the probability distribution P (e.g., P might be known to be normal and θ1

and θ2 are the process mean before and after the changeover). Intuitively speaking, if Xi comes from θ1

distribution, we would reasonably expect that most likely Pθ1(Xi) > Pθ2(Xi) and E(si(Xi)) < 0. By the

same argument, if Xi comes from θ2 distribution, we would expect that most likely Pθ1(Xi) < Pθ2(Xi) and

E(si(Xi)) > 0. Hence a transition from negative si’s to positive si’s indicates that a change has occurred.

9

Table 1: The pseudo-code algorithm.Initialization

kbest = kcurr = 2;

n1best= n1curr = m + 1; n2best

= n2curr = n + m− 1;

f1best= f1curr =

∑1

i=1−mf(Xi)

n1best

; f2best= f2curr =

∑n+m

i=2f(Xi)

n2best

;

sbbest= sbcurr

=n1best

n2best

(n1best+n2best

)(f1best− f2best

)2;

Hypothesis generation

for l = 3, · · · , n do {

kcurr + +; n1curr + +; n2curr −−;

f1curr ←m+l−1

m+l f1curr + f(Xl)m+l ; f2curr ←

n+m−l+1n+m−l f2curr −

f(Xl)n+m−l ;

sbcurr← n1curr n2curr

(n1curr+n2curr ) (f1curr − f2curr)2;

if (sbcurr> sbbest

) {

kbest = kcurr;

n1best= n1curr ; n2best

= n2curr ;

f1best= f1curr ; f2best

= f2curr ;

sbbest= sbcurr

;

}

}

Hypothesis validation

s2f1best

=

∑kbest−1

i=1−m(f(Xi)−f1best

)2

n1best−1 ; s2

f2best=

∑n+m

i=kbest(f(Xi)−f2best

)2

n2best−1 ;

t =f1best

−f2best

s ;

s =√

2MSEnh

; MSE =n1best

s2f1best

+n2bests2f2best

n+2m−2 ; nh = 21

n1best

+ 1n2best

;

if t is larger than the corresponding value from the t-table

Accept kbest as a real changeover index

else

Repeat the procedure for the next window location

10

Certainly, the stochastic nature of a random sequence prohibits us from making the changeover decision

based on a single observation. Instead, we define

Skj =

k∑

i=j

si =k∑

i=j

lnPθ2(Xi)

Pθ1(Xi)(10)

Then for a fixed sample size n, the decision rule d is given by

d =

{

1 ifSn1 < h θ = θ1

2 ifSn1 ≥ h θ = θ2

(11)

where h is a preset threshold value. The decision causes an alarm time to be determined as

ta = n ·min{k : dk = 2} (12)

where dk is the decision rule for the sample number k. For example, in a production process continuous

samples of size n are collected. An alarm sounds and the production process is halted after the first sample

(sample k) of size n for which the decision favors θ2. This process is called Shewhart Control Chart [38]

and is widely used in process control.

While the above change-detection process is provably optimal based on Neyman-Pearson Lemma [6], in

real-world applications a number of refinements and alternative formulations are possible.

• Geometric Moving Average Control Chart (GMA) [35]. The idea is to weigh different si by their recency.

This conforms to the intuition that the recent samples are more relevant in making the changeover decision.

Frequently, an exponential weight is employed that quickly decays the contribution from past samples.

• Finite Moving Average Control Chart (FMA) [27, 22]. The idea is to use a finite window by replacing

the exponential forgetting operation in GMA with a finite memory one. This is important for real-world

operating conditions, such as for a sensor of a limited computing and storage power in a sensor network.

• Filtered Derivative Algorithm [5]. In the real-world, noise corruption of the observation sequence might

be unavoidable. Hence, all changeover detectors generally employ a filtering (smoothing) operation before

applying a detection algorithm, such as taking the derivative of si and looking for a sudden positive jump.

However, the smoothing operation has a tendency to introduce multiple peaks in the smoothed derivative

11

of si, and hence, may cause false alarms to be registered. A stopping rule that counts the number of

threshold crossings during a fixed time interval should be used to locate the true crossing and minimize

the chance of false alarms.

• Cumulative Sum Algorithm (CUSUM) [27]. This is the most popular algorithm for on-line change detec-

tion. It uses an adaptive threshold instead of a fixed threshold, and a decision rule that is

dk =

1 ifSk1 − bS

k1 c < h θ = θ1

2 ifSk1 − bS

k1 c ≥ h θ = θ2

(13)

bSk1 c = min1≤j≤kS

j1 captures the random sequence’s current minimum value and is computed and mod-

ified online.

• A priori Information on the Changeover Time [14]. The enhancements discussed so far do not assume that

the distribution of the changeover time is known. In certain cases, such information is indeed available.

For example, a production process might have a fixed probability of producing quality (p) and defective

(1 − p) products (and once a defective product is produced, the process becomes out of control and all

ensuing products are defective). Then the onset of the defective production stage is modeled as a geometric

distribution with parameter p. Bayesian approach allows us to incorporate such a priori information into

the change-detection process. For example, the production sequence can be modeled by a Markov chain

with two states, in-control (1) and out-of-control (2). The a priori knowledge of the state π(0)i , i = 1, 2

and the transition probability p(i|j), 1 ≤ i, j ≤ 2 are then used to propagate the system state through time.

π(t)i , i = 1, 2 is then combined with the previous decision rules to generate a best changeover estimation

using Bayesian rules.

• Unknown θ2 [23]. In a process control application, it is reasonable to assume that the statistical properties

when the process is in control is known. However, many things can go wrong to cause the process to

become out of control, and hence, it might not be reasonable to expect that the statistical properties of the

out-of-control process is known a priori. From a Bayesian viewpoint, θ2 is now itself a random variable

and Pθ2 is not a unique distribution, but a family of distributions indexed by θ2.7 A Bayesian formulation,

7In fact, one can reasonably argue that not only we do not know the parameter θ after the changeover, we do not know the form

12

where the contributions from different Si, based on different assumption of θ2, weighed by their prior (if

the prior is known somehow) should be used to arrive at a maximum-likelihood estimation.

4 OUR CONTRIBUTIONS

Comparing to the state-of-the-art, the new method makes the following novel contributions:

1. We do not assume that the form (e.g., the distribution is normal) or the parameter (e.g., the distribution

has zero mean) of the distribution is known for either the before or the after process. This represents

a significant relaxation of the assumptions in the previous formulations, which must know the form

and the parameter value of the before process (often time the after process as well). While one might

argue that such knowledge is available in, say, process control applications, it is not true for other

applications. For example, in signal segmentation, often times, we do not know the statistics of either

the before or after process. All we care is that if a certain statistical property has changed in the signal,

then the signal should be broken down into pieces. Our new formulation is readily applicable in these

general scenarios where the form and parameter of the before and after processes are not known, while

the traditional change-detection algorithms are completely powerless. Hence, our algorithm is much

more general than all existing ones.

2. Not only our technique is more general than the traditional techniques, but it is significantly more

accurate, especially in the difficult cases where the populations are not well separated. As will be

demonstrated in our experimental results, the CUSUM method, widely regarded as one of the best

techniques for process monitoring, performs significantly worse than our new technique in terms of

both detection accuracy and detection bias.

3. Our method works for detecting changes in any statistics, not just mean. We prove theoretically that

our algorithm produces correct detection results and is an unbiased estimator under certain conditions.

of the distribution either! The formulation then becomes intractable using the traditional formulations. However, not knowing theparameter or the form of the before and after distributions does not represent any difficulty for our formulation.

13

We also conducted extensive experiments based on both synthetic and real data to verify the efficacy

of the algorithm.

5 EXPERIMENTAL RESULTS

We use five different probability distributions, of both continuous and discrete types, in our experiments.

Normal (characterized by mean u and variance σ2) and uniform (characterized by a distribution range

[min,max]) distributions are widely used in many simulation tasks. We also use three other distributions

that are highly relevant in process control and quality measurement.

• Poisson distribution is frequently used in modeling the rate of occurrence, such as the number of defects

in a production line. The distribution has a single parameter λ > 0, which often characterizes the rate of

defect. The distribution (e.g., for measuring the number of defects in X samples) is

P (X) =e−λλX

X!E(X) = λ σ2(X) = λ (14)

• Negative binomial distribution is also often used for modeling the rate of occurrence. The distribution

has two parameters n and p, where, in a quality control application, n denotes the maximum allowed

number of defects and p measures the probability of a sample being defective. The distribution measures

the number of samples to be examined to find n defects

P (X) =

(

n + X − 1X

)

pn(1− p)X E(X) =n(1− p)

pσ2(X) =

n(1− p)

p2(15)

• Gamma distribution is a general distribution covering many special cases, including the chi-squared distri-

bution and the exponential distribution. We use it for its generality that it can approximate a large number

of other distributions. The distribution is

P (X) =1

Γ(α)β−αXα−1e−X/β E(X) = αβ σ2(X) = αβ2 (16)

Our experiments are to repeat the following basic procedure many times under varying parameter settings:

Basic Procedure: We generate two random sequences of the same or different lengths by sampling from

two distributions (uniform, normal, Poisson, negative binomial, or gamma). A tunable amount of Gaussian

14

−20 −10 0 10 200

200

400

600

800

(a)−20 −10 0 10 200

200

400

600

800

(b)

−20 −10 0 10 200

200

400

600

800

(c)−20 −10 0 10 200

100

200

300

400

500

(d)

Figure 2: Sample detection histograms, each representing the distribution of the detected changeover loca-tions from 1000 random runs. The x-axis represents detected changeover index (0 is the correct index) andthe y-axis represents the frequency (out of 1000). Parameter settings are tabulated in Table 2.

noise may be added to the two samples if necessary. We then concatenate these two sequences back to back,

and the concatenated sequence emulates the windowed sequence in our algorithm. We run our algorithm to

select the changeover index with the smallest sw/wb ratio and record the result. We then repeat the above

process a thousand times to create a frequency histogram of the changeover index locations selected by our

algorithm.

Fig. 2 depicts four sample histograms under different parameter settings. Table 2 summarizes the important

parameters used in these experiments. Fig. 2.(a) shows the results of changeover detection where the before

and after processes are both normally distributed, with the same sample size and variance, and with no

added noise. The correct changeover location is zero. A positive (negative) index means that some samples

from D1 (D2) are incorrectly merged into D2 (D1). Ideally, a good detection algorithm should generate

a frequency histogram that is sharply peaked at zero (correct localization) with the population tapering off

15

Table 2: The parameter values used in the four sample frequency histograms shown in Fig. 2.Process Distribution f-statistics Parameters Sample size Noise

Fig. 2.(a)Before normal mean uD1 = 0, σ2

D1= 1 15 0

After normal mean uD2 = 2, σ2D2

= 1 15 0Fig. 2.(b)

Before normal mean uD1 = 0, σ2D1

= 1 40 5%After uniform mean minD2 = 1,maxD2 = 2 20 5%

Fig. 2.(c)Before Poisson mean λD1 = 1 30 5%After Poisson mean λD2 = 4 15 5%

Fig. 2.(d)Before uniform variance minD1 = −1,maxD1 = 1 30 0After uniform variance equal mixture of uniform 30 0

distributions [3,4] and [-4,-3]

quickly and equally (no bias) on the two sides of zero.

Comparing to Fig. 2.(a), Fig. 2.(b) shows a sample result where the before and after processes have different

distributions, means, variances, and sample sizes. Fig. 2.(c) shows a sample result where noise is added,

and the sample size and variance are different. Fig. 2.(d) shows that a different statistics (variance instead

of mean) can be used in the change detection (in this case the two populations have the same mean).

From these four results, one can easily see that quite a number of parameters can vary and there are a large

number of parameter combinations where the algorithm can be tested. These parameters are discussed in

more detail below:

• Effective population separation. One of the most important settings for our algorithm—in fact, we believe

for any change-detection algorithm—is the ratio R that controls the separation of the means of the two

populations vs. their standard deviations, or

R =|µf(D1) − µf(D2)|

max(σf(D1), σf(D2))(17)

Recall that the ratio of R = 1 is the minimum separation required to prove that our algorithm performs

correctly asymptotically. Intuitively speaking, this ratio determines how easy or how hard a changeover

problem really is. If the ratio is large, the two populations are spaced far apart compared to their internal

16

spreads, and hence, the changeover in f should be very prominent and easily discernible. On the other

hand, small R’s can cause problems if an algorithm lacks sensitivity and selectivity.

Most experimental results below (Fig. 7 to Fig. 12) will be presented in two ways: one is parameterized

by R and the other by another parameter (sample size, amount of noise, etc.). For these experiments, we

tested the range of R from as small as 0.5 to 10, representing a twenty-fold change in R. We report here

experiments on small R’s (R ≈ 2) as larger values of R’s do not present a challenge.

• Finite window size. It is proven in Theorem 1 that our algorithm gives the correct prediction of the

changeover location in a window in the asymptotic sense. However, in reality the window must be of a

finite length. Hence, it is important to understand the performance of the algorithm for finite window sizes.

Fig. 5 shows the experiment results. we vary the window size from 20 all the way to 500, for each window

length setting, we vary the R ratio from 0.5 to 10. The results are grouped by different R’s (R equals

to 0.7, 1.0, 2.0, and 3.0 in Fig. 5) for all the five distributions. As can be be seen that as R increases,

the detection algorithm correctly locates the changeover location > 80% of the time (for R ≈ 3) for all

the distributions. The experiment results also show that the performance of our algorithm is not sensitive

to the window length, as the performance curves stay almost horizontal for all the window lengths and

distributions tested. Hence, in all ensuing figures, we show the results for a typical window length only.

• Limited padding size. Padding is crucial to ascertain the statistical properties of the before (left padding)

and after (right padding) subprocesses. Without enough padding on the two ends, the accuracy of the

changeover detection near the window boundary might suffer.

Fig. 6 shows the results of varying the padding size (from 1 all the way to 20, for a window size of

30). Again, these plots are indexed by R. They show the same qualitative trends as those in Fig. 5, i.e.,

the performance quickly improves with larger R’s and approaches almost 100% detection accuracy when

R > 3. Again, the performance is not sensitive to the padding size and the distribution assumed. Hence,

from now on we show results only for a typical window length (from 15 to 30) and padding size (1).

• Precision vs. recall. In real-world applications, the accuracy of a detection algorithm is measured by its

17

precision and recall rates. Instead of requiring that the algorithm always locates the correct changeover

locations with no margin of error, we often provide a fixed error bound and ask how likely it is for the

algorithm to achieve that kind of localization accuracy. An opposite way is to specify the desired level of

detection sensitivity and ask how much localization error we must tolerate to achieve the specified level of

sensitivity. The uncertainty principle dictates that as the desired detection accuracy increases, the detection

sensitivity suffers and vice versa. We verify this principle in Figs. 7 and 8.

In Fig. 7, we specify the levels of location uncertainty that we can tolerate, i.e., the reported changeover

location may deviate from the true location by that many indices. The uncertainty range goes from 1 to

10. We then calculate, under each uncertainty setting, the chance that the algorithm locates the changeover

points that close to the true locations. Certainly, small R’s make things harder under the same uncertainty

level, as observed in the first two plots in Fig. 7. Furthermore, as we tolerate more uncertainty in the

reported locations, the algorithm achieves a higher probability of locating the changeover points, as seen

in the second two plots in Fig. 7. The important thing to note is that our algorithm quickly achieves

100% accuracy as the allowed position uncertainty and R increase. When R = 2 (the last plot) or when

samples are nicely separated, the localization accuracy is above 80% even with a very small (1 index) error

tolerance.

In Fig. 8, we answer the opposite question as in Fig. 7: If we want to achieve a desired level of detection

sensitivity (e.g., we must detect 80% of all changeovers), then how certain can we be about the reported

changeover location? The uncertainty principle states that the higher the desired sensitivity is, the lower

the reported accuracy will be. While this is true, as can be seen in Fig. 8 that we can be very sure of the

reported location (uncertainty less than 5 indices) even with extremely high detection sensitivity (90%), if

R is reasonable (e.g., R is 2 in the fourth plot in Fig. 8). Even for the critical R setting of 1, we achieve

an acceptable performance for these five distributions (e.g., an expected position deviation of less than 7

indices at a sensitivity level of 70%).

• Noise corruption. In real-world applications noise corruption is to be expected. A robust change-detection

algorithm must be resilient to a certain amount of noise and still be able to detect the changeover point

18

with high precision and sensitivity.

In Fig. 9, we added different amounts of noise to the underlying signals as a certain percentage (from 1%

to 25%) of the root-mean-square signal strength (R(X) =

√

∑n

i=1X2

i

n ). The first two plots in Fig. 9 shows

that we can quickly overcome noise corruption if the underlying processes are well separated (80% correct

localization in the presence of 10% noise for R > 2). Furthermore, for a fixed, reasonable R setting, the

algorithm is not very sensitive to the amount of noise as the performance curves are almost flat and drop

off only slightly at high noise levels for all distributions, as depicted in the second two plots of Fig. 9.

• Imbalanced population size. When the processing window slides over the data stream, the true changeover

location can appear anywhere within the window. This implies that the sample sizes of the two populations

may not be equal. When one sample size is significantly smaller than the others (say, the changeover

position is toward the two ends of the window), it may become difficult to ascertain the statistical property

of one of the populations.

Fig. 10 shows the effect of different sample sizes. We fix one population to have a sample size of 30 and

vary the size of the other population from 1.5 to 10 times the first one. As can be seen from the second two

plots that our algorithm is insensitive to the variation in the sample sizes and still achieve > 90% precise

localization with R ≈ 3.

• Imbalanced population variance. The before and after populations should have different mean f -statistics,

but they may also have different variance for the f -statics. The effect of the imbalanced population vari-

ance is shown in Figs. 11 and 12. In Fig. 11, we analyze the detection accuracy (i.e., the probability of

correct localization of the changeover point) while in Fig. 12 we analyze the detection bias (i.e., how far a

reported position might deviate from the true location).

This parameter is important because if the variance of one population is significantly higher (or lower)

than the other, we notice that the bias in the reported positions becomes significant but the detection

accuracy can actually become better. The reason for the bias is that it is possible to move samples from the

population with smaller variance into the collection of the other population with a much larger variance

19

without significantly changing the variance of either. However, moving samples from the population with a

much larger variance into the population with a smaller variance is very noticeable. Hence, the localization

error is often one-sided and biased toward the population with the larger variance.

However, as shown in Fig. 12, the detection bias in our algorithm is never very large, even for small R’s

and large variance ratios. For example, for R = 0.5 and one population’s variance is 5 times as large as

the other, the average bias is just about 3 indices, as shown in the third plot. What is interesting is that

even though the detection shows a bias for highly imbalanced population variance, the accuracy is actually

better, particularly when R ≈ 1. Furthermore, by comparing Figs. 10 and 11, it can be seen that difference

in the population variance plays a more significant role than difference in the sample size. While the

performance curves in Fig. 10 are almost flat, they are less so in Fig. 11.

• Other Statistics. Finally, we present some results using the statistics other than mean. We generate two

processes, one is a uniform distribution in the range of [-1,1], and the other is a combination of two uniform

distributions in the range of [-4, -3] and [3, 4] with an equal chance of selecting from either. Obviously,

these two processes have the same mean of 0. But they are quite distinct in their variance measure. As

long as we use the correct f -statistics, the algorithm is able to locate the changeover point correctly. A

frequency histogram (over 1000 random runs) is shown in Fig. 2.(d), using the variance instead of the

mean as the statistics.

5.1 Comparison with the Traditional Techniques

We compare the proposed technique with a traditional change-detection technique (CUSUM) in terms of

both accuracy and bias in changeover detection. CUSUM is a well established technique in process control

and its performance is widely documented. The idea of CUSUM, as discussed before, is to compare the

sum of the log-likelihood ratio with the minimum sum encountered so far. Changeover detection is based on

Eq. 13. Intuitively speaking, if the process has not yet switched over, the log-likelihood ratio sk is likely to

be negative, and Sk1 keeps on decreasing. If the process has switched over, sk becomes more positive, so Sk

1

starts to pick up while bSk1 c remains low. A changeover is declared if Sk

1 has risen above a certain threshold

20

−30 −20 −10 0 10 20 30

0

50

100

150

200

250

300

350

Red, R=0.5

CUCUM curves indexed by ratio R

Green, R=1Blue, R=3Black, R=5

(a)−40 −30 −20 −10 0 10 20 30 40 500

50

100

150

200

250CUCUM changeover detection histogram, R =1

Index position

Pop

ulat

ion

(b)

−40 −30 −20 −10 0 10 20 30 40 500

50

100

150

200

250CUCUM changeover detection histogram, R =2

Pop

ulat

ion

Index position (c)−40 −30 −20 −10 0 10 20 30 40 500

100

200

300

400

500

600

Index position

Pop

ulat

ion

CUCUM changeover detection histogram, R =3

(d)

Figure 3: (a) Typical CUSUM curves indexed by R, and (b), (c), and (d) are sample histograms of detectedchangeover locations for different R values (each histogram representing 1000 random runs). When R issmall in (b) and (c), the detection results show large bias.

from bSk1 c.

Fig. 3.(a) shows several such Sk1 −bS

k1 c curves. These curves are for a fixed-size window of 60 indices with

the changeover location in the middle of the window (index 0). They differ in the ratio R, which controls

how similar the before and after populations are. As can be seen in Fig. 3(a), the S k1 − bS

k1 c curves behave

as expected (flat before the changeover point and rising linearly afterward) for large R’s. For small R’s, the

curves actually behave very badly—rising too early which creates large detection biases.

This behavior can be understood intuitively as follows: Let’s say that the before process has a smaller mean

than the after process. If we sample the before process, the samples can fall equally likely below or above the

process’ mean (e.g., if the process is normal). When the separation of the two means becomes small relative

21

to their standard deviations, samples from the before process that are larger than the process’ mean have a

good chance of being very close (or closer) to the mean of the after process. This causes the log-likelihood

ratio to become more positive that causes the curve to rise early. The result is that when R is small (R ≈ 2),

the detection process suffers large error, as shown In Figs. 3.(b) and (c).

To better quantify the effect observed in Fig. 3, we conduct a experimental test as before. We assume that

both the before and after processes are normally distributed (which is usually true in process monitoring—

that the uncertainty in a manufacturing process is an I.I.D. normal process). We collect an equal number (30)

of samples from two normal distributions with different means, concatenate them to represent a windowed

sequence, and add noise up to 25% of the RMS signal strength. We run the proposed algorithm and CUSUM

to calculate two measures: (1) detection accuracy (Fig. 13): what is the probability of the algorithms placing

the changeover location in the middle of the window, and (2) detection bias (Fig. 14): how much deviation

is there of the reported changeover locations from the true locations.

As can be easily seen in these figures, the proposed technique significantly outperforms CUSUM. This is

particularly true when R is small. CUSUM has a tendency of totally missing the target, resulting in large

bias (almost 25 indices away) and low accuracy (close to 0%) for small R’s, while our technique performs

much better. Our technique is also more robust in the presence of noise. The performance of CUSUM starts

to approach that of ours only for extremely high R’s. As can be seen in Figs. 13 and 14, even at R = 5,

CUSUM’s performance still lags behind.

6 CONCLUDING REMARKS

In this paper, we present a technique for change detection in independent random sequences. We prove

theoretically that the algorithm gives the correct changeover prediction in the expected sense and provide

experimental results to evaluate the performance of the algorithm. We also compare the performance of

the proposed algorithm with that of the CUSUM and show that the proposed algorithm has a much general

formulation and outperforms CUSUM significantly.

22

APPENDIX

Proof of Lemma 1

s2f = 1

n+2m−1

∑n+m1−m (f(Xi)− f)2

= 1n+2m−1 {

∑k−11−m(f(Xi)− f1 + f1 − f)2 +

∑n+mk (f(Xi)− f2 + f2 − f)2}

= 1n+2m−1 {

∑k−11−m(f(Xi)− f1)

2 +∑n+m

k (f(Xi)− f2)2

+∑k−1

1−m(f1 − f)2 +∑n+m

k (f2 − f)2

+ 2∑k−1

1−m(f(Xi)− f1)(f1 − f) + 2∑n+m

k (f(Xi)− f2)(f2 − f)}

= 1n+2m−1 {

∑k−11−m(f(Xi)− f1)

2 +∑n+m

k (f(Xi)− f2)2 + n1(f1 − f)2 + n2(f2 − f)2}

= 1n+2m−1 {(n1 − 1)

∑k−1

1−m(f(Xi)−f1)2

n1−1 + (n2 − 1)

∑n+m

k(f(Xi)−f2)2

n2−1

= +n1(f1 −n1f1+n2f2

n+2m )2 + n2(f2 −n1f1+n2f2

n+2m )2}

= 1n+2m−1 {(n1 − 1)s2

f1(k) + (n2 − 1)s2

f2(k) + n1(

n2n+2m(f1 − f2))

2 + n2(n1

n+2m(f1 − f2))2}

= 1n+2m−1 {(n1 − 1)s2

f1(k) + (n2 − 1)s2

f2(k) + n1n2

n+2m (f1 − f2)2}

= 1n+2m−1 (sw(k) + sb(k)) 2

(18)

Proof of Theorem 1 Lemma 1 shows that minimizing the ratio of sw/sb is equivalent to maximize sb

or minimize sw, so we need to concentrate on one of the terms only (sw in this proof). Denote the true

changeover point as k, and the within-class scatter as sw(k) = (n1 − 1)s2f1

(k) + (n2 − 1)s2f2

(k). Then it

can be shown that (see Eq. 38 for more detail)

E(sw(k)) = (n1 − 1)σ2f(D1) + (n2 − 1)σ2

f(D2) (19)

Consider what happens if the changeover point is misplaced by, say, l, where l can be either positive or

negative. The within-class scatter then becomes sw(k+ l) = (n1 + l−1)s2f1

(k+ l)+(n2− l−1)s2f2

(k+ l).

What we show below is that, under the conditions stipulated in theorem 2, E(sw(k + l)) > E(sw(k)) for

all l. Hence, k will be selected as the changeover point in average more frequently than any other indices,

or P (e = k) > P (e 6= k). Furthermore, E(sw(k + l)) = E(sw(k − l)) for all l if the two underlying

populations have the same variance. This, together with P (e = k) > P (e 6= k), implies that E(e) = k.

We consider the case where l is positive (the case where l is negative can be proven in a similar manner).

23

Figure 4: Some symbols used in the proof.

Before we proceed, we introduce the following notations to simplify the derivation

f1 = f1(k) =

∑k−1

i=1−mf(Xi)

n1f2 = f2(k) =

∑n+m

i=kf(Xi)

n2

f ′1 = f1(k + l) =

∑k+l−1

i=1−mf(Xi)

n1+l f ′2 = f2(k + l) =

∑n+m

i=k+lf(Xi)

n2−l

f ′′1 =

∑k−1

i=k−lf(Xi)

l f ′′2 =

∑k+l−1

i=kf(Xi)

l

fi = f(Xi)

(20)

These are also illustrated graphically in Fig. 4.

If l is zero, the first group contains only data items in the F1 family while the second group contains only

those in the F2 family. If l is positive, l data items that were in the F2 distribution (group 2) are now moved

into the F1 distribution (group 1). Hence, (detailed steps of deriving Eqs. 29 and 30 from Eqs. 26 and 27

will be presented later in Eqs. 40 and 41)

E(sb(k + l)) (21)

= E

k+l−1∑

1−m

(fi − f ′1)

2 +n+m∑

k+l

(fi − f ′2)

2

(22)

24

=k+l−1∑

1−m

E(fi −n1f1 + lf ′′

2

n1 + l)2 +

n+m∑

k+l

E(fi − f ′2)

2 (23)

=k+l−1∑

1−m

E(n1(fi − f1) + l(fi − f ′′

2 )

n1 + l)2 +

n+m∑

k+l

n2 − l − 1

n2 − lσ2

f(D2) (24)

=k+l−1∑

1−m

En2

1(fi − f1)2 + 2n1l(fi − f1)(fi − f ′′

2 ) + l2(fi − f ′′2 )2

(n1 + l)2+ (n2 − l − 1)σ2

f(D2) (25)

=1

(n1 + l)2

k−1∑

1−m

E(n21(fi − f1)

2 + 2n1l(fi − f1)(fi − f ′′2 ) + l2(fi − f ′′

2 )2) (26)

+k+l−1∑

k

E(n21(fi − f1)

2 + 2n1l(fi − f1)(fi − f ′′2 ) + l2(fi − f ′′

2 )2)

}

(27)

+(n2 − l − 1)σ2f(D2) (28)

=1

(n1 + l)2

k−1∑

1−m

(n21

n1 − 1

n1σ2

f(D1) + 2n1ln1 − 1

n1σ2

f(D1) + l2n1 − 1

n1σ2

f(D1) + l2E(f1 − f ′′2 )2)(29)

+k+l−1∑

k

(n21

l − 1

lσ2

f(D2) + n21E(f1 − f ′′

2 )2 + 2n1ll − 1

lσ2

f(D2) + l2l − 1

lσ2

f(D2))

}

(30)

+(n2 − l − 1)σ2f(D2) (31)

= (n1 − 1)σ2f(D1) + (l − 1)σ2

f(D2) +n1l

n1 + lE(f1 − f ′′

2 )2 + (n2 − l − 1)σ2f(D2) (32)

= (n1 − 1)σ2f(D1) + (n2 − 1)σ2

f(D2) +n1l

n1 + lE(f1 − f ′′

2 )2 − σ2f(D2) (33)

= E(sb(k)) +n1l

n1 + lE(f1 − f ′′

2 )2 − σ2f(D2) (34)

Hence, E(sb(k+l)) > E(sb(k)) if n1ln1+lE(f1−f ′′

2 )2 > σ2f(D2). Similarly, one can show that E(sb(k−l)) >

E(sb(k)) if n2ln2+lE(f2 − f ′′

1 )2 > σ2f(D1). These, in turn, imply

E|f1 − f ′′2 | = |uf(D1) − uf(D2)| >

√

1n1

+ 1l σf(D2)

E|f2 − f ′′1 | = |uf(D1) − uf(D2)| >

√

1n2

+ 1l σf(D1)

(35)

In asymptotic case when n1 and n2 approach infinity, to guaranteed that Esb(k) is the smallest for all choices

of l, we must have (set l = 1)

|uf(D1) − uf(D2)| > max(σf(D1), σf(D2)) (36)

25

Furthermore,

E(sb(k + l))−E(sb(k − l)) =n1l

n1 + lE(f1 − f ′′

2 )2 − σ2f(D2) −

n2l

n2 + lE(f2 − f ′′

1 )2 + σ2f(D1) (37)

In the limiting case when n1 → ∞ and n2 → ∞, E(sb(k + l)) − E(sb(k − l)) = 0 if σ2f(D1) = σ2

f(D2).

Hence, E(e) is unbiased if the two underlying distributions have the same variance. 2

Some of the steps in the above derivations require more elaboration. In particular,

∑k−1i=1−m E(fi − f1)

2 =∑k−1

i=1−m E(fi −

∑

k−1

j=1−mfj

n1

)2

=∑k−1

i=1−m1

n2

1

E(n21f

2i − 2n1fi(

∑k−1j=1−m fj) + (

∑k−1j=1−m fj)(

∑k−1l=1−m fl))

=∑k−1

i=1−m1

n2

1

(

n21E(f2

i )− 2n1E(f2i )− 2n1

∑k−1j=1−m,j 6=i E(fi)E(fj)+

∑k−1j=1−m E(f2

j ) +∑k−1

j=1−m

∑k−1l=1−m,l6=j E(fj)E(fl)

)

=∑k−1

i=1−m1

n2

1

(n21E(f2

i )− 2n1E(f2i )− 2n1(n1 − 1)u2

f(D1)+ n1E(f2

i ) + n1(n1 − 1)u2f(D1)

)

=∑k−1

i=1−mn1−1

n1

(E(f2i )− u2

f(D1))

= (n1 − 1)σ2f(D1)

(38)

We use the fact that because Xi’s are I.I.D., E(f 2i ) = E(f2

j ) and E(fifj) = E(fi)E(fj) = u2f(D1), for

1−m ≤ i, j ≤ k − 1. The same procedure can be used to show that∑n+m

i=k E(fi − f2)2 = (n2 − 1)σ2

f(D2)∑k−1

i=k−l E(fi − f ′′1 )2 = (l − 1)σ2

f(D1)∑k+l−1

i=k E(fi − f ′′2 )2 = (l − 1)σ2

f(D2)

(39)

These relations are used in Eqs. 19, 24, 26, 27, 29, and 30.

That the second and third terms in Eq. 26 turn into their corresponding terms in Eq. 29 is verified below.

(Similar derivations can be used to verify the first and second terms in Eq. 27 turn into their corresponding

terms in Eq. 30.) For the second term, we have

E(fi − f1)(fi − f ′′2 ) = E(fi − f1)(fi − f1 + f1 − f ′′

2 )

= E(fi − f1)2 + E(fi − f1)(f1 − f ′′

2 )

= n1−1n1

σ2f(D1) + E(fif1 − f1f1 − fif

′′2 + f1f

′′2 )

= n1−1n1

σ2f(D1) + E(fi

∑k−1

j=1−mfj

n1−

∑k−1

j=1−mfj

n1

∑k−1

l=1−mfl

n1)−E(fif

′′2 − f1f

′′2 )

= n1−1n1

σ2f(D1) + E(fi

∑k−1

j=1−mfj

n1−

∑k−1

j=1−mfj

n1

∑k−1

l=1−mfl

n1)− (E(fi)Ef ′′

2 −Ef1Ef ′′2 )

= n1−1n1

σ2f(D1) +

E(f2i)+(n1−1)u2

f(D1)

n1−

n1E(f2i)+n1(n1−1)u2

f(D1)

n21

= n1−1n1

σ2f(D1)

(40)

26

For the third term, we have

E(fi − f ′′2 )2 = E(fi − f1 + f1 − f ′′

2 )2

= E(fi − f1)2 + 2E(fi − f1)(f1 − f ′′

2 ) + E(f1 − f ′′2 )2

= n1−1n1

σ2f(D1) + E(f1 − f ′′

2 )2(41)

That the term E(fi − f1)(f1 − f ′′2 ) in the above equation is zero is already verified in Eq. 40.

References

[1] C. C. Aggarwal. An intuitive framework for understanding changes in evolving data streams. In ICDE

’02: Proceedings of the 18th Intl. Conf. on Data Engineering, page 261, Washington, DC, USA, 2002.

IEEE Computer Society.

[2] C. C. Aggarwal. On change diagnosis in evolving data streams. IEEE Transactions on Knowledge and

Data Engineering, 17(5):587–600, 2005.

[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream sys-

tems. In PODS ’02: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on

Principles of database systems, pages 1–16, New York, NY, USA, 2002. ACM Press.

[4] R. Balupari, B. Tjaden, S. Ostermann, M. Bykova, and A. Mitchell. Real-time network-based anomaly

intrusion detection. Real-time system security, pages 1–19, 2003.

[5] M. Basseville, B. Espiau, and J. Gasnier. Edge Detection using Sequential Mthods for Change in Level.

IEEE Trans. Acoust. Speech Signal Proccess., 29:24–31, 1981.

[6] M. Basseville and I. Nikiforov. Detection of abrupt changes: Theory and application. Information

and system science series. Prentice Hall, Englewood Cliffs, NJ, 1993.

[7] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In

COLT ’92: Fifth annual workshop on Computational learning theory, pages 144–152, New York, NY,

USA, 1992. ACM Press.

27

[8] R. G. Brown. Introduction to Random Signal Analysis and Kalman filtering. Wiley, New York, NY,

1983.

[9] A. Deshpande, C. Guestrin, and S. Madden. Using probabilistic models for data management in

acquisitional environments. In CIDR ’05: Proc. of 2nd Biennial Conf. on Innovative Data Systems

Research, Asilomar, CA, USA, January 4-7 2005.

[10] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Ed. Wiley, New York, 2001.

[11] W. Fan. Systematic data selection to mine concept-drifting data streams. In KDD ’04: Proceedings

of the tenth ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pages 128–137, New

York, NY, USA, 2004. ACM Press.

[12] W. Feller. An Introduction to Probability Theory and Its Applications, volume 1. Wiley, New York,

1968.

[13] R. A. Fisher. Applications of students’ distribution. Metron, 5:22–32, 1925.

[14] M. A. Girshick and H. Rubin. A Bayes Approach to Quality Control Model. Annals of Mathematical

Statistics, 23:114–125, 1952.

[15] R. Hamming. Digital Filters. Dover Publications, 1997.

[16] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer-Verlag,

New York, NY, 2001.

[17] D. V. Hinkley. Inference About the Change Point in a Sequence of Random Variables. Biometrika,

57:1–17, 1970.

[18] D. V. Hinkley. Inference About the Change Point from Cumulative Sum-Tests. Biometrika, 58:509–

523, 1971.

28

[19] A. Jain, E. Y. Chang, and Y.-F. Wang. Adaptive stream resource management using kalman filters. In

SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD international conference on Management of

data, pages 11–22, New York, NY, USA, 2004. ACM Press.

[20] T. Joachims. A statistical learning learning model of text classification for support vector machines.

In SIGIR ’01: Proceedings of the 24th annual Intl. ACM SIGIR Conf. on Research and development in

information retrieval, pages 128–136, New York, NY, USA, 2001. ACM Press.

[21] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In VLDB’04: Proc. of 30th

Intl. Conf. on Very Large Databases, Toronto, Canada, September 2004.

[22] T. L. Lai. Control Charts based on Weighted Sums. Annals of Statistics, 2:134–147, 1974.

[23] G. Lorden. Procedures for Reacting to a Change in Distribution. Annals of Mathematical Statistics,

42:1897–1908, 1971.

[24] J. Ma and S. Perkins. Online novelty detection on temporal sequences. In KDD ’03: Proceedings

of the 9th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pages 613–618, New

York, NY, USA, 2003.

[25] P. S. Maybeck. Stochastic Models, Estimation, and Control, vol. 1. Academic Press, New York, NY,

1979.

[26] A. V. Oppenhim and R. Schafer. Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1975.

[27] E. Page. Continuous inspection schemes. Biometrika, 41:110–115, 1954.

[28] E. S. Page. Control Charts for the Mean of a Normal Population. Royal Statsitical Society, B-16:131–

135, 1954.

[29] E. S. Page. Control Charts with Warning Lines. Biometrika, 42:241–257, 1955.

[30] E. S. Page. Estimating the Point of Change in a Continuous Process. Biometrika, 44:248–252, 1957.

29

[31] M. J. Phillips. A Survey of Samplng Procedurs for Continuous Production. Royal Statistical Society,

A-132:205–228, 1969.

[32] D. Picard. Testing and Estimating Change-Points in Time Series. Advances in Applied Probability,

17:841–87, 1985.

[33] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition.

Readings in speech recognition, pages 267–296, 1990.

[34] W. M. Rand. The probable error of a mean. Biometrika, 6(1):1–25, 1908.

[35] S. W. Roberts. Control Charts Based on Geometric Moving Averages. Technometrics, 1:239–250,

1966.

[36] B. Scholkopf and A. Smola. Learning with Kernels. The MIT Press, 2002.

[37] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,

2004.

[38] W. A. Shewart. Economic Control of Quality Menufactured Product. D. Van Nostrand Reinhold, 1931.

[39] A. N. Shiraev. Optimal Stopping rules. Springer, New York, 1978.

[40] A. N. Shiryaev. On Optimum Mehods in Quicket Detection Problems. Theory Probability and Appli-

cations, 8:22–46, 1963.

[41] A. N. Shiryaev. Some Exact Formulas in a Disorder Process. Theory Probability and Applications,

10:348–354, 1963.

[42] D. Siegmund. Sequnetial Analysis - Tests and Confidence Intervals. Springer, New York, 1985.

[43] M. F. Triola. Elementary Statistics, 10th Ed. Addision Wesley, 2005.

[44] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.

30

[45] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers.

In KDD ’03: Proceedings of the ninth ACM SIGKDD Intl. Conf. on Knowledge discovery and data

mining, pages 226–235, New York, NY, USA, 2003. ACM Press.

[46] N. A. Weiss. Introductory Statistics, 7th Ed. Addison Wesley, 2004.

[47] G. Wu, E. Chang, Y. K. Chen, and C. Hughes. Incremental approximate matrix factorization for speed-

ing up support vector machines. In KDD ’06: Proceedings of the 12th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 760–766, New York, NY, USA, 2006.

ACM Press.

31

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Normal: red cross

Uniform: green circle

Poisson: blue star

Negative binomial: cyan triangle

Gamma: black diamond

Window length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 0.70

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Window length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 1.00

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Window length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 2.00

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Window length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 3.00

Figure 5: Test results on using different window lengths.

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Padding length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 0.70

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Padding length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 2.00

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Padding length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 3.00

0 2 4 6 8 10 12 14 16 18 200

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Padding length

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 5.00

Figure 6: Test results on using different padding sizes.

32

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Ratio of separation of mean to maximum standard deviation

Det

ectio

n se

nsiti

vity

Allowed position uncertainty = 3

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star




Det

ectio

n se

nsiti

vity

Allowed position uncertainty = 7

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Allowed position uncertainty

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 1.00

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Allowed position uncertainty

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 2.00

Figure 7: Test results on the trade-off between precision and recall. The allowed changeover locationuncertainty is specified. We calculate the probability of correct localization within the uncertainty bound.

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

Normal: red crossUniform: green circlePoisson: blue starNegative binomial: cyan triangleGamma: black diamond


Res

ulte

d lo

catio

n un

cert

aint

y

Required changeover detection sensitivity = 60%

0 1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30



Res

ulte

d lo

catio

n un

cert

aint

y

Required changeover detection sensitivity = 80%

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

5

10

15

20

25

30


Required changeover detection sensitivity

Res

ulte

d lo

catio

n un

cert

aint

y

R ratio = 1.00

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

5

10

15

20

25

30


Required changeover detection sensitivity

Res

ulte

d lo

catio

n un

cert

aint

y

R ratio = 2.00

Figure 8: Test results on the trade-off between precision and recall. The desired probability of localization isspecified. We calculate the changeover location uncertainty to achieve the desired localization probability.

33

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star




% o

f cor

rect

cha

ngeo

ver

loca

tion

Noise corruption, noise strengh = 2% of signal RMS

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star




% o

f cor

rect

cha

ngeo

ver

loca

tion


0 0.05 0.1 0.15 0.2 0.250

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Amount of noise as % of the signal RMS strength

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 1.50

0 0.05 0.1 0.15 0.2 0.250

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star




% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 2.50

Figure 9: Test results on different amount of added noise.

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star




% o

f cor

rect

cha

ngeo

ver

loca

tion

Different sample sizes, ratio of sample sizes = 1.50

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star




% o

f cor

rect

cha

ngeo

ver

loca

tion

Different sample sizes, ratio of sample sizes = 2.00

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Sample size ratio

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 2.00

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross


Poisson: blue star



Sample size ratio

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 3.00

Figure 10: Test results on different population sample sizes.

34

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross




% o

f cor

rect

cha

ngeo

ver

loca

tion

Different population variances, variance ratio = 60%

0 1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Normal: red cross




% o

f cor

rect

cha

ngeo

ver

loca

tion

Different population variances, variance ratio = 80%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Normal: red cross



Ratio of population variance

% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 1.00

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Normal: red cross




% o

f cor

rect

cha

ngeo

ver

loca

tion

R ratio = 3.00

Figure 11: Test results on different population variance. Here we shows the results of detection accuracy.

0 1 2 3 4 5 6 7 8 9 10−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

Normal: red crossUniform: green circleGamma: black diamond


Pos

ition

bia

s fr

om c

orre

ct lo

catio

n (0

)

Detection bias, population variance ratio = 50%

0 1 2 3 4 5 6 7 8 9 10−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1



Pos

ition

bia

s fr

om c

orre

ct lo

catio

n (0

)

Detection bias, population variance ratio = 70%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1



Pos

ition

bia

s in

det

ectio

n

R ratio = 0.50

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1



Pos

ition

bia

s in

det

ectio

n

R ratio = 2.00

Figure 12: Test results on different population variance. Here we shows the results of detection bias.

35

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1

Proposed technique: red cross

CUSUM: green circle


Det

ectio

n ac

cura

cy


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1


CUSUM: green circle


Det

ectio

n ac

cura

cy


0 0.05 0.1 0.15 0.2 0.250

0.2

0.4

0.6

0.8

1


CUSUM: green circle


Det

ectio

n ac

cura

cy

R ratio = 2.00

0 0.05 0.1 0.15 0.2 0.250

0.2

0.4

0.6

0.8

1


CUSUM: green circle


Det

ectio

n ac

cura

cy

R ratio = 3.00

Figure 13: Test results on comparing the proposed technique with CUSUM. Here we show the results ofdetection accuracy.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−30

−25

−20

−15

−10

−5

0

5


CUSUM: green circle


Det

ectio

n bi

as


0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−30

−25

−20

−15

−10

−5

0

5


CUSUM: green circle


Det

ectio

n bi

as


0 0.05 0.1 0.15 0.2 0.25−30

−25

−20

−15

−10

−5

0

5


CUSUM: green circle


Det

ectio

n bi

as

R ratio = 1.00

0 0.05 0.1 0.15 0.2 0.25−30

−25

−20

−15

−10

−5

0

5


CUSUM: green circle


Det

ectio

n bi

as

R ratio = 1.80

Figure 14: Test results on comparing the proposed technique with CUSUM. Here we show the results ofdetection bias.

36

A New Framework for On-Line Change Detectionyfwang/papers/as06.pdf · A New Framework for On-Line...

Documents

Transcript of A New Framework for On-Line Change Detectionyfwang/papers/as06.pdf · A New Framework for On-Line...