[IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy...

8
Multilayer Change-Point Detection on Stock Order Flows by Wavelet Transformation Xiaoyan Liu, Dept. of Information Systems City University of Hong Kong [email protected] yu.edu.hk Xindong Wu, Dept. of Computer Science, University of Vermont [email protected] Huaiqing Wang, Dept. of Information Systems City University of Hong Kong [email protected] Yingfeng, Wang, Dept. of Information Systems City University of Hong Kong gilbert.wang@student .cityu.edu.hk Abstract In empirical finance, the increase or decrease in the number of stock buy/sell orders is aroused by the information asymmetry, which eventually affects the change of the stock price. To monitor the change in the stock order flow, we propose a multilayer change-point detection algorithm which makes use of the multi- resolution property of wavelet transformation. We first detect the change-points in the lower level resolutions of wavelet transforms and then map them back to the points in the original time series. Different weights are assigned to the different levels for computing the confidence of the mapped points to be the change- points in the original time series. The change-points obtained by our method are more reliable than the change-points detected only from the original time series. The experiments on both artificial Poisson sequences and real-world stock order flows from Shanghai Stock Exchange (SSE) show the effectiveness of our detection method. 1. Introduction Recently, the increasing importance of temporal data attracts more and more research efforts in the field of data mining [3, 9]. As an important application field, a variety of data mining techniques are applied to the stock market. One popular method is to find the prototype patterns in the charts, then make a prediction on the future stock trend based on these identified patterns [14]. However, most of these charts are usually plotted by the daily, weekly, or even monthly data. The shorter-term stock traders may need to make minute-by-minute decisions based on the real-time stock data. This has to involve the analysis of “T&Q” (Trade and Quote) data. In contrast to the previous market microstructure research which mainly focuses on the bid/ask spread and attributes of intraday trades [12, 16], our interest concentrates on analyzing the bid/ask number sequences and detecting the change-points in them. Here, the bid/ask number means the number of sell/buy orders. This research problem is enlightened by the empirical finance discovery. Easley et al. [11] argue that “it is private information rather than public information that leads to abnormal trading activity that precedes price changes”. For example, if some traders know the inside information in advance that the oil company finds a new oil source, they will submit the orders to buy the stock of that company. As another example, institutional investors often split one large-size order to a series of medium-size orders to avoid others’ attention and result in a sharp increase of the stock price to raise their purchase cost [6]. In both cases, the number of buy orders in that period will increase which will finally result in the rising of the stock price. After the market digests the news, that is, the private information finally becomes public to all traders, the number of orders will return to a normal level. Therefore, detecting the change-points in the stock order flow can help the uninformed traders to predict the change direction of stock prices and monitor the abnormal events. In this paper, we propose a multilayer change-point detection algorithm which makes use of the multiresolution property of wavelet transform. The basic idea is to first online build different levels of approximate sequences for the original sequence by wavelet transform, then detect the change-points in the obtained sequence in each level and map them to the points in the original time series. The weights are 2008 IEEE International Conference on Data Mining Workshops 978-0-7695-3503-6/08 $25.00 © 2008 IEEE DOI 10.1109/ICDM.Workshops.2008.67 635 2008 IEEE International Conference on Data Mining Workshops 978-0-7695-3503-6/08 $25.00 © 2008 IEEE DOI 10.1109/ICDMW.2008.65 635

Transcript of [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy...

Page 1: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

Multilayer Change-Point Detection on Stock Order Flows by Wavelet Transformation

Xiaoyan Liu, Dept. of Information

Systems City University of

Hong Kong [email protected]

yu.edu.hk

Xindong Wu, Dept. of Computer

Science, University of Vermont

[email protected]

Huaiqing Wang, Dept. of Information

Systems City University of

Hong Kong [email protected]

Yingfeng, Wang, Dept. of Information

Systems City University of

Hong Kong gilbert.wang@student

.cityu.edu.hk

Abstract

In empirical finance, the increase or decrease in the

number of stock buy/sell orders is aroused by the information asymmetry, which eventually affects the change of the stock price. To monitor the change in the stock order flow, we propose a multilayer change-point detection algorithm which makes use of the multi-resolution property of wavelet transformation. We first detect the change-points in the lower level resolutions of wavelet transforms and then map them back to the points in the original time series. Different weights are assigned to the different levels for computing the confidence of the mapped points to be the change-points in the original time series. The change-points obtained by our method are more reliable than the change-points detected only from the original time series. The experiments on both artificial Poisson sequences and real-world stock order flows from Shanghai Stock Exchange (SSE) show the effectiveness of our detection method. 1. Introduction

Recently, the increasing importance of temporal data attracts more and more research efforts in the field of data mining [3, 9]. As an important application field, a variety of data mining techniques are applied to the stock market. One popular method is to find the prototype patterns in the charts, then make a prediction on the future stock trend based on these identified patterns [14]. However, most of these charts are usually plotted by the daily, weekly, or even monthly data. The shorter-term stock traders may need to make minute-by-minute decisions based on the real-time stock data. This has to involve the analysis of “T&Q”

(Trade and Quote) data. In contrast to the previous market microstructure

research which mainly focuses on the bid/ask spread and attributes of intraday trades [12, 16], our interest concentrates on analyzing the bid/ask number sequences and detecting the change-points in them. Here, the bid/ask number means the number of sell/buy orders. This research problem is enlightened by the empirical finance discovery.

Easley et al. [11] argue that “it is private information rather than public information that leads to abnormal trading activity that precedes price changes”. For example, if some traders know the inside information in advance that the oil company finds a new oil source, they will submit the orders to buy the stock of that company. As another example, institutional investors often split one large-size order to a series of medium-size orders to avoid others’ attention and result in a sharp increase of the stock price to raise their purchase cost [6]. In both cases, the number of buy orders in that period will increase which will finally result in the rising of the stock price. After the market digests the news, that is, the private information finally becomes public to all traders, the number of orders will return to a normal level. Therefore, detecting the change-points in the stock order flow can help the uninformed traders to predict the change direction of stock prices and monitor the abnormal events.

In this paper, we propose a multilayer change-point detection algorithm which makes use of the multiresolution property of wavelet transform. The basic idea is to first online build different levels of approximate sequences for the original sequence by wavelet transform, then detect the change-points in the obtained sequence in each level and map them to the points in the original time series. The weights are

2008 IEEE International Conference on Data Mining Workshops

978-0-7695-3503-6/08 $25.00 © 2008 IEEE

DOI 10.1109/ICDM.Workshops.2008.67

635

2008 IEEE International Conference on Data Mining Workshops

978-0-7695-3503-6/08 $25.00 © 2008 IEEE

DOI 10.1109/ICDMW.2008.65

635

Page 2: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

assigned to the change-points detected in each level. The change-points obtained in the lower levels of wavelet transforms denote the longer-term change, and so are assigned larger weights. The change-points in the finer levels reflect the local change, and so are given smaller weights. By mapping, the importance of the change-points can be computed. The change-points identified by multilayer detection are more reliable than those identified in the single, original time series.

The remainder of the paper is organized as follows. In Section 2, we review related work on change point detection methods. In Section 3, the Haar wavelet transformation is briefly introduced and the proposed algorithm is described in detail. In Section 4, we conduct the experiments on artificial Poisson sequences and real-world stock data. In Section 5, we conclude the paper and discuss future research directions. 2. Related work

The goal of our research is to detect the changes in the bid/ask number flows so as to give alarms before big price changes. The change-point detection problem has been intensively studied by researchers in statistics since 1950’s, and also in operations research, machine learning and data mining fields more recently. In data stream mining, change-point detection methods are required to identify the process change when data are presented serially. Some interesting research problems and preliminary results on online mining of changes from data streams are discussed in [10].

The generic framework of online change-point detection methods is based on sliding windows or blocks of data [2, 13, 15]. For example, by moving t forward, a sequence of paired windows

, 1{ ,..., }t k t k tx x x− − −=

, 1{ ,..., }t k t t kx x x+ + −= are obtained. The difference between these two windows can be measured by various methods. Some methods are based on the data distribution assumption that the sequences are from a family of distributions with unknown parameters. After parameter estimations, some criteria to identify the change-points are constructed, such as the likelihood function, Bayesian factors, K-L divergence, or other sequential hypothesis testing statistics [1, 4, 18, 22]. If there is no priori distribution information to make use, some papers estimate the probability density of the sequential windows by a kernel function and then compare the difference [2, 8]. Besides, some methods don’t rely on any distribution estimation but find other

characteristics of the two windows to compare. These methods include singular spectrum analysis [17], independent component analysis [20], Markov chain Monto Carlo methods [21], and wavelet transformation [23].

For applications to stock data, Chen and Gupa [7] test and locate variance change-points for the stock prices by a binary procedure combined with Schewarz Information Criterion (SIC). But this method is not an online method and relies on the assumption of an independent Gaussian random variable sequence and a known common mean. Zhu and Shasha [23] proposed a shifted wavelet tree structure and applied it to real time T&Q data from New York Stock Exchange (NYSE). This paper focused on detecting the periods with high volumes of trading activities and high stock price volatility based on basic statistics, while our work concentrates on the bid/ask numbers.

In [19], a method was proposed to detect the changes in the order number flow incrementally. This method is based on the Poisson distribution assumption with two rates on the stock order flow. Although the order numbers follow the Poisson distribution in theory, the deviation from the ideal distribution exists since the inevitable errors in the practical world. So in this paper, we propose a detection method based on the Haar wavelet transformation without any distribution assumption. We detect the change points in different level resolutions of wavelet transforms and then map them back to the original time series. The details of our algorithm are described step by step in the next section.

3. Multilayer change-point detection 3.1. Haar wavelet transformation

Wavelet analysis has shown its advantage in time series analysis where time series is changing over time [23]. Here, we briefly introduce the Haar wavelet transform, since it is very easy to compute and implement among the wavelet family. Let’s consider a time series x= },...,{ 1 nxx where n is a power of 2 and nm 2log= . The original time series is viewed as level 0 in a wavelet tree, and we denote it as

)0(x . The normalized pair wise averages and differences of the adjacent data items at level 0 produce the wavelet coefficients at the finest (or highest) level m. The differences are also called details. The process repeats until there is only one average and difference left which are at the coarsest (or lowest) level 1. Table 1 shows an example process of

636636

Page 3: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

Table 1. Haar wavelet decomposition tree

Points

Level

1x 2x 3x 4x 5x 6x 7x 8x

3

221

)3(1

xxs

+=

243

)3(2

xxs

+=

265

)3(3

xxs

+=

287

)3(4

xxs

+=

221

)3(1

xxd

−=

243

)3(2

xxd

−=

265

)3(3

xxd

−=

287

)3(4

xxd

−=

2

24321

)2(1

xxxxs

+++=

28765

)2(2

xxxxs

+++=

2)()( 4321

)2(1

xxxxd

+−+=

2)()( 8765

)2(2

xxxxd

+−+=

1

2287654321)1(

1xxxxxxxx

s+++++++

= 22

)()( 87654321)1(1

xxxxxxxxd

+++−+++=

Averages Details

computing the Haar wavelet decomposition [23].

The average of the lowest level and the details of all levels are the Discrete Wavelet Transform (DWT) of the time series. In Table 1, the DWT of x is

X =DWT(x)= ( )3(4

)3(3

)3(2

)3(1

)2(2

)2(1

)1(1

)1(1 ,,,,,,, ddddddds ).

The averages at level J and the details at the levels j (j=J,J+1,…, k,k ≤ m) construct the DWT approximations of the original time series. For instance, we can use ( 2 ) ( 2 ) ( 2 ) ( 2 )

1 2 1 2( , , , )s s d d as the approximate representation of x= },...,{ 81 xx . In our method, the details part is omitted since we are more interested in the aggregate numbers. 3.2. Single layer change-point detection

Kifer et.al [15] proposed a meta-algorithm for change detection. Their algorithm is based on a two-window paradigm. The algorithm compares the data in the “reference window” to the data in the current window. The current window slides forward with each incoming data point, and the reference window is updated whenever a change is detected. The structure of the algorithm is shown in Figure 1.

In Figure 1, W1,i and W2,i are the “reference window” and current window respectively. W1,i contains the first m1,i points of the data streams since the latest detected change. W2,i is a sliding window that contains the latest m2,i points in the data streams. The current window W2,i slides one step forward whenever a new data point appears on the data stream. δi is a parameter that defines the balance between sensitivity and robustness of the detection [15]. At each such update, the difference of the two windows is checked by function Dist. If Dist(W1,i, W2,i) >δi, a change is

reported.

Figure 1. The meta-algorithm for change

detection in [15]

The difference between these two windows is compared to determine whether t is a change-point or not. The key to this scheme is to choose a suitable distance function so that the change can be effective detected. Many criteria are used in the comparison. In our algorithm, we define the distance between W1 and W2 as follows,

Algorithm 3.1 A Meta-Algorithm for Change Detection

Input: time sequence (a1, a2,…, an, …), δ, k Output: change points (c1, c2,…, ck, …) Initial: i=1, index=1, c1 a1 For i=1,…k do :step 1 0 1 ←c W1,i ← first m1,i points from time 0 1 ←c W2,i ← next m2,i points from stream End For While not at end of stream do For i=1,…, k do Slide window W2,i by 1 point If Dist(W1,i ,W2,i) >δi Then c1 current time Report change at time c1 Clear both windows and goto step 1 End if

End for End while

637637

Page 4: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

1 2

1 2

2 21, 2 2, 1

1 11 2

2 21, 1 2, 2

1 1

( ) ( )(W , W )

( ) ( )

m m

i ii im m

i ii i

x x x xDist

x x x x

= =

= =

− + −=

− + −

∑ ∑

∑ ∑, (3.1)

where ix ,1 and ix ,2 are the points in the windows W1

and W2 respectively, and 1x and 2x are the averages of the two windows. Intuitively, if the two windows are from the same distribution, the averages of the windows are not very far away from each other, so

1 2(W , W )Dist is very close to 1. If 1 2(W , W )Dist is larger than the user defined threshold δ , then the current time is reported as a change-point. 3.3. Multilayer change-point detection

The basic idea of the proposed multilayer change-point detection algorithm is to apply the meta-algorithm mentioned in Section 3.2 to the wavelet transform resolutions of the original time series and then detect the change-points at different levels.

In the k-layer ( 2k ≥ ) change-point detection, for every 2k-1 points in the time series, we compute the averages and details of level k-1 to 1 by Haar wavelet transform. We take the 2k-2 averages at level k-1 as an approximation of the original 2k-1 points. Then, based on the original sequence, k-1 transformed sequences are generated. For example, when k=4, for every 8 points in the original time series which make up of level 0 by definition, we compute 4 average values at level 3, 2 average values at level 2, and 1 average value at level 1.

The generated sequences are demonstrated in Table 2 where the superscripts denote the levels of transforms. We will detect change-points in the sequences at each level of Haar wavelet transforms. Actually, the change-points in the coarser levels reflect the changes in the long-term trend and hence are more important in the global structure; the change-points in the finer levels reflect the local change in the sequence. According to the users’ requirement, the attention can

be paid to the change-points at a given level of interest. We assign different weights to the change-points in

different levels. For k-layer detection, the weights for the change-points at level j are as follows,

⎪⎪⎩

⎪⎪⎨

=−

−=−=

0,12

1

1,...,1,12

2

j

kjw

k

k

jk

j , (3.2)

and ∑−

=

1

0

k

jjw =1. When k=4, the weight vector is

0 3, 2, 11 2 4 8( , , , ) ( , , , )

15 15 15 15W w w w w= = .

We map the change-points obtained at level j (j=1,…, k-2) to the points at level j+1 by the following functions

( ) ( 1)1 2 1: j j

j j i if x x +→ + −→ , (j=1,…,k-2). (3.3)

For example, for the 4-layer detection as illustrated in Table 2, If the point (2)

2ix at level 2 is detected as a change-point, the corresponding mapped point at level 3 should be (3)

4 1ix − . The mapping from level k-1 to level 0 is

( 1) (0)1 0 2 1: k

k i if x x−− → −→ . (3.4)

Therefore, the mapping from any level j (j=1,…, k-1) to level 0 is

0 1 2 1 1 0j j j k k kf f f f→ → + − → − − →= .

By applying )1,...1}({ 0 −=→ kjf j , all the change-points at level j (j=1,…, k-1) can be mapped to the points in the original time series and sorted chronologically, { itt cpcp,...,

1 }( ji tt ≤ if ji ≤ ). We define

conf( itcp )=W*C’, (3.5)

where 0 1 1( , ,..., )kW w w w−= is defined by Eq.(3.2) and C=(c0, ck-1 , …, c1) is an indicator vector in which cj =1 if there exists )( j

icp which is a change-point of the are

Table 2. Different levels of sequences of the original time series

L0 (0)1x (0)

8 7ix − )0(

8ix

L3 (3)1x (3)

4 3ix − (3)4 2ix − (3)

4 1ix − (3)4ix

L2 (2)1x (2)

2 1ix − (2)2ix

L1 (1)1x (1)

ix

638638

Page 5: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

detected change-points at levels 1, 2, 3, 0 respectively, then the confidence of the change-point (0)

8 7ix − in the original time series is 1. We describe the detection steps in Figure 2.

Figure 2. Multilayer change-point detection

algorithm 4. Experiments

We conduct the experiments on the artificial dataset and the real-world stock data. The goal of experiments on artificial dataset is to show the multilayer detection more reliable than single layer detection since we can exactly know where the change points are. We adopt algorithm 3.1 for the single layer detection. Actually, any other single layer detection algorithm can be applied in the multilayer detection framework. The experiments on the real stock dataset are used to show the practical effect of the multilayer detection algorithm.

4.1. Poisson simulation

Since the order flows are thought to follow a Poisson distribution by empirical finance, we first conduct an experiment on an artificial Poisson sequence. We generate the Poisson sequence by a Poisson random number generator with the following procedure.

(1) Set the number of change-points NoCp=100, 1 100,λ = 2 130.λ =

(2) The number of segments NoSg=NoCp+1. We

randomly generate 100 random numbers li from the uniform distribution on interval [100, 300] as the length of each segment. That means each segment contains 100 data points at least. (3) For the ith segment, if i is an odd number, we generate li points from the Poi( 1λ ), otherwise from the Poi( 2λ ).

0 200 400 600 800 1000 1200 1400 1600 180070

80

90

100

110

120

130

140

150

160

tcount

Figure 3. An example of generated Poisson

sequence

Two indicators, false detection rate (FDR) and misdetection rate (MDR), are calculated. The denotations we use are listed as follows.

NoRCp: the number of known real change-points of the sequence NoDCp: the number of change-points detected by our algorithm NoFp: the number of false change-points among our detected change-points NoMp: the number of those real change-points missed to detect FDR: NoFp/NoDCp MDR: NoMp/NoRCp

The smaller the indicators MDR and FDR, the better the algorithm results are. When we decide whether one detected change-point is a real change-point, we relax the condition as that if ( , ) {| | }Min

ji i j

r RCPDist d RCP d r ε

∈= − <= , (3.6)

then di∈ DCP is viewed as an real change-point, where RCP and DCP respectively denote the position set of real change-points and detected change-points. 4.1.1. Test on single layer detection. We first test the meta-algorithm in single layer detection. To investigate the influence of the reference window length and the current window length on the two indicators, we randomly generate 100 Poisson sequences for the window length setting. The current window length is

Algorithm 3.2 Multilayer Change-point Detection Algorithm (MCPA) Input: no. of layers k, window length l,

data streams 1{ ,..., ,...}ix x , weight vector 0 1 1( , ,..., )kW w w w−=

Output: change-points 1

{ ,..., ,....}it tcp cp ,

the confidences of change-points:

1{conf ( ),..., conf ( ),....}

it tcp cp Step1. While data points are coming, generate the sequence { ( )j

ix } for level j (j=k-1,…,1) by Haar wavelet transformation Step2. For j=0 to k-1, apply Algorithm 3.1 to the sequence ( ){ }j

ix to obtain the change-points ( ){ }jicp ,

map them to the points in original time series, sort them chronologically to be it

cp , and update

conf( ) it

cp

639639

Page 6: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

set no longer than the reference window length. The results are shown in Table 3 and Table 4.

Table 3. Different reference window lengths

(current window length=30, δ=5)

Reference window length

MDR FDR DR

30 6.29% 6.08% 92.72%

20 6.60% 6.63% 92.41%

10 7.57% 7.43% 91.43%

Table 4. Different current window lengths (reference window length=30, δ=5)

Current window length

MDR FDR DR

30 18.94% 35.94% 88.10%

20 9.56% 20.72% 92.99%

10 6.02% 6.02% 88.10%

From Table 3 and Table 4, we can see that when the current window length is given, with the increase of the reference window length, the misdetection rate and the false detection rate both decrease a little. This is because the more the samples in the reference window, the closer to the real distribution they are. Meanwhile, when the reference window is given, with the increase of the current window length, the misdetection rate and the false detection rate both increase. This is because the longer the current window, the less sensitive it is to the change-points. DR denotes the delayed detection rate which is the percentage of delayed detected points among the correctly detected change points. 4.1.2. Test on multilayer detection. In this part, we inspect the reliability of the multilayer change-point detection. We apply the meta-algorithm to each level of wavelet transforms. The effectiveness of the meta-algorithm has been studied in Section 4.1.1. For single layer detection, we can achieve the posterior probability 0( 1| )P x H= ≈ 0.07, where 0H is the hypothesis that no change occurs in the sequence, and x=1 means x is detected as a change-point. It is the probability that a non-change-point is detected as a change-point. Similarly, 1( 0 | )P x H= ≈ 0.07, where 1H is the hypothesis that a change occurs in the sequence, and x=0 means x is not identified as a change-point. It is the probability that a change-point is missed to

detect. In the k-layer detection, if one change-point is missed, that means, it is missed at every layer, then MDR= 1[ ( 0 | )]kP x H= . For the false detection rate, if a detected change-point is a false one, it is detected at least in one layer, then FDR=1- 0[ ( 1| )]kP x H= .

In the multilayer detection, the final detected change-points in the original sequence are integrated from lower levels, so the number of final detected change-points may be several times of the number of detected points obtained by single layer detection. It is evident that the misdetection rate will decrease. However, the false detection rate will increase by the criterion in Eq.(3.6) since the falsely detected change-points are the union of the false points in each layer and the mapping process also leads to the time position deviation. So we examine the difference between the detected change-point sequence and the real change-point sequence by dynamic time warping (DTW) distance [7] which is able to find the optimal alignment between two time series.

Table 5. Multi-layer detection results

MDR(std) DTW(std)

Single layer 6.46% (3.05%) 8.6 (0.77)

Two-layer 1.51% (1.1%) 13.5 (0.83) From the posterior probability, we have known that

if we use two-layer detection, the misdetection rate MDR= 2

1[ ( 0 | )]P x H= 20.7 0.0049≈ = . The accuracy of the two-layer detection is enough for us. Table 5 demonstrates the average MDR and DTW distance by single layer detection and 2-level detection for 100 generated Poisson sequences. For the MDR, we set ε=10 in Eq.(3.6). In the original data sequence, the lengths of the reference window and the current window are set as 30 and 10 respectively. In the sequence at level 3, the lengths of the reference window and the current window are 20 and 10 respectively. The average segment length for each sequence follows the uniform distribution in [400,500].

From this table, we can see that the misdetection rate is reduced by the multilayer detection. The numeric in the parentheses is the standard deviation for 100 sequences. The two-layer detection obtains a more reliable misdetection rate. For the average distance between the real change-points and detected change-points over 100 Poisson sequences, the single layer detection obtains 7.6 and two-layer detection obtains 10.6. That means, the detected change-points by using single layer detection deviate from the real change-points about 8 points on average, while the detected change-points by using the two-layer detection deviate

640640

Page 7: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

from the real change-points about 11 points on average. Given the mapping’s system error, the DTW of two-layer is acceptable. This shows that although some detected points are identified as false change-points under the criterion in Eq.(3.6), they are actually very close to the threshold ε =10. 4.2. Stock simulation

We also test the proposed algorithm on the 30

stocks from Shanghai Stock Exchange in the period from Jun 1, 2003 to Jun 31, 2004. We count the number of trades every five minutes from the market opening to market closing. The trades are classified to buy initiated trades and sell initiated trades [14]. So two sequences are obtained for each stock, buy order flow and sell order flow.

We detect the change in the buy order flow and sell order flow separately. The average price of the trades occurring at the five-minute interval is taken as the stock price in that interval. As a result, there are about 12,000 points that are prepared for each stock’s buy/sell order flow. The statistics of these stocks are listed in Table 6. These 30 stocks are chosen because of their high trading rate.

Table 6. Statistics of 30 stocks

Mean S.D. Median Price 5.8122 1.9751 5.6321

Variance of �P(x10-3) 8.5711 2.3982 7.5821

Daily number of

trades 1003.6 234.35 958.72

Daily share volume (x103)

12968 12479 8882.6

Trade size in shares 10814 6731.1 8273.8

Trade size in RMB(x103) 5862.9 3051.3 4973.2

Spread(¥) 0.011595 0.009215 0.01054 Spread (%) 0.21548 0.40999 0.22517

As we mentioned in Section 1, the change in the

order flow will finally lead to the change in the stock price. Since we do not know the real change-points in the order flow unless there is some important news which is easy to look up, we use “lift” as our criterion to test the effectiveness of our method. Let ( )J

iP denote

the price change between the J time intervals before and after the point i, and ( )

i

JPc denote the price change between the j time intervals before and after the change-point i. The definition of “lift” is given as follows,

lift=( )

1

( )

1

mJ

ii

nJ

ii

Pc m

P n

=

=

∑, (3.7)

where n is the number of total points in the stock and m is the number of detected change-points. This is actually the ratio of price change caused by the order flow change to the average price change of the whole sequence.

00.2

0.40.6

0.81

1.21.4

1.61.8

0 3 2 1

Layer No.

Lifts

J=2

J=6

J=12

J=18

Figure 3. The lifts for different levels In this experiment, we calculate the average lift for

30 stocks under J=2, 6, 12, and 18, which means the lift is under 10 minutes, 30 minutes, 1 hour and 90 minutes before and after the points respectively. The results are obtained under the following parameters: the reference window length m1=30, 20, 10, 10, the current window length m2=20, 10, 10, 8 for the levels 0, 3, 2, 1, and the layer number k=4. δ is determined by the prior knowledge about the stocks’ order numbers in normal and abnormal status. Figure 3 shows the average lift of the 30 stocks in different level and different J. For each J, with the level becoming coarser, the lift decrease. At the coarsest level (level 1), with J larger, the lift increases. This is because with the time interval becoming longer, the change in the long term is becoming more evident. The fact that the lift values are all larger than 1 means our detected change-points actually indicate more price change than other points.

Therefore, compared with the average price change for the whole time series, our detected change-points on the stock order flows are effective and actually indicate the changes of stock prices.

641641

Page 8: [IEEE 2008 IEEE International Conference on Data Mining Workshops (ICDMW) - Pisa, Italy (2008.12.15-2008.12.19)] 2008 IEEE International Conference on Data Mining Workshops - Multilayer

5. Conclusions

The proposed algorithm in this paper is to detect

the change-points in the wavelet transform resolutions at different levels. The final obtained change-points are more reliable than the change-points in single sequence detection. The algorithm not only detects the change-points through several levels but also gives the degree of importance of the points as the change-points. For further studies, the proposed algorithm needs priori knowledge about the stock order flows to determine the parameter δ, and the determination of the sliding window length l at each level also deserves an in depth exploration, so does the relationship between δ and l at different level J. Also, how to reduce the detection delay time is a key to practical applications of the proposed algorithm. References [1] R P. Adams and D. J.C. Mackay, “Bayesian Online Change-point Detection”, http://www.inference.phy.cam.ac.uk/rpa23/papers/rpa-changepoint.pdf. [2] C.C. Aggarwal, “A Framework for Diagnosing Changes in Evolving Data”, In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, CA, USA, June 2003, pp. 575 – 586. [3] B. Babcock, S. Babu, M.Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems”, In Proceedings of the 21th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, USA, June 3-5, 2002, pp. 55-68. [4] C. W. Baum and V.V. Veeravalli, “A Sequential Procedure for Multihypothesis Testing”, IEEE Transactions on Information Theory, Vol. 40, No.6, 1994, pp. 1994-2007. [5] Berndt, D. and Clifford, J., “Using Dynamic Time Warping to Find Patterns In Time Series”, AAAI-94 Workshop on Knowledge Discovery in Databases, Seattle, Washington, 1994. [6] B. Biais, P. Hillion, and C. Spatt, “An Empirical Analysis of the Limit-order Book and the Order Flow in the Paris Bourse”, Journal of Finance, Vol. 50, No. 5, 1995, pp.1655-1689. [7] J. Chen and A.K. Gupta, “Testing and Locating Variance Change-points with Application to Stock Prices”, Journal of the American Statistical Association, Vol. 92, No. 438, June 1997, pp. 739-747. [8] F. Desobry, M. Davy, and C. Doncarli, “An Online Kernel Change Detection Algorithm”, IEEE Transactions on Signal Processing, Vol.53, No.8, 2005, pp.2961-2974. [9] P. Domingos and G.. Hulten. “Mining High-Speed Data

Streams”, In Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA, 2000, pp. 71-80. [10] G.Z. Dong, J.W. Han, L.V.S. Lakshmanan, J. Pei, H.X. Wang, P. S. Yu, “Online Mining of Changes from Data Streams: Research Problems and Preliminary Results”, In Proceedings of the 2003 ACM SIGMOD, San Diego, CA, USA, June 8, 2003. [11] D. Easley, N. Kiefer, M. O’Hara and J.P. Paperman, “Liquidity, Information and Infrequently Stocks”, Journal of Finance, Vol.51, No.4, 1996, pp. 1405-1436. [12] F.D. Foster and S. Viswanathan, “A Theory of Intraday Variations in Volumes Variance and Trading Costs”, Review of Financial Studies, 1990, Vol. 3, pp. 593-624. [13] V. Ganti, J. Gehrke, and R. Ramakrishnan, “Mining Data Streams under Block Evolution”, ACM SIGKDD Explorations Newsletter, Vol. 3, No. 2, Jan. 2002, pp. 1-10. [14] G.. Gottlieb and A. KaIay, “Implications of the Discreteness of Observed Stock Prices”, Journal of Finance, Vol.40, No.1, 1985, pp. 135-153. [15] D. Kifer, S. Ben-David, and J. Gehrke, “Detecting Change in Data Streams”, In Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004. [16] G.M. Koop and S.M. Potter, “Forecasting and Estimating Multiple Change-point Models with an Unknown Number of Change-points”, Technical report, Federal Reserve Bank of New York, December 2004. [17] V. Moskvina and A. Zhigljavsky, “Change-point Detection Algorithm Based on the Singular Spectrum Analysis”, Communication in Statistics: Simulation & Computation, Vol. 32, 2003, pp. 319-352. [18] C.B. Lee, “Bayesian Analysis of a Change-point in Exponential Families with Applications”, Computational Statistics and Data Analysis, Vol. 27, 1998, pp. 195-208. [19] X.Y. Liu, X.D. Wu, H.Q. Wang, Z. Zhang, “Incremental Online Change-point Detection in Stock Order Flows”, Working Paper, Department of Information Systems, City University of Hong Kong, 2006. [20] B.J. Noe and F.M. Ham, “Change Detection through Subspace Projection Using Independent Component Analysis to Track Moving Targets in Scenery”, International Joint Conference on Neural Networks, Washington, DC, USA, Vol. 1, No. 15-19, July 2001, pp:703 – 708. [21] M. Raimondo and N. Tajvidi, “A Peaks over Threshold Model for Change-point detection by Wavelets”, Statistica Sinica, Vol. 14, 2004, pp. 395-412. [22] D. Siegmund, Sequential analysis, Springer-Verlag, 1985 [23] Y.Y. Zhu and D. Shasha, “Efficient Elastic Burst Detection in Data Streams”, In Proceedings of the 9th ACM SIGKDD, Washington, D.C, June 2003, pp.336 – 34.

642642