Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial...

46

Transcript of Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial...

Page 1: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Detecting Spatial Outliers: Algorithmand Application

Chang-Tien LuSpatial Database LabDepartment of Computer ScienceUniversity of [email protected]: 612-378-7705

Page 2: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Outline� Introduction� Motivation� General De�nition of Spatial Outlier� Related Work� Proposed Approach and Algorithm� Evaluation of Proposed Approach� Conclusions

Page 3: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Spatial Data Mining� Spatial Databases are too large to analyze manually� NASA Earth Observation System(EOS)� National Institute of Justice - Crime mapping� Census Bureau, Dept. of Commence - Census Data� Spatial Data Mining� Discover frequent and interesting spatial patternsfor post processing (knowledge discovery)� Pattern examples: outliers, hot-spots, land-useclassi�cation

1

Page 4: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Spatial Outlier� De�nition� A data point that is extreme relative to its neighbors� Individual attribute value is not necessarily extreme in thetotal population, but is extreme in its adjacent area� An example� Item: Palm Beach County,Neighbors(item) = Counties in Florida

Source: http://madison.hss.cmu.edu/buchanan-bush.gif2

Page 5: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Application Domain� I-35W North Bound� Volume: the number of vehicles passing a stationwithin 5 minutes

Traffic Volume(Time v.s. Station)

Time

I35

W S

tatio

n I

D (

No

rth

Bo

un

d)

50 100 150 200 250

10

20

30

40

50

600

20

40

60

80

100

120

140

160

180

3

Page 6: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Application Domain� Minneapolis-St. Paul (Twin-Cities) Tra�c Data Set� 930 detectors (stations) installed on major highways� Periodical measuring attributes: volume, occupancy, speed� Interesting spatial outliers - discontinuities{ Assume smooth spatial attribute

Page 7: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Application Domain: Twin-CitiesTra�c Data

0 2 4 6 8 10 12 14 16 18 20 22 240

10

20

30

40

50

Time

Volu

me

Traffic Volume v.s. Time for Station 138 on 1/12 1997

Figure 1: Station 138 on 1/12 1997

0 2 4 6 8 10 12 14 16 18 20 22 240

50

100

150

200

250

300

Time

Volu

me

Traffic Volume v.s. Time for Station 139 on 1/12 1997

Figure 2: Station 139 on 1/12 1997

0 2 4 6 8 10 12 14 16 18 20 22 240

20

40

60

80

100

Time

Volu

me

Traffic Volume v.s. Time for Station 140 on 1/12 1997

Figure 3: Station 140 on 1/12 1997

Page 8: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$An Example of Spatial Outlier� Spatial outlier: S, global outlier: G

0 2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

8

← S

P →

Q →

D ↑

Original Data Points

Location

Attri

bute

Val

ues

Data Point Fitting CurveG

L

� Zs(x) approach: S(x) = [f(x)� 1kPy2N(x)(f(y))]� if Zs(x) = jS(x)��s�s j > �, declare x as an outlier

0 2 4 6 8 10 12 14 16 18 20−2

−1

0

1

2

3

4Outliers Detected Using Spatial Test

Location

Spat

ial T

est Z

s(x)

S →

P →← Q

4

Page 9: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Evaluation of Statistical Assumption� Distribution of tra�c station attribute f(x) looksnormal

−50 0 50 100 150 2000

10

20

30

40

50

60

70Volume Distribution at 10:00am on 1/15/1997

Volume Value

Num

ber

of S

tatio

ns

� Distribution of S(x) = [f(x) � 1kPy2N(x)(f(y))]looks normal too!

−4 −3 −2 −1 0 1 2 3 40

50

100

150Normalized Volume Difference over Spatial Neighborhood at 10:00am on 1/15/1997

Normalized Volume Difference over Spatial Neighborhood

Num

ber

of S

tatio

ns

5

Page 10: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$General De�nition of Spatial Outlier� Given� A spatial framework S consisting of locationss1; s2; : : : ; sn� An attribute function f : si ! R(R : set of real numbers)� A neighborhood relationship N � S � S� A neighborhood aggregation function fNaggr : RN ! R� A di�erence function Fdiff : R�R! R� A test procedure TF : R! fTrue; Falseg� General de�nition: SO-outlier� An object O 2 S is a SO-outlier (f , fNaggr, Fdiff , TF )if TFfFdiff [f(x), fNaggr(f(x), N(x))]g == TRUE� Constraints� Fdiff , TF are composed (e.g., ratio, product, linearcombination) of few statistical function of f , fNaggr

6

Page 11: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Related Work - Outlier DetectionTests

Moran Scatterplot

Outlier Detection Methods

One-dimensional(linear)

Multi-dimensional

Dimensions

Homogeneous Bi-partite dimension

(Spatial outlier detection)

Scatterplot

Graphical

Varigram Cloud Spatial StatisticZ s(x)

attribute value

Frequency

distribution over

Quantitative

7

Page 12: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Related Work - Outlier DetectionTests� Related work: two families� 1-dimension - ignores geographic location� Homogeneous Multi-dimension - mixes location withattributes� Spatial outlier{ 2 classes of dimensions - location, attributes{ Neighborhood - based on location dimensions{ Di�erence - compares attribute dimensions� Comparison of outlier detection methods

Neighbor N/AattributeDefinition

Comparisonattribute

location and

location and

location

attribute valuesof neighbors

Homogeneous( linear )One-dimensional Multi-dimensional

with populationdistribution

Bi-partite

8

Page 13: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Related Work� Di�erent Spatial Outlier Tests� Spatial Statistic Approach� Scatter Plot Approach (Luc Anselin '94)� Moran Scatter plot Approach (Luc Anselin '95)� Variogram Cloud Approach (Graphic)� All these are special case of SO-outlier� Will show this for one case� Test for Outlier DetectionOutlierDe�nition Spatial Statistic (Zs(x)) Scatter Plot Moran Scatter PlotDi�erencefunctionFdiff (f; fNaggr,fG2aggr, : : :,fGkaggr)

S(x) = [f(x) � E(x)](E(x) = 1kPy2N(x) f(y)) � = E(x)� (m� f(x)+ b)(E(x) = 1kPy2N(x) f(y)) Zi = f(x)��f�f , Ii =Zim2 Pj WijZj ,Test functionTF (f; fNaggr,fG2aggr, : : :,fGkaggr, �) jS(x)��s�s j > � j������ j > � (Zi:+, Ii: -) or (Zi:- ,Ii: +)

Table 1: Test for Outlier Detection9

Page 14: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Scatter Plot Approach

1 2 3 4 5 6 7 82

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7Scatter Plot

Attribute Values

Ave

rag

e A

ttrib

ute

Va

lue

s O

ve

r N

eig

hb

orh

oo

d

← S

P →

Q →

� Lemma� Scatter plot is a special case of SO-outlier� Given� An attribute function f(x)� A neighborhood relationship N(x)� An aggregation function fNaggr : E(x) = 1kPy2N(x) f(y)� A di�erence function Fdiff : � = E(x)� (m � f(x) + b)� Detect spatial outlier by� Test function TF : j������ j > �10

Page 15: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Moran Scatterplot Approach� X axis : Zi = f(x)��� standardized deviation units� Y axis (local moran value ): Ii = Zim2Pj WijZj{ where m2 =Pj Z2j , Pj Wij = 1� Four clusters{ High-high: High attribute value, high attribute valueneighbors (Positive association){ High-low: High attribute value, low attribute value neigh-bors (Negative association){ low-high: Low attribute value, high attribute valuesneighbors (Negative association){ low-low: Low attribute value, low attribute value neigh-bors (Positive association)� Spatial outliers{ High low; Low high

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−3

−2

−1

0

1

2

3

S →

P →

Q →

Moran Scatter Plot

Z−Score of Attribute Values

Wei

ghte

d N

eigh

bor Z

−Sco

re o

f Attr

ibut

e V

alue

s

Page 16: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Moran Scatter Plot Approach� Lemma� Moran Scatter plot is a special case of SO-outlier� Given� An attribute function f(x)� A neighborhood relationship N(x):neighbor matrix Wij� Di�erence functions Zi = f(x)��f�f , Ii = Zim2Pj WijZj� Detect Spatial Outlier by� Test function (Zi : + , Ii : -) or (Zi : - , Ii : +)

Page 17: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Variogram Cloud Approach� Approach� X axis: Pairwise distance between neighbors� Y axis: Square root di�erence of attribute value� Spatial outlier{ Locations that are near to one another{ Locations with large attribute di�erences� An example{ S1 : relation of S and Q node{ S2 : relation of S and P node

0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5Variogram Cloud

Pairwise Distance

Sq

ua

re R

oo

t o

f A

bso

lute

Diffe

ren

ce

of A

ttrib

ute

Va

lue

s

← (Q,S)

← (P,S)

Page 18: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Outline� Introduction� Proposed Approach and Algorithm� Problem formulation� Our approach� E�cient algorithm� Cost model� Evaluation of Proposed Approach� Conclusions

Page 19: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Problem Formulation� Given� A spatial framework S consisting of locations s1; s2; : : : ; sn� An attribute function f : si ! R� A neighborhood relationship N � S � S� An aggregation function fNaggr : RN ! R� A di�erence function Fdiff : R�R! R� A test procedure TF : R! fTrue; Falseg� General de�nition: SO-outlier� An object O 2 S is a SO-outlier (f , fNaggr, Fdiff , TF )if TFfFdiff [f(x), fNaggr(f(x), N(x))]g == TRUE� Design� An e�cient algorithm to detect SO-outier,i,e., O = fsi j si 2 S; si is a spatial outlierg� Objective� E�ciency: to minimize the computation time� Constraints� Fdiff , TF are composed (e.g., ratio, product, linearcombination) of few statistical function of f , fNaggr� The size of the data set � the main memory size� Computation time is determined by I/O time11

Page 20: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Our Approach� Separate two phases� Model Building� Test a node (or a set of nodes)� Computation Structure of Model Building� Key insights:{ Need a few statistics of f(x) and fNaggr� Spatial self join using N(x) relationship� Computation Structure of Testing� Single node: spatial range query{ Get All Neighbors(x) operation� A given set of nodes{ Sequence of Get All Neighbor(x)

12

Page 21: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$An Example of Our Approach� Consider Scatter Plot� Model Building� Neighborhood aggregate function fNaggr : E(x) = 1kPy2N(x) f(y)� Global aggregate function fG1aggr, fG2aggr, .. , fGkaggr{P f(x);PE(x);P f(x)E(x);P f 2(x);PE2(x)� Global parameter PG1aggr, PG2aggr, .. , PGkaggr{ m = NP f(x)E(x)�P f(x)PE(x)NP f2(x)�(P f(x)2{ b = P f(x)PE2(x)�P f(x)P f(x)E(x)NP f2(x)�(P f(x))2{ �� =qSyy�(m2Sxx)(n�2) ,{ where Sxx =P f 2(x)� [(P f(x))2n ]{ and Syy =PE2(x)� [(PE(x))2n ]� Testing� Di�erence function{ � = E(x)� (m � f(x) + b){ where E(x) = 1kPy2N(x) f(y)� Test function{ j������ j > �

13

Page 22: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Scatter Plot� X axis: Attribute value of each point� Y axis: Average attribute value in adjacent area� Run a linear square regression on the plot{ Regression line y = mx + b{ m = NP xiyi�P xiP yiNP x2i�(P xi)2{ b = P xiP yi2�P xiP xiyiNP x2i�(P xi)2{ � = y � (mx + b){ Standard deviation �� =qSyy�(a2Sxx)(N�2){ Mean �� = 0{ Sxx =Px2i � [(P xi)2n ]{ Syy =P y2i � [(P yi)2n ]� � = y � (mx + b){ if standardized residual j������ j > �{ Declare x as an outlier

Page 23: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Spatial Outlier Detection: Summary� Model Building� Compute simple statistics on f(x); fNaggr� Details in Table 2� Computable in a single spatial-self-join on N(x)Model BuildingOutlierDe�nition Spatial Statistic Scatter Plot Moran Scatter PlotAttributefunction f f(x) f(x) f(x)NeighborhoodaggregatefunctionfNaggr S(x) = f(x)� E(x) E(x) = 1kPy2N(x) f(y)Global ag-gregatefunctionfG1aggr, fG2aggr,.. , fGkaggr

PS(x), PS2(x) P f(x), PE(x),P f(x)E(x), P f 2(x),PE2(x) P f(x), P f 2(x)Global pa-rameterPG1aggr, PG2aggr,.. , PGkaggr �s = PS(x)n , �s =q 1n [PS2(x)� (P S(x))2n ] m =NP f(x)E(x)�P f(x)PE(x)NP f2(x)�(P f(x)2 ,b =P f(x)PE2(x)�P f(x)P f(x)E(x)NP f2(x)�(P f(x))2 ,�� = qSyy�(m2Sxx)(n�2) , whereSxx = P f 2(x) � [ (P f(x))2n ],Syy =PE2(x)� [ (PE(x))2n ],

�f = P f(x)n , �f =q 1n [P f 2(x)� (P f(x))2n ]Table 2: Model Building

14

Page 24: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Spatial Outlier Detection: Summary� Test for Outlier DetectionOutlierDe�nition Spatial Statistic (Zs(x)) Scatter Plot Moran Scatter PlotDi�erencefunctionFdiff (f; fNaggr,fG2aggr, : : :,fGkaggr)

S(x) = [f(x) � E(x)](E(x) = 1kPy2N(x) f(y)) � = E(x)� (m� f(x)+ b)(E(x) = 1kPy2N(x) f(y)) Zi = f(x)��f�f , Ii =Zim2 Pj WijZj ,Test functionTF (f; fNaggr,fG2aggr, : : :,fGkaggr, �) jS(x)��s�s j > � j������ j > � (Zi:+, Ii: -) or (Zi:- ,Ii: +)

Table 3: Test for Outlier Detection

15

Page 25: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Model Building Algorithm� Algorithm A1: steps� For each location x{ Retrieve data record of x = (f(x), identities ofneighbors(x)){ Get-All-Neighbors(x): Retrieve data records ofneighbor(x)� if neighbor y is not in the memory bu�er� request another I/O operation{ Compute neighborhood aggregate function fNaggr{ Accumulate global aggregate function:fG1aggr, fG2aggr, .. , fGkaggr� Compute global parameter PG1aggr, PG2aggr, .. , PGkaggr� I/O cost is determined by� Dominant operation: Get-All-Neighbors(x)� I/O cost of Get-All-Neighbors(x) is determined bythe clustering e�ciency{ Grouping nodes into disk page

16

Page 26: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Test Algorithm� Algorithm A2(a): steps� For each location x along a route{ Retrieve data record of x = (f(x), identi�es ofneighbors(x)){ Get-All-Neighbors(x): Retrieve data records ofneighbor(x)� if neighbor y is not in the memory bu�er� request another I/O operation{ Compute Di�erence function Fdiff(f; fNaggr, fG2aggr, : : :, fGkaggr)� if TF (f; fNaggr, fG2aggr, : : :, fGkaggr, �) == True{ Declare x as an outlier� Algorithm A2(b): steps� For each location x within the random set{ Retrieve data record of x = (f(x), identities ofneighbors(x)){ Get-All-Neighbors(x): Retrieve data records ofneighbor(x)� if neighbor y is not in the memory bu�er� request another I/O operation{ Compute Di�erence function Fdiff(f; fNaggr, fG2aggr, : : :, fGkaggr)� if TF (f; fNaggr, fG2aggr, : : :, fGkaggr, �) == True{ Declare x as an outlier

17

Page 27: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$I/O Cost Model� De�nition� CE: Clustering E�ciency� N: Total number of nodes� Bfr: Blocking factor (number of nodes in a page)� K: Number of neighbors for each node� L: Number of nodes in a route� R: Number of nodes in a random set� Assume two memory bu�ers� Cost model of A1� (N/Bfr)+N*K*(1-CE){ The cost to retrieve all nodes: (N/Bfr){ The cost to retrieve neighbors of all nodes: N*K*(1-CE)� Cost model of A2(a)� L*(1-CE) + L*K*(1-CE)� Cost model of A2(b)� R+R*K(1-CE)

18

Page 28: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering E�ciency Parameter� Computation cost (I/O cost) is determined byClustering E�ciency(CE)� CE de�nition:{ (Total number of unsplit edges)/(Total number of edges){ Probability[vi and a neighbor of vi are stored in the samedisk page]� An example{ CE = 9�39 = 69 = 0:66

V1

V3V5

V6 V7 V8

V9

V2

V4

Page APage C

Page B

� CE depends on{ Disk block size{ Data set size, data set distribution{ Clustering method 19

Page 29: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Outline� Introduction� Proposed Approach and Algorithm� Evaluation of Proposed Approach� Candidates (Clustering Methods)� Experiment Design� Results� Conclusions

Page 30: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Experimental Evaluation (Summary)� Hypothesis:� I/O cost of the algorithms is determined by theclustering e�ciency� Physical Data Page Clustering Method� Graph-based method: CCAM� Geometric method: Cell Tree� Geometric method: Z-order� Metrics: Clustering E�ciency (CE), I/O cost� Benchmark data� Minneapolis - St. Paul Tra�c data� Benchmark tasks� Model Building� Test Spatial Outlier (Route)� Test Spatial Outlier (Set of Random Nodes)� Variable Parameters� Bu�er size� Disk page size

20

Page 31: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering Method: CCAM� Connectivity Clustered Access Method� Cluster the nodes via min-cut graph partitioning� Use B+ tree with Z-order as the secondary index

Key 0Key 1Key 3Key 4Key 5Key 6Key 7Key 8

Key 8

Node 4

Node 1

Node 5

Node 3Node 6

Node 8

Node 9

Node 7Node 12

Node 13

Node 18

Node 11Node 14

Node 15

Node 26

Node 0

Key 9Key 11Key 12Key 13Key 14Key 15Key 18Key 26

+B tree index Data Page

1

0

3

4

5

6

7

8

9 11

12

13

14

15

2618

21

Page 32: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering Method: CCAM� Connectivity Clustered Access Method� Cluster the nodes via min-cut graph partitioning� Use B+ tree with Z-order as the secondary index

1

0

3

4

5

6

7

8

9 11

12

13

14

15

2618

Page 33: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering Method: Cell Tree� Binary Space Partitioning(BSP)� Decompose universe into disjoint convex subspaces� Each leaf node corresponds to one of the subspaces� Each tree node is stored on one disk page� Cannot exploit edge information, pure geometric

H1

Node 1

Node 15

Node 26

Node 0

H2

H30

H2

H1

H3

4

8

12

13

9

14

11

15

2618

5

6

3

7

Node 3

Node 6

Node 4Node 5

Node 7

Node 18

Node 8Node 9

Node 11

Node 14

Node 12Node 13

Data PageCell-tree Index

1

22

Page 34: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering Method: Cell Tree� Binary Space Partitioning(BSP)� Decompose universe into disjoint convex subspaces� Each leaf node corresponds to one of the subspaces� Each tree node is stored on one disk page� Cannot exploit edge information, pure geometric

0

H2

H1

H3

4

8

12

13

9

14

11

15

2618

5

6

3

7

1

Page 35: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering Method: Z-order� Impose a total order on the nodes� Z order = interleave (bits of X , bits of Y)� Use B+ tree as the primary index� Cannot exploit edge information, pure geometric

Key 0Key 1Key 3Key 4Key 5Key 6Key 7Key 8

Key 8

0

1

4

5

6

7

8

9

12

13

14

113

2 10

18 24

251917

26

27

15

16

0

1

3

4

2

0 1 2 3

Node 1

Node 6

Node 26

Node 0

Key 9Key 11Key 12Key 13Key 14Key 15Key 18Key 26

Node 3

Node 4

Node 5

Node 7

Node 8

tree indexB +

Data Page

Node 9Node 11

Node 12

Node 13

Node 14Node 15

Node 18

23

Page 36: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Clustering Method: Z-order� Impose a total order on the nodes� Z order = interleave (bits of X , bits of Y)� Use B+ tree as the primary index� Cannot exploit edge information, pure geometric

0

1

4

5

6

7

8

9

12

13

14

113

2 10

18 24

251917

26

27

15

16

0

1

3

4

2

0 1 2 3

Page 37: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Experiment Design� Experiment Data Set� Twin-Cities Tra�c Data� Each data object (node){ Attribute Values{ Neighbor List{ Size: 256 bytes

PageSize

Sets of pages of data

Sets of pages of data

HighwayTwin-Cities

SizeBufferCCAM

Z-orderCell-tree

Parameters

No of neighbors

Buffer size

No ofNeighbors

Clustering EfficiencyI/O Cost

MapClustering

method

Model Building

(spatial self- join)

(a) Set of random nodes

(b) Nodes in a Route

Test Spatial Outlier (A2)

Global

Figure 4: Experimental Layout24

Page 38: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Experimental Observation andResults� Step 1: Model Building� Compute global aggregate function� Compute global parameters� Spatial self join structure� Step 2: Test Spatial Outlier� Detect outliers along a Route� Detect outliers in a set of random nodes

25

Page 39: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Model Building: E�ect of Page Size� Fixed Parameters� Bu�er Size: 64k� Variable Parameters� Page size, clustering strategy

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8

Nu

mb

er

of

pa

ge

acc

ess

es

Page Size (K bytes)

ZordCELL

CCAM

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1 2 3 4 5 6 7 8

CE

Va

lue

Page Size (K Bytes)

ZordCELL

CCAM

� CCAM has the best performance� CCAM has the highest CE value� High CE => Low I/O cost{ Cost Model: (N/Bfr)+N*K*(1-CE)� Increase page size => reduce number of page accesses26

Page 40: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Model Building: e�ect of Bu�er Size� Fixed Parameters� Page size: 2K� Clustering E�ciency:{ CCAM=0.81, Cell=0.69, Z-ord=0.51� Variable Parameters� Number of bu�ers, clustering strategy

110

120

130

140

150

160

170

180

8 10 12 14 16 18 20

Nu

mb

er

of p

ag

e a

cce

sse

s

Number of Buffers

ZordCELL

CCAM

� Increase Bu�er size => reduce number of page ac-cesses� CCAM has the best performance27

Page 41: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Test Spatial Outlier (Route):E�ect of Page Size� Average I/O cost of outlier query over 50 routes� Fixed Parameters� Bu�er size: 4 Kbytes, Data point size = 256 bytes� Variable Parameters� Page size, clustering strategy

20

30

40

50

60

70

80

90

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Nu

mb

er

of

Pa

ge

Acc

ess

es

Page Size (K Bytes)

ZordCELL

CCAM

0

0.2

0.4

0.6

0.8

1

0.5 1 1.5 2

CE

Va

lue

Page Size (K Bytes)

ZordCELL

CCAM

� Increase page size => reduce no of page accesses� CCAM has the best performance� CCAM has the highest CE value� High CE => Low I/O cost{ Cost Model: L*(1-CE) + L*K*(1-CE)� Cell Tree has zero CE value when Bfr=2� Increase page size => Performance gap reduces28

Page 42: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Test Spatial Outlier (Random Nodes):E�ect of Bu�er Size� Average I/O cost of outlier query over 50 sets ofrandom nodes� Fixed Parameters� Number of random nodes: 150, page size : 1 Kbytes� CE value : CCAM = 0.68 Cell = 0.53 Z-ord = 0.31� Variable Parameters� Bu�er Size, clustering strategy

250

300

350

400

1 2 3 4 5 6 7 8 9

Nu

mb

er

of p

ag

e a

cce

sse

s

Number of Buffers

ZordCELL

CCAM

� Increase bu�er size => No e�ect� CCAM has the best performance

Page 43: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Test Spatial Outlier (Random Nodes):E�ect of Page Size� Average I/O cost over 50 sets of random nodes� Fixed Parameters� Number of random nodes: 150 points, bu�er size:4K� Variable Parameters� Page size, clustering strategy

200

250

300

350

400

450

500

0.5 1 1.5 2

Nu

mb

er

of P

ag

e A

cce

sse

s

Page Size (K Bytes)

ZordCELL

CCAM

� Increase page size => reduce no of page accesses� CCAM has the best performance

Page 44: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Summary of Experimental Results� Clustering E�ciency� CCAM achieves higher clustering e�ciency thanCell tree and Z-order� Test parameter and test result computation� CCAM has lower I/O than Cell tree and Z-order� Higher CE leads to lower I/O cost� CE is a good predicator of relative I/O perfor-mance� Page size improves clustering e�ciency of all meth-ods� Reduces performance gap between methods

29

Page 45: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Conclusion� SO-outlier� A general de�nition of spatial outlier� Show that existing de�nition are special cases ofSO-outlier� Scatter plot, Moran Scatterplot, Spatial Statistic� Develop e�cient algorithm to detect spatial outlier� Model Buliding� Test Spatial Outlier� Recognize the computation structure of spatialoutlier detection algorithms� � Self Join� Get-All-Neighbor() dominates I/O cost� Develop Algebraic Cost Models� Evaluate Alternate Page Clustering Methods

30

Page 46: Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

&

'

%

$Future Direction� Extend Spatial Outlier Detection Test� Multi attributes (non-location)� Location attribute includes time� Temporal and Spatial-Temporal Outliers� Extend Experiments� NASA data sets - uniform grid� Hypothesis - Geometric clustering may performwell� Explore other spatial patterns beyond spatial outlier� Land-use classi�cation

31