NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...

NEW OUTLIER DETECTION

TECHNIQUES FOR DATA STREAMS

Approved by:

Dr. Michael Hahsler

Dr. Margaret H. Dunham

Dr. Sukumaran Nair

Dr. Jeff Tian

Dr. Ping Gui

NEW OUTLIER DETECTION

TECHNIQUES FOR DATA STREAMS

A Dissertation Presented to the Graduate Faculty of the

Bobby B. Lyle School of Engineering

Southern Methodist University

in

Partial Fulfillment of the Requirements

for the degree of

Doctor of Philosophy

with a

Major in Computer Science

by

Charlie Isaksson

(M. S. C. S., Mid Sweden University, 2006)

Dec 17, 2016

ACKNOWLEDGMENTS

I am truly humbled and grateful for the great number of individuals that has supported

and encouraged me over the past nine years to fulfill my biggest dream. Dr. Margaret H.

Dunham and Dr. Michael Hahsler have been my two mentors and friends throughout this

rewarding journey. I would like to extend special thanks to Dr. Hahsler for helping me find

my path back and recognize your background knowledge and patience.

I would like to extend my gratitude to the faculty and staff members in the Department

of Computer Science and Engineering at Southern Methodist University.

For the other members of my dissertation committee, Dr. Sukumaran Nair, Dr. Jeff Tian

and Dr. Ping Gui, thank you for all the feedback and patience you had with me. I whole-

heartedly enjoyed the challenge of researching a critical issue that currently is important

for various industries.

Finally, a special recognition goes out to my family and friends who supported and

encouraged me during my pursuit of the doctorate in computer science. Thanks to my kids

for giving me the strength to keep going. I love you more than you will ever know.

iii

Isaksson , Charlie M. S. C. S., Mid Sweden University, 2006

New Outlier Detection

Techniques For Data Streams

Advisor: Professor Michael Hahsler

Doctor of Philosophy degree conferred Dec 17, 2016

Dissertation completed Nov 09, 2016

The availability and reliability of data have become essential in our modern society. In

fact, it has become critical in every domain to maintain high-quality data even though that

data may originate at high velocity and in large quantities. Today it is well understood that

data enables businesses to achieve their full potential by providing valuable insights into

their business as well as potentially offering them an advantage over their competitors. To

achieve such a goal requires a significant investment in both big data infrastructure and data

mining capabilities. Data Mining is the process of finding hidden patterns within a large

dataset. Imperative to Data Mining is the ability to detect outliers, data points that deviate

from the rest of the data points because outliers can dramatically alter the result of the anal-

ysis. Although outliers occur infrequently, it is hard to identify them since there are many

potential sources for outliers (such as human errors, machine errors, environmental varia-

tions, faulty sensors). Finding outliers in large dataset requires extremely efficient outlier

detection techniques. It becomes even harder to detect an outlier within a data stream as it

imposes a single pass restriction and data often arrives at a very fast rate. Also streaming

data may contain redundant information, which can reduce outlier detection performance

and efficiency. To avoid this redundancy while maintaining the correctness of the data, it

becomes necessary to summarize the data stream. The Extensible Markov Model (EMM)

has been proven to be a good candidate for meeting these requirements to detect outliers in

iv

data stream applications. EMM uses data stream clustering models and takes into account

temporal and ordering aspects using a Markov Chain (MC), a powerful temporal model that

allows studying a complex system and making predictions about events. Extensible Markov

model is a time changing MC that has the ability of learning and dynamically adapting its

structure to the environment as well as updating the state transition probability based on the

incoming data. The model generated by EMM allows analysis of a particular time frame

as an MC, and, as time passes, this model will continue to adapt, evolve, and learn with the

ongoing data stream. This is due to the close coupling of the clustering model with an MC

model. Combining these two models delivers a spatiotemporal model that satisfies all the

requirements from a data stream (big data) infrastructure standpoint. In this dissertation,

the data pattern finding capability of EMM has been extended in several ways. Firstly, a

sophisticated mining task on the synopsis is investigated to detect Distributed Denial of

Service (DDOS) network intrusion. A performance study is then conducted of different

outlier detection techniques and compared with EMM, and this leads to two additional ex-

tensions to further improve EMM’s performance. SO-Stream, a new self-organizing cluster

structure that allows the algorithm to obtain the threshold for each micro-cluster dynami-

cally, is proposed, and then SO-Stream is extended by integrating a Markov Model (MM).

The new algorithm is called Adaptive Streaming Markov Model (ASMM), which is de-

signed to handle concept drift, spatiotemporal outliers, and high volume and velocity data

streams while preserving higher accuracy and cluster quality. The dissertation concludes

with directions for future work including distributed ASMM’s that can be integrated into

big data frameworks, ASMM’s for telecom applications and a visualization technique for

multidimensional data that is greatly needed for better interpretation of outlier models.

v

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

CHAPTER

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2. Focus of the Dissertation and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3. Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1. Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2. Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3. Outlier Detection in Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4. Spatiotemporal Outlier Detection in Data Stream . . . . . . . . . . . . . . . . . . . . . . . . 13

3. RISK LEVELING OF NETWORK TRAFFIC ANOMALIES . . . . . . . . . . . . . . . . . 17

3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2. Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4. Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4. A COMPARATIVE STUDY OF OUTLIER DETECTION ALGORITHMS . . . . 36

vi

4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1. Extensible Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2. Density Based Local Outliers (LOF Approach) . . . . . . . . . . . . . . . . . . 41

4.1.3. Density Based Local Outliers (LSC-Mine Approach) . . . . . . . . . . . . 41

4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1. Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.2. Experiments on Real Life Data and Synthetic Datasets . . . . . . . . . . . 45

4.3. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5. SO-STREAM: SELF ORGANIZING DENSITY-BASED CLUSTERINGOVER DATA STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2. Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.1. SOStream Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.2. Density-Based Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.3. SOStream Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.4. Online Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.1. Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4.2. Real-World Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.4.3. Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4.4. Scalability and Complexity of SOStream . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

vii

6. ASMM: DETECTING SPETIO-TEMPORAL OUTLIERS WITH ADAP-TIVE STREAMING MARKOV MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2.1. Extensible Markov Model Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2.2. Adaptive Streaming Markov Model Algorithm . . . . . . . . . . . . . . . . . . 90

6.2.3. EMMRare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.3.2. Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.3.3. Scalability and Complexity of ASMM . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.1. Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2. Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

APPENDIX

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

viii

LIST OF FIGURES

Figure Page

2.1 A classification of outlier detection techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 A workflow from a traditional spatiotemporal outlier detection framework. . . 14

2.3 The workflow from EMM outlier detection framework. . . . . . . . . . . . . . . . . . . . . 16

3.1 Logarithm of traffic volume shows the DDoS attacks . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Advantages of the LOF approach. Modified from [78] . . . . . . . . . . . . . . . . . . . . . 42

4.2 Run time for LOF, LSC-Mine, and EMM with MinPts=20 and EMMThreshold=0.99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 (a) Data points of stream with 5 overlapping clusters and (b) Show SOStreamcapability to distinguish overlapped clusters. For visualizing clusterstructure, we do not utilize Fading or Merging. . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 SOStream clustering quality horizon = 1K, Stream speed = 1K. The qual-ity Evaluation for MR-Stream and D-Stream is retrieved from [68]. . . . . . 73

5.3 SOStream memory cost over the length of the data stream. The MemoryEvaluation for MR-Stream is retrieved from [68]. . . . . . . . . . . . . . . . . . . . . . 76

5.4 SOStream execute time using high dimensional KDD CUP99 datasetwith 34 numerical attributes. The sampling data rate is every 25K points. 77

6.1 Example of EMM directed graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.2 Basic example that show high level operations of ASMM. . . . . . . . . . . . . . . . . . 90

6.3 The Sensors were arranged in the lab according to the above diagram.Obtained from [88] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.4 Show subplots from time period [8398:9000]. It is evident that humiditysuffers from spatial outlier. However, due to large data size we areunable to display temporal outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

ix

6.5 Show subplots from normalized Server System Health dataset. Hence,the highlighted red area include both the spatial and temporal outliers. . . 110

6.6 Distribution of ASMM’s clusters count based on different buffer size forthe KDD CUP’99 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.7 Clusters size decreases with increased buffer size. The number of clustersstabilizes between buffer size 15 to 35. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.8 ASMM and EMM memory cost over different threshold values usingKDD CUP99 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.9 ASMM and EMM execute time using high dimensional KDD CUP99 dataset. 117

x

LIST OF TABLES

Table Page

3.1 Notations of EMM Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 The extracted features from raw tcpdump data using tcptrace software . . . . . . 32

3.3 Legend used in the performance evaluation with derivations from theconfusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Impacts of clustering thresholds and selection of similarity measures . . . . . . . . 35

3.5 Detection rate and false alarm rate using frequency based anomaly de-tection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Detection rate and false alarm rate using risk leveling anomaly detectionmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 EMM detection and false positive rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 LOF detection and false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 LSC-Mine detection and false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 EMM, LOF and LSC-Mine detection and false positive rates using PCA. . . . . 54

4.5 EMM, LOF and LSC-Mine detection and false positive rates. . . . . . . . . . . . . . . 55

5.1 Feature comparison between different data stream clustering algorithms. . . . . 58

5.2 Comparing average purity for different MinPts for α = 0.1. . . . . . . . . . . . . . . . . 75

5.3 Comparing average purity for different MinPts for α = 0.3. . . . . . . . . . . . . . . . . 75

5.4 Highlight the improvement SOStream compared to MR-Stream and D-Stream. 75

6.1 Show the legend from the confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.2 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xi

6.3 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107







xii

Dedicated to the Almighty Creator, the Most Gracious, the Most Merciful.

Chapter 1

INTRODUCTION

Availability and reliability of data have become crucial factors in today’s modern so-

ciety. One important task for any domain application is to detect abnormal data. Outlier

detection is extensively used in a wide variety of applications such as fraud detection in

the banking system, intrusion detection in network security, unusual behavior in military

surveillance, and also detection of tumors in MRI images. Outliers are defined as data

points that occur very infrequently and/or lie far from the expected values. It is crucial to

investigate outliers because they may contain valuable information regarding the process

under investigation. One should inquire why such data points have occurred and whether

similar points would continue to appear before taking the decision of removing them from

the dataset before training models. Statisticians have researched the problem of outlier de-

tection since the early nineteenth century [49]. There have been many techniques proposed

for outlier detection. Out of those techniques, some are specifically designed to suit certain

application domains while others are more generic. The presence of outliers in data may

carry important information. For example, an anomaly in digital photography may indicate

that a terrorist is using steganography to hide messages in the low-order bits of a digital

photograph in either plaintext or ciphertext form to disguise it from their enemies [34].

Similarly, outliers in Magnetic Resonance Imaging (MRI) may identify pixels, which are

significantly different between two MRI scans and thereby indicate the presence of brain

tumors [54]. Furthermore, an abnormal pattern in network traffic may signal an alarm of

intrusion, which may indicate a compromised server is sending out unauthorized informa-

tion [87]. Other examples could be outliers in credit card transactions, which may raise

1

attention to a credit card theft [39] or interruptions in continuous signals from airplane to

the ground due to inconsistent data acting as outliers, which may lead to accidents.

Outliers may arise due to several reasons such as intrusion, human error, machine error,

and changes in the behavior of the system. Because of all these various causes, outliers are

difficult to detect. For example, attempting to define a normal region, which includes all

possible behaviors is very problematic. Furthermore, it is difficult to set a precise boundary

distinguishing outlier and normal behavior. This may result in a case where an outlier

lying close to the boundary may be predicted as normal, or, on the contrary, normal data

lying close to the boundary may be identified as an outlier. There are also cases where

malicious actions result in outliers. Malicious actors may try to adapt themselves in such a

way that resulting outliers seem to appear normal thereby making it difficult to distinguish

between normal and malicious behavior. In addition to this, it becomes difficult to detect,

distinguish, and remove data consisting of noise, which may appear to be similar to the real

outliers.

1.1. Motivation

Existing outlier detection techniques have been effective in either space or time but not

both. The Extensible Markov Model (EMM) [76] is a spatiotemporal algorithm that has

been successfully used in diverse fields. [77, 28, 120, 118, 119] EMM has proven to be a

powerful algorithm that can administrate space and time very efficiently as well as adapt to

continuous changes in the environment in a scalable manner. When EMM is used to pro-

cess data, it will dynamically construct a codebook based on the input data. The codebook

consists of a set of model vectors representing typical vectors within the dataset. EMM

creates new entries in the codebook based on a fixed threshold when the input data does

not map to an existing cluster. The use of a spatiotemporal data mining algorithm like

EMM allows continuous assessment and is capable of both tracking changes over time and

2

determining whether or not that particular change is probable based on a normal or abnor-

mal pattern. Other outlier detection algorithms that are based only on clustering would be

incapable of establishing such relationships because they lack the temporal model.

1.2. Focus of the Dissertation and Conclusions

This work presents several innovative data mining models for outlier detection based on

Extensible Markov Model (EMM) [76] which combine the spatiotemporal data modeling

with data streams. EMM is composed of two core features: modeling and pattern-finding

capabilities. The modeling component in EMM is used to group related data points into

clusters. EMM combines a clustering model with a Markov Model (MM). Many algorithms

are proposed to extend EMM, and their performance is discussed for a large number of

datasets.

The main contributions of this work can be summarized as follows.

1. Risk leveling of network traffic anomalies: A real world application to explore so-

phisticated mining tasks. False alarm rate is being used for performance evaluation.

Discussed in Chapter 3.

2. Comparison of EMM with other state-of-the-art outlier detection techniques: LOF

and LSC Mine. The comparison is based on accuracy and runtime complexity. The

research indicates that EMM outperformed the other two techniques in several cases.

However, EMM suffered from a critical issue concerning the clustering component

using a fixed threshold. Discussed in Chapter 4.

3. For the EMM algorithm to work efficiently, it is imperative that the threshold be set

to the correct value. The threshold is the static parameter that determines whether a

new event belongs to an existing cluster or if a new cluster should be created. If this

parameter is set to an unsuitable value, then the algorithm will create too many clus-

ters and suffer from overfitting, or it will create too few clusters resulting in unstable

3

classification. The next step is to add to EMM the ability of an adaptable threshold.

Self-Organizing Map (SOM) [105], is an unsupervised algorithm that does not use

a fixed threshold, but it creates an approximation output space with randomly as-

signed weight, and depending on the incoming data it will adjust the neighbors of the

winning weight to be closer to it. SO-Stream proposes a clustering algorithm that dy-

namically self-organizes its structure without the use of a fixed threshold. SO-Stream

is designed specifically for the data stream, so its performance was tested against

two popular stream clustering algorithms, MR-Stream and D-Stream with different

real-world and synthetic datasets. SO-Stream outperformed the other two techniques

concerning cluster purity, memory, and runtime complexity. SO-Stream can identify

highly overlapping clusters, and SO-Stream operations (i.e. create, remove, merge,

and fade) are completely online. Discussed in Chapter 5.

4. EMM’s two components: clustering and MM are tightly coupled and data points have

to be processed in order. We proposed a new algorithm ASMM that utilizes an offline

and online component that decouples these two elements. ASMM can handle points

out of order while maintaining the original order of the incoming data points. SO-

Stream utilizes an offline component to initialize the clustering model, and then the

initial model is efficiently incremented with the online component. The offline com-

ponent uses a buffering technique to add support for concept drift as well new pat-

terns that may emerge from evolving stream. ASMM performance is tested against

two popular outlier detection algorithms: LOF and EMMRare on a large number of

real-world and synthetic datasets. ASMM outperformed the other two techniques

in term of different confusion matrix measures, memory and runtime complexity.

Discussed in Chapter 6.

1.3. Organization of the Dissertation

The remainder of the dissertation is organized as follows: Chapter 2 presents the back-

4

ground work; Chapter 3 proposes a spatiotemporal technique for outlier detection in data

streams; Chapter 4 provides a comparative study of techniques to detect outliers; Chapter

5 presents a novel unsupervised clustering technique for data streams. Chapter 6 presents a

spatiotemporal model that extends SO-Stream with a temporal Markov Model, and Chap-

ter 7 concludes with an assessment of the viability of stream mining for outlier detection.

Please note that each thesis chapter represents a previously published/submitted research

paper, and due to this reason, some concepts are introduced recurrently in the introduction

sections from various chapters.

5

Chapter 2

BACKGROUND

In this chapter, the previous research related to spatiotemporal data stream mining is

addressed. Firstly, general techniques used in the areas of outlier detection are reviewed,

and then three important areas are discussed: Data streams, outlier detection in the data

stream, and spatiotemporal outlier detection in data streams.

2.1. Outlier Detection

Outlier Detection Techniques

Clustering-based

Nearest Neighbor-based

Statistical-based

Classification-based

Spectral Decomposition-based

Principal Component Analysis

Support Vector Machine

Bayesian Network

Figure 2.1: A classification of outlier detection techniques

Different approaches and methodologies have been introduced to address the outlier/

anomaly detection problem; they include statistical approaches, supervised and unsuper-

vised learning techniques, neural networks and machine learning techniques (See Fig-

ure 2.1). We can not provide a complete survey here but refer the interested reader to

available surveys [109], [71], [110]. We briefly mention some representative techniques1.

6

The Grubbs method (extreme studentized deviate) [50] introduced a one-dimensional sta-

tistical method in, which all parameters are derived from the data, it requires no user pa-

rameters. It calculates the mean and standard deviation from all attribute values and then

calculates a Z-score as the difference between the mean value of the attribute and the query

value divided by the standard deviation for the attribute. Then the Z-score for the query is

compared with a threshold of 1% or 5% significance level.

An optimized k-NN was introduced by [97]. It gives a list of potential outliers and their

ranking. In this approach, the entire distance matrix needed to be calculated for all the

points, but the authors introduced a partitioning technique to speed up the k-NN algorithm.

Other outlier detection approaches are based on the Neural Networks. Neural networks

are non-parametric and models that require training and testing to determine the threshold

and be able to identify outliers. Most of them suffer when the data has high dimensionality.

[10] and [22] define novelties in time-series data for fault diagnosis in vibration signatures

of aircraft engines and Bishop monitor processes such as oil pipeline flows. They both

use a supervised neural network (multilayer perceptron), which is a feed-forward network

with a single hidden layer, where hidden layers are used to add on neurons to neural net-

works architecture to build up the ability to solve highly complex nonlinear functions. The

drawback to this is that the increased number of neurons also increases the necessary time

needed by the neural network to converge during learning. [83] uses an auto-associative

neural network which is also a feedforward perceptron-based network which uses super-

vised learning. [106] introduced a detection technique for time series monitoring based on

the Adaptive Resonance Theory (ART) [51] incremental unsupervised neural network.

An approach that works well with high dimensional data is using decision trees as in

[53] and [36] where they use a C4.5 decision tree to detect outliers in categorical data

1This section has been published in International Conference on Machine Learning and Data MiningMLDM, 2009. [26]

7

to identify unexpected entries in databases. They pre-select cases using the taxonomy

from a case-based retrieval algorithm to prune outliers and then use these cases to train the

decision tree. [103], [104] introduced an approach that uses similarity-based matching for

monitoring activities.

Local Outlier Factor (LOF) [24] algorithm detects outliers by measuring the local devi-

ation of a given data point to its neighbors. LOF was designed for static data, but if repeat-

edly applied, either periodically or every time a new data point comes in, this algorithm can

be adopted by data streams. Pokrajac [43] proposed an incremental LOF algorithm where

the reachability distance, local reachability density (LRD) and LOF values for each new

data point are computed, and those values for existing points were updated. Hence outliers

can be instantly detected.

2.2. Data Streams

The data mining community has provided many innovative technologies that address

different issues. One of these is data streams, a new data mining area that involves data

that is continuous and perhaps infinite. This type of data can be characterized as a high

volume that arrives at a high velocity. Storing such data may be impractical, and even if

such data volume was to be stored, processing any particular record more than once may

be infeasible. See [29] for a detailed discussion on different streaming applications. Ad-

ditionally, characteristics of data in streaming may change over time (e.g., concept drift).

Since data streams can be viewed as time series, time series models were also considered.

Traditional linear time series models consist of three statistical based models: autoregres-

sive (AR), integrated (I), and the moving average (MA). The ARIMA model [23] integrates

all of these models. Data Stream Mining has a single pass restriction that makes traditional

time series models impractical. Thus, rather than using time series forecasting models to

detect outliers, conventional multidimensional models that account for the temporal drift

8

and deviations are used.

2.3. Outlier Detection in Data Streams

As new data arrives, the data stream models need to update their structures to capture

the normal trends in the data. Outliers are then detected when the data causes a dras-

tic change in the original model. Yamanishi and Takeuchi [61, 63] presented an online

sequential discounting algorithm that incrementally learns a probabilistic mixture model.

The model accounts for drift by using a decay factor. Moreover, the model can detect

outliers by computing an outlier score from a learned mixture model. Depending on the

type of data (continuous or categorical) different models were proposed. For categorical

data, Sequentially Discounting Laplace Estimation (SDLE) utilizes a Laplace smoothing

function to compute a probability score based on the occurrence frequency of a particular

symbol divided by all the data points. For every new data point, the model needs to update

all its cells. Two models were proposed for continuous data: Gaussian Mixture and Time

Series. Both models detect an anomaly if the model at the time (t − 1) has changed after

adding a new data point at time t. Further research by Javitz [56] proposed to update the

normal distribution of the data by giving more weight to recent data. An accepted solution

for streaming data is to model or summarize related data points into clusters, which helps

avoid the retention of the whole dataset. Clustering models, in general, can be used to

detect outliers. This methodology fits new data points into existing clusters, and outliers

are detected when either new data points do not fit into the clusters or the internal clus-

ter structure changes. Several popular clustering algorithms which can be used for outlier

detection are reviewed below. Aggarwal and Yu [30] proposed clustering as a method to

detect outliers. The k-Means clustering method is often used because it allows reallocation

of samples even after assignment and converges quickly. The problem with basic k-Means

is that the random allocation of cluster centers reduces its accuracy. Also, the values of k

9

(number of clusters) and t (number of iterations) are difficult to set in advance. To counter

this limitation, dynamic clustering approach was proposed. In fact, the underlying structure

in data stream clustering continues to evolve as time passes. Detecting outliers, whether

spatially or temporally, is particularly challenging. For instance, data points analyzed in

an early stage can be incorrectly viewed as outliers; however, as time elapses a new trend

may start to occur. Moreover, data points that are time delayed may also appear falsely as

outliers. Thus, techniques such as dynamic time warping may help to discover the truth.

Accordingly, the dynamic nature of data has motivated data mining researchers to develop

innovative technologies to manage such requirements.

For example2, E-Stream [64] handles the evolving data stream by providing cluster

operations like add, delete, split, and merge. The algorithm starts empty, and for every time

step based on a radius threshold either a new data point is mapped into one of the existing

clusters, or a new cluster is created around the incoming data point. Any cluster that does

not meet a defined density level is considered inactive and remains isolated until achieving

the desired weight. Cluster weights are decreased over time to reduce the influence of old

data points. This technique is well known as a fading function, where the cluster, which

is inactive for a certain time period has a risk of being deleted. Also, for each step, a pair

of clusters may be merged because either the overlap between two clusters is sufficiently

large or the maximum cluster limit has been reached. The split of one cluster into two

sub-clusters occurs if internal data is different. The split process creates one histogram for

each active cluster, where the data dimension is summarized into an α-bin histogram, and

then the split is performed if a deep valley between two significant peaks is found.

CluStream [4] divides the clustering process into online and offline components. The

online micro-clustering component periodically stores detailed summary statistics in a

2Some of these clustering algorithms are defined in this section in more details compared to the originalpublished paper [27].

10

high-speed data stream, and the offline macro-clustering component uses the summary

statistics in association with user input to provide the user with a quick understanding of

the clusters whenever required. This two-phased approach also provides the user with the

flexibility to explore the nature of the evolution of the clusters over different time periods.

DenStream [25] discovers clusters of arbitrary shapes in an evolving data stream by

maintaining two lists, one with potential micro-clusters and the other with outlier micro-

clusters. Each time a new data point arrives, an attempt is made to merge the point into

one of the existing nearest potential micro-clusters. Based on the resulting micro-cluster,

if its radius is larger than a specified radius then the merge is omitted, and then another

attempt is made to merge the point with the nearest outlier micro-cluster. Once again, if

the resulting radius is larger than a specified radius the merge is omitted, and a new outlier

micro-cluster centered at that point is created and added to the outliers list. If any of the

outlier micro-clusters exceed a specified weight, then it is moved into the potential micro-

clusters list. Periodically an attempt is made to prune points from outlier micro-clusters list

into potential micro-clusters.

OpticsStream [42] is an online visualization algorithm that produces a map representing

the clustering structure. It adds the ordering technique from OPTICS [81], which is not

suitable for the data stream, on top of any density based algorithm such as DenStream to

better manage the cluster dynamics.

HPStream [5] is an online clustering algorithm that discovers distinct clusters based on

a different subset of streaming data point dimensions. This is achieved by maintaining for

each cluster a d-dimensional vector that indicates, which of the dimensions are included

in the continuous assignment of incoming streaming data points to an appropriate cluster.

The algorithm begins by assigning the received streaming data point to each of the existing

clusters, and then it computes the radii and selects the dimensions with the smallest radii

followed by creating a d-dimensional vector for each cluster. Next, the Manhattan distance

11

is computed from the incoming data point to the centroid of each existing cluster (where its

d-dimensional vector limits the centroid for each cluster). From these distances, the winner

is found by returning the largest average distance along with the included dimensions. Then

the radius is computed for the winning cluster and compared to the winning distance based

on this comparison, and either a new fading cluster is created centered at the incoming data

point, or the incoming data point is added to the winning cluster. Also, clusters are removed

if they contain zero dimensions or if the number of clusters has exceeded the user defined

threshold.

WSTREAM [41] is a density-based algorithm that discovers cluster structure by main-

taining a list of rectangular windows that are incrementally adjusted over time. Each win-

dow will move based on the centroid of the cluster, and the centroid will be incrementally

recomputed whenever new streaming data points are inserted into the window. The win-

dows can also incrementally contract and expand based on the window approximated kernel

density and the user-defined bandwidth matrix that is controlled by specified rules. In the

case of windows overlap, the proportion of the number of streaming data points in the in-

tersection of the pair of windows to the remaining points in each window is computed, and

then it is compared to the user defined thresholds, which is then used to either remove or

merge the windows. This algorithm also periodically monitors the windows weights from

the stored windows. If the weights are less than the defined minimum threshold (which is

considered to be an outlier) or are very old compared to the defined time, the windows are

removed.

D-Stream [31] is a density based clustering algorithm used for data streams. This al-

gorithm works on the same basis as the time step model. It starts by initializing an empty

hash table grid list. It contains both an online and offline component. The online compo-

nent reads the incoming raw data record, and this record is either mapped to the existing

grid list or inserted into the grid list if it does not exist. After the insertion of the record

12

into the grid, the characteristic vector of the grid is updated. This characteristic vector con-

tains all the information about the grid. Thus, the online component partitions the data into

many corresponding density grids forming grid clusters. The offline component takes the

role of dynamically adjusting the clusters. If the grid receives no new value for an extended

period, then it is removed from the grid list. Such grids are known as sporadic grids that

may contain outliers.

MR-Stream [68] extends D-Stream by finding clusters at versatile granularities. It re-

cursively partitions the data space into well-defined cells by using a tree data structure

quadtree. MR-Stream facilitates both online and offline components.

2.4. Spatiotemporal Outlier Detection in Data Stream

Spatiotemporal data mining refers to a process that extracts hidden knowledge from

both the spatial and temporal data space. Spatiotemporal is an emerging research area for

data stream applications. Traditionally, data mining techniques considered spatiality and

temporality as two separate research areas. However, today both combined have become

a central requirement to process data events. According to the survey article on outlier

detection by Manish [74], spatiotemporal outliers can be defined as spatiotemporal objects

whose behavioral/thematic (non-spatial and non-temporal) attributes are significantly dif-

ferent from those of the other objects in its spatial and temporal neighborhoods. Figure 2.2

shows a workflow from a traditional spatiotemporal outlier detection framework. This

framework models the outlier detection into three main components. The first component

is responsible for finding objects from the input data stream that have interesting semantics.

The next component analyzes these objects to identify if they are spatial outliers. Finally,

the spatial outliers are examined across time to check if they are temporal outliers. Objects

are classified as spatiotemporal outliers if found to be both spatial and temporal outliers.

13

Spatio-Temporal Data

Data

DataData

Verify Temporal Outliers

Find Spatial

Outliers

Find Spatial Objects

Spatio-Temporal Outliers

Figure 2.2: A workflow from a traditional spatiotemporal outlier detection framework.

14

Birant [40], Cheng [101, 32], and Adam [3] proposed anomaly detection algorithms

that utilize a multi-step approach in Figure 2.2. They first try to detect spatial outliers and

then verify their temporal neighborhood to determine the spatiotemporal outliers. The tech-

niques use a modified version of DBSCAN [79] for both the spatial and temporal neighbors.

These are given a radius followed by assigning a density factor to clusters that are intended

to detect potential outliers. The two evaluations are performed to identify the spatiotem-

poral outliers. Another technique proposed by Cheng uses spatial scaling with a four-step

approach to address the semantic and dynamic properties of geographic phenomena for ST-

Outlier detection. First, the algorithm finds semantic objects (i.e., spatiotemporal objects),

which uses prior knowledge to form some regions that have significant semantic mean-

ings. Next, aggregation, which focuses on detecting spatiotemporal outliers, is utilized to

remove noise. Additionally, a comparison between the found outlier in the clustering phase

is compared to the points that were filtered. A final step is to verify the temporal outliers

based on the previous steps. Adam [3] uses a distance-based outlier detection technique,

which establishes a spatial Voronoi grid to obtain macro-clusters. The algorithm uses Jac-

card distance and the silhouette coefficient to determine the quality of the micro-clusters.

Any points that substantially deviate from the neighborhood are flagged as spatiotempo-

ral outliers. Other techniques such as outlier solids, Kulldorff scan statistic, and trajectory

outliers can be considered for more detailed discussion [73].

The Extensible Markov Model (EMM) [76] is a spatiotemporal algorithm that is based

on first order Markov Chains (MC) described in [17]. EMM consist of two parts: a

distance-based data stream clustering algorithm for spatial data that obtains representative

granules in the continuous data space, and an MC to model temporal behavior. EMM ap-

plies to data stream processing with the number of states unknown in advance and provides

a heuristic modeling method where the approximation of the Markov property is appro-

priate. EMM operations are entirely online and thus suitable for data streams. Figure 2.3

15

shows the EMM’s framework for detecting spatial-temporal outliers. The rest of this thesis

will investigate solutions and improvements for different aspects of this framework.

EMM Outliers

Clustering MarkovModel

Data(t+n)

Datat

Data(t+1)

Spatio-TemporalOutliers

SpatiotemporalData

Figure 2.3: The workflow from EMM outlier detection framework.

16

Chapter 3

RISK LEVELING OF NETWORK TRAFFIC ANOMALIES

The goal of intrusion detection is to identify attempted or ongoing attacks on a computer

system or network. Many attacks aim to compromise computer networks in an online

manner. Traffic anomalies have been an important indication of such attacks. Challenges

in the detections lie in modeling of the large continuous streams of data and performing

anomaly detection in an online manner.

In this chapter1, we will present a data mining technique to assess the risks of local

anomalies based on synopsis obtained from a global spatiotemporal modeling approach.

The proposed model is proactive in the detection of various types of traffic related attacks

such as distributed denial of service (DDoS). It is incremental, scalable and thus suitable

for online processing. Algorithm analysis shows the time efficiency of the proposed tech-

nique. The experiments conducted with a DARPA dataset demonstrate that compared with

a frequency based anomaly detection model, the false alarm rate caused by the proposed

model is significantly mitigated without losing a high detection rate.

3.1. Introduction

Data mining is used to detect anomalies [120] [8] [78] [52] The goal of anomaly detec-

tion is to ”find data objects that are different from most other objects” [86]. An anomaly

can be used as an indication of a possible dangerous situation in computer networks and

other systems. When an anomaly is detected by an anomaly detection model, an alarm is

1This work has been published in International Journal of Computer Science and Network Security (IJC-SNS), 2006 [28] and presents joint work with Yu Meng and Professor Margaret H. Dunham.

17

set and human intervention is invoked to examine whether the alarm represents an event

of interest such as a dangerous situation or a malicious activity. Traffic anomaly is a type

of anomaly. It refers to traffic characteristics that deviate from that which occurs at the

majority of the time. These behaviors may have significant impact on the system. Traffic

anomaly has received attention as a major indicator of risk exposure in computer networks.

For example, Juniper Networks has proposed a combination of traffic anomaly detection,

protocol anomaly detection and stateful signatures to identify a variety of types of attacks

in computer networks [91]. Cisco has delivered the Cisco Traffic Anomaly Detector XT

5600 for detection of distributed denial of services (DDoS), worms, and other attacks [33].

Applications of traffic anomaly mining can be intuitively extended to highway traffic op-

eration, and electric power demand management. However, an anomaly is not necessarily

a risk. Generally as a higher detection rate is pursued with an anomaly detection model, a

higher false alarm rate is caused as well. Needed human intervention caused by false alarms

is very costly and there is a demand to reduce unnecessary human intervention. Automatic

techniques are desired to evaluate the chance that an anomaly is of interest so as to take

out some anomalies that are probably not a risk. Existing anomaly detection work uses

either frequency based or data deviation based [120] [78] [46] [96] [9] [85] approaches.

We have noticed that these may suffer from a high false alarm rate. In this chapter we

propose a risk leveling model, a two phase data mining technique with rules using both oc-

currence frequency and data deviations. The proposed model detects the anomalies based

on frequency and then measures the deviation of the anomaly away from the data space.

The level of risks with which the anomaly is associated is evaluated by the deviation, as

we envision that anomaly data space when risks occur is located away from the normal

data space. A common characteristic of a data stream is its high volume of data; moreover

the data continuously arrives at a rapid rate. It is not feasible to store all data from the

streams and use random accesses to the data as we do in traditional database. This implies

18

a single pass restriction for all data in the streams [55]. Therefore, the data stream must be

modeled in order to obtain a synopsis of the global profile of the dataset. Data mining is a

key technique in modeling stream data. Our proposed risk leveling model is built based on

the Extensible Markov model (EMM), a spatiotemporal modeling technique [76]. The risk

leveling model uses the synopsis obtained from the EMM modeling process. Performance

comparisons with a frequency based anomaly detection model [120] are expected to show

the low false alarm rate without losing a high detection rate of the proposed risk leveling

model. Also the proposed model inherits the incrementality and scalability of the EMM.

3.2. Related Work

Our proposed technique assesses the chance of alarms raised by a frequency based

anomaly detection model to actually be events of interest, i.e. risks. Before we present

risk-leveling model, we first introduce related work followed by frequency based anomaly

detection technique. Among prominent properties of anomaly are its rarity and possible sig-

nificance. These properties distinguish anomaly detection techniques from modeling tech-

niques in other subjects in feature selection/construction and evaluation metrics. Lazare-

vic [8] indicates that unsupervised techniques and supervised techniques are the two major

categories of techniques in anomaly detection. The unsupervised technique is capable of

mining unlabeled data. That means no priori knowledge is required for ”normal” profiles.

An anomaly is detected by selecting an event that deviates from the majority. Although a

variety of algorithms can be applied, some common steps are seen as follows:

• Construct features. The features may be constructed in a weighted numeric vector.

• Determine a distance measure from the data point, which represents an event under

investigation, to a cluster. The kth nearest neighbor distance [75], similarity (Jaccard,

Cosine, Overlap, Dice [75]), Euclidean distance, Manhattan distance [75], skewed

19

distance (Mahalanobis distance [8]), and density distance (LOF) [78] are of common

distance measures.

• Apply an anomaly detection algorithm to the data, based on a set of rules defining

anomalies. The following categories of anomaly detection algorithms are seen in the

literature:

– Distance based algorithms [46] [96],

– Statistics based algorithms [84] including finite mixture model [62], and infor-

mation theory [113].

– Model based algorithms such as neural networks [92] and SVM [9].

Distance-based algorithms are based on clustering and form a major category of tech-

niques in anomaly detection. These techniques neither assume independence among dif-

ferent dimensions of data as statistical based algorithms do, nor are these as sensitive to

the initial selection of the model as model based algorithms are. The meaning of the tech-

niques is easy to interpret and is suitable for spatial data mining. Temporality can be

another expected characteristic of anomalies in additional to spatiality, particularly in traf-

fic anomaly detection. Markov chains and suffix trees have been used [44] [85] to store

temporal profiles. The benefit of a Markov chain is its concise presentation in mathemat-

ics. Variations of the Markov chain with dynamic structures have been proposed to model

dynamically changing data [76] [35]. The Suffix tree stores all suffixes of a sequence and is

linearly efficient in string matching with the suffixes. The EMM [76] takes the advantage of

distance-based clustering for spatial data as well as that of the Markov chain for temporal-

ity. EMM achieves an efficient modeling by mapping groups of closely located real world

events to states of a Markov chain. EMM is an extension to the Markov chain. EMM uses

clustering to obtain representative granules in the continuous data space. Also by provid-

ing a dynamically adjustable structure, EMM is applicable to data stream processing when

20

the number of states is unknown in advance and provides a heuristic modeling method for

data that hold approximation of the Markov property. EMM formalizes a framework for

spatiotemporal data mining by introducing phases including clustering and Markov chain

construction which model the data stream so as to get the synopsis of the data profile, and

applications which are built on the synopsis. Subsequently, we will give a concise descrip-

tion of EMM and this description, which should be sufficient to grasp the scope of our

work. Further information concerning EMM can be found in [76]. A multidimensional

data point in EMM represents a real world event. The data point can be represented in a

hyperspace as a vector. EMM defines a set of formalized procedures such that at any time

t, EMM consists of a Markov Chain (MC) and algorithms to modify it, where algorithms

include:

1. EMMCluster: defines a technique for matching between input data at time t+ 1 and

existing states in the MC at time t. This is a clustering algorithm which determines

if the new data point or event should be added to an existing cluster (MC state) or

whether a new cluster (MC state) should be created. A distance threshold th is used

in clustering.

2. EMMBuild: is an algorithm that updates (as well as adds, deletes, and merges) MC

at time t+ 1 given the MC at time t and output of EMMCluster at time t+ 1.

3. EMMapplications: are algorithms that use the EMM to solve various problems. To

date, we have examined EMM for prediction (EMMPredict) [76] and anomaly (rare

event) detection(EMMRare) [120].

Throughout this chapter, we use a view of EMM as depicted by a directed graph with

nodes and links. We use link and transition interchangeably to refer to a directed arc;

and use node, state, and cluster interchangeably to specifically refer to a vertex in the

EMM. These algorithms are executed in an interleaved manner. The first two phases are

21

used to model the data. The third phase is used to perform applications based on the

synopsis created in the modeling process. The synopsis includes information of cluster

features [107] and transitions between states. The cluster feature defined in [107] includes

at least a count of occurrence, CNi (count on the node) and either a medoid or centroid

for that cluster, LSi. To summarize, elements of the synopsis of an EMM are listed in

Table 3.1.

Table 3.1: Notations of EMM Elements

Legend of Notations

Notation Description

Ni The ith EMM node, labeled by CNi and LSiCNi Count of occurrences of data points found in the cluster (EMM node or EMM state) NiLSi A vector representing the representative data point of the cluster, usually being centroid or medoid of the cluster

Lij The directed link from Ni to Nj , labeled by CLijCLij Count of occurrences of the directed link from Ni to Njm Number of EMM states

n Number of attributes in the vector representing a data points, or dimensions of the data space

In this chapter, the frequency based anomaly detection algorithm [120] is used to com-

pare with the proposed model, which is one of the several applications of EMM. We give

a brief review of the approach used. The idea for anomaly detection comes from the fact

that the learning aspect of EMM dynamically creates a Markov chain that captures past

behavior stored in the synopsis. No input into the model identifies normal or abnormal be-

havior instead this is learned based on the statistics of occurrence of transitions and states

within the generated Markov chain. By learning what is normal, the model can predict

what is not. The basic idea is to define a set of rules related to cardinalities of clusters and

transitions to judge anomalies. An anomaly is detected if an input event (or an data point),

Et, is determined not to belong to any existing cluster (state in EMM), if the cardinality of

the associated cluster (CNn) is small, or if the transition (CLij) from the current state, i, to

the new state, j, is small. When any of the predefined rules are met, a Boolean alarm, At,

22

is set to indicate capture of anomalies.

3.3. Methodology

In this section we present the steps to build a risk leveling model, based on EMM

modeling [76] and the frequency based anomaly detection model [120], as well as the

evaluation metrics. KDD defines preprocessing procedures of data [108] to convert the

format of raw data to the format that is appropriate for data mining. Our preprocessed data

will use a structured format which combines the time stamp and spatial traffic statistics in

one vector:

Vt =< Dt, Tt, S1t, S2t, ..., Sit, ... >,

where Dt denotes type of day, Tt time of the day, and Sit the value of statistic found at

a spatial location i, at time t. This spatiotemporal format defines an input real world event

(input data point) in the multidimensional data space. Assume there are n elements in the

vector. Therefore each data point can be represented as a vector in n-dimensional space.

A trait of EMM is that it learns while performing a task so as to dynamically adapt the

time variant dataset. To perform mining of risk levels, the following are applied.

1. EMMCluster: Nearest neighbor clustering,

2. EMMBuild,

3. EMMAnomaly,

4. EMMRiskLeveling.

Algorithms EMMCluster and EMMBuild define the modeling process of EMM [76].

Combined with algorithm EMMAnomaly [76], an EMM anomaly detection model based

on occurrence frequency is defined and has been introduced in preceding section. The

23

anomaly detection model sets alarms, At, based on a set of predefined, frequency based

rules. To build a risk leveling model, a new algorithm, EMMRiskLeveling, is added. The

risk leveling model outputs a risk leveling index by combining the frequency based anomaly

alarm and evaluation of deviation of the local pattern in the normal data space. We will see

that the deviation evaluation can be calculated incrementally.

To evaluate the deviation, we use two parameters, centroid −→c (t) and diameter D(t), to

characterize the data space Ω of the model. Here the data space Ω refers to the region that

the data points occupy in the n-dimensional hyperspace. It is equivalent to the region that

the EMM nodes are distributed. The centroid of Ω is given in Definition 3.1. Moreover the

centroid can be computed incrementally. Using the incrementality, the time complexity is

reduced from O(m) to O(1), as given in Lemma 3.1.

Definition 3.1 (Centroid of data space:) Denote an EMM node to be Ni, the number of

data points included in the node to be−−→CN i, and first moment or the representative location

of Ni is−→LSi. The centroid of the data space −→c (t) is defined as:

−→c (t) =m∑i=1

−→LSi ∗ CNi

t(3.1)

Lemma 3.1 (Incrementality of centroid of data space.) Given−→c (t−1) and the first mo-

ment of current EMM state−−→LSc, then −→c (t) can be expressed in incremental manner.

−→c (t) =−→c (t− 1) ∗ (t− 1) +

−−→LSc

t= (3.2)

−→c (t− 1) ∗ (1− 1t) +

−−→LSct

24

proof 3.1 First we should note that−−→LSc is the same as

−−→LSt. We consider two cases:

1. Nc is a new EMM node:

−→c (t) =m(t)∑i=1

−→LSi ∗ CNi

t

=m(t−1)∑i=1,i 6=c

−→LSi ∗ CNi

t+

−−→LSct

=m(t−1)∑i=1,i 6=c

−→LSi ∗ CNit− 1

∗ t− 1t

+

−−→LSct

=−→c (t− 1) ∗ (t− 1) +

−−→LSc

t

= −→c (t− 1) ∗ (1− 1t) +

−−→LSct

25

2. Nc is an existing EMM node:

−→c (t) =m(t)∑i=1

−→LSi ∗ CNi

t

=m(t)−1∑i=1,i 6=c

−→LSi ∗ CNi

t+

−−→LSc ∗ CNc

t

=m(t)−1∑i=1,i 6=c

−→LSi ∗ CNi

t+

−−→LSc ∗ (CNc − 1)

t+

−−→LSct

=m(t−1)∑i=1

−→LSi ∗ CNit− 1

∗ t− 1t

+

−−→LSct

=−→c (t− 1) ∗ (t− 1) +

−−→LSc

t

= −→c (t− 1) ∗ (1− 1t) +

−−→LSct

Now we define the diameter of the Ω in Definition 3.2.

Definition 3.2 Denote an EMM node to be Ni, the number of data points included in the

node to be CNi and the distance between any two EMM nodes, Ni, Nj to be dij . The

diameter of the data space at time t, D(t), is defined by:

D(t) =(∑m

i=1

∑mj=1

((dij)2∗CNi∗CNj)

2t(t−1)

)1/2(3.3)

where t is the time instance and m is the number of EMM nodes. For simplicity in

computations, we define that:

d(t) =(∑m

i=1

∑mj=1

((dij)2∗CNi∗CNj)

2

)1/2(3.4)

26

Therefore, we have,

D(t) =(

d2(t)t(t−1)

)1/2(3.5)

or,

D2(t) =d2(t)

t(t− 1)(3.6)

At each time instance, theD(t) gives a weighted inter-cluster distance of the data points

received so far, and can be used to measure the size of the data space. Actually it can be

seen as an approximation of the inter-data point distance in the data space by ignoring the

inter-point distance among the data points within the same clusters. As we can see, the

computation complexity of this O(m2). However given the incrementality, the computa-

tion complexity can be reduced to O(m).

Lemma 3.2 (Incrementality of diameter of data space.) Given diameter of data space

at time instance t− 1, then the diameter of data space at time instance t can be expressed

in incremental manner:

d2(t) = d2(t− 1) +m(t−1)∑i=1

(d2i (t) ∗ CNi) (3.7)

Since the proof is very similar to that of incrementality of the centroid, we skip the

27

proof here. Now, denote that at time instance t, the distance between the current Node Nc

and −→c (t) is dcc(t). We define a risk leveling index as in Definition 3.3.

Definition 3.3 (Risk Leveling Index:) Given an alert raised by the frequency based anomaly

detection model when a data point−→Et is input, the risk leveling index caused by data devi-

ation is given by a hyperbolic tangent sigmoid function, and is defined as:

a(t) =er(t) − e−r(t)

er(t) + e−r(t)(3.8)

where,

r(t) =(

d2ccD2(t)

)1/4(3.9)

or,

r(t) =(t(t−1)∗d2ccd2(t)

)1/4(3.10)

for simplicity of computations.

The a(t) yields a output range [0, 1) because the ratio r(t) is never negative. The further

the current data point is located outside the border of the data space, the more likely the

data point is associated with a risk. This is an induction of our assumption. The procedures

to compute the risk leveling index is illustrated in Algorithm 1.

28

input : At : Boolean output of the frequency based anomaly detection model attime t.Gt : EMM at time t

output: a(t) : Risk leveling index at time t

foreach time instance t do1if At == true then2

Update −→c (t) using (3.3);3Update D(t) using (3.7) and (3.5) 0r (3.6);4Compute a(t) using (3.10);5

Algorithm 1: EMMRiskLevel

Example 3.1 (Given an EMM at time 5, specified as:)

N1 = { 2, < 1, 4 > }, N2 = { 3, < 2, 3 > },

L11 = { 1 };L12 = { 1 };L21 = { 1 };L22 = { 2 };

−→c (5) =< 8/5, 17.5 >

d2(5) = 12

Our proposed approach to determining risk leveling index based on synopsis has the

following benefits:

• Computations takes O(1) time for −→c (t) and O(m) time for D(t). Recall that the

EMM takes O(m) time for clustering and O(1) time for Markov chain updates. The

proposed approach inherits the time efficiency that EMM possesses.

• The proposed approach is solely based on synopsis of EMM obtained at current time.

Thus the proposed method is as incremental and scalable as EMM does.

• The proposed approach learns in an unsupervised manner while performs applica-

tions. It is not heavily dependent on a training process and thus is suitable for stream

data processing.

29

If a data point−→d6 = < 1, 3 > is input at time t = 6, is clustered into a new EMM node

N3, and the frequency based anomaly model set an alarm At = true due to its rules, then

using Algorithm 1, we have:

−→c (6) =<85∗ 5 + 1

6,175∗ 5 + 3

6>=< 3/2, 10/3 >

d2(6) = d2(5) + (1 + 1) = 12 + 2 + 14,

D2(6) =d2(6)

6 ∗ (6− 1)= 7/15,

d2cc = | < 1− 8/5, 3− 17/5 > |2 = 2/5,

r(6) =25715

= 6/7,

r(6) = 0.69.

We consider several evaluation metrics to compare the performance of our proposed

model to the frequency based anomaly detection model: Detection (also true positive or

recall or hit rate in the literature) Rate, False Alarm (or false positive) Rate [120]. Pre-

cision (or positive predictive value) and F1 (also F-score or F-measure) score. Detection

Rate refers to ratio between the numbers of correctly alarmed risks to the total number

of risks that were incorrectly labeled as normal data points. False Alarm Rate refers to

the expectancy of the false positive ratio. F1 score is a measure of a test’s accuracy (See

definition in 3.11, 3.12, 3.13 and 3.14).

30

Precision =TP

TP + FP(3.11)

True Positive Rate =TP

TP + FN(3.12)

False Alarm Rate =FP

FP + TN(3.13)

F1 =2TP

2TP + FP + FN(3.14)

3.4. Experiments and Analysis

This section briefly reports the results of experiments conducted comparing the pro-

posed model to a frequency based model. We demonstrate the learning capacity, impact of

parameters in time and memory utilization. The frequency based anomaly detection model

is introduced in the earlier Section.

3.4.1. Dataset

In 1998, 1999 and 2000, the MIT Lincoln Laboratory [57] conducted a comparative

evaluation of intrusion detection system (IDSs) developed under DARPA funding. This

effort was to examine Internet traffic in the air force bases. The traffic was performed in

a simulation network. The idea was to generate a set of realistic attacks, embed them in

normal data, and evaluate the false alarm and detection rates of systems with these data,

in order to enrich performance’s improvement of existing IDS [57]. We use the DARPA

dataset as a testcase of our proposed model.

In order to extract information from the tcpdump datasets of DAPRA, TcpTrace utility

software [102] was used. This preprocessing procedure was applied to TCP connection

records, but ignores ICMP and UDP packets. The new feature-list attained from ”raw

31

tcpdump data” using the TcpTrace software is presented in Table 3.2. The preprocessed

dataset is structured in nine different features, where each feature denotes the statistical

count of network traffic within a fixed time interval.

Table 3.2: The extracted features from raw tcpdump data using tcptrace software

Extracted Relevant Features

Name Description

IIN The number of packets flowing from inside to inside network

ION The number of packets flowing from inside to outside network

IDN The number of packets flowing from inside to DMZ network

OON The number of packets flowing from outside to outside network

OIN The number of packets flowing from outside to inside network

ODN The number of packets flowing from outside to DMZ network

DDN The number of packets flowing from DMZ to DMZ network

DIN The number of packets flowing from DMZ to inside network

DON The number of packets flowing from DMZ to outside network

Preprocessed network traffic statistics is gathered at every 10 second for investigation.

The DARPA 1999 dataset which is free of attacks for two weeks (1st week and 3rd week) is

used as training data and DARPA 2000 dataset which contains DDoS attacks is used a test

data. We obtained 20270 rows from the first week and 21174 rows from the third week to

create the normal dataset and the dataset is used for modeling. The DARPA 2000 dataset

which contains attacks has 1048 rows. Figure 3.1 shows DARPA 2000 data profile with

attacks.

3.4.2. Experiments

Now we present the performance experiments that compare two models with deriva-

tions from the confusion matrix. Table 3.3 gives the legends used in this section for quick

reference. The experiment result shows that using the frequency based anomaly detection

32

model, it detects the attack after running first week of training data. However the side-effect

is the high false alarm rate and low detection rate. By training with third week it drops the

false alarm with 5% and also the detection rate increases with 5%. Tables 3.5 and 3.6 pro-

vides detection and false positive rates from the first week and continuing training with the

third week. The threshold used is 0.8 with Jaccard clustering.

Figure 3.1: Logarithm of traffic volume shows the DDoS attacks

Table 3.4 shows the number of states created in EMM using the first and third weeks

of DARPA 1999 normal dataset. As we can see the number of EMM nodes or states is

slightly different. This demonstrates the learning capability of the EMM although exhaus-

tive learning is not possible. This observation is consistent with [76] which has reported

a sublinear growth rate of number of EMM states. Also compared with the size of the

dataset, the number of EMM states is really low in all cases in the table, which implies the

efficiency of the model. We can also see that different similarity measures with different

threshold values yield different number of EMM nodes or states in the modeling processes.

Thus selection of threshold values impacts the memory usage and time utilization.

To conclude, the proposed risk leveling model lowers the false alarm rate compared

33

Table 3.3: Legend used in the performance evaluation with derivations from the confusionmatrix.

Legend of performance experiments

Name Description

NOA Number of observable attacks

NA Number of alerts

NTAD The number of packets flowing from inside to DMZ network

P Precision

TPR True positive rate

FAR False alarm rate

F1 F-measure

with the frequency based anomaly detection model and keeps a high detection rate in the

test cases. The approach is efficient, incremental and scalable.

3.5. Chapter Summary

This chapter presents a novel data mining technique to detect traffic based network in-

trusions. Our proposed technique takes both frequency and data deviation into account in

an efficient, incremental, and scalable anomaly detection model. The performance experi-

ments support our assumption that the traffic related network intrusion is companied with

data deviation. The technique is suitable for online processing.

There are several directions for future research. These directions include design of

models incorporating signatures that were previously determined to be risks, investigation

of correlations of the parameters, exploration of feasibility of the model for dynamic dataset

in grid computing environments.

34

Table 3.4: Impacts of clustering thresholds and selection of similarity measures

Parameter AnalysisNormal Dataset

DARPA 1999 SimThreshold

0.7 0.80 0.90 0.99

First week

Jaccard 148 298 855 7794Die 72 120 372 5033

Cosine 13 21 59 1298Overlap 6 10 11 38

Difference

Third week

Jaccard 181 367 1124 11820Die 84 145 449 7222

Cosine 13 22 63 1702Overlap 6 10 11 42

Diff betweenfirst & third

weeks

Jaccard 33(18.23%) 69(18.8%) 269(23.93%) 4026(34.1%)Die 12(14.3%) 25(17.24%) 77(17.15%) 2189(30.74%)

Cosine 0% 1(4.55%) 4(6.35%) 404(23.74%)Overlap 0% 0% 0% 4(9.52%)

Table 3.5: Detection rate and false alarm rate using frequency based anomaly detectionmodel

Performance of the frequency based anomaly detection modelSetting NOA NA NTAD P TPR FAR F1

First Week Dataset 1 5 1 0.2 1 0.00382 0.3333333With Third Week Dataset 1 4 1 0.25 1 0.002865 0.4

Table 3.6: Detection rate and false alarm rate using risk leveling anomaly detection model

Performance of the risk leveling anomaly detection modelSetting NOA NA NTAD P TPR FAR F1

First Week Dataset 1 1 1 1 1 0 1With Third Week Dataset 1 1 1 1 1 0 1

35

Chapter 4

A COMPARATIVE STUDY OF OUTLIER DETECTION ALGORITHMS

In the previous chapter, we studied a new anomaly detection model that is based on

Extensible Markov Model. A spatiotemporal model that can be used to detect outliers

in data streams. In this chapter1, we will study EMM’s outlier detection performance on

different real life datasets and test its performance against two spatial outlier detection

models.

4.1. Introduction

Data Mining is the process of extracting interesting information from large sets of data.

Outliers are defined as events that occur very infrequently. Detecting outliers before they

escalate with potentially catastrophic consequence is very important for various real life

applications such as: fraud detection, network robustness analysis, and intrusion detec-

tion. This chapter presents a comprehensive analysis of three outlier detection methods i.e.

Extensible Markov Model (EMM), Local Outlier Factor (LOF) and LCS-Mine. In Algo-

rithm analysis section we demonstrate the time complexity analysis and outlier detection

accuracy. The conducted experiments with Ozone level Detection, IR video trajectories,

and 1999 and 2000 DARPA DDoS datasets indicate that EMM outperforms both LOF

and LSC-Mine in both time and outlier detection accuracy. Recently, outlier detection has

gained an enormous amount of attention and become one of the most important problems

in many industrial and financial applications. Supervised and unsupervised learning tech-

niques are the two fundamental approaches to the problem of outlier detection. Supervised1This work has been published in International Conference on Machine Learning and Data Mining

MLDM, 2009. [26]

36

learning approaches build models of normal data and detect deviations from the normal

model in observed data. The advantage of these types of outlier detection algorithms is that

they can detect new types of activity as deviations from normal usage. In contrast, unsu-

pervised outlier detection techniques identify outliers without using any prior knowledge

of the data. It is essential for outlier detection techniques to detect sudden or unexpected

changes in existing behavior as soon as possible. Assume for example the following three

scenarios:

1. A network alarm is raised indicating a possible attack. The associated network traffic

is abnormal from the normal Network traffic. The security analyst discovers that the

enormous traffic is not produced from the Internet, but from its Local Area Network

(LAN). This scenario is characterized as zombie effect in a Distributed Denial of Ser-

vices (DDoS) attack [120], where the LAN is utilized in the DDoS attack to deny the

services for a targeted Network. It also means that the LAN has been compromised

long before the discovery of DDoS attack. Computer systems in a LAN provide ser-

vices that correspond to certain types of behavior, if a new service is started without

system administrator permission, then it is extremely important to set an alarm and

discover suspicious activities as soon as possible in order to avoid disaster.

2. Video surveillance [121] is frequently encountered in commercial, residential or mil-

itary buildings. Finding outliers in the video data involves mining massive surveil-

lance video databases automatically collected to retrieve the shots containing inde-

pendently moving targets. The environment where it operates is often very noisy.

3. Today it is not news that the ozone layer is getting thinner and thinner [70]. This

is harmful to human health, and affects other important parts of our daily life, such

as farming, tourism etc. Therefore an accurate ozone alert forecasting system would

facilitate issuance of warnings to the public at an early stage before the ozone reaches

a dangerous level.

37

One recent approach to outlier detection, Local Outlier Factor (LOF) [78], is based

on the density of data close to an object. This algorithm has proven to perform well,

but suffers from some performance issues. In this chapter we compare the performance

of LOF and one of its extensions, LSC-Mine [72], to the use of our previously proposed

modeling tool Extensible Markov Model (EMM) [120]. This comparative study provides

a study of these three outlier algorithms and denotes their time and detection performance.

Extensible Markov Model (EMM) is a spatiotemporal modeling technique that interleaves

a clustering algorithm with a first order Markov Chain (MC) [82], where at any point in

time EMM can provide a high level summary of the data stream. Local Outlier Factor

(LOF) [78] is an unsupervised density-based algorithm that assigns to each object a degree

to be an outlier. It is local in that, the degree depends on how isolated the object is with

respect to the surrounding neighborhood. LSC-Mine [72] was constructed to overcome the

disadvantages of the LOF technique proposed earlier.

4.1.1. Extensible Markov Model

Extensible Markov Model (EMM) [76] has the advantage of using a distance-based

clustering for spatial data as well as that of the Markov chain for temporality. And as proved

in our previous work [28], EMM achieves an efficient modeling by mapping groups of

closely located real world events to states of a Markov chain. EMM is thus an extension to

the Markov chain. EMM uses clustering to obtain representative granules in the continuous

data space. Also by providing a dynamically adjustable structure, EMM is applicable to

data stream processing when the number of states is unknown in advance and provides a

heuristic modeling method for data that hold approximation of the Markov property. The

nodes in the graph are clusters of real world states, where each of them is a vector of sensor

values, for example a flood level sensor in a river bend, that continuously feeds values

creating a data stream. The EMM defines a set of formalized procedures such that at any

38

time t, EMM consists of a Markov Chain (MC) [13] and algorithms to modify it, where

algorithms include:

1. EMMCluster defines a technique for matching between input data at time t + 1 and

existing states in the MC at time t. This is a clustering algorithm which determines

if the new data point or event should be added to an existing cluster (MC state) or

whether a new cluster (MC state) should be created. A distance threshold th is used

in clustering. For more details see Algorithm 2

2. EMMIncrement algorithm updates (as well as adds, deletes, and merges) MC at time

t + 1 given the MC at time t and output of EMMCluster at time t + 1. For more

details see Algorithm 3

3. EMMapplications are algorithms which use the EMM to solve various problems. To

date we have examined EMM for prediction (EMMPredict) [76] and anomaly (rare,

outlier event) detection (EMMRare) [120].

Throughout this chapter, EMM is viewed as directed graph with nodes and links. Link

and transition are used interchangeably to refer to a directed arc; node, state, and cluster

are used interchangeably to specifically refer to a vertex in the EMM. EMMCluster and

EMMIncrement are used to model the data. The EMMapplications is used to perform

applications based on the synopsis created in the modeling process. The synopsis includes

information of the cluster features [107] and transitions between states. The cluster feature

defined in [107] includes at least a count of occurrence, CNi (count on the node) and

either a medoid or centroid for that cluster, LSi. To summarize, elements of the synopsis

of an EMM are listed in Table 1. The frequency based anomaly detection [76] is used to

compare with LOF, and LSC-Mine algorithms, that is one of the several applications of

EMM. The idea for outlier detection comes from the fact that the learning aspect of EMM

dynamically creates a Markov chain and captures past behavior stored in the synopsis. No

39

input into the model identifies normal or abnormal behavior, instead this is learned based

on the statistics of occurrence of transitions and states within the generated Markov chain.

By learning what is normal, the model can predict what is not. The basic idea is to define a

set of rules related to cardinalities of clusters and transitions to judge outlier. An outlier is

NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...

Documents

Transcript of NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...