NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...
Transcript of NEW OUTLIER DETECTION TECHNIQUES FOR DATA ......Bobby B. Lyle School of Engineering Southern...
-
NEW OUTLIER DETECTION
TECHNIQUES FOR DATA STREAMS
Approved by:
Dr. Michael Hahsler
Dr. Margaret H. Dunham
Dr. Sukumaran Nair
Dr. Jeff Tian
Dr. Ping Gui
-
NEW OUTLIER DETECTION
TECHNIQUES FOR DATA STREAMS
A Dissertation Presented to the Graduate Faculty of the
Bobby B. Lyle School of Engineering
Southern Methodist University
in
Partial Fulfillment of the Requirements
for the degree of
Doctor of Philosophy
with a
Major in Computer Science
by
Charlie Isaksson
(M. S. C. S., Mid Sweden University, 2006)
Dec 17, 2016
-
ACKNOWLEDGMENTS
I am truly humbled and grateful for the great number of individuals that has supported
and encouraged me over the past nine years to fulfill my biggest dream. Dr. Margaret H.
Dunham and Dr. Michael Hahsler have been my two mentors and friends throughout this
rewarding journey. I would like to extend special thanks to Dr. Hahsler for helping me find
my path back and recognize your background knowledge and patience.
I would like to extend my gratitude to the faculty and staff members in the Department
of Computer Science and Engineering at Southern Methodist University.
For the other members of my dissertation committee, Dr. Sukumaran Nair, Dr. Jeff Tian
and Dr. Ping Gui, thank you for all the feedback and patience you had with me. I whole-
heartedly enjoyed the challenge of researching a critical issue that currently is important
for various industries.
Finally, a special recognition goes out to my family and friends who supported and
encouraged me during my pursuit of the doctorate in computer science. Thanks to my kids
for giving me the strength to keep going. I love you more than you will ever know.
iii
-
Isaksson , Charlie M. S. C. S., Mid Sweden University, 2006
New Outlier Detection
Techniques For Data Streams
Advisor: Professor Michael Hahsler
Doctor of Philosophy degree conferred Dec 17, 2016
Dissertation completed Nov 09, 2016
The availability and reliability of data have become essential in our modern society. In
fact, it has become critical in every domain to maintain high-quality data even though that
data may originate at high velocity and in large quantities. Today it is well understood that
data enables businesses to achieve their full potential by providing valuable insights into
their business as well as potentially offering them an advantage over their competitors. To
achieve such a goal requires a significant investment in both big data infrastructure and data
mining capabilities. Data Mining is the process of finding hidden patterns within a large
dataset. Imperative to Data Mining is the ability to detect outliers, data points that deviate
from the rest of the data points because outliers can dramatically alter the result of the anal-
ysis. Although outliers occur infrequently, it is hard to identify them since there are many
potential sources for outliers (such as human errors, machine errors, environmental varia-
tions, faulty sensors). Finding outliers in large dataset requires extremely efficient outlier
detection techniques. It becomes even harder to detect an outlier within a data stream as it
imposes a single pass restriction and data often arrives at a very fast rate. Also streaming
data may contain redundant information, which can reduce outlier detection performance
and efficiency. To avoid this redundancy while maintaining the correctness of the data, it
becomes necessary to summarize the data stream. The Extensible Markov Model (EMM)
has been proven to be a good candidate for meeting these requirements to detect outliers in
iv
-
data stream applications. EMM uses data stream clustering models and takes into account
temporal and ordering aspects using a Markov Chain (MC), a powerful temporal model that
allows studying a complex system and making predictions about events. Extensible Markov
model is a time changing MC that has the ability of learning and dynamically adapting its
structure to the environment as well as updating the state transition probability based on the
incoming data. The model generated by EMM allows analysis of a particular time frame
as an MC, and, as time passes, this model will continue to adapt, evolve, and learn with the
ongoing data stream. This is due to the close coupling of the clustering model with an MC
model. Combining these two models delivers a spatiotemporal model that satisfies all the
requirements from a data stream (big data) infrastructure standpoint. In this dissertation,
the data pattern finding capability of EMM has been extended in several ways. Firstly, a
sophisticated mining task on the synopsis is investigated to detect Distributed Denial of
Service (DDOS) network intrusion. A performance study is then conducted of different
outlier detection techniques and compared with EMM, and this leads to two additional ex-
tensions to further improve EMM’s performance. SO-Stream, a new self-organizing cluster
structure that allows the algorithm to obtain the threshold for each micro-cluster dynami-
cally, is proposed, and then SO-Stream is extended by integrating a Markov Model (MM).
The new algorithm is called Adaptive Streaming Markov Model (ASMM), which is de-
signed to handle concept drift, spatiotemporal outliers, and high volume and velocity data
streams while preserving higher accuracy and cluster quality. The dissertation concludes
with directions for future work including distributed ASMM’s that can be integrated into
big data frameworks, ASMM’s for telecom applications and a visualization technique for
multidimensional data that is greatly needed for better interpretation of outlier models.
v
-
TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2. Focus of the Dissertation and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3. Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2. Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3. Outlier Detection in Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4. Spatiotemporal Outlier Detection in Data Stream . . . . . . . . . . . . . . . . . . . . . . . . 13
3. RISK LEVELING OF NETWORK TRAFFIC ANOMALIES . . . . . . . . . . . . . . . . . 17
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2. Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4. Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4. A COMPARATIVE STUDY OF OUTLIER DETECTION ALGORITHMS . . . . 36
vi
-
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1. Extensible Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2. Density Based Local Outliers (LOF Approach) . . . . . . . . . . . . . . . . . . 41
4.1.3. Density Based Local Outliers (LSC-Mine Approach) . . . . . . . . . . . . 41
4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1. Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2. Experiments on Real Life Data and Synthetic Datasets . . . . . . . . . . . 45
4.3. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5. SO-STREAM: SELF ORGANIZING DENSITY-BASED CLUSTERINGOVER DATA STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2. Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1. SOStream Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2. Density-Based Centroid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.3. SOStream Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.4. Online Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.1. Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2. Real-World Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4.3. Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.4. Scalability and Complexity of SOStream . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vii
-
6. ASMM: DETECTING SPETIO-TEMPORAL OUTLIERS WITH ADAP-TIVE STREAMING MARKOV MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.1. Extensible Markov Model Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.2. Adaptive Streaming Markov Model Algorithm . . . . . . . . . . . . . . . . . . 90
6.2.3. EMMRare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.2. Parameter Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.3. Scalability and Complexity of ASMM . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4. Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.1. Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2. Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
APPENDIX
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
viii
-
LIST OF FIGURES
Figure Page
2.1 A classification of outlier detection techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 A workflow from a traditional spatiotemporal outlier detection framework. . . 14
2.3 The workflow from EMM outlier detection framework. . . . . . . . . . . . . . . . . . . . . 16
3.1 Logarithm of traffic volume shows the DDoS attacks . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Advantages of the LOF approach. Modified from [78] . . . . . . . . . . . . . . . . . . . . . 42
4.2 Run time for LOF, LSC-Mine, and EMM with MinPts=20 and EMMThreshold=0.99. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 (a) Data points of stream with 5 overlapping clusters and (b) Show SOStreamcapability to distinguish overlapped clusters. For visualizing clusterstructure, we do not utilize Fading or Merging. . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 SOStream clustering quality horizon = 1K, Stream speed = 1K. The qual-ity Evaluation for MR-Stream and D-Stream is retrieved from [68]. . . . . . 73
5.3 SOStream memory cost over the length of the data stream. The MemoryEvaluation for MR-Stream is retrieved from [68]. . . . . . . . . . . . . . . . . . . . . . 76
5.4 SOStream execute time using high dimensional KDD CUP99 datasetwith 34 numerical attributes. The sampling data rate is every 25K points. 77
6.1 Example of EMM directed graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Basic example that show high level operations of ASMM. . . . . . . . . . . . . . . . . . 90
6.3 The Sensors were arranged in the lab according to the above diagram.Obtained from [88] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Show subplots from time period [8398:9000]. It is evident that humiditysuffers from spatial outlier. However, due to large data size we areunable to display temporal outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
ix
-
6.5 Show subplots from normalized Server System Health dataset. Hence,the highlighted red area include both the spatial and temporal outliers. . . 110
6.6 Distribution of ASMM’s clusters count based on different buffer size forthe KDD CUP’99 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.7 Clusters size decreases with increased buffer size. The number of clustersstabilizes between buffer size 15 to 35. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.8 ASMM and EMM memory cost over different threshold values usingKDD CUP99 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.9 ASMM and EMM execute time using high dimensional KDD CUP99 dataset. 117
x
-
LIST OF TABLES
Table Page
3.1 Notations of EMM Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 The extracted features from raw tcpdump data using tcptrace software . . . . . . 32
3.3 Legend used in the performance evaluation with derivations from theconfusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Impacts of clustering thresholds and selection of similarity measures . . . . . . . . 35
3.5 Detection rate and false alarm rate using frequency based anomaly de-tection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Detection rate and false alarm rate using risk leveling anomaly detectionmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 EMM detection and false positive rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 LOF detection and false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 LSC-Mine detection and false positive rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 EMM, LOF and LSC-Mine detection and false positive rates using PCA. . . . . 54
4.5 EMM, LOF and LSC-Mine detection and false positive rates. . . . . . . . . . . . . . . 55
5.1 Feature comparison between different data stream clustering algorithms. . . . . 58
5.2 Comparing average purity for different MinPts for α = 0.1. . . . . . . . . . . . . . . . . 75
5.3 Comparing average purity for different MinPts for α = 0.3. . . . . . . . . . . . . . . . . 75
5.4 Highlight the improvement SOStream compared to MR-Stream and D-Stream. 75
6.1 Show the legend from the confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xi
-
6.3 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.6 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.7 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.8 ASMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.9 EMM’s outlier detection results over different threshold values and dif-ferent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.10 LOF’s outlier detection results over different threshold values and differ-ent measure from the confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xii
-
Dedicated to the Almighty Creator, the Most Gracious, the Most Merciful.
-
Chapter 1
INTRODUCTION
Availability and reliability of data have become crucial factors in today’s modern so-
ciety. One important task for any domain application is to detect abnormal data. Outlier
detection is extensively used in a wide variety of applications such as fraud detection in
the banking system, intrusion detection in network security, unusual behavior in military
surveillance, and also detection of tumors in MRI images. Outliers are defined as data
points that occur very infrequently and/or lie far from the expected values. It is crucial to
investigate outliers because they may contain valuable information regarding the process
under investigation. One should inquire why such data points have occurred and whether
similar points would continue to appear before taking the decision of removing them from
the dataset before training models. Statisticians have researched the problem of outlier de-
tection since the early nineteenth century [49]. There have been many techniques proposed
for outlier detection. Out of those techniques, some are specifically designed to suit certain
application domains while others are more generic. The presence of outliers in data may
carry important information. For example, an anomaly in digital photography may indicate
that a terrorist is using steganography to hide messages in the low-order bits of a digital
photograph in either plaintext or ciphertext form to disguise it from their enemies [34].
Similarly, outliers in Magnetic Resonance Imaging (MRI) may identify pixels, which are
significantly different between two MRI scans and thereby indicate the presence of brain
tumors [54]. Furthermore, an abnormal pattern in network traffic may signal an alarm of
intrusion, which may indicate a compromised server is sending out unauthorized informa-
tion [87]. Other examples could be outliers in credit card transactions, which may raise
1
-
attention to a credit card theft [39] or interruptions in continuous signals from airplane to
the ground due to inconsistent data acting as outliers, which may lead to accidents.
Outliers may arise due to several reasons such as intrusion, human error, machine error,
and changes in the behavior of the system. Because of all these various causes, outliers are
difficult to detect. For example, attempting to define a normal region, which includes all
possible behaviors is very problematic. Furthermore, it is difficult to set a precise boundary
distinguishing outlier and normal behavior. This may result in a case where an outlier
lying close to the boundary may be predicted as normal, or, on the contrary, normal data
lying close to the boundary may be identified as an outlier. There are also cases where
malicious actions result in outliers. Malicious actors may try to adapt themselves in such a
way that resulting outliers seem to appear normal thereby making it difficult to distinguish
between normal and malicious behavior. In addition to this, it becomes difficult to detect,
distinguish, and remove data consisting of noise, which may appear to be similar to the real
outliers.
1.1. Motivation
Existing outlier detection techniques have been effective in either space or time but not
both. The Extensible Markov Model (EMM) [76] is a spatiotemporal algorithm that has
been successfully used in diverse fields. [77, 28, 120, 118, 119] EMM has proven to be a
powerful algorithm that can administrate space and time very efficiently as well as adapt to
continuous changes in the environment in a scalable manner. When EMM is used to pro-
cess data, it will dynamically construct a codebook based on the input data. The codebook
consists of a set of model vectors representing typical vectors within the dataset. EMM
creates new entries in the codebook based on a fixed threshold when the input data does
not map to an existing cluster. The use of a spatiotemporal data mining algorithm like
EMM allows continuous assessment and is capable of both tracking changes over time and
2
-
determining whether or not that particular change is probable based on a normal or abnor-
mal pattern. Other outlier detection algorithms that are based only on clustering would be
incapable of establishing such relationships because they lack the temporal model.
1.2. Focus of the Dissertation and Conclusions
This work presents several innovative data mining models for outlier detection based on
Extensible Markov Model (EMM) [76] which combine the spatiotemporal data modeling
with data streams. EMM is composed of two core features: modeling and pattern-finding
capabilities. The modeling component in EMM is used to group related data points into
clusters. EMM combines a clustering model with a Markov Model (MM). Many algorithms
are proposed to extend EMM, and their performance is discussed for a large number of
datasets.
The main contributions of this work can be summarized as follows.
1. Risk leveling of network traffic anomalies: A real world application to explore so-
phisticated mining tasks. False alarm rate is being used for performance evaluation.
Discussed in Chapter 3.
2. Comparison of EMM with other state-of-the-art outlier detection techniques: LOF
and LSC Mine. The comparison is based on accuracy and runtime complexity. The
research indicates that EMM outperformed the other two techniques in several cases.
However, EMM suffered from a critical issue concerning the clustering component
using a fixed threshold. Discussed in Chapter 4.
3. For the EMM algorithm to work efficiently, it is imperative that the threshold be set
to the correct value. The threshold is the static parameter that determines whether a
new event belongs to an existing cluster or if a new cluster should be created. If this
parameter is set to an unsuitable value, then the algorithm will create too many clus-
ters and suffer from overfitting, or it will create too few clusters resulting in unstable
3
-
classification. The next step is to add to EMM the ability of an adaptable threshold.
Self-Organizing Map (SOM) [105], is an unsupervised algorithm that does not use
a fixed threshold, but it creates an approximation output space with randomly as-
signed weight, and depending on the incoming data it will adjust the neighbors of the
winning weight to be closer to it. SO-Stream proposes a clustering algorithm that dy-
namically self-organizes its structure without the use of a fixed threshold. SO-Stream
is designed specifically for the data stream, so its performance was tested against
two popular stream clustering algorithms, MR-Stream and D-Stream with different
real-world and synthetic datasets. SO-Stream outperformed the other two techniques
concerning cluster purity, memory, and runtime complexity. SO-Stream can identify
highly overlapping clusters, and SO-Stream operations (i.e. create, remove, merge,
and fade) are completely online. Discussed in Chapter 5.
4. EMM’s two components: clustering and MM are tightly coupled and data points have
to be processed in order. We proposed a new algorithm ASMM that utilizes an offline
and online component that decouples these two elements. ASMM can handle points
out of order while maintaining the original order of the incoming data points. SO-
Stream utilizes an offline component to initialize the clustering model, and then the
initial model is efficiently incremented with the online component. The offline com-
ponent uses a buffering technique to add support for concept drift as well new pat-
terns that may emerge from evolving stream. ASMM performance is tested against
two popular outlier detection algorithms: LOF and EMMRare on a large number of
real-world and synthetic datasets. ASMM outperformed the other two techniques
in term of different confusion matrix measures, memory and runtime complexity.
Discussed in Chapter 6.
1.3. Organization of the Dissertation
The remainder of the dissertation is organized as follows: Chapter 2 presents the back-
4
-
ground work; Chapter 3 proposes a spatiotemporal technique for outlier detection in data
streams; Chapter 4 provides a comparative study of techniques to detect outliers; Chapter
5 presents a novel unsupervised clustering technique for data streams. Chapter 6 presents a
spatiotemporal model that extends SO-Stream with a temporal Markov Model, and Chap-
ter 7 concludes with an assessment of the viability of stream mining for outlier detection.
Please note that each thesis chapter represents a previously published/submitted research
paper, and due to this reason, some concepts are introduced recurrently in the introduction
sections from various chapters.
5
-
Chapter 2
BACKGROUND
In this chapter, the previous research related to spatiotemporal data stream mining is
addressed. Firstly, general techniques used in the areas of outlier detection are reviewed,
and then three important areas are discussed: Data streams, outlier detection in the data
stream, and spatiotemporal outlier detection in data streams.
2.1. Outlier Detection
Outlier Detection Techniques
Clustering-based
Nearest Neighbor-based
Statistical-based
Classification-based
Spectral Decomposition-based
Principal Component Analysis
Support Vector Machine
Bayesian Network
Figure 2.1: A classification of outlier detection techniques
Different approaches and methodologies have been introduced to address the outlier/
anomaly detection problem; they include statistical approaches, supervised and unsuper-
vised learning techniques, neural networks and machine learning techniques (See Fig-
ure 2.1). We can not provide a complete survey here but refer the interested reader to
available surveys [109], [71], [110]. We briefly mention some representative techniques1.
6
-
The Grubbs method (extreme studentized deviate) [50] introduced a one-dimensional sta-
tistical method in, which all parameters are derived from the data, it requires no user pa-
rameters. It calculates the mean and standard deviation from all attribute values and then
calculates a Z-score as the difference between the mean value of the attribute and the query
value divided by the standard deviation for the attribute. Then the Z-score for the query is
compared with a threshold of 1% or 5% significance level.
An optimized k-NN was introduced by [97]. It gives a list of potential outliers and their
ranking. In this approach, the entire distance matrix needed to be calculated for all the
points, but the authors introduced a partitioning technique to speed up the k-NN algorithm.
Other outlier detection approaches are based on the Neural Networks. Neural networks
are non-parametric and models that require training and testing to determine the threshold
and be able to identify outliers. Most of them suffer when the data has high dimensionality.
[10] and [22] define novelties in time-series data for fault diagnosis in vibration signatures
of aircraft engines and Bishop monitor processes such as oil pipeline flows. They both
use a supervised neural network (multilayer perceptron), which is a feed-forward network
with a single hidden layer, where hidden layers are used to add on neurons to neural net-
works architecture to build up the ability to solve highly complex nonlinear functions. The
drawback to this is that the increased number of neurons also increases the necessary time
needed by the neural network to converge during learning. [83] uses an auto-associative
neural network which is also a feedforward perceptron-based network which uses super-
vised learning. [106] introduced a detection technique for time series monitoring based on
the Adaptive Resonance Theory (ART) [51] incremental unsupervised neural network.
An approach that works well with high dimensional data is using decision trees as in
[53] and [36] where they use a C4.5 decision tree to detect outliers in categorical data
1This section has been published in International Conference on Machine Learning and Data MiningMLDM, 2009. [26]
7
-
to identify unexpected entries in databases. They pre-select cases using the taxonomy
from a case-based retrieval algorithm to prune outliers and then use these cases to train the
decision tree. [103], [104] introduced an approach that uses similarity-based matching for
monitoring activities.
Local Outlier Factor (LOF) [24] algorithm detects outliers by measuring the local devi-
ation of a given data point to its neighbors. LOF was designed for static data, but if repeat-
edly applied, either periodically or every time a new data point comes in, this algorithm can
be adopted by data streams. Pokrajac [43] proposed an incremental LOF algorithm where
the reachability distance, local reachability density (LRD) and LOF values for each new
data point are computed, and those values for existing points were updated. Hence outliers
can be instantly detected.
2.2. Data Streams
The data mining community has provided many innovative technologies that address
different issues. One of these is data streams, a new data mining area that involves data
that is continuous and perhaps infinite. This type of data can be characterized as a high
volume that arrives at a high velocity. Storing such data may be impractical, and even if
such data volume was to be stored, processing any particular record more than once may
be infeasible. See [29] for a detailed discussion on different streaming applications. Ad-
ditionally, characteristics of data in streaming may change over time (e.g., concept drift).
Since data streams can be viewed as time series, time series models were also considered.
Traditional linear time series models consist of three statistical based models: autoregres-
sive (AR), integrated (I), and the moving average (MA). The ARIMA model [23] integrates
all of these models. Data Stream Mining has a single pass restriction that makes traditional
time series models impractical. Thus, rather than using time series forecasting models to
detect outliers, conventional multidimensional models that account for the temporal drift
8
-
and deviations are used.
2.3. Outlier Detection in Data Streams
As new data arrives, the data stream models need to update their structures to capture
the normal trends in the data. Outliers are then detected when the data causes a dras-
tic change in the original model. Yamanishi and Takeuchi [61, 63] presented an online
sequential discounting algorithm that incrementally learns a probabilistic mixture model.
The model accounts for drift by using a decay factor. Moreover, the model can detect
outliers by computing an outlier score from a learned mixture model. Depending on the
type of data (continuous or categorical) different models were proposed. For categorical
data, Sequentially Discounting Laplace Estimation (SDLE) utilizes a Laplace smoothing
function to compute a probability score based on the occurrence frequency of a particular
symbol divided by all the data points. For every new data point, the model needs to update
all its cells. Two models were proposed for continuous data: Gaussian Mixture and Time
Series. Both models detect an anomaly if the model at the time (t − 1) has changed after
adding a new data point at time t. Further research by Javitz [56] proposed to update the
normal distribution of the data by giving more weight to recent data. An accepted solution
for streaming data is to model or summarize related data points into clusters, which helps
avoid the retention of the whole dataset. Clustering models, in general, can be used to
detect outliers. This methodology fits new data points into existing clusters, and outliers
are detected when either new data points do not fit into the clusters or the internal clus-
ter structure changes. Several popular clustering algorithms which can be used for outlier
detection are reviewed below. Aggarwal and Yu [30] proposed clustering as a method to
detect outliers. The k-Means clustering method is often used because it allows reallocation
of samples even after assignment and converges quickly. The problem with basic k-Means
is that the random allocation of cluster centers reduces its accuracy. Also, the values of k
9
-
(number of clusters) and t (number of iterations) are difficult to set in advance. To counter
this limitation, dynamic clustering approach was proposed. In fact, the underlying structure
in data stream clustering continues to evolve as time passes. Detecting outliers, whether
spatially or temporally, is particularly challenging. For instance, data points analyzed in
an early stage can be incorrectly viewed as outliers; however, as time elapses a new trend
may start to occur. Moreover, data points that are time delayed may also appear falsely as
outliers. Thus, techniques such as dynamic time warping may help to discover the truth.
Accordingly, the dynamic nature of data has motivated data mining researchers to develop
innovative technologies to manage such requirements.
For example2, E-Stream [64] handles the evolving data stream by providing cluster
operations like add, delete, split, and merge. The algorithm starts empty, and for every time
step based on a radius threshold either a new data point is mapped into one of the existing
clusters, or a new cluster is created around the incoming data point. Any cluster that does
not meet a defined density level is considered inactive and remains isolated until achieving
the desired weight. Cluster weights are decreased over time to reduce the influence of old
data points. This technique is well known as a fading function, where the cluster, which
is inactive for a certain time period has a risk of being deleted. Also, for each step, a pair
of clusters may be merged because either the overlap between two clusters is sufficiently
large or the maximum cluster limit has been reached. The split of one cluster into two
sub-clusters occurs if internal data is different. The split process creates one histogram for
each active cluster, where the data dimension is summarized into an α-bin histogram, and
then the split is performed if a deep valley between two significant peaks is found.
CluStream [4] divides the clustering process into online and offline components. The
online micro-clustering component periodically stores detailed summary statistics in a
2Some of these clustering algorithms are defined in this section in more details compared to the originalpublished paper [27].
10
-
high-speed data stream, and the offline macro-clustering component uses the summary
statistics in association with user input to provide the user with a quick understanding of
the clusters whenever required. This two-phased approach also provides the user with the
flexibility to explore the nature of the evolution of the clusters over different time periods.
DenStream [25] discovers clusters of arbitrary shapes in an evolving data stream by
maintaining two lists, one with potential micro-clusters and the other with outlier micro-
clusters. Each time a new data point arrives, an attempt is made to merge the point into
one of the existing nearest potential micro-clusters. Based on the resulting micro-cluster,
if its radius is larger than a specified radius then the merge is omitted, and then another
attempt is made to merge the point with the nearest outlier micro-cluster. Once again, if
the resulting radius is larger than a specified radius the merge is omitted, and a new outlier
micro-cluster centered at that point is created and added to the outliers list. If any of the
outlier micro-clusters exceed a specified weight, then it is moved into the potential micro-
clusters list. Periodically an attempt is made to prune points from outlier micro-clusters list
into potential micro-clusters.
OpticsStream [42] is an online visualization algorithm that produces a map representing
the clustering structure. It adds the ordering technique from OPTICS [81], which is not
suitable for the data stream, on top of any density based algorithm such as DenStream to
better manage the cluster dynamics.
HPStream [5] is an online clustering algorithm that discovers distinct clusters based on
a different subset of streaming data point dimensions. This is achieved by maintaining for
each cluster a d-dimensional vector that indicates, which of the dimensions are included
in the continuous assignment of incoming streaming data points to an appropriate cluster.
The algorithm begins by assigning the received streaming data point to each of the existing
clusters, and then it computes the radii and selects the dimensions with the smallest radii
followed by creating a d-dimensional vector for each cluster. Next, the Manhattan distance
11
-
is computed from the incoming data point to the centroid of each existing cluster (where its
d-dimensional vector limits the centroid for each cluster). From these distances, the winner
is found by returning the largest average distance along with the included dimensions. Then
the radius is computed for the winning cluster and compared to the winning distance based
on this comparison, and either a new fading cluster is created centered at the incoming data
point, or the incoming data point is added to the winning cluster. Also, clusters are removed
if they contain zero dimensions or if the number of clusters has exceeded the user defined
threshold.
WSTREAM [41] is a density-based algorithm that discovers cluster structure by main-
taining a list of rectangular windows that are incrementally adjusted over time. Each win-
dow will move based on the centroid of the cluster, and the centroid will be incrementally
recomputed whenever new streaming data points are inserted into the window. The win-
dows can also incrementally contract and expand based on the window approximated kernel
density and the user-defined bandwidth matrix that is controlled by specified rules. In the
case of windows overlap, the proportion of the number of streaming data points in the in-
tersection of the pair of windows to the remaining points in each window is computed, and
then it is compared to the user defined thresholds, which is then used to either remove or
merge the windows. This algorithm also periodically monitors the windows weights from
the stored windows. If the weights are less than the defined minimum threshold (which is
considered to be an outlier) or are very old compared to the defined time, the windows are
removed.
D-Stream [31] is a density based clustering algorithm used for data streams. This al-
gorithm works on the same basis as the time step model. It starts by initializing an empty
hash table grid list. It contains both an online and offline component. The online compo-
nent reads the incoming raw data record, and this record is either mapped to the existing
grid list or inserted into the grid list if it does not exist. After the insertion of the record
12
-
into the grid, the characteristic vector of the grid is updated. This characteristic vector con-
tains all the information about the grid. Thus, the online component partitions the data into
many corresponding density grids forming grid clusters. The offline component takes the
role of dynamically adjusting the clusters. If the grid receives no new value for an extended
period, then it is removed from the grid list. Such grids are known as sporadic grids that
may contain outliers.
MR-Stream [68] extends D-Stream by finding clusters at versatile granularities. It re-
cursively partitions the data space into well-defined cells by using a tree data structure
quadtree. MR-Stream facilitates both online and offline components.
2.4. Spatiotemporal Outlier Detection in Data Stream
Spatiotemporal data mining refers to a process that extracts hidden knowledge from
both the spatial and temporal data space. Spatiotemporal is an emerging research area for
data stream applications. Traditionally, data mining techniques considered spatiality and
temporality as two separate research areas. However, today both combined have become
a central requirement to process data events. According to the survey article on outlier
detection by Manish [74], spatiotemporal outliers can be defined as spatiotemporal objects
whose behavioral/thematic (non-spatial and non-temporal) attributes are significantly dif-
ferent from those of the other objects in its spatial and temporal neighborhoods. Figure 2.2
shows a workflow from a traditional spatiotemporal outlier detection framework. This
framework models the outlier detection into three main components. The first component
is responsible for finding objects from the input data stream that have interesting semantics.
The next component analyzes these objects to identify if they are spatial outliers. Finally,
the spatial outliers are examined across time to check if they are temporal outliers. Objects
are classified as spatiotemporal outliers if found to be both spatial and temporal outliers.
13
-
Spatio-Temporal Data
Data
DataData
Verify Temporal Outliers
Find Spatial
Outliers
Find Spatial Objects
Spatio-Temporal Outliers
Figure 2.2: A workflow from a traditional spatiotemporal outlier detection framework.
14
-
Birant [40], Cheng [101, 32], and Adam [3] proposed anomaly detection algorithms
that utilize a multi-step approach in Figure 2.2. They first try to detect spatial outliers and
then verify their temporal neighborhood to determine the spatiotemporal outliers. The tech-
niques use a modified version of DBSCAN [79] for both the spatial and temporal neighbors.
These are given a radius followed by assigning a density factor to clusters that are intended
to detect potential outliers. The two evaluations are performed to identify the spatiotem-
poral outliers. Another technique proposed by Cheng uses spatial scaling with a four-step
approach to address the semantic and dynamic properties of geographic phenomena for ST-
Outlier detection. First, the algorithm finds semantic objects (i.e., spatiotemporal objects),
which uses prior knowledge to form some regions that have significant semantic mean-
ings. Next, aggregation, which focuses on detecting spatiotemporal outliers, is utilized to
remove noise. Additionally, a comparison between the found outlier in the clustering phase
is compared to the points that were filtered. A final step is to verify the temporal outliers
based on the previous steps. Adam [3] uses a distance-based outlier detection technique,
which establishes a spatial Voronoi grid to obtain macro-clusters. The algorithm uses Jac-
card distance and the silhouette coefficient to determine the quality of the micro-clusters.
Any points that substantially deviate from the neighborhood are flagged as spatiotempo-
ral outliers. Other techniques such as outlier solids, Kulldorff scan statistic, and trajectory
outliers can be considered for more detailed discussion [73].
The Extensible Markov Model (EMM) [76] is a spatiotemporal algorithm that is based
on first order Markov Chains (MC) described in [17]. EMM consist of two parts: a
distance-based data stream clustering algorithm for spatial data that obtains representative
granules in the continuous data space, and an MC to model temporal behavior. EMM ap-
plies to data stream processing with the number of states unknown in advance and provides
a heuristic modeling method where the approximation of the Markov property is appro-
priate. EMM operations are entirely online and thus suitable for data streams. Figure 2.3
15
-
shows the EMM’s framework for detecting spatial-temporal outliers. The rest of this thesis
will investigate solutions and improvements for different aspects of this framework.
EMM Outliers
Clustering MarkovModel
Data(t+n)
Datat
Data(t+1)
Spatio-TemporalOutliers
SpatiotemporalData
Figure 2.3: The workflow from EMM outlier detection framework.
16
-
Chapter 3
RISK LEVELING OF NETWORK TRAFFIC ANOMALIES
The goal of intrusion detection is to identify attempted or ongoing attacks on a computer
system or network. Many attacks aim to compromise computer networks in an online
manner. Traffic anomalies have been an important indication of such attacks. Challenges
in the detections lie in modeling of the large continuous streams of data and performing
anomaly detection in an online manner.
In this chapter1, we will present a data mining technique to assess the risks of local
anomalies based on synopsis obtained from a global spatiotemporal modeling approach.
The proposed model is proactive in the detection of various types of traffic related attacks
such as distributed denial of service (DDoS). It is incremental, scalable and thus suitable
for online processing. Algorithm analysis shows the time efficiency of the proposed tech-
nique. The experiments conducted with a DARPA dataset demonstrate that compared with
a frequency based anomaly detection model, the false alarm rate caused by the proposed
model is significantly mitigated without losing a high detection rate.
3.1. Introduction
Data mining is used to detect anomalies [120] [8] [78] [52] The goal of anomaly detec-
tion is to ”find data objects that are different from most other objects” [86]. An anomaly
can be used as an indication of a possible dangerous situation in computer networks and
other systems. When an anomaly is detected by an anomaly detection model, an alarm is
1This work has been published in International Journal of Computer Science and Network Security (IJC-SNS), 2006 [28] and presents joint work with Yu Meng and Professor Margaret H. Dunham.
17
-
set and human intervention is invoked to examine whether the alarm represents an event
of interest such as a dangerous situation or a malicious activity. Traffic anomaly is a type
of anomaly. It refers to traffic characteristics that deviate from that which occurs at the
majority of the time. These behaviors may have significant impact on the system. Traffic
anomaly has received attention as a major indicator of risk exposure in computer networks.
For example, Juniper Networks has proposed a combination of traffic anomaly detection,
protocol anomaly detection and stateful signatures to identify a variety of types of attacks
in computer networks [91]. Cisco has delivered the Cisco Traffic Anomaly Detector XT
5600 for detection of distributed denial of services (DDoS), worms, and other attacks [33].
Applications of traffic anomaly mining can be intuitively extended to highway traffic op-
eration, and electric power demand management. However, an anomaly is not necessarily
a risk. Generally as a higher detection rate is pursued with an anomaly detection model, a
higher false alarm rate is caused as well. Needed human intervention caused by false alarms
is very costly and there is a demand to reduce unnecessary human intervention. Automatic
techniques are desired to evaluate the chance that an anomaly is of interest so as to take
out some anomalies that are probably not a risk. Existing anomaly detection work uses
either frequency based or data deviation based [120] [78] [46] [96] [9] [85] approaches.
We have noticed that these may suffer from a high false alarm rate. In this chapter we
propose a risk leveling model, a two phase data mining technique with rules using both oc-
currence frequency and data deviations. The proposed model detects the anomalies based
on frequency and then measures the deviation of the anomaly away from the data space.
The level of risks with which the anomaly is associated is evaluated by the deviation, as
we envision that anomaly data space when risks occur is located away from the normal
data space. A common characteristic of a data stream is its high volume of data; moreover
the data continuously arrives at a rapid rate. It is not feasible to store all data from the
streams and use random accesses to the data as we do in traditional database. This implies
18
-
a single pass restriction for all data in the streams [55]. Therefore, the data stream must be
modeled in order to obtain a synopsis of the global profile of the dataset. Data mining is a
key technique in modeling stream data. Our proposed risk leveling model is built based on
the Extensible Markov model (EMM), a spatiotemporal modeling technique [76]. The risk
leveling model uses the synopsis obtained from the EMM modeling process. Performance
comparisons with a frequency based anomaly detection model [120] are expected to show
the low false alarm rate without losing a high detection rate of the proposed risk leveling
model. Also the proposed model inherits the incrementality and scalability of the EMM.
3.2. Related Work
Our proposed technique assesses the chance of alarms raised by a frequency based
anomaly detection model to actually be events of interest, i.e. risks. Before we present
risk-leveling model, we first introduce related work followed by frequency based anomaly
detection technique. Among prominent properties of anomaly are its rarity and possible sig-
nificance. These properties distinguish anomaly detection techniques from modeling tech-
niques in other subjects in feature selection/construction and evaluation metrics. Lazare-
vic [8] indicates that unsupervised techniques and supervised techniques are the two major
categories of techniques in anomaly detection. The unsupervised technique is capable of
mining unlabeled data. That means no priori knowledge is required for ”normal” profiles.
An anomaly is detected by selecting an event that deviates from the majority. Although a
variety of algorithms can be applied, some common steps are seen as follows:
• Construct features. The features may be constructed in a weighted numeric vector.
• Determine a distance measure from the data point, which represents an event under
investigation, to a cluster. The kth nearest neighbor distance [75], similarity (Jaccard,
Cosine, Overlap, Dice [75]), Euclidean distance, Manhattan distance [75], skewed
19
-
distance (Mahalanobis distance [8]), and density distance (LOF) [78] are of common
distance measures.
• Apply an anomaly detection algorithm to the data, based on a set of rules defining
anomalies. The following categories of anomaly detection algorithms are seen in the
literature:
– Distance based algorithms [46] [96],
– Statistics based algorithms [84] including finite mixture model [62], and infor-
mation theory [113].
– Model based algorithms such as neural networks [92] and SVM [9].
Distance-based algorithms are based on clustering and form a major category of tech-
niques in anomaly detection. These techniques neither assume independence among dif-
ferent dimensions of data as statistical based algorithms do, nor are these as sensitive to
the initial selection of the model as model based algorithms are. The meaning of the tech-
niques is easy to interpret and is suitable for spatial data mining. Temporality can be
another expected characteristic of anomalies in additional to spatiality, particularly in traf-
fic anomaly detection. Markov chains and suffix trees have been used [44] [85] to store
temporal profiles. The benefit of a Markov chain is its concise presentation in mathemat-
ics. Variations of the Markov chain with dynamic structures have been proposed to model
dynamically changing data [76] [35]. The Suffix tree stores all suffixes of a sequence and is
linearly efficient in string matching with the suffixes. The EMM [76] takes the advantage of
distance-based clustering for spatial data as well as that of the Markov chain for temporal-
ity. EMM achieves an efficient modeling by mapping groups of closely located real world
events to states of a Markov chain. EMM is an extension to the Markov chain. EMM uses
clustering to obtain representative granules in the continuous data space. Also by provid-
ing a dynamically adjustable structure, EMM is applicable to data stream processing when
20
-
the number of states is unknown in advance and provides a heuristic modeling method for
data that hold approximation of the Markov property. EMM formalizes a framework for
spatiotemporal data mining by introducing phases including clustering and Markov chain
construction which model the data stream so as to get the synopsis of the data profile, and
applications which are built on the synopsis. Subsequently, we will give a concise descrip-
tion of EMM and this description, which should be sufficient to grasp the scope of our
work. Further information concerning EMM can be found in [76]. A multidimensional
data point in EMM represents a real world event. The data point can be represented in a
hyperspace as a vector. EMM defines a set of formalized procedures such that at any time
t, EMM consists of a Markov Chain (MC) and algorithms to modify it, where algorithms
include:
1. EMMCluster: defines a technique for matching between input data at time t+ 1 and
existing states in the MC at time t. This is a clustering algorithm which determines
if the new data point or event should be added to an existing cluster (MC state) or
whether a new cluster (MC state) should be created. A distance threshold th is used
in clustering.
2. EMMBuild: is an algorithm that updates (as well as adds, deletes, and merges) MC
at time t+ 1 given the MC at time t and output of EMMCluster at time t+ 1.
3. EMMapplications: are algorithms that use the EMM to solve various problems. To
date, we have examined EMM for prediction (EMMPredict) [76] and anomaly (rare
event) detection(EMMRare) [120].
Throughout this chapter, we use a view of EMM as depicted by a directed graph with
nodes and links. We use link and transition interchangeably to refer to a directed arc;
and use node, state, and cluster interchangeably to specifically refer to a vertex in the
EMM. These algorithms are executed in an interleaved manner. The first two phases are
21
-
used to model the data. The third phase is used to perform applications based on the
synopsis created in the modeling process. The synopsis includes information of cluster
features [107] and transitions between states. The cluster feature defined in [107] includes
at least a count of occurrence, CNi (count on the node) and either a medoid or centroid
for that cluster, LSi. To summarize, elements of the synopsis of an EMM are listed in
Table 3.1.
Table 3.1: Notations of EMM Elements
Legend of Notations
Notation Description
Ni The ith EMM node, labeled by CNi and LSiCNi Count of occurrences of data points found in the cluster (EMM node or EMM state) NiLSi A vector representing the representative data point of the cluster, usually being centroid or medoid of the cluster
Lij The directed link from Ni to Nj , labeled by CLijCLij Count of occurrences of the directed link from Ni to Njm Number of EMM states
n Number of attributes in the vector representing a data points, or dimensions of the data space
In this chapter, the frequency based anomaly detection algorithm [120] is used to com-
pare with the proposed model, which is one of the several applications of EMM. We give
a brief review of the approach used. The idea for anomaly detection comes from the fact
that the learning aspect of EMM dynamically creates a Markov chain that captures past
behavior stored in the synopsis. No input into the model identifies normal or abnormal be-
havior instead this is learned based on the statistics of occurrence of transitions and states
within the generated Markov chain. By learning what is normal, the model can predict
what is not. The basic idea is to define a set of rules related to cardinalities of clusters and
transitions to judge anomalies. An anomaly is detected if an input event (or an data point),
Et, is determined not to belong to any existing cluster (state in EMM), if the cardinality of
the associated cluster (CNn) is small, or if the transition (CLij) from the current state, i, to
the new state, j, is small. When any of the predefined rules are met, a Boolean alarm, At,
22
-
is set to indicate capture of anomalies.
3.3. Methodology
In this section we present the steps to build a risk leveling model, based on EMM
modeling [76] and the frequency based anomaly detection model [120], as well as the
evaluation metrics. KDD defines preprocessing procedures of data [108] to convert the
format of raw data to the format that is appropriate for data mining. Our preprocessed data
will use a structured format which combines the time stamp and spatial traffic statistics in
one vector:
Vt =< Dt, Tt, S1t, S2t, ..., Sit, ... >,
where Dt denotes type of day, Tt time of the day, and Sit the value of statistic found at
a spatial location i, at time t. This spatiotemporal format defines an input real world event
(input data point) in the multidimensional data space. Assume there are n elements in the
vector. Therefore each data point can be represented as a vector in n-dimensional space.
A trait of EMM is that it learns while performing a task so as to dynamically adapt the
time variant dataset. To perform mining of risk levels, the following are applied.
1. EMMCluster: Nearest neighbor clustering,
2. EMMBuild,
3. EMMAnomaly,
4. EMMRiskLeveling.
Algorithms EMMCluster and EMMBuild define the modeling process of EMM [76].
Combined with algorithm EMMAnomaly [76], an EMM anomaly detection model based
on occurrence frequency is defined and has been introduced in preceding section. The
23
-
anomaly detection model sets alarms, At, based on a set of predefined, frequency based
rules. To build a risk leveling model, a new algorithm, EMMRiskLeveling, is added. The
risk leveling model outputs a risk leveling index by combining the frequency based anomaly
alarm and evaluation of deviation of the local pattern in the normal data space. We will see
that the deviation evaluation can be calculated incrementally.
To evaluate the deviation, we use two parameters, centroid −→c (t) and diameter D(t), to
characterize the data space Ω of the model. Here the data space Ω refers to the region that
the data points occupy in the n-dimensional hyperspace. It is equivalent to the region that
the EMM nodes are distributed. The centroid of Ω is given in Definition 3.1. Moreover the
centroid can be computed incrementally. Using the incrementality, the time complexity is
reduced from O(m) to O(1), as given in Lemma 3.1.
Definition 3.1 (Centroid of data space:) Denote an EMM node to be Ni, the number of
data points included in the node to be−−→CN i, and first moment or the representative location
of Ni is−→LSi. The centroid of the data space −→c (t) is defined as:
−→c (t) =m∑i=1
−→LSi ∗ CNi
t(3.1)
Lemma 3.1 (Incrementality of centroid of data space.) Given−→c (t−1) and the first mo-
ment of current EMM state−−→LSc, then −→c (t) can be expressed in incremental manner.
−→c (t) =−→c (t− 1) ∗ (t− 1) +
−−→LSc
t= (3.2)
−→c (t− 1) ∗ (1− 1t) +
−−→LSct
24
-
proof 3.1 First we should note that−−→LSc is the same as
−−→LSt. We consider two cases:
1. Nc is a new EMM node:
−→c (t) =m(t)∑i=1
−→LSi ∗ CNi
t
=m(t−1)∑i=1,i 6=c
−→LSi ∗ CNi
t+
−−→LSct
=m(t−1)∑i=1,i 6=c
−→LSi ∗ CNit− 1
∗ t− 1t
+
−−→LSct
=−→c (t− 1) ∗ (t− 1) +
−−→LSc
t
= −→c (t− 1) ∗ (1− 1t) +
−−→LSct
25
-
2. Nc is an existing EMM node:
−→c (t) =m(t)∑i=1
−→LSi ∗ CNi
t
=m(t)−1∑i=1,i 6=c
−→LSi ∗ CNi
t+
−−→LSc ∗ CNc
t
=m(t)−1∑i=1,i 6=c
−→LSi ∗ CNi
t+
−−→LSc ∗ (CNc − 1)
t+
−−→LSct
=m(t−1)∑i=1
−→LSi ∗ CNit− 1
∗ t− 1t
+
−−→LSct
=−→c (t− 1) ∗ (t− 1) +
−−→LSc
t
= −→c (t− 1) ∗ (1− 1t) +
−−→LSct
Now we define the diameter of the Ω in Definition 3.2.
Definition 3.2 Denote an EMM node to be Ni, the number of data points included in the
node to be CNi and the distance between any two EMM nodes, Ni, Nj to be dij . The
diameter of the data space at time t, D(t), is defined by:
D(t) =(∑m
i=1
∑mj=1
((dij)2∗CNi∗CNj)
2t(t−1)
)1/2(3.3)
where t is the time instance and m is the number of EMM nodes. For simplicity in
computations, we define that:
d(t) =(∑m
i=1
∑mj=1
((dij)2∗CNi∗CNj)
2
)1/2(3.4)
26
-
Therefore, we have,
D(t) =(
d2(t)t(t−1)
)1/2(3.5)
or,
D2(t) =d2(t)
t(t− 1)(3.6)
At each time instance, theD(t) gives a weighted inter-cluster distance of the data points
received so far, and can be used to measure the size of the data space. Actually it can be
seen as an approximation of the inter-data point distance in the data space by ignoring the
inter-point distance among the data points within the same clusters. As we can see, the
computation complexity of this O(m2). However given the incrementality, the computa-
tion complexity can be reduced to O(m).
Lemma 3.2 (Incrementality of diameter of data space.) Given diameter of data space
at time instance t− 1, then the diameter of data space at time instance t can be expressed
in incremental manner:
d2(t) = d2(t− 1) +m(t−1)∑i=1
(d2i (t) ∗ CNi) (3.7)
Since the proof is very similar to that of incrementality of the centroid, we skip the
27
-
proof here. Now, denote that at time instance t, the distance between the current Node Nc
and −→c (t) is dcc(t). We define a risk leveling index as in Definition 3.3.
Definition 3.3 (Risk Leveling Index:) Given an alert raised by the frequency based anomaly
detection model when a data point−→Et is input, the risk leveling index caused by data devi-
ation is given by a hyperbolic tangent sigmoid function, and is defined as:
a(t) =er(t) − e−r(t)
er(t) + e−r(t)(3.8)
where,
r(t) =(
d2ccD2(t)
)1/4(3.9)
or,
r(t) =(t(t−1)∗d2ccd2(t)
)1/4(3.10)
for simplicity of computations.
The a(t) yields a output range [0, 1) because the ratio r(t) is never negative. The further
the current data point is located outside the border of the data space, the more likely the
data point is associated with a risk. This is an induction of our assumption. The procedures
to compute the risk leveling index is illustrated in Algorithm 1.
28
-
input : At : Boolean output of the frequency based anomaly detection model attime t.Gt : EMM at time t
output: a(t) : Risk leveling index at time t
foreach time instance t do1if At == true then2
Update −→c (t) using (3.3);3Update D(t) using (3.7) and (3.5) 0r (3.6);4Compute a(t) using (3.10);5
Algorithm 1: EMMRiskLevel
Example 3.1 (Given an EMM at time 5, specified as:)
N1 = { 2, < 1, 4 > }, N2 = { 3, < 2, 3 > },
L11 = { 1 };L12 = { 1 };L21 = { 1 };L22 = { 2 };
−→c (5) =< 8/5, 17.5 >
d2(5) = 12
Our proposed approach to determining risk leveling index based on synopsis has the
following benefits:
• Computations takes O(1) time for −→c (t) and O(m) time for D(t). Recall that the
EMM takes O(m) time for clustering and O(1) time for Markov chain updates. The
proposed approach inherits the time efficiency that EMM possesses.
• The proposed approach is solely based on synopsis of EMM obtained at current time.
Thus the proposed method is as incremental and scalable as EMM does.
• The proposed approach learns in an unsupervised manner while performs applica-
tions. It is not heavily dependent on a training process and thus is suitable for stream
data processing.
29
-
If a data point−→d6 = < 1, 3 > is input at time t = 6, is clustered into a new EMM node
N3, and the frequency based anomaly model set an alarm At = true due to its rules, then
using Algorithm 1, we have:
−→c (6) =<85∗ 5 + 1
6,175∗ 5 + 3
6>=< 3/2, 10/3 >
d2(6) = d2(5) + (1 + 1) = 12 + 2 + 14,
D2(6) =d2(6)
6 ∗ (6− 1)= 7/15,
d2cc = | < 1− 8/5, 3− 17/5 > |2 = 2/5,
r(6) =25715
= 6/7,
r(6) = 0.69.
We consider several evaluation metrics to compare the performance of our proposed
model to the frequency based anomaly detection model: Detection (also true positive or
recall or hit rate in the literature) Rate, False Alarm (or false positive) Rate [120]. Pre-
cision (or positive predictive value) and F1 (also F-score or F-measure) score. Detection
Rate refers to ratio between the numbers of correctly alarmed risks to the total number
of risks that were incorrectly labeled as normal data points. False Alarm Rate refers to
the expectancy of the false positive ratio. F1 score is a measure of a test’s accuracy (See
definition in 3.11, 3.12, 3.13 and 3.14).
30
-
Precision =TP
TP + FP(3.11)
True Positive Rate =TP
TP + FN(3.12)
False Alarm Rate =FP
FP + TN(3.13)
F1 =2TP
2TP + FP + FN(3.14)
3.4. Experiments and Analysis
This section briefly reports the results of experiments conducted comparing the pro-
posed model to a frequency based model. We demonstrate the learning capacity, impact of
parameters in time and memory utilization. The frequency based anomaly detection model
is introduced in the earlier Section.
3.4.1. Dataset
In 1998, 1999 and 2000, the MIT Lincoln Laboratory [57] conducted a comparative
evaluation of intrusion detection system (IDSs) developed under DARPA funding. This
effort was to examine Internet traffic in the air force bases. The traffic was performed in
a simulation network. The idea was to generate a set of realistic attacks, embed them in
normal data, and evaluate the false alarm and detection rates of systems with these data,
in order to enrich performance’s improvement of existing IDS [57]. We use the DARPA
dataset as a testcase of our proposed model.
In order to extract information from the tcpdump datasets of DAPRA, TcpTrace utility
software [102] was used. This preprocessing procedure was applied to TCP connection
records, but ignores ICMP and UDP packets. The new feature-list attained from ”raw
31
-
tcpdump data” using the TcpTrace software is presented in Table 3.2. The preprocessed
dataset is structured in nine different features, where each feature denotes the statistical
count of network traffic within a fixed time interval.
Table 3.2: The extracted features from raw tcpdump data using tcptrace software
Extracted Relevant Features
Name Description
IIN The number of packets flowing from inside to inside network
ION The number of packets flowing from inside to outside network
IDN The number of packets flowing from inside to DMZ network
OON The number of packets flowing from outside to outside network
OIN The number of packets flowing from outside to inside network
ODN The number of packets flowing from outside to DMZ network
DDN The number of packets flowing from DMZ to DMZ network
DIN The number of packets flowing from DMZ to inside network
DON The number of packets flowing from DMZ to outside network
Preprocessed network traffic statistics is gathered at every 10 second for investigation.
The DARPA 1999 dataset which is free of attacks for two weeks (1st week and 3rd week) is
used as training data and DARPA 2000 dataset which contains DDoS attacks is used a test
data. We obtained 20270 rows from the first week and 21174 rows from the third week to
create the normal dataset and the dataset is used for modeling. The DARPA 2000 dataset
which contains attacks has 1048 rows. Figure 3.1 shows DARPA 2000 data profile with
attacks.
3.4.2. Experiments
Now we present the performance experiments that compare two models with deriva-
tions from the confusion matrix. Table 3.3 gives the legends used in this section for quick
reference. The experiment result shows that using the frequency based anomaly detection
32
-
model, it detects the attack after running first week of training data. However the side-effect
is the high false alarm rate and low detection rate. By training with third week it drops the
false alarm with 5% and also the detection rate increases with 5%. Tables 3.5 and 3.6 pro-
vides detection and false positive rates from the first week and continuing training with the
third week. The threshold used is 0.8 with Jaccard clustering.
Figure 3.1: Logarithm of traffic volume shows the DDoS attacks
Table 3.4 shows the number of states created in EMM using the first and third weeks
of DARPA 1999 normal dataset. As we can see the number of EMM nodes or states is
slightly different. This demonstrates the learning capability of the EMM although exhaus-
tive learning is not possible. This observation is consistent with [76] which has reported
a sublinear growth rate of number of EMM states. Also compared with the size of the
dataset, the number of EMM states is really low in all cases in the table, which implies the
efficiency of the model. We can also see that different similarity measures with different
threshold values yield different number of EMM nodes or states in the modeling processes.
Thus selection of threshold values impacts the memory usage and time utilization.
To conclude, the proposed risk leveling model lowers the false alarm rate compared
33
-
Table 3.3: Legend used in the performance evaluation with derivations from the confusionmatrix.
Legend of performance experiments
Name Description
NOA Number of observable attacks
NA Number of alerts
NTAD The number of packets flowing from inside to DMZ network
P Precision
TPR True positive rate
FAR False alarm rate
F1 F-measure
with the frequency based anomaly detection model and keeps a high detection rate in the
test cases. The approach is efficient, incremental and scalable.
3.5. Chapter Summary
This chapter presents a novel data mining technique to detect traffic based network in-
trusions. Our proposed technique takes both frequency and data deviation into account in
an efficient, incremental, and scalable anomaly detection model. The performance experi-
ments support our assumption that the traffic related network intrusion is companied with
data deviation. The technique is suitable for online processing.
There are several directions for future research. These directions include design of
models incorporating signatures that were previously determined to be risks, investigation
of correlations of the parameters, exploration of feasibility of the model for dynamic dataset
in grid computing environments.
34
-
Table 3.4: Impacts of clustering thresholds and selection of similarity measures
Parameter AnalysisNormal Dataset
DARPA 1999 SimThreshold
0.7 0.80 0.90 0.99
First week
Jaccard 148 298 855 7794Die 72 120 372 5033
Cosine 13 21 59 1298Overlap 6 10 11 38
Difference
Third week
Jaccard 181 367 1124 11820Die 84 145 449 7222
Cosine 13 22 63 1702Overlap 6 10 11 42
Diff betweenfirst & third
weeks
Jaccard 33(18.23%) 69(18.8%) 269(23.93%) 4026(34.1%)Die 12(14.3%) 25(17.24%) 77(17.15%) 2189(30.74%)
Cosine 0% 1(4.55%) 4(6.35%) 404(23.74%)Overlap 0% 0% 0% 4(9.52%)
Table 3.5: Detection rate and false alarm rate using frequency based anomaly detectionmodel
Performance of the frequency based anomaly detection modelSetting NOA NA NTAD P TPR FAR F1
First Week Dataset 1 5 1 0.2 1 0.00382 0.3333333With Third Week Dataset 1 4 1 0.25 1 0.002865 0.4
Table 3.6: Detection rate and false alarm rate using risk leveling anomaly detection model
Performance of the risk leveling anomaly detection modelSetting NOA NA NTAD P TPR FAR F1
First Week Dataset 1 1 1 1 1 0 1With Third Week Dataset 1 1 1 1 1 0 1
35
-
Chapter 4
A COMPARATIVE STUDY OF OUTLIER DETECTION ALGORITHMS
In the previous chapter, we studied a new anomaly detection model that is based on
Extensible Markov Model. A spatiotemporal model that can be used to detect outliers
in data streams. In this chapter1, we will study EMM’s outlier detection performance on
different real life datasets and test its performance against two spatial outlier detection
models.
4.1. Introduction
Data Mining is the process of extracting interesting information from large sets of data.
Outliers are defined as events that occur very infrequently. Detecting outliers before they
escalate with potentially catastrophic consequence is very important for various real life
applications such as: fraud detection, network robustness analysis, and intrusion detec-
tion. This chapter presents a comprehensive analysis of three outlier detection methods i.e.
Extensible Markov Model (EMM), Local Outlier Factor (LOF) and LCS-Mine. In Algo-
rithm analysis section we demonstrate the time complexity analysis and outlier detection
accuracy. The conducted experiments with Ozone level Detection, IR video trajectories,
and 1999 and 2000 DARPA DDoS datasets indicate that EMM outperforms both LOF
and LSC-Mine in both time and outlier detection accuracy. Recently, outlier detection has
gained an enormous amount of attention and become one of the most important problems
in many industrial and financial applications. Supervised and unsupervised learning tech-
niques are the two fundamental approaches to the problem of outlier detection. Supervised1This work has been published in International Conference on Machine Learning and Data Mining
MLDM, 2009. [26]
36
-
learning approaches build models of normal data and detect deviations from the normal
model in observed data. The advantage of these types of outlier detection algorithms is that
they can detect new types of activity as deviations from normal usage. In contrast, unsu-
pervised outlier detection techniques identify outliers without using any prior knowledge
of the data. It is essential for outlier detection techniques to detect sudden or unexpected
changes in existing behavior as soon as possible. Assume for example the following three
scenarios:
1. A network alarm is raised indicating a possible attack. The associated network traffic
is abnormal from the normal Network traffic. The security analyst discovers that the
enormous traffic is not produced from the Internet, but from its Local Area Network
(LAN). This scenario is characterized as zombie effect in a Distributed Denial of Ser-
vices (DDoS) attack [120], where the LAN is utilized in the DDoS attack to deny the
services for a targeted Network. It also means that the LAN has been compromised
long before the discovery of DDoS attack. Computer systems in a LAN provide ser-
vices that correspond to certain types of behavior, if a new service is started without
system administrator permission, then it is extremely important to set an alarm and
discover suspicious activities as soon as possible in order to avoid disaster.
2. Video surveillance [121] is frequently encountered in commercial, residential or mil-
itary buildings. Finding outliers in the video data involves mining massive surveil-
lance video databases automatically collected to retrieve the shots containing inde-
pendently moving targets. The environment where it operates is often very noisy.
3. Today it is not news that the ozone layer is getting thinner and thinner [70]. This
is harmful to human health, and affects other important parts of our daily life, such
as farming, tourism etc. Therefore an accurate ozone alert forecasting system would
facilitate issuance of warnings to the public at an early stage before the ozone reaches
a dangerous level.
37
-
One recent approach to outlier detection, Local Outlier Factor (LOF) [78], is based
on the density of data close to an object. This algorithm has proven to perform well,
but suffers from some performance issues. In this chapter we compare the performance
of LOF and one of its extensions, LSC-Mine [72], to the use of our previously proposed
modeling tool Extensible Markov Model (EMM) [120]. This comparative study provides
a study of these three outlier algorithms and denotes their time and detection performance.
Extensible Markov Model (EMM) is a spatiotemporal modeling technique that interleaves
a clustering algorithm with a first order Markov Chain (MC) [82], where at any point in
time EMM can provide a high level summary of the data stream. Local Outlier Factor
(LOF) [78] is an unsupervised density-based algorithm that assigns to each object a degree
to be an outlier. It is local in that, the degree depends on how isolated the object is with
respect to the surrounding neighborhood. LSC-Mine [72] was constructed to overcome the
disadvantages of the LOF technique proposed earlier.
4.1.1. Extensible Markov Model
Extensible Markov Model (EMM) [76] has the advantage of using a distance-based
clustering for spatial data as well as that of the Markov chain for temporality. And as proved
in our previous work [28], EMM achieves an efficient modeling by mapping groups of
closely located real world events to states of a Markov chain. EMM is thus an extension to
the Markov chain. EMM uses clustering to obtain representative granules in the continuous
data space. Also by providing a dynamically adjustable structure, EMM is applicable to
data stream processing when the number of states is unknown in advance and provides a
heuristic modeling method for data that hold approximation of the Markov property. The
nodes in the graph are clusters of real world states, where each of them is a vector of sensor
values, for example a flood level sensor in a river bend, that continuously feeds values
creating a data stream. The EMM defines a set of formalized procedures such that at any
38
-
time t, EMM consists of a Markov Chain (MC) [13] and algorithms to modify it, where
algorithms include:
1. EMMCluster defines a technique for matching between input data at time t + 1 and
existing states in the MC at time t. This is a clustering algorithm which determines
if the new data point or event should be added to an existing cluster (MC state) or
whether a new cluster (MC state) should be created. A distance threshold th is used
in clustering. For more details see Algorithm 2
2. EMMIncrement algorithm updates (as well as adds, deletes, and merges) MC at time
t + 1 given the MC at time t and output of EMMCluster at time t + 1. For more
details see Algorithm 3
3. EMMapplications are algorithms which use the EMM to solve various problems. To
date we have examined EMM for prediction (EMMPredict) [76] and anomaly (rare,
outlier event) detection (EMMRare) [120].
Throughout this chapter, EMM is viewed as directed graph with nodes and links. Link
and transition are used interchangeably to refer to a directed arc; node, state, and cluster
are used interchangeably to specifically refer to a vertex in the EMM. EMMCluster and
EMMIncrement are used to model the data. The EMMapplications is used to perform
applications based on the synopsis created in the modeling process. The synopsis includes
information of the cluster features [107] and transitions between states. The cluster feature
defined in [107] includes at least a count of occurrence, CNi (count on the node) and
either a medoid or centroid for that cluster, LSi. To summarize, elements of the synopsis
of an EMM are listed in Table 1. The frequency based anomaly detection [76] is used to
compare with LOF, and LSC-Mine algorithms, that is one of the several applications of
EMM. The idea for outlier detection comes from the fact that the learning aspect of EMM
dynamically creates a Markov chain and captures past behavior stored in the synopsis. No
39
-
input into the model identifies normal or abnormal behavior, instead this is learned based
on the statistics of occurrence of transitions and states within the generated Markov chain.
By learning what is normal, the model can predict what is not. The basic idea is to define a
set of rules related to cardinalities of clusters and transitions to judge outlier. An outlier is