Rare Sequential Pattern Mining of Critical Infrastructure ... · Rare Sequential Pattern Mining of...
Transcript of Rare Sequential Pattern Mining of Critical Infrastructure ... · Rare Sequential Pattern Mining of...
Rare Sequential Pattern Mining of
Critical Infrastructure Control Logs for
Anomaly Detection
by
Anisur Rahman
Bachelor of Science in Computer Science and Engineering(The University of Asia Paci�c) � August 2002
Master of Science in Computer Science and Engineering(Da�odil International University) � December 2009
Submitted in ful�lment of the requirement for the degree
of Doctor of Philosophy
Information Security Discipline
Science and Engineering Faculty
Queensland University of Technology
2019
Keywords
Frequent Pattern, Rare Pattern, Sequential Database, Critical Infrastructure,
SCADA Control System, Anomaly Detection, Itemset Pattern Mining, Sequen-
tial Pattern Mining, Rare Sequential Pattern Mining, Sequential Association
Rules Mining.
i
ii
Abstract
The importance to provide cybersecurity for Supervisory Control and Data Ac-
quisition (SCADA) control systems is now recognised as a world-wide problem.
These SCADA systems are used to drive much of a nation's critical infrastruc-
ture, which by de�nition is essential for the nation's citizen's way of life. SCADA
control systems no longer operate in isolation, which had the added bene�t that
it provided a level of protection from anomalies or intrusions. They are con-
nected to the computer networks and internet systems to operate, control and
monitor their operations. This connection to the Internet enables the SCADA
system exposed to cyber-attacks. Therefore, there is a need to have a detection
system which discovers anomalies that may have occurred on a system.
Log �les record the process activities of the SCADA control system. These
logs can be analysed to detect abnormal process activities treated as anomalies
on the control system. Attacks or anomalies on a system may be frequent and
rare, but this thesis is only concerned with rare anomalies. The main objective of
this thesis is to design and develop an anomaly detection method from SCADA
control logs by using rare sequential pattern mining technique. In addition, this
thesis also aims to develop a method for possible anomaly prediction on SCADA
control system. To achieve the main objective of this thesis, it is considered that
anomalies are a rare phenomenon in a system. So, we propose and develop a
new rare sequential pattern mining approach to �nd rare or infrequent patterns.
Since the goal of pattern mining is typically to �nd the regular behaviour of a
system, rare behaviour of a system is often explicitly ignored and discarded. Rare
patterns can provide valuable information indicating anomalous or unacceptable
behaviour of a system. To �nd e�ective rare patterns, all rare sequential patterns
sharing the same frequency or support value are put into di�erent groups. The
smallest pattern in each group is the minimal rare sequential pattern, while
the largest pattern is the maximal rare sequential pattern. We evaluated the
iii
rare sequential pattern mining method using SCADA control system log data
containing cyber incidents. The identi�ed rare anomalous sequences were attacks
on the system, demonstrating the usefulness of the rare sequential pattern mining
approach.
Next, we used constraints to improve the e�ectiveness and e�ciency of the
proposed rare sequential pattern mining algorithm. The constraints were used
to generate only useful rare patterns to detect anomalies on SCADA system.
The e�ciency was improved by reducing the computational time while generat-
ing useful rare patterns because the constrained rare sequential pattern mining
algorithm generated less number of patterns. While achieving the improved ef-
fectiveness and e�ciency, the proposed rare sequential pattern mining algorithm
did not compromise the anomaly detection accuracy, which requires security op-
erators less e�ort and time to detect anomalies.
Finally, we developed a sequential association rule mining approach to pre-
dict possible anomaly on SCADA control systems. In this method, we used rare
sequential patterns to generate association rules. These rules were then used
to predict possible anomalies by using streaming logs from the SCADA control
system. The results from this thesis demonstrate that anomalies can be de-
tected from SCADA control logs by applying our rare sequential pattern mining
approach. The results also demonstrate that anomaly prediction on SCADA
systems can also be done by using sequential association rules.
iv
Contents
Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Previously Published Material . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
Chapter 1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Aims and Scope . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2 Background and Literature Review 12
2.1 Background of SCADA Control System . . . . . . . . . . . . . . . 12
2.2 SCADA Test-bed Scenario . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Anomalies in SCADA Control System . . . . . . . . . . . . . . . . 17
2.4 Anomalies Detection Methods . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Signature-based Detection . . . . . . . . . . . . . . . . . . 19
2.4.2 Anomaly-based Detection . . . . . . . . . . . . . . . . . . 20
2.5 Data Mining and Machine Learning . . . . . . . . . . . . . . . . . 21
2.5.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Machine Learning Methods . . . . . . . . . . . . . . . . . 22
2.5.3 Supervised Learning Method . . . . . . . . . . . . . . . . . 23
2.5.4 Unsupervised Learning Method . . . . . . . . . . . . . . . 24
2.5.5 Semi-supervised Learning Method . . . . . . . . . . . . . . 25
v
2.6 Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 Itemset Pattern Mining . . . . . . . . . . . . . . . . . . . . 28
2.6.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . 29
2.7 Constraint-based Pattern Mining . . . . . . . . . . . . . . . . . . 32
2.8 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . 35
2.9 Existing Anomaly Detection in SCADA System . . . . . . . . . . 39
2.10 Summary and Research Gaps . . . . . . . . . . . . . . . . . . . . 43
Chapter 3 A Rare Sequential Pattern Mining Approach for Anomaly
Detection 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 De�nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 A New Method For Finding Rare Sequential Patterns . . . . . . . 59
3.3.1 Generating Rare Sequential Generator Patterns . . . . . . 61
3.3.2 Generating All Rare Sequential Patterns . . . . . . . . . . 64
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 SCADA System Architecture . . . . . . . . . . . . . . . . 67
3.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.3 Experimental Methodology . . . . . . . . . . . . . . . . . . 74
3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.1 E�ectiveness of Equivalence Class . . . . . . . . . . . . . . 81
3.5.2 Computational Complexity . . . . . . . . . . . . . . . . . . 83
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 4 Constraint-based Rare Sequential Pattern Mining 90
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Existing Related Work . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Constraint-based Rare Sequential Pattern Mining Algorithm . . . 98
4.4.1 Generating Constrained Rare Sequential Generator Patterns 99
4.4.2 Generating Constrained Rare Sequential Patterns . . . . . 101
4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 104
4.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
vi
4.5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.3 Experimental methodology . . . . . . . . . . . . . . . . . . 111
4.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6.1 Conveyor-belt Control System . . . . . . . . . . . . . . . . 115
4.6.2 Pressure Control System . . . . . . . . . . . . . . . . . . . 120
4.6.3 Water Tank Control System . . . . . . . . . . . . . . . . . 125
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Chapter 5 A Rare Sequential Association Rules Mining of SCADA
Streaming Logs for Anomaly Prediction 134
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.4 A New Anomaly Prediction Method Using Sequential Association
Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.4.1 Generating Sequential Association Rules . . . . . . . . . . 142
5.4.2 Prediction of Anomalies using Sequential Association Rules 145
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 151
5.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.5.3 Experimental Methodology . . . . . . . . . . . . . . . . . . 153
5.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.6.1 Conveyor-belt Control System . . . . . . . . . . . . . . . . 156
5.6.2 Pressure Control System . . . . . . . . . . . . . . . . . . . 157
5.6.3 Water Tank Control System . . . . . . . . . . . . . . . . . 158
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Chapter 6 Conclusion and Future Work 165
6.1 Research Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . 168
Bibliography 170
vii
viii
List of Figures
2.1 A simplistic view of SCADA control system layout. . . . . . . . . 13
2.2 A physical laboratory view of the SCADA test-bed. . . . . . . . . 15
2.3 A logical view of SCADA test-bed process control system. . . . . 16
2.4 A data mining approach for information extraction. . . . . . . . . 21
2.5 Supervised learning method. . . . . . . . . . . . . . . . . . . . . . 23
2.6 Unsupervised learning method. . . . . . . . . . . . . . . . . . . . 25
2.7 Semi-supervised learning method. . . . . . . . . . . . . . . . . . . 26
2.8 A sequence diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 A partial lattice view of a sequential database. . . . . . . . . . . 54
3.2 The positive and the negative border of a lattice of a sequential
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 An equivalence class of rare sequential patterns. . . . . . . . . . . 66
5.1 Anomaly prediction from streaming logs. . . . . . . . . . . . . . . 141
ix
x
List of Tables
2.1 Transaction database TDB. . . . . . . . . . . . . . . . . . . . . . 28
2.2 A sequential database SDB . . . . . . . . . . . . . . . . . . . . . 30
2.3 A market basket transaction database. . . . . . . . . . . . . . . . 36
3.1 A sequential database SDB. . . . . . . . . . . . . . . . . . . . . . 53
3.2 Execution of Algorithm 3.1. . . . . . . . . . . . . . . . . . . . . . 63
3.3 A partial view of a conveyor belt control system log. . . . . . . . . 69
3.4 A partial view of a pressure control system log. . . . . . . . . . . 69
3.5 A partial view of a water tank control system log. . . . . . . . . . 70
3.6 A partial view of a conveyor belt control system logs from the
second dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7 A partial view of a pressure control system log from the second
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.8 A partial view of a water tank control system log from the second
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.9 A sample of the conveyor belt SDB generated from Dataset-1 in
the First Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.10 A sample of the pressure control SDB generated from Dataset-2
in the First Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.11 A sample of the water tank SDB generated from Dataset-3 in the
First Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.12 A sample of the rare sequential patterns from conveyor belt SDB
in Dataset-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.13 A sample of rare sequential patterns from pressure control SDB
in Dataset-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.14 A sample of rare sequential patterns from water tank SDB in
Dataset-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.15 A sample of rare sequential patterns from conveyor belt SDB in
Dataset-4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xi
3.16 A sample of rare sequential patterns from pressure control SDB
in Dataset-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.17 Comparison among the databases regarding the number of fre-
quent generators our algorithm produced and the number of fre-
quent generators produced by FEAT algorithm. . . . . . . . . . . 84
4.1 A sequential database SDB with events' occurrence time-stamp. . 92
4.2 A partial view of a conveyor belt control logs. . . . . . . . . . . . 105
4.3 A partial view of a pressure control logs. . . . . . . . . . . . . . . 106
4.4 A partial view of a water tank control logs. . . . . . . . . . . . . . 106
4.5 Confusion matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.6 A partial view of the conveyor-belt result from the Experiment-1. 116
4.7 A partial view of the conveyor-belt result from Experiment-2. . . 117
4.8 A comparison table showing the number of rare sequential patterns
and the computational time taken by the four experiments on the
conveyor-belt database. . . . . . . . . . . . . . . . . . . . . . . . . 118
4.9 A partial view of the pressure control result from Experiment-1. . 121
4.10 A partial view of the pressure control result from Experiment-2. . 122
4.11 A comparison table showing number of rare patterns and time
taken by all 4 experiments on pressure control SDB. . . . . . . . . 123
4.12 A partial view of the water tank result from Experiment-1. . . . . 125
4.13 A partial view of the water tank result from Experiment-2. . . . . 126
4.14 A comparison table showing number of rare patterns and compu-
tational time taken by the four experiments on water tank control
system database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.1 A sequential database SDB. . . . . . . . . . . . . . . . . . . . . . 139
5.2 A view of possible rare sequential association rules. . . . . . . . . 148
5.3 Examples of sequential association rules from three control systems.156
5.4 Anomaly predictions from the three control system streaming logs. 157
5.5 Anomaly predictions from the three control system streaming logs. 158
5.6 Anomaly predictions from the three control system streaming logs. 159
xii
QUT Verified Signature
QUT Verified Signature
xiv
Previously Published Material
The following articles have been published, and contain material based on the
content of this thesis.
(i) Anisur Rahman, Yue Xu, Kenneth Radke and Ernest Foo. Finding Anoma-
lies in SCADA Logs Using Rare Sequential Pattern Mining. In 10th In-
ternational Conference on Network and System Security, Springer; pages
499− 506, September 28− 30, 2016, Taipei, Taiwan.
(ii) Anisur Rahman, Yue Xu, Kenneth Radke and Ernest Foo. A Rare Sequen-
tial Pattern Mining Approach For Anomaly Detection. In the Journal of
Knowledge and Information Security. (Review Submitted)
xv
Acknowledgements
First of all I express my humble and sincere gratitude to Almighty Allah (SWT)
Who bestowed upon me with the knowledge, wisdom, health, and mental strength
to undertake the Ph.D. research and enabled me to complete it. Next, I am
greatly indebted to my wonderful supervisory team comprising Dr. Ernest Foo
who is the principal supervisor, and Associate professor Yue Xu, and Dr. Ken-
neth Radke are the two associate supervisors. They have spent a lot of time to
guide me in every regular weekly meetings, where we discussed on my research
updates. Their consistent guidance, encouragement and constructive feedbacks
on my research works and writings helped me to achieve my research goals.
Thanks to Ernest once again for merging the information security and data sci-
ence together in my research that helped me to expand my knowledge domain.
I am equally thankful to Professor Yue for guiding me to learn the data science
theory, tools and techniques needed for my research. Also, thanks to Kenneth
for all the tips while implementing the algorithms.
I would like to give thanks to Professor Yuefeng Li and Dr. Matthew McK-
ague for being a panel members outside of my supervisory team of my Ph.D.
�nal seminar. I am grateful that they have taken their time to read my thesis
and give their valuable comments. Also, thanks to the external examiners for
spending their time reading and providing me with constructive suggestion to
improve the quality of my thesis. In addition, I am thankful to the Queensland
University of Science and Technology (QUT) for allocating me QUT Postgrad-
uate Research Award (QUTPRA), QUT Higher Degree Research Tuition Fee
Sponsorship, and Conference Travel Support. It would have been impossible to
start with my Ph.D. journey without the scholarship support. Furthermore, I
would also like be thankful to the unit coordinators, Dr. Ernest Foo, Associate
professor Yue Xu, Dr. Leonie Simpson, Dr. Matthew McKague, Dr. Wasana
Bandara, Prof. Yanming Feng who showed con�dence in me and gave me the
xvi
opportunities to work as a sessional academic at QUT. This has helped me to
gain academic teaching and learning experience in addition to providing me with
the �nancial support.
I would like to thank my colleagues and friends at Information Security Disci-
pline at QUT. I had the opportunity to learn the independent and collaborative
behaviour and how to develop the communication skills during the course of
my Ph.D. research candidature. As a token of my appreciation, I need to men-
tion their names: Nicholas Rodo�le, Hassan Fareed M Lahza, Iftekhar Salam,
David Myers, Jack Parry, Hassan Musallam Ahmed Qahur Al Mahri, Basker
Palaniswamy, Udyani Shanika Kumari Herath Mudiyanselage, Mir Ali Reza-
zadeh Baee, Shriparen Sriskandarajah, Mukhtar Hussain, Chathurika Pavithrani
Kumari, Niluka Amarasinghe, Tarun Bansal, Qinyi Li, James A. Akande, Chris
Djamaludin. Moreover, I extend my thanks to my roommates who have been a
part of my company at my o�ce GP S1051: Shane Black, Vikal Achrya, Udyani
and Fida. I am also giving thanks to those whom I used to meet on the lobby of
my o�ce for their smiling faces and showing their interest on my research.
Finally, I would like to express my gratitude to my parents for their struggle
and dedication for their children to become educated and good human beings.
May Allah (SWT) bestow His blessings and give continuous reward to both of
them. In honour my parents' struggle and dedication, I dedicate this thesis to my
parents. Last but not the least, my heartfelt thanks to my wife for her love and
belief in me and my children who are the gifts from Allah (SWT), my younger
brothers whom I love dearly, my mother-in-law, sister-in-laws, brother-in-law, my
uncles, aunties and all other family members for their love, a�ection, patience,
tolerance and sel�ess support throughout my Ph.D. research. There are many
relatives, friends and well-wishers whose names are too many to be mentioned
here by name, yet I am greatly thankful to them for their constant a�ection,
encouragement and support.
xvii
Dedicated to my parents
xviii
Chapter 1
Introduction
Supervisory Control and Data Acquisition (SCADA) control system network
is widely used in various applications involving important national critical in-
frastructures such as nuclear power plants. Any attack or malfunction to this
critical infrastructure can cause serious consequences to people, the environment
and industries linked to this infrastructure. There is a need to protect critical
infrastructure networks with the notion of defending the system from unwanted
incidents. However, defending is not always possible. This is because tradi-
tional SCADA control systems used to work in isolation, meaning they were not
connected to the internet. These systems would use proprietary software and
hardware which kept the system more secured. The SCADA systems were more
secure because it was hard to know the operations of the vendor speci�c software
and hardware that were used in the SCADA system. This way the operational
information was obscured from outside of the system. This means the system
provided security through obscurity.
However, modern SCADA systems are connected to the internet which ex-
poses the SCADA system to the external networks. In addition, the system
now uses o�-the-shelf hardware and software for its operation. As a result, it is
possible to get the SCADA hardware and software information which allows the
attacker to conduct a cyber-attack on the system. For example, the most promi-
nent attack that was performed in 2010 on the SCADA system was the Stuxnet
malware which hit a Nuclear Power Plant in Iran causing industrial damage [1].
1
2 Chapter 1. Introduction
Furthermore, as the con�guration of the SCADA system is typically not fre-
quently changed, the software is not updated. This is because the SCADA
system is designed to be functional for at least 20 years. The functionality of the
SCADA system is simple and has a limited number of operations which makes
the SCADA system unique from other conventional IT networks. Due to the
simplicity of the SCADA network, the con�guration of the system does not re-
quire to be updated which makes the system vulnerable for cyber-attack. It is
expected that the security incidents and attacks are to increase. As a legacy
system the SCADA networks were not designed to work securely in an environ-
ment such as internet. Therefore, defending SCADA systems from cyber-attack
is di�cult. Hence, a detection method is required to �nd anomalies or intrusions
on the SCADA system. Once an unwanted incident is detected, reactionary mea-
sures can be taken to reduce the consequences to the system. Detection is the
primary motivation of this research.
The rest of this chapter is organized as follows: Section 1.1 presents back-
ground and motivation of this research, Section 1.2 discusses the research prob-
lem, Section 1.3 de�nes research aims and scope, Section 1.4 provides the con-
tributions of this research, and �nally Section 1.5 concludes this chapter by
outlining a general structure of this thesis.
1.1 Background and Motivation
Anomaly detection is one step of several diverse measures that can be applied to
protect critical infrastructure control system networks such as SCADA control
systems. The control system consists of di�erent devices like Remote Termi-
nal Units (RTUs), Programmable Logic Controls (PLCs), Intelligent Electronic
Devices (IEDs) and di�erent sensors that connect physical devices to computer
networks for remotely monitoring and supervising critical infrastructure. There
are di�erent types of control systems based on their application areas. For exam-
ple, Process Control Systems (PCSs), Supervisory Control and Data Acquisition
(SCADA) systems, Distributed Control Systems (DCSs), and Building Manage-
ment Systems (BMSs) [2] [3].
The US government has de�ned transportation, oil and gas production and
storage, water supply, emergency services, government services, banking and �-
nance, electrical power, telecommunications as critical infrastructures [4]. How-
1.1. Background and Motivation 3
ever, we argue that di�erent countries may have di�erent policies to declare
their infrastructures as critical infrastructure. Critical infrastructures are closely
interdependent on each other. Therefore, if one of these establishments is af-
fected then other dependent infrastructures are a�ected in a cascading e�ect [5].
Therefore, these critical infrastructures are vital such that their inactivity or
destruction could bring impact to human lives, environment, and economy.
In recent times, SCADA systems are increasingly being used to monitor and
control the process activities of critical infrastructures. Modern SCADA systems
have come of age through several revolutionary steps. Now SCADA systems use
conventional IT technology as a backbone to communicate with �eld devices.
However, in the early stages of development these systems were being operated in
isolation, meaning not connected to external networks. Moreover, the system was
vendor-centric which means that it was operated only by software and hardware
manufactured by a speci�c vendor. So, it was di�cult to in�ict an attack on the
system due to a lack of available information about the system. In other words,
the security was ensured by obscurity.
Since modern SCADA systems use o�-the-shelf technology, the system is
prone to cyber-attacks. This is because adversaries can easily �nd tools and
techniques for conducting successful attacks on the infrastructure [4]. Further,
the sophistication of hacking tools is growing while the need for higher technical
knowledge for the intruder to cause harm is decreasing [6]. The impact of a
cyber-attack on control systems varies and can range from infrastructural asset
and environmental destruction to the loss of human lives [7].
While most of the cyber-attacks on SCADA networks remain undisclosed or
categorized as classi�ed by government agencies and industries, there are still
some prominent malicious attacks publicized in the literature. The �rst cyber-
attack recorded in 1982 at the Trans-Siberian gas pipeline in Soviet Union that
caused a huge explosion and �re was visible from space [8]. In 1998, a 12-year
old hacker managed to access to the Theodore Roosevelt Dam in Arizona by
gaining control of the computer that controls the �oodgates of the dam. It was
speculated that if the gates had been opened the cities, Tempe and Mesa, would
have been �ooded by the water [9]. In 2003, the Davies-Besse nuclear plant
in Ohio was disabled via the slammer worm for several hours [10]. In 2010,
the Stuxnet computer worm attacked a nuclear facility at Natanz in Iran which
caused power centrifuges to fail.
4 Chapter 1. Introduction
Since SCADA systems are using o�-the-shelf technology, meaning commercially
available hardware and software, it is expected these attack incident will grow
in the years to come. Some attacks are without established motivations and
yet can do catastrophic damage and others are non-critical and cannot cause
a catastrophic failure [11]. A recent report from Security Intelligence by Scott
Koegler [12] has reported that the cyber-threat has been increasing over the
time. In their 2017 report, they have shown that there was an increase of 90
million more intrusions than in 2016. Another report published by IBMManaged
Security Services (MSS) [13] has shown that the number of ICS attacks, which
can be de�ned as disrupting the process activities of the critical infrastructure
controlled by SCADA control networks, from 1st January 2013 to 30th August
2015 were on the rise.
As the number of attacks on ICS increases, the consequence of the attacks
demands constant monitoring to detecting anomalies or intrusions in the control
system network. This research aims to investigate SCADA control logs. The
reason for choosing logs is that SCADA process activities are recorded in log
�les. Any evidence of successful or attempted unsuccessful unauthorized access
or intrusion into the system may be recorded in log �les, which can be analyzed for
detecting anomalies or intrusions. We assume that the integrity and availability
of logs are ensured, meaning logs cannot be tampered. Hadºiosmanovi£ et al.
[14] state that compared to other logs used in di�erent domains, SCADAs logs
are in a good format to analyze. Garitano et al. [15] state that SCADA control
systems have a communication that is deterministic and activities that are for
most of the cases limited and recurrent. Any activity which is not recurrent
could be a rare pattern or phenomenon and hence an anomaly in the system.
However, �nding rare patterns from SCADA logs is challenging for the following
reasons:
(i) There is lack of existing knowledge as to rare sequential pattern mining
methods to �nd rare suspicious pattern from SCADA control logs. To the
best of our knowledge, there are no prior research to �nd rare sequential
patterns. Although there exist a few works that aim to �nd rare itemset
patterns, these methods do not preserve the order of the events. The
preservation of event order is important in SCADA control systems as
events occur in sequential manner. However, keeping the order of events is
di�cult and costly while generating rare patterns from the SCADA control
1.2. Research Problem 5
logs. This is because the large number of combinations of the events cost
computational time and search space.
(ii) It is important to conduct a forensic investigation of the SCADA system to
detect anomalies once it has occurred. However, due to the large volume of
logs generated by the SCADA control system, it is di�cult to �nd anomalies
manually. Therefore, developing an e�cient algorithm that automatically
detects the anomaly by analysing the logs is important. However, no pre-
vious algorithm has addressed this problem before using a rare sequential
pattern mining approach.
(iii) It is also important to make an early prediction of a possible anomaly
in a live SCADA control system. This is because due to early anomaly
prediction it may be possible to avoid the possible attack on the system.
However, prediction anomalies from a live SCADA system is di�cult as it
requires the analysis of streamed logs as they are generated.
Therefore, this research will apply pattern mining techniques, a research branch
of data mining techniques, to extract hidden and useful information about the
system activities or behavior from the logs. Among the pattern mining tech-
niques, this research will speci�cally be using sequential pattern mining as it
applies a sequential database to extract hidden and useful patterns. Frequent
patterns represent normal or expected behavior of a system. Sequences which
occur rarely in a system are called rare or infrequent sequential patterns. In this
research, it is assumed that anomalies or attacks happen very rarely in a system.
Therefore, this thesis introduces a rare sequential pattern mining technique to
�nd rare events that represent anomalous events or attacks on a system.
1.2 Research Problem
The SCADA control system network keeps records of process events or activities
in log �les. The events are recorded with timestamps and tagged with each in-
dividual event of the control system. The timestamps indicate when the events
occur in the system. As a result, the recorded logs resemble a sequence of events
that represent the process activities of the SCADA control system. The logs
record the normal process activities as well as unexpected abnormal process ac-
tivities. The normal process activities is produced by the expected outcome of
6 Chapter 1. Introduction
the control process, while the abnormal process activities include disruption in
the process activities resulting in an unexpected faulty outcome. The abnormal
process activities or anomalies could emanate from a natural failure or malfunc-
tion of the system. In addition, the abnormal process could be the result of a
cyber-attack on the system.
The analysis of SCADA control logs could help us to �nd anomalies in the
system. It is assumed that the anomalies this research will address rarely happen
in a system in comparison to the regular activities of the system. Although many
attacks have frequent records, but we concentrate on rare events. Therefore, by
analysing the logs we can �nd rare activities of the SCADA control system. Since
the SCADA logs are large, it is di�cult to analyse them manually to identify
anomalies. Data mining methods can be used to analyse these large log data
to extract useful information, that is, rare patterns to detect anomalies. There
has been much research in intrusion detection using signature based detection.
However, this method cannot detect unknown or zero day attacks, although
they can provide high detection rate and low false alarms. On the other hand,
there has been little work in anomaly based or behaviour based detection. This
method can detect not only the known attacks, but also unknown zero day
attacks. However, it generates a high false alarm rate.
The SCADA activities and topology usually do not change very frequently.
In other words, the actions and system's behaviour remain almost predictable
[16]. If any action deviates from the normal or expected behaviour of the system
then this action can be considered as an anomalous event and deserves further
investigation. Hadºiosmanovi£ et al. [17] used water treatment SCADA logs to
detect anomalous events by applying a data mining approach. They used itemset
pattern mining to �nd a single infrequent event as an anomalous event. How-
ever, this method cannot �nd a sequence of rare anomalous events because their
method did not consider the order of events. The order of the events is important
because the events are recorded in sequential manner in the log �le. In addition,
the order of events can change the process activities and hence the outcome.
Hadºiosmanovi£ et al's. [17] method cannot make a prediction regarding incom-
ing possible anomalies in the SCADA system. The above mentioned scenario has
led us to the following research problem: How can we analyse SCADA control
logs to design and develop an anomaly detection method based on rare sequential
patterns and also how can we predict possible anomalies in the SCADA control
1.3. Research Aims and Scope 7
system?
1.3 Research Aims and Scope
The primary aim of this research is to detect anomalies by analysing SCADA
control logs. It is assumed that anomalies are unexpected rare events or activities
compared to the regular activities of a system. As a common practice, activities
involving the operations of a system are recorded in a �le, such as system logs.
This is primarily done to detect or trace system faults, which could be caused
by natural failure of the system or may be caused by an intruder conducting
cyber-attacks on the system. To achieve the primary aim, that is, the detection
of anomalies in a SCADA system, this research sets the following three objectives
to perform:
(i) To design and develop a method for �nding anomalies that are rare in
SCADA control systems. The process events or activities of SCADA control
systems are mostly de�nitive and repetitive. This means a set of events are
performed in a repetitive manner to complete a process. Also, the events
are conducted in sequential manner. So, any rare event or sequence of
events would be a deviation from the normal or regular behaviour pro�le
of the SCADA control system. These rare events could be considered as
anomalies in the system. Therefore, there is a need to develop an algorithm
that can identify rare events from the SCADA control logs.
(ii) To improve the e�ciency of the rare sequential pattern mining algorithm
without losing accuracy by introducing constraints. This objective aims to
evaluate the proposed rare sequential pattern mining algorithm to improve
the e�ciency by generating less rare sequential patterns by removing unim-
portant rare patterns. As a result, the reduced number of rare patterns
can reduce the computational time. The smaller number of rare patterns
can be achieved by integrating constraints into the rare sequential pattern
mining algorithm. This objective would then check the accuracy of the
constraint-based rare sequential pattern mining algorithm behaviour once
the e�ciency is improved. It means the research would verify the accuracy
of the constraint-based rare sequential pattern mining algorithm in terms
of identifying anomalies from the reduced number of rare patterns. The re-
search also aims to verify if any overhead is added to the complexity of the
8 Chapter 1. Introduction
constraint-based rare sequential pattern mining algorithm. Finally, the re-
search will check the false positive status due to the inclusion of constraints
in the rare sequential pattern mining algorithm.
(iii) To provide an anomaly prediction method that can extend the work of the
rare sequential pattern mining algorithm. This objective aims not only to
detect anomalies, but also to predict anomalies in the SCADA system.
The concept of association rules can be used to predict future events in a
SCADA system. The prediction can be done on the live or streaming logs.
If incoming events in the streaming logs can be found in the association
rules which are generated from rare sequential patterns, it can be predicted
from the incoming logs that the remaining events may occur in the future.
The bene�t of this approach is that the system security operators could
be alerted about incoming anomalies or attacks in the system before they
occur.
1.4 Research Contributions
The main contribution of this research is to �nd rare sequential patterns from
a sequential database. This is the �rst approach in the literature in which it is
shown that anomalies can be detected using rare sequential patterns. The follow-
ing research contributions are presented based on the research background and
motivation, aims, scope, and objective of the research discussed in this chapter.
� Contribution 1: The �rst contribution of this thesis has been to design
and develop a novel method of generating rare sequential patterns from a
sequential database. This method which is presented in Chapter 3 deter-
mines whether rare sequential patterns can be used to detect anomalies
by analysing SCADA control logs. This method also analysed whether the
shortest length or minimal rare sequential pattern is e�ective in comparison
to the maximal or the largest length rare sequential patterns regarding the
detection of anomalies. This is because the minimal rare patterns manifest
the starting point of anomalous pattern while the maximal rare patterns
give the complete scenario of the anomalies.
1.4. Research Contributions 9
� Contribution 2: The second contribution of this research has been to
improve the e�ciency of our proposed rare sequential pattern mining algo-
rithm. The e�ciency is improved by integrating constraints in the proposed
algorithm presented in Chapter 4. These constraints are the time-span con-
straint, the feature reduction constraint, and the algorithmic constraint.
Time-span constraint is used so that only the signi�cant patterns discov-
ered. The feature reduction constraint is used to reduce the number of
unique events in the database so that small number of rare sequential pat-
terns can be generated. The feature reduction constraint also reduces com-
putational time as small number of rare patterns are generated. Finally,
the algorithmic constraint is used to avoid unwanted database scanning
which further reduces the computational time. Among these constraints,
the time-span constraint and the feature reduction constraint are used in
the data pre-precessing stage, while the algorithmic constraint is used with
the rare sequential pattern mining algorithm. The purpose of adding these
constraints is to reduce the number of rare sequential patterns so that
anomalies can be identi�ed from a smaller number of rare sequential pat-
terns with less computational time. It is possible that some other con-
straints can reduce the rare sequential patterns as well as the reduce the
computational time, but in our experiment we used the above constraints
for the solution. This method also ensures the accuracy of the algorithm is
not degraded while improving the e�ciency. In other words, the constraint-
based rare sequential pattern mining algorithm does not sacri�ce accuracy
compared to rare sequential pattern mining algorithm. Finally, this con-
strained method reduces the false positive in terms of anomaly detection.
� Contribution 3: The third contribution of this research has been to design
and develop a method to predict possible anomalies on SCADA streaming
logs. This anomaly prediction method is presented in Chapter 5. This
method builds on the proposed rare sequential pattern mining algorithm
to generate sequential association rules. In our experiment, we used the
longest antecedent association rules, although variable length antecedent
association rules can be generated. The variable length association rules
generate large anomaly predictions, which is due to frequent shorter an-
tecedent found in the streaming logs. Moreover, variable length association
rules generate redundant rules which also contributes to the large number
10 Chapter 1. Introduction
of anomaly predictions. This is because the shorter length antecedents are
the subsequence of the longest antecedents. As a result, for a single anoma-
lous pattern many predictions occur in the streaming logs. To reduce the
number of possible anomalies predictions and remove redundant rules, we
used the longest antecedent rules. These association rules are then used to
predict possible incoming anomalies once the antecedent of a rule is found
in the streaming logs. This method also detects anomalies if the prediction
occurs in the streaming logs.
1.5 Structure of the Thesis
The general structure of this thesis are presented as follows. Chapter 2 presents
the background and literature review of this research. The contributions of this
research are presented in Chapters 3, 4, and 5. The research conclusion and
future research direction are described in Chapter 6.
The following section presents brief overviews of each of the above mentioned
chapters of this thesis.
� Chapter 2 (Background and Literature Review): This chapter describes
the background of this research and relevant works pertaining to anomaly
detection in the literature. This chapter starts with the generic view of
the SCADA control system along with the laboratory test-bed setup of an
industry scale SCADA control system. The data mining and machine learn-
ing approaches used for anomaly detection have also been discussed. This
chapter also discusses the existing anomaly detection methods in SCADA
control system. Further, this research explains the reason for choosing
rare sequential pattern mining as an anomaly detection method. Finally,
this chapter is concludes with the identi�cation of the research gaps in the
literature.
� Chapter 3 (A Rare Sequential Pattern Mining Approach for Anomaly
Detection): This chapter presents the �rst contribution of this research.
In this chapter, a novel algorithm for rare sequential pattern mining has
been proposed. This chapter shows that this method can be used to detect
anomalies from a sequential database. To detect anomalies, this method
has been used to analyse SCADA control system logs. This chapter also
1.5. Structure of the Thesis 11
explains that depending on the domain application the size of the rare
patterns can play an important role in identifying anomalies in a system.
The minimal sized rare pattern indicates the beginning or starting of an
anomalous pattern while the maximal sized rare pattern represents the
entire scenario of an anomalous pattern.
� Chapter 4 (Constraint-based Rare Sequential Pattern Mining): This chap-
ter presents the second contribution of this research. This chapter is con-
cerned with improving the performance of our proposed rare sequential
pattern mining algorithm presented in Chapter 3. The performance was
achieved by introducing a constrained rare sequential pattern mining al-
gorithm, which improved the e�ciency while not degrading the accuracy
compared to without constrained rare sequential pattern mining algorithm
in Chapter 3.
� Chapter 5 (Sequential Association Rules Mining for Anomaly Prediction):
This chapter presents the third contribution of this research. Here, the
chapter discusses how sequential association rules generated from rare se-
quential patterns can be used to make predictions of possible anomalies in
SCADA control system. This prediction can be done from a live system
by analysing streaming SCADA control logs.
� Chapter 6 (Conclusion and Future Work): This chapter concludes the
thesis by summarising the research contributions discussed in Chapters
3, 4, and 5. In addition, this research has outlined some of the research
problems that remained open for researchers to extend this research work.
Chapter 2
Background and Literature Review
This chapter presents an overview of Industrial Control Systems (ICSs) such
as Supervisory Control and Data Acquisition (SCADA) systems and its compo-
nents, architecture and applications in Section 2.1. Section 2.2 discusses SCADA
test-bed scenario. Section 2.3 discusses anomalies in SCADA control systems.
In Section 2.4, anomalies detection methods have been discussed. Section 2.5
presents data mining and machine learning approaches for detection of anoma-
lies. In Section 2.6 through 2.8, we present pattern mining, constrained pattern
mining and association rule mining techniques, respectively. Section 2.9 pro-
vides state-of-the-art research status involving anomaly detection in the SCADA
control system. Finally, Section 2.10 draws the conclusion of this chapter.
2.1 Background of SCADA Control System
Information technology is connecting or bringing physical devices into a com-
puter network system. As a consequence, computer based control systems have
grown to monitor and control machinery and industrial processes from remote
geographical locations. These computer based control systems can be classi�ed
into di�erent categories considering their application areas, such as Supervisory
Control and Data Acquisition (SCADA), Process Control Systems (PCS), Dis-
tributed Control Systems (DCS), Cyber-Physical Systems (CPS) [3]. All of these
control systems are called Industrial Control System (ICS) because these con-
trol systems are used to monitor and control the process activities of di�erent
12
2.1. Background of SCADA Control System 13
industries.
There are various SCADA applications in the ICS, such as electricity, gas
and oil pipelines distribution, water utilities, transportation networks and ap-
plications. The infrastructure of these networks can be extended to a large
geographical areas. Therefore, there is a need to monitor and control the process
activities of these infrastructure from a remote location. Among di�erent con-
trolling networks, SCADA is widely used in electricity distribution sector [18].
The use of SCADA in power distribution systems started since 1960's and has
been gradually evolving with the development of newer technologies [19].
Figure 2.1: A simplistic view of SCADA control system layout.
A general simplistic SCADA diagram is shown in Figure 2.1 which is composed
of three main sections comprising both hardware and software. In other words,
a SCADA system is composed of physical and logical components [20]. The
left part of Figure 2.1 called supervisory systems and Human Machine Interface
(HMI) is considered as control center, the middle section is composed of commu-
nication backbone and protocols that connect the devices like Remote Terminal
Units (RTUs), Programmable Logic Controls (PLCs), Intelligent Electronic De-
vices (IEDs) which is on the far right with the control center. The RTUs collect
the data and converts them to digital signals that are relayed to the control cen-
ter (Supervisory Systems and HMI). The SCADA control system is considered as
the hub or nerve center that controls the critical infrastructures. The HMI device
is controlled by software that allows the control system operator to monitor the
14 Chapter 2. Background and Literature Review
process and events and may react to the situations if there is an emergency. The
supervisory system comprises of computer servers like data historian, Master
Terminal Unit (MTU) that collects, processes and logs the data sent by the �eld
devices such as RTUs, PLCs and IEDs. Further, the supervisory unit monitors
and sends commands to control the processes on the �eld devices. The commu-
nication infrastructure is composed of di�erent links, such as radio frequency,
telephone line, �ber in which some communication protocols are running to pass
information to and from SCADA devices.
2.2 SCADA Test-bed Scenario
To conduct the experiments in this research, we use our SCADA control system
industry scale test-bed network laboratory, meaning the test-bed network repre-
sents the similar usage for industrial strength SCADA system. There are some
reasons as to why this research needs a SCADA test-bed control system rather
than a real life control system. At the beginning of this research, we collected
some control logs from a SCADA controlled electrical substation. After analysing
the logs, we have found that the control system records the process activities in
a log �le whenever there is a malfunction in the system. In other words, all the
control system logs are error logs. The error logs could be a genuine failure of
the system or a cyber-attack. However, there is no identi�cation as to which
logs are real natural system failure logs and which logs are from a cyber-attack
if there is an attack on the system. Sometimes the system operator even do not
know whether their system has been compromised by a cyber-attack. Further,
the system never recorded the regular control process activities in the log �le.
Since this research aims to �nd anomalies or abnormalities in SCADA control
system, there is a need to have both the normal activity logs and anomalous logs
to evaluate the experimental results. Therefore, the logs collected from the real
life electrical substation control system are not suitable for the experiments of
this research. Also, it is not feasible to conduct attacks on a real life SCADA
control system to generate datasets. The reason is the output of the process
control system could be disrupted, if attacks are carried out. Further, even if
the administrator of the control system allows attacks on their system, due to
their organization policy they cannot share their dataset, because they may not
want to disclose their system's weakness to the public. Therefore, there is a need
2.2. SCADA Test-bed Scenario 15
to have a SCADA control system test-bed where attacks can be conducted that
generate datasets suitable for research experiments and to validate the anomaly
detection experimental results.
The SCADA test-bed is designed with three individual physical control sys-
tems named conveyor belt, pressure control and water tank system. A physical
Figure 2.2: A physical laboratory view of the SCADA test-bed.
laboratory view of the test-bed is shown in Figure 2.2. The pressure control
system is placed on the right side of the �gure labelled in number (1) in a circle.
In the middle of the �gure, the conveyor belt control system is placed that is la-
belled in number (2) in a circle. Finally, the water tank control system is placed
in the left side of the �gure which is labelled in (3) in a circle.
The water tank control system is consists of two tanks, the lower tank and the
upper tank. A water pump is used to �ll the upper tank by transferring the water
from the lower tank. The current water level on the upper tank is measured by a
sensor. The water in the upper tank increases from a lower threshold value to an
higher threshold value. Once the water level reaches the higher threshold value,
the water starts receding until it reaches to the lower threshold value. Gravity
16 Chapter 2. Background and Literature Review
allows water in the upper tank to move back into the lower tank. This process
continues to repeat for a de�ned period of time.
The conveyor belt is a bi-directional control system that separates light ob-
jects from the dark objects on a moving conveyor belt. Two sensors are used
to sort out the objects on the conveyor belt. The �rst sensor is used to detect
the object on the conveyor belt and the second sensor is used to color of the
object. The objects are collected in two di�erent directions; Left direction and
right direction, based on the colors of the objects which are detected by a sensor
built into the control system.
Figure 2.3: A logical view of SCADA test-bed process control system.
Finally, the pressure control system pressurises an object at a certain upper
threshold pressure value measured in pounds per inch (PSI). The pressure control
is connected to an air compressor. The air pressure inside the pressure control
pipeline increases to a prede�ned upper threshold value. Once the pressure level
reaches the upper threshold value, a solenoid valve is open to release the air to
drop pressure into a lower threshold value. When air pressure reaches the lower
threshold value, the solenoid closes the valve and the compressor starts to build
the pressure into the pipeline of the pressure control system. This process of
building and then releasing pressure continues for a de�ned time period.
2.3. Anomalies in SCADA Control System 17
The logical layout of the SCADA test-bed process control network is shown
in Figure 2.3. Every individual control system is attached with an industry
standard ICS device like Siemens S7-1200 PLCs. These three control systems are
connected to a master PLC that aggregates the logs produced by the three control
systems. Each control system is monitored and controlled by a Human Machine
Interface (HMI) connected to the process control network of the test-bed. The
HMI also pools the logs from the PLCs connected to each control system. The
three control systems and the master PLC is connected to a switch of an ethernet
network. The HMI and a Personal Computer (PC) is also connected to the
process control network. The attack PC is used to conduct attacks to disrupt
the process activities of the control system while generating the anomalous logs.
These three control devices could be attacked to disrupt the normal process
activities by running a Python script on the attack PC. The conveyor belt sorting
direction could be changed so that the white object on the belt could be sorted
to the site where the dark object is being collected or the dark object could be
sorted to the site where the white object is being collected. Another example
could be to change the pressure control system's upper and lower threshold value
from the de�ned set values. If the upper threshold value is changed to a very
high value, the pipe could burst or the compressor could explode.
2.3 Anomalies in SCADA Control System
The SCADA system is usually vulnerable to physical as well as cyber-attack,
and the attack on this system has been increasing many fold since 21st century
[18]. The main security concern of the ICS network is that the protocols and the
devices cannot withstand to cyber-attacks. This is because all the security goals,
that is, CIA (Con�dentiality, Integrity and Availability) are not addressed. The
ICS networks are not changed frequently compared to conventional IT networks.
Once an ICS network is con�gured, the set-up almost remains unchanged for
many years. This is because the functionalities performed by the ICS networks
are de�nitive and does not require frequent changes to the system. As a result,
the control system software is not updated which leaves the system vulnerable.
Furthermore, modern ICS networks are exposed to the internet which exposes
the control system to cyber-attacks. Firewalls cannot protect the ICS control
system. So, there is a need for an intrusion detection system (IDS) which can
18 Chapter 2. Background and Literature Review
detect cyber-attacks that have occurred on the system.
Although the importance of CIA are the same for both traditional or standard
IT networks and ICS networks, their security concerns or implementation prior-
ities are di�erent. For example, for standard IT protecting the data (providing
con�dentiality), ensuring correct command (maintaining integrity) and keeping
less number of interruption (availability of resources). In other words, CIA is
the order of importance. In addition, for ICS networks the priority is ensuring
correct commands (maintaining integrity), reducing interruption (availability of
resources) and protecting the data (providing con�dentiality), that is, integrity,
availability and con�dentiality (IAC). Moreover, the security issues are tradition-
ally de�ned or formulated by the ICS organizations to support their individual
goals. Therefore, security designed and applied to one infrastructure cannot be
fully implemented for other infrastructures. Security techniques or policies that
protect standard IT networks cannot be adopted for ICS networks [21], because
of the di�erences between these two networks (standard IT and ICS). These
di�erences include system characteristics, system maintenance and upgrading,
security practices, security counter measures and the cyber-attack impacts to
the systems. Among these, the consequences of successful cyber-attack is more
severe for ICS networks than standard IT networks considering the cost of dam-
ages involved with the system. Therefore, we argue that ICS networks or control
system networks (SCADA) need critical infrastructure speci�c security strate-
gies to safeguard the system because ICS is legacy system. So, new protocols is
di�cult and expensive. Instead, anomaly or intrusion detection and �rewall can
be added.
2.4 Anomalies Detection Methods
Malicious events or intrusion can be detected using anomaly detection techniques
[22]. Anomaly based detection was originally proposed by Denning [23] and since
then this method has been used in computer security for intrusion detection [24].
In general, algorithms used for anomalies or intrusions detection need to have
normal operation data that is called labeled data to build a training model.
These algorithms generally consider anomalies as patterns that have not seen
before in normal behavioral patterns of a system [23] [24]. The intrusion de-
tection techniques signature based and anomaly based methods were originally
2.4. Anomalies Detection Methods 19
used in the traditional IT (Information Technology) networks. Later these de-
tection techniques were gradually accommodated to the SCADA control system
network [25]. The operational activities of SCADA are not completely similar to
traditional IT and the SCADA control system is used to monitor and control the
process activities of another infrastructure, that is, the Industrial Control Sys-
tems (ICSs). Manganaris et al. [26] showed that frequent behaviour over a long
period of time could be considered as normal behaviour of a system. Therefore,
the absence of a frequent event or set of events can be considered as an anomaly.
For example, a speci�c alarm occurs in every minutes is normal than a burst of
alarms all of a sudden which never happened before is more suspicious than the
frequent alarms. Clifton et al. [27] applied a sequential association mining tech-
nique to identify normal behaviour of a system based on the frequent occurrence
of a sequence of of alarms event which was �ltered out later from suspicious event
lists. Julisch and Dacier [28] used the pattern mining episode rules and clustering
techniques to reduce irrelevant alarm signals using false positives from historical
alarms. The authors �rst discover the patterns of false positive and later re-
moves the false positive patterns from the possible anomalies alarm which helps
to reduce large number of alarms. These methods depend on signature based
rules which inherently lack the ability to identify new attacks. The data mining
based anomalies detection techniques can be used for both signature-based and
anomaly-based anomalies detection techniques [29] [24].
2.4.1 Signature-based Detection
The signature based method uses prior knowledge of attack signatures that de-
�nes the patterns of an attack. Fan et al. [30] used a signature based ANN
classi�er to detect malicious sequential patterns from a sequence of machine in-
structions. Hadºiosmanovi£ et al. [17] applied a semi-automated to analyse
a water treatment SCADA logs. The authors used a frequent itemset mining
approach to �nd a rare event by changing di�erent support values in SCADA
process logs. Using this approach the authors could only identify a single rare
event. Later they used the stakesholders' knowledge to identify whether the rare
event is an anomalous pattern or not. Although this method could identify rare
itermset pattern, this method lacks to identify sequence of events as rare anoma-
lous pattern. Their method inherently lacks the ability to identify new malware
which has no previous signature trained in their method. This is because the
20 Chapter 2. Background and Literature Review
human expertise is required to design, test and deploy the signatures. Therefore,
it is a time consuming e�ort and updates for the new attack signatures cannot be
readily or promptly generated and deployed into a system. This manual human
e�ort cannot cope with the rapidly changing behavior of attack patterns and
hence require automatic signature generation [31] [32]. In commercial purpose
signature based techniques are used for intrusion detection due to their high de-
tection rate, reliability and low false rate [33]. However, this technique cannot
identify unknown or zero day attacks because the constant change in the attack
patterns. Moreover, the signatures require well-de�ned rules for the possible at-
tacks which are almost impossible because of the constant changes in the system
vulnerabilities and attack patterns.
2.4.2 Anomaly-based Detection
The anomaly-based method of �nding anomalies uses normal operational be-
havior of a system or network. Wespi et al. [34] used behaviour-based anomaly
detection method to build a normal behaviour from audit data. To detect anoma-
lies, they distinguished the observed behaviour with the stored normal behaviour.
The behaviour-based method detects any deviation or unexpected behavior of
the system without human intervention. The normal behavior pro�le can be
created for each individual system. This technique applies to either machine
learning, data mining or statistical methods. These techniques are based on su-
pervised and unsupervised learning method. The supervised method requires
a prior knowledge or understanding of the system. However, the unsupervised
method does not require any previous knowledge of the system [31] [32]. The
unsupervised method automatically builds the system pro�le and any deviation
from the normal behavior pro�le is detected as an anomaly. As a result, it can
detect known as well as unknown attacks. Moreover, as there is no manual up-
date of signatures required for unsupervised method, the anomaly detection is
faster than signature based system. However, this method generates high false
alarm rate because it treats any previously unseen events, even newly added valid
events, as anomalies and therefore cannot be fully reliable.
2.5. Data Mining and Machine Learning 21
2.5 Data Mining and Machine Learning
2.5.1 Data Mining
Data mining is an analytical step in the knowledge discovery process [35]. Data
mining automatically discovers hidden, interesting, useful, and understandable
knowledge from a large collection of data [36]. The algorithms used in data min-
ing derive knowledge from various �elds, such as statistics, databases, pattern
recognition and machine learning [36]. The knowledge that is derived from a
particular application domain can be useful or applied in another application
domain. Cios et at. [35] developed a knowledge discovery model that accommo-
dates both academic research and industrial application aspects. It is comprised
of multiple steps that are executed in a sequence. Every step requires the previ-
ous step's result as input. The knowledge discovery model is an iterative process
with some feedback loops. To prepare dataset that can be used as input into the
data mining algorithms, initial raw data needs to be preprocessed. Data prepro-
cessing involves several steps such as selection of data, the cleansing of data, the
construction of data, the integration of data, and the formatting of data [37].
This step takes a comparatively large amount of time among these steps of the
entire knowledge discovery process [35].
After dataset preparation, meaning having the dataset ready for processing,
the actual data mining methods are used to extract user required information. In
this phase di�erent data mining algorithms are applied in the processed dataset
to extract the interesting and understandable knowledge. Our research, Data
Figure 2.4: A data mining approach for information extraction.
Mining Critical Infrastructure Control Logs for Anomaly Detection, uses data
mining experimental methods for detecting anomalies in SCADA control logs.
22 Chapter 2. Background and Literature Review
The data mining methods used are composed of several steps as shown in Figure
2.4. The steps are data source selection, data preprocessing, relevant data se-
lection, develop or choose data mining algorithm, pattern discovery and analyse
the patterns to extract knowledge for a practical application.
The main working domain of our research is the pattern mining process which
is one of the research branches in data mining methods. In pattern mining, there
are two major categories of research being done in the literature. These are
frequent pattern mining and infrequent or rare pattern mining. Our research uses
the rare pattern mining approach to discover unusual or unexpected behavior of
a system. There has been little work regarding the use of rare pattern mining
on SCADA control systems. Hadºiosmanovi£ et al. [17] used a rare itemset
pattern mining approach to detect anomalies by analysing logs from a SCADA
water treatment plant. However, they did not preserve the order of events on
the control logs. This thesis maintains the order of events to detect anomalies
by using a novel rare sequential pattern mining approach.
As the process activities of SCADA control system are de�nitive and repeti-
tive, the events become frequent which represent the regular or normal behaviour
of SCADA control system. In this thesis, it is hypothesized that irregular or ab-
normal behaviour occurs rarely in a system. Any changes to the SCADA process
control system makes it a rare behaviour of the system which can be identi�ed
by using rare sequential patterns. Rare anomalies can resemble the unexpected
irregular behavior of a system. Our hypothesis is tested in Chapter 3 of this
thesis with a rare pattern mining experiment using SCADA control logs.
After discovering the desired rare pattern, the pattern analysis step analyse
the rare patterns to �nd anomalies in the system that requires domain expertise.
In this �nal stage of knowledge discovery process, the discovered knowledge is
documented and applied inside the targeted system. It is also to be noted that
the knowledge discovered for one domain can be extended and applied to other
knowledge areas.
2.5.2 Machine Learning Methods
Machine learning is the study of learning from experience like human beings.
In the learning process, the machine learning algorithms maps a set of inputs
to the outputs with the help of computer programming. These algorithms are
used in data mining tasks for building automatic models to extract patterns or
2.5. Data Mining and Machine Learning 23
knowledge from machine generated large volume of data. The traditional learning
methods could be categorized into two groups such as supervised learning and
unsupervised learning [38]. However, there is another instance of learning method
called semi-supervised learning that is a blend of supervised and unsupervised
learning method.
2.5.3 Supervised Learning Method
The supervised learning usually works in two phases, the �rst step is the training
phase and the second step is the testing phase. A model is built or trained with
a normal training data set and then check the model performance or accuracy
with test data set [24] [22]. The output of this model could be classi�cation or
regression. This model can be built only when labelled training data is available.
Example algorithms that require labelled data are decision tree and Support
Vector Machine (SVM) classi�ers. An example of a decision tree classi�er is given
in Figure 2.5. The decision tree is composed of three features Weather Forecast,
Humidity Condition and Wind Condition. These three features have a set of
de�ned values. For example, theWeather Forecast has three values sunny, cloudy
and raining. In means that on a particular day, the weather could be sunny or it
could be overcast or it could be a rainy day. Similarly, the other two features have
a de�ned set value. The Humidity Condition of a day could be normal or high.
Finally, the Wind Condition on a day could be strong or weak. The decision on
Figure 2.5: Supervised learning method.
24 Chapter 2. Background and Literature Review
playing a tennis match on a day depends on the combination of some values of
these features. Similarly, the decision of not playing a tennis match also depends
on these values. For example, the decision of playing a tennis match could depend
on (a) if it is a sunny day and the humidity condition is normal, or (b) if it is
a cloudy day, or (c) if it rains and the wind condition is weak. Although these
supervised methods are e�ective in detecting known anomalies, these methods
lack to identify unknown anomalies. In addition, these methods involves cost
of training the model involving preparing the labelled training dataset. Finding
anomalies using supervised methods in SCADA system cannot be e�ective as
these methods cannot �nd unknown or zero day attacks. Since SCADA systems
are used to control industrial control system, failure to detect zero day attack
could cause a devastating impact to the economy and environment.
2.5.4 Unsupervised Learning Method
The unsupervised learning method does not require any labelled data to detect
anomalies. In other words, the data is not classi�ed as attacked data or non-
attack data. Therefore, in an unsupervised model there is no training dataset
and testing dataset needed to detect anomalies from a dataset. The underlying
unknown structure or output can be generated from unlabelled data into di�er-
ent groups based on similarities in data such as clustering, dimension reduction
and association rules. Since data is not required to compare with the labelled
dataset, the learning process of the unsupervised model is faster in comparison
to the supervised learning mode. However, the anomaly detection accuracy of
the unsupervised model is lower than the supervised and semisupervised model
[38]. This is because the unsupervised method suspects any abnormal behaviour
as a potential anomaly on the system. In a clustering unsupervised learning
method shown in Figure 2.6, unlabelled data is distinguished from each other
by separating them into di�erent groups. These groups are formed based on
the close similarities of features among the data. In Figure 2.6, there are three
groups that can be generated from the unlabelled data. There are di�erent kinds
of unsupervised learning algorithms have been developed. The K-means is one
of the widely used clustering algorithms. This clustering method cannot be ap-
propriate for �nding anomalies in SCADA control system. This is because in
SCADA system, the events occur in a sequential manner where a sequence of
events completes a task. These events are correlated to each other. Therefore,
2.5. Data Mining and Machine Learning 25
Figure 2.6: Unsupervised learning method.
these events cannot be separated into di�erent clusters for �nding anomalous se-
quence of events. Therefore, to �nd anomalous sequence of events from SCADA
control system, sequential pattern mining method can be used.
2.5.5 Semi-supervised Learning Method
Semi-supervised learning resides in between supervised and unsupervised meth-
ods. There are some circumstances when the semi-supervised method of learning
is needed. Sometimes it is hard to �nd labelled data, because labelling data may
require expertise which is di�cult to achieve and hence expensive. Also, it may
be time consuming and may require special devices to label the data. Further-
more, it may happen when the amount of input data is large in quantity and
only a few of them are labelled leaving a large amount of the data unlabelled.
The semi-supervised algorithm tries to �nd strong inductive biases from a large
number of unlabelled data [39]. A general approach is shown in Figure 2.7 where
γ is a hidden structure associated with both the object A and B. The object B′
contains a few labelled examples of object B. Some data in object A is labelled
and some are unlabelled data. The unlabelled data in A assists in inferring the
object B using the hidden structure γ. Another example of a semi-supervised
26 Chapter 2. Background and Literature Review
Figure 2.7: Semi-supervised learning method.
learning is an infant word-object mapping [40] to measure the ability to asso-
ciate the word and object. If an infant listens to a word many times before the
word's corresponding object (labelled data) is seen, the association is stronger.
However, if the word has not been heard before, the association is weak.
2.6 Pattern Mining
Pattern mining is an important and widely studied task in data mining. It allows
the extraction of interesting hidden information as well as relations among data,
such as association rules [41] [42], correlations [43], causality [44], sequential pat-
terns [45], multidimensional patterns [46], episodes and emerging pattern [47] [48]
and many other patterns in large databases. Han et al. [49] in their data mining
book give the de�nition of pattern as: �a set of items, subsequences, or sub-
structures that occur frequently together (or strongly correlated) in a dataset.�
Patterns can be discovered from a large dataset. There are two major types of
pattern mining that exist, such as itemset pattern mining and sequence pattern
mining. Each of these techniques can be further categorized as frequent and rare
pattern mining.
The data mining techniques extract hidden patterns from large volumes of
data. The pattern is sequential when the data is represented with a sequence
or time-related format. The sequence is comprised of a set of transactions and
2.6. Pattern Mining 27
Figure 2.8: A sequence diagram.
each transaction is composed of a set of events or items. The sequence of trans-
action and events are shown in Figure 2.8. The pattern analysis based on the
sequence database is regarded as sequential pattern mining. This technique was
�rst introduced by the authors Agrawal and Srikant [50] in 1995 and de�ned as
follows:
�Given a database of sequences, where each sequence consists of a list of
transactions ordered by transaction time and each transaction is a set of items,
sequential pattern mining is to discover all sequential patterns with a user spec-
i�ed minimum support, where the support of a pattern is the number of data
sequences that contain the pattern.�
There has been much research in algorithm development for sequential pat-
tern mining. These algorithms can be categorized into two di�erent broad groups
(i) Apriori-based algorithms (ii) Pattern growth-based algorithms [51] [52]. Apri-
ori based algorithms apply breadth �rst or level-wise search techniques. This
technique generates many candidate sequences which could be exponential in
the worst case scenario. Most of the candidate sequences are useless and unde-
sirable. Therefore, these undesirable sequences are needed to be pruned. Apriori
algorithm also applies multiple scans on the sequence database that increases
processing time [51] [52]. The examples of Apriori based algorithms are Gener-
alized Sequential Patterns (GSP) which is based on horizontal database format
and Sequential Pattern Discovery using Equivalent classes (SPADE) which is
based on vertical database format.
To overcome the drawbacks of Apriori based algorithms, specially for remov-
ing candidate generation, the Frequent Pattern Growth (FP-Growth) algorithm
was introduced. This algorithm applies the divide and conquer method [53] [54].
28 Chapter 2. Background and Literature Review
It converts the sequence database into a frequent pattern tree and is faster in
operation with large data compared to the Apriori algorithm [51]. The PREFIX-
projected Sequential PAtterN mining (Pre�xSpan) is a widely used example of
the FP-Growth algorithm. Sequential pattern mining has a wide range of ap-
plications. For example, �nding customer buying patterns to o�er them new
products, redesigning company's web site after analyzing customers' browsing
patterns, �nding DNA sequences. As a novel approach, this thesis proposes and
develops a rare sequential pattern mining algorithm using Apriori-based method.
2.6.1 Itemset Pattern Mining
An itemset pattern mining algorithm extracts interesting and useful patterns
from a transaction database. This concept was �rst introduced by Agrawal and
Srikant [55] where they discovered a group of items in a customer transaction
database that were frequently purchased together. A transaction database D =
{T1, T2, ..., Tn} is a set of transactions such that each transaction Tq ⊆ I (1 ≤q ≤ m) is a set of distinct items [56]. Each transaction Tq can be identi�ed with
a unique identi�er called a Transaction ID (TID). An example of a customer
transaction database is given in Table 2.1. The example transaction database
Table 2.1: Transaction database TDB.
Transaction ID TransactionTID1 {a, b, d, e}TID2 {c, d, f}TID3 {b, c, d, f}TID4 {c, f}TID5 {a, b, c, e, f}
has �ve transactions named TID1 to TID5. Each transaction comprises a set
of items or an itemset. An itemset X is a set of items such that X ⊆ I. For
example, {a, b, d, e} is an itemset shown in the transaction ID TID1 in Table
2.1. This itemset is composed of four items that are purchased by a customer in
one transaction. In a transaction the items do not keep the order in which these
items are purchased. In other words, from an itemset we cannot �nd the order
in which the items were purchased. Therefore, there is no di�erence between
the itemset {a, b, d, e} and {b, a, e, d} because these two itemsets hold the same
items. Further, if an item is purchased in multiple numbers, these items are
2.6. Pattern Mining 29
listed as one item. For example, in the itemset in transaction TID4, if the item
c were purchased in 4 times, this item is counted as one item.
The task of itemset pattern mining is to �nd all the itemsets that appear
together frequently in a transaction database. The initial purpose of itemset
pattern mining was to analyse the market basket to promote the sales of items
by arranging the items next to each other that are bought together. For ex-
ample, the itemset {c, f } in the transaction Table 2.1 is found to appear most
frequently purchased itemset. As a result, these items can be placed next to
each other on the shelf increasing their sales. Ever since the market basket ap-
plication, the application of itemset pattern mining has been extended to many
di�erent domains, such as product recommendation, text mining, bioinformat-
ics, e-learning, web page analysis, network tra�c analysis, image classi�cation
[57] [58] [59]. There have been many algorithms developed for itemset pattern
mining, such as Apriori [55], FP-Growth [60], Eclat [61], FIN [62].
2.6.2 Sequential Pattern Mining
In itemset pattern mining, the order of events is not considered although in
some applications the order of items or events are important. Therefore, item-
set pattern mining cannot be applicable to extract the desired information from
database where the order of events is important. For example, in �nding intru-
sions in a SCADA network the order of the events is important [63]. This is
because the change in the order of events can cause the SCADA process control
system to malfunction. For example, in a water tank control system a pre-de�ned
ordered event �lls up a water tank to its maximum threshold capacity. Then the
water is drained from the tank and when it reaches its lower threshold level, the
water pump turns on and again �lls the tank to its maximum capacity. The
ordered events are as follows:
〈{Close_V alve}, {Pump_On}, {Check_Water_Max_Threshold_V alue},{Pump_Off}, {Open_V alve}〉
These ordered events continue as a regular process of the water tank control
system. However, if there occurs a change in the order of the events from the
regular process of the water tank control system, the water tank system �oods.
The altered ordered events are given below:
30 Chapter 2. Background and Literature Review
〈{Open_V alve}, {Check_Water_Max_Threshold_V alue}, {Pump_Off},{Close_V alve}, {Pump_On}〉
In text mining considering the order of the words in a sentence is important [64].
In sequential pattern mining there are two types of data that can be used. The
�rst one is time-series data and the second is sequence data. Time-series data
has nominal values and sequence data has symbolic values [65]. An example of
time-series data could be stock prices in the capital market, user consumption
of electricity. An example of sequence pattern mining could be sequence of web
clicks by a user while visiting a website, purchasing a sequence of items by a
customer over a period of time.
Sequential pattern mining is an active research application area and this
technique has been used in di�erent application areas, such as market basket
analysis, bioinformatics, e-learning, text mining, web-click stream analysis [65].
Assume that a set of items or symbols I ={i1, i2, ..., im}. An itemset X is a set
of items such that X ⊆ I. A sequence is an ordered list of items s = 〈I1, I2, ..., In〉where Ik ⊆ I (1 ≤ k ≤ n). For example, 〈{a}, {b, c}, {d}, {e, f}〉 is a sequence ofitems purchased by a customer over a period of time. The customer purchased
item {a} in the �rst transaction. After some time, in the second transaction the
customer bought two items {b,c} at the same time, then purchased item {a}
and �nally purchased item {e,f} in the last transaction. This sequence has 4
transactions that indicates a customer has purchased items in a sequential order
at a di�erent time. Therefore, the size of the sequence is 4. However, the length
of the sequence is 6 that indicates the number of items the customer purchased
over the time period.
A sequence database SDB is composed of a list of sequences, that is, SDB =
〈s1, s2, ..., sp〉. An example of a sequence database is given in Table 2.2. There
Table 2.2: A sequential database SDB
Sequence ID SequencesSID1 〈{a}, {b, c}, {d}〉SID2 〈{a}, {e}, {f}, {b}〉SID3 〈{a, b}, {d}, {c}〉SID4 〈{b}, {c, e}, {e}〉SID5 〈{c}, {b}, {e}, {d}〉
2.6. Pattern Mining 31
are 5 sequences in the sequence database SDB in Table 2.2. Each sequence is
identi�ed with a Sequence ID (SID) number. The purpose of sequential pattern
mining is to �nd a subsequence that is interesting to the user. A sequence is called
a subsequence if a 〈a1, a2, ..., an〉 is contained in another sequence 〈b1, b2, ..., bm〉where (m ≥ n), if there exist integers 1 ≤ i1 < i2 < ... < in ≤ m such that
a1 ⊆ bi1 , a2 ⊆ bi1 , ..., an ⊆ bin . The interest in a subsequence could be measured
by the frequency of the subsequence. A subsequence that is rare in the database
SDB could also be interesting to some users depending on the interest of an
application domain.
2.6.2.1 Frequent Sequential Pattern
In frequent sequential pattern mining, a sequence is considered to be frequent if
the frequency or support value of a sequence is greater than or equal to the user
provided threshold, called the minimum support value minsup [66]. For example,
the sequence 〈{b}, {d}〉 is a frequent subsequence in the sequence database SDB
in Table 2.2 when the user provided minimum support value is set to minsup =
2. The support of the sequence 〈{b}, {d}〉 is 3 because this sequence is found in 3sequences SID1, SID3 and SID5. Similarly, the sequence 〈{a}〉 is also a frequentpattern since it appears in 3 sequences in the database SDB, that is, its support
value is greater than the minsup = 2 value. In some application domain, the
frequent pattern is considered interesting because the frequent pattern represent
the regular or expected behavioural pattern of a system. There have been many
algorithms developed to �nd frequent patterns from sequence databases. Some
of the most popular algorithms are Srikant and Agrawal's [66] GSP algorithm,
Zakis's [67] SPADE algorithm, Pei et al.'s [45] Pre�xSpan algorithm. All of these
algorithms �nd all the frequent subsequences that are considered interesting to
the users. These algorithms can be categorised into two groups; Apriori or level-
wise algorithm and pattern growth algorithm. An example of Apriori algorithm
is GSP algorithm and pattern growth algorithm is Pre�xSpan algorithm. The
pattern growth algorithm is computationally faster than the Apriori algorithm
because pattern growth algorithm does not generate candidate sequences unlike
the Apriori algorithm that generates candidate sequence.
32 Chapter 2. Background and Literature Review
2.6.2.2 Rare Sequential Pattern
In rare sequential pattern mining, a sequence is considered rare if the support
value of a sequence is lower than the user de�ned maximum support value max-
sup. For example, the sequence 〈{a}, {d}, {c}〉 is a rare sequence because it
appears in the sequence database SDB in Table 2.2 only once. In other words,
this sequence is found only in the sequence in SID3. This sequence is rare be-
cause its support value is below the maxsup value set to 2. In some application
domains, rare cases are interesting to the users. For example, rare symptoms
of a disease are important to identify the disease. Another example of a rare
case that could be suspicious and interesting is a �re alarm system. In normal
circumstances, whenever there is a smoke detected in a system, the �re alarm
goes on. In other words, the smoke detected event is followed by the �re alarm
triggered event which is the correct behaviour of the system. However, if smoke
that has been detected, but the �re alarm did not trigger then these two se-
quences of events together is a irregular rare behaviour of the system. Although
there have been many algorithms developed in the literature to �nd frequent se-
quential patterns, there exist no algorithms to �nd rare sequential patterns from
the sequential database.
2.7 Constraint-based Pattern Mining
Generally constraints can be considered as a preference or restrictive parameter
that can be incorporated to discover or extract expected useful information from
databases. It means that, constraints are parameters that can be used to limit
the access and process data to �nd user interested information. In other words,
constraints can be used at the data source while choosing the target databases as
well as at the algorithmic level during the data processing stage. For example,
to extract a meaningful or signi�cant pattern that can be generated from a
particular time period data or an episodic period of data, we need to select the
data source that has episodic characteristics. It means that there would be some
�xed or irregular time gap between the episodic event or activities in the system.
In the episodic time period the system accomplishes a complete task and only
then data is recorded in the logs. But, when there exist no activities in the
system, there is no data recorded in the logs. Therefore, during data source
selection, we need to choose a time gap in episodic data instead of continuous
2.7. Constraint-based Pattern Mining 33
data where there is no time gap. The example of algorithmic constraint could
be not to allow a pattern size to exceed a prede�ned size. It means if a pattern
exceeds the size constraint, it can no longer be a useful pattern.
The constraints can reduce the search space in the database and can �lter
the results while extracting the required information [68]. Most of the time the
use of constraints is associated with pattern mining tasks. In pattern mining,
particularly in sequential pattern mining, a large number of results (patterns) are
generated. It is hard to analyse and discover the desired pattern from these large
number of patterns [69] [70]. Most of the patterns are useless or unimportant
because these patterns cannot be used to identify user required information [71].
However, a large amount of computational time is required to process these
unimportant patterns during the pattern mining process. To reduce this time
and �lter out the unnecessary patterns, domain related knowledge is needed to
extract their desired result.
Pei et al. [72] gives the de�nition of a constraint C as follows: A constraint
C is a predicate on the powerset of the set of items I, that is, C : 2I ⇒ {true,
false}. Here, I = {i1, i2, ..., im} be a set of items. A sequence S satis�es a
constraint C if and only if C (S ) is true. For example, to �nd a constrained
rare sequential pattern, the following condition needs to be satis�ed, that is,
sup(S) ≤ maxsup ∧ C(S) = true. Assume the user de�ned maximum support
maxsup is set to 2 and the constraint C is the size of a rare pattern, that is,
the number of events in a rare pattern that should be no greater than 3 events.
Then the rare sequential pattern 〈{a}, {e}, {b}〉 from the Table 2.2 can be called
a constrained pattern. This is because this pattern satis�es the above mentioned
constrained condition.
There have been many constraints proposed in the literature. These con-
straints can be categorised into di�erent groups based on semantics, properties
and nature of the data source [68]. The semantic based constraints depends on
the interest of the application domains. For example, item constraints that can
be used to extract a particular item or a group of items in a pattern. An instance
of an itemset constrain could be, assume a drug manufacturing company is in-
terested in to �nding patterns that contain particular drugs while mining their
warehouse database. The property based constraint relies on the behavioural
characteristics of items when it is added to an itemset or removed from an item-
set. For example, if a pattern is found rare in a database, any pattern that
34 Chapter 2. Background and Literature Review
grows from this rare pattern, meaning any superpattern, would always be a rare
pattern. In other words, an item added to the rare pattern to make it a su-
perpattern does not change the properties of the rare pattern. This property is
called an anti-monotonic constraint. The type of data source may also in�uence
the selection of constraint. For example, some data sources are continuous while
some other data sources are episodic.
There have been many algorithms developed for constrained pattern mining.
Srikant and Agrawal [66] �rst introduced a constraint based algorithm called
Generalised Sequential Pattern (GSP). In this algorithm, the time gap, meaning
the time di�erence between two consecutive events, constraint and time span
constraints, that is, the time di�erence between the �rst event and the last event
in a pattern, have been used. Over time, there have been other algorithms
that have developed or extended the GSP algorithm. Mannila et al. [48] used
the width of a time window constraint to �nd frequent episodic patterns. This
algorithm �nds the patterns whose events are within the time window constraint.
For example, if a sequential pattern 〈{a}, {b, c}, {d}〉 that contains three events{a}, {b,c} and {d}. If these three events in a pattern occur within a set time
window constraint of 5 minutes, the pattern is called a time window constrained
pattern. In other words, all these events happen inside the 5 minutes time period.
Like Mannila et al.'s [48] time window method, our proposed method used time
span constraint. However, we used time-span constraint to �nd rare episodic
pattern unlike �nding frequent episodic pattern used by Manila et al. [48].
Chen et al. [73] applies time interval of events as constraints while discovering
sequential patterns. It means that in a pattern if the time interval or di�erence
between the consecutive events satis�es the set time interval constraints, the
pattern is considered signi�cant and hence discovered from the database. Chen
and Hu [71] introduced two constraints recency and compactness in sequential
pattern mining. The recency constraint discovers most recent patterns because
the behaviour of system may change over time. On the other hand, the com-
pactness constraint �nds patterns from a de�ned time span period which is the
similar idea used by Mannila et al. [48]. Another constraint algorithm, Sequen-
tial Pattern mIning with Regular expressIons consTraints (SPIRIT ), uses regular
expressions as constraints [74]. Later, following the SPIRIT idea, Antunes and
Oliveira [75] developed an algorithm to infer association rules using context free
grammar as constraints. Algorithms usually integrate constraints in the mining
2.8. Association Rule Mining 35
process. In other words, constraints can be directly associated with the actual
pattern search process [76]. For example, Zaki's [77] cSPADE constraint-based
algorithm where constraints such as the restriction of length and width of a
pattern were integrated in the mining process.
In a similar approach, we have used pattern size constraint to reduce the
number of comparisons while searching a candidate sequence pattern (sub pat-
tern) exist in an episodic sequence. If the size of the candidate sequence pattern
is larger than the size of an episodic sequence, the candidate sequence cannot be
found. Hence, the searching process can be skipped which results in reduction in
comparison. So, the pattern size constraint improves the time e�ciency of the
proposed rare sequential pattern mining algorithm. we have also used another
constraint called pattern existence constraint which is unique to further reduce
the searching time, �nding candidate sequence pattern in an episodic sequence,
of our proposed rare sequential pattern mining method. When the frequency
of a candidate sequence equals the maximum support threshold value, there is
no need to further searching the candidate sequence in the sequence database.
This is because the candidate sequence frequency cannot exceed the maximum
support threshold value. Therefore, unwanted scanning is avoided which helps
to reduce the searching time.
2.8 Association Rule Mining
Association rule mining is a rule-based machine learning method that discovers
interesting relations such as correlation, association or causal structures between
data in large databases [78]. This method is used to �nd a strong rule involving
two sets of items that indicate that if one itemset occurs then the other itemset
also occurs in the dataset. Agrawal et al. [79] �rst applies the association analysis
method using point of sale data called market basket analysis to discover the
purchasing behaviour of customers. An example of a market basket transaction
database is given in Table 2.3. For example, the association rule {Bread, Milk}
⇒ {Diapers} can be extracted from the database shown in Table 2.3. This
rule suggests that there exists a strong association between the itemset {Bread,
Milk} and itemset {Diapers}. In other words, this rule indicates that many
customers who bought Bread and Milk together and also bought Diapers in the
same transaction. Therefore, this information helps to make marketing decisions
36 Chapter 2. Background and Literature Review
Table 2.3: A market basket transaction database.
Transaction ID Items PurchasedTID1 {Bread,Milk,Diapers}TID2 {Bread,Diapers,Beer, Eggs}TID3 {Milk,Diapers,Bread,Cola}TID4 {Diapers,Beer,Bread,Milk}TID5 {Bread,Milk,Diapers,Cola}
such as promoting promotional pricing and inventory management placing these
items close to each other in the shelf. There have been many applications of the
association rule apart from market basket analysis. For example, bioinformatics,
web usage mining, intrusion detection, medical diagnosis [80].
An association rule is denoted by using the implication expression X ⇒ Y.
Here, X is called the antecedent and Y is called the consequent of the rule. In
this rule, the antecedent X and consequent Y are disjoint itemsets, meaning X ∩Y = ∅. To measure the strength of an association rule between the two itemsets
can be determined by di�erent parameters or metrics. These parameters could
be support, con�dence, lift, conviction. These parameters determine the interest
of a rule from a set of possible rules. The most popular metrics for identifying
interesting rules are support and con�dence. Using these metrics we can focus
on the interesting rules and disregard or eliminate uninteresting rules.
The Support indicates how frequently an itemset appears in a database. The
support can be formally de�ned as:
Support, s(X ⇒ Y)= support(X∪Y )N
Assume that the itemset X be {Bread, Milk} and Y is {Diapers} and N is the
number of transaction, that is, 5 in the database in Table 2.3. The support count
for {Bread, Milk, Diapers} is 4 since this itemset appears in 4 transactions TID1,
TID3, TID4 and TID5 and the total number of transactions is 5. Therefore, the
support of this rule is 4/5 = 0.8. In other words, this rule occurs in 80% of the
all transactions.
The Con�dence determines how often a rule has been found to be true in
a database. In other words, the con�dence indicates how often the itemset Y
appears in the transaction of a database where the itemset X also appears. The
con�dence can be formally de�ned as:
2.8. Association Rule Mining 37
Con�dence, c(X ⇒ Y)= support(X∪Y )support(X)
The con�dence of the rule {Bread, Milk} ⇒ {Diapers} can be calculated by
dividing the support count of the itemset {Bread, Milk, Diapers} which is 4 by
the support count of the itemset {Bread, Milk} which is 4 because this itemset
appears in 4 transactions TID1, TID3, TID4 and TID5 of the database in Table
2.3. Therefore, the con�dence of the rule is 4/4 = 1.0. In a 100% transaction,
whenever a customer bought Bread and Milk, they also bought Diapers.
Pattern mining is the �rst step of association rule mining. An association
rule �nds a correlation among the items in a pattern which is not possible in
a pattern mining approach. This is because in pattern mining the items are
listed as a set of items. However, in an association rule items are listed in a
two distinct correlated itemset. For example, in pattern mining {Bread, Milk,
Diapers} is a frequent pattern when the support value is 2 as shown in Table 2.3.
It is not possible to say from the above pattern whether any item tend to cooccur
with other items in a single transaction. On the other hand, in the association
rule {Bread, Milk} ⇒ {Diapers} which is generated from the previous frequent
pattern, it is possible to say that if items Bread and Milk are purchased in a
single transaction, it is likely that Diapers also will be purchased in the same
transaction.
The association rule can be categorised into two groups based on the order of
the antecedent and the consequent of a rule. These two categories of association
rules are (i) the Itemset Association Rule and (ii) the Sequential Association
Rule. When the order between the antecedent and the consequent of a rule is
not considered, it can be considered an itemset association rule. For example,
{Bread, Milk} ⇒ {Diapers} in the previous discussion is an itemset association
rule. This is because the rule does not say anything about the order between the
antecedent itemset {Bread, Milk} and the consequent itemset {Diapers}. The
rule does not indicate which itemset was purchased before the other itemset.
Rather the rule says that the itemset comprising both antecedent and consequent
{Bread, Milk, Diapers} is purchased together in a single transaction.
On the other hand, the sequential association rule mining considers the order
between the antecedent itemset and the consequent itemset. The rule X ⇒ Y is
considered a sequential association rule if and only if the antecedent is followed by
38 Chapter 2. Background and Literature Review
the consequent. However, the order of the items in the antecedent and consequent
itemsets is not considered. For example, the sequential association rule {a, b}
⇒ {c, d} indicates that the itemset {a, b} occurs before the itemset {c, d} in a
database with the antecedent is followed by the consequent. However, the order
of the items in the antecedent {a, b} and in consequent {c, d} does not matter in
the sequential association rule. The interestingness of the sequential association
rule mining can be measured with the same metrics support and con�dence that
are used in itemset association rule mining.
Sequential association rules can be grouped into two categories [81]. In the
�rst category, the order of events in a sequence is maintained both in the an-
tecedent and the consequent of a sequential rule. In addition, the order is main-
tained between the antecedent and the consequent. On the other hand, in the
second category, the order of events inside the antecedent and the consequent is
not maintained. However, the order is maintained between the antecedent and
the consequent. This thesis uses the sequential association rule mining where
the order of events are maintained both inside the antecedent and consequent as
well as between the antecedent and the consequent. This is because the logs of
the SCADA control system is sequential in with timestamps tagged with each
event.
Sequential association rule mining can be used in di�erent application domain
such as in mobile telecom industry. In telecom industry, sequential association
rules can be used to reduce the large number of alarms which allows the net-
work operator to selecting the most important alarm [82]. This method allows
the operator to take precaution in advance to reduce the possible consequence
or system disruption. Another example of the application of sequential asso-
ciation rule mining is predicting future energy demand [83]. Authors in [76]
establish correlation between the consumers household activity and the electric-
ity consumption to �nd the future electricity demand pattern. In web mining,
sequential association rules can be used to predict the future HTTP request in a
web site [84]. Yong et al. [84] uses web page usage patterns to mine association
rules using sequence and temporal constraint. Although association rules mining
can be used to predict a future event based on the current trend of the pattern,
no research has been done to predict a future event or sequence of events which
is a part of anomalous sequence.
2.9. Existing Anomaly Detection in SCADA System 39
2.9 Existing Anomaly Detection in SCADA Sys-
tem
Research has already been done to �nd anomalous behavior in SCADA systems
from diverse perspectives. The existing works can be analysed according to the
following categories: (i) Log based anomaly detection techniques, (ii) Protocol
based anomaly detection techniques, (iii) Network tra�c based anomaly detec-
tion techniques, (iv) Data mining based anomaly detection techniques and (v)
Rare pattern mining based anomaly detection techniques.
(i) Log based anomaly detection techniques : Naedele el al. [85] proposes to
aggregate log resources from distributed process control environment to
analyze and present security status in visual forms to a non IT opera-
tor monitoring the process to detect abnormal condition using his domain
expertise. Naedele el al. also combines human experience with IDSs tech-
niques to reduce the false positive alarm rate which is normally very high
with rule based as well as anomaly based techniques. Balducelli et al. [86]
attempted to �nd abnormal behavior using a case based reasoning method.
They tried to compare the sequence of SCADA log events with the pre-
viously de�ned normal behavior. However, they did not work with real
SCADA log �les rather they experimented in a simulated test-bed envi-
ronment. Infrastructure appliances can record operational activities of the
systems to a system log �le in a de�ned format. Moreover, any successful
or attempted unauthorized access or intrusion into the system can also be
recorded into log �les that can be analyzed for intrusion or fault detection
by the security experts or network administrator. Therefore, logs or event
logs could be a good data source to preprocess and analyze for detecting
attacks or intrusions happened in the system. Vaarandi [87] uses event logs
from network devices like routers to identify anomalies. The author uses
his proposed clustering algorithm, Simple Log File Clustering Tool (SLCT),
to separate outliers from the normal system pro�le. The tool checks each
line of the event log and if the event line is found to be similar to the
normal system pro�le event logs, it is separated into a normal group. How-
ever, if the event line does not fall into the normal system pro�le group, it
is considered as an outlier and put into an outlier group. Felix et al. [88]
states that di�erent systems maintain their own formatted logs which make
40 Chapter 2. Background and Literature Review
it hard to understand and process logs from outside of the domain. But
Garitano et al. [15] state that SCADA systems have a number of unique
features such as deterministic communication and limited and recurrent
activities. Therefore, modelling anomaly detection for SCADA systems is
comparatively easier than the traditional computer networks. In addition,
Hadºiosmanovi£ et al. [14] also mention that compared to other logs used
in di�erent domains that SCADA logs are in good format to analyze.
(ii) Protocol based anomaly detection techniques : It is di�cult for network ad-
ministrators to manually analyze the volumes of log entries in SCADA
systems. Therefore, a data mining approach is needed to analyses large vol-
umes of data to �nd hidden, useful patterns and detect anomalous events
or attacks. In the past, anomaly detection systems primarily applied two
techniques: signature based and anomaly based methods for general en-
terprise networks. Later on, these detection techniques were adopted or
migrated into the control system network like SCADA. Usually anomaly
based detection techniques are not used in enterprise networks because of
its rapidly changing behavior [89]. However, SCADA networks or control
networks usually do not change rapidly in terms of topology, protocols,
actions and communication patterns [17] [90] [91] [92]. Therefore, we ar-
gue that anomaly detection techniques would e�ectively be able to identify
anomalies in control system networks. This technique usually learns the
normal behavior of a system and gives alert alarms if any deviation from
the regular system behavior is found [92].
(iii) Network tra�c based anomaly detection techniques : There have been much
research on anomaly detection from diverse perspectives. Some operate at
communication protocol level [90] [91] and analyse on SCADA logs for
threat identi�cation in SCADA processes. In most cases, signature based
techniques are used for anomaly detection due to their high detection rate
and low false rate. Some tools are available for free to use such as Snort,
Sensors, Net-Rangers and some are for commercial use like Real-Secure [93]
[26]. These tools collect network tra�c data and match it with a prede-
�ned pattern in the database to identify any deviation and thus identify
an occurrence of an anomaly into the system. These tools usually use sig-
nature based techniques to detect any suspicious activities and refer to the
analysts for further investigation. This type of software requires periodic
2.9. Existing Anomaly Detection in SCADA System 41
updates of its database and hence works well in small network environments
[94]. However, attackers are constantly updating techniques to address the
defensive security measures and this technique cannot cope with a large
number of attacks and raises large volumes of alert alarms for the security
experts to further investigation.
(iv) Data mining based anomaly detection techniques : Data mining can be used
to �nd the hidden patterns (regularities and irregularities) from big data
generated by ICS like SCADA that can help detect anomalies in a system.
Currently data mining methods cannot identify anomalies from constantly
changing datasets [94]. However, as SCADA log data is almost steady
we are optimistic of detecting anomalies using data mining techniques.
Manganaris et al. [26] shows that the absence of frequent events or set
of events can be considered as an anomaly. Clifton et al. [27] applied
data mining techniques to identify normal behavior of a system based on
the frequent occurrence of an alarm event and later �ltered out suspicious
events lists. Barbara et al. [95] analyse system and user behavior using
data mining association rules from network tra�c data to train a model.
They look for any deviation in association rules considered as abnormal
behavior.
(v) Rare pattern mining based anomaly detection techniques : Although there
have been some work in �nding infrequent itemset mining, they do not
consider addressing the order of the itemsets or events. Until now there
has been no research in the literature in rare pattern mining that consid-
ers events in sequential order to detect anomalies. However, our research
focuses on maintaining the order of the events because the sequence or-
der would help to �nd causal relationships among the events. A break of
sequence order represents anomalous events or attacks. The research in
the rare pattern mining �eld started in late 1990s. This area has recently
attracted a lot of attention from researchers due to its increasing demand
for application in anomaly detection in network security, medicine, genetics
and molecular biology [96].
Saha et al. [97] mentions the basic strategy for rare pattern mining is to
identify all the frequent patterns from a transaction database using a user
de�ned threshold and later pruning these patterns from the database. As
42 Chapter 2. Background and Literature Review
a result the remaining patterns fall below the support value considered as
infrequent. Szathmary et al. [98] discovered rare itemsets by identifying
minimal rare itemset generators. Their motivation is to identify individual
frequent items. If the combination of these frequent items becomes infre-
quent then this combination is considered a rare itemset. For example,
items vegetarian {veg} and cardio vascular disease {cvd} are individually
frequent items but when they are combined, the combination {veg, cvd}
is infrequent then {veg, cvd} becomes a rare item set. However, their
co-occurrence does not carry signi�cant meaning, in other words, it is con-
sidered unlikely to have cvd for the veg people.
We are inspired by the motivation of the aforementioned work. Although
this work was done for rare itemset mining, we have ventured into �nd-
ing rare or infrequent sequential patterns where the order of itemsets or
events are preserved. The order of the events helps to �nd causal rela-
tionship among the events and that is missing in Szathmary et al.'s work.
For example, Fan-Failure and Device-Down are two events and between
these two events there is a causal relationship such that Fan-Failure leads
to Device-Down [99]. In other words, the occurrence of the event Fan-
Failure causes the second event Device-Down to happen. Therefore, these
two events {Fan-Failue, Device-Down} carries signi�cant meaningful infor-
mation. However, if these two events occur in the order as {Device-Down,
Fan-Failue} then it does not carry meaningful information regarding which
event causes other event to occur.
Therefore, infrequent or rare sequential patterns need to be further ana-
lyzed to discover anomalous events which are also rare in actions. It is
worth mentioning, that if a rare event is performed several times then it
becomes a frequent event and would be considered as a normal or regular
event of the system. In that case, this technique might not work prop-
erly. Usually the frequent sequential patterns are those sequences that
satisfy the user de�ned minimum support threshold value. However, the
remaining sequences that fall below the threshold value are considered as
infrequent sequential patterns which are usually ignored as they do not
represent the regular or normal behavior of the system. The current litera-
ture only focuses on frequent patterns and ignores infrequent patterns. Our
research will use these infrequent sequential patterns to detect anomalies.
2.10. Summary and Research Gaps 43
The closest matching of this research work is in Szathmary et al.'s work
[98]. However, Szathmary et al. did not consider the events' sequence order
and as a result there is no correlation among the events and this will be
addressed in this research work.
2.10 Summary and Research Gaps
This chapter has reviewed literature concerning anomaly detection techniques in
critical infrastructures with diverse data sources. The data used is mainly com-
munication protocols, network tra�cs and a few cases log �les. The methods
used to detect anomalies analyzing this data varied from case based reasoning,
human expertise with tradition IT analysis to data mining techniques. Most
of the research was conducted using signature based techniques which can only
identify previously known attack patterns. However, attackers are using con-
stantly changing new attack techniques or patterns which these methods fail to
identify. Therefore, our research will address these shortcomings and develop a
behavior based anomaly detection model that will not only identify known at-
tacks but also detect new or unknown attacks. In addition, the developed model
will generate an early prediction of possible anomalies in the system.
This chapter has also reviewed the basic building blocks of SCADA architec-
ture, control device logs that record SCADA process activities. Based on this
review, it has been found that SCADA control systems play a critical role in
monitoring and controlling critical infrastructures. However, SCADA systems
have been the target of cyber attacks to disrupt the process activities of the con-
trol system. This disruption may cause environmental as well as �nancial losses
which is a great concern for communities that use SCADA.
Therefore, there is a need to detect anomalies and intrusion in SCADA control
systems. It has been found that data mining in general and pattern mining in
particular can play a signi�cant roles. This is because of the large volume of
logs generated by the SCADA control systems that can be e�ectively analysed
by data mining algorithms.
44 Chapter 2. Background and Literature Review
Three research gaps have been identi�ed from the literature. A brief description
of these research gaps are:
1st Research Gap: There exist no anomaly detection method maintaining the
order of events which can identify an abnormal behaviour from SCADA control
logs. In SCADA control system events occur in sequential manner. The normal
or regular ordered sequence of events accomplish a complete process, while the
irregular ordered sequence of events disrupts the normal process. For example,
the following ordered sequence of events in a water tank control system accom-
plish a complete process.
〈{Close_V alve}, {Pump_On}, {Check_Water_Max_Threshold_V alue},{Pump_Off}, {Open_V alve}〉
These ordered sequence of events �lls up a water tank to its maximum threshold
capacity. Once the threshold is reached, the water is released from the tank.
When the water level touches the lower threshold value, the above ordered se-
quence starts to �ll the water tank. This process continues for a de�ned time
period. However, if the events in the above sequence is performed with following
di�erent ordered sequence, the water tank system over�ows and �oods the sys-
tem.
〈{Open_V alve}, {Check_Water_Max_Threshold_V alue}, {Pump_Off},{Close_V alve}, {Pump_On}〉
This is because the changes of order of events which deviates from the regular
process of the water tank control system.
Although there exists a single work by Hadºiosmanovi£ et al. [17] that �nds
anomalies from a water treatment control system logs, that work used itemset
pattern mining to identify the abnormal or unusual system usages of a system
user. Hadºiosmanovi£ et al's. work �nds a single unusual event with the help of
domain expert knowledge. However, since SCADA control system events occur
in sequential manner, the order of the events are important in distinguishing
between the noraml behaviour against the abnormal behaviour of the SCADA
control system. For example, if the events a, b, and c happen in the SCADA
2.10. Summary and Research Gaps 45
system in the order of event a followed by event b further followed by event c
then the sequence is called the regular pro�le of the system. However, if there is
a change in the order of events such as event a followed by c then followed by b
then the sequence is called abnormal, which can be considered the anomalies in
the system.
In Chapter 3 of this thesis we �nd anomalies in SCADA control system logs.
We use a rare sequential pattern mining approach to detect anomalies. The rea-
son for using rare sequential pattern mining method is to verify if the notion of
anomalies happen rarely in a system. To �nd rare sequence of events, this re-
search developed an algorithm that can generate all rare sequential patterns in a
sequence database. Chapter 3 also generates the equivalence class that separates
all rare patterns into di�erent groups comprising the minimal rare sequential
pattern, the maximal rare sequential pattern and the all the patterns in between
the minimal and maximal pattern. Based on the domain interest, sometimes it
is the minimal pattern that can help to identify the anomalies and sometimes it
is the maximal pattern that can better help to detect root cause of anomalies in
the system.
2nd Research Gap: Analysing a rare sequential pattern mining algorithm with
the integration of constraints. There is a need to investigate constrained rare se-
quential pattern mining. The �rst reason is to �nd rare pattern from an episodic
time period or a session period. If a pattern originates from a session time then
the events in the pattern have the capability to carry out a complete task. Other-
wise, if the pattern is composed to the events coming from consecutive sequences
that are beyond the session time, the pattern may lose its strength or capability
to conduct an accomplished task.
The second reason is to improved the e�ciency of the rare sequential pat-
tern mining algorithm in terms of reducing the unimportant rare patterns and
minimizing the computational time of the algorithm while generating the desired
pattern. These can be achieved by integrating constraints to �lter out unneces-
sary patterns, which can help to reduce the database search space. Therefore,
the minimized search space can attribute to less computational time in the al-
gorithm. The third reason is to reduce the false positive rate, which can be
achieved by identifying anomalous pattern from a small number of suspicious
rare patterns while keeping the accuracy of the algorithm.
46 Chapter 2. Background and Literature Review
In Chapter 4, this research �nds anomalous patterns that belong to an episodic
session time period. The events of a pattern do not fragment apart a consecutive
sequence. If the events of a pattern happen within a session time period, the
pattern is signi�cant that it has the strength or capacity to do harm to a system.
In this chapter, we also reduce the complexity of the rare sequential pattern
mining algorithm that we developed in Chapter 3. The complexity is improved
by improving the e�ciency of the algorithm in terms of reducing the number
of rare sequential patterns generated and the computational time the algorithm
takes to generate these patterns. The e�ciency is improved by integrating con-
straints into the algorithm. There is a trade o� between the e�ciency and the
accuracy. However, the accuracy of the algorithm does not decrease for anomaly
detection. This is due to careful use of the constraints with the help of domain
expert knowledge.
3rd Research Gap: The absence of anomaly prediction method which can alert
the possible incoming anomalies on SCADA live system. The connection of
modern SCADA systems to the internet exposes them to cyber-attacks. To
detect cyber-attack, one of the methods is to analyse the static o�-line logs
to understand the nature of the attack once it has occurred on the system.
Although this method can identify the attack, it cannot protect the system from
the occurrence of an attack. This is because this method is unable to provided
possible attack information before the attack happens. As a result, it is neither
possible to protect the system from the attack, nor it can provide possible attack
information to take precautions so that the attack can be avoided. Therefore,
there is a need for an anomaly prediction method which can raise an alert for
a possible anomaly on the SCADA live system. In Chapter 5 of this thesis, the
research uses association rules generated from rare sequential patterns to predict
an incoming or ongoing anomalies or attacks in the system by analysing SCADA
streaming logs. If the antecedent of an association rule can be found in the
incoming streaming logs, it is predicted that the consequent of that rule is likely
to happen in near future. The antecedent comprising a sequence of events is
considered as the precursor, meaning it gives an indication of the consequent of
the rule is likely to follow. If the consequent which is also an event or sequence of
events that occur within a de�ned time period or session time in the log stream,
then the prediction becomes true.
2.10. Summary and Research Gaps 47
The following chapter, Chapter 3, presents the �rst contribution of this thesis,
which addresses the �rst research gap �There are no anomaly detection methods
involving SCADA control logs using rare sequential pattern mining method.
Chapter 3
A Rare Sequential Pattern Mining
Approach for Anomaly Detection
3.1 Introduction
Pattern mining is one branch of data mining research focused on discovering un-
derlying useful information from a large database using di�erent methods [100].
Among these techniques, some deal with itemset pattern mining and others are
used for mining sequential patterns. In both categories, frequent pattern min-
ing is a widely practiced research domain because it can discover the regular or
expected behaviour in data. However, to reduce computation requirements to
achievable levels, this technique ignores a large and sometimes very interesting
segment of the database, that is, the infrequent or rare behavioural patterns
which could be interesting and signi�cant.
It is believed that in some circumstances rarity in data gives useful, interesting
and reliable information to discover the unexpected or anomalous behaviour,
which goes against a common assumption of the data mining domain [98]. An
example where anomalous behaviour is critically important is in industrial control
SCADA (supervisory control and data acquisition) networks, which manifest a
regular and expected system behaviour for its limited and repetitive actions. So,
any irregular or unexpected behaviour, which is very rare in a SCADA system,
deserves further investigation.
48
3.1. Introduction 49
In an industrial domain, a process is completed with a speci�c ordered sequence
of actions. For example, consider a SCADA water tank system that consists of
two water tanks and a pump that moves water from a lower tank into an upper
tank. Gravity allows water in the upper tank to move back into the lower tank.
We argue that a particular order of events can lead to an undesired result. For
example, if the upper tank reservoir valve is closed and the pump is on, the
pump �lls the upper tank. When the water level reaches the 40% level of the
tank, a sensor triggers the pump o� and opens the valve. Then, water drains
out to lower tank. But, when the water level touches a certain low level mark in
the upper tank, the valve closes and triggers the pump on and starts �lling the
upper tank. For example, the following sequential ordered events are a regular
system pro�le for �lling a water tank reservoir:
〈{Close_V alve}, {Pump_On}, {Check_Water_Level_40%}, {Pump_Off},{Open_V alve}〉
However, if these events are performed in a di�erent order the system can �ood.
We assume that the water level in the upper tank reservoir is above the 40% level
mark of its capacity and the valve is triggered to open. Therefore, the water from
the upper tank reservoir starts draining in to the lower tank reservoir. But, when
the water level of the upper tank reservoir touches the 40% of its level mark, the
sensor triggers the pump o�. Then the valve gets closed and triggers the water
pump on. An example of these ordered sequential events are given below:
〈{Open_V alve}, {Check_Water_Level_40%}, {Pump_Off}, {Close_V alve},{Pump_On}〉
If the system runs with the above order of events, then there are no further
checks before the water level reaches its upper threshold capacity. As a result,
the upper tank reservoir could over�ow and cause the system to �ood.
The fundamental di�erence between itemset pattern mining and sequence
pattern mining is that in itemset pattern mining the order of the events is not
considered whereas in sequence pattern mining the order of the events is main-
tained and considered important. The importance of the order is shown in the
50 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
above water tank reservoir example, where the incorrect order of the execution
of events can cause the system to malfunction. Even though, at the logical level
both itemset and sequential pattern mining are the same when the candidate
generator patterns are generated level-wisely by adding an item to an existing
pattern. However, there is a methodical di�erence as to how we add items into
the pattern. In itemset mining, an item is added into a set of items whereas in
sequential pattern mining the item is added at di�erent position of the pattern.
For example, if 〈{a}, {b}〉 is a pattern of size-2 and we need to generate a can-
didate pattern of size-3 by adding the item {c}, then in itemset mining we get
only the candidate pattern {a, b, c}. However, in sequential pattern mining we
get three di�erent candidate sequence patterns, 〈{c}, {a}, {b}〉, 〈{a}, {c}, {b}〉,and 〈{a}, {b}, {c}〉. Therefore, in sequence pattern mining the integrity of the
original pattern can be preserved. This means that the occurrence order of the
events in a pattern is not changed after adding new items. However, in itemset
pattern mining there is no integrity because there is no order for an itemset.
Therefore, we argue that if we only consider the rare itemset patterns, which
are not concerned about event order, it cannot be e�ective in identifying impor-
tant sequential anomalies. Also, as some adversaries may not have an in-depth
understanding of a targeted system, they may try actions which are incompat-
ible with a system's prede�ned action sequences, that would result in rare and
abnormal events. Furthermore, the order of the events may help to �nd causal
relationship. For example, Fan-Failure and Device-Down are two events and be-
tween these two events there is a causal relationship that Fan-Failure leads to
Device-Down [99]. It means that, the malfunction of the fan occurred �rst, then
the device stops working. But the other event order, meaning Device-Down then
Fan-Failure does not show the causal relationship.
It is often perceived that usually shorter patterns are frequent, while by na-
ture, longer patterns are likely to be rare or infrequent and their combination
can be even more rare. If a combination of two frequent single items becomes
infrequent it is called a rare pattern. These patterns could be useful in di�erent
application domains for �nding anomalies. The amount of rare patterns could
be large in number. The number of detected rare patterns depends on the user
de�ned support value. The support value is used to count the frequency of a
sequence in a sequence database. If the support value is high, the number of
generated rare patterns is high, but if the support value is low, a small number
3.1. Introduction 51
of rare patterns is generated. However, for sequential patterns, these rare pat-
terns are still largely due to the di�erent arrangement of the event order in a
pattern.
Therefore, identifying useful rare patterns is challenging with sequential pat-
tern mining. As there is an increasing demand for rare patterns in anomaly
detection in network security, medicine, genetics and molecular biology [101], we
are motivated to �nd anomalies with rare sequential patterns. The present work
builds upon the authors' previous work [102]. In the previous work they have
only discovered the minimal rare sequential generator patterns that are consid-
ered the seed of all rare patterns and have shown that the orders of the events
in a pattern are important in a system where the events occur in a sequential
manner.
There are three aspects of the contributions in this work. Firstly, this work is
the �rst approach to �nd rare sequential patterns. Secondly, we present a struc-
tured generator-based method to generate all rare sequential patterns. Thirdly,
di�erent application domains may have di�erent interest in the sizes of the rare
patterns. Some domains may prefer the smallest patterns over the largest pat-
terns to �nd anomalous behaviour of a system, while others may consider the
largest pattern is better to reach a conclusion about anomalies in a system. So,
to cater to these demands, this method separates all rare sequential patterns into
di�erent groups having the same frequency. In each group, the smallest pattern
is called the minimal pattern and the largest pattern is called the maximal pat-
tern. Sometimes there will be more than one minimal and maximal patterns in
a group. The minimal patterns �nd the source or seed of attacks while maximal
patterns �nd the size or level of the disruption. Finally, this method is applied
to real SCADA control system logs to validate the usefulness of the approach in
�nding anomalies.
The remaining sections of this paper are organized as follows. Section 3.2
gives a discussion of de�nitions related to our work. Section 3.3 explains our
proposed novel method for �nding minimal and maximal rare patterns using se-
quential pattern mining. Section 3.4 gives the details of experimental procedures
and results. Section 3.5 provides analysis and discussion of the �ndings. Section
3.6 discusses the related research, and �nally, Section 3.7 draws conclusions and
future work.
52 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
3.2 De�nitions
We begin by presenting the de�nitions of some concepts required to formally
describe our approach and algorithms. Some of the concepts are commonly used
in related works such as the one in [103].
De�nition 1 (Sequence): Let I = {i1, i2, i3, ..., in} be a set of items and an itemset
X (also called an event) is a subset of unordered distinct items, that is, X ⊆ I. A
sequence S is an ordered list of itemsets, i.e., S = 〈X1, X2, ..., Xk〉, where Xj ⊆ I
and 1 ≤ j ≤ k. Note that, while an item could appear only once inside an event,
within a sequence an item may appear in multiple events.
De�nition 2 (Sequence database): A sequence database SDB comprises of a set of
sequences, such as SDB = {S1, S2, ..., Sp} where Sj is a sequence. Each sequenceof the database SDB is identi�ed with a unique sequence identi�er SID.
For example, let a sequence S in the database SDB, and Events(S) be a set of
events which occur in S. The set of all events which occur in the database SDB
is de�ned as Events(SDB)=∪S∈SDB Events (S).
De�nition 3 (Sequence containment): A sequence Sa = 〈A1, A2, ..., An〉 is saidto be contained in a sequence Sb = 〈B1, B2, ..., Bm〉 if and only if there exist
integers 1 ≤ i1 < i2 < ... < in ≤ m, n ≤ m, such that, A1 ⊆ Bi1, A2 ⊆ Bi2,
..., An ⊆ Bin and this is denoted as Sa v Sb. In this case Sa is considered as
a subsequence of Sb and Sb is said to be a super sequence of Sa, where n ≤ m
indicates that the number of elements in a subsequence must be less than or equal
to the number of elements in a super sequence.
For example, 〈{b}, {c, e}〉 is a subsequence of 〈{b, d}, {c, e, f}, {h}〉, while 〈{a},{b}〉 is not a subsequence of 〈{a, b}, {c, d}〉.
De�nition 4 (Size and Length of a sequence): The size |S| of a sequence is
the number of events or itemsets that exist in that sequence. While the length
of a sequence is the total number of individual items (repetition of an item is
considered) counted in that sequence.
For example, S = 〈{a}, {b, c}, {b, d}, {e}〉 is a sequence with four events and
hence the size of this sequence is 4 denoted as size-4, whereas the length of this
sequence is 6 because the total number of items is 6.
3.2. De�nitions 53
Table 3.1: A sequential database SDB.
Sequence ID SequencesSID1 〈{a}, {b, d}, {e}, {c}〉SID2 〈{a}, {c}, {b}, {e}〉SID3 〈{a, b}, {c}, {b}, {e}〉SID4 〈{b}, {c, e}, {e}〉SID5 〈{a}, {b, c}, {e}, {f}, {c}〉SID6 〈{g}〉
De�nition 5 (Support): The support of a sequence Sa in a sequential database
SDB is determined by the number of sequences S ∈ SDB, such that Sa v S.
For example, the pattern 〈{a}, {b}, {e}〉 is found in four sequences (SID1, SID2,
SID3, and SID5 shown in Table 3.1); therefore, the support of this pattern is
4 in absolute count or 66% in relative count.
De�nition 6 (Frequent and Rare Sequential Pattern):A frequent sequential pattern
is a sequence whose support is greater than or equal to the user de�ned support
threshold. A rare sequential pattern is a sequence whose support is less than the
user de�ned threshold.
For example, the sequential pattern 〈{a}, {b}, {e}〉 is frequent while pattern
〈{a, b}, {c}〉 is rare in the database SDB in Table 3.1 when the user provided
support threshold value is 2. These frequent and rare sequential patterns can
be represented by a lattice which is a structure that describes the relationship
among the sequential patterns in a sequential database. This is similar to the
structure used by Pasquier et al. [104]. In both cases, the relationship is an
ordered relationship between the superpatterns and subpatterns. However, the
lattice used by Pasquier et al. [104] is a subset relationship while the lattice used
in this paper is built on a subsequence relationship. For the sequential database
in Table 3.1, a simpli�ed partial lattice is given in Figure 3.1 which illustrates the
relationship, shown in thin lines, among the sequential patterns in the database.
The subsequence relationship among the patterns is a partial ordered relation-
ship. For example, the sequential pattern 〈{a}, {c}, {e}〉 is a subsequence of thesequential pattern 〈{a}, {c}, {e}, {f}〉 and 〈{a}, {c}, {f}, {e}〉 and their relation-ships are shown in the Figure 3.1. However, there is no subsequence relationship
between 〈{a}, {c}, {e}, {f}〉 and 〈{a}, {c}, {f}, {e}〉.
54 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
Figure 3.1: A partial lattice view of a sequential database.
De�nition 7 (Closed and Generator Sequential Pattern):A sequential pattern sa
is called a closed pattern if there is no other sequential pattern sb such that sa vsb and their supports are equal. A sequential pattern sa is called a generator if
there is no other sequential pattern sb such that sb v sa and their supports are
3.2. De�nitions 55
equal.
For example, the pattern 〈{a, b}, {c}, {e}〉 is a closed pattern because there are
no super sequential patterns which have the same support. The pattern 〈{a, b}〉is a generator pattern because no sub-sequential patterns are found which have
the same support.
De�nition 8 (Rare Sequential Generator Pattern):A sequential pattern is called
a rare sequential generator pattern (RSG) if it is rare and all of its proper sub-
sequences are frequent.
For example, sequence 〈{a, b}〉 is a rare sequential generator pattern (shown in
thick bordered shaded rectangle on the right side of the solid slanted line in Fig-
ure 3.2) because all of its subsequence patterns 〈{a}〉, 〈{b}〉 are frequent whenthe minimum support threshold is 2.
De�nition 9 (Zero Sequential pattern and Minimal Zero Sequential Generator
Pattern): A sequential pattern is called a zero sequential pattern if and only if
it does not exist in the database, i.e., its support value is 0. A zero sequential
pattern is called a minimal zero sequential generator pattern (mZRSGr) if and
only if all of its proper sub sequences are non-zero sequential patterns.
A minimal zero sequential generator pattern is the upper border limit of a valid
rare sequential pattern. It is noted that all the super sequential patterns that
can be generated from mZRSGr do not appear in the database.
For example, non-existent sequence pattern 〈{a}, {f}, {d}〉 which is shown in a
small dotted rectangle with a number zero on top in Figure 3.1 is a minimal zero
sequential generator and all super sequences of this pattern do not exist.
De�nition 10 (Equivalence Class):Two patterns Sa and Sb are in the same equiv-
alence class with respect to a sequential database SDB if and only if for each
sequence s ∈ SDB, we have Sa v s if and only if Sb v s. In other words, all
the patterns in the equivalence class occur in the same sequences in the SDB.
Therefore, they have the same support.
For example, the rare patterns, such as 〈{a, b}〉, 〈{a, b}, {c}〉, 〈{a, b}, {e}〉, 〈{a, b},{c}, {e}〉, 〈{b}, {c}, {e}〉 form an equivalence class. These patterns have the same
support and they belong to the same sequence (SID3) of the SDB (shown in
Table 3.1). Note that, generator patterns are the smallest or minimal pattern
56 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
in size while the closed patterns are the largest or maximal patterns among
all of the member patterns of an equivalence class. An equivalence class may
contain more than one generator and closed patterns. For example, the sequen-
tial patterns 〈{d}〉, 〈{b, d}, {c}〉, 〈{a}, {d}〉, 〈{a}, {b, d}〉, 〈{d}, {c}〉, 〈{d}, {e}〉,〈{b, d}, {e}〉, 〈{a}, {b, d}, {e}〉, 〈{a}, {b, d}, {c}〉, 〈{a}, {d}, {e}〉, 〈{d}, {e}, {c}〉,〈{a}, {d}, {c}〉, 〈{b, d}, {e}, {c}〉, 〈{a}, {b, d}, {e}, {c}〉, 〈{a}, {d}, {e}, {c}〉 fromthe database given in Table 3.1 form an equivalence class shown in Figure 3.3.
In this equivalence class, the sequential patterns 〈{d}〉 and 〈{b, d}〉 are the
minimal rare sequential generator patterns, while the rare sequential patterns
〈{a}, {b, d}, {e}, {c}〉 and 〈{a}, {d}, {e}, {c}〉 are the closed sequential patterns.
It is worth noting that at a high level, itemset pattern mining and sequence
pattern mining can be parallel. However, sequence pattern mining di�ers from
itemset mining, especially when the candidate sequence generator pattern is gen-
erated. For example, if a candidate sequence pattern is generated from two items
a and b, then in itemset mining only one candidate itemset pattern {a,b} can
be generated. However, in sequence mining there are two possible candidate se-
quence patterns that can be generated, i.e., 〈{a}, {b}〉 and 〈{b}, {a}〉. Therefore,if the sequence of events is taken into consideration while calculating the support,
the itemset pattern is di�erent from the sequential pattern. On the other hand,
if the sequence is ignored during the support calculation, the itemset pattern
mining and the sequential pattern mining are the same.
3.2. De�nitions 57
Figure 3.2: The positive and the negative border of a lattice of a sequentialdatabase.
Mannila and Toivonen in their work [105] de�ned the notions of positive border
and negative borders. According to Mannila and Toivonen, the maximal frequent
patterns form the positive border of the frequent zone and the minimal rare
patterns form the negative border of the infrequent zone. Figure 3.2 depicts only
a partial view of a lattice structure that represents the positive and the negative
borders of the sequential patterns of the sequential database in Table 3.1. The
58 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
solid slanted line separates the rare patterns on the right side and the frequent
patterns on the left side. The minimal rare sequential generators along the right
side of the solid slanted line form the negative border and the maximal frequent
sequential patterns on the left side of the solid slanted line is positive border.
The rare generators hold the important property that all of their subsequences
are frequent and as such they must have higher support value than the rare gen-
erators. For example, minimal rare sequential generator patterns 〈{d}〉, 〈{f}〉,〈{g}〉, 〈{a, b}〉, 〈{c, e}〉, 〈{b, d}〉, 〈{b, c}〉, 〈{b, b}〉, 〈{c, c}〉, 〈{e, e}〉, which are
shown in thick bordered shaded rectangle form the negative border along the
right side of the solid slanted line. On the other hand, maximal frequent se-
quential patterns 〈{a}, {b}, {c}〉, 〈{a}, {e}, {c}〉, 〈{a}, {c}, {e}〉, 〈{c}, {b}, {e}〉,〈{a}, {b}, {e}, {c}〉, which are shown in dash and dot rectangle form the positive
border along the left side of the solid slanted line.
The pattern in dotted rectangle shape on the right side of the solid slanted
line is a non-existent pattern, meaning this pattern never exists in the SDB.
As there exists many non-existent patterns, most are omitted and are shown as
three dots(...) in Figure 3.1. The frequent generators which are shown in dashed
rectangles and other frequent patterns which are shown in solid and dash and
dot rectangles on the left side of the solid slanted line. The frequent generators
are the frequent patterns for which there are no subsequence patterns with the
same support value. On the other hand, for the patterns shown in dotted rect-
angles at least one of its subsequences has the same support value. The rare
generators hold the important property that all of its subsequences are frequent
sequences including frequent generators. Further more, for both frequent and
rare generators, the support of any subsequence of a generator must be higher
than the support of the generator. This property is de�ned by Pasquier et al.
[104] as follows:
Property 1: A sequence X is a generator if and only if the support of X is lower
than the support of Y, Y is any subsequence of X.
In this chapter, we propose methods to generate all generators including both
frequent and rare generators based on Property 1 (Phase 1), then generate all
rare sequential patterns from the rare generators (Phase 2).
3.3. A New Method For Finding Rare Sequential Patterns 59
3.3 A New Method For Finding Rare Sequential
Patterns
In this section we present a new general method for �nding rare sequential pat-
terns. This method comprises of two phases. In the �rst phase, we generate rare
sequential generator patterns which are the seeds for generating all rare sequen-
tial patterns in the second phase. Note that, the subsequence patterns of a rare
generator pattern are generator patterns, but could be frequent generators or rare
generators. According to Szathmary et al. [98], there are two ways to generate
rare patterns, either from frequent generators or from maximal frequent patterns.
The later method requires �nding all frequent patterns up to maximal patterns
from where all rare generators can be produced. But this method, generating
rare generators from maximal frequent patterns, requires greater computational
time because the method needs to explore all frequent patterns to generate max-
imal frequent patterns. Hence, we propose to use frequent generators [106] to
�nd all rare generators allowing us to use only a subset of the frequent patterns.
To this process, we split the set of events Event(SDB) into two, a set of rare
size-1 sequences and a set of frequent size-1 sequences. The size-1 rare patterns
are the rare generator sequential patterns while the frequent patterns of size-1
are frequent generator sequential patterns. We exclusively use these frequent
generator sequential patterns to �nd further rare generator sequential patterns
based on Property 1 by applying an apriori-like method. The method is formally
described in Algorithm 3.1. The rare generators are the seeds for generating all
rare sequential patterns.
In the second phase, we generate all rare sequential patterns by generating su-
per patterns of rare patterns starting from rare sequential generators. According
to the apriori property [55], any super pattern of an infrequent pattern must be
infrequent. The rare sequential patterns are further increased by merging with
size-1 sequences to generate more rare sequential patterns until no more new rare
patterns can be generated. In the process of making super sequential patterns,
we keep the integrity of the original sequence, from where candidate super se-
quences are generated, by not changing the order of the itemsets or events of
the original sequence. In other words, even if we add a size-1 itemset forming a
size-1 sequence also into di�erent positions of the original sequence, the itemsets'
order of the original sequence remains intact in the generated candidate super
60 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
sequences. For example, given two sequential patterns S1 = 〈{d}, {c}, {b}〉 andS2 = 〈{a, b}〉 from the sequential database SDB (shown in Table 3.1), there could
exist di�erent ways to combine them to form the new candidate super sequential
patterns at the sequence level. So, we can combine them in the following possible
ways:
〈{a, b}, {d}, {c}, {b}〉, 〈{d}, {a, b}, {c}, {b}〉, 〈{d}, {c}, {a, b}, {b}〉,and 〈{d}, {c}, {b}, {a, b}〉.
In the above patterns, the order of the itemsets of the original pattern S1 from
where all possible candidate super sequence patterns are generated remain un-
changed. Another way of generating the candidate super sequences ensuring the
integrity is by concatenating the two sequences S1 and S2 in both forward and
reverse direction, such as 〈{S1}, {S2}〉 and 〈{S2}, {S1}〉.
〈{a, b}, {d}, {c}, {b}〉 and 〈{d}, {c}, {b}, {a, b}〉.
As a result, only the above two possible candidate super sequence patterns are
possible at all times irrespective of the size of the original sequence S1. Therefore,
we have applied the former method generating all possible candidate patterns in
our algorithm.
Note that, the total number of possible super patterns would be equal to the
size of |S1| + 1. For example, the size of the pattern S1 is 3 for having three
events. Therefore, the total number of super patterns grown from the pattern
S1 to 4. Also, the size of the super sequences would be the sum of the size of
the sequences S1 and S2. Similarly, the length of the super sequences would be
the sum of the length of the sequences S1 and S2. In other words, if the size
and length of S1 is K and L respectively, for S2 length is L′ then the size and
the length of the super patterns would be (K + 1) and (L+L′) respectively. For
example, the size and length of the pattern S1 is 3, but for the pattern S2 the
size is 1 and the length is 2. So, the size and the length of the super patterns
become 4 and 5, respectively.
Also, in this stage all rare patterns are separated into di�erent equivalence
groups based on their similar frequency and the sequences in the SDB they
belong. Rare patterns that have the same frequency and they come from the
3.3. A New Method For Finding Rare Sequential Patterns 61
same sequences in the database SDB, they are put into an equivalence class.
Note that, among all rare patterns in an equivalence class, we can also identify
the smallest as well the largest rare patterns as per requirements from di�erent
domains.
3.3.1 Generating Rare Sequential Generator Patterns
In this phase, all rare sequential generator patterns (RSG) are discovered from
the sequential database SDB. First, based on the frequency test (comparing
the number of sequences against the de�ned support threshold value minsup),
Events(SDB) is divided into rare and frequent zones. The sequences in the rare
zone are size-1 rare generators. On the other hand, the frequent zone sequences
are size-1 frequent generators. This is done in steps 1-15 in Algorithm 3.1 given
below.
To �nd size-2 and above rare generators, size-(s−1) frequent generators, s ≥2, are combined in pairs to generate size-s super patterns that share size-(s−1)common pre�x subsequences. For example, two frequent generator patterns of
size-3 〈{c}, {b}, {e}〉 and 〈{c}, {b}, {f}〉 share a common pre�x size-2 subse-
quence 〈{c}, {b}〉, so these two patterns produces two super sequences of size-4,namely 〈{c}, {b}, {e}, {f}〉 and 〈{c}, {b}, {f}, {e}〉. In other words, keeping the
common pre�x sequence unchanged, the remaining su�x sequences are merged
together in both forward and reverse directions. These super sequence patterns
are tested against the maxsup value to generate frequent generators and rare
generators. This is done in steps 18 to 32 in Algorithm 3.1.
The above process continues until no more frequent generator can be found.
In the third iteration with the database SDB, the process stops as there exist
no frequent generators. So, at the end of the process, all rare generator patterns
are collected. For demonstrating the process of Algorithm 3.1, the execution
of Algorithm 3.1 with a maxsup value set to 2 on the database SDB given in
Table 3.1 is illustrated in Table 3.2. At �rst, the algorithm �nds all the size-
1 sequences as candidate sequence generator CSG1 in Table 3.2(a) in a single
database scan. Then the support values of these sequences are counted. Since
an empty sequence is a subsequence of every sequences in the SDB, it is found
that empty sequence is frequent and its support is equal to the total number of
sequences in the SDB. In other words, the empty sequence has a 100% support
value shown as FSG0 in Table 3.2(a).
62 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
Algorithm 3.1: Finding Rare Sequential Generator Patterns.Input: A sequential database SDB, maxsupOutput: Rare Sequential Generator Patterns (RSG)
1 CSG1 ← {〈e〉|∀e ∈ Events(SDB)} // Here CSG1 is a set of candidatesequence generators with size-1 sequences
2 FSG1 ← {}, RSG1 ← {} // Here FSG1 and RSG1 is a set of frequentsequential generator and rare sequential generator respectively
3 S.supp0 = |SDB|,∀S ∈ CSG1
4 Count support S.supp1 of each sequence S in CSG1 by scanning theSDB
5 for S ∈ CSG1 do6 if S.supp1 = S.supp0 then7 remove S from CSG1
8 else9 if S.supp1 > maxsup then10 FSG1 ← FSG1 ∪ {S}11 else12 RSG1 ← RSG1 ∪ {S}
13 s ← 214 FSGs ← {}, RSGs ← {}15 while FSGs−1 not empty do16 CSGs ← all possible combinations of two sequences with common
pre�x of size(s-2) subsequences in FSGs−117 for S ∈ CSGs do18 ms ← minimum support of the size(s-1) subsequences of S19 S.supps ← 020 for a ∈ SDB do21 if S v a then22 S.supps ← S.supps+123 else24 continue
25 if S.supps = ms then26 remove S from CSGs
27 else28 if S.supps > maxsup then29 FSGs ← FSGs ∪ {S}30 else31 RSGs ← RSGs ∪ {S}
32 s ← s+1
33 return RSG = RSG1 ∪RSG2...RSGs−1
3.3. A New Method For Finding Rare Sequential Patterns 63
Table 3.2: Execution of Algorithm 3.1.
(a) First iteration
CSG1
minsup
ofFSG0
SupOf
CSG1
True
CSG1
〈{a}〉 6 4 Yes
〈{b}〉 6 5 Yes
〈{c}〉 6 5 Yes
〈{d}〉 6 1 Yes
〈{e}〉 6 5 Yes
〈{f}〉 6 1 Yes
〈{g}〉 6 1 Yes
〈{a,b}〉 6 1 Yes
〈{b,c}〉 6 1 Yes
〈{c,e}〉 6 1 Yes
〈{b,d}〉 6 1 Yes
RSG1
SupOf
RSG1
FSG1
SupOf
FSG1
〈{d}〉 1 〈{a}〉 4
〈{f}〉 1 〈{b}〉 5
〈{g}〉 1 〈{c}〉 5
〈{a,b}〉 1 〈{e}〉 5
〈{b,c}〉 1
〈{c,e}〉 1
〈{b,d}〉 1
(b) Second iteration
CSG2
minsup
ofFSG1
SupOf
CSG2
True
CSG2
〈{a},{a}〉 4 0 No
〈{a},{b}〉 4 4 No
〈{b},{a}〉 4 0 No
〈{a},{c}〉 4 4 No
〈{c},{a}〉 4 0 No
〈{a},{e}〉 4 4 No
〈{e},{a}〉 4 0 No
〈{b},{b}〉 5 1 Yes
〈{b},{c}〉 5 4 Yes
〈{c},{b}〉 5 2 Yes
〈{b},{e}〉 5 5 No
〈{e},{b}〉 5 0 No
〈{c},{c}〉 5 1 Yes
〈{c},{e}〉 5 4 Yes
〈{e},{c}〉 5 2 Yes
〈{e},{e}〉 5 1 Yes
RSG2
SupOf
RSG2
RSG2
SupOf
FSG2
〈{b},{b}〉 1 〈{b},{c}〉 4
〈{c},{c}〉 1 〈{c},{b}〉 2
〈{e},{e}〉 1 〈{c},{e}〉 4
〈{e},{c}〉 2
(c) Third iteration
CSG3
minsup
ofFSG2
SupOf
CSG3
True
CSG3
〈{c},{b},{e}〉 2 2 No
〈{c},{e},{b}〉 2 0 No
RSG3
SupOf
RSG3
FSG3
SupOf
FSG3
〈{}〉 〈{}〉
64 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
To �nd true sequence generators, based on Property 1 the candidate sequence
generators are pruned if they have the equal support with the minimum support
of its subsequences. For size-1, all CSG1 in Table 3.2 (a) turned out true CSG1.
Then, the true CSG1 are tested against the user de�nedmaxsup value to �nd the
rare sequence generators RSG1 and frequent sequence generators FSG1. For the
example in Table 3.2, 7 size-1 rare generators and 4 frequent generators are found.
Next, the FSG1 are combined in pairs to generate size-2 CSG2 in Table 3.2(b)
keeping the common pre�x subsequence of FSG1, which is an empty sequence.
In CSG2 there exist some true or potential CSG2 and some not true CSG2 as
these pattern have the same support with one of its subsequences patterns. For
example, 〈{a}, {b}〉 in CSG2 has the same support with one of its subsequence
〈{a}〉. The true CSG2 are further separated into rare sequential generators
RSG2 and frequent sequential generators FSG2 in Table 3.2(b). Finally, after
the third iteration in Table 3.2(c), there exist no frequent sequential generators
except empty sequence. So, no further candidate sequence can be generated and
the algorithm stops. The rare generators are collected from RSGs where s is the
size of the rare generators.
3.3.2 Generating All Rare Sequential Patterns
In the second phase, all rare sequential patterns are generated from RSG pat-
terns discovered in Algorithm 3.1 which is explained in Section 3.1. We propose a
level-wise method to generate all rare sequential patterns. The proposed method
for generating all the rare sequential patterns is formally described in Algorithm
3.2. Starting from the size-1 rare generators, which are shown in thick bordered
shaded rectangles in the lattice in Figure 3.1, for each size-s, the method dis-
covers size-(s+1) rare patterns from all possible size-(s+1) patterns generated
by extending each size-s rare pattern with every event in Events(SDB), as de-
scribed in steps 6 to 35 of Algorithm 3.2. The variable CRSPs+1 in Algorithm
3.2 contains all potential size-(s+1) rare patterns, which are generated in steps 7
to 13. In this process the order of the events in the sequence remains unchanged
which ensures the integrity of the original sequence. For example, 〈{b, c}, {d}〉is a size-2 rare sequential pattern which can be extended to three possible size-3
rare sequential patterns 〈{a}, {b, c}, {d}〉, 〈{b, c}, {a}, {d}〉, and 〈{b, c}, {d}, {a}〉with an event {a} in SDB. After the merging, all patterns of size-(s+1) are gen-
erated. These patterns fall either into rare patterns which have the frequency
3.3. A New Method For Finding Rare Sequential Patterns 65
Algorithm 3.2: Generating all Rare Sequential Patterns and theirEquivalence Classes.Input: a sequential database SDB, a set of rare sequential generators
RSGOutput: Generating all rare patterns and their equivalence classes
1 NEP ← {} // Here NEP holds all non-existent patterns in SDB2 GRSP ← {} // set of equivalence classes that have the same support
and they occur in the same sequences in SDB3 s ← 14 RSPs ← {g|g ∈ RSG, |g| = 1} //size-1 rare generators5 ms ← maxS∈SDB{|S|}6 while s < ms and RSPs 6= empty do7 CRSPs+1 ← {} // candidate rare sequence patterns of size-(s+1)8 for each S in RSPs do9 for each e in Events(SDB) do10 C ← all sequences generated by adding e into S at di�erent
positions11 CRSPs+1 ← CRSPs+1 ∪ C
12 RSPs+1 ← {}13 for each S in CRSPs+1 do14 if there is n in NEP such that n is a subsequence of S then15 continue16 else17 RSPs+1 ← RSPs+1 ∪ {S}
18 RSPs+1 ← RSPs+1 ∪ RSGs+1
19 for each S in RSPs+1 do20 S.supp ← 0, S.sid ← {}21 for a ∈ SDB do22 if S v a then23 S.supp ← S.supp+124 S.sid ← S.sid ∪ {a.sid} //a.sid is the id of sequence a25 else26 NEP ← NEP ∪ {S}
27 sp ← S.supp, sid ← S.sid28 if GRSPsp,sid is in GRSP then29 GRSPsp,sid ← GRSPsp,sid ∪ {S}30 else31 GRSPsp,sid ← {S}32 GRSP ← GRSP ∪ {GRSPsp,sid}
33 s ← s+134 RSPs ← RSPs ∪ RSGs
35 return RSP = RSP1 ∪RSP2... ∪RSPs−136 return GRSP
66 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
below the support threshold value, e.g., minsup = 2 for the database in Table
3.1, or non-existent patterns which have frequency zero, meaning that these pat-
terns do not exist in the database SDB. Note that, these non-existent patterns
are not used to generate further super patterns as these patterns will also be
non-existent patterns. Also, no frequent patterns are formed at this stage as
these patterns grow from minimal rare generators, meaning any super patterns
originate from rare patterns will not be frequent.
Figure 3.3: An equivalence class of rare sequential patterns.
In addition, at every stage, the rare patterns are grouped together into di�erent
equivalence classes based on their equal support value and common sequences in
the SDB. In Algorithm 3.2, for a pattern S, its support is denoted as S.supp, and
the set of sequence IDs where S occurs is denoted as S.sid. For example, the rare
3.4. Evaluation 67
sequential patterns in Figure 3.3 form an equivalence class, which comprises of
16 rare sequential patterns because all of these patterns have the same support
value 1 and they belong to the same sequence SID3 of the database SDB shown
in Table 3.1. At the end of the process, all rare sequential patterns in the SDB
are distributed into di�erent equivalence classes by Algorithm 3.2. The smallest
rare sequential patterns 〈{d}〉 and 〈{b, d}〉 shown in thick bordered shaded rect-
angles at the bottom of the Figure 3.3 are the minimal rare sequential generator
patterns. The maximal rare sequential patterns shown in dashed rectangles at
the top of the Figure 3.3 are the closed sequential patterns. All other rare se-
quential patterns shown in thick bordered unshaded rectangles in Figure 3.3 are
generated from the minimal rare sequential generator patterns.
3.4 Evaluation
In this section, we outline the experimental methodology used to evaluate our
proposed algorithms to �nd minimal, maximal and all rare sequential patterns
to detect anomalies in SCADA control logs. Logs which are used in the exper-
iment are o�-line control logs. The o�-line log means that SCADA activities
were recorded in log �les, which were later pre-processed to use with the rare
sequential pattern mining algorithm. Firstly, we present the experiment setup
representing a complete industrial control system running on three di�erent con-
trol systems. These are the conveyor belt system, the water tank system, and
the pressure control system. Secondly, we describe the datasets used throughout
the experiment to evaluate our proposed methods. Finally, we explain the results
validating our proposed methods.
3.4.1 SCADA System Architecture
This thesis uses three separate SCADA control systems such as conveyor belt
control system, pressure control system and water tank control system that are
con�gured in the SCADA laboratory. All of these control systems are connected
by a human machine interface (HMI), which not only monitors and controls the
processes, but also collects the activity logs being generated by the programmable
logic controllers (PLCs) in each of the systems.
68 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
The conveyor belt system consists of two loops of conveyors belts functioning in
two di�erent directions, left and right. As light and dark objects pass a sensor,
a sorting arm changes direction moving the object to either the left or right
conveyor belt depending on the color of the object. This process continues for a
prede�ned time period. After the conveyor belt �nishes a process cycle, the water
tank control system starts its operation. A sensor device is used to monitor the
water level as a percentage of the total water level capacity of the upper tank.
If water crosses the highest or lowest level of the water tank, then an alarm
is triggered to indicate the upper tank's over-�ow or under-�ow condition. This
process repeats for a prede�ned time period. In the �nal stage of the process, the
pressure control system pumps air into a sealed steel pipe system and increases
the air pressure. At a given upper pressure threshold value, which is measured
in pounds per square inch (PSI), the air pressure is released through a solenoid
valve. The air pressure in the pipe drops. Once a set lower pressure level has been
reached, the air compressor starts up again building pressure in the pipe system.
This process continues for a prede�ned period of time. Once the compressed air
pipeline stage of the process is complete, a full cycle of the process has completed.
The system starts again with the conveyor belt control system.
3.4.2 Datasets
To evaluate our proposed algorithms, we conducted experiments on two sets of
data. The �rst set of data (First Dataset) comprising Dataset-1, Dataset-2, and
Dataset-3 represents the logs collected from the conveyor belt, the pressure con-
trol and the water tank control system described in Section 4.1. A partial view
of these datasets are shown in Table 3.3, Table 3.4 and Table 3.5 respectively.
These datasets were collected from a training session held in the SCADA lab-
oratory from 9.30 a.m. to 4.00 p.m. Three process control systems (Conveyor
belt, Pressure control, and Water tank) were switched on and started functioning
smoothly on the training day.
Two highly skilled professional teams named the blue team and the red team
participated in the training session. The aim of the blue team was to operate and
monitor the SCADA control system equipment while the aim of the red team was
to conduct cyber-attacks on the control system. The system was compromised
by the red team who was able to successfully disrupt the processes running on
all three control systems in the latter half of the day. All the events (regular
3.4. Evaluation 69
Table 3.3: A partial view of a conveyor belt control system log.
VarName TimeString VarValueConv_Read_Conv_Color_PE 16/07/2015 9:31:11 AM 0Conv_Read_Conv_HMI_Direction 16/07/2015 9:31:11 AM 0Conv_Read_Conv_Present_PE 16/07/2015 9:31:11 AM 0Conv_Read_Solenoid_Left_Direction 16/07/2015 9:31:11 AM -1Conv_Read_Solenoid_Right_Direction 16/07/2015 9:31:11 AM 0Conv_Run_Status 16/07/2015 9:31:11 AM 0HMI_Conv_Direction 16/07/2015 9:31:11 AM 0HMI_Conv_Master_Mode 16/07/2015 9:31:11 AM -1HMI_Conv_Reset 16/07/2015 9:31:11 AM 0
and attacked) were recorded in the log �les. There were a total of 205 868, 228
762, and 388 877 lines of logs recorded in Dataset-1, Dataset-2 and Dataset-3
respectively during the whole-day training period. The logs were recorded under
5 di�erent attributes or features. These are VarName, TimeString, VarValue,
Validity, and Time_ms.
Table 3.4: A partial view of a pressure control system log.
VarName TimeString VarValueHMI_Pipe_Pump_O�_SP 16/07/2015 9:31:11 AM 40HMI_Pipe_Pump_On_SP 16/07/2015 9:31:11 AM 5HMI_Pipe_Solenoid_O�_SP 16/07/2015 9:31:11 AM 30HMI_Pipe_Solenoid_On_SP 16/07/2015 9:31:11 AM 40HMI_Pipe_Master_Mode 16/07/2015 9:31:11 AM -1Pipe_Pump_Run_Status 16/07/2015 9:31:11 AM -1Pipe_Read_Pipeline_Pressure 16/07/2015 9:31:11 AM 0.3130435Pipe_Read_Pump_Mode 16/07/2015 9:31:11 AM 0Pipe_Read_Pump_Run_Cmd 16/07/2015 9:31:11 AM 0Pipe_Read_Solenoid_Mode 16/07/2015 9:31:11 AM 0
The attribute VarName holds the name of the events occur in the control system,
TimeString record the date and time when the event occurs, the VarValue holds
the value of the event in VarName, the validity is used to check the �oating point
of VarValue, and �nally Time_ms additionally holds the TimeString value in
milliseconds. Among these 5 attributes we have only used 3 attributes named
VarName, TimeString, and VarValue in our experiment since these attributes
hold the important required information in �nding anomalies.
70 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
Table 3.5: A partial view of a water tank control system log.
VarName TimeString VarValueHMI_Tank_Master_Mode 16/07/2015 9:31:11 AM -1Tank_Level 16/07/2015 9:31:11 AM 52.68681Tank_O�_SP_Int 16/07/2015 9:31:11 AM 80Tank_On_SP_Int 16/07/2015 9:31:11 AM 50Tank_Read_Pump_In_Auto 16/07/2015 9:31:11 AM 0Tank_Read_Pump_In_Manual 16/07/2015 9:31:11 AM 0Tank_Read_Pump_Running 16/07/2015 9:31:11 AM 0Tank_Read_Tank_Level 16/07/2015 9:31:11 AM 52.6868Tank_Stopped 16/07/2015 9:31:11 AM -1Tank_Usage_Level 16/07/2015 9:31:11 AM 47.3132
During the training session, there were 5 di�erent types of attacks that were
successfully conducted in all three control systems by the red team. The red team
was successfully able to change the diverter gate direction in the conveyor belt
between 2.49 p.m. to 3.09 p.m. The red team successfully made an attack on the
water tank system by changing the tank's mode of operation from automatic to
manual mode, which was conducted between 3.14 p.m. to 3.24 p.m. In addition,
the team also stopped the water tank while it was on at 3.24 p.m. Finally, in
the pressure control system, an attack was conducted that changed the pressure
lower threshold value and upper threshold value between 3.33 p.m. to 3.35 p.m.
on the training day.
Table 3.6: A partial view of a conveyor belt control system logs from the seconddataset.
VarName TimeString VarValue
Conv_Read_Conv_Color_PE 16/06/2017 6:55:08 PM 0
Conv_Read_Conv_HMI_Direction 16/06/2017 6:55:08 PM 0
Conv_Read_Conv_Present_PE 16/06/2017 6:55:08 PM 0
Conv_Read_Solenoid_Left_Direction 16/06/2017 6:55:08 PM -1
Conv_Read_Solenoid_Right_Direction 16/06/2017 6:55:08 PM 0
Conv_Run_Status 16/06/2017 6:55:08 PM 0
HMI_Conv_Direction 16/06/2017 6:55:08 PM 0
HMI_Conv_Master_Mode 16/06/2017 6:55:08 PM -1
HMI_Conv_Reset 16/06/2017 6:55:08 PM 0
3.4. Evaluation 71
The second set of data (Second Dataset) comprises Dataset-4, Dataset-5, and
Dataset-6 representing conveyor belt, pressure control and water tank process
control logs. These datasets were collected in a di�erent controlled experimental
set up. In other words, we recon�gured the same physical equipment using a
di�erent process control set up. Attacks were conducted to disrupt the normal
process activities of the control system network. An attack PC was connected
to the control system network. A Python script was run from the attack PC to
carry out these attacks.
Table 3.7: A partial view of a pressure control system log from the second dataset.
VarName TimeString VarValueHMI_Pipe_Pump_O�_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Pump_On_SP 16/06/2017 6:55:08 PM 5HMI_Pipe_Solenoid_O�_SP 16/06/2017 6:55:08 PM 30HMI_Pipe_Solenoid_On_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Master_Mode 16/06/2017 6:55:08 PM -1Pipe_Pump_Run_Status 16/06/2017 6:55:08 PM -1Pipe_Read_Pipeline_Pressure 16/06/2017 6:55:08 PM 17.63478Pipe_Read_Pump_Mode 16/06/2017 6:55:08 PM 0Pipe_Read_Pump_Run_Cmd 16/06/2017 6:55:08 PM -1Pipe_Read_Solenoid_Mode 16/06/2017 6:55:08 PM -1
All of the normal activities as well as the attack activities were recorded in the
logs. Later, we identi�ed the attack logs and labeled them using another Python
script to generate labeled attack data for validating the rare patterns. The rare
patterns generated from the Second Dataset can be validated with the labeled
attack dataset in contrast to the validation process using the knowledge from
domain experts in the First Dataset. We run the process control system to
generate the Second Dataset for a period of 8 hours.
There was a total of 109 053, 474 368, and 235 990 lines of logs recorded in
Dataset-4, Dataset-5, and Dataset-6 respectively. These datasets are shown in
Table 3.6, Table 3.7, and Table 3.8 respectively. These logs, like the First Dataset
logs, were also recorded under 5 di�erent attributes or features named VarName,
TimeString, VarValue, Validity, and Time_ms. We also used 3 attributes from
the Second Dataset named VarName, TimeString, and VarValue as we used in
the First Dataset in our experiment since these attributes hold the important
required information in �nding anomalies.
We carried out attacks on all the three control systems disrupting their regular
72 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
processes. The attacks conducted on the process control system to generate the
Second Dataset are as follows: Firstly, there are three types of attacks conducted
to the conveyor belt control system. These are changing the direction of the
diverter gate, unscheduled stopping the conveyor belt, and starting the conveyor
belt after unscheduled stoppage. There was a �ooding attack conducted for a
short time period that stopped and started the conveyor belt multiple times.
We kept a record of the control process logs in log �les. Later, we labeled the
attacked logs so that the discovered rare suspicious anomalous patterns can be
compared for the detection of attack patterns.
Table 3.8: A partial view of a water tank control system log from the seconddataset.
VarName TimeString VarValueHMI_Tank_Master_Mode 16/06/2017 6:55:08 PM -1Tank_Level 16/06/2017 6:55:08 PM 61.77073Tank_O�_SP_Int 16/06/2017 6:55:08 PM 80Tank_On_SP_Int 16/06/2017 6:55:08 PM 50Tank_Read_Pump_In_Auto 16/06/2017 6:55:08 PM 0Tank_Read_Pump_In_Manual 16/06/2017 6:55:08 PM 0Tank_Read_Pump_Running 16/06/2017 6:55:08 PM 0Tank_Read_Tank_Level 16/06/2017 6:55:08 PM 61.77073Tank_Stopped 16/06/2017 6:55:08 PM -1Tank_Usage_Level 16/06/2017 6:55:08 PM 38.22927
Secondly, in the pressure control system, four types of attacks were conducted.
The �rst type of attack changed the pressure control system upper threshold
value from the prede�ned set value. The second type of attack changed to the
lower threshold value from the prede�ned set value. The third type of attack
stopped and then started the pressure control system at unscheduled times.
Flooding attacks were conducted by activating and deactivating the pressure
control system in quick succession multiple times.
Finally, in the water tank control system, we conducted two types of attacks.
The �rst type of attack was to change the mode of operation of the water tank
from automatic to manual mode and later from manual mode to automatic mode.
The second type of attack caused an unscheduled stop and start to the water
tank pump. The above types of attacks were conducted as �ooding attacks by
changing the water tank from automatic to manual mode and vice-versa as well
as activating and deactivating the water tank several times.
3.4. Evaluation 73
In the preprocessing, we merged the feature or variable name VarName with its
corresponding values held by the feature VarValue. Together, the VarName and
the VarValue represent an itemset or event of the control process. For example,
the feature Conv_Read_Solenoid_Left_Direction and its corresponding value
-1 are merged together as {Conv_Read_Solenoid_Left_Direction_-1} repre-
senting an event of the conveyor belt control system from Dataset-1 in the First
Dataset.
Table 3.9: A sample of the conveyor belt SDB generated from Dataset-1 in theFirst Dataset.
SID1 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI_Direction_0}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_− 1}, {Conv_Read_Solenoid_Right_Direction_0}, {Conv_Run_Status_0}, {HMI_Conv_Direction_0}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_0}〉
SID2 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI_Direction_0}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_− 1}, {Conv_Read_Solenoid_Right_Direction_0}, {Conv_Run_Status_− 1}, {HMI_Conv_Direction_0}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_0}〉
These events in a de�ned time duration form a sequence in the sequence database
SDB. An example of the sequence database SDB from the conveyor belt control
logs (Dataset-1) is given in Table 3.9. There were 22 875 sequences comprising
the conveyor belt sequence database SDB from the Dataset-1. Each sequence
was created by comprising the number of events that occurred in every second on
the conveyor belt control system. For example, in conveyor belt control system
9 events occur in every second, so 22 875 sequences were created from 205 868
line of logs. Once the sequence database SDB was created, we ran our proposed
rare sequential pattern mining algorithm on the SDB.
Similarly, we created the pressure control SDB from Dataset-2 in the First
Dataset. The pressure control SDB is composed of 22 877 sequences from 388 877
line of logs. In pressure control system, 17 events occurred in every second. A
sample of the pressure control SDB is shown in Table 3.10. The water tank
sequence SDB was created from Dataset-3 in the First Dataset. The water tank
SDB comprises of 22 876 sequences from 228 762 line of logs. In every second 10
74 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
Table 3.10: A sample of the pressure control SDB generated from Dataset-2 inthe First Dataset.
SID1 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_SP_5}, {HMI_Pipe_Solenoid_Off_SP_30}, {HMI_Pipe_Solenoid_On_SP_40}, {HMI_Pipe_Master_Mode_− 1},{Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_0.3130435}, {Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0},{Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1}, {Pressure_Int_− 1}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5}, {Solenoid_Off_SP_Int_30},{Solenoid_On_SP_Int_40}〉
SID2 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_SP_5}, {HMI_Pipe_Solenoid_Off_SP_30}, {HMI_Pipe_Solenoid_On_SP_40}, {HMI_Pipe_Master_Mode_− 1},{Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_0.326087}, {Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0},{Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1}, {Pressure_Int_− 1}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5}, {Solenoid_Off_SP_Int_30},{Solenoid_On_SP_Int_40}〉
events occurred in the water tank control system. A sample of the water tank
SDB is shown in Table 3.11.
Moreover, we created three more sequence databases; the conveyor belt SDB,
pressure control SDB, and water tank SDB from Dataset-4, Dataset-5, and
Dataset-6 respectively in the Second Dataset. The conveyor belt SDB com-
prises of 12 117 sequences from 109 053 line of logs, the pressure control SDB
comprises of 27 904 sequences from 474 368 line of logs, and the water tank SDB
comprises of 23 599 sequences from 235 990 line of logs. All of the sequence
databases generated from both First Dataset and Second Dataset are used as
inputs to our proposed algorithms to generate rare sequential patterns.
3.4.3 Experimental Methodology
We carried out experiments applying our two proposed algorithms. Algorithm
3.1 `Finding Rare Sequential Generator Patterns' explained in Section 3.1 is for
generating all rare sequential generator patterns from the sequence databases,
3.4. Evaluation 75
Table 3.11: A sample of the water tank SDB generated from Dataset-3 in theFirst Dataset.
SID1 〈{HMI_Tank_Master_Mode_− 1}, {Tank_Level_52.68681},{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}{Tank_Read_Pump_In_Auto_0}, {Tank_Read_Pump_In_Manual_0}, {Tank_Read_Pump_Running_0}, {Tank_Read_Tank_Level_52.6868}, {Tank_Stopped_− 1}, {Tank_Usage_Level_47.3132}〉
SID2 〈{HMI_Tank_Master_Mode_− 1}, {Tank_Level_52.764},{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}{Tank_Read_Pump_In_Auto_0}, {Tank_Read_Pump_In_Manual_0}, {Tank_Read_Pump_Running_0}, {Tank_Read_Tank_Level_52.764}, {Tank_Stopped_− 1}, {Tank_Usage_Level_47.236}〉
such as conveyor belt SDB shown in Table 3.9. After generating minimal rare
generators, we then apply Algorithm 3.2 `Generating all Rare Sequential Pat-
terns in the sequence database and their Equivalence Classes' which is explained
in Section 3.2 to �nding all rare sequential patterns. These rare patterns are
grown from minimal rare generators. In addition to generating all rare sequen-
tial patterns, we also group them into di�erent equivalence classes. As a result,
we can �nd the smallest or minimal rare pattern, the largest or maximal rare
pattern and all rare patterns in between the minimal and maximal rare patterns
in an equivalence class.
These algorithms were implemented in Python 2.7. For the experiments a Dell
OptiPlex 9020 featuring Intel Core i7-4770 3.4 GHz processor, 16GB RAM and
256GB HDD was used. The operating system was Windows 7 Enterprise. We
have used Python programming language. However, the performance regarding
the computation time and memory consumption of the proposed algorithm can
be improved with an implementation in Java or the C++ language. We explored
NumPy for library functions that �nd a subsequence either consecutive or non-
consecutive in a sequence database, but could not �nd any built-in functions.
Anomalies could be rare or frequent in a system. Our proposed method
focuses on a system where the rarity of events is considered abnormal or irregular
behavior and events are recorded in a sequential manner. In other words, our
proposed method is applicable to detect anomalies that are rare in a system. An
example of a rare anomalous pattern could be a stealthy attack. The stealthy
76 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
attack is conducted on a system over a long period of time, say port scanning
over a long period of time. If the port scanning is conducted quickly in a short
time, the system can easily trace the activity as suspicious. However, if the port
scan activity is conducted slowly for a long period of time, it is di�cult to detect
as suspicious anomalous behavior. This is because the stealthy activity becomes
rare events on a system.
It was assumed that anomalies may happen rarely to a system. So, to �nd
rare occurrences of activities, the support value minsup = 2 was set through-
out the experiments involving all three datasets. Moreover, It has been ob-
served that in the domain of SCADA control systems, there exist a limited
number of processes. The events or actions forming the process are repeti-
tive. Therefore, in an attack-free or malfunction-free system, a de�ned num-
ber of itemsets or events that accomplish a process are considered as a reg-
ular or acceptable behavior. For example, {Conv_Read_Conv_Color_PE_0},
{Conv_Read_Solenoid_Left_Direction_-1}, and {Conv_Read_Solenoid_Right
_Direction_-1} are three individual itemsets or events from the sequence shown
in conveyor belt SDB in Table 3.3. As these events are performed in a repetitive
manner in the conveyor belt SDB, they become frequent events.
On the other hand, any changes to these process events or changes which de-
viate from the prede�ned order of the events, can make the events rare. For ex-
ample, the value of the feature Conv_Read_Solenoid_Left_Direction is changed
from −1 to 0, that is, the event {Conv_Read_Solenoid_Left_Direction_-1} is
changed to {Conv_Read_Solenoid_Left_Direction_0}, then the prede�ned out-
come of the process is disrupted. In other words, the object moving on the belt
is being sorted in wrong direction. Since anomalies rarely occur in a system, the
rare action that is an irregular or unacceptable behavior can be considered as a
rare anomalous pattern. Hence, it could be considered a suspicious event that
deserves further in-depth analysis.
Currently in the cyber security industry, existing intrusion detection systems
generate many suspicious alarms for the experts to analyse. However, most of
the alarms turn out to be false and only a few of them are found true. Since
our proposed method produces less number of rare patterns, it follows that there
will be less number of suspicious alarms for the experts to analyse for anomalies
and attack pattern.
3.4. Evaluation 77
3.4.4 Results
The experimental results are evaluated in two phases. In the �rst phase, the
experimented results that are obtained from the �rst dataset, are evaluated with
domain expert knowledge. This is because the �rst dataset was created from a
training session in the SCADA laboratory where two highly skilled professional
teams, the blue team and the red team, participated. The rare suspicious pat-
terns generated by the rare sequential pattern mining algorithm are identi�ed
as attack patterns by an expert which occurred during the time period when
the system was compromised by the red team. The identi�ed rare patterns were
later checked with the stored logs and veri�ed. In the second phase, the results
are validated with labelled attack datasets. The labelled attack datasets were
created while the second datasets were collected in a controlled experimental
set- up. Attacks were conducted to disrupt the normal process activities of the
control system network. All of the normal activities as well as the attacks were
recorded in the logs. Later, we identi�ed the attacked logs and labelled them as
anomalous for validating the rare patterns. The rare patterns generated by the
rare sequential pattern mining algorithm from the second datasets were validated
with the labelled attack patterns.
Firstly, we show the results obtained from the First Dataset comprising
Dataset-1, Dataset-2, and Dataset-3. After that we will show the results from
the second dataset consisting of Dataset-4, Dataset-5, and Dataset-6. In the
Dataset-1, the conveyor belt control system, we found 6 rare sequential patterns
which were grouped in 5 equivalence classes out of 205 868 lines of logs. An
example is given in Table 3.12, where two rare sequential patterns SID1 and
SID2 have been identi�ed as suspicious patterns.
Table 3.12: A sample of the rare sequential patterns from conveyor belt SDB inDataset-1.
SID1 〈{HMI_Conv_Reset_− 1}〉SID2 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI
_Direction_− 1}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_0}, {Conv_Read_Solenoid_Right_Direction_− 1}, {Conv_Run_Status_0}, {HMI_Conv_Direction_− 1}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_− 1}〉
These two patterns (shown in Table 3.12) are also grouped in an equivalence
78 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
class since these patterns share the same support and they belong to the same
sequences in the conveyor belt SDB. The �rst pattern SID1 in Table 3.12 com-
prises of only a single event, such as {HMI_Conv_Reset_-1} which indicatesthat the conveyor belt control system has been reset or restarted from the HMI
terminal. The second pattern SID2 is comprised of 9 events. Note that, in the
equivalence class, the pattern SID1 is the minimal or the smallest rare sequen-
tial pattern while the pattern SID2 is the maximal or the largest rare sequential
pattern in the same equivalence class.
Since it was assumed that rare patterns could be suspicious, we consulted
with domain experts who were monitoring and tracking the SCADA control pro-
cess activities during the training day. Analyzing the patterns in Table 3.12, they
identi�ed 3 individual events that were anomalous in the SID2, because the val-
ues of these events {HMI_conv_reset_-1}, {Covn_Read_Conv_HMI_
Direction_-1} and {HMI_Conv_Direction_-1} were changed from the
expected values 0 to −1. As a result, these changes together caused the con-
veyor belt to sort the objects (dark and light) on the belt in the wrong direction.
However, these objects were set to move along in a prede�ned direction, which
was as a requirement for the SCADA process. Therefore, in this scenario, it is
evident that the minimal rare pattern SID1 alone could not determine that the
rare pattern was anomalous, rather it is the maximal pattern SID2 along with
other relevant events, such as {Covn_Read_Conv_HMI_Direction_-1}
and {HMI_Conv_Direction_-1} that aid the the minimal rare pattern
〈{HMI_conv_reset_-1}〉 to identify the anomalous pattern. Moreover, since
this anomalous sequence occurred during the time period when the system was
compromised. Experts con�rm it represents a cyber-attack sequence.
In Dataset-2, the pressure control dataset, 57 rare sequential patterns which
forms 38 equivalence classes out of 388 877 lines or rows of logs were found.
An example of these equivalence classes is shown in Table 3.13 where three rare
patterns form an equivalence class. The rare patterns SID1 and SID2 in Ta-
ble 3.13 are composed of only a single event, such as {Pressure_Int_2} and
{Pipe_Read_Pipeline_Pressure_2} respectively. On the other hand, the
rare pattern SID3 comprises 17 events. From these three rare sequential patterns
in Table 3.13, the experts identi�ed 3 events {Pipe_Read_Pipeline_Pressure
_2}, {Pressure_Int_2}, and {Solenoid_On_SP_Int_55} as anomalous
in the SID3 rare sequential pattern, because the solenoid pressure value has been
3.4. Evaluation 79
Table 3.13: A sample of rare sequential patterns from pressure control SDB inDataset-2.
SID1 〈{Pressure_Int_2}〉SID2 〈{Pipe_Read_Pipeline_Pressure_2}〉SID3 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_
SP_5}, {HMI_Pipe_Solenoid_Off_SP_55}, {HMI_Pipe_Solenoid_On_SP_55}, {HMI_Pipe_Master_Mode_− 1}, {Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_2},{Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0}, {Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1}, {Pressure_Int_2}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5},{Solenoid_Off_SP_Int_55}, {Solenoid_On_SP_Int_55}〉
changed to 55 PSI from the HMI terminal. The changed value is above the preset
maximum pressure threshold value 40 PSI of the pressure control system. Here
also, the maximal rare pattern helps to identify the anomalous pattern rather
than the minimal rare pattern. This rare pattern also occurred during the time
period when the system was compromised. Expert knowledge con�rms this rare
anomalous sequence is a cyber-attack on the system.
Finally, in Dataset-3, the water tank control system dataset, 34 rare sequen-
tial patterns forming 34 equivalence classes were found out of 228 762 lines or
rows of logs. In one example, a rare sequential pattern comprising 10 events
shown in Table 3.14. This rare pattern itself forms an equivalence class.
Table 3.14: A sample of rare sequential patterns from water tank SDB in Dataset-3.
SID1 〈{HMI_Tank_Master_Mode_− 1}, {Tank_Level_53}, {Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}, {Tank_Read_Pump_In_Auto_0}, {Tank_Read_Pump_In_Manual_− 1}, {Tank_Read_Pump_Running_0}, {Tank_Read_Tank_Level_53}, {Tank_Stopped_0}, {Tank_Usage_Level_47}〉
When consulting with the domain experts, they identi�ed event Tank_Read_
Pump_In_Manual_-1 is an anomalous event in the rare sequence pattern
in Table 3.14, because the value of this event has been changed from 0 to −1.
However, the water tank pump was set to run in automatic mode which was
a requirement for this experiment. This rare pattern occurred during the time
when the system was compromised. This rare anomalous sequence was identi�ed
80 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
by the expert as a cyber-attack on the system.
Now, we describe the results found from the Second Dataset comprising
Dataset-4, Dataset-5, and Dataset-6. In conveyor belt SDB generated from the
Dataset-4, we found a total of 23 rare sequential patterns. These rare patterns
form 22 equivalence classes. An example of a rare pattern that forms an equiv-
Table 3.15: A sample of rare sequential patterns from conveyor belt SDB inDataset-4.
SID1 〈{Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_HMI_Direction_− 1}, {Conv_Read_Conv_Present_PE_0}, {Conv_Read_Solenoid_Left_Direction_0}, {Conv_Read_Solenoid_Right_Direction_− 1}, {Conv_Run_Status_0}, {HMI_Conv_Direction_− 1}, {HIM_Conv_Master_Mode_− 1}, {HMI_Conv_Reset_− 1}〉
alence class which is shown in Table 3.15. The rare pattern in Table 3.15 is an
attack pattern because the even HMI_Conv_Direction_-1 value was changed
from 0 to −1. This was done while the attack was conducted to the conveyor
belt control system, and this incident was recorded in the labeled dataset as an
attack. We compared all the rare patterns found in Dataset-4 with the labeled
attack dataset to �nd which rare pattern are attack patterns. We found that 4
rare suspicious patterns were found as attack patterns out of 23 discovered rare
patterns.
Table 3.16: A sample of rare sequential patterns from pressure control SDB inDataset-5.
SID1 〈{HMI_Pipe_Pump_Off_SP_40}, {HMI_Pipe_Pump_On_SP_5}, {HMI_Pipe_Solenoid_Off_SP_25}, {HMI_Pipe_Solenoid_On_SP_40}, {HMI_Pipe_Master_Mode_− 1}, {Pipe_Pump_Run_Status_− 1}, {Pipe_Read_Pipeline_Pressure_0},{Pipe_Read_Pump_Mode_0}, {Pipe_Read_Pump_Run_Cmd_0}, {Pipe_Read_Solenoid_Mode_0}, {Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Solenoid_Open_Status_− 1},{Pressure_Int_1}, {Pump_Off_SP_Int_40}, {Pump_On_SP_Int_5}, {Solenoid_Off_SP_Int_25}, {Solenoid_On_SP_Int_40}〉
In the pressure control SDB from the Dataset-5, our algorithm discovered 218
rare patterns that form 198 equivalence classes. After comparing these rare pat-
terns with the labeled attack dataset, we found 2 rare patterns as attack patterns.
3.5. Discussion and Analysis 81
An example of a rare attack pattern is given in Table 3.16. There was an at-
tack conducted in the pressure control system by changing the pressure control
system lower threshold value. The value of the event Solenoid_O�_SP_Int_25
was changed from 20 to 25 as shown in Table 3.16.
Finally, in the water tank SDB generated from Dataset-6, we found 204
rare patterns grouped in 194 equivalence classes. We could not �nd any attack
patterns in these rare patterns. This is due to the nature of the attacks. All
attacks in the water tank control system were conducted as �ooding attacks. As
a result, the attack events became frequent events and hence our method could
not �nd the attack events as rare events.
3.5 Discussion and Analysis
In this section we discuss and analyze our proposed methods. In Section 3.5.1 we
show the e�ectiveness of the equivalence classes and in Section 3.5.2 we discuss
the complexity of the two algorithms followed by the discussion of their e�ciency.
3.5.1 E�ectiveness of Equivalence Class
Our proposed general methods can generate all rare sequential patterns from a
sequence database. The ordered sequence of actions is crucial to some domains
especially a SCADA process control system where events are sequential, regular
and de�nitive. In such environments, any changes in the ordered sequence brings
about changes to the process output or end result, which is explained with an
example in Section 3.1. In the �rst experiment with the conveyor belt dataset,
our method discovered 6 rare sequential patterns. All of these patterns are
separated into di�erent groups or equivalence classes, where all the patterns in
one group have equal support value and they appear in the same sequences in
the database.
In each group the minimal rare pattern indicates the initial or �rst point of
attack. However, the maximal rare pattern sets the context that enables the
operator to decide the impact of the attack in the SCADA process, such as a
malfunction or breakdown of a SCADA system or a cyber-attack on the system.
An example is given in Table 3.12, where two rare sequences SID1 and SID2
form an equivalence class. The domain experts identi�ed 3 events in the second
pattern SID2 as problematic. All of the 3 events values were changed from 0
82 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
to −1, which caused the objects to be sorted in opposite directions, meaning
that, the light object in the left loop while the dark objects the right loop.
However, as a requirement for the experiment, the light object was moving along
the right loop while the dark object was moving along the left loop. It is worth
noting that, the minimal rare pattern SID1 in Table 3.12 indicates that the
conveyor belt was reset, but it does not provide the context from where the
domain expert can identify the pattern as anomalous. However, the maximal
rare sequential pattern in SID2, which includes the minimal pattern and other
relevant events help the domain expert to identify this pattern as anomalous as
well as an intrusive pattern.
Similarly, in the pressure control system and the water tank system dataset,
we have also discovered some rare patterns which are shown in Table 3.13 and
Table 3.14 respectively. In the pressure system, the maximum solenoid pressure
was set to 40 PSI. In the rare pattern it is found that the value was changed
to 55 PSI, which was above the maximum threshold value. Hence, the experts
identi�ed this unexpected rare pattern as an intrusive pattern. Also, in the water
tank control system, as a prerequisite requirement, the water tank system was
running in an automatic mode. The system was later changed from automatic
mode to manual mode by the attacking team, and this sequence is anomalous as
well as an intrusion.
It has been observed that checking the integrity constraint of the process
control system, we can identify the anomalous as well as attack patterns. How-
ever, we argue that the resemblance of the result of this experiment is due to
the attack pattern conducted to the control system processes. The nature of the
SCADA control process is that the changes of the values from the prede�ned set
values is considered an attack on the control system which can cause disruption
to the process. This means that changing the event's value or threshold value
can alter the process outcome and make it a rare event. For example, chang-
ing the value of the conveyor belt event {HMI_Conv_Direction} from 0 to
−1 can change the direction of the diverting gate of the conveyor belt from left
direction to right direction. As a result, the object on the conveyor belt will
move in the wrong direction. Therefore, this change is considered as an irregular
behaviour of the system, and it is purposely changed to disrupt the conveyor
belt's normal activity. Therefore, this is an attack into the conveyor belt control
system. Similarly, the change of the upper threshold value of the pressure control
3.5. Discussion and Analysis 83
event {Solenoid_On_SP_Int} from 40 PSI to 50 PSI make it a rare anomalous
event that can hamper the outcome of the process. This change of prede�ned
set pressure value is an attack on the pressure control system as this is the vi-
olation of the normal process activities. In di�erent application domains, if the
process outcome can be changed with an attack that is conducted by changing
the execution order of the events, our proposed rare sequential pattern mining
algorithm can also detect the rare sequential anomalous as well as attack pat-
tern. Therefore, we argue that it is not only the integrity constraint but also
the rare sequential pattern that can detect the rare anomalies and attacks in a
system. In the SCADA domain datasets, it is found that we do not need all rare
patterns, rather investigating the maximal rare patterns in a group was su�cient
to �nd anomalies as well as intrusions. However, it could be possible that for
di�erent application domains, minimal rare patterns or all rare patterns in a
group could be helpful to detect anomalies in a system. Therefore, distributing
all rare patterns in di�erent groups is found e�ective.
3.5.2 Computational Complexity
We assume N be the number of sequences in the sequence database SDB, M be
the size of the maximum sequence in SDB, and L be the maximum number of se-
quences in CSGs. For Algorithm 3.1, the while-loop could iterateM times in the
worse case. Inside the while-loop, the database scan at lines 20-24 goes through
N sequences and the for-loop would iterate L times. So, the complexity of Algo-
rithm 3.1 would beM ∗L∗N . The candidate sequence generator CSGs contains
the combinations of sequences in FSGs with common pre�x of size(s−1). Thenumber of patterns in CSGs is usually very small, that is, L << N . There-
fore, by ignoring the less signi�cant term, the complexity of the �rst algorithm,
Algorithm 3.1, is O(M ∗N).
In Algorithm 3.2, RSPs is the set of rare sequential patterns of size-s. The
maximum size of RSPs is at most L. Events(SDB) contains all the unique
events in the database. For an application domain, events are relatively sta-
ble, so the size of Events(SDB) can be considered as a constant number. Let
|Events(SDB)| = E. For Algorithm 3.2, the while-loop iterates M times at
most. Line 10 would take |S| times to add e into S to generate |S| patterns, inthe worst case, |S| = M . Inside the while-loop, the �rst nested for-loop at lines 8-
11 would take E ∗L∗M times. This means, the size of the resulting set CRSP of
84 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
candidate rare sequence patterns is at most E∗L∗M . The second for-loop at line
13-17 would take E∗L∗M time in the worst case. Similarly, the third for-loop at
lines 19-32 would take E∗L∗M time as well. By taking the database scan at line
21-26 into consideration, the overall computational complexity can be calculated
asM(E ∗L∗M+E ∗L∗M+E ∗L∗M ∗N) = 2E ∗L∗M2+E ∗L∗M2∗N . Given
that E is a constant, the complexity can be O(L∗M2+L∗M2∗N). Further, for a
large log dataset, usually we have L << N and alsoM << N , it is reasonable to
consider that L ∗M < N . In this case, the complexity of the second algorithm,
Algorithm 3.2, is O(M2 ∗N).
The e�ciency of Algorithm 3.1 can be also justi�ed by the way it generates
rare patterns. Our proposed method did not generate all frequent patterns, in-
stead the method only generates frequent generators to �nd all rare generator
patterns. As such, it requires to explore only a subset of frequent patterns, that
is, frequent generators. Otherwise, all frequent patterns are needed to be gener-
ated which would cost large amounts of memory as well as search time. Moreover,
we compared our algorithm with the method proposed by Gao et al. [107] FEAT
(Frequent sEquence generATor) miner algorithm to �nd if our algorithm is gen-
erating all the frequent sequence generators. We ran the FEAT algorithm using
SPMF [108] (Sequential Pattern Mining Framework). We found that the number
of frequent generators produced by the FEAT algorithm on di�erent databases
(shown in Table 3.17) are the same number of frequent generators that are gen-
erated from our algorithm.
Table 3.17: Comparison among the databases regarding the number of frequentgenerators our algorithm produced and the number of frequent generators pro-duced by FEAT algorithm.
DatasetDatabase
SDB
#Sequences
in Database
# Freq. Gen
Our Algorithm
# Freq. Gen
FEAT Algorithm
First
Dataset
Conveyor belt 22875 538 538
Pressure control 22876 4572 4572
Water tank 22877 3919 3919
Second
Dataset
Conveyor belt 12117 427 427
Pressure control 27904 4043 4043
Water tank 23599 2961 2961
For example, from the conveyor belt SDB in the �rst dataset, our proposed al-
3.6. Related Work 85
gorithm generated 538 number of frequent generators from 22 875 number of
sequences. In comparison, Gao et al.'s [107] FEAT algorithm has also produced
the same number of frequent generators. Table 3.17 shows all the frequent gener-
ators produced by our method in comparison to the FEAT algorithm. Also, note
that, in our proposed second method, Algorithm 3.2, does not generate unwanted
patterns that can be originated from non-existent patterns, which saves search
space and time. An example of a non-existent pattern 〈{c}, {a}〉 that is shownin the lattice in Figure 3.1, from which no further candidate sequence patterns
can be generated.
3.6 Related Work
Anomaly or outlier detection is de�ned as �an observation that deviates so much
from other observations as to arouse suspicion that it was generated by a dif-
ferent mechanism� [109]. Malicious events or intrusions can be detected using
anomaly detection techniques [22]. The anomaly detection method was originally
proposed by Denning [23] and since then this method has been used in computer
security in general and for intrusion detection in particular [24]. Balducelli et
al. [86] attempted to �nd anomalous or abnormal behavior using a case-based
reasoning method. They compared a sequence of SCADA log events with the
previously de�ned normal behavior. In general, algorithms used for anomaly
detection need to have normal operation data, also called labeled data, to build
a training model. These algorithms generally consider anomalies as patterns
that have not been seen before in normal or regular behavioural patterns of a
system [99] [101]. Data mining based anomaly detection techniques can be used
for both signature and behaviour based anomaly detection techniques [29] which
can further be classi�ed as (i) supervised methods, (ii) semi-supervised methods
and (iii) unsupervised methods [110].
So far, there have been few works using data mining to �nding anomalies.
Manganaris et al. [26] showed that the absence of a frequent event or set of
events can be considered as an anomaly. Clifton et al. [27] applied a data min-
ing technique to identify normal behaviour of a system based on the frequent
occurrence of an alarm event which was �ltered out later from suspicious event
lists. Barbara et al. [95] built models of systems and users' normal behaviour
using data mining association rules from network tra�c data. Later in the de-
86 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
tection model they looked for any deviation in association rules considered as
abnormal behaviour thus an anomaly of the system and users. Fan et al. [30]
used a signature based ANN classi�er to detect malicious sequential patterns
from a sequence of machine instructions. Their method inherently lacks the
ability to identify new malware which has no previous signature trained in their
method. Hadºiosmanovi£ et al. [17] applied frequent itemset mining to �nd a
rare event by changing di�erent support values in SCADA process logs. They
could only identify a single rare event or item which the stakesholders identi�ed
as potential vulnerabilities. But, they could only identify a single event rather
than a sequence of anomalous events. Lee and Stolfo [111] pre-labeled regular
system calls as normal sequences, then they looked for an abnormal sequence not
found in the normal list. But �nding all normal sequences is almost impossible.
Julisch and Dacier [28] used the pattern mining episode rules technique to reduce
irrelevant alarm signals using false positives from historical alarms. These meth-
ods depend on signature based rules which inherently lack the ability to identify
new attacks.
Saha et al. [97] mentions the basic strategy for rare or infrequent pattern
mining is to identify all the frequent patterns from a transaction database using
a user provided minimum support threshold value and later prune these pat-
terns from the database. As a result, the remaining patterns which fall below
the threshold value are considered as rare patterns. Szathmary et al. [98] dis-
covered rare itemsets by identifying minimal rare itemset generators. The main
idea behind their approach was to identify individual frequent items. If the
combination of these frequent items becomes infrequent then this combination
is considered a rare itemset. Their work focused on rare itemset mining which
ignores the order of the events, but we argue that ordered rare events may help
to �nd anomalous patterns. Shengyi et al. [112] applied a data mining approach
called common path algorithm where they labeled attacking data based on some
assumed rules. As attackers are constantly using updated techniques, they can
bypass this signature based method. So, it cannot �nd unknown attacks that do
not match the de�ned rules.
Until now, there has been limited research in rare itemset mining. Among
these Szathmary et al. [98] is a pioneer, where the authors starting from frequent
generators derive rare itemset generators by merging frequent itemsets or events.
For example, two individual frequent events {A} and {B}, when merged together
3.6. Related Work 87
become {A,B} which turns out to be a rare event. However, this work did not
consider the order of events. Szathmary et al.'s approach cannot be applied
to �nd anomalies in a system where the order of events is signi�cant like a
SCADA system. In particular, SCADA systems are often used to control critical
infrastructure, and detecting an anomaly is of the highest importance. The
process failure in a SCADA system may physically damage equipment and human
life. In addition, Szathmary et al.'s algorithm cannot maintain the integrity of
the original pattern, meaning the order of the event is lost in their method.
As a result, causal relationship among the events cannot be established. For
example, the rare itemset pattern {Fan-Failure, Device-Down} does not provide
information as to which event causes to trigger other event.
Although there have been a limited number of works in �nding rare itemset
mining, they did not address the order of itemsets or events. However, our re-
search is focusing on sequential patterns �rst introduced by Agrawal and Srikant
[50] where the event's execution sequence order remain unchanged. As any al-
teration of events in the execution order represents a deviation of a prede�ned
sequence order which may be an indication of anomalous events considered to be
rare in a system. Rahman et al. [102] have shown that the sequence of ordered
events is signi�cant to detect anomalies in SCADA logs using rare sequential pat-
tern mining. However, this work only focused on the minimal or smallest rare
sequential patterns and could not �nd all the rare sequential patterns. There-
fore, we extend the previous work to �nd all rare sequential patterns and their
equivalence classes to identify anomalies e�ectively from a sequential database.
Although the proposed rare sequential pattern mining algorithm can e�ectively
detect rare anomalous pattern, this method cannot be applicable to �nd anoma-
lous pattern in time gapped control logs. This means that the system activities
are not recorded continuously in the log �le, rather logs are recorded only when
there happens an event on the system. The proposed method does not consider
the time gap while generating rare sequential pattern. In addition, the proposed
method cannot be used to predict possible anomalies, meaning that the method
cannot predict a possible anomaly in a live system.
88 Chapter 3. A Rare Sequential Pattern Mining Approach for Anomaly Detection
3.7 Summary
In this chapter we have proposed and developed a new method for �nding anoma-
lies in SCADA control system. We proposed a rare sequential pattern mining
algorithm for �nding anomalies. In this thesis it is assumed that anomalies hap-
pen rarely in a system, so a rare pattern could represent an anomalous pattern
in a system. As activities of a SCADA system are limited and repetitive, an
anomalous pattern on the SCADA system would be rare. In addition, the ac-
tivities of a SCADA system occurs sequentially which has been the motivation
to use rare sequential pattern to detect anomalies on a SCADA control system.
we discussed a novel minimal and maximal rare sequential pattern mining ap-
proach for anomaly detection. To �nd e�ective patterns, all rare patterns sharing
the same support value are put into di�erent groups. The smallest pattern in
each group is the minimal rare pattern while the largest pattern is the maximal
rare pattern. In each group, the minimal patterns, the maximal patterns, and
other patterns in between the minimal and the maximal could be used to �nd
anomalies depending on the preference of the application domains.
We evaluated our method using Supervisory Control and Data Acquisition
(SCADA) control system log data containing cyber-incidents. The identi�ed
rare anomalous sequences were intrusions or attacks on the system, demonstrat-
ing the usefulness of our rare sequential pattern mining approach. We applied
our proposed methods on three SCADA control system log datasets. With every
dataset it was found that some of the rare patterns identi�ed as suspicious were
later revealed as intrusive patterns. The maximal rare patterns were more e�ec-
tive in identifying malicious anomalous patterns than the minimal rare patterns.
However, it could be that the minimal pattern or all rare patters in a group that
would be e�ective in identifying anomalies based on domain requirements.
The rare sequential pattern mining method discussed in this chapter �nds
rare pattern from the logs which are recorded continuously on the SCADA con-
trol system. However, in some application domain events are recorded with an
inde�nite time-span gap. It means that when there is an event then it is recorded
in the logs otherwise the events are not recorded. Therefore, there exist time-
span gaps among the events in the logs. As using the proposed rare sequential
pattern mining algorithm the anomalies cannot be detected from the time-span
gapped control logs, we propose another method to �nd rare patterns from time-
span gapped control logs. In the next chapter, Chapter 4, we discuss the method
3.7. Summary 89
of constraint-based rare sequential pattern mining technique. If patterns are
generated from two time windows having a time gap, it might not carry the
signi�cance towards �nding anomalies. So, to �nd anomalies from signi�cant
rare patterns we need time constraint-based pattern mining. In addition, the
proposed method can only detect anomalies from static SCADA control logs.
However, the method cannot used to predict anomalies from the streaming logs
in SCADA live system. Therefore, we a propose a rare sequential association
rules mining method which uses SCADA streaming logs for anomaly prediction.
Chapter 4
Constraint-based Rare Sequential
Pattern Mining
4.1 Introduction
Pattern mining and especially sequential pattern mining produces overwhelming
numbers of patterns. This makes the sequential mining process ine�cient and
ine�ective. It becomes ine�cient because the search space becomes exponential
and ine�ective due to extraction of knowledge from large number of patterns. As
a result, most of the patterns become useless to the users [113]. It is in fact the
users who determine the interesting and useful patterns on a system. They use
di�erent constraints or restrictions in the data source as well as in the mining
process to discover the useful required patterns in an e�cient and e�ective man-
ner [114] [115] [116] [117]. It is found in the literature that constraints have been
accepted as the most common and e�ective approach to control large number of
discovered patterns. There have been many approaches to constraint-based pat-
tern mining explored to achieve e�ciency and e�ectiveness while discovering the
useful and interesting patterns. These approaches apply not only the semantics
of the domain knowledge but also apply the interest of the system users.
The constraint-based pattern mining can be categorized into several groups
based on their applications. Pei et al. [45] de�nes the constraint into seven
categories, such as (i) Item constraint, (ii) Length constraint, (iii) Super-pattern
constraint, (iv) Aggregate constrain, (v) Regular expression constraint, (vi) Du-
90
4.1. Introduction 91
ration constraint, and (vii) Gap constraint. All of these constraints can be ap-
plied with three mechanism (a) Pre-processing or dataset �ltering constraint,
(b) Pattern �ltering or mining process constraint and (c) Post-processing con-
straints [118] [76]. In the pre-processing step, constraints can be enabled at the
data source to �lter and organize the dataset so that the user desired patterns
can be obtained after applying the data mining methods. In the pattern �ltering
process, constraints are imposed by modifying the actual mining process algo-
rithm. This process can make the data mining process more e�cient by reducing
the search space that requires less time to �nd the desired results. In the post-
processing step, constraints are used after the standard mining process discovers
the results. In this method, any number of constraints could be used to keep
or extract the users' demanded patterns and �lter out the unwanted patterns.
The post-processing method is unsatisfactory because it wastes computational
time in producing unwanted patterns from the users' perspective, then it �lters
out the unnecessary patterns [116]. Therefore, this method does not focus on
improving the e�ciency or performance of the data mining algorithm.
The rare sequential pattern mining is also challenging with respect to setting
the threshold value. The number of rare sequential patterns could be large if the
threshold value is set to a higher value. In addition, if the average size of the
sequences are large in the database SDB, the possibility of generating the large
number of rare sequential patterns increases. Moreover, if the number of unique
events increases in the SDB, the number of candidate sequential patterns also
increases. The large candidate sequential patterns take more computational time
in the data mining process. Furthermore, if the database size increases, meaning
the number of sequences in the SDB is large, then it also attributes to increase
the computational time of the mining process. Therefore, these three factors the
size of the database, the size of sequences in the database, and the number of
unique events in the database make it di�cult to identify the rare suspicious
anomalous patterns. These factors also contribute to increase the computational
time signi�cantly. Hence, there is a need to apply constraints to generate less
number of rare sequential patterns which reduces the computational time to
identify rare suspicious anomalous patterns.
92 Chapter 4. Constraint-based Rare Sequential Pattern Mining
4.1.1 Motivation
We have analyzed the control logs of a real life SCADA controlled electrical
power distribution substation. It has been observed that when there occurs some
activities or events in the system, the events are recorded in a log �le. However,
when there happens no events in the system, nothing is recorded in the log �le.
As a result, in the log �le we �nd an episode of recorded events for a certain time-
span period and there is a time gap during which there exists no recorded events
in the logs. A meaningful or signi�cant pattern can be generated from an episode
of events in a de�ned time-span period. However, if a pattern overlaps beyond
the de�ned consecutive time-span period, it may not be a signi�cant pattern.
This is because in an episode of events during the de�ned time-span period,
some sequence of events accomplish a complete a task or process. Therefore, if
a pattern is derived from several consecutive episodic events, which exceeds the
de�ned time-span period, the pattern cannot be a signi�cant pattern; rather a
misleading pattern.
Therefore, the �rst motivation is to integrate the time-span constraint while
selecting the sequences from the data source, the raw logs. The time-span con-
straint is applied during the data pre-processing stage. As a result, the signi�cant
discovered patterns can only be extracted from a de�ned time-span period. It
means that the time duration between the �rst event and the last event in a
pattern must satisfy the user de�ned maximum time-span threshold value [119].
For example, let's consider the pattern 〈{a}, {c}, {d}〉 appears to be rare (SID2,SID5 ) in the sequence database SDB shown in Table 4.1 when the maximum
support threshold value maxsup is set to 2. It is di�cult to distinguish between
these two rare patterns and take decision as to which rare pattern is more reliable
to judge it a suspicious anomalous pattern.
Table 4.1: A sequential database SDB with events' occurrence time-stamp.
Sequence ID SequencesSID1 〈{c}1, {b, d}6 〉SID2 〈{a}12, {e}13, {c}14, {f}16, {d}20 〉SID3 〈{b}26, {f, g}27 〉SID4 〈{g}33, {d}38 〉SID5 〈{a}45, {e}47, {c}48, {d}50 〉
However, if we further analyze these two patterns considering the time-span
duration, then the pattern in SID5 occurred within a de�ned time-span, say in
4.1. Introduction 93
5-minutes time-span duration. It is assumed that this pattern is more capable
to do harm on the system than the pattern in SID2. It means that the SID5
pattern is more signi�cant than SID2, because the events in the SID5 pattern are
performed within a de�ned time-span period. So, the pattern has the potential
to do harm on the system since it remains active for a de�ned time-span period.
Otherwise the pattern weakens its potential or strength to do harm on a system
and hence considered as a less signi�cant pattern.
The second motivation is to avoid unwanted database scanning while gener-
ating rare sequential pattern. The avoidable unwanted database scanning can
be achieved by integrating constraint in the actual rare sequential pattern min-
ing algorithm. The constraint, which is also called the algorithmic constraint,
prohibits to search a candidate sequence which size is larger than the size of
a sequence in the database. For example, assume that 〈{c}, {b}, {a}, {c}〉 is acandidate sequence and 〈{b, c}, {b}, {a}〉 is a sequence in the database. The al-
gorithmic constraint compares the size of the candidate sequence with the size
of the sequence in the database. If the size of the candidate sequence is larger
than the size of the sequence in the database, it is not possible the candidate
sequence can be found in the database. In this example, the size of the sequence
〈{b, c}, {b}, {a}〉, which is 3, is smaller than the size of the candidate sequence
〈{c}, {b}, {a}, {c}〉, which is 4. So, the candidate sequence cannot be found in
the database sequence. This is because the candidate sequence is always be a
subsequence of a sequence in the database. Hence, it is not bene�ting to look
for the candidate sequence in the database sequence. Therefore, the algorith-
mic constraint can reduce the computational time of the rare sequential pattern
mining process.
The third motivation for integrating the constraints into the rare sequential
pattern mining is reducing the number of features. The feature reduction con-
straint bene�ts to reduce the number of unique events in the database. The
reduced unique events minimize the computation time and search space while
mining the rare sequential patterns from the database. For example, assume that
a rare sequential pattern 〈{a}, {b}, {c}〉 of size-3 and a sequence database SDB
that has 38 unique events. When all possible candidate super sequential patterns
of size-4 are generated from a size-3 rare sequential pattern, it generates a total
of 4 ∗ 38 candidate super sequential patterns. In addition, if there exists 20 rare
sequential patterns, the total number of candidate super sequential patterns that
94 Chapter 4. Constraint-based Rare Sequential Pattern Mining
are generated is 20 ∗ 4 ∗ 38. In general, for N number of rare sequential patterns
of size-n with M number of unique events in a database generates N ∗(n+1)∗Mcandidate super sequential patterns of size-(n+1). Furthermore, if the average
size of the sequences in the database SDB are long, the number of level-wise
(the increase of sequence size) generated candidate super sequential patterns can
be large. Therefore, we argue that the generation of candidate super sequential
patterns can be reduced by keeping the average sequence size small and reducing
the number of unique events in the database. The number of unique events and
the average size of the sequences can be reduced by integrating feature reduction
constraint in the database.
4.2 Existing Related Work
To understand how constraints are used in the pattern mining, in this section we
provide the existing works as to constraint-based pattern mining. In sequential
pattern mining the use of constraint is �rst introduced by Srikant and Agrawal
[66]. They introduced time constraint, such as minimum gap between two suc-
cessive events, maximum gap between two consecutive events and sliding time
window that relaxes conventional sequential pattern mining process in GSP with
Apriori framework. Later many methods have been proposed in constraint-based
pattern mining to achieve e�ectiveness and e�ciency in mining sequential pat-
terns interested to the users need. Garofalakis et al. [120] proposed a family
of four algorithms called SPIRIT where di�erent regular expressions R are used
as constraints for mining frequent sequential patterns that satis�es a given reg-
ular expression constraint. For example, SPIRIT(N) only keeps the candidate
sequence patterns elements that are de�ned by the constraint R. In other words,
candidate sequence patterns are pruned when they do not satisfy constraint R.
The constraints are used inside the mining process, meaning during the mining
process constraints are enforced.
Parthasarathy et al. [121] uses constraint in post-processing. Zaki et al. [77]
integrates a variety of syntactic constraints into cSPADE algorithm to mine
frequent sequences. These constraints are length or width restrictions, gap lim-
itations on the consecutive events on a sequence, time window restriction of
occurring a whole sequence, and item constraints limiting the inclusion or exclu-
sion of de�ned items in a sequence. Authors imposed these constraints inside the
4.3. Preliminaries 95
mining process. Desai and Ganatra [122] e�ectively applies di�erent constraints
like Gap, Compactness (Time span), Item, Recency, Pro�tability and Length to
understand the purchasing behavior of customers. Antunes and Oliveira [123]
introduced gap constraint in the generalization of the Pre�xSpan (GenPre�xS-
pan) algorithm. They have shown that the gap constraint is applicable to the
long sequence like bioinformatics sequence [124]. In the literature it has been
shown that in the pattern mining process the computational processing time can
be reduced signi�cantly by applying the constraints [74] [125]. It is also proved
that constraint-based pattern mining can e�ectively reduce a large search space
when applying in sequential pattern mining [45].
In the above discussion it is found that all of the constraints studied in the
literature are based on mining the frequent patterns. However, there exists no
prior works that applies the constraints to mine the rare sequential patterns
by focusing the users' interests and the semantics of the SCADA domain. To
improve the e�ciency and e�ectiveness of the proposed rare sequential pattern
mining method, we integrate constraints in the rare sequential pattern mining
approach.
4.3 Preliminaries
This section presents the background knowledge of di�erent constraints that are
integrated in the proposed rare sequential pattern mining algorithm. The follow-
ing are widely used constraints from applications point of view used by the users
to produce only the patterns of their interests and discard unwanted patterns.
Some of the constraining concepts are used in the literature such as in [126]
[122]. These constraints are widely used with frequent sequential pattern min-
ing. However, we have integrated the following constraints in our proposed rare
sequential pattern mining algorithm. The objective of the constraint-based rare
sequential pattern mining is to ensure that the important patterns are identi�ed
and the unwanted patterns are ignored. Let, I be a set of items, according to
Pei et al. [72] a constraint C is de�ned as predicate on the powerset of I, that is,
C : 2I ⇒ {true, false}. A sequence S satis�es a constraint C if and only if C(S)
is true. The problem of constrained rare sequential pattern mining is to �nd all
rare patterns in a sequence database SDB that satis�es the constraint, that is, if
(sup(S) ≤ σ) ∧ C(S) = true, where σ is the maximum support threshold value
96 Chapter 4. Constraint-based Rare Sequential Pattern Mining
maxsup.
This research has integrated the following constraints in the rare sequential
pattern mining process, described in Chapter 3, to achieve the three goals, which
is described in Section 4.1.1.
Constraint 1 (Time-span duration): The time-span constraint is de�ned by
calculating the timestamp di�erence between the �rst and the last events in a
discovered sequential pattern. This is similar to the approach used by Zhu et
al. [127] where the authors applies session �lters to mine web sequential pattern.
The time duration must be within the given time period. Let, S = 〈A1, A2,
..., An〉 be a sequence, Ai.time be the timestamp of Ai, Dur (S) = An.time −A1.time be the duration of S, the time-span duration constraint is de�ned as :
CTS(S) ≡ Dur(S) ≤ ∆t, where ∆t is an integer.
For example, the pattern 〈{a}, {c}, {d}〉 appearing in SID2 and SID5 shown
in Table 4.1 is a rare sequential pattern when the maximum support threshold
value maxsup is set to 2. However, if the time-span constraint ∆t = 5, then the
pattern in SID5 is valid rare sequential pattern, because this sequence time-span
is within the set time-span constraint ∆t. The time di�erence between the last
event {d} and the �rst event {a} of the pattern 〈{a}, {c}, {d}〉 is 5 − 1 = 4,
which is below the de�ned time-span constraint ∆t = 5. However, the same
pattern which appeared in SID2 is not a valid rare sequential pattern. This is
because the pattern's time-span period 10− 1 = 9 exceeds the de�ned time-span
constraint ∆t = 5.
Constraint 2 (Pattern size): The pattern size constraint is de�ned by com-
paring the size of a candidate sequence with the size of a sequence in a sequential
database SDB. Let, α and γ be two sequences, where α represents a candidate
sequence and γ represents a sequence in SDB. The pattern size constraint is de-
�ned as:
Csize(α, γ) ≡ Size(α) ≤ Size(γ), where Size(α) and Size(γ) are integers.
For example, the candidate sequence 〈{a}, {c}, {d}〉 of size-3 can only be found
4.3. Preliminaries 97
in the SDB in Table 4.1 that has a sequence size larger than or equal to the
size of the candidate sequence. The candidate sequence can only be found in the
sequences SID2 and SID5 because the size of these two sequences size-5 and
size-4 respectively are larger than the size of candidate sequence size-3. How-
ever, the candidate sequence cannot be found in the sequences SID1, SID3, and
SID4 because the size of these sequences size-2, size-2, and size-2 respectively
are smaller than the size of the candidate sequence size-3.
Constraint 3 (Pattern existence): The pattern existence constraint is de�ned
by comparing the support of a candidate sequence with the maximum support
threshold value maxsup. Let, α be a candidate sequence and maxsup is a maxi-
mum support threshold value for �nding rare sequential pattern in a sequential
database. The pattern existence constraint is de�ned as:
CPE ≡ Sup(α) = maxsup, where Sup(α) and maxsup are integers.
For example, 〈{a}, {c}, {f}〉 is a candidate sequence and the maximum support
threshold value maxsup is set to 1. The candidate sequence support is 1 which
is equal to the maximum support threshold value maxsup. The candidate se-
quence's support value becomes equal to the maximum support threshold value
maxsup at the sequences SID2 shown in Table 4.1. So, it is unnecessary to scan
the remaining sequences after the sequence SID2 in the database SDB. There-
fore, unwanted scanning of the sequences SID3, SID4, and SID5 can be avoided.
Constraint 4 (Feature reduction): In this process, we select the important
features, while the unimportant features are not selected during the data pre-
processing stage. The unimportant feature means the feature which value cannot
be changed or the feature which is not required to conduct attacks on the system.
We formulated the feature selection rules to select the important features and
to drop the insigni�cant features while identifying the anomalies. The features
which carry the signi�cant information needed to identify the anomalous pattern
are selected for the experiment. It means that the features' values which can
be changed to conduct attacks on the control system are selected the proposed
experiment. For example, an attack can be conducted by changing the value of
the feature Tank_Read_Pump_In_Manual to operate the water tank control
98 Chapter 4. Constraint-based Rare Sequential Pattern Mining
system in manual mode, rater than in automatic mode.
4.4 Constraint-based Rare Sequential Pattern Min-
ing Algorithm
In this section, we discuss how the constraints are integrated by modifying our
proposed rare sequential pattern mining algorithms (Algorithm 3.1 and Algo-
rithm 3.2) presented in Chapter 3. The Algorithm 3.1 which generates rare
sequential generator patterns cannot be modi�ed to integrate algorithmic pat-
tern existence constraint during the rare pattern mining process. This is because
in �nding rare sequential generator patterns, it is required to check whether any
candidate sequence patterns is either a rare or a frequent pattern. However, the
pattern existence constraint is used to stop unwanted scanning the database once
the support of the candidate sequence pattern touches the maximum support
threshold value. But during the generation of generator patterns it is possible
that a candidate sequence could become a frequent pattern, whose support val-
ued can exceed the maximum support threshold value maxsup. Therefore, it is
not possible to integrate the pattern existence constraint during the generation
of the rare sequential generators.
However, another algorithmic constraint, the pattern size, can be integrated
while generating rare sequential generators with the Algorithm 3.1. The is be-
cause there exists the possibility of having a candidate sequence which size could
be larger than the size of a sequence in the database SDB. Therefore, the pat-
tern size constraint is used to generate constraint-based rare sequential genera-
tors which is given in the Algorithm 4.1. In addition, the constraint-based rare
sequential pattern mining algorithm, which is given in the Algorithm 4.2, inte-
grates both of the algorithmic constrains: the pattern size constraint and the
pattern existence constraint. These two constraints are integrated by modifying
the rare sequential pattern mining algorithm (Algorithm 3.2), which is described
in Chapter 3.
The goal of the constraint-based rare sequential pattern mining algorithms
(Algorithm 4.1 and Algorithm 4.2) is to �nd the user interested rare sequential
patterns and remove the unwanted patterns. As a result, this constraint-based
method helps to �nd anomalies from the reduced number of rare sequential pat-
terns. In other words, the reduced rare sequential patterns help to �nd anomaly
4.4. Constraint-based Rare Sequential Pattern Mining Algorithm 99
in an e�ective manner. It means �nding anomalies from a small number of
rare sequential patterns instead of �nding anomalies from a large number of
rare sequential patterns. In addition, the constraint-based algorithms can �nd
anomalies in an e�cient manner by reducing the computation time while gen-
erating rare sequential patterns. This is achieved by generating only the user
interested rare sequential patterns instead of all the rare patterns. Further, the
constraint-based rare sequential pattern mining algorithms provides the system
operators with less number of rare sequential patterns to identify the anomalies.
The less number of rare patterns also reduces the false positives because the
anomalies are identi�ed from less number of rare patterns.
4.4.1 Generating Constrained Rare Sequential Generator
Patterns
The algorithm for generating constrained rare sequential generators (Algorithm
4.1) is a modi�ed version of the Algorithm 3.1 discussed in Chapter 3. The
Algorithm 4.1 generates minimal rare sequential generators from a sequential
database SDB. In this process, �rstly, the size-1 generators are found by sep-
arating the Events(SDB) into rare and frequent zones based on their support
value. The size-1 events, also called candidate sequence generators CSG1, are
separated into the rare zone and the frequent zone. The CSG1 are separated into
the rare zone when their support value is below or equal to the maximum sup-
port threshold value maxsup. On the other hand, the CSG1 are separated into
the frequent zone when their support value is larger than the maximum support
threshold value maxsup. The candidate sequence generators CSG1 in the rare
zone are called size-1 rare generators, while the candidate sequence generators
CSG1 in the frequent zone are called size-1 frequent generators. This separation
is done in step 1-12 of the Algorithm 4.1.
After generating size-1 rare generators, we need to �nd other rare generators
that have larger size than the size-1 rare generators. This larger sized generators
are generated by merging frequent generators starting from the size-1 frequent
generators. The merging of frequent generators of size-s are done to generate
size-(s+1) candidate sequence generators. The candidate sequence generator
of size-(s+1) are generated by merging two size-1 generators while keeping the
common pre�x subsequence of size-(s−1). For example, the following are two
frequent generators of size-2 〈{a}, {b}〉 and 〈{a}, {c}〉. These two generators
100 Chapter 4. Constraint-based Rare Sequential Pattern Mining
Algorithm 4.1: Generating Constraint-based Rare Sequential Genera-tor PatternsInput: A sequential database SDB, maxsupOutput: Constrained Rare Sequential Generator Patterns (RSG)
1 CSG1 ← {〈e〉|∀e ∈ Events(SDB)} // Here CSG1 is a set of candidatesequence generators with size-1 sequences
2 FSG1 ← {}, RSG1 ← {} // Here FSG1 and RSG1 is a set of frequentsequential generator and rare sequential generator respectively
3 S.supp0 ← |SDB|, ∀S ∈ CSG1
4 Count support S.supp1 of each sequence S in CSG1 by scanning SDB5 for S ∈ CSG1 do6 if S.supp1 = S.supp0 then7 remove S from CSG1
8 else9 if S.supp1 > maxsup then10 FSG1 ← FSG1 ∪ {S}11 else12 RSG1 ← RSG1 ∪ {S}
13 s ← 214 FSGs ← {}, RSGs ← {}15 while FSGs−1 not empty do16 CSGs ← all possible combinations of two sequences with common
pre�x of size(s−2) subsequences in FSGs−117 for S ∈ CSGs do18 ms ← minimum support of the size(s−1) subsequences of S19 S.supps ← 020 for a ∈ SDB and Csize(S, a) is true do21 if S v a then22 S.supps ← S.supps+123 else24 continue
25 if S.supps = ms then26 remove S from CSGs
27 else28 if S.supps > maxsup then29 FSGs ← FSGs ∪ {S}30 else31 RSGs ← RSGs ∪ {S}
32 s ← s+1
33 return RSG = RSG1 ∪RSG2...RSGs−1
4.4. Constraint-based Rare Sequential Pattern Mining Algorithm 101
have the common pre�x subsequence 〈{a}〉. There are two candidate sequence
generators of size-3 that are generated from the two frequent generators of size-2.
In this candidate sequence generation process, keeping the common pre�x sub-
sequence unchanged the remaining su�x sequences are merged in both forward
and reverse directions. The candidate sequence generators are 〈{a}, {b}, {c}〉 and〈{a}, {c}, {b}〉, which are generated at step 16 of the Algorithm 4.1.
Once the candidate sequence generators are generated, the candidate se-
quence is scanned in the database to check if the candidate sequence exists in the
database. While searching the candidate sequence in the database, the pattern
size constraint is applied in the Algorithm 4.1, which is shown at step 20-24.
The pattern size constraint was not used in the rare sequential pattern mining
Algorithm 3.1. This constraint enable the Algorithm 4.1 to skip the unwanted
sequences to scan in the SDB. If the size of a candidate sequence is larger than
the size of a SDB sequence, then the SDB sequence is skipped to the next se-
quence. For example, to look for the candidate sequence 〈{a}, {e}, {f}〉 in the
SDB shown in Table 4.1, the sequences SID1, SID3, SID4 are skipped. This is
because the size of the candidate sequence, size-3, is larger than the size of the
sequences SID1, SID3, SID4, which is size-2 for all of these three sequences. The
sequences SID2 and SID5 are only searched for the candidate sequence. Since, 3
sequences are skipped from 5 sequences in the database, the computational time
can be reduced upto 60% while �nding the candidate sequence.
The candidate sequence which has smaller size than the size of a sequence
in the database can be found either as a rare sequential generator or a frequent
sequential generator depending on the candidate sequence's support value. This
is done at step 18-26 of the Algorithm 4.1. The entire process, shown in step
15-27, of �nding the rare sequential generators continues until no more frequent
sequential generators to process, that is, FSGs becomes empty. At the end of
the process, the constrained rare sequential generators are collected, which is
shown at step 28 of the Algorithm 4.1.
4.4.2 Generating Constrained Rare Sequential Patterns
In this phase, constrained rare sequential patterns are generated from the con-
strained rare sequential generators generated by the Algorithm 4.1. The proposed
method of generating constrained rare sequential patterns is described in Algo-
rithm 4.2. This algorithm is a modi�ed version of the Algorithm 3.2 described
102 Chapter 4. Constraint-based Rare Sequential Pattern Mining
in Chapter 3. The procedure starts with the size-1 rare sequential generators as
shown at step 3 of the Algorithm 4.2. Starting from the size-1 rare generators,
at each size-s, all possible candidate rare sequential patterns CRSPs+1 of size-
(s+1) are generated, which is shown at steps 6-10 of the Algorithm 4.2. For each
rare sequential pattern of size-s, the candidate sequence patterns are generated
by extending the rare sequential pattern with each size-1 event of the database.
Each event is placed in every possible position of the rare sequential pattern.
For example, 〈{a}, {b}, {c}〉 is a rare sequential pattern of size-3 and {g} is asize-1 event. From the size-3 rare sequential pattern, four candidate sequential
patterns are generated by placing the event {g} in four di�erent positions, re-
spectively. The generated candidate sequential patterns are 〈{g}, {a}, {b}, {c}〉;〈{a}, {g}, {b}, {c}〉; 〈{a}, {b}, {g}, {c}〉 and 〈{a}, {b}, {c}, {g}〉.
Not all of the generated candidate sequential patterns are rare sequential
patterns. Among these candidate sequential patterns, some are rare sequential
patterns and others are non-existent patterns. The candidate sequential patterns
must be infrequent patterns, rare and non-existence, as the candidate sequential
patterns are generated from the rare sequential patterns. While the candidate
rare sequential patterns CRSPs+1 are generated, it is ensured that candidate se-
quential patterns which contain any non-existent patterns. This is because any
candidate rare sequential pattern that could be generated from the non-existent
pattern become a non-existent pattern. On the other hand, the candidate rare
sequential patterns which are not generated from the non-existent pattern are
further checked to �nd the rare sequential patterns. If candidate rare sequen-
tial patterns CRSPs+1 are found, the patterns are added to the rare sequential
pattern RSP. If candidate rare sequential patterns CRSPs+1 are not found, the
patterns are added to NEP so that no subsequent CRSPs+1 patterns can be
generated. This is shown at steps 13-17 of Algorithm 4.2.
The algorithmic constraints, the pattern size constraint and the pattern ex-
istence, are applied while searching the candidate rare sequential patterns in
CRSPs+1 as shown at steps 21-26 of the Algorithm 4.2. The pattern size con-
straint checks if the size of a candidate rare sequential pattern in CRSPs+1 is
larger than the size of a sequence in the database. If the size of the candidate
sequence is larger than the size of a sequence in the database, the candidate
sequence cannot be found in the sequence. Hence, the scanning of the sequence
in the database is skipped. The pattern existence constraint stops scanning the
4.4. Constraint-based Rare Sequential Pattern Mining Algorithm 103
Algorithm 4.2: Generating Constrained Rare Sequential Patterns andtheir Equivalence Classes.Input: a sequential database SDB, a set of rare sequential generators
RSGOutput: Generating all rare patterns and their equivalence classes
1 NEP ← {} // Here NEP holds all non-existent patterns in SDB2 GRSP ← {} // set of equivalence classes that have the same support
and they occur in the same sequences in SDB3 s ← 14 RSPs ← {g|g ∈ RSG, |g| = 1} //size-1 rare generators5 ms ← maxS∈SDB{|S|}6 while s < ms and RSPs 6= empty do7 CRSPs+1 ← {}// candidate rare sequence patterns of size-(s+1)8 for each S in RSPs do9 for each e in Events(SDB) do10 C ← all sequences generated by adding e into S at di�erent
positions11 CRSPs+1 ← CRSPs+1 ∪ C
12 RSPs+1 ← {}13 for each S in CRSPs+1 do14 if there is n in NEP such that n is a subsequence of S then15 continue16 else17 RSPs+1 ← RSPs+1 ∪ {S}
18 RSPs+1 ← RSPs+1 ∪ RSGs+1
19 for each S in RSPs+1 do20 S.supp ← 0, S.sid ← {}21 for a ∈ SDB and Csize(S, a) is true do22 if S v a and Cpe(S) is true then23 S.supp ← S.supp+124 S.sid ← S.sid ∪ {a.sid} //a.sid is the id of sequence a25 else26 NEP ← NEP ∪ {S}
27 sp ← S.supp, sid ← S.sid28 if GRSPsp,sid is in GRSP then29 GRSPsp,sid ← GRSPsp,sid ∪ {S}30 else31 GRSPsp,sid ← {S}32 GRSP ← GRSP ∪ {GRSPsp,sid}
33 s ← s+134 RSPs ← RSPs ∪ RSGs
35 return RSP = RSP1 ∪RSP2... ∪RSPs−136 return GRSP
104 Chapter 4. Constraint-based Rare Sequential Pattern Mining
database once the support of a candidate rare sequential pattern CRSPs+1 equals
to the maximum support threshold value maxsup as shown at step 22 of the Al-
gorithm 4.2. When the constrained rare sequential patterns RSPs+1 are found,
these patterns are separated into di�erent equivalence classes. Each equivalence
class holds the rare patterns that have the same support value and they occur in
the same sequences in database, which is shown at steps 28-32 of Algorithm 4.2.
4.5 Experimental Evaluation
In this section, we present the experimental methodology to evaluate our pro-
posed constraint-based rare sequential pattern mining algorithm. Firstly, we
describe the dataset used in the experiment to evaluate the proposed method.
Secondly, we explain the data pre-processing steps that prepares the dataset for
experimenting with the proposed algorithm. Finally, we describe the method
which is used in the experiment.
4.5.1 Dataset
In this experiment of this research we have also used o�-line SCADA control
logs. Three control systems were used as the source of logs. These logs con-
tain data about the process activities of the control systems. The features
that are used to record the activities in the logs hold the binary and the in-
teger values. For example, the feature Conv_Read_Conv_Color_PE as shown
in Table 4.2 that indicates the detection of color of the object puck running
on the conveyor belt. Depending on the color of the object puck, the feature
Conv_Read_Conv_Color_PE value changes from 0 to −1 or vice versa. Hence,
the diverting paddle directs the puck either on the left or on the right direction
on the conveyor belt. If the direction is on the left then the value of the feature
Conv_Read_Solenoid_Left_Direction changes from 0 to −1. However, if the di-
rection is on the right then the value of the feature Conv_Read_Solenoid_Right_
Direction changes from −1 to 0.
All of the features' values of the conveyor belt control logs are of binary
nature, meaning values are either 0 or −1. On the other hand, the feature which
is not of binary nature hold integer and �oating point values. For example, the
pressure values of the pressure control system changes from low to high when the
pressure increases in the pressure control system's pipe and from high to low when
4.5. Experimental Evaluation 105
Table 4.2: A partial view of a conveyor belt control logs.
VarName TimeString VarValueConv_Read_Solenoid_Left_Direction 16/06/2017 5:41:08 PM -1Conv_Read_Solenoid_Right_Direction 16/06/2017 5:41:08 PM 0Conv_Run_Status 16/06/2017 5:41:08 PM 0Conv_Read_Conv_Color_PE 16/06/2017 5:41:08 PM 0Conv_Read_Conv_HMI_Direction 16/06/2017 5:41:08 PM 0Conv_Read_Conv_Present_PE 16/06/2017 5:41:08 PM 0HMI_Conv_Master_Mode 16/06/2017 5:41:08 PM -1HMI_Conv_Reset 16/06/2017 5:41:08 PM 0HMI_Conv_Direction 16/06/2017 5:41:08 PM 0
the pressure releases from the pressure control system's pipe. The pressure values
are stored as �oating point values since the pressure status cannot indicated as
binary values. The �oating point pressure values are high variance in nature
which contribute to increase the number of unique events in the database.
The pressure control system feature Pipe_Read_Pipeline_Pressure as shown
in Table 4.3 holds the �oating point value 17.63478 that indicates the current
status of pressure in the pressure control system pipe. The pressure values grad-
ually changes from low to high and high to low when the pressure increases and
decreases on the pipe respectively. The nature of the SCADA control process is
that the changes of the values from the prede�ned set values is considered an
attack on the control system which can cause disruption to the process. This
means that changing the event's value or threshold value can alter the process
outcome and make it a rare event. For example, the change of the upper thresh-
old value of the pressure control event Solenoid_On_SP_Int from 40 PSI to 50
PSI make it a rare anomalous event that can hamper the outcome of the process.
This change of prede�ned set pressure value is an attack on the pressure control
system as this is the violation of the normal process activities. All of the pressure
values are recorded in �oat values under the variable. Further, in the water tank
control system, the feature Tank_Read_Tank_Level as shown in Table 4.4 holds
the �oating point value 61.77073 that indicates the current status of the water
level on the tank.
The water level values are also a high variance, which are stored in �oating
point. When the pump �lls the upper primary tank, the water level on the
primary tank increases from a low level to a high level. However, when the water
is drained from the upper primary tank, the water level on the primary tank
106 Chapter 4. Constraint-based Rare Sequential Pattern Mining
Table 4.3: A partial view of a pressure control logs.
VarName TimeString VarValueHMI_Pipe_Pump_O�_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Pump_On_SP 16/06/2017 6:55:08 PM 5HMI_Pipe_Solenoid_O�_SP 16/06/2017 6:55:08 PM 30HMI_Pipe_Solenoid_On_SP 16/06/2017 6:55:08 PM 40HMI_Pipe_Master_Mode 16/06/2017 6:55:08 PM -1Pipe_Pump_Run_Status 16/06/2017 6:55:08 PM -1Pipe_Read_Pipeline_Pressure 16/06/2017 6:55:08 PM 17.63478Pipe_Read_Pump_Mode 16/06/2017 6:55:08 PM 0Pipe_Read_Pump_Run_Cmd 16/06/2017 6:55:08 PM -1Pipe_Read_Solenoid_Mode 16/06/2017 6:55:08 PM -1
decreases from a high level to a low level. If the water tank high level threshold
value is changed to a very high value which is above the capacity of the upper
primary tank, the high threshold value will never be met. As a result, the pump
will never stop and hence the upper primary tank over�ows and �oods the control
system. Likewise, if the low level threshold value is changed to a very low value,
the low threshold value will not be reached. As a result, the pump will not start
to �ll the upper primary tank, although the water level has touched the low level
capacity of the upper primary tank. Like the pressure control system, some of
the features such as Tank_Stopped as shown in Table 4.4 holds binary value,
that is, 0 or −1. If Tank_Stopped holds the value 0, it indicates that the water
tank has been stopped. However, if the Tank_Stopped holds the value −1, it
indicates that the water tank is in running state.
Table 4.4: A partial view of a water tank control logs.
VarName TimeString VarValueTank_Level 16/06/2017 6:55:08 PM 61.77073Tank_O�_SP_Int 16/06/2017 6:55:08 PM 80Tank_On_SP_Int 16/06/2017 6:55:08 PM 50Tank_Read_Pump_In_Auto 16/06/2017 6:55:08 PM 0Tank_Read_Pump_In_Manual 16/06/2017 6:55:08 PM 0Tank_Read_Pump_Running 16/06/2017 6:55:08 PM 0Tank_Read_Tank_Level 16/06/2017 6:55:08 PM 61.77073Tank_Stopped 16/06/2017 6:55:08 PM -1Tank_Usage_Level 16/06/2017 6:55:08 PM 38.22927HMI_Tank_Master_Mode 16/06/2017 6:55:08 PM -1
4.5. Experimental Evaluation 107
To generate the control system logs di�erent methods can be used according to
the guidelines provided by the vendor of the SCADA control system devices.
In the �rst phase of data generation, recording and storing, di�erent methods
circular, segmented circular, display system event at, and trigger event method
can be used. In circular logging method, event activities are recorded in a �xed
sized log �le. Once the log �le size becomes full, the logging starts to overwrite
the existing log �le. The segmented circular logging method is a slight variation
of the circular log. Instead of overwriting the existing log �le after reaching the
size limit of the log �le, it creates a new log �le. If the newly created log �le gets
full, it starts writing again on to the existing previous log �le. For the experiment,
we have used circular logging method to generate the SCADA control logs.
In the second phase, during the data collection, three methods Cyclic, On-
change, and On-demand methods can be used to collect the process activity
values that are stored in the PLC devices on SCADA systems. In cyclic method,
the values stored in the PLC memories are polled simultaneously in a �xed time
interval. For example, in every 1 second time, values stored in di�erent variables
are gathered together. In the on-change method, only the values are polled which
have been changed during the polling time. The �nal method of data acquisition
is the on demand, where data are only polled when there is a demand or request
for the logs. This can be done in a non-periodic manner by running a script
in reply to a data request. However, the previous two methods are periodic,
meaning every time interval data are polled. Note that, due to high variance at
some values, such as pressure value changes and water level changes, on-change
method does not record every values that changes, rather in some interval time
values are recorded. The reason for choosing on-change logs acquisition mode is
that the method records the episodic events together in the logs, which allows
to segment the log sequences by a time-span gap.
The logs for the experiment of this chapter were generated and collected
for 8 hours of operation of the SCADA control systems. In this time period,
some attacks were conducted to disrupt the process activities of the control
systems. The conveyor belt dataset, the pressure system dataset and the water
tank datasets are comprised of 1 929, 47 586 and 18 679 lines of logs. In all of the
three control system datasets, di�erent types of attacks were conducted during
the data generation operation on the SCADA control systems. In the pressure
control system, attack was conducted by changing the pressure threshold values.
108 Chapter 4. Constraint-based Rare Sequential Pattern Mining
For example, the lower pressure threshold value was changed from the set value
20 PSI (Pound per Inch) to 25 PSI and the upper threshold value was changed
from set value 40 PSI to 45 PSI. In the conveyor belt control system, attack was
conducted by changing the direction of the diverter paddle, which results in an
unexpected change to the direction of the diverter paddle. In addition, on few
occasions the conveyor belt control system was unexpectedly stopped and then
started. Finally, in the water tank control system, some attacks were conducted
by unexpected changes to the mode of operation from automatic to manual and
manual to automatic mode of the water tank control system. Besides, on few
occasions, the water tank was stopped and started unexpectedly. Furthermore,
some �ooding attacks were conducted by changing the values of some of the
events in multiple times.
4.5.2 Pre-processing
The raw data that are collected from the data source cannot always be readily
available to use for experiment. The data needs pre-processing for preparing
suitable dataset for the data mining algorithm. Pyle D. [128] presents the basic
requirements for data pre-processing. These includes data cleaning, normaliza-
tion, transformation, feature extraction and feature selection. After completion
of the di�erent steps in pre-processing, a set of data is prepared for the actual
data mining process. The control logs collected for the experiment is in comma
separated value format (CSV), which are not readily suitable for our proposed
constraint-based rare sequential pattern mining algorithm.
To transform the CSV formatted SCADA control system logs to a sequential
database SDB, we �rst select the required features, then merging the features
with its corresponding values. The features merged with the values represent
an individual event. We formulated the feature selection rules to select the
important features and to drop the insigni�cant features while identifying the
anomalies. The features which carry the signi�cant information needed to iden-
tify the anomalous pattern are selected for the experiment. It means that the
features' values which can be changed to conduct attacks on the control system
are selected the proposed experiment. For example, an attack can be conducted
by changing the value of the feature Tank_Read_Pump_In_Manual to operate
the water tank control system in manual mode, rater than in automatic mode.
In addition, we do not select the features that hold the duplicate value of other
4.5. Experimental Evaluation 109
features. For example, the feature Pump_O�_SP_Int and HMI_Pipe_Pump_
O�_SP of the pressure control system hold the same value. Therefore, in-
stead of keeping the both features, we select one feature for the experiment.
We also dropped the features which hold the meta data, which means the fea-
ture's value that hold the information of other features. For example, the variable
Time_ms value of a feature hold the value in milliseonds format while the vari-
able TimeString hold the same value in date and time format. The bene�t of
selecting the signi�cant features and dropping the unimportant features is to
keep the size of the sequences in the database small. The reduced number of
features help to generate less number of candidate sequences. The reduced can-
didate sequences require less number of comparison to �nd the rare patterns
which consumes less computational time. We pre-processed the control logs (the
conveyor belt, the pressure control and the water tank control system) into se-
quence database respectively.
(i) Conveyor belt control logs: In the conveyor belt control system, the process
activities were recorded under 9 di�erent features. Each feature holds a bi-
nary value 0 or −1. Every individual feature is merged with its correspond-
ing values that represents an individual event. For example, the feature
Conv_Run_Status contains either 0 or −1 based on the current status of
the conveyor belt control system. If the conveyor belt is in running state,
the feature holds the value −1, otherwise the feature holds the value 0.
Therefore, this feature generates two events {Conv_Run_Status_0} and
{Conv_Run_Status_-1}. The log events are then segmented into di�erent
sequences which comprise the sequential database SDB. The sequences are
segmented based on the average time-span gap between two consecutive
episodic events in the control logs, which ensures that a pattern can be
generated from a time-span constrained episodic sequence. Otherwise, a
pattern can be fragmented into consecutive sequences if the sequence is not
segmented based on the time-span gap among the episode of events. There
are 171 sequences generated, which makes the conveyor belt database by
applying the time-span constraint.
Among the sequences in the database, the longest size of the sequence
is 12. This means that the longest sized sequence comprises 12 events.
There are 38 unique events found in the database. Among these events,
17 events are generated from 9 di�erent features. Since one of the features
110 Chapter 4. Constraint-based Rare Sequential Pattern Mining
emergency_stop holds only a single value instead of binary value, the fea-
ture generates a single event rather than two events. Hence, instead of 18
events the conveyor belt features generate 17 events from the 9 features.
The remaining 11 events out of 38 unique events are generated by com-
bining the unique events that occur simultaneously in the database. For
example, {Conv_Run_Status_0, Conv_Read_Conv_Color_PE_0} is an
unique event that comprises of two events {Conv_Run_Status_0} and
{Conv_Read_Conv_Color_PE_0} because these two events occurred si-
multaneously on the control system.
(ii) Pressure system control logs: In the pressure control system, the events
were recorded under 17 features unlike conveyor belt control system, where
events were recorded under 9 features. Among these 17 features, some of
the features hold binary number values, some other features hold integer
number values, and the rest of the features hold �oating point values. For
example, the feature Pipe_Solenoid_Open_Status holds the binary values
either 0 or −1 based on the current status of the solenoid of the pressure
control system. Since each feature is merged with its corresponding val-
ues to generate an event, the feature Pipe_Solenoid_Open_Status gener-
ates two individual events, such as {Pipe_Solenoid_Open_Status_0} and
{Pipe_Solenoid_Open_Status_-1}. Another feature Pipe_Read_Pipeline
_Pressure holds the �oating point values, such as 20.18696 which indicates
the current status of the pressure on the pipeline of the pressure control
system. The pipeline pressure changes from lower threshold value to the
upper threshold value and vice-versa. Since the pressure values are held
in �oating point, the feature Pipe_Read_Pipeline_Pressure is a high vari-
ance feature, which contributes to increase the number of unique events in
the pressure control database.
To reduce to the unique events in the database, we converted the high
variance feature values to the ceiling values. For example, the feature
Pipe_Read_Pipeline_Pressure holds the current pressure value. As the
pressure increases on the pressure control system, this feature keeps record-
ing the values in �oating point. The values are rounded up so that the num-
ber of unique events can be reduced. Since the features are merged with
its corresponding values, the less number of feature values will produce less
number of unique events. Therefore, the pressure value 20.18696 is rounded
4.5. Experimental Evaluation 111
up as 21, which is merged with the feature Pipe_Read_Pipeline_Pressure
to create an individual event Pipe_Read_Pipeline_Pressure_21. There-
fore, instead of producing many unique events, we reduce the number of
unique events by rounding up values.
After generating the individual events like the conveyor belt control database,
we create sequences from the pressure control logs by segmenting them with
an average time-span gap between two consecutive episodic events. Using
the average time-span gap among the episodic events, there are 232 num-
ber of sequences generated from the pressure control logs, which makes the
pressure control database. Among these sequences, the longest sequence
comprises of 17 events. In addition, a total of 72 unique events generated in
the pressure control database. Among the 72 unique events, 52 events are
generated by merging 16 features with their corresponding values. Since
the pressure control system features not only hold the binary values, but
also use integer and �oating point values, the number of unique events are
large unlike the conveyor belt control system. The remaining 20 events are
formed in combination of 52 unique events that occur simultaneously on
the pressure control database.
(iii) Water tank control logs: In the water tank control system, the process ac-
tivities were recorded under 10 di�erent features. Among these features,
some of the features hold binary values, and some other features hold inte-
ger and �oating point values like the pressure control system. The �oating
pint values of the high variance features are converted to ceiling values like
the pressure control system features. As a result, the number of unique
events are reduced. Like the previous two control systems, the water tank
database is created once the individual log events are generated. There are
323 sequences generated from the water tank control logs. The database
92 unique events. The longest sequence is composed of 22 events.
4.5.3 Experimental methodology
The datasets which are prepared during the pre-processing phase are used with
the proposed constraint-based rare sequential pattern mining algorithm as de-
scribe in Section 4.4. The algorithm comprises of two phases. In the �rst phase,
the Algorithm 4.1 generates constrained rare sequential generators. In the second
112 Chapter 4. Constraint-based Rare Sequential Pattern Mining
phase, the Algorithm 4.2 generates constrained rare sequential patterns which
are extended from the generators. To �nd the impact of using constraints with
the rare sequential pattern mining algorithm, which is discussed in Chapter 3,
we conducted experiments by applying with and without constraints on the same
datasets. Finally, we evaluated the performance of these experiments by using
the precision and the recall of the confusion matrix as shown in Table 4.5.
Precision or Detection Rate(DR): It de�nes the ratio between the number
of rightly detected malicious events or attacks and the total number of predicted
attacks or malicious events. The precision is de�ned as follows:
Precision =True Positive
True Positive+ False Positive% (4.1)
Here, True Positive indicates the correct identi�cation of an intrusive case by the
algorithm. On the other hand, False Negative represents the incorrect identi�ca-
tion of an intrusive case as a benign case. For example, assume that an anomaly
detection algorithm has raised 148 alarms, although only 20 alarms are found as
anomalous or attacks on the system as shown in Table 4.5. It is shown that the
anomaly detection algorithm has identi�ed 40 predictions as intrusive, but only
12 predictions are found correct, which is True Positive. On the other hand, 28
predictions are found incorrect, which is False Positive. In addition, it is also
shown that the algorithm identi�ed 108 predictions are benign or not intrusive.
Among these 8 predictions are found incorrect, which is False Negative. On the
other hand, 92 predictions are found, which is True Negative.
Table 4.5: Confusion matrix.
True ConditionConditionPositive
ConditionNegative
PredictiveCondition Condition
PositiveTP:12
(True Positive)
FP:28(False Positive)Type I error
ConditionNegative
FN:8(False Negative)Type II error
TN:100(True Negative)
4.5. Experimental Evaluation 113
Recall or True Positive Rate(TPR): It de�nes the ratio between the numbers
of rightly detected malicious events or attacks and the total number of actual
malicious events or attacks [129]. The recall is de�ned as follows:
Recall =True Positive
True Positive+ False Negative% (4.2)
Therefore, the precision and the recall of the anomaly detection algorithm is 30%
and 60% by using the confusion matrix.
We designed four experiments (Experiment-1 to Experiment-4 ) to measure
the e�ectiveness and e�ciency of our proposed constraint-based rare sequential
pattern mining algorithm to detect anomalies in the SCADA control system logs.
We conducted these four experiments individually on the three control system
datasets.
(i) First Experiment : The �rst experiment (Experiment-1 ) is designed to im-
plement our proposed constraint-based rare sequential pattern mining al-
gorithm without implementing any additional constraints apart from the
time-span constrained database. The aim of the �rst experiment is to
record the computational time of the rare sequential pattern mining al-
gorithm and the total number of rare sequential patterns the algorithm
generates. The computational time and the number of rare sequential
patterns are then compared with the computational time and the num-
ber of rare patterns generated by the other three experiments where the
constraints are implemented. In the experiment, the maximum support
threshold value was set to 2 because it was assumed that anomalies occur
rarely in a system. The low threshold value ensures that we can �lter out
the frequently occurred sequences from the control system databases.
(ii) Second Experiment : In the second experiment (Experiment-2 ), we added
the feature reduction constraint along with the time-span constraint to re-
duce the number of unique events on the database. The goal of the second
experiment is to �nd the computational time and the number of rare se-
quential patterns when the respective constraints are used separately. To
�nd anomalies in an e�ective and e�cient manner, we added feature reduc-
tion constraints in addition to the time-span constraint with the three con-
trol system databases. As a result, anomalies are detected from a reduced
number of rare sequential patterns which consumes less computational time
114 Chapter 4. Constraint-based Rare Sequential Pattern Mining
compared to the Experiment-1.
To achieve less number of rare patterns and less computational time, we
reduced the number of features from the database by removing less sig-
ni�cant features that do not contribute in �nding anomalies. This means
that the features, which are less likely to make any change to the process
outcome, are not selected in the rare sequential pattern mining process.
Also, those features that values are unlikely to be altered while conducting
attacks on the system are not selected in the mining process. So, we se-
lected only those features whose values can be changed to conduct attacks
on the control system. For example, changing in the value of the conveyor
belt feature HMI_Conv_Direction from 0 to −1 can alter the direction of
the sorted objects on the conveyor belt.
Firstly, for the conveyor belt control system experiment, we selected 3
features out of 9 features from the control logs. These three features with
their corresponding values resulted in 6 individual events. As a result,
the number of unique events were reduced from 38 unique events to 8
unique events in the conveyor belt database. With the reduced number
of unique events, there are 72 sequences generated from the conveyor belt
database instead of 171 sequences used in the Experiment-1, where no
feature reduction constraint was applied. In addition, the longest size of the
sequence is reduced to 3 events compared to 12 events in the Experiment-1.
Secondly, for the pressure control system experiment, we selected 3 features
out of 16 features. These 3 features were selected because the values of
these features can be altered to change the process outcome of the pressure
control system. As a result, the number of unique events were reduced
to 7 events from 72 events, which were used in the Experiment-1. Also,
the number of sequences were reduced to 48 sequences from 232 sequences
which were used in the Experiment-1. Further, the longest sequence size is
also reduced to 3 events from 17 events.
Finally, for the water tank control system experiment, we selected 4 features
from 10 features. As a result, the number of unique events were reduced
to 8 events from 92 events which were used in the Experiment-1. Also,
the number of sequences were reduced to 81 sequences from 323 sequences.
Further, the size of the longest sequence was reduced to 7 events from
22 events, which were used in the Experiment-1. After implementing the
4.6. Results and Analysis 115
feature reduction constraints in the three control system databases, we ran
our proposed rare sequential pattern mining algorithm with a maximum
support threshold value 2 like the previous Experiment-1.
(iii) Third Experiment : In the third experiment (Experiment-3 ), we applied
two algorithmic constraints along with the time-span constrained database
to avoid unwanted scanning of the database. We used the algorithmic
constraints without implementing the feature reduction constraints on the
database. The goal of the third experiment is to �nd the computational
time and the number of rare sequential patterns when the respective con-
straints are used independently. In other words, whether the algorithm
can �nd the same number of anomalous patterns which were found by
the Experiment-1 and the Experiment-2. Also, to compare the computa-
tional time with the previous two experiments, the Experiment-1 and the
Experiment-2.
(iv) Fourth Experiment : Finally, in the fourth experiment (Experiment-4 ), we
combined all the constraints, the feature reduction constraints and the al-
gorithmic constraints along with the time-span constrained database. The
goal of the fourth experiment is to evaluate the performance of the algo-
rithm when the constraints are used together rather than they are used
independently.
4.6 Results and Analysis
This section presents the experimental results and analysis. We conducted
the constraint-based rare sequential pattern mining on three control system
databases. At �rst, Section 4.6.1 presents the results obtained from the conveyor
belt database. Secondly, Section 4.6.2 presents the pressure control database. Fi-
nally, Section 4.6.3 presents the results found from the water tank control system
database.
4.6.1 Conveyor-belt Control System
The rare sequential pattern mining algorithm in the �rst experiment (Experiment-
1), which does not include any additional constraints other than the time-span
116 Chapter 4. Constraint-based Rare Sequential Pattern Mining
constrained database, generated 906 925 rare sequential patterns. The compu-
tational time the algorithm takes to generate these rare sequential patterns is
4 days, 7 hours 26 minutes. A partial view of the rare sequential patterns is
shown in Table 4.6. Among these rare sequential patterns, 4 rare sequential pat-
terns have been detected as anomalies as well as attack patterns. The remaining
rare sequential patterns are mere suspicious patterns which could not detect any
other anomalies. Hence, these suspicious patterns are less important for detecting
anomalies. The algorithm required a large computational time to generate these
rare sequential patterns, although most of the rare patterns do not contribute to
detect anomalies. Moreover, it is di�cult to detect anomalous patterns from the
large number of rare sequential patterns. This is because the security operators
need to check these rare patterns to manually identify the anomalies. Therefore,
Experiment-1 shows that the rare sequential pattern mining algorithm without
any additional constraints other than the time-span constrained database is less
e�ective and e�cient in detecting anomalies.
Table 4.6: A partial view of the conveyor-belt result from the Experiment-1.
SID1 〈{Conv_Read_Conv_HMI_Direction_0}}〉SID2 〈{HMI_Conv_Direction_− 1}〉SID3 〈{Conv_Read_Conv_HMI_Direction_− 1}〉SID4 〈{Conv_Read_Conv_HMI_Direction_− 1}, {HMI_Conv_
Direction_− 1}〉SID5 〈{Conv_Read_Solenoid_Left_Direction_− 1}, {Conv_Read_
Solenoid_Right_Direction_0}, {Conv_Read_Conv_Color_PE_0}, {Conv_Read_Conv_Present_PE_0}, {Conv_Run_Status_0}, {Conv_Run_Status_− 1}, {Conv_Run_Status_0}, {Conv_Run_Status_− 1}, {Conv_Read_Solenoid_Left_Direction_0}{Conv_Read_Solenoid_Right_Direction_− 1}, {Conv_Read_Conv_Color_PE_− 1}, {Conv_Read_Conv_HMI_Direction}〉
In the Experiment-2, where we added the feature reduction constraints in
addition to the time-span constrained database, the rare sequential pattern
mining algorithm generated 16 rare sequential patterns. In comparison to the
Experiment-1, which generated 906 925 rare sequential patterns, the Experiment-
2 generated only 16 rare sequential patterns, which were signi�cantly reduced.
Moreover, the computational time for the Experiment-2 is less than a minute
compared to the computational time 4 days, 7 hours 26 minutes of the Experiment-
1. Although the Experiment-2 generated only 16 rare sequential patterns, the
4.6. Results and Analysis 117
Experiment-2 detects the same number of anomalous patterns 4, which was also
detected by the Experiment-1. It means that the Experiment-2 did not miss any
anomalies that were detected by the Experiment-1. However, the Experiment-1
detected the anomalous patterns from the large number of rare sequential pat-
terns, which took large computational time compared to computational time of
the Experiment-2. Since �nding anomalies from less number of rare sequential
patterns do not require extensive work for the security operators, the feature
reduction constrained Experiment-2 is more e�ective in �nding anomalies than
the Experiment-1. In addition, as the Experiment-2 takes less computational
time compared to the computational time by the Experiment-1, the feature re-
duction constrained Experiment-2 is more e�cient in detecting anomalies than
the Experiment-1. A partial view of the results from the Experiment-2 is shown
in Table 4.7.
Table 4.7: A partial view of the conveyor-belt result from Experiment-2.
SID1 〈{Conv_Read_Conv_HMI_Direction_0}}〉SID2 〈{HMI_Conv_Direction_− 1}〉SID3 〈{HMI_Conv_Direction_0}〉SID4 〈{Conv_Read_Conv_HMI_Direction_− 1}〉SID5 〈{Conv_Read_Conv_HMI_Direction_− 1}, {HMI_Conv_
Direction_− 1}〉SID6 〈{Conv_Read_Conv_HMI_Direction_0}, {HMI_Conv_
Direction_0}〉SID7 〈{Conv_Run_Status_− 1}, {Conv_Run_Status_− 1}〉SID8 〈{Conv_Run_Status_− 1}, {Conv_Run_Status_0}〉SID9 〈{Conv_Run_Status_0}, {Conv_Run_Status_− 1}〉SID10 〈{Conv_Run_Status_− 1}, {HMI_Conv_Direction_− 1}〉SID11 〈{Conv_Run_Status_− 1}, {Conv_Read_Conv_HMI_
Direction_− 1}〉SID12 〈{Conv_Run_Status_− 1}, {Conv_Read_Conv_HMI_
Direction_− 1}, {HMI_Conv_Direction_− 1}〉
In the third experiment (Experiment-3 ), which used the algorithmic constraints
without applying the feature reduction constraints on the database, the al-
gorithmic constraint-based rare sequential pattern mining algorithm generated
906 925 rare sequential patterns, which are the same numbers as generated by
the Experiment-1. This is because both of the experiments, the Experiment-1
and the Experiment-3, did not implement the feature reduction constraint. As
a result, the number of unique events, the size of sequences, and the size of
118 Chapter 4. Constraint-based Rare Sequential Pattern Mining
the database remain unchanged. However, the computational time taken by the
Experiment-3 is 3 days, 2 hours and 54 minutes, which is less than the compu-
tational time taken by the Experiment-1, which is 4 days, 7 hours 26 minutes as
shown in Table 4.8. This is due to the algorithmic constraint which contributed
the Experiment-2 to reduce the computational time.
The fourth experiment (Experiment-4 ), which have used the feature reduc-
tion constraint and the algorithmic constraint together along with the time-span
constrained database, has generated 16 rare sequential patterns. Among these 16
rare sequential patterns, 4 patterns were detected as anomalous patterns when
compared to the labelled attack dataset. The generated rare sequential patterns
by the Experiment-4 are signi�cantly less than the number of rare sequential
patterns generated by the Experiment-1 and the Experiment-3. The reason is
the Experiment-4 used the feature reduction constraint and the algorithmic con-
straint, which the Experiment-1 and the Experiment-3 did not use. The feature
reduction constraints in Experiment-4 contributed to reduce the unique events,
the size of sequences, and and the size of the database, which contributed to
reduce the generation of rare sequential patterns compared to the Experiment-1
and Experiment-3. The reduced rare sequential patterns required less computa-
tional time. In addition, the algorithmic constraints also contributed to reduce
the computational time while generating the rare sequential patterns as shown
in Table 4.8.
Table 4.8: A comparison table showing the number of rare sequential patternsand the computational time taken by the four experiments on the conveyor-beltdatabase.
Experiment# # Rare patterns Execution timeExperiment-1No ConstraintConveyor belt SDB
906 9254 days7 hours26 minutes
Experiment-2Feature ConstraintConveyor belt SDB
16 < 1 minute
Experiment-3Algorithmic ConstraintConveyor belt SDB
906 9253 days2 hours54 minutes
Experiment-4Combined ConstraintConveyor belt SDB
16 < 1 minute
4.6. Results and Analysis 119
The Experiment-2 and the Experiment-4 generated the same 16 rare sequential
patterns because these two experiments applied the feature reduction constraints.
In comparison to the Experiment-2, the Experiment-4 also uses the algorithmic
constraints, which reduced the computational time in seconds. Since the feature
reduction constraint reduced the unique events, the size of sequences and the
database, the computational time reduction between the Experiment-2 and the
Experiment-4 is minimum.
It has been found from the four experiments that the implementation of con-
straints improved the e�ectiveness and e�ciency of the proposed rare sequential
pattern mining algorithm. It means the constraint-based rare sequential pattern
mining is e�ective because the constrained algorithm generated less rare sequen-
tial patterns. It is convenient to detect the anomalous patterns from the less
rare sequential patterns compared to the large rare sequential pattern. In ad-
dition, the constraint-based rare sequential pattern mining algorithm is e�cient
since the generation of rare sequential patterns consumes less computational time
compared to the large computational time by the large computational time by
the proposed rare sequential pattern mining algorithm.
4.6.1.1 Performance Evaluation
We also evaluated the performance of our proposed constraint-based rare se-
quential pattern mining algorithm by using the precision and the recall of the
confusion matrix as shown in Table 4.5. There were 5 attacks conducted on the
conveyor belt control system during the logs generation phase. These attacks
were of 3 types:
(a) Unscheduled stoppage of the conveyor belt.
(b) Unscheduled start of the conveyor belt.
(c) Unwanted changes to the direction of the diverter gate of the conveyor belt.
In the logs, the attacked events were identi�ed and labelled so that the anoma-
lies can be detected by verifying the rare sequential patterns with the labelled
dataset. The proposed constraint-based rare sequential pattern mining algorithm
successfully detected 4 anomalous patterns by verifying the rare sequential pat-
terns with the attacked dataset, which contained 5 actual attacks on the conveyor
belt control system. For example, a rare sequential pattern 〈{Conv_Run_Status_
120 Chapter 4. Constraint-based Rare Sequential Pattern Mining
-1},{Conv_Run_Status_0}〉 shown in the row SID8 in Table 4.7 was detected
as an attack pattern. This is because during the attacking procedure the conveyor
belt control system was deliberately stopped while it was in running state. The
�rst event {Conv_Run_Status_-1} of this attack pattern indicates that the con-
veyor belt was in running state, while the second event {Conv_Run_Status_0}
indicates that the conveyor belt was stopped. Another example of an anomalous
and attack pattern detected in the conveyor belt database is 〈{Conv_Run_Status_-1}, {Conv_Read_Conv_HMI_Direction_-1}, {HMI_Conv_Direction_-1}〉shown in the row SID12 in Table 4.7 which indicates that the direction of the
conveyor belt's diverter gate was changed while the attack was conducted on
the control system. It means that the values of the second and third events
{Conv_Read_Conv_HMI_Direction_-1} and {HMI_Conv_Direction_-1} re-
spectively of this attack pattern were changed from 0 to −1.
It is found from the Experiment-2 and the Experiment-4 that the constraint-
based rare sequential pattern mining algorithm has correctly detected 4 attacks
from the 16 generated rare sequential patterns. The remaining 12 rare sequen-
tial patterns were not identi�ed as anomalous pattern. The True Positive and
the False Positive of the constraint-based rare sequential pattern mining algo-
rithm is 4 and 12 respectively. Therefore, the precision of the algorithm for the
Experiment-2 and the Experiment-4 can be calculated by using the equation
(4.1) is 25%. It means that the anomaly detection rate of the algorithm is 25%.
Since the algorithm detected 4 attack patterns out of 5 actual attacks, the True
Positive is 4 and the False Negative is 1. Therefore, the recall of the constrained
rare sequential pattern mining algorithm can be calculated by using the equation
(4.2) is 80%, which means the sensitivity or true positive rate of the algorithm
is 80%.
4.6.2 Pressure Control System
The rare sequential pattern mining algorithm in the Experiment-1, which did
not include any additional constraints, generated 1 107 540 rare sequential pat-
terns. Among these rare sequential patterns, 5 rare sequential patterns were
detected as anomalous patterns by verifying with the labelled attack dataset. A
partial view of the rare patterns are shown in Table 4.9. The remaining rare
sequential patterns are suspicious patterns, which are less important to detect
anomalies. The computational time to generate these rare sequential patterns
4.6. Results and Analysis 121
by the Experiment-1 is 5 days, 11 hours and 40 minutes.
Table 4.9: A partial view of the pressure control result from Experiment-1.
SID1 〈{Solenoid_On_SP_Int_40}〉SID2 〈{Solenoid_O�_SP_Int_25}〉SID3 〈{Solenoid_On_SP_Int_45}〉SID4 〈{Solenoid_On_SP_Int_45}, {Solenoid_On_SP_Int_40}〉SID5 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_
Int_40}, {Pressure_Int_45}, {Pipe_Read_Solenoid_Open_Cmd_− 1}, {Pipe_Read_Pipeline_Pressure_44}, {Pressure_Int_43}, {Pipe_Read_Solenoid_Open_Cmd_0}, {Pipe_Read_Pipeline_Pressure_41}, {Pipe_Read_Solenoid_Open_Cmd_− 1}〉
The Experiment-1 detected 5 anomalous patterns although the experiment gen-
erated a large number of rare sequential patterns, which required large compu-
tational time. It is di�cult and time time consuming to identify the anomalous
patterns from the large number of rare sequential patterns. Therefore, to detect
anomalies in an e�ective and e�cient manner, meaning to detect anomalies from
a small number of rare sequential patterns with a less computational time, we
added the feature reduction constraints in the second experiment (Experiment-
2 ). The Experiment-2 generated only 9 rare sequential patterns, which were
signi�cantly reduced compared to 1 107 540 rare sequential patterns generated
by the Experiment-1. Even though the Experiment-2 has generated only 9 rare
sequential patterns, the number of detected anomalous patterns, which is 5,
are the same as detected by the Experiment-1. It means that the Experiment-
2 did not miss to detect any anomalous patterns which were detected by the
Experiment-1.
The feature reduction constrained Experiment-2 is more e�ective than the
without feature reduction constrained Experiment-1 because the Experiment-2
generated less number of rare sequential patterns. It is easier to �nd anomalies
from less number of rare sequential patterns compared to �nd anomalies from
than the large number of rare sequential pattern. In addition, the computational
time taken by the Experiment-2 is less than a minute, which was reduced sig-
ni�cantly, compared to the computational time 5 days, 11 hours and 40 minutes
taken by the Experiment-1. Hence, Experiment-2 is more e�cient for anomaly
detection than the Experiment-1. A partial view of the Experiment-2 generated
rare sequential patterns are shown in Table 4.10.
122 Chapter 4. Constraint-based Rare Sequential Pattern Mining
Table 4.10: A partial view of the pressure control result from Experiment-2.
SID1 〈{Solenoid_On_SP_Int_40}〉SID2 〈{Solenoid_O�_SP_Int_25}〉SID3 〈{Solenoid_On_SP_Int_45}〉SID4 〈{Solenoid_Off_SP_Int_30}〉SID5 〈{Solenoid_Off_SP_Int_30}, {Solenoid_On_SP_Int_40}〉SID6 〈{Solenoid_On_SP_Int_45}, {Solenoid_On_SP_Int_40}〉SID7 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_
Int_40}〉SID8 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_
Int_45}〉SID9 〈{Pipe_Read_Solenoid_Mode_− 1}, {Solenoid_On_SP_
Int_45}, {Solenoid_On_SP_Int_40}〉
In the third experiment (Experiment-3 ), the algorithmic constrained rare sequen-
tial pattern mining algorithm generated 1 107 540 rare sequential patterns, which
are the same number of rare sequential patterns as generated by the Experiment-
1. Although the Experiment-1 and the Experiment-3 generated exactly the same
number of rare sequential patterns, the Experiment-3 took less computational
time 4 days, 2 hours and 14 minutes compared to the computational time 5 days,
11 hours and 40 by the Experiment-1. The algorithmic constraints contributed
to reduced the computational time for the Experiment-3. The Experiment-3
also detected exactly the same number anomalous patterns as detected by the
Experiment-1 and Experiment-2, which is 5. This means that the Experiment-3
did not miss any attack patterns which were detected by the previous two exper-
iments the Experiment-1 and the Experiment-2. The computational time for the
Experiment-2 which is less than a minute which is signi�cantly lower than the
Experiment-1 and the Experiment-3. This is because the Experiment-2 applied
feature reduction constraint that contributed to reduce the unique events, the
size of sequences in the database and the size of the database. These three factors
not only contributed to increase the number of rare sequential patterns but also
contributed to increase the computational time as shown in Table 4.11. Hence,
the Experiment-3 is more e�cient than the Experiment-1, but less e�cient than
the Experiment-2.
Finally, the fourth experiment (Experiment-4 ) has produced 9 rare sequential
patterns, which are equal to the number of rare sequential patterns generated by
the Experiment-2. This is because both of the Experiment-2 and the Experiment-
4.6. Results and Analysis 123
Table 4.11: A comparison table showing number of rare patterns and time takenby all 4 experiments on pressure control SDB.
Experiment# # Rare patterns Execution timeExperiment-1No ConstraintPressure control SDB
1 107 5405 days11 hours40 minutes
Experiment-2Feature ConstraintPressure control SDB
9 < 1 minute
Experiment-3Algorithmic ConstraintPressure control SDB
1 107 5404 days2 hours14 minutes
Experiment-4Combined ConstraintPressure control SDB
9 < 1 minute
4 used the feature reduction constraint, which contributed to reduce the number
of rare sequential pattern. Although the Experiment-4 used additional algorith-
mic constraint, it did not contribute to further reduce the rare sequential pattern.
This is because the algorithmic constraint does not reduce the rare sequential
pattern, rather the algorithmic constraint reduces the computational time. The
computational time of the two experiments (Experiment-2 and Experiment-4 )
is less than a minute, although the Experiment-4 took less computational time,
which is in seconds, than the Experiment-2. The computational time reduction
between the Experiment-2 and Experiment-4 by the algorithmic constraint is
minimum due to the small size of the database caused by the feature reduction
constraint. If the database size was not reduced, the algorithmic constraint could
reduce the computational time signi�cantly which is shown in Table 4.11 between
the Experiment-1 and Experiment-4.
The Experiment-4 also detected 5 anomalous patterns which is equal to
the number anomalous patterns detected by the previous three experiments
Experiment-1, Experiment-2 and Experiment-3. It has been found that imple-
mentation of constraints have improved the e�ectiveness and e�ciency of the
proposed rare sequential pattern mining algorithm. A comparison table regard-
ing the number of rare sequential patterns generated and the execution time
taken by the four experiments are given in Table 4.11.
124 Chapter 4. Constraint-based Rare Sequential Pattern Mining
4.6.2.1 Performance Evaluation
We also evaluated the performance of the proposed constraint-based rare sequen-
tial pattern mining algorithm. The performance is measured with the precision
and the recall using the confusion matrix shown in Table 4.5. In the pressure con-
trol system, there were 6 attacks conducted during the generation of the control
logs. These attacks were of 4 types:
(a) Unexpected changes to the upper threshold value of the pressure control
system.
(b) Unexpected changes to the lower threshold value of the pressure control
system.
(c) Unscheduled stopping the pressure control system.
(d) Unscheduled starting the pressure control system.
The constrained rare sequential pattern mining algorithm successfully detected 5
anomalous patterns. For example, the pattern 〈{Solenoid_O�_SP_Int_25}〉,shown in the row SID2 in Table 4.10, indicates that the current status of the pres-
sure control system's lower threshold value 25 PSI of the pressure control system.
As a precondition for the pressure control system experiment, the pressure control
system's lower threshold value was set to 20 PSI. Since the pressure control sys-
tem's lower threshold value was deliberately changed from the prede�ned value
20 PSI to 25 PSI, the pattern 〈{Solenoid_O�_SP_Int_25}〉 indicates an attackpattern. The pressure control system's lower threshold value was intentionally
changed during the attack procedure conducted on the pressure control system.
Another example of a rare sequential pattern that was detected as an attack pat-
tern is 〈{Pipe_Read_Solenoid_Mode_-1}, {Solenoid_On_SP_Int_45}, {Solen
oid_On_SP_Int_40}〉, which is shown in the row SID9 in Table 4.10. This is
because the pressure control system's upper threshold value was changed from
the prede�ned set value 40 PSI to 45 PSI. This unexpected change was done
during the attacking procedure conducted on the pressure control system.
Since constraint-based rare sequential pattern mining algorithm in Experiment-
2 and in Experiment-4 has detected 5 anomalous patterns from the 9 generated
rare sequential patterns, the True Positive and False Positive of of the algorithm
is 5 and 4 respectively. The precision of the constraint-based rare sequential
4.6. Results and Analysis 125
pattern mining algorithm is 55% by using the equation in (4.1). On the other
hand, since the constraint-based rare sequential pattern mining algorithm has
detected 5 attack patterns out of 6 actual attacks, the True Positive is 5 and
False Negatives is 1. So, the recall the algorithm is 83% by using the equation
(4.2)
4.6.3 Water Tank Control System
The �rst experiment (Experiment-1 ) has generated 1 204 514 rare sequential
patterns from the water tank control system. To generate these rare sequential
patterns, the rare sequential pattern mining algorithm has taken 6 days, 2 hours
and 11 minutes. A partial view of the generated rare sequential patterns by the
Experiment-1 is given in Table 4.12. Although there were some attacks conducted
on the water tank control system, no anomalous patterns were detected unlike
the conveyor belt and the pressure control system experiments. The attacks were
of two types:
(a) Unexpected change of the water tank mode of operation from the manual
mode to the automatic mode.
(b) Unexpected change of the water tank mode of operation from the automatic
mode to the manual mode.
Table 4.12: A partial view of the water tank result from Experiment-1.
SID1 〈{Tank_Off_SP_Int_80}〉SID2 〈{Tank_On_SP_Int_50}〉SID3 〈{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}〉SID4 〈{Solenoid_Off_SP_Int_30}〉SID5 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_
Running_− 1}, {Tank_Level_45}, {Tank_Read_Tank_Level_45}, {Tank_Usage_Level_55}, {Tank_Level_43}, {Tank_Read_Tank_Level_43}, {Tank_Usage_Level_57}〉
SID6 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_Running_0}, {Tank_Usage_Level_76}, {Tank_Level_25}, {Tank_Read_Tank_Level_25)},{Tank_Usage_Level_72}, {Tank_Level_29}〉
These two types of attacks were done as �ooding attacks which means the changes
to the mode of operation was conducted in multiple times on the water tank
126 Chapter 4. Constraint-based Rare Sequential Pattern Mining
control system. Due to the �ooding attacks, the changes to the mode of operation
made the events as frequent events. Since the proposed rare sequential pattern
mining algorithm generates the rare sequential patterns comprising rare events,
the �ooding attacks could not be detected on the water tank control system.
In the second experiment (Experiment-2 ), the feature reduction constrained
rare sequential pattern mining algorithm generated 65 rare sequential patterns,
which is signi�cantly reduced compared to the rare sequential patterns generated
by the Experiment-1. A partial view of the results are shown in Table 4.13.
The feature reduction in the Experiment-2 also reduced the computational time,
Table 4.13: A partial view of the water tank result from Experiment-2.
SID1 〈{Tank_Off_SP_Int_80}〉SID2 〈{Tank_On_SP_Int_50}〉SID3 〈{Tank_Off_SP_Int_80}, {Tank_On_SP_Int_50}〉SID4 〈{Solenoid_Off_SP_Int_30}〉SID5 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_
Running_− 1}〉SID6 〈{Tank_Read_Pump_In_Auto_− 1}, {Tank_Read_Pump_
In_Auto_− 1}〉SID7 〈{Tank_Read_Pump_Running_− 1}, {Tank_Read_Pump_
In_Auto_− 1}〉SID8 〈{Tank_Read_Pump_Running_0}, {Tank_Read_Pump_
In_Auto_− 1}〉
which is less than a minute, compared to the computational time 6 days, 2
hours and 11 minutes taken by the Experiment-1. Although the number of rare
sequential patterns and the computational time were reduced by the Experiment-
2, yet no anomalous patterns were detected. This is because the �ooding attacks
made the events on the water tank control system to become frequent. Therefore,
the proposed constrained rare sequential pattern mining algorithm could not
detect the �ooding attacks on the control system.
In the third experiment (Experiment-3 ), the algorithmic constrained rare se-
quential pattern mining algorithm generated 1 204 514 rare sequential patterns,
which are equal to the numbers of rare sequential patterns generated by the
Experiment-1 as shown in Table 4.14. This is because both of the Experiment-3
and Experiment-1 did not apply the feature reduction constraints. Although the
Experiment-3 and the Experiment-1 generated equal number of rare sequential
patterns, the computational time di�ers between these two experiments. The
4.6. Results and Analysis 127
computational time taken by the Experiment-3 is 4 days, 17 hours and 40 min-
utes, which is less than the computational time 5 days, 11 hours and 40 minutes
taken by the Experiment-1. Like the previous two experiments, the Experiment-1
Table 4.14: A comparison table showing number of rare patterns and computa-tional time taken by the four experiments on water tank control system database.
Experiment# # Rare patterns Execution timeExperiment-1No ConstraintWater tank SDB
1 204 5146 days2 hours11 minutes
Experiment-2Feature ConstraintWater tank SDB
65 < 1 minute
Experiment-3Algorithmic Constraintwater tank SDB
1 204 5144 days17 hours40 minutes
Experiment-4Combined ConstraintWater tank SDB
65 < 1 minute
and Experiment-2, the rare sequential patterns generated by the Experiment-3
could not detect any anomalous patterns because of the �ooding attacks on the
control system.
Finally, the fourth experiment (Experiment-4 ) of the water tank control sys-
tem, where we implemented all of the constraints together, generated the equal
number of rare sequential patterns as generated by the Experiment-2 as shown
in Table 4.14. The reason for generating equal number of rare sequential pat-
terns is that both of the experiments, Experiment-4 and Experiment-2, used
the feature reduction constraints, which contributed to reduce the number of
rare sequential patterns. The computational time for the Experiment-4 is less
than the computational time of the Experiment-3 and Experiment-1. Although
the computational time for the Experiment-4 and the Experiment-2 are less
than a minute, Experiment-4 took less time, which is few seconds, compared
to the Experiment-2 because of the usages of algorithmic constraint. The com-
putational time di�erence between the Experiment-4 and the Experiment-2 is
minimum due to the reduced database, which is caused by the feature reduc-
tion constraint. If the database size was not reduced, the algorithmic constraint
could reduce the computational time signi�cantly which is evident between the
Experiment-1 and the Experiment-3 as shown in Table 4.14.
128 Chapter 4. Constraint-based Rare Sequential Pattern Mining
Like all of the previous three experiments, the Experiment-4 could not detect
any anomalous patterns because of the �ooding attacks on the water tank con-
trol system. Since the four experiments conducted on the water tank database
could not detect any anomalous patterns, we could not calculate the perfor-
mance regarding the precision and the recall of the constrained rare sequential
pattern mining algorithm. A comparison table regarding the number of rare se-
quential patterns generated by these four experiments and their computational
time to generate the rare sequential patterns from the water tank control system
database is shown in Table 4.14.
4.7 Discussion
This section gives the analysis and discussion of the methods and results of the
experiments. In the experiment section we have shown that di�erent constraints
needed to use with rare sequential pattern mining algorithm to �nd anomalies in
an e�ective and e�cient manner. The proposed constraint-based rare sequential
pattern mining algorithm (Algorithm 4.1 and Algorithm 4.2) works in two phases.
In the �rst phase, the Algorithm 4.1 generated constrained rare sequential gen-
erator patterns where it was not possible to implement the pattern existence
constrain, which is one of the two algorithmic constraints. The reason for this,
in the �rst phase of the algorithm, it was needed to check if a candidate sequence
pattern is either a rare sequential pattern or a frequent sequential pattern.
On the other hand, if we implemented the pattern existence constraint, we
would not be able to �nd the frequent sequential patterns. The reason is once the
candidate sequence pattern support value meets the maximum support threshold
value, the pattern existence constraint stops further scanning the database. As a
result, we can only �nd the rare sequential pattern. But, the candidate sequence
pattern could also become frequent sequential pattern which were ensured by not
including the pattern existence constraint in the Algorithm 4.1. In addition, if the
pattern existence constraint were used, the generation of rare sequential generator
patterns would have been limited. This is because the rare sequential generator
patterns are generated from the frequent sequential patterns. If the frequent
patterns are not generated due to the implementation of the pattern existence
constraint, the generation of rare sequential patterns are stopped. Therefore, it
was needed to check if a candidate sequence pattern is a frequent pattern by
4.7. Discussion 129
further scanning the database even if the support value of a candidate sequence
satis�es the maximum support threshold value.
Although the pattern existence constraint could not be used with the Al-
gorithm 4.1, the other algorithmic constraint, the pattern size constraint, was
used with the Algorithm 4.1. The pattern size constraint contributed to reduce
the computational time while generating the rare sequential generator patterns
by avoiding unwanted database scanning. This pattern size constraint reduces
computational time by not scanning the database sequences which size is smaller
than the size of the candidate sequence. The computational time reduction
depends on the number of events in the candidate sequence and the database
sequence which is skipped during the database scanning. Also, the total number
of database sequences which are scanned to look for the candidate sequences
contributes to reduce the computational time. This means scanning the larger
number of database scanning reduces the minimum computational time, while
the smaller number of database scanning reduces the maximum computational
time.
In the second phase, the Algorithm 4.2 implemented both of the algorithmic
constraints, the pattern existence and the pattern size, which contributed to
reduce the computational time while generating rare sequential patterns. The
pattern existence constraint enforced the unwanted scanning of the database
once the support value of a candidate sequence reached the maximum support
threshold value. In the second phase, the implementation of the pattern existence
constraint was possible because a candidate sequence becomes a rare sequential
pattern if the candidate sequence is found in the database. Unlike the �rst
phase, in the second phase it is not possible to �nd frequent sequential pattern.
The reason is the candidate sequence pattern is generated by extending the rare
sequential pattern.
Since any pattern that is extended from a rare sequential pattern becomes
a rare sequential pattern. Therefore, the support of a candidate sequence pat-
tern, which are extended from rare sequential pattern, satis�es the maximum
support threshold value, scanning the rest of the sequences in the database costs
the computational time. So, the pattern existence constrained rare sequential
pattern mining algorithm reduces computational time. The second algorithmic
constraint, the pattern size constraint, also contributes to reduce the computa-
tional time in addition to the pattern existence constraint, which is explained
130 Chapter 4. Constraint-based Rare Sequential Pattern Mining
previously in the �rst phase of the algorithm (Algorithm 4.1). Since the Algo-
rithm 4.2 uses two algorithmic constraint compared to one algorithmic constraint
by the Algorithm 4.1, the Algorithm 4.2 reduces more computational time than
the Algorithm 4.1.
The lesson learnt is that three factors; the number of unique events in a
database, the size of sequences in a database and the size of a database, con-
tributes to the performance of the proposed constraint-based rare sequential
pattern mining algorithm. If unique events in a database increases, the com-
putational time for generating rare sequential patterns also increases. This is
because when a candidate sequence pattern is extended from a rare sequential
pattern, each unique event is added in di�erent position of the rare sequential
pattern. The large number of unique events creates large candidate sequential
patterns, which cost large computational time of the algorithm.
Moreover, the large size of sequences in the database contributes to increase
the large number of candidate sequential pattern. This is because the candidate
sequence pattern extends equal to the size of the longest sequence in a database.
The candidate sequence extends from the size-1 sequence to the maximum size of
the sequences in the database. In other words, the candidate sequence extends
from size-1 sequence to size-2 sequence, which stops in the size-n sequence,
where the size-n is the maximum size of the sequences in the database. In
the process of extending the candidate sequence, the unique events increase the
candidate sequential patterns. Since candidate sequential patterns are scanned
in the database to �nd rare sequential patterns, the large candidate sequential
pattern cost large computational time. Finally, the large size of database also
contributes to cost the large computational time. This is because to �nd rare
sequential pattern, the entire database is scanned for each candidate sequential
pattern. Therefore, the large sized database cost large computational time. It
has also been learnt that the reason for the large number of unique events depends
on the range of values that are held by the control system features. If the control
system features hold the values in �oating point, the number of unique events
increases signi�cantly. This is because each control system feature is merged
with its corresponding values. So, if a feature holds a range of �oating point
values, it creates a large number of unique events.
In the three control system experiments, it has been found that implemen-
tation of the constraints with the rare sequential pattern mining algorithm im-
4.7. Discussion 131
proved the e�ectiveness and e�ciency for detecting anomalies. The constrained
rare sequential pattern mining algorithm is e�ective because the algorithm gen-
erates small number of rare sequential patterns, which requires less e�ort to �nd
the anomalous patterns compared to �nding the anomalous patterns from the
large number of rare sequential patterns. The constrained rare sequential pat-
tern mining algorithm is e�cient since generating rare sequential patterns takes
less computational time compared to without constrained rare sequential pattern
mining algorithm. Although the constrained rare sequential pattern mining algo-
rithm reduced the number of rare sequential patterns, the algorithm did not fail
to detect any anomalous pattern which were detected by the without constrained
algorithm.
The proposed constraint-based rare sequential pattern mining algorithm suc-
cessfully detected anomalies on the control system. The constraint-based rare se-
quential pattern mining is an improvement on the rare sequential pattern mining
algorithm discussed in Chapter 3. The constraint-based rare sequential pattern
mining algorithm has successfully detected the same anomalous patterns which
were detected by the rare sequential patterns, which means that the constrained
rare sequential pattern mining algorithm did not compromise in detecting anoma-
lies. But the constraint-based rare sequential pattern mining algorithm has de-
tected the anomalous patterns in an e�ective and e�cient manner. The detection
of anomalies is e�ective because anomalies are identi�ed from small number of
rare sequential patterns compared to the large rare sequential patterns generated
by the rare sequential patterns. Therefore, with the constraint-based rare sequen-
tial pattern mining algorithm, it is easy for the security operators to identify the
anomalous patterns. In addition, the constrained rare sequential pattern mining
algorithm is e�cient because this method detects anomalies in quick time, which
is less than a minute compared to several days taken by the without constrained
rare sequential pattern mining algorithm. Therefore, security operators do not
need to wait long time, rather they can detect anomalies in less than a minute.
Although the constraint-based rare sequential pattern mining algorithm has
detected anomalies e�ectively and e�ciently, the constrained algorithm could
not detect the �ooding attacks on the SCADA control system. This is because
of the characteristics of the proposed algorithm, which �nds the rare sequential
patterns. Since �ooding attacks are conducted by applying the same events by
repeating in multiple times, the attack changes the events frequency from rare
132 Chapter 4. Constraint-based Rare Sequential Pattern Mining
to frequent. Therefore, the �ooding attacks cannot be detected by our proposed
rare sequential pattern mining method.
4.8 Conclusion
The anomaly detection using constraint-based rare sequential pattern mining al-
gorithm successfully detected attack patterns from the SCADA control system
logs. Compared to without constraint-based rare sequential pattern mining al-
gorithm, the constraint-based rare sequential pattern mining algorithm has been
found to be e�ective and e�cient in detecting anomalies on SCADA control sys-
tem. It is e�ective discarding the less important rare sequential patterns which
helps to reduce the number of rare sequential patterns. As a result, the detection
of anomalies from less number of rare sequential patterns were found e�ective
because it requires less e�ort to �nd anomalies from the less number of rare se-
quential pattern. The constraint-based rare sequential pattern mining algorithm
is found e�cient because the algorithm takes less computational time to identify
the anomalies. Therefore, the proposed constraint-based rare sequential pattern
mining algorithm is promising since in some cases it takes months and sometimes
even a year to detect cyber incident in a control system after the incident occurs
[130].
We have analysed and demonstrated that the implementation of the in rare
sequential pattern mining can be e�ective and e�cient by only focusing on the
interested patterns, and reducing the computational time for detecting anoma-
lies. We validated our constraint-based rare sequential pattern mining results
with the SCADA labelled attack dataset. In this experiment we have shown
that our proposed constraint-based rare sequential pattern mining algorithm can
successfully be used to detect anomalies e�ectively and e�ciently on a SCADA
control system.
The anomaly detection which we have done in this chapter, Chapter 4, and in
the previous chapter, Chapter 3, are based on experimenting the o�-line SCADA
control system logs. The o�-line logs mean that SCADA process control events
are stored in log �les. These logs are then collected to pre-process for preparing
the sequential databases. Our proposed rare sequential pattern mining algorithm
generates rare sequential patterns from where anomalous patterns are detected.
The proposed method can only detect anomalies that have already occurred
4.8. Conclusion 133
on the SCADA control system. However, this method cannot predict anomalies
before they occur on the SCADA control system. In the next chapter, Chapter 5,
we show how possible anomalies on the SCADA control system can be predicted
by using the on-line or streaming SCADA control system logs.
Chapter 5
A Rare Sequential Association
Rules Mining of SCADA Streaming
Logs for Anomaly Prediction
5.1 Introduction
The sequential association rules mining refer to discovering rules in a sequen-
tial database. Every individual rule consists of several sequential events. These
events are divided into two parts, the antecedent and the consequent. In many
application domains sequential association rule mining has been applied to anal-
yse data and predict future events. For example, in stock market analysis, e-
learning, and drought management [131] sequential rules have a high prediction
accuracy compared to sequential patterns. This is because sequential rules pre-
dict possible occurring of future events based on existing current events [132]. To
�nd sequential rules, several algorithms have been developed in the literature.
Mannila et al. [48] �nds sequential rules by analyzing alarm �ow in telecommuni-
cation networks logs. If two sets of events (also called episodes) occur frequently
in a sequence, a rule can be generated such as X ⇒ Y where X and Y represent
two sets of events in a sequence.
The rule indicates that if an event or set of events X occur, it is likely that
another event or set of events Y will also occur after sometime. The probability of
their occurrence can be indicated with a con�dence value that can be computed
134
5.1. Introduction 135
as support(X∪Y )support(X)
. The rule is only valid if the con�dence value of the rule satis�es
the user provided minimum con�dence minconf value. Harms et al. [133] �nds
rules if the frequent antecedent X is followed by frequent consequent Y from
several sequences in a sequential database. Lo et al. [134] �nds rules that are
common to several sequences. These rules strictly maintain the order of events
inside the antecedent X and consequent Y and also the order between the X and
Y. However, Fournier-Viger el al. [135] proposed sequential rules wherein the
order of the events inside the antecedent X and consequent Y is not considered.
But, the order is only maintained between the antecedent X and the consequent
Y. In other words, the antecedent part will be followed by the consequent part.
The existing algorithms for sequential association rules mining in the liter-
ature discover rules using frequent patterns. However, no works generate rules
from rare sequential patterns although these patterns can be used for detecting
anomalies and attacks, which this research has shown in Chapter 3 and Chapter
4 of this thesis. Note that, since sequential association rules can be generated
from rare sequential patterns, the antecedent X and the consequent Y do not
always need to be rare. Rather these two parts, that is, the X and the Y of a
rule can be frequent, or in combination of rare and frequent patterns unlike the
frequent sequential rules where both X and Y of a rule must be frequent. In
sequential association rule mining, the antecedent could be considered as a pre-
cursor to the likelihood of an incoming or ongoing anomalous and attack pattern
in a streaming log. The precursor can be considered the unusual initiating events
connected to the consequent of the association rules.
The goal of the sequential association rule mining on this research is to dis-
cover rules from rare sequential patterns so that these rules can be used to predict
and detect anomalies in the incoming streaming logs. For example, if the rare
sequential pattern 〈{a}, {b}, {c}, {d}〉 is detected or identi�ed as an anomalous
and attack pattern, it could be possible that from this rare sequential pattern the
following sequential association rules can be generated: 〈{a}〉 ⇒ 〈{b}, {c}, {d}〉;〈{a}{b}〉 ⇒ 〈{c}, {d}〉; 〈{a}{b}{c}〉 ⇒ 〈{d}〉. Note that, every rare sequential
pattern comprising N number of events (N ≥ 2) would generate a maximum of
(N -1) rules. Not all these rules can be used to predict the anomalies in the incom-
ing streaming logs. Only the valid rules that have a con�dence value higher than
the user provided minimum con�dence value minconf can be used for prediction.
136Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
5.1.1 Motivation
We have shown in Chapter 3 and Chapter 4 of this thesis that rare sequential
patterns can be used to detect anomalies and attacks in SCADA control system
logs. A sequential rule is important to give a better understanding of data.
It provides a good correlation between two segments of data [136]. There is an
increasing demand from industry not only to detect anomalies in the system, but
also to predict anomalies in a system. To predict a possible incoming anomalous
pattern, sequential association rules can be e�ective in predicting anomalies. If
the antecedent is contained in the incoming logs, the consequent is likely to occur
soon. Since rules are generated from rare sequential patterns, the probability of
anomaly prediction is high once the antecedent is found.
However, we also argue that it is not always possible to predict an incoming
anomalous pattern from incoming streaming logs. The reason is that, in some
application domains it is possible to conduct attacks on the system by changing
only a single event's set value. For example, during the operation of a pressure
control system, the prede�ned lower and upper pressure threshold value remain
unchanged for the entire experiment. But, it is possible to disrupt the expected
output of the pressure control system by changing the prede�ned threshold val-
ues. If the prede�ned upper threshold value 40 PSI of the pressure control system
is changed to 60 PSI, the object that undergoes the pressure control system could
produce faulty output. In this case, the pressure control feature that merges with
the changed threshold value makes a single event or size-1 pattern. The size-1
pattern can be found as a rare sequential pattern on the pressure control system.
As a result, sequential association rules cannot be generated from the single event
rare sequential pattern. When this single event rare sequential pattern occurs on
the system, the rare pattern indicates an anomaly. A pattern with a single event
can be used to indicate an anomaly, but cannot be used to predict anomaly. The
reason is that the single event or size-1 rare pattern is composed of one segment
unlike the sequential association rule that is composed of two segments the an-
tecedent and the consequent. Therefore, using the single event rare sequential
pattern it is not possible to predict possible incoming anomaly, rather this single
event rare sequential pattern is used to detect the anomalies that has occurred
on a system.
There are two contributions in this chapter. Firstly, this is the �rst work
that uses sequential association rules to predict anomalies in the SCADA control
5.2. Previous Work 137
system streamed logs. Secondly, even if in some cases the proposed method
cannot predict the incoming attack, it can detect the attack when it happens in
the system. Hence, this method can also ensure there are no false negatives in
the system. This is because the discovered rare sequential patterns can detect
all of the possible rare anomalies in a system. To validate our proposed method,
we used SCADA control system logs for possible anomaly predictions.
5.2 Previous Work
Sequential association rules can be discovered from a single sequence or multiple
sequences [137] [67]. Mannila et al. [48] �nds sequential rules by analysing alarm
�ow in telecommunication networks. They discover rules from a single sequence.
If two sets of events (or episodes) occur frequently in a sequence, an association
rule can be generated as X ⇒ Y where X and Y represent two sets of events in a
sequence. The rule indicates that if an event or set of events X occur, it is likely
that another event or set of events Y may occur within a de�ned sliding window.
The probability of their occurrence can be indicated with a con�dence value. It
means that the higher the con�dence value of an association rule, the more likely
that the consequent of the association rule will occur after the antecedent has
occurred. Harms et al. [133] �nds rules if a frequent antecedent X is followed by
a frequent consequent Y from several sequences in a sequential database within a
user de�ned time window. Lo et al. [134] �nds rules that are common to several
sequences. In all these works, the order of the events in both the antecedent
were preserved. However, Fournier-Viger et al. [135] generates partially ordered
sequential association rules from multiple sequences. In this method, the order
of the events inside the antecedent and consequent is not considered. The order
is only maintained between the antecedent and consequent of a rule.
Sequential association rules can be used in di�erent applications, such as
weather forecasting, disease identi�cation, �nancial investement risk. In the
medical sector, they can be used to predict a disease from a sequence of symp-
toms. In the capital market, based on a sequence of stock market events, the
probable investment risk can be predicted using sequential association rules [121]
[138]. The existing algorithms in the literature discover sequential association
rules from frequent sequential patterns. In these rules both antecedent and con-
sequent segments are frequent. However, there exist no algorithm that discovers
138Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
association rules from rare sequential patterns, although rare sequential patterns
can be used to detect anomalies in a system. The anomaly detection procedures
for rare sequential patterns have been discussed in the Chapter 3 and the Chap-
ter 4 of this thesis. In these two chapters it is shown that the rare sequential
pattern mining algorithm can generate rare sequential patterns from SCADA
static or stored control logs. These rare patterns can e�ectively detect anomalies
that have occurred on the SCADA system. However, these rare patterns cannot
be used to predict a possible incoming anomalies on the SCADA live system.
This is because there exist not correlation among the events in the rare sequen-
tial pattern so that anomaly prediction can be done on the basis of a correlated
sequence of events.
To address this limitation of the rare sequential pattern mining algorithm, in
this Chapter 5, we propose a two-phase anomaly prediction algorithm. The goal
of this algorithm is to discover sequential rules that appear in a rare sequence.
In the �rst phase, Algorithm 5.1 generates sequential association rules from the
rare sequential patterns. The Algorithm 5.1 only generates the valid association
rules which satis�es the user provided minimum con�dence value. In the second
phase, Algorithm 5.2 uses the valid sequential association rules generated by the
Algorithm 5.1 to predict possible incoming anomalous patterns from SCADA
streaming control logs. A prediction of an anomalous pattern is raised by trig-
gering a rule once the antecedent of an association rule is found in the streaming
logs.
5.3 Preliminaries
This section presents the background knowledge of sequential association rules.
This research generates sequential association rules from rare patterns that cor-
relates among the events in a sequence. A sequential association rule X ⇒ Y
represents an association or correlation between the two sequences X and Y.
This rule is validated by the support and the con�dence parameters. The Sup-
port indicates the frequency of sequence 〈X, Y 〉 in a sequential database, while
the Con�dence de�nes the degree of the correlation between X and Y. In other
words, the possibility that Y will occur after X has occurred. The support and
con�dence is de�ned as:
� Support, S(X ⇒ Y ) = (σ(X∪Y ))N
where N is the total number of transactions
5.3. Preliminaries 139
and σ is the number of transactions which contain both X and Y. The
support should be less than or equal to user provided maximum support
maxsup threshold value, that is, support ≤ maxsup. For example, the
sequence 〈{a}, {c}, {e}〉 in Table 5.1 appears twice in SID3 and SID5 in
the sequential database SDB. The total number of sequences in the SDB
is 6. Therefore, the relative support of this sequence is 33%. On the other
hand, in absolute count the support of the sequence is 2.
Table 5.1: A sequential database SDB.
Sequence ID SequencesSID1 〈{a}, {b, d}, {e}, {c}〉SID2 〈{a}, {c}, {b, c}, {a}〉SID3 〈{a, b}, {b, c}, {c, d}, {e}〉SID4 〈{b}, {c, e}, {e}〉SID5 〈{a}, {b, d}, {e}, {f}, {d}〉SID6 〈{g}〉
� Con�dence, C(X ⇒ Y ) = (S(X∪Y ))(S(X))
where S(X ∪ Y ) is the support of the
pattern X followed by consequent Y of a sequential association rule, and
S(X) is the support of the antecedent X. The con�dence should be greater
than or equal to minconf threshold value, that is, con�dence ≥ minconf.
For example, the sequential association rules R1 = 〈{a}〉 ⇒ 〈{c}, {e}〉and R2 = 〈{a}, {c}〉 ⇒ 〈{e}〉 generated from the rare sequential pattern
〈{a}, {c}, {e}〉. The con�dence of the rule R1 is 0.5 because the support of
rare pattern S(X ∪ Y ), that is, 〈{a}, {c}, {e}〉 is 2 and the support of the
antecedent S(X), that is, 〈{a}〉 is 4. On the other hand, the con�dence
of the rule R2 is 0.6 because the support of rare pattern S(X ∪ Y ), that
is, 〈{a}, {c}, {e}〉 is 2 and the support of the antecedent S(X), that is,
〈{a}, {c}〉 is 3. Note that, even though the support of the rare pattern is
same in the rules R1 and R2, their con�dence values are not the same.
Sequential association rules are generated from rare sequential patterns. The
association rules that are generated from the rare sequential patterns can be de-
�ned as follows:
R = {X ⇒ Y | X, Y are frequent or rare sequential patterns, X ∪ Y is a rare
sequential pattern, Sup(X ⇒ Y) ≤ maxsup, Conf(X ⇒ Y) ≥ minconf }
140Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
These association rules can be generated from rare sequential patterns that have
at least two events. However, if the rare sequential pattern is composed of single
event, no sequential association rule can be generated. In that case, the single
event rare sequential pattern can be used to detect anomalies. For example, the
sequence 〈{a}, {c}, {e}〉 in the example dataset in Table 5.1 is a rare sequential
pattern which is composed of three events. This rare sequential pattern can
produce at most two sequential association rules. The �rst sequential associ-
ation rule is 〈{a}〉 ⇒ 〈{c}, {e}〉 and the second sequential association rule is
〈{a}, {c}〉 ⇒ 〈{e}〉. Generally, in generating association rules, there could be
many possible combinations of the events both in the antecedent and the conse-
quent segments. The consequent 〈{c}, {e}〉 of the �rst sequential association rule〈{a}〉 ⇒ 〈{c}, {e}〉 could be arranged either in 〈{c}, {e}〉 or in 〈{e}, {c}〉. Also,the antecedent 〈{a}, {c}〉 of the second sequential associational rule 〈{a}, {c}〉⇒ 〈{e}〉 could also be arranged with the following two combinations 〈{a}, {c}〉and 〈{c}, {a}〉. Therefore, if we considered all of the possible combinations of
events in both the antecedent and the consequent of an association rule, then
the rare sequential pattern 〈{a}, {c}, {e}〉 could generate four sequential associ-
ation rules, such as 〈{a}〉 ⇒ 〈{c}, {e}〉; 〈{a}〉 ⇒ 〈{e}, {c}〉; 〈{a}, {c}〉 ⇒ 〈{e}〉;〈{c}, {a}〉 ⇒ 〈{e}〉.
However, as sequential association rules are generated from rare sequential
patterns, di�erent combination of events other than the regular combination in
the rare sequential patterns cannot be possible. This is because the events in the
rare sequential patterns occur sequentially. Therefore, the rare sequential pat-
tern 〈{a}, {c}, {e}〉 generates at most two sequential association rules instead of
four sequential association rules. On the other hand, the sequence 〈{g}〉 in SID6
of the example dataset in Table 5.1 is also a rare sequential pattern when the
maximum support threshold value is set to 2. Since this rare sequential pattern is
composed of a single event or it is a size-1 rare sequential pattern, sequential as-
sociation rule cannot be generated from this pattern. Instead, this pattern is con-
sidered as a size-1 rare sequential pattern 〈{g}〉, which is used to detect anomalyrather than predict anomaly. The sequential association rule can be used to pre-
dict and detect possible anomalous patterns from an incoming streaming logs.
To predict possible incoming anomalies, the streaming logs are segmented into
sequence of time windows or sessions denoted as 〈W1,W2, ...,Wn−1,Wn〉 shownin Figure 5.1. The session time for each of the window Wi is determined by the
5.4. A New Anomaly Prediction Method Using Sequential Association Rules 141
Figure 5.1: Anomaly prediction from streaming logs.
user. In other words, the streaming events are segmented into windows according
to the session time set by the user.
The �gure shows that the incoming streaming log data is segmented into
a sequence of time windows starting with the window W1 and continues until
the streaming logs are ended. Each window contains a sequence Si comprising
some number of ordered events e1, e2, ..., eni. In other words, a sequence Si =
〈e1, e2, ..., eni〉 is created from each window Wi. If any antecedent of a sequential
association rule is contained in the sequence Si, it can be predicted that the
consequent of the rule may occur after sometime.
5.4 A New Anomaly Prediction Method Using
Sequential Association Rules
Although rare sequential pattern mining method can detect anomalies, it can-
not predict incoming anomalies from streaming logs. If a prediction regarding
ongoing anomalies can be made ahead of its occurrence on a system, the system
operator can be alerted so that necessary precautions can be taken. Therefore,
there is a need to develop an anomaly prediction system. To achieve this we
propose a new method to predict potential anomalies which is composed of two
phases. In the �rst phase, our proposed algorithm generates sequential associa-
tion rules automatically from the rare sequential patterns. Then in the second
phase of the method, these sequential association rules are used to predict and
detect possible anomalies in the streaming logs.
142Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
In order to generate sequential association rules, we need rare sequential pat-
terns. The rare sequential patterns that have at least two or more events are
used to generate sequential association rules. In other words, the rare sequential
patterns that have size-2 or more are used to generate sequential association
rules. This is because the size-1 rare sequential pattern is composed of a single
event which can not be separated as antecedent and consequent of an association
rule. Moreover, while generating association rules we only generated the longest
antecedent association rules instead of all variable length association rules. Gen-
erally, n-1 association rules can be generated from a rare sequential pattern of
size-n. For example, the rare sequential pattern 〈{a}, {b}, {c}, {d}〉 of size-4 can
generate the following 3 sequential association rules:
(i) 〈{a}〉 ⇒ 〈{b}, {c}, {d}〉
(ii) 〈{a}, {b}〉 ⇒ 〈{c}, {d}〉
(iii) 〈{a}, {b}, {c}〉 ⇒ 〈{d}〉
Among these three association rules, the 3rd association rule 〈{a}, {b}, {c}〉 ⇒〈{d}〉 is the longest antecedent association rule. The antecedents of the 1st and
the 2nd association rules are subsequence patterns of the antecedent of the 3rd
association rule. These three rules are redundant rules as they contain the equal
number of sequential events. If these three rules are used to predict the possi-
ble anomalous pattern 〈{a}, {b}, {c}, {d}〉, three predictions can occur. In other
words, for one possible anomalous pattern, three rules are triggered. However,
if the longest antecedent rule is used to predict the possible anomalous pattern
〈{a}, {b}, {c}, {d}〉, a single prediction occurs instead of three predictions. There-fore, to reduce the number of anomaly prediction and not to trigger redundant
rules, in this experiment we only generated the longest antecedent rules rather
than variable length rules.
The rest of this section is given as follows: Section 5.4.1 describes generating
sequential association rules for the experiment, Section 5.4.2 presents the pre-
diction of anomalies using the sequential association rules, and �nally Section
5.4.2.1 describes the prediction method.
5.4.1 Generating Sequential Association Rules
To generate association rules, �rst we generate rare sequential patterns from
a sequential database SDB by applying our proposed rare sequential pattern
5.4. A New Anomaly Prediction Method Using Sequential Association Rules 143
mining algorithms (Algorithm 3.1 and Algorithm 3.2), which are presented in
Chapter 3. The algorithm for generating sequential association rules, Algorithm
5.1, generates all possible rare sequential association rules from rare sequential
patterns. The inputs to this algorithm are a sequential database SDB, a set
of rare sequential patterns RSP and a user provided minimal con�dence value
minconf. The variable SAR in the step 1 of the Algorithm 5.1 holds a set of
sequential association rules, which is the output generated from a sequential
database SDB. In steps 2-13, the algorithm generates sequential association rules
until there exists no rare sequential patterns in RSP. For example, the sequence
〈{a}, {b, d}, {e}, {f}, {d}〉 is a rare sequential pattern P in SID5 of the example
dataset given in Table 5.1. To generate sequential association rules from the rare
sequential pattern P, the rare sequential pattern needs to be checked if it is a
single event or size-1 rare sequential pattern or not. This checking needs to be
done because if the rare sequential pattern is composed of a single event or it is a
size-1 rare sequential pattern, a sequential association rule cannot be generated.
This checking is done at step 3 of the Algorithm 5.1.
In the above example, the rare sequential pattern P is not a single event
pattern; rather the rare sequential pattern is composed of �ve events. Therefore,
the Algorithm 5.1 generates sequential association rules from this rare sequen-
tial pattern P. The step 4-12 of Algorithm 5.1 generates sequential association
rules from the rare sequential pattern 〈{a}, {b, d}, {e}, {f}, {d}〉. The largest
antecedent, which is a sub-pattern of size-(n-1) of the rare sequential pattern
P , is assigned to the variable ante, that is, ante = 〈{a}, {b, d}, {e}, {f}〉 shownat step 4 of the Algorithm 5.1. The consequent cons is formed by removing
the sub-pattern 〈{a}, {b, d}, {e}, {f}〉 from the rare sequential pattern P. So, the
consequent is 〈{d}〉 is assigned to the variable cons shown at step 5 of Algorithm5.1. These two variables ante and cons together form a sequential association
rule, ante ⇒ cons. It means an association rule 〈{a}, {b, d}, {e}, {f}〉 ⇒ 〈{d}〉which is the only generated rule from the rare sequential pattern P.
144Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
Algorithm 5.1: Generating Sequential Association Rules.Input: a sequential database SDB, a set of rare sequential patterns RSP
and their supports, and a minimum con�dence minconfOutput: a set of generated sequential association rules
1 SAR ← {} // set of sequential association rules from the SDB2 for P ∈ RSP do3 if |P | > 1 then4 ante ← sp(i, P ) //sp(i, P ) is a pattern containing the left-most i
//events, where i ← |P | − 15 cons = P\sp(i, P ) //remove sp(i, P ), cons is a consequent of a
//rule6 count the support of Sup(ante) by scanning SDB7 conf ← Sup(P) / Sup(ante) //support of P is divided by the
//support of its antecedent8 if conf ≥ minconf then9 SAR ← SAR ∪ {ante⇒ cons, conf}
10 else11 continue
12 else13 continue
14 return SAR
Once an association rule ante ⇒ cons is generated, the Algorithm 5.1 �nds the
support value of the antecedent ante of the association rule by scanning the
database, which is shown at step 6 of the Algorithm 5.1. After generating the
support value of the antecedent of an association rule, the con�dence of the as-
sociation rule is calculated. The con�dence is calculated by dividing the support
value of the rare sequential pattern P with the support value of antecedent of
the pattern P, which is done at step 7 of the Algorithm 5.1. The generated
association rules can be valid rules only when those association rules have a con-
�dence value higher than the user de�ned minimum con�dence value minconf
shown at step 8 of the Algorithm 5.1. A valid association rule signi�es that if
the antecedent of a rule occurs in the streaming logs then it is highly likely that
the consequent of the same rule will occur next in the streaming logs. The valid
association rules are assigned to the sequential association rules variable SAR
as shown at step 9 of the Algorithm 5.1. The Algorithm 5.1 stops generating
association rules when there exists no rare sequential patterns RSP. The valid
association rules SAR generated in the �rst phase by the Algorithm 5.1 are used
5.4. A New Anomaly Prediction Method Using Sequential Association Rules 145
to predict and detect anomalies using the Algorithm 5.2 in the second phase
which is discussed next in Section 5.4.3.
5.4.2 Prediction of Anomalies using Sequential Associa-
tion Rules
In the second phase, our proposed method anomaly prediction and detection using
sequential association rules, in Algorithm 5.2 predicts and detects anomalies in
the streaming logs. In the prediction and detection process, the logs are checked
with the valid sequential association rules that are generated by the Algorithm
5.1. In the beginning, incoming streaming logs are segmented into a sequence
of �xed time window sessions 〈W1,W2, ...,Wn〉 as shown in Figure 5.1. Each of
the session window Wi is composed of a sequence Si containing some sequence of
events 〈e1, e2, ..., eni〉. The number of events can vary in each sequence of session
window Wi. In other words, the number of events in each sequence depends on
the variable number of events that occur during the �xed time session period.
For example, assume that a sequence S1 comprises of four events {a}, {b, c}, {d},{f} that occur in the session window W1 from the streaming log, while the next
sequence S2 comprises of three events {c}, {b, d}, {a} that occur in the window
W2 from the streaming log.
Even though the �xed time duration of the two windows W1 and W2 is the
same, the number of events in the two sequences S1 and S2 can be di�erent. In
the prediction and detection process, the Algorithm 5.2 takes a set of sequential
association rules rSAR as input. The sequential association rules are stored in
the variable TR as shown in step 1 of the Algorithm 5.2. The Algorithm 5.2 also
takes a sequence of windows 〈W1,W2, ...,Wn〉 as inputs that are segmented from
a streaming logs. The algorithm starts with the window Wi where the value of i
starts from 1 and continues as long as the streamed logs exists. Next, sequence
Si comprising a sequence of events from windowWi are generated in step 6 of the
Algorithm 5.2. Then the sequence Si is checked with each association rule rSAR in
TR. If r is a sequential association rule rSAR, then it is checked if the antecedent
X of the sequential association rule rSAR is found or contained in the sequence
Si. In other words, if antecedent X v Si, it is predicted that consequent Y of
the association rule X ⇒ Y will occur next in the streaming logs, which is shown
in steps 8-10 of the Algorithm 5.2. Therefore, the rule X ⇒ Y is triggered and
a prediction alert TRP is raised for a likely incoming event or sequence of events
146Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
Algorithm 5.2: Anomaly Prediction using Sequential AssociationRules.Input: a set of sequential association rules rSAR, a sequence of �xed
time session windows 〈W1,W2, ...,Wn〉 from a streaming logOutput: display attack patterns
1 TR ← rSAR // a set of sequential association rules rSAR2 TRP ← {} // a set of triggered rules of possible predicted anomalies3 TRD ← {} // a set of triggered rules of possible predicted anomalies
//followed by detection4 k ← 2 //k is a maximum number of window for consequent of a rule
//to occur5 for i ← 1, i = n do6 Si ← 〈e1, ..., eni
〉 is a sequence of ith window Wi
7 for r ∈ TR, r = X ⇒ Y do8 if antecedent X v Si then9 display consequent Y of r indicating Y may happen soon
10 TRP ← TRP ∪ {(Wi, (X ⇒ Y, conf))}11 for j ← 1, j = k do12 if Y v Si+j then13 TRD ← TRD ∪ {(Wi+j, (X ⇒ Y, conf))}14 else15 remove (Wi, (X ⇒ Y, conf) from TRP
16 else17 continue
18 return TRD
of the consequent Y of the rule. The consequent may occur in the same window
Wi where the antecedent of a rule has occurred, or the consequent may occur
in upcoming windows Wi+j. If the consequent Y occurs within a de�ned time
period, then the possible anomaly prediction alert is correct. So, it is reported
that the triggered rule TRD has been detected as true, which is shown in the
steps 11-13 of the Algorithm 5.2. However, if the consequent Y does not occur
after the antecedent X occurred during the de�ned time period, the possible
anomaly prediction alert TRP is removed from the system, which is shown in
step 15 of the Algorithm 5.2. Therefore, the possible anomaly prediction TRP
by the triggered rule has been found false.
For example, let the antecedent of an association rule is 〈{a}, {b, d}〉 ⇒ 〈{d}〉which is checked with a sequence 〈{a}, {c}, {b, d}, {b}〉 denoted as Si. The se-
5.4. A New Anomaly Prediction Method Using Sequential Association Rules 147
quence is generated from the windowWi in a streaming log. Since the antecedent
〈{a}, {b, d}〉 of the rule is contained in the window Wi, it is predicted that the
consequent 〈{d}〉 of the rule will occur soon in the streaming log. The consequentmay happen in the same window, that is, Wi after the antecedent has occurred,
or the consequent may be in the sequence Si+1 of the window Wi+1 or in the
sequence Si+2 of the window Wi+2. These windows Wi+1 and Wi+2 are within
the de�ned time period for the rule's consequent to remain valid once the an-
tecedent of the rule has occurred. If the sequence Si+1 is 〈{e}, {a}, {f}, {c}〉 andsequence Si+2 is 〈{b}, {c}, {e}, {a}, {f}, {d}〉, then the consequent is found in thesequence Si+2 after the antecedent has occurred in the sequence Si, in window
Wi. So, the possible anomaly prediction alert is found correct. Therefore, the
triggered rule, its con�dence and the window number where the rule is found is
stored in the variable TRD as shown in step 13 of the Algorithm 5.2. However,
if the consequent of a rule is not found in Wi+1 or Wi+2 although its antecedent
have been found in Wi, the rule is not triggered. This is because the consequent
of the rule did not occur in the streaming logs within the de�ned time period.
5.4.2.1 Prediction Methods
The prediction of a consequent of a rule occurring after the antecedent is found in
the streaming log could be done using two di�erent methods. The �rst method,
which is called the variable length antecedent rule, is to raise an alert after �nding
any antecedent X from a set of sequential association rules that are contained in
the sequence Si of a window Wi from the streaming log. The second method,
which is called the longest antecedent rule, is to �nd the largest antecedent of the
association rules that is contained in the sequence Si of a window Wi and then
raise a possible anomaly alert. For example, assume an incoming log sequence
Si, that is, 〈{a}, {b, d}, {e}, {b}, {f}, {a}, {d}〉 in a window Wi. The association
rules shown in Table 5.2 are then checked with the sequence Si to predict a
possible incoming anomalous pattern.
5.4.2.2 Variable Length Antecedent Rules
In this method, the antecedent 〈{a}, {b, d}〉 of the rule R2 in Table 5.2 is con-
tained in the incoming log sequence Si. The algorithm raises a possible anomaly
alert on the system. The alert indicates that it is likely that the consequent
〈{e}, {f}, {d}〉 of the rule R2 may occur on the system. The consequent may
148Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
occur in the same sequence where the antecedent is found, or the consequent
may occur in the next windows that exist within de�ned time period. If the
consequent is found in the same sequence Si followed by the antecedent, the
prediction is immediately found true. It is considered that a possible anomaly
has happened on the system. If the consequent is not found in the same se-
quence Si, it is predicted that the consequent may occur in the next windows
that are within a de�ned time period. If the consequent is found in the next
windows, the possible anomaly prediction becomes true. It means that the pos-
sible anomaly prediction has been detected in the streaming logs. Otherwise, the
possible anomaly prediction alert is removed from the prediction system. For the
association rule R2, the prediction is found true in the same sequence Si where
the antecedent is found.
Similarly, when the antecedent 〈{a}, {b, d}, {e}〉 of the rule R3 is contained in
the sequence Si, it is predicted that the consequent sequence of events 〈{f}, {d}〉is likely to occur on the system after some time. Since the consequent is found
in the same sequence Si, the prediction becomes true. Finally, the antecedent
〈{a}, {b, d}, {e}, {f}〉 of the rule R4 is also found in the sequence Si. The pre-
diction of the consequent 〈{d}〉 is also found in the same sequence. Therefore,
the prediction is also found true. Note that, the antecedent and the consequent
of all these three rules R2, R3, and R4, have been found in the incoming log
sequence Si. Therefore, even the consequence did not happen, they also have
been used to do the prediction. These three rules are generated from a single
rare sequential pattern, and these rules contain equal number of events. Three
instances of raising prediction are done from these three rules, although they
contain the same number of events. This method is not e�ective as it triggers
many predictions where a single prediction is su�cient. As a result, this method
consumes a lot of resources, which can be improved.
Table 5.2: A view of possible rare sequential association rules.
R1 〈{a}〉 ⇒ 〈{b, d}, {e}, {f}, {d}〉R2 〈{a}, {b, d}〉 ⇒ 〈{e}, {f}, {d}〉R3 〈{a}, {b, d}, {e}〉 ⇒ 〈{f}, {d}〉R4 〈{a}, {b, d}, {e}, {f}〉 ⇒ 〈{d}〉
5.4. A New Anomaly Prediction Method Using Sequential Association Rules 149
5.4.2.3 Longest Antecedent Rules
In this method, instead of raising three prediction alerts from three rules, the
algorithm raises a single alert from a single rule R4. The other rules are not
generated. The reason is the rule R4 has the largest antecedent among the three
rules R2, R3 and R4. It means that the antecedent of the rules R2 and R3 are
contained in the antecedent of the rule R4. Since the antecedent of the rule R2
and R3 is a subsequence of the antecedent of the rule R4, it is not logical to raise
alerts from the rules R2 and R3. For example, the antecedent of the sequential
association rule R2 is 〈{a}, {b, d}〉 and the antecedent of the sequential associa-
tion rule R3 is 〈{a}, {b, d}, {e}〉. These two antecedents are the subsequence of
the antecedent 〈{a}, {b, d}, {e}, {f}〉 of the sequential association rule R4. So,
the longest antecedent method raises a single prediction alert compared to the 3
prediction alerts that are raised by the variable length method.
The longest antecedent method is e�ective compared to the variable length
method to predict possible anomalies in the streaming logs. This is because
if anomaly prediction becomes false, the longest antecedent method generates
less false positives since the longest antecedent method generates less prediction
alerts. However, as the variable length antecedent method generates more pre-
diction alerts compared to the longest antecedent method, the variable length
antecedent method produces more false positives than the longest antecedent
method. For example, assume that 〈{a}, {b, d}, {e}, {b}, {f}, {a}, {c}〉 is an
incoming streaming log sequence denoted as Si that are generated from the
session time window Wi. In the anomaly prediction process, the consequent
〈{e}, {f}, {d}〉, 〈{f}, {d}〉 and 〈{d}〉 respectively of the rules R2, R3 and R4
cannot be found in the same window Wi, although the antecedent of these rules
are found. In addition, if the consequent of these three rules cannot be found
in the next windows that occur in the de�ned time period, the three prediction
alerts from these three rules becomes false. However, since the longest antecedent
method uses the largest antecedent rule R4 to predict anomalies, the method can
produce a maximum of one false prediction compared to three false predictions
by the variable length antecedent method.
In addition, once an association rule is generated, the algorithm �nds the
support value of the antecedent ante of the association rule ante ⇒ cons. The
number of database scanning is equal to the number of association rules generated
from each rare sequential pattern P. Since in the variable length antecedent
150Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
method, a rare sequential pattern P of size-n generates n-1 sequential association
rules, �nding the support value of ante requires n-1 database scanning. However,
the longest antecedent method requires a single database scanning since the
method generates a single association rule from each sequential pattern P of
size-n.
Moreover, although the variable length antecedent method generates the as-
sociation rules that have both high and low con�dence values, the longest an-
tecedent method produces only the association rules of high con�dence values.
Since the longest antecedent method uses the largest antecedent, the support of
the longest antecedent usually becomes low compared to the shorter antecedent
in the database. This low antecedent value contributes to generate high con�-
dence association rules for the longest method. The probability of occurring the
consequent of a high con�dence association rule in the streaming logs is higher
than the low con�dence association rule. This is because the higher the con-
�dence value an association rule have, the stronger correlation the association
rule have between the antecedent and the consequent. For example, if the user
de�ned minimum con�dence value minconf is set to 0.5, the association rules
R2, R3 and R4 among the four association rules, which are generated by the
variable length antecedent method as shown in Table 5.2, from a rare sequen-
tial pattern 〈{a}, {b, d}, {e}, {f}, {d}〉 become valid rules. The con�dence of the
association rules R2 and R3 are 0.5, which means that in 50% times whenever
the antecedent of the association rules R2 and R3 occurs, the consequent of the
association rules R2 and R3 also occurs in the streaming logs. On the other
hand, the con�dence of the association rule R4 is 1.0, which means that in 100%
times whenever the antecedent of the association rule R4 occurs, the consequent
of the association rule R4 also occurs in the streaming logs.
The longest antecedent method generates only a single association rule R4
compared to the variable length antecedent method's four association rules. This
association rule R4 triggers less number of times compared to other rules R2 and
R3 because rule R4 has a longer antecedent than the other rule's antecedents.
This is because the probability of �nding a longer antecedent in the log stream
is rare. So, even if the consequent part of the rule R4 does not occur once the
antecedent has appeared, the rule produces less false predictions. Hence, the rule
R4 generate less false positives. On the other hand, the shorter antecedent rules
R2 and R3 trigger many times in the log stream. It is because the smaller number
5.5. Experimental Evaluation 151
of events in the antecedent results in a frequent sequence of events in the log
stream. If the consequent events do not occur after the antecedent has occurred,
the triggered rules become a false prediction. Hence, the more triggers produce
more false positives. Since the longest method provides the system operators
with the less number of prediction alerts and less false positives in comparison
to the variable length antecedent method, this research in this chapter uses the
longest method of anomaly prediction.
5.5 Experimental Evaluation
The experiment conducted in this chapter is to predict and detect possible
anomalies from SCADA streaming logs. Anomalies can be detected from o�-
line SCADA control logs which we have discussed in Chapter 3 and in Chapter
4 of this thesis. In Chapter 3, we have proposed a new approach for �nding
rare sequential patterns that can be used to detect anomalies in SCADA control
logs. In this chapter, we use sequential association rules generated from rare
sequential patterns to predict possible anomaly in the SCADA logs. We used
SCADA control system logs from three physical devices named conveyor belt,
pressure control and water tank to conduct this experiment. Each of the con-
trol device logs are used to create two sets of data, the training dataset and the
testing dataset. The association rules are generated from the training dataset,
while the testing dataset is used to represent the streaming logs. The possible
anomalies are predicted from the streaming logs by using the association rules.
The association rules are applied on streaming logs to predict anomalies. The
rest of this section is given as follows: Section 5.5.1 describes the dataset used for
the experiment, Section 5.5.2 presents the pre-processing required to prepare the
training dataset and the testing dataset that are used as inputs to our proposed
anomaly prediction algorithms. Finally, Section 5.5.3 describes the experimental
methodology.
5.5.1 Dataset
In this experiment, we used SCADA process control logs that were collected
from our scaled SCADA industry test-bed laboratory system. In this chapter
we use the same dataset that were used in Chapter 4 for constraint-based rare
sequential pattern mining experiment. Recall that the logs for the experiment of
152Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
the Chapter 4 were generated and collected for 8 hours of operation on the three
SCADA control systems the conveyor belt, the pressure system, and the water
tank control system. The logs were collected by using the on-change method
where the process control events were recorded which had been changed during
the polling time. It means the on-change method did not record every values
that changed, rather in some time interval the values were recorded in the logs.
There were some attacks conducted to disrupt the process control activities
during the 8 hours operation on the three control systems. For example, in the
pressure control system, attack was conducted by changing the pressure threshold
values. In other words, the lower pressure threshold value was changed from the
set value 20 PSI (Pound per Inch) to 25 PSI and the upper threshold value was
changed from set value 40 PSI to 45 PSI. In the conveyor belt control system,
attack was conducted by changing the direction of the diverter paddle, which
resulted in an unexpected change to the direction of the diverter paddle. The
detailed description of these datasets are given in Chapter 4 at Table 4.2, Table
4.3 and Table 4.4 respectively.
5.5.2 Pre-processing
In the pre-processing phase, we divided each of the control system logs into two
sets of logs. In the �rst set, we keep the �rst 40% of the each control system logs.
This 40% is preprocessed to prepare a sequential database which is considered as
the training datasets. The remaining 60% log of each control system is used for
generating a respective the testing dataset. The purpose of the training dataset
is to generate association rules that can be used to predict anomalous events from
the testing dataset. The training dataset is composed of a set of sequences that
are generated from the control logs. However, the testing dataset is composed of
continuous events from the control logs.
In the process of preparing the training dataset, we select the necessary re-
quired features with the help of domain expert knowledge. While selecting the
features, we choose those features that can be changed to conduct attacks on the
control system. For example, the conveyor belt feature Conv_Run_Status holds
the value that indicates the current status of the conveyor belt control system.
It indicates that if the conveyor belt is in running state or in stopping state. If
the value of the feature Conv_Run_Status is 0, it indicates the status of the
conveyor belt is o�. On the other hand, if the value of this feature is −1, it
5.5. Experimental Evaluation 153
indicates that the status of the conveyor belt is on. The feature is merged with
its corresponding stored value which represents an event of the control system.
Since this feature holds either 0 or −1, two events {Conv_Run_Status_0} and
{Conv_Run_Status_-1} can be generated from the feature Conv_Run_Status.
These events form the sequences that make up the training dataset.
On the other hand, in the process of generating the testing dataset, we apply
the similar process of selecting necessary features from the training dataset. We
select the same features that are selected for preparing the training dataset. The
reason is that the association rules which are generated from the training dataset
are checked with the testing dataset. If both the training dataset and the testing
dataset do not have the same features, the association rules from the training
dataset may not be found in the testing dataset. Therefore, it is necessary to
have similar set of features both at the training and the testing dataset. For ex-
ample, if three features such as Conv_Run_Status, HMI_Conv_Direction, and
Conv_Read_Conv_HMI_Direction are selected from the conveyor belt control
logs for training dataset, the same set of features are required to be selected for
the testing dataset.
5.5.3 Experimental Methodology
In the experiments we used two datasets, the training and the testing. The
training dataset is used to generate the rare sequential patterns which are later
used to generate sequential association rules. The testing dataset represents
the streaming logs from where possible anomalies are predicted by using the
association rules. The testing dataset represents the streaming events since the
dataset is composed of continuous events from the conveyor belt control logs.
Once the streaming events are recorded, the events are segmented into �xed
time session windows. The events in each window are formed into a sequence so
that the generated rules from the conveyor belt testing dataset can be checked
with the sequence. If the antecedent sequence of an association rule is found in a
sequence generated from a window of streaming events, an anomaly prediction is
made that the consequent of the same association rule is likely to occur next in
the streaming events. So, the consequent is checked in the same window where
the antecedent is found. If the consequent is not found in the same window, the
consequent is checked in the next windows. If the consequent is found either in
the same window where the antecedent found or in the next windows that exist
154Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
within the time-span period, the anomaly prediction becomes true. However, if
the consequent is not found either in the same window or in the next windows,
the anomaly prediction is removed from the system.
Firstly, we generate rare sequential patterns from the training datasets. We
set the maximum support threshold value maxsup to 2 to �nd the rare sequential
patterns from each of the three control system training datasets. To generate
rare sequential patterns, we applied our proposed rare sequential pattern mining
algorithms, Algorithm 3.1 and Algorithm 3.2, that were presented in Chapter 3.
We set the maximum support threshold value maxsup to 2 to �nd the rare se-
quential patterns. Once the rare sequential patterns were generated, we used our
proposed algorithm, that is, the Algorithm 5.1 to generate sequential association
rules. The association rules indicate how the antecedent and the consequent of
an association rule are correlated to each other. The valid association rules are
then selected from all the association rules depending on the user de�ned con�-
dence value. In this experiment, we have de�ned the minimum con�dence value
0.9, which signi�es that there exist a strong correlation between the antecedent
and the consequent of an association rule. So, if an antecedent of an association
rule occurs in the streaming logs, it is highly likely that consequent of the same
association rule would occur next in the streaming logs. These rules are then
used to predict possible anomalies from the testing dataset, which represents
the streaming logs. In the anomaly prediction process, we used the proposed
algorithm, Algorithm 5.2, to predict possible anomalies in the streaming logs.
We conducted the anomaly prediction experiment of this chapter to see if the
sequential association rules can be used to predict possible anomaly from SCADA
streaming logs. We also wanted to see if the predicted anomalous events are part
of an actual anomalous patterns. We conducted three experiments Experiment-
1, Experiment-2 and Experiment-3 with the training and the testing datasets
from the three control system logs. The Experiment-1 was conducted with the
conveyor belt control logs. The Experiment-2 was conducted with the pressure
system control logs. Finally, the Experiment-3 was conducted with the water
tank control logs.
In all of these three experiments, we generated sequential association rules
from the respective training datasets. While generating the association rules,
we generated the longest antecedent rules from the rare sequential patterns. We
generated the longest antecedent rules from each of the rare sequential patterns
5.6. Results and Analysis 155
that have at least two or more events. Once sequential association rules are
generated, we validate the association rules that have the con�dence higher than
the user de�ned minimum con�dence value. The antecedent of the sequential
association rules are then checked if the antecedent can be found in the streaming
logs of the testing dataset. If any antecedent of the sequential association rules
are found in the streaming events, the algorithm will raise a prediction that
the consequent of the rule will occur next in the streaming events. Here, the
streaming event means each individual event that occurs in the control system
logs. The streaming events in the testing dataset are segmented into a �xed time
session window. In every second we segment the events in a window as in the
SCADA control system in every second there occurs some events. Therefore,
if an antecedent is found in a window, the consequent may be found in the
same window after the antecedent, or in the next few windows. The number of
windows where the consequent may appear depend on the time duration where
the sequential association rules remain valid. So, if the consequent is found in
the same window or in the next windows that are within the time duration of
the rule's validity period, the prediction becomes true. It means that an entire
association rule (the antecedent followed by the consequent) is found in the
streaming logs. The time duration is the time-span period where a sequence of
events occur. These sequence of events comprise the size of a sequence in the
training dataset. If the consequent of an association rule is not found within
the time-span duration, then the prediction is removed from the system. The
prediction is removed because the consequent of an association rule did not occur
within the association rule's validity period.
5.6 Results and Analysis
In this section we present the experimental results and analysis. We conducted
our proposed sequential association rules mining algorithm followed by anomaly
prediction on three control system datasets. Firstly, we present and illustrate the
results obtained from the conveyor belt control system in Section 5.6.1. Secondly,
in Section 5.6.2, we discuss the results from the pressure control system. Finally,
in Section 5.6.3, we present and analyse the result from the water tank control
system.
156Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
5.6.1 Conveyor-belt Control System
The rare sequential pattern mining algorithm generated 14 rare sequential pat-
terns from the conveyor belt training dataset. There were 11 sequential asso-
ciation rules generated by the Algorithm 5.1 from the generated rare sequen-
tial patterns of the conveyor belt training dataset. Two of the 11 sequen-
tial association rules are shown in Table 5.3. Using these association rules,
the anomaly prediction algorithm, Algorithm 5.2, has generated 45 possible
anomaly predictions. For example, the anomaly prediction using the associa-
tion rule 〈 {HMI_Conv_Reset_-1 }, {Conv_Read_Conv_HMI_Direction_-1
} 〉 ⇒ 〈{HMI_Conv_Direction_-1 }〉 was found in the streaming logs. As a
result, the prediction was successful. This predicted anomalous pattern was
found as an attack pattern. This is because the conveyor belt control system
was reset by changing the value of the event {HMI_Conv_Reset_-1 } from 0
to −1 and then the conveyor belt direction was changed by changing the value
of the event {Conv_Read_Conv_HMI_Direction_-1 } from 0 to −1. These
changes caused to change the conveyor belt direction as re�ected in the event
{HMI_Conv_Direction_-1 } which value were also changed from 0 to −1. This
attack caused objects on the conveyor belt sorted in wrong direction.
Table 5.3: Examples of sequential association rules from three control systems.
Dataset Sequential Association Rules
Conveyor belt
Dataset
〈 {HMI_Conv_Reset_-1 }, {Conv_Read_Conv_HMI_
Direction_-1 } 〉 ⇒ 〈{HMI_Conv_Direction_-1 }〉〈{Conv_Run_Status_-1 }, {HMI_Conv_Reset_-1 } 〉
⇒ 〈{Conv_Run_Status_0 }〉
Pressure control
Dataset
〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_
On_SP_41}〉 ⇒ 〈{HMI_Pipe_Solenoid_On_SP_42}〉〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_
O�_SP_24}〉 ⇒ 〈{HMI_Pipe_Solenoid_O�_SP_25}〉
Water tank
Dataset
〈{Tank_Level_65 }, {Tank_Level_68}〉 ⇒〈{Tank_Read_Tank_Level_68}〉
〈{Tank_Usage_Level_61}, {Tank_Read_Tank_Level_42}〉 ⇒ 〈{HMI_Tank_Master_Mode_-1}〉
Similarly, the association rule 〈{Conv_Run_Status_-1 }, {HMI_Conv_Reset_-
5.6. Results and Analysis 157
1 } 〉 ⇒ 〈{Conv_Run_Status_0 }〉 was successfully found in the streaming logs.This pattern is also found as an attack pattern as the conveyor belt was unsched-
uled stopped while the system was in running. The event {Conv_Run_Status_-
1 } shows that the conveyor belt was running state. Then the conveyor belt was
reset as shown in the event {HMI_Conv_Reset_-1 }. After resetting the con-
veyor belt was stopped and the status is shown in the event 〈{Conv_Run_Status_0 }〉, which indicates the conveyor belt was not in running state because the
event value was changed from −1 to 0.
Among the 45 possible anomaly predictions, 32 predictions were found true
positive. Among the 32 true predictions, 3 predictions were found as attack
patterns out of 5 actual attacks on the conveyor belt control system. This means
once an antecedent of an association rule has occurred in the streaming logs,
the corresponding consequent of the same association rule has also occurred
in the streaming logs. The consequent either may have occurred in the same
window where the antecedent has occurred, or the consequent have occurred in
the next windows after the antecedent have occurred in the previous window.
The anomaly prediction true positive rate for the conveyor belt experiment is
71.11%, which is shown in Table 5.4.
Table 5.4: Anomaly predictions from the three control system streaming logs.
ControlSystems
TotalPred.
TruePositive
FalsePositive
True PositiveRate
False PositiveRate
Conveyorbelt
45 32 13 71.11% 28.89%
On the other hand, 13 possible anomaly predictions were found false positive as
the consequent of an association rule neither occurred in the same window, where
the antecedent occurred, nor the consequent occurred in the next windows. So,
the anomaly prediction false positive rate is 28.89%.
5.6.2 Pressure Control System
In pressure control system training dataset, the rare sequential pattern min-
ing algorithm generated 8 rare sequential patterns. From these rare sequen-
tial patterns 6 association rules were generated. The anomaly prediction Al-
gorithm 5.2 has generated 26 anomaly predictions from the pressure control
streaming logs. For example, the anomaly prediction using the association
158Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
rule 〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_On_SP_41}〉 ⇒〈{HMI_Pipe_Solenoid_On_SP_42}〉, which is shown in Table 5.3, was found
successful in the pressure control system streaming logs. This means that after
�nding the antecedent 〈{Pipe_Pump_Run_Status_-1}, {HMI_Pipe_Solenoid_
On_SP_41}〉 in the streaming log, the consequent 〈{HMI_Pipe_Solenoid_On_
SP_42}〉 was also found in the streaming log. This possible anomalous pattern
was found as attack pattern when veri�ed with the attack dataset. This is an at-
tack pattern as the pressure upper threshold value {HMI_Pipe_Solenoid_On_
SP_41} was found above the set threshold value.
Table 5.5: Anomaly predictions from the three control system streaming logs.
ControlSystems
TotalPred.
TruePositive
FalsePositive
True PositiveRate
False PositiveRate
Pressurecontrol
26 18 8 69.23% 30.77%
In the attack process the upper threshold value was change to 45 from 40. In the
predicted anomalous pattern, it was found the pressure control system was in
running state as indicated by the event {Pipe_Pump_Run_Status_-1}. Then
the next event {HMI_Pipe_Solenoid_On_SP_41} indicates that the current
pressure status on the control system, which is 41. As this pressure value was
above the threshold value, it was predicted that an attack is progressing on the
control system and pressure value is increasing. This was found true as indicated
by the next event {HMI_Pipe_Solenoid_On_SP_42}, which was found in the
streaming log. Among these 26 possible anomaly predictions, 18 predictions were
found true positive. Among the 18 true predictions, 4 predictions were found as
attack patterns out of 6 actual attacks on the pressure control system. The
prediction was true because the antecedent of an association rule was followed
by the consequent of the same rule in the streaming logs. As a result, the anomaly
prediction algorithm's true positive rate is 69.23%, which is shown in Table 5.5.
On the other hand, 8 anomaly predictions were found false positive. Hence, the
false positive prediction rate is 30.77%.
5.6.3 Water Tank Control System
The rare sequential pattern mining algorithm has generated 52 rare sequential
patterns from the water tank training dataset. The association rule mining
5.7. Discussion 159
Algorithm 5.1 then generated 36 sequential association rules from the gener-
ated rare sequential patterns. Some examples of the sequential association rules
are given in Table 5.3. Using these association rules, the Algorithm 5.2 has
predicted 78 possible anomaly predictions on the water tank streaming logs.
Among the anomaly predictions, 49 predictions were found successful because
the consequent of the association rules have occurred in the streaming logs
once the antecedent of the rule has occurred. For example, the anomalous
prediction using the association rule 〈{Tank_Level_65 }, {Tank_Level_68}〉⇒ 〈{Tank_Read_Tank_Level_68}〉 as shown in Table 5.3 was found in the
streaming log. Since 49 predictions were found true, the prediction true positive
rate for the water tank control system is 62.82% as shown in Table 5.6.
Table 5.6: Anomaly predictions from the three control system streaming logs.
ControlSystems
TotalPred.
TruePositive
FalsePositive
True PositiveRate
False PositiveRate
Watertank
78 49 29 62.82% 37.18%
Among the successful predictions, no predictions were found as attack patterns.
This is because the attacks on the water tank control system was �ooding attacks.
On the other hand, 29 predictions were found false as consequent of an association
rule was not found in streaming logs once the antecedent of the rule is found in
the streaming logs. For example, the possible anomaly prediction using the as-
sociation rule 〈{Tank_Usage_Level_61}, {Tank_Read_Tank_Level_42}〉 ⇒〈{HMI_Tank_Master_Mode_-1}〉 was found false positive. This is because theconsequent did not occur in the streaming logs even if the antecedent was found
in the streaming logs. The anomaly false positive prediction rate is 37.18%,
which is shown in Table 5.6.
5.7 Discussion
This section provides the discussion of anomaly prediction method used and the
results obtained from the experiments. In this experiment, we have shown that
possible anomalies can be predicted on the SCADA control system by using
real time streaming logs. We have experimented with three control system logs:
the conveyor belt control system, the pressure control system and the water
160Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
tank control system. With each of the control system logs, we conducted the
experiment in three phases. In the �rst phase, we created two datasets, the
training dataset and the testing dataset, from each of the control system logs.
We generated rare sequential patterns from the training dataset. In the second
phase, the rare sequential patterns were used to generate sequential association
rules by which the possible anomaly predictions were performed. Finally, in the
third phase, we generated possible anomaly predictions in the streaming logs by
using the association rules.
The purpose of the association rule mining was to predict possible anomalies
in the streaming logs. Every association rule is composed of two parts the an-
tecedent and the consequent. Although we experimented with the variable length
antecedent method of anomaly prediction, the method did not produce expected
prediction results. In this method, it was found that the method produces large
anomaly prediction. This is because the variable length antecedent method gen-
erated large number of association rules. These association rules were of both
shorter and longer antecedent rules. The shorter antecedent rules were triggered
many times when the antecedent were found in the streaming logs. The shorter
antecedent rules triggered many times because the shorter antecedent becomes
frequent which is found many times in the streaming logs.
Although the shorter antecedent rules triggered many times, the consequent
of the rules did not occur in the streaming logs. This is because the shorter an-
tecedent rules have weak correlation between the antecedent and the consequent.
The weak correlation is due to the low con�dence value of the association rules.
The frequency of the shorter antecedent is high, as the shorter antecedent become
frequent, contribute to generate low con�dence value of the association rule. This
is because the con�dence of an association rule is derived by dividing the fre-
quency of a rare sequential pattern by the frequency of an antecedent. Generally
the rare sequential patterns have a low frequency value. Hence, dividing the low
frequency value by the high frequency value produces low con�dence.
The weak correlation of an association rule means the probability of occurring
the consequent in the streaming logs are low once the antecedent of the rule is
found in the streaming logs. Therefore, even if antecedent of a rule found many
times in the streaming logs, the consequent did not occur next in the stream-
ing logs, which resulted in unsuccessful prediction. Since the variable length
antecedent method generated large predictions, anomaly prediction success rate
5.7. Discussion 161
was lower in comparison to the unsuccessful prediction rate. Moreover, the vari-
able length antecedent method generated redundant association rules from a rare
sequential pattern. For a single anomalous pattern, this method raised multiple
predictions by triggering redundant association rules. Triggering redundant as-
sociation rules not only contributed large anomaly predictions but also consumed
large computational time.
To reduce the number of anomaly predictions and remove the redundant
association rules problem, we applied the longest antecedent method of gener-
ating association rules. In the longest antecedent method, we only generated
the association rules that have the longest antecedent. This method reduced the
number of association rules. The longest antecedent association rules reduced
the possible anomaly predictions by not generating multiple predictions for a
single anomalous pattern. As a result, the longest antecedent method generated
less anomaly predictions compared to the variable length antecedent method.
This longest antecedent method also reduces the number of false positives.
This is because once an alert is raised for possible anomaly prediction using
the longest antecedent of an association rule, the consequent of the same rule is
checked in the streaming logs. If the consequent is found in the streaming logs,
the anomaly prediction is successful or true positive. However, if the anomaly
prediction is found unsuccessful, the prediction becomes false positive. Since the
prediction method uses only the longest antecedent rule, it produces less false
positive. On the other hand, since the variable length antecedent method uses
all possible length antecedent rules, this method generates more false positives
when the anomaly prediction is found unsuccessful.
In the anomaly prediction experiments, we found di�erent successful predic-
tion rate. In conveyor belt experiment, there were 45 anomaly predictions raised
by the algorithm. Among these predictions 32 were found as successful anomaly
predictions, while 13 predictions were found unsuccessful. The successful pre-
diction rate for the algorithm is 71.11% against the unsuccessful prediction rate
28.89%. In the pressure control system experiment, 26 anomaly prediction were
raised by the algorithm. Among these predictions, 18 predictions were found
successful, while 8 predictions were found unsuccessful. The prediction success
rate for the pressure control experiment is 69.23%, while the false prediction rate
is 30.77%. Finally, in water tank experiment, among 78 anomaly predictions
49 predictions were found successful against 29 unsuccessful predictions. The
162Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
prediction success rate for the water tank experiment is 62.82%, while the false
prediction rate is 37.18%.
In all of the three control system experiments, it is found that the algorithm's
anomaly prediction success rate is higher in comparison to the unsuccessful pre-
diction rate. However, the false positive still remains high. This is because the
possible anomaly prediction was raised when the antecedent of an association
rule is found in the streaming log window. The antecedent of a rule could be
frequent, although the association rules were generated from a rare sequential
pattern. Therefore, the frequent antecedent of an association rule has appeared
many times in the streaming log windows. However, the consequent did not ap-
pear frequently in the streaming log window as the association rule comprising
both antecedent and consequent was rare. There could be two possibilities: one
is that the consequent never happened and secondly, the consequent occurred,
but it did not occur within the de�ned time-window. Therefore, the possible
anomaly prediction is found false and hence increases the rate of false positive.
The possible anomaly predictions can be further reduced by generating the
association rules only from the rare sequential patterns, which are identi�ed as
an attack patterns. In other words, instead of using all rare sequential patterns,
only the rare sequential patterns that have been identi�ed as attack patterns can
be used to generate association rules. However, this method can not be able to
predict any possible new or zero day anomalies on the system.
In the experiment, it was found that the method was able to identify some
possible anomaly predictions as attack patterns. This is because the attack pat-
tern was found both in the training and in the testing dataset. On the other
hand, a few attacks were not identi�ed by the possible anomaly predictions. This
is because some of the attacks conducted on the control system were found either
in the training dataset or in the testing dataset. Therefore, it is required that
the attack patterns should appear both in the training and the testing dataset
since the possible anomalous patterns are predicted from the testing dataset.
It means to verify if a predicted anomalous pattern is an attack pattern, the
training dataset needs to have some attack patterns which are also found in
the testing dataset. It is noted that in the anomaly prediction experiment, we
cannot swap the training dataset and the testing dataset. This means �rstly
the association rules are generated from the training dataset and then predicted
possible anomalies in the testing dataset. Secondly, the training and the testing
5.8. Summary 163
datasets are interchanged so that association rules can be generated from the
newly interchanged training dataset, and predict possible anomalies from the
newly interchanged testing dataset. The swapping of training and test datasets
cannot be possible because in the SCADA control system, the logs are being
recorded in a sequential manner. It means that the events are recorded in a
chronological order. The training dataset is created from the log events which
occurred before the log events from where the test dataset was created. The
purpose of the training dataset is to generate sequential association rules which
are later used in testing dataset. This means that if the antecedents of the asso-
ciation rules are found in the test dataset, it is predicted that the corresponding
consequent is likely to occur nest in the testing dataset. However, if the train-
ing dataset is swapped with the test dataset, the sequential order of the events
is lost. Hence, the association rules that are generated from the interchanged
training dataset are composed of the log events which occur after the log events
in the interchanged testing dataset. Therefore, swapping between the training
dataset and the testing dataset cannot be possible in the sequential database.
It is learnt that the prediction results could be more accurate if training
and test data is considered in di�erent proportions. However, splitting the data
in di�erent proportion is not feasible for this experiment. This is because the
attacks which were conducted on the control system while generating data were
found either in the training dataset or in the testing dataset. If the percentage
of the training dataset is increased, there will be no attack data remained in
the testing dataset. Hence, the predicted anomalies cannot be veri�ed whether
they are the true attack patterns or not. This means to verify if a predicted
anomalous pattern is an attack pattern, the training dataset needs to have some
attack patterns which are also found in the testing dataset. Therefore, it is
required that the attack patterns should appear both in the training and the
testing dataset since the possible anomalous patterns are predicted from the
testing dataset.
5.8 Summary
This chapter has presented a new approach for sequential association rules mining
method for possible anomaly prediction. This method has been used to predict
and detect possible anomalies on the SCADA control system. The anomaly
164Chapter 5. A Rare Sequential Association Rules Mining of SCADA Streaming Logs
for Anomaly Prediction
prediction method has used the SCADA control system streaming logs. The
experimental results showed that possible anomalies can be predicted by using
sequential association rules. The system is alerted on the possibility of occurring
an anomalous pattern in the streaming logs. The anomaly prediction method
produced less false positives as the method alerted possible anomalies with the
longest antecedent association rule.
The next chapter, Chapter 6, presents the summary of this thesis. This
chapter reviews the contributions and how they have been achieved. Finally,
Chapter 6 is concluded with some recommendations to the future work.
Chapter 6
Conclusion and Future Work
This chapter presents the overall research summary of this thesis. In this sum-
mary we relate the objectives that were outlined in Chapter 1 and how we have
been able to achieve the goals of this thesis. Further, this chapter also presents
the future research directions as to address the problems that have been identi�ed
while conducting this research.
6.1 Research Summary
The detection of anomalies from SCADA control system can be performed by
either analysing control logs or network tra�c packets. In this thesis, we have
analysed SCADA control logs to �nd anomalies. Since SCADA control process
activities are recorded in sequential manner with timestamp tagged with each
events, this thesis used sequential pattern mining approach to analyse logs. The
process activities of SCADA control system are limited and repetitive. As a
result, the normal activities of the SCADA control system are predictive which
produces a regular system behavioural pro�le. The regular system pro�le is
frequent behaviour of the system. However, any abnormal process activities
which are deviated from the normal activities of the control system represents a
rare activities of the control system. These rare activities are the anomalies of
the control system. These anomalies could be deliberate cyber-attacks or system
malfunctions.
165
166 Chapter 6. Conclusion and Future Work
Although anomalies could be frequent and rare in a system, we assumed anoma-
lies are rare events in SCADA control system. So, we used rare sequential pattern
mining method to �nd rare events in the control system. As far as we are aware
there has been no prior work that detect anomalies using rare sequential pattern
by analysing SCADA control logs. Although there has been a single work by
Hadºiosmanovi£ et al. [17] who used water treatment SCADA control logs to
�nd threats in the system, they used rare itemset pattern mining to �nd rare
event in the system. They could not �nd rare sequence of events representing
anomalies in SCADA system. Additionally, Hadºiosmanovi£ et al. [17] could not
identify ordered events which is important in identifying anomalies in SCADA
control system. We have addressed this problem in this thesis using the proposed
rare sequential pattern mining method. In this thesis, we have achieved three
key objectives set for our research to detect and predict anomalies in SCADA
control system logs by using rare sequential pattern mining. We conclude this
thesis by presenting a summary of how we have achieved the objectives of our
research stated in Chapter 1.
Objective 1: To design and develop a method for �nding anomalies that are
rare in SCADA control systems. This research objective is achieved with the
use of rare sequential pattern mining approach. The detailed of this approach is
discussed in Chapter 3 of this thesis. In this chapter, we proposed and developed
a new method for �nding rare sequential patterns from SCADA system. Using
this method we analysed the SCADA control logs and discovered rare sequen-
tial patterns that represent anomalous behaviour of the control system. We also
found the minimal or the smallest rare sequential patterns as well as the max-
imal or the longest rare sequential patterns in an equivalence class, where the
patterns share the same frequency and occur in the same number of sequences.
The purpose of �nding minimal and maximal rare anomalous patterns were to
evaluate which rare patterns give greater understanding to detect anomalies. It
was found that in the SCADA domain the maximal rare sequential pattern gives
more context to understand the complete scenario to identify the anomalous pat-
tern than the minimal rare sequential patterns which indicate the starting point
of the anomalous pattern. We also analysed that the order of events in SCADA
control logs are important to detect anomalies.
6.1. Research Summary 167
Objective 2: To improve the e�ciency of the rare sequential pattern mining
algorithm without losing accuracy by introducing constraints. We have achieved
this objective by the constraint-based rare sequential pattern mining algorithm.
The detailed of this algorithm is presented in Chapter 4. In this chapter, we
used three constraints: the time-span gap constraint, the feature reduction con-
straint, and the algorithmic constraint. The time-span gap constraint is imple-
mented with the sequential database in the pre-processing stage. It means that
the time-span constraint is used when segmenting the sequences for creating the
sequential database from the SCADA control logs. The time-span gap constraint
is used to �nd the signi�cant rare sequential pattern where the events of the pat-
tern occur in a de�ned time period. This constraint is implemented by selecting
the episodic events from the control logs. As a result, events are not overlapped
into consecutive sequences. Secondly, the feature reduction constraint was used
to reduce the number of unique events in the sequence database. This reduced
unique events contributed to generate less number of possible candidate super
sequences. The outcome revealed that this less candidate super sequence con-
tributed to achieve the e�ciency of the constraint-based rare sequential pattern
mining algorithm by taking less computational time.
The third constraint used is the algorithmic constraint, which was imple-
mented with rare sequential pattern mining algorithm. Here, two constraints
were used: the �rst constraint was to check the size of the candidate super se-
quence before the candidate sequence is searched in the database. Since larger
sized candidate super sequence can not be contained in the smaller sized sequence
of the database, the scanning of the smaller sized sequences can be avoided. The
second constraint implemented was to avoid unwanted database scan. This was
done by not searching the candidate super sequence in the remaining sequences
of the database once the candidate super sequence is found in a sequence in the
database. The outcome showed that these two algorithmic constraint contributed
to reduce the computational time of the constraint-based rare sequential pattern
mining algorithm. Further, the accuracy of anomalies detected by the constraint-
based rare sequential pattern mining algorithm was compared with the anomalies
detected by without constrained rare sequential pattern mining algorithm. The
result revealed that constraint-based rare sequential pattern mining algorithm
found the same number of anomalies as found by the rare sequential pattern
mining algorithm.
168 Chapter 6. Conclusion and Future Work
Objective 3: To provide an anomaly prediction method that can extend the work
of the rare sequential pattern mining algorithm. We have achieved this objective
by using a new approach to anomaly prediction method. This method is based
on sequential association rules mining technique. We extended our proposed rare
sequential pattern mining algorithm to predict anomalies in streaming SCADA
control logs. The rare sequential pattern mining algorithm is used to detect
anomalies in static SCADA control logs, which we achieved in objective 1. This
method cannot predict possible anomalies before it occurs on the live SCADA
system. To achieve this we developed and implemented an anomaly prediction
algorithm that raises an alert to the system once the precursor or antecedent of an
anomalous pattern is found the incoming streaming logs. The detailed analysis
and implementation of this method is presented in Chapter 5 of this thesis. We
have validated the results by using the training and the testing dataset of the
the SCADA control logs. It was found that the anomaly prediction method is
e�ective in predicting the anomalies on streaming SCADA control logs before it
occurs on the system.
6.2 Future Research Directions
This thesis has mainly focused on implementing the rare sequential pattern min-
ing algorithm on SCADA control system to detect and predict anomalies. Since
the proposed rare sequential pattern mining method is generic, it can be imple-
mented in other domains for �nding anomalies. In Chapter 3 an analysis of gen-
erating rare sequential patterns to detect anomaly was discussed. In the process
of generating rare patterns, all possible candidate sequence patterns are gener-
ated from a rare sequential patterns. Most of the candidate sequence patterns
are found as non-existent patterns in the database. This is because while gener-
ating sequential candidate patterns, the events are placed in di�erent position of
a sequence. As in events occur in a sequential manner, it is highly unlikely that
the events that happen later in a sequence can go before other events. Therefore,
most of the candidate sequential patterns are found non-existent in the database.
These non-existent patterns cost large computational time and space. In future
research, heuristics can be applied while generating candidate sequential pattern
to reduce the non-existent patterns so that computational time can be improved.
6.2. Future Research Directions 169
We have used Apriori method to generate rare sequential generator patterns.
The e�ciency of the Apriori method is not optimum. This is because the Apri-
ori method generates candidate generator patterns. In sequential pattern mining
it is time consuming to generate candidate sequence pattern as compared to the
itemset pattern mining. Hence, our proposed rare sequential pattern mining al-
gorithm could not provide e�cient method of �nding rare sequential pattern.
Therefore, the e�ciency of the proposed rare sequential pattern mining algo-
rithm can be further improved by implementing our proposed algorithm using
the pattern growth method. Moreover, in this thesis, we have used SCADA con-
trol system logs to �nd anomalies by using our proposed rare sequential pattern
mining algorithm. As our proposed method is a generic approach to �nd rare
sequential patterns, this method can be used with standard IT network to �nd
rare behaviour in the IT networks. To �nd rare activities in the IT network,
packet data can be analysed to detect anomalies in the network.
Finally, our proposed anomaly prediction method was not e�ective as it pro-
duces high false positive. This is because the in the proposed method, the pos-
sible anomalous pattern �ag is raised once the antecedent of a rule is found in
the streaming logs. However, not all antecedent of the association rule could be
a part of anomalous pattern. Therefore, the domain expertise can identify the
rules that have been identi�ed as an anomalous pattern. These rules then can
be further reduced so that the they can be used for anomaly prediction. As a
result, it is possible that the false positive can be reduced. In future research
the proposed method can be applied in standard IT network to predict possible
anomalies by analysing the streaming packets.
Bibliography
[1] S. Collins and S. McCombie, �Stuxnet: the emergence of a new cyber
weapon and its implications,� Journal of Policing, Intelligence and Counter
Terrorism, vol. 7, no. 1, pp. 80�91, 2012.
[2] R. M. van der Knij�, �Control systems/SCADA forensics, what's the dif-
ference?,� Digital Investigation, vol. 11, no. 3, pp. 160�174, 2014.
[3] A. A. Cardenas, S. Amin, Z.-S. Lin, Y.-L. Huang, C.-Y. Huang, and S. Sas-
try, �Attacks against process control systems: risk assessment, detection,
and response,� in Proceedings of the 6th ACM Symposium on Information,
Computer and Communications Security, pp. 355�366, ACM.
[4] �Critical Foundations�Protecting America's Infrastructures. Report of the
president's commission on critical infrastructure protection.� Available
from https://fas.org/sgp/library/pccip.pdf. Accessed 11 July 2018.
[5] P. Pederson, D. Dudenhoe�er, S. Hartley, and M. Permann, �Critical in-
frastructure interdependency modeling: a survey of us and international
research,� Idaho National Laboratory, pp. 1�20, 2006.
[6] E. Carter, CCSP Self-study: Cisco Secure Intrusion Detection System
(CSIDS). Cisco Press, 2004.
[7] J. Weiss, �Cyber security research and development,� report, KEMA Inc.,
2008.
[8] B. Miller and D. Rowe, �A survey scada of and critical infrastructure inci-
dents,� in Proceedings of the 1st Annual conference on Research in infor-
mation technology, pp. 51�56, ACM, 2012.
[9] M. D. Cavelty, Cyber-security and threat politics: US e�orts to secure the
information age. Routledge, 2007.
170
BIBLIOGRAPHY 171
[10] J. Guan, J. H. Graham, and J. L. Hieb, �A digraph model for risk iden-
ti�cation and mangement in scada systems,� in Intelligence and Security
Informatics (ISI), 2011 IEEE International Conference on, pp. 150�155,
IEEE, 2011.
[11] K. Wilhoit, �The scada that didn't cry wolf,� Trend Micro Inc., White
Paper, 2013.
[12] �Year in Review: How Did the Cyberthreat Landscape
Change in 2017?.� https://securityintelligence.com/
year-in-review-how-did-the-cyberthreat-landscape-change-in-2017/,
2017 (accessed April 21, 2018).
[13] �Security attacks on industrial control systems.� Available from
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=
SEL03046USEN, 2015 (accessed April 21, 2018).
[14] D. Hadºiosmanovi£, D. Bolzoni, S. Etalle, and P. Hartel, �Challenges and
opportunities in securing industrial control systems,� in Complexity in En-
gineering (COMPENG), 2012, pp. 1�6, IEEE.
[15] I. Garitano, R. Uribeetxeberria, and U. Zurutuza, �A review of scada
anomaly detection systems,� in Proceedings of the 6th International Con-
ference on Soft Computing Models in Industrial and Environmental Appli-
cations, pp. 357�366, Springer.
[16] D. Hadºiosmanovi£, D. Bolzoni, and P. Hartel, �Towards securing scada
systems against process-related threats,� 2010.
[17] Hadºiosmanovi£, Dina and Bolzoni, Damiano and Hartel, Pieter H, �A log
mining approach for process monitoring in SCADA,� Int. J. of Inform.
Security, vol. 11, no. 4, pp. 231�251, 2012.
[18] J. Gao, J. Liu, B. Rajan, R. Nori, B. Fu, Y. Xiao, W. Liang, and
C. Philip Chen, �Scada communication and security issues,� Security and
Communication Networks, vol. 7, no. 1, pp. 175�194, 2014.
[19] Y. Ebata, H. Hayashi, Y. Hasegawa, S. Komatsu, and K. Suzuki, �Develop-
ment of the intranet-based scada (supervisory control and data acquisition
172 BIBLIOGRAPHY
system) for power system,� in Power Engineering Society Winter Meeting,
2000. IEEE, vol. 3, pp. 1656�1661, IEEE.
[20] A. Giani, G. Karsai, T. Roosta, A. Shah, B. Sinopoli, and J. Wiley, �A
testbed for secure and robust scada systems,� ACM SIGBED Review, vol. 5,
no. 2, p. 4, 2008.
[21] M. Cheminod, L. Durante, and A. Valenzano, �Review of Security Issues in
Industrial Networks,� IEEE Trans. on Ind. Informat, vol. 9, no. 1, pp. 277�
293, 2013.
[22] V. Chandola, A. Banerjee, and V. Kumar, �Anomaly detection: A survey,�
ACM Computing Surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
[23] D. E. Denning, �An intrusion-detection model,� IEEE Transactions on
Software Engineering, no. 2, pp. 222�232, 1987.
[24] Lazarevic, Aleksandar and Ertöz, Levent and Kumar, Vipin and Ozgur,
Aysel and Srivastava, Jaideep, �A comparative study of anomaly detection
schemes in network intrusion detection,� in SDM, pp. 25�36, SIAM.
[25] J. Verba and M. Milvich, �Idaho national laboratory supervisory control
and data acquisition intrusion detection system (scada ids),� in Technolo-
gies for Homeland Security, 2008 IEEE Conference on, pp. 469�473, IEEE.
[26] S. Manganaris, M. Christensen, D. Zerkle, and K. Hermiz, �A data mining
analysis of RTID alarms,� Computer Networks, vol. 34, no. 4, pp. 571�577,
2000.
[27] C. Clifton and G. Gengo, �Developing custom intrusion detection �lters
using data mining,� in IEEE Proc. 21st Century Military Commun., vol. 1,
pp. 440�443, 2000.
[28] K. Julisch and M. Dacier, �Mining intrusion detection alarms for action-
able knowledge,� in Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, pp. 366�375, ACM,
2002.
[29] K. Sequeira and M. Zaki, �Admit: anomaly-based data mining for intru-
sions,� in Proceedings of the eighth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, pp. 386�395, ACM.
BIBLIOGRAPHY 173
[30] Y. Fan, Y. Ye, and L. Chen, �Malicious sequential pattern mining for au-
tomatic malware detection,� Expert Systems with Applications, vol. 52,
pp. 16�25, 2016.
[31] A. Patcha and J.-M. Park, �An overview of anomaly detection techniques:
Existing solutions and latest technological trends,� Computer Networks,
vol. 51, no. 12, pp. 3448�3470, 2007.
[32] T. Shon and J. Moon, �A hybrid machine learning approach to network
anomaly detection,� Information Sciences, vol. 177, no. 18, pp. 3799�3821,
2007.
[33] S. Noel, D. Wijesekera, and C. Youman, Modern intrusion detection, data
mining, and degrees of attack guilt, pp. 1�31. Springer, 2002.
[34] A. Wespi, M. Dacier, H. Debar, and M. M. Nassehi, �Audit trail pattern
analysis for detecting suspicious process behavior,� in Proceedings of RAID
98, Workshop on Recent Advances in Intrusion Detection, 1998.
[35] K. J. Cios and L. A. Kurgan, Trends in data mining and knowledge dis-
covery, pp. 1�26. Springer, 2005.
[36] W. Lee and S. J. Stolfo, �Data mining approaches for intrusion detection,�
in 7th USENIX Security Symposium.
[37] C. Shearer, �The crisp-dm model: the new blueprint for data mining,�
Journal of Data Warehousing, vol. 5, no. 4, pp. 13�22, 2000.
[38] H. Wang, Exploring Intrinsic Structures from Samples: Supervised, Unsu-
pervised, and Semisupervised Frameworks. Thesis, 2007.
[39] C. Kemp, T. L. Gri�ths, S. Stromsten, and J. B. Tenenbaum, �Semi-
supervised learning with trees,� in Advances in neural information pro-
cessing systems, pp. 257�264, 2004.
[40] Z. Xiaojin, �Semi-Supervised Learning Tutorial.� http://pages.cs.wisc.
edu/~jerryzhu/pub/sslicml07.pdf, 2007 (accessed April 08, 2018).
[41] R. Agrawal and R. Srikant, �Fast algorithms for mining association rules in
large databases,� in Proceedings of International Conference on Very Large
Databases (VLDB '94), pp. 487�499, 1994.
174 BIBLIOGRAPHY
[42] J. Pei and J. Han, �Constrained frequent pattern mining: a pattern-growth
view,� ACM SIGKDD Explorations Newsletter, vol. 4, no. 1, pp. 31�39,
2002.
[43] S. Brin, R. Motwani, and C. Silverstein, �Beyond market baskets: Gener-
alizing association rules to correlations,� in Acm Sigmod Record, vol. 26,
pp. 265�276, ACM, 1997.
[44] C. Silverstein, S. Brin, R. Motwani, and J. Ullman, �Scalable techniques for
mining causal structures,� Data Mining and Knowledge Discovery, vol. 4,
no. 2-3, pp. 163�192, 2000.
[45] J. Pei, J. Han, and W. Wang, �Constraint-based sequential pattern mining:
the pattern-growth methods,� Journal of Intelligent Information Systems,
vol. 28, no. 2, pp. 133�160, 2007.
[46] B. Lent, A. Swami, and J. Widom, �Clustering association rules,� in Data
Engineering, 1997. Proceedings. 13th International Conference on, pp. 220�
231, IEEE, 1997.
[47] G. Dong and J. Li, �E�cient mining of emerging patterns: Discovering
trends and di�erences,� in Proceedings of the �fth ACM SIGKDD inter-
national conference on Knowledge discovery and data mining, pp. 43�52,
ACM, 1999.
[48] H. Mannila, H. Toivonen, and A. I. Verkamo, �Discovery of frequent
episodes in event sequences,� Data mining and knowledge discovery, vol. 1,
no. 3, pp. 259�289, 1997.
[49] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques.
Elsevier, 2011.
[50] R. Agrawal and R. Srikant, �Mining sequential patterns,� in Data Engi-
neering, 1995. Proceedings of the Eleventh International Conference on,
pp. 3�14, IEEE, 1995.
[51] C. H. Mooney and J. F. Roddick, �Sequential pattern mining�approaches
and algorithms,� ACM Computing Surveys (CSUR), vol. 45, no. 2, p. 19,
2013.
BIBLIOGRAPHY 175
[52] S. M. Vishal, �A survey on sequential pattern mining algorithms,� Inter-
national Journal of Computer Science and Information Technologies (IJC-
SIT), vol. 5, no. 2, 2014.
[53] J. Han, H. Cheng, D. Xin, and X. Yan, �Frequent pattern mining: cur-
rent status and future directions,� Data Mining and Knowledge Discovery,
vol. 15, no. 1, pp. 55�86, 2007.
[54] J. Han, J. Pei, and Y. Yin, �Mining frequent patterns without candidate
generation,� in ACM SIGMOD Record, vol. 29, pp. 1�12, ACM.
[55] R. Agrawal, R. Srikant, et al., �Fast algorithms for mining association
rules,� in Proc. 20th int. conf. very large data bases, VLDB, vol. 1215,
pp. 487�499, 1994.
[56] P. Fournier-Viger, J. C.-W. Lin, B. Vo, T. T. Chi, J. Zhang, and H. B.
Le, �A survey of itemset mining,� Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery, vol. 7, no. 4, 2017.
[57] D. Brauckho�, X. Dimitropoulos, A. Wagner, and K. Salamatian,
�Anomaly extraction in backbone networks using association rules,� in Pro-
ceedings of the 9th ACM SIGCOMM conference on Internet measurement,
pp. 28�34, ACM, 2009.
[58] B. Fernando, E. Fromont, and T. Tuytelaars, �E�ective use of frequent
itemset mining for image classi�cation,� in European conference on com-
puter vision, pp. 214�227, Springer, 2012.
[59] E. Glatz, S. Mavromatidis, B. Ager, and X. Dimitropoulos, �Visualizing
big network tra�c data using frequent pattern mining and hypergraphs,�
Computing, vol. 96, no. 1, pp. 27�38, 2014.
[60] J. Han, J. Pei, Y. Yin, and R. Mao, �Mining frequent patterns without
candidate generation: A frequent-pattern tree approach,� Data mining and
knowledge discovery, vol. 8, no. 1, pp. 53�87, 2004.
[61] M. J. Zaki, �Scalable algorithms for association mining,� IEEE transactions
on knowledge and data engineering, vol. 12, no. 3, pp. 372�390, 2000.
[62] Z.-H. Deng and S.-L. Lv, �Fast mining frequent itemsets using nodesets,�
Expert Systems with Applications, vol. 41, no. 10, pp. 4505�4512, 2014.
176 BIBLIOGRAPHY
[63] Y. W. T. Pramono et al., �Anomaly-based intrusion detection and preven-
tion system on website usage using rule-growth sequential pattern analysis:
Case study: Statistics of indonesia (bps) website,� in Advanced Informatics:
Concept, Theory and Application (ICAICTA), 2014 International Confer-
ence of, pp. 203�208, IEEE, 2014.
[64] Y. J. M. Pokou, P. Fournier-Viger, and C. Moghrabi, �Authorship attribu-
tion using small sets of frequent part-of-speech skip-grams.,� in FLAIRS
Conference, pp. 86�91, 2016.
[65] P. Fournier-Viger, J. C.-W. Lin, R. U. Kiran, Y. S. Koh, and R. Thomas,
�A survey of sequential pattern mining,� Data Science and Pattern Recog-
nition, vol. 1, no. 1, pp. 54�77, 2017.
[66] R. Srikant and R. Agrawal, �Mining sequential patterns: Generalizations
and performance improvements,� in International Conference on Extending
Database Technology, pp. 1�17, Springer, 1996.
[67] M. J. Zaki, �Spade: An e�cient algorithm for mining frequent sequences,�
Machine learning, vol. 42, no. 1-2, pp. 31�60, 2001.
[68] A. Silva and C. Antunes, �Constrained pattern mining in the new era,�
Knowledge and Information Systems, vol. 47, no. 3, pp. 489�516, 2016.
[69] R. Srikant, Q. Vu, and R. Agrawal, �Mining association rules with item
constraints.,� in Kdd, vol. 97, pp. 67�73, 1997.
[70] J. Han, H. Cheng, D. Xin, and X. Yan, �Frequent pattern mining: current
status and future directions,� Data mining and knowledge discovery, vol. 15,
no. 1, pp. 55�86, 2007.
[71] Y.-L. Chen and Y.-H. Hu, �Constraint-based sequential pattern mining:
The consideration of recency and compactness,� Decision Support Systems,
vol. 42, no. 2, pp. 1203�1215, 2006.
[72] J. Pei, J. Han, and L. V. Lakshmanan, �Mining frequent itemsets with
convertible constraints,� in Data Engineering, 2001. Proceedings. 17th In-
ternational Conference on, pp. 433�442, IEEE, 2001.
BIBLIOGRAPHY 177
[73] Y.-L. Chen, M.-C. Chiang, and M.-T. Ko, �Discovering time-interval se-
quential patterns in sequence databases,� Expert Systems with Applica-
tions, vol. 25, no. 3, pp. 343�354, 2003.
[74] M. Garofalakis, R. Rastogi, and K. Shim, �Mining sequential patterns with
regular expression constraints,� IEEE Transactions on knowledge and data
engineering, vol. 14, no. 3, pp. 530�552, 2002.
[75] C. M. Antunes and A. L. Oliveira, �Inference of sequential association rules
guided by context-free grammars,� in International Colloquium on Gram-
matical Inference, pp. 1�13, Springer, 2002.
[76] M. Wojciechowski and M. Zakrzewicz, �Dataset �ltering techniques in
constraint-based frequent pattern mining,� in Pattern detection and dis-
covery, pp. 77�91, Springer, 2002.
[77] M. J. Zaki, �Sequence mining in categorical domains: incorporating con-
straints,� in Proceedings of the ninth international conference on Informa-
tion and knowledge management, pp. 422�429, ACM, 2000.
[78] G. Piatetsky-Shapiro, �Discovery, analysis, and presentation of strong
rules,� Knowledge discovery in databases, pp. 229�238, 1991.
[79] R. Agrawal, T. Imieli«ski, and A. Swami, �Mining association rules between
sets of items in large databases,� in Acm sigmod record, vol. 22, pp. 207�
216, ACM, 1993.
[80] P.-N. Tan, M. Steinbach, and V. Kumar, �Association analysis: basic con-
cepts and algorithms,� Introduction to Data mining, pp. 327�414, 2005.
[81] P. Fournier-Viger, T. Gueniche, S. Zida, and V. S. Tseng, �Erminer: se-
quential rule mining using equivalence classes,� in International Symposium
on Intelligent Data Analysis, pp. 108�119, Springer, 2014.
[82] Ö. Çelebi, E. Zeydan, �. Ar�, Ö. �leri, and S. Ergüt, �Alarm sequence rule
mining extended with a time con�dence parameter,� in IEEE International
Conference on Data Mining (ICDM), 2014.
[83] L. Ong, M. Bergés, and H. Y. Noh, �Exploring sequential and association
rule mining for pattern-based energy demand characterization,� in Proceed-
178 BIBLIOGRAPHY
ings of the 5th ACM Workshop on Embedded Systems For Energy-E�cient
Buildings, pp. 1�2, ACM, 2013.
[84] Wang Yong, Li Zhanhuai, and ZhangYang, �Mining sequential association-
rule for improving web document prediction,� in Sixth International Con-
ference on Computational Intelligence and Multimedia Applications (IC-
CIMA'05), pp. 146�151, Aug 2005.
[85] M. Naedele and O. Biderbost, �Human-assisted intrusion detection for pro-
cess control systems,� in Proceedings of the Second International Confer-
ence on Applied Cryptography and Network Security, pp. 216�225.
[86] C. Balducelli, L. Lavalle, and G. Vicoli, �Novelty detection and manage-
ment to safeguard information-intensive critical infrastructures,� Interna-
tional Journal of Emergency Management, vol. 4, no. 1, pp. 88�103, 2007.
[87] R. Vaarandi, �A data clustering algorithm for mining patterns from event
logs,� in Proceedings of the 2003 IEEE Workshop on IP Operations and
Management (IPOM), pp. 119�126.
[88] F. Salfner, S. Tschirpke, and M. Malek, �Comprehensive log�les for au-
tonomic systems,� in Proceedings of the 18th International Conference on
Parallel and Distributed Processing Symposium, p. 211, IEEE.
[89] B. Zhu, A. Joseph, and S. Sastry, �A taxonomy of cyber attacks on scada
systems,� in Proceeding of the 4th International Conference on Cyber, Phys-
ical and Social Computing, pp. 380�388, IEEE.
[90] S. Cheung, B. Dutertre, M. Fong, U. Lindqvist, K. Skinner, and A. Valdes,
�Using model-based intrusion detection for SCADA networks,� in Proc. of
the SCADA Security Scienti�c Symp., vol. 46, pp. 1�12, 2007.
[91] R. R. R. Barbosa, R. Sadre, and A. Pras, �A �rst look into scada network
tra�c,� in Network Operations and Management Symposium (NOMS),
2012 IEEE, pp. 518�521, IEEE.
[92] A. Valdes and S. Cheung, �Communication pattern anomaly detection in
process control systems,� in Technologies for Homeland Security, 2009.
HST'09. IEEE Conference on, pp. 22�29, IEEE.
BIBLIOGRAPHY 179
[93] R. A. Kemmerer and G. Vigna, �Intrusion detection: A brief history and
overview (supplement to computer magazine),� Computer, no. 4, pp. 27�30,
2002.
[94] E. Bloedorn, A. D. Christiansen, W. Hill, C. Skorupka, L. M. Talbot, and
J. Tivel, �Data mining for network intrusion detection: How to get started,�
report, MITRE Technical Report, 2001.
[95] D. Barbara, N. Wu, and S. Jajodia, �Detecting Novel Network Intrusions
Using Bayes Estimators,� in 1st SIAM Conf. on Data Mining, pp. 1�17,
2001.
[96] K. Sujatha and K. R. S. Rao, �A survey on infrequent pattern mining,�
International Journal of Advances in Engineering & Technology, vol. 6,
no. 4, p. 1728, 2013.
[97] B. Saha, M. Lazarescu, and S. Venkatesh, �Infrequent item mining in mul-
tiple data streams,� in Data Mining Workshops, 2007. ICDM Workshops
2007. Seventh IEEE International Conference on, pp. 569�574, IEEE.
[98] L. Szathmary, A. Napoli, and P. Valtchev, �Towards rare itemset mining,�
in 19th IEEE Int. Conf. on Tools with Arti�cial Intell.(ICTAI 2007), vol. 1,
pp. 305�312, 2007.
[99] R. Vaarandi, Tools and Techniques for Event Log Analysis. Tallinn Uni-
versity of Technology Press, 2005.
[100] M.-S. Chen, J. Han, and P. S. Yu, �Data mining: an overview from a
database perspective,� IEEE Transactions on Knowledge and data Engi-
neering, vol. 8, no. 6, pp. 866�883, 1996.
[101] C. S. Hemalatha, V. Vaidehi, and R. Lakshmi, �Minimal infrequent pattern
based approach for mining outliers in data streams,� Expert Systems with
Applications, vol. 42, no. 4, pp. 1998�2012, 2015.
[102] A. Rahman, Y. Xu, K. Radke, and E. Foo, �Finding Anomalies in SCADA
Logs Using Rare Sequential Pattern Mining,� in International Conference
on Network and System Security, pp. 499�506, Springer, 2016.
180 BIBLIOGRAPHY
[103] P. Fournier-Viger, A. Gomariz, M. �ebek, and M. Hlosta, VGEN: Fast
Vertical Mining of Sequential Generator Patterns, pp. 476�488. Springer,
2014.
[104] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, �E�cient mining of as-
sociation rules using closed itemset lattices,� Information systems, vol. 24,
no. 1, pp. 25�46, 1999.
[105] H. Mannila and H. Toivonen, �Levelwise search and borders of theories in
knowledge discovery,� Data mining and knowledge discovery, vol. 1, no. 3,
pp. 241�258, 1997.
[106] S. Yi, T. Zhao, Y. Zhang, S. Ma, and Z. Che, �An e�ective algorithm
for mining sequential generators,� Procedia Engineering, vol. 15, pp. 3653�
3657, 2011.
[107] C. Gao, J. Wang, Y. He, and L. Zhou, �E�cient mining of frequent sequence
generators,� in Proceedings of the 17th international conference on World
Wide Web, pp. 1051�1052, ACM, 2008.
[108] P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu., and V. S.
Tseng, �SPMF: a Java Open-Source Pattern Mining Library,� Journal of
Machine Learning Research (JMLR), vol. 15, pp. 3389�3393, 2014.
[109] D. Hawkins, Identi�cation of Outliers. Chapman and Hall, London, 1980.
[110] R. Kaur and S. Singh, �A survey of data mining and social network anal-
ysis based anomaly detection techniques,� Egyptian Informatics Journal,
vol. 17, no. 2, pp. 199�216, 2016.
[111] W. Lee, S. J. Stolfo, et al., �Data mining approaches for intrusion detec-
tion.,� in Usenix security, 1998.
[112] S. Pan, T. Morris, and U. Adhikari, �Developing a hybrid intrusion detec-
tion system using data mining for power systems,� IEEE Transactions on
Smart Grid, vol. 6, no. 6, pp. 3104�3113, 2015.
[113] S. Bistarelli and F. Bonchi, �Soft constraint based pattern mining,� Data
& Knowledge Engineering, vol. 62, no. 1, pp. 118�137, 2007.
BIBLIOGRAPHY 181
[114] R. T. Ng, L. V. Lakshmanan, J. Han, and A. Pang, �Exploratory min-
ing and pruning optimizations of constrained associations rules,� in ACM
Sigmod Record, vol. 27, pp. 13�24, ACM, 1998.
[115] J.-F. Boulicaut and B. Jeudy, �Constraint-based data mining,� in Data
mining and knowledge discovery handbook, pp. 339�354, Springer, 2009.
[116] J. Han, L. V. Lakshmanan, and R. T. Ng, �Constraint-based, multidimen-
sional data mining,� Computer, vol. 32, no. 8, pp. 46�50, 1999.
[117] R. J. Bayardo, R. Agrawal, and D. Gunopulos, �Constraint-based rule min-
ing in large, dense databases,� Data mining and knowledge discovery, vol. 4,
no. 2-3, pp. 217�240, 2000.
[118] V. Grossi, A. Romei, and F. Turini, �Survey on using constraints in data
mining,� Data Mining and Knowledge Discovery, vol. 31, no. 2, pp. 424�
464, 2017.
[119] Y.-L. Chen and Y.-H. Hu, �The consideration of recency and compactness
in sequential pattern mining,� in Proceedings of the second workshop on
Knowledge Economy and Electronic Commerce, vol. 42, pp. 1203�1215,
2006.
[120] M. N. Garofalakis, R. Rastogi, and K. Shim, �Spirit: Sequential pattern
mining with regular expression constraints,� in VLDB, vol. 99, pp. 7�10,
1999.
[121] S. Parthasarathy, M. J. Zaki, M. Ogihara, and S. Dwarkadas, �Incremental
and interactive sequence mining,� in Proceedings of the eighth international
conference on Information and knowledge management, pp. 251�258, ACM,
1999.
[122] N. A. K. Desai and A. Ganatra, �E�cient constraint-based sequential
pattern mining (spm) algorithm to understand customers� buying be-
haviour from time stamp-based sequence dataset,� Cogent Engineering,
vol. 2, no. 1, p. 1072292, 2015.
[123] C. Antunes and A. L. Oliveira, �Generalization of pattern-growth meth-
ods for sequential pattern mining with gap constraints,� in International
182 BIBLIOGRAPHY
Workshop on Machine Learning and Data Mining in Pattern Recognition,
pp. 239�251, Springer, 2003.
[124] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu,
�Pre�xspan: Mining sequential patterns e�ciently by pre�x-projected pat-
tern growth,� in proceedings of the 17th international conference on data
engineering, pp. 215�224, 2001.
[125] T. Morzy, M. Wojciechowski, and M. Zakrzewicz, �E�cient constraint-
based sequential pattern mining using dataset �ltering techniques,� in
Databases and Information Systems II, pp. 297�309, Springer, 2002.
[126] J. Pei, J. Han, and W. Wang, �Mining sequential patterns with constraints
in large databases,� in Proceedings of the eleventh international conference
on Information and knowledge management, pp. 18�25, ACM, 2002.
[127] J. Zhu, H. Wu, and G. Gao, �An e�cient method of web sequential pattern
mining based on session �lter and transaction identi�cation,� Journal of
Networks, vol. 5, no. 9, p. 1017, 2010.
[128] D. Pyle, Data preparation for data mining, vol. 1. morgan kaufmann, 1999.
[129] R. Ranjan and G. Sahoo, �A new clustering approach for anomaly intrusion
detection,� arXiv preprint arXiv:1404.2772, 2014.
[130] �Verizon, 2013 data breach investigations report.� http:
//www.verizonenterprise.com/resources/reports/rp_
data-breach-investigations-report-2013_en_xg.pdf, 2013 (ac-
cessed March 10, 2018).
[131] P. Fournier-Viger, R. Nkambou, and V. S.-M. Tseng, �Rulegrowth: mining
sequential rules common to several sequences by pattern-growth,� in Pro-
ceedings of the 2011 ACM symposium on applied computing, pp. 956�961,
ACM, 2011.
[132] P. Fournier-Viger, �An Introduction to Sequential Rule Mining�The Data
Mining Blog.� http://data-mining.philippe-fournier-viger.com/
introduction-to-sequential-rule-mining/, 2015 (accessed February
7, 2018).
BIBLIOGRAPHY 183
[133] S. K. Harms, J. Deogun, and T. Tadesse, �Discovering sequential associa-
tion rules with constraints and time lags in multiple sequences,� in Interna-
tional Symposium on Methodologies for Intelligent Systems, pp. 432�441,
Springer, 2002.
[134] D. Lo, S.-C. Khoo, and L. Wong, �Non-redundant sequential rules�the-
ory and algorithm,� Information Systems, vol. 34, no. 4-5, pp. 438�453,
2009.
[135] P. Fournier-Viger, U. Faghihi, R. Nkambou, and E. M. Nguifo, �Cmrules:
Mining sequential rules common to several sequences,� Knowledge-Based
Systems, vol. 25, no. 1, pp. 63�76, 2012.
[136] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth, �Rule
discovery from time series.,� in KDD, vol. 98, pp. 16�22, 1998.
[137] J. Deogun and L. Jiang, �Prediction mining�an approach to mining associa-
tion rules for prediction,� in International Workshop on Rough Sets, Fuzzy
Sets, Data Mining, and Granular-Soft Computing, pp. 98�108, Springer,
2005.
[138] P. Fournier-Viger, T. Gueniche, and V. S. Tseng, �Using partially-ordered
sequential rules to generate more accurate sequence prediction,� in Interna-
tional Conference on Advanced Data Mining and Applications, pp. 431�442,
Springer, 2012.